How much information Adapted from a presentation by

How much information? Adapted from a presentation by: Jim Gray Microsoft Research http: //research. microsoft. com/~gray Alex Szalay Johns Hopkins University http: //tarkus. pha. jhu. edu/~szalay/ 1

How much information is there in the world 1. What can we store. 2. What is stored. 3. Why are we interested. 2

Infinite Storage? • The Terror Bytes are Here – 1 TB costs 1 k$ to buy – 1 TB costs 300 k$/y to own • Management & curation are expensive – Searching 1 TB takes minutes or hours Yotta Zetta Exa Peta • Petrified by Peta Bytes? We are here • But… people can “afford” them so, Tera – Even though they can never actually be Giga seen in your lifetime – Automate the process Mega 3 Kilo

How much information is there? Yotta • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http: //www. lesk. com/mlesk/ksg 97/ksg. html See Lyman & Varian: How much information http: //www. sims. berkeley. edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Everything ! Recorded All Books Multi. Media All books (words). Movie A Photo A Book Zetta Exa Peta Tera Giga Mega 4 Kilo

First Disk 1956 • IBM 305 RAMAC • 4 MB • 50 x 24” disks • 1200 rpm • 100 ms access • 35 k$/y rent • Included computer & accounting software (tubes not transistors) 6

Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y • 1000 $/TB today • 100 $/TB in 2007 Moores law 58. 70% /year TB growth 112. 30% /year since 1993 Price decline 50. 70% /year since 1993 Most (80%) data is personal (not enterprise) This will likely remain true. 9

Disk Storage Cheaper Than Paper • File Cabinet (4 drawer) 250$ Cabinet: Paper (24, 000 sheets) (2 x 3 @ 10€/ft 2) 180$ Total 700$ 0. 03 $/sheet 3 pennies per page 250$ Space • Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2 e-6 $/sheet(10, 000 x cheaper) micro-dollar per page Image: 1 m photos 3 e-4 $/photo (100 x cheaper) milli-dollar per photo • Store everything on disk Note: Disk is 100 x to 1000 x cheaper than RAM 11

Trying to fill a terabyte in a year Items/TB Items/day 300 KB JPEG 3 M 9, 800 1 MB Doc 1 M 2, 900 1 hour 256 kb/s MP 3 audio 1 hour 1. 5 Mbp/s MPEG video 9 K 26 290 0. 8 14

Portable Computer: 2006? • 100 Gips processor • 1 GB RAM • 1 TB disk • 1 Gbps network • “Some” of your software finding things is a data mining challenge 15

80% of data is personal / individual. But, what about the other 20%? • Business – Wall Mart online: 1 PB and growing…. – Paradox: most “transaction” systems < 1 PB. – Have to go to image/data monitoring for big data • Government – Government is the biggest business. • Science – LOTS of data. 19

Q: Where will the Data Come From? A: Sensor Applications • Earth Observation – 15 PB by 2007 • Medical Images & Information + Health Monitoring – Potential 1 GB/patient/y 1 EB/y • Video Monitoring – ~1 E 8 video cameras @ 1 E 5 MBps 10 TB/s 100 EB/y filtered? ? ? • Airplane Engines – 1 GB sensor data/flight, – 100, 000 engine hours/day – 30 PB/y • Smart Dust: ? ? EB/y http: //robotics. eecs. berkeley. edu/~pister/Smart. Dust/ http: //www-bsac. eecs. berkeley. edu/~shollar/macro_motes/macromotes. html 20

Premise: Data. Grid Computing • Store exabytes twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail, Ba. Bar, CERN, Fermilab, SDSC, … 23

Thesis • Most new information is digital (and old information is being digitized) • An Information Science Grand Challenge: – Capture – Organize – Summarize – Visualize this information • Optimize Human Attention as a resource • Improve information quality 24

The Evolution of Science • Observational Science – Scientist gathers data by direct observation – Scientist analyzes data • Analytical Science – Scientist builds analytical model – Makes predictions. • Computational Science – Simulate analytical model – Validate model and makes predictions • Data Exploration Science Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 26

Computational Science Evolves • Historically, Computational Science = simulation. • New emphasis on informatics: – – – Capturing, Organizing, Summarizing, Analyzing, Visualizing • Largely driven by observational science, but also needed by simulations. • Too soon to say if comp-X and X-info will unify or compete. Ba. Bar, Stanford P&E Gene Sequencer From http: //www. genome. uci. edu/ 27 Space Telescope

Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N 2, likelihood techniques N 3 • As data and computers grow at same rate, we can only keep up with N log. N • A way out? – Discard notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory 28 • Requires combination of statistics & computer science

Smart Data (active databases) • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database – Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality – Databases & Procedures being unified – Example temporal and spatial indexing – Pixel processing • Easy to reorganize the data – Multiple views, each optimal for certain types of analyses – Building hierarchical summaries are trivial • Scalable to Petabyte datasets 29

Challenge: Make Data Publication & Access Easy • Augment FTP with data query: Return intelligent data subsets • Make it easy to – Publish: Record structured data – Find: • Find data anywhere in the network • Get the subset you need – Explore datasets interactively • Realistic goal: – Make it as easy as publishing/reading web sites today. 31

Data Federations of Web Services • Massive datasets live near their owners: – – Near the instrument’s software pipeline Near the applications Near data knowledge and curation Super Computer centers become Super Data Centers • Each Archive publishes a web service – Schema: documents the data – Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives – A common global schema Federation • Challenge: – What is the object model for your science? 32

Web Services: The Key? • Web SERVER: – Given a url + parameters – Returns a web page (often dynamic) Your h t program tp • Web SERVICE: – Given a XML document (soap msg) – Returns an XML document – Tools make this look like an RPC. • F(x, y, z) returns (u, v, w) – Distributed objects for the web. – + naming, discovery, security, . . • Internet-scale distributed computing b We e pag Your s o program ap Data In your address space Web Server t jec l ob m in x Web Service 33

Emerging technologies • Look at science • High end computation and storage 34