Скачать презентацию How much information Adapted from a presentation by Скачать презентацию How much information Adapted from a presentation by

0b4df829a617103a38fc78123634977d.ppt

  • Количество слайдов: 21

How much information? Adapted from a presentation by: Jim Gray Microsoft Research http: //research. How much information? Adapted from a presentation by: Jim Gray Microsoft Research http: //research. microsoft. com/~gray Alex Szalay Johns Hopkins University http: //tarkus. pha. jhu. edu/~szalay/ 1

How much information is there in the world 1. What can we store. 2. How much information is there in the world 1. What can we store. 2. What is stored. 3. Why are we interested. 2

Infinite Storage? • The Terror Bytes are Here – 1 TB costs 1 k$ Infinite Storage? • The Terror Bytes are Here – 1 TB costs 1 k$ to buy – 1 TB costs 300 k$/y to own • Management & curation are expensive – Searching 1 TB takes minutes or hours Yotta Zetta Exa Peta • Petrified by Peta Bytes? We are here • But… people can “afford” them so, Tera – Even though they can never actually be Giga seen in your lifetime – Automate the process Mega 3 Kilo

How much information is there? Yotta • Soon everything can be recorded and indexed How much information is there? Yotta • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http: //www. lesk. com/mlesk/ksg 97/ksg. html See Lyman & Varian: How much information http: //www. sims. berkeley. edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Everything ! Recorded All Books Multi. Media All books (words). Movie A Photo A Book Zetta Exa Peta Tera Giga Mega 4 Kilo

First Disk 1956 • IBM 305 RAMAC • 4 MB • 50 x 24” First Disk 1956 • IBM 305 RAMAC • 4 MB • 50 x 24” disks • 1200 rpm • 100 ms access • 35 k$/y rent • Included computer & accounting software (tubes not transistors) 6

Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y • 1000 $/TB today • 100 $/TB in 2007 Moores law 58. 70% /year TB growth 112. 30% /year since 1993 Price decline 50. 70% /year since 1993 Most (80%) data is personal (not enterprise) This will likely remain true. 9

Disk Storage Cheaper Than Paper • File Cabinet (4 drawer) 250$ Cabinet: Paper (24, Disk Storage Cheaper Than Paper • File Cabinet (4 drawer) 250$ Cabinet: Paper (24, 000 sheets) (2 x 3 @ 10€/ft 2) 180$ Total 700$ 0. 03 $/sheet 3 pennies per page 250$ Space • Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2 e-6 $/sheet(10, 000 x cheaper) micro-dollar per page Image: 1 m photos 3 e-4 $/photo (100 x cheaper) milli-dollar per photo • Store everything on disk Note: Disk is 100 x to 1000 x cheaper than RAM 11

Trying to fill a terabyte in a year Items/TB Items/day 300 KB JPEG 3 Trying to fill a terabyte in a year Items/TB Items/day 300 KB JPEG 3 M 9, 800 1 MB Doc 1 M 2, 900 1 hour 256 kb/s MP 3 audio 1 hour 1. 5 Mbp/s MPEG video 9 K 26 290 0. 8 14

Portable Computer: 2006? • 100 Gips processor • 1 GB RAM • 1 TB Portable Computer: 2006? • 100 Gips processor • 1 GB RAM • 1 TB disk • 1 Gbps network • “Some” of your software finding things is a data mining challenge 15

80% of data is personal / individual. But, what about the other 20%? • 80% of data is personal / individual. But, what about the other 20%? • Business – Wall Mart online: 1 PB and growing…. – Paradox: most “transaction” systems < 1 PB. – Have to go to image/data monitoring for big data • Government – Government is the biggest business. • Science – LOTS of data. 19

Q: Where will the Data Come From? A: Sensor Applications • Earth Observation – Q: Where will the Data Come From? A: Sensor Applications • Earth Observation – 15 PB by 2007 • Medical Images & Information + Health Monitoring – Potential 1 GB/patient/y 1 EB/y • Video Monitoring – ~1 E 8 video cameras @ 1 E 5 MBps 10 TB/s 100 EB/y filtered? ? ? • Airplane Engines – 1 GB sensor data/flight, – 100, 000 engine hours/day – 30 PB/y • Smart Dust: ? ? EB/y http: //robotics. eecs. berkeley. edu/~pister/Smart. Dust/ http: //www-bsac. eecs. berkeley. edu/~shollar/macro_motes/macromotes. html 20

Premise: Data. Grid Computing • Store exabytes twice (for redundancy) • Access them from Premise: Data. Grid Computing • Store exabytes twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail, Ba. Bar, CERN, Fermilab, SDSC, … 23

Thesis • Most new information is digital (and old information is being digitized) • Thesis • Most new information is digital (and old information is being digitized) • An Information Science Grand Challenge: – Capture – Organize – Summarize – Visualize this information • Optimize Human Attention as a resource • Improve information quality 24

The Evolution of Science • Observational Science – Scientist gathers data by direct observation The Evolution of Science • Observational Science – Scientist gathers data by direct observation – Scientist analyzes data • Analytical Science – Scientist builds analytical model – Makes predictions. • Computational Science – Simulate analytical model – Validate model and makes predictions • Data Exploration Science Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 26

Computational Science Evolves • Historically, Computational Science = simulation. • New emphasis on informatics: Computational Science Evolves • Historically, Computational Science = simulation. • New emphasis on informatics: – – – Capturing, Organizing, Summarizing, Analyzing, Visualizing • Largely driven by observational science, but also needed by simulations. • Too soon to say if comp-X and X-info will unify or compete. Ba. Bar, Stanford P&E Gene Sequencer From http: //www. genome. uci. edu/ 27 Space Telescope

Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N 2, likelihood techniques N 3 • As data and computers grow at same rate, we can only keep up with N log. N • A way out? – Discard notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory 28 • Requires combination of statistics & computer science

Smart Data (active databases) • If there is too much data to move around, Smart Data (active databases) • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database – Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality – Databases & Procedures being unified – Example temporal and spatial indexing – Pixel processing • Easy to reorganize the data – Multiple views, each optimal for certain types of analyses – Building hierarchical summaries are trivial • Scalable to Petabyte datasets 29

Challenge: Make Data Publication & Access Easy • Augment FTP with data query: Return Challenge: Make Data Publication & Access Easy • Augment FTP with data query: Return intelligent data subsets • Make it easy to – Publish: Record structured data – Find: • Find data anywhere in the network • Get the subset you need – Explore datasets interactively • Realistic goal: – Make it as easy as publishing/reading web sites today. 31

Data Federations of Web Services • Massive datasets live near their owners: – – Data Federations of Web Services • Massive datasets live near their owners: – – Near the instrument’s software pipeline Near the applications Near data knowledge and curation Super Computer centers become Super Data Centers • Each Archive publishes a web service – Schema: documents the data – Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives – A common global schema Federation • Challenge: – What is the object model for your science? 32

Web Services: The Key? • Web SERVER: – Given a url + parameters – Web Services: The Key? • Web SERVER: – Given a url + parameters – Returns a web page (often dynamic) Your h t program tp • Web SERVICE: – Given a XML document (soap msg) – Returns an XML document – Tools make this look like an RPC. • F(x, y, z) returns (u, v, w) – Distributed objects for the web. – + naming, discovery, security, . . • Internet-scale distributed computing b We e pag Your s o program ap Data In your address space Web Server t jec l ob m in x Web Service 33

Emerging technologies • Look at science • High end computation and storage 34 Emerging technologies • Look at science • High end computation and storage 34