Скачать презентацию The Information Avalanche Reducing Information Overload Jim Gray Скачать презентацию The Information Avalanche Reducing Information Overload Jim Gray

dc952c264db0f82f6113f29c4fdf2e63.ppt

  • Количество слайдов: 77

The Information Avalanche: Reducing Information Overload Jim Gray Microsoft Research Onassis Foundation Science Lecture The Information Avalanche: Reducing Information Overload Jim Gray Microsoft Research Onassis Foundation Science Lecture Series http: //www. forth. gr/onassis/lectures/2002 -07 -15/index. html Heraklion, Crete, Greece, 15 -19 July 2002 1

Thesis • Most new information is digital (and old information is being digitized) • Thesis • Most new information is digital (and old information is being digitized) • A Computer Science Grand Challenge: – Capture – Organize – Summarize – Visualize This information • Optimize Human Attention as a resource. • Improve information quality 2

Information Avalanche • The Situation – a census of the data – We can Information Avalanche • The Situation – a census of the data – We can record everything – Everything is a LOT! • The Good news – Changes science, education, medicine, entertainment, …. – Shrinks time and space – Can augment human intelligence • The Bad News – The end of privacy – Cyber Crime / Cyber Terrorism – Monoculture • The Technical Challenges – Amplify human intellect – Organize, summarize and prioritize information – Make programming easy. 3

How much information is there? Yotta • Soon everything can be Everything recorded and How much information is there? Yotta • Soon everything can be Everything recorded and indexed ! Recorded • Most bytes will never be All Books seen by humans. Multi. Media • Data summarization, trend detection All Lo. C books (words) anomaly detection are key technologies See Mike Lesk: How much information is there: http: //www. lesk. com/mlesk/ksg 97/ksg. html See Lyman & Varian: How much information http: //www. sims. berkeley. edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli . Movie A Photo A Book Zetta Exa Peta Tera Giga Mega 4 Kilo

Information Census Lesk Varian & Leyman EB • • ~10 Exabytes PB ~90% digital Information Census Lesk Varian & Leyman EB • • ~10 Exabytes PB ~90% digital > 55% personal TB Print: . 003% of bytes 5 TB/y, but text has lowest entropy • Email is 4 PB/y and is 20% text Media • WWW is ~50 TB deep web ~50 PB • Growth: 50%/y (10 Bmpd) (estimate by Gray) TB/y Growth Rate, % optical 50 70 paper 100 2 100, 000 4 film magnetic 1, 000 total 1, 100, 150 5 55 50

93% 6 93% 6

Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y • 1000 €/TB today • 100 €/TB in 2007 7

Disk Storage Cheaper than Paper • File Cabinet: cabinet (4 drawer) paper (24, 000 Disk Storage Cheaper than Paper • File Cabinet: cabinet (4 drawer) paper (24, 000 sheets) space (2 x 3 @ 10€/ft 2) total 0. 03 €/sheet • Disk: 250$ 180$ 700$ disk (160 GB =) 200$ ASCII: 500 m pages 2 e-7 €/sheet (10, 000 x cheaper) Image: 1 m photos 3 e-4 €/photo (100 x cheaper) • Store everything on disk 8

Low rent min $/byte Shrinks time now or later Shrinks space here or there Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Immediate OR Time Delayed Why Put Everything in Cyberspace? Point-to-Point OR Broadcast Locate Process Analyze Summarize 9

Storage trends • Right now, it’s affordable to buy 100 GB/year • In 5 Storage trends • Right now, it’s affordable to buy 100 GB/year • In 5 years you can afford to buy 1 TB/year! (assuming storage doubles every 18 months) 10

Trying to fill a terabyte in a year Items/TB Items/day 300 KB JPEG 3. Trying to fill a terabyte in a year Items/TB Items/day 300 KB JPEG 3. 6 M 9800 1 MB Doc 1. 0 M 2900 1 hour 256 kb/s MP 3 audio 1 hour 1. 5 Mbp/s MPEG video 9. 3 K 26 290 4 11

Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely” 12

Gordon Bell’s Main. Brain™ Digitize Everything A BIG shoebox? • • • Scans 20 Gordon Bell’s Main. Brain™ Digitize Everything A BIG shoebox? • • • Scans 20 k “pages” [email protected] 300 dpi Music: 2 k “tacks” Photos: 13 k images Video: 10 hrs Docs: 3 k (ppt, word, . . ) Mail: 100 k messages 1 GB 7 GB 2 GB 3 GB 18 GB 13

Gary Starkweather • • • Scan EVERYTHING 400 dpi TIFF 70 k “pages” ~ Gary Starkweather • • • Scan EVERYTHING 400 dpi TIFF 70 k “pages” ~ 14 GB OCR all scans (98% recognition ocr accuracy) All indexed (5 second access to anything) All on his laptop. 14

Access! 15 Access! 15

50% personal, What about the other 50% • Business – Wall Mart online: 1 50% personal, What about the other 50% • Business – Wall Mart online: 1 PB and growing…. – Paradox: most “transaction” systems have mere PBs. – Have to go to image/data monitoring for big data • Government – Online government is big thrust (cheaper, better, …) • Science 16

Instruments: CERN – LHC Peta Bytes per Year Looking for the Higgs Particle • Instruments: CERN – LHC Peta Bytes per Year Looking for the Higgs Particle • Sensors: 1000 GB/s (1 TB/s) • Events 75 GB/s • Filtered 5 GB/s • Reduced 0. 1 GB/s ~ 2 PB/y • Data pyramid: 100 GB : 1 TB : 100 TB : 1 PB : 10 PB CERN Tier 0 17

LHC Requirements (2005 - ) • 1 E 9 events pa @ 1 MB/ev LHC Requirements (2005 - ) • 1 E 9 events pa @ 1 MB/ev = 1 PB/year/expt • Reconstructed = 100 TB/recon/year/expt • Send to Tier 1 Regional Centres • => 400 TB/year to RAL? • Keep one set + derivatives on disk • …and rest on tape • But UK plans a Tier 1 clone • Many data clones Source: John Gordon IT Department, CLRC/RAL CUF Meeting, October 2000 18

Science Data Volume ESO/STECF Science Archive • 100 TB archive • Similar at Hubble, Science Data Volume ESO/STECF Science Archive • 100 TB archive • Similar at Hubble, Keck, SDSS, … • ~1 PB aggregate 19

Data Pipeline: NASA • • Level 0: raw data Level 1: calibrated data Level Data Pipeline: NASA • • Level 0: raw data Level 1: calibrated data Level 1 A: calibrated & normalized Level 2: derived data metrics data stream measured values flux/magnitude/… vegetation index • Data volume – 0 ~ 1 A << 2 • Level 2 >> level 1 because 4 editions of 4 Level 2 products, each is small, but… E 4 E 3 time – MANY data products – Must keep all published – data Editions (versions) Level 1 A E 2 E 1 EOSDIS Core System Information for Scientists, http: //observer. gsfc. nasa. gov/sec 3/Product. Levels. html 20

Terra. Server http: //Terra. Service. net/ • • • 3 x 2 TB databases Terra. Server http: //Terra. Service. net/ • • • 3 x 2 TB databases 18 TB disk tri-plexed (=6 TB) 3 + 1 Cluster 99. 96% uptime 1 B page views 5 B DB queries Now a. NET web service 21

Image Data 1 m resolution 2 m resolution 12 TB 95 % U. S. Image Data 1 m resolution 2 m resolution 12 TB 95 % U. S. Coverage 1 TB 100% U. S. Coverage USGS Aerial photos “DOQ” 1 Km resolution 100 % World Coverage Encarta Virtual Globe USGS Topo Maps • All in the database 200 x 200 pixel tiles compressed • Spatial access z-Tranform Btree 22

Hardware 8 Compaq DL 360 “Photon” Web Servers One SQL database per rack Each Hardware 8 Compaq DL 360 “Photon” Web Servers One SQL database per rack Each rack contains 4. 5 tb 261 total drives / 13. 7 TB total 2200 Meta Data Stored on 101 GB “Fast, Small Disks” (18 x 18. 2 GB) Imagery Data Stored on 4 339 GB “Slow, Big Disks” (15 x 73. 8 GB) To Add 90 72. 8 GB Disks in Feb 2001 to create 18 TB SAN O O J 2200 Fiber SAN Switches E E J 2200 P Q K L 2200 F SQLInst 1 G SQLInst 2 2200 R S 2200 M N 2200 SQLInst 3 H I Spare 23 4 Compaq Pro. Liant 8500 Db Servers

Terra. Server Lessons Learned • • Hardware is 5 9’s (with clustering) 99999 Software Terra. Server Lessons Learned • • Hardware is 5 9’s (with clustering) 99999 Software is 5 9’s (with clustering) 99999 Admin is 4 9’s (offline maintenance) 9999 Network is 3 9’s (mistakes, environment)9 9 9 • Simple designs are best • 10 TB DB is management limit 1 PB = 100 x 10 TB DB this is 100 x better than 5 years ago. • Minimize use of tape – Backup to disk (snapshots) – Portable disk TBs 24

Sensor Applications • Earth Observation – 15 PB by 2007 • Medical Images & Sensor Applications • Earth Observation – 15 PB by 2007 • Medical Images & Information + Health Monitoring – Potential 1 GB/patient/y 1 EB/y • Video Monitoring – ~1 E 8 video cameras @ 1 E 5 MBps 10 TB/s 100 EB/y filtered? ? ? • Airplane Engines – 1 GB sensor data/flight, – 100, 000 engine hours/day – 30 PB/y • Smart Dust: ? ? EB/y http: //robotics. eecs. berkeley. edu/~pister/Smart. Dust/ http: //www-bsac. eecs. berkeley. edu/~shollar/macro_motes/macromotes. html 25

What do they do with the data business, government, science More later in talk What do they do with the data business, government, science More later in talk y x • Look for anomalies – 1, 2, 1, 1, 1, 2, -5, 1, 0, 2, y • Look for trends and patterns x – 1, 2, 3, 4, 5, • Look for correlations ln(z) – ln(x) – ln(y) ~ c ln(z) • Look at summaries then drill down to details • LOTS of histograms Ln(x/y) 26

Premise: Grid Computing • Store exabytes once or twice (for redundancy) • Access them Premise: Grid Computing • Store exabytes once or twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail, CERN, Fermilab, SDSC 27

Bandwidth: 3 x bandwidth/year for 25 more years • Today: – 40 Gbps per Bandwidth: 3 x bandwidth/year for 25 more years • Today: – 40 Gbps per channel (λ) – 12 channels per fiber (wdm): 500 Gbps – 32 fibers/bundle = 16 Tbps/bundle • • In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps 28

Underlying Theme Digital Everything Ø From “words and numbers” to “sights and sounds” Automation Underlying Theme Digital Everything Ø From “words and numbers” to “sights and sounds” Automation Ø From dumb to Web services Ø From manual to self-tuning, self organizing, and self maintaining Ø Beyond reliability to availability New Devices Ø From isolated to adaptive, synchronized, and connected One inter-connected network Ø From stand alone/basic connectivity to always wired (and wireless) 29 Ø Everything over IP

Information Avalanche • The Situation – a census of the data – We can Information Avalanche • The Situation – a census of the data – We can record everything – Everything is a LOT! • The Good news – Changes science, education, medicine, entertainment, …. – Shrinks time and space – Can augment human intelligence • The Bad News – The end of privacy – Cyber Crime / Cyber Terrorism – Monoculture • The Technical Challenges – Amplify human intellect – Organize, summarize and prioritize information – Make programming easy. 30

Online Science • • All literature online All data online All instruments online Great Online Science • • All literature online All data online All instruments online Great analysis tools. 31

Online Education • • • All literature online All lectures online Interactive and time-shifted Online Education • • • All literature online All lectures online Interactive and time-shifted education Just-in-time education Available to everyone everywhere – Economic model is not understood (who pays? ) – One model: “society pays” 32

Online Business • • • Frictionless economy Near-perfect information Very efficient Fully customized products Online Business • • • Frictionless economy Near-perfect information Very efficient Fully customized products Example: Wallmart / Dell: – Traditional business 1 -10 inventory turns/y – e. Buisiness 100 -500 turns/y: no inventory • VERY efficient, huge economic advantage • Your customers & suppliers loan you money! 33

Online Medicine • Traditional medicine: – Can monitor your health continuously – Instant diagnosis Online Medicine • Traditional medicine: – Can monitor your health continuously – Instant diagnosis – Personalized drugs • New Biology – DNA is software – “solve each disease” – Huge impact on agriculture too 34

Cyber-Space Shrinks Time and Distance • Everyone is always connected – Can get information Cyber-Space Shrinks Time and Distance • Everyone is always connected – Can get information they want – Can communicate with friends & family • Everything is online – You never miss a meeting/game/party/movie (you can always watch it) – You never forget anything (its there somewhere) 35

Sustainable Society • Year 2050: 9 B people living at Europe’s standard of living Sustainable Society • Year 2050: 9 B people living at Europe’s standard of living • 100 M people in a city? • Environment can’t sustain it • More efficient cities/transportation/… • 20% consume 60% now if 100 % consume 1/3 of current levels net consumption unchanged. • Need to reduce energy/water/metal consumption 3 x in developed world. 36

Cyber. Space (data) and Tools Can Augment Human Intelligence • See next talk (12 Cyber. Space (data) and Tools Can Augment Human Intelligence • See next talk (12 CS challenges) • My. Main. Brain is a personal example: improved memory • Data mining tools are promising 37

Information Avalanche • The Situation – a census of the data – We can Information Avalanche • The Situation – a census of the data – We can record everything – Everything is a LOT! • The Good news – Changes science, education, medicine, entertainment, …. – Shrinks time and space – Can augment human intelligence • The Bad News – The end of privacy – Cyber Crime / Cyber Terrorism – Monoculture • The Technical Challenges – Amplify human intellect – Organize, summarize and prioritize information – Make programming easy. 38

The End Of Privacy • You can find out all about me. • Organizations The End Of Privacy • You can find out all about me. • Organizations can precisely track us – Credit cards, email, cellphone, … • Animals have “tags” in them, I will probably get a tag (eventually) (I already carry a dozen ID & smart cards). • “You have no privacy, get over it” Scott Mcnealy 39

The Centralization of Power • Computers enable an Orwellian future (1984) • The government The Centralization of Power • Computers enable an Orwellian future (1984) • The government can know everything you ever – – Buy Say Hear See/Read/… • Where you are (phone company already knows) • Who you see and talk to • OK now, but what if Nero/Hitler/Stalin/. . comes to power? 40

Cyber Crime • You can steal my identity – Sell my house – Accumulate Cyber Crime • You can steal my identity – Sell my house – Accumulate huge debts – Make a video of me doing terrible things. • You can steal on a grand scale – Now Trillions of dollars are online. – A LARGE honey-pot for criminals. 41

Cyber Terrorism • • It is easier to attack/destroy than to steal. Viruses, data Cyber Terrorism • • It is easier to attack/destroy than to steal. Viruses, data corruption, data modification Denial of Service Hijacking and then destroying equipment – Utilities (water, energy, transportation) – Production (factories) 42

Monoculture • Radio & TV & movies & Internet are making the world more Monoculture • Radio & TV & movies & Internet are making the world more homogenous. • ½ the world has never made a phone call • But this is changing fast (they want to make phone calls!) • The wired world enables communities to form very easily – e. g. Sanskrit scholars. • But the community has to speak a common language. 43

Information Clutter • Most mail is junk mail • Most e. Mail will soon Information Clutter • Most mail is junk mail • Most e. Mail will soon be junk mail – 30% of hotmail, 75% of my mail (~130 m/d). • Telemarketing wastes people’s time. • Creates info-glut • You have 50, 000 new mail messages • Need systems and interfaces to filter, summarize, prioritize information 44

Information Avalanche • The Situation – a census of the data – We can Information Avalanche • The Situation – a census of the data – We can record everything – Everything is a LOT! • The Good news – Changes science, education, medicine, entertainment, …. – Shrinks time and space – Can augment human intelligence • The Bad News – The end of privacy – Cyber Crime / Cyber Terrorism – Monoculture • The Technical Challenges – Amplify human intellect – Organize, summarize and prioritize information – Make programming easy. 45

Technical Challenges • • • Storing information Organizing information Summarizing information Visualizing information Make Technical Challenges • • • Storing information Organizing information Summarizing information Visualizing information Make programming easy 46

The personal Terabyte (all your stuff online) So you’ve got it – now what The personal Terabyte (all your stuff online) So you’ve got it – now what do you do with it? • Probably not accessed very often but TREASURED (what’s the one thing you would save in a fire? ) • Can you find anything? • Can you organize that many objects? • Once you find it will you know what it is? • Once you’ve found it, could you find it again? • Research Goal: Have GOOD answers for all these Questions 47

Bell, Gemmell, Lueder: My. Life. Bits Guiding Principles 1. Freedom from strict hierarchy – Bell, Gemmell, Lueder: My. Life. Bits Guiding Principles 1. Freedom from strict hierarchy – Full text search & Collections 2. Many visualizations – “don’t metaphor me in” 3. Annotations add value – So make them easy! 4. Keep the links when you author – “transclusion” 5. Everything goes in a database 48

How will we find it? Put everything in the DB (and index it) • How will we find it? Put everything in the DB (and index it) • Need dbms features Consistency, Indexing, Pivoting, Queries, Speed/scalability, Backup, replication If you don’t use one, creating one! • Simple logical structure: – Blob and link is all that is inherent – Additional properties (facets == extra tables) and methods on those tables (encapsulation) • • More than a file system Unifies data and meta-data Simpler to manage Easier to subset and reorganize Set-oriented access Allows online updates Automatic indexing, replication SQL 49

How do we represent it to the outside world? <? xml version= How do we represent it to the outside world? - - • File metaphor too primitive: just a blob • Table metaphor too primitive: just records • Need Metadata describing data context – Format – Providence (author/publisher/ citations/…) schema – Rights – History – Related documents • • … - - -

184. 028935351008 -1. 12590950121524
… - 184. 025719033547 -1. 21795827920186
Data or difgram In a standard format XML and XML schema Data. Set is great example of this World is now defining standard schemas 50

There is a problem Niklaus Wirth: Algorithms + Data Structures = Programs • GREAT!!!! There is a problem Niklaus Wirth: Algorithms + Data Structures = Programs • GREAT!!!! –XML documents are portable objects –XML documents are complex objects –WSDL defines the methods on objects (the class) • But will all the implementations match? –Think of UNIX or SQL or C or… • This is a work in progress. 51

Photo. Server: Managing Photos • • • Load all photos into the database Annotate Photo. Server: Managing Photos • • • Load all photos into the database Annotate the photos View by various attributes Do similarity Search Use XML for interchange Use db. Object, Template for access SQL, Templates, XML data j. Script DOM IIS XML datasets & mime data Templates Schema SQL (for xml) 52

How Similarity Search Works • • For each picture Loader – Inserts thumbnails – How Similarity Search Works • • For each picture Loader – Inserts thumbnails – Extracts 270 Features into a blob When looking for similar picture – Scan all photos comparing features many black squares No 10% orange (dot product of vectors) …etc – Sort by similarity Feature blob is an array 72% match 27% – Today I fake the array with functions and cast(substring(feature, 72, 8) as float) – When SQL Server gets C# I will not have to fake it. – And… it will run 100 x faster (compiled managed code). Idea pioneered by IBM Research, we use a variant by MS Beijing Research. black squares 20% orange …etc match 53

Key Observations • • Data: – XML data sets are a universal way to Key Observations • • Data: – XML data sets are a universal way to represent answers – XML data sets minimize round trips: 1 request/response Search – It is BEST to index – You can put objects and attributes in a row (SQL puts big blobs off-page) – If you can’t index, You can extract attributes and quickly compare – SQL can scan at 2 M records/cpu/second – Sequential scans are embarrassingly parallel. 54

What about Big Data • Talked about organizing personal data • What about BIG What about Big Data • Talked about organizing personal data • What about BIG data. • Most of the following slides inspired by (or even copied from) Alex Szalay JHU and George Djorgovski Cal Tech 55

Data Knowledge ? • Exponential growth of data volume, complexity, quality • But growth Data Knowledge ? • Exponential growth of data volume, complexity, quality • But growth SLOW growth of knowledge understanding • Why? Methodology bottleneck Human wetware limitations … • Need AI-assisted discovery 56 Adapted from slides by Alex Szalay and George Djorgovski

What’s needed? (not drawn to scale) Miners Scientists Data Mining Algorithms Plumbers Database To What’s needed? (not drawn to scale) Miners Scientists Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools 57

How Discoveries Made? adapted from slide by George Djorgovski • Conceptual Discoveries: e. g. How Discoveries Made? adapted from slide by George Djorgovski • Conceptual Discoveries: e. g. , Relativity, QM, Brane World, Inflation … Theoretical, may be inspired by observations • Phenomenological Discoveries: e. g. , Dark Matter, QSOs, GRBs, CMBR, Extrasolar Planets, Obscured Universe … Empirical, inspire theories, can be motivated by them New Technical Capabilities Observational Discoveries Theory Phenomenological Discoveries: Explore parameter space Make new connections (e. g. , multi- ) Understanding of complex phenomena requires complex, information-rich data (and simulations? ) 58

Data Mining in the Image Domain: Can We Discover New Types of Phenomena Using Data Mining in the Image Domain: Can We Discover New Types of Phenomena Using Automated Pattern Recognition? (Every object detection algorithm has its biases and limitations) – Effective parametrization of source morphologies and environments – Multiscale analysis (Also: in the time/lightcurve domain) 59

Exploration of Parameter Spaces in the Catalog Domain (Source Attributes) • Clustering Analysis (supervised Exploration of Parameter Spaces in the Catalog Domain (Source Attributes) • Clustering Analysis (supervised and unsupervised): – How many different types of objects are there? – Are there any rare or new types, outliers? • Multivariate Correlation Search: – Are there significant, nontrivial correlations present in the data? Clusters vs. Correlations: Science Correlations reduction of the statistical dimensionality 60

New Science from Multivariate Correlations DD = 2 DS = 1 f (xi, xj, New Science from Multivariate Correlations DD = 2 DS = 1 f (xi, xj, …) xi Data dimension DD = 2 Statistical dim. DS = 2 xj If DS < DD, then MV correlations are present xk Fundamental Plane of E-galaxies: Correlations objectively define types of objects, e. g. , TFR normal spirals, FP normal ellipticals … and can lead to some new insights 61

But… DD >> 1, DS >> 1 Data Complexity Multidimensionality Discoveries But the bad But… DD >> 1, DS >> 1 Data Complexity Multidimensionality Discoveries But the bad news is … The Curse of Hyper-dimensionality, The computational cost of clustering analysis: K-means: K N I D Expectation Maximization: K N I D 2 Monte Carlo Cross-Validation: M Kmax 2 N I D 2 N = no. of data vectors ~1 e 12, K = no. of clusters chosen, I = no. of iterations, D = no. of data dimensions ~1 e 4 Kmax = max no. of clusters tried M = no. of Monte Carlo trials/partitions Exascale computing and / or better algorithms Some dimensionality reduction methods do exist (e. g. , PCA, class prototypes, hierarchical methods, etc. ), but more work is needed 62

The Curse of Hyper-dimensionality • Visualization! • A fundamental limitation of the human perception: The Curse of Hyper-dimensionality • Visualization! • A fundamental limitation of the human perception: DMAX = 3? 5? (NB: We can certainly understand mathematically much higher dimensionalities, but cannot really visualize them; our own Neural Nets are powerful pattern recognition tools) DM Algorithm User • Interactive visualization a key part of the data mining process: Visualization • Some methodology exists, but much more is needed 63

 Online Multivariate Analysis Challenges • Data heterogeneity, biases, selection effects … • Non-Gaussianity Online Multivariate Analysis Challenges • Data heterogeneity, biases, selection effects … • Non-Gaussianity of clusters (data models) Outlier population, or a non-Gaussian tail? • Non-trivial topology of clustering • Useful vs. “useless” parameters … 64

Useful vs. “Useless” Parameters: Clusters (classes) and correlations may exist/separate in some parameter subspaces, Useful vs. “Useless” Parameters: Clusters (classes) and correlations may exist/separate in some parameter subspaces, but not in others xi xn xj xm 65

Optimal Statistics following slides adapted from Alex Szalay • statistics algorithms scale poorly – Optimal Statistics following slides adapted from Alex Szalay • statistics algorithms scale poorly – Correlation functions N 2, likelihood techniques N 3 • Even if data and computers grow at same rate Computers can do at most N log. N algorithms • Possible solutions? – Assumes infinite computational resources – Assumes that only source of error is statistical – `Cosmic Variance’: we can only observe the Universe from one location (finite sample size) • Solutions require combination of Statistics and CS • New algorithms: not worse than N log. N 66

Clever Data Structures • Heavy use of tree structures: – Initial cost Nln. N Clever Data Structures • Heavy use of tree structures: – Initial cost Nln. N – Large speedup later – Tree-codes for correlations (A. Moore et al 2001) • Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi etal (2001) • N log. N instead of N 3 => 1 day instead of 10 million years • Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources 67

Angular Clustering with Photo-z • w( ) by Peebles and Groth: – The first Angular Clustering with Photo-z • w( ) by Peebles and Groth: – The first example of publishing and analyzing large data • • Samples based on rest-frame quantities Strictly volume limited samples Largest angular correlation study to date Very clear detection of – Luminosity and color dependence • Results consistent with 3 D clustering T. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S. Dodelson, J. Frieman, R. Scranton, D. Johnston and the SDSS Collaboration 68

The Samples 2800 square degrees in 10 stripes, data in custom DB All: 50 The Samples 2800 square degrees in 10 stripes, data in custom DB All: 50 M mr<21 : 15 M 10 stripes: 10 M 0. 1 Mr 0. 1 Mr 2. 2 M 3. 1 M -20 > Mr >-21 > Mr >-23 -21 > Mr >-22 1182 k 931 k 662 k -22 > Mr >-23 343 k 254 k 185 k 316 k 280 k 326 k 185 k 127 k 269 k 69

The Stripes • 10 stripes over the SDSS area, covering about 2800 square degrees The Stripes • 10 stripes over the SDSS area, covering about 2800 square degrees • About 20% lost due to bad seeing • Masks: seeing, bright stars, etc. • Images generated from query by web service 70

The Masks • Stripe 11 + masks • Masks are derived from the database The Masks • Stripe 11 + masks • Masks are derived from the database – Search and intersect extended objects with boundaries 71

The Analysis • e. Sp. ICE : I. Szapudi, S. Colombi and S. Prunet The Analysis • e. Sp. ICE : I. Szapudi, S. Colombi and S. Prunet • Integrated with the database by T. Budavari • Extremely fast processing (N log. N) – 1 stripe with about 1 million galaxies is processed in 3 mins – Usual figure was 10 min for 10, 000 galaxies => 70 days • Each stripe processed separately for each cut • 2 D angular correlation function computed • w( ): average with rejection of pixels along the scan – flat field vector causes mock correlations 72

Angular Correlations I. • Luminosity dependence: 3 cuts -20> M > -21> M > Angular Correlations I. • Luminosity dependence: 3 cuts -20> M > -21> M > -22> M > -23 73

Angular Correlations II. • Color Dependence 4 bins by rest-frame SED type 74 Angular Correlations II. • Color Dependence 4 bins by rest-frame SED type 74

If there’s time • • • Better User Interfaces Organizing photos Organizing newsgroups Enhancing If there’s time • • • Better User Interfaces Organizing photos Organizing newsgroups Enhancing meetings. Attentional interfaces: 0 Task. Galary. MPG 1 Digital Photo. mpg 2 Communities. mpg 3 flows. mpg 4 Side Show. mpg 75

Thesis • Most new information is digital (and old information is being digitized) • Thesis • Most new information is digital (and old information is being digitized) • A Computer Science Grand Challenge: – Capture – Organize – Summarize – Visualize This information • Optimize Human Attention as a resource. 76 • Improve information quality

Information Avalanche • The Situation – a census of the data – We can Information Avalanche • The Situation – a census of the data – We can record everything – Everything is a LOT! • The Good news – Changes science, education, medicine, entertainment, …. – Shrinks time and space – Can augment human intelligence • The Bad News – The end of privacy – Cyber Crime / Cyber Terrorism – Monoculture • The Technical Challenges – Amplify human intellect – Organize, summarize and prioritize information – Make programming easy. 77