Скачать презентацию Visualization of Heterogeneous Data Mike Cammarano Xin Luna Скачать презентацию Visualization of Heterogeneous Data Mike Cammarano Xin Luna

5d8b62f12525a167a08a29e5e86a25f4.ppt

  • Количество слайдов: 41

Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan

Homogeneous data is easy. Company Founded Headquarters Microsoft 1975 47. 6 N, 122. 1 Homogeneous data is easy. Company Founded Headquarters Microsoft 1975 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1998 37. 4 N, 122. 0 W Logo

Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1970 1975 1998 37. 4 N, 122. 0 W 1980 1990 2000 Logo

Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1970 1975 1998 37. 4 N, 122. 0 W 1980 1990 2000 Logo

Multiple sources? • Collaborative content • Semi-structured data {{Infobox Writer | bgcolour = silver Multiple sources? • Collaborative content • Semi-structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2. jpg | caption = This [[daguerreotype]] of Poe was taken in 1848. . . | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U. S. ]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U. S. ]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]. . .

DBpedia. org According to DBpedia. org: • DBpedia is a community effort to extract DBpedia. org According to DBpedia. org: • DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. • The DBpedia dataset currently provides information about more than 1. 95 million “things”, including at least: • 80, 000 persons • 70, 000 places • 35, 000 music albums • 12, 000 films

Database size We use a subset of DBpedia, mostly infoboxes and geonames. • 30 Database size We use a subset of DBpedia, mostly infoboxes and geonames. • 30 M triples • 2. 5 GB We currently use an in-memory database. Hardware is dual processor, dual core AMD opteron 280’s w/ 8 GB RAM.

A glimpse inside DBpedia A glimpse inside DBpedia

A glimpse inside DBpedia Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: birth_place dbp: latitude w 3 A glimpse inside DBpedia Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: birth_place dbp: latitude w 3 c: owl#same. As 39° 41´ 45˝ N geonames: latitude 42. 358403

Heterogeneity • Types • Decimal vs. sexagesimal coordinates 39° 41´ 45˝ N • Names Heterogeneity • Types • Decimal vs. sexagesimal coordinates 39° 41´ 45˝ N • Names • PLACE_OF_BIRTH vs. birth_place • Paths dbp: PLACE_OF_BIRTH dbp: latitude vs. dbp: birth_place w 3 c: owl#same. As geonames: latitude 39. 70

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Scenario / Demo Scenario / Demo

Vision: Self-configuring data Vision: Self-configuring data

Contributions • Visualize heterogeneous data represented as a graph of relationships between objects • Contributions • Visualize heterogeneous data represented as a graph of relationships between objects • Describe inputs to a visualization: • Visualization template • Set of keywords per attribute • Find attributes needed for a visualization by searching paths • Within an iterative process of search, visualization, and refinement • Present algorithm for finding and ranking paths based on keywords • Efficiently enumerate paths • A* • Random sampling • Rank according to: • Keywords • Heuristics about graph structure

Integrate searching and visualization Search for potentially desirable paths Refine path selections Visualize results Integrate searching and visualization Search for potentially desirable paths Refine path selections Visualize results in context

Matching problem • Find the best path to a number for “state latitude” ca Matching problem • Find the best path to a number for “state latitude” ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place e us er hoad le color latitude name blue 39. 0 Harry Reid

Basic algorithm • Find the best path to a number for “state latitude” ca Basic algorithm • Find the best path to a number for “state latitude” ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 state. capital. latitude 0. 8 4 state. governor. children 0. 5 state. pop 0. 6 spouse. birth_place. latitude 0. 5 6349000 birth place e us er hoad le color latitude name 39. 0 Harry Reid blue 1. Explore graph 2. Find paths ending in a number 3. Score and rank paths using TF/IDF

Improving execution time • New pruning techniques since the paper submission • A* • Improving execution time • New pruning techniques since the paper submission • A* • Bidirectional search on terms • Random sampling

Pruning techniques • Most paths do not correspond to a “state latitude” • How Pruning techniques • Most paths do not correspond to a “state latitude” • How can we avoid such bad paths? ca governor po p sta te l pita spouse ty r pa Dianne Feinstein latitude children 42. 4 4 No mention of latitude 6349000 birth place se ouder hea l color latitude name blue 39. 0 Many unrelated terms Harry Reid No potential paths

Pruning techniques / A* Search • Use a scoring function that penalizes unrelated terms Pruning techniques / A* Search • Use a scoring function that penalizes unrelated terms • Then an A* search ignores paths with many such terms ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid Many unrelated terms

A* pruning results Senators on map Average # of edges examined at each depth, A* pruning results Senators on map Average # of edges examined at each depth, full enumeration: Image Name latitude 1 66 66 66 2 5409 5446 5408 3 134226 168673 145549 4 1393766 5245035 1009247 Average # of edges examined at each depth, using A*: Image Name latitude 1 66 66 66 2 2049 9 598 3 1615 5092 2272 4 198 228 2148

Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen nodes ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 A hit! 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No potential paths

Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen nodes • Only search known hits for the remaining nodes • Prevents repeatedly checking where there are likely no paths ca governor po p sta te l pita spouse rty pa John Kerry latitude children 42. 4 A hit! 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No potential paths

Sampling results Average # edges examined at all depths: Image Name State Seed nodes Sampling results Average # edges examined at all depths: Image Name State Seed nodes (10) 920 40 200 Others (89) 82 35 175 Latitude Longitude TOTAL 3100 7360 144 580 Total edges examined: without sampling with sampling 7360× 99 = 728640 7360× 10 + 580× 89 = 125220

Performance Runtime for senators’ example: State latitude State longitude Image Name Instances total 0. Performance Runtime for senators’ example: State latitude State longitude Image Name Instances total 0. 911 0. 854 0. 542 0. 513 0. 187 3. 007 sec Runtime for astronauts’ example: Mission launch Mission insignia Name Instances total 1. 109 1. 151 0. 743 1. 102 4. 105 sec Runtime for each field in countries’ example: GDP per capita Inflation Flag Name Instances total 1. 142 2. 228 0. 867 1. 108 1. 136 6. 481 sec • Performance now interactive • With new pruning techniques, ~100 x faster than reported in paper.

Variations – senators’ flags versus birth places Variations – senators’ flags versus birth places

Timeline of manned spaceflight Timeline of manned spaceflight

Scatterplot of inflation vs. GDP Scatterplot of inflation vs. GDP

Precision / Recall Senators – image: Correct Incorrect 86 6 Accepted 0 6 Rejected Precision / Recall Senators – image: Correct Incorrect 86 6 Accepted 0 6 Rejected Senators – state latitude: Correct Incorrect 64 34 Accepted 1 0 Rejected Countries – gdp per capita: Correct Incorrect 206 58 Accepted 9 0 Rejected

Summary • Visualize heterogeneous data represented as a graph of relationships between objects • Summary • Visualize heterogeneous data represented as a graph of relationships between objects • Produce visualizations conforming to templates by searching for needed attributes • Present algorithm for finding and ranking paths based on keywords • Efficiently enumerate paths • Rank • Now fast enough for interactive use • High precision and recall

Future work • Improvements • UI support for initial discovery and query refinement • Future work • Improvements • UI support for initial discovery and query refinement • Robustness of terms / Improved ranking • Automatic selection of visualization • Visualizing missing data • Visualizations that reflect result relevance (selective emphasis) • Deploy on the web • Wikipedia • The whole web

Acknowledgements Funding sources: • Boeing • RVAC • CALO Tools and data: • DBpedia Acknowledgements Funding sources: • Boeing • RVAC • CALO Tools and data: • DBpedia • MIT SIMILE project timeline • Tom Patterson’s map artwork

The end! The end!

Pruning techniques Bidirectional Search • Before A*, search one step back from each literal, Pruning techniques Bidirectional Search • Before A*, search one step back from each literal, following only edges that match keywords • This saves one step during forward A* search ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No mention of latitude

Need for multiple paths Need for multiple paths

Need for multiple paths Need for multiple paths