5d8b62f12525a167a08a29e5e86a25f4.ppt
- Количество слайдов: 41
Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan
Homogeneous data is easy. Company Founded Headquarters Microsoft 1975 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1998 37. 4 N, 122. 0 W Logo
Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1970 1975 1998 37. 4 N, 122. 0 W 1980 1990 2000 Logo
Homogeneous data is easy. Company Founded Headquarters Microsoft 47. 6 N, 122. 1 W Enron 1985 29. 7 N, 95. 3 W Google 1970 1975 1998 37. 4 N, 122. 0 W 1980 1990 2000 Logo
Multiple sources? • Collaborative content • Semi-structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2. jpg | caption = This [[daguerreotype]] of Poe was taken in 1848. . . | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U. S. ]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U. S. ]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]. . .
DBpedia. org According to DBpedia. org: • DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. • The DBpedia dataset currently provides information about more than 1. 95 million “things”, including at least: • 80, 000 persons • 70, 000 places • 35, 000 music albums • 12, 000 films
Database size We use a subset of DBpedia, mostly infoboxes and geonames. • 30 M triples • 2. 5 GB We currently use an in-memory database. Hardware is dual processor, dual core AMD opteron 280’s w/ 8 GB RAM.
A glimpse inside DBpedia
A glimpse inside DBpedia Kerry: Poe: dbp: PLACE_OF_BIRTH dbp: birth_place dbp: latitude w 3 c: owl#same. As 39° 41´ 45˝ N geonames: latitude 42. 358403
Heterogeneity • Types • Decimal vs. sexagesimal coordinates 39° 41´ 45˝ N • Names • PLACE_OF_BIRTH vs. birth_place • Paths dbp: PLACE_OF_BIRTH dbp: latitude vs. dbp: birth_place w 3 c: owl#same. As geonames: latitude 39. 70
Scenario / Demo
Scenario / Demo
Scenario / Demo
Scenario / Demo
Scenario / Demo
Scenario / Demo
Scenario / Demo
Vision: Self-configuring data
Contributions • Visualize heterogeneous data represented as a graph of relationships between objects • Describe inputs to a visualization: • Visualization template • Set of keywords per attribute • Find attributes needed for a visualization by searching paths • Within an iterative process of search, visualization, and refinement • Present algorithm for finding and ranking paths based on keywords • Efficiently enumerate paths • A* • Random sampling • Rank according to: • Keywords • Heuristics about graph structure
Integrate searching and visualization Search for potentially desirable paths Refine path selections Visualize results in context
Matching problem • Find the best path to a number for “state latitude” ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place e us er hoad le color latitude name blue 39. 0 Harry Reid
Basic algorithm • Find the best path to a number for “state latitude” ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 state. capital. latitude 0. 8 4 state. governor. children 0. 5 state. pop 0. 6 spouse. birth_place. latitude 0. 5 6349000 birth place e us er hoad le color latitude name 39. 0 Harry Reid blue 1. Explore graph 2. Find paths ending in a number 3. Score and rank paths using TF/IDF
Improving execution time • New pruning techniques since the paper submission • A* • Bidirectional search on terms • Random sampling
Pruning techniques • Most paths do not correspond to a “state latitude” • How can we avoid such bad paths? ca governor po p sta te l pita spouse ty r pa Dianne Feinstein latitude children 42. 4 4 No mention of latitude 6349000 birth place se ouder hea l color latitude name blue 39. 0 Many unrelated terms Harry Reid No potential paths
Pruning techniques / A* Search • Use a scoring function that penalizes unrelated terms • Then an A* search ignores paths with many such terms ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid Many unrelated terms
A* pruning results Senators on map Average # of edges examined at each depth, full enumeration: Image Name latitude 1 66 66 66 2 5409 5446 5408 3 134226 168673 145549 4 1393766 5245035 1009247 Average # of edges examined at each depth, using A*: Image Name latitude 1 66 66 66 2 2049 9 598 3 1615 5092 2272 4 198 228 2148
Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen nodes ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 A hit! 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No potential paths
Pruning techniques / Random Sampling • Do normal A* search for n randomly chosen nodes • Only search known hits for the remaining nodes • Prevents repeatedly checking where there are likely no paths ca governor po p sta te l pita spouse rty pa John Kerry latitude children 42. 4 A hit! 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No potential paths
Sampling results Average # edges examined at all depths: Image Name State Seed nodes (10) 920 40 200 Others (89) 82 35 175 Latitude Longitude TOTAL 3100 7360 144 580 Total edges examined: without sampling with sampling 7360× 99 = 728640 7360× 10 + 580× 89 = 125220
Performance Runtime for senators’ example: State latitude State longitude Image Name Instances total 0. 911 0. 854 0. 542 0. 513 0. 187 3. 007 sec Runtime for astronauts’ example: Mission launch Mission insignia Name Instances total 1. 109 1. 151 0. 743 1. 102 4. 105 sec Runtime for each field in countries’ example: GDP per capita Inflation Flag Name Instances total 1. 142 2. 228 0. 867 1. 108 1. 136 6. 481 sec • Performance now interactive • With new pruning techniques, ~100 x faster than reported in paper.
Variations – senators’ flags versus birth places
Timeline of manned spaceflight
Scatterplot of inflation vs. GDP
Precision / Recall Senators – image: Correct Incorrect 86 6 Accepted 0 6 Rejected Senators – state latitude: Correct Incorrect 64 34 Accepted 1 0 Rejected Countries – gdp per capita: Correct Incorrect 206 58 Accepted 9 0 Rejected
Summary • Visualize heterogeneous data represented as a graph of relationships between objects • Produce visualizations conforming to templates by searching for needed attributes • Present algorithm for finding and ranking paths based on keywords • Efficiently enumerate paths • Rank • Now fast enough for interactive use • High precision and recall
Future work • Improvements • UI support for initial discovery and query refinement • Robustness of terms / Improved ranking • Automatic selection of visualization • Visualizing missing data • Visualizations that reflect result relevance (selective emphasis) • Deploy on the web • Wikipedia • The whole web
Acknowledgements Funding sources: • Boeing • RVAC • CALO Tools and data: • DBpedia • MIT SIMILE project timeline • Tom Patterson’s map artwork
The end!
Pruning techniques Bidirectional Search • Before A*, search one step back from each literal, following only edges that match keywords • This saves one step during forward A* search ca governor po p sta te l pita spouse rty pa Dianne Feinstein latitude children 42. 4 4 6349000 birth place se ouder hea l color latitude name blue 39. 0 Harry Reid No mention of latitude
Need for multiple paths
Need for multiple paths


