Searching for Statistical Diagrams Michael Cafarella University of

Скачать презентацию Searching for Statistical Diagrams Michael Cafarella University of

121da9afdcc1bcca6b325d207b1c8fbc.ppt

Количество слайдов: 48

Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November 17, 2011

Statistical Diagrams n n n Everywhere in serious academic, governmental, scientific documents Our only peek into data behind docs Previously precious and rare, Web gives us a precious flood n n 2 In small Web crawl, found 319 K diagrams in 153 K academic papers Google makes it easy to find docs, images; very hard to find diagrams

Outline n n n Intro Related Work Our Approach n n n 6 Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Outline n n n Intro Related Work Our Approach n n n 7 Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Telephone System Manhole Drawing, from Arias, et al, Pattern Recognition Letters 16, 1995

Sample Architectural Drawing, from Ah-Soon and Tombre, Proc’ds of Fourth Int’l Conf on Document Analysis and Recognition, 1997.

Previous Work n n Understanding diagrams isn’t new Understanding a Web’s worth of diagrams is new n n n 11 Need to search statistical diagrams in medicine, economics, biology, physics, etc The (old) phone company can afford a system tailored for manhole diagrams; we can’t Effective scaling with # of topics is central goal of domain-independent information extraction

Domain-Independent IE n n General IE topic since early 1990 s Goal is to obtain structured information from unstructured raw documents n n Traditional IE requires topic-specific code, features, data n n 12 [Title, Price] from online bookstores [Director, Film] from discussion boards [Scientist, Birthday] from biographies Supervision costs grow with # of domains Domain-independent IE does not

Related Work n Domain-independent extraction: n n Text (Banko et al, IJCAI 2007; Shinyama+Sekine, HLT-NAACL 2006) Tables (Cafarella et al. , VLDB 2008) Infoboxes (Wu and Weld, CIKM 2007) Specific to diagrams, some DI: n Huang et al, “Associating text and graphics…”, ICDAR 2005 Huang et al, “Model-based chart image recognition”, GREC 2003 Kaiser et al, “Automatic extraction…”, AAAI 2008 n Liu et al, “Automated analysis…”, IJDAR 2009 n n 13

Outline n n n Intro Related Work Our Approach n n n 14 Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Our Approach n Typical Web search pipeline n n Our novel components n n 17 Crawl Web for documents Obtain and index text Make index queryable n Diagram metadata extraction Custom search ranker Snippet generator

Metadata Extraction 1. 2. Recover good (text, x, y) from PDFs Apply simple role label: title, legend, etc n n n 3. Group texts into “model diagram” candidates, throw away unlikely ones n 4. E. g. , must include something on x scale Relabel text using geometric relationships n n 18 Does text start with capitalized word? Ratio of text region height to width % of words in region that are nouns Distance, angle to diagram’s origin? Leftmost in diagram? Under a caption?

Search Ranker n Tested four versions n n Naïve – standard document relevance Reference – caption and context only Field – all fields, equal weighting Weighted – all fields, trained weights

Snippet Generation Apply metadata over graphic n Tested five versions Caption and Context only Caption and Context No graphics at all accompany graphics Apply metadata over graphic 3. Text-snippet 2. Small-snippet 1. Original-snippet 4. Integrated-snippet 5. Enhanced-snippet 20

Outline n n n Intro Related Work Our Approach n n n 22 Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Experiments n Crawled Web for scientific papers n n n Fed data to prototype search engine Evaluated n n 23 From Clue. Web 09 Any URL ending in. pdf from. edu URL 319 K diagrams Metadata extraction Rank quality Snippet effectiveness All results compared against human judgments

1. Experiments - Extraction Recall Precision Text Full Text All Full title 0. 256 0. 651 0. 674 0. 344 0. 609 0. 617 Y-scale 0. 782 0. 796 0. 754 0. 899 0. 843 0. 900 Y-label 0. 835 0. 864 0. 874 0. 775 0. 752 0. 797 X-scale 0. 903 0. 835 0. 616 0. 915 0. 896 X-label 0. 241 0. 681 0. 340 0. 842 0. 835 legend 0. 520 0. 623 0. 656 0. 349 0. 615 0. 631 caption 0. 952 0. 887 0. 839 0. 450 0. 887 0. 929 nondiag 0. 768 24 All 0. 924 0. 313 0. 850 0. 909 0. 838

2. Experiments - Ranking 26

2. Experiments - Ranking Mean Reciprocal Rank Naïve 0. 633 Reference 0. 9643 Field 0. 8833 Weight 0. 9667 Improves ranking 52% over naïve ranker 27

3. Experiments - Snippets 28 Improves snippet accuracy 33% over naïve soln

Aggregated Diagrams 29

Aggregated Diagrams 30

Aggregated Diagrams X-Axis times accuracy timesec speedup timeseconds precision frequencyhz frequency numberofnodes probability timesec numberofprocessors percent recall times fifodepth i Iteration 31 Y-Axis cumulativeprobability

Clustering 32

Clustering 33

Clustering 34

Clustering 1978 -2004 35 2000 -2004 2002 -2006

Other Applications n Working now n n n In future: n n 36 Keyword, axis search Similar diagrams / similar papers Improved academic paper search “Show plots that support my hypothesis”

Outline n n n Intro Related Work Our Approach n n n 37 Metadata Extraction Ranking Search Snippets Experiments Work In Progress: Spreadsheets

Spreadsheets n 38 n SAUS has >1, 300; we’ve dl’ed 350 K Many tasks: search, facets, integration, etc.

Metadata Recovery 41

Future Work n n Has experiment X ever been run? WY GDP vs coal production in 2002 Preemptively compute good diagrams Lots (most? ) structured data lives outside DBMS n n 42 Spreadsheets HTML tables Log files, sensor readings, experiments, … Structured search

Conclusions n n n Metadata extraction enables 52% better search ranking Extraction-enhanced snippets allow users to choose 33% more accurately We rely on open information extraction, but extracted data not the main product n 43 Can be successful even with imperfect extractors

Thanks 44

Web. Tables [VLDB 08, “Web. Tables…”, Cafarella, Halevy, Wang, Wu, Zhang] n In corpus of 14 B raw HTML tables, ~154 M are “good” databases n n Largest corpus of databases & schemas we know of The Web. Tables system: n n Recovers good relations from crawl and enables search Builds novel apps on the recovered data

Web. Tables Pipeline Inverted Index Raw crawled pages Raw HTML Tables Recovered Relations Relation Search Job-title, company, date Make, model, year • 5. 4 M attributes 916 Rbi, ab, h, r, bb, avg, slg 12 Dob, player, height, weight 4 … • 2. 6 M distinct schemas 104 … Attribute Correlation Statistics Db The Unreasonable Effectiveness of Data [Halevy, Norvig, Pereira]

Synonym Discovery n Use schema statistics to automatically compute attribute synonyms n n 1. 2. 3. 4. More complete than thesaurus Given input “context” attribute set C: A = all attrs that appear with C P = all (a, b) where a A, b A, a b rm all (a, b) from P where p(a, b)>0 For each remaining pair (a, b) compute: