Скачать презентацию An Automatic Approach to Semantic Annotation of Unstructured Скачать презентацию An Automatic Approach to Semantic Annotation of Unstructured

94d8f58a8cadc50faaf625c9e0e97b7c.ppt

  • Количество слайдов: 19

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute

Unstructured, Ungrammatical Text Unstructured, Ungrammatical Text

Unstructured, Ungrammatical Text Car Model Car Year Unstructured, Ungrammatical Text Car Model Car Year

Semantic Annotation 02 M 3 Convertible. . Absolute beauty!!! implied! <Make>BMW</Make> <Model>M 3</Model> <Trim>2 Semantic Annotation 02 M 3 Convertible. . Absolute beauty!!! implied! BMW M 3 2 Dr STD Convertible 2002 “Understand” & query the posts (can query on BMW, even though not in post!) Note: This is not extraction! (Not pulling them out of post…)

Reference Sets l Annotation/Extraction is hard l l l Can’t rely on structure (wrappers) Reference Sets l Annotation/Extraction is hard l l l Can’t rely on structure (wrappers) Can’t rely on grammar (NLP) Reference sets are the key (IJCAI 2005) l Match posts to reference set tuples l l Clue to attributes in posts Provides normalized attribute values when matched

Reference Sets l Collections of entities and their attributes l Relational Data! Scrape make, Reference Sets l Collections of entities and their attributes l Relational Data! Scrape make, model, trim, year for all cars from 1990 -2005…

Contributions Previously Now User supplies reference set System selects reference sets from repository User Contributions Previously Now User supplies reference set System selects reference sets from repository User trains record Unsupervised linkage between matching between reference set & posts

New unsupervised approach: Two Steps Posts ---------------- Reference Set Repository: Grows over time, increasing New unsupervised approach: Two Steps Posts ---------------- Reference Set Repository: Grows over time, increasing coverage 1) Unsupervised Reference Set Chooser Unsupervised Semantic Annotation 2) Unsupervised Record Linkage

Choosing a Reference Set Vector space model: set of posts are 1 doc, reference Choosing a Reference Set Vector space model: set of posts are 1 doc, reference sets are 1 doc Select reference set most similar to the set of posts… FORD Thunderbird - $4700 2001 White Toyota Corrolla CE Excellent Condition - $8200 Cars 0. 7 SIM: 0. 4 SIM: 0. 3 PD(C, H) = 0. 75 > T Hotels 0. 4 PD(H, R) = 0. 33 < T Restaurants 0. 3 Avg. 0. 47 Cars Hotels Restaurants

Choosing Reference Sets l l Similarity: Jensen-Shannon distance & TF-IDF used in Experiments in Choosing Reference Sets l l Similarity: Jensen-Shannon distance & TF-IDF used in Experiments in paper Percent Difference as splitting criterion l Relative measure l “Reasonable” threshold – we use 0. 6 throughout l Score > average as well l l Small scores with small changes can result in increased percent difference but they are not better, just relatively so… If two or more reference sets selected, annotation runs iteratively l If two reference sets have same schema, use one with higher rank l Eliminate redundant matching

Vector Space Matching for Semantic Annotation l l l Choosing reference sets: set of Vector Space Matching for Semantic Annotation l l l Choosing reference sets: set of posts vs. whole reference set Vector space matching: each post vs. each reference set record Modified Dice similarity l Modification: if Jaro-Winler > 0. 95 put in (p ∩ r) l captures spelling errors and abbreviations

Why Dice? l l TF/IDF w/ Cosine Sim: l “City” given more weight than Why Dice? l l TF/IDF w/ Cosine Sim: l “City” given more weight than “Ford” in reference set l Post: Near New Ford Expedition XLT 4 WD with Brand New 22 Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 l TFIDF Match (score 0. 20): {VOLKSWAGEN, JETTA, 4 Dr City Sedan, 1995} Jaccard Sim [(p ∩ r)/(p U r)]: l Discounts shorter strings (many posts are short!) l Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr XLT 4 WD SUV, 2005} l l Dice: 0. 32 Jacc: 0. 19 Dice boosts numerator If intersection is small, denominator of Dice almost same as Jaccard, so numerator matters more

Vector Space Matching for Semantic Annotation new 2007 altima 02 M 3 Convertible. . Vector Space Matching for Semantic Annotation new 2007 altima 02 M 3 Convertible. . Absolute beauty!!! Awesome car for sale! It’s an accord, I think… {NISSAN, ALTIMA, 4 Dr 3. 5 SE Sedan, 2007} 0. 36 {NISSAN, ALTIMA, 4 Dr 2. 5 S Sedan, 2007} 0. 36 {BMW, M 3, 2 Dr STD Convertible, 2002} 0. 5 {HONDA, ACCORD, 4 Dr LX, 2001} 0. 13 < 0. 33 Avg. Dice = 0. 33 l Average score splits matches from non-matches, eliminating false positives l l Threshold for matches from data Using average assumes good matches and bad ones (see this in the data…)

Vector Space Matching for Semantic Annotation l Attributes in agreement l l Set of Vector Space Matching for Semantic Annotation l Attributes in agreement l l Set of matches: ambiguity in differing attributes Which is better? All have maximum score as matches! l l We say none, throw away differences… Union them? In real world, not all posts have all attributes § E. g. : new 2007 altima {NISSAN, ALTIMA, 4 Dr 3. 5 SE Sedan, 2007} 0. 36 {NISSAN, ALTIMA, 4 Dr 2. 5 S Sedan, 2007} 0. 36

Experimental Data Sets Reference Sets: Name Source Attributes Fodors Travel Guide name, address, city, Experimental Data Sets Reference Sets: Name Source Attributes Fodors Travel Guide name, address, city, cuisine 534 Zagat Restaurant Guide name, address, city, cuisine 330 Comics Price Guide title, issue, publisher 918 Hotels Bidding For Travel star rating, name, local area 132 Cars Edmunds & Super Lamb Auto make, model, trim, year KBBCars Kelly Blue Book Car Prices Records 27, 006 make, model, trim, year 2, 777 Posts: Name Source Reference Set Match Records BFT Bidding For Travel Hotels EBay Comics Craigs List Cars, KBBCars (in order) 2, 568 Boats Craigs List Boats None 1, 099 1, 125 776

Results: Choose Ref. Sets (Jensen-Shannon) BFT Posts Boat Posts Ref. Set Score % Diff. Results: Choose Ref. Sets (Jensen-Shannon) BFT Posts Boat Posts Ref. Set Score % Diff. Hotels 0. 622 2. 172 Cars 0. 251 0. 513 Fodors 0. 196 0. 05 Fodors 0. 166 0. 144 Cars 0. 187 0. 248 KBBCars 0. 145 0. 089 KBBCars 0. 15 0. 101 Comics 0. 133 0. 025 Zagat 0. 136 0. 161 Zagat 0. 13 0. 544 Comics 0. 117 Hotels 0. 084 Average 0. 234 Average 0. 152 T = 0. 6 Ebay Posts Craig’s List Ref. Set Score % Diff Cars 0. 52 0. 161 Comics 0. 579 2. 351 KBBCars 0. 447 1. 193 Fodors 0. 173 0. 152 Fodors 0. 204 0. 144 Cars 0. 15 0. 252 Zagat 0. 178 0. 365 Zagat 0. 12 0. 186 Hotels 0. 131 0. 153 Hotels 0. 101 0. 170 Comics 0. 113 KBBCars 0. 086 Average 0. 266 Average 0. 201

Results: Semantic Annotation BFT Posts Attribute Recall Prec. F-Measure Phoebus F-Mes. Hotel Name 88. Results: Semantic Annotation BFT Posts Attribute Recall Prec. F-Measure Phoebus F-Mes. Hotel Name 88. 23 89. 36 88. 79 92. 68 Star Rating 92. 02 89. 25 90. 61 92. 68 Local Area 93. 77 90. 52 92. 17 92. 68 EBay Posts Title 86. 08 91. 60 88. 76 88. 64 Issue 70. 16 89. 40 78. 62 88. 64 Publisher 86. 08 91. 60 88. 76 Supervised Machine Learning: notion of matches/ nonmatches in its training data 88. 64 Craig’s List Posts Make 93. 96 86. 35 89. 99 N/A Model 82. 62 81. 35 81. 98 N/A Trim 71. 62 51. 95 60. 22 N/A Year 78. 86 91. 01 84. 50 N/A In agreement issues

Related Work l l Semantic Annotation l Rule and Pattern based methods assume structure Related Work l l Semantic Annotation l Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions. l Sem. Tag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate l They disambiguate 1 token at time. We disambiguate using all posts during reference set selection, so we don’t have their ambiguity issue such as “is jaguar a car or animal? ” [Reference set would tell us!] l We don’t require carefully formed taxonomy so we can easily exploit widely available reference sets Info. Extraction using Reference Sets l l l CRAM – unsupervised extraction but given reference set & labels all tokens (no junk allowed!) Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised Resource Selection in Distr. IR (“Hidden Web”) [Survey: Craswell et. al. 2000] l Probe queries required to estimate coverage since they don’t have full access to data. Since we have full access to reference sets we don’t use probe queries

Conclusions l Unsupervised semantic annotation l System can accurately query noisy, unstructured sources w/o Conclusions l Unsupervised semantic annotation l System can accurately query noisy, unstructured sources w/o human intervention l l Unsupervised selection of reference sets l l repository grows over time, increasing coverage over time Unsupervised annotation l l E. g. Aggregate queries (avg. Honda price? ) w/o reading all posts competitive with machine learning approach but without burden of labeling matches. Necessary to exploit newly collected reference sets automatically Allow for large scale annotation over time, w/o user intervention Future Work l l Unsupervised extraction Collect reference sets and manage with an information mediator