Скачать презентацию Predicting Accuracy of Extracting Information from Unstructured Text Скачать презентацию Predicting Accuracy of Extracting Information from Unstructured Text

dfb4e7e219e45b0a521258b9ea12f2a0.ppt

  • Количество слайдов: 28

Predicting Accuracy of Extracting Information from Unstructured Text Collections Eugene Agichtein and Silviu Cucerzan Predicting Accuracy of Extracting Information from Unstructured Text Collections Eugene Agichtein and Silviu Cucerzan Microsoft Research Text Mining Search and Navigation Group

Extracting and Managing Information in Text Document Collections Web Documents Blogs News Alerts … Extracting and Managing Information in Text Document Collections Web Documents Blogs News Alerts … Varying properties Different Languages Varying consistency Noise/errors …. Complex problem Usually many parameters Often tuning required Information Extraction System Entities ------------------------------------Research Events Relations E 1 E 3 ----------- E 2 E 4 E 1 Success ~ Accuracy Text Mining Search and Navigation Group

The Goal: Predict Extraction Accuracy Estimate the expected success of an IE system that The Goal: Predict Extraction Accuracy Estimate the expected success of an IE system that relies on contextual patterns before • running expensive experiments • tuning parameters • training the system Useful when adapting an IE system to • a new task • a new document collection • a new language Research Text Mining Search and Navigation Group

Specific Extraction Tasks • Named Entity Recognition (NER) Misc Organization European champions Liverpool paved Specific Extraction Tasks • Named Entity Recognition (NER) Misc Organization European champions Liverpool paved the way to the group stages of the Champions League taking a 3 -1 lead over CSKA Sofia on Wednesday [. . . ] Gerard Houllier's men started the match in Sofia on fire with Steven Gerrard scoring [. . . ] Location Person • Relation Extraction (RE) Abraham Lincoln was born on Feb. 12, 1809, in a log cabin in Hardin (now Larue) County, Ky BORN Who Abraham Lincoln Where Feb. 12, 1809 Hardin County, KY Research Text Mining Search and Navigation Group

Contextual Clues … yesterday, Mrs Clinton told reporters the move to the East Room Contextual Clues … yesterday, Mrs Clinton told reporters the move to the East Room Left context Right context engineers Orville and Wilbur Wright built the first working airplane in 1903. Left context Research Middle context Right context Text Mining Search and Navigation Group

Approach: Language Modelling • Presence of contextual clues for a task appears related to Approach: Language Modelling • Presence of contextual clues for a task appears related to extraction difficulty • The more “obvious” the clues, the easier the task • Can be modelled as “unexpectedness” of a word • Use Language Modelling (LM) techniques to quantify intuition Research Text Mining Search and Navigation Group

Language Models (LM) • An LM is summary of word distribution in text • Language Models (LM) • An LM is summary of word distribution in text • Can define unigram, bigram, trigram, n-gram models • More complex models exist – Distance, syntax, word classes – But: not robust for web, other languages, … • LMs used in IR, ASR, Text Classification, Clustering: – Query Clarity: Predicting query performance [Cronen-Townsend et al, SIGIR 2002] – Context Modelling for NER [Cucerzan et al. , EMNLP 1999], [Klein et al. Co. NLL 2003] … Research Text Mining Search and Navigation Group

Document Language Models • A basic LM is a normalized word histogram for the Document Language Models • A basic LM is a normalized word histogram for the document collection • Unigram (word) models commonly used • Higher-order n-grams (bigrams, trigrams) can be used Research word the to and said . . . 's company mrs won president Freq 0. 0584 0. 0269 0. 0199 0. 0147. . . 0. 0018 0. 0014 0. 0003 Text Mining Search and Navigation Group

Context Language Models • Senator Christopher Dodd, D-Conn. , named general chairman of the Context Language Models • Senator Christopher Dodd, D-Conn. , named general chairman of the Democratic National Committee last week by President Bill Clinton , said it was premature to talk about lifting the U. S. embargo against Cuba… • Although the Clinton ‘s health plan failed to make it through Congress this year , Mrs Clinton vowed continued support for the proposal. • A senior White House official, who accompanied Clinton , told reporters… • By the fall of 1905, the Wright brothers ’ experimental period ended. With their third powered airplane , they now routinely made flights of several … • Against this backdrop, we see the Wright brothers efforts to develop an airplane … Research Text Mining Search and Navigation Group

Key Observation • If normally rare words consistently appear in contexts around entities, extraction Key Observation • If normally rare words consistently appear in contexts around entities, extraction task tends to be “easier”. • Contexts for a task are an intrinsic property of collection and extraction task, and not restricted to a specific information extraction system. Research Text Mining Search and Navigation Group

Divergence Measures • Cosine Divergence: • Relative entropy: KL Divergence Research Text Mining Search Divergence Measures • Cosine Divergence: • Relative entropy: KL Divergence Research Text Mining Search and Navigation Group

Interpreting Divergence: Reference LM • Need to calibrate the observed divergence • Compute Reference Interpreting Divergence: Reference LM • Need to calibrate the observed divergence • Compute Reference Model LMR : – Pick K random non-stopwords R and compute the context language model around Ri. … the five-star Hotel Astoria is a symbol of elegance and comfort. With an unbeatable location in St Isaac's Square in the heart of St Petersburg, . . . • Normalized KL(LMC)= • Normalization corrects for bias introduced by small sample size Research Text Mining Search and Navigation Group

Reference LM (cont) • LMR converges to LMBG for large sample sizes • Divergence Reference LM (cont) • LMR converges to LMBG for large sample sizes • Divergence of LMR substantial for small samples Research Text Mining Search and Navigation Group

Predicting Extraction Accuracy: The Algorithm 1. Start with a small sample S of entities Predicting Extraction Accuracy: The Algorithm 1. Start with a small sample S of entities (or relation tuples) to be extracted 2. Find occurrences of S in given collection 3. Compute LMBG for the collection 4. Compute LMC for S and the collection 5. Pick |S| random words R from LMBG 6. Compute context LM for R LMR 7. Compute KL(LMC || LMBG), KL(LMR || LMBG) 8. Return normalized KL(LMC) Research Text Mining Search and Navigation Group

Experimental Evaluation • How to measure success? – Compare predicted ease of task vs. Experimental Evaluation • How to measure success? – Compare predicted ease of task vs. observed extraction accuracy • Extraction Tasks: NER and RE – NER: Datasets from the Co. NLL 2002, 2003 evaluations – RE: Binary relations between NEs and generic phrases Research Text Mining Search and Navigation Group

Extraction Task Accuracy Spanish Dutch LOC 90. 21 79. 84 79. 19 MISC 78. Extraction Task Accuracy Spanish Dutch LOC 90. 21 79. 84 79. 19 MISC 78. 83 55. 82 73. 9 ORG 81. 86 79. 69 69. 48 PER 91. 47 86. 83 78. 83 Overall NER English 86. 77 79. 2 75. 24 Relation Accuracy (%) strict partial Task Difficulty Easy DIED 0. 34 0. 97 Easy 0. 35 0. 64 Hard WROTE Research 0. 73 0. 96 INVENT RE BORN 0. 12 0. 50 Hard Text Mining Search and Navigation Group

Document Collections Task Collection Size Reuters RCV 1, 1/100 Reuters RCV 1, 1/10 NER Document Collections Task Collection Size Reuters RCV 1, 1/100 Reuters RCV 1, 1/10 NER 3, 566, 125 words 35, 639, 471 words EFE newswire articles, May 2000 (Spanish) 367, 589 words “De Morgen” articles (Dutch) 268, 705 words Encarta document collection RE 64, 187, 912 words Note that Spanish and Dutch corpus sizes are much smaller Research Text Mining Search and Navigation Group

Predicting NER Performance (English) Florian et al. Chieu et al. Klein et al. Zhang Predicting NER Performance (English) Florian et al. Chieu et al. Klein et al. Zhang et al. Carreras et al. Average LOC 91. 15 91. 12 89. 98 89. 54 89. 26 90. 21 MISC 80. 44 79. 16 80. 15 75. 87 78. 54 78. 83 ORG 84. 67 84. 32 80. 48 80. 46 79. 41 81. 86 PER 93. 85 93. 44 90. 72 90. 44 88. 93 91. 47 Overall 88. 76 88. 31 86. 31 85. 50 85. 00 86. 77 Reuters 1/10, Context = 3 words, discard stopwords, avg Context size Absolute Normalized LOC 0. 98 1. 07 MISC 1. 29 1. 40 ORG 2. 83 3. 08 PER 4. 10 4. 46 RANDOM LOC exception: Large overlap between locations in the training and test collections (i. e. , simple gazetteers effective). 0. 92 Absolute and Normalized KL-divergence Research Text Mining Search and Navigation Group

NER – Robustness / Different Dimensions • Counting stopwords (w) or not (w/o) LOC NER – Robustness / Different Dimensions • Counting stopwords (w) or not (w/o) LOC MISC ORG PER RAND • Context Size 90. 2 78. 8 81. 9 91. 5 - w 0. 93 1. 09 2. 68 3. 91 0. 78 w/o 1. 48 1. 83 3. 81 5. 62 1. 27 LOC Reuters 1/100, context ± 3, avg F MISC ORG PER RAND • Corpus size Reuters, context ± 3, no stopwords, avg Research 0. 88 1. 26 2. 12 2. 94 2. 43 2 1. 06 1. 47 2. 95 4. 11 1. 14 3 Reuters 1/100, no stopwords, avg 1 1. 07 1. 4 3. 08 4. 46 0. 92 MISC ORG LOC PER RAND 1/10 1. 07 1. 4 3. 08 4. 46 0. 92 1/100 1. 48 1. 83 3. 81 5. 62 1. 27 Text Mining Search and Navigation Group

Other Dimensions: Sample Size • Normalized divergence of LMC remains high - Contrast with Other Dimensions: Sample Size • Normalized divergence of LMC remains high - Contrast with LMR for larger sample sizes Research Text Mining Search and Navigation Group

Other Dimensions: N-gram size LOC 90. 21 MISC 78. 83 ORG 81. 86 PER Other Dimensions: N-gram size LOC 90. 21 MISC 78. 83 ORG 81. 86 PER 91. 47 Higher order n-grams may help in some cases. Research Text Mining Search and Navigation Group

Other Languages • Spanish Context=1 Context=2 Context=3 Entity Actual LOC 79. 84 MISC 55. Other Languages • Spanish Context=1 Context=2 Context=3 Entity Actual LOC 79. 84 MISC 55. 82 2. 56 ORG 79. 69 1. 53 PER 86. 83 LOC 1. 18 1. 39 1. 42 MISC 1. 73 2. 12 2. 35 ORG 1. 42 1. 59 1. 64 PER 2. 01 2. 31 RANDOM 2. 42 1. 82 Problem: very small collections • Dutch Context=1 Context=2 Context=3 Entity Actual LOC 1. 44 1. 65 1. 61 LOC 79. 19 MISC 1. 97 2. 02 1. 91 ORG 1. 53 1. 86 1. 92 MISC 73. 9 PER 2. 25 2. 63 2. 60 ORG 69. 48 RANDOM 2. 59 1. 89 1. 71 PER 78. 83 Research Text Mining Search and Navigation Group

Predicting RE Performance (English) Relation Accuracy (%) Context size 1 Context size 2 Context Predicting RE Performance (English) Relation Accuracy (%) Context size 1 Context size 2 Context size 3 BORN 2. 02 2. 17 2. 39 BORN 0. 73 0. 96 DIED 1. 89 1. 86 1. 83 DIED 0. 34 0. 97 INVENT 1. 94 1. 75 1. 72 INVENT 0. 35 0. 64 WROTE 1. 59 1. 53 WROTE 0. 12 0. 50 RANDOM 6. 87 6. 24 5. 79 • 2 - and 3 - word contexts correctly distinguish between “easy” tasks (BORN, DIED), and “difficult” tasks (INVENT, WROTE). • 1 -word context size appears not sufficient for predicting RE Research Text Mining Search and Navigation Group

Other Dimensions: Sample Size • Divergence increases w/ sample size Research Text Mining Search Other Dimensions: Sample Size • Divergence increases w/ sample size Research Text Mining Search and Navigation Group

Results Summary • Context models can be effective in predicting the success of information Results Summary • Context models can be effective in predicting the success of information extraction systems • Even a small sample of available entities can be sufficient for making accurate predictions • Available large collection size most important limiting factor Research Text Mining Search and Navigation Group

Other Applications and Future Work • Could use results for – Active learning/training IE Other Applications and Future Work • Could use results for – Active learning/training IE – Improved boundary detection for NER – Improved confidence estimation of extraction • e. g. : Culotta and Mc. Callum [HLT 2004] • For better results, could incorporate: – Internal contexts, gazeteers (e. g. , for LOC entities) • e. g. : Agichtein & Ganti [KDD 2004], Cohen & Sarawagi [KDD 2004] – Syntax/logical distance – Coreference Resolution – Word classes Research Text Mining Search and Navigation Group

Summary • Presented the first attempt to predict information extraction accuracy for a given Summary • Presented the first attempt to predict information extraction accuracy for a given task and collection • Developed a general, system-independent method utilizing Language Modelling techniques • Estimates for extraction accuracy can help – Deploy information extraction systems – Port Information Extraction systems to new tasks, domains, collections, and languages Research Text Mining Search and Navigation Group

For More Information Text Mining, Search, and Navigation Group http: //research. microsoft. com/tmsn/ Research For More Information Text Mining, Search, and Navigation Group http: //research. microsoft. com/tmsn/ Research Text Mining Search and Navigation Group