fbe6165b30d69102f11bc91ec43f1c16.ppt
- Количество слайдов: 27
Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University 2004 -12 -09 IGK Colloquium - Winter 04/05 1
Outline n Information Extraction ¨ Some comments on IE ¨ An IE example system: FASTUS ¨ The challenge n Answers to the challenge ¨ The possible method ¨ Some case studies ¨ Dissertation roadmap 2004 -12 -09 IGK Colloquium - Winter 04/05 2
Information Extraction n Problem: ¨ Huge amount of textual information available ¨ Who is able to read analyze it? 2004 -12 -09 IGK Colloquium - Winter 04/05 3
Information Extraction Solution IE: ¨ Find relevant information ¨ Analyze relevant information ¨ Structure relevant information 2004 -12 -09 IGK Colloquium - Winter 04/05 automatically 4
Information Extraction n Input: specification of relevant information templates and documents n Output: set of instantiated templates e. g. store in a data base 2004 -12 -09 IGK Colloquium - Winter 04/05 6
Information Extraction n Evaluation: precision/recall, F-measure n Application: ¨ Text Classification ¨ Text Mining ¨ Text Summarization ¨ Question Answering 2004 -12 -09 IGK Colloquium - Winter 04/05 7
IE example system: FASTUS n FASTUS (= Finite State Automa-based Text Understanding System) MUC IE system Extraction of information in unstructured text n No real text understanding! n 2004 -12 -09 IGK Colloquium - Winter 04/05 8
IE example system: FASTUS Series of cascaded, finite-state automata n Basically, 3 steps: n ¨ Recognize phrases n Complex words (multi words, proper names) n Simple phrases n Complex phrases ¨ Recognize patterns ¨ Merge bits of information found 2004 -12 -09 IGK Colloquium - Winter 04/05 11
Information Extraction n Limitation ¨ Someone has to build the templates, which is time consuming. ¨ Thus, the templates are normally static. n What about adaptation to new domain …? 2004 -12 -09 IGK Colloquium - Winter 04/05 14
The Challenge: n To be more flexible (and to support open domain QA): ¨ Have many more patterns than in a typical IE system ¨ Base work on (already existing) QA ontology ¨ Learn the patterns automatically!? 2004 -12 -09 IGK Colloquium - Winter 04/05 16
The Method – Constraints n n n We are looking for common entities (as MUC Named Entities) … … and also for exceptional ones (book titles, sports, occupations etc. ) No annotated corpora No hand crafted rules Thus, we will have to start with almost nothing unsupervised or semi unsupervised learning 2004 -12 -09 IGK Colloquium - Winter 04/05 17
The Method – Bootstrapping n “… a process where a simple system activates a more complicated system… “ (http: //en. wikipedia. org) n “… a complex system emerges by starting simply and, bit by bit, developing more complex capabilities on top of the simpler ones…” (http: //en. wikipedia. org) 2004 -12 -09 IGK Colloquium - Winter 04/05 18
The Method – Bootstrapping n Start with seed ¨ Learn ¨ Evaluate the learned ¨ Add evaluated to the seed n Restart with new seed 2004 -12 -09 IGK Colloquium - Winter 04/05 19
Excursus: Bootstrapping for WSD Yarowsky 1995 ¨ Start with small set of contexts for given word (e. g. plant) ¨ Determine log likelihood values from small annotated corpus ¨ Arrange log likelihood according to values 2004 -12 -09 IGK Colloquium - Winter 04/05 20
Excursus: Bootstrapping for WSD n n n Look for word (plant) and its context in corpus Assign sense (sense 1 or sense 2) on basis of best log likelihood ratio applicable Find new context words that co-occur with known context often enough Example: n n n 2004 -12 -09 target: plant known context: species co-occurrence: animal IGK Colloquium - Winter 04/05 21
Excursus: Bootstrapping for WSD n Calculate log likelihood ratios of this new context Add them to list n Note: smoothing is useful n 2004 -12 -09 IGK Colloquium - Winter 04/05 22
The Method – Bootstrapping What does this mean for Information Extraction? n Start with a small number of instances (and/or patterns) ¨ ¨ ¨ Learn thereby patterns Evaluate patterns Add new patterns to pattern set n n Derive more instances from these new patterns Evaluate new instances Add new instances to instance set Restart with enlarged instance (or pattern) set This is an iterative process. There are basically two “nested” bootstrapping loops. 2004 -12 -09 IGK Colloquium - Winter 04/05 23
The Method – Bootstrapping Some principle problems: ¨ How to evaluate the patterns and the instances? ¨ Add all instances (patterns) to the instance (pattern) set? ¨ Start ¨ By the way, what is a pattern? ¨ What 2004 -12 -09 with instances or pattern or even with both? about convergence of the algorithm? about corpus size? IGK Colloquium - Winter 04/05 24
Some Case Studies: Corpus and Method n Corpus: web and WSJ n Apply algorithm described but chose patterns/instances manually 2004 -12 -09 IGK Colloquium - Winter 04/05 25
Some Case Studies: City n Start with one instance: “Berlin” pattern: n n Hotels in und Umgebung ¨ search for “Hotels in *” n Paris n München n Hamburg n etc. n but also: Europa, Mecklenburg-Vorpommern n Now, restart web search with new instances to get new patterns 2004 -12 -09 IGK Colloquium - Winter 04/05 26
Some Case Studies: Professions n Start with one instance: “lawyer” pattern: n n lawyer’s job hire a lawyer ¨ search for “*’s job” n forester n therapist n reporter n etc. n but also: employee, John … n Now, restart web search with new instances to get new patterns 2004 -12 -09 IGK Colloquium - Winter 04/05 27
Some Case Studies: Problems n Patterns match a lot of different instance types possible criteria to chose good patterns n Instances could be multi words criteria that determine “instance boundaries” n Even if patterns are good, instances found could be wrong ones criteria to decide about instances 2004 -12 -09 IGK Colloquium - Winter 04/05 28
Some Case Studies: List Search n Start web search with 5 instances at a time instances n n n tennis footballet sailing baseball Get lists with lots of additional instances all at once 2004 -12 -09 IGK Colloquium - Winter 04/05 29
Some Case Studies: Problems n n Only works on the web! For some instance types it doesn’t work at all! n Decide about 5 instances to similar or to different = find no lists n Find actual list in web page 2004 -12 -09 IGK Colloquium - Winter 04/05 30
Roadmap n n n Decide about bootstrapping and implement it Run for MUC Named Entities Run for “simple”, “one word” classes (e. g. sports, occupations) Run for “difficult” classes (e. g. book titles, movies) Run for different classes at the same time 2004 -12 -09 IGK Colloquium - Winter 04/05 31
Literature survey There are some publications which address either bootstrapping or flexible IE ¨ ¨ ¨ ¨ E. Riloff, R. Jones (1999): Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. E. Agichtein, L. Gravano (2000): Snowball: extracting relations from large plain-text collections. R. Yangarber, et al. (2000): Automatic acquisition of domain knowledge for Information Extraction. O. Etzioni, et al. (2004): Methods for Domain-Indepedent Information Extraction from the Web: An Experimental Comparison. D. Yarowsky (1995): Unsupervised word sense disambiguation rivaling supervised methods. St. Abney (2002): Bootstrapping. St. Abney (2004): Understanding the Yarowsky Algorithm. 2004 -12 -09 IGK Colloquium - Winter 04/05 32
Thank you! 2004 -12 -09 IGK Colloquium - Winter 04/05 33
fbe6165b30d69102f11bc91ec43f1c16.ppt