Скачать презентацию I 256 Applied Natural Language Processing Fall 2009 Скачать презентацию I 256 Applied Natural Language Processing Fall 2009

e43d15a0a12a2a9ebbac62d47b0fa94d.ppt

  • Количество слайдов: 51

I 256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara I 256 Applied Natural Language Processing Fall 2009 Lecture 13 Information Extraction (1) Barbara Rosario

Today • • Another project proposal Classification recap Information Extraction (1) Midterm evaluations 2 Today • • Another project proposal Classification recap Information Extraction (1) Midterm evaluations 2

Classifying at Different Granularities • Text Categorization: – Classify an entire document • Information Classifying at Different Granularities • Text Categorization: – Classify an entire document • Information Extraction (IE): – Identify and classify small units within documents • Named Entity Extraction (NE): – A subset of IE – Identify and classify proper names • People, locations, organizations 3

What is Information Extraction? As a task: Filling slots in a database from unstructured What is Information Extraction? As a task: Filling slots in a database from unstructured text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is [. . ]. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… 4 Adapted from slide by William Cohen

Information Extraction Task: Extract structured data, such as tables, from unstructured text October 14, Information Extraction Task: Extract structured data, such as tables, from unstructured text October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is [. . ]. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . Richard Stallman, founder of the Free Software Foundation, countered saying… 5 Adapted from slide by William Cohen

Information Extraction As a family of techniques: Information Extraction = segmentation + classification + Information Extraction As a family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft aka “named entity Gates extraction/detection” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 6 Adapted from slide by William Cohen

Information Extraction A family of techniques: Information Extraction = segmentation + classification + association Information Extraction A family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 7 Adapted from slide by William Cohen

Information Extraction A family of techniques: Information Extraction = segmentation + classification + association Information Extraction A family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 8 Adapted from slide by William Cohen

Landscape of IE Tasks: Degree of Formatting Text paragraphs without formatting Grammatical sentences and Landscape of IE Tasks: Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of Body. Media. Astro holds a Ph. D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M. S. in symbolic and heuristic computation and B. S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables 9

Landscape of IE Tasks: Intended Breadth of Coverage Web site specific Formatting Amazon. com Landscape of IE Tasks: Intended Breadth of Coverage Web site specific Formatting Amazon. com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names 10 Adapted from slide by William Cohen

Landscape of IE Tasks: Complexity Closed set Regular set U. S. states U. S. Landscape of IE Tasks: Complexity Closed set Regular set U. S. states U. S. phone numbers He was born in Alabama… Phone: (413) 545 -1323 The big Wyoming sky… The CALD main office can be reached at 412 -268 -1299 Complex pattern U. S. postal addresses University of Arkansas P. O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4 th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at Whiz. Bang Labs. 11 Adapted from slide by William Cohen

Applications • Information Extraction’s applications – Extract structured data out of electronically-available scientific literature, Applications • Information Extraction’s applications – Extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine – Legal documents – Business intelligence – Resume harvesting – Media analysis – Sentiment detection – Patent search – Email scanning 12

Information Extraction Architecture 13 Information Extraction Architecture 13

Main tasks • Named Entity Recognition • Relation Extraction • Relations like subject are Main tasks • Named Entity Recognition • Relation Extraction • Relations like subject are syntactic, relations like person, location, agent or message are semantic 14

Landscape of IE Tasks Jack Welch will retire as CEO of General Electric tomorrow. Landscape of IE Tasks Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Named entity recognition Binary relationship Relation Extraction Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut 15 Adapted from slide by William Cohen

Named entity recognition • • • The goal of a named entity recognition (NER) Named entity recognition • • • The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. Named entities: definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates etc. . Can be broken down into two sub-tasks: 1. identifying the boundaries of the NE (segmentation) 2. identifying its type (classification) Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. 16 Identification of NE

Named entity recognition • • • The goal of a named entity recognition (NER) Named entity recognition • • • The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. Named entities: definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates etc. . Can be broken down into two sub-tasks: 1. identifying the boundaries of the NE (segmentation) 2. identifying its type (classification) Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. 17 Classification of the NE

Methods • Look up each word in an appropriate list of names. For example, Methods • Look up each word in an appropriate list of names. For example, for locations, we could use a gazetteer, or geographical dictionary, such as the Alexandria Gazetteer or the Getty Gazetteer. – error-prone; case distinctions may help, but these are not always present. 18 Location Detection by Simple Lookup for a News Story:

Ambiguity • Many named entity terms are ambiguous. – May and North are likely Ambiguity • Many named entity terms are ambiguous. – May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON – Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. – A term like Yankee will be ordinary modifier in some contexts, but will be marked as an entity of type ORGANIZATION in the phrase Yankee infielders. • Further challenges: – Multi-word names like Stanford University – Names that contain other names such as Cecil H. Green Library and Escondido Village Conference Service Center. • In named entity recognition, therefore, we need to be able to identify the beginning and end of multi-token sequences chunking 19

Chunking • Chunking useful for entity recognition • Segment and label multi-token sequences • Chunking • Chunking useful for entity recognition • Segment and label multi-token sequences • Each of these larger boxes is called a chunk 20

Chunking • The Co. NLL 2000 corpus contains 270 k words of Wall Street Chunking • The Co. NLL 2000 corpus contains 270 k words of Wall Street Journal text, annotated with part-ofspeech tags and chunk tags. Three chunk types in Co. NLL 2000: NP chunks VP chunks PP chunks 21

Chunking • More info in Section 7. 2 NLTK book • We may cover Chunking • More info in Section 7. 2 NLTK book • We may cover this later during the course but for now, just remember that: • You probably need to do chunking if you do a named entity recognition task • And perhaps also parse-tree-based features: 22

Path Features 23 From Dan Kein’s CS 288 slides (UC Berkeley) Path Features 23 From Dan Kein’s CS 288 slides (UC Berkeley)

NLTK NE classifier • NLTK provides a classifier that has already been trained to NLTK NE classifier • NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk. ne_chunk(). – If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE (geo-political entity). 24

NE methods • Next class 25 NE methods • Next class 25

Relation Extraction • Once named entities have been identified in a text, we then Relation Extraction • Once named entities have been identified in a text, we then want to extract the relations that exist between them – Typically relations between specified types of named entity. – Example: look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. 26

Relation Extraction • Example: RE: hard-code patterns that contain strings that express the relation Relation Extraction • Example: RE: hard-code patterns that contain strings that express the relation that we are looking for. – Example: look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y. 27

Relation Extraction • nltk. sem. extract_rels • Problem with hand-crafted patterns? • Machine-learning systems Relation Extraction • nltk. sem. extract_rels • Problem with hand-crafted patterns? • Machine-learning systems which typically attempt to learn ‘relation patterns’ automatically from a training corpus. • Next week we’ll see some machine learning methods to tackle this 28

Semantic Parsers Jack Welch will retire as CEO of General Electric tomorrow. The top Semantic Parsers Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Named entity recognition Binary relationship Relation Extraction N-ary record Semantic Parsers Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Company: Title: Out: In: Person: Jeffrey Immelt Location: Connecticut Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut 29 Adapted from slide by William Cohen

Prop. Bank / Frame. Net • Frame. Net: roles shared between verbs • Prop. Prop. Bank / Frame. Net • Frame. Net: roles shared between verbs • Prop. Bank: each verb has it’s own roles • Prop. Bank more used, because it’s layered over the treebank (and so has greater coverage, plus parses) • Note: some linguistic theories postulate even fewer roles than Frame. Net (e. g. 5 -20 total: agent, patient, instrument, etc. ) 30 From Dan Kein’s CS 288 slides (UC Berkeley)

Prop. Bank Example 31 From Dan Kein’s CS 288 slides (UC Berkeley) Prop. Bank Example 31 From Dan Kein’s CS 288 slides (UC Berkeley)

Prop. Bank Example 32 From Dan Kein’s CS 288 slides (UC Berkeley) Prop. Bank Example 32 From Dan Kein’s CS 288 slides (UC Berkeley)

Prop. Bank Example 33 From Dan Kein’s CS 288 slides (UC Berkeley) Prop. Bank Example 33 From Dan Kein’s CS 288 slides (UC Berkeley)

Message Understanding Conference (MUC) • DARPA funded significant efforts in IE in the early Message Understanding Conference (MUC) • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: – Terrorist events – Industrial joint ventures – Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). 34

Message Understanding Conference (MUC) • Named entity • Person, Organization, Location • Co-reference • Message Understanding Conference (MUC) • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill Clinton • Template element • Perpetrator, Target • Template relation • Incident • Multilingual 35

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production of 20, 000 iron and “metal wood” clubs a month 36

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production of 20, 000 iron and “metal wood” clubs a month 37

Example of IE from FASTUS (1993) Bridgestone Sports Co. said Friday it had set Example of IE from FASTUS (1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 MUC template Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 38

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 MUC template Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 MUC template Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 39

Three generations of IE systems • Hand-Built Systems – Knowledge Engineering [1980 s– ] Three generations of IE systems • Hand-Built Systems – Knowledge Engineering [1980 s– ] – Rules written by hand – Require experts who understand both the systems and the domain – Iterative guess-test-tweak-repeat cycle • Automatic, Trainable Rule-Extraction Systems [1990 s– ] – Rules discovered automatically using predefined templates, using automated rule learners • Statistical Models [1997 – ] – Use machine learning to learn which features indicate boundaries and types of entities. – Learning usually supervised; may be partially unsupervised 40

Successors to MUC • Co. NNL: Conference on Computational Natural Language Learning – Different Successors to MUC • Co. NNL: Conference on Computational Natural Language Learning – Different topics each year – 2002, 2003: Language-independent NER – 2004: Semantic Role recognition – 2001: Identify clauses in text – 2000: Chunking boundaries • http: //cnts. uia. ac. be/conll 2003/ (also conll 2004, conll 2002…) • ACE: Automated Content Extraction – Entity Detection and Tracking • Sponsored by NIST • http: //wave. ldc. upenn. edu/Projects/ACE/ • Several others recently – See http: //cnts. uia. ac. be/conll 2003/ner/ 41

Co. NNL-2003 • Goal: identify boundaries and types of named entities – People, Organizations, Co. NNL-2003 • Goal: identify boundaries and types of named entities – People, Organizations, Locations, Misc. – Experiment with incorporating external resources (Gazeteers) and unlabeled data • Data: 4 pieces of info for each term Word POS Chunk Entity. Type 42

Summary of Results • 16 systems participated • Machine Learning Techniques – Combinations of Summary of Results • 16 systems participated • Machine Learning Techniques – Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4) – Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, Ada. Boost, and memory-based learning – Combining techniques often worked well • Features – Choice of features is at least as important as ML method – Top-scoring systems used many types – No one feature stands out as essential (other than words) Sang and De Meulder, Introduction to the Co. NLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of Co. NLL-2003 43

Evaluating IE Accuracy • • • Always evaluate performance on independent, manually-annotated test data Evaluating IE Accuracy • • • Always evaluate performance on independent, manually-annotated test data not used during system development. Measure for each test document: – Total number of correct extractions in the solution template: N – Total number of slot/value pairs extracted by the system: E – Number of extracted slot/value pairs that are correct (i. e. in the solution template): C Compute average value of metrics adapted from IR: – Recall = C/N – Precision = C/E – F-Measure = Harmonic mean of recall and precision 44

Sang and De Meulder, Introduction to the Co. NLL-2003 Shared Task: Language-Independent Named Entity Sang and De Meulder, Introduction to the Co. NLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of Co. NLL-2003 45

Use of External Information • Improvement from using Gazeteers vs. unlabeled data nearly equal Use of External Information • Improvement from using Gazeteers vs. unlabeled data nearly equal • Gazeteers less useful for German than English (higher quality) • Note: standard methods to understand the impact of some features (for a given algorithm); remove them and see error reductions 46

Precision, Recall, and F-Scores * * * Not significantly different Sang and De Meulder, Precision, Recall, and F-Scores * * * Not significantly different Sang and De Meulder, Introduction to the Co. NLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of Co. NLL-2003 47

Combining Results • What happens if we combine the results of all of the Combining Results • What happens if we combine the results of all of the systems? – Used a majority-vote of 5 systems for each set – English: F = 90. 30 (14% error reduction of best system) – German: F = 74. 17 (6% error reduction of best system) 48

IE Tools • Research tools – Gate • http: //gate. ac. uk/ – Minor. IE Tools • Research tools – Gate • http: //gate. ac. uk/ – Minor. Third • http: //minorthird. sourceforge. net/ – Alembic (only NE tagging) • http: //www. mitre. org/tech/alembic-workbench/ • Commercial – ? ? I don’t know which ones work well 49

Resources • Not checked but from http: //nlp. stanford. edu/links/statnlp. html Semantic Parsers • Resources • Not checked but from http: //nlp. stanford. edu/links/statnlp. html Semantic Parsers • ASSERT – Prop. Bank semantic roles (and opinions, etc. ) by Sameer Pradhan. • Shalmaneser – Frame. Net-based by Katrin Erk. • Tree Kernels in SVMlight by Alessandro Moschitti. – A general package, but it has particularly been used for SRL. Named Entity Recognition • Stanford Named Entity Recognizer – A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel. • Ling. Pipe – Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co. • Yam. Cha – SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won Co. NLL 2000 shared task. (Less automatic than a 50 specialized POS tagger for an end user. )

Resources • Information Extraction/Wrapper Induction – Wrapper induction is a method for automatically constructing Resources • Information Extraction/Wrapper Induction – Wrapper induction is a method for automatically constructing "wrappers", or scripts which automate the process of retrieving information from a lightlystructured information resource. – Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel. – IE data sets – Updated versions (i. e. , now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions. – Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets. – RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999. – Web IR and IE (Einat Amitay). Various links on IR and IE on the web. – Web question answering system (University of Michigan) – GATE: General Architecture for Text Engineering (Sheffield) – Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides. 51 Not checked but from http: //nlp. stanford. edu/links/statnlp. html