Скачать презентацию Information Extraction from Scientific Texts Junichi Tsujii Graduate Скачать презентацию Information Extraction from Scientific Texts Junichi Tsujii Graduate

199da67a0e1583eaf08ec964f921246c.ppt

  • Количество слайдов: 114

Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Texts are one of the major sources of information and knowledge. However, they are Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with the other sources like data bases, numerical data, etc. Natural Language Processing--IE

Overview of GENIA System Retrieval Module Corpus Module • Markup generation / compilation • Overview of GENIA System Retrieval Module Corpus Module • Markup generation / compilation • Annotated corpus construction Text Structure Interface Module Annotated Event Security Data model Concept Module • BK design / construction / compilation • IR Request • Abstract • Full Paper Database Document Named-Entity Markup language User • GUI • HTML conversion • System integration Background Knowledge Ontology MEDLINE • Identify & classify terms • Identify events Corpus Raw(OCR) • Request enhancement • Spawn request • Classify documents Information Extraction Module Database Module • DB design / access / management • DB construction

Plan 1. What is IE ? 2. General Framework of NLP 3. Basic IE Plan 1. What is IE ? 2. General Framework of NLP 3. Basic IE techniques 4. IE in Biology Automatic Term Recognition (S. Ananiadou)

What is IE ? What is IE ?

Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to queries for information (2)Passage Retrieval To search and retrieve part of documents in response to queries for information (3)Information Extraction To extract information that fits pre-defined database schemas or templates, specifying the output formats (4) Question/Answering Tasks To answer general questions by using texts as knowledge base: Fact retrieval, combination of IR and IE (5)Text Understanding To understand texts as people do: Artificial Intelligence

Ranges of Queries (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding Ranges of Queries (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding Pre-Defined: Fixed aspects of information carried in texts

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

FASTUS Based on finite states automata (FSA) set up new Twaiwan dallors 1. Complex FASTUS Based on finite states automata (FSA) set up new Twaiwan dallors 1. Complex Words: a Japanese trading house had set up 2. Basic Phrases: production of 20, 000 iron and metal wood clubs 3. Complex phrases: [company] [set up] [Joint-Venture] with [company] Patterns for events of interest to the application Basic templates are to be built. Recognition of multi-words and proper names Simple noun groups, verb groups and particles Complex noun groups and verb groups 4. Domain Events: 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990

Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Name of the Venture: Yaxing Benz Products: buses and bus chassis Location: Yangzhou, China Companies involved: (1)Name: X? Country: German (2)Name: Y? Country: China

Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Different template Crime-Type: Murder for crimes Type: Stabbing The killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general manager Location: Nanjing, China

Interpretation of Texts (1)Information Retrieval/Detection User (2)Passage Retrieval User (3)Information Extraction System (4) Question/Answering Interpretation of Texts (1)Information Retrieval/Detection User (2)Passage Retrieval User (3)Information Extraction System (4) Question/Answering Tasks System (5)Text Understanding System

Characterization of Texts IR System Queries Collection of Texts Characterization of Texts IR System Queries Collection of Texts

Knowledge Interpretation Characterization of Texts IR System Queries Collection of Texts Knowledge Interpretation Characterization of Texts IR System Queries Collection of Texts

Knowledge Interpretation Characterization of Texts Passage IR System Collection of Texts Queries Knowledge Interpretation Characterization of Texts Passage IR System Collection of Texts Queries

Knowledge Characterization of Texts Interpretation Passage IR System IE System Queries Structures of Sentences Knowledge Characterization of Texts Interpretation Passage IR System IE System Queries Structures of Sentences NLP Collection of Texts Templates

Knowledge Interpretation IE System Texts Templates Knowledge Interpretation IE System Texts Templates

IE as compromise NLP Knowledge Interpretation IE System General Framework of NLP/NLU Texts Predefined IE as compromise NLP Knowledge Interpretation IE System General Framework of NLP/NLU Texts Predefined Templates

Performance Evaluation (1)Information Retrieval/Detection Rather clear (2)Passage Retrieval A bit vague (3)Information Extraction Rather Performance Evaluation (1)Information Retrieval/Detection Rather clear (2)Passage Retrieval A bit vague (3)Information Extraction Rather clear (4) Question/Answering Tasks A bit vague (5)Text Understanding Very vague

Query N: Correct Documents M: Retrieved Documents C: Correct Documents that are actually retrieved Query N: Correct Documents M: Retrieved Documents C: Correct Documents that are actually retrieved N Collection of Documents Precision: C M C Recall: N F-Value: 2 P・R P+R M P R C

Query N: Correct Templates M: Retrieved Templates C: Correct Templates that are actually retrieved Query N: Correct Templates M: Retrieved Templates C: Correct Templates that are actually retrieved N Collection of Documents Precision: C M C Recall: N F-Value: 2 P・R P+R M P R C More complicated due to partially filled templates

General Framework of NLP General Framework of NLP

General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing S Syntactic Analysis Context processing Interpretation VP P-N V John Semantic Analysis NP run

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John Morphological and Lexical Processing S Syntactic Analysis Context processing Interpretation VP P-N V John Semantic Analysis NP run

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John is a student. He runs. Morphological and Lexical Processing S Syntactic Analysis Context processing Interpretation VP P-N V John Semantic Analysis NP run

General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt: 1999

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological and Open class words Lexical Processing Terms Term recognition Named Entities Syntactic Analysis Company names Locations Numerical expressions Semantic Analysis Context processing Interpretation

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English Morphological and are ambiguous in terms (2) Ambiguities: Lexical Processing of their part of speeches. Combinatorial runs: v/3 pre, n/plu Explosion clubs: v/3 pre, n/plu Syntactic Analysis and two meanings Semantic Analysis Context processing Interpretation

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation

Structural Ambiguities (1)Attachment Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can Structural Ambiguities (1)Attachment Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. John bought a car with large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities Semantic Ambiguities(2) young women and men in the room Every man loves a woman. (3)Analytical Ambiguities Visiting relatives can be boring. Co-reference Ambiguities

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Combinatorial Explosion Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation

Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning

Framework of IE IE as compromise NLP Framework of IE IE as compromise NLP

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules

Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems

General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation 95 % FSA rules Part of Speech Tagger Statistic taggers Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Local Context Statistical Bias Domain specific rules: , Inc. Mr. . Machine Learning: HMM, Decision Trees Rules + Machine Learning F-Value 90 Domain Dependent

FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event.

FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event.

FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Analysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event.

Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata Context Free Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency

Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata n n Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata n n AB Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

Pattern-maching {PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)* PN ’s (ADJ)* N P Art Pattern-maching {PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)* PN ’s (ADJ)* N P Art (ADJ)* N 1 PN 0 ’s Art 2 ADJ N ’s 3 John’s interesting book with a nice cover P PN Art 4

FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Analysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event.

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 “metal wood” clubs a month. 1. Complex words Attachment Ambiguities are not made explicit 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to {{ }} produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 “metal wood” clubs a month. 1. Complex words a Japanese tea house a [Japanese tea] house a Japanese [tea house] 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 “metal wood” clubs a month. 1. Complex words Structural Ambiguities of NP are ignored 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 “metal wood” clubs a month. 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3. Complex Phrases

Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] and [COMPNY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE], [COMPNY], capitalized at 20 million [CURRENCY-UNIT] [START] production in [TIME] with production of 20, 000 [PRODUCT] a month. 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3. Complex Phrases Some syntactic structures like …

Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 2. Basic Phrases: Bridgestone Sports Co. : Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3. Complex Phrases Syntactic structures relevant to information to be extracted are dealt with.

Syntactic variations GM set up a joint venture with Toyota. GM announced it was Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota.

Syntactic variations GM set up a joint venture with Toyota. GM announced it was Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S NP GM [SET-UP] VP V signed NP VP N agreement V Toyota. setting up GM plans to set up a joint venture with GM expects to set up a joint venture with Toyota.

Syntactic variations GM set up a joint venture with Toyota. GM announced it was Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S NP GM [SET-UP] VP V set up GM plans to set up a joint venture with Toyota. GM expects to set up a joint venture with Toyota.

Example of IE: FASTUS(1993) [COMPNY] [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] Example of IE: FASTUS(1993) [COMPNY] [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 3. Complex Phrases 4. Domain Events [COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY] [COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY] The attachment positions of PP are determined at this stage. Irrelevant parts of sentences are ignored.

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG]

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG]

Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG] Basic patterns Surface Pattern Generator Patterns used by Domain Event Relative clause construction Passivization, etc.

FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, was found. 2. Basic Phrases: 3. Complex phrases: 4. Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event.

FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, was found. 2. Basic Phrases: 3. Complex phrases: 4. Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event.

FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, FASTUS Based on finite states automata (FSA) 1. Complex Words: NP, who was kidnapped, was found. 2. Basic Phrases: 3. Complex phrases: 4. Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event.

Current state of the arts of IE 1. Carefully constructed IE systems F-60 level Current state of the arts of IE 1. Carefully constructed IE systems F-60 level (interannotater agreement: 60 -80%) Domain: telegraphic messages about naval operation (MUC-1: 87, MUC-2: 89) news articles and transcriptions of radio broadcasts Latin American terrorism (MUC-3: 91, MUC-4: 1992) News articles about joint ventures (MUC-5, 93) News articles about management changes (MUC-6, 95) News articles about space vehicle (MUC-7, 97) 2. Handcrafted rules (named entity recognition, domain events, etc) Automatic learning from texts: Supervised learning : corpus preparation Non-supervised, or controlled learning

IE in Biology IE in Biology

CSNDB (National Institute of Health Sciences) • A data- and knowledge- base for signaling CSNDB (National Institute of Health Sciences) • A data- and knowledge- base for signaling pathways of human cells. – It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. – Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. – CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. – Final goal is to make a computerized model for various biological phenomena.

Example. 1 • A Standard Reaction l. Signal_Reaction: “EGF receptor Grb 2” l. From_molecule Example. 1 • A Standard Reaction l. Signal_Reaction: “EGF receptor Grb 2” l. From_molecule “EGF receptor” l. To_molecule “Grb 2” l. Tissue “liver” l. Effect “activation” l. Interaction “SH 2+phosphorylated Tyr” l. Reference [Yamauchi_1997] Excerpted @[Takai 98]

Example. 3 • A Polymerization Reaction l. Signal_Reaction: “Ah receptor + HSP 90 ” Example. 3 • A Polymerization Reaction l. Signal_Reaction: “Ah receptor + HSP 90 ” l. Component “Ah receptor” “HSP 90” l. Effect “activation dissociation” l. Interaction “PAS domain” “of Ah receptor” l. Activity “inactivation of Ah receptor” l. Reference [Powell-Coffman_1998] Excerpted @[Takai 98]

FASTUS Based on finite states automata (FSA) 1. Complex Words: Recognition of multi-words and FASTUS Based on finite states automata (FSA) 1. Complex Words: Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles 3. Complex phrases: Complex noun groups and verb groups 4. Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1. FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1. Complex Words: Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles 3. Complex phrases: Complex noun groups and verb groups 4. Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

FASTUS Is separation of stages possible ? Open word classes: techical terms very long FASTUS Is separation of stages possible ? Open word classes: techical terms very long specific formation rules many semantic classes acronyms variants fairly ambiguous [[Term recognition]] Coordination across word formation A or B and C D Based on finite states automata (FSA) 1. Complex Words: Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles 3. Complex phrases: Complex noun groups and verb groups 4. Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1. FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1. Complex Words: Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles 3. Complex phrases: Complex noun groups and verb groups 4. Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E 1: An active phorbol ester activates protein kinase C.

Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E 1: An active phorbol ester activates protein kinase C. E 2: The active phorbol ester modifies I kappa B.

Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E 1: An active phorbol ester activates protein kinase C. E 2: The active phorbol ester modifies I kappa B. E 3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole

Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E 1: An active phorbol ester activates protein kinase C. E 2: The active phorbol ester modifies I kappa B. E 3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole

Full parser based on good grammar formalisms 1. Several attempts of using full parsers Full parser based on good grammar formalisms 1. Several attempts of using full parsers : To improve the Precision 2. Systematic treatment of interaction of the different phases : Unification-based grammar formalisms The two papers in the NLP session of PSB 2001

Experiment (A. Yakushiji et. al, PSB 2001) XHPSG: HPSG-like Grammar translated from XTAG of Experiment (A. Yakushiji et. al, PSB 2001) XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y. Tateishi, TAG+ workshop 98) Automatic conversion: Detailed, empirical comparison of grammars of different formalisms (+LFG) Terms (Compound nouns) are chunked beforehand. 180 sentences from abstracts in MEDLINE The average parse time per sentence: 2. 7 sec by a naïve parser (This can be improved by the multi-stage parser by 50 times)

Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Extractable from pp’s Parsing Not extractable Failures Memory limitation, etc 31 32 26 27 17 68%

Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc.

Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc.

Bio Term Bank T BB ² A database for all sort of biological terms Bio Term Bank T BB ² A database for all sort of biological terms collected from genome databases and biological texts. ² It will contain 2 million terms in 2001 and 5 million terms until 2005. ² Terms are classified by biochemical and terminological attributes, grounded on their resources. Biological ontology committee Japan organized by T. Takagi and T. Takai, U. Tokyo in Genome Projects of MESSC (2000. 4~ 2005. 3)

Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc.

GENIA ontology (current version) +-name-+-source-+-natural-+-organism-+- multi-cell organism   | | | +- mono-cell organism GENIA ontology (current version) +-name-+-source-+-natural-+-organism-+- multi-cell organism   | | | +- mono-cell organism   | | | +- virus   | | +- tissue   | | +- cell type   | | +- sub-location of cells   | +-artificial-+- cell line   |   +-substance-+-compound-+-organic-+-amino-+- protein-+-protein family or group | | +-protein complex | | +-individual protein molecule | | +-subunit of protein complex | | +-substructure of protein | | +-domain or region of protein | +-peptide | +-amino acid monomer | +-nucleic-+- DNA-+-DNA family or group | +- individual DNA molecule | +-domain or region of DNA | +- RNA-+-RNA family or group +- individual RNA molecule +-domain or region of RNA

Expansion of GENIA Ontology • Try to tag all NPs in some MEDLINE abstracts Expansion of GENIA Ontology • Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology • Find frequent verbs and what class of arguments they take

Expansion of GENIA Ontology • Chemical class of substance and their substrucutres • Sources Expansion of GENIA Ontology • Chemical class of substance and their substrucutres • Sources • Biological role, or function, of substances • Reaction – Biological reaction – Pathway – Disease • Structure themselves • Experiment , experimental results, and researchers • Measure

Example of Entities in Expanded • Biological role, or function, of substances – receptor, Example of Entities in Expanded • Biological role, or function, of substances – receptor, inhibitor, … • Biological reaction – activation, binding, inhibition, apoptosis, G 2 arrest – pathway, signal – immune dysfunction, Ataxia telangiectasia (AT) • Structure themselves – alpha-helix, • Experiment, experimental results, researchers – our results, these studies, we

Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts

Verbs Related to Biological Events Verbs that take biological entities as arguments • induce Verbs Related to Biological Events Verbs that take biological entities as arguments • induce – noun BE INDUCED BY noun by PROTEIN activation of these PROTEIN was induce – noun INDUCE noun phosphorylation PROTEIN induced the tyrosine • bind – noun BIND TO noun – noun BINDING noun – the BINDING of noun the drugs bind to two different PROTEIN motifs previously found to bind the cellular factor the TATA-box binding protein the binding of PROTEIN semantic class: substance structure source experiment fact reaction

Verbs Related to Biological Events Verbs that take description entities • report – noun Verbs Related to Biological Events Verbs that take description entities • report – noun REPORT that-clause by PROTEIN – noun REPORT noun we report here that PROTEIN is activate we report the characterization of we report a novel structure of PROTEIN semantic class: substance structure source experiment fact reaction

Verbs Related to Biological Events Verbs whose arguments depend on syntactic patterns • show Verbs Related to Biological Events Verbs whose arguments depend on syntactic patterns • show – noun BE SHOWN to-infinitive – noun SHOW that-clause sufficient – noun SHOW noun PROTEIN activity PROTEIN has been shown to trigger cellular PROTEIN ac the data show that PROTEIN stimulation is als SOURCE showed a dose-dependent inhibition semantic class: substance source experiment fact

Verbs Related to Biological Events Verbs that take both entities • indicate – noun Verbs Related to Biological Events Verbs that take both entities • indicate – noun INDICATE that-clause prolifiration – noun INDICATE noun the data indicate that PROTEIN is required in CELL – noun INDICATE that-clause PROTEIN – noun INDICATE noun action the structure indicates that it represents a unique class of these findings indicate an unexpected role of DNA the structure indicates mechanisms for allosteric effector semantic class: substance structure source experiment fact reaction role

Example of NE Annotation UI - 85146267 TI - Characterization of <NE ti= Example of NE Annotation UI - 85146267 TI - Characterization of aldosterone binding sites in circulating human mononuclear leukocytes. AB - Aldosterone binding sites in human mononuclear leukocytes were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in RPMI-1640 medium, cells were incubated at 37 degrees C for 1 h with different concentrations of [3 H]aldosterone plus a 100 -fold concentration of RU-26988 (11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1, 4, 6 -trien-3 -one), with or without an excess of unlabeled aldosterone. Aldosterone binds to a single class of receptors with an affinity of 2. 7 +/- 0. 5 n. M (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of desoxycorticosterone = corticosterone = aldosterone greater than hydrocortisone greater than dexamethasone. The results indicate that mononuclear leukocytes could be useful for studying the physiological significance of these mineralocorticoid receptors and their regulation in humans.

Available from our website: Definition of ontological classes Manual of GMPL: extention of XML Available from our website: Definition of ontological classes Manual of GMPL: extention of XML to annonate texts Manual of Text Annotation Soon: Annotated texts (1000 abstracts) by the end of March

1. IE can contribute to Bio-informatics significantly. 2. However, the domains in Bio-chemistry seem 1. IE can contribute to Bio-informatics significantly. 2. However, the domains in Bio-chemistry seem more structurally rich than the domains we have dealt with so far. Term formation, rich ontologies, complex syntactic structures. 3. It requires substantial efforts in resource building. 4. However, those resources can contribute to other applications : Knowledge sharing, Intelligent IR, Knowledge discovery One of the crucial techniques is ATR ….

Overview of GENIA System Retrieval Module Corpus Module • Markup generation / compilation • Overview of GENIA System Retrieval Module Corpus Module • Markup generation / compilation • Annotated corpus construction Text Structure Interface Module Annotated Event Security Data model Concept Module • BK design / construction / compilation • IR Request • Abstract • Full Paper Database Document Named-Entity Markup language User • GUI • HTML conversion • System integration Background Knowledge Ontology MEDLINE • Identify & classify terms • Identify events Corpus Raw(OCR) • Request enhancement • Spawn request • Classify documents Information Extraction Module Database Module • DB design / access / management • DB construction