Скачать презентацию Understanding Text Meaning in Information Applications Ido Dagan Скачать презентацию Understanding Text Meaning in Information Applications Ido Dagan

7c05fb9515be16125c938262a8f3f4fe.ppt

  • Количество слайдов: 44

Understanding Text Meaning in Information Applications Ido Dagan Bar-Ilan University, Israel 1 Understanding Text Meaning in Information Applications Ido Dagan Bar-Ilan University, Israel 1

Outline – a Vision • Why do we need “Text Understanding”? • Capture understanding Outline – a Vision • Why do we need “Text Understanding”? • Capture understanding by Textual Entailment – Does one text entail another? • Major challenge – knowledge acquisition • Initial applications • Looking 5 years ahead 2

Text Understanding • Vision for improving information access • Common search engines – Still: Text Understanding • Vision for improving information access • Common search engines – Still: text processing mostly matches query keywords • Deeper understanding: – Consider the meanings of words and the relationships between them • Relevant for applications – Question answering, information extraction, semantic search, summarization 3

4 4

5 5

Towards text understanding: Question Answering 6 Towards text understanding: Question Answering 6

7 7

Information Extraction (IE( • Identify information of pre-determined structure – Automatic filling of “forms” Information Extraction (IE( • Identify information of pre-determined structure – Automatic filling of “forms” • Example - extract product information: Company Hyundai Suzuki Product Type Car Motorcycle Product Name Accent Elantra R-350 8

Search may benefit understanding • Query: AIDS treatment • Irrelevant document: Hemophiliacs lack a Search may benefit understanding • Query: AIDS treatment • Irrelevant document: Hemophiliacs lack a protein, called factor VIII, that is essential for making blood clots. As a result, they frequently suffer internal bleeding and must receive infusions of clotting protein derived from human blood. During the early 1980 s, these treatments were often tainted with the AIDS virus. In 1984, after that was discovered, manufacturers began heating factor VIII to kill the virus. The strategy greatly reduced the problem but was not foolproof. However, many experts believe that adding detergents and other refinements to the purification process has made natural factor VIII virtually free of AIDS. (AP 890118 -0146, TIPSTER Vol. 1) • Many irrelevant documents mention AIDS and treatments for other diseases 9

Relevant Document • Query: AIDS treatment Federal health officials are recommending aggressive use of Relevant Document • Query: AIDS treatment Federal health officials are recommending aggressive use of a newly approved drug that protects people infected with the AIDS virus against a form of pneumonia that is the No. 1 killer of AIDS victims. The Food and Drug Administration approved the drug, aerosol pentamidine, on Thursday. The announcement came as the Centers for Disease Control issued greatly expanded treatment guidelines recommending wider use of the drug in people infected with the AIDS virus but who may show no symptoms. (AP 890616 -0048, TIPSTER VOL. 1) • Relevant documents may mention specific types of treatments for AIDS 10

11 11

12 12

13 13

Why is it difficult? Meaning Variability Ambiguity Language 14 Why is it difficult? Meaning Variability Ambiguity Language 14

Variability of Semantic Expression The Dow Jones Industrial Average closed up 255 Dow ends Variability of Semantic Expression The Dow Jones Industrial Average closed up 255 Dow ends up Dow gains 255 points Dow climbs 255 Stock market hits a record high 15

How to capture “understanding”? Question Who bought Overture? Overture’s acquisition by Yahoo … >> How to capture “understanding”? Question Who bought Overture? Overture’s acquisition by Yahoo … >> Expected answer form X bought Overture entails Yahoo bought Overture hypothesized answer text Key task is recognizing that one text entails another • IE – extract buying events: Y’s acquisition by X X buy Y • Search: Find Acquisitions by Yahoo • Summarization (multi-document) – identify redundancy 16

Textual Entailment ≈ Human Reading Comprehension • From a children’s English learning book (Sela Textual Entailment ≈ Human Reading Comprehension • From a children’s English learning book (Sela and Greenberg): • Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …” ? ? ? • Hypothesis (True/False? ): The Bermuda Triangle is near the United States 17

PASCAL Recognizing Textual Entailment (RTE) Challenges FP-6 Funded PASCAL NOE 2004 -7 Bar-Ilan University PASCAL Recognizing Textual Entailment (RTE) Challenges FP-6 Funded PASCAL NOE 2004 -7 Bar-Ilan University MITRE ITC-irst and CELCT, Trento Microsoft Research 18

Some Examples TEXT HYPOTHESIS TASK ENTAILMENT Regan attended a ceremony in 1 Washington to Some Examples TEXT HYPOTHESIS TASK ENTAILMENT Regan attended a ceremony in 1 Washington to commemorate the landings in Normandy. Washington is located in Normandy. IE False 2 Google files for its long awaited IPO. Google goes public. IR True …: a shootout at the Guadalajara airport Cardinal Juan Jesus 3 in May, 1993, that killed Cardinal Juan Posadas Ocampo died in Jesus Posadas Ocampo and six others. 1993. QA True The SPD got just 21. 5% of the vote in the European Parliament elections, 4 while the conservative opposition parties polled 44. 5%. IE True The SPD is defeated by the opposition parties. 19

Participation and Impact • Very successful challenges, world wide: – RTE-1 – 17 groups Participation and Impact • Very successful challenges, world wide: – RTE-1 – 17 groups – RTE-2 – 23 groups • ~150 downloads! – RTE-3 25 groups – RTE-4 (2008) – moved to NIST (TREC organizers) • High interest in the research community – Papers, conference keywords, sessions and areas, Ph. D’s, influence on funded Projects – special issue at Journal of Natural Language Engineering 20

Results First Author (Group) Accuracy Average Precision Hickl (LCC) 75. 4% 80. 8% Tatu Results First Author (Group) Accuracy Average Precision Hickl (LCC) 75. 4% 80. 8% Tatu (LCC) 73. 8% 71. 3% Zanzotto (Milan & Rome) 63. 9% 64. 4% Adams (Dallas) 62. 6% 62. 8% Bos (Rome & Leeds) 61. 6% 66. 9% 11 groups 58. 1%-60. 5% 7 groups 52. 9%-55. 6% Average: 60% Median: 59% 21

What is the main obstacle? • System reports point at: – Lack of knowledge What is the main obstacle? • System reports point at: – Lack of knowledge • rules, paraphrases, lexical relations, etc. • It seems that systems that coped better with these issues performed best 22

Research Directions at Bar-Ilan Knowledge Acquisition Inference Applications Oren Glickman, Idan Szpektor, Roy Bar Research Directions at Bar-Ilan Knowledge Acquisition Inference Applications Oren Glickman, Idan Szpektor, Roy Bar Haim, Maayan Geffet, Moshe Koppel Bar Ilan University Shachar Mirkin Hebrew University, Israel Hristo Tanev, Bernardo Magnini, Alberto Lavelli, Lorenza Romano ITC-irst, Italy Bonaventura Coppola, Milen Kouylekov University of Trento and ITC-irst, Italy 23

Distributional Word Similarity “Similar words appear in similar contexts” Harris, 1968 Similar Word Meanings Distributional Word Similarity “Similar words appear in similar contexts” Harris, 1968 Similar Word Meanings Similar Contexts Distributional Similarity Model: Similar Word Meanings Similar Context Features 24

Measuring Context Similarity Country State Industry (genitive) Neighboring (modifier) … Visit (obj) … Population Measuring Context Similarity Country State Industry (genitive) Neighboring (modifier) … Visit (obj) … Population (genitive) Governor (modifier) Parliament (genitive) Neighboring (modifier) … Governor (modifier) Parliament (genitive) Industry (genitive) … Visit (obj) President (genitive) 25

Incorporate Indicative Patterns 26 Incorporate Indicative Patterns 26

Acquisition Example • Top-ranked entailments for “company”: firm, bank, group, subsidiary, unit, business, supplier, Acquisition Example • Top-ranked entailments for “company”: firm, bank, group, subsidiary, unit, business, supplier, carrier, agency, airline, division, giant, entity, financial institution, manufacturer, corporation, commercial bank, joint venture, maker, producer, factory … 27

Learning Entailment Rules Q: What reduces the risk of Heart Attacks? Hypothesis: Aspirin reduces Learning Entailment Rules Q: What reduces the risk of Heart Attacks? Hypothesis: Aspirin reduces the risk of Heart Attacks Text: Aspirin prevents Heart Attacks Entailment Rule: X prevent Y ⇨ X reduce risk of Y template Need a large knowledge base of entailment rules 28

TEASE – Algorithm Flow Lexicon Input template: X subj-accuse-obj Y WE B TEASE Sample TEASE – Algorithm Flow Lexicon Input template: X subj-accuse-obj Y WE B TEASE Sample corpus for input template: Paula Jones accused Clinton… Sanhedrin accused St. Paul… … Anchor Set Extraction (ASE) Anchor sets: {Paula Jones subj; Clinton obj} {Sanhedrin subj; St. Paul obj} … Sample corpus for anchor sets: Template Extraction Paula Jones called Clinton indictable… St. Paul defended before the Sanhedrin … (TE) Templates: X call Y indictable Y defend before X … 29 iterate

Sample of Extracted Anchor-Sets for X prevent Y X=‘sunscreens’, Y=‘sunburn’ X=‘sunscreens’, Y=‘skin cancer’ X=‘vitamin Sample of Extracted Anchor-Sets for X prevent Y X=‘sunscreens’, Y=‘sunburn’ X=‘sunscreens’, Y=‘skin cancer’ X=‘vitamin e’, Y=‘heart disease’ X=‘aspirin’, Y=‘heart attack’ X=‘vaccine candidate’, Y=‘infection’ X=‘universal precautions’, Y=‘HIV’ X=‘safety device’, Y=‘fatal injuries’ X=‘hepa filtration’, Y=‘contaminants’ X=‘low cloud cover’, Y= ‘measurements’ X=‘gene therapy’, Y=‘blindness’ X=‘cooperation’, Y=‘terrorism’ X=‘safety valve’, Y=‘leakage’ X=‘safe sex’, Y=‘cervical cancer’ X=‘safety belts’, Y=‘fatalities’ X=‘security fencing’, Y=‘intruders’ X=‘soy protein’, Y=‘bone loss’ X=‘MWI’, Y=‘pollution’ X=‘vitamin C’, Y=‘colds’ 30

Sample of Extracted Templates for X prevent Y X reduce Y X protect against Sample of Extracted Templates for X prevent Y X reduce Y X protect against Y X eliminate Y X stop Y X avoid Y X for prevention of Y X provide protection against Y X combat Y X ward Y X lower risk of Y X be barrier against Y X fight Y X reduce Y risk X decrease the risk of Y relationship between X and Y X guard against Y X be cure for Y X treat Y X in war on Y X in the struggle against Y X a day keeps Y away X eliminate the possibility of Y X cut risk Y X inhibit Y 31

Experiment and Evaluation • 48 randomly chosen input verbs • 1392 templates extracted ; Experiment and Evaluation • 48 randomly chosen input verbs • 1392 templates extracted ; human judgments Encouraging Results: Average Yield 29 correct templates per verb Average Precision 45. 3% per verb • Future work: improve precision 32

Syntactic Variability Phenomena Template: X activate Y Phenomenon Example Passive form Y is activated Syntactic Variability Phenomena Template: X activate Y Phenomenon Example Passive form Y is activated by X Apposition X activates its companion, Y Conjunction X activates Z and Y Set X activates two proteins: Y and Z Relative clause X, which activates Y Coordination X binds and activates Y Transparent head X activates a fragment of Y Co-reference X is a kinase, though it activates Y 33

Takeaway • Promising potential for creating huge entailment knowledge bases – Millions of rules Takeaway • Promising potential for creating huge entailment knowledge bases – Millions of rules • Speculation: is it possible to have a public effort for knowledge acquisition? – Human Genome Project analogy – Community effort 34

Initial Applications: Relation Extraction Semantic Search 35 Initial Applications: Relation Extraction Semantic Search 35

Dataset • Recognizing interactions between annotated proteins pairs (Bunescu 2005) – 200 Medline abstracts Dataset • Recognizing interactions between annotated proteins pairs (Bunescu 2005) – 200 Medline abstracts – Gold standard dataset of protein pairs • Input template : X interact with Y 36

Manual Analysis - Results • 93% of interacting protein pairs can be identified with Manual Analysis - Results • 93% of interacting protein pairs can be identified with lexical syntactic templates Number of templates vs. recall (within 93%): R(%) # templates 10 2 60 39 20 4 70 73 30 6 80 107 40 11 90 141 50 21 100 175 37

TEASE Output for X interact with Y A sample of correct templates learned: X TEASE Output for X interact with Y A sample of correct templates learned: X bind to Y X activate Y X stimulate Y X couple to Y X binding to Y X Y interaction X attach to Y X interaction with Y interaction between X and Y X become trapped in Y X Y complex X recognize Y X block Y X trap Y X recruit Y X associate with Y X be linked to Y X target Y 38

TEASE algorithm Potential Recall on Training Set Experiment input + iterative + morph Recall TEASE algorithm Potential Recall on Training Set Experiment input + iterative + morph Recall 39% 49% 63% • Iterative - taking the top 5 ranked templates as input • Morph - recognizing morphological derivations 39

40 40

41 41

42 42

Integrating IE and Search (w. IBM Research Haifa) 43 Integrating IE and Search (w. IBM Research Haifa) 43

Optimistic Conclusions • Good prospects for better levels of text understanding – Enabling more Optimistic Conclusions • Good prospects for better levels of text understanding – Enabling more sophisticated information access • Textual entailment is an appealing framework – Boosts research on text understanding – Potential for vast knowledge acquisition Thank you! 44