Textual Entailment Ido Dagan Dan Roth Bar Ilan

Textual Entailment Ido Dagan Dan Roth, Bar Ilan University Israel University of Illinois, Urbana-Champaign USA ACL -2007 Fabio Massimo Zanzotto University of Rome Italy 1

Outline 1. 2. 3. 4. 5. Motivation and Task Definition A Skeletal review of Textual Entailment Systems Knowledge Acquisition Methods Applications of Textual Entailment A Textual Entailment view of Applied Semantics Page 2

I. Motivation and Task Definition Page 3

Motivation n Text applications require semantic inference A common framework for applied semantics is needed, but still missing Textual entailment may provide such framework Page 4

Desiderata for Modeling Framework A framework for a target level of language processing should provide: n 1) 2) Generic (feasible) module for applications Unified (agreeable) paradigm for investigating language phenomena Most semantics research is scattered n n n WSD, NER, SRL, lexical semantics relations… (e. g. vs. syntax) Dominating approach - interpretation Page 5

Natural Language and Meaning Variability Ambiguity Language Page 6

Variability of Semantic Expression The Dow Jones Industrial Average closed up 255 Dow ends up Dow gains 255 points Dow climbs 255 Stock market hits a record high Model variability as relations between text expressions: n n Equivalence: text 1 text 2 Entailment: text 1 text 2 (paraphrasing) the general case Page 7

Typical Application Inference: Entailment Question Who bought Overture? Overture’s acquisition by Yahoo >> n n X bought Overture entails Yahoo bought Overture hypothesized answer text n Expected answer form Similar for IE: X acquire Y Similar for “semantic” IR: t: Overture was bought for … Summarization (multi-document) – identify redundant info MT evaluation (and recent ideas for MT) Educational applications Page 8

KRAQ'05 Workshop - KNOWLEDGE and REASONING for ANSWERING QUESTIONS (IJCAI-05) CFP: n Reasoning aspects: * information fusion, * search criteria expansion models * summarization and intensional answers, * reasoning under uncertainty or with incomplete knowledge, n Knowledge representation and integration: * levels of knowledge involved (e. g. ontologies, domain knowledge), * knowledge extraction models and techniques to optimize response accuracy … but similar needs for other applications – can entailment provide a common empirical framework? Page 9

Classical Entailment Definition n Chierchia & Mc. Connell-Ginet (2001): A text t entails a hypothesis h if h is true in every circumstance (possible world) in which t is true n Strict entailment - doesn't account for some uncertainty allowed in applications Page 10

“Almost certain” Entailments t: The technological triumph known as GPS … was incubated in the mind of Ivan Getting. h: Ivan Getting invented the GPS. Page 11

Applied Textual Entailment n A directional relation between two text fragments: Text (t) and Hypothesis (h): t entails h (t h) if humans reading t will infer that h is most likely true n Operational (applied) definition: n n Human gold standard - as in NLP applications Assuming common background knowledge – which is indeed expected from applications Page 12

Probabilistic Interpretation Definition: n t probabilistically entails h if: n n P(h is true | t) > P(h is true) n t increases the likelihood of h being true n ≡ Positive PMI – t provides information on h’s truth P(h is true | t ): entailment confidence n n The relevant entailment score for applications In practice: “most likely” entailment expected Page 13

The Role of Knowledge n For textual entailment to hold we require: n text AND knowledge h but n n knowledge should not entail h alone Systems are not supposed to validate h’s truth regardless of t (e. g. by searching h on the web) Page 14

PASCAL Recognizing Textual Entailment (RTE) Challenges EU FP-6 Funded PASCAL Network of Excellence 2004 -7 Bar-Ilan University MITRE ITC-irst and CELCT, Trento Microsoft Research Page 15

Generic Dataset by Application Use n 7 application settings in RTE-1, 4 in RTE-2/3 n n n n n QA IE “Semantic” IR Comparable documents / multi-doc summarization MT evaluation Reading comprehension Paraphrase acquisition Most data created from actual applications output RTE-2/3: 800 examples in development and test sets 50 -50% YES/NO split Page 16

RTE Examples TEXT HYPOTHESIS TASK ENTAILMENT Regan attended a ceremony in 1 Washington to commemorate the landings in Normandy. Washington is located in Normandy. IE False 2 Google files for its long awaited IPO. Google goes public. IR True …: a shootout at the Guadalajara airport in May, 1993, that killed 3 Cardinal Juan Jesus Posadas Ocampo and six others. Cardinal Juan Jesus Posadas Ocampo died in 1993. QA True IE True The SPD got just 21. 5% of the vote in the European Parliament elections, 4 while the conservative opposition parties polled 44. 5%. The SPD is defeated by the opposition parties. Page 17

Participation and Impact n Very successful challenges, world wide: n n RTE-1 – 17 groups RTE-2 – 23 groups n n RTE-3 – 25 groups n n ~150 downloads Joint workshop at ACL-07 High interest in the research community n n n Papers, conference sessions and areas, Ph. D’s, influence on funded projects Textual Entailment special issue at JNLE ACL-07 tutorial Page 18

Methods and Approaches (RTE-2) n Measure similarity match between t and h (coverage of h by t): n n n n n Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (Word. Net, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e. g. negation, modality) Cross-pair similarity Detect mismatch (for non-entailment) Interpretation to logic representation + logic inference Page 19

Dominant approach: Supervised Learning Classifier t, h Similarity Features: Lexical, n-gram, syntactic semantic, global Feature vector n n n Features model similarity and mismatch Classifier determines relative weights of information sources Train on development set and auxiliary t-h corpora Page 20 YES NO

RTE-2 Results First Author (Group) Accuracy Average Precision Hickl (LCC) 75. 4% 80. 8% Tatu (LCC) 73. 8% 71. 3% Zanzotto (Milan & Rome) 63. 9% 64. 4% Adams (Dallas) 62. 6% 62. 8% Bos (Rome & Leeds) 61. 6% 66. 9% 11 groups 58. 1%-60. 5% 7 groups 52. 9%-55. 6% Page 21 Average: 60% Median: 59%

Analysis For the first time: methods that carry some deeper analysis seemed (? ) to outperform shallow lexical methods n Cf. Kevin Knight’s invited talk at EACL-06, titled: Isn’t linguistic Structure Important, Asked the Engineer n Still, most systems, which do utilize deep analysis, did not score significantly better than the lexical baseline Page 22

Why? n System reports point at: n n n Lack of knowledge (syntactic transformation rules, paraphrases, lexical relations, etc. ) Lack of training data It seems that systems that coped better with these issues performed best: n n Hickl et al. - acquisition of large entailment corpora for training Tatu et al. – large knowledge bases (linguistic and world knowledge) Page 23

Some suggested research directions n Knowledge acquisition n n Unsupervised acquisition of linguistic and world knowledge from general corpora and web Acquiring larger entailment corpora Manual resources and knowledge engineering Inference n Principled framework for inference and fusion of information levels n Are we happy with bags of features? Page 24

Complementary Evaluation Modes n “Seek” mode: n n Entailment subtasks evaluations n n Input: h and corpus Output: all entailing t ’s in corpus Captures information seeking needs, but requires post-run annotation (TREC-style) Lexical, lexical-syntactic, logical, alignment… Contribution to various applications n QA – Harabagiu & Hickl, ACL-06; RE – Romano et al. , EACL-06 Page 25

II. A Skeletal review of Textual Entailment Systems Page 26

Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Entails Subsumed by Yahoo acquired Overture is a search company Google owns Overture ………. Phrasal verb paraphrasing Entity matching Alignment How? Semantic Role Labeling Integration Page 27

A general Strategy for Textual Entailment Given a sentence T Re-represent T Lexical Syntactic Semantic Representatio n Re-represent T Re-represent T T Given a sentence H e Re-represent H Lexical Syntactic Semantic Knowledge Base semantic; structural & pragmatic Transformations/rules Decision Find the set of Transformations/Features of the new representation (or: use these to create a cost function) that allows embedding of H in T. Page 28

Details of The Entailment Strategy n Preprocessing n n n n n Multiple levels of lexical preprocessing Syntactic Parsing Shallow semantic parsing Annotating semantic phenomena Representation Bag of words, n-grams through tree/graphs based representation Logical representations Knowledge Sources n n n Control Strategy & Decision Making n n n Single pass/iterative processing Strict vs. Parameter based Justification n Page 29 Syntactic mapping rules Lexical resources Semantic Phenomena specific modules RTE specific knowledge sources Additional Corpora/Web resources What can be said about the decision?

The Case of Shallow Lexical Approaches n Preprocessing n n Identify Stop Words Representation n Bag of words Knowledge Sources n Control Strategy & Decision Making n n n Shallow Lexical resources – typically Wordnet Single pass Compute Similarity; use threshold tuned on a development set (could be per task) Justification n Page 30 It works

Shallow Lexical Approaches (Example) n Lexical/word-based semantic overlap: score based on matching each word in H with some word in T n n n Word similarity measure: may use Word. Net Clearly, this may not appeal to what we May take account of subsequences, word order think as understanding, and it is easy to ‘Learn’ threshold on maximum word-based match score does not generate cases for which this work well. Text: The Cassini spacecraft arrived at works in July, 2006. well with However, it Titan (surprisingly) respect to current evaluation metrics (data Text: NASA’s Cassini-Huygens spacecraft traveled to Saturn in sets? ) 2006. Text: The Cassini spacecraft has taken images that show rivers on Saturn’s moon Titan. Hyp: The Cassini spacecraft has reached Titan. Page 31

An Algorithm: Local. Lexcial. Matching n For each word in Hypothesis, Text n n n if word matches stopword – remove word if no words left in Hypothesis or Text return 0 number. Matched = 0; n n for each word W_H in Hypothesis for each word W_T in Text HYP_LEMMAS = Lemmatize(W_H); TEXT_LEMMAS = Lemmatize(W_T); n n if any term in HYP_LEMMAS matches any term in TEXT_LEMMAS n n n Use Wordnet’s using Lexical. Compare() number. Matched++; Return: number. Matched/|HYP_Lemmas| Page 32

An Algorithm: Local. Lexical. Matching (Cont. ) n Lexical. Compare() n if(LEMMA_H == LEMMA_T) n n return TRUE: if(Synonym. Of(text. Word, hypothesis. Word) n n return. TRUE; if(Member. Of. Distance. From. To(text. Word, hypothesis. Word) <= 3) n n return TRUE; if(Meronymy. Distance. From. To(text. Word, hypothesis. Word) <= 3) n n Test: 60. 50 Test: 65. 63 if(Hypernym. Distance. From. To(text. Word, hypothesis. Word) <= 3) n n return TRUE; LLM Performance: RTE 2: Dev: 63. 00 RTE 3: Dev: 67. 50 return TRUE; Notes: n n Lexical. Compare is Asymmetric & makes use of single relation type Additional differences could be attributed to stop word list (e. g, including aux verbs) Straightforward improvements such as bi-grams do not help. More sophisticated lexical knowledge (entities; time) should help. Page 33

Details of The Entailment Strategy (Again) n Preprocessing n n n n n Multiple levels of lexical preprocessing Syntactic Parsing Shallow semantic parsing Annotating semantic phenomena Representation Bag of words, n-grams through tree/graphs based representation Logical representations Knowledge Sources n n n Control Strategy & Decision Making n n n Single pass/iterative processing Strict vs. Parameter based Justification n Page 34 Syntactic mapping rules Lexical resources Semantic Phenomena specific modules RTE specific knowledge sources Additional Corpora/Web resources What can be said about the decision?

Preprocessing n Syntactic Processing: n n n Only a few systems Lexical Processing n n n n Syntactic Parsing (Collins; Charniak; CCG) Dependency Parsing (+types) Tokenization; lemmatization For each word in Hypothesis, Text Phrasal verbs Idiom processing Named Entities + Normalization Date/Time arguments + Normalization Semantic Processing n n Semantic Role Labeling Nominalization Modality/Polarity/Factive Co-reference } Page 35 } often used only during decision making

Details of The Entailment Strategy (Again) n Preprocessing n n n n n Multiple levels of lexical preprocessing Syntactic Parsing Shallow semantic parsing Annotating semantic phenomena Representation Bag of words, n-grams through tree/graphs based representation Logical representations Knowledge Sources n n n Control Strategy & Decision Making n n n Single pass/iterative processing Strict vs. Parameter based Justification n Page 36 Syntactic mapping rules Lexical resources Semantic Phenomena specific modules RTE specific knowledge sources Additional Corpora/Web resources What can be said about the decision?

Basic Representations Meaning Representation Inference Logical Forms Semantic Representation Syntactic Parse Local Lexical Raw Textual Entailment n Most approaches augment the basic structure defined by the processing level with additional annotation and make use of a tree/graph/frame-based system. Page 37

Basic Representations (Syntax) Syntactic Parse Hyp: The Cassini spacecraft has reached Titan. Page 38 Local Lexical

Basic Representations (Shallow Semantics: Pred-Arg ) T: The government purchase of the Roanoke building, a former prison, took place in 1902. take PRED The govt. purchase… prison ARG_0 place ARG_1 purchase PRED in 1902 ARG_2 The Roanoke building ARG_1 H: The Roanoke building, which was a former prison, was bought by the government in 1902. buy The government ARG_0 The Roanoke building PRED The Roanoke … prison In 1902 ARG_1 AM_TMP be PRED a former prison ARG_2 ARG_1 Page 39 Roth&Sammons’ 07

Basic Representations (Logical Representation) [Bos & Markert] The semantic representation language is a first-order fragment a language used in Discourse Representation Theory (DRS), conveying argument structure with a neo-Davidsonian analysis and Including the recursive DRS structure to cover negation, disjunction, and implication. Page 40

Representing Knowledge Sources n Rather straight forward in the Logical Framework: n Page 41 Tree/Graph base representation may also use rule based transformations to encode different kinds of knowledge, sometimes represented as generic or knowledge based tree transformations.

Representing Knowledge Sources (cont. ) n In general, there is a mix of procedural and rule based encodings of knowledge sources n n Done by hanging more information on parse tree or predicate argument representation [Example from LCC’s system] Or different frame-based annotation systems for encoding information, that are processed procedurally. Page 42

Details of The Entailment Strategy (Again) n Preprocessing n n n n n Multiple levels of lexical preprocessing Syntactic Parsing Shallow semantic parsing Annotating semantic phenomena Representation Bag of words, n-grams through tree/graphs based representation Logical representations Knowledge Sources n n n Control Strategy & Decision Making n n n Single pass/iterative processing Strict vs. Parameter based Justification n Page 43 Syntactic mapping rules Lexical resources Semantic Phenomena specific modules RTE specific knowledge sources Additional Corpora/Web resources What can be said about the decision?

Knowledge Sources n n n The knowledge sources available to the system are the most significant component of supporting TE. Different systems draw differently the line between preprocessing capabilities and knowledge resources. The way resources are handled is also different across different approaches. Page 44

Enriching Preprocessing n In addition to syntactic parsing several approaches enrich the representation with various linguistics resources n n n n n Pos tagging Stemming Predicate argument representation: verb predicates and nominalization Entity Annotation: Stand alone NERs with a variable number of classes Acronym handling and Entity Normalization: mapping mentions of the same entity mentioned in different ways to a single ID. Co-reference resolution Dates, times and numeric values; identification and normalization. Identification of semantic relations: complex nominals, genitives, adjectival phrases, and adjectival clauses. Event identification and frame construction. Page 45

Lexical Resources n n Recognizing that a word or a phrase in S entails a word or a phrase in H is essential in determining Textual Entailment. Wordnet is the most commonly used resoruce n n In most cases, a Wordnet based similarity measure between words is used. This is typically a symmetric relation. Lexical chains over Wordnet are used; in some cases, care is taken to disallow some chains of specific relations. Extended Wordnet is being used to make use of Entities Derivation relation which links verbs with their corresponding nominalized nouns. Page 46

Lexical Resources (Cont. ) n Lexical Paraphrasing Rules n n A number of efforts to acquire relational paraphrase rules are under way, and several systems are making use of resources such as DIRT and TEASE. Some systems seems to have acquired paraphrase rules that are in the RTE corpus n n n n n person killed --> claimed one life hand reins over to --> give starting job to same-sex marriage --> gay nuptials cast ballots in the election -> vote dominant firm --> monopoly power death toll --> kill try to kill --> attack lost their lives --> were killed left people dead --> people were killed Page 47

Semantic Phenomena n n n A large number of semantic phenomena have been identified as significant to Textual Entailment. A large number of them are being handled (in a restricted way) by some of the systems. Very little quantification per-phenomena has been done, if at all. Semantic implications of interpreting syntactic structures [Braz et. al’ 05; Bar-Haim et. al. ’ 07] n Conjunctions n n n Jake and Jill ran up the hill Jake and Jill met on the hill Jake ran up the hill *Jake met on the hill Clausal modifiers n n n But celebrations were muted as many Iranians observed a Shi'ite mourning month. Many Iranians observed a Shi'ite mourning month. Semantic Role Labeling handles this phenomena automatically Page 48

Semantic Phenomena (Cont. ) n Relative clauses n n Appositives n n n Frank Robinson, a one-time manager of the Indians, has the distinction for the NL. Frank Robinson is a one-time manager of the Indians. Passive n n The assailants fired six bullets at the car, which carried Vladimir Skobtsov. The carried Vladimir Skobtsov. Semantic Role Labeling handles this phenomena automatically We have been approached by the investment banker. The investment banker approached us. Semantic Role Labeling handles this phenomena automatically Genitive modifier n n Malaysia's crude palm oil output is estimated to have risen. . The crude palm oil output of Malasia is estimated to have risen. Page 49

Logical Structure n Factivity : Uncovering the context in which a verb phrase is embedded n n n Polarity negative markers or a negation-denoting verb (e. g. deny, refuse, fail) n n n The terrorists tried to enter the building. The terrorists entered the building. The terrorists failed to enter the building. The terrorists entered the building. Modality/Negation Dealing with modal auxiliary verbs (can, must, should), that modify verbs’ meanings and with the identification of the scope of negation. Superlatives/Comperatives/Monotonicity: inflecting adjectives or adverbs. Quantifiers, determiners and articles Page 50

Some Examples n n n n [Braz et. al. IJCAI workshop’ 05; PARC Corpus] T: Legally, John could drive. H: John drove. S: Bush said that Khan sold centrifuges to North Korea. H: Centrifuges were sold to North Korea. S: No US congressman visited Iraq until the war. H: Some US congressmen visited Iraq before the war. S: The room was full of women. H: The room was full of intelligent women. S: The New York Times reported that Hanssen sold FBI secrets to the Russians and could face the death penalty. H: Hanssen sold FBI secrets to the Russians. S: All soldiers were killed in the ambush. H: Many soldiers were killed in the ambush. Page 51

Details of The Entailment Strategy (Again) n Preprocessing n n n n n Multiple levels of lexical preprocessing Syntactic Parsing Shallow semantic parsing Annotating semantic phenomena Representation Bag of words, n-grams through tree/graphs based representation Logical representations Knowledge Sources n n n Control Strategy & Decision Making n n n Single pass/iterative processing Strict vs. Parameter based Justification n Page 52 Syntactic mapping rules Lexical resources Semantic Phenomena specific modules RTE specific knowledge sources Additional Corpora/Web resources What can be said about the decision?

Control Strategy and Decision Making n Single Iteration n n Strict Logical approaches are, in principle, a single stage computation. The pair is processed and transform into the logic form. Existing Theorem Provers act on the pair along with the KB. Multiple iterations n n n Graph based algorithms are typically iterative. Following [Punyakanok et. al ’ 04] transformations are applied and entailment test is done after each transformation is applied. Transformation can be chained, but sometimes the order makes a difference. The algorithm can be a greedy algorithm or can be more exhaustive, and search for the best path found [Braz et. al’ 05; Bar-Haim et. al 07] Page 53

Transformation Walkthrough [Braz et. al’ 05] T: The government purchase of the Roanoke building, a former prison, took place in 1902. H: The Roanoke building, which was a former prison, was bought by the government in 1902. Does ‘H’ follow from ‘T’? Page 54

Transformation Walkthrough (1) T: The government purchase of the Roanoke building, a former prison, took place in 1902. take PRED The govt. purchase… prison ARG_0 place ARG_1 purchase PRED in 1902 ARG_2 The Roanoke building ARG_1 H: The Roanoke building, which was a former prison, was bought by the government in 1902. buy The government ARG_0 The Roanoke building PRED The Roanoke … prison In 1902 ARG_1 AM_TMP be PRED a former prison ARG_2 ARG_1 Page 55

Transformation Walkthrough (2) T: The government purchase of the Roanoke building, a former prison, took place in 1902. Phrasal Verb Rewriter The government purchase of the Roanoke building, a former prison, occurred in 1902. occur The govt. purchase… prison PRED ARG_0 in 1902 ARG_2 H: The Roanoke building, which was a former prison, was bought by the government. Page 56

Transformation Walkthrough (3) T: The government purchase of the Roanoke building, a former prison, occurred in 1902. Nominalization Promoter NOTE: depends on earlier transformation: The government purchase the Roanoke building in 1902. order is important! purchase The government ARG_0 PRED the Roanoke building, a former prison In 1902 AM_TMP ARG_1 H: The Roanoke building, which was a former prison, was bought by the government in 1902. Page 57

Transformation Walkthrough (4) T: The government purchase of the Roanoke building, a former prison, occurred in 1902. Apposition Rewriter The Roanoke building be a former prison. be The Roanoke building PRED ARG_1 a former prison ARG_2 H: The Roanoke building, which was a former prison, was bought by the government in 1902. Page 58

Transformation Walkthrough (5) T: The government purchase of the Roanoke building, a former prison, took place in 1902. purchase PRED ARG_0 The Roanoke … prison In 1902 ARG_1 The government AM_TMP be PRED The Roanoke building a former prison ARG_2 ARG_1 H: The Roanoke building, which was a former prison, was bought by the Word. Net government in 1902. buy The government ARG_0 The Roanoke building PRED The Roanoke … prison In 1902 ARG_1 AM_TMP be PRED a former prison ARG_2 ARG_1 Page 59

Characteristics n Multiple paths => optimization problem n n Shortest or highest-confidence path through transformations Order is important; may need to explore different orderings Module dependencies are ‘local’; module B does not need access to module A’s KB/inference, only its output If outcome is “true”, the (optimal) set of transformations and local comparisons form a proof Page 60

Summary: Control Strategy and Decision Making n Despite the appeal of the Strict Logical approaches as of today, they do not work well enough. n Bos & Markert: n n n Braz et. al. : n n Strict graph based representation is not doing as well as LLM. Tatu et. al n n Strict logical approach is failing significantly behind good LLMs and multiple levels of lexical pre-processing Only incorporating rather shallow features and using it in the evaluation saves this approach. Results show that strict logical approach is inferior to LLMs, but when put together, it produces some gain. Using Machine Learning methods as a way to combine systems and multiple features has been found very useful. Page 61

Hybrid/Ensemble Approaches n Bos et al. : use theorem prover and model builder n n Tatu et al. (2006) use ensemble approach: n n Expand models of T, H using model builder, check sizes of models Test consistency with background knowledge with T, H Try to prove entailment with and without background knowledge Create two logical systems, one lexical alignment system Combine system scores using coefficients found via search (train on annotated data) Modify coefficients for different tasks Zanzotto et al. (2006) try to learn from comparison of structures of T, H for ‘true’ vs. ‘false’ entailment pairs n n Use lexical, syntactic annotation to characterize match between T, H for successful, unsuccessful entailment pairs Train Kernel/SVM to distinguish between match graphs Page 62

Justification n For most approaches justification is given only by the data Preprocessed n n Empirical Evaluation Logical Approaches n n There is a proof theoretic justification Modulo the power of the resources and the ability to map a sentence to a logical form. n n Graph/tree based approaches n n There is a model theoretic justification The approach is sound, but not complete, modulo the availably of resources. Page 63

Justifying Graph Based Approaches n n [Braz et. al 05] R - a knowledge representation language, with a well defined syntax and semantics or a domain D. For text snippets s, t: n n rs, rt - their representations in R. M(rs), M(rt) their model theoretic representations n There is a well defined notion of subsumption in R, defined model theoretically u, v 2 R: u is subsumed by v when M(u) µ M(v) n Not an algorithm; need a proof theory. n Page 64

Defining Semantic Entailment (2) n n The proof theory is weak; will show rs µ rt only when they are relatively similar syntactically. r 2 R is faithful to s if M(rs) = M(r) Definition: Let s, t, be text snippets with representations rs, rt 2 R. We say that s semantically entails t if there is a representation r 2 R that is faithful to s, for which we can prove that r µ rt n Given rs need to generate many equivalent representations r’s and test r’s µ rt Cannot be done exhaustively How to generate alternative representations? Page 65

Defining Semantic Entailment (3) n n A rewrite rule (l, r) is a pair of expressions in R such that l µ r Given a representation rs of s and a rule (r, l) for which rs µ l the augmentation of rs via (l, r) is r’s = rs Æ r. µ l µ r, rs µ l rs r’s = rs Æ r Claim: r’s is faithful to s. Proof: In general, since r’s = rs Æ r then M(r’s)= M(rs) Å M(r) However, since rs µ l µ r then M(rs) µ M(r). Consequently: M(r’s)= M(rs) And the augmented representation is faithful to s. Page 66

Comments n n The claim suggests an algorithm for generating alternative (equivalent) representations and for semantic entailment. The resulting algorithm is a sound algorithm, but is not complete. Completeness depends on the quality of the KB of rules. The power of this algorithm is in the rules KB. l and r might be very different syntactically, but by satisfying model theoretic subsumption they provide expressivity to the re-representation in a way that facilitates the overall subsumption. Page 67

Non-Entailment n n The problem of determining non-entailment is harder, mostly due to it’s structure. Most approaches determine non-entailment heuristically. n n n Set a threshold for a cost function. If not met by the pair, say ‘now’ Several approach has identified specific features the hind on nonentialment. A model Theoretic approach for non-entailment has also been developed, although it’s effectiveness isn’t clear yet. Page 68

What are we missing? n It is completely clear that the key resource missing is knowledge. n n Better resources translate immediately to better results. At this point existing resources seem to be lacking in coverage and accuracy. Not enough high quality public resources; no quantification. Some Examples n Lexical Knowledge: Some cases are difficult to acquire systematically. n n n A bought Y A has/owns Y Many of the current lexical resources are very noisy. Numbers, quantitative reasoning Time and Date; Temporal Reasoning. Robust event based reasoning and information integration Page 69

Textual Entailment as a Classification Task Page 70

RTE as classification task n RTE is a classification task: n n Given a pair we need to decide if T implies H or T does not implies H We can learn a classifier from annotated examples What do we need: n A learning algorithm n A suitable feature space Page 71

Defining the feature space How do we define the feature space? T 1 H 1 n T 1 “At the end of the year, all solid companies pay dividends. ” H 1 “At the end of the year, all solid insurance companies pay dividends. ” n Possible features n n “Distance Features” - Features of “some” distance between T and H “Entailment trigger Features” “Pair Feature” – The content of the T-H pair is represented Possible representations of the sentences n n n Bag-of-words (possibly with n-grams) Syntactic representation Semantic representation Page 72

Distance Features T H T “At the end of the year, all solid companies pay dividends. ” H “At the end of the year, all solid insurance companies pay dividends. ” Possible features n n Number of words in common Longest common subsequence Longest common syntactic subtree … Page 73

Entailment Triggers Possible features from (de Marneffe et al. , 2006) n Polarity features n presence/absence of neative polarity contexts (not, no or few, without) “Oil price surged” “Oil prices didn’t grow” n Antonymy features n presence/absence of antonymous words in T and H “Oil price is surging” “Oil prices is falling down” n Adjunct features n dropping/adding of syntactic adjunct when moving from T to H “all solid companies pay dividends” “all solid companies pay cash dividends” n … Page 74

Pair Features T H T “At the end of the year, all solid companies pay dividends. ” H “At the end of the year, all solid insurance companies pay dividends. ” Possible features Bag-of-word spaces of T and H n Syntactic spaces of T and H Page 75 insurance_H dividends_H pay_H solid_H … year_H … end_H dividends_T H pay_T companies_T solid_T year_T … end_T T companies_H n …

Pair Features: what can we learn? Bag-of-word spaces of T and H n We can learn: n n T implies H as when T contains “end”… T does not imply H when H contains “end”… It seems to be totally irrelevant!!! Page 76 insurance_H dividends_H pay_H companies_H solid_H … year_H … end_H dividends_T H pay_T companies_T solid_T … year_T end_T T …

ML Methods in the possible feature spaces (Bos&Markert, 2006) Distance Entailment Trigger Possible Features Pair (Zanzotto&Moschitti, 2006) (de Marneffe et al. , 2006) (Hickl et al. , 2006) (Ipken et al. , 2006) (Kozareva&Montoyo, 2006) (…) (Herrera et al. , 2006) (…) (Rodney et al. , 2006) Bag-of-words Syntactic Sentence representation Page 77 Semantic

Effectively using the Pair Feature Space (Zanzotto, Moschitti, 2006) Roadmap n Motivation: Reason why it is important even if it seems not. n Understanding the model with an example n n n Challenges A simple example Defining the cross-pair similarity Page 78

Observing the Distance Feature Space… (Zanzotto, Moschitti, 2006) T 1 H 1 T 1 “At the end of the year, all solid companies pay dividends. ” H 1 “At the end of the year, all solid insurance companies pay dividends. ” T 1 H 2 T 1 “At the end of the year, all solid companies pay dividends. ” H 2 “At the end of the year, all solid companies pay cash dividends. ” In a distance feature space… … the two pairs are very likely the same point % common syntactic dependencies T 1 H 1 T 1 H 2 % common words Page 79

What can happen in the pair feature space? (Zanzotto, Moschitti, 2006) T 1 H 1 T 1 “At the end of the year, all solid companies pay dividends. ” H 1 “At the end of the year, all solid insurance companies pay dividends. ” T 1 H 2 T 1 H 2 T 3 H 3 “At the end of the year, all solid companies pay dividends. ” “At the end of the year, all solid companies pay cash dividends. ” S 2 < S 1 T 3 “All wild animals eat plants that have scientifically proven medicinal properties. ” H 3 “All wild mountain animals eat plants that have scientifically proven medicinal properties. ” Page 80

Observations Some examples are difficult to be exploited in the distance feature space… n We need a space that considers the content and the structure of textual entailment examples Let us explore: n the pair space! n … using the Kernel Trick: define the space defining the distance K(P 1 , P 2) instead of defining the feautures K(T 1 H 1, T 1 H 2) n T 1 H 1 T 1 H 2 Page 81

Target (Zanzotto, Moschitti, 2006) Cross-pair similarity KS((T’, H’), (T’’, H’’)) KT(T’, T’’)+ KT(H’, H’’) How do we build it: n n Using a syntactic interpretation of sentences Using a similarity among trees KT(T’, T’’): this similarity counts the number of subtrees in common between T’ and T’’ This is a syntactic pair feature space Question: do we need something more? Page 82

Observing the syntactic pair feature space (Zanzotto, Moschitti, 2006) Can we use syntactic tree similarity? Page 83

Observing the syntactic pair feature space (Zanzotto, Moschitti, 2006) Can we use syntactic tree similarity? Page 84

Observing the syntactic pair feature space (Zanzotto, Moschitti, 2006) Can we use syntactic tree similarity? Not only! Page 85

Observing the syntactic pair feature space (Zanzotto, Moschitti, 2006) Can we use syntactic tree similarity? Not only! We want to use/exploit also the implied rewrite rule a a b c b a d c Page 86 d a c b b d c d

Exploiting Rewrite Rules (Zanzotto, Moschitti, 2006) To capture the textual entailment recognition rule (rewrite rule or inference rule), the cross-pair similarity measure should consider: n n the structural/syntactical similarity between, respectively, texts and hypotheses the similarity among the intra-pair relations between constituents How to reduce the problem to a tree similarity computation? Page 87

Exploiting Rewrite Rules (Zanzotto, Moschitti, 2006) Page 88

Exploiting Rewrite Rules Intra-pair operations (Zanzotto, Moschitti, 2006) Page 89

Exploiting Rewrite Rules Intra-pair operations Finding anchors (Zanzotto, Moschitti, 2006) Page 90

Exploiting Rewrite Rules Intra-pair operations (Zanzotto, Moschitti, 2006) Finding anchors Naming anchors with placeholders Page 91

Exploiting Rewrite Rules Intra-pair operations (Zanzotto, Moschitti, 2006) Finding anchors Naming anchors with placeholders Propagating placeholders Page 92

Exploiting Rewrite Rules Intra-pair operations Cross-pair operations Finding anchors Naming anchors with placeholders Propagating placeholders Page 93 (Zanzotto, Moschitti, 2006)

Exploiting Rewrite Rules Intra-pair operations Finding anchors Naming anchors with placeholders Propagating placeholders (Zanzotto, Moschitti, 2006) Cross-pair operations Matching placeholders across pairs Page 94

Exploiting Rewrite Rules Intra-pair operations Cross-pair operations Finding anchors Naming anchors with placeholders Propagating placeholders Matching placeholders across pairs Renaming placeholders Page 95

Exploiting Rewrite Rules Intra-pair operations Cross-pair operations Finding anchors Naming anchors with placeholders Propagating placeholders Matching placeholders across pairs Renaming placeholders Calculating the similarity between syntactic trees with co-indexed leaves Page 96

Exploiting Rewrite Rules Intra-pair operations Cross-pair operations Finding anchors Naming anchors with placeholders Propagating placeholders Matching placeholders across pairs Renaming placeholders Calculating the similarity between syntactic trees with co-indexed leaves Page 97 (Zanzotto, Moschitti, 2006)

Exploiting Rewrite Rules (Zanzotto, Moschitti, 2006) The initial example: sim(H 1, H 3) > sim(H 2, H 3)? Page 98

Defining the Cross-pair similarity (Zanzotto, Moschitti, 2006) n The cross pair similarity is based on the distance between syntatic trees with co-indexed leaves: where n n C is the set of all the correspondences between anchors of (T’, H’) and (T’’, H’’) t(S, c) returns the parse tree of the hypothesis (text) S where placeholders of these latter are replaced by means of the substitution c i is the identity substitution KT(t 1, t 2) is a function that measures the similarity between the two trees t 1 and t 2. Page 99

Defining the Cross-pair similarity Page 100

Refining Cross-pair Similarity (Zanzotto, Moschitti, 2006) n Controlling complexity n n Reducing the computational cost n n We reduced the size of the set of anchors using the notion of chunk Many subtree computations are repeated during the computation of KT(t 1, t 2). This can be exploited for a better dynamic progamming algorithm (Moschitti&Zanzotto, 2007) Focussing on information within a pair relevant for the entailment: n Text trees are pruned according to where anchors attach Page 101

BREAK (30 min) Page 102

III. Knowledge Acquisition Methods Page 103

Knowledge Acquisition for TE What kind of knowledge we need? n Explicit Knowledge (Structured Knowledge Bases) n Relations among words (or concepts) n n n Relations among sentence prototypes n n n Symmetric: Synonymy, cohypohymy Directional: hyponymy, part of, … Symmetric: Paraphrasing Directional : Inference Rules/Rewrite Rules Implicit Knowledge n Relations among sentences n n Symmetric: paraphrasing examples Directional: entailment examples Page 104

Acquisition of Explicit Knowledge Page 105

Acquisition of Explicit Knowledge The questions we need to answer n What? n n Using what? n n What we want to learn? Which resources do we need? Which are the principles we have? How? n How do we organize the “knowledge acquisition” algorithm Page 106

Acquisition of Explicit Knowledge: what? Types of knowledge n Symmetric n Co-hyponymy Between words: cat dog n Synonymy Between words: buy acquire Sentence prototypes (paraphrasing) : X bought Y X acquired Z% of the Y’s shares n Directional semantic relations Words: cat animal , buy own , wheel partof car Sentence prototypes : X acquired Z% of the Y’s shares X owns Y Page 107

Acquisition of Explicit Knowledge : Using what? Underlying hypothesis n Harris’ Distributional Hypothesis (DH) (Harris, 1964) “Words that tend to occur in the same contexts tend to have similar meanings. ” sim(w 1, w 2) sim(C(w 1), C(w 2)) n Robison’s Point-wise Assertion Patterns (PAP) (Robison, 1970) “It is possible to extract relevant semantic relations with some pattern. ” w 1 is in a relation r with w 2 if the context pattern(w 1, w 2 ) Page 108

Distributional Hypothesis (DH) simw(W 1, W 2) simctx(C(W 1), C(W 2)) Words or Forms w 1= constitute Context (Feature) Space C(w 1) Corpus: source of contexts … sun is constituted of hydrogen … …The Sun is composed of hydrogen … w 2= compose C(w 2) Page 109

Point-wise Assertion Patterns (PAP) w 1 is in a relation r with w 2 if the contexts patternsr(w 1, w 2 ) relation w 1 part_of w 2 Corpus: source of contexts “w 1 is constituted of w 2” “w 1 is composed of patterns w 2” … sun is constituted of hydrogen … …The Sun is composed of hydrogen … Statistical Indicator Scorpus 1, w 2) (w selects correct vs incorrect relations among words part_of(sun, hydrogen) Page 110

DH and PAP cooperate Distributional Hypothesis Words or Forms w 1= constitute Point-wise assertion Patterns Context (Feature) Space C(w 1) Corpus: source of contexts … sun is constituted of hydrogen … …The Sun is composed of hydrogen … w 2= compose C(w 2) Page 111

Knowledge Acquisition: Where methods differ? Words or Forms w 1= cat Context (Feature) Space C(w 1) w 2= dog On the “word” side n Target equivalence classes: Concepts or Relations n Target forms: words or expressions On the “context” side n Feature Space n Similarity function Page 112 C(w 2)

Directional Verb Entailment (Zanzotto et al. , 2006) Noun Entailment (Geffet&Dagan, 2005) Relation Pattern Learning (ESPRESSO) (Pantel&Pennacchiotti, 2006) ISA patterns (Hearst, 1992) ESPRESSO (Pantel&Pennacchiotti, 2006) Symmetric Types of knowledge KA 4 TE: a first classification of some methods Hearst Concept Learning (Lin&Pantel, 2001 a) TEASE (Szepktor et al. , 2004) Inference Rules (DIRT) (Lin&Pantel, 2001 b) Distributional Hypothesis Point-wise assertion Patterns Underlying hypothesis Page 113

Noun Entailment Relation (Geffet&Dagan, 2006) n n n Type of knowledge: directional relations Underlying hypothesis: distributional hypothesis Main Idea: distributional inclusion hypothesis Words or Forms w 1 w 2 if All the prominent features of w 1 occur with w 2 in a sufficiently large corpus Context (Feature) Space C(w 1) I(C(w 1)) + + w 1 w 2 I(C(w 2)) w 2 C(w 2) Page 114 + + +

Verb Entailment Relations (Zanzotto, Pennacchiotti, Pazienza, 2006) n n n Type of knowledge: oriented relations Underlying hypothesis: point-wise assertion patterns Main Idea: ? win play ! player wins relation v 1 v 2 patterns “agentive_nominalization( v 2) v 1” Statistical Indicator S (v 1, v 2) Point-wise Mutual information Page 115

Verb Entailment Relations (Zanzotto, Pennacchiotti, Pazienza, 2006) Understanding the idea n Selectional restriction fly(x) has_wings(x) in general S of v v(x) c(x) (if x is the subject kip then x has the property c) n Agentive nominalization ped “agentive noun is the doer or the performer of an action v’” “X is player” may be read as play(x) c(x) is clearly v’(x) if the property c is derived by v’ with an agentive nominalization Page 116

Verb Entailment Relations Understanding the idea Given the expression Ø Ø player wins Seen as a selctional restriction win(x) play(x) Ski ppe Seen as a selectional preference d P(play(x)|win(x)) > P(play(x)) Page 117

Knowledge Acquisition for TE: How? The algorithmic nature of a DH+PAP method n Direct n n Indirect n n Starting point: target words Starting point: context feature space Iterative n Interplay between the context feature space and the target words Page 118

Direct Algorithm sim(w 1, w 2) sim(C(w 1), C(w 2)) 1. sim(w 1, w 2) sim(I(C(w 1)), I(C(w 2))) 2. Words or Forms w 1= cat Context (Feature) Space C(w 1) sim(C(w 1), C(w 2)) I(C(w 1)) sim(w 1, w 2) sim(I(C(w 1)), I(C(w 2))) I(C(w 2)) w 2= dog C(w 2) Page 119 3. Select target words wi from the corpus or from a dictionary Retrieve contexts of each wi and represent them in the feature space C(wi ) For each pair (wi, wj) 1. Compute the similarity sim(C(wi), C(wj )) in the context space 2. If sim(wi, wj )= sim(C(wi), C(wj ))>t, wi and wj belong to the same equivalence class W

Indirect Algorithm 1. sim(w 1, w 2) sim(C(w 1), C(w 2)) sim(w 1, w 2) sim(I(C(w 1)), I(C(w 2))) Words or Forms 2. Context (Feature) Space 3. w 1= cat C(w 1) sim(C(w 1), C(w 2)) 4. 5. sim(w 1, w 2) w 2= dog C(w 2) Page 120 Given an equivalence class W, select relevant contexts and represent them in the feature space Retrieve target words (w 1, …, wn) that appear in these contexts. These are likely to be words in the equivalence class W Eventually, for each wi, retrieve C(wi. I) from the corpus Compute the centroid I(C(W)) For each for each wi, if sim(I(C(W), wi)<t, eliminate wi from W.

Iterative Algorithm 1. sim(w 1, w 2) sim(C(w 1), C(w 2)) sim(w 1, w 2) sim(I(C(w 1)), I(C(w 2))) Words or Forms Context (Feature) Space 2. 3. 4. w 1= cat C(w 1) sim(C(w 1), C(w 2)) sim(w 1, w 2) w 2= dog C(w 2) Page 121 For each word wi in the equivalence class W, retrieve the C(wi) contexts and represent them in the feature space Extract words wj that have contexts similar to C(wi) Extract contexts C(wj) of these new words For each for each new word wj, if sim(C(W), wj)>t, put wj in W.

Knowledge Acquisition using DH and PAH n Direct Algorithms n n Indirect Algorithms n n Concepts from text via clustering (Lin&Pantel, 2001) Inference rules – aka DIRT (Lin&Pantel, 2001) … Hearst’s ISA patterns (Hearst, 1992) Question Answering patterns (Ravichandran&Hovy, 2002) … Iterative Algorithms n n n Entailment rules from Web – aka TEASE (Szepktor et al. , 2004) Espresso (Pantel&Pennacchiotti, 2006) … Page 122

TEASE (Szepktor et al. , 2004) Type: Iterative algorithm On the “word” side n Target equivalence classes: fine-grained relations prevent(X, Y ) n Target forms: verb with arguments X subj mod call finally obj mod Y indictable On the “context” side n Feature Space X_{filler}: mi? , Y_{filler}: mi ? Innovations with respect to reasearches < 2004 n First direct algorithm for extracting rules Page 123

TEASE (Szepktor et al. , 2004) Lexicon Input template: X subj-accuse-obj Y WE B Sample corpus for input template: Paula Jones accused Clinton… BBC accused Blair… Sanhedrin accused St. Paul… … TEASE Ski Anchor Set Extraction (ASE) ppe d Anchor sets: {Paula Jones subj; Clinton obj} {Sanhedrin subj; St. Paul obj} … Sample corpus for anchor sets: Template Extraction Paula Jones called Clinton indictable… St. Paul defended before the Sanhedrin … (TE) Templates: X call Y indictable Y defend before X Page 124 … iterate

TEASE (Szepktor et al. , 2004) Innovations with respect to reasearches < 2004 n First direct algorithm for extracting rules n A feature selection is done to assess the most informative features n Extracted forms are clustered to obtain the most general Ski set of equivalent forms sentence prototype of a given pp S 1: S 2: call obj {1} X {1} Y {1} call subj {2} mod {1} obj {2} indictable {1} X {2} Y {2} for {1} call {1, 2} {1} subj {1} ed mod {2} indictable {2} subj {1, 2} X {1, 2} mod {1, 2} obj {1, 2} Y {1, 2} indictable {1, 2} for {1} mod {2} harassment {1} finally {2} Page 125 harassment {1}

Espresso (Pantel&Pennacchiotti, 2006) Type: Iterative algorithm On the “word” side n Target equivalence classes: relations compose(X, Y ) n Target forms: expressions, sequences of tokens Y is composed by. X, Y is made of X Innovations with respect to reasearches < 2006 n A measure to determine specific vs. general patterns (ranking in the equivalent forms) Page 126

Espresso (leader , panel) (city , region) (oxygen , water) 1. 0 0. 9 0. 7 0. 6 0. 2 Ski (Pantel&Pennacchiotti, 2006) (tree , land) (atom, molecule) (leader , panel) (range of information, FBI report) (artifact , exhibit) (oxygen , hydrogen) ppe Y is composed by. X X, Y Y is part of. Y 1. 0 0. 8 0. 2 Y is composed by. X Y is part of. X X, Y Page 127 d (tree , land) (oxygen , hydrogen) (atom, molecule) (leader , panel) (range of information, FBI report) (artifact , exhibit) …

Espresso (Pantel&Pennacchiotti, 2006) Innovations with respect to reasearches < 2006 n A measure to determine specific vs. general patterns (ranking in the equivalent forms) 1. 0 0. 8 0. 2 n n Y is composed by. X Y is part of. X X, Y Ski ppe d Both pattern and instance selections are performed Different Use of General and specific patterns in the iterative algorithm Page 128

Acquisition of Implicit Knowledge Page 129

Acquisition of Implicit Knowledge The questions we need to answer n What? n n What we want to learn? Which resources do we need? Using what? n Which are the principles we have? Page 130

Acquisition of Implicit Knowledge: what? Types of knowledge n Symmetric n Nearly Synonymy between sentences Acme Inc. bought Goofy ltd. Acme Inc. acquired 11% of the Goofy ltd. ’s shares n Directional semantic relations n Entailment between sentences Acme Inc. acquired 11% of the Goofy ltd. ’s shares Acme Inc. owns Goofy ltd. Note: ALSO TRICKY NOT-ENTAILMENT ARE RELEVANT Page 131

Acquisition of Implicit Knowledge : Using what? Underlying hypothesis n Structural and content similarity “Sentences are similar if they share enough content” sim(s 1, s 2) according to relations from s 1 and s 2 n A revised Point-wise Assertion Patterns “Some patterns of sentences reveal relations among sentences” Page 132

entails not entails Symmetric Directional Types of knowledge A first classification of some methods Relations among sentences (Hickl et al. , 2006) Relations among sentences (Burger&Ferro, 2005) Paraphrase Corpus (Dolan&Quirk, 2004) Revised Point-wise assertion Patterns Structural and content similarity Underlying hypothesis Page 133

Entailment relations among sentences (Burger&Ferro, 2005) n n n Type of knowledge: directional relations (entailment) Underlying hypothesis: revised point-wise assertion patterns Main Idea: in headline news items, the first sentence/paragraph generally entails the title relation s 2 s 1 patterns “News Item This pattern works on the structure of the text Title(s 1) First_Sentence(s 2)” Page 134

Entailment relations among sentences examples from the web Title New York Plan for DNA Data in Most Crimes Body Eliot Spitzer is proposing a major expansion of New York’s database of DNA samples to include people convicted of most crimes, while making it easier for prisoners to use DNA to try to establish their innocence. … Title Chrysler Group to Be Sold for $7. 4 Billion Body Daimler. Chrysler confirmed today that it would sell a controlling interest in its struggling Chrysler Group to Cerberus Capital Management of New York, a private equity firm that specializes in restructuring troubled companies. … Page 135

Tricky Not-Entailment relations among sentences (Hickl et al. , 2006) n Type of knowledge: directional relations (tricky not- entailment) n n Underlying hypothesis: revised point-wise assertion patterns Main Idea: n n relation pattern s in a text, sentences with a same name entity generally do not entails each other Sentences connected by “on the contrary”, “but”, … do not entail each other s 1 s 2 s 1 and s 2 are in the same text and share at least a named entity “s 1. On the contrary, s 2” Page 136

Tricky Not-Entailment relations among sentences examples from (Hickl et al. , 2006) T One player losing a close friend is Japanese pitcher Hideki Irabu, who was befriended by Wells during spring training last year. H Irabu said he would take Wells out to dinner when the Yankees visit Toronto. T According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient. H In contrast, he stressed, Clean Mag has a 100 percent pollution retrieval rate, is low cost and can be recycled. Page 137

Context Sensitive Paraphrasing n n n He used a Phillips head to tighten the screw. The bank owner tightened security after a spat of local crimes. The Federal Reserve will aggressively tighten monetary policy. Page 138 Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce ……….

Context Sensitive Paraphrasing n n n Can speak replace command? The general commanded his troops. The general spoke to his troops. The soloist commanded attention. The soloist spoke to attention.

Context Sensitive Paraphrasing n n Need to know when one word can paraphrase another, not just if. Given a word v and its context in sentence S, and another word u: n n n Can u replace v in S and have S keep the same or entailed meaning. Is the new sentence S’ where u has replaced v entailed by previous sentence S The general commanded [V] his troops. [Speak = U] The general spoke to his troops. YES The soloist commanded [V ] attention. [Speak = U] The soloist spoke to attention. NO

Related Work n Paraphrase generation: n n A sense disambiguation task – w/o naming the sense n n Given a sentence or phrase, generate paraphrases of that phrase which have the same or entailed meaning in some context. [DIRT; TEASE] Dagan et. al’ 06 Kauchak & Barzilay (in the context of improving MT evaluation) Sem. Eval word Substitution Task; Pantel et. al ‘ 06 In these cases, this was done by learning (in a supervised way) a single classifier per word u

Context Sensitive Paraphrasing n n n [Connor&Roth ’ 07] Use a single global binary classifier f(S, v, u) ! {0, 1} Unsupervised, bootstrapped, learning approach Key: the use of a very large amount of unlabeled data to derive a reliable supervision signal that is then used to train a supervised learning algorithm. Features are amount of overlap between contexts u and v have both been seen with Include context sensitivity by restricting to contexts similar to S n Are both u and v seen in contexts similar to local context S Allows running the classifier on previously unseen pairs (u, v)

IV. Applications of Textual Entailment Page 143

Relation Extraction (Romano et al. EACL-06) n Identify different ways of expressing a target relation n n Examples: Management Succession, Birth - Death, Mergers and Acquisitions, Protein Interaction Traditionally performed in a supervised manner n n Requires dozens-hundreds examples per relation Examples should cover broad semantic variability n Costly - Feasible? ? ? n Little work on unsupervised approaches Page 144

Proposed Approach Input Template X prevent Y Entailment Rule Acquisition TEASE Templates X prevention for Y, X treat Y, X reduce Y Syntactic Matcher Relation Instances <sunscreen, sunburns> Page 145 Transformation Rules

Dataset n n Bunescu 2005 Recognizing interactions between annotated proteins pairs n n 200 Medline abstracts Input template : X interact with Y Page 146

Manual Analysis - Results n 93% of interacting protein pairs can be identified with lexical syntactic templates Number of templates vs. recall (within 93%): R(%) # templates 10 2 60 39 20 4 70 73 30 6 80 107 40 11 90 141 50 21 100 175 Frequency of syntactic phenomena: Phenomenon % transparent head 34 relative clause 8 apposition 24 co-reference 7 conjunction 24 coordination 7 set 13 passive form 2 Page 147

TEASE Output for X interact with Y A sample of correct templates learned: X bind to Y X binding to Y X activate Y X Y interaction X stimulate Y X attach to Y X couple to Y X interaction with Y interaction between X and Y X trap Y X become trapped in Y X recruit Y X Y complex X associate with Y X recognize Y X be linked to Y X block Y X target Y Page 148

TEASE Potential Recall on Training Set Experiment Recall input + iterative n 49% input + iterative + morph n 39% 63% Iterative - taking the top 5 ranked templates as input Morph - recognizing morphological derivations (cf. semantic role labeling vs. matching) Page 149

Performance vs. Supervised Approaches Supervised: 180 training abstracts Page 150

Textual Entailment for Question Answering n n Sanda Harabagiu and Andrew Hickl (ACL-06) : Methods for Using Textual Entailment in Open-Domain Question Answering Typical QA architecture – 3 stages: 1) 2) 3) § Question processing Passage retrieval Answer processing Incorporated their RTE-2 entailment system at stages 2&3, for filtering and re-ranking Page 151

Integrated three methods 1) 2) 3) Test entailment between question and final answer – filter and re-rank by entailment score Test entailment between question and candidate retrieved passage – combine entailment score in passage ranking Test entailment between question and Automatically Generated Questions (AGQ) created from candidate paragraph § § Utilizes earlier method for generating Q-A pairs from paragraph Correct answer should match that of an entailed AGQ TE is relatively easy to integrate at different stages Results: 20% accuracy increase Page 152

Answer Validation Exercise @ CLEF 2006 -7 n n n Peñas et al. , Journal of Logic and Computation (to appear) Allow textual entailment systems to validate (and prioritize) the answers of QA systems participating at CLEF AVE participants receive: 1) 2) n question and answer – need to generate full hypothesis supporting passage – should entail the answer hypothesis Methodologically: Enables to measure TE systems contribution to QA performance, across many QA systems n TE developers do not need to have full-blown QA system Page 153

V. A Textual Entailment view of Applied Semantics Page 154

Classical Approach = Interpretation Stipulated Meaning Representation (by scholar) Variability Language (by nature) n n Logical forms, word senses, semantic roles, named entity types, … - scattered interpretation tasks Feasible/suitable framework for applied semantics? Page 155

Textual Entailment = Text Mapping Assumed Meaning (by humans) Variability Language (by nature) Page 156

General Case – Inference Meaning Representation Inference Interpretation Language Textual Entailment n n Entailment mapping is the actual applied goal - but also a touchstone for understanding! Interpretation becomes possible means n Varying representation levels may be investigated Page 157

Some perspectives n Issues with semantic interpretation n n Hard to agree on a representation language Costly to annotate semantic representations for training Difficult to obtain - is it more difficult than needed? Textual entailment refers to texts n n n Texts are theory neutral Amenable for unsupervised learning “Proof is in the pudding” test Page 158

Entailment as an Applied Semantics Framework The new view: formulate (all? ) semantic problems as entailment tasks Some semantic problems are traditionally investigated as entailment tasks But also… n Revised definitions of old problems n Exposing many new ones n Page 159

Some Classical Entailment Problems n Monotonicity – traditionally approached via entailment n n Given that: dog animal n Upward monotone: Some dogs are nice Some animals are nice n Downward monotone: No animals are nice No dogs are nice Some formal approaches – via interpretation to logical form Natural logic – avoids interpretation to FOL (cf. Stanford @ RTE-3) Noun compound relation identification n n a novel by Tolstoy wrote a novel Practically an entailment task, when relations are represented lexically (rather than as interpreted semantic notions) Page 160

Revised definition of an Old Problem: Sense Ambiguity n n Classical task definition - interpretation: Word Sense Disambiguation What is the RIGHT set of senses? n n n Any concrete set is problematic/subjective … but WSD forces you to choose one A lexical entailment perspective: n n n Instead of identifying an explicitly stipulated sense of a word occurrence. . . identify whether a word occurrence (i. e. its implicit sense) entails another word occurrence, in context Dagan et al. (ACL-2006) Page 161

Synonym Substitution Source = record positive negative Target = disc This is anyway a stunning disc, thanks to the playing of the Moscow Virtuosi with Spivakov. He said computer networks would not be affected and copies of information should be made on floppy discs. Before the dead soldier was placed in the ditch his personal possessions were removed, leaving one disc on the body for identification purposes. Page 162

Unsupervised Direct: k. NN-ranking n n Test example score: Average Cosine similarity of target example with k most similar (unlabeled) instances of source word Rational: n n n positive examples of target will be similar to some source occurrence (of corresponding sense) negative target examples won’t be similar to source examples Rank test examples by score n A classification slant on language modeling Page 163

Results (for synonyms): Ranking § k. NN improves 8 -18% precision up to 25% recall Page 164

Other Modified and New Problems n Lexical entailment vs. classical lexical semantic relationships n n n synonym ⇔ synonym hyponym ⇒ hypernym (but much beyond WN – e. g. “medical technology”) meronym ⇐ ? ⇒ holonym – depending on meronym type, and context n n Named Entity Classification – by any textual type n n n X’s acquisition of Y X acquired Y X’s acquisition by Y Y acquired X Transparent head n n n Which pickup trucks are produced by Mitsubishi? Magnum pickup truck Argument mapping for nominalizations (derivations) n n boil on elbow ⇒ boil on arm vs. government voted ⇒ minister voted sell to an IBM division sell to IBM sell to an IBM competitor ⇏ sell to IBM … Page 165

The importance of analyzing entailment examples n Few systematic manual data analysis works were reported n n n Vanderwende et al. at RTE-1 workshop Bar-Haim et al. at ACL-05 EMSEE Workshop Within Romano et al. at EACL-06 Xerox Parc Data set; Braz et. IJCAI workshop’ 05 Contribute a lot to understanding and defining entailment phenomena and sub-problems Should be done (and reported) much more… Page 166

Unified Evaluation Framework n n Defining semantic problems as entailment problems facilitates unified evaluation schemes (vs. current state) Possible evaluation schemes: 1) Evaluate on the general TE task, while creating corpora which focus on target sub-tasks § § 2) Define TE-oriented subtasks, and evaluate directly on sub-task § § E. g. a TE dataset with many sense-matching instances Measure impact of sense-matching algorithms on TE performance E. g. a test collection manually annotated for sense-matching Advantages: isolate sub-problem; researchers can investigate individual problems without needing a full-blown TE system (cf. QA research) Such datasets may be derived from datasets of type (1) Facilitates common inference goal across semantic problems Page 167

Summary: Textual Entailment as Goal n The essence of the textual entailment paradigm: Ø Ø n Interpretation and “mapping” methods may compete/complement n n Base applied semantic inference on entailment “engines” and KBs Formulate various semantic problems as entailment sub-tasks at various levels of representations Open question: which inferences n n can be represented at “language” level? require logical or specialized representation and inference? (temporal, spatial, mathematical, …) Page 168

Textual Entailment ≈ Human Reading Comprehension n From a children’s English learning book (Sela and Greenberg): Reference Text: “…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …” ? ? ? Hypothesis (True/False? ): The Bermuda Triangle is near the United States Page 169

Cautious Optimism: Approaching the Desiderata? 1) 2) Generic (feasible) module for applications Unified (agreeable) paradigm for investigating language phenomena Thank you! Page 170