
c6c913c63f76e22e104300a93f26609e.ppt
- Количество слайдов: 39
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Corina Forăscu, Radu Ion, Dan Tufiş Faculty of Computer Science, Al. I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy corinfor@info. uaic. ro , {radu, tufis}@racai. ro CALP 07 workshop @ RANLP 1
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 2
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Fundamentals Temporal information in Natural Language: 1. Time-denoting expressions – references to a calendar or clock system • expressed by NPs, PPs, or Adv. Ps • the 23 rd of May, 1998; Monday; tomorrow; the second semester 2. Event-denoting expressions - reference to an event § expressed by 1. sentences – more precisely their syntactic head, the main verb: John listens to the music. 2. 2. noun phrases: Israel will ask the USA to delay a military strike against Iraq. CALP 07 workshop @ RANLP 3
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; • • question answering (questions like when, how often or how long); • information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); • discourse processing: temporal structure of discourse and summarization. • CALP 07 workshop @ RANLP 4
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (2) Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today. CALP 07 workshop @ RANLP 5
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (3) Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. <TIMEX 3 temporal. Function="true" tid="t 152" type="TIME" value="PRESENT_REF">Acum</TIMEX 3> îsi <EVENT aspect="PROGRESSIVE" class="OCCURENCE" eid="e 153" tense="PAST"> dădea</EVENT><MAKEINSTANCE eiid="ei 59" eid="e 153" cardinality="1" /> seama <SIGNAL sid="s 154">ca</SIGNAL> tocmai din cauza acestui <EVENT aspect="NONE" class="OCCURENCE" eid="e 156" tense="NONE">incident</EVENT> <MAKEINSTANCE eiid="ei 60" eid="e 156" cardinality="1" /> se <EVENT aspect="PERFECTIVE" class="I_ACTION" eid="e 157" tense="PAST">hotarâse</EVENT><MAKEINSTANCE eiid="ei 61" eid="e 157" cardinality="1" /> el brusc <SIGNAL sid="s 54">sa</SIGNAL><EVENT aspect="NONE" class="OCCURENCE" eid="e 159" tense="PRESENT"> vină</EVENT><MAKEINSTANCE eiid="ei 62" eid="e 159" cardinality="1" /> acasa <SIGNAL sid="s 160">si</SIGNAL><SIGNAL sid="s 55">sa</SIGNAL> -si <EVENT aspect="NONE" class="ASPECTUAL" eid="e 161" tense="PRESENT"> înceapă</EVENT> <MAKEINSTANCE eiid="ei 63" eid="e 161" cardinality="1" /> jurnalul taman <TIMEX 3 temporal. Function="true" tid="t 162" type="DATE" value="1984 -0404">astăzi</TIMEX 3>. <TLINK event. Instance. ID="ei 59" related. To. Time="t 152" rel. Type="SIMULTANEOUS" /> <TLINK event. Instance. ID="ei 60" related. To. Event="e 157" rel. Type="BEFORE" /> CALP 07 workshop @ RANLP 6
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 State of the Art 2006 Time Symposium 2005 ACL 2005: TARSQI system ACE – TERN: TIMEX 2 v. 1. 2. 2004 TARSQI: Time. ML v. 1. 2. ACE – TERN: TIMEX 2 v. 1. 1. 2002 TERQAS: Time. ML v. 1. 0. DAML-Time 2001 STAG (Setzer) 2000 ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events TIMEX 1998 ACL: Temporal and Spatial Information Processing MUC 7 1947 TIDES 2001: TIMEX 2 v. 1. 0. 2 LREC 2002 Annotation Standards for Temporal Information in Natural Language Reichenbach: The tenses of verbs CALP 07 workshop @ RANLP 7
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 TERQAS 2002 + Time. ML v. 1. 0 metadata standard for: ü marking events, ü their temporal anchoring and ü links in news articles + + Time. Bank corpus v. 1. 0. guidelines for temporal annotation CALP 07 workshop @ RANLP 8
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 9
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Time. ML v. 1. 2 A metadata standard developed especially for news articles, for marking • Events: EVENT, MAKEINSTANCE • temporal anchoring of events: TIMEX 3, SIGNAL • links between events and/or timexes: TLINK, ALINK, SLINK CALP 07 workshop @ RANLP 10
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (1) • situations that happen or occur, states or circumstances in which something obtains or holds true • tensed verbs, adjectives, nominalizations The oat-bran craze e 190 has cost e 189 the world's largest cereal maker market share. 7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_ACTION CALP 07 workshop @ RANLP 11
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (2) The oat-bran <class="OCCURRENCE"> crazee 190</EVENT> has <class="OCCURRENCE">coste 189 </EVENT> the world's largest cereal maker market share. Analysts <class="REPORTING" >saye 28</EVENT> much of Kellogg's <class="OCCURRENCE">erosion e 204 </EVENT> has been in such core brands as Corn Flakes, . . . CALP 07 workshop @ RANLP 12
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Instances Based on the event annotation: how many different instances or realizations has a given event – at least one • Carries the tense and aspect of the verb-denoted event John learnse 1 twice on Monday. • <MAKEINSTANCE eiid=‘ei 1’ event. ID=‘e 1’ signal. ID=‘s 1’ cardinality=‘ 2’ aspect="NONE" tense="PRESENT"> CALP 07 workshop @ RANLP 13
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (1) Explicit & implicit temporal expressions: • Times: 11 o’clock; midnight • Dates: • Fully Specified (May 23, 2006; winter, 2005), • Underspecified (Monday; next week; last month; two years ago) • Durations: two months; three hours • Sets: every week; every Tuesday CALP 07 workshop @ RANLP 14
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (2) <TIMEX 3 tid="t 192" type="DATE" temporal. Function="false" function. In. Document="CREATION_TIME" value="198910 -30" >10/30/89</TIMEX 3> <TIMEX 3 mod="APPROX" tid="t 220" type="DURATION" temporal. Function="true" function. In. Document="NONE" value="P 2 Y" anchor. Time. ID="t 192" >the next two years or so</TIMEX 3> <TIMEX 3 tid="t 207" type="DATE" temporal. Function="true" function. In. Document="NONE" value="FUTURE_REF" anchor. Time. ID="t 192" >soon</TIMEX 3> CALP 07 workshop @ RANLP 15
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal signals: SIGNAL Function words that indicate how temporal objects are to be related to each other: • temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when • negative expressions • modal verbs • prepositions signaling modality (“to”) • special characters denoting ranges in temporal expressions: “-” and “/” CALP 07 workshop @ RANLP 16
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Dependencies: LINKs • • Temporal Relations: TLINK Anchors to Time Orders between Time and Events Aspectual Relations: ALINK Phases of an event Subordinating Relations: SLINK Events that syntactically subordinate other events CALP 07 workshop @ RANLP 17
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (1) • temporal relation between two temporal elements (event-event, event-timex); • EVENTs – through their INSTANCEs • 13 rel. Types – as Allen’s: • • Simultaneous Identical One before (/after) the other One immediately before (+after) the other One including / being included in the other One holding during the duration of the other One being the beginning (/ending) of the other One being begun (/ended) by the other CALP 07 workshop @ RANLP 18
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (2) ei 1994 ei 1995 t 192 craze cost 10/30/89 quit ei 1996 The oat-bran crazee 190/ei 1994 has coste 189/ei 1995 the world's largest cereal maker market share. The company's president quit e 3 /ei 1996 suddenly. CALP 07 workshop @ RANLP 19
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (3) ei 1994 ei 1995 t 192 craze cost 10/30/89 quit ei 1996 <TLINK related. To. Event. Instance="ei 1995" event. Instance. ID="ei 1994" rel. Type="BEFORE" /> <TLINK related. To. Time="t 192" event. Instance. ID="ei 1996" rel. Type="BEFORE" /> <TLINK related. To. Event. Instance="ei 1995" event. Instance. ID="ei 1996" rel. Type="IS_INCLUDED" /> CALP 07 workshop @ RANLP 20
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Aspectual relations: ALINK • relationship between an aspectual event and its argument event: • Initiation: John started ei 5 to read ei 6. <ALINK event. Instance. ID="ei 5" related. To. Event. Instance="ei 6" rel. Type="INITIATES"/> • Culmination: John finished ei 5 assembling ei 6 the table. <ALINK event. Instance. ID="ei 5“ related. To. Event. Instance="ei 6“ rel. Type="TERMINATES"/> • Termination: John stopped talking. • Continuation: John kept talking. CALP 07 workshop @ RANLP 21
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Subordination relations: SLINK • • for contexts introducing relations between two events of type: Modal: John should have bought some wine. Factive: John forgot that he was in Boston yesterday. Counterfactive: John prevented the divorce. Evidential: John said he bought some wine. Negative evidential: John denied he bought only beer. Conditional: If John leaves today, Mary will cry. CALP 07 workshop @ RANLP 22
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 183 English news report documents Time. ML annotated, distributed through LDC 4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units Non-Time. ML Markup in Time Bank 1. 1: • structure information: header • named entity recognition: <ENAMEX>, <NUMEX>, <CARDINAL> • sentence boundary information: <s> CALP 07 workshop @ RANLP 23
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 events 7935 instances timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL 27592 CALP 07 workshop @ RANLP 7940 24
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 25
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Translation • • 2 “trained translators”; one final correction Translation desiderata: • • 1 -1 sentence aligned Preserving POS Verb tense – mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian • 4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types CALP 07 workshop @ RANLP 26
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Preprocessing the corpus • • Tokenisation – Mt. Seg, with idiomatic expressions, clitic splitting POS-tagging – Tn. T adapted & improved to determine the POS of unknown words Lemmatisation – probabilistic, based on a lexicon Chunking – REs over POS tags to determine non-recursive NPs, Adv. Ps, PPs CALP 07 workshop @ RANLP 27
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian. English track organized at ACL 2005 Current: P = 88. 80%, R = 74. 83%, F = 81. 22% 91714 alignments, manually checked, out of which 25346 are NULL-alignments CALP 07 workshop @ RANLP 28
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment 1. Content words alignment: based on the translation lexicons P = 94. 08%, R = 34. 99%, F = 51. 00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89. 90%, R = 53. 90%, F = 67. 40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed. CALP 07 workshop @ RANLP 29
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment CALP 07 workshop @ RANLP 30
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment The parallel corpus = 183 files in XCES format <tu id="1"> <seg lang="en"> <s id="Timex. en. 1"> <w lemma="on_the_other_hand" ana="14+, ADVE" chunk="Ap#1">On_the_other_hand</w> <c>, </c> <w lemma="it" ana="13+, PPER 3" chunk="Vp#1">it</w> <w lemma="be" ana="3+, AUX 3" chunk="Vp#1">'s</w> <w lemma="turn" ana="1+, PPRE" chunk="Vp#1">turning</w> <w lemma="out" ana="5+, PREP">out</w> <w lemma="to" ana="15+, TO" chunk="Vp#2">to</w> <w lemma="be" ana="1+, VINF" chunk="Vp#2">be</w> <w lemma="another" ana="22+, PI">another</w> <w lemma="very" ana="14+, ADVE" chunk="Ap#2">very</w> <w lemma="bad" ana="1+, ADJE" chunk="Ap#2, Np#1">bad</w> <w lemma="financial" ana="1+, ADJE" chunk="Ap#2, Np#1">financial</w> <w lemma="week" ana="1+, NN" chunk="Np#1">week</w> workshop @ RANLP CALP 07… </s> <tu id="1"> <seg lang="ro"> <s id="Timex. ro. 1"> <w lemma="pe_de_altă_parte" ana="14+, R" chunk="Ap#1">Pe_de_altă_parte</w> <c>, </c> <w lemma="sine" ana="12+, PXA" chunk="Vp#1">se</w> <w lemma="dovedi" ana="1+, V 3" chunk="Vp#1">dovedeşte</w> <w lemma="a" ana="15+, QN" chunk="Vp#2">a</w> <w lemma="fi" ana="1+, VN" chunk="Vp#2">fi</w> <w lemma="alt" ana="22+, PI" chunk="Np#1">altă</w> <w lemma="săptămână" ana="1+, NSRN" chunk="Np#1">săptămână</w> <w lemma="financiar" ana="1+, ASN" chunk="Np#1, Ap#2">financiară</w> <w lemma="foarte" ana="14+, R" chunk="Np#1, Ap#2">foarte</w> <w lemma="prost" ana="1+, ASN" chunk="Np#1, Ap#2">proastă</w> … </s> </seg></tu> 31
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Based on the Romanian-English lexical alignment CALP 07 workshop @ RANLP 32
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import For every pair of sentences Sro and Sen from the Time. Bank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {<”In the”; 1, 2>, <”Philippines”; 3>, <”, a”; 4, 5>, <”four”; 6>, <”year”; 7>, <”low. ”; 8, 9>}. CALP 07 workshop @ RANLP 33
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English Time. Bank. E’ = {<”In the”; 1, 2; s>, <”Philippines”; 3; s, ENAMEX>, …} 3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {<”În”; s>, <”Filipine”; s, ENAMEX>, …}. CALP 07 workshop @ RANLP 34
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format. CALP 07 workshop @ RANLP 35
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Offline markup (MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian CALP 07 workshop @ RANLP 36
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Time. ML tags % transfered events 7703 97. 07 instances 7706 97. 05 timexes 1356 95. 89 signals 668 97. 09 alinks 249 93. 96 slinks 2831 96. 55 tlinks 6122 95. 38 26635 96. 53 TOTAL CALP 07 workshop @ RANLP 37
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Conclusions & future work • • • improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature) CALP 07 workshop @ RANLP 38
Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Thank you! (Temporal) Questions? ? ? CALP 07 workshop @ RANLP 39
c6c913c63f76e22e104300a93f26609e.ppt