Скачать презентацию Semi-automatic Annotation of the Romanian Time Bank 1 Скачать презентацию Semi-automatic Annotation of the Romanian Time Bank 1

c6c913c63f76e22e104300a93f26609e.ppt

  • Количество слайдов: 39

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Corina Forăscu, Radu Ion, Dan Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Corina Forăscu, Radu Ion, Dan Tufiş Faculty of Computer Science, Al. I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy [email protected] uaic. ro , {radu, tufis}@racai. ro CALP 07 workshop @ RANLP 1

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 2

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Fundamentals Temporal information in Natural Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Fundamentals Temporal information in Natural Language: 1. Time-denoting expressions – references to a calendar or clock system • expressed by NPs, PPs, or Adv. Ps • the 23 rd of May, 1998; Monday; tomorrow; the second semester 2. Event-denoting expressions - reference to an event § expressed by 1. sentences – more precisely their syntactic head, the main verb: John listens to the music. 2. 2. noun phrases: Israel will ask the USA to delay a military strike against Iraq. CALP 07 workshop @ RANLP 3

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (1) NLP applications to Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; • • question answering (questions like when, how often or how long); • information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); • discourse processing: temporal structure of discourse and summarization. • CALP 07 workshop @ RANLP 4

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (2) Acum îşi dădea Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (2) Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today. CALP 07 workshop @ RANLP 5

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (3) Acum îşi dădea Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Motivation (3) Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Acum îsi dădea seama ca tocmai din cauza acestui incident se hotarâse el brusc sa vină acasa sisa -si înceapă jurnalul taman astăzi. CALP 07 workshop @ RANLP 6

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 State of the Art 2006 Semi-automatic Annotation of the Romanian Time. Bank 1. 2 State of the Art 2006 Time Symposium 2005 ACL 2005: TARSQI system ACE – TERN: TIMEX 2 v. 1. 2. 2004 TARSQI: Time. ML v. 1. 2. ACE – TERN: TIMEX 2 v. 1. 1. 2002 TERQAS: Time. ML v. 1. 0. DAML-Time 2001 STAG (Setzer) 2000 ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events TIMEX 1998 ACL: Temporal and Spatial Information Processing MUC 7 1947 TIDES 2001: TIMEX 2 v. 1. 0. 2 LREC 2002 Annotation Standards for Temporal Information in Natural Language Reichenbach: The tenses of verbs CALP 07 workshop @ RANLP 7

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 TERQAS 2002 + Time. ML Semi-automatic Annotation of the Romanian Time. Bank 1. 2 TERQAS 2002 + Time. ML v. 1. 0 metadata standard for: ü marking events, ü their temporal anchoring and ü links in news articles + + Time. Bank corpus v. 1. 0. guidelines for temporal annotation CALP 07 workshop @ RANLP 8

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 9

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Time. ML v. 1. 2 Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Time. ML v. 1. 2 A metadata standard developed especially for news articles, for marking • Events: EVENT, MAKEINSTANCE • temporal anchoring of events: TIMEX 3, SIGNAL • links between events and/or timexes: TLINK, ALINK, SLINK CALP 07 workshop @ RANLP 10

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (1) • situations that Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (1) • situations that happen or occur, states or circumstances in which something obtains or holds true • tensed verbs, adjectives, nominalizations The oat-bran craze e 190 has cost e 189 the world's largest cereal maker market share. 7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_ACTION CALP 07 workshop @ RANLP 11

" src="http://present5.com/presentation/c6c913c63f76e22e104300a93f26609e/image-12.jpg" alt="Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (2) The oat-bran " /> Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Events (2) The oat-bran crazee 190 has coste 189 the world's largest cereal maker market share. Analysts saye 28 much of Kellogg's erosion e 204 has been in such core brands as Corn Flakes, . . . CALP 07 workshop @ RANLP 12

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Instances Based on the event Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Instances Based on the event annotation: how many different instances or realizations has a given event – at least one • Carries the tense and aspect of the verb-denoted event John learnse 1 twice on Monday. • CALP 07 workshop @ RANLP 13

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (1) Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (1) Explicit & implicit temporal expressions: • Times: 11 o’clock; midnight • Dates: • Fully Specified (May 23, 2006; winter, 2005), • Underspecified (Monday; next week; last month; two years ago) • Durations: two months; three hours • Sets: every week; every Tuesday CALP 07 workshop @ RANLP 14

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (2) Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal expressions: TIMEX 3 (2) 10/30/89 the next two years or so soon CALP 07 workshop @ RANLP 15

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal signals: SIGNAL Function words Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal signals: SIGNAL Function words that indicate how temporal objects are to be related to each other: • temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when • negative expressions • modal verbs • prepositions signaling modality (“to”) • special characters denoting ranges in temporal expressions: “-” and “/” CALP 07 workshop @ RANLP 16

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Dependencies: LINKs • • Temporal Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Dependencies: LINKs • • Temporal Relations: TLINK Anchors to Time Orders between Time and Events Aspectual Relations: ALINK Phases of an event Subordinating Relations: SLINK Events that syntactically subordinate other events CALP 07 workshop @ RANLP 17

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (1) • Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (1) • temporal relation between two temporal elements (event-event, event-timex); • EVENTs – through their INSTANCEs • 13 rel. Types – as Allen’s: • • Simultaneous Identical One before (/after) the other One immediately before (+after) the other One including / being included in the other One holding during the duration of the other One being the beginning (/ending) of the other One being begun (/ended) by the other CALP 07 workshop @ RANLP 18

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (2) ei Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (2) ei 1994 ei 1995 t 192 craze cost 10/30/89 quit ei 1996 The oat-bran crazee 190/ei 1994 has coste 189/ei 1995 the world's largest cereal maker market share. The company's president quit e 3 /ei 1996 suddenly. CALP 07 workshop @ RANLP 19

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (3) ei Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Temporal relations: TLINK (3) ei 1994 ei 1995 t 192 craze cost 10/30/89 quit ei 1996 CALP 07 workshop @ RANLP 20

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Aspectual relations: ALINK • relationship Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Aspectual relations: ALINK • relationship between an aspectual event and its argument event: • Initiation: John started ei 5 to read ei 6. • Culmination: John finished ei 5 assembling ei 6 the table. • Termination: John stopped talking. • Continuation: John kept talking. CALP 07 workshop @ RANLP 21

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Subordination relations: SLINK • • Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Subordination relations: SLINK • • for contexts introducing relations between two events of type: Modal: John should have bought some wine. Factive: John forgot that he was in Boston yesterday. Counterfactive: John prevented the divorce. Evidential: John said he bought some wine. Negative evidential: John denied he bought only beer. Conditional: If John leaves today, Mary will cry. CALP 07 workshop @ RANLP 22

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 183 English news report documents Semi-automatic Annotation of the Romanian Time. Bank 1. 2 183 English news report documents Time. ML annotated, distributed through LDC 4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units Non-Time. ML Markup in Time Bank 1. 1: • structure information: header • named entity recognition: , , • sentence boundary information: CALP 07 workshop @ RANLP 23

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 events 7935 instances timexes 1414 Semi-automatic Annotation of the Romanian Time. Bank 1. 2 events 7935 instances timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL 27592 CALP 07 workshop @ RANLP 7940 24

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Outline 1. Fundamentals 2. Time. ML & Time. Bank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions CALP 07 workshop @ RANLP 25

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Translation • • 2 “trained Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Translation • • 2 “trained translators”; one final correction Translation desiderata: • • 1 -1 sentence aligned Preserving POS Verb tense – mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian • 4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types CALP 07 workshop @ RANLP 26

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Preprocessing the corpus • • Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Preprocessing the corpus • • Tokenisation – Mt. Seg, with idiomatic expressions, clitic splitting POS-tagging – Tn. T adapted & improved to determine the POS of unknown words Lemmatisation – probabilistic, based on a lexicon Chunking – REs over POS tags to determine non-recursive NPs, Adv. Ps, PPs CALP 07 workshop @ RANLP 27

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment YAWA : 4 stages, Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian. English track organized at ACL 2005 Current: P = 88. 80%, R = 74. 83%, F = 81. 22% 91714 alignments, manually checked, out of which 25346 are NULL-alignments CALP 07 workshop @ RANLP 28

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment 1. Content words alignment: Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment 1. Content words alignment: based on the translation lexicons P = 94. 08%, R = 34. 99%, F = 51. 00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89. 90%, R = 53. 90%, F = 67. 40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed. CALP 07 workshop @ RANLP 29

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment CALP 07 workshop @ Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment CALP 07 workshop @ RANLP 30

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment The parallel corpus = Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Alignment The parallel corpus = 183 files in XCES format On_the_other_hand , it 's turning out to be another very bad financial week workshop @ RANLP CALP 07… Pe_de_altă_parte , se dovedeşte a fi altă săptămână financiară foarte proastă 31

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Based on the Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Based on the Romanian-English lexical alignment CALP 07 workshop @ RANLP 32

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import For every pair Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import For every pair of sentences Sro and Sen from the Time. Bank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {<”In the”; 1, 2>, <”Philippines”; 3>, <”, a”; 4, 5>, <”four”; 6>, <”year”; 7>, <”low. ”; 8, 9>}. CALP 07 workshop @ RANLP 33

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 2. add to Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English Time. Bank. E’ = {<”In the”; 1, 2; s>, <”Philippines”; 3; s, ENAMEX>, …} 3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {<”În”; s>, <”Filipine”; s, ENAMEX>, …}. CALP 07 workshop @ RANLP 34

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 4. construct the Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format. CALP 07 workshop @ RANLP 35

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Offline markup (MAKEINSTANCE, Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Offline markup (MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian CALP 07 workshop @ RANLP 36

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Time. ML tags Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Annotation import Time. ML tags % transfered events 7703 97. 07 instances 7706 97. 05 timexes 1356 95. 89 signals 668 97. 09 alinks 249 93. 96 slinks 2831 96. 55 tlinks 6122 95. 38 26635 96. 53 TOTAL CALP 07 workshop @ RANLP 37

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Conclusions & future work • Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Conclusions & future work • • • improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature) CALP 07 workshop @ RANLP 38

Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Thank you! (Temporal) Questions? ? Semi-automatic Annotation of the Romanian Time. Bank 1. 2 Thank you! (Temporal) Questions? ? ? CALP 07 workshop @ RANLP 39