4db779eabb64bdf0e57992967a652308.ppt
- Количество слайдов: 120
The Prague Dependency Treebank and Valency Annotation Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 1
Tutorial Outline l (H 1) The Prague Dependency Treebank (PDT) l l (H 2) The Tectogrammatical Annotation of the PDT l l l Introduction Morphology and Surface Dependency Syntax “Physical” markup: The Prague Markup Language (PML) “Deep” Syntactic Structure, Valency Topic/focus, Coreference (H 3) Tectogrammatical Annotation & Valency Lexicon l l l Verbs and Nouns: Relating Form, Syntax and Semantics Linking the Corpus and the Lexicon Demo: annotation of data, valency Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 2
The Prague Dependency Treebank Project (Czech Treebank) l 1996 -2005 -. . . l 1998 PDT v. 0. 5 released (JHU workshop) l l 2001 PDT 1. 0 released (LDC): l l 400 k words annotated, unchecked 1. 3 MW annotated, morphology & surface syntax 2005 PDT 2. 0 release planned l l 0. 8 MW annotated (50 k sentences) the “tectogrammatical layer” § underlying (deep) syntax Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 3
Related Projects (Treebanks) l Prague Czech-English Dependency Treebank l l l Prague Arabic Dependency Treebank l l l WSJ portion of PTB, translated to Czech automatically analyzed l English side (PTB), too apply same representation to annotation of Arabic suface syntax so far Both have been published in 2004 (LDC) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 4
PDT (Czech) Data l 4 sources: l l l Full articles selected l l l Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly) article ~ DOCUMENT (basic corpus unit) Time period: 1990 -1995 1. 8 million tokens (~110 thousand sentences) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 5
PDT 2. 0 PDT 1. 0 (2001) (2005) PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 6
Tokenization, Segmentation, Sentence Breaks (L 0, w-layer) l Basic Principles l Fully automatic § l No access to any linguistic knowledge § l Will have to be the same for the manually annotated part as well as for other plain-text data …beyond, say, really fail-safe lists of certain types of abbreviations, language identification, coding scheme, and letter classification (upper/lower/…) Standard output markup § unified coding scheme (today, Unicode in most cases) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 7
Tokenization l Words l What is a word? (word boundaries) l l Treatment of hyphens, apostrophes, periods, … Numbers w/digits (normalization) § § l l “periods”, thousand separators Types of numbers (? ) § cardinal, ordinal, money, SSN, tel/fax/…, dates, . . . Mixed letters and digits Rule of thumb: l Split whenever there is the slightest doubt! Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 8
Tokenization l Capitalization l Main issues (the “true case”): § § § l Nontrivial § l Names (not identified yet!) Start of sentence (don’t know it yet either!) Typographical conventions (unmarked in most cases) Headings Rule of thumb: l don’t solve it (yet), just keep it & possibly mark it Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 9
(No) Segmentation, I l Segmentation ~ (for us) splitting “inside” words (“between two letters”) l examples (not segmented in PDT): l elektro|technický (electrotechnical) l bílo|červeno|modrý (white-red-blue) l tisíci|hlavý (one-thousand-headed) l polo|šílený (half-mad) l na|č = na co (onto what, contraction (~ isn’t)) l pracoval|s = pracoval jsi (you have worked, ~ y’know) l za|č|s = za co jsi (for what you have <verb>) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 10
(No) Segmentation, II l Ambiguity l l l přenos: l přenos l přeno|s a few others - transmission - you-have-been argued-with However: it is not very frequent (Cz, En, Ar) → l l can be handled by expanded dictionary & tagset design therefore no segmentation (of this kind)! Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 11
Sentence Boundaries l Chicken and egg problem: l To analyze a text linguistically, we need to know sentence boundaries… l l l but… To know sentence boundaries, we would need to have the text linguistically analyzed. Solution: l Do something good enough in most cases l …maybe redo it later in the manually annotated part Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 12
PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 13
Layer 1 (m-layer): Morphology l Prerequisites for the manual annotation process: l l l Tokenized data Annotation guidelines Annotation tool l Manual decision making support l Offline (or online) morphological analyzer Quality checking tool Process description Results (manually annotated data) to be used for. . . l tagger training, linguisitic research, basis for further annotation, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 14
Morphological Attributes l Tag: 13 categories l Example: AAFP 3 ----3 N---- Adjective Regular Feminine Plural Dative l Ex. : nejnezajímavějším “(to) the most uninteresting” no poss. Gender no poss. Number no person no tense superlative negated no voice reserve 1 reserve 2 base var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 15
Morphological Tagset l 13 categories, 4452 plausible tags (combinations): Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 16
Morphological Analysis l Formally: MA: A+ → Pow(L x T) l MA(f) = { [ l, t ] }; l l l f A+ (the token), l L (lemma), t T (tag) tokens taken in isolation no attempt to solve e. g. auxiliaries vs. full verbs Ex. : MA(“má“) = { [mít, VB-S---3 P-AA---], lit. “to have” lit. “has”, ”my” [můj, PSFS 1 ------1], lit. “my” [můj, PSFS 5 -S 1 ------1], [můj, PSNP 1 -S 1 ------1], [můj, PSNP 4 -S 1 ------1], [můj, PSNP 5 -S 1 ------1] } Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 17
Morphological Analysis: Implementation l Dictionary-based l l covers 800 k. W (lemmas), ~ 20 mil. forms (w/tag) C code implementation l l standard (regular) derivations on-the-fly; ex. : l spojit spojený joinedly spojenost joinedliness joinably spojitelný joinable joinability spojitelnost irregular forms listed in dictionary (w/tags) no phonological processing (concatenation only) grammatical prefixes only: negation, superlative Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 18
The Morphological Annotation Tool l DA: manual disambiguation tool Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 19
The Process of Morphological Annotation l From tokenized to annotated text: tokenized text (auto, w-layer) text w/morph. interpretations text w/select. interpretation Sept. 18, 2005, Боровец, БГ (Auto) morphological analysis Manual morphological disambiguation (DA) Manual adjudication RANLP Tutorial: The Prague Dependency Treebank morphological dictionary annotation guidelines annotated text (m-layer) 20
Using the Results: Morphological Disambiguation l Full morphological disambiguation l l more complex than (e. g. English) POS tagging Three taggers: l l (Pure) HMM Feature-based (Max. Ent-like) l l l used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’ 02) All: ~ 94 -5% accuracy (perceptron is best) l rule & statistic combination: tiny improvement (Hajič et al. , ACL 2001) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 21
The Segmentation Problem: Possible solution (Arabic) l Tokenization / segmentation not always trivial l l Find max. no. of segments l l l Arabic, German, Chinese, Japanese 4 for Arabic expand every solution (morph. analysis) to the same number of segments, adding “blank” segments to the end concatenate tags (→ same length) concatenate “lemmas” (roots, . . . ) Result: l l the same formal definition; can be converted back to segments trivially tagging solves segmentation! Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 22
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 23
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 24
PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 25
Layer 2 (a-layer): Analytical Syntax l Dependency + Analytical Function governor dependent The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 26
Analytical Syntax: Functions l Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr, Atv(V), Aux. V, Pnom l “Double” dependency: Atr. Adv, Atr. Obj, Atr l l Special (function words, punctuation, . . . ): Reflefives, particles: Aux. T, Aux. R, Aux. O, Aux. Z, Aux. Y l Prepositions/Conjunctions: Aux. P, Aux. C l Punctuation, Graphics: Aux. X, Aux. S, Aux. G, Aux. K l l Structural l Elipsis: Ex. D, Coordination etc. : Coord, Apos Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 27
Example l lit. That it will go wrong, (that) was clear immediately. § Že bude zle, bylo jasné hned. Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 28
Surface Syntax Example l Complete sentence: Sb, Pred, Obj The-baker bakes rolls. § Pekař peče housky. § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 29
Surface Syntax Example l Analytical verb form: (he) allowed would-be to-be enrolled § směl by být zapsán § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 30
Surface Syntax Example l Predicate with copula (state) (the) pool has-been already filled § bazén byl již napuštěn § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 31
Surface Syntax Example l Passive construction (action) (The) book has-been translated [by Mr. X] § Kniha byla přeložena § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 32
Surface Syntax Example l Complement we (are) came three § my jsme přišli tři § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 33
Surface Syntax Example l Complement when NP is missing (he) has cooked [his meals] § má uvařeno § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 34
Surface Syntax Example l Object (he) gave him a-book § dal mu knihu § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 35
Surface Syntax Example l Object used for infinitive of analytical verb forms (he) Could come § Mohl by přijít § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 36
Surface Syntax Example l Relative clause (embedded) § § (a) house, which is expensive, (we) (to-ourselves) will-not-buy dům , který je drahý , si nekoupíme Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 37
Surface Syntax Example l Coordination. . . (to) magic, mystic(, ) etc. §. . . magii , mystice apod. § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 38
Surface Syntax Example l Apposition cheap, i. e. under 5 crown § levný , tj. pod 5 korun § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 39
Surface Syntax Example l Incomplete phrases Peter works well , but Paul badly § Petr pracuje dobře, ale Pavel špatně § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 40
Surface Syntax Example l Variants (equality) (he) bought shoes for boy § koupil boty pro kluka § Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 41
Using the Results: Parsing l Several parsers of Czech l l Analytical layer dependency syntax Trained on PDT 1. 0 dat, 1. 2 mil. words Collins (98), Charniak (00), Žabokrtský (02), Ribarov (04), Nivre (05), Zeman(05), Mc. Donald (05) Best results (accuracy: percent of correct dependencies): l 84 -85% for a single parser, > 86% for a combination Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 42
A step aside. . . l l Technical description of the markup The Prague Markup Language (PML) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 43
The Prague Markup Language l l XML-based, UTF-8 coding used Stand-off annotation l l l Can capture intermediate annotation l l strict hierarchical scheme 4 files for each annotated document ~ 4 layers of annotation e. g. , ambiguous analysis after morphological preprocessing Lexical resources linked in l valency lexicon referenced from t-layer data Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 44
XML Annotation Layers l l l Strictly top-down links w+m+a can be easily “knitted” API for cross-layer access (programming) PML Schema / Relax NG [With slight modification, can be used for spoken data (audio as layer “-1”)] Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 45
The Prague Markup Language Example l m-layer data, linked to w-layer: <m id="m-tr/_12941_01_00013. fs-s 1 w 4"> <src. rf>manual</src. rf> <w> <dest. rf>w#w-tr/_12941_01_00013. fs-s 1 w 4</dest. rf> <trans>basic</trans> </w> <form>pocházela</form> <lemma>pocházet_: T</lemma> <tag>Vp. QW---XR-AA---</tag> </m> <m id="m-tr/_12941_01_00013. fs-s 1 w 5">. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 46
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 47
l. For your notes. . . l(End Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank of Lecture 1) 48
PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 49
Layer 3 (t-layer): Tectogrammatical Annotation l l Underlying (deep) syntax 4 sublayers: l l l dependency structure, (detailed) functors l valency annotation topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): l detailed functors l underlying gender, number, . . . Total l 39 attributes (vs. 5 at m-layer, 2 at a-layer) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 50
Analytical vs. Tectogrammatical annotation (TR: sublayer 1 only) Underlying verb + tense Deep function Elided Actor in Another ellipsis. . . Prepositions out (TR: sublayer 1 only shown) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 51
Layer 3: Tectogrammatical l l Underlying (deep) syntax 4 sublayers: l l dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): l l detailed functors underlying gender, number, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 52
Example - TR l l Graphical visualization He worked as an engineer and he liked the work. l[He]worked Sept. 18, 2005, Боровец, БГ as an-engineer and the-work him pleased. RANLP Tutorial: The Prague Dependency Treebank 53
Dependency Structure l Similar to the surface (Analytical) layer. . . but: l certain nodes deleted l l some nodes added l l l auxiliaries, non-autosemantic words, punctuation based on word (mostly verb, noun) valency some ellipsis resolution detailed dependency relation labels (functors) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 54
Tectogrammatical Functors syntactic l “Actants”: ACT, PAT, EFF, ADDR, ORIG l l l semantic modify: verbs, nouns, adjectives cannot repeat in a clause, usually obligatory Free modifications (~ 50), semantically defined l l can repeat; optional, sometimes obligatory Ex. : LOC, DIR 1, . . . ; TWHEN, TTILL, . . . ; RESTR, DESC; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, l Special l Coordination, Rhematizers, Foreign phrases, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 55
Tectogrammatical Example l Analytical verb form: (he) allowed would-be to-be enrolled § směl by být zapsán § Collapsed Additional attributes (grammatemes): conditional + “allow” Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 56
Tectogrammatical Example l Predicate with copula (state) (the) pool has-been already filled § bazén byl již napuštěný § ý Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 57
Tectogrammatical Example l Passive construction (action) (The) book has-been translated [by Mr. X] § Kniha byla přeložena § Disappeared Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank Added 58
Tectogrammatical Example l Object (he) gave him a-book § dal mu knihu § Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 59
Tectogrammatical Example l Relative clause (embedded) § § Sept. 18, 2005, Боровец, БГ (a) house, which is expensive, (we) (to-ourselves) will-not-buy dům , který je drahý , si nekoupíme RANLP Tutorial: The Prague Dependency Treebank 60
Tectogrammatical Example l Incomplete phrases Peter works well , but Paul badly § Petr pracuje dobře, ale Pavel špatně § Added Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 61
Layer 3: Tectogrammatical l l Underlying (deep) syntax 4 sublayers: l l dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): l l detailed functors underlying gender, number, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 62
Deep Word Order Topic/Focus l Example: l Analytical dep. tree: Baker bakes rolls. Sept. 18, 2005, Боровец, БГ vs. Baker. IC bakes rolls. RANLP Tutorial: The Prague Dependency Treebank 63
Deep Word Order Topic/Focus l Deep word order: l l from “old” information to the “new” one (left-toright) at every level (head included) projectivity by definition (almost. . . ) l l i. e. , partial level-based order -> total d. w. o. Topic/focus/contrastive topic l l attribute of every node (t, f, c) restricted by d. w. o. and other constraints Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 64
Layer 3: Tectogrammatical l l Underlying (deep) syntax 4 sublayers: l l dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): l l detailed functors underlying gender, number, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 65
Coreference l Grammatical (easy) l relative clauses l which, who § l l Peter and Paul, who. . . control l infinitival constructions § lpromise l. PRED John promised to go. . . reflexive pronouns l {him, her, thme}self(-ves) § lgo l. PAT l. John l. ACT lhe l. ACT lhome l. DIR 3 Mary saw herself in. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 66
Coreference l Textual l Ex. : Peter moved to Iowa after he finished his Ph. D. Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 67
Layer 3: Tectogrammatical l l Underlying (deep) syntax 4 sublayers: l l dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): l l detailed functors underlying gender, number, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 68
Grammatemes l Detailed functors (subfunctors) l only for some functors: l l TWHEN: before/after LOC: next-to, behind, in-front-of, . . . also: ACMP, BEN, CPR, DIR 1, DIR 2, DIR 3, EXT Lexical (underlying) l l number (SG/PL), tense, modality, degree of comparison, . . . strictly only where necessary (agreement!) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 69
Example - simplified view Se zuby jsem měl v minulosti jen problémy. With teeth I-have had in the-past only problems. Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 70
Fully Annotated Sentence The boundaries of some problems seem to be clearer after they were revived by Havel’s speech. Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 71
Definition of Valency l l Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning Properties of valency: l l l Specific for every word meaning (in general) l leave: sb left sth for sb vs. sb left from somewhere l same as in Prop. Bank leave. 02 vs. leave. 01 Typically strongly correlates with surface form l morphological case (~ ending), preposition+case, . . . Semantic constraints are very dangerous Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 72
Structure of Valency l vyměnit (to replace) word (lemma) l word sense group 1 l valency frame: § surface expression word sense group 2 l. . . l l slot 1 slot 2 slot 3 Sept. 18, 2005, Боровец, БГ vyměnit 1 ACT PAT EFF Nom. Acc. za+Acc. vyměnit 2. . . RANLP Tutorial: The Prague Dependency Treebank 73
The Valency Lexicon PDT-VALLEX l Valency frames l each l l verb, some nouns, adjectives Basic set prepared in advance, annotators add entries on-the-go, checking and approval process follows (consistency) VALLEX l l l more detailed and complex annotation of valency Žabokrtský, Lopatková (2005), VALLEX 1. 0 All about valency: http: //ckl. ms. mff. cuni. cz/~semecky/vallex/ Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 74
PDT-VALLEX Entry l l dosáhnout: “to reach”, “to get [sb to do sth]” browser/user-formatted example: Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 75
Corpus <-> Valency Lexicon Corpus: l Sentence 2035: l Lexicon: Sept. 18, 2005, Боровец, БГ Sentence 15345: Sentence 51042: ENTRY: uzavřít vf 1: ACT(. 1) CPHR({smlouva}. 4) ex: u. dohodu (close a contract) vf 2: ACT(. 1) PAT(. 4) ex. : u. pokoj (close a room, house) RANLP Tutorial: The Prague Dependency Treebank 76
The Annotation Process l 4 sublayers l l Structure l l automatic preprocessing - programmed conversion from analytical layer annotation Grammatemes l l work on structure first, rest in parallel mostly automatically (based on lower layers’ annotation), manual checking, corrections Cross-sublayer/cross-layer checking l partly automatic, then manual Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 77
The Annotation Process Scheme Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 78
Using the Results (t-layer) l Preliminary! l l l Functor assignment l l in the works Coreference l l > 80% accuracy on manually annotated structure Tectogrammatical parser l l PDT 2. 0 not published yet (fall 2005) final, checked data available now (50 k sentences) preliminary results: > 80% Valency l frame assignment > 70% Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 79
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 80
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 81
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 82
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 83
l. For your notes. . . l(End Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank of Lecture 2) 84
Valency & Tectogrammatical Annotation l Valency and. . . l l (surface) form Annotation tools l Tr. Ed l l l structural annotation valency lexicon integration Search l Tr. Ed, Netgraph Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 85
Valency & Form rm Fo lemma (AL): uvažovat ACT: surface ellipsis, node disappears PAT: preposition ‘o’ and a locative case Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 86
Tectogrammatical / Analytical uvažovat – uvažovat PAST / já. Masc – PPart. Masc. SG(Pred) / být. Pres. SG. 1(Aux. V) pravidlo. PL. PAT – o. Prep(Aux. P) / pravidlo. PL. Loc(Obj) já - 0 CONTEXT NEEDED ? from another sentence: pravidlo. PL. PAT – pravidlo. PL. Acc(Obj) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 87
Valency & Form l Valency frame: l l Simplest case: l l (per each sense of word) (obligatory) modifiers ↔ functors functor → form surface form of a functor: particular case Ex. : ACT in nominative (he says) Ex. : PAT in accusative (she sees him) . . . but it is not always so simple (as we have already seen)! Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 88
Valency & Form: Constraints l Tree structure: n 1 n 2 n 3 n 4 l (Sets of) Constraints: l n 1: lemma=uvažovat mode=active l n 2: case=Nom afun=Sb l n 3: lemma=o afun=Aux. P l n 4: case=Loc afun=Obj Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 89
(General) Valency Lexicon Entries Entry Sense # Frame Valency # Optinality 1 1 1 ACT PAT 2 2 ACT PAT LOC {ci} 3 ACT PAT DIR 3 {ci} 3 4 ACT PAT {ci} 1 1 ACT 2 2 ACT INT {ci} 1 1 ACT PAT {ci} 2 2 ACT PAT {ci} 2 3 Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank Form alternatives {ci} {ci} 90
Valency Lexicon Simplification l Independent form for each slot of a particular valency frame l l l Functoroblig. /opt. Ex. : l l l ACT, PAT, . . . : own constraint, not a global one ↔ constraints. Functor lemma 1 ACT(Nom. ) PAT(o+6) (to consider a rule) lemma 2 ACT(Nom. ) PAT(4) (create a rule) Standard “transformations” of frame form l passivization, reflexivization, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 91
Example: Valency & Form l Simple 1: 1: l l l ex. : create: ACT(Nom) PAT(Acc) verb in infinitive: INTT(Inf) subordinate clause: PAT(verb) class of words with generic verbs: CPHR({class}) no constraint: (often) LOC, TWHEN l l general constraint for a given functor applies . . . more! Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 92
Example: Valency & Form l 1: 2 l relative clause lemma=say mode=active to_say: ACT EFF afun=Sb case=Nom afun=Aux. C lemma=that afun=Obj POS=verb • linear representation: EFF(that[. v]) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 93
Example: Valency & Form l 1: 2 l idomatic phrase lemma=follow mode=active to_follow 2: ACT DPHR afun=Sb case=Nom afun=Obj lemma=interest case=4 number=pl afun=Atr lemma=own • linear representation: DPHR(interest. P 4[own. #]) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 94
Example: Valency & Form l 1: 3 l idomatic phrase lemma=follow mode=active to_follow 2: ACT DPHR afun=Sb case=Nom afun=Obj lemma=interest case=4 number=pl afun=Atr lemma=own afun=Atr lemma=his Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 95
Example: Valency & Form l 1: 4 l idomatic phrase lemma=run mode=active to_run 27: DPHR afun=Aux. P lemma=on afun=Sb lemma=frost case=Nom afun=Obj lemma=back afun=Atr POS=poss Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 96
Valency and Translation l leave: l leave-1 l l leave-2 l l to leave [from] somewhere to leave sth for sb Translating (from English into Czech): l which equivalent to chose? l l nechat vs. odjet/opustit which prepositions, cases, . . . to use? l accusative vs. “z” (“from”) with genitive vs. . ? Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 97
Valency and Translation l leave-1 l l nechat-3 ACT() PAT() LOC() leave-2 l odjet-1 ACT() DIR 1(from. ) Sept. 18, 2005, Боровец, БГ ACT(. 1) PAT(. 4) LOC() ACT(. 1) DIR 1(z. [. 2]) RANLP Tutorial: The Prague Dependency Treebank 98
Valency and Text Generation l Tectogrammatical Representation l has all the information to (re)generate the surface form of the sentence: l l l . . . except the links to a-layer, however l l l in a “generalized” form non-redundant (almost. . . but for generation, it is o. k. ) links used only for training [statistical models for] parsing/generation modules not present when e. g. doing text planning, translation, . . . valency dictionary: form of “learned” knowledge Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 99
Valency and Text Generation l Using valency for. . . l l . . . getting the correct (lemma, tag) of verb arguments Example: l. VALLEX entry: starat (se) ACT(. 1) PAT(o. [. 4]) starat V. . . starat_se PRED Martin ACT tygr PAT Martin se. . 1. . . Martin Sept. 18, 2005, Боровец, БГ o. . . . “Martin tygr. . 4. . takes se stará o tygry. RANLP Tutorial: The Prague Dependency Treebank care of tigers. ” 100
Tectogrammatical Annotation Tools l Manual annotation l l 4 groups of annotators ~ 4 sublayers Special graphical tool (Tr. Ed) l l Customizable graphical tree editor Preprocessing l l Data from analytical layer, preprocessed Online dependency function preassignment Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 101
The [Manual] Annotation Tool l Perl/Perl. Tk based, platform-independent l l Perl as the “macro” language l l “unlimited” online processing capability Flexibility for interactive checking l l Linux, Windows 95/98/2000, Solaris, . . . split screen, graphical “diff” function Customization, printing, “plugins”, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 102
The “Tr. Ed” Tree Editor l Graphical tool Tr. Ed l Main screen: Original sentence: [This year’s flu season is still quiet in Europe. ] Editing window customization Run a macro Multiwindow editing/compare Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 103
Valency Lexicon in Tr. Ed to write sth (about sth) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 104
Annotating the Links l Stand-off annotation principles l l Minimal work on link annotation (close to zero) Macro commands in Tr. Ed l l Links to another layer Links to lexicon transparently keeps track of merged nodes, splits, etc. , and adapts links correspondingly. Result: l l almost no extra work final check after annotators do the last pass Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 105
The “Old” PDT 1. 0 l l l Morphology (1. 8 MW) & Surface syntax (1. 5 MW) SGML format (csts. dtd) + compact “FS” Mixed (single-file) annotation l l 7 attributes + dependency Tr. Ed (graphical viewer/editor), Net. Graph (search capability) l simple visualization Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 106
What’s New in PDT 2. 0 l Tectogrammatical layer (0. 8 MW) l l XML stand-off annotation (“PML”, 4 layers) New data division (train/dtest/etest) l l l 39 node attributes + dependency valency dictionary (PDT-VALLEX) added morphological annotation to all data corrections of PDT 1. 0 files (morphology, syntax) Improved tools: l Tr. Ed, btred/ntred (batch tree corpus processing) l new features, better visualization Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 107
Tectogrammatical attributes I l node typing l l functor, subfunctor l l l complex, coap, qcomplex, root, atom, . . . TWHEN: TWHEN. basic, TWHEN. before is_member, is_generated, is_parenthesis, is_dsp_root, is_state, quot_type, . . . grammatemes (16): l aspect, degcmp, deontmod, sempos, tense, indeftype, politeness, person, . . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 108
Tectogrammatical attributes II l topic/focus: l l valency: t_lemma, val_frame. rf bookkeeping: id coref_gram. rf, coref_text. rf, compl. rf l l l tfa, deepord reference to TR node, type of coreference sentmod Linking to analytical layer l a. lex. rf (“main” anal. node), a. aux. rf (others) Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 109
PDT 2. 0: The Data l Data sizes Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 110
Tr. Ed Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 111
“Batch” data processing l Tr. Ed -> “batch/networked” btred/ntred #!btred -T -N --context PML_T -e Get. Gen. Parents sub Get. Gen. Parents { # get nodes with no surface counterpart, print their parents if ($this->{is_generated} == 1) { # now, get all parents: @parents = Get. EParents($this); if ($#parents != 0) { # exclude top of the tree foreach $ref (@parents) { $sz. Tlemma = $ref->{t_lemma}; print $this->{t_lemma}, “t", $sz. Tlemma, "t"; FPosition(); } } # of some parents present } # of tectogrammatical generated node } # of Get. Gen. Parents Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 112
Parallel data processing l ntred/btred Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 113
Some pointers l Current version of PDT: v 2. 0 beta l l l http: //ufal. mff. cuni. cz l l Projects -> Treebank http: //www. ldc. upenn. edu l l all three levels, 1. 9/1. 5/0. 8 Mwords http: //ufal. mff. cuni. cz/pdt 2. 0 LDC 2001 T 10 (PDT v 1. 0), LDC 2004 T 23 (PADT 1. 0), LDC 2004 T 25 (PCEDT 1. 0) http: //www. clsp. jhu. edu: Workshop 2002 l Using TL for MT Generation Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 114
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 115
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 116
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 117
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 118
l. For your notes. . . Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank 119
l. For your notes. . . l(End Sept. 18, 2005, Боровец, БГ RANLP Tutorial: The Prague Dependency Treebank of Lecture 3, Tutorial) 120
4db779eabb64bdf0e57992967a652308.ppt