
ae59f6740a4cf9218fd37a9d67e279a3.ppt
- Количество слайдов: 30
Annotation of Grammatemes in the Prague Dependency Treebank 2. 0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic {razimova, zabokrtsky}@ufal. mff. cuni. cz
Outline of the talk Introduction n Prague Dependency Treebank 2. 0 n Annotation of grammatemes n Motivation n Grammateme attributes n Two-level node hierarchy n Examples of grammateme value assignment n n Final remarks LREC 2006, Annotation Science 2 razimova@ufal. mff. cuni. cz
Introduction n grammatemes in the PDT 2. 0 n n one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically indispensable • number for nouns, degree of comparison for adjectives, tense for verbs, etc. n annotation of grammatemes n n the last task in the PDT 2. 0 annotation procedure possible to assign automatically – profiting from the already available annotation: • annotation of the same sentence at the lower layers • already available components of the t-tree (tree structure, types of dependency relations, co-reference, etc. ) LREC 2006, Annotation Science 3 razimova@ufal. mff. cuni. cz
Historical background and development of PDT project n n n mid 1960’s – Praguian Functional Generative Description (Petr Sgall et al. ) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0. 5 pre-release 2001 – PDT 1. 0 released by LDC n n manual annotation of morphology and surface syntax 2006 – PDT 2. 0 to be released by LDC n interlinked morphological, surface-syntactic and complex deep-syntactic annotation • including annotation of grammatemes LREC 2006, Annotation Science 4 razimova@ufal. mff. cuni. cz
Outline of the talk Introduction n Prague Dependency Treebank 2. 0 n Annotation of grammatemes n Motivation n Grammateme attributes n Two-level node hierarchy n Examples of grammateme value assignment n n Final remarks LREC 2006, Annotation Science 5 razimova@ufal. mff. cuni. cz
Layers of annotation n tectogrammatical layer n n analytical layer n n surface-syntactic dependency tree morphological layer n n deep-syntactic dependency tree m-lemma and m-tag associated with each token word layer n original text, segmented on word boundaries LREC 2006, Annotation Science 6 lit: He-was would went toforest. razimova@ufal. mff. cuni. cz He would have gone to the forest.
Interlinking the layers n n lit: He-was would went toforest. LREC 2006, Annotation Science He would have gone to the forest. any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers 7 razimova@ufal. mff. cuni. cz
Size of the PDT 2. 0 data (i) n 7, 129 manually annotated textual documents n all documents annotated at the m-layer • 16, 065 sentences with 1, 960, 657 tokens n 75 % of the m-layer data annotated at the a-layer • 5, 338 documents, 87, 980 sentences, 1, 504, 847 tokens n 44 % of the m-layer data annotated also at the t-layer • 3, 168 documents, 49, 442 sentences, 833, 357 tokens LREC 2006, Annotation Science 8 razimova@ufal. mff. cuni. cz
Size of the PDT 2. 0 data (ii) n n n training data (80 %) development test data (10 %) evaluation test data (10 %) LREC 2006, Annotation Science 9 razimova@ufal. mff. cuni. cz
M-layer n n n sentence represented as a sequence of tokens each token lemmatized and tagged (attributes m-lemma and m-tag) positional m-tag: 15 characters n n n 1. (main) POS 2. detailed POS 3. gender 4. number 5. case . . . LREC 2006, Annotation Science lit. : Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer. Some contours of the problem seem to be clearer after the resurgence by Havel's speech. 10 razimova@ufal. mff. cuni. cz
A-layer n n rooted ordered tree with labeled nodes and edges a-nodes n n n one token of the m-layer is represented by exactly one a-node labeled with a-lemmas (identical with word forms) a-edges represent dependency relations (Sb, Obj, Adv, Atr) n represent non-dependency relations Some contours of (Coord) n analytical function attribute appears the problem seem to as an a-node attribute be clearer after the resurgence by Havel's speech. razimova@ufal. mff. cuni. cz LREC 2006, Annotation Science 11 n
T-layer n n rooted ordered tree with labeled nodes and edges t-nodes n n n complex typed feature structures represent auto-semantic words functional words do not have nodes of their own artificially added nodes t-edges dependency relations (functor) n non-dependency relations (coordination constructions) Some contours of the problem seem to be n functor attribute appears as an clearer after the resurgence by Havel's t-node attribute LREC 2006, Annotation Science 12 razimova@ufal. mff. cuni. cz speech. n
Všem bylo předáno osvědčení o úspěšném absolvování kurzu. Areas of annotation at the t-layer n n n tree structure t-lemma attribute dependency relation (functor and subfunctor) n topic-focus attributes co-reference attributes n node typing attributes n (nodetype and sempos) lit. [To] all was handed over a certificate of successful graduation from the course. They all received a certificate of successful LREC 2006, Annotation Science graduation from this course. n 13 grammateme attributes razimova@ufal. mff. cuni. cz
Outline of the talk Introduction n Prague Dependency Treebank 2. 0 n Annotation of grammatemes n Motivation n Grammateme attributes n Two-level node hierarchy n Examples of grammateme value assignment n n Final remarks LREC 2006, Annotation Science 14 razimova@ufal. mff. cuni. cz
Grammatemes: Motivation n grammatemes n t-node attributes representing inflectional information that is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc. ) n semantically irrelevant morphological meanings are not part of the t-layer (e. g. case for nouns) LREC 2006, Annotation Science 15 razimova@ufal. mff. cuni. cz
Grammateme attributes n 15 grammatemes n n number gender person politeness n n indeftype numertype negation degcmp n n n n LREC 2006, Annotation Science 16 tense aspect verbmod deontmod dispmod resultative iterativeness razimova@ufal. mff. cuni. cz
Conditioned presence/absence of grammatemes n obviously, not all grammatemes are relevant for all nodes n no tense for dog, no degree of comparison for (he) waits, etc. how to formally declare presence/absence of a given grammateme attribute in a given node? the need for node typing n chosen solution: two-level typing n n 1 st level: 8 more general types of nodes • grammatemes relevant only for one of them n 2 nd level: 19 more specific subtypes, corresponding to detailed semantic parts of speech LREC 2006, Annotation Science 17 razimova@ufal. mff. cuni. cz
Presence/absence of grammateme values: Two-level t-node hierarchy 1 st level: attribute nodetype n 2 nd level: attribute sempos n LREC 2006, Annotation Science 18 razimova@ufal. mff. cuni. cz
First level of the hierarchy: attribute nodetype n 8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex n fully automatic annotation - use of the tree structure root n t-attributes • t-lemma qcomplex | list • functor atom | coap | dphr | fphr n else complex n LREC 2006, Annotation Science Levnější benzín na Východě, dražší na Západě 19 razimova@ufal. mff. cuni. cz Cheaper gasoline in the East, more expensive one in the West
Second level of the hierarchy: attribute sempos n n only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos: n n fully automatic annotation – use of n n n. . | adj. . | adv. . | v. . m-tag t-lemma other t-attributes sempos value delimits the set of relevant grammatemes LREC 2006, Annotation Science 20 razimova@ufal. mff. cuni. cz
Values of nodetype and sempos in the PDT 2. 0 – an overview n nodetype values: LREC 2006, Annotation Science n sempos values: 21 razimova@ufal. mff. cuni. cz
Grammateme value assignment n n n-tred environment for processing the PDT data http: //ufal. mff. cuni. cz/˜pajas automatic annotation n 2000 lines of Perl code • crucial importance of inter-layer links – use of • t-attributes, a-attributes, m-attributes n rules using special economic notation • 2000 lines written in a text file n lexical resources • special purpose lists of adverbs / verbs n manual annotation of special problems n n two annotators working in parallel simplified annotation environment: treebank positions extracted into simple HTML forms LREC 2006, Annotation Science 22 razimova@ufal. mff. cuni. cz
Simple HTML-based environment for manual annotation lit: The difference [you] would have to pay yourself. LREC 2006, Annotation Science 23 razimova@ufal. mff. cuni. cz
Automatic vs. manual assignment n at the t-layer of the PDT 2. 0: n 1, 594, 333 grammateme values assigned at 550, 947 complex nodes n manually assigned: • 17, 520 grammateme values • inter-annotator agreement: 70 -85 % LREC 2006, Annotation Science 24 razimova@ufal. mff. cuni. cz
Grammateme assignment and m-tag n n number grammateme: values sg | pl assigned automatically using m-tag n n n. denot number=sg e. g. les (forest) • m-layer: tag NNIS 2 -----A--- t-layer: number=sg manual assignment nouns with only plural forms (identified by a list extracted from the machinereadable dictionary of standard Czech) n e. g. dveře (door/doors) • m-layer: always plural LREC 2006, Annotation Science • t-layer: annotator decision sg | pl 25 n lit: He-was would went toforest. razimova@ufal. mff. cuni. cz He would have gone to the forest.
Grammateme assignment and tree structure n n v verbmod=cdn mood grammateme verbmod: values ind | imp | cdn assigned automatically n one-word verbal forms • e. g. jde (goes) • m-tag information n verbal forms consisting of more word forms (represented by a single node at the t-layer) • e. g. byl by šel (would have gone) • corresponding a-layer subtree involves the node by LREC 2006, Annotation Science 26 • m-tag of the node by lit: He-was would went toforest. He would haverazimova@ufal. mff. cuni. cz gone to the forest.
Grammateme assignment and co-reference Ze zbytku suroviny mlékárna vyrábí sušené mléko, které vyváží do Asie a Jižní Ameriky. n grammatemes gender, number and person in relative pronouns are left underspecified (value inher), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”) lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America. From the rest of the material, the diary produces dried milk, which is exported [by it] LREC 2006, Annotation Science to Asia and South America. 27 razimova@ufal. mff. cuni. cz
Outline of the talk Introduction n Prague Dependency Treebank 2. 0 n Annotation of grammatemes n Motivation n Grammateme attributes n Two-level node hierarchy n Examples of grammateme value assignment n n Final remarks LREC 2006, Annotation Science 28 razimova@ufal. mff. cuni. cz
Final remarks n achievements: n n two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node automatic procedure for capturing the node classification and the grammateme attributes verification of the procedure on large-scale data experience: n it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic LREC 2006, Annotation Science 29 razimova@ufal. mff. cuni. cz
http: //ufal. mff. cuni. cz/pdt 2. 0/ LREC 2006, Annotation Science 30 razimova@ufal. mff. cuni. cz
ae59f6740a4cf9218fd37a9d67e279a3.ppt