 Скачать презентацию Prague Dependency Treebank s Workshop at LSA 2011 Part
	Скачать презентацию Prague Dependency Treebank s Workshop at LSA 2011 Part
	b9e009451086987bdd8865301683daea.ppt
- Количество слайдов: 48
 
								Prague Dependency Treebank(s) Workshop at LSA 2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic
 
								Part I - Text to Syntax l l The Prague Dependency Treebank projects The theory behind it The Corpora Morphology l l Dictionaries, tools (incl. POS tagger) Dependency surface syntax l l Czech, Arabic, English Parallel annotated corpus July 30, 2011 LSA 2011 Prague Dependency Treebanks I 2
 
								The Prague Dependency Treebank(s) l The idea l l l Apply the “old” Prague theory to real-word texts Provide enough data for ML experiments ? “Old” Prague theory l l l Prague structuralism (1930 s) Stratificational approach Centered on “deep syntax” l l July 30, 2011 Separated from “surface form” Dependency based (how else ) LSA 2011 Prague Dependency Treebanks I 3
 
								PDT: The Methodology l Manual annotation is PRIMARY l l “No information loss, no redundancy” l l l Much formalization, but… … original form always retrievable Dictionaries l l Some help from existing tools possible In theory: “secondary”, side effect of annotation (generalization) In reality: help consistency Links: data → dictionary(-ies) Result: used for Machine Learning Ergonomy of annotation l Graphical (“linguistic”) presentation & editing July 30, 2011 LSA 2011 Prague Dependency Treebanks I 4
 
								The Prague Dependency Treebank Project: Czech Treebank l 1995 (Dublin) 1996 -2006 -. . . l 1998 PDT v. 0. 5 released (JHU workshop) l l 2001 PDT 1. 0 released (LDC): l l 400 k words manually annotated, unchecked 1. 3 MW annotated, morphology & surface syntax 2006 PDT 2. 0 release l l 0. 8 MW annotated (50 k sentences) + PDT 1. 0 corrected the “tectogrammatical layer” § July 30, 2011 underlying (deep) syntax LSA 2011 Prague Dependency Treebanks I 5
 
								Related Projects (Treebanks) l Prague Czech-English Dependency Treebank l WSJ portion of PTB, translated to Czech (1. 2 mil. words) l l Penn Treebank / WSJ l l l Annotated for basice set of attributes Pre-converted, manually annotated for basic sets of attributes Named entity annotation, co-reference (BBN) merged in Detailed breakdown of NP Prague Arabic Dependency Treebank l apply same representation to annotation of Arabic l surface syntax so far Both published (in first version) 2004 (by LDC) l PCEDT/PEDT version 2. 0 being prepared (2011? ) l Preliminary version available for browsing – see the workshop web July 30, 2011 LSA 2011 Prague Dependency Treebanks I 6
 
								PDT (Czech) Data l 4 sources: l l l Full articles selected l l l Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly) article ~ DOCUMENT (basic corpus unit) Time period: 1990 -1995 1. 8 million tokens (~110, 000 sentences) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 7
 
								PDT Annotation Layers l L 0 (w) Words (tokens) PDT 2. 0 (2006) PDT 1. 0 (2001) l l L 1 (m) Morphology l l Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l July 30, 2011 Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA 2011 Prague Dependency Treebanks I 8
 
								PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon July 30, 2011 LSA 2011 Prague Dependency Treebanks I 9
 
								Morphological Attributes l Tag: 13 categories l Example: AAFP 3 ----3 N---- Adjective Regular Feminine Plural Dative l Ex. : nejnezajímavějším “(to) the most uninteresting” no poss. Gender no poss. Number no person no tense superlative negated no voice reserve 1 reserve 2 base var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 July 30, 2011 LSA 2011 Prague Dependency Treebanks I 10
 
								Morphological Tagset l 13 categories, 4452 plausible tags (combinations): July 30, 2011 LSA 2011 Prague Dependency Treebanks I 11
 
								Morphological Analysis l Formally: MA: A+ → Pow(L x T) l MA(f) = { [ l, t ] }; l l l f A+ (the token), l L (lemma), t T (tag) tokens taken in isolation no attempt to solve e. g. auxiliaries vs. full verbs Ex. : MA(“má“) = { [mít, VB-S---3 P-AA---], lit. “to have” lit. “has”, ”my” [můj, PSFS 1 ------1], lit. “my” [můj, PSFS 5 -S 1 ------1], [můj, PSNP 1 -S 1 ------1], [můj, PSNP 4 -S 1 ------1], [můj, PSNP 5 -S 1 ------1] } July 30, 2011 LSA 2011 Prague Dependency Treebanks I 12
 
								Morphological Disambiguation l Full morphological disambiguation l l more complex than (e. g. English) POS tagging Several full morphological taggers: l l (Pure) HMM Feature-based (Max. Ent, NB) l l l used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’ 02) All: ~ 94 -96% accuracy (perceptron is best) l rule & statistic combination: tiny improvement (Hajič et al. , ACL 2001, Spoustova et al. , 2007: > 96%) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 13
 
								The Segmentation Problem: Arabic l Tokenization / segmentation not always trivial l Arabic, German, Chinese, Japanese July 30, 2011 LSA 2011 Prague Dependency Treebanks I 14
 
								The Segmentation Problem: Solution for Arabic l Find max. no. of segments, concatenate up to max. l 4 (x 10) for Arabic l l l F-----VIIA-3 MS--S----3 MP 4 ----- sa-+yu-hbir-u+-hum+0 Resulting annotation: F-----VIIA-3 MS--S----3 MP 4 -----P-----SD----MS-----------P-------------------N-------2 R---------------N-------2 D---------------A-----FS 2 D--------------- July 30, 2011 LSA 2011 Prague Dependency Treebanks I 15
 
								Arabic Tagging Results l l Maximum entropy, features ~ categories Experiments on Penn Arabic Treebanks l l POS: From 95. 25 -97. 37% Full morph. 88. 17 -89. 31% Segmentation: 98. 60 -99. 37% Prague Arabic Dependency Treebank l l l POS: 96. 02% Full morph. : 89. 24% Segmentation: 99. 25% July 30, 2011 LSA 2011 Prague Dependency Treebanks I 16
 
								PDT Annotation Layers l L 0 (w) Words (tokens) l l L 1 (m) Morphology l l Tag (full morphology, 13 categories), lemma L 2 (a) Analytical layer (surface syntax) l l automatic segmentation and markup only Dependency, analytical dependency function L 3 (t) Tectogrammatical layer (“deep” syntax) l Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon July 30, 2011 LSA 2011 Prague Dependency Treebanks I 17
 
								Layer 2 (a-layer): Analytical Syntax l Dependency + Analytical Function governor dependent The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. July 30, 2011 LSA 2011 Prague Dependency Treebanks I 18
![Analytical Syntax: Functions l Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr, Analytical Syntax: Functions l Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr,](https://present5.com/presentation/b9e009451086987bdd8865301683daea/image-19.jpg) 
								Analytical Syntax: Functions l Main (for [main] semantic lexemes): Pred, Sb, Obj, Adv, Atr, Atv(V), Aux. V, Pnom l “Double” dependency: Atr. Adv, Atr. Obj, Atr l l Special (function words, punctuation, . . . ): Reflefives, particles: Aux. T, Aux. R, Aux. O, Aux. Z, Aux. Y l Prepositions/Conjunctions: Aux. P, Aux. C l Punctuation, Graphics: Aux. X, Aux. S, Aux. G, Aux. K l l Structural l July 30, 2011 Elipsis: Ex. D, Coordination etc. : Coord, Apos LSA 2011 Prague Dependency Treebanks I 19
 
								Surface Syntax Example l Complete sentence: Sb, Pred, Obj The-baker bakes rolls. § Pekař peče housky. § July 30, 2011 LSA 2011 Prague Dependency Treebanks I 20
 
								Surface Syntax Example l Incomplete phrases Peter works well , but Paul badly § Petr pracuje dobře, ale Pavel špatně § July 30, 2011 LSA 2011 Prague Dependency Treebanks I 21
 
								Surface Syntax Example l Variants (equal meaning) (he) bought shoes for boy § koupil boty pro kluka § July 30, 2011 LSA 2011 Prague Dependency Treebanks I 22
 
								PDT-style Arabic Surface Syntax l Only several differences l l l (Sometimes) Separate nodes for individual segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as ellispsis; Arabic: systematic solution needed): Pred (Added) analytic functions: l (did-not) l l Aux. M Ante (what) Work by Faculty of Arts, Charles University l Arabic language students July 30, 2011 LSA 2011 Prague Dependency Treebanks I 23
 
								Arabic Surface Syntax Example l In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it. July 30, 2011 LSA 2011 Prague Dependency Treebanks I 24
 
								English Analytic Layer l By conversion from PTB l l Extended analytic functions Head rules l Jason Eisner’s, added more for full conversion l l Coordination, traces, etc. Coordination handling l Same as in Czech/Arabic PDT July 30, 2011 LSA 2011 Prague Dependency Treebanks I 25
 
								Penn Treebank l University of Pennsylvania, 1993 l l Wall Street Journal texts, ca. 50, 000 sentences l l Linguistic Data Consortium 1989 -1991 Financial (most), news, arts, sports 2499 (2312) documents in 25 sections Annotation l l POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing + Propbanking July 30, 2011 LSA 2011 Prague Dependency Treebanks I 26
 
								Penn Treebank Example l l l l ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, , ) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, , ) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov. ) (CD 29) ))) (. . ) )) l“Preterminal” l. POS tag (NNS) l(noun, plural) l. Noun l. Phrase label (NP) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. July 30, 2011 LSA 2011 Prague Dependency Treebanks I 27
 
								Penn Treebank Example: Sentence Tree l Phrase-based tree representation: July 30, 2011 LSA 2011 Prague Dependency Treebanks I 28
 
								Parallel Czech-English Annotation l l l English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal): l Morphology and surface syntax: technical conversion l l Tectogrammatical annotation: manual annotation l l Penn Treebank style -> PDT Analytic layer (Slightly) different rules needed for English Alignment l Natural, sentence level only (now) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 29
 
								Human Translation of WSJ Texts l l Hired translators / FCE level Specific rules for translation l l Sentence per sentence only l …to get simple 1: 1 alignment Fluent Czech at the target side If a choice, prefer “literal” translation The numbers: l l English tokens: 1, 173, 766 Translated to Czech: l Revised/PCEDT 1. 0: 487, 929 l Now finished (all 2312 documents) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 30
 
								English Annotation POS and Syntax l Automatic conversion from Penn Treebank l PDT morphological layer l l From POS tags PDT analytic layer l From: § § § l 2 -step process § § July 30, 2011 Penn Treebank Syntactic Structure Non-terminal labels Function tags (non-terminal “suffixes”) Head determination rules Conversion to dependency + analytic function LSA 2011 Prague Dependency Treebanks I 31
 
								Head Determination Rules l Exhaustive set of rules l l By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based) l l Additional rules l l l Ex. : (S (NP-SBJ VP. )) → VP Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) l J. Robinson (1960 s) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 32
 
								Example: Head Determination Rules (J. E. ) (join) (will) (join) (board) l. Rules: (NP (DT NN)) → NN (the) (board) (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP July 30, 2011 LSA 2011 Prague Dependency Treebanks I 33
 
								Conversion: Analytic Structure, Functions l l Analytic Function assignment (conversion) Rules l based on functional tags: -SBJ Sb -BNF Obj -LGS Obj -DIR Adv -LOC Adv -PRP Adv -TMP Adv l l -PRD Pnom -DTV Obj -ADV Adv -EXT Adv -MNR Adv -PUT Adv Ad-hoc rules (if functional tags missing) Lemmatization (years → year) July 30, 2011 LSA 2011 Prague Dependency Treebanks I 34
 
								Example: Analytical Structure, Functions (join) (will) (join) → → (board) (the) (board) Penn Treebank structure (with heads added) July 30, 2011 LSA 2011 Prague Dependency Treebanks I PDT-like Analytic Representation 35
 
								Annotation tools l Tr. Ed l l l Visualization, annotation, processing, search Perl, Perl-Tk Customizable Multiple data formats, native: Prague ML (XML) Demonstration I l Surface-syntax annotation July 30, 2011 LSA 2011 Prague Dependency Treebanks I 36
 
								Speech reconstruction l Spoken input l l l l ASR does not do capitalization, punctuation Disfluencies Repetitions Corrections Fillers Deviations from syntax. . . ungrammatical / dissimilar to text July 30, 2011 LSA 2011 Prague Dependency Treebanks I 37
 
								Objective l Create gold-standard data for l l l Use in machine learning of automatic l l l (Statistical) training Testing speech reconstruction (eventually) language understanding Go beyond state-of-the-art l l July 30, 2011 ASR Post-Correction / disfluency removal (cf. e. g. Fitzgerald, 2008, or Lopez/Cozar & Callejas, 2008) LSA 2011 Prague Dependency Treebanks I 38
 
								What is Speech Reconstruction. . . apart from disfluency removal? and we ‘re sitting on the step of I think it ‘s my aunt Molly ‘s house Disfluency removal ? ? I think We’re sitting on the step of my aunt Molly’s house, . Speech reconstruction l original transcript → edited transcript l July 30, 2011 resembling an interview editing for print LSA 2011 Prague Dependency Treebanks I 39
 
								Word-for-word transcription l l l using Transcriber 1. 5. 1 audio synchronization spoken text non-speech events (UHHUH, laughter, etc. ) rough segmentation (defined acoustically) speaker/turn identification July 30, 2011 LSA 2011 Prague Dependency Treebanks I 40
 
								Annotation rules sentence segmentation • orthography • capitalization • punctuation • morphosyntax • word order • partial ellipsis restoration • no discourse-irrelevant nonspeech events • filler and fragment deletion • but: . . . meaning preservation • July 30, 2011 ~ written-text standards LSA 2011 Prague Dependency Treebanks I 41
 
								Segment splitting (reconstruction) segment = sentence (transcript) segment ~ non-silence span July 30, 2011 LSA 2011 Prague Dependency Treebanks I 43
 
								Word order, deletions, insertions, . . . punctuation, capitalization, . . . July 30, 2011 LSA 2011 Prague Dependency Treebanks I 44
 
								Function word insertion July 30, 2011 LSA 2011 Prague Dependency Treebanks I 45
 
								Annotation layers l July 30, 2011 LSA 2011 Prague Dependency Treebanks I Multilayered standoff annotation 46
 
								Current status l English dialogues l l l 151, 000 words (14, 5 h) 16, 000 words double annotated Audio / manual transcript / reconstruction Auto tagging/parsing: syntax / semnatics Annotation manual Baseline automatic SR systems July 30, 2011 LSA 2011 Prague Dependency Treebanks I 47
 
								Conclusion l Language resources l l Integrated annotation of l l l Beyond post-ASR corrections: “Speech Reconstruction” Audio, ASR, manual transcription, edited speech reconstruction [morphology, syntax, semantics] The next step: machine learning July 30, 2011 LSA 2011 Prague Dependency Treebanks I 48
 
								Czech Example Original transcription: ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání Recovering punctuation and capitalization: Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k nim jako do zaměstnání Translation: Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání Reference reconstructions: (a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání. (b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání. July 30, 2011 LSA 2011 Prague Dependency Treebanks I 51
 Скачать презентацию Prague Dependency Treebank s Workshop at LSA 2011 Part
	Скачать презентацию Prague Dependency Treebank s Workshop at LSA 2011 Part
	b9e009451086987bdd8865301683daea.ppt