bb9b7002fe03022aec8859fa52a8a7be.ppt
- Количество слайдов: 15
The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University of Pennsylvania bies@ldc. upenn. edu Workshop on Treebanks, Rochester NY, April 26, 2007
Outline u. Lessons learned, or how to get treebanking right u. Current Methodology u. Change for the better – not the same old WSJ anymore… Workshop on Treebanks, Rochester NY, April 26, 2007
Goals of Treebanking u. Representing useful linguistic structure in an accessible way l Consistent annotation l Searchable trees l “Correct” linguistic analysis if possible, but at least consistent and searchable if not l Annotation useful to both linguistic and NLP communities l Structures that can be used as the base for additional annotation and analysis (Prop. Bank, for example) Workshop on Treebanks, Rochester NY, April 26, 2007
Lessons learned: Annotators u Linguists do make good annotators! u Guidelines are very important u Training annotators well takes a very long time 1. Learn the system 2. Self consistency 3. Inter-annotator agreement – consistent with everybody else u Keeping trained annotators is not easy l Full time is good (combo annotation and scripting, error searching, workflow, etc. ) u Good results are possible l English IAA now = 96 f-measure l Arabic IAA now = c. 93 f-measure Workshop on Treebanks, Rochester NY, April 26, 2007
Lessons learned: Computational requirements u. Good tools l Annotation tools l Automatic processing tools (tagger, parser, etc. ) u. Programming support u. Feedback from end users! l And the time and flexibility in the schedule to take advantage of it Workshop on Treebanks, Rochester NY, April 26, 2007
Lessons learned: Time u. Long-term commitment l For the annotator (long training period makes long productive period desireable) l For the project (guidelines, training, dual annotation) u. Ramping up takes time Workshop on Treebanks, Rochester NY, April 26, 2007
Annotation Guidelines u. Detailed guidelines are important u. Can be very stable, but never totally done l Need a forum for updating u. Recognize and acknowledge unusually difficult annotation decisions l Find good workarounds l Avoid making the same decision over and over (or differently) Workshop on Treebanks, Rochester NY, April 26, 2007
Annotators’ guidelines u. Involving annotators helps the guidelines l Buy-in… l Avoids building in distinctions, etc. that annotators can’t reliably make l Find iconic examples n. Paint the town red; K- and N-ras; Secretary of State James Baker n ﻳ ﺍﻟﺍ tawiylu Al+q. Amati tall (of) the+stature u. Format annotators are comfortable using l Searchable, easily accessible u. Content and format useful to end users l Feedback helpful Workshop on Treebanks, Rochester NY, April 26, 2007
Importance of QC/Error checking u. Will always be human error, no matter how good the annotators are u. Search for errors and fix them l Search tools l Ideally someone intimately familiar with the annotation and its challenges = a tech happy annotator l As many different ways to look at the data as possible to turn up errors you might not expect n. Searching Arabic Treebank using English Treebank experience n. Feedback from parsing work n. Feedback from Prop. Bank work Workshop on Treebanks, Rochester NY, April 26, 2007
Current methodology u. Increasing emphasis on QC/error checking u. Good tools u. Incorporate as much good automatic tagging, parsing, etc. as possible as input to annotation u. Increasing emphasis on coordination with other types of annotation l Prop. Bank, MDE, sentence alignment, etc. Workshop on Treebanks, Rochester NY, April 26, 2007
A file’s path through annotation u Selection (in coordination with other annotation projects) l Source generated u Segmentation into sentences and tokens l Automatic l Manual correction u POS/morphological tagging l Tagger (for English), generation of possible morphological analyses (for Arabic) l Manual correction/selection of POS/morphological tag u Treebank l Parser l Manual correction (Tree. Editor) n Two passes, if necessary n Including dual annotation for IAA u Quality control/error correction l Error searches l Manual correction Workshop on Treebanks, Rochester NY, April 26, 2007
POS ANNOTATION Workshop on Treebanks, Rochester NY, April 26, 2007
Penn Arabic Treebank ‘Tree. Editor’ Workshop on Treebanks, Rochester NY, April 26, 2007
Recent guidelines improvements u. English l Improved NP structure n NML n Distributed modifiers for Bio. Medical domain/entities l More compatible with Prop. Bank n Untensed sentential complements n Relative clause adjunction n Hyphen tokenization (New York – based company) u. Arabic l POS changes to reduce mismatches with treebank nodes (feedback from parsing work) l Improved NP structure n More direct representation of idafa/construct state and other grammatical constructions Workshop on Treebanks, Rochester NY, April 26, 2007
New data u. English l Lots of English Translation Treebank, translated from both Chinese and Arabic l Not just WSJ! l Very good annotation u. Arabic l Revision of ATB 3/Annahar to begin soon ncf. Treebank II revision in English Treebank Workshop on Treebanks, Rochester NY, April 26, 2007
bb9b7002fe03022aec8859fa52a8a7be.ppt