5b1d2de1ce8e61de5f5ee5060cd30a7f.ppt
- Количество слайдов: 26
Treebanking a Blackfoot Corpus Joel Dunham UBC
Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking
Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana
Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative
Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2 -why-NEG-with-eat-TA-DIR-3 SGCONJ • ‘Why don’t you eat with her? ’
OLD • Online Linguistic Database • www. onlinelinguisticdatabase. org • Web application for documenting and analyzing languages
OLD • Open source (GPL): Python (Pylons), My. SQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes
Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http: //blaold. webfactional. com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.
BLAOLD
BLAOLD • Forms (morphemes & sentences): 21, 788 (2011 -07 -25) • morphemes: 5, 094 • sentences: 3, 193 • unclassified: 13, 501 • (word tokens: 20, 577)
BLAOLD • Sources: • textual: 16, 209 forms • field work: 5, 569 forms (and growing. . . )
BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E. g. , Creation Story: • http: //blaold. webfactional. com/creati onstory
BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • . . .
BLAOLD • Files: • Associate Forms, Collections & Files • 2, 159 files (2011 -07 -25) • • 1, 744 audio 259 image 148 text 4 video
Form with morphemic analysis Morpheme segmentation and morpheme gloss lines. Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: Associated WAV file (tagged as an object language utterance) Associated JPG (used as a stimulus in elicitation) “prev-asp-vta drt-num nan drt-num agra-nan adt-asp-vai -oth-num”
BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?
Morphological Parser • ‘A morphological parser for Blackfoot’ (Dunham, 2010; WAIL) • input = transcription: • • kimaaksawohpokooyimasi output = <segmentation, morph glosses, POSes>: • • • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2 -why-NEG-with-eat-TA-DIR-3 SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb
Morphological Parser kimaaksawohpokooyimasi Accuracy: ca. 70% FST Challenges: - variations in transcription - no hard and fast spelling rules - researchers differ in the extent to which they use the standard phonemic orthography to capture phonetic detail Phonology (from a grammar) hand-coded into FST Phonology Morphotactics & lexicon extracted programmatically from the BLAOLD Morphotactics (lexicon) POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2 -why-NEG-with-eat-TA-DIR-3 SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb
Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i. e. , during data entry): save researcher time • create more data to improve searching
Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr]. */
Morphological Parser /n[ai][nr]. */ Good Bad
Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: ‘S < (NP $. (VP < NP))’ S NP DT VP NP VBD NP
Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr. s) has only 5 words
Treebank S S S VP VP NP DEM NP VBZ DEM NN CC VBZ drt-num adt-asp-fin-thm drt-num nan-nin und adt-asp-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áy ‘He is building that house and he is still building it. ’
Treebank • Worth it to treebank Blackfoot? Cons Pros lots of researcher hours & money might significantly improve search : . research efficiency time might be better spent elsewhere, e. g. , elicitation automated parsing may be relatively easy
Nitsííkoohtaahsi’taki
5b1d2de1ce8e61de5f5ee5060cd30a7f.ppt