Скачать презентацию Treebanking a Blackfoot Corpus Joel Dunham UBC Скачать презентацию Treebanking a Blackfoot Corpus Joel Dunham UBC

5b1d2de1ce8e61de5f5ee5060cd30a7f.ppt

  • Количество слайдов: 26

Treebanking a Blackfoot Corpus Joel Dunham UBC Treebanking a Blackfoot Corpus Joel Dunham UBC

Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • Overview • Blackfoot language • Online Linguistic Database (OLD) • Blackfoot OLD (BOLD) • BOLD Annotation/treebanking

Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Blackfoot language • Algonquian (Plains): Alberta & Montana • Endangered: < 5000 speakers • Fieldwork: UBC, UCalgary, UMontana

Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative Blackfoot language • Salient properties: • Direct-inverse system • Grammatical animacy • Agglutinative

Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2 -why-NEG-with-eat-TA-DIR-3 SGCONJ • ‘Why Blackfoot language • Agglutinative: • kimaaksawohpokooyimasi • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi • 2 -why-NEG-with-eat-TA-DIR-3 SGCONJ • ‘Why don’t you eat with her? ’

OLD • Online Linguistic Database • www. onlinelinguisticdatabase. org • Web application for documenting OLD • Online Linguistic Database • www. onlinelinguisticdatabase. org • Web application for documenting and analyzing languages

OLD • Open source (GPL): Python (Pylons), My. SQL, HTML/JS • Powerful search capability: OLD • Open source (GPL): Python (Pylons), My. SQL, HTML/JS • Powerful search capability: regex, boolean • Multi-user, web-based, collaborative • Multi-media: audio, video, images, text • Auto-linking of morphemes

Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http: Blackfoot OLD • OLD web application for Blackfoot (BLAOLD; funded by SSHRC) • http: //blaold. webfactional. com/ • Other OLD web apps: • Okanagan OLD (OKAOLD) • Plains Cree OLD (CRKOLD) • etc.

BLAOLD BLAOLD

BLAOLD • Forms (morphemes & sentences): 21, 788 (2011 -07 -25) • morphemes: 5, BLAOLD • Forms (morphemes & sentences): 21, 788 (2011 -07 -25) • morphemes: 5, 094 • sentences: 3, 193 • unclassified: 13, 501 • (word tokens: 20, 577)

BLAOLD • Sources: • textual: 16, 209 forms • field work: 5, 569 forms BLAOLD • Sources: • textual: 16, 209 forms • field work: 5, 569 forms (and growing. . . )

BLAOLD • Collections • texts created by ordered references to forms • 135 Collections BLAOLD • Collections • texts created by ordered references to forms • 135 Collections at present • E. g. , Creation Story: • http: //blaold. webfactional. com/creati onstory

BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • . . BLAOLD Collection (text) created by referencing Forms entered into the BLAOLD. • . . .

BLAOLD • Files: • Associate Forms, Collections & Files • 2, 159 files (2011 BLAOLD • Files: • Associate Forms, Collections & Files • 2, 159 files (2011 -07 -25) • • 1, 744 audio 259 image 148 text 4 video

Form with morphemic analysis Morpheme segmentation and morpheme gloss lines. Blue text indicates links Form with morphemic analysis Morpheme segmentation and morpheme gloss lines. Blue text indicates links to morphemic Form entries found by the system POS string auto-generated: Associated WAV file (tagged as an object language utterance) Associated JPG (used as a stimulus in elicitation) “prev-asp-vta drt-num nan drt-num agra-nan adt-asp-vai -oth-num”

BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks BLAOLD: Goal • Improve efficiency of data collection, dissemination & analysis • automate subtasks & improve search • morphological parsing • treebanking?

Morphological Parser • ‘A morphological parser for Blackfoot’ (Dunham, 2010; WAIL) • input = Morphological Parser • ‘A morphological parser for Blackfoot’ (Dunham, 2010; WAIL) • input = transcription: • • kimaaksawohpokooyimasi output = : • • • k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2 -why-NEG-with-eat-TA-DIR-3 SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb

Morphological Parser kimaaksawohpokooyimasi Accuracy: ca. 70% FST Challenges: - variations in transcription - no Morphological Parser kimaaksawohpokooyimasi Accuracy: ca. 70% FST Challenges: - variations in transcription - no hard and fast spelling rules - researchers differ in the extent to which they use the standard phonemic orthography to capture phonetic detail Phonology (from a grammar) hand-coded into FST Phonology Morphotactics & lexicon extracted programmatically from the BLAOLD Morphotactics (lexicon) POS/morphemic N-grams used to select most probable parse k-máak-sa-ohpook-ooyi-m-yii-wa-hsi 2 -why-NEG-with-eat-TA-DIR-3 SG-CONJ agra-adt-oth-adt-vai-fin-thm-agrb

Morphological Parser • Benefits of a morphological parse(r): • parse online in real time Morphological Parser • Benefits of a morphological parse(r): • parse online in real time (i. e. , during data entry): save researcher time • create more data to improve searching

Morphological Parser • Search example: find all sentences with an overt subject and an Morphological Parser • Search example: find all sentences with an overt subject and an overt object • Regex on POS string for 2 nominal roots: • /n[ai][nr]. */

Morphological Parser /n[ai][nr]. */ Good Bad Morphological Parser /n[ai][nr]. */ Good Bad

Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: ‘S Treebank (S (NP (DT oma) (NP aakííwa)) (VP (VBD iihpóma) (NP ónnikii))) TGrep: ‘S < (NP $. (VP < NP))’ S NP DT VP NP VBD NP

Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot Treebank • Assuming a flat morphological structure, the syntactic phrase structure parsing of Blackfoot may actually be easy relative to English • one of the longest words in the BLAOLD by character (69 chr. s) has only 5 words

Treebank S S S VP VP NP DEM NP VBZ DEM NN CC VBZ Treebank S S S VP VP NP DEM NP VBZ DEM NN CC VBZ drt-num adt-asp-fin-thm drt-num nan-nin und adt-asp-fin-thm-agrb oth ann-wa á'p-á-istot-i-m om-yi náápi-moyis ki saaki-á'p-á-istot-i-m-wa-áy ‘He is building that house and he is still building it. ’

Treebank • Worth it to treebank Blackfoot? Cons Pros lots of researcher hours & Treebank • Worth it to treebank Blackfoot? Cons Pros lots of researcher hours & money might significantly improve search : . research efficiency time might be better spent elsewhere, e. g. , elicitation automated parsing may be relatively easy

Nitsííkoohtaahsi’taki Nitsííkoohtaahsi’taki