Скачать презентацию Using Corpora for Language Research COGS 523 -Lecture Скачать презентацию Using Corpora for Language Research COGS 523 -Lecture

32f27dab267cf30057b61b623a924163.ppt

  • Количество слайдов: 36

Using Corpora for Language Research COGS 523 -Lecture 5 METU Turkish Corpus and METU-Turkish Using Corpora for Language Research COGS 523 -Lecture 5 METU Turkish Corpus and METU-Turkish Sabancı Treebank- A Developer’s Perspective 19. 03. 2018 COGS 523 - Bilge Say 1

Related Readings n n n Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development Related Readings n n n Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge, Development of a Corpus and a Treebank for Present-day Written Turkish, in Proceedings of the Eleventh International Conference of Turkish Linguistics, August 2002. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür, Building a Turkish Treebank, Invited chapter in Building and Exploiting Syntactically-annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers, 2003. Nart B. Atalay, Kemal Oflazer, Bilge Say, The Annotation Process in the Turkish Treebank, in Proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC, April 13 -14, 2003, Budapest, Hungary. 19. 03. 2018 COGS 523 - Bilge Say 2

Acknowledgements n n Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer Acknowledgements n n Funding: METU-BAP, TÜBİTAK METU-Sabancı Treebank: Joint work with Prof. Kemal Oflazer Main Contributors: Umut Özge and Nart Bedin Atalay, METU; around 5 research assistants and 13 student annotators and trainees at various phases of the project. Various members of faculty gave ideas esp at initial stages. Agreements with 14 publishers (incl. 3 newspapers and 4 magazines) 19. 03. 2018 COGS 523 - Bilge Say 3

Requirements for Corpora for Turkish ? n n n n Incorporating many registers representatively Requirements for Corpora for Turkish ? n n n n Incorporating many registers representatively Diachronic and synchronic Electronic Annotated with standard practices (typographically, morphosyntactically, semantically, prosodically. . . ) Respecting copyright laws Accessible (free availabilty, support, etc) Searchable 19. 03. 2018 COGS 523 - Bilge Say 4

What is METU Turkish Corpus? n n A synchronic (1990+) corpus of written Turkish What is METU Turkish Corpus? n n A synchronic (1990+) corpus of written Turkish 2. 000 words from 201 books, 87 journal issues and issues of 3 daily newspapers totaling 999 samples Various kinds of annotation (creation of a treebank as separate subproject) Project: 1999 -2003 19. 03. 2018 COGS 523 - Bilge Say 5

Other Features of METU Turkish Corpus n n n Permissions for each sample obtained Other Features of METU Turkish Corpus n n n Permissions for each sample obtained from the publishers Opportunistic representativeness !! Platform-independent; XML and TEIcompliant annotation Accompanying query software Free for academic research purposes on signature of a user agreement http: //www. ii. metu. edu. tr/~corpus/ 19. 03. 2018 COGS 523 - Bilge Say 6

Building the Corpus Text Compilation (permissions, scanning if necessary, control) n Computer-aided annotation (TEI-XCES Building the Corpus Text Compilation (permissions, scanning if necessary, control) n Computer-aided annotation (TEI-XCES for general-typographic; XML-compliant in-house scheme for the treebank) n Control n Query Workbench Development n 19. 03. 2018 COGS 523 - Bilge Say 7

Distribution of Text Types 19. 03. 2018 COGS 523 - Bilge Say 8 Distribution of Text Types 19. 03. 2018 COGS 523 - Bilge Say 8

Annotation of the Corpus Text Encoding Initiative (TEI) compliant n XCES – XML based Annotation of the Corpus Text Encoding Initiative (TEI) compliant n XCES – XML based Corpus Encoding Standards compliant- a TEI application n Compliant with major current corpora such as British National Corpus n 19. 03. 2018 COGS 523 - Bilge Say 9

The TEI Structure - 1 tei. Corpus tei. Header front 19. 03. 2018 TEI. The TEI Structure - 1 tei. Corpus tei. Header front 19. 03. 2018 TEI. 2 text body COGS 523 - Bilge Say back (Burnard, 2001) 10

The TEI Structure - 2 front body divisions back e. g. <div 1> components The TEI Structure - 2 front body divisions back e. g.

components e. g.

, … phrase-level 19. 03. 2018 COGS 523 - Bilge Say e. g. , … (Burnard, 2001) 11

A Typical Header <ces. Header> <file. Desc> <title. Stmt> <h. title>00017113</h. title> </title. Stmt> A Typical Header 00017113 2008 17929 . . . 19. 03. 2018 COGS 523 - Bilge Say 12

A Typical Header (cont. ) <source. Desc> <bibl. Struct> <analytic> <h. title>Anadolu Dağlarının 'Bitki A Typical Header (cont. ) Anadolu Dağlarının 'Bitki Avcısı': Prof. Dr. Turhan BAYTOP Nalân MAHSERECİ Bilim ve Ütopya Mart 2000 İstanbul 1301 - 6717 19. 03. 2018 COGS 523 - Bilge Say 13

A Typical Header (cont. ) <profile. Desc> <text. Class> <cat. Ref>Makale</cat. Ref> </text. Class> A Typical Header (cont. ) Makale 12. 10. 2000 Sedef The header part was changed. 19. 03. 2018 COGS 523 - Bilge Say 14

A Typical Body <text> <body> <p>Oktay biraz önce, <q>Hadi biz de Sitem'in yanına gidelim, A Typical Body

Oktay biraz önce, Hadi biz de Sitem'in yanına gidelim, demişti. Sitem'in, kucağında Tomurcuk Beyle Yılanlı İncirlerden yana gittiğini o da görmüştü çünkü. Ben omuz silkmekle yetindim, Oktay da üstelemedi. Sitem ikimizin yüzüne karşı da görünmez kapılar kapamıştı. Benim de elinden kayıp gidivermemden korkan Oktay beni oyalamak için geçen yaz Giray Ağabeysiyle Kirazlı Yaylaya yaptıkları bir gezintiyi anlatmaya başladı.

O gün ve sonrasında olanları elbet sana da anlatmışlardır, Dalya. Gene de o kargaşa, o şaşkınlık, o panik, o kafa karmaşası yaşanmadan bilinemez. . .

19. 03. 2018 COGS 523 - Bilge Say 15

Entering XCES Annotations - 1 19. 03. 2018 COGS 523 - Bilge Say 16 Entering XCES Annotations - 1 19. 03. 2018 COGS 523 - Bilge Say 16

Entering XCES Annotations - 2 19. 03. 2018 COGS 523 - Bilge Say 17 Entering XCES Annotations - 2 19. 03. 2018 COGS 523 - Bilge Say 17

METU-Sabancı treebank project n n n Annotation of morphological and (surface) syntactic features in METU-Sabancı treebank project n n n Annotation of morphological and (surface) syntactic features in a dependencyinspired manner A subcorpus containing 7. 300 annotated sentences and 65. 000 words: initially whole samples selected from the main corpus. (Another version containing 5600 sentences) Genre distribution is proportional with the METU Corpus 19. 03. 2018 COGS 523 - Bilge Say 18

Building the Treebank Morphological Analysis of Selected Samples from the Corpus n Preprocessing of Building the Treebank Morphological Analysis of Selected Samples from the Corpus n Preprocessing of the Collocations n (Manual) Disambiguation of the Morphological Parses n Annotating with the Dependency Structure n Control n 19. 03. 2018 COGS 523 - Bilge Say 19

Annotation – Lexical Level n A word can be seen as a sequence of Annotation – Lexical Level n A word can be seen as a sequence of inflectional groups (IGs) of the form Lemma+Infl 1^DB+Infl 2^DB+…^DB+Infln n evinizdekilerden (from the ones at your house) ev+Noun+A 3 sg+P 2 pl+Loc^DB+Adj^DB+Noun+A 3 pl+Pnon+Abl Inflectional Group 19. 03. 2018 COGS 523 - Bilge Say 20

Annotation- Syntactic Level Bu çocuk okuldan erken geldi. This child school+Abl early come+Past+3 sg Annotation- Syntactic Level Bu çocuk okuldan erken geldi. This child school+Abl early come+Past+3 sg This child came from the school early. Determiner Bu çocuk Subject Modifier okuldan erken geldi. Abl. adj 19. 03. 2018 COGS 523 - Bilge Say 21

Annotation- Syntactic Level n n n n Sentence Object Subject Intensifier Modifier Determiner Question-Particle Annotation- Syntactic Level n n n n Sentence Object Subject Intensifier Modifier Determiner Question-Particle Total of 20 syntactic tags 19. 03. 2018 n n n n Relativizer Coordination Possessor Classifier Ablative Adjunct Dative Adjunct Locative Adjunct Instrumental Adjunct. . . COGS 523 - Bilge Say 22

Morphosyntactic processing n Tokenized text is annotated (ambiguously) by all possible morphological analyses for Morphosyntactic processing n Tokenized text is annotated (ambiguously) by all possible morphological analyses for each token. n Involves also unknown word processing A constraint-based disambiguation module performs limited morphological disambiguation. n Recognizing and morphological annotation of collocations 19. 03. 2018 COGS 523 - Bilge Say 23 n

Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to Automatic Dependency Annotation Try to get most of the “easy” relations right automatically to help and speed up the human annotator n Human annotator can override if the selected dependency relation is not right. n Pilot work is done but not practised in the METU-Sabancı treebank n 19. 03. 2018 COGS 523 - Bilge Say 24

Automatic Dependency Annotation n A set of heuristic rules tentatively attach some of the Automatic Dependency Annotation n A set of heuristic rules tentatively attach some of the relations automatically n n n Appropriately case-marked nouns to the immediately following unambiguous postposition as objects Indefinite nominative nouns to the first verb to the right as objects Adverbs and Adjuncts attach to the first verb to the right as modifiers and adjunct 19. 03. 2018 COGS 523 - Bilge Say 25

The Annotation Tool n n The text thus processed can now be further annotated The Annotation Tool n n The text thus processed can now be further annotated with an annotation tool n Visualization n Review selections (morph/dependency) and override (for morphology) or annotate (for dependency) The output of the program is morphologically disambiguated annotated text which is encoded according to XML document and Turkish Treebank formats. 19. 03. 2018 COGS 523 - Bilge Say 26

Annotating the Treebank - 1 19. 03. 2018 COGS 523 - Bilge Say 27 Annotating the Treebank - 1 19. 03. 2018 COGS 523 - Bilge Say 27

Annotating the Treebank – 2 19. 03. 2018 COGS 523 - Bilge Say 28 Annotating the Treebank – 2 19. 03. 2018 COGS 523 - Bilge Say 28

Corpus Query Workbench n n n n A user-friendly query engine for linguists Organization Corpus Query Workbench n n n n A user-friendly query engine for linguists Organization through sessions Boolean or regular expression queries Filtering queries through bibliographic constraints such as author, genre, year Treebank entries viewed through a graphical interface Printing and saving options of outputs and session queries available Implemented in Java SE 1. 4. 1, compatible with Window XP/Linux 19. 03. 2018 COGS 523 - Bilge Say 29

19. 03. 2018 COGS 523 - Bilge Say 30 19. 03. 2018 COGS 523 - Bilge Say 30

19. 03. 2018 COGS 523 - Bilge Say 31 19. 03. 2018 COGS 523 - Bilge Say 31

Post-project developments n n About 100 user forms received Some uses (from a recent Post-project developments n n About 100 user forms received Some uses (from a recent survey) n n n Word sense disambiguation Coherence in Turkish texts Subcategorization Frame Acquisition Teaching Turkish or NLP Co. NLL Dependency task for METUSabancı Treebank (~5000 sentences) Frequency lists available (due to Umut Özge and Serge Sharoff) 19. 03. 2018 COGS 523 - Bilge Say 32

What would we have done differently? n n More funding, more interdisciplinary organization, less What would we have done differently? n n More funding, more interdisciplinary organization, less turnover. . . Approaching a corpus development project like a software engineering project. . . n n n Doing a pilot project Better quality control processes, version control and documentation control processes. More and better automatic text capture and annotation 19. 03. 2018 COGS 523 - Bilge Say 33

Requests from Users n n n Extend the size and variety of the corpus Requests from Users n n n Extend the size and variety of the corpus POS tag the whole corpus Enable the users to enter their own corpora to query tool Implement statistical features to the query tools Add semantic annotation Treebank specific ones: n n 10, 000; 7, 000 or 5, 000 sentences? Detailed stylebook LEM and MORPH fields Better versioning, some nonconformant entries with XML 19. 03. 2018 COGS 523 - Bilge Say 34

Requirements for future generations of Turkish corpora n Turkish National Corpus (like ANC, BNC, Requirements for future generations of Turkish corpora n Turkish National Corpus (like ANC, BNC, or CNC) n n n n Spoken Part Automatic Tools Diachronic Part Linguistically motivated morphological and syntactic annotation Some motivation for text providers Well-funded, well-organized project Comparable corpora of Turkic languages 19. 03. 2018 COGS 523 - Bilge Say 35

 Lecture 6 n n Bernardini et al. A Wacky Introduction. April 14, your Lecture 6 n n Bernardini et al. A Wacky Introduction. April 14, your tool evaluation presentations and reports – only two weeks left! 19. 03. 2018 COGS 523 - Bilge Say 36