9a303c6a3542c22b45532e418e4f6fbd.ppt
- Количество слайдов: 1
Annotation parameters for a spontaneous speech corpus Eckhard Bick - University of Southern Denmark Heliana Mello – Universidade Federal de Minas Gerais Tommaso Raso – Universidade Federal de Minas Gerais C-ORAL-BRASIL PALAVRAS LABLITA Example: Morphosyntactic annotation THE C-ORAL-BRASIL CORPUS The corpus represents Brazilian Portuguese spontaneous speech events, with the same criteria as the C-ORAL-ROM corpora (Cresti. Moneglia 2005) for European Portuguese, French, Italian and Spanish. Diaphasic variation is the main goal of corpus architecture: • informal vs. formal; • informal: private vs. public; • for each corpus half: 1/3 dialogues, 1/3 conversations, 1/3 monologues. Maximal diaphasic diversity: people grocery shopping and shoe shopping; construction worker and an engineer at a construction site; driving lesson; people playing pool, soccer and different table games; people cooking or cleaning the kitchen or the house; people working at the computer; a student helping another one with a recorder; driver and passenger talking in a car; waiters waiting at a party; drag-queens putting on make up before a show; a mother telling a story to her child; people telling dramatic moments of their life or explaining their job; jokes; recipes, and many other different situations. Each recorded session is stored in wav files (Windows PCM, 22050 Hz. 16 bit). The C-ORAL-BRASIL corpus provides the acoustic source of each session together with the following main annotations: • Session metadata, in CHAT and IMDI formats. Synchronization of each transcribed utterance to the acoustic source, in XML files, via the Win. Pitch Software (© Pitch France www. winpitch. com). • The orthographic transcription, in CHAT format, enriched with the tagging of utterance terminal (//) and within utterance non terminal (/) prosodic breaks, in TXT files (Moneglia & Cresti 1997). Utterance is defined as the smallest unity of speech with pragmatic autonomy, i. e. a speech act. THE INFORMATIONAL ANNOTATION A 200 word minicorpus (more than 30, 000 words) was informationally tagged based on the Informational Patterning Theory (Cresti 2000). An interface that allows for the extraction of morphosyntactic and informational configurations in a reciprocal relation will be implemented. For example: it will be possible to extract all Topics which are NPs configurationally on the one hand, or all modal verbs within Topic units on the other. Example: Topic search in the Win. Pitch interface Example: three utterance sequence THE ILLOCUTIONARY ANNOTATION The C-ORAL-BRASIL corpus (Raso & Mello 2009; 2010) annotation system encompases a three level tagging arrangement as follows: 1. morphosyntactic automatic annotation through the parser PALAVRAS implemented to consider utterance and information units as its domain of application (Bick 2000); 2. informational annotation (Cresti 2000); 3. illocutionary annotation (Cresti 2005). THE MORPHOSYNTACTIC ANNOTATION The morphosyntactic annotation of the corpus was carried by a robust Constraint Grammar (CG) parser for Portuguese, PALAVRAS (Bick 2000), which as a rule-based system allows the systematic adaptation to very different types of data. Like historical texts (Bick 2005), transcribed speech (Bick 1998) poses two main problems for automatic grammatical analysis, one being non-standard orthography, the other non-standard segmentation, the former affecting lexical recall, the latter creating problems for contextual disambiguation. To overcome these problems, we introduced a two-level markup as a preprocessing stage, where prosodical information, speaker overlap, repairs etc were maintained at a meta-annotation level, while at the same time creating a layer of standardized writtentext token sequence for the parser to work on. To support this process, our program chain had access to a lexicon extension as well as a number of systematical grammatical transformations (e. g. missing person-number inflexion, clitic subject forms, plural interjections etc. ). The segmentation problem was addressed both at the token level, with new rules for contractions and focus markers, and at the syntactic level, by treating prosodic breaks as punctuation (e. g. //as full stop and / as comma). Though providing deep structural information, such as syntactic function and dependency, our grammatical annotation is strictly word- (token-) based and allows the easy integration into databases and tag-based corpus search interfaces such as Corpus. Eye