- Количество слайдов: 38
Russian Corpora: Comparison and Usage Victor Zakharov vz 1311@yandex. ru St. Petersburg State University Department of Mathematical Linguistics The 16 th International Conference TSD 2013
Смотри: Захаров В. П. Корпуса русского языка // Труды Института русского языка им. В. В. Виноградова. Вып. 6. 2015. С. 20 -64.
"New Times" • Conferences on corpus linguistics: Dialogue (every year) Corpora-20 XX (our department) http: //corpora. phil. spbu. ru 2002 – 2013 (each 2 years) • Publications • Russian National Corpus (http: //ruscorpora. ru/en/index. html) Started in 2003 and from April, 2004 is accessible via the Internet Лекция 1 -3 Корпуса русского языка: современность
The Russian National Corpus is the most popular one among linguists for both being the most well known and due to opportunities which it presents. However, being unable to go into a deeper analysis within the framework of this presentation, we will zero in on its general characteristics together with its most unique features. Also, to show the state of the art in modern Russian corpus linguistics we will touch in greater detail upon other corpora that are not so much known but are worth mentioning. Лекция 1 -3 Корпуса русского языка: современность
The Russian National Corpus (2) The corpus allows us to study the variability and volatility of linguistic phenomena frequencies, as well as to obtain reliable results in the following areas: 1) the study of morphological variants of words and their evolution; 2) the study of word-formation options and related issues; 3) the study of changes in syntactic relations; 4) the research of changes in the system of Russian accent; 5) the study of lexical variation, in particular, changes in synonym series and lexical groups, as well as semantic relations in them. Лекция 1 -3 Корпуса русского языка: современность
The Russian National Corpus (3) Over 500 million words (March 2013) The RNC includes the following subcorpora: • 1) The main corpus • 2) Deeply annotated corpus (treebank) • 3) Spoken corpus • 4) Parallel text corpus • 5) Dialectal corpus • 6) Poetic corpus • 7) Educational corpus • 8) Newspaper corpus • 10) Multimodal/multimedia corpus Лекция 1 -3 Корпуса русского языка: современность
The main corpus is subdivided into 2 parts: • modern written texts (from the 1950 s to the present day) (230 mln tokens); • early texts (from the middle of the 18 th to the middle of the 20 th centuries). The part of modern texts is the largest one of the subcorpora. Texts are represented in proportion to their share in real-life usage. For example, the share of fiction does not exceed 40% Лекция 1 -3 Корпуса русского языка: современность
The main corpus (2) Every text included in the main corpus is subject to metatagging and morphological tagging. Morphological tagging is carried out automatically. In a small part of the main corpus (around 6 mln tokens) grammatical homonyms are disambiguated by hand, and results of automated morphological analysis are corrected. This part is the model morphological corpus and serves as a testing ground for various search algorithms and programs of morphological analysis and automated processing. Disambiguated texts are automatically supplied with indicators of stress. Stress annotation may be turned off for printing or saving the search results. Лекция 1 -3 Корпуса русского языка: современность
Searching Based on Yandex Search Engine: ü ü – – – lemma search; wordform search; set phrases; additional features: before or after punctuation marks in the beginning or in the end of a sentence capitalization etc. Additional options: ü grammeme search; ü semantic search; ü metadata search. For lexico-grammatical search, we can input a sequence of lexemes and/or word-forms with certain grammatical and/or semantic features and combine them in any way. Лекция 1 -3 Корпуса русского языка: современность
Searching (2) • For compound searches parenthesis are used. For example, the query S & (nom|acc) yields nouns in nominative or accusative. • It can be used with both left or right truncation. • Distance between words could be set from minimum to maximum. The distance between words next to each other is 1 word; the distance of 0 is interpreted as concurrence of wordforms Лекция 1 -3 Корпуса русского языка: современность
Russian National Corpus: Search Interface Лекция 1 -3 Корпуса русского языка: современность
Grammatical search A simpler way to search for certain grammatical features is to use a selection window. The selection window contains a list of appropriate features, subdivided by categories: f. e. , for morphology, part of speech, case, gender, voice, number, etc. Лекция 1 -3 Корпуса русского языка: современность
Grammatical search (2) Лекция 1 -3 Корпуса русского языка: современность
Russian National Corpus: Metadata tagset The interface unites certain metadata parameters into 2 blocks: I. Passport ü Author: name, gender, year of birth or approximate age ü Text title ü Date of creation (can be given as an exact or an approximate date, and as after or before a certain date) II. Two subgroups: ü non-fiction, ü fiction; The two subgroups have different structures of parameters. Лекция 1 -3 Корпуса русского языка: современность
RNC: Semantic annotation The RNC texts are semantically tagged. Semantic annotation in the main corpus is a unique feature of RNC that makes it distinct from other national corpora. Semantic and derivational parameters: person, substance, space, movement, diminutive, verbal noun, etc. Is used the Semantic dictionary of the Corpus, based on the classification system which was developed for the database Lexicograph beginning from 1992 under the leadership of E. V. Paducheva and E. V. Rakhilina. Лекция 1 -3 Корпуса русского языка: современность
The structure of semantic and lexical information There are three groups of tags assigned to words to reflect lexical and semantic information: • Class (a name, a reflexive pronoun, etc. ) • Lexical and semantic features (a lexeme's thematic class, indications of causality or assessment, etc. ) • Derivational features (a diminutive, an adjectival adverb, etc. ) The set of semantic and lexical parameters is different for different parts of speech. Moreover, nouns are divided into three subclasses (concrete nouns, abstract nouns, and proper names), each with its own hierarchy of tags. Лекция 1 -3 Корпуса русского языка: современность
Lexical and semantic tags • Taxonomy (a lexeme's thematic class) – for nouns, adjectives and adverbs. • Mereology (“part – whole” and “element – aggregate” relationships) – for concrete and abstract nouns • Topology – for concrete names • Causation – for verbs • Auxiliary status – for verbs • Evaluation – for abstract and concrete nouns, adjectives and adverbs • Etc. Лекция 1 -3 Корпуса русского языка: современность
Lexical and semantic tags: Fragment (1) • Taxonomy: t: hum – person (человек (human), учитель(teacher)), t: hum: etn – ethnonyms (эфиоп (Ethiopian), итальянка (Italian)), t: hum: kin – kinship terms (брат (brother), бабушка (grandmother)), t: animal – animals (корова (cow), сорока (magpie)), etc. • Mereology: pt: part – parts (верхушка (top)), pt: part& pc: plant – parts of plants (ветка (limb), корень (root)), pt: part& pc: constr – parts of buildings and constructions (комната (room), дверь (door)), etc. • Topology: top: contain – containers (комната (room), озеро (lake)), top: horiz – horizontal surfaces (пол (floor), площадка (ground, area)), etc. • Evaluation: ev – evaluation (neither positive nor negative) (озорник (mischief-maker)), ev: posit – positive evaluation (умница (clever man or woman)), ev: neg – negative evaluation (негодяй (scoundrel)). Лекция 1 -3 Корпуса русского языка: современность
Lexical and semantic tags: Fragment (2) Some tags for verbs: • t: move – movement (бежать (run), бросить (throw)) • t: put – placement (положить (put), спрятать (hide)) • t: impact – physical impact (бить (beat), колоть (prick)) • t: be: exist – existence (жить (live), происходить (happen)) • t: be: appear – start of existence (возникнуть (arise), создать (create)) • t: be: disapp – end of existence (убить (kill), улетучиться (diappear)) • t: loc – location (лежать (lie), стоять (stand)). Лекция 1 -3 Корпуса русского языка: современность
Search Interface in English Лекция 1 -3 Корпуса русского языка: современность
Semantics: search Лекция 1 -3 Корпуса русского языка: современность
Semantics: search (2) Лекция 1 -3 Корпуса русского языка: современность
RNC: Search results can be presented twofold: • a horizontal text (a broader context) • a concordance (next slide). In both cases grammatical and semantic features of any word can be checked out: the slide shows features for the word лук (onion (t: food) OR bow (t: tool: weapon)). Лекция 1 -3 Корпуса русского языка: современность
RNC: Search results (2) Лекция 1 -3 Корпуса русского языка: современность
RNC: Search results (3) Лекция 1 -3 Корпуса русского языка: современность
Deeply Annotated Corpus (RNC) • 25 000 sentences, more than 350 000 tokens. • Various topics. • Primary ideology - multipurpose system ETAP (machine translation, Laboratory for Computational Linguistics, Institute for Information Transmission, RAS). • Sentence structure - dependency tree (syntax structure dates back to the linguistic model Meaning-Text Theory by Igor Mel'čuk). Лекция 1 -3 Корпуса русского языка: современность
Deeply Annotated Corpus (RNC) An example Лекция 1 -3 Корпуса русского языка: современность
RNC: The Corpus of Spoken Russian More than 10 mln tokens. Represents real-life Russian speech and includes the recordings of public and spontaneous spoken Russian and the transcripts of the Russian movies. To record the spoken specimens the standard spelling was used. The corpus contains the patterns of different genres/types and of different geographic origins. The corpus covers the time frame from 1930 to 2007. In addition, the corpus has its own annotation: the accentological and the sociological one. Лекция 1 -3 Корпуса русского языка: современность
ORD corpus (Odin Rechevoj Den’ or One Day of Speech) The main aim of creating the ORD corpus is to collect recordings of actual speech which we use in our everyday communication. Balanced group of 30 persons representing various social and age strata in the population of St. Petersburg These individuals spent one day with recorders dangling around their necks and recording all their day communications. More than 240 hours of recording were obtained with 170 hours containing speech data quite suitable for further linguistic analysis. 2202 communication episodes. At present, orthographic transcription of the corpus numbers more than 500001 -3 Корпуса русского Лекция word forms. языка: современность
RNC: the Multimedia Russian corpus (MURCO) Fragments of movies of the 1930 s through the 2000 s and some other materials. The total volume of the movie transcripts is around 3, 5 million tokens. The alignment of the text transcripts with the parallel sound and video tracks. The types of annotation in the MURCO are as follows: • orthoepic annotation: combinations of sounds are marked; • annotation of accentological structure; • speech act annotation: the types of speech acts; • gesture annotation: the type of gesticulation in a clip. A user obtains not only a written text, annotated from different points of view, but also the corresponding sound and video material. Лекция 1 -3 Корпуса русского языка: современность
Parallel corpora of Russian 1. Russian National Corpus (English. Russian, Russian-English, German-Russian, Ukrainian-Russian, Russian-Ukrainian, Belorussian-Russian, Russian-Belorussian) 2. PARUS (SNC, Bratislava) – PAralelní RUsko-Slovenský korpus (rus-slov) 3. PARRUS (Tampere) (rus-fin) 4. Inter. Corp (ČNK, Praha) 5. Para. Sol (The Regensburg Parallel Corpus of Slavonic) Лекция 1 -3 Корпуса русского языка: современность
Other text corpora of Russian HANCO Helsinki Annotated Corpus (2001 -2004, A. Mustajoki, M. Kopotev, the Department of Slavonic and Baltic Languages and Literatures at the University of Helsinki). The corpus includes morphological, syntactic information about approximately 100, 000 running words, extracted from a modern Russian magazine and representing the modern Russian language. Лекция 1 -3 Корпуса русского языка: современность
HANCO The main principles of creation • • Orientation to a wider audience. Potential users are not only a narrow circle of experts, but also students and teachers of Russian. The choice of parameters for a search is carried out in such a way as to minimize the amount of specialized knowledge required. Orientation to the accuracy of the grammatical description, not to the amount of annotated material. Orientation to multilevel grammatical information The HANCO corpus contains multilateral grammatical information including morphological, syntactic, and functional characteristics. They can be combined in the process of searching. Possibility of alternative interpretations. The HANCO creators made the decision to accept the possibility of alternative interpretations of linguistic facts. Such seeming illegibility demands a lot of manual work, but it facilitates the searching of necessary information by the potential user. Лекция 1 -3 Корпуса русского языка: современность
Moshkov's Library corpus Лекция 1 -3 Корпуса русского языка: современность
AOT-DDC: results Лекция 1 -3 Корпуса русского языка: современность
Russian corpora in Sketch Engine • http: //sketchengine. co. uk/ • Lexical Computing Ltd. (A. Kilgarriff) • More than 150 corpora of different languages • Among them corpora of Russian and first of all ru. Ten corpus of 20 bilion tokens (Wacky technology) • Corpus manager Sketch Engine (Masaryk University) • Different tools: Concordance, Word sketches, Thesaurus, Differences, Clustering, etc. Лекция 1 -3 Корпуса русского языка: современность
Russian corpora by Serge Sharoff Leeds Univ. (corpus manager CQP) ü Russian Reference Corpus (a part of the RNC) ü Russian Reference Corpus, another version ü Russian Fiction (disambiguated) ü Russian Newspapers ü Russian Business Corpus ü Russian Internet Corpus ü Russian corpora together Лекция 1 -3 Корпуса русского языка: современность
Russian corpora by Serge Sharoff Лекция 1 -3 Корпуса русского языка: современность