Скачать презентацию Corpus Linguistics Iryna Dilay Lecture 1 Introduction Скачать презентацию Corpus Linguistics Iryna Dilay Lecture 1 Introduction

CL 1.ppt

  • Количество слайдов: 12

Corpus Linguistics Iryna Dilay Corpus Linguistics Iryna Dilay

Lecture 1 Introduction to Corpus Linguistics • The subject matter of CL • The Lecture 1 Introduction to Corpus Linguistics • The subject matter of CL • The notion of Corpus and a corpus-based approach • The foundation of CL. Its representatives and opponents: empiric data vs introspection. • Types of corpora.

Corpus linguistics as an empirical study relies on facts!!! • Corpus linguistics is perhaps Corpus linguistics as an empirical study relies on facts!!! • Corpus linguistics is perhaps best described for the moment in simple terms as the study of language based on the examples of ‘real life’ language use (Tony Mc. Enery); • the study of language as expressed in samples (corpora) or “real world” text (Wikipedia); • the study of natural language. This method (corpus analysis) is aimed at deriving a set of abstract rules by which a natural language is governed.

Corpus (pl corpora) or text corpus is a large and structured set of texts Corpus (pl corpora) or text corpus is a large and structured set of texts electronically stored and processed, it is a collection of spoken and written texts, organized by register and coded for other discourse considerations. Corpora are the main knowledge base in CL. Modern corpora are on-line collections of text and speech.

Types of corpora. written language (the Brown Corpus, LOB Corpus) • Spoken language (MICASE, Types of corpora. written language (the Brown Corpus, LOB Corpus) • Spoken language (MICASE, Switchboard) • National (the BNC, the COCA) • International (ICE, Glo. Wb. E) • Monolingual • Parallel • Multilingual

How many words…? • I do uh main- mainly business data processing (spoken) • How many words…? • I do uh main- mainly business data processing (spoken) • He stepped out into the hall, was delighted to encounter a water brother (written)

Basic terms: • Types – the number of distinct words in a corpus, that Basic terms: • Types – the number of distinct words in a corpus, that is, the size of the vocabulary. • Tokens – the total number of running words. • Lemma – is a set of lexical forms having the same stem, POS, word-sense (base form).

Unnatural language Wallace Chafe (1994): “It is a very peculiar thing that so much Unnatural language Wallace Chafe (1994): “It is a very peculiar thing that so much of contemporary linguistic research has been based on unnatural language”: • He is the man to whom I wonder who knew which book to give. • The cat is on the mat. • The happy boy eats ice cream. • Colorless green ideas sleep furiously (Chomsky). • Flying planes can be dangerous (Chomsky). • Seymour sliced the salami (Lakoff).

Landmarks in CL • • • 1961 – the Brown Corpus 1967 - Henry Landmarks in CL • • • 1961 – the Brown Corpus 1967 - Henry Kucera and Nelson Francis ‘Computational Analysis of Present-Day American English’ 1960 - Randolph Quirk 'Towards a description of English Usage‘. 1969 – American Herritage Dictionary (AHD) 1980 – launch of COBUILD (Collins Birmingham University International Language Database) at University of Birmingham led up by J. Sinclair and funded by Collins publishers – COBUILD Corpus, or the Bank of English 1987 – 1 st edition of monolingual learner's dictionary “Collins COBUILD English Language Dictionary” 1990 s – The British National Corpus (BNC) Collaboration of three publishers (with the Oxford University Press as the lead collaborator, Longman and W. & R. Chambers), two universities (the University of Oxford and Lancaster University) and the British Library 1984 -1991 the Helsinki Corpus (diachronic) 1990 - 2010 – The Corpus of Contemporary American English (COCA) created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University 2013 - The Corpus of Global Web-Based English (Glo. Wb. E) Other Corpora: Switchboard, callhome, ATIS, TREC, MUC, COHA, Corpus of Spoken professional American English, Michigan Corpus of American Spoken English (MICASE), LOB Corpus, Kolhapur (Indian English), Wellington (New Zealand English), ACE (Australian English), the Frown Corpus (early 1990 s American English), and the FLOB Corpus (1990 s British English). ICE (International Corpus of English).

Useful links • http: //corpus. byu. edu/ : Mark Davies’ site • http: //www. Useful links • http: //corpus. byu. edu/ : Mark Davies’ site • http: //www. natcorp. ox. ac. uk/lookup. htmf) – BNC • (http: //www. cobuild. collins. co. Mk/form. htmr) – Bank of English • http: //www. spaceless. com - selected web pages, http: //www. edict. comhk/concordance/ contains the examples of non-native speakers use of language; http: //www. papyr. com/applets/concordancer/ allows uploading and searching own choice of texts.

Bibliography • • • Sinclair John 1991, Corpus, Concordance, Collocation, Oxford : OUP Sinclair Bibliography • • • Sinclair John 1991, Corpus, Concordance, Collocation, Oxford : OUP Sinclair J. M. Trust the Text : Language, Corpus and Discourse / John M. Sinclair, R. Carter. – London, New York : Routledge, 2004. – 212 p. Tognini-Bonelli Elena Corpus Linguistics at Work. John Benjamins Publishing, 2001. – 223 p. Mc. Enery Tony and Andrew Wilson 2001, Corpus Linguistics. An Introduction, second edition, Edinburg: Edinburg University Press. Meyer Charles F. 2002, English Corpus Linguistics, An Introduction, Cambridge: CUP. Корпусна лінгвістика / [В. А. Широков, О. В. Бугаков, Т. О. Грязнухіна та ін. ] ; під ред. В. А. Широкова. – К. : Довіра, 2005. – 471 с. Hunston Susan. Corpora in Applied Linguistics. Cambridge Applied Linguistics Cambridge University Press, 2002. – 241 p. Mc. Enery, Tony and Andrew Hardie. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, 2011 – 294 p. How to Use Corpora in Language Teaching (ed. by John Mc. Hardy Sinclair) John Benjamins Publishing, 2004

Thank you for attention Thank you for attention