
bd64c6e3089894ceec1a903df116c615.ppt
- Количество слайдов: 107
Learner corpus research hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8: 15 am - 10: 15 am lextutor. ca/cv/slrf_09/corpus. ppt
n Dr. Cobb will provide a "crash course" in carrying out research using learner corpora and small teacher or researcher built corpora generally. He will lead a walk-through of a study he has conducted using corpus data and address the work that had to be done and issues to be resolved at each stage of the study, offering a behind-the-scenes look at how corpus research is carried out. In addition he will display some new and accessible online tools for corpus work, hoping to encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm. 2
n Dr. Cobb will provide a [1] "crash course" in carrying out [1 a] research using learner corpora and [1 b] small teacher or researcher built corpora generally. He will lead a [2] walkthrough of a study he has conducted using corpus data and [2 a] address the work that had to be done and [2 b] issues to be resolved at each stage of the study, offering a behind-thescenes look at how corpus research is carried out. In addition he will display some [3] new and accessible online tools for corpus work, hoping to [4] encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm. 3
LEARNER CORPUS n crash course u u n walk-through of a study u u n n n research using learner corpora or other small corpora address the work that had to be done issues to be resolved at each stage display online tools for corpus work encourage hands-on experience + a bit of context 4
At 10. 15 you will know… n n n What a corpus is Why corpus research is important What it has contributed to applied linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus … interpreting a instructors Some tools of corpus analysis How to do a learner corpus study Results from some published studies The future of learner corpus studies 5
Corpora – what are they? 6
What is a corpus? n A large collection of language in use, but Not only large « Not necessarily so large « n Assembled systematically, according to explicit criteria « n of representativeness How large? u Depends on the goal 7
Goals and sizes n Linguistics goal - to represent entire language • 100 million wds still under-represents common collocations n Pedagogical goal – S`s meet common words, structures • 1 -million-words gives 10 hits for frequent words n Applied linguistics goal – trace an acquisition feature • 1 -200, 000 words is common 8
Sub-Goals and sizes n Pedagogical goal – S`s meet common grammar and vocab « Grammar – 1 million is adequate – All structures get many hits « Lexis • Basic vocab – 1 million gives 10 hits @ 2 k level • Main collocations – 1 million gives the main ones Torrential rain? • “Raining cats and dogs”? – 1 billion gives 5 hits • Identify specialist lexis – 200, 000 may be enough 9
10
A growth industry n Brown 1970………………. . 1, 000 wds http: //icame. uib. no/brown/bcm. html n BNC 1994. ……………… 100, 000 wds www. natcorp. ox. ac. uk n Cambridge Int’l 2002. . 1, 000, 000 wds www. cambridge. org. /elt/corpus/international_corpus. htm n Plus ANC, Bank of English, Cancode … 11
Design / composition e. g. , Brown (1970 s) Page from Lextutor 12
What does a corpus represent? n A language as a whole • BNC n Or a part • Cancode oral, MICASE academic n Or of an individual • Jack London’s collected works n Or a group of individuals – Class of ESL learners 13
How do we read a corpus? n Cannot read it naturally – Defeats the goal n Needs the help of a search technology concordance « index « frequency list « many others « 14
Concordancers 15 http: //www. lextutor. ca/concordancers/concord_e. htm
Lists 16 http: //www. lextutor. ca/freq/compleat_lister
Indexes 17 http: //www. lextutor. ca/concordancers/text_concor
Corpora – why do we need them? 18
Why do we need corpora? A. B. Corpus work is sexy We have computers – let’s use them Linguistic intuitions are unreliable 19
Linguistic intuitions are notoriously unreliable n Demo 1: Do you think however is more common in spoken or in written language? u By how much? (3 to 1… etc) 20
21 http: //www. lextutor. ca/range_corpu
n Demo 2: What are the main senses of back and which is most common? • By what factor? n http: //www. lextutor. ca/concordance rs/concord_e. html 22
23
24
n Demo 3: Can you rank order these roughly by frequency band? 0 - 2 k 3 k - 5 k 6 k - 10 k 11 k-15 k http: //www. lextutor. ca/freq/train/ 25
26 Try one? http: //www. lextutor. ca/freq/trai
But not always n Demo 4: Which do you think is more common, man and woman, or woman and man? Factor of 10: 1, 5: 1, 2: 1? n Go Live n http: //www. lextutor. ca/concordancers/concord_e. html 27
Many linguistic intuitions are unreliable Implicit patterns are extremely slow to extract from input N. Ellis, J. Hulstijn … because of the severe limitations on what we can see and remember unaided … 28
Scientific instrumentation - a brief history 29
Not only linguistic intuitions are problematic For every appearance, many possible explanations Stand outside on a starry evening, what does it look like? 30
n The role of the computer in modern science is well known. In disciplines like physics and biology, the computer's ability to store and process inhumanly large amounts of information has disclosed patterns and regularities in nature beyond the limits of normal human experience. Similarly in language study, computer analysis of large texts reveals facts about language that are not limited to what people can experience, remember, or intuit. In the natural sciences, however, the computer merely continues the extension of the human sensorium that began 20 years ago with the telescope and microscope. But language study did not have its telescope or microscope. The computer is its first analytical tool, making feasible for the first time a truly empirical science of language. 31 – Cobb 1999
n Before the computer, linguists could only study small samples of language at a time because of their limitations of their powers of observation and their memories. Even scholars who relentlessly collected instances of usage all their lives only had a few examples of any particular pattern, and there was no way of telling what they had missed. « Sinclair, 2003, p. ix 32
Early corpora n n Dr Johnson A Dictionary of the English Language u n Longman 1755 Based on quotations from literature copied onto many slips of paper But using literature has some problems - Old and recent lit conflated - Is literature truly representative of life’s typical situations? - Is its lexis «un peu recherché» ? 33
120 years later - James Murray, OED 1879 – REAL LANGUAGE examples sen - Oxford City Post Office sets up a special sub-branch for OED 34
Most sciences supplemented by technologies from 15 th century n n BIOLOGY. . ………. microscope ASTRONOMY. . …. . telescope NAVIGATION. ……astrolabe etc Language study – late 20 th century – …. machine readable 35
Thus the “corpus revolution” n n Dictionaries Grammars Courses Studies 36
Of particular note… LGSWE 37
Corpus – successes 38
Fabled Core of English is close to disclosure + coverage Main lexis n u n Main collocations in BNC-speech u n 84 HF collocations belong in 1 k list, Shin & Nation 2007 Main phrasal verbs – u n 2000 wd families = 80%, Carrol et al 76 25 Ph vbs = 1/3 of all ph vbs in BNC, Gardner & Davies, 2007 Main morphologies u Bauer & Nation, 1993 n Main stress patterns (Murphy & Kandil) u Cf. All this coming together at the same time as the human genome, also a corpus project 39
prescriptivism is close to defeated in language pedagogy debate remains Except one n u n Corpus-based v. corpus-informed approaches Corpus based u If it`s in the corpus times X, it`s OK «X n to be defined Corpus informed u Corpus information is one source of information 40
Numerous errors are now corrected (in principle) n n n Definitions no longer harder than the defined word Simple present no longer automatically the first verb tense taught Written language no longer the model for spoken language Status of multi-word units reinstated Grammar no longer taught … u u via unknown lexis as unconnected to lexis 41
Task n n Grammar as connected to lexis? Let’s see what this could mean u n + practice “reading concordances” Get out “borders on” • (From SInclair http: //www. twc. it/) What is the pattern? u What does it mean? u « Can we call this ``word grammar``? 42
User extract became is more than just a way of life – it BORDERS on a religion. But there is of the laws of the sea sometimes BORDERS on arrogance. Not only should the international collaboration is great and BORDERS on cartel like behaviour. who say using the extremist label BORDERS on demagoguery and will only serve Yugoslavia. What is occurring there BORDERS on genocide. No country or society Careless but losing two in the one day BORDERS on incompetence. Now Charlie Turkey, the only NATO country which BORDERS on Iraq, is playing a key role in Her mastery of the short story BORDERS on perfection. kate saunders country’s stagnant growth, which now BORDERS on recession. Here again, the challenge looms ugly when recession BORDERS on slump. Everybody is on edge, The auth the case_0 of maxim ‘The collector’s passion BORDERS on the chaos of memories. ’ before staged d, although and an easy going demeanour which BORDERS on the charismatic, it’s hardly popular Kosovo, a professional solicitousness which BORDERS on the dangerous edge of savings account al Asian clash. He said: ‘The hostility there BORDERS on the dangerous. ’ Black players and – an The sky, a then Claire makes a statement that BORDERS on the downright cocky. When I ask The l emories. ’ before staged protests at these two BORDERS on the east and west of their speaking t ut there is the Sierra Madre” as he dubs them BORDERS on the eccentric. Mountain lions courses and opportunities, that it BORDERS on the embarrassing. This the straight, ve. He portrays has a streak of bravery which BORDERS on the foolish. She has delicate to buy. cause the amount of work he is required to do BORDERS on the incredible. In the case_0 of maxi rous edge of savings accounts versus shares, BORDERS on the irresponsible. an independent Bos his private His love for all things maritime BORDERS on the obsessional. He is truly Not surpr e, four even_0 harbour a passion for DIY that BORDERS on the obsessive. But there is the Sierr ybody is on edge, The author, a lifelong fan, BORDERS on the obsessive. He portrays has a stre hen I ask The linear intensity of their songs BORDERS on the paranoid and, although and an eas. Wander into the The atmosphere of paranoia BORDERS on the pathological. The sky, a then Clai ng. This the straight, but his winning effort BORDERS on the sensational because the amount of d his own most dangerous regions on Earth. It BORDERS on the Serbian province of Kosovo, a pro elicate to buy. A family with three children BORDERS on the socially acceptable, four even_0 of their speaking to troops in Xinjian which BORDERS on the Soviet Central Asian clash. He sai players and – and to performing them sort of BORDERS on the surreal. He had his own most dang He is truly Not surprisingly, the atmosphere BORDERS on the surreal. Wander into the The 43 atmo hardly popular music. In some cases_1, this BORDERS on wholesale plagiarism. That’s * ______
Corpus – failures 44
And yet… “The corpus-driven revolution in applied linguistics continues apace, and along with it the paradox that as corpora change the face of applied linguistics (most dictionaries, grammars, and course books now claim to be corpus based) it is largely without the participation of practitioners. Only a few teachers or researchers have ever built a corpus or delved through concordance lines. ” - Cobb 2008, review of CBLS 45
Stalled enterprise (Mc. Carthy, 2008) Teachers and researchers need to become producers, not just consumers, of corpus research Why? To evaluate “corpus based” claims Often vocab but not grammar is CB, etc What kind of corpus? To effectively lobby to get their CB needs met e. g. Gram+lex of specific domains To develop their own CB materials Who still uses a course book? To build their own corpora for action research projects 46
Stumbling blocks Some intimidation remains attached to corpus work It is not universally appreciated in SLA - Widdowson Computer stuff looks daunting - Seems more linguistics than applied POLICY OF THIS WORKSHOP: There are some fairly clear reasons to do this and simple ways to get started 47
… The classic corpora are not easy-access - Despite long lists on the Web - Even Mc. Carthy’s Cancode is 100% unavailable to researchers - - Ref Tribble review of O’keefe et al Especially in languages other than English - Lextutor users’ requests for German => Solutions <= [1] Band together (CECL) [2] Make your own => 48
DIY corpus – why? 49
German http: //www. lextutor. ca/concordancers/braun_i nfo. html 50
Why bother – Google is a corpus Ref – Robb 51
52
v. corpus Classic case, breadth v. depth Web-as-corpus gives massive volume Even smallish DIY corpus gives Better quality search Families, starts with, ends with Easier access to detail & context Better exposure to pattern + you can make your own, target your own needs Material for learners Material from learners 53
DIY corpus – how? 54
Build your own - HOW n Many texts on the Web u u n E. g. , http: //www. lextutor. ca/bookbox/ Question of selection replaces quesiotn of access Must be or become text files u(whatever. txt) u «dot txt Whether you want a one-big-file corpus « Or several-small-files corpus 55
Only plain. TXT files make corpora One 56
One big file: a) Insert One 57
One big file: b) Upload http: //www. lextutor. ca/tools/corpus_bu ilder 2/ One 58
DIY corpus for learning materials 59
Using CB tools to select / develop learning materials? Using news texts? Check first against CB frequency lists Pre-teaching vocab? Find the CB keywords Writing tests? Check it contains gram+lex the S’s have actually seen Teaching a speaking course? Check models are speech not writing 60
Build corpus as learning materials For some purpose Must make some sampling sense EG one London – all London All course materials Corpus of graded readers 61
Learning materials – multi-file corpus http: //www. lextutor. ca/callwild 62
Learning materials – one-file corpus http: //conc. lextutor. ca/list_learn/eng/ 63
Learning materials – one-file corpus http: //www. lextutor. ca/corpus_grammar/ 64
DIY for research purposes 65
1. Written production 66
Learner text more and more available - Collect & investigate because it is there? Some typical purposes - determine needs - check progress - Cf. active vs. passive ability - explore for experimental hypothesis Constraints Choose topic carefully Does topic suggest just one verb tense? Cf capital punishment vs. my holiday Very different language demands 67
Models of LCs Learners vs. NSs Ls vs. Ls – Snapshot or Longitudinal (same Ls at diff times) Or diff Ls at diff stages in learning ≅ longitudinal (Cross-sectional) OR Belz (04, citing Cobb 03) 4 LC variables should be controlled: 1. type of learner (e. g. , FL vs. SL), 2. stage of learner 3. text type/purpose/register/conditions, 4. and the availability of a similar corpus of native speaker data 68
NS data must be comparable Best example is UCLE’s Locness Louvain Corpus of Native Speaker Essays n 149, 574 words of argumentative essays written by American university students n 18, 826 words of literary-mixed essays written by American university students n 59, 568 words of argumentative and literary essays written by British university students n 60, 209 words of British A-level argumentative essays. 69
Issues in LC SMALL ISSUES – Tag or not? Spell check or not, or at what point? One file or many? BIG ISSUE - Granger 2004, p. 124 What kind of data is a LC? “LC typically fall into the category of natural or open-ended data” while “SLA researchers tend to prefer [1] introspective or [2] experimental/elicited data…” V BIG ISSUE Is this paradigm an instance of Bley-Vroman’s 70 (1983) “comparative fallacy”?
Once made, flat or tagged? n Pro’s of flat corpus u If for learning materials, = what learners face • THEY must make sense of data • Tagged does it for them Easier to make, you can have more u Search inputs require some work, Trial +error u n Pro’s of tagged corpus u Precise comparisons are possible « u Especially for N-N compounds and errors But learner data poses special problems « Tags are needed for error analysis • VP + ADV + D OBJ, etc « Yet learner data confuses taggers 71
Error tagger (UCL Err Extractor – Granger 02) specific-purpose, known-target tagging - Unlikely to confuse tagger, but a ton of work 72
Here’s a set of studies I’m working on LC study typically begins with a practical problem Theoretical conundrums? not so much E. g. , this problem: Montreal learners Eight years ESL At 18 many switch to English-language system With insufficient vocabulary for advanced study in English Fully competent only at 1 k 73
Biq question Input: What lexis are these kids getting in school? RQ Do their NNS teachers have enough vocab themselves to get kids over the 1 k-hump? 74
Procedure Run Vocab size test on Ts Nations’s new 14 k – lextutor. ca/tests/ Get small exploration corpus of their production “How could the TESL program be improved? ” Argumentative + opinion Get similar sized NS corpus LOCNESS, A-Levels, UK “An invention that has changed how we live” Compare for structure and lexis Quantity (frequency) and quality Focus on lexis 2 k+ 75
76
Prelude Look at TESLProg. txt in your handout as demo mini-corpus Writing task was this n 5 -minute in-class writing exercise « n Discursive topic « n Peter Elbow, keep writing idea How could UQAM new TESL program be improved? Homework: - identify your main point « - focus + elaborate for Web publication « n Each paper gets three rounds of feedback 77
Comparison text from Locness (ex 1) Computers have become a huge part of our lives in both the areas of work education. But are they such a good thing? When calculators came along a drop in ability of students for mental arithm obvious and now they are used for the simplest calculations. The compute the same thing. Computers encourage laziness in the general public, why something yourself when the computer can do it for you. This is very time efficient but it is causing people to forget basic ideas. For instance, spellin longer as important as it was you can simply use a "spellcheck" to correct which is absurd. For the youth of today computers offer links around the world and millions figures. This could be argued to be educational. However, this is killing the of children and they spend hours sat at a keyboard tapping away in the doo of the house. They should be out enjoying themselves and gaining experie themselves instead of reading about them on a flat screen. It is said that you can meet people through computers and have `relationsh preposterous and people are losing the ability to communicate and form re 78
Comparison corpus from Locness (2) Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people. More lexis? Less? A little? A lot? http: //www. lextutor. ca/vp/bnc/ 79
Which analysis software? 80
Basic structure snapshot (Qc corpus) 81 http: //www. lextutor. ca/concordancers/text_concord
82 http: //www. lextutor. ca/concordancers/text_concord
83
84 http: //www. lextutor. ca/tuples/eng/
Lexis comparison 85
Lexis comparison http: //www. lextutor. ca/vp/bnc/ NNS corpus (Quebec TESL trainees) 155 post-1 k word families/3356 tokens NS corpus (UK A-Levels essay) 269 post-1 k word families/3630 tokens But that’s not all Split up corpus Look at individuals 86
87
Almost all post-2 ks are used by one write 88
Conclusion Interesting peripheral differences for another study Syntax correct but unelaborated Phrases heavy on the short end, light on the long end Low proportion of noun-noun Vocab - Heavy reliance on 1 k vocab Low Post-1 k Items used by one person Yet good recognition scores at 3 k+ levels Known words are not getting used Unlikely to get used in classroom 89
2. Oral production corpus 90
Let’s learn more about the previous study: Follow trainees into their classrooms Does the predicted pattern occur? If new words appear, are they recycled? *See Horst’s Teacher Talk Corpus study in a forthcoming RIFL (2011) (Note: Different subjects – here we are establishing tools & method) 91
18 hrs of NS-T classroom talk Looks like rich lexical input… 92
Summary n Post-1 k words (learning zone) u 1570 families u 900 appear in one class-hour only « n Inc 300 one TIME only «Recyclage» is not happening u Now add this to the NNS data « u Few post-1 k used in own writing The problem starts to make sense 93
Or, Alert’s 108, 000 wds, no past tense! Went, saw 94 nhttp: //www. lextutor. ca/concordancers/concord_e. html
3. Goal clarification 95
Let’s work through a published study Ovtcharov & Cobb 2006 (en français) Situation: Ottawa Civil service promotions depend on success in L 2 oral interview Pass/fail evaluated globally (=impressionistically) “A well developed vocabulary” is one of the stated criteria But what is it? The usual soft focus 96
Needed for the study 1. Corpus of transcribed oral interviews Both passes, fails, & borderlines 24 of each, 25 -35 minutes 100 s of hours work 2. French version of Vocabprofile Lemmatized large-corpus based, k-leveled frequency lists? Miraculously appear in c. 2001 See Cobb & Horst, 2004 3. Usable NS reference corpus Provided by Beeching, 2001 French oral interviews in USA 97
Result Identifiable difference at 2 k Strong difference at 3 k+MHL (off-list) 98
Significance (Assuming replication) One less failure-to-communicate in the vastness of high-stakes language instruction The instructional design process has a place to begin 99
So… Corpus research is a fairly simple, -counting type of research bean That can solve complex problems in language learning & teaching, both Practical What do these people need to learn? Can examiners’ impressions be operationalized? Theoretical E. g. , Piecing together the portrait of advanced interlanguage (Cobb 2003) 100
Course tie-up 101
At 10. 15 you now know… n n n What a corpus is Why it is important What insights it has yielded in applied linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus Some tools of corpus analysis How to do a learner corpus study The results of some published learner corpus studies The future of learner corpus studies 102
The Future 103
Where do we go from here? Corpus research carries on shining the light into dark corners - 2007 -2009 work from Dee Gardner, Stuart Webb Some increase in corpus awareness - Teacher training programs - MA methods courses Collaboration reduces labour - CECL, the Locness reference corpus - Promise of automatic corpus comparisons at Calper Gold Dev. world can play as tools go online 104
If we have time… The final challenge to the utility of frequency lists As already seen We are closing in on the Core of English This includes a smaller than expected group of true homonyms No corpus tool-kit so far deals with these systematically E. g. a Vocabprofile analysis does not distinguish bank and bank 105
Go live 106 http: //www. lextutor. ca/concordancers/text_concord
ca www. lextutor. ca This PPT at http: //www. lextutor. ca/cv/slrf_09/corpus. ppt References list at http: //www. lextutor. ca/cv/slrf_09/hando
bd64c6e3089894ceec1a903df116c615.ppt