
aaf275cb52ce952c31dd3cd7dfd0277f.ppt
- Количество слайдов: 121
Tom Cobb cobb. tom@uqam. ca Université du Québec à Montréal Didactique Des Langues The past and future of Lextutor. ca Nov 6, 9 h-10 h 15 Alberta TESL 2010 Edmonton 1
Greetings! n n Thanks for having me back! We last met in Oct 18 2002 u Anybody who was here last time? « « Designing comprehensible input with Information Technology » • Or, operationalising « i+1 » n This will be largely more of the same, I’m afraid u n Like playing the guitar, easy start but the hard part is at the end ATESL 2002 inspired me to keep this going! 2
The main idea was… n “i+1” is a great idea u But undefined « Like Republican tax cuts? n Yet i+1 can be defined u. At least for lexis «With frequency-based vocab tests «And a matching frequency-based text profiler 3
4
5
In 2002, Lextutor. ca racked up about 10, 000 hits/month, 120, 000/yr 6
In a good month in 2010, Lextutor. ca racks up 10, 000 hits/day` 7
Shameless boasting! Only because the real ownership of Lextutor has long been out of my hands I am now caretaker to a resource that thousands use and contribute to From every land on earth but principally about 25 countries 8
Nov 3, Noon, Eastern Standard Time 9
Nov 3, midnight, Eastern Standard Time 10
LAST 10, 0000 Nov 3, Noon, Eastern Std Time LAST 10, 0000 Nov 3, Midnight, Eastern Std Time 11
12
Challenges : 90 seconds on Lextutor = 16 hits, some heavy 13
Traffic management tricks n n n Multiple servers Processing offload, server side client side (Javascript) Analysis of usage patterns u Dedicated versions of programs u Example: « Users are deploying the massive apparatus of Vocabprofile-BNC to check the frequency level of single words • Solution: a resource-light, easy-find, one-word version of the progam 14
15
Other challenges n n n Universities’ ever tightening security u Faced with annual tripling of student downloads Proliferation of Web browsers u + unannounced changes in how they interpret Javascript Ever faster evolution of Web technologies u n CSS, Unicode, XML, smart phones Dwindling of the 33% prof. -time for R+D u Just as the D-issues become more complex… u With continuing insistence on R>D for promotions etc 16
Other challenges (2) New rules at the funding $$$ agencies (SSHRC et al) n Can no longer “buy” my own time u Cannot offer salaries of professional programmers Issues of an “old” Website n Need to enrich experience for long-time users u Without losing access for new users 17
Other challenges (3) Some headwinds in Language Teaching: Language technologies are made for a close inspection of language Concordancing Which words go together Frequency Which words are more common, less common… etc …Yet the communicative potential of these technologies has become dominant in LL Effectively ushering in a “communicative era” Part 2 Text computing has failed to linked to Fon. F or LA 18
So… n There is a huge role for inspiration in all of this! 19
A brief run through D-evelopments inspired by user voices 2008 -2010 Live demos and hands-on at Workshop 10 h 30 -11 h 45 20
2. Small picture stuff 21
22
23
* 24
25
26
27
28
29
30
http: //www. lextutor. ca/cloze/n_word/ 31
32
33
34
35
36
VP_Cloze: What texts “look like” to learners Forestry A: intact version Forestry B: 1990 version, Forestry C: Mixed profile 1 k words known of words known + affixes + size information Even if used in an unprocessed form, the increasing wood supplies will require a larger labour force, an improved roading network, and expanded transport and processing facilities. If the trees are to be exported, then certain investments must be made. They will include investments in: logging machinery and equipment; logging trucks, and other vehicles required for the transport of processed products; upgrading and maintaining roads (or rail or coastal shipping facilities where appropriate); and port facilities. The list could be extended to include overseas shipping, and accommodation and township facilities forestry workers. Even if used in an unprocessed form, the increasing _____ will require a larger labour force, an improved roading network, and _____ and processing facilities. If the _____ are to be _____, then certain _____ must be _____. They will _____ investments in : logging machinery and _____; logging trucks, and other vehicles _____ for the transport of _____ products; upgrading and maintaining _____ ( or _____ or coastal ______ where appropriate); and port _____. The list could be extended to include ______, and _____ and township facilities forestry _____. Even if used in an unprocessed form, the increasing ________ will require a larger labour force, an improved roading network, and ______ed _____ and processing facilities. If the ____s are to be ______ed, then certain _____s must be ____. They will ______ investments in: logging machinery and _____ment; logging trucks, and other vehicles _______d for the transport of _______ed products; upgrading and maintaining ____s (or ____ or coastal __________ where appropriate); and port _____. The list could be extended to include ________, and _______ and township facilities forestry ______s. 37
VP_Cloze: What texts “look like” to learners Forestry A: intact version Forestry B: 1990 version, Forestry C: Mixed profile 1 k words known of words known + affixes + size information Even if used in an unprocessed form, the increasing wood supplies will require a larger labour force, an improved roading network, and expanded transport and processing facilities. If the trees are to be exported, then certain investments must be made. They will include investments in: logging machinery and equipment; logging trucks, and other vehicles required for the transport of processed products; upgrading and maintaining roads (or rail or coastal shipping facilities where appropriate); and port facilities. The list could be extended to include overseas shipping, and accommodation and township facilities forestry workers. Even if used in an unprocessed form, the increasing _____ will require a larger labour force, an improved roading network, and _____ and processing facilities. If the _____ are to be _____, then certain _____ must be _____. They will _____ investments in : logging machinery and _____; logging trucks, and other vehicles _____ for the transport of _____ products; upgrading and maintaining _____ ( or _____ or coastal ______ where appropriate); and port _____. The list could be extended to include ______, and _____ and township facilities forestry _____. Even if used in an unprocessed form, the increasing ________ will require a larger labour force, an improved roading network, and ______ed _____ and processing facilities. If the ____s are to be ______ed, then certain _____s must be ____. They will ______ investments in: logging machinery and _____ment; logging trucks, and other vehicles _______d for the transport of _______ed products; upgrading and maintaining ____s (or ____ or coastal __________ where appropriate); and port _____. The list could be extended to include ________, and _______ and township facilities forestry ______s. 38
VP_Cloze: What texts “look like” to learners Forestry A: intact version Forestry B: 1990 version, Forestry C: Mixed profile 1 k words known of words known + affixes + size information Even if used in an unprocessed form, the increasing wood supplies will require a larger labour force, an improved roading network, and expanded transport and processing facilities. If the trees are to be exported, then certain investments must be made. They will include investments in: logging machinery and equipment; logging trucks, and other vehicles required for the transport of processed products; upgrading and maintaining roads (or rail or coastal shipping facilities where appropriate); and port facilities. The list could be extended to include overseas shipping, and accommodation and township facilities forestry workers. Even if used in an unprocessed form, the increasing _____ will require a larger labour force, an improved roading network, and _____ and processing facilities. If the _____ are to be _____, then certain _____ must be _____. They will _____ investments in : logging machinery and _____; logging trucks, and other vehicles _____ for the transport of _____ products; upgrading and maintaining _____ ( or _____ or coastal ______ where appropriate); and port _____. The list could be extended to include ______, and _____ and township facilities forestry _____. Even if used in an unprocessed form, the increasing ________ will require a larger labour force, an improved roading network, and ______ed _____ and processing facilities. If the ____s are to be ______ed, then certain _____s must be ____. They will ______ investments in: logging machinery and _____ment; logging trucks, and other vehicles _______d for the transport of _______ed products; upgrading and maintaining ____s (or ____ or coastal __________ where appropriate); and port _____. The list could be extended to include ________, and _______ and township facilities forestry ______s. 39
VP-Negative: which HF words are NOT in course book etc… 40
Randomicity Does EXCEL provide a nice random numbers list? No n click 41
http: //mywordcoach. us. ubi. com/ 42
43
44
45
2. Medium picture stuff One of Lextutor’s big ideas is to build tutorials “on top of” concordance and frequency output n Frequency not so hard … 46
Click to go live 47
Concordance not so easy n But some of this takes some time… u. The toy-set demo is easy «But classroom-usable ~? 48
Some mid-1990 s ideas… 49
50
Strong empirical result, 1997 Seemed to crack no-transfer problem But it was only a Toy System Made a point but was not really useable beyond context >> 200 -word learning task >> Purpose built definitions “Basic meaning” underlying several uses of each word >> Purpose built corpus Students’ own course materials Jiggered until five concs / word So 200 words, 1000 examples 51
So how long does it take… WITH an empirical finding WITH a demonstrated need WITH an available technology WITH willing learners… To ramp this idea up to a full corpusbased tutor? Initial Target: 3, 500 most frequent words Eventual target: 20, 000 words 52
Needed: - 3500 short, basic definitions - That “fit” 85%+ cases - A corpus that is ~ > Big enough to assure >10 examples for each word > In language that is comprehensible to a learner with <1500 words 53
Build comprehensible corpus 54
Scour Web for >>short<< definitions Wiki collection GRE + TOEFL tip-sheets “Here are the words you need to succeed…” About 40% of these need major re-write 55
56
April 2010, an online expansible running proto 57
3. Big picture stuff The original VP scheme for defining i+1 was (frankly) pretty rough n Categories too few, too big u n 1 k + 2 k + AWL The AWL ran into problems… u See Cobb 2009 in RIFL 58
59
Buck did not read the newspapers or he would have known that trouble was brewing, not only for himself but for every tide water dog, strong of muscle and with warm long hair, from Puget Sound to San Diego. 1 2 3 80% 7% 0% 13% 60
Demo - Rex Murphy by Classic and by BNC-VP 61
GSL + AWL 1 k 80. 17% 2 k 5. 65% AWL 1. 68% Off 12. 50% BNC + proper noun auto-extract (C+L, 09) BNC LISTS (N, 2006) 62
63
64
65
How does this help us? Look at the texts on the workshop handout Rough rank order by lexical richness Then click here 66
67
NEW FREQUENCY LISTS Provided by huge new corpora like BNC, COCA Clearly have benefits on Lextutor But size does not solve every problem Little corpus projects like Lextutor may have a contribution 68
n Even newest frequency lists fail to distinguish between equivalent written word forms (homographs). “Money banks" are somewhat bigger than “river banks“ (more frequent) u And yet the official BNC lists (e. g. , Leech et al, 2001) do not make the distinction 69
70
Nor the newest lists (Davies & Gardner, 2010) 71
How important is this problem? n If we are going to say ~ u “A learner should know 5000 high frequency word families to read an academic text with modest dictionary look-up” u And here are the 5000 words u We have to know we have the 5000 words right 72
For example: row is a GSL-1 item « BUT row a boat, ducks in a row after the pubs closed 1. Three unrelated words 2. None frequent enough in itself to be GSL-1 3. Except through lump-summing of: Different parts of speech (within ‘family’) OK Related meanings OK Unrelated meanings Not OK Unrelated meaning = a homograph 73
Defining polysemy v. homonomy/homography From Wang & Nation (2004) 74
n n How many homographs are there anyway? Nation & Wang (2004) Extent of homographs in the AWL u 60 out of 570 families are homographs u « However - in all but four cases « Just one meaning is the basis of membership • Convert (change) = 95% in corpus – vs. convert (points in rugby) = <5% in corpus So when teaching convert ? Forget the rugby meaning 75
But homography is mainly a feature not of the AWL But of very high frequency words How many such words? 76
n “Homonyms, homographs and homophones: Just how homophobic is high frequency English? ” – MA study, Wellington NZ, Kevin Parent «Parent’s method: Look up every GSL word in e. Shorter OED • For those with >1 separate entries – Check Freq by hand in BNC corpus – Record cases where both meanings are quite frequent n RESULT 77
178 HF families involved out of 2000, or about 9%78
Therefore this may be a problem worth tackling Needed : 1. A way of extracting homograph_1 and homograph_2 u from natural texts « on the fly 2. An official ranking of each in a large corpus 79
The goal~ Re-write the BNC lists 1000 List … … bank_1 … 4000 List ? … … … bank_2 80
Or take the case of bear (officially 1 k) >70% of uses are “I can’t bear it” <30% are “bear in the woods” So should bear = animal be a 1 k word? The animals line up rather neatly Horse, cat, dog => 1 k Chicken, pig, cow => 2 k Mouse, rat, wolf => 3 k Deer, snake=>5 k Moose=>9 k Coyote=>14 k furry bear about here? My guess: Can’t bear it => 2 k really Bear in woods => 3 k really 81
But how? n How do humans extract homographs? u Using context « 1 global / semantic / pragmatic – hard to measure – hard to automatize « 2 specific collocates – 99% non-overlapping – measurable in principle – possible to automatize n Stubbs (2009, p 19) claims that "Homonyms can be automatically distinguished by their collocations“ u (i. e. by computer programs) « But functioning demonstrations are not easy to find! 82
- BNC data => input to Lextutor concordancer (lexutor. ca/concordancers/concord_e. html) - Token collocates > 5 identified 83
Each homograph with all collocates>5 familized and inspecte Include “play” if >50 hits (from 1000, so 5%) within 4 words 84
THEN: Incorporate collocates into PERL algorithm that can extract data View at http: //www. lextutor. ca/concordancers/text_concord/show_db. pl 85
Develop as input to PERL algorithm that can extract data from input tex Program goes through words… Comes to ball If base or bases within 4 words => ball_1 If bat, bats, batted, batting, batter within 4 words = … If dance, date, floor, or gives gave given + a => b … if both => majority wins If tie or nothing => use previous If nothing and no previous => ball_0 86
Complexities Output from one assignment is input to another Algo comes to “Bear arms” – Which “arms” is it? It helps if “bear” has already been assigned Was it `momma bear` (bear_2) Or `grin and bear`(bear_1) ? Every new entry affects those already in place 87
Complexities Some search expressions are quite com to pick up all instances of a separable collocation |bbears+[a-z ]*s+ins+min his view =BEAR this it the cost in mind 88
Development text: input 89 (Note artificially elevated density of homographs)
Development text: output (98% correct) 90
Behind the scenes: Basis of choices ** 91
Try yourself / follow progress at http: //www. lextutor. ca/concordancers/text_con 92
Test for Dehomynizer: Q: Can it handle novel texts? 93
Problem 2. VH-F Multi-Words n Some Multi-Word Units « independent, non-compositional meaning n are so frequent that… « they are actually 1 st and 2 nd 1000 items u E. g. , learners will meet “of course” « More frequently than 2 k item “window” n 505 of these belong in the most frequent 5, 000 94 (so >10%)
The solution n Re-work the BNC lists to include high frequency multi-words in their proper places « This work has recently been accomplished for first 5 k lists – Alvarez & Schmitt n Re-work VP-algorithm to find and categorize these HF-MWUs 95
96
Now to integrate this information… n n n Into Frequency routines Into Vocabprofiles of user texts Into Vocabulary Size Tests =>>MASSIVE UNDERTAKING 97
Step 1: Integrate Freq Lister and N-gram 98
99
100
A real programming trick will be… Distinguishing “at_all” phrase from “at all” words I have none [at_all] v. [Look at] [all the cars] In fact – insight - these too are homographs 101
How would linguists approach this trick? Grammatical parsing How will educators approach this trick? As we just saw with homographs Leave the job to collocates 102
Advantages of collocates approach n n Faster online than parser Has built-in tutorial possibilities. u. Build a collocs tutor on top of the algorithm 103
ALGO If {some none any (one|body|thing) + “at all” => “at_all” is one k-2 phrase Otherwise => “at all” is two k-1 words 104 }
105
All this will make some difference to Vocabprofiles >Word counts of texts will go down As 2 words => one >Proportion of K-1 will go down Since 2 -5 k phrases are mainly composed of 1 k words >Gone (may be) such verities as: “Most conversation is k 1 items” “A typical newspaper profile is… k 1: 70% K 2: 10% k 3: 5%” 106
All this will make some difference to Vocabprofiles Example: “As far as I know, this is true” Current profile: as far as I know this is true 1 k = 8: 8 = 100% New profile: as_far_as I know this is true 1 k = 5: 6 = 84% 3 k = 1: 6 = 16% 107
The new goal is ~ Re-write the BNC lists again 1000 List something … bank_1 … of_course 4000 List ? course … something_of_a … … bank_2 108
A major benefit will be… Better and better predictions of vocabulary coverage Tests with more predictive power Better targeted teaching 109
A major challenge will be… Maintaining practitioner credibility 110
There is probably no choice The frequency approach Once embarked upon Can only be pursued to completion Question: What if we don’t focusing on frequency? 111
Montreal kids’ levels test scores (No formal vocab training) Levels Test (10 k) at Time 1 7. 00 Scores/10 50%>> 6. 00 5. 00 4. 00 3. 00 2. 00 1. 00 0. 00 1 2 3 4 5 6 7 8 9 10 Group 1 6. 33 5. 03 4. 13 4. 90 4. 87 2. 73 2. 93 2. 80 1. 70 1. 47 Group 2 6. 07 4. 76 4. 20 4. 87 4. 84 2. 73 2. 83 2. 74 1. 64 1. 45 K-levels 112
7. 00 6. 00 5. 00 Scores/10 What does this profile mean? Levels Test (10 k) at Time 1 4. 00 3. 00 2. 00 On their own, kids will learn lots of vocab But it may not be in the 1 k-3 k high frequency zone 1. 00 0. 00 Group 1 Group 2 1 6. 33 6. 07 2 5. 03 4. 76 3 4. 13 4. 20 4 4. 90 4. 87 5 4. 87 4. 84 6 2. 73 7 2. 93 2. 83 8 2. 80 2. 74 9 1. 70 1. 64 10 1. 47 1. 45 K-levels Where coverage power lies With 5000 words known These kids are facing <65% coverage in their school materials Recommended is 95%! 113
114
We have to get the frequency analysis right n By ourselves, u n not wait for the linguists And convince institutions, teachers, booksellers etc to use it u Hopeful sign from Pearson-Longman: rep at AAAL-2010 says « We are now using VP to grade materials • Pleads “Go easy on the tinkering”! – Research v. business 115
References Cobb papers & software lextutor. ca/cv/ All others > See me here > Write me at cobb. tom@uqam. ca 116
References research Cobb (1997). Is there any measurable learning from hands-on concordancing? System Cobb (2007) Computing the demands of vocabulary acquisition. LLT Martinez, R & Schmitt, N. (2010). A phrasal expressions list. First author’s Ph. D 117
References - web Concordancers lextutor. ca/concordancers/ VP (+ negative) lextutor. ca/vp/ Tests lextutor. ca/tests/ Homographs lextutor. ca/concordancers/text_concord Computer-Assisted Contextual Inference lextutor. ca/conc_infer 118
119
n n n Technology and Language Testing Corpus-Based Testing Parallel Concordancing Analyzing Speech Corpora Concordancing Language Teacher Training in Technology Language Trainer Training in Technology and Listening Computer-Assisted Language Learning Effectiveness Research Learner Modeling in Intelligent Computer. Assisted Language Learning n n n Natural Language Processing in Intelligent Computer-Assisted Language Learning Learner Corpora Automated Speech Recognition Technology and Discourse Intonation Computer-Assisted Vocabulary Load Analysis Technology and Usage. Based Teaching Applications Information Retrieval for Reading Tutors Online Communities of Practice Emerging Technologies for Language Learning Lexical Bundles Technology and Phrases Technology and Teaching Writing Latent Semantic Analysis 120
n n n n Mobile Assisted Language Learning Technology and Phonetics Computer-Mediated Communication and Second Language Use Computer-Mediated Communication and Second Language Development Multimodal Computer. Mediated Communication and Distance Education Distance Language Learning Massively Multiplayer Online Games Digital Divide Technology and Teaching Vocabulary Exporting Applied Linguistics Technology Monolingual Lexicography Bilingual Lexicography Across Languages Internet and World English n n n n Searchlinguistics Internationalization and Localization Translation Terminology Technology and Literacy Corpora and Literature Keyword Analysis Connectionism Text-to-Speech Synthesis Development Text-to-Speech Synthesis Research Lexical Priming Technology and Culture Computer-Assisted Language Learning and Machine Translation 121
aaf275cb52ce952c31dd3cd7dfd0277f.ppt