Скачать презентацию Technology Transfer of Automated Essay Evaluation From NLP Скачать презентацию Technology Transfer of Automated Essay Evaluation From NLP

a797e1cd38075ebb0e7003d5ffbd348d.ppt

  • Количество слайдов: 65

Technology Transfer of Automated Essay Evaluation: From NLP research through deployment as a business Technology Transfer of Automated Essay Evaluation: From NLP research through deployment as a business Jill Burstein Educational Testing Service Presented at HLT/NAACL May 5, 2004 Boston, MA

Criterion. SM, E-rater®, Critique, C-rater, & more … Jill Burstein Claudia Leacock Thomas Morton Criterion. SM, E-rater®, Critique, C-rater, & more … Jill Burstein Claudia Leacock Thomas Morton Educational Testing Service Martin Chodorow Hunter College, CUNY Susanne Wolff Princeton University

Let’s Talk About Writing & Assessment Let’s Talk About Writing & Assessment

Educators’ Vision: Writing Skill Development • Master basic skills in K 12 – Grammar, Educators’ Vision: Writing Skill Development • Master basic skills in K 12 – Grammar, spelling, punctuation, etc. • Perfected the 5 paragraph essay – U. S. Concept – Thesis, 3 Main Points, Conclusion • Writing within and beyond discipline – Address different audiences – Generate various genres: persuasive, compareand–contrast, research writing within discipline

Evaluation of Vision Through Writing Assessments • High stakes: Undergraduate & Graduate – Admissions Evaluation of Vision Through Writing Assessments • High stakes: Undergraduate & Graduate – Admissions – Placement • No/Low stakes: K 12 – Statewide and national assessments – Classroom instruction

What do essays look like? What do essays look like?

The Reality of Writing Quality • Timed Assessments – Up to 500 words (grade The Reality of Writing Quality • Timed Assessments – Up to 500 words (grade level dependent) – Not Literary Essays !creative !irony !metaphor • Instructional Uses – Maybe Longer Essays – Better Quality with Revision Facility

Most essays look like this Most essays look like this

And others like this… “You are stupid because you can't read. You are also And others like this… “You are stupid because you can't read. You are also stupid becuase you don't speak English and because you can't add. Your so stupid, you can't even add! Once, a teacher give you a very simple math problem; it was 1+1=? . Now keep in mind that this was in fourth grade, when you should have known the answer. You said it was 23! I laughed so hard I almost wet my pants! How much more stupid can you be? ! So have I proved it? Don't you agree that your the stupidest person on earth? I mean, you can't read, speak English, or add. Let's face it, your a moron, no, an idiot, no, even worse, you're an imbosol. ”

And this …. “I THINK THAT EVERYONE SHOULD BE ABLE TO WEAR WHATEVER THE And this …. “I THINK THAT EVERYONE SHOULD BE ABLE TO WEAR WHATEVER THE HELL THEY WANT TO WEAR. “

And this… “I don't know how to explain this question because I took a And this… “I don't know how to explain this question because I took a nap while listening. Sorry. “

And this … This is my topic on presidents. The one i will be And this … This is my topic on presidents. The one i will be talking about is Bill Clinton. He like most of us have fingers, with fingernails. He also has two arms were his fingers are on, were teh fingernails connect. What can a arm be without a hand? nothing thats the answer, so he obviously has two hands in witch the fingers are on, with the fingernails connect too. He also has eyes. . . EYES o yeah even him the big cheese has eyes, its weird cuz not many people have eyes but this good president does, and he can SEE you, you might not be able to see him , but he can SEE you. One time we were on AOL chat and i was talking to him he said he like to climb trees in his underwear well his arms were covered with sauce. . pizza sauce. He said it makes him feel free and good about himself. Also in the chat he said he has a pigeon for a pet and the things name is Frances, he said they like to make bacon together in the mourning and at night. And they eat at his friends Y place mostly everyday.

And this … “Is true that in so many jobs people have to wear And this … “Is true that in so many jobs people have to wear dress codes for so many reazons. Like in the restaurant the workers are obligaded to use a drees code because they have to look differently and have a good looking to impresionate the cutmers. Not necesesary at all the schools have drees codes but the ones that havet is because Arriba todos mis compas ya llego el rey del cristal, y yo mismo lo cosino para mejor calidad por esos mismos motivos me busca la federal la de estados unidos tambien me quiere agarar. si ellos me buscan por tierra yo me les pelo por mar si piensan que ando colima yo me paseo en michoacan por la ruana y poir tepeque aguililla y cuancoman , cuantas libras va a llevarse “ [Descriptive Translation: “It’s a rhyme about a drugrunner … the guy is basically saying that he's the king of Cristal meth, wanted by the DEA and the Feds. ”]

Human Scoring Algorithm Human Scoring Algorithm

essay human reader score (S 1) YES human reader score (S 2) Is |S essay human reader score (S 1) YES human reader score (S 2) Is |S 1 S 2| > 1 ? expert human reader score (S 3) NO Final Score = mode, or mean of closest Final Score = mean

Building Automated Essay Scoring Capabilities Building Automated Essay Scoring Capabilities

Some History – PEG – 1960’s essay scoring (Page, 1966) • Transformation of essay Some History – PEG – 1960’s essay scoring (Page, 1966) • Transformation of essay length • Some syntactic analysis • Convincing results – Writer’s Workbench (Cherry et al, 1982) • Editing tool for students • Diction, style, spelling • Discourse structure (‘topic sentence’ identification) – Intelligent Essay Assessor (Landauer et al, 1998) • Essay scoring with latent semantic analysis (LSA) • Style and mechanics measures

Assessment Market Technologically Ready • Increase in Internet & Computer Access – Instructional computers Assessment Market Technologically Ready • Increase in Internet & Computer Access – Instructional computers with Internet access in public schools: 12. 1 to 1 in 1998 & 4. 8 to 1 in 2001 (NCES, 2002) – Web resources used in over 40% of college courses (Campus Computing Project, 2000) – 99% of public schools have internet access (NCES report, 2002) • State Assessments: Increase in computerbased delivery • Largest Test Publishers offer 850+ digital textbook titles

Motivation in Assessment • Cost Savings for Large-Scale Assessments • Classroom Integration for Instruction Motivation in Assessment • Cost Savings for Large-Scale Assessments • Classroom Integration for Instruction – More practice writing possible – Electronic writing portfolios – Individual performance assessment – Classroom assessments

Educators’ Questions About Innovation 1. Reliability: Can automated essay assessments increase scoring consistency for Educators’ Questions About Innovation 1. Reliability: Can automated essay assessments increase scoring consistency for authentic assessments? 2. Assessment Type: Can automated scoring introduce more varied high stakes assessments? 3. Costs/Performance: Can scoring costs be reduced, but scoring performance maintained?

Starting Development Starting Development

What should a good essay look like? • Clearly states the author's position, and What should a good essay look like? • Clearly states the author's position, and effectively persuades the reader of validity of author's argument. • Well organized, with strong transitions helping to link words and ideas. • Develops its arguments with specific, wellelaborated support. • Varies sentence structures and makes good word choices; very few errors in spelling, grammar, or punctuation

Mapping Writing Features to NLP Tools Writing Features NLP Tools Grammar Errors & Sentence Mapping Writing Features to NLP Tools Writing Features NLP Tools Grammar Errors & Sentence Structures POS Taggers; Syntactic Parsers Vocabulary Usage Content Analysis; POS Taggers Sentence & Word Level Mechanics Spelling Tools; POS Taggers Organization & Development of Ideas Discourse Analyzers

E-rater (2/99 – 9/04) • 50+ Writing-Relevant Features – – Syntactic Structure Features: clause E-rater (2/99 – 9/04) • 50+ Writing-Relevant Features – – Syntactic Structure Features: clause types Discourse Structure Features: cue words & terms Content: Content vector analysis Lexical Complexity: e. g. , word length, unique words • NLP Tools – Syntactic Parses – High level discourse analyzer – tf*idf (essay level & argument level) • Topic-Specific Models – Training with Human Scored Essays – Stepwise Linear Regression (Variable Feature Set & Weights) • System Performance – Agreement with Humans – Comparable to Two Humans – E-rater/Human agreement : 59% exact; 98% exact + adjacent

essay human reader score (S 1) YES E rater score (S 2) Is |S essay human reader score (S 1) YES E rater score (S 2) Is |S 1 S 2| > 1 ? expert human reader score (S 3) NO Final Score = mode, or mean of closest Final Score = mean

Outcomes of Early Success Outcomes of Early Success

NY Times Headline Phobia Can you spell imbecile? : E-rater® Gives Good Score to NY Times Headline Phobia Can you spell imbecile? : E-rater® Gives Good Score to Bad Essay By A. Reporter ETS’s automated scoring system thinks that this essay should get something like a “B. ” Would you want your child to do well on this kind of writing? You be the judge. “You are stupid because you can't read. You are also stupid becuase you don't speak English and because you can't add. Your so stupid, you can't even add! Once, a teacher give you a very simple math problem; it was 1+1=? . Now keep in mind that this was in fourth grade, when you should have known the answer. You said it was 23! I laughed so hard I almost wet my pants! How much more stupid can you be? ! So have I proved it? Don't you agree that your the stupidest person on earth? I mean, you can't read, speak English, or add. Let's face it, your a moron, no, an idiot, no, even worse, you're an imbosol. ”

Anomalous Essay Detection Statistical evaluation of word usage to flag anomalous essays – “Your Anomalous Essay Detection Statistical evaluation of word usage to flag anomalous essays – “Your essay does not resemble others being written on this topic. ” – “Your essay might not be relevant to assigned topic. ” – “Your essay appears to be restatement of the topic with a few additional concepts. ” – “Compared to other essays written on this topic, your essay contains more repetition of words. ” – “Your essay shows less development of a theme than other essays written on this topic. ”

Positive Outcomes: Changing Business Model • Cost Savings – 1995 2000: ETS Research, some Positive Outcomes: Changing Business Model • Cost Savings – 1995 2000: ETS Research, some marketing, little sales • Revenue Generation – 2001 2003: Spin off (ETS Tech), all marketing, all sales, all product development, all the time – 2003 – present: Spin back (to ETS) with vastly increase marketing & sales

What teachers really wanted: Qualitative Feedback What teachers really wanted: Qualitative Feedback

Learning from Assessment Experts • Holistic scores not meaningful to students – – Score Learning from Assessment Experts • Holistic scores not meaningful to students – – Score 3: While a position may be stated, either it is unclear OR undeveloped. May have organization in parts, but lacks organization in other parts. The support of the position may be brief, repetitive, or irrelevant. Inconsistent control of sentence structure, and incorrect word choices; errors in spelling, grammar, or punctuation occasionally interfere with reader understanding. • Demos for focus groups with teachers, policy makers, assessment experts – Errors in grammar, usage, mechanics, and style – Organization & Development

More Innovation – More Questions – Meaningfulness: Is the feedback consistently related to a More Innovation – More Questions – Meaningfulness: Is the feedback consistently related to a clearly defined standard? – Self-Evaluation: Can instructional software help students understand evaluation of their writing? – Improvement: Can writing practice with immediate feedback help students?

Criterion. SM Online Essay Evaluation Service • Critique writing analysis tools – Grammar – Criterion. SM Online Essay Evaluation Service • Critique writing analysis tools – Grammar – Usage – Mechanics – Style – Organization & Development • E-rater

Motivation For New Capability Development • What’s free for commercial use Spelling • What’s Motivation For New Capability Development • What’s free for commercial use Spelling • What’s not …free … Grammar Usage Mechanics Style • What doesn’t exist Organization & Development

Methods Methods

Grammar, Usage and Mechanics Errors • Corpus of well formed text: 30 million words Grammar, Usage and Mechanics Errors • Corpus of well formed text: 30 million words from newspapers • Features: function words and part of speech tags a_AT good_JJ job_NN during_IN • Collect frequencies for: – Unigrams of tags and function words – bigrams of tags and function words a_JJ AT_JJ JJ_NN NN_during NN_IN • Method: pointwise mutual information and log likelihood ratio used to detect unexpected sequences – likely violations of English grammar

Grammar, Usage and Mechanics Errors • Harvest low probability bigrams from a set of Grammar, Usage and Mechanics Errors • Harvest low probability bigrams from a set of essays. • Low probability bigrams: – DTS_NN, AT_NNS – *these pencil, *every teenagers • Write Filters: – *These pencil is yellow. – but not These pencil erasers are dirty.

Grammar • • Fragments Garbled Sentences Subject Verb Agreement: the motel are … Verb Grammar • • Fragments Garbled Sentences Subject Verb Agreement: the motel are … Verb form: They are need to distinguish … Pronoun Errors: Them are my reasons … Possessive Errors: the students grades Wrong or Missing Word: The should take the student • Proofread This!: I think my through problems

Usage • Article Errors: I like these song • Confused Words: Because of there Usage • Article Errors: I like these song • Confused Words: Because of there different genres … • Wrong Form of Word: the right choose • Faulty Comparison: It is more easier • Nonstandard Verb or Word Form

Mechanics • • • Spelling Missing Capitalization Missing Initial Capitalization Missing Question Mark Missing Mechanics • • • Spelling Missing Capitalization Missing Initial Capitalization Missing Question Mark Missing Final Punctuation Missing Apostrophe: Thats about the only thing Missing Comma Missing Hyphen: a well deserved vacation Fused Words Compound Words Duplicate Words: escape to the another town

Style • Short sentences, unusually long sentences, and passives? • Automatic detection of repeated Style • Short sentences, unusually long sentences, and passives? • Automatic detection of repeated words – 300 essays manually annotated for repetition – Word-based text features with C 5. 0 • proportion of word use in essays • distance between repeated word occurrences • pronoun? • word length

How Do We Identify Organization & Development in Essay Writing? • Discourse Theories • How Do We Identify Organization & Development in Essay Writing? • Discourse Theories • Lacks Essay-Based Discourse Function –Cue word & term detection (Cohen, Hirschberg & Litman, Hovy et al, Knott, Mann & Thompson, Vander Linden & Martin, Sidner, & Quirk, et al) • Topical Coherence, Not Discourse Function –Tex. Tiling – (Hearst & Plaunt) –LSA (Landauer et al) –Select-A-Kibbitzer (Weimer-Hastings & Graesser) • Not Student Friendly –RST Trees (Mann & Thompson) • Essay-Based Discourse Analyzer (Burstein, Marcu, & Knight) • Background, Thesis, Main Points, Supporting Ideas, and Conclusion

Organization and Development: Essay-Based Discourse Analyzer • 1400+ essays manually annotated with pre-defined labels Organization and Development: Essay-Based Discourse Analyzer • 1400+ essays manually annotated with pre-defined labels • Voting Between 3 Systems: 2 Probabilistic & 1 Decision-Based – Probable discourse label sequences – Essay sublanguage: agree, should, would, opinion, for example, because, however. . . – RST relations: contrast, elaboration, antithesis. . . – Syntactic structures: infinitive, complement, subordinating clauses. . . • Identifies background text, thesis statement, main ideas, supporting ideas, & conclusion statement

Evaluating Capabilities • Precision, Recall, & F-measure – Trade off Precision for Recall – Evaluating Capabilities • Precision, Recall, & F-measure – Trade off Precision for Recall – Better not to show falsely identified errors • Grammar, Usage, & Mechanics (Bigrams) – Minimum Precision for Deployment • Style & Organization & Development – Human annotated data – Develop baseline comparisons – Precision, Recall, F measure outperform baselines & approach human agreement

Some Numbers • Grammar, Usage, & Mechanics – (Minimum) Overall System Precision = 0. Some Numbers • Grammar, Usage, & Mechanics – (Minimum) Overall System Precision = 0. 90 • Discourse Capability (Org & Dev) – Baseline Precision = 0. 71 – Overall System Precision = 0. 85 – Human agreement = 0. 95 • Repetitive Word Use (Style) – Baseline Precision: 0. 27 – Overall System Precision = 0. 95 – Human Agreement: 0. 55

E-rater v. 2. 0 – September 2004 • 12 Features: Relevant to Writing Standards E-rater v. 2. 0 – September 2004 • 12 Features: Relevant to Writing Standards – Grammar, Usage, and Mechanics : Error Types – Style: Sentence Type, Sentence Length, Repeated Words – Organization: Thesis, Main Points, Support, and Conclusion – Content: Vocabulary Usage • Topic-Specific & Grade-Specific Models – Training with Human Scored Essays – Multiple Regression (Standardized Feature Set & Variable Weights) • System Performance – Agreement with Humans – Comparable to Two Humans – E-rater/Human agreement : up to 62% exact (from 59%) ; 98% exact + adjacent

SM Criterion Online Essay Evaluation SM Criterion Online Essay Evaluation

Student Independence essay E rater score Critique Writing Analysis Feedback G, U, M, S Student Independence essay E rater score Critique Writing Analysis Feedback G, U, M, S Org & Dev

Effectiveness Studies Effectiveness Studies

Criterion Field Data (Attali, 2004) • Research Questions – Can we evaluate the basic Criterion Field Data (Attali, 2004) • Research Questions – Can we evaluate the basic effectiveness of Critique writing analysis tools? – Can students understand respond to system feedback? • Criterion Field Data – Multiple submissions from about 9, 000 6 th to 12 th grade essays – Available for analysis: • First and last essay submission • Total number of submissions

Summary Results • 25% error reduction across 30+error types • Increased number of essay Summary Results • 25% error reduction across 30+error types • Increased number of essay based discourse elements – background – main point – supporting idea – conclusion

Criterion. SM & Standardized Testing (Shermis et al, 2004) • Research Question – Can Criterion. SM & Standardized Testing (Shermis et al, 2004) • Research Question – Can Criterion use have a positive impact on FCAT writing scores? • Data/Design – 36 10 th Grade English classes in Miami Dade • 18 used Criterion • 18 used “traditional” instruction

Summary Results • Bad News – No significant differences in FCAT scores • Good Summary Results • Bad News – No significant differences in FCAT scores • Good News – Significant growth in writing performance across different topics; Reduced numbers of errors – Significantly more writing productivity

Users & Volumes E-rater – GMAT (1999 – present) – 350 K essays scored Users & Volumes E-rater – GMAT (1999 – present) – 350 K essays scored each year – Moving into Statewide Assessment Criterion: E-rater + Critique – – K 12, college, and graduate level practice applications End 2002: 200 clients & 50 K subscriptions End 2003: 445 clients & 127 K subscriptions March 2004: 544 clients & 437 K subscriptions International Exposure – Users in Canada, Mexico, Pakistan, India, Estonia, Puerto Rico, Egypt, Nepal, Taiwan, Hong Kong & Japan

Beyond Essay Evaluation C-rater (Leacock) – short-answer, concept-based evaluation • • morphological analyzer pronoun Beyond Essay Evaluation C-rater (Leacock) – short-answer, concept-based evaluation • • morphological analyzer pronoun resolution syntactic chunker predicate argument structure generator automated spelling correction word similarity matrices part of speech tagger Test Item Creation Assistants – key-distractor selection for word-based test items • Statistical word similarity tools (Deane & Higgins) – article/paragraph selection for reading comprehension items • part of speech tagger • rhetorical parser

Next Generation Capabilities Next Generation Capabilities

Additional Grammar Error Detection (Chodorow, Leacock, Han & Wolff) • Missing (or extra) Determiner Additional Grammar Error Detection (Chodorow, Leacock, Han & Wolff) • Missing (or extra) Determiner Errors I can do so for __ following reasons • Preposition Errors a knowledge of/*at math • Long Distance Subject Verb Agreement The use of dress codes are becoming a popular subject. Word Salad Detector: Prevent System Gaming (T. Morton) • Rare p o s tag sequences (mixing up content!) quick The the over brown dogs fox. jumped lazy • Abnormal p o s tag distributions kfdl afjidaoi djfd &&&&**

Current Research Problems • Evaluate Coherence in Organization & Development (Higgins, Burstein & Marcu) Current Research Problems • Evaluate Coherence in Organization & Development (Higgins, Burstein & Marcu) – Does thesis statement respond to the question? – Do the main points relate to thesis statement? – Are all sentences in a supporting idea related? • E-rater Trait-Based Scoring

CL Research Contributions to Automated Text Evaluation (1 of 2) SHORT-ANSWER SCORING PROTOTYPES (1993 CL Research Contributions to Automated Text Evaluation (1 of 2) SHORT-ANSWER SCORING PROTOTYPES (1993 – 1995) Paul Jacobs, Jacqueline Kud, & Lisa Rao (General Electric IR Tools) ESSAY SCORING PROTOTYPE (1996 – 1999): Lisa Braden Harder, Simon Corston Oliver, George Heidorn, Karen Jensen, & Steve Richardson (Microsoft syntactic parser+); Robin Cohen, Sidney Greenbaum, Julia Hirschberg, Ed Hovy, Alistair Knott, Julia Lavid, Geoffrey Leech, Keith Vander Linden, Diane Litman, William Mann, James Martin, Elisabeth Maier, Candace Sidner, Sandra Thompson, & Randolph Quirk (discourse theory); *Gerard Salton (Vector Space Analysis) E-RATER® (1999 - present): *Steve Abney (CASS parser), *Eric Brill (p-o-s tagger), Hoa Trang Dang, Mary Dee Harris, Karen Kukich, Thomas Landauer (scholarly debate), Leah Larkey, Ralph Grishman (COMLEX), *George Miller (Word. Net), *Thomas Morton, *Adwait Ratnaparkhi (p-o-s tagger), *Susanne Wolff

CL Research Contributions to Automated Text Evaluation (2 of 2) CRITIQUE WRITING ANALYSIS TOOLS CL Research Contributions to Automated Text Evaluation (2 of 2) CRITIQUE WRITING ANALYSIS TOOLS (2000 -present): , Giovanni Flammia (Kappa Tool), Peter Foltz, Walter Kintsch, and Thomas Landauer (LSA & text coherence), Marti Hearst & Christian Plaunt (Tex. Tiling), Derrick Higgins, P. Kinerva, J. Kristoferson, and A. Holtz (Random Indexing), Kevin Knight & *Daniel Marcu (Rhetorical/Discourse Structure Parsers), *Dekang Lin (Word Similarity Indices) & Andrew Mc. Callum & Kamal Nigam (multivariate bernoulli), Eleni Miltsakaki, Ross Quinlan (C 5. 0), Rob Schapire (Boos. Texter), Peter Weimer Hastings & Arthur Graesser (essay coherence theory), Magdalena Wolska, & Vladmir Vapnik (Support Vector Machines), Linguistic Data Consortium. ANOMALOUS ESSAY DETECTION (2000): Martin Chodorow C-RATER (2000 – present): Claudia Leacock, Rebecca Passoneau TEST ITEM CREATION ASSISTANTS (2002 – present): Paul Deane & Derrick Higgins, R. Soricut (Rhetorical parser)

More Publications: http: //www. ets. org/research/erater. html Tom Morton’s Freeware Parser: https: //sourceforge. net/projects/opennlp More Publications: http: //www. ets. org/research/erater. html Tom Morton’s Freeware Parser: https: //sourceforge. net/projects/opennlp Open. NLP Tools, Download