Automatic Detection of Plagiarized Spoken Responses Keelan Evanini

Скачать презентацию Automatic Detection of Plagiarized Spoken Responses Keelan Evanini

fd284f8f4900ea68d7efee7aff1adfe1.ppt

Количество слайдов: 13

Automated Detection of Plagiarized Spoken Responses • Becomes an important application due to the increasing need of automated scoring for spontaneous speech • Prevents one type of cheating strategy – Test takers may prepare canned answers using test preparation materials prior to the examination. 2 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Plagiarized Spoken Response in ® TOEFL i. BT • TOEFL® i. BT, a large scale, high-stakes assessment of English for non-native speakers. – Independent speaking tasks, asking test takers to draw upon their own ideas, opinions, and experiences in a 45 -second spoken response. • Plagiarized Spoken Responses – Test takers may attempt to game the assessment by memorizing canned material from an external source and adapt it to a question. – This type of plagiarism can affect the validity of a test taker’s speaking score. 3 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

One Source Material One Plagiarized Response Well, the place I enjoy the most is a small town located in France. I like this small town because it has very charming ocean view. I mean the sky there is so blue and the beach is always full of sunshine. You know how romantic it can ever be, just relax yourself on the beach, when the sun is setting down, when the ocean breeze is blowing and the seabirds are singing. Of course I like this small French town also because there are many great French restaurants. They offer the best seafood in the world like lobsters and tuna fishes. The most important, I have been benefited a lot from this trip to France because I made friends with some gorgeous French girls. One of them even gave me a little watch as a souvenir of our friendship. family is a little trip to France when I was in primary school ten years ago I enjoy this activity first because we visited a small French town located by the beach the town has very charming ocean view and in the sky is so blue and the beach is always full of sunshine you know how romantic it can ever be just relax yourself on the beach when the sun is settling down the sea birds are singing of course I enjoy this activity with my family also because there are many great French restaurants they offer the best sea food in the world like lobsters and tuna fishes so I enjoy this activity with my family very much even it has passed several years 4 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Canned Response Collection • Step 1: Human raters flag potentially plagiarized spoken responses. • Step 2: Rater supervisors review responses by comparing them to external source material. • Step 3: if the presence of plagiarized material made it impossible to provide a valid assessment of the test taker’s performance, a score of 0 was assigned. 5 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Data Collection • 719 potentially plagiarized responses; 239 canned responses with score 0 • 49 different source materials • Approximately 300 control responses from each of the four most-frequent test questions. Number of Words Data Set N Sources Mean Standard Deviation 49 122. 5 36. 5 Plagiarized 239 109. 1 18. 9 Control 1196 84. 9 24. 1 The plagiarized responses are on average a little longer than the control responses 6 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Methodology (1) • Comparison between a test response and each of the 49 reference sources with 9 text-to-text similarity metrics: – Word Error Rate (WER) – TER and TER-Plus (Snover et al. , 2006), (Snover et al. , 2008) – Four similarity metrics based on Word. Net (Wu and Palmer, 1994), (Leacock and Chodorow, 1998) – Latent Semantic Analysis – BLEU (Papineni et al. , 2002) 7 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Methodology (2) • 4 different features for each similarity metric: – Document-level similarity – Single maximum similarity value and from a sentence-by -sentence comparison – Average of the similarity values for all sentence-bysentence comparisons – Average of the maximum similarity values for each sentence in the test response, where the maximum similarity of a sentence is obtained by comparing it with each sentence in the source text 8 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Experimental Setup • Experiments on both human transcriptions and automatic speech recognition (ASR) outputs • ASR system – Trained on approximately 800 hours of TOEFL® i. BT responses – WERs were 0. 411 on the plagiarized set and 0. 362 on the control set • Maximum Entropy-based sentence boundary detection system (Chen and Yoon, 2011) 9 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Results Text Kappa 0. 903 0. 807 Document-level 0. 920 0. 839 Sentence-level 0. 847 0. 693 ALL 0. 852 0. 703 Document-level 0. 871 0. 742 Sentence-level ASR Outputs Accuracy ALL Transcriptions Features 0. 735 0. 470 Mean Accuracy and Kappa value for classification results using the 239 responses in the Plagiarized set and 1000 random subsets of 239 responses from the control set. 10 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Discussion and Future Work (1) q Precision was higher than the recall: 0. 948 vs. 0. 888 on human transcriptions; 0. 904 vs. 0. 831 on ASR outputs. In an operational system, it may be desirable to tune the classifier to increase the recall. q Balanced canned and control responses were obtained in experiments. Distribution of actual responses is heavily unbalanced. A much larger control set will be experimented on ASR outputs. 11 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Discussion and Future Work (2) q Matching source texts were required: plagiarized responses based on unseen sources cannot been detected. Obtain additional source texts; compare a test response with all previously collected spoken responses for a given population of test takers. q Above methods may lead to high number of false positives, especially when based on ASR outputs. Apply N-best list to compute similarity metrics; introduce additional sources of information, such as stylistic patterns and prosodic features. 12 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014

Thank You! Questions? Comments? Xinhao Wang, xwang 002@ets. org Associate Research Scientist, NLP & Speech Group at ETS. Keelan Evanini, kevanini@ets. org Managing Research Scientist, NLP & Speech Group at ETS. 13 Copyright © 2014 by Educational Testing Service. All rights reserved. June 18, 2014