Скачать презентацию VARD 2 A tool for dealing with spelling Скачать презентацию VARD 2 A tool for dealing with spelling

0127f7c760f548aa437dbeaa6c60f425.ppt

  • Количество слайдов: 17

VARD 2: A tool for dealing with spelling variation in historical corpora Alistair Baron VARD 2: A tool for dealing with spelling variation in historical corpora Alistair Baron Computing Department Lancaster University a. [email protected] lancs. ac. uk www. comp. lancs. ac. uk/~barona/ Alistair Baron | Lancaster University

Outline Early Modern English n q q q VARD 2 n q q n Outline Early Modern English n q q q VARD 2 n q q n Characteristics Spelling Variation Corpora Demonstration Current and Future Work Questions Alistair Baron | Lancaster University 2

Early Modern English (EMod. E) n n Period of English language between 1450 and Early Modern English (EMod. E) n n Period of English language between 1450 and 1700. Large amount of research interest: q q q n Influential period in the formation of modern English. Earliest period of English from which a large corpus can be built due to a sharp increase in text production: n King Henry V’s commitment to the vernacular from 1417. n William Caxton’s introduction of the printing press in 1476. n An increasingly literate public. Shakespeare’s works. Large amount of spelling variation. Alistair Baron | Lancaster University 3

EMod. E Spelling Variation n Large amount of spelling variation in EMod. E texts: EMod. E Spelling Variation n Large amount of spelling variation in EMod. E texts: q q n No notion of the importance of having a single spelling for each word. Authors, scribes, editors and printing houses would have their own spelling preferences. Letters would be added or removed to ease line justification. Local dialect could also influence spelling. Spelling variation became less frequent through the EMod. E period: q q Spread of London and Chancery English through the introduction of printing. Signified by the introduction of dictionaries, especially that of Samuel Johnson’s in 1755. Alistair Baron | Lancaster University 4

EMod. E Spelling Variation (Examples) n Examples of spelling variants: q Addition or removal EMod. E Spelling Variation (Examples) n Examples of spelling variants: q Addition or removal of ‘e’, e. g. aske, workes, dos q Doubling and singling of letters, e. g. smels, heere, leggs q Interchanged letters: { u , v }, { j , i }, { ie , y }, { vv , w }, e. g. haue, vnder, maiestie, vvas q Usage of apostrophe, e. g. vow’d, ‘em q Spellings which are variable still today, e. g. centre/center, -or/our, -ise/-ize q Fused forms, e. g. t’is, t’was, o’th q Archaic –(e)th and –(e)st endings, e. g. hath, doth, seemeth, shouldst q Archaic forms, e. g. betwixt, howbeit q Phonetic spellings, e. g. publiquely, blew (blue) q + any combination of the above and other irregular spellings, e. g. Iigge (Jig), diuell (devil), shak’d (shook) Alistair Baron | Lancaster University 5

EMod. E Corpora n n The construction of EMod. E and other historical corpora EMod. E Corpora n n The construction of EMod. E and other historical corpora has become an important focus of research. Research topics using the corpora range from studies of diachronic linguistic change to studies of attitudes towards gender at the time. Corpora include: q q ARCHER, Helsinki, Lampeter and ZEN (Kytö et al, 1994) Corpus of Early English Correspondence (Nevalainen, 1997) Corpus of English Dialogues (Culpeper and Kytö, 1997) Many versions of Shakespeare’s works n n e. g. the First Folio as printed in 1623 which can be sourced from the Oxford Text Archive (http: //ota. ahds. ac. uk/) Also, increasing amounts of textual data, large quantities of which are historical texts, are being digitised through current initiatives including: q q q The Open Content Alliance (http: //www. opencontentalliance. org/) Google Books (http: //books. google. com) Early English Books Online (http: //eebo. chadwyck. com/home) Alistair Baron | Lancaster University 6

EMod. E Corpora The problem with spelling variation n Many corpus linguistic functions can EMod. E Corpora The problem with spelling variation n Many corpus linguistic functions can be completed automatically with software such as Wordsmith Tools, BNCWeb and WMatrix; however, these tools are designed to work with modern English (or other modern languages). Problems occur when corpus linguistic functions are processed on historical varieties or dialects of English (and indeed other languages) especially when large levels of spelling variation occurs – as in Early Modern English. This can cause problems for even simple functions such as a string search, with only words spelt in exactly the same way as the search query being returned. q n A recent examination of the Lampeter corpus has shown that an average of 1 in 5 word types per text are not found in a large modern word list. Frequency lists will also be incorrect due to a word’s potential frequency being split between its different spellings. q would for example could be spelt in a variety of forms including: would, wolde, wood, wulde, wud, wald, vvould, vvold, and so on. Alistair Baron | Lancaster University 7

EMod. E Corpora The problem with spelling variation n Keyword lists could be obscured EMod. E Corpora The problem with spelling variation n Keyword lists could be obscured by spelling variation due to multiple spellings of a word reducing its ‘keyness’. Collocations would also be affected in much the same way, with cooccurring words not being detected due to spelling variation. Rayson et al (2007) evaluated the accuracy of the CLAWS Part-ofspeech tagger on EMod. E corpora (modern accuracy: 96 -97%): POS Tagging Accuracy Shakespeare Spelling variation remaining 81. 94% 88. 46% Spelling variation modernised n Lampeter 88. 88% 91. 24% Archer et al (2003) discuss developing the USAS Semantic Tagger for EMod. E, the paper reports on evaluation performed on relatively contemporary texts from 1640. Dealing in part with spelling variation produced an improvement in error rates: 2. 9% to 1. 2% in one text and 4. 0% to 1. 4% in the other text processed. Alistair Baron | Lancaster University 8

Our Solution – VARD (Variant Detector) n n Our solution to the problem was Our Solution – VARD (Variant Detector) n n Our solution to the problem was to build a pre-processor for corpus linguistic tools which ‘standardizes’ the spelling variation found within. This led to the production of VARD, a search and replace tool which uses a large list of known variants to insert a modern equivalent alongside the original spelling. The processed text could then be passed on to corpus linguistic software where the system would ‘see’ the modern spelling instead of the spelling variant, thus improving the accuracy of the tool’s techniques. It should be noted that we are not “correcting” the spelling of EMod. E texts, there was no correct spelling at the time and the spelling variants are important linguistic features. The original variant is maintained and it is a simple process to switch between the original and modernised texts. The modern equivalents are inserted for the benefit of other automated software which would produce inaccurate results with a large amount of spelling variants remaining. Alistair Baron | Lancaster University 9

VARD 2 n n n The original VARD tool managed to deal with a VARD 2 n n n The original VARD tool managed to deal with a large amount of spelling variation, however due to the extensive variety of spelling variation it is impossible to include all possible spelling variants in a predefined list. Therefore VARD 2 was developed which employs techniques from modern spell checkers to find potential replacements for spelling variants. The tool also offers an interactive interface where users can view all spelling variants detected and select the desired replacement from a supplied list for each variant. Alistair Baron | Lancaster University 10

VARD 2 Demonstration Alistair Baron | Lancaster University 11 VARD 2 Demonstration Alistair Baron | Lancaster University 11

Current and Future Work n DICER: Discovery and Investigation of Character Edit Rules Alistair Current and Future Work n DICER: Discovery and Investigation of Character Edit Rules Alistair Baron | Lancaster University 12

Current and Future Work n Context sensitive rules q q q n n Surrounding Current and Future Work n Context sensitive rules q q q n n Surrounding grammar (and Semantics? ) Word bigram and trigram analysis Collocations? Analysis of VARD 2 recall and precision Analysis of effect on Corpus linguistic techniques Alistair Baron | Lancaster University 13

Any Questions? n Thanks for listening! n More information about VARD 2 can be Any Questions? n Thanks for listening! n More information about VARD 2 can be found at http: //www. comp. lancs. ac. uk/~barona/vard 2/ n Any Questions? Alistair Baron | Lancaster University 14

References n Archer, D. , Mc. Enery, T. , Rayson, P. and Hardie, A. References n Archer, D. , Mc. Enery, T. , Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In Archer, D, Rayson, P. , Wilson, A. and Mc. Enery, T. (eds. ). Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31. n Culpeper, J. and Kytö, M. (1997). Towards a corpus of dialogues, 1550 -1750. In Ramisch, H. and Wynne, K. (eds. ). Language in Time and Space. Studies in Honour of Wolfgang Viereck on the Occasion of His 60 th Birthday (Zeitschrift für Dialektologie und Linguistik - Beihefte, Heft 97). pp 60 -73. Franz Steiner Verlag, Stuttgart. n Kytö, M. , Rissanen, M. and Wright, S. (1994). Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, Cambridge, March 1993. Rodopi, Amsterdam. n Nevalainen, T. (1997). Ongoing work on the Corpus of Early English Correspondence. In Hickey, R. , Kytö, M. , Lancashire, I. and Rissanen, M. (eds. ) Tracing the Trail of Time: Proceedings from the Second Diachronic Corpora Workshop. Rodopi, Amsterdam. n Rayson, P. , Archer, D. and Smith, N. (2005). VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747 -9398. n Rayson, P. , Archer, D. , Baron, A. , Culpeper, J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In proceedings of Corpus Linguistics 2007, July 27 -30, University of Birmingham, UK. n Vallins, G. H. (revised by Scragg, D. G. 1965) (1954). Spelling. André Deutsch. Alistair Baron | Lancaster University 15

VARD 2 Screenshots Alistair Baron | Lancaster University 16 VARD 2 Screenshots Alistair Baron | Lancaster University 16

VARD 2 Screenshots Alistair Baron | Lancaster University 17 VARD 2 Screenshots Alistair Baron | Lancaster University 17