Скачать презентацию Fixing a Legacy Lexicon Mike Maxwell maxwell ldc upenn Скачать презентацию Fixing a Legacy Lexicon Mike Maxwell maxwell ldc upenn

816c5eaaed34f01215dfdb96851a5093.ppt

  • Количество слайдов: 10

Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc. upenn. edu University of Pennsylvania Linguistic Data Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc. upenn. edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U. S. A. www. ldc. upenn. edu n E-MELD Conference, 15 -18 July 2004 1

The Problem • Shoebox lexicon of Mawukakan – Inconsistencies: » Inconsistencies among POSs etc. The Problem • Shoebox lexicon of Mawukakan – Inconsistencies: » Inconsistencies among POSs etc. (fixable in Shoebox) » Spelling errors: English, French and Mawu (import into Word, use English and French spell correctors) » Errors in hierarchy: Missing fields Mis-ordered fields » Missing reciprocal cross-references • Absolutely typical of Shoebox-style lexicons • Repairs needed for – Archiving – Publication – Export/ import n E-MELD Conference, 15 -18 July 2004 2

Old Solution • Parse until error, characterize error, find error in Shoebox, fix error… Old Solution • Parse until error, characterize error, find error in Shoebox, fix error… • Find all errors, send list to user, user fixes them, re-do… n E-MELD Conference, 15 -18 July 2004 3

Partial solutions • Inconsistencies among POSs etc. – Fixable in Shoebox – Helpful addition: Partial solutions • Inconsistencies among POSs etc. – Fixable in Shoebox – Helpful addition: counts of POS tokens • Spelling errors – Import into Word with automatic marking of language, use English and French spell correctors to fix errors, export back to Shoebox – No solution for Mawu spelling (n-grams) • Missing cross-references – Easy to find with shell script, send list to users – Would be better to mark errors in lexicon • Missing bi-directional references n E-MELD Conference, 15 -18 July 2004 4

Partial solutions • Errors in hierarchy w ba’el pos v. i ex Yax bo’on Partial solutions • Errors in hierarchy w ba’el pos v. i ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? n E-MELD Conference, 15 -18 July 2004 5

Repairing the hierarchy • Solution: special purpose parser, mark SFM file with errors and Repairing the hierarchy • Solution: special purpose parser, mark SFM file with errors and suggested fixes • Need hierarchy Cannot (reliably) extract hierarchy from Shoebox typ file • User or consultant must provide definition of hierarchy, as regex: (w ( (pos defn (ex ex. En ex. Fr)* (syn)? ) | (num pos defn (ex ex. En ex. Fr)* (syn)? )+ ) ) – Tool to extract a list of all occurring record/ field patterns n E-MELD Conference, 15 -18 July 2004 6

Sample output • regex … (ex ex. En ex. Fr)*… • Input … ex Sample output • regex … (ex ex. En ex. Fr)*… • Input … ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? • Output: … ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex. Fr ***Missing field inserted*** ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? n E-MELD Conference, 15 -18 July 2004 7

More sample output • Input w pos defn • Output w pos defn yax More sample output • Input w pos defn • Output w pos defn yax AUX-V Adj green yax AUX-V Adj ***Erroneous field*** green n E-MELD Conference, 15 -18 July 2004 8

More sample output • Input w yax pos AUX-V foo bar degn green • More sample output • Input w yax pos AUX-V foo bar degn green • Output w yax error ***Unable to parse record structure*** pos AUX-V foo bar degn green n E-MELD Conference, 15 -18 July 2004 9

The next language • Nahuatl lexicon – 11, 000 entries – 5000 record/ field The next language • Nahuatl lexicon – 11, 000 entries – 5000 record/ field patterns – 147 SFMs… n E-MELD Conference, 15 -18 July 2004 10