Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc. upenn. edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U. S. A. www. ldc. upenn. edu n E-MELD Conference, 15 -18 July 2004 1
The Problem • Shoebox lexicon of Mawukakan – Inconsistencies: » Inconsistencies among POSs etc. (fixable in Shoebox) » Spelling errors: English, French and Mawu (import into Word, use English and French spell correctors) » Errors in hierarchy: Missing fields Mis-ordered fields » Missing reciprocal cross-references • Absolutely typical of Shoebox-style lexicons • Repairs needed for – Archiving – Publication – Export/ import n E-MELD Conference, 15 -18 July 2004 2
Old Solution • Parse until error, characterize error, find error in Shoebox, fix error… • Find all errors, send list to user, user fixes them, re-do… n E-MELD Conference, 15 -18 July 2004 3
Partial solutions • Inconsistencies among POSs etc. – Fixable in Shoebox – Helpful addition: counts of POS tokens • Spelling errors – Import into Word with automatic marking of language, use English and French spell correctors to fix errors, export back to Shoebox – No solution for Mawu spelling (n-grams) • Missing cross-references – Easy to find with shell script, send list to users – Would be better to mark errors in lexicon • Missing bi-directional references n E-MELD Conference, 15 -18 July 2004 4
Partial solutions • Errors in hierarchy w ba’el pos v. i ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? n E-MELD Conference, 15 -18 July 2004 5
Repairing the hierarchy • Solution: special purpose parser, mark SFM file with errors and suggested fixes • Need hierarchy Cannot (reliably) extract hierarchy from Shoebox typ file • User or consultant must provide definition of hierarchy, as regex: (w ( (pos defn (ex ex. En ex. Fr)* (syn)? ) | (num pos defn (ex ex. En ex. Fr)* (syn)? )+ ) ) – Tool to extract a list of all occurring record/ field patterns n E-MELD Conference, 15 -18 July 2004 6
Sample output • regex … (ex ex. En ex. Fr)*… • Input … ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? • Output: … ex Yax bo’on ta sna Antonio. ex. En I’m going to Antonio’s house. | ex. Fr ***Missing field inserted*** ex Ban yax ba’at? ex. En Where are you going? ex. Fr Ou allez-vous? n E-MELD Conference, 15 -18 July 2004 7
More sample output • Input w pos defn • Output w pos defn yax AUX-V Adj green yax AUX-V Adj ***Erroneous field*** green n E-MELD Conference, 15 -18 July 2004 8
More sample output • Input w yax pos AUX-V foo bar degn green • Output w yax error ***Unable to parse record structure*** pos AUX-V foo bar degn green n E-MELD Conference, 15 -18 July 2004 9
The next language • Nahuatl lexicon – 11, 000 entries – 5000 record/ field patterns – 147 SFMs… n E-MELD Conference, 15 -18 July 2004 10