30b143b7d3a189c3670c61d8df7196d8.ppt
- Количество слайдов: 62
Corp. Eus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque I. Leturia, A. Gurrutxaga 1, I. Alegria, A. Ezeiza 2 WAC 3 – September 15 -16, 2007 – Louvain-la-Neuve Elhuyar R&D, Usurbil, Basque Country IXA Group, University of the Basque Country, Donostia, Basque Country 1 2
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Motivation • No doubt corpora are necessary: – for linguistic research – for language normalization – for developing language technologies • But many corpora are exclusively used for these purposes • They are not made publicly available and searchable through the Internet
Motivation • For Basque, it is essential to have corpora available for querying – Standardization of Basque started only in 1968 – Many rules, words and spellings have been changing since; still, every now and then new rules are released by the Academy of Basque Language – It was not taught in schools until the seventies and in universities until the eighties – No decision as to the correct word or spelling has yet been taken in many areas or words – Even written production abounds with misspellings, errors, uncertainties, etc.
Motivation • Basque speaking community needs corpora – Teachers – Writers – Technical text producers – Dictionary makers – Translators – Students – Academics in the field of standardization • Basque is not a language rich in corpora – Few, small and not updated
Motivation • Only corpora available (I): – XX. mendeko euskararen corpusa: • Academy of the Basque language • 4. 6 million words • Balanced • Literary texts • Twentieth century • http: //www. euskaracorpusa. net/XXmendea/Konts_a rrunta_fr. html
Motivation • Only corpora available (II): – Ereduzko prosa gaur: • University of the Basque Country • 23. 8 million words • Literary and press texts regarded as “reference” • 2000 - 2005 • http: //www. ehu. es/euskaraorria/euskara/ereduzkoa/araka. html
Motivation • Only corpora available (III): – Zientzia eta teknologiaren corpusa: • Elhuyar Foundation and the IXA Group of the University of the Basque Country • 7. 6 million words • Texts on science and technology • 1990 - 2002 • http: //www. ztcorpusa. net
Motivation • Only corpora available (IV): – Klasikoen gordailua: • Susa publishing house • 10. 7 million words • Non-tagged • Classic texts • http: //klasikoak. armiarma. com/corpus. htm
Motivation • But we do have the Internet – Huge repository of texts – Constantly updated • A tool for querying the Internet as if it were a Basque corpus would be very interesting
Motivation • Also disadvantages: – Not linguistically tagged: • Always some uncertainty • Variants and misspellings will not appear when looking for a word – It will never show all, only what there is in the first results returned by search engines – The Internet is often considered nonrepresentative – The Internet is full of redundancy
Motivation • Nevertheless, we thought that the benefits far exceeded the disadvantages • We embarked on a project to build a ‘web as corpus’ tool for Basque
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Problems with Basque language • Similar services exist: – Web. Conc (http: //www. niederlandistik. fuberlin. de/cgi-bin/web-conc. cgi) – Web. Corp (http: //www. webcorp. org. uk/) – KWi. CFinder (http: //www. kwicfinder. com) • But these rely on search engines • Search engines don’t work well for Basque
Problems with Basque language • Looking for conjugations and inflections – Basque is an agglutinative language • A given lemma makes many different word forms – lan (“work”): lana (“the work”), lanak (“works” or “the works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)… – Looking only for the exact given word, or the word plus an “s” for the plural, is not enough – Wildcards are not an appropriate solution • Looking for lan* would also return forms of the words lanabes (“tool”), lanbro (“fog”)…
Problems with Basque language • Language discrimination – No search engine offers the possibility of returning only pages in Basque – Big problem when looking for: • Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia… • Short words: katu (“cat”), ur (“water”)… • Proper nouns: Egipto, Newton, Pluton… – Many non-Basque results are returned, often no Basque results at all
Problems with Basque language • Lack of knowledge about the language – Status of language: • Late standardization • Still many changes in words and rules • Late teaching in schools and universities • Many non-standardised areas or words • Many misspellings and errors in written production – A word might be incorrect but appear often in the web – The user might think it is correct, without knowing that a more appropriate word exists
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Our approach • Looking for conjugations and inflections: Morphological query expansion (I) – Morphological generator created by the IXA Group of the University of the Basque Country – We obtain all the forms of a given lemma – We ask the search engine for all of them using an OR operator – etxe (“house”) => etxe OR etxeak OR etxeari OR etxeek OR …
Our approach • Looking for conjugations and inflections: Morphological query expansion (II) – Little problems: • The APIs of the search engines have each a limit in number of words or length of search phrase – we had to discover the limits by trial and error • Due to these limits, real lemmatised search is impossible – we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words – these are the forms of the words sent in the query
Our approach • Language discrimination: Language-filtering words (I) – We looked in a corpus for the most frequent words in Basque – We include them in the search phrase using an AND operator
Our approach • Language discrimination: Language-filtering words (II) – Little problems (I): • The most frequent words in Basque exist in other languages too • Several language-filtering words had to be used – the more of these, the more we gained in precision (fewer non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa – we chose precision and include four filtering words – if few results are returned, the user can try again increasing the recall
Our approach • Language discrimination: Language-filtering words (III) – Little problems (II): • In bilingual pages, the searched word can be in a piece of text that is not in Basque – Lang. Id, a free language identifier developed by the IXA Group of the University of the Basque Country – applied to some context around the words to see if it is in a piece of text in Basque – it does not work well with small contexts, but if the context is too big pieces in other languages can be included – we start with quite a broad context and progressively reduce its length until minimal length for Lang. Id to work properly is reached – if at any time Lang. Id says it is in Basque, we stop and we show it
Our approach • Lack of knowledge about the language: Variant suggestion (I) – EDBL, lexical database created by the IXA Group of the University of the Basque Country – Each word is linked to its variants, common errors, old spellings, etc. – When a user enters a word, its standard form or variants are suggested
Our approach • Lack of knowledge about the language: Variant suggestion (II) – Somehow lightens one of the problems of the non-linguistically-tagged nature of the web: • in a tagged corpus, variants would be assigned the correct lemma and would appear when looking for the lemma • with our approach, the user can obtain the variants too
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Corp. Eus • System architecture: – User enters word – Query the EDBL for variants – Query morphological generator to obtain conjugations and inflections – Query APIs of search engines – Download pages – Find occurrences of the forms of the word – Query Lang. Id for language occurrences are in – Show KWi. Cs and counts
Word Variants Word, variants rd Inflections, conjugations o W Corp. Eus User Oc cu rre nc e. K Search phrase URLs EDBL (IXA) Morphological generator (IXA) Search engines’ APIs URLs W i. C s an Web pages dc WWW ou nts Occurrence contexts Language Lang. Id (IXA)
Corp. Eus • Features (I): – Lemma-based search – Language-filtered search – Variant suggestion
Corp. Eus • Features (II): – Ambiguous or unrecognised words: • The user chooses the analysis upon which to base the morphological generation
Corp. Eus • Features (III): – Search for more than one word: • Lemma-based search performed for all of them • Occurrences of any of the words are shown
Corp. Eus • Features (IV): – Noun phrase or term searching: • Enclosing various terms in double quotes • Morphological generation applied to last word • Thus, proper lemma-based search for whole noun phrases or terms (in Basque, only the last component of the noun phrase or term is inflected)
Corp. Eus • Features (V): – Different ordering criteria: • Pages arriving order (default) • Form of searched word • Context after the word • Context before the word – Ordered on the fly as they arrive
Corp. Eus • Features (VI): – Analysis of the words: • Possible lemmas and POSs of the forms of the searched word are shown in a floating box • Different colours: – Light green: correct word, unambiguous – Dark green: variant, unambiguous – Light yellow: correct word, ambiguous – Dark yellow: variant, ambiguous – Red: unrecognised word
Corp. Eus • Features (VII): – Count charts: • Word forms • Possible lemma or POS • Word before or after • Lemma of word before or after • …
Corp. Eus • Features (VIII): – Many textual content file types: • • • HTML XML RSS TXT PDF DOC RTF PPT XLS … – Parallel downloading of pages to avoid blocking
Corp. Eus • Demo: http: //www. corpeus. org
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Eus. Bila • Search engines don’t work well for Basque • We decided to build a search service for Basque based on the principles of Corp. Eus: – API based – Lemma-based search – Language-filtered search – Variant suggestion • But return URLs and snippets, not KWi. Cs or charts
Eus. Bila • Problem: limit of calls per day of the APIs – Google: 1, 000 calls per day – Yahoo!: 5, 000 calls per day – Windows Live Search: 10, 000 calls per day • The limits can be enough for a corpus tool, but not for a general use search service • Microsoft recently augmented the limit in calls per day to 25, 000 and also launched an unlimited use commercial license
Eus. Bila • Published a paper in i. NEWS 07 (Improving Non-English Web Searching), a workshop in SIGIR’ 07 (July 2007, Amsterdam) • It aroused interest, as it is a cost-effective web search solution that can be used by other minority languages with few resources
Eus. Bila • Launch: – By Eleka Ingeniaritza Linguistikoa – Under commercial name Elebila – October 2007
Eus. Bila • Demo: Eus. Bila
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Contents • Motivation • Problems with Basque language • Our approach • Corp. Eus, a ‘web as corpus’ tool for Basque • Eus. Bila, a search service for Basque • Evaluation
Evaluation • The methodolgy used in Eus. Bila and Corp. Eus was evaluated for the i. NEWS 07 paper on Eus. Bila • We evaluated: – Gain in recall due to morphological query expansion – Gain in precision due to language-filtering words – Loss in recall due to language-filtering words
Evaluation • Indicator for precision: percentage of results that were actually in Basque • Indicator for recall: estimated hit counts returned by the API • Compared Windows Live Search’s API with Eus. Bila using this same API • The words for the evaluation were taken from the search logs of a very popular science portal in Basque
Evaluation Condition Measured variable Result Languagefiltering words Morphological query expansion Words Not applied - Only Basque Hit counts Gain in precision due to language-filtering words - Not applied Any kind % of results in Basque Loss in recall due to language-filtering words - Not applied Only Basque Hit counts Decrease from 6. 48% to 57. 69%, depending on the number of language-filtering words* Applied - Any kind Hit counts 40. 19% increase Gain in recall due to morphological query expansion 89. 43% increase 70. 55 points increase, from 27. 19% to 97. 74% * The amount of filtering words can optionally be reduced to increase the recall when few results are returned
Corp. Eus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque I. Leturia, A. Gurrutxaga 1, I. Alegria, A. Ezeiza 2 WAC 3 – September 15 -16, 2007 – Louvain-la-Neuve Elhuyar R&D, Usurbil, Basque Country IXA Group, University of the Basque Country, Donostia, Basque Country 1 2
30b143b7d3a189c3670c61d8df7196d8.ppt