1af79fd4a43ce2f7d3df2e5ecd1d64b7.ppt
- Количество слайдов: 26
Knowledge Center for Processing Hebrew Alon Itai – CS Technion
Tools for underrepresented languages Computer tools and especially the Internet are Anglophile. o Search engines are not tooled for morphologically rich languages.
” Search “dog” “dogs” “and dogs o o o o o כלבים מאולפים מחפשים בית רוני אילוף כלבים כלב אתרי קטגורית כלבים כלב - ויקיפדיה כלבים מאולפים מחפשים בית הב-הב אתר חיות המחמד של ישראל! כלב )יונק( קובי חזן אילוף כלבים | כלב היחידה המיוחדת לאילוף כלבים אוגר זהב כלב הבית מכונה בלשון המדע – כלב זאב ביתי עמותת SOS חיות - בחירת כלב מתאים לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירה - כלבים | כלב אתר המציעזולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלב שידוכים בין גזעים, בייביסיטרים, על אלמנה וכלב תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעות. הרבה מידע, מאמרים, קורסים, Dog o אתר הכלבים מכיל ניופאונדלנד, כלבי רועים וכלב רועים בלגי - PETNET. co. il ליווי, וכל הקשור בהם תמונות וקטעי וידאו של כלבים עזרת זולת רפואית וכלב נחייה dog o גזעי כלבים · תמונת החודש · הכלב והחוק · רפואה וטיפול · קורסים · מאמרים · לוח מודעות · כלבי הצלה · קטעי וידאו · תמונת השנה · פינת האימוץ. . .
Tools for underrepresented languages. o Computer tools and especially the Internet are Anglophile. o Search engines are not tooled for morphologically rich languages. o Email and chats do not cope well with strange alphabets o use (pidgin) English for communication, … o The local language is used less and less.
The problem o Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools. o Even when tools are available – no open source o Tools developed at Universities are not fit for general use: not robust enough no standard interface lack of documentation
Duplication of Effort o Every researcher has to redevelop her own o 1. 2. 3. 4. 5. 6. tools, before conducting original research For example: In Hebrew, there are many morphological analyzers: Choueka and Shapira 1964, Ornan 1987, Lavie et al. 1988, Bentur et al. 1992, Segal 1999, HSPELL Yona and Wintner 2005
The Knowledge Center o In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew. o Its aim to develop products (software and databases) for processing Hebrew and make them available to the public, both in academia and industry. o Researchers from four universities are involved in the Center's activities.
The researchers o Yoad Winter (Technion), o Shuly Wintner (Haifa University), o Michael Elhadad (Ben Gurion University), o Arnon Cohen (Ben Gurion University), o Yoram Singer (Hebrew University) o Eli Shamir (Hebrew University) o Alon Itai (Technion)
The model o The ministry provides initial funds. o The Center should be self-sustainable – it should finance itself by selling products. The problems: o The market is too small, had it been large then there would have been no need for the center. o Contradicts our philosophy of open research and open code.
Licensing Policy o Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL. o Payments only for special services. o Can get a non-exclusive license for commercial use.
XML o All products are represented by XML. • Readable both by machines and by humans • Enables using off-shelf tools for on screen presentation and validation EXAMPLE -<item id=“ 17580” script=“formal” transliterated=“bwqr” undotted=“ “בוקר dotted=“ > “בקר <noun gender=“masculine” number=“singular” plural=“im”> <replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“ >/“בקרים </noun> Info for the morphological </item> parser
(XML (2 o Facilitates interface between tools: o For example, the output of the morphological analyzer is the input for the morphological disambiguator. o Thus one can match different morphological analyzers with different disambiguators and compare their results
Products o Morphological analyzers o Morphological disambiguators o Lexicon o Corpora o Speech data base o Tools for editing lexicons and tagging corpora. o PR: forum, …
The lexicon by part of speech noun 10332 preposition 100 verb Proper Name 4485 conjunction 4227 pronoun 62 60 adjective 1612 interjection 40 adverb quantifier 352 interrogative 132 negation Total : 21, 417 9 6
GUI for editing the lexicon
Morphological disambiguators o Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%. o Erel Segal combined a Brill-like method with a priori occurrence probabilities. o Meni Adler used HMM on whole words. o All three disambiguators are available at the Center.
Corpora total Size Unique tokens קורפוס 11, 062, 232 319, 666 11, 216, 867 304, 160 Arutz 7 1, 300, 326 166, 780 17, 732, 122 262, 338 Sha’ar la-matkhil (dotted) Knesset
(Corpora (2 o 6000 sentences of manually tagged corpus (12, 000 tokens).
Tree bank o 6000 syntactically parsed sentences. o Used for automatic parsing.
Conclusions o The Center is an example of cooperation between researchers in several universities. o Many users have downloaded the products. o 10 companies have purchased licenses.
(Conclusions (2 o Money is running out, … o The model requires money, experts, and commitment. o Not suitable for languages with very few speakers, or for poor communities.
Modern Hebrew o Official Language of the State of Israel o Spoken by 7 M people o Related, but linguistically distinct, from Biblical Hebrew. o Morphologically rich
Semitic Word Formation root + pattern word pattern root Ca. C yi. CCo. C ktb katab (he wrote) yiktob (he will write) šbr šabar (he broke) yišbor (he will break)
Writing System o Most vowels are omitted o Particles are prepended to words, Example: h – definite article, b – preposition (in) w – conjunction (and) wbbyt = w + b + ha +byt and in the house
Morphological Ambiguity o o 1. 2. 3. 4. 5. 6. Most words are morphologically ambiguous Example: šbth שבתה šavta = šbt + Ca. CCa = stopped working šavta = šbh + Ca. CCa = took prisoner šabatah = her Saturday še-b-te = that in tea še-b-ha-te = that in the tea še-bit-h = that her daughter …
1af79fd4a43ce2f7d3df2e5ecd1d64b7.ppt