
fa5063884c8834c0c94e04a4ef978b57.ppt
- Количество слайдов: 20
Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya
Overview • • Morphological analyzer Suffix processing Stop-words Future work
Present work Search “भ रत ” – bhaarat – Bharat Will not match pages which has terms such as भ रत च – bharataachaa - Of Bharat भ रत त – bharataat - In Bharat Lack of large size corpus Unavailability of tools
Corpus Statistics- Marathi • 99, 275 Documents (510 MB) – Maharashtra times – Sakal News • April 2004 to September 2007 • UTF-8 encoding • XML tags – DOC - document – DOCNO – document identifier – TEXT - article
Document: example
Topics • 100 topics • Aligned with English topics • XML tags – num : query identifier – title: title of the query – desc: description – narr: Additional information about the query • Cover all issues –local, international
Topic example
Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space model) • DFR-BM 25 (Probabilistic) – Both models available in Terrier • Evaluation against relevance judged document for 25 queries
Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत च bhaarataachaa - of Bharat – भ रत त bhaarataat – in Bharat – भ रत वर bhaarataavar – on Bharat • Lemmatizer finds Lemma – भ रत • Stemmer finds stem: Longest unchangeable word prefix – भ रत
Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get attached after another suffix • Example: – घर सम रच दख ल – घर -सम र -च -दख ल – gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)
Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र म versus र म • Gives all possible roots – Selects first root – most frequent • Used at indexing and query processing end
Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366 0. 2944 0. 3167 0. 2583 0. 8724 TF-IDF + lemmatizer 0. 4003 0. 3551 0. 3417 0. 2917 0. 9686 DFR+ without lemmatizer 0. 3455 0. 3209 0. 3500 0. 2667 0. 8744 DFR-BM 25 + lemmatizer 0. 4140 0. 3686 0. 3833 0. 3083 0. 9619 0. 3625 0. 3797 0. 4600 0. 3960 0. 9178 DFR-BM 25 + lemmatizer (Fire submission)
Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes – Suffixes of space and time • • वर – var - on सम र – samor - in front of मधय – madhye - in नतर -nanter – after • Created manually – 66 words list
Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 % or more documents Selected stop-words – त , य , न , अस , आह , य , ह , कर , त
Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10 Recall 0. 4381 0. 3846 0. 3917 0. 3167 0. 97085 0. 4433 0. 3798 0. 4000 0. 3208 0. 9731 + lemmatization + suffix Indexing DFR-BM 25 + lemmatization + suffix Indexing + stop-words
P-R graph • Precision-recall graph for all four cases is show below
Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be improved • Heuristic suffix stripping: unknown words • Handle derivational morphology • Spelling variations, common spelling mistakes
Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, – http: //in. indiatimes. com/aboutus. cms • Sakal: Sakal Media Group – http: //www. sakaal. in/
References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval • Jacques Savoy, Searching strategies for the Bulgarian language • Morphological Analyzer, CFILT
Thank you