Скачать презентацию Marathi Marathi Monolingual Information Retrieval Mr Ashish Скачать презентацию Marathi Marathi Monolingual Information Retrieval Mr Ashish

fa5063884c8834c0c94e04a4ef978b57.ppt

  • Количество слайдов: 20

Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

Overview • • Morphological analyzer Suffix processing Stop-words Future work Overview • • Morphological analyzer Suffix processing Stop-words Future work

Present work Search “भ रत ” – bhaarat – Bharat Will not match pages Present work Search “भ रत ” – bhaarat – Bharat Will not match pages which has terms such as भ रत च – bharataachaa - Of Bharat भ रत त – bharataat - In Bharat Lack of large size corpus Unavailability of tools

Corpus Statistics- Marathi • 99, 275 Documents (510 MB) – Maharashtra times – Sakal Corpus Statistics- Marathi • 99, 275 Documents (510 MB) – Maharashtra times – Sakal News • April 2004 to September 2007 • UTF-8 encoding • XML tags – DOC - document – DOCNO – document identifier – TEXT - article

Document: example <DOC> <DOCNO>Maharashtra. C 06 E 811 C 6 B. htm. txt</DOCNO> <TEXT> Document: example Maharashtra. C 06 E 811 C 6 B. htm. txt म हफल वचणय स गललय तरण वर ब बटय च हलल (attack of a leapord on a young man who has gone to collect flowers of Moha) इसल पर , त . २२ - च र ळ आण म हफल वचणय स ठ जगल त गललय एक आद व स तरण वर ब बटय न अच नक हलल कलय न त तरण गभ र जखम झ ल आह. ह घटन शकरव र (त . २० ) मळझर (त . क नवट ) य ग व चय जगल त घडल . . . . इसल पर वन पर कषतर क रय लय अतरगत यण ऱय मळझर यथ ल आद व स तरण मन हर. . .

Topics • 100 topics • Aligned with English topics • XML tags – num Topics • 100 topics • Aligned with English topics • XML tags – num : query identifier – title: title of the query – desc: description – narr: Additional information about the query • Cover all issues –local, international

Topic example <top> <num>1 <title>टवट -२० व शवचषक त ल भ रत च कर Topic example 1 टवट -२० व शवचषक त ल भ रत च कर ड पटतव )India’s championship in tewnty-20 Worldcup) <desc>पह लय आयस स व शव टवट -२० सरव तकषट -व जत -सपरधत ल भ रत चय व जय च वतत दण र लख श ध . </desc> <narr>टवट -२० व शचचषक सपरधमध ल प क सत न व रदध भ रत च व जय , हय ऐत ह स क व जय न म तत खळ डन कलल व करम तय न म ळव लल बकष स व परसक र च रककम स मन व र च तसच </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-8.jpg" alt="Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space" /> Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space model) • DFR-BM 25 (Probabilistic) – Both models available in Terrier • Evaluation against relevance judged document for 25 queries </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-9.jpg" alt="Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत" /> Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत च bhaarataachaa - of Bharat – भ रत त bhaarataat – in Bharat – भ रत वर bhaarataavar – on Bharat • Lemmatizer finds Lemma – भ रत • Stemmer finds stem: Longest unchangeable word prefix – भ रत </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-10.jpg" alt="Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get" /> Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get attached after another suffix • Example: – घर सम रच दख ल – घर -सम र -च -दख ल – gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house) </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-11.jpg" alt="Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र" /> Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र म versus र म • Gives all possible roots – Selects first root – most frequent • Used at indexing and query processing end </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-12.jpg" alt="Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366" /> Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366 0. 2944 0. 3167 0. 2583 0. 8724 TF-IDF + lemmatizer 0. 4003 0. 3551 0. 3417 0. 2917 0. 9686 DFR+ without lemmatizer 0. 3455 0. 3209 0. 3500 0. 2667 0. 8744 DFR-BM 25 + lemmatizer 0. 4140 0. 3686 0. 3833 0. 3083 0. 9619 0. 3625 0. 3797 0. 4600 0. 3960 0. 9178 DFR-BM 25 + lemmatizer (Fire submission) </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-13.jpg" alt="Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes" /> Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes – Suffixes of space and time • • वर – var - on सम र – samor - in front of मधय – madhye - in नतर -nanter – after • Created manually – 66 words list </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 %" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-14.jpg" alt="Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 %" /> Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 % or more documents Selected stop-words – त , य , न , अस , आह , य , ह , कर , त </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-15.jpg" alt="Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10" /> Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10 Recall 0. 4381 0. 3846 0. 3917 0. 3167 0. 97085 0. 4433 0. 3798 0. 4000 0. 3208 0. 9731 + lemmatization + suffix Indexing DFR-BM 25 + lemmatization + suffix Indexing + stop-words </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="P-R graph • Precision-recall graph for all four cases is show below " src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-16.jpg" alt="P-R graph • Precision-recall graph for all four cases is show below " /> P-R graph • Precision-recall graph for all four cases is show below </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-17.jpg" alt="Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be" /> Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be improved • Heuristic suffix stripping: unknown words • Handle derivational morphology • Spelling variations, common spelling mistakes </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, –" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-18.jpg" alt="Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, –" /> Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, – http: //in. indiatimes. com/aboutus. cms • Sakal: Sakal Media Group – http: //www. sakaal. in/ </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier" src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-19.jpg" alt="References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier" /> References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval • Jacques Savoy, Searching strategies for the Bulgarian language • Morphological Analyzer, CFILT </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Thank you " src="http://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-20.jpg" alt="Thank you " /> Thank you </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="" src="" alt="" /> </p> </div> </div> <div id="inputform"> <script>$("#inputform").load("https://present5.com/wp-content/plugins/report-content/inc/report-form-aj.php"); </script> </div> </p> <!--end entry-content--> </div> </article><!-- .post --> </section><!-- #content --> <div class="three columns"> <div class="widget-entry"> <div id="sidebarrelated"> <div id="text-2" class="box_small box widget widget_text"><div class="crp_related crp_related_shortcode "><div class="gallery_entry_related"><a href="http://present5.com/confidential-and-proprietary-information-of-burger-king-corporation/" ><img src="http://present5.com/wp-content/uploads/care_program_deck_rus_19_613_3-180x135.jpg" alt="CONFIDENTIAL AND PROPRIETARY INFORMATION OF BURGER KING CORPORATION" title="CONFIDENTIAL AND PROPRIETARY INFORMATION OF BURGER KING CORPORATION" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/confidential-and-proprietary-information-of-burger-king-corporation/" class="crp_title">CONFIDENTIAL AND PROPRIETARY INFORMATION OF BURGER KING CORPORATION</a></div><div class="gallery_entry_related"><a href="http://present5.com/theme-4-national-monitoring-system-and-departmental-resources/" ><img src="http://present5.com/wp-content/uploads/prezentaciya_4-1_mon_0-180x135.jpg" alt="Theme 4 NATIONAL MONITORING SYSTEM AND DEPARTMENTAL RESOURCES" title="Theme 4 NATIONAL MONITORING SYSTEM AND DEPARTMENTAL RESOURCES" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/theme-4-national-monitoring-system-and-departmental-resources/" class="crp_title">Theme 4 NATIONAL MONITORING SYSTEM AND DEPARTMENTAL RESOURCES</a></div><div class="gallery_entry_related"><a href="http://present5.com/one-trax-overview-one-trax-overview/" ><img src="http://present5.com/wp-content/uploads/one-trax_v1.3_presentation_-_complete-180x135.jpg" alt="ONE-TRAX ®® Overview ONE-TRAX Overview" title="ONE-TRAX ®® Overview ONE-TRAX Overview" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/one-trax-overview-one-trax-overview/" class="crp_title">ONE-TRAX ®® Overview ONE-TRAX Overview</a></div><div class="gallery_entry_related"><a href="http://present5.com/use-of-additional-information-in-suspicious-transaction-report/" ><img src="http://present5.com/wp-content/uploads/2_addl_info_in_str_analysis_97-2003-180x135.jpg" alt="Use of Additional Information in Suspicious Transaction Report" title="Use of Additional Information in Suspicious Transaction Report" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/use-of-additional-information-in-suspicious-transaction-report/" class="crp_title">Use of Additional Information in Suspicious Transaction Report</a></div><div class="gallery_entry_related"><a href="http://present5.com/17-1-irwinmc-graw-hill-the-mc-graw-hill/" ><img src="http://present5.com/wp-content/uploads/chpt17_0-180x135.jpg" alt="17 — 1 Irwin/Mc. Graw-Hill ©The Mc. Graw-Hill" title="17 — 1 Irwin/Mc. Graw-Hill ©The Mc. Graw-Hill" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/17-1-irwinmc-graw-hill-the-mc-graw-hill/" class="crp_title">17 — 1 Irwin/Mc. Graw-Hill ©The Mc. Graw-Hill</a></div><div class="gallery_entry_related"><a href="http://present5.com/11-013016-confidential-information-2010-m-i-swacoreservoir/" ><img src="http://present5.com/wp-content/uploads/rdf_short-180x135.jpg" alt="11 01/30/16 Confidential Information © 2010 M-I SWACOReservoir" title="11 01/30/16 Confidential Information © 2010 M-I SWACOReservoir" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/11-013016-confidential-information-2010-m-i-swacoreservoir/" class="crp_title">11 01/30/16 Confidential Information © 2010 M-I SWACOReservoir</a></div><div class="gallery_entry_related"><a href="http://present5.com/prezentaciya-primer-oformleniya-prezentacii-ir/" ><img src="http://present5.com/wp-content/uploads/primer_oformleniya_prezentacii_ir-180x135.jpg" alt="Презентация Пример оформления презентации IR" title="Презентация Пример оформления презентации IR" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/prezentaciya-primer-oformleniya-prezentacii-ir/" class="crp_title">Презентация Пример оформления презентации IR</a></div><div class="gallery_entry_related"><a href="http://present5.com/tema-optimizaciya-biznes-procesiv-pidpriyemstva/" ><img src="http://present5.com/wp-content/uploads/business_process_improvment-176x135.jpg" alt="Тема: Оптимізація бізнес-процесів підприємства" title="Тема: Оптимізація бізнес-процесів підприємства" width="180" height="135" class="crp_thumb crp_featured" /></a><a href="http://present5.com/tema-optimizaciya-biznes-procesiv-pidpriyemstva/" class="crp_title">Тема: Оптимізація бізнес-процесів підприємства</a></div><div class="crp_clear"></div></div></div></div> </div> </div> </div> </div> <!-- #content-wrapper --> <footer id="footer"> <div class="container"> <div class="columns twelve"> <!--noindex--> <!--LiveInternet counter--><script type="text/javascript"><!-- document.write("<img src='//counter.yadro.ru/hit?t26.10;r"+ escape(document.referrer)+((typeof(screen)=="undefined")?"": ";s"+screen.width+"*"+screen.height+"*"+(screen.colorDepth? screen.colorDepth:screen.pixelDepth))+";u"+escape(document.URL)+ ";"+Math.random()+ "' alt='' title='"+" ' "+ "border='0' width='1' height='1'><\/a>") //--></script><!--/LiveInternet--> <script> $(window).load(function() { var owl = document.getElementsByClassName('owl-carousel owl-theme owl-loaded owl-drag')[0]; document.getElementById("owlheader").insertBefore(owl, null); $('#owlheader').css('display', 'inline-block'); }); </script> <script type="text/javascript"> var yaParams = {'typepage': '1000_top_300k', 'author': '1000_top_300k' }; </script> <!-- Yandex.Metrika counter --> <script type="text/javascript"> (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter32395810 = new Ya.Metrika({ id:32395810, clickmap:true, trackLinks:true, accurateTrackBounce:true, webvisor:true, params: yaParams }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/32395810" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!--/noindex--> <nav id="top-nav"> <ul id="menu-top" class="top-menu clearfix"> </ul> </nav> </div> </div><!--.container--> </footer> <script type='text/javascript'> /* <![CDATA[ */ var wpcf7 = {"apiSettings":{"root":"https:\/\/present5.com\/wp-json\/contact-form-7\/v1","namespace":"contact-form-7\/v1"},"recaptcha":{"messages":{"empty":"\u041f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430, \u043f\u043e\u0434\u0442\u0432\u0435\u0440\u0434\u0438\u0442\u0435, \u0447\u0442\u043e \u0432\u044b \u043d\u0435 \u0440\u043e\u0431\u043e\u0442."}}}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=5.0.1'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/jquery.shuffle.js?ver=4.9.10'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/scripts.js?ver=1.1'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/shuffle.js?ver=4.9.10'></script> <!--[if lt IE 9]> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/selectivizr.js?ver=1.0.2'></script> <![endif]--> </body> </html>