Скачать презентацию Marathi Marathi Monolingual Information Retrieval Mr Ashish Скачать презентацию Marathi Marathi Monolingual Information Retrieval Mr Ashish

fa5063884c8834c0c94e04a4ef978b57.ppt

  • Количество слайдов: 20

Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

Overview • • Morphological analyzer Suffix processing Stop-words Future work Overview • • Morphological analyzer Suffix processing Stop-words Future work

Present work Search “भ रत ” – bhaarat – Bharat Will not match pages Present work Search “भ रत ” – bhaarat – Bharat Will not match pages which has terms such as भ रत च – bharataachaa - Of Bharat भ रत त – bharataat - In Bharat Lack of large size corpus Unavailability of tools

Corpus Statistics- Marathi • 99, 275 Documents (510 MB) – Maharashtra times – Sakal Corpus Statistics- Marathi • 99, 275 Documents (510 MB) – Maharashtra times – Sakal News • April 2004 to September 2007 • UTF-8 encoding • XML tags – DOC - document – DOCNO – document identifier – TEXT - article

Document: example <DOC> <DOCNO>Maharashtra. C 06 E 811 C 6 B. htm. txt</DOCNO> <TEXT> Document: example Maharashtra. C 06 E 811 C 6 B. htm. txt म हफल वचणय स गललय तरण वर ब बटय च हलल (attack of a leapord on a young man who has gone to collect flowers of Moha) इसल पर , त . २२ - च र ळ आण म हफल वचणय स ठ जगल त गललय एक आद व स तरण वर ब बटय न अच नक हलल कलय न त तरण गभ र जखम झ ल आह. ह घटन शकरव र (त . २० ) मळझर (त . क नवट ) य ग व चय जगल त घडल . . . . इसल पर वन पर कषतर क रय लय अतरगत यण ऱय मळझर यथ ल आद व स तरण मन हर. . .

Topics • 100 topics • Aligned with English topics • XML tags – num Topics • 100 topics • Aligned with English topics • XML tags – num : query identifier – title: title of the query – desc: description – narr: Additional information about the query • Cover all issues –local, international

Topic example <top> <num>1 <title>टवट -२० व शवचषक त ल भ रत च कर Topic example 1 टवट -२० व शवचषक त ल भ रत च कर ड पटतव )India’s championship in tewnty-20 Worldcup) <desc>पह लय आयस स व शव टवट -२० सरव तकषट -व जत -सपरधत ल भ रत चय व जय च वतत दण र लख श ध . </desc> <narr>टवट -२० व शचचषक सपरधमध ल प क सत न व रदध भ रत च व जय , हय ऐत ह स क व जय न म तत खळ डन कलल व करम तय न म ळव लल बकष स व परसक र च रककम स मन व र च तसच </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-8.jpg" alt="Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space" /> Tools • Terrier – Open source IR system – Models • TF-IDF (Vector space model) • DFR-BM 25 (Probabilistic) – Both models available in Terrier • Evaluation against relevance judged document for 25 queries </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-9.jpg" alt="Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत" /> Lemmatizer Vs stemmer – भ रत ल bhaarataalaa – for Bharat – भ रत च bhaarataachaa - of Bharat – भ रत त bhaarataat – in Bharat – भ रत वर bhaarataavar – on Bharat • Lemmatizer finds Lemma – भ रत • Stemmer finds stem: Longest unchangeable word prefix – भ रत </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-10.jpg" alt="Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get" /> Marathi suffixes • Suffixes include case markers, postposition markers etc. • Suffixes may get attached after another suffix • Example: – घर सम रच दख ल – घर -सम र -च -दख ल – gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house) </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-11.jpg" alt="Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र" /> Morphological analyzer • Use of Marathi morphology analyzer – Better matching words • र म versus र म • Gives all possible roots – Selects first root – most frequent • Used at indexing and query processing end </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-12.jpg" alt="Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366" /> Lemmatizer Results MAP Rprecision Precision at 5 10 Recall TF-IDF without lemmatizer 0. 3366 0. 2944 0. 3167 0. 2583 0. 8724 TF-IDF + lemmatizer 0. 4003 0. 3551 0. 3417 0. 2917 0. 9686 DFR+ without lemmatizer 0. 3455 0. 3209 0. 3500 0. 2667 0. 8744 DFR-BM 25 + lemmatizer 0. 4140 0. 3686 0. 3833 0. 3083 0. 9619 0. 3625 0. 3797 0. 4600 0. 3960 0. 9178 DFR-BM 25 + lemmatizer (Fire submission) </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-13.jpg" alt="Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes" /> Suffixes • Usually ignored • Indexing suffixes - not studied • Index selected suffixes – Suffixes of space and time • • वर – var - on सम र – samor - in front of मधय – madhye - in नतर -nanter – after • Created manually – 66 words list </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 %" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-14.jpg" alt="Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 %" /> Stop-words • • Most frequently occurring words Little discriminatory value Occur in 80 % or more documents Selected stop-words – त , य , न , अस , आह , य , ह , कर , त </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-15.jpg" alt="Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10" /> Results suffix indexing and stop-words MAP DFR-BM 25 Rprecision Precision at 5 at 10 Recall 0. 4381 0. 3846 0. 3917 0. 3167 0. 97085 0. 4433 0. 3798 0. 4000 0. 3208 0. 9731 + lemmatization + suffix Indexing DFR-BM 25 + lemmatization + suffix Indexing + stop-words </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="P-R graph • Precision-recall graph for all four cases is show below " src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-16.jpg" alt="P-R graph • Precision-recall graph for all four cases is show below " /> P-R graph • Precision-recall graph for all four cases is show below </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-17.jpg" alt="Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be" /> Future work • Morphological analyzer – Accuracy 94. 5 % • Needs to be improved • Heuristic suffix stripping: unknown words • Handle derivational morphology • Spelling variations, common spelling mistakes </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, –" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-18.jpg" alt="Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, –" /> Acknowledgement • “Cross Lingual Information Access” Project • Maharashtra times: Times Media Group, – http: //in. indiatimes. com/aboutus. cms • Sakal: Sakal Media Group – http: //www. sakaal. in/ </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier" src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-19.jpg" alt="References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier" /> References • http: //ir. dcs. gla. ac. uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval • Jacques Savoy, Searching strategies for the Bulgarian language • Morphological Analyzer, CFILT </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="Thank you " src="https://present5.com/presentation/fa5063884c8834c0c94e04a4ef978b57/image-20.jpg" alt="Thank you " /> Thank you </p> </div> <div style="width: auto;" class="description columns twelve"><p><img class="imgdescription" title="" src="" alt="" /> </p> </div> </div> <div id="inputform"> <script>$("#inputform").load("https://present5.com/wp-content/plugins/report-content/inc/report-form-aj.php"); </script> </div> </p> <!--end entry-content--> </div> </article><!-- .post --> </section><!-- #content --> <div class="three columns"> <div class="widget-entry"> </div> </div> </div> </div> <!-- #content-wrapper --> <footer id="footer" style="padding: 5px 0 5px;"> <div class="container"> <div class="columns twelve"> <!--noindex--> <!--LiveInternet counter--><script type="text/javascript"><!-- document.write("<img src='//counter.yadro.ru/hit?t26.10;r"+ escape(document.referrer)+((typeof(screen)=="undefined")?"": ";s"+screen.width+"*"+screen.height+"*"+(screen.colorDepth? screen.colorDepth:screen.pixelDepth))+";u"+escape(document.URL)+ ";"+Math.random()+ "' alt='' title='"+" ' "+ "border='0' width='1' height='1'><\/a>") //--></script><!--/LiveInternet--> <a href="https://slidetodoc.com/" alt="Наш международный проект SlideToDoc.com!" target="_blank"><img src="https://present5.com/SlideToDoc.png"></a> <script> $(window).load(function() { var owl = document.getElementsByClassName('owl-carousel owl-theme owl-loaded owl-drag')[0]; document.getElementById("owlheader").insertBefore(owl, null); $('#owlheader').css('display', 'inline-block'); }); </script> <script type="text/javascript"> var yaParams = {'typepage': '1000_top_300k', 'author': '1000_top_300k' }; </script> <!-- Yandex.Metrika counter --> <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(32395810, "init", { clickmap:true, trackLinks:true, accurateTrackBounce:true, webvisor:true }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/32395810" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!--/noindex--> <nav id="top-nav"> <ul id="menu-top" class="top-menu clearfix"> </ul> </nav> </div> </div><!--.container--> </footer> <script type='text/javascript'> /* <![CDATA[ */ var wpcf7 = {"apiSettings":{"root":"https:\/\/present5.com\/wp-json\/contact-form-7\/v1","namespace":"contact-form-7\/v1"}}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/plugins/contact-form-7/includes/js/scripts.js?ver=5.1.4'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/jquery.shuffle.js?ver=4.9.26'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/scripts.js?ver=1.13'></script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/shuffle.js?ver=4.9.26'></script> <!--[if lt IE 9]> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/selectivizr.js?ver=1.0.2'></script> <![endif]--> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/notify.js?ver=1722041733'></script> <script type='text/javascript'> /* <![CDATA[ */ var my_ajax_object = {"ajax_url":"https:\/\/present5.com\/wp-admin\/admin-ajax.php","nonce":"3bdbe52c73"}; /* ]]> */ </script> <script type='text/javascript' src='https://present5.com/wp-content/themes/sampression-lite/lib/js/filer.js?ver=1722041733'></script> </body> </html>