eca9a65b7683b81177bbc104536dff5a.ppt
- Количество слайдов: 28
University of Tehran Database Research Group Persian@CLEF 2008 Mono & Cross Language Experiments on Persian Text Abolfazl Ale. Ahmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran 18 Sep 2008 1
Outline Persian Language Persian Test Collections Hamshahri in CLEF 2008 UT Participants Using Part of Speech Tagging in Persian Information Retrieval Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Local Cluster Analysis Using Part of Speech Tagging Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Cross Language Experiments at Persian@CLEF 2008 Next Year 2
The Persian Language A branch of Indo-European Languages Official Language of Iran, Afghanistan and Tajikistan Its morphological analysis is Comparably difficult The word “ ”ﺧﺒﺮ has two plural forms: • Persian rules: “ ”ﺧﺒﺮﻫﺎ • Arabic rules: “ ”ﺍﺧﺒﺎﺭ 3
Some Processing Issues Writing Style Issues: e. g. “ ”ﻣی ﺷﻮﺩ and “ ”ﻣیﺸﻮﺩ are the same e. g. “ ”کﺘﺎﺑﻬﺎ and “ ”کﺘﺎﺏ ﻫﺎ are the same KASRE: e. g. چﺮﺍﻍ ﻋﻠی ﺧﺎﻧﻪ ﺭﺍ ﺳﻮﺯﺍﻧﺪ has two different meanings: • Cheragh. Ali burned the house • Ali’s lantern burned the house 4
Some Processing Issues Encoding 5
Persian in the Middle East User Population Growth on the Web (2000 -2008) December 31, 2007 Source: Internet World Statistics, http: //internetworldstats. com/ 6
Persian Test Collections IR Domain Ghavanin (domain specific) Hamshahri (news) WEB: http: //ece. ut. ac. ir/dbrg/hamshahri NLP Domain Bijankhan (2 Million Word) WEB: http: //ece. ut. ac. ir/dbrg/bijankhan 7
Hamshahri in CLEF 2008 News articles of Hamshahri newspaper from year 1996 to 2002 Size of the documents varies from short news (under 1 KB) to rather long articles (e. g. 140 KB) 22 assessors Evaluation based on DIRECT System 8
Hamshahri in CLEF 2008 Collection size 564 MB (Unicode text) No. Of documents 166, 774 No. Of unique terms 417, 339 Average length of documents 380 Terms No. Of categories 9 No. Of Topics 50 bilingual 9
Implementation of our methods We submitted top 100 for each run 10
Using Part of Speech Tagging in Persian Information Retrieval Reza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl Ale. Ahmad, Hadi Amiri, Farhad Oroumchian 11
Using Part of Speech Tagging in Persian Information Retrieval Config. Corpus Query 1 Tagged Title with equal weighting for all POS tags 2 Stemmed and tagged Stemmed title with equal weighting for all POS tags 3 Stemmed title without POS tagging 4 Stemmed Title plus description 5 Stemmed Title plus description (stop words removed) 6 Stemmed (stop words removed) Tagged 7 Tagged 8 Normal Title plus description with equal weighting for all POS tags Title with various weighting schemes for different POS tags Title (Neither stemmed nor tagged) 12
Using Part of Speech Tagging in Persian Information Retrieval 20 less used tags omitted, others equal weight Noun=3 Noun=0 Verb=2 Verb=0 Adj=1 Avj=3 Adj=0 Adj=1 Adj=0 Adv=1 Adv = 0 Adv=0 Adv=1 Average precision 0. 2745 0. 2635 0. 2597 0. 1108 0. 1198 0. 0977 R-Precision 0. 3097 0. 3104 0. 2888 0. 1256 0. 1186 0. 1111 13
Using Part of Speech Tagging in Persian Information Retrieval 14
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel Ale. Ahmad, Hadi Amiri, Farhad Oroumchian Weighting Model Description BB 2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization BM 25 DFR_BM 25 The BM 25 probabilistic model The DFR version of BM 25 IFB 2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_exp. B 2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_exp. C 2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm In. L 2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization PL 2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization TF_IDF The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf Terrier Open Source Retrieval Engine: http: // ir. dcs. gla. ac. uk/terrier/ 15
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Weighting Model Average Precision R-Precision BB 2 0. 3854 0. 4167 BM 25 0. 3562 0. 4009 DFR_BM 25 0. 4006 0. 4347 IFB 2 0. 4017 0. 4328 In_exp. B 2 0. 3997 0. 4329 In_exp. C 2 0. 4190 0. 4461 In. L 2 0. 3832 0. 4200 PL 2 0. 4314 0. 4548 TF_IDF 0. 3574 0. 4017 16
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track And two other variations of this operator: IOWA and NOWA 17
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track 18
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Post hoc Results Retrieval Method Toolkit Average Precision R-Precision TF_IDF with unstemmed single terms Terrier 0. 3847 0. 4122 PL 2 with 4 gram terms Terrier 0. 3669 0. 3939 Indri with stemmed terms Lemur 0. 3955 0. 4149 IOWA 0. 4515 0. 4708 NOWA 0. 4522 0. 4736 Dif +5. 67 19
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Amir Hossein Jadidinejad, Mitra Mohtarami, Hadi Amiri 20
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text But the result was not good on the test set 21
Cross Language Experiments at Persian@CLEF 2008 Abolfazl Ale. Ahmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Run tot-ret rel-ret MAP Retrieval Model Tool Using Light Stemmer 5161 1967 26. 89 Vector Space Lucene Without Stemmer 5161 1991 27. 08 Vector Space Lucene 3 Grams 5161 1901 26. 07 Language Modeling Lemur 4 Grams 5161 1950 26. 70 Language Modeling Lemur 5 Grams 5161 1983 27. 13 Language Modeling Lemur Term-Based 5161 2035 28. 14 Language Modeling Lemur 22
Cross Language Experiments at Persian@CLEF 2008 Abolfazl Ale. Ahmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Probabilistic Structured Queries (PSQ) Combinatorial Translation Probability (CTP) 23
Cross Language Experiments at Persian@CLEF 2008 Abolfazl Ale. Ahmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Results 24
Cross Language Experiments at Persian@CLEF 2008 Abolfazl Ale. Ahmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation Using Shiraz machine translation system from CRL of NMSU Took 10 days to translate 130, 000+ docs from Persian to English 25
Cross Language Experiments at Persian@CLEF 2008 Abolfazl Ale. Ahmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation & Hybrid Results 26
Next Year Ham 2 for the Next Year Extended Version of Hamshahri Collection 2 times larger (~1. 5 GB) <DOC> <DOCID>HAM 2 -851011 -001</DOCID> <DOCNO>HAM 2 -851011 -001</DOCNO> <ORIGINALFILE>/1385/851011/news/_adabh. htm</ORIGINALFILE> <ISSUE>4172 - ﺩﻭﺷﻨﺒﻪ 11 ﺩﻱ 5831 - ﺳﺎﻝ چﻬﺎﺭﺩﻫﻢ - ﺷﻤﺎﺭﻩ Jan 1, 2007</ISSUE> <DATE>2007 -01 -01</DATE> <CAT xml: lang="fa"> /<ﺍﺩﺏ ﻭ ﻫﻨﺮ CAT> <CAT xml: lang="en">Literature and Art</CAT> <TITLE> <![CDATA[ ﻣﺪﻳﺮﻛﻞ ﻛﺘﺎﺏ ﻭ ﻛﺘﺎﺑﺨﻮﺍﻧﻲ ﻭﺯﺍﺭﺕ ﻓﺮﻫﻨگ ﻭ ﺍﺭﺷﺎﺩ ﺍﺳﻼﻣﻲ ﺧﺒﺮ ﺩﺍﺩ >]]آﻴﻴﻦ ﻧﺎﻣﻪ ﺧﺮﻳﺪ ﻛﺘﺎﺏ ﺍﺻﻼﺡ ﺷﺪ </TITLE> <TEXT> <image>/1385/851011/news/008505. jpg</image> <![CDATA[ ﻓﺎﺭﺱ: ﻣﺪﻳﺮ ﻛﻞ ﻛﺘﺎﺏ ﻭ ﻛﺘﺎﺏ ﺧﻮﺍﻧﻲ ﻭﺯﺍﺭﺕ ﻓﺮﻫﻨگ ﻭ ﺍﺭﺷﺎﺩ ﺍﺳﻼﻣﻲ گﻔﺖ: آﻴﻴﻦ ﻧﺎﻡ </TEXT> </DOC> <DOC> 27
Questions? Thanks For Your Attention Database Research Group http: //ece. ut. ac. ir/dbrg 28
eca9a65b7683b81177bbc104536dff5a.ppt