Using TF-IDF Weight Ranking Model in CLINSS as

Скачать презентацию Using TF-IDF Weight Ranking Model in CLINSS as

2012_CL!NSS_Palkovskii.pptx

Количество слайдов: 20

Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity Measure to Identify Cases of Journalistic Text Re-use Palkovskii Y. , Belov A. Zhytomyr State University Institute of Foreign Philology, In affiliation with Sky. Line LLC [Plagiarism Prevention Solutions] Zhytomyr, Ukraine

Who we arewhat we do Small, devoted group of studentsprofessors in ZSU. Focused on Plagiarism DetectionCross-Language PD. We develop a core text compare engine for a number of commercial products, PD related, for Sky. Line LLC: http: //Plagiarism-Detector. com Plagiarism Detector Accumulator Server [PDAS] Plagiarism Detector Client [PDC] We like to participate in competitions in Plagiarism Detections (especially in hot countries) and proud to have taken part in: PAN 09 Spain, PAN 10 Italy, PAN 11 Amsterdam, PAN 12 Italy, CL!TR 11 India Mumbai, IIT © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

CL!NSS proposed task What we are looking for? -“Same news event” within a pair of documents Pair-wise document comparison Reasonable processing time Resolution issues for focal news events are not a requirement, at least at this point Focus on the final result and a “starting point” prototype © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

How does it work? Language normalization via Google Translate Text preprocessing that included most frequent words removal (preliminary harvested from both corpuses and sorted by frequency) Running comparison of each document against the test corpus, saving the data retrieved for further analysis Each cached result for every pair undergoes estimation via predefined filter set getting scores. Top 100 list is formed by ascending score value. © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Our evaluation methods via Google Images © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

News set about “Curiosity” landing on Mars via Google Images © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Latest Bollywood newsfeeds via Google Images © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

In detail Inserting manually crafted news pairs into the both corpora and evaluating final ranking positions Different degree of news stories uniqueness – ranging from news about Curiosity Landing on Mars to the latest Bollywood films news (i. e. matching the context character and the exact vocabulary of the training set) 10 news planted, 9 out of ten fell into the “top 10” ranking, thus proving the initial hypothesis © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Detailed document comparison PAN 2012 prototype – “i. GTC” project, based on an ngram matching principle, with 3 levels of graphically based clusterization, already tuned in by a GA last year FIREPAN to both tackle medium-to-high degrees jf obfuscation as well as translated and simulated plagiarism We did not use it. With main reason – retain the purely statistical approach based on TF-IDF values © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

CL!NSS Results Achieved HindiEnglish Rank Run NDCG@1 NDCG@5 NDCG@10 1 run-1 -english-hindi-palkovskii 0. 3229 0. 3259 0. 3380 2 run-2 -english-hindi-deriupm 0. 2100 0. 2136 0. 2613 3 run-1 -english-hindi-deriupm 0. 1900 0. 2110 0. 2168 4 run-1 -english-hindi-iiith 0. 1939 0. 1994 0. 2154 5 run-3 -english-hindi-deriupm 0. 1500 0. 1886 0. 2030 6 run-3 -english-hindi-iiith 0. 1837 0. 1557 0. 1722 7 run-2 -english-hindi-iiith 0. 0204 0. 0462 0. 0512 © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

CL!NSS Results Achieved GujaratiEnglish Rank Run 1 NDCG@5 run-1 -english-hindi-palkovskii 0. 0541 0. 0843 NDCG@10 0. 0955 Ideas to consider: Different efficiency for different sources types and news typesstructureorigin [according to Parth Gutpa analysis of CL!NSS] MT substitute for Gujarati © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

PAN 2011CLEF CLPD Baseline Manual: Automatic: 0. 37 P-det 0. 92 P-det R: . 69 P: . 26 G: 1 R: . 97 P: . 88 G: 1 Comparison problem: NDCG* metrics vs P-det (any ideas? ) © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

HardwareRuntime Moderately computationally intensive Single Intel 6 -core 990 ex 6 GB Ram (RAM intensive usage) Single SSD drive Total runtime of 12 hours for the test corpus (excluding the PAN 2012 comparer filter) © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Software used Microsoft windows 7 [] Microsoft Visual Studio 2010C# © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

What we missed Meta-parameters tuning-in exhaustiveness A hybrid approach that uses a combination of PAN 2012 text comparer prototype as an additional scoring mechanism (runtime limitations and an idea to stick to a single methodology) Post analysis of successful and failed detections Including results visualization in hope for further insights Our competitive colleagues from Austria, Romania, Chile, etc. ! Layered Analysis of each influencing scoring factor [ref. to PAN 20112012 analysis] © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Things we’re happy to discuss Results evaluation Achieved baseline in comparison to PAN results The corpus size Automatic evaluation platform for result processing and evaluation Perspectives of machine learning Hybrid approaches Baseline comparison with other related tracks © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

References [1] Cristian Grozea and Marius Popescu. Encoplot—Performance in the Second International Plagiarism Detection Challenge: Lab Report for PAN at CLEF 2010. In Braschler et al. ISBN 978 -88 -904810 -0 -0 [2] Debora Weber-Wulff, "Plagiarism Detection Competition" copy-shake-paste. blogspot. com. 2009. 21 June. 2011. [3] Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978 -88 -904810 -0 -0. [4] Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1 st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1– 9. CEUR-WS. org, September 2009. URL http: //ceur-ws. org/Vol-502. [5] Thanh Dao. "An improvement on capturing similarity between strings" www. codeproject. com. 2005. 29 Jul. 2011. http: //www. codeproject. com/KB/recipes/improvestringsimilarity. aspx [6] Troy Simpson, Thanh Dao. "Word. Net-based semantic similarity measurement" www. codeproject. com. 2005. 1 Oct. 2011. http: //www. codeproject. com/KB/string/semanticsimilaritywordnet. aspx [7] Yurii Palkovskii, Alexei Belov, and Irina Muzika. Exploring Fingerprinting as External Plagiarism Detection Method: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978 -88 -904810 -0 -0. © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

Letters are powered by people: © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC

I would like to thank those people – thank you for your assistance and help: • Mandar Mitra • Parth Gupta • Anwar Shaikh And an additional “thank you” for getting as far as Kolkata! © Palkovskii, Belov et al. 2012 TF-IDF Weight Ranking Model as news similarity measure In affiliation with Zhytomyr State Uni and Sky. Line LLC