Effect of Word-Based Correction on Retrieval of Arabic

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy, darwishk}@eg. ibm. com

Outlines: 1. Motivation 2. Background 3. Approach 4. Experimental Setup 5. Results 6. Conclusion 7. Future Work

Motivation: 2000 1900 1800 1700 1600 1500 1400 Problem: 500+ years of legacy documents 1998: Arabic e -text comes online First printing press E-text becomes commonplace Read to search Automated full text search Goal: To search printed documents efficiently and effectively Does OCR solve the problem?

Arabic Language Challenges • Orthography – – Character shape depends on position 15 of the 28 letters contain dots Optional diacritics may be present Printed text may include ligatures and kashida • Morphology – Prefix, infix, and suffix – 6 x 1010 possible surface forms • Other factors ﻭﺳـﻴــﻜـﺘﺒﻮﻧـﻬـﺎ wasaya+ktub+uunahaa and will + write + they it = and they will write it – Eighth most widely spoken language in the world – Web growth started only recently

Arabic Pre-processing & Retrieval • Pre-processing: – Remove diacritics – Normalize different forms of alef & ya to accommodate for ∙ Common spelling errors ∙ Grammatical, morphological, and orthographic properties ﺀ , ﺉ , ﺅ , ﺃ ، آ ، ﺇ ، ﺍ , ﺍ and ﻯ ، ﻱ ﻱ • Text Retrieval: Best Index Terms – Regular text: Light stemming and character 3 & 4 grams are best – OCR text: character 3 & 4 grams are best

Main Idea: OCR VVorcl-Easod Comectlon l 0 r Word-Based Correction for Retrieval of Arabic OCR Belrieval Arahie OCR Dcgraclod Documents Degraded Doeurnerits Correction We want to examine the effect of correction on Retrieval Corrected Degraded Image Text

Approach: OCR Corrected Text ------------ OCR Correction Indexing Ranked List of Documents OCR Degraded Text ------------- OCR system

Experimental Setup: • Test collections • Error Correction • Building Error Model • Training & Decoding • Experiments

Document Collections: ZAD TREC 2002 CLIR Printed 14 th century religious book, scanned at 300 x 300 dpi and OCR’ed Arabic newswire articles from Agence France Press (AFP) 2, 730 documents 383, 872 articles 25 topics 50 topics Real Degraded text by OCR process Synthetic degraded text using degradation model WER = 39 % WER = 30. 8 %

The ZAD Collection: Sample Document: Sample Query: ﺣﻜﻢ ﺍﻟﺘﻴﻤﻢ ﻭﻣﺘﻰ ﺷﺮﻉ

: The TREC 2002 CLIR Collection : Sample Document > <DOCNO>19940513_AFP_ARB 0001</DOCNO > </HEADER ﺍﺭﺍ0080 4 ﻉ 0177 ﻗﺒﺮﺹ /ﺍﻓﺐ-ﺗﺼﺞ 68 ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ/ﺳﻼﻡ/ﺣﻜﻢ ﺫﺍﺗﻲ > <HEADER > <BODY > </HEADLINE ﺍﻟﻌﻠﻢ ﺍﻟﻔﻠﺴﻄﻴﻨﻲ ﻟﻢ ﺭﻓﻊ ﻓﻮﻕ ﻛﻨﻴﺲ ﺍﺭﻳﺤﺎ ; <HEADLINE> &HT > <TEXT ﺍﺭﻳﺤﺎ )ﺍﻟﻀﻔﺔ ﺍﻟﻐﺮﺑﻴﺔ( 13 -5 )ﺍﻑ ﺏ(- ﻳﻘﻮﻡ ﺍﺣﺪ ﻋﻨﺎﺻﺮ ﺍﻟﺸﺮﻃﺔ ﺍﻟﻔﻠﺴﻄﻴﻨﻴﺔ ﺑﺤﺮﺍﺳﺔ ﻣﺪﺧﻞ ﺍﻟﻜﻨﻴﺲ ﺍﻟﻴﻬﻮﺩﻱ ﻓﻲ ﻭﺳﻂ > <P > </P ﺍﺭﻳﺤﺎ ﺍﺣﺪ آﺨﺮ ﻣﻮﺍﻗﻊ ﺍﻟﻤﺪﻳﻨﺔ ﺍﻟﺘﻲ ﺗﻢ ﺗﺴﻠﻴﻤﻬﺎ ﺍﻟﻰ ﺍﻟﺸﺮﻃﺔ ﺍﻟﻔﻠﺴﻄﻴﻨﻴﺔ ﺍﻻ ﺍﻧﻪ ﻟﻢ ﻳﺘﻢ ﺭﻓﻊ ﺍﻟﻌﻠﻢ ﺍﻟﻔﻠﺴﻄﻴﻨﻲ ﻓﻮﻕ ﺍﻟﻜﻨﻴﺲ > </P ﻭﻗﺎﻝ ﺿﺎﺑﻂ ﻓﻠﺴﻄﻴﻨﻲ ﻟﻔﻠﺴﻄﻴﻨﻴﺔ ﻛﺎﻧﺖ ﺗﺤﺎﻭﻝ ﺭﻓﻊ ﺍﻟﻌﻠﻢ ﺍﻟﻔﻠﺴﻄﻴﻨﻲ ﻓﻮﻕ ﺍﻟﻜﻨﻴﺲ "ﻫﺬﺍ ﻣﻜﺎﻥ ﻣﻘﺪﺱ" > <P ﻭﻗﺒﻴﻞ ﺫﻟﻚ ﺍﻗﺘﺮﺏ ﺛﻼﺛﺔ ﻣﺴﺘﻮﻃﻨﻴﻦ ﻳﻬﻮﺩ ﻣﻦ ﻣﺪﺧﻞ ﺍﻟﻜﻨﻴﺲ ﺍﻟﺬﻱ ﻛﺎﻥ ﺍﻟﺠﻨﻮﺩ ﺍﻻﺳﺮﺍﺋﻴﻠﻴﻮﻥ ﻣﺎ ﺯﺍﻟﻮﺍ ﻳﻮﺀﻣﻨﻮﻥ ﺣﺮﺍﺳﺘﻪ > <P > </P ﻭﻋﻨﺪﻣﺎ ﻣﻨﻌﻬﻢ ﺍﻟﺠﻨﻮﺩ ﻣﻦ ﺍﻟﺪﺧﻮﻝ ﻗﺎﻣﻮﺍ ﺑﺘﻤﺰﻳﻖ ﺛﻴﺎﺑﻬﻢ > </TEXT : Sample Query ﺳﺠﻨﺎﺀ ﺣﺮﺏ ﺍﻳﺮﺍﻧﻴﻴﻦ ﻭﻋﺮﺍﻗﻴﻴﻦ

OCR-Correction Model : Training OCR Degraded Text Manual Corrected OCR Text Aligning Characters Mapping Build Error Model OCR Degraded Text OCR Corrected Text Pick up most likely correction using Bayes Rule Generate Corrections Decoding

Aligning Characters Mapping: 1 Mapping m: n Mapping Ex: walid vvaicl w v Null v a a l Null i i d c Null l Ex: walid vvaicl w vv a a l Null i i d cl w v a v S I √ D √ S I l a i i w d c l v a v S √ D √ S l a i i d c l

Building Error Model: Where Ck. Cl, and Dx. Dy are a character or more

Decoding: Baye’s Rule: P ( Wordcorrect | Word. OCR ) = Word Level model argmax ( P ( Word. OCR | Wordcorrect ) P ( Wordcorrect ) ) Character Level model P ( Word. OCR | Wordcorrect ) = P ( Wordcorrect ) = LM probability (used simple unigram probability)

Example: d a dairnn i r da i r n daim d ai r n dai claim r n d a i rn aim da ir n d air n horn dair n Character Level Model: 1. Segmentation 2. Mapping 3. Generate Candidates Ex: dairn ε d 0. 8 h 0. 1 cl 0. 08 0. 02 d ε a a 0. 9 o 0. 05 r 0. 02 oi 0. 015 0. 005 n 0. 005 e 0. 005 ε i 0. 84 l 0. 12 0. 02 t 0. 015 ll 0. 005 rn 0. 425 d a i rn da i rn 0. 091 d ai rn dai rn 0. 0091 d a irn 0. 00227 irn da d airn 0. 00007 dairn ε rn 0. 7 m 0. 15 im 0. 02 ln 0. 015 0. 005 l 0. 09 i 0. 05 li 0. 02 s 0. 015 f 0. 005 t 0. 005 a 0. 005

Example (cont): Word Level Model: Find the Frequency of Occurrence of each generated word in the dictionary P ( dairn | dairn ) = 0. 425 Freq ( dairn ) = 0 P ( daim | dairn ) = 0. 091 Freq ( daim ) = 0 P ( claim | dairn ) = 0. 0091 Freq ( claim ) = 1500 P ( aim | dairn ) = 0. 00227 Freq ( aim ) = 4000 P ( horn | dairn ) = 0. 00007 Freq ( horn ) = 150 dairn claim

IR Experiments • Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed • The collections were indexed and searched using words, character 3 -grams, character 4 -grams, and lightly stemmed word • Retrieval performance were tested for all combination between index type and number of correction • Measure of merit is Mean Average Precision • Significance testing done using t-test with p-value = 0. 05

Correction Results: ZAD Collection TREC Collection

IR Results: “ZAD Collection” : Clean Bad

IR Results: “TREC Collection” : Clean Bad

Conclusion & future work: • Despite WER was halved IR effectiveness was not improved with statistically significant increase • Using more than one correction does not help • Indexing using n-grams (shorter index terms) is better than “moderate” error correction • Effect of using n-gram word LM on error correction “Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006” • Effect of “good” error correction on improving the retrieval effectiveness

Thank gon Lnanh you Correction