Generation of Synthetic Datasets for Performance Evaluation of

Скачать презентацию Generation of Synthetic Datasets for Performance Evaluation of

00cb45f6dca47df0337c6da7275e46a0.ppt

Количество слайдов: 15

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR Mathieu Delalandre CVC, Barcelona, Spain DAG Meeting CVC, Barcelona, Spain Wednesday 19 th of November 2008

Introduction • Text/graphics documents are used in a variety of fields like geography, engineering, social sciences … Some examples are architectural drawing utility map Huge amount of data exist, two main sources digitized documents (modern and old) geographic map web images

Introduction • OCR of text/graphics documents Character recognition system working with Text/Gra text/graphics documents phics # First related work [Brown’ 1979] separatio # More than 50 references on this topic n today full image of text-lines Problematics [Fletcher’ 1988] [Zenzo’ 1992] [Goto’ 1999] Text-line - letter [Adam’ 2000] … detection general to any documents segmentation images of single - multi-font text-line Characte recognition r - scale variation segment images ation specific to text/graphics documents of single - text/graphics Characte character separation r - rotation variation recogniti - text-line on detection ASCII - no reading order - no dictionary

Introduction • About performance Results evaluation System Documents Groundtruthin g Groundtruth Results Characterisati on Performance evaluation The case of general OCR [Kanungo’ 1999] More than 40 references on the topic [Kanungo’ 1999] Several standard databases exist (NIST, MARS, CD-ROM English, …) Annual evaluation reports [Rice’ 1992] [Rice’ 1993] Black-box evaluation: The evaluation considers the OCR system as an indivisible unit and evaluates it from its final results (i. e. OCR output vs. ASCII transcription of the text using The case distances). string edit of text/graphic document OCR [Wenyin’ 1997] White-box evaluation: The evaluation aims to Only 1 reference on the topic characterize the performance of individual sub. No standard databases modules of the OCR system (skewing, letter None complete evaluation done through segmentation, block identification, character 20 years of etc. ). recognition, research

Introduction document OCR # white-box evaluation # groundtruthing step # datasets for text/line detection and character recognition # generation algorithms are “simple”, the main purpose of the talk will concern the setting Character recognition Character segmentation Text-line detection Text/graphics separation • Scope of the Performance evaluation of text/graphics proposed work Groundtr uthing Character contributions ization

Plan 1. 2. 3. 4. Groundtruth definition Datasets for character recognition Datasets for text-line detection In progress datasets

Groundtruth definition – Character level • • • ASCII code font (name, size, style) location point orientated bounding box orientation (ϴ) scale ( ) – Text level • first location point • groundtruth of characters/word char • H e l l o W o r l d positions p 0 0 0 1 1 1 word p 0 1 2 3 4 char 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets

1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Datasets for character recognition (1/2) • Problematics • Published experiments How to generate single character images ? Which number of class ? Which image resolution ? Which size for the datasets ? Which fonts ? • Etc …. conclusions Main (1) The real sizes of characters can be only estimated. (2) The confusion problem (e. g. 6 vs 9) is not still well defined, the 62 class problem (az A-Z 0 -9) is the main goal. (3) It is not possible to fix a standard size for the training/test sets, this information is still well defined, several thousands of images are mandatory for the training. (4) The impact of fonts is few studied and should be take into account in the evaluation image cla size lear font( rotat scal size ss ning s) ion ing Brown’ 1 981 682 ? ? / 20 10 000 × × yes Zenzo’ 92 ? ? / 72 62 000 × × yes Takahas hi’ 1992 242 ? ? / 6 50% 10 400 × yes Adam’ 20 00 282 51/ 15 33% 62 000 × yes Chen’ 20 03 162 - 26/ 1 14% 5122 26 000 1 no Choisy’ 2 004 282 51/ 15 80% 62 000 × yes Hase’ 200 4 322 ? ? / 3 33% 26 000 3 yes Pal’ 2006 132 - 40/ 2 yes 2 18 80% yes no

Datasets for character recognition (2/2) • Generation setting letter class 62 font class a-z; A-Z; 0 -9 30 fonts http: //www. codestyle. org/ with lower and upper case, no cursive basic fonts 3 character size times, courier, arial 322 max dxdy of font symbols pixels dataset size 5 000 / font training free • Generation 62 classes; 40 samples/class; 50%/50% ranked files allow a training specification 20% training on [file-4001 – file-5000] algorithm character 1. 0 to with a gap of 1/1000 font manager, centering, scale and scaling 2. 0 rotation processes character rotation 0 to 2×π with a gap of π/500 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets • Datasets Geometry invariance tests scalin rotatio font(s) fonts image g n /test s 3 no no 1 3 15 000 3 yes no 1 3 15 000 3 no yes 1 3 15 000 3 yes 1 3 15 000 Font adequacy tests scalin rotatio font(s) fonts image g n /test s 30 yes 1 30 150 Font scalability 000 tests scalin rotatio font(s) fonts image g n /test s 4 yes 3; 6; 9; 12 150 15 000 +30 000 + 45 12 000 + 60 000

1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Datasets for text-line detection (1/2) • Problematics How to generate single character images ? Roy’ 2008 Which number of word per image ? Pal’ 2004 Which image size ? Which size for the datasets ? Loo’ 2002 Which • Etc …. number of font ? Main conclusions (1) The use-cases are heterogeneous, the Park’ 2001 sizes and resolutions of images are few provided, the text density is then difficult Goto’ 199 to estimate, images with significant text 9 content are preferred. Tan’ 1998 (2) Depending the use-cases, not all the He’ 1996 methods work on curved text, a combination of curved and straight text is Burgue’ 1 995 necessary. (3) All the methods use context to extract the. Deseillig text-line (i. e. font type, character size, lineny’ 1995 use-case ima text- curv font/i scali ges lines ed mg ng geographic map ? ? 5 000 yes many yes artistic document ? ? 1 521 yes many yes poster, newspaper 2 118 poster, publicity 30 1265 yes many yes Japanese form yes many yes 170 9 831 yes many yes map 8 96 no many yes drawing 1 16 no many yes cadastral map 4 150 no many yes cadastral map 3 1 250 no many yes

Datasets for text-line detection (2/2) • Setting dictionary 422 text- countries and capitals lines font class 30 fonts http: //www. codestyle. org/ with lower and upper case, no cursive character 322 pixels max dxdy of font symbols size image size 6402 10 -50 text-lines per image dataset 100 • Generation algorithm size images text scaling B 1 ejects 1. 0 to 1. 5 withl 2 a B 1 of 1/1000 B 2 of dx, dy gap text rotation -π/2 to +π/2 l 1 step 2 The insert algorithm dy d B 2 withdx gap of π/500 a θ l 3 • Datasets 1. Groundtruth definition 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets Text-line density test textscalin curv font(s)/t words line/img g ed est 1 low yes no 3 in progress 1 medium yes no 3 in Font context test textscalin curv font(s)/t progress words 1 high no 3 in line/img yes ed g est progress 1 medium no no 9 in progress 1 medium no no 6 in progress Size context 1 medium no no 3 in test textscalin curv font(s)/t words progress line/img g ed est 1 medium no no 1 in progress 1 medium yes no 1 in progress

In progress datasets 1. Groundtruth definition and setting 2. Datasets for character recognition 3. Datasets for text-line detection 4. In progress datasets

Conclusions # in progress work … # character recognition datasets are ready # bags of words still under packaging, but will be ready soon. Perspectives # middle term, experimentations with standard feature extraction methods [Roy’ 2008] [Valveny’ 2007] # long term, experimentations with bags of word and text/graphics documents [Delalandre’ 2007] [Wenyin’ 1997]

References (1/2) 1. R. Brown and M. Lybanon and L. K. Gronmeyer. Recognition of Handprinted Characters for Automated Cartography: A Progress Report. Proceedings of the SPIE, Vol. 205, 1979. 2. L. A. Fletcher & R. Kasturi. A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol (10), pp. 910 -918 , 1988. 3. S. D. Zenzo; M. D. Buno; M. Meucci & A. Spirito. Optical recognition of hand-printed characters of any size, position, and orientation. IBM Journal of Research and Development, vol (36), pp. 487 -501 , 1992. 4. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp. 111 -119 , 1999. 5. S. Adam; J. M. Ogier; C. Cariou; R. Mullot; J. Labiche & J. Gardes. Symbol and Character Recognition : Application to Engineering Drawings. International Journal on Document Analysis and Recognition (IJDAR), vol (3), pp. 89 -101 , 2000. 6. T. Kanungo; G. A. Marton & O. Bulbu. Performance evaluation of two Arabic OCR products. Workshop on Advances in Computer-Assisted Recognition (AIPR) , SPIE Proceedings, vol (3584), pp. 76 -83 , 1999. 7. S. V. Rice J. Kanai & T. A. Nartker. A Report on the Accuracy of OCR Devices. Information Science Research Institute, University of Nevada, USA, 1992. 8. S. V. Rice; J. Kanai & T. A. Nartker. An Evaluation of OCR Accuracy. Information Science Research Institute, University of Nevada, USA, 1993. 9. L. Wenyin & D. Dori. A Protocol for Performance Evaluation of Line Detection Algorithms. Machine Vision and Applications, vol (9), pp. 240 -250 , 1997. 10. R. M. Brown. Handprinted Symbol Recognition System: A Very High Performance Approach To Pattern Analysis Of Free-form Symbols. Conference Southeastcon , pp. 5 -8 , 1981. 11. H. Takahashi. Neural network architectures for rotated character recognition. International Conference on Pattern Recognition (ICPR) , pp. 623 -626 , 1992. 12. Q. Chen. Evaluation of OCR algorithms for images with different spatial resolutions and noises. School of Information Technology and Engineering, University of Ottawa, Canada, 2003.

References (2/2) 14. H. Hase; T. Shinokawa; S. Tokai & C. Y. Suen. A robust method of recognizing multi-font rotated characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp. 363 - 366 , 2004. 15. U. Pal; F. Kimura; K. Roy & T. Pal. Recognition of English Multi-oriented Characters. International Conference on Pattern Recognition (ICPR) , vol (2), pp. 873 -876 , 2006. 16. P. P. Roy; U. Pal & J. Llados. Multi-oriented character recognition from graphical documents. International Conference on Cognition and Recognition (ICCR) , pp. 30 -35 , 2008. 17. U. Pal & P. P. Roy. Multi-oriented and curved text lines extraction from Indian documents. IEEE Transactions on Systems, Man and Cybernetics- Part B, vol (34), pp. 1676 -1684 , 2004. 18. P. K. Loo & and C. L. Tan. Word and Sentence Extraction Using Irregular Pyramid. Workshop on Document Analysis System (DAS) , Lecture Notes in Computer Science (LNCS), vol (2423), pp. 307 -318 , 2002. 19. H. C. Park; S. Y. Ok; Y. J. Yu & H. G. Cho. Word Extraction in Text/Graphic Mixed Image Using 3 -Dimensional Graph Model. International Journal on Document Analysis and Recognition (IJDAR), vol (4), pp. 115 130 , 2001. 20. H. Goto & H. Aso. Extracting curved text lines using local linearity of the text line. International Journal on Document Analysis and Recognition (IJDAR), vol (2), pp. 111 -119 , 1999. 21. C. L. Tan & P. O. Ng. Text extraction using pyramid. Pattern Recognition (PR), vol (31), pp. 63 -72 , 1998. 22. S. He, N. Abe & C. L. Tan. A clustering-based approach to the separation of text strings from mixed text/graphics documents. International Conference on Pattern Recognition (ICPR) , pp. 706 -710 , 1996. 23. M. Burge & G. Monagan. Extracting Words and Multi Part Symbols in Graphics Rich Documents. International Conference on Image Analysis and Processing (ICIAP) , 1995. 24. M. Deseilligny; H. Le Men & G. Stamon. Characters string recognition on maps, a method for high level reconstruction. International Conference on Document Analysis and Recognition (ICDAR) , pp. 249 252 , 1995. 25. E. Valveny; S. Tabbone; O. Ramos & E. Philippot. Performance Characterization of Shape Descriptors for Symbol Representation. Workshop on Graphics Recognition (GREC) , 2007. 26. M. Delalandre; T. Pridmore; E. Valveny; E. Trupin & H. Locteau. Building Synthetic Graphical Documents for