From Web Documents to Old Books Works in

From Web Documents to Old Books Works in Progress in Graphics Recognition Mathieu Delalandre Meeting of Document Analysis Group Computer Vision Center Barcelona, Spain Thursday 23 th November 2006

Plan • Short CV • Vector Graphics Indexing and Retrieval • Dropcap Image Retrieval

Short CV Personal Information Mathieu Delalandre, 32 years old Academic Degrees 1995 -1998 -2001 Lic. Sc in Electronic Rouen University, France M. Sc in Industrial Computing Rouen University, France SCSIT LITIS Research Periods Length 6 months 3 ½ years 5 months 13 months 2 months 3 years Position Master Ph. D Post-doc Contract Post-doc Nottingham Rouen Laboratory Subject LITIS symbol recognition LITIS drawing understanding SCSIT vector graphics indexing L 3 i dropcap image retrieval LITIS performance evaluation CVC … L 3 i La Rochelle CVC Barcelona

Plan • Short CV • Vector Graphics Indexing and Retrieval • Dropcap Image Retrieval

Vector Graphics Indexing and Retrieval What are vector graphics ? <rect x="400" y="100" width="400“ height="200" fill="yellow" stroke="navy" stroke-width="10" /> Application of vector graphics 1982 Computer Aided Design (DXF ‘ 1982’) 1985 Office software (PS ‘ 1985’, CGM ‘ 1987’, WMF ‘ 1993’) Bitmap vs vector graphics 1996 Web (PNG ‘ 1996’, SVG ’ 2001’. . ) Vector graphics are growing on Web 2001 SVG 1. 0 More accurate and lighter Known vector graphics formats 2004 SVG widely used structured documents [Mong’ 03], geographic maps [Chen’ 04], technical drawings [Kang’ 04] • AI (Adobe Illustrator) 2005 Powerful editors (Inskape, Webdraw, …) • SVG (Scalable Vector Graphic) 2006 Internet Explorer and Mozilla Firefox support SVG • WMF (Windows Metafile) • EPS (Encapsulated Post. Script) EPS Plane • DXF (Auto. CAD) • Clip. Art Clipart cheese WMF pen

Vector Graphics Indexing and Retrieval Our key ideas System overview [Doer’ 98] [Tom’ 03] Look like pattern recognition approach Doc 1 Retrieval Features Extraction Doc 1 Features Extraction Indexing process must adapted to document content adaptation Retrieval Doc 2 Doc 3 Doc 2 We can improve results by structuring the index Doc 3 structured index Index Content adaptation Graphics objects Model 1 Structured index Model 2 Model 3 Indexed objects Pattern frequency Level 1 3. 3 28. 3 Level 2 6. 6 10. 0 3. 3 13. 2 Level 3 3. 3 Ranked patterns Square Junction Adjacency Line Inclusion 13. 2 6. 6 1. 6 3. 3

Vector Graphics Indexing and Retrieval Our approach Before retrieve, we need to extract features R 1 R 3 What are the difficulties ? How to get R 2 ? <rect x="400" y="100" width="400" height="200" fill="blue" /> <rect x="650" y="200" width="400" height="200" fill="yellow" /> parsing and break-up set of objects R 1 R 2 R 3 We need a break-up filtering then junction detection set of line How to speed up the process ? set of broke line You see 5 You have 9 We need a cleanup x 11 y 21 y 12 y 22 x 12 x 21 x 22 Sorting the bounding box

Vector Graphics Indexing and Retrieval Our approach (next) Result example line graph building Polyline Junction while 2 -connex edge if 3 -connex node Polygon 1 2 [Wen’ 01] 3 while starting vector take nearest vector Adjacency Polygon common vector included bounding box line gravity center adjacency inclusion region detection To work on graph take time Using vectorial data adjacency and inclusion Time processing on ‘Mikado’ database

Vector Graphics Indexing and Retrieval To work on retrieval engine now ? How to evaluate the retrieval results after ? We must work on performance evaluation before ? Doc 1 Features Extraction Doc 2 Retrieval Doc 3 GT 1 How to get the ground truth ? Produce ground truth from existing document take time, we must produce synthetic document. GT 2 Performance evaluation GT 3 Synthetic document production Production rules Our key idea Produce true-life document need much knowledge, it is harder to do with a computer ‘Creasy’ but well formed drawing Production rules + - 0 -n 1 -connected We can produce ‘creasy’ but well formed documents, it is sufficient for performance evaluation purposes 2 -connected 0 -n 1 -connected 1 1 2 -connected O-n +

Vector Graphics Indexing and Retrieval Low Level Primitives Noise rules • to scale line • to broke line • to move line • … Graphical Objects Domain rules • must be connected • must be adjacent • must be include • can include • … (4) To move objects according to domain rules (5) II To delete oldest alone objects ‘cycle number’ General rules • object number • document size • object choice -probability distribution -rotation and scale range -position constraints -overlapped or not • … rotate and scale while I III (6) Adding noise on low level primitives composing objects In progress (1) To insert a new object while underhand object number (2) To move other objects if it can’t do (1) (3) To exit if it can’t do (1) and (2), then run (4) and (5) rotate and distort Vector Graphics Ground Truth scale and overlap

Vector Graphics Indexing and Retrieval Works done Fast graph building from vector graphics Production of first synthetic documents About project dot-line 04/05 02/06 SCSIT Post doc IRCSET Application A. Winstanley (NCG, Dublin University) 04/06 Eureka Meeting e. Connector, HP Lab Works in progress … 06/06 ANVAR Application informal agreement To produce more complex synthetic documents … 11/06 EPEIRES contract To work on model selection … To work on index structuration … 2007 To visit A. Winstanley (NCG, Dublin University) To take contact with M. Fonseca (IST, Lisbon University) JM Ogier plan to mount a European project 2008

Plan • Short CV • Vector Graphics Indexing and Retrieval • Dropcap Image Retrieval

Dropcap Image Retrieval Old books of XV° and XVI° centuries Which part and kind of graphics in old books Book Page Bartolomeo (1534) Old Graphics dropcap figure headline 4755 (3. 4 per page) Foreground pixel [Jour’ 05] 63% textual 37% graphical Graphics type Laurens (1621) 1385 Graphics Alciati (1511) 46 41% dropcap 59% others CESR Database

Dropcap Image Retrieval In what are interested historian people with these images ? Real time process or not ? Retrieve similar printings Printing 1 query We can’t index all images in regard to legal properties, a real time process will allow to do queries with images provided by other digital libraries DB Printing 2 Wood plug (bottom view) Why ? (1) Wood plug tracking 1555 -1578 1511 -1542 1497 -1507 Printing house Plug 1 Plug 2 Plug 3 Vascosan 1555 Marnef 1576 results result (2) User-driven historical metadata acquisition Metadata file plug exchange copy 1531 -1548 Metadata file Without retrieval With retrieval more faster reduce error Metadata file

Dropcap Image Retrieval What are the main difficulties? Noise Offset Which descriptor use ? To scalar [Loncaric’ 98] descriptors To image [Gesu’ 99] Ø Hough, Radon, Zernike, Hu, Ø Template matching, Fourrier Hausdorff distance Ø Scaled and Ø no scaled and orientation invariant Ø fast Ø slow fast local complex Ø local (character, symbol, digit) Ø global (scene) global More adapted but too complex Not adapted for our problem Our key idea Complexity To use an image compressed representation Accuracy Scalability Image Database Filtering Compression Centering and Comparison several hundred of classes several thousand of images Query R 1 R 2 R 3

Filtering Dropcap Image Retrieval Compression We have started to work with our images but the file formats are so different Centering and Comparison

Filtering Dropcap Image Retrieval Centering and Comparison Compression Why ? Digitalization problems [Lawrence’ 00] Several image providers Several digitalization tools Long process Human supervised Complex post-processing plate-form … query Expertise analysis QUEID Base charts Diagnostic (1) (2) (3) Format Diagnostic mode Our key idea Before to work on retrieval engine historian people need tools to improve quality of their databases To develop an engine (QUEID) working on image metadata to detect digitalization problem, and to secure retrieve system Software setting Image exchange Prototype software Our database accepted Base rejected Filtering mode Parameters Size 279. 7 Mp gray Formats Tiff Compression Engine 2038 Model QUEID Files Uncompress Resolutions 250 to 350

Filtering Dropcap Image Retrieval Compression Our key idea Which kind of RLE ? both RLE seems more adapted To use a Run Length Encoding (RLE) of Image image foreground background both Compression results 0. 88 0. 75 0. 95 Centering and Comparison

Filtering Dropcap Image Retrieval Centering and Comparison Compression To solve the offset problems we must use a centering step before the comparison We can do it in an easy way by comparing foreground histogram Centering Time results Raster vs RLE image database query image Size k. pixel Min x 1 line (y) image 1 line (y+dy) image 2 x reference stack x 2 while x 2 x 1 handle image 2 while x 1 x 2 handle image 1 x 2 x 2 Mean 137. 7 337. 06 600. 8 903. 62 Time s Min x 1 176. 67 Size k. run x 1 7. 74 Max Comparison Time s 1. 1 22. 32 Mean 15. 5 41. 68 Max 87. 8 137. 06

Dropcap Image Retrieval Mean query of 40 s, how to reduce again without using a lossless compression and to loose accuracy ? Our key idea Our first system Level 1 : image sizes Level 2 : black, white pixels Level 3 : RLE comparison How to process the distance curve ? Using a basic clustering algorithm ‘elbow criteria’ To use a system appraoch using different level of operator (from more speed to more accurate) to select image to compare Speed query 1 st Level 2 2 sd Level if 1 - 2 < 0 push x, cluster while 1 - 2 < 0 1 Depth next

Dropcap Image Retrieval Selection results Selection % Min 4% Mean 24% Max 59% From 4% to 59%, how to reduce the variability ? To work on a better selection criteria seems ambiguous … Our key idea To add an intermediate operator between scalar and image data

Dropcap Image Retrieval Our key idea Example of query result First results seem good, but how to get the ground truth and to evaluate our system? To use our engine to produce benchmark database Query Same plug 0. 1947 0. 2517 0. 3485 0. 3616 0. 3819 To produce 0. 4064 Next plug 0. 4109 0. 4209 control IHM Retrieve engine Bench 1 Bench 2 retrieve Base display driven labelling Labels Bench 2

Dropcap Image Retrieval Works done QUEID to filter and analyse image database Speedup comparison using two feature RLE compression System approach About project dot-line Works in progress … To add operator to improve system To extend our system to produce benchmark database 09/05 06/06 09/06 10/06 2007 MADONNE Postdoc 1 er CESR Technical Meeting ANAGRAM Worshop (Fribourg) 2 sd CESR Technical Meeting Navi. Do. Mass agreement GDR-JC Project (LMA, LI, Cre. STIC, LITIS, CVC) To put online the system on CESR website old graphic working group (Glasgow, Tours …)

Bibliography 1. 2. 3. 4. 5. 6. 7. 8. J. Mong and D. Brailsford. Using svg as the rendering model for structured and graphically complex web material. In Symposium on Document Engineering (Doc. Eng), pages 88 -91, 2003. Y. Chen, J. Gong, W. Jia, and Q. Zhang. Xml-based spatial data interoperability on the internet. In Conference of International Society for Photogrammetry and Remote Sensing and Spatial Information Sciences (ISPRS), pages 167 -201, 2004. J. Kang, B. Lho, J. Kim, and Y. Kim. Xml-based vector graphics: Application for web-based design automation. In International Conference on Computing in Civil and Building Engineering (ICCCBE), pages 170 -178, 2004. M. Weindorf. Structure based interpretation of unstructured vector maps. In Workshop on Graphics Recognition (GREC), volume 2390 of Lecture Notes in Computer Science (LNCS), pages 190 -199, 2002. N. Journet, R. Mullot, J. Ramel, and V. Eglin. Ancient printed documents indexation: a new approach. In International Conference on Advances in Pattern Recognition (ICAPR), volume 3686 of Lectures Notes in Computer Science (LNCS), pages 513 -522, 2005. V. D. Gesu and V. Starovoitov. Distance based function for image comparison. Pattern Recognition Letters (PRL), 20(2): 207 -214, 1999. S. Loncaric. A survey of shape analysis techniques. Pattern Recognition (PR), 31(8): 983 -1001, 1998. G. Lawrence and al. Risk management of digital information: A file format investigation. RLG Digi. News, 8(4), 2000.