A Framework for Relationship Discovery Among Files of

Скачать презентацию A Framework for Relationship Discovery Among Files of

fce3a707d76c0d139b10b1dc4cf6c5bb.ppt

Количество слайдов: 1

A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {mondrejc, jkastner, pbajcsy} @ncsa. uiuc. edu Abstract We present a framework for relationship discovery from heterogeneous data systems. The framework consists of modules for automated file system analysis, file content analysis, integration of the results from analyses, storage of metadata and data-driven decision support for discovering relationships among files. The file content analysis includes filtering for file type detection (e. g. , file format identification using DROID and PRONOM) and type-specific content analysis (such as, information extraction from 2 D engineering drawings using Optical Character Recognition (OCR), and keyword based extraction of information from 3 D CAD models). The integration component consolidates metadata extracted from the file system and from the file content using metadata Resource Description Framework (RDF)-based representations. These are stored using Tupelo in an underlying content repository. We report our preliminary design of the framework and the performance of prototype modules for a test collection of electronic records documenting the Torpedo Weapon Retriever (TWR 841). This test collection presents a problem of unknown relationships among files that currently include 784 2 D image drawings and 22 CAD models. Framework Design Content Information Extraction An overall design to discovering relationships among multiple sources of electronic records. We study the extraction of content information to discover relationships between engineering drawings (tiff files) with the Title Block and corresponding Auto. CAD 3 D models (dwg files) of the TWR 841 ship deck. Engineering Drawing RELATIONSHIP 3 D CAD Model • Information in engineering drawings: The title block is cropped. Information is extracted using Optical character recognition (OCR) software. The extracted information is corrected and encoded into about 15 -20 RDF triples using a developed ontology. File System Information Extraction • Aperture, a Java framework has been used for metadata extraction from File systems. It saves the metadata following the Nepomuk ontology. • We studied the size of extracted metadata and developed prediction capabilities to estimate additional storage requirements. Metadata size as a function of number of files in a File system. The test systems were, divided based on the Operating System (OS) type to: (c 1 ) LINUX based 8 CPU Intel Xeon with 2. 5 GHz and 8 GB RAM and (c 2) Windows. XP 1 CPU 2 GHz Intel and 2 GB RAM. While the dots corresponds to concrete File systems, the blue line represents the metadata size prediction based on simulated File system topology. Cropped Title Block Information from OCR Editing and Ontology Definition 120 TORPEDO WEAPONS RETRIEVER TRANSVERSE BULKHEADS BELOW MAIN DECK Of. PAO'Mt. N* Of » NE **v* NAVAL SEA SYSTEMS COMMAND 1/2"-1'-0"& AS SHOWN H 117 -6200895 A LDOBSON 4 -I 0 -86 RDF representation of information extracted • Information in 3 D CAD files: The 3 D CAD models in STEP file format are searched for any ASCII strings matching English dictionary. The information is again encoded by about 8 -10 RDF triples. STEP METADATA SPECIFICATION This component calls DROID, a file format identification program. The results are metadata about each file including the registered PRONOM universal ID. PRONOM is a resource registry (information) about the file formats, software products and other technical components. Several 3 D file formats are not supported by PRONOM and DROID returns the unidentified file format flag. Those files are then checked against an internal list of 3 D file types. The results are converted into RDF triples and stored in a metadata context repository. U 2110_BHD 12_Autocad. dwg Positive Auto. CAD Drawing 2004 -2005 http: //www. nationalarchives. gov. uk/pronom/fmt/36 image/vnd. dwg RDF triples generated for two engineering drawings in tiff and Autocad formats with PRONOM Unique IDs highlighted. An UUID is used as a key for storing a set of triples about the same file. PARSED STEP METADATA FILE_DESCRIPTION( /* description */ (''), /* implementation_level */ '2; 1'); FILE_NAME( /* name */ '', File Format Identification EXPECTED STEP METADATA FILE_DESCRIPTION((''), /* implementation_level */ '2; 1'); FILE_NAME( '120 TORPEDO WEAPONS RETRIEVER, TRANSVERSE BULKHEADS BELOW, MAIN DECK', ‘ 04 -10 -86', ('LDOBSON'), ('NAVAL SEA SYSTEMS COMMAND'), ' ', 'IDA-STEP', ' '); FILE_DESCRIPTION((''), '2; 1'); FILE_NAME( 'D: \NARA\Archieve_data_samples\BHD_FR 12 \U 2110_BHD 12_2007_05_09. stp', '2007 -05 -10 T 13: 45: 37', ('rakowpj'), (''), 'Autodesk Inventor 11', ''); /* time_stamp */'', /* author */ (''), /* organization */ (''), /* preprocessor_version */ ' ', /* originating_system */ '', /* authorization */ ' '); Table shows an example of information extracted from 3 D CAD model stored in STEP file formats of the TWR 841 ship deck. Conclusions • We have prototyped a framework for file system and file content metadata extraction. The relationship discovery from metadata is in progress. • We developed the metadata size prediction capability for File systems. • We empirically observed the number of generated RDF triples for relationship discovery to be on average about 20 -30 per file leading to the total number of 8 -12 million RDF triples for an average size server. Acknowledgments This research was partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019.