fce3a707d76c0d139b10b1dc4cf6c5bb.ppt
- Количество слайдов: 1
A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {mondrejc, jkastner, pbajcsy} @ncsa. uiuc. edu Abstract We present a framework for relationship discovery from heterogeneous data systems. The framework consists of modules for automated file system analysis, file content analysis, integration of the results from analyses, storage of metadata and data-driven decision support for discovering relationships among files. The file content analysis includes filtering for file type detection (e. g. , file format identification using DROID and PRONOM) and type-specific content analysis (such as, information extraction from 2 D engineering drawings using Optical Character Recognition (OCR), and keyword based extraction of information from 3 D CAD models). The integration component consolidates metadata extracted from the file system and from the file content using metadata Resource Description Framework (RDF)-based representations. These are stored using Tupelo in an underlying content repository. We report our preliminary design of the framework and the performance of prototype modules for a test collection of electronic records documenting the Torpedo Weapon Retriever (TWR 841). This test collection presents a problem of unknown relationships among files that currently include 784 2 D image drawings and 22 CAD models. Framework Design Content Information Extraction An overall design to discovering relationships among multiple sources of electronic records. We study the extraction of content information to discover relationships between engineering drawings (tiff files) with the Title Block and corresponding Auto. CAD 3 D models (dwg files) of the TWR 841 ship deck. Engineering Drawing RELATIONSHIP 3 D CAD Model • Information in engineering drawings: The title block is cropped. Information is extracted using Optical character recognition (OCR) software. The extracted information is corrected and encoded into about 15 -20 RDF triples using a developed ontology. File System Information Extraction • Aperture, a Java framework has been used for metadata extraction from File systems. It saves the metadata following the Nepomuk ontology. • We studied the size of extracted metadata and developed prediction capabilities to estimate additional storage requirements. Metadata size as a function of number of files in a File system. The test systems were, divided based on the Operating System (OS) type to: (c 1 ) LINUX based 8 CPU Intel Xeon with 2. 5 GHz and 8 GB RAM and (c 2) Windows. XP 1 CPU 2 GHz Intel and 2 GB RAM. While the dots corresponds to concrete File systems, the blue line represents the metadata size prediction based on simulated File system topology. Cropped Title Block Information from OCR Editing and Ontology Definition xml version="1. 0" encoding="UTF-8"? >


