The Web CAT Framework Automatic Generation of Meta-Data

The Web. CAT Framework Automatic Generation of Meta-Data from Web Resources Bruno Martins and Mário J. Silva Faculdade de Ciências da Universidade de Lisboa

Outline of the Presentation • Motivation • The Web. CAT framework • Overview of the components – The Core Parser – The Miners – The Augmenters • Applications and results • Conclusions and future work

Motivation WWW is the largest information source in the world but. . . – Semantic Web is not truly deployed yet – Poorly authored HTML pages - Fuzzy and irregular input – Content and presentation heavily interlinked (not XHTML) – No meta-data standard (Dublin Core is not mandatory) – Multiple formats (Flash, PDF, …) Designing tools that reuse and remix Web content remains very difficult!

Recently Proposed Semantic Web Systems Annotation of Web pages with ontology derived semantic tags – Manual or semi-automatic tagging – Laborious and error-prone task Fully automated systems can provide the means to bootstrap the Semantic Web

Web. CAT : Web Content Analysis Tool Extensible framework for automatically extracting/generating meta-data from present-day Web resources • Web agents and page scrappers • Web crawlers • Web mining applications Starting point for more advanced annotation systems and Semantic Web tools

The Web. CAT Framework

Web. CAT Core Parser Low-level processing related to scanning HTML and extracting information • Conversion from other file formats to HTML • Handle fuzzy, noisy, irregular input – Similar to HTML browsers, never throw syntax errors – Best effort approach to solve markup problems – Fault-tolerant parser written by hand

Web. CAT Core Parser: Text Content • Tokenization based on context pairs – Context given by surrounding character(s) – HTML scanning and tokenization tightly coupled • Detection of sentences and individual words • Character n-grams and collocations • Keep track of HTML markup information

Web. CAT Core Parser : Hyperlinks Normalization of HTML links • Discard URLs not following the syntax • Convert host names to lowercase • www. TEST. COM/ converted to www. test. com/ • Discard default port number • www. test. com: 80/ converted to www. test. com/ • Normalize file information • www. test. com/d 1/. . // converted to www. test. com/

Web. CAT Core Parser: Meta-Tags Normalization of Meta-Tag information – Dublin Core – Geo. Tags – Geo. URL – Robots Exclusin Protocol – HTTP-Equiv Extraction of available RDF information

Web. CAT Miners Task specific modules that infer knowledge from the available meta-data • Machine-learning and text analytics techniques • Some examples: – – – Content fingerprinting algorithm (Rabin hash function) Detecting nepotistic links (Davison’ 00) Stemming algorithms (Snowball package) Language Identification (Martins&Silva’ 05) Named Entity Recognition

Web. CAT Miners : Language Identification Language meta-data useful to bootstrap more advanced algorithms • Existing language METATAG information • Machine learning approach based on n-grams – Comparison of most frequently occurring n-grams – Efficient similarity measure (Lin’ 98) – Heuristics based on HTML tags

Web. CAT Miners: Named Entity Recognition Named entity annotations with references to ontology • Currently handles locations and organizations with a geographical context (for use in Geo-IR) • Knowledge-based system with rules combining – Name lists (multilingual, based on language meta-data) – Context patterns (multilingual, based on language meta-data) – Capitalization • Heuristics for disambiguation + “grounding” to ontology – One reference per discourse (Gale et al’ 93)

Web. CAT Augmenters Augmenting the metadata extracted/mined from the documents • Good for simultaneous analysis of a large number of Web resources • Combination of the available meta-data

Web. CAT Augmenters: Assigning Geographical Scopes to Web Pages Assign each document a geographical scope • Use geo-references from the NER miner • Anchor text is propagated to other pages • Disambiguation made through: – Relations on a geographical ontology – Graph ranking algorithm (Page. Rank)

Applications • Open source software • http: //webcat. soureforge. net • In use at the tumba! Web search engine • http: //www. tumba. pt • 10 million Portuguese Web pages • GREASE Project (Web-Geo-IR) • Web characterization studies • Used in participations on TREC and CLEF

Experimental Results Evaluation of individual components • The Core Parser • Tokenizer achieved 95% accuracy over WSJ corpus • The Miners • Language identification achieved 91% accuracy in discriminating 11 different languages over Web pages. • NER achieved 0. 89 precision and 0. 68 recall on recognizing NEs on a small set of web pages • The Augmenters • Scope Assignment in DMOZ pages gave promising results Additional experiments currently under way!

Experimental Results Statistics from a Crawl of the Portuguese Web Document Statistics Avg. Words per Doc. Collective Statistics Value Documents analyzed 325140 Data size 78 GB Textual data 8. 8 GB External Links 243930 Web Sites Avg. Document Size Words Distinct Words Value 438 Avg. Doc Size 32. 4 KB Avg. Text Size 2. 8 GB Avg. Word Length 5 chars Meta-Data Statistics Value PDF Docs 1. 9% DOC, XLS, PPT Docs 0. 7% 131864 Description tag 17% 32. 4 KB Keywords tag 18% Portuguese docs 73% English docs 17% 1652645998 7880609 Content replicas Distinct Words 15. 5% 7880609

Conclusions and Future Work • Automatic meta-data generation is a pre-requisite for the deployment of the semantic Web • Large scale effort of collecting/generating meta-data for Web resources • Advantages over other existing methods (DOM parsers or regular expression tools) • Modular architecture facilitates adding new features • Some of the specific algorithms require improvements • API and documentation needs some cleaning up

Thanks for your attention. bmartins@xldb. di. fc. ul. pt http: //webcat. sourceforge. net