Sem Tag and Seeker Bootstrapping the Semantic Web

Sem. Tag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien IBM Almaden Research Center http: //www. almaden. ibm. com/webfountain/resources/semtag. pdf

Outline p p p Motivation Goal Sem. Tag n n n p Seeker n n n p p Architecture Phases TBD Results Methodology Design Architecture Environment Conclusion Related and Future work.

Motivation p Natural language processing is the most significant obstacle in building machine understandable web. p To allow for the Semantic Web to become a reality we need: n n Web-services to maintain & provide metadata. Annotated documents (OWL, RDF, XML, . . . ).

Annotations p Current practice of annotation for knowledge identification , extraction & other applications is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management

Goal p To perform automated semantic tagging of large corpora. Sem. Tag p To introduce a new disambiguation algorithm to resolve ambiguities in a natural language corpus. TBD Algo p To introduce the platform which different tagging applications can share. Seeker

Sem. Tag p The goal is to automatic add semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The <resource ref = http: //tap. stanford. edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http: //tap. stanford. edu/ Athlete. Jordan_Michael”> Michael Jordan</resource> will. . . ’’

Sem. Tag p Uses TAP KB n n TAP is a public broad, shallow knowledgebase. TAP contains lexical and taxonomical information about popular objects like music, movies, sports, etc. Problem: No write access to original document How do you annotate? ? ? p Uses the concept of Label Bureau from PICS (Platform for Internet Content Selection) n n HTTP server that can be queried for annotation information Separate store of semantic annotation information

Example: Annotated Page

Sem. Tag Architecture Add to DB Disambiguate windows Tagging Retrieve documents Automatic Manual Tokenize Find Context Spotting determine distribution of terms Learning

Sem. Tag Phases p 1. Spotting: n n n p Retrieve documents from Seeker. Tokenize documents. Find contexts (10 words + label + 10 words) that appears in TAP Taxonomy. 2. Learning: n Scan the representative sample to determine distribution of terms at each internal node of the taxonomy.

Sem. Tag Phases, cont’d p 3. Tagging n n Disambiguate windows (using TBD). Add to the database. Ambiguities types: n n Same label appears at multiple locations in TAP ontology. Some entities have labels that occur in context that have no representative in the taxonomy. Training Data: n n Automatic metadata Manual metadata

TBD Methodology p Each node has a set of labels. n E. g. : cats, football, cars all contain the label Jaguar. p Each label in the text is stored with a window of 20 words – the context p A spot(l, c) is a label in a context. p Each node has an associated similarity function mapping a context to a similarity n Higher similarity more likely to contain a reference

TBD - Similarity p Generate 200 k dimensional vector corresponding to context. p TF-IDF scheme n p Each entry of the vector is the frequency of the term occurring at that node divide by corpus frequency of the term. IR Algorithm – Cosine Similarity n Vector product of sparse spot vector and dense node vector

TBD - Algorithm p Some internal nodes very popular: n n Associate a measurement Mus of how accurate Sim is likely to be at a node. Also Mua, how ambiguous the node is overall (consistency of human judgment) p TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v p 82% accuracy on 434 million spots

The TBD Algorithm

Sem. Tag Results p Applied on 264 million pages p Produced 550 million labels. p Final set of 434 million spots with Accuracy 82%.

Sem. Tag Methodology 1. Lexicon generation: n n n Approximately 90 million total words. 1. 4 million unique words. Most frequent 200, 000 words. 2. Similarity functions: n Estimated distribution of terms corresponding to 192 most common TAP nodes to derive fu.

Sem. Tag Methodology, cont’d 3. Measurement values: n Determined based on 750 relevant human judgments. 4. Full TBD Processing: n Applied to 550 m spots. 5. Evaluation: n Compared TBD results with additional 378 human judgments.

Seeker p A platform used by Sem. Tag and other increasingly sophisticated text analytics applications. p Provides scalable, extensible knowledge extraction from erratic resources. Erratic resources? ? ?

Seeker Design Goals p Composability p Modularity p Extensibilty p Scalability p Robustness

Seeker Architecture Sem. Tag Components Indexing Tokens Crawls WEB Storage & Communication Query Processing Annotators Modular & Extensible Scalability & Robustness n/w level APIs Miners

Seeker Design p To achieve modularity and extensibility n p SOA (service-oriented architecture) was used where communication among agents is done through a set of language-independent network-level APIs. To achieve scalability and robustness n Infrastructure components.

Infrastructure Components p The Data Store n n p The Indexer n p Central repository for all data storage. Communication medium. For indexing sequences of tokens. The Joiner n Query processing component.

Analysis Agents p Annotators n p Performs some local processing on each web page and write back results to the store in form of an annotation. Miners n n Performs Intermediate processing Looks at the results of spots on many pages in order to disambiguate them.

Observation p Advantage n n p Other application can obtain semantic annotation from web-available database. Use both human & computer judgments to solve ambiguous data in their TBD algorithm Disadvantage n n The system require a large amount of storage space to store data. Requires much larger and richer KB to build web scale ontology.

Conclusion p Automatic semantic tagging is essential to bootstrap the Semantic Web. p It’s possible to achieve good accuracy with simple disambiguation approaches.

Future Work p Develop more approaches and algorithms to automated tagging. p Make annotated data public and seeker as a public service.

Related Work p Systems built as a result of the Semantic Web are divided among two types: n n n Create ontologies – semi automated Page annotation. Examples: Protégé, Onto. Annotate, Anntea, SHOE, … p Some AI approaches were used, but, they need a lot of training. Principal tool: Wrapping p Some used other NL understanding techniques, example ALPHA.

Questions?