ONTOLOGY LEARNING Hui Yang 杨慧 Language Technologies Institute

Скачать презентацию ONTOLOGY LEARNING Hui Yang 杨慧 Language Technologies Institute

c6ca7861fb262ee70d26c3b7e571c70a.ppt

Количество слайдов: 91

ONTOLOGY LEARNING Hui Yang (杨慧) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs. cmu. edu 5 Sep 2008 @ Xi’an Jiaotong University

2 A LITTLE BIT OF CONTEXT…. THE LANGUAGE TECHNOLOGIES INSTITUTE Carnegie Mellon University’s School of Computer Science: 1 undergraduate program 7 graduate departments (CSD, HCI, LTI, RI, SEI, MLD, ETC) The Language Technologies Institute is a graduate department in the School of Computer Science About 25 faculty About 125 graduate students (~85 Ph. D, ~40 MS) About 30 visitors, post-docs, staff programmers, … © 2005 JAMIE CALLAN

A LITTLE BIT OF CONTEXT…. THE LANGUAGE TECHNOLOGIES INSTITUTE 3 LTI courses and research focus on Machine translation, especially high-accuracy MT Natural language processing & computational linguistics Information retrieval & text mining Speech recognition & synthesis Computer-assisted language learning & intelligent tutoring Computational biology (“the language of the human genome”) … and combinations of the above Speech-to-speech MT Open domain question answering …

A LITTLE BIT ABOUT ME My Research Interests Text Mining Information Retrieval Natural Language Processing Statistical Machine Learning My Earlier Work Question Answering Multimedia Information Retrieval Near-duplicate Detection Opinion Detection

TODAY’S TALK ONTOLOGY LEARNING

ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions

INFORMATION RETRIEVAL TECHNOLOGIES Web Search Engines have changed our life But, have Search Engines fulfilled Information Needs? Google’s great achievement Some, only some What does search bring to us? Overwhelming Information in Search Results Tedious manual judgment still needed

FIND A GOOD KINDERGARTEN IN THE PITTSBURGH AREA

BUY A USED CAR IN THE PITTSBURGH AREA

IT WILL BE GREAT TO HAVE A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials

THIS IS EXACTLY THE TASK OF INFORMATION TRIAGE, OR PERSONAL ONTOLOGY LEARNING

INTRODUCTION Ontology is a data model that represents a set of concepts within a domain and the set of pair-wise relationships between those concepts.

EXAMPLE: A SIMPLE ONTOLOGY Game Equipment ball table

EXAMPLE: WORDNET

EXAMPLE: ODP

INTRODUCTION Ontology Learning is the task to construct a welldefined ontology given a text corpus or a set of concept terms

INTRODUCTION Ontology offers a nice way to summarize the important topics in a domain/collection Ontology facilitates knowledge sharing and reuse Ontology offers relational associations for reasoning and inference

ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions

SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling

CONCEPT EXTRACTION Two Steps: Noun N-gram and Named Entity Mining Web-based Concept Filtering

NOUN N-GRAM MINING I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD. /.

NOUN N-GRAM MINING I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD. /. Extracted Bi-grams Mercury emissions Power plants

C ONCEPT F ILTERING Web-based POS error detection Assumption: Remove POS errors Among the first 10 google snippets, a valid concept appears more than a threshold (4 in our case) protect/NN polar/NN bear/NN Remove Spelling errors Pullution, polor bear

CONCEPT EXTRACTION

SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling

CLUSTERING Hierarchical Clustering Different Strategies for Concepts at Different Abstraction Levels

EXAMPLE: A SIMPLE ONTOLOGY Game Equipment Abstract Level ball Concrete Level table

BOTTOM-UP HIERARCHICAL CLUSTERING Start from Concrete Concepts Concept candidates are organized into groups based on the 1 st sense of the head noun in Word. Net One of their common head nouns will be selected as the parent concept for this group pollution subsumes water pollution, air pollution. Create a high accuracy concept forests at the lower level of the ontology

ONTOLOGY FRAGMENTS Different fragments are grouped

CONTINUE TO BE BOTTOMUP Problem Still a forest Many concepts at top level are not grouped Solution: Clustering In any clustering algorithm, we need a metric Hard to know the metric to measure distance for those top level nodes

HUMAN-GUIDED ONTOLOGY LEARNING Learn What? Learn from What? A distance metric function Concepts at lower levels since they are highly accurate User feedback After learning, then what? Apply the distance metric function to concepts at the higher level to get distance scores for them Then use whatever clustering algorithm to group them based on the distance scores

TRAINING DATA FROM LOWER LEVELS A set of concepts x(i) on the ith level of the ontology hierarchy Distance matrix y(i) The Matrix entry which corresponding to concept x(i)j and x(i)k is y(i)jk∈{0, 1}, y(i)jk = 0, if x(i)j and x(i)k in the same group; y(i)jk = 1, otherwise.

TRAINING DATA FROM LOWER LEVELS

LEARNING THE DISTANCE METRIC Distance metric represented as a Mahalanobis distance Φ(x(i)j, x(i)k)represents a set of pairwise underlying feature functions A is a positive semi-definite matrix, the parameter we need to learn Parameter estimation by Minimizing Squared Errors

SOLVE THE OPTIMIZATION PROBLEM Optimization can be done by Newton’s Method Interior-Point Method Any standard semi-definite programming (SDP) solvers Sedumi, yalmip

GENERATE DISTANCE SCORES We have learned A! For any pair of concepts at higher level (x(i+1)l, x(i+1)m), the corresponding entry in the distance matrix y(i+1) is

K-MEDOIDS CLUSTERING Flat clustering at a level Use one of the concepts as the cluster center Estimate the number of clusters by Gap statistics [Tibshirani et al. 2000]

HUMAN COMPUTER INTERACTION

SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling

CLUSTER LABELING Problem: Concepts are grouped together, but nameless Solution: A web-based approach Send a query formed by concatenating the child concepts to Google Parse top 10 snippets The most frequent word is selected to be the parent of this group

ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions

USER STUDY 12 graduate students from political science in University of Pittsburgh Divided into two groups Manual group: 4 Interactive group: 8 Task: Construct ontology hierarchy for 4 datasets Mercury Polar bear Wolf Toxin Release Inventory (tri) 90 minutes limit or till user’s satisfaction

DATASETS

SOFTWARE USED FOR USER STUDY

QUALITY OF MANUAL VS. INTERACTIVE RUNS Manual Users show moderate agreements (0. 4~0. 6) Interactive runs produce similar quality results Difference between manual and interactive runs is NOT statistically significant

COSTS OF MANUAL VS. INTERACTIVE RUNS Interactive users use 40% less edits (statistically significant) Interactive runs save 30 -60 minutes per ontology Within interactive runs, a human spends 64% less time than manual runs

CONTRIBUTIONS OF HUMANGUIDED ONTOLOGY LEARNING Effectively combine the strengths of automatic systems and human knowledge Combine many techniques into a unified framework pattern-based(concept mining) knowledge-based (use of Wordnet) Web-based (concept filtering and cluster naming) Machine Learning A detailed independent user study

WHAT TO IMPROVE? Is bottom-up the best way to do? We have used different technologies for concepts at different levels, how to formally generalize it? Maybe not Incremental clustering saves most efforts Model concept abstractness explicitly We have tested on domain-specific corpora, how about corpora for more general purposes? Can we reconstruct Word. Net or ODP?

ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions

CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask

Solution: Incremental clustering CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask

Solution: Learn statistical models for each abstraction level CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask

CHALLENGES Solution: Separate metric Hard learning and to find a good name for a new group in ontology bottom-up clustering framework construction Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask

A UNIFIED SOLUTION Metric-based Ontology Learning

LET’S BEGIN WITH SOME IMPORTANT DEFINITIONS An ontology is a data model T = (C, R | D) Concept Set Relationship Set Domain

A Full Ontology MORE DEFINITIONS Game Equipment ball table

A Partial Ontology MORE DEFINITIONS Game Equipment ball table

MORE DEFINITIONS Ontology Metric d( , ) =2 d( , ball d( , table ) =1 ) = 4. 5 weight = 1. 5 weight =1 weight= 2

MORE DEFINITIONS Information in an Ontology T ∑ d( , ) =2 d( , ball d( , table ) =1 ) = 4. 5 weight= 1. 5 weight=1 weight= 2

MORE DEFINITIONS Information in a Level L ∑ d( , ) =2 d( , ball ) =1

Minimum Evolution Assumption: The Optimal Ontology is the One Introduces the Least Information Changes! ASSUMPTIONS OF ONTOLOGY

Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY

Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY ball

Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY ball table

ASSUMPTIONS OF ONTOLOGY Minimum Evolution Assumption Game Equipment ball table

Abstractness Assumption: Each abstraction level has its own Information function ASSUMPTIONS OF ONTOLOGY

ASSUMPTIONS OF ONTOLOGY Abstractness Assumption Game Equipment Info 2(. ) Info 1(. ) ball Info 3(. ) table

FORMAL FORMULATION OF ONTOLOGY LEARNING The Task of Ontology Learning is defined as The construction of a full ontology T given a set of concepts C and an initial partial ontology T 0 Keeping adding concepts in C into T 0 Note T 0 could be empty Until a full ontology is formed

GOAL OF ONTOLOGY LEARNING Find the optimal full ontology s. t. the information changes since T 0 are least , i. e. , Note that this is by the Minimum Evolution Assumption

GET TO THE GOAL Goal: Since the optimal set of concepts is always C Concepts are added incrementally

GET TO THE GOAL Plug in definition of information change Minimum Evolution objective function Transform into a minimization problem

EXPLICITLY MODEL ABSTRACTNESS Model Abstractness for each Level by Least Square Fit Plug in definition of amount of information for an abstraction level Abstractness objective function

MULTIPLE CRITERION OPTIMIZATION FUNCTION Minimum Evolution objective function Scalarization variable Abstractness objective function

THE OPTIMIZATION ALGORITHM

ESTIMATING ONTOLOGY METRIC Assume ontology metric is a linear interpolation of some underlying feature functions Ridge Regression to estimate and predict the ontology metric

FEATURES Google KL-Divergence Wikipedia KL-Divergence Google Minipar Syntactic distance Lexico-Syntactic Patterns Term Co-occurrence Word Length Difference

EVALUATION Reconstruct subdirectories from Word. Net and ODP 50 Word. Net subdirectories are from 12 topics: gathering, professional, people, building, place, milk, meal, water, beverage, alcohol, dish and herb 50 ODP subdirectories are from 16 topics: computers, robotics, intranet, mobile computing, database, operating system, linux, tex, software, computer science, data communication, algorithms, data formats, security multimedia and artificial intelligence

DATASETS

ONTOLOGY RECONSTRUCTION An absolute gain of 10% compared to the state-of-theart system developed in Stanford University (ACL 2006 Best Paper)

INTERACTION OF ABSTRACTION LEVELS AND FEATURES Abstract concepts are sensitive to the explicit modeling – a good modeling on abstract concepts greatly boost performance Contributions from different features for abstract concepts vary; for concrete concepts indifferent Simple features (term cooccurrence, word length) work the best Combination of heterogeneous features works better than individual features

CONTRIBUTIONS OF METRIC-BASED ONTOLOGY LEARNING Avoid and hence solve the problem of unknown group names Tackle the problem of no control over concept abstractness Experiments show that concept at different abstraction levels behave different and sensitive to different features Provide a solution to incorporate heterogeneous features An absolute gain of 10% in precision for both Word. Net and ODP than a state-of-the-art system

WE HAVE TALKED ABOUT The Task of Information Triage and Personal Ontology Learning Human-guided Ontology Learning Metric-based Ontology Learning

AT THE BEGINNING, WE SAID: IT WILL BE GREAT TO HAVE A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials

FIND A GOOD KINDERGARTEN IN THE PITTSBURGH AREA Are We there yet?

THE KINDERGARTEN EXAMPLE We Have DONE with the organization! However, does it support further inference for a decent decision making? Maybe not Future work! More Future work Model multiple relationships simultaneously More Efficient Distance Metric Learning

THANK YOU AND QUESTIONS Hui Yang (huiyang@cs. cmu. edu)