c6ca7861fb262ee70d26c3b7e571c70a.ppt
- Количество слайдов: 91
ONTOLOGY LEARNING Hui Yang (杨慧) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs. cmu. edu 5 Sep 2008 @ Xi’an Jiaotong University
2 A LITTLE BIT OF CONTEXT…. THE LANGUAGE TECHNOLOGIES INSTITUTE Carnegie Mellon University’s School of Computer Science: 1 undergraduate program 7 graduate departments (CSD, HCI, LTI, RI, SEI, MLD, ETC) The Language Technologies Institute is a graduate department in the School of Computer Science About 25 faculty About 125 graduate students (~85 Ph. D, ~40 MS) About 30 visitors, post-docs, staff programmers, … © 2005 JAMIE CALLAN
A LITTLE BIT OF CONTEXT…. THE LANGUAGE TECHNOLOGIES INSTITUTE 3 LTI courses and research focus on Machine translation, especially high-accuracy MT Natural language processing & computational linguistics Information retrieval & text mining Speech recognition & synthesis Computer-assisted language learning & intelligent tutoring Computational biology (“the language of the human genome”) … and combinations of the above Speech-to-speech MT Open domain question answering …
A LITTLE BIT ABOUT ME My Research Interests Text Mining Information Retrieval Natural Language Processing Statistical Machine Learning My Earlier Work Question Answering Multimedia Information Retrieval Near-duplicate Detection Opinion Detection
TODAY’S TALK ONTOLOGY LEARNING
ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions
INFORMATION RETRIEVAL TECHNOLOGIES Web Search Engines have changed our life But, have Search Engines fulfilled Information Needs? Google’s great achievement Some, only some What does search bring to us? Overwhelming Information in Search Results Tedious manual judgment still needed
FIND A GOOD KINDERGARTEN IN THE PITTSBURGH AREA
BUY A USED CAR IN THE PITTSBURGH AREA
IT WILL BE GREAT TO HAVE A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials
THIS IS EXACTLY THE TASK OF INFORMATION TRIAGE, OR PERSONAL ONTOLOGY LEARNING
INTRODUCTION Ontology is a data model that represents a set of concepts within a domain and the set of pair-wise relationships between those concepts.
EXAMPLE: A SIMPLE ONTOLOGY Game Equipment ball table
EXAMPLE: WORDNET
EXAMPLE: ODP
INTRODUCTION Ontology Learning is the task to construct a welldefined ontology given a text corpus or a set of concept terms
INTRODUCTION Ontology offers a nice way to summarize the important topics in a domain/collection Ontology facilitates knowledge sharing and reuse Ontology offers relational associations for reasoning and inference
ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions
SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling
SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling
CONCEPT EXTRACTION Two Steps: Noun N-gram and Named Entity Mining Web-based Concept Filtering
NOUN N-GRAM MINING I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD. /.
NOUN N-GRAM MINING I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD. /. Extracted Bi-grams Mercury emissions Power plants
C ONCEPT F ILTERING Web-based POS error detection Assumption: Remove POS errors Among the first 10 google snippets, a valid concept appears more than a threshold (4 in our case) protect/NN polar/NN bear/NN Remove Spelling errors Pullution, polor bear
CONCEPT EXTRACTION
SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling
CLUSTERING Hierarchical Clustering Different Strategies for Concepts at Different Abstraction Levels
EXAMPLE: A SIMPLE ONTOLOGY Game Equipment Abstract Level ball Concrete Level table
BOTTOM-UP HIERARCHICAL CLUSTERING Start from Concrete Concepts Concept candidates are organized into groups based on the 1 st sense of the head noun in Word. Net One of their common head nouns will be selected as the parent concept for this group pollution subsumes water pollution, air pollution. Create a high accuracy concept forests at the lower level of the ontology
ONTOLOGY FRAGMENTS Different fragments are grouped
CONTINUE TO BE BOTTOMUP Problem Still a forest Many concepts at top level are not grouped Solution: Clustering In any clustering algorithm, we need a metric Hard to know the metric to measure distance for those top level nodes
HUMAN-GUIDED ONTOLOGY LEARNING Learn What? Learn from What? A distance metric function Concepts at lower levels since they are highly accurate User feedback After learning, then what? Apply the distance metric function to concepts at the higher level to get distance scores for them Then use whatever clustering algorithm to group them based on the distance scores
TRAINING DATA FROM LOWER LEVELS A set of concepts x(i) on the ith level of the ontology hierarchy Distance matrix y(i) The Matrix entry which corresponding to concept x(i)j and x(i)k is y(i)jk∈{0, 1}, y(i)jk = 0, if x(i)j and x(i)k in the same group; y(i)jk = 1, otherwise.
TRAINING DATA FROM LOWER LEVELS
LEARNING THE DISTANCE METRIC Distance metric represented as a Mahalanobis distance Φ(x(i)j, x(i)k)represents a set of pairwise underlying feature functions A is a positive semi-definite matrix, the parameter we need to learn Parameter estimation by Minimizing Squared Errors
SOLVE THE OPTIMIZATION PROBLEM Optimization can be done by Newton’s Method Interior-Point Method Any standard semi-definite programming (SDP) solvers Sedumi, yalmip
GENERATE DISTANCE SCORES We have learned A! For any pair of concepts at higher level (x(i+1)l, x(i+1)m), the corresponding entry in the distance matrix y(i+1) is
K-MEDOIDS CLUSTERING Flat clustering at a level Use one of the concepts as the cluster center Estimate the number of clusters by Gap statistics [Tibshirani et al. 2000]
HUMAN COMPUTER INTERACTION
SUBTASKS IN ONTOLOGY LEARNING Concept Extraction Synonym Detection Relationship Formulation by Clustering Cluster Labeling
CLUSTER LABELING Problem: Concepts are grouped together, but nameless Solution: A web-based approach Send a query formed by concatenating the child concepts to Google Parse top 10 snippets The most frequent word is selected to be the parent of this group
ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions
USER STUDY 12 graduate students from political science in University of Pittsburgh Divided into two groups Manual group: 4 Interactive group: 8 Task: Construct ontology hierarchy for 4 datasets Mercury Polar bear Wolf Toxin Release Inventory (tri) 90 minutes limit or till user’s satisfaction
DATASETS
SOFTWARE USED FOR USER STUDY
QUALITY OF MANUAL VS. INTERACTIVE RUNS Manual Users show moderate agreements (0. 4~0. 6) Interactive runs produce similar quality results Difference between manual and interactive runs is NOT statistically significant
COSTS OF MANUAL VS. INTERACTIVE RUNS Interactive users use 40% less edits (statistically significant) Interactive runs save 30 -60 minutes per ontology Within interactive runs, a human spends 64% less time than manual runs
CONTRIBUTIONS OF HUMANGUIDED ONTOLOGY LEARNING Effectively combine the strengths of automatic systems and human knowledge Combine many techniques into a unified framework pattern-based(concept mining) knowledge-based (use of Wordnet) Web-based (concept filtering and cluster naming) Machine Learning A detailed independent user study
WHAT TO IMPROVE? Is bottom-up the best way to do? We have used different technologies for concepts at different levels, how to formally generalize it? Maybe not Incremental clustering saves most efforts Model concept abstractness explicitly We have tested on domain-specific corpora, how about corpora for more general purposes? Can we reconstruct Word. Net or ODP?
ROADMAP Introduction Subtasks in Ontology Learning Human-Guided Ontology Learning User Study Metric-Based Ontology Learning Experimental Results Conclusions
CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask
Solution: Incremental clustering CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask
Solution: Learn statistical models for each abstraction level CHALLENGES Hard to find a good name for a new group in bottom-up clustering framework Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask
CHALLENGES Solution: Separate metric Hard learning and to find a good name for a new group in ontology bottom-up clustering framework construction Formally model concept abstractness Intelligently use different techniques for concepts at different abstract levels Flexibly incorporate heterogeneous features State-of-the-art either use one type of semantic evidence to infer all relationships, or Use one type of feature for a particular subtask
A UNIFIED SOLUTION Metric-based Ontology Learning
LET’S BEGIN WITH SOME IMPORTANT DEFINITIONS An ontology is a data model T = (C, R | D) Concept Set Relationship Set Domain
A Full Ontology MORE DEFINITIONS Game Equipment ball table
A Partial Ontology MORE DEFINITIONS Game Equipment ball table
MORE DEFINITIONS Ontology Metric d( , ) =2 d( , ball d( , table ) =1 ) = 4. 5 weight = 1. 5 weight =1 weight= 2
MORE DEFINITIONS Information in an Ontology T ∑ d( , ) =2 d( , ball d( , table ) =1 ) = 4. 5 weight= 1. 5 weight=1 weight= 2
MORE DEFINITIONS Information in a Level L ∑ d( , ) =2 d( , ball ) =1
Minimum Evolution Assumption: The Optimal Ontology is the One Introduces the Least Information Changes! ASSUMPTIONS OF ONTOLOGY
Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY
Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY
Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY
Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY ball
Minimum Evolution Assumption ASSUMPTIONS OF ONTOLOGY ball table
ASSUMPTIONS OF ONTOLOGY Minimum Evolution Assumption Game Equipment ball table
ASSUMPTIONS OF ONTOLOGY Minimum Evolution Assumption Game Equipment ball table
ASSUMPTIONS OF ONTOLOGY Minimum Evolution Assumption Game Equipment ball table
Abstractness Assumption: Each abstraction level has its own Information function ASSUMPTIONS OF ONTOLOGY
ASSUMPTIONS OF ONTOLOGY Abstractness Assumption Game Equipment Info 2(. ) Info 1(. ) ball Info 3(. ) table
FORMAL FORMULATION OF ONTOLOGY LEARNING The Task of Ontology Learning is defined as The construction of a full ontology T given a set of concepts C and an initial partial ontology T 0 Keeping adding concepts in C into T 0 Note T 0 could be empty Until a full ontology is formed
GOAL OF ONTOLOGY LEARNING Find the optimal full ontology s. t. the information changes since T 0 are least , i. e. , Note that this is by the Minimum Evolution Assumption
GET TO THE GOAL Goal: Since the optimal set of concepts is always C Concepts are added incrementally
GET TO THE GOAL Plug in definition of information change Minimum Evolution objective function Transform into a minimization problem
EXPLICITLY MODEL ABSTRACTNESS Model Abstractness for each Level by Least Square Fit Plug in definition of amount of information for an abstraction level Abstractness objective function
MULTIPLE CRITERION OPTIMIZATION FUNCTION Minimum Evolution objective function Scalarization variable Abstractness objective function
THE OPTIMIZATION ALGORITHM
ESTIMATING ONTOLOGY METRIC Assume ontology metric is a linear interpolation of some underlying feature functions Ridge Regression to estimate and predict the ontology metric
FEATURES Google KL-Divergence Wikipedia KL-Divergence Google Minipar Syntactic distance Lexico-Syntactic Patterns Term Co-occurrence Word Length Difference
EVALUATION Reconstruct subdirectories from Word. Net and ODP 50 Word. Net subdirectories are from 12 topics: gathering, professional, people, building, place, milk, meal, water, beverage, alcohol, dish and herb 50 ODP subdirectories are from 16 topics: computers, robotics, intranet, mobile computing, database, operating system, linux, tex, software, computer science, data communication, algorithms, data formats, security multimedia and artificial intelligence
DATASETS
ONTOLOGY RECONSTRUCTION An absolute gain of 10% compared to the state-of-theart system developed in Stanford University (ACL 2006 Best Paper)
INTERACTION OF ABSTRACTION LEVELS AND FEATURES Abstract concepts are sensitive to the explicit modeling – a good modeling on abstract concepts greatly boost performance Contributions from different features for abstract concepts vary; for concrete concepts indifferent Simple features (term cooccurrence, word length) work the best Combination of heterogeneous features works better than individual features
CONTRIBUTIONS OF METRIC-BASED ONTOLOGY LEARNING Avoid and hence solve the problem of unknown group names Tackle the problem of no control over concept abstractness Experiments show that concept at different abstraction levels behave different and sensitive to different features Provide a solution to incorporate heterogeneous features An absolute gain of 10% in precision for both Word. Net and ODP than a state-of-the-art system
WE HAVE TALKED ABOUT The Task of Information Triage and Personal Ontology Learning Human-guided Ontology Learning Metric-based Ontology Learning
AT THE BEGINNING, WE SAID: IT WILL BE GREAT TO HAVE A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials
FIND A GOOD KINDERGARTEN IN THE PITTSBURGH AREA Are We there yet?
THE KINDERGARTEN EXAMPLE We Have DONE with the organization! However, does it support further inference for a decent decision making? Maybe not Future work! More Future work Model multiple relationships simultaneously More Efficient Distance Metric Learning
THANK YOU AND QUESTIONS Hui Yang (huiyang@cs. cmu. edu)