Automating information integration Sunita Sarawagi Data integration

Скачать презентацию Automating information integration Sunita Sarawagi Data integration

40ffaaa9ebabc754dcbe7449ebc659de.ppt

Количество слайдов: 28

Automating information integration Sunita Sarawagi

Data integration n n The process of integrating data from multiple, heterogeneous, loosely structured information sources into a single well-defined structured database A tedious process of, n n n Matching schema of component databases Extracting structure from any unstructured sources Eliminating duplicates amongst overlapping sources Substituting missing values, Detecting and correcting errors

Application scenarios n Data warehousing: n n n Phenomenal amount of time and resources spent on data cleaning Example: Segmenting and merging name-address lists during data warehousing Web: n Creating structured databases from distributed unstructured web-pages n Citation databases: Citeseer and Cora

Recent trends n n Classical problem that has bothered researchers and practitioners for decades Several existing commercial solutions for enterprise data integration [mid-80 s] n n n Manual, domain-specific, data-driven script-based tools Example: Name/address cleaning Require high-expertise to code and maintain

Scope of the work n n Novel application of data mining and machine learning techniques to automate data cleaning operations. Focus on two operations n n Information Extraction Duplicate elimination

Datamold Automatic Segmentation of text into structured records

Problem definition Source: concatenation of structured elements with limited reordering and some missing fields n Example: Addresses, bib records House number 156 Author Building Road Area City Hillside ctype Scenic drive Powai Mumbai Year Title Journal Zip 400076 Volume Page P. P. Wangikar, T. P. Graycar, D. A. Estell, D. S. Clark, J. S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J. Amer. Chem. Soc. 115, 12231 -12237.

Learning to segment Given, n n list of structured elements several examples showing position of structured elements in text, Train a model to identify them in unseen text At top-level a classification problem Input features: § Content of the element § Specific keywords like street, zip, vol, pp, § Position of the element § Inter-element sequencing § Intra-element sequencing

IE with Hidden Markov Models n Probabilistic models for IE Emission probabilities Transition probabilities 0. 5 Author Letter 0. 3 Et. al 0. 1 Word 0. 5 Year dddd 0. 8 dd 0. 2 0. 9 Title 0. 6 B 0. 5 A 0. 3 C 0. 1 0. 8 Journal 0. 2 journal 0. 4 ACM 0. 2 IEEE 0. 3

Training the HMM n Two parts n n Learning Structure Learning parameters given structure n n Dictionary emission probabilities Transition probabilities

HMM Structure n Naïve Model: One state per element … Mahatma Gandhi Road Near Parkland. . . Nested model Each element another HMM [Mahatma Gandhi Road Near : Landmark] Parkland. . . … n

HMM Dictionary n For each word (=feature), associate the probability of emitting that word n n Multinomial model More advanced models with overlapping features of a word, n example, n n part of speech, capitalized or not type: number, letter, word etc Maximum entropy models

Learning model parameters n When training data defines unique path through HMM n Transition probabilities n Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i n Emission probabilities n Probability of emitting symbol k from state i = number of times k generated from i number of transition from i n When training data defines multiple path: n A more general EM like algorithm (Baum-Welch)

Using the HMM to segment n n Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm

Results: Comparative Evaluation Dataset insta nces Elem ents IITB student 2388 Addresses 17 Company Addresses 769 6 US Addresses 740 6 The Nested model does best in all three cases

ALIAS Automatic elimination of duplicates using active learning

The de-duplication problem Given a list of semi-structured records, find all records that refer to a same entity n Challenges: n n Errors and inconsistencies in data Spotting duplicates might be hard as they may be spread far apart: n may not be group-able using obvious keys

The learning approach Example labeled pairs Similarity functions f 1 f 2 …fn Record 1 D Record 2 1. 0 0. 4 … 0. 2 1 Record 1 N Record 3 0. 0 0. 1 … 0. 3 0 Record 4 D Record 5 Unlabeled list Record Record 6 7 8 9 10 11 Similarity functions Year. Difference > 1 Non-Duplicate All-Ngrams 0. 48 Non Duplicate Author. Title. Ngrams 0. 4 Duplicate Classifier Title. Is. Null < 1 Page. Match 0. 5 0. 3 0. 4 … 0. 4 1 Duplicat e Author. Edit. Dist 0. 8 Duplicate Mapped examples Non-Duplicate 0. 0 1. 0 0. 6 0. 7 0. 3 0. 0 0. 3 0. 6 0. 1 0. 4 0. 2 0. 1 0. 4 0. 1 0. 8 0. 1 … … … … 0. 3 0. 2 0. 5 0. 6 0. 4 0. 1 0. 5 ? ? ? ? 0. 0 Duplicate 0. 1 … 0. 3 1. 0 0. 6 0. 7 0. 3 0. 0 0. 3 0. 6 0. 4 0. 2 0. 1 0. 4 0. 1 0. 8 0. 1 … … … … 0. 2 0. 5 0. 6 0. 4 0. 1 0. 5 0 1 0 1 1

Experiences with the learning approach n Too much manual search in preparing training data n n Hard to spot challenging and covering sets of duplicates in large lists Even harder to find close non-duplicates that will capture the nuances examine instances that are similar on one attribute but dissimilar on another Active learning is a generalization of this!

The active learning approach Example labeled pairs Similarity functions f 1 f 2 …fn Record 1 D Record 2 1. 0 0. 4 … 0. 2 1 Record 3 N Record 4 0. 0 0. 1 … 0. 3 0 Unlabeled list Record Record 6 7 8 9 10 11 Classifier 0. 0 1. 0 0. 6 0. 7 0. 3 0. 0 0. 3 0. 6 0. 1 0. 4 0. 2 0. 1 0. 4 0. 1 0. 8 0. 1 … … … … 0. 3 0. 2 0. 5 0. 6 0. 4 0. 1 0. 5 ? ? ? ? Active learner 0. 7 0. 1 … 0. 6 1 0. 3 0. 4 … 0. 4 0 0. 7 0. 1 … 0. 6 ? 0. 3 0. 4 … 0. 4 ?

Committee-based algorithm n n Train k classifiers C 1, C 2, . . Ck on training data For each unlabeled instance x n n n Find prediction y 1, . . , yk from the k classifiers Compute uncertainty U(x) as entropy of above y-s Pick instance with highest uncertainty

Forming a classifier committee Randomly perturb learnt parameters n Probabilistic classifiers: . n n n Sample from posterior distribution on parameters given training data. Example: binomial parameter p has a beta distribution with mean p Discriminative classifiers: n Random boundary in uncertainty region

The ALIAS deduplication system Lp Initial training records Similarity Functions (F) D Unlabeled Input records Mapper Mapped labeled instances Training data T Infer pairs using transitivity Train classifier Dp Pool of mapped unlabeled instances Select instances S Similarity Indices Predicate for uncertain region A Large record lists Deduplication function Evaluation engine Groups of duplicates in A Active Learner

Benefits of active learning n Active learning much better than random n With only 100 active instances n n 97% accuracy, Random only 30% Committee-based selection close to optimal

Analyzing selected instances n n Fraction of duplicates in selected instances: 44% starting with only 0. 5% Is the gain due to increased fraction of duplicates? n n Replaced non-duplicates in selected set with random non-dups Result only 40% accuracy!!!

Conclusion and future work n n Interactive discovery of deduplication function using active learning Manual effort reduced to n n n Providing simple similarity functions Labeling selected pairs – two orders of magnitude fewer than random Ongoing work: n n Efficient evaluation on large data sets Multi-table de-duplication

Overview of research (1999 -2002) n Intelligent exploration of large multidimensional databases (ICube project) n n n Automated data cleaning n n n Released software in public domain under GPL 3 VLDB papers (1999 -2001), 2 journal papers (2001, 2001), 2 MTPs Licensed software for address segmentation to a Indian software company Transferring software for duplicate elimination to NIC 1 ACM SIGMOD paper (2001), 1 ACM KDD paper (2002), 1 VLDB demo (2002), 1 ICDE demo (2003), 2 MTPs Integrating databases and mining n n In collaboration with Microsoft research 1 IEEE ICDE paper (2002)

Topics of further research n Practical general-purpose Information Integration at the web-scale involving multiple tables n n Richer more complex models exploiting higher-level structures in input data, e. g. trees, tables Limited training data: exploiting existing structured databases and unlabeled data Doing this efficiently and incrementally Finally: extend WWW-directories with structured information