The LSD Project Reconciling Schemas of Disparate Data

Скачать презентацию The LSD Project Reconciling Schemas of Disparate Data

3ad863059c51a05af7d0b94388c80c39.ppt

Количество слайдов: 23

The LSD Project Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach An. Hai Doan, Pedro Domingos, Alon Halevy University of Washington

Data Integration Find houses with four bathrooms priced under $500, 000 mediated schema source schema 1 source schema 2 source schema 3 wrapper realestate. com homeseekers. com homes. com 2

Semantic Mappings between Schemas l Mediated & source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1 -1 mapping non 1 -1 mapping house location contact name full-baths half-baths phone 3

Current State of Affairs l Finding semantic mappings is now the bottleneck! – largely done by hand – labor intensive & error prone l Will only be exacerbated – – l data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on the semantic web Need (semi-)automatic approaches to scale up! 4

The LSD (Learning Source Descriptions) Approach Suppose user wants to integrate 100 data sources 1. User – manually creates mappings for a few sources, say 3 – shows LSD these mappings 2. LSD learns from the mappings 3. LSD proposes mappings for remaining 97 sources 5

Example Mediated schema address location price agent-phone listed-price phone description comments Schema of realestate. com location listed-price phone comments realestate. com Miami, FL $250, 000 (305) 729 0831 Fantastic house Boston, MA $110, 000 (617) 253 1429 Great location. . . homes. com price contact-phone extra-info $550, 000 (278) 345 7215 Beautiful yard $320, 000 (617) 335 2315 Great beach. . Learned hypotheses If “phone” occurs in the name => agent-phone If “fantastic” & “great” occur frequently in data values => description 6

Our Contributions 1. Use of multi-strategy learning – well-suited to exploit multiple types of knowledge – highly modular & extensible 2. Extend learning to incorporate constraints – handle a wide range of domain & user-specified constraints 3. Develop XML learner – exploit hierarchical nature of XML 7

Multi-Strategy Learning l Use a set of base learners – each exploits well certain types of information l Match schema elements of a new source – apply the base learners – combine their predictions using a meta-learner l Meta-learner – uses training sources to measure base learner accuracy – weighs each learner based on its accuracy 8

Base Learners l Input – schema information: name, proximity, structure, . . . – data information: value, format, . . . l Output – prediction weighted by confidence score l Examples – Name learner – agent-name => (name, 0. 7), (phone, 0. 3) – Naive Bayes learner – “Kent, WA” – “Great location” => (address, 0. 8), (name, 0. 2) => (description, 0. 9), (address, 0. 1) 9

Training the Learners Mediated schema address location price agent-phone listed-price phone description comments Schema of realestate. com Name Learner realestate. com Miami, FL $250, 000 (305) 729 0831 Fantastic house Boston, MA $110, 000 (617) 253 1429 Great location (location, address) (listed-price, price) (phone, agent-phone) (comments, description). . . Naive Bayes Learner (“Miami, FL”, address) (“$ 250, 000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description). . . 10

Applying the Learners Mediated schema Schema of homes. com area day-phone extra-info Seattle, WA Kent, WA Austin, TX address Name Learner Naive Bayes (278) 345 7215 (617) 335 2315 (512) 427 1115 Beautiful yard Great beach Close to Seattle price agent-phone Meta-Learner description (address, 0. 8), (description, 0. 2) (address, 0. 6), (description, 0. 4) (address, 0. 7), (description, 0. 3) (agent-phone, 0. 9), (description, 0. 1) (address, 0. 6), (description, 0. 4) 11

Domain Constraints l Impose semantic regularities on sources – verified using schema or data l Examples – a = address & b = address a=b – a = house-id a is a key – a = agent-info & b = agent-name b is nested in a l Can be specified up front – when creating mediated schema – independent of any actual source schema 12

The Constraint Handler Predictions from Meta-Learner Domain Constraints area: (address, 0. 7), (description, 0. 3) contact-phone: (agent-phone, 0. 9), (description, 0. 1) extra-info: (address, 0. 6), (description, 0. 4) a = address & b = adderss area: address 0. 7 contact-phone: agent-phone 0. 9 extra-info: address 0. 6 0. 378 a=b 0. 3 0. 1 0. 4 0. 012 area: address 0. 7 contact-phone: agent-phone 0. 9 extra-info: description 0. 4 0. 252 Can specify arbitrary constraints l User feedback = domain constraint l – ad-id = house-id l Extended to handle domain heuristics – a = agent-phone & b = agent-name a & b are usually close to each other 13

Putting It All Together: the LSD System Training Phase Matching Phase Mediated schema Source schemas Data listings Training data for base learners L 1 L 2 Lk Domain Constraints User Feedback Constraint Handler Mapping Combination l Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner l Meta-learner – uses stacking [Ting&Witten 99, Wolpert 92] – returns linear weighted combination of base learners’ predictions 14

Empirical Evaluation l Four domains – Real Estate I & II, Course Offerings, Faculty Listings l For each domain – – l create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 Ten runs for each experiment - in each run: – manually provide 1 -1 mappings for 3 sources – ask LSD to propose mappings for remaining 2 sources – accuracy = % of 1 -1 mappings correctly identified 15

Average Matching Acccuracy (%) High Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0. 8 - 6% 16

Average matching accuracy (%) Performance Sensitivity Number of data listings per source 17

Average matching accuracy (%) Contribution of Schema vs. Data l More experiments in the paper! 18

Related Work l Rule-based approaches – TRANSCM [Milo&Zohar 98], ARTEMIS [Castano&Antonellis 99], [Palopoli et. al. 98], CUPID [Madhavan et. al. 01] – utilize only schema information l Learner-based approaches – SEMINT [Li&Clifton 94], ILA [Perkowitz&Etzioni 95] – employ a single learner, limited applicability l Others – DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01] l Multi-strategy learning in other domains – series of workshops [91, 93, 96, 98, 00] – [Freitag 98], Proverb [Keim et. al. 99] 19

Summary l LSD project – applies machine learning to schema matching l Main ideas & contributions – use of multi-strategy learning – extend learning to handle domain & user-specified constraints – develop XML learner l System design: A contribution to generic schema-matching – highly modular & extensible – handle multiple types of knowledge – continuously improve over time 20

Ongoing & Future Work l Improve accuracy – address current system limitations Extend LSD to more complex mappings l Apply LSD to other application contexts l – – – data translation data warehousing e-commerce information extraction semantic web www. cs. washington. edu/homes/anhai/lsd. html 21

Average Matching Acccuracy (%) Contribution of Each Component Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system 22

Exploiting Hierarchical Structure l Existing learners flatten out all structures Gail Murphy MAX Realtors l Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Developed XML learner – similar to the Naive Bayes learner – input instance = bag of tokens – differs in one crucial aspect – consider not only text tokens, but also structure tokens 23