Скачать презентацию Source Discovery and Schema Mapping for Data Integration Скачать презентацию Source Discovery and Schema Mapping for Data Integration

4fce12afa8bf8adc9085115e48673ea9.ppt

  • Количество слайдов: 50

Source Discovery and Schema Mapping for Data Integration Brigham Young University Li Xu BYU Source Discovery and Schema Mapping for Data Integration Brigham Young University Li Xu BYU Data Extraction Group Funded by NSF 1

Data Integration Find houses with four bedrooms priced under $200, 000 Mediator source schema Data Integration Find houses with four bedrooms priced under $200, 000 Mediator source schema 1 global schema source schema 2 source schema 3 wrappers realestate. com BYU Data Extraction Group homeseekers. com Funded by NSF homes. com 2

Problems • How to Recognize Applicable Information Sources for an Application? • How to Problems • How to Recognize Applicable Information Sources for an Application? • How to Specify Mapping between the Source Schemas and the Global Schema? • How to Reformulate User Queries? • How to Merge Data from Heterogeneous Sources? • … BYU Data Extraction Group Funded by NSF 3

Recognizing Ontology. Applicable HTML Documents BYU Data Extraction Group Funded by NSF 4 Recognizing Ontology. Applicable HTML Documents BYU Data Extraction Group Funded by NSF 4

Application Ontology How to specify an application? BYU Data Extraction Group Funded by NSF Application Ontology How to specify an application? BYU Data Extraction Group Funded by NSF 5

Applicable HTML Documents • Multiple-Record Documents • Single-Record Documents • HTML Forms How to Applicable HTML Documents • Multiple-Record Documents • Single-Record Documents • HTML Forms How to distinguish an applicable HTML document? BYU Data Extraction Group Funded by NSF 6

Multiple-Record Doc’s Document 1: Car Ads Document 2: Items for Sale or Rent BYU Multiple-Record Doc’s Document 1: Car Ads Document 2: Items for Sale or Rent BYU Data Extraction Group Funded by NSF 7

Single-Record Doc. BYU Data Extraction Group Funded by NSF 8 Single-Record Doc. BYU Data Extraction Group Funded by NSF 8

HTML Forms Information hidden under the HTML form BYU Data Extraction Group Funded by HTML Forms Information hidden under the HTML form BYU Data Extraction Group Funded by NSF 9

Recognition Heuristics • h 1+: Densities • h 2: Expected Values • h 3: Recognition Heuristics • h 1+: Densities • h 2: Expected Values • h 3: Grouping How to measure the applicability of an HTML document for an application? BYU Data Extraction Group Funded by NSF 10

h 1+: Densities Document 1: Car Ads Document 2: Items for Sale or Rent h 1+: Densities Document 1: Car Ads Document 2: Items for Sale or Rent BYU Data Extraction Group Funded by NSF 11

h 2: Expected Values <Year: 0. 98, Make: 0. 93, Model: 0. 91, Mileage: h 2: Expected Values Document 1: Car Ads Document 2: Items for Sale or Rent Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 Phone. Nr: 3 BYU Data Extraction Group Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 Phone. Nr: 4 Funded by NSF 12

h 3: Grouping (of 1 -Max Object Sets) Document 1: Car Ads Document 2: h 3: Grouping (of 1 -Max Object Sets) Document 1: Car Ads Document 2: Items for Sale or Rent { { BYU Data Extraction Group Year Mileage … Mileage Year Price … { { { Funded by NSF Year Make Model Price Year Model Year Make Model Mileage … 13

Classification Problem • Subtasks – Multiple Records – Singleton Record – Application Form • Classification Problem • Subtasks – Multiple Records – Singleton Record – Application Form • Learning Algorithm: Decision Tree C 4. 5 – (h 1+0, h 1+1, …, h 2, h 3, Positive) – (h 1+0, h 1+1, …, h 2, h 3, Negative) How to construct recognition rules for an application? BYU Data Extraction Group Funded by NSF 14

Experiments Car Ads and Obituaries • Training Sets • Test Sets – Car Ads Experiments Car Ads and Obituaries • Training Sets • Test Sets – Car Ads (Yes| No) • 143 | 363 • 614 | 636 • 50 |69 Precision 95% Recall 98% F-measure 96% – Obituaries (Yes| No) • 68 | 135 • 50 | 69 • 62 | 135 BYU Data Extraction Group – Car Ads (40 | 40) – Obituaries (40 |40) Precision 95% Recall 95% F-measure 95% Funded by NSF 15

Link Analysis BYU Data Extraction Group Funded by NSF 16 Link Analysis BYU Data Extraction Group Funded by NSF 16

Form Filling BYU Data Extraction Group Funded by NSF 17 Form Filling BYU Data Extraction Group Funded by NSF 17

Form Filling (Cont. ) BYU Data Extraction Group Funded by NSF 18 Form Filling (Cont. ) BYU Data Extraction Group Funded by NSF 18

Incorrect Positive Response Motorcycle Year Make Price Mileage Phone. Nr Feature BYU Data Extraction Incorrect Positive Response Motorcycle Year Make Price Mileage Phone. Nr Feature BYU Data Extraction Group Funded by NSF 19

Historical Figure Deceased Name Death Date Birth Date Age Relationship Relative Name BYU Data Historical Figure Deceased Name Death Date Birth Date Age Relationship Relative Name BYU Data Extraction Group Funded by NSF 20

Automating Schema Mapping for Data Integration BYU Data Extraction Group Funded by NSF 21 Automating Schema Mapping for Data Integration BYU Data Extraction Group Funded by NSF 21

Schema Mapping Year Make Model Make & Model Feature Cost Car Body Type Miles Schema Mapping Year Make Model Make & Model Feature Cost Car Body Type Miles Target BYU Data Extraction Group Color Car Phone Mileage Year Style Cost Source Funded by NSF 22

Schema Mapping for Populated Schemas • Central Idea: Exploit All Data & Metadata • Schema Mapping for Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) – Attribute Names – Data-Value Characteristics – Expected Data Values – Data-Dictionary Information – Structural Properties BYU Data Extraction Group Funded by NSF 23

The Approach • Input: – Two Graphs, S and T – Data Instances for The Approach • Input: – Two Graphs, S and T – Data Instances for S and T – Lightweight Domain Ontology • Output: – A Source-to-Target Mapping between S and T • Should enable translating data instances from S to T. – Direct and Many Indirect Matches • (t, s) • (t, s’ <= ) • Framework – Individual Facet Matching – Combination of Individual Matchers BYU Data Extraction Group Funded by NSF 24

Attribute Names • Target and Source Attributes – T: A – S: B • Attribute Names • Target and Source Attributes – T: A – S: B • Word. Net • C 4. 5 Decision Tree: feature selection, trained on schemas in DB books – – – BYU Data Extraction Group f 0: same word f 1: synonym f 2: sum of distances to a common hypernym root f 3: number of different common hypernym roots f 4: sum of the number of senses of A and B Funded by NSF 25

Word. Net Rule The number of different common hypernym roots of A and B Word. Net Rule The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym BYU Data Extraction Group Funded by NSF 26

Data-Value Characteristics • C 4. 5 Decision Tree • Features – Numeric data (Mean, Data-Value Characteristics • C 4. 5 Decision Tree • Features – Numeric data (Mean, variation, standard deviation, …) – Alphanumeric data (String length, numeric ratio, space ratio) BYU Data Extraction Group Funded by NSF 27

Expected Data Values • Concepts and Relationships • Data Recognizers Make & Model – Expected Data Values • Concepts and Relationships • Data Recognizers Make & Model – Car. Make • “ford” • “honda” • … Ford Mustang Ford Taurus Ford F 150 … Brand Acura Audi BMW … Model Legend Mustang A 4 … – Car. Model • • BYU Data Extraction Group “accord” “mustang” “taurus” … Car. Make. Car. Model Target Funded by NSF Car. Make Car. Model Source 28

Structure Matching MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water Structure Matching MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State Target BYU Data Extraction Group phone Source Funded by NSF 29

Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State Target BYU Data Extraction Group phone Source Funded by NSF 30

Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State Target BYU Data Extraction Group phone Source Funded by NSF 31

Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State phone Source Target BYU Data Extraction Group Funded by NSF 32

Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State phone Source Target BYU Data Extraction Group Funded by NSF 33

Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf Structure Matching (Cont. ) MLS Bedrooms Name House Basic_features location Agent SQFT Fax Golf course Water front location_ description beds agent Address name Street City fax State phone Source Target BYU Data Extraction Group Funded by NSF 34

{House, MLS} vs. {MLS} MLS Bedrooms Basic_features location House SQFT beds location_ description Golf {House, MLS} vs. {MLS} MLS Bedrooms Basic_features location House SQFT beds location_ description Golf course Water front Address Street City State Source Target BYU Data Extraction Group Funded by NSF 35

{House, MLS} vs. {MLS} MLS Bedrooms Basic_features location House SQFT beds location_ description Golf {House, MLS} vs. {MLS} MLS Bedrooms Basic_features location House SQFT beds location_ description Golf course Water front Address Street City State Source Target BYU Data Extraction Group Funded by NSF 36

{House, MLS} vs. {MLS} MLS Bedrooms Basic_features House SQFT House’ location_ description Golf course {House, MLS} vs. {MLS} MLS Bedrooms Basic_features House SQFT House’ location_ description Golf course beds Water front location Address 1’ Street City State Source Target BYU Data Extraction Group Funded by NSF 37

{House, MLS} vs. {MLS} MLS Basic_features Bedrooms location_ description House SQFT House’ beds location {House, MLS} vs. {MLS} MLS Basic_features Bedrooms location_ description House SQFT House’ beds location Golf course Water front Street Water front’ Address City State Target BYU Data Extraction Group Golf course’ Address 1’ Street 1’ City 1’ State 1’ Source Funded by NSF 38

{Agent} vs. {agent} Name agent Agent address Fax name fax phone Address Street City {Agent} vs. {agent} Name agent Agent address Fax name fax phone Address Street City State Source Target BYU Data Extraction Group Funded by NSF 39

{Agent} vs. {agent} agent Name name Agent Fax phone fax address Address 2’ Address {Agent} vs. {agent} agent Name name Agent Fax phone fax address Address 2’ Address Street City State Street 2, City 2’ State 2’ Source Target BYU Data Extraction Group Funded by NSF 40

Inter-Relationship Set MLS Bedrooms Name House Agent House’ Fax Golf course Water front Address Inter-Relationship Set MLS Bedrooms Name House Agent House’ Fax Golf course Water front Address agent Street City State Source Target BYU Data Extraction Group Funded by NSF 41

Example: Source-To-Target Mapping House’ MLS name beds agent City’ Golf course’ Street’ Water front’ Example: Source-To-Target Mapping House’ MLS name beds agent City’ Golf course’ Street’ Water front’ State’ Address 1’ BYU Data Extraction Group fax Funded by NSF Address 2’ 42

Target-based Integration and Query System (TIQS) • Definition : I = (T, {Si}, {Mi}) Target-based Integration and Query System (TIQS) • Definition : I = (T, {Si}, {Mi}) • Phases – Design (Source-to-Target Mappings {Mi}) – Query Processing (Rule Unfolding) BYU Data Extraction Group Funded by NSF 43

 • Query – House-Bedrooms(x, 4) : - House-Bedrooms(x, 4), House-Golf_course(x, “Yes”), House-Water_front(x, “Yes”) • Query – House-Bedrooms(x, 4) : - House-Bedrooms(x, 4), House-Golf_course(x, “Yes”), House-Water_front(x, “Yes”) House’ MLS name beds agent City’ Golf course’ Street’ Water front’ State’ fax Address’ Address 1’ Address 2’ Query Reformulation BYU Data Extraction Group Funded by NSF 44

 • Query – House-Bedrooms(x, 4) : - House-Bedrooms(x, 4), House-Golf_Course(x, “Yes”), House-Water_Front(x, “Yes”) • Query – House-Bedrooms(x, 4) : - House-Bedrooms(x, 4), House-Golf_Course(x, “Yes”), House-Water_Front(x, “Yes”) House’ MLS name beds agent City’ Golf course’ Street’ Water front’ fax Address’ Query Reformulation BYU Data Extraction Group State’ Funded by NSF Address 1’ Address 2’ 45

TIQS (Cont. ) • User Queries – Logic Rules – Maximal and Sound Query TIQS (Cont. ) • User Queries – Logic Rules – Maximal and Sound Query Answers • Advantages – Rule Unfolding – Scalability BYU Data Extraction Group Funded by NSF 46

Experimental Results Application Precision (Number of Schemes) (%) Recall (%) F (%) Number Matches Experimental Results Application Precision (Number of Schemes) (%) Recall (%) F (%) Number Matches Number Correct Number Incorrect Faculty Member (5) 100 100 540 0 Course Schedule (5) 99 93 96 490 454 6 Real Estate (5) 90 94 92 876 820 92 Indirect Matches: (precision 87%, recall 94%, F-measure 90%) Data borrowed from Univ. of Washington [DDH, SIGMOD 01] Rough Comparison with U of W Results * Course Schedule – Accuracy: ~71% • * Real Estate (2 tests) – Accuracy: ~75% * Faculty Member – Accuracy, ~92% BYU Data Extraction Group Funded by NSF 47

Conclusion • A Robust and Flexible Approach to Check Applicability of HTML documents • Conclusion • A Robust and Flexible Approach to Check Applicability of HTML documents • A Composite Approach to Automate Schema Mapping – Direct Matches – Indirect Matches • An Approach that Combines Advantages of Basic Approaches to Data Integration BYU Data Extraction Group Funded by NSF 48

Future Work • Test More Applications and Data to Evaluate the Approaches • Extend Future Work • Test More Applications and Data to Evaluate the Approaches • Extend Training Classifiers for Applicability Checking • Further Automating Schema Mapping • Automate Ontology Mapping on the Semantic Web • Automate Mapping between XML Documents • … BYU Data Extraction Group Funded by NSF 49

Thanks ! Questions? BYU Data Extraction Group Funded by NSF 50 Thanks ! Questions? BYU Data Extraction Group Funded by NSF 50