Скачать презентацию CSE 636 Data Integration Schema Matching Cupid Скачать презентацию CSE 636 Data Integration Schema Matching Cupid

5c3ded51845cad2157e13d82f0ae18cc.ppt

  • Количество слайдов: 27

CSE 636 Data Integration Schema Matching Cupid CSE 636 Data Integration Schema Matching Cupid

Virtual Integration Architecture Run-Time Design-Time Schema Matching Query Reformulation Query Result End User Mediation Virtual Integration Architecture Run-Time Design-Time Schema Matching Query Reformulation Query Result End User Mediation Language Mediator Web Services Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 2

Schema Heterogeneity Independently created schemas… … might be modeling similar information… … in slightly Schema Heterogeneity Independently created schemas… … might be modeling similar information… … in slightly different ways DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. ID grade DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter 3

Schema Heterogeneity DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. Schema Heterogeneity DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. ID grade • • DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter Similar entities represented Dissimilar structures (inverted nesting) Different element names for similar data values Similar element names for different data values 4

Schema Matching vs. Schema Mapping • GAV and LAV are schema mapping languages • Schema Matching vs. Schema Mapping • GAV and LAV are schema mapping languages • Mappings: – set of queries – associations + semantics • Match: – set of associations only • Schema Matching: – Identifying associations – First step towards constructing mappings 5

Schema Matching vs. Schema Mapping Semantics for $s 1 in DB 3/student where $s Schema Matching vs. Schema Mapping Semantics for $s 1 in DB 3/student where $s 1/type = ‘UGRAD’ return {$s 1/student. ID} {$s 1/name} LAV Mapping: DB 1 Q(DB 3) DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. ID grade DB 3 Associations student * student. ID name type course * course. ID title ? type letter 6

The Problem of Schema Matching Input • Schemas S 1 and S 2 • The Problem of Schema Matching Input • Schemas S 1 and S 2 • Possibly data instances for S 1 and S 2 • Background knowledge – – – thesauri validated matches standard schemas reference instances ontologies constraints (keys, data types etc) Output • Associations between S 1 and S 2 Goal • Schema matching tools with significant automated support 7

Schema Matching How is the match result expressed? DB 2 course * course. ID Schema Matching How is the match result expressed? DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter • Pairs of paths • Lists of paths • Schema names 8

Schema Matching What do we match? • 1. 2. 3. Depends on the queries Schema Matching What do we match? • 1. 2. 3. Depends on the queries we want to ask Elements in isolation (leaves in particular) Substructures Whole schemas 9

Motivation • Important component in many applications – Data Integration – Data Migration – Motivation • Important component in many applications – Data Integration – Data Migration – E-Commerce • Model Management [Bernstein, Halevy, Pottinger ’ 00] – Algebra for manipulating models and mappings – Match, Merge, Compose … 10

Problems • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • Schema Problems • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • Schema matching is a hard problem – Naming and structural differences in schemas – Similar, but non-identical concepts modeled – Multiple data models – SQL DDL, XML, ODMG… 11

Schema Matching Approaches How to match? Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Schema Matching Approaches How to match? Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Hybrid Per-Element Linguistic Constraintbased • Graph matching manual automatic composition Linguistic Constraintbased • Names • Descriptions Composite • IR (word • Value pattern frequencies, and ranges key terms) • Types • Keys Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001 12

Cupid Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Hybrid Per-Element Linguistic Constraintbased • Cupid Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Hybrid Per-Element Linguistic Constraintbased • Graph matching manual automatic composition Linguistic Constraintbased • Names • Descriptions Composite • IR (word • Value pattern frequencies, and ranges key terms) • Types • Keys Madhavan, Bernstein and Rahm, VLDB, 2001 13

Cupid Example PO Purchase. Order POLines Items POShip. To Deliver. To Item Name Item Cupid Example PO Purchase. Order POLines Items POShip. To Deliver. To Item Name Item City Street Name Line Item. Number Uo. M Unitof. Measure Qty Address City Street Quantity 14

Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping SSIM WSIM Structure Matching 15

Linguistic Matching • Heuristic name matching – Tokenization of names POOrder. Num PO, Order, Linguistic Matching • Heuristic name matching – Tokenization of names POOrder. Num PO, Order, Num – Expansion of short-forms, acronyms PO Purchase, Order; Number – Clustering of schema elements based on keywords and data-types Street, City, POAddress – Thesaurus of synonyms, hypernyms, acronyms – Linguistic Similarity coefficient (LSIM [0, 1] ) 16

Structure Matching PO Purchase. Order POLines Items Deliver. To POShip. To Item Name Item Structure Matching PO Purchase. Order POLines Items Deliver. To POShip. To Item Name Item City Street Line Item. Number Uo. M Unitof. Measure Qty Name Address City Street Quantity 17

Structure Matching Mutually Reinforcing Similarity PO POLines Item Purchase. Order WSIM> thhigh Line Uo. Structure Matching Mutually Reinforcing Similarity PO POLines Item Purchase. Order WSIM> thhigh Line Uo. M Qty SSIM ++ Items Item. Num Unitof. Measure Quantity 18

Structure Matching Context Dependent Disambiguation PO Purchase. Order POBill. To Invoice. To Deliver. To Structure Matching Context Dependent Disambiguation PO Purchase. Order POBill. To Invoice. To Deliver. To POShip. To Address City Street Address SSIM++ City SSIM-- SSIM++ Street City Street 19

Intuition • Atomic elements are similar – Linguistically and data-type similar – Their ancestors Intuition • Atomic elements are similar – Linguistically and data-type similar – Their ancestors are similar • Compound elements (non-leaf) are similar if – Linguistically similar – Subtrees rooted at the elements are similar • Mutually recursive – Leaves determine internal node similarity – Similarity of internal nodes leads to increase in leaf similarity 20

Structure Match Details • Subtrees are similar if – Immediate children are similar – Structure Match Details • Subtrees are similar if – Immediate children are similar – Leaf sets are similar • Subtree Similarity (nodes s and t) – Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa – Less sensitive to variation in intermediate structure • Pruning the number of comparisons – Elements must have comparable number of leaves 21

Referential Integrity Purchase Order ID Order-Customer-fk Customer ID Product Name Customer ID Customer-Purchase-Order Customer Referential Integrity Purchase Order ID Order-Customer-fk Customer ID Product Name Customer ID Customer-Purchase-Order Customer Address Name Schema A Schema B • Join nodes added to the schema tree for each referential integrity constraint • Views can be similarly used 22

Cupid Architecture Schema 1 LSIM Linguistic Matching Schema 2 Thesaurus Generate Mapping Output Mapping Cupid Architecture Schema 1 LSIM Linguistic Matching Schema 2 Thesaurus Generate Mapping Output Mapping Linguistic Similarity (LSIM ) SSIM WSIM Structure Matching Structural (SSIM Weighted (WSIM Similarity ), ) Invoice. To Bill. To 0. 7 Invoice. To Bill. To 0. 8 0. 7 Uo. M Unit. Measure 0. 9 Uo. M Unit. Measure 0. 7 0. 8 City 1. 0 Invoice. To/City Bill. To/City 0. 8 0. 9 23

Mapping Generation • Individual mapping elements computed from WSIM values: – Consider only mapping Mapping Generation • Individual mapping elements computed from WSIM values: – Consider only mapping pairs that have WSIM greater than threshold – For each element of target find most similar source element – Not accepted mappings with high similarity are returned in order to help user modify map 24

Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping SSIM WSIM Structure Matching Input hint 25

Work Needed • A more robust solution – Auto-tuning parameters – Thesaurus Generation and Work Needed • A more robust solution – Auto-tuning parameters – Thesaurus Generation and Evolution • Schema matching component architecture – Easily extensible by adding multiple techniques – Data Instances for matching – Look at COMA & Proto. Plasm systems 26

References 1. J. Madhavan, P. A. Bernstein, E. Rahm Generic Schema Matching with Cupid References 1. J. Madhavan, P. A. Bernstein, E. Rahm Generic Schema Matching with Cupid VLDB, 2001 2. H. H. Do, E. Rahm: COMA - A System for Flexible Combination of Schema Matching Approaches VLDB, 2002 3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. Quix Industrial-Strength Schema Matching SIGMOD Record 33(4), 2004 27