
5c3ded51845cad2157e13d82f0ae18cc.ppt
- Количество слайдов: 27
CSE 636 Data Integration Schema Matching Cupid
Virtual Integration Architecture Run-Time Design-Time Schema Matching Query Reformulation Query Result End User Mediation Language Mediator Web Services Global Schema Optimization & Execution XML Wrapper Local Data Schema Source 2
Schema Heterogeneity Independently created schemas… … might be modeling similar information… … in slightly different ways DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. ID grade DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter 3
Schema Heterogeneity DB 1 ugrad * ugrad. ID name enrollment * course. ID ugrad. ID grade • • DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter Similar entities represented Dissimilar structures (inverted nesting) Different element names for similar data values Similar element names for different data values 4
Schema Matching vs. Schema Mapping • GAV and LAV are schema mapping languages • Mappings: – set of queries – associations + semantics • Match: – set of associations only • Schema Matching: – Identifying associations – First step towards constructing mappings 5
Schema Matching vs. Schema Mapping Semantics for $s 1 in DB 3/student where $s 1/type = ‘UGRAD’ return
The Problem of Schema Matching Input • Schemas S 1 and S 2 • Possibly data instances for S 1 and S 2 • Background knowledge – – – thesauri validated matches standard schemas reference instances ontologies constraints (keys, data types etc) Output • Associations between S 1 and S 2 Goal • Schema matching tools with significant automated support 7
Schema Matching How is the match result expressed? DB 2 course * course. ID title student * student. ID name type evaluation DB 3 student * student. ID name type course * course. ID title ? type letter • Pairs of paths • Lists of paths • Schema names 8
Schema Matching What do we match? • 1. 2. 3. Depends on the queries we want to ask Elements in isolation (leaves in particular) Substructures Whole schemas 9
Motivation • Important component in many applications – Data Integration – Data Migration – E-Commerce • Model Management [Bernstein, Halevy, Pottinger ’ 00] – Algebra for manipulating models and mappings – Match, Merge, Compose … 10
Problems • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • Schema matching is a hard problem – Naming and structural differences in schemas – Similar, but non-identical concepts modeled – Multiple data models – SQL DDL, XML, ODMG… 11
Schema Matching Approaches How to match? Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Hybrid Per-Element Linguistic Constraintbased • Graph matching manual automatic composition Linguistic Constraintbased • Names • Descriptions Composite • IR (word • Value pattern frequencies, and ranges key terms) • Types • Keys Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001 12
Cupid Individual matchers Schema-based Per-Element Content-based Structural Combined matchers Hybrid Per-Element Linguistic Constraintbased • Graph matching manual automatic composition Linguistic Constraintbased • Names • Descriptions Composite • IR (word • Value pattern frequencies, and ranges key terms) • Types • Keys Madhavan, Bernstein and Rahm, VLDB, 2001 13
Cupid Example PO Purchase. Order POLines Items POShip. To Deliver. To Item Name Item City Street Name Line Item. Number Uo. M Unitof. Measure Qty Address City Street Quantity 14
Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping SSIM WSIM Structure Matching 15
Linguistic Matching • Heuristic name matching – Tokenization of names POOrder. Num PO, Order, Num – Expansion of short-forms, acronyms PO Purchase, Order; Number – Clustering of schema elements based on keywords and data-types Street, City, POAddress – Thesaurus of synonyms, hypernyms, acronyms – Linguistic Similarity coefficient (LSIM [0, 1] ) 16
Structure Matching PO Purchase. Order POLines Items Deliver. To POShip. To Item Name Item City Street Line Item. Number Uo. M Unitof. Measure Qty Name Address City Street Quantity 17
Structure Matching Mutually Reinforcing Similarity PO POLines Item Purchase. Order WSIM> thhigh Line Uo. M Qty SSIM ++ Items Item. Num Unitof. Measure Quantity 18
Structure Matching Context Dependent Disambiguation PO Purchase. Order POBill. To Invoice. To Deliver. To POShip. To Address City Street Address SSIM++ City SSIM-- SSIM++ Street City Street 19
Intuition • Atomic elements are similar – Linguistically and data-type similar – Their ancestors are similar • Compound elements (non-leaf) are similar if – Linguistically similar – Subtrees rooted at the elements are similar • Mutually recursive – Leaves determine internal node similarity – Similarity of internal nodes leads to increase in leaf similarity 20
Structure Match Details • Subtrees are similar if – Immediate children are similar – Leaf sets are similar • Subtree Similarity (nodes s and t) – Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa – Less sensitive to variation in intermediate structure • Pruning the number of comparisons – Elements must have comparable number of leaves 21
Referential Integrity Purchase Order ID Order-Customer-fk Customer ID Product Name Customer ID Customer-Purchase-Order Customer Address Name Schema A Schema B • Join nodes added to the schema tree for each referential integrity constraint • Views can be similarly used 22
Cupid Architecture Schema 1 LSIM Linguistic Matching Schema 2 Thesaurus Generate Mapping Output Mapping Linguistic Similarity (LSIM ) SSIM WSIM Structure Matching Structural (SSIM Weighted (WSIM Similarity ), ) Invoice. To Bill. To 0. 7 Invoice. To Bill. To 0. 8 0. 7 Uo. M Unit. Measure 0. 9 Uo. M Unit. Measure 0. 7 0. 8 City 1. 0 Invoice. To/City Bill. To/City 0. 8 0. 9 23
Mapping Generation • Individual mapping elements computed from WSIM values: – Consider only mapping pairs that have WSIM greater than threshold – For each element of target find most similar source element – Not accepted mappings with high similarity are returned in order to help user modify map 24
Cupid Architecture Schema 1 Linguistic Matching Schema 2 LSIM Thesaurus Output Mapping Generate Mapping SSIM WSIM Structure Matching Input hint 25
Work Needed • A more robust solution – Auto-tuning parameters – Thesaurus Generation and Evolution • Schema matching component architecture – Easily extensible by adding multiple techniques – Data Instances for matching – Look at COMA & Proto. Plasm systems 26
References 1. J. Madhavan, P. A. Bernstein, E. Rahm Generic Schema Matching with Cupid VLDB, 2001 2. H. H. Do, E. Rahm: COMA - A System for Flexible Combination of Schema Matching Approaches VLDB, 2002 3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. Quix Industrial-Strength Schema Matching SIGMOD Record 33(4), 2004 27