Скачать презентацию A Survey of Approaches to Automatic Schema Matching Скачать презентацию A Survey of Approaches to Automatic Schema Matching

29ff502ea4f4ebbe62435a4c9cb40c39.ppt

  • Количество слайдов: 45

A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: 3740535 Email: A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: 3740535 Email: SSamarah@Site. Uottawa. Ca This Presentation based on the following paper: E. Rahm, P. Bernstein, "A Survey of Approaches to Automatic Schema Matching“, VLDB Journal, 10(4): 334 -350 (2001). 1

Presentation Outlines o o o Definition of Schema Matching problem. Applications of Schema Matching. Presentation Outlines o o o Definition of Schema Matching problem. Applications of Schema Matching. Schema matching approaches. Personal contribution. Conclusion. 2

Schema Matching: Definition o o o Schema Matching: is the process of finding semantic Schema Matching: Definition o o o Schema Matching: is the process of finding semantic correspondence between elements of two schemas. Schema Matching is achieved through Match operation. Match Operator is a function that takes two schemas as input and returns a mapping between those two schemas as output, called the match result. Match(S 1, S 2) Match Result. 3

The Match operator Match(S 1, S 2) Match Result o o o The schema The Match operator Match(S 1, S 2) Match Result o o o The schema (either S 1 or S 2) defined to be a set of elements connected by some structure (ER model, OO model, . . ). The Match Result is a set of mapping elements, each of which indicate that certain elements of S 1 are mapped to certain elements of S 2, expressed by. A mapping expressions, which specifies how the S 1 and S 2 elements are related, may be associated with the mapping elements. 4

The Match Operator 5 The Match Operator 5

Schema Matching Problem o Schema Matching currently performed manually which makes it: n n Schema Matching Problem o Schema Matching currently performed manually which makes it: n n n o Error-prone. Tedious. Time-consuming. So, the solution is to automate the match function, but n n There is no mathematical model that capture the matching process. Application dependent. 6

Schema Matching Applications o o o Schema Integration. Data Warehouses. E-commerce. Semantic Query Processing. Schema Matching Applications o o o Schema Integration. Data Warehouses. E-commerce. Semantic Query Processing. Data Integration system 7

1 -Schema Integration o o Is the process of constricting a global view from 1 -Schema Integration o o Is the process of constricting a global view from a set of independently developed schema. Schema integration achieved through identifying the interschema relationships (applying schema matching) then unify these matched elements. 8

Schema Matching Example: S 1 {SNam, …. } S 2 {FName, LName, …}. . Schema Matching Example: S 1 {SNam, …. } S 2 {FName, LName, …}. . Sn Match + Unify Sg {Name, …} Determining the correspondence between SName and Fname, Lname achived through the match operation. 9

2 -Data Warehouses o o DWH is a decision support database that is extracted 2 -Data Warehouses o o DWH is a decision support database that is extracted from a set of data sources. DWH and sources represent data in different format (i. e. relations or XML versus multidimensional view) Constructing DWH require transform data from its format into DWH format. match operation can be used to identify those elements in the sources that are represented in the DWH, according to this mapping appropriate transformation can be designed. 10

3 - E-commerce o o o Trading partners exchange messages that describes business transactions. 3 - E-commerce o o o Trading partners exchange messages that describes business transactions. Each partners uses its own message format (EDI, XML, . . ) or different message schema. In order to exchange messages, there is a need to translate the messages to the format required by different partners (Matching problem). 11

4 - Semantic Query Processing o o A run time scenarios where the user 4 - Semantic Query Processing o o A run time scenarios where the user specifies the output of the query (in terms of concepts familiar to him, which may be not the same concepts presented in the DB {the Select Clause}), and the system figures out how to produce that output (i. e. the from and where clauses in SQL). The match operation is used to determent the mapping between the user concepts and the DB concepts). 12

4 -Semantic Query Processing (Example) Example All employees earn more than 2000$ Applying Match 4 -Semantic Query Processing (Example) Example All employees earn more than 2000$ Applying Match Producing New Query in SQL format: {Select FName, LName From Employee Where Salary >2000} Mapping elements SQL Q Emplyee(FName, LName, Salary) Output 13

5 - Data Integration System o o o The major component of data integration 5 - Data Integration System o o o The major component of data integration system is the source description. Source description maps the sources schema to the mediated schema. Match operation applied in order to specify this mapping. 14

Generic Match Architecture o o o Schemas to be matched are represented in a Generic Match Architecture o o o Schemas to be matched are represented in a uniform internal representation. Importer translates input schemas from their native representation into the internal representation. Exporter translates the match result produced by the match from the internal representation into the representation required by each tool. 15

Generic Match Architecture Tool 1 (Portal Schemas) Tool 2 (E-business Schemas) Tool 3 (DWH Generic Match Architecture Tool 1 (Portal Schemas) Tool 2 (E-business Schemas) Tool 3 (DWH Schemas) Tool 4 (DB Schemas) Schema import / export Global Libraries (dictionaries, schemas, . . ) Generic Match Implementation Internal Schema Representation 16

Classification of schema matching approaches o o o Instance Vs schema: consider instance data Classification of schema matching approaches o o o Instance Vs schema: consider instance data or schema information. Element Vs structure: matching performed for individual schema element (attribute), or for combinations of elements (structure). Language Vs constraint: use linguistic information (names and textual description) or constraint information (key, relationship) Matching cardinality: the overall match result may relate one or more elements of one schema to one or more elements of the other (1: 1, 1: n, m: n). Auxiliary information: the use of auxiliary information (dictionaries, pervious matching results, user input, . . ) 17

Classification of schema matching approaches Schema Matching Approaches Individual Matcher Schema Based Element Level Classification of schema matching approaches Schema Matching Approaches Individual Matcher Schema Based Element Level Linguistic Combine Matcher Instance Based Structure Level Constraint Hybrid Matcher Element Level Linguistic Composite Matcher Manual Automatic Constraint 18

Schema level matcher o Consider only schema information, not instance data, such as: n Schema level matcher o Consider only schema information, not instance data, such as: n n n o Name Description Data Type Relationship (is-a, part-of) Constraints Schema structure Multiple match candidates could be founded, each of which assigned with a similarity degree. 19

Granularity of match o o Element level matching: for each element in the first Granularity of match o o Element level matching: for each element in the first schema, determines the matching elements in the second schema. Structural level matching: matching combinations of elements that appear together in a structure. n Could be fully match or partial match. 20

Granularity of match Example S 1 Elements S 2 Elements Match Granularity Address Customer. Granularity of match Example S 1 Elements S 2 Elements Match Granularity Address Customer. Address (S. L full Match) Street City Street State USState ZIP Postal. Code Zip Account. Owner Customer City (E. L) USState (E. L) Postal. Code (E. L) Account. Owner Name Cname Name Address CAddress Birthdate Customer (S. L partial Match) CName (E. L) CPhone CAddress (E. L) Tax. Exempt 21

Linguistic approaches o Linguistic matchers use names and text (word or sentence) to find Linguistic approaches o Linguistic matchers use names and text (word or sentence) to find semantically similar schema elements: n n Name Matching Description matching 22

Name Matching o o Name based matching matches schema elements with equal names or Name Matching o o Name based matching matches schema elements with equal names or similar names. n Equality of canonical name representation after stemming and preprocessing. n Equality of synonyms. n Equality of hypernyms (X is a hypernym of Y is a kind of X. n Similarity based on common substring, edit distance, soundex. n User- provided name matches. Thesauri or dictionary should be exploited. 23

Name Matching Example: two schema S 1, S 2 represent two automobile suppliers S Name Matching Example: two schema S 1, S 2 represent two automobile suppliers S 1 Elements S 2 Elements Matching Based on Car. ID Truck. ID Car Truck (Hypernyms) Car is an automobile and truck is an automobile. Brand Make Brand Price Sold. TO Sold 2 Sold. To CAddress Customer. Address (Preprocessing) Make (Synonyms) Price (Equality of Names) Sold 2 (Soundex) 24

Constraint- based approaches o Exploit constraints information associated with the input schemas to determine Constraint- based approaches o Exploit constraints information associated with the input schemas to determine the similarity of schema elements. n n Data types and domain constraints Key characteristics (primary, unique) Relationship cardinality Structural constraints such as foreign key (used by structural matches approaches) 25

Constraint- based approaches S 1 Elements S 2 Elements Employee Matching Personal Emp. No Constraint- based approaches S 1 Elements S 2 Elements Employee Matching Personal Emp. No {int, PK} Pno {int, PK} Born Pno Birthdate {Type} Emp. No| Dept. No {Key} Emp. Name {String} Pname {string} Pname Dept. Name {Type} Pname Several match candidate could results so this approach could be sued to limit the number of candidate. Emp. Name {Type} Dept. No {int, ref dep} Dept {String} salary {single} Born {date} Birth. Date {date} Department Dept. No {int, PK} Dept. Name {String} Dept. Name {Type} Emp. Name {Type} S 2. Personal {Pno, Pname, Dept, born} Select S 1. Employee. Emp. No, S 1. Employee. Emp. Name, S 1. Department. Dept. Name, S 1. Employee. Birth. Date From S 1. Employee, S 1. Department Where (S 1. Employee. Dept. No = S 1. Department. Dept. No) Note: Structural Matching 26

Description Matching o Based on linguistic evaluation for the comment associated with schema elements. Description Matching o Based on linguistic evaluation for the comment associated with schema elements. Example: S 1: empn // employee name S 2: name // name of employee NL Understanding technology Empn name 27

Reusing Schema and mapping information o o This approach support and exploit the reuse Reusing Schema and mapping information o o This approach support and exploit the reuse of common schema components and previously determined mapping. Useful when matching applied to different but similar schemas to the same destination schemas. 28

Reusing Schema and mapping information EX: Schema S 1 Schema S 2 Schema S Reusing Schema and mapping information EX: Schema S 1 Schema S 2 Schema S Purchase order 2 Purchase order POrder Product Article Bill. To Payee Name Bill. Address Recipient Ship. To Name Address Ship. Address Contact. Phone Contact Name Address Goal: Mapping S 1 to S matching result between S 2 and S are previously determent and can be reused to map S 1 to S 29

Match cardinality o o o Global cardinality: how many mapping elements S 1 or Match cardinality o o o Global cardinality: how many mapping elements S 1 or (S 2) elements can participate in the matching results. Local cardinality: how many elements in S 1 match how many elements in S 2 within individual mapping element. Most of approaches restricted to 1: 1 local and 1: 1 or 1: n global cardinality. 30

Match cardinality Example: Local Match Cardinality S 1 elements S 2 elements Matching Expression Match cardinality Example: Local Match Cardinality S 1 elements S 2 elements Matching Expression 1 1: 1 element level Price Amount = Price 2 n: 1 element level Price, Tax Cost = Price* (1 + Tax/100) 3 1: n element level Name First. Name Last. Name First. Name, Last. Name = Extract(Name, . . ) 4 n: 1 structure level (n: m element level) B. Title B. Pu. No P. Name A. Book A. Publisher Price has 1: n Global Cardinality A. Book, A. Publisher = Select B. Title, P. Name From B. P Where B. Pu. No =P. Pu. No 31

Instance level approaches o o o Consider data contents. Useful when schema information limited Instance level approaches o o o Consider data contents. Useful when schema information limited (or no schema at all). Enhance schema matching by considering elements whose instances are more similar. n n Linguistic approach based on IR techniques for text elements. Constraint based such as value range and average for numeric element. 32

Instance level approach Example: Emp. No Dept SSN Works for 234 Marketing 230 Accounting Instance level approach Example: Emp. No Dept SSN Works for 234 Marketing 230 Accounting 235 Accounting 229 Marketing 236 Marketing 228 Marketing {Dept ≈ works for} (based on “Marketing” Frequency) {Emp. No ≈ SSN} (based on value range) 33

Combining matchers o Combine several approaches to achieve good match candidates. n n Hybrid Combining matchers o Combine several approaches to achieve good match candidates. n n Hybrid Matcher: combine several matching approaches to determine match candidate based on multiple criteria (name, type constraints). o More effective (poor candidates filtered out early) o Better performance (reducing number of pass over the schema). Composite matcher: combines the results of several independently executed matchers, including hybrid matchers. o More flexible than hybrid matchers, (allow us to select from set of matchers). o The combination of results could be automatic or manually). 34

personal contribution o o A prototype Datalog program designed to implement a composite matcher. personal contribution o o A prototype Datalog program designed to implement a composite matcher. The program was tested on DES system ( Datalog Educational System) available at: http//www. fdi. ucm. es/professor/fernan/DES/ o The program takes advantage of linguistic based approach over constraint approach. 35

Database Description o The Data. Base contains the following predicates: n n Source(Element. Name, Database Description o The Data. Base contains the following predicates: n n Source(Element. Name, Data. Type, constraints, …). // to describe sources Dictionary(Name 1, Name 2)// a dictionary to provides synonyms. 36

The Program Rules o o o o cand 1(N, N) cand 2(N, N 1) The Program Rules o o o o cand 1(N, N) cand 2(N, N 1) ok 1(N 1) ok 2(N 1) cand 3(N, N 1) o o o cand 4(N, N 1) o o o o match(N, N 1) match(N, N 1) : - s 1(N, D, C), s 2(N, Data. Type, Constraint). : - s 1(N, D, C), s 2(N 1, Data. Type, Constraint), d(N, N 1). : - cand 1(N, N 1). : - cand 2(N, N 1). : - s 1(N, D, C), s 2(N 1, D, Constraint), not(ok 1(N)), not(ok 2(N 1)), not(ok 1(N 1)), not(ok 2(N)). : - s 1(N, Data. Type, C), s 2(N 1, D, C), not(ok 1(N)), not(ok 2(N 1)), not(ok 1(N 1)), not(ok 2(N)). : - cand 1(N, N 1). : - cand 2(N, N 1). : - cand 3(N, N 1), cand 4(N, N 1). : - cand 3(N, N 1), not(cand 4(N, N 1)). : - cand 4(N, N 1), not(cand 3(N, N 1)). 37

EXAMPLE o The program was tested on the following schemas. s 1(e. No, integer, EXAMPLE o The program was tested on the following schemas. s 1(e. No, integer, pk). s 1(city, string, 20). s 1(street, string, 30). s 1(state, string, 12). s 2(e. ID, integer, pk). s 2(c. Name, string, 15). s 2(street, string, 10). s 2(province, string, 25). d(state, province). 38

Results 39 Results 39

Conclusion o o A taxonomy for schema approaches was presented, in order to compare Conclusion o o A taxonomy for schema approaches was presented, in order to compare between different approaches to schema matching. The generic implementation described in the paper, could be base for any new implantation. Different techniques could be used in order to automate schema matching, such as Natural language, machine learning, IR. . . Having full automated matching (without user interaction ) not achieved yet. 40

Thank You 41 Thank You 41

References o o o E. Rahm, Ph. A. Bernstein, References o o o E. Rahm, Ph. A. Bernstein, "A Survey of Approaches to Automatic Schema Matching“, VLDB Journal, 10(4): 334350 (2001). DES system ( Datalog Educational System) available at: http//www. fdi. ucm. es/professor/fernan/DES/. Jayant Madhavan, Philip A. Bernstein, Erhard Rahm, “Genric Schema Matching with Cupid”, Proceedings of 27 th VLDB conference, Roma, Italy, 2001. Hong-Hai Do, Sergey Melnik, Erhard Rahm, “Comparison of Schema Matching Evaluations”, University of Leipzig Augustusplatz 10 -11, 04109, Leipzig, Germany. An. Hai Doan, Pedro Domigos, Alon Levy, “Learning Source Description for Data Integration”, University of Washington, Sattle, WA 98195 42

Appendix 1 - Cupid (Microsoft Research) 2 - LSD (Learning Source Description) 43 Appendix 1 - Cupid (Microsoft Research) 2 - LSD (Learning Source Description) 43

1 - Cupid (Microsoft Research) o o o A hybrid matcher based on element 1 - Cupid (Microsoft Research) o o o A hybrid matcher based on element and structure matching. What is new in this approach, that the schemas represented as a graph which encode the referential constraints into structure the can be matched just like other structures. The algorithm has two phase: n n Linguistic matching: matches individual schema elements based on their names, data types, domains, . . Structural matching: matches schema elements based on the similarity of their context. 44

2 - LSD (Learning Source Description) o o Composite matcher, with autonomic combination of 2 - LSD (Learning Source Description) o o Composite matcher, with autonomic combination of match results. An attempt to automate the mappings between source schemas and mediated schema in data integration system. Uses machine learning techniques to match a new data source against previously determined global schema. After a set of data sources have been manually mapped to a mediated schema, the system should be able to glean significant information from these mapping for subsequent data sources. 45