
84af209732c582dbebd84034d743ec58.ppt
- Количество слайдов: 21
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ. Illinois at Urbana-Champaign
Context: Meta. Querier Large-scale integration of the deep Web Query Result Meta. Querier The Deep Web Meta. Querier 2
Challenge: Matching query interfaces (QIs) Book Domain m: n complex matching 1: 1 simple matching Music Domain Meta. Querier 3
Demo. Meta. Querier 4
Traditional approaches of schema matching – Pairwise attribute correspondence But, scale is a challenge… q How to address the challenge of large scale? Pairwise Matching S 1: author title subject ISBN S 2: name title category format And, scale is an opportunity! q How to leverage the opportunity of large scale? Pairwise Attribute Correspondence S 1. author « S 2. name S 1. subject « S 2. category Meta. Querier 5
A holistic schema matching paradigm Input: Output: Set of schemas Semantic model, for all attribute matchings S 1: author title subject ISBN S 2: writer title category format S 3: name title keyword binding Meta. Querier Holistic Schema Matching author = name = writer subject = category format = binding 6
Holistic matching is, in essence– Data mining to discover semantics for information integration Generation Observations (attribute occurrences) Hidden Regularities E Our Hypothesis Semantics (semantic correspondences) Statistical Analysis -for Model Discovery Meta. Querier E Our Approach 7
Regularity: Co-occurrence patterns Author = { Last Name , First Name } Grouping Attributes Synonym Attributes (a) amazon. com (c) bn. com Meta. Querier (b) www. randomhouse. com (d) 1 bookstreet. com 8
Schema matching as correlation mining Across many sources: n Synonym attributes with negative correlation q q n synonym attributes are semantically alternative thus, rarely co-occur in query interfaces Grouping attributes with positive correlation q q grouping attributes are semantically complement thus, often co-occur in query interfaces Meta. Querier 9
Data preparation: Prepare schema transactions to be mined attribute operator n Interface Extraction [SIGMOD’ 04] n value Type Recognition q Type is not declared in Web interfaces q Identify types from instance values, n e. g. , integer, datetime Used for constraining merging and matching Syntactic Merging q merge attributes with syntactically similar names q n n q e. g. , title of book to title, author’s name to author merge attributes with syntactically similar instance values Meta. Querier 10
DCM: Dual Correlation Mining framework 1. Positive correlation mining as potential groups Mining positive correlations Last Name (any), First Name (any) 2. Negative correlation mining as potential matchings Mining negative correlations 3. Matching selection as model construction Author (any) = {Last Name (any), First Name (any)} ISBN (any) = {Last Name (any), First Name (any)} Author (any) = {Last Name (any), First Name (any)} Subject (string) = Category (string) Format (string) = Binding (string) Meta. Querier 11
Correlation measure for qualification To find groups and matchings that pass the correlation threshold n Observation: Pairwise correlations q q n e. g. , in Airfares domain, to = arrival city = destination to and arrival city are negatively correlated to and destiation are negatively correlated arrival city and destination are negatively correlated Measure: Cmin = min m(Ai, Aj), for all i <> j q q m: some correlation measure for two items support downward closure --- enable Apriori algorithm accommodate different measure m Meta. Querier 12
The mining process – A standard Apriori algorithm Schema Transactions Departure City Destination From …. To …. … …. Departure City … Arrival City … …… … Correlated items with length 2 Destination = To Destination = Arrival City To = Arrival City Departure City = From … …. … Correlated items with length 3 Destination = To = Arrival City … …. Meta. Querier 13
Correlation measures for ranking To rank and select matchings in model construction n Qualification measure is not good for ranking q q n a set cannot win its subset due to the downward closure e. g. , min({1, 2, 3}) < min({2, 3}) superset contains more matchings and should be preferred Ranking measure: Cmax = max m(Ai, Aj), for all i <> j q q A set doest not win its superset When tie, breaking the tie by semantic richness n n A 1 = A 2 = A 3 is semantically richer than A 1 = A 2 A 1 = {A 2, A 3} is semantically richer than A 1 = A 2 Meta. Querier 14
Choosing the m --Measuring the correlation of two items n Contingency table n We explore 22 measures, e. g. , Lift = f 00 f 11/(f 01 f 10) Jaccard = f 11/(f 11+f 01+f 10) Meta. Querier 15
Choosing the m --- The problems of existing measures n Co-presence (f 11) is more important than co-absence (f 00) Less positive correlation but a higher Lift = 17 n More positive correlation but a lower Lift = 0. 69 Rare attributes are not statistically convincing Ap as rare attributes and Jaccard = 0. 02 Meta. Querier No rare attributes and Jaccard = 0. 02 16
Choosing the m --- H-measure n H-measure H = f 01 f 10/(f+1 f 1+) Ignore the co-absence Less positive correlation H = 0. 25 More positive correlation H = 0. 07 Differentiate the subtlety of negative correlations Ap as rare attributes and H = 0. 49 Meta. Querier No rare attributes and H = 0. 92 17
Experimental setup n 447 deep Web sources in 8 domains q Domains n n n q Travel: Airfares, Hotels, Car Rentals Entertainment: Books, Movies, Music Records Living: Jobs, Automobiles Available as the TEL-8 dataset in UIUC Web Integration Repository n http: //metaquerier. cs. uiuc. edu/repository/ Meta. Querier 18
Results in Books and Airfares domains n Books ü author (any) = {last name (any), first name (any)} ü subject (string) = category (string) ü format (string) = binding (string) n Airfares ü passenger (integer) = {adult (integer), child (integer), infant (integer)} ü from (string) = departure city (string) = depart (string) ü departure date (datetime) = depart (datetime) ü return date (datetime) = return (datetime) ü class (string) = cabin (string) destination (string) = to (string) = {departure city (string), arrival city (string)} Meta. Querier 19
Contributions n Insight q We build a conceptually novel connection between data integration and correlation mining n n n schema matching as a new application of correlation mining as a new approach for schema matching Techniques q q q The dual correlation mining framework Measures for qualification and ranking H-measure, robust for negative correlations Meta. Querier 20
Thank You! Meta. Querier 21
84af209732c582dbebd84034d743ec58.ppt