Query Processing over Incomplete Autonomous Web Databases MS

Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof. Chitta Baral Prof. Yi Chen Prof. Huan Liu

Introduction to Web databases n n Many websites allow user query through a form based interface and are supported by backend databases Consider used cars selling websites such as Cars. com, Yahoo! autos, etc

Incompleteness in Web databases n n Web databases are often input by lay individuals without any curation. For e. g. Cars. com, Yahoo! Autos Web databases are being populated using automated information extraction techniques which are inherently imperfect The local schema of data sources may not support certain attributes supported by the global schema Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value Website # of attributes # of incomplete body style engine tuples autotrader. com 13 25127 33. 67% 3. 6% 8. 1% carsdirect. com 14 32564 98. 74% 55. 7% 55. 8%

Problem Statement n n Many entities corresponding to tuples with missing values might be relevant to the user query Current query processing techniques return answers that exactly satisfy the user query – Such techniques return results with high precision but low recall Q: Make=Honda n n null Accord 2003 sedan Relevant Uncertain tuple: A tuple which does not exactly satisfy the query predicates but the entity represented by that tuple might be relevant to the query How to support query processing over incomplete autonomous databases in order to retrieve ranked uncertain results?

Challenges Involved n n How to predict missing values in autonomous databases? As autonomous databases are accessible only through form-based interfaces, how to retrieve relevant uncertain answers? – How to keep query processing cost manageable in retrieving uncertain tuples? n How to rank the retrieved uncertain answers?

Related Work n Probabilistic databases – Incomplete databases are similar to probabilistic databases once we assess the probabilities for missing values – TRIO: uncertainty with lineage – Con. Quer: handling inconsistency over databases • Assume probability distributions are given for uncertain or inconsistent attributes – We assess probability distribution for missing attribute and use it to rank rewritten queries to retrieve relevant answers since the probabilities cannot be stored in databases – Our query rewriting framework is general and can be used by these systems if the databases are autonomous n Handling Missing Values – EM algorithm, Bayes Net, Association rules

Possible Approaches For a query Q: body style = convt 1. Certain Answers Only (CAO): Return certain answers only as in traditional databases 2. All Uncertain Answers (AUA): Null matches any concrete value, hence return all answers having body style=convt along with answers having body style as null 3. Relevant Uncertain Answers (RUA): Ranking answers by predicting values of missing attribute n Low Recall Low Precision, infeasible Costly, infeasible

Outline n n n Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion

QPIAD System Architecture

RRUA: Generating Rewritten Queries n n Restricted Relevant Uncertain Answers (RRUA) approach only retrieves only relevant incomplete tuples instead of retrieving all tuples as in AUA and RUA Consider a query Q: Body style=convt Base Result Set: RS(Q) Make Model Year Price Body style Audi a 4 20000 convt BMW z 4 2003 17000 convt Porsche boxster 2000 13000 convt …. . …… …… Rewritten queries are based on the determining set from AFD for Body style: Model ~~> Body style: 0. 9 Determining Attribute set(dtr. Set) Q 1: model=‘a 4’ Q 2: model=‘z 4’ Q 3: model=‘boxster’

Learning Attribute Correlations n n AFD: VIN ~~> Model where VIN is an Approximate Key(AKey) with high confidence VIN will not be useful for query rewriting and feature selection since it will not be able to retrieve additional new tuples

RRUA: Ranking Rewritten Queries n All queries may not be equally good in retrieving relevant answers – “z 4” model cars are more likely to be convertibles than a car with “a 4” model n When database or network resources are limited, the mediator can choose to issue the top K queries to get the most relevant uncertain answers

Learning Value Distributions n n n Used to rank queries based on the determining set of attributes from the AFD for query attribute We use Naïve Bayes Classifier with mestimates with AFD as a feature selection step Rank of a rewritten query Qi = P(Am=vm|ti), where ti ε Пdtr. Set(Am)(RS(Q)) – Q 1: model=‘a 4’, R(Q 1) = P(bodystyle=convt|model=a 4) = 0. 4 – Q 2: model=‘z 4’, R(Q 2) = P(bodystyle=convt|model=z 4)= 1. 0 – Q 3: model=‘boxster’, R(Q 3) = P(bodystyle=convt|model=boxster)=0. 7 R(Q 2) > R(Q 3) > R(Q 1) n Relevant uncertain answers are ranked based on the rank of the rewritten query that retrieved it

Combining AFDs and Classifiers n n More than one AFD may exist for some attributes Experimented with several approaches: – Only best-AFD having highest confidence – All attributes ignoring AFDs – Hybrid One-AFD – Ensemble of classifiers

Empirical Evaluation of QPIAD n n Test Databases: Auto. Trader database containing 100 K tuples and Census database from UCI Repository containing 50 K tuples Oracular study: To evaluate the effectiveness of our system against a ground truth, we artificially insert missing values in 10% of the tuples within these databases

RRUA vs AUA vs RUA

Precision over Top K Tuples

Ranking the Rewritten Queries Cars database Census database

Robustness of QPIAD

User Relevance Issues with QPIAD n n n When the query processor presents incomplete tuples, it becomes a recommender system For a query Q: year=2000 How to convince users into believing the system results? Make Model Year Price Honda Civic null 15000 Mileage Explanation 18000 We have determined that this car’s year is 60% likely to 2000 based on price=15000 and mileage=18000

Outline n n n Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion

Leveraging Correlations between Data Sources Q: Body style=coupe Mediator: GS(Make, Model, Year, Price, Mileage, Bodystyle)

Correlated Source and Maximum Correlated Source n Consider four sources with schema: – S 1(Make, Model, Year, Price) – S 2(Engine, Drive, Bodystyle), • AFD: {Engine, Drive} -> Body style confidence 0. 7 – S 3(Make, Model, Body style) • AFD: Model -> Body style confidence 0. 8 – S 4(Make, Price, Body style) • AFD: {Make, Price} -> Body Style confidence 0. 6 – Mediator global schema GS(Make, Model, Year, Price, Bodystyle, Engine, Drive) n n S 3 and S 4 are correlated sources with S 1 on Body style attribute S 3 is the maximum correlated source for S 1 on Body style attribute

Retrieving Relevant Uncertain Answers from Cars. Direct. com n n n Consider a query Q: body style = coupe(GS) Cars. com has an AFD: Model ~~> Body style(0. 9) Cars. com is the maximum correlated source for Cars. Direct. com which doesn’t support Body style but supports Model attribute Make Q 1: model=Accord Q 2: model=Mustang Q 3: model=Legend Q 4: model=325 Model Year Price Body style Honda Accord 2003 19000 coupe Ford Mustang 2004 29100 coupe Acura Legend 1997 12000 coupe BMW 325 2003 28000 coupe

Empirical Evaluation of using Correlation between Data Sources n n n We consider a mediator performing data integration over three sources: Cars. com, Yahoo! Autos and Cars. Direct. com do not allow querying on body style but when the tuples are retrieved we can check the body style attribute to determine if the tuple retrieved has the body style specified in the query Evaluation using attribute correlations and value distributions learned from Cars. com for 5 test queries on body style attribute

Retrieving Relevant Answers using Correlations from Cars. com

Handling Joins over Incomplete Autonomous databases n Mediator performing data integration across two sources: – Source S 1 is incomplete – Source S 2 is complete Source Local Schema S 1 Cars(Make, Model, Year, Price) S 2 Review(Model, Ratings) Mediator View Used. Cars(Make, Model, Year, Price, Ratings) : - Cars(Make, Model, Year, Price), Review(Model, Ratings)

Issues in Handling Joins n n Performing joins over probabilistic databases will lead to a disjunction in join results Consider joining uncertain tuples from the two sources: Make Model Honda null [0. 6 Civic] [0. 4 Accord] Approximation 0. 6 0. 4 Make Honda Year Price Model Ratings 2003 18000 Civic 5 Accord 4 Model Civic Year 2003 Price 18000 Ratings 5 Honda Accord 2003 18000 4 or

Handling Join Queries n n Q: σMake=Honda(Used. Cars) Assume AFDs: {Make, Year} ~~> Model, Model ~~> Make Q 1: Model=Odyssey: R(Q 1)=1 Honda Odyssey 2000 10000 3 Q 2: Model=Accord: R(Q 2)=1 Honda Accord 2004 20000 4 Queries on source S 2 to join Q 3: Model=Odyssey: R(Q 3)=1 Q 4: Model=Accord: R(Q 4)=1 Q 5: Model=Civic: R(Q 5)=0. 6 null Accord 2002 18000 4 1. 0 Honda null 2000 5 0. 6 15000 Make Model(FK) Year Price Model(PK) Ratings Honda Odyssey 2000 10000 Civic 5 Honda Accord 2004 20000 Corolla 4 Honda null 2000 15000 Accord 4 null Accord 2002 18000 Altima 3 Toyota Camry 2003 16000 Camry 5 Odyssey 3 0. 6 Civic 0. 4 Accord

Experimental Results Joins

Outline n n n Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion

QUIC: Querying under Imprecision and Incompleteness n n n Consider a query Q: model like Civic(Cars) User might be interested in similar cars like “Accord”, ”Camry”, etc Ranking results in presence of both similar and incomplete tuples Id Make Model Year Body style 1 Honda Civic 2000 Sedan 2 Honda Accord 2004 Coupe 3 Toyota Camry 2001 Sedan 4 Honda null 2004 Coupe 5 Honda null 2000 Sedan 6 Honda Civic 2004 Coupe 7 BMW 3 series 2001 convt 8 Toyota null 1999 sedan

Other Contributions[*Collaboration with Garrett Wolf] n n n Handling multi-attribute selection queries for incomplete databases* QUIC system for query processing under imprecision and incompleteness Online learning of value distribution based on base result set to avoid sample biases

Conclusion n n Thesis proposed a framework for query processing over incomplete autonomous web databases: – QPIAD: Query processing over incomplete autonomous databases – QPIAD: Data Integration over multiple incomplete data sources Results of empirical evaluation on real world databases show that our system returns relevant answers with high precision while keeping the query processing cost manageable

Thank You!! Questions? ?