Peer Data-Management Systems Plumbing for the Semantic Web

Peer Data-Management Systems: Plumbing for the Semantic Web Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos

Agenda Elements of the Semantic Web l Piazza: a peer data-management system l – A database guy’s contribution to the semantic web l The key issue: mapping between different models: – Some recent progress and current directions. l The critical issue: crossing the structure chasm. l The talk I’m not giving today: – A critique of the Semantic Web. l Work and thoughts are in progress 2

The Semantic Web (my view) l Web sites include structural annotations – You can pose meaningful queries on them. – Ontologies provide the semantic glue. – Internal implementation of web sites left open. l Agents perform tasks: – – l Query one or more web sites Perform updates (e. g. , set schedules) Coordinate actions Trust each other (or not). I. e. , agents operating on a gigantic heterogeneous distributed database. 3

Getting there l Robust infrastructure for querying – Peer data management systems. l Facilitate mapping between different structures. Need tools for: – Locating relevant structures – Easily joining the semantic web. l Get data into structured form – Should we worry about the legacy web? 4

Agenda Elements of the Semantic Web (personal view) l Piazza: a peer data-management system l – A database guy’s contribution to the semantic web l The key issue: mapping between different models: – Some recent progress and current directions. l The critical issue: crossing the structure chasm. 5

Piazza: Peer Data-Management Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic distributed architecture. l Peers can: – Export base data – Provide views on base data – Serve as logical mediators for other peers Every peer can be both a server and a client. l Peers join and leave the PDMS at will. l 6

Extending the Vision to Data Sharing 7

Relationship of PDMS to… P 2 P overlay networks (the “S” word) l Data integration systems (no central logical mediated schema) l Federated databases (scale, ad-hoc nature) l Distributed databases (no central administration) l 8

Representing Data l A spectrum of possibilities: – Relational tables, some integrity constraints – XML: can encode relational, hierarchical, OO – Xquery – emerging standard query language (SQL for XML) – RDF: “XML on drugs”. – Sees only the logic; ignores other aspects. – DAML+OIL – Full blown Knowledge representation language. They all have semantics; just different expressive powers. l We keep the data simple. Mappings between data at different peers are more complex. l 9

Piazza Querying l Semantic mappings between peers provide glue: LH: Crit. Bed(bed, hosp, room, PID, status) H: Crit. Bed(bed, hosp, room) & H: Patient(PID, bed, status) 9 DC: Skilled. Person(PID, "Doctor") : - H: Doctor(SID, h, l, s, e) 9 DC: Skilled. Person(PID, "EMT") : - H: EMT(SID, h, vid, s, e) l Query processing phases: – Reformulate a query into queries over stored data. – Minicon algorithm (++) for answering queries using views. – Extensions in Piazza enable chaining multiple peer mappings. – Find best plan for the query and execute it: – Tukwila data integration engine – an efficient processor for network bound XML/relational data. 10

Efficiency Issues in Piazza l Intelligent data placement: – We may want to place views over data at key points in the PDMS: – Save work for frequently asked queries. – Increase availability in cases of failures. – – l Akamai for structured data A form of automated reformulation. Large search space of possibilities Surprising lower bounds on very simple cases [Chirkova et al, VLDB 2001]. Efficient propagation of updates: – Approach: publish updategrams as first-class citizens. 11

Additional Piazza Issues l The catalog of data sources – – l What does a catalog of structured data sources look like? How can it be browsed by humans? How do we facilitate joining a PDMS? How can the catalog be distributed physically? Systems issues: – Architecture of a Piazza node: what are the components? – Naming issues – Security l Piazza collaborators: Etzioni, Gribble, Ives, Levy, Suciu, Mork, Rodrig, Tatarinov. 12

Agenda Elements of the Semantic Web l Piazza: a peer data-management system l – A database guy’s contribution to the semantic web l The key issue: mapping between different models: – Some recent progress and current directions. l The critical issue: crossing the structure chasm. 13

It’s All About the Mappings It’s not about understanding the data: It’s about understanding each other. l l l Whenever you see a model for some domain, there is another one hiding around the corner. Mappings provide semantic relationships between different peers. Specifying mappings: inherently a human-assisted task. Goal: make it easy, fast, incremental. Not a new problem! 14

Example Semantic Mapping l Mapping between XML DTDs house address contact-info agent-name num-baths agent-phone 1 -1 mapping non 1 -1 mapping house location contact name full-baths half-baths phone 15

Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. l Extensible: accommodate in a principled fashion: l – User feedback – Domain constraints – General heuristics l “Memory”, knowledge reuse: – System should exploit knowledge from previous matching tasks [LSD]. l Some underlying semantics. 16

Why Matching is Difficult l Structures represent same entity differently – different names => same entity: – area & address => location – same names => different entities: – area => location or square-feet l Intended semantics is typically subjective! – IBM Almaden Lab = IBM? l Schema, data and rules never fully capture semantics! – not adequately documented, certainly not for machine consumption. l Often hard for humans (committees are formed!) 17

Learning for Mapping l l l We started simple: generating semantic mappings between a mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) L(earning) S(ource) D(escriptions) [SIGMOD 2001]. Recent and current work: – (simple) Ontology mapping [WWW-02] – Complex mappings [COMAP] – Semantics [Madhavan et al. , AAAI-02] 18

Data Integration (a simple PDMS) Find houses with four bathrooms priced under $500, 000 Query reformulation and optimization. source schema 1 realestate. com mediated schema source schema 2 homeseekers. com source schema 3 homes. com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. 19

Learning from the Manual Mappings price Mediated schema agent-name agent-phone office-phone listed-price contact-name contact-phone office Schema of realestate. com listed-price contact-name contact-phone $250 K $320 K James Smith Mike Doan $350 K $230 K $190 K contact-agent comments If “office” occurs in the name => office-phone office comments (305) 729 0831 (305) 616 1822 Fantastic house (617) 253 1429 (617) 112 2315 Great location homes. com sold-at description extra-info (206) 634 9435 Beautiful yard (617) 335 4243 Close to Seattle (512) 342 1263 Great lot If “fantastic” & “great” occur frequently in data instances => description 20

Multi-Strategy Learning l Use a set of base learners: – Name learner, Naïve Bayes, Whirl, XML learner l And a set of recognizers: – County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. l Combine base learners with a meta-learner, using stacking. l 21

Base Learners l Training Object Training examples Matching X l Name Learner l (X 1, C 1) (X 2, C 2). . . (Xm, Cm) Observed label Classification model (hypothesis) labels weighted by confidence score – training: – matching: l (“location”, address) (“contact name”, name) agent-name => (name, 0. 7), (phone, 0. 3) Naive Bayes Learner – training: matching: (“Seattle, WA”, address) (“ 250 K”, price) “Kent, WA” => (address, 0. 8), (name, 0. 2) 22

Meta-Learner: Stacking [Wolpert 92, Ting&Witten 99] l Training – – l uses training data to learn weights one for each (base-learner, mediated-schema element) pair weight (Name-Learner, address) = 0. 2 weight (Naive-Bayes, address) = 0. 8 Matching: combine predictions of base learners – computes weighted average of base-learner confidence scores area Seattle, WA Kent, WA Bend, OR Name Learner Naive Bayes (address, 0. 4) (address, 0. 9) Meta-Learner (address, 0. 4*0. 2 + 0. 9*0. 8 = 0. 8) 23

The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner 1 Hypothesis 1 Training data for base learners Base-Learnerk Hypothesisk Base-Learner 1. . Base-Learnerk Meta-Learner Predictions for instances Prediction Combiner Domain constraints Predictions for elements Constraint Handler Meta-Learner Weights for Base Learners Mappings 24

Domain Constraints Encode user knowledge about the domain l Specified by examining mediated schema l Examples l – at most one source-schema element can match address – if a source-schema element matches house-id then it is a key – avg-value(price) > avg-value(num-baths) l Given a mapping combination – can verify if it satisfies a given constraint area: sold-at: contact-agent: extra-info: address price agent-phone address 25

Empirical Evaluation l Four domains – Real Estate I & II, Course Offerings, Faculty Listings l For each domain – – l create mediated DTD & domain constraints choose five sources extract & convert data listings into XML (faithful to schema!) mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 Ten runs for each experiment - in each run: – manually provide 1 -1 mappings for 3 sources – ask LSD to propose mappings for remaining 2 sources – accuracy = % of 1 -1 mappings correctly identified 26

Average Matching Acccuracy (%) Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0. 8 - 6% 27

Average matching accuracy (%) Sensitivity to Amount of Available Data Number of data listings per source (Real Estate I) 28

Average matching accuracy (%) Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD l More experiments in the paper [Doan et. al. 01] 29

Average Matching Acccuracy (%) Contribution of Each Component Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system 30

The Next Steps l Learning is a useful component. But it needs to be combined with: – User feedback – Domain constraints – General heuristics l Need a representation of mappings: – First step – see [Madhavan et al. , AAAI-02] – Also defines key inference problems for such a representation, – Provides answers for the mapping language used in Piazza. – Ultimately, some first-order probabilistic representation. l Need benchmarks to measure progress. 31

Agenda Elements of the Semantic Web l Piazza: a peer data-management system l – A database guy’s contribution to the semantic web l The key issue: mapping between different models: – Some recent progress and current directions. l The critical issue: crossing the structure chasm. 32

Can We Cross the Structure Chasm? l There are two worlds: – U-world: the current web, keyword search, google – S-world: databases, knowledge bases, structured queries The web succeeded because it’s in the u-world. l For the semantic web to succeed, we need to make it dead simple for people to: l – Structure data, locate relevant data and data sets, query. l However: – People have a hard time structuring their data – It’s harder to query structured data: need to know a terminology. – It’s harder to understand each other in the S-world. DB and KR people have no clue how to deal with this. l More expressive power in the languages won’t help. l 33