3e10c10cf0ce73ae21ac5647338e8e55.ppt
- Количество слайдов: 50
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006
Outline • Example data management challenges ØDenote by: “dataspaces” [Franklin, H. , Maier] • Dataspace Support Platforms: – “Pay-as-you-go” data management • Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.
Shrapnels in Baghdad Story courtesy of Phil Bernstein
Personal Information Management Attached. To [Semex: Dong et al. ] Recipient Conf. Home. Page Experiment. Of Course. Grade. In Published. In Sender Cites Early. Version Article. About Presentation. For Frequent. Emailer Co. Author Budget. Of Originated. From Home. Page Address. Of
Google Base
The Web is Getting Semantic • • Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop – DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]
“Data is the plural of anecdote” • Digital libraries, enterprises, “smart homes” • Corie Environmental Observation System – Talk to Maier • Circle of Blue – Data about the world’s water sources • The Boeing 777 – [Hanrahan @ Stanford]
Requirements • A system that: – Is defined by boundaries (organizational, physical, logical), not explicit entry. – Provides best-effort services • Little or no setup time – Leverages semantics when possible Manage dataspaces
Other Dataspace Characteristics • All dataspaces contain >20% porn. • The rest has >50% spam.
Dataspaces vs. Data Integration Data integration requires semantic mappings Books. And. Music Title Author Publisher Item. ID Item. Type Suggested. Price Categories Keywords Inventory Database A Title ISBN Price Discount. Price Edition Authors ISBN First. Name Last. Name Book. Categories ISBN Category CDCategories CDs Album ASIN Price Discount. Price Studio ASIN Category Artists ASIN Artist. Name Group. Name Inventory Database B
Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Dataspaces Data integration solutions Artist: Mike Franklin Investment (time, cost)
Dataspaces vs. Data Integration ü Data integration systems require semantic mappings. – Dataspaces are “pay-as-you-go” • Dataspaces capture a broader class of semantic relationships: – Derived. From, Snapshot. Of, Highly. Correlated. With, …
Why Do This Now? • Fact: – Data management is about people, not enterprises. • Prediction: – In the next few years, DB conferences will be about data management for the masses. • “CMA”: – If not, our community will become largely irrelevant. • Observation: – We’re doing it anyway: e. g. , DB&IR, information extraction, uncertainty, …
Outline ü Example data management challenges üDenote by: “dataspaces” Ø Dataspace Support Platforms: – “Pay-as-you-go” data management • Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.
Logical Model: Participants and Relationships RDB java snapshot SDB 1 hr updates sensor XML manually created WSDL RDB sensor WSDL view schema mapping RDF replica XML sensor
Relationships • General form: (Obj 1, Rel, Obj 2, p) • Obj 1, Obj 2: instances or sources, • p: degree of certainty • Language for describing relationships? – [Rosati]
Dataspace Support Platforms (DSSP) Discover & Enhance Catalog Local Store & Index RDB sensor WSDL java snapshot 1 hr updates SDB java sensor XML manually created schema mapping WSDL RDB Search & query view RDF replica Administration XML sensor
Outline ü Example data management challenges üDenote by: “dataspaces” ü Dataspace Support Platforms: – “Pay-as-you-go” data management Ø Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.
Technical Outline Query Evolve Reflect • Queries • Answers • Query processing
Query Dataspace Queries • Keyword queries as starting point – Later may be refined to add structure – Formulated in terms of user’s “schema” • Mostly of the form – Instance*: • “britany spears” – P (instance) • “chicago weather” • “PC chair PODS” Evolve Reflect
Query Semantics of Answers 1. The actual answers: – P(instance), P*(instance) Evolve Reflect
Weather Seattle
Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Sources where answer can be found: – Partially specify the query to the source – Help the user clean the query
Toyota Corolla Palo alto
Volvo Palo alto Toyota Palo alto Volvo Palo alto
Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Sources where answer can be found: – Partially specify the query to the source – Help the user clean the query 3. Supporting facts or sources: – Facts that can be used to derive P(instance) – Rest of derivation may be obvious to user
Query Related or Partial Answers Evolve Reflect • In which country was Dan Suciu born? – Bucharest • Latest edition of software X: – 2004 edition • Is the Space Needle higher than the Eiffel Tower? 184 m – Height of Seattle Space Needle 324 m – Height of Eiffel Tower Rank all types of answers
Query Processing: Data Integration Evolve Reflect Weather(Chicago) Q 1 4 3 2 Q 41 Q 43 Q 42 PDMS Active XML Data exchange
Query Processing: DSSPs Query Evolve Reflect Query: Jan Van den Bussche address First name: Jan Middle name: Last name: Van den Bussche Address: ?
Query Processing: DSSPs Query Evolve Reflect First name: Jan Middle name: Last name: Van den Bussche Address: ? Keyword query: J. vd Bussche address: City required Street. Adr city, zip (t 1, p 1) … … City? … … (t 1, p 3) … … (t 2, p 2) Companies address
Query Two Principles • Mappings are approximate at best – What do approximate mappings mean? – Answering queries with them? – Composition? • Answering queries = Finding evidence + combining evidence Evolve Reflect
Technical Outline Query Evolve Reflect • Reuse human attention
Query Evolve The Cost of Semantics Reflect Semantic integration modeled by: {(Obj 1, rel, Obj 2, p), …} Benefit Dataspaces Data integration solutions Artist: Mike Franklin Investment (time, cost) ?
Reusing Human Attention Query Evolve Reflect • Principle: § User action = statement of semantic relationship Ø Leverage actions to infer other semantic relationships • Examples – Providing a semantic mapping • Infer other mappings – Writing a query • Infer content of sources, relationships between sources – Creating a “digital workspace” • Infer “relatedness” of documents/sources • Infer co-reference between objects in the dataspace – Annotating, cutting & pasting, browsing among docs
Query Past, Present and Future Evolve Reflect • Leverage past actions & existing structure: – [Dong et al. , 2004, 2005], [He & Chang, 2003] • Generalize from current actions – Queries, schema mappings • Beg for extra attention: – ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al. ]
Reuse: Learning Schema Mappings [Doan et al. ] Action Mediated schema ( S 1, M, S, p) • Classifiers for mediated schema Ø Thousands of web forms mapped in little time Ø Transformic Inc: deep web search. v [Madhavan et al. ]: infer mappings for any schemas in the domain Query Evolve Reflect
Technical Outline Query Evolve Reflect • Unify lineage, uncertainty, and inconsistency • Model them on views
Query Evolve Dataspace Reflection • Answers are uncertain in dataspaces: – data, – mappings, – query answering techniques • Data may be inconsistent • Tracking lineage is crucial Reflect
Query Evolve A DSSP Needs to • • • Reflect Process uncertain data Update uncertainty with new evidence Be proactive about reducing uncertainty Live with inconsistency Leverage lineage to reduce uncertainty ( Obj 1, Rel, Obj 2, p)
Query Evolve Two Principles • Need a single formalism for modeling: – uncertainty, – inconsistency, and – lineage • Model them on views Reflect
Israel population Uncertainty & Lineage Trio @ Stanford Query Evolve Reflect
Uncertainty and Inconsistency • Inconsistency = uncertainty about the truth – Salary (John Doe, $120, 000) – Salary (John Doe, $135, 000) ØSalary (John Doe, $120, 000 | $135, 000) • Orchestra, Ives @ U. Penn. Query Evolve Reflect
Uncertainty Formalisms 101 • Represent a set of possible worlds • A-tuples - uncertainty on attribute values: – (PODS, 2006, {Chicago, Baltimore}) – Tuples can be optional • Not closed under relational operators Query Evolve Reflect
Uncertainty Formalisms 101 (2) • X-tuples – uncertainty on entire tuples: – {(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} • Still not closed under relational operators Query Evolve Reflect
Uncertainty 101: C-Tables • { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } • Closed and Complete! • Understandable? – See [Das Sarma et al. , ICDE 2006]: working Query models for uncertain data Evolve Reflect
Adding Lineage to X-tuples [ULDBs, Benjelloun et al. , VLDB 06] l 1 ! (l 1 & l 2) l 2 {(PODS, 2006, Ch), (PODC, 2006, Ba)} { (…) (…) } (t) (t’) Query Evolve No effect on complexity of relational operators Reflect
Adding Lineage to A-tuples (PODS, {2005, 2006}, {Chicago, Baltimore}) l 1 l 2 l 3 l 4 Lineage on views (projections) Ø Uncertainty should also be on views! (Halevy, Los Altos, CA, professor}, 0. 8 0. 6 Ø Answering queries using uncertain views ü See [Dalvi & Suciu, 2005] for a great start Query Evolve Reflect
Putting it all Together Query • Approximate mappings • Evidence combination Evolve • Reuse human attention • Create approximate semantic relationships Reflect • Foundation for reasoning about uncertainty, inconsistency and lineage
Conclusion and Outlook • Data management moving to consumers • Dataspaces: key element in this agenda – Pay as you go data management – Reuse human attention • The role of theory: – Reflect, generalize and explain – People, people
Some References • SIGMOD Record, December 2005: – Original dataspace vision paper • Maier EDBT 2006 tak • PODS 2006 proceedings: challenges • Data Integration: the Teenage Years – VLDB 2006 • Teaching integration to undergraduates: – SIGMOD Record, September, 2003.