Скачать презентацию Principles of Dataspace Systems Alon Halevy PODS June Скачать презентацию Principles of Dataspace Systems Alon Halevy PODS June

3e10c10cf0ce73ae21ac5647338e8e55.ppt

  • Количество слайдов: 50

Principles of Dataspace Systems Alon Halevy PODS June 26, 2006 Principles of Dataspace Systems Alon Halevy PODS June 26, 2006

Outline • Example data management challenges ØDenote by: “dataspaces” [Franklin, H. , Maier] • Outline • Example data management challenges ØDenote by: “dataspaces” [Franklin, H. , Maier] • Dataspace Support Platforms: – “Pay-as-you-go” data management • Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

Shrapnels in Baghdad Story courtesy of Phil Bernstein Shrapnels in Baghdad Story courtesy of Phil Bernstein

Personal Information Management Attached. To [Semex: Dong et al. ] Recipient Conf. Home. Page Personal Information Management Attached. To [Semex: Dong et al. ] Recipient Conf. Home. Page Experiment. Of Course. Grade. In Published. In Sender Cites Early. Version Article. About Presentation. For Frequent. Emailer Co. Author Budget. Of Originated. From Home. Page Address. Of

Google Base Google Base

The Web is Getting Semantic • • Forms (millions) Vertical search engines (hundreds) Annotation The Web is Getting Semantic • • Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop – DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]

“Data is the plural of anecdote” • Digital libraries, enterprises, “smart homes” • Corie “Data is the plural of anecdote” • Digital libraries, enterprises, “smart homes” • Corie Environmental Observation System – Talk to Maier • Circle of Blue – Data about the world’s water sources • The Boeing 777 – [Hanrahan @ Stanford]

Requirements • A system that: – Is defined by boundaries (organizational, physical, logical), not Requirements • A system that: – Is defined by boundaries (organizational, physical, logical), not explicit entry. – Provides best-effort services • Little or no setup time – Leverages semantics when possible Manage dataspaces

Other Dataspace Characteristics • All dataspaces contain >20% porn. • The rest has >50% Other Dataspace Characteristics • All dataspaces contain >20% porn. • The rest has >50% spam.

Dataspaces vs. Data Integration Data integration requires semantic mappings Books. And. Music Title Author Dataspaces vs. Data Integration Data integration requires semantic mappings Books. And. Music Title Author Publisher Item. ID Item. Type Suggested. Price Categories Keywords Inventory Database A Title ISBN Price Discount. Price Edition Authors ISBN First. Name Last. Name Book. Categories ISBN Category CDCategories CDs Album ASIN Price Discount. Price Studio ASIN Category Artists ASIN Artist. Name Group. Name Inventory Database B

Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Dataspaces Data integration Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Dataspaces Data integration solutions Artist: Mike Franklin Investment (time, cost)

Dataspaces vs. Data Integration ü Data integration systems require semantic mappings. – Dataspaces are Dataspaces vs. Data Integration ü Data integration systems require semantic mappings. – Dataspaces are “pay-as-you-go” • Dataspaces capture a broader class of semantic relationships: – Derived. From, Snapshot. Of, Highly. Correlated. With, …

Why Do This Now? • Fact: – Data management is about people, not enterprises. Why Do This Now? • Fact: – Data management is about people, not enterprises. • Prediction: – In the next few years, DB conferences will be about data management for the masses. • “CMA”: – If not, our community will become largely irrelevant. • Observation: – We’re doing it anyway: e. g. , DB&IR, information extraction, uncertainty, …

Outline ü Example data management challenges üDenote by: “dataspaces” Ø Dataspace Support Platforms: – Outline ü Example data management challenges üDenote by: “dataspaces” Ø Dataspace Support Platforms: – “Pay-as-you-go” data management • Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

Logical Model: Participants and Relationships RDB java snapshot SDB 1 hr updates sensor XML Logical Model: Participants and Relationships RDB java snapshot SDB 1 hr updates sensor XML manually created WSDL RDB sensor WSDL view schema mapping RDF replica XML sensor

Relationships • General form: (Obj 1, Rel, Obj 2, p) • Obj 1, Obj Relationships • General form: (Obj 1, Rel, Obj 2, p) • Obj 1, Obj 2: instances or sources, • p: degree of certainty • Language for describing relationships? – [Rosati]

Dataspace Support Platforms (DSSP) Discover & Enhance Catalog Local Store & Index RDB sensor Dataspace Support Platforms (DSSP) Discover & Enhance Catalog Local Store & Index RDB sensor WSDL java snapshot 1 hr updates SDB java sensor XML manually created schema mapping WSDL RDB Search & query view RDF replica Administration XML sensor

Outline ü Example data management challenges üDenote by: “dataspaces” ü Dataspace Support Platforms: – Outline ü Example data management challenges üDenote by: “dataspaces” ü Dataspace Support Platforms: – “Pay-as-you-go” data management Ø Putting meat to the bones: – Specific research problems, recent progress – Querying, dataspace evolution, reflection • (Possibly) some predictions and subliminal messages.

Technical Outline Query Evolve Reflect • Queries • Answers • Query processing Technical Outline Query Evolve Reflect • Queries • Answers • Query processing

Query Dataspace Queries • Keyword queries as starting point – Later may be refined Query Dataspace Queries • Keyword queries as starting point – Later may be refined to add structure – Formulated in terms of user’s “schema” • Mostly of the form – Instance*: • “britany spears” – P (instance) • “chicago weather” • “PC chair PODS” Evolve Reflect

Query Semantics of Answers 1. The actual answers: – P(instance), P*(instance) Evolve Reflect Query Semantics of Answers 1. The actual answers: – P(instance), P*(instance) Evolve Reflect

Weather Seattle Weather Seattle

Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Sources where answer can be found: – Partially specify the query to the source – Help the user clean the query

Toyota Corolla Palo alto Toyota Corolla Palo alto

Volvo Palo alto Toyota Palo alto Volvo Palo alto Volvo Palo alto Toyota Palo alto Volvo Palo alto

Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Query Semantics of Answers Evolve Reflect 1. The actual answers: – P(instance), P*(instance) 2. Sources where answer can be found: – Partially specify the query to the source – Help the user clean the query 3. Supporting facts or sources: – Facts that can be used to derive P(instance) – Rest of derivation may be obvious to user

Query Related or Partial Answers Evolve Reflect • In which country was Dan Suciu Query Related or Partial Answers Evolve Reflect • In which country was Dan Suciu born? – Bucharest • Latest edition of software X: – 2004 edition • Is the Space Needle higher than the Eiffel Tower? 184 m – Height of Seattle Space Needle 324 m – Height of Eiffel Tower Rank all types of answers

Query Processing: Data Integration Evolve Reflect Weather(Chicago) Q 1 4 3 2 Q 41 Query Processing: Data Integration Evolve Reflect Weather(Chicago) Q 1 4 3 2 Q 41 Q 43 Q 42 PDMS Active XML Data exchange

Query Processing: DSSPs Query Evolve Reflect Query: Jan Van den Bussche address First name: Query Processing: DSSPs Query Evolve Reflect Query: Jan Van den Bussche address First name: Jan Middle name: Last name: Van den Bussche Address: ?

Query Processing: DSSPs Query Evolve Reflect First name: Jan Middle name: Last name: Van Query Processing: DSSPs Query Evolve Reflect First name: Jan Middle name: Last name: Van den Bussche Address: ? Keyword query: J. vd Bussche address: City required Street. Adr city, zip (t 1, p 1) … … City? … … (t 1, p 3) … … (t 2, p 2) Companies address

Query Two Principles • Mappings are approximate at best – What do approximate mappings Query Two Principles • Mappings are approximate at best – What do approximate mappings mean? – Answering queries with them? – Composition? • Answering queries = Finding evidence + combining evidence Evolve Reflect

Technical Outline Query Evolve Reflect • Reuse human attention Technical Outline Query Evolve Reflect • Reuse human attention

Query Evolve The Cost of Semantics Reflect Semantic integration modeled by: {(Obj 1, rel, Query Evolve The Cost of Semantics Reflect Semantic integration modeled by: {(Obj 1, rel, Obj 2, p), …} Benefit Dataspaces Data integration solutions Artist: Mike Franklin Investment (time, cost) ?

Reusing Human Attention Query Evolve Reflect • Principle: § User action = statement of Reusing Human Attention Query Evolve Reflect • Principle: § User action = statement of semantic relationship Ø Leverage actions to infer other semantic relationships • Examples – Providing a semantic mapping • Infer other mappings – Writing a query • Infer content of sources, relationships between sources – Creating a “digital workspace” • Infer “relatedness” of documents/sources • Infer co-reference between objects in the dataspace – Annotating, cutting & pasting, browsing among docs

Query Past, Present and Future Evolve Reflect • Leverage past actions & existing structure: Query Past, Present and Future Evolve Reflect • Leverage past actions & existing structure: – [Dong et al. , 2004, 2005], [He & Chang, 2003] • Generalize from current actions – Queries, schema mappings • Beg for extra attention: – ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al. ]

Reuse: Learning Schema Mappings [Doan et al. ] Action Mediated schema ( S 1, Reuse: Learning Schema Mappings [Doan et al. ] Action Mediated schema ( S 1, M, S, p) • Classifiers for mediated schema Ø Thousands of web forms mapped in little time Ø Transformic Inc: deep web search. v [Madhavan et al. ]: infer mappings for any schemas in the domain Query Evolve Reflect

Technical Outline Query Evolve Reflect • Unify lineage, uncertainty, and inconsistency • Model them Technical Outline Query Evolve Reflect • Unify lineage, uncertainty, and inconsistency • Model them on views

Query Evolve Dataspace Reflection • Answers are uncertain in dataspaces: – data, – mappings, Query Evolve Dataspace Reflection • Answers are uncertain in dataspaces: – data, – mappings, – query answering techniques • Data may be inconsistent • Tracking lineage is crucial Reflect

Query Evolve A DSSP Needs to • • • Reflect Process uncertain data Update Query Evolve A DSSP Needs to • • • Reflect Process uncertain data Update uncertainty with new evidence Be proactive about reducing uncertainty Live with inconsistency Leverage lineage to reduce uncertainty ( Obj 1, Rel, Obj 2, p)

Query Evolve Two Principles • Need a single formalism for modeling: – uncertainty, – Query Evolve Two Principles • Need a single formalism for modeling: – uncertainty, – inconsistency, and – lineage • Model them on views Reflect

Israel population Uncertainty & Lineage Trio @ Stanford Query Evolve Reflect Israel population Uncertainty & Lineage Trio @ Stanford Query Evolve Reflect

Uncertainty and Inconsistency • Inconsistency = uncertainty about the truth – Salary (John Doe, Uncertainty and Inconsistency • Inconsistency = uncertainty about the truth – Salary (John Doe, $120, 000) – Salary (John Doe, $135, 000) ØSalary (John Doe, $120, 000 | $135, 000) • Orchestra, Ives @ U. Penn. Query Evolve Reflect

Uncertainty Formalisms 101 • Represent a set of possible worlds • A-tuples - uncertainty Uncertainty Formalisms 101 • Represent a set of possible worlds • A-tuples - uncertainty on attribute values: – (PODS, 2006, {Chicago, Baltimore}) – Tuples can be optional • Not closed under relational operators Query Evolve Reflect

Uncertainty Formalisms 101 (2) • X-tuples – uncertainty on entire tuples: – {(PODS, 2006, Uncertainty Formalisms 101 (2) • X-tuples – uncertainty on entire tuples: – {(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} • Still not closed under relational operators Query Evolve Reflect

Uncertainty 101: C-Tables • { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), Uncertainty 101: C-Tables • { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } • Closed and Complete! • Understandable? – See [Das Sarma et al. , ICDE 2006]: working Query models for uncertain data Evolve Reflect

Adding Lineage to X-tuples [ULDBs, Benjelloun et al. , VLDB 06] l 1 ! Adding Lineage to X-tuples [ULDBs, Benjelloun et al. , VLDB 06] l 1 ! (l 1 & l 2) l 2 {(PODS, 2006, Ch), (PODC, 2006, Ba)} { (…) (…) } (t) (t’) Query Evolve No effect on complexity of relational operators Reflect

Adding Lineage to A-tuples (PODS, {2005, 2006}, {Chicago, Baltimore}) l 1 l 2 l Adding Lineage to A-tuples (PODS, {2005, 2006}, {Chicago, Baltimore}) l 1 l 2 l 3 l 4 Lineage on views (projections) Ø Uncertainty should also be on views! (Halevy, Los Altos, CA, professor}, 0. 8 0. 6 Ø Answering queries using uncertain views ü See [Dalvi & Suciu, 2005] for a great start Query Evolve Reflect

Putting it all Together Query • Approximate mappings • Evidence combination Evolve • Reuse Putting it all Together Query • Approximate mappings • Evidence combination Evolve • Reuse human attention • Create approximate semantic relationships Reflect • Foundation for reasoning about uncertainty, inconsistency and lineage

Conclusion and Outlook • Data management moving to consumers • Dataspaces: key element in Conclusion and Outlook • Data management moving to consumers • Dataspaces: key element in this agenda – Pay as you go data management – Reuse human attention • The role of theory: – Reflect, generalize and explain – People, people

Some References • SIGMOD Record, December 2005: – Original dataspace vision paper • Maier Some References • SIGMOD Record, December 2005: – Original dataspace vision paper • Maier EDBT 2006 tak • PODS 2006 proceedings: challenges • Data Integration: the Teenage Years – VLDB 2006 • Teaching integration to undergraduates: – SIGMOD Record, September, 2003.