13721a7858b17f1b6f8eac815fffd4f4.ppt
- Количество слайдов: 95
Tutorial #5: Scientific Data Integration and Mediation Bertram Ludäscher Ilkay Altintas Amarnath Gupta Kai Lin San Diego Supercomputer Center U. C. San Diego Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
• National Science Foundation (NSF) Acknowledgements – www. nsf. gov • GEOsciences Network (NSF) – www. geongrid. org • Biomedical Informatics Research Network (NIH) – www. nbirn. net • Science Environment for Ecological Knowledge (NSF) – seek. ecoinformatics. org • Scientific Data Management Center (DOE) – sdm. lbl. gov/sdmcenter/ Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 2
Outline • 8: 30 – 10: 30 am: Tutorial: Data Integration & Mediation – Introduction to database mediation: • motivation and architecture • XML-based data integration – Database mediation theory primer: • logic view definitions, view unfolding, computing feasible plans – From XML-based to Knowledge-based mediation: • use of ontologies in data integration, . . . • 10: 30 – 10: 45 am: BREAK • 10: 45 – 12: 00: Applications and Demos – – 10: 45 – 11: 05 Mediator Demo 11: 05 – 11: 20 Queries w/ Ontology Support 11: 20 – 11: 40 Scientific Workflows 11: 40 – 12: 00 KNOW-ME Ontology Tool Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 3
Information Integration Challenges • System aspects: “Grid” Middleware – distributed data & computing – Web Services, WSDL/SOAP, … – sources = functions, files, databases, … Semantics Structure • Syntax & Structure: XML-Based Mediators Syntax System aspects Ø reconciling S 4 heterogeneities Ø “gluing” together multiple data sources Ø bridging information and knowledge gaps computationally Scientific Data-Mediation AHM'03 – wrapping, restructuring – XML queries and views – sources = XML databases • Semantics: Model-Based/Semantic Mediators – conceptual models and declarative views – Semantic. Web/Knowledge. Grid stuff: ontologies, description logics (RDF(S), DAML+OIL, OWL. . . ) – sources = knowledge bases (DB+CMs+ICs) National Partnership for Advanced Computational Infrastructure 4
Information Integration from a DB Perspective • Information Integration Problem – Given: data sources S 1, . . . , Sk (DBMS, web sites, . . . ) and user questions Q 1, . . . , Qn that can be answered using the Si – Find: the answers to Q 1, . . . , Qn • The Database Perspective: source = “database” Þ Si has a schema (relational, XML, OO, . . . ) Þ Si can be queried Þ define virtual (or materialized) integrated views V over , . . . , Sk using database query languages (SQL, XQuery, . . . ) Þ questions become queries Qi against V(S 1, . . . , Sk) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure S 1 5
Standard (XML-Based) Mediator Architecture USER/Client Query Q ( G (S 1, . . . , Sk) ) Integrated Global (XML) View G Integrated View Definition MEDIATOR G(. . ) S 1(. . )…Sk(. . ) XML Queries & Results (XML) View Wrapper S 1 Scientific Data-Mediation AHM'03 (XML) View S 2 Sk wrappers implemented as web services National Partnership for Advanced Computational Infrastructure 6
Some BIRNing Data Integration Questions Biomedical Informatics Research Network http: //nbirn. net • Data Integration Approaches: – – Let’s just share data, e. g. , link everything from a web page!. . . or better put everything into an relational or XML database. . . and do remote access using the Grid. . . or just use Web services! • Nice try. But: – “Find the files where the amygdala was segmented. ” – “Which other structures were segmented in the same files? ” – “Did the volume of any of those structures differ much from normal? ” – “What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ” Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 7
An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week? ” addall. com ? “One-World” Mediation Information Integration public library amazon. com Scientific Data-Mediation AHM'03 WWW barnes&noble. com half. com National Partnership for Advanced Computational Infrastructure A 1 books. com
A Home Buyer’s Information Integration Problem What houses for sale under $500 k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Scientific Data-Mediation AHM'03 Crime Stats “Multiple-Worlds” Mediation School Rankings National Partnership for Advanced Computational Infrastructure Demographics
A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3 -D geometry ? How does it relate to host rock structures? ? Information Integration “Complex Multiple-Worlds” Mediation Geo. Physical Geo. Chronologic Geologic Map Geo. Chemical. National Partnership for Advanced Computational Infrastructure Scientific Data-Mediation AHM'03 (gravity contours) (Concordia) (Virginia) Foliation Map (structure DB)
A Neuroscientist’s Information Integration Problem Biomedical Informatics Research Network http: //nbirn. net What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization (NCMIR) Scientific Data-Mediation AHM'03 sequence info (Ca. PROT) “Complex Multiple-Worlds” Mediation morphometry neurotransmission (SYNAPSE) (SENSELAB) National Partnership for Advanced Computational Infrastructure
Structural / XML-Based Mediation Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 12
Abstract XML-Based Mediator Architecture USER/Client Query Q o V (S_1, . . . , S_k) Integrated View Definition IVD(S 1, . . . , Sn) Integrated XML View V MEDIATOR XML Queries & Results XML View Wrapper S_1 Scientific Data-Mediation AHM'03 S_2 S_k National Partnership for Advanced Computational Infrastructure 13
Extensible Markup Language (XML). . . in their wonderful book called <title>Sem. Web Tractat</title> T. B. Schatz authors Lee, how Tractat </title> by B. Lee, theand T. B. showthe. . . by B. Schatz andby <author>B. Schatz</author> <book> and <author> T. B. Tractat</title> authors show how. . . Lee</author>, the authors <title>Sem. Web show how. . . <author>B. Schatz</author> <author>T. B. Lee</author> </book> book title author “Sem. Web Tractat” “B. Schatz” “T. B. Lee” book: title: “Sem. Web Tractat” author: “B. Schatz” author: “T. B. Lee” • (meta)language for marking up text & data with user-definable tags – (X)HTML, XSLT, XML Schema, . . . – Math. ML, Bio. ML, Geo. ML, Neuro. ML, . . . – XML-RPC, SOAP, . . . • semistructured tree data model – flexible: marked-up text, web-pages, databases, . . . • container model: – “boxes within boxes” Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 14
Example: Relational Data => XML R A B C a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3 R tuple A B C a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3 Scientific Data-Mediation AHM'03 R tuple A a 1 /A B b 1 /B C c 1 /C /tuple A a 2 /A B b 2 /B C c 2 /C /tuple … /R National Partnership for Advanced Computational Infrastructure 15
Tag Names & Nesting => XML DTDs (Grammars) Grammar Rules bibliography paper authors paper* authors full. Paper? title booktitle author+ XML DTD <!ELEMENT bibliography paper*> <!ELEMENT paper (authors, full. Paper? , title, booktitle)> <!ELEMENT authors author+> Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 16
XML DTDs vs. XML Schema • XML DTDs – set of allowed tag names – their nesting structure (via grammar rules) • XML Schema – – – tag names and nesting structure user-defined complex data types subtyping (no multiple inheritance): RESTRICT and EXTEND separate “namespace” for type names and tag (=element) names. . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 17
XML Schema: User-Defined Type/Class Hierarchy Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 18
XML Schema Declarations (“home-style” syntax) Complex Type Declarations Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 19
XML Schema (“home-style”) Simple Type Declarations Complex Types Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 20
XML Schema: Substitution Groups Elements of a substitution group (hexagons) and associated complex types (boxes) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 21
XML Schema Declarations (W 3 C syntax) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 22
XML Query Languages • XPath: – root//books/book[cover_style=“paperback”][price<80] • XQuery – the W 3 C XML query language • XSLT – XML transformations (XML=>HTML, XML=>XML) • . . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 23
Transforming and Rendering XML: XSLT Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 24
XMAS: XML Matching And Structuring language CONSTRUCT <books> <book> $a 1 $t <pubs> $p { $p } </pubs> </book> { $a 1, $t } </books> WHERE <books. book> $a 1 : <author /> $t : <title /> </> IN "amazon. com" AND <authors. author> $a 2 : <author /> <pubs> $p : <pub/> </> IN "www. . . DBLP… " AND value( $a 1 ) = value( $a 2 ) XMAS Scientific Data-Mediation AHM'03 Integrated View Definition: “Find books from amazon. com and DBLP, join on author, group by authors and title” XMAS Algebra National Partnership for Advanced Computational Infrastructure 25
Database Mediation Theory Primer Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
Mediator Query Processing Query Q Integrated View Definition V Translator parsed plan Composition (Q o V) composed plan Compile-time Run-time Rewriter/Optimizer optimized plan Plan Execution Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 27
Logic View Definitions (Global-as-View) or Querying and Reasoning with the Family. . . • Warm up: Who says this? – “Your are my son, but I’m not your father!” • The mother! Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 28
Logic View Definitions (Global-as-View) • Globals-as-View (GAV) – Integrated view V is defined in terms of the sources Src_1, . . . , Src_k • Given the following source databases: – Src_1 schema = { father(Father, Child), mother(Mother, Child) } – Src_2 schema = { spouse(Spouse, Spouse) } – Src_3 schema = { male(Person), female(Person) } • Can you define integrated views V for. . . ? – parent(Parent, Child) • short: parent/2, i. e. , table/relation name is ‘parent’, arity (#columns) is 2 – son/2, daughter/2 – brother/2, sister/2 – brother_in_law/2, sister_in_law/2 – aunt/2, uncle/2 – married/2, bachelor/2 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 29
Logic View Definitions (Global-as-View) Source relations: father/2, mother/2, spouse/2, male/1, female/1 = “, ” = conjunction (and) = “ ; ” = disjunction (or) • parent(C, P) father(C, P) ; mother(C, P). • son(P, S) parent(S, P) , male(S). • brother(X, B) parent(X, P), son(P, B), X B. • brother_in_law(X, B) sister(X, Z), spouse(Z, B) ; spouse(X, Z), brother(Z, B). Scientific Data-Mediation AHM'03 = “not” = negation National Partnership for Advanced Computational Infrastructure 30
Logic View Definitions (Global-as-View) Source relations: father/2, mother/2, spouse/2, male/1, female/1 = “, ” = conjunction (and) = “ ; ” = disjunction (or) = “not” = negation • uncle(X, U) parent(X, Z), brother(Z, U) ; parent(X, Z), brother_in_law(Z, U). • aunt(X, A) parent(X, Z), sister(Z, A) ; parent(X, Z), sister_in_law(Z, A). • married(X) spouse(X, _). • bachelor(X) [person(X)] , not married(X). Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 31
Query Rewriting and Query Evaluation • Query Rewriting: - Given a user query Q in terms of virtual views V. . . - Find an equivalent query Q’ in terms of the sources Src_1, . . . , Src_k • Query Evaluation: - Given a query Q’, evaluate Q’ over the source databases D : = Src_1 . . . Src_k • Examples: – Q_uncle/2 = { (X, Y) | uncle(X, Y) holds in D } – Q_tom’s_uncle/1 = { X | uncle(tom, X) holds in D } – Q_whose_uncle_is_tom/1 = { X | uncle(X, tom) holds in D } Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 32
Query Rewriting (for GAV) • Query rewriting: - Given a user query Q in terms of virtual views V. . . - Find an equivalent query Q’ in terms of the sources Src_1, . . . , Src_k • Query Q, views V, source schemas S • View unfolding: – starting with Q, repeatedly replace view predicates by the definition • Creating a feasible plan: – here: compute disjunctive normal form (DNF) – DNF = disjunction of conjunctions (= “union of joins”) – order goals within each conjunction according to sources’ query capabilities Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 33
Example • ? - plan(brother(X 0, X 1)). brother(X 0, X 1) == LQP ==> (father(X 0, X 2) v mother(X 0, X 2)) & (father(X 1, X 2) v mother(X 1, X 2)) & male(X 1) & neq(X 0, X 1) brother(X 0, X 1) ==NNF LQP==> (father(X 0, X 2) v mother(X 0, X 2)) & (father(X 1, X 2) v mother(X 1, X 2)) & male(X 1) & neq(X 0, X 1) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 34
Example (Cont’d) • ? - plan(brother(X 0, X 1)). brother(X 0, X 1) ==DNF LQP==> father(X 0, X 2)&father(X 1, X 2)&male(X 1)&neq(X 0, X 1) v mother(X 0, X 2)&father(X 1, X 2)&male(X 1)&neq(X 0, X 1) v father(X 0, X 2)&mother(X 1, X 2)&male(X 1)&neq(X 0, X 1) v mother(X 0, X 2)&mother(X 1, X 2)&male(X 1)&neq(X 0, X 1) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 35
Example (Cont’d) • ? - plan(brother(X 0, X 1)). brother(X 0, X 1) ==Bp ordered LQP==> parent. Db(father(X 1, X 2) & father(X 0, X 2)) & gender. Db(male(X 1)) & mediator(neq(X 0, X 1)) v parent. Db(father(X 1, X 2) & mother(X 0, X 2)) & gender. Db(male(X 1)) & mediator(neq(X 0, X 1)) v parent. Db(mother(X 1, X 2)&father(X 0, X 2)) & gender. Db(male(X 1)) & z_mediator(neq(X 0, X 1)) v parent. Db(mother(X 1, X 2)&mother(X 0, X 2)) & gender. Db(male(X 1))&z_mediator(neq(X 0, X 1)) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 36
Computing Feasible Plans (Goal Ordering) • A conjunctive query Q is an expression of the form – q( X ) p 1( X 1 ) , . . . , pn( Xn ) – order of subgoals p_i is irrelevant • An ordered plan P is an expression of the form – q( X ) [p 1( X 1 ) , . . . , pn( Xn )] – order of subgoals p_i is important • Problem: – given Q, compute P which is feasible, i. e. , observes the limited query capabilities of sources – Here: binding patterns, i. e. , predicates’ arguments can be • “b” – bound • “f” – free • “_” – bound or free Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 37
A Simple Algorithm for Ordering Goals Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 38
Query Containment • A query Q 1 is contained in Q 2, denoted Q 1 Q 2 – if for all possible database instances, the set of answers to Q 1 is contained in the set of answers to Q 2. • Q 1 and Q 2 are called equivalent – if Q 1 Q 2 and Q 2 Q 1. • Query containment is undecidable for many languages, e. g. , for the relational calculus (SQL). • For conjunctive queries, the problem is NP-complete (and thus decidable) – Since query sizes tend to be “small” (in particular, when compared to database sizes), query containment is still of use in practice (indeed, it is one of the most fundamental tools for logic-based query optimization). Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 39
Query Containment • Q 1(Xs, Ys) is contained in Q 2(Xs, Zs) iff ALL Xs: (EXISTS Ys: Q 1(Xs, Ys)) (EXISTS Zs: Q 2(Xs, Zs)) • iff we can refute its negation • iff NOT ALL Xs: (EXISTS Ys: Q 1(Xs, Ys)) (EXISTS Zs: Q 2(Xs, Zs)) |= [] • iff EXISTS Xs: (EXISTS Ys: Q 1(Xs, Ys)) AND NOT (EXISTS Zs: Q 2(Xs, Zs)) |= [] • iff – canonical_db(Q 1) AND Q 2(Xs, Zs) |= [] • create database from Q 1, then run Q 2 as a query. . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 40
Query Containment Algorithm (in Prolog) • Applications: – query minimization (conjunctive query is minimal if not conjunct can be dropped) – semantic query optimization • Q denial • here: denial is an integrity constraint and states what must not hold • example: denial = false mother(X, M), father(Y, M) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 41
Example • 50% of the clauses of the executable plan are irrelevant. . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 42
Mediator Demo • Computer Science Challenges: – Given a query Q over virtual integrated database V, how to come up with Q’ over the source schemas? (cf. Garlic, Discovery. Link, . . . ) • query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization • computation of feasible orders (NP-complete!? ) while minimizing number of “chunks” sent to sources • semantic query optimization (reasoning over plans!); e. g. conjunctive query containment is NP-complete [Chandra-Merlin-77] • A Quick Demo of the current prototype: – Find 3 D reconstructions of cells found in ‘cerebellar cortex’: • • • ? - ccdb. Data('cerebellar cortex'). Join everything reachable along ‘cerebellar-cortex’. (has-a)* in UMLS. . with concept markup in CCDB. . . retrieve (links to) results. . . also show on Smart. Atlas tool Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 43
Mediator Demo Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 44
From XML-Based to Logic and Model. Based (“Semantic”) Mediation Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 45
What’s the Problem with XML & Complex Multiple-Worlds? • XML is Syntax – DTDs talk about element nesting – XML Schema schemas give you data types – need anything else? => write comments! • Domain Semantics is complex: – implicit assumptions, hidden semantics 1. sources seem unrelated to the non-expert 1. Need Structure and Semantics beyond XML trees! 1. 2. 3. 4. employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 46
From XML-Based to Model-Based Mediation • Data and Knowledge Sharing Potential: Database Mediation + Knowledge Representation ____________ = Model-Based Mediation • Basic Ideas: – turn primary data sources into knowledge sources – employ secondary glue knowledge sources • generic: UMLS, . . . • specific: community/laboratory ontologies Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 47
Information Integration Landscape conceptual complexity/depth high Model-Based Mediation GO Eco. Cyc Ontologies KR formalisms Ribo. Web BLAST UMLS Tambis Bioinformatics Geoinformatics MIA Entrez Cyc Word. Net DB mediation techniques low addall book-buyer one-world Scientific Data-Mediation AHM'03 home-buyer 24 x 7 consumer conceptual distance multiple-worlds National Partnership for Advanced Computational Infrastructure 48
Knowledge Representation: Relating Theory to the World via Formal Models John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations “All models are wrong, but some are useful!” Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 49
XML-Based vs. Model-Based Mediation CM ~ {Descr. Logic, ER, UML, RDF/XML(-Schema), …} Integrated-DTD : = Glue Maps XML-QL(Src 1 -DTD, . . . ) CM-QL ~ {F-Logic, DAML+OIL, …} Integrated-CM : = DMs, PMs CM-QL(Src 1 -CM, . . . ) No Domain Constraints IF THEN Structural Constraints (DTDs), Parent, Child, Sibling, . . . A = (B*|C), D B =. . . XML Elements XML Models Scientific Data-Mediation AHM'03 C 1 C 2 R. . . . C 3 Logical Domain Constraints Classes, Relations, is-a, has-a, . . . (XML) Objects Conceptual Models Raw Data Raw Advanced Computational Infrastructure Raw. Data National Partnership for
What’s the Glue? What’s in a Link? X • Syntactic Joins – (X, Y) : = X. SSN = Y. SSN – (X, Y) : = X. UMLS-ID = Y. UID Y equality • “Speciality” Joins – (X, Y, Score) : = BLAST(X, Y, Score) similarity • Semantic/Rule-Based Joins – (X, Y, C) : = X isa C, Y isa C, BLAST(X, Y, S), S>0. 8 homology, lub – (X, Y, [produces, B, increased_in]) : = X produces B, B increased_in Y. rule-based e. g. , X= -secretase, B=beta amyloid, Y=Alzheimer’s disease • Challenge: – compile semantic joins into efficient syntactic ones Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 51
Model-Based Mediation Methodology. . . • Lift Sources to export CMs: CM(S) = OM(S) + KB(S) + CON(S) • Object Model OM(S): – complex objects (frames), class hierarchy, OO constraints • Knowledge Base KB(S): – explicit representation of (“hidden”) source semantics – logic rules over OM(S) • Contextualization CON(S): – situate OM(S) data using “glue maps” (GMs): Þ domain maps DMs (ontology) = terminological knowledge: concepts + roles Þ process maps PMs = “procedural knowledge”: states + transitions Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 52
. . . Model-Based Mediation Methodology • Integrated View Definition (IVD) – declarative (logic) rules with object-oriented features – defined over CM(S), domain maps, process maps – needs “mediation engineers” = domain + KRDB experts • Knowledge-Based Querying and Browsing (runtime): – mediator composes the user query Q with the IVD. . . rewrites (Q o IVD), sends subqueries to sources. . . post-processes returned results (e. g. , situate in context) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 53
Model-Based Mediator Architecture USER/Client “Glue” Maps GMs CM (Integrated View) Domain Maps DMs DMs Mediator Engine XSB Engine Integrated View Definition IVD FL rule proc. Graph proc. Domain Maps Process Maps DMs PMs GCM semantic context CON(S) GCM CM S 1 CM S 2 LP rule proc. CM S 3 First results & Demos: CM Queries & Results (exchanged in XML) CM(S) = OM(S)+KB(S)+CON(S) CM-Wrapper (XML-Wrapper) S 1 Scientific Data-Mediation AHM'03 KIND prototype, formal DM semantics, PMs [SSDBM 00] [VLDB 00] [ICDE 01] [NIH-HB 01] (w/ Gupta, Martone) S 2 S 3 National Partnership for Advanced Computational Infrastructure 54
Domain Maps (Ontologies) as Glue Knowledge Sources • Domain Map = Ontology – representation of terminological knowledge • Use in Model-Based Mediation – (derived) concepts as “drop points”, “anchor points”, “context” for source classes – compile-time use: view definition, subsumption, classification, . . . – runtime use: querying/deduction, path queries, . . • Formalisms: – Semantic nets, Thesauri, Frame-logic, Description logics, . . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 55
Ontologies • So what is an Ontology? – – – definition of things that are relevant to your application representation of terminological knowledge (“TBox”) explicit specification of a conceptualization concept hierarchy (“is-a”) further semantic relationships between concepts abstractions of relational schemas, (E)ER, UML classes, XML Schemas • Examples: – – NCMIR ANATOM GO (Gene Ontology) UMLS (Unified Medical Language System CYC Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 56
Formalism for Ontologies: Description Logic • DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 57
Description Logic Statements as Rules • In first-order logic (rule form): happy. Father(X) man(X), child(X, C 1), child(X, C 2), blue(C 1), green(C 2), not ( child(X, C 3), poorunhappy. Child(C 3) ). poorunhappy. Child(C) not rich(C), not happy(C). Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 58
Description Logics • Terminological Knowledge (TBox) – Concept Definition (naming of concepts): – Axiom (constraining of concepts): => a mediators “glue knowledge source” • Assertional Knowledge (ABox) – the marked neuron in image 27 => the concrete instances/individuals of the concepts/classes that your sources export Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 59
Querying vs. Reasoning • Querying: – given a DB instance I (= logic interpretation), evaluate a query expression (e. g. SQL, FO formula, Prolog program, . . . ) – boolean query: check if I |= (i. e. , if I is a model of ) – (ternary) query: { (X, Y, Z) | I |= (X, Y, Z) } => check happy. Fathers in a given database • Reasoning: – check if I |= implies I |= for all databases I, – i. e. , if => – undecidable for FO, F-logic, etc. – Descriptions Logics are decidable fragments Þ concept subsumption, concept hierarchy, classification Þ semantic tableaux, resolution, specialized algorithms Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 60
What’s in an Answer? (What’s in a Link? revisited) X Y • Semantic/Rule-Based Joins – (X, Y, [produces, B, increased_in]) : = X produces B, B increased_in Y. rule-based e. g. , X= -secretase, B=beta amyloid, Y=Alzheimer’s disease • What is the Erdoes number of person P? – 3 • Really? Why? – authority based: <VIP> said so – faith based: don’t know but firmly believe – query statement Q =. . . derived it from DB I – query Q =. . . derived it from DB I and KB T using derivation D => logic-based systems often “come with explanations” (“computations as proofs”) Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 61
Formalizing Glue Knowledge: Domain Map for SYNAPSE and NCMIR Domain Map = labeled graph with concepts ("classes") and roles ("associations") • additional semantics: expressed as logic rules (F-logic) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). Domain Expert Knowledge Domain Map (DM) Scientific Data-Mediation AHM'03 DM in Description Logic National Partnership for Advanced Computational Infrastructure 62
Source Contextualization & DM Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map. . . Þ sources can register new concepts at the mediator. . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 63
Example: ANATOM Domain Map Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
Browsing Registered Data with Domain Maps Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 65
Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue • nodes ~ states • edges ~ processes, transitions • blue/red edges: • processes in Src 1/Src 2 • general form of edges: related formalisms Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 66
Summary: Mediation Scenarios & Techniques Federated Databases One-World Common Schema XML-Based Mediation Model-Based Mediation One-/Multiple-Worlds Complex Multiple-Worlds Mediated Schema Common Glue Maps SQL, rules XML query languages DOOD query languages Schema Transformations Syntax-Aware Mappings Syntactic Joins DB expert Scientific Data-Mediation AHM'03 DB expert Semantics-Aware Mappings “Semantic” Joins via Glue Maps KRDB + domain expert National Partnership for Advanced Computational Infrastructure 67
Semantic (Community) Webs “Within the next decade, computing technology will transform the Internet into the Interspace, an information infrastructure that supports semantics indexing and concept navigation across widely distributed community repositories. ” Bruce Schatz, IEEE Computer, Jan. 2002 "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. " Tim Berners-Lee et al. , 2001 Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 68
Combine Everything: Die eierlegende Wollmilchsau: • Database Federation/Mediation – query rewriting under GAV/LAV – w/ binding pattern constraints – distributed query processing • Semantic Mediation – semantic integrity constraints, reasoning w/ plans, automated deduction – deductive database/logic programming technology, AI “stuff”. . . – Semantic Web technology • Scientific Workflow Management – more procedural than database mediation (often the scientist is the query planner) – deployment using web services Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 69
BREAK. . . followed by demos. . . Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 71
GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” (GSC) Genesis Fabric Composition “smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies) • data at different description levels can be found and processed Texture Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 72
GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” http: //klin-pc. sdsc. edu: 8080/examples/jsp/geon/composition. jsp Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 73
GEON Ontology Demo • http: //klin-pc. sdsc. edu: 8080/examples/jsp/geon/old-rock. jsp • http: //klin-pc. sdsc. edu: 8080/examples/jsp/geon/rock. jsp Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 74
Architecture of Ontology Based Map Integration Global Web Map Server Ontology Mapping Web Map Server Database Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 75
DOE Scientific Datamanagement Center • Scientific Workflow Demo Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 76
Example: A Scientific Workflow Microarray analysis A Database search for promoter identification c. DNA Cluster B C Promoter model Common promoter alignment Promoter sequences * * Database search Scientific Data-Mediation AHM'03 *- New candidate target genes * 77 National Partnership for Advanced Adapted from Thomas Werner Biomolecular Engineering, 17: 87 -94 (2001) Computational Infrastructure
Conceptual Workflow Compute clusters (min. distance) For each promoter Select gene-set (cluster-level) For each gene Retrieve matching c. DNA Retrieve genomic Sequence Extract promoter Region(begin, end) Scientific Data-Mediation AHM'03 Retrieve Transcription factors Compute Subsequence labels Arrange Transcription factors With all Promoter Models Align promoters Create consensus sequence National Partnership for Advanced Computational Infrastructure Compute Joint Promoter Model 78
Mapping This Workflow To Web Sites Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 79
Customized CGI Application Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 80
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 81
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 82
Clustal. W Output Transfac Query Results Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 83
SDM-Sci. DAC System Architecture User WF-Pilot Design Execution monitoring WF-Engine AWF EWF WF-Compiler web service matching AAV rules ET schemas semantic type checking web service invocation ET AWF EWF Translation query rewriting Scheduling and execution data type conversion ET Genbank BLAST C C C conversion rules Abstract Task Executable Task Data & Parameter Datatype & (AT) Repository (ET) Repository Ontologies Conversion Repository Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 84
AWF to EWF Declarative specification For each gene Retrieve matching c. DNA Retrieve genomic Sequence Extract promoter Region(begin, end) User supplied Get. Genomic. Sequence (+{selected. Gene}, -{{Genomic. Sequence}}) : GENBANK (+{selected. Gene}, -{c. DNASequence}), BLAST (+{c. DNASequence}, +db. Name, +format, {ranked. Genomic. Sequence. List}). Get. Genomic. Sequence (+{selected. Gene}, -{{Genomic. Sequence}}) : GENBANK (+{selected. Gene}, -{c. DNASequence}), BLAT (+{c. DNASequence}, +Query. Type, +Sort. Criteria, +Output. Type , {ranked. Genomic. Sequence. List}). Identify. Promoter. Elements (+{ranked. Genomic. Sequence. List}, -{element}) : Promoter. Sequences (+{ranked. Genomic. Sequence. List}, get. Begin. End(+Species, -Begin, -End), -{element}). Need extra domain knowledge Translation to EWF needs Same functionality, different creation of iterators operational constraints and Scientific Data-Mediation AHM'03 85 availability National Partnership for Advanced Computational Infrastructure
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 86
Abstract Task (AT) Registration Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 87
Abstract Task (AT) View and Delete Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 88
Abstract Task (AT) Update Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 89
AWF Design Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 90
EWF Planning and Compilation Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 91
EWF Execution Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 92
BIRN Tools Demo Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
Some References (starting points) • XML – General: http: //xml. coverpages. org/xml. html – XQuery: http: //www. w 3. org/XML/Query – XSLT: http: //xml. coverpages. org/xsl. html • Query Rewriting: – database research literature • Logic Programming – Learn Prolog Now! http: //www. coli. uni-sb. de/~kris/learn-prolog-now/ – SWI-Prolog (nice free Prolog system): http: //www. swi-prolog. org/ • Ontologies – Ontology Web language: http: //www. w 3. org/TR/owl-features/ – http: //www-ksl. stanford. edu/kst/what-is-an-ontology. html – http: //www. cs. utexas. edu/users/mfkb/related. html • Model-Based Mediation: – http: //www. sdsc. edu/~ludaesch/Paper/icde 01. html • Semantic Web: – http: //www. w 3. org/2001/sw/ Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 94
References: Project Web Sites • GEOsciences Network (NSF) – www. geongrid. org • Biomedical Informatics Research Network (NIH) – www. nbirn. net • Science Environment for Ecological Knowledge (NSF) – seek. ecoinformatics. org • Scientific Data Management Center (DOE) – sdm. lbl. gov/sdmcenter/ Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 95
13721a7858b17f1b6f8eac815fffd4f4.ppt