925d306376c58027f1555a000b76858e.ppt
- Количество слайдов: 40
Knowledge and Provenance: A knowledge model perspective Carole Goble, University of Manchester, UK
Talk roadmap What is this provenance about and for? Knowledge for Provenance Knowledge technologies How do we represent knowledge for and about provenance? The Provenance of Knowledge Where do knowledge assertions come from?
my Context Knowledge-driven Middleware for data intensive in silico experiments in biology http: //www. mygrid. org. uk
Any and every experimental item attracts provenance (so long as you can ID it). • • • Experimental design components – workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, services Experimental instances that are records of enacted experiments – data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results Experimental glue that groups and links design and instance components – a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist
Provenance is metadata … • intended for sharing, retrieving, integrating, aggregating and processing. • generated with the hope that it is comprehensive enough to be future-proofed. • recorded for those who we do not yet know will use the object and who will likely use it in a different way. • machine computational: free text of limited help. • Provenance is the knowledge that makes – An item interpretable and reusable within a context – An item reproducible or at least repeatable. • Its part of the information model of any system
Question: What ATPase superfamily proteins are found in mouse? 1. Q 9 CQV 8 O 70468 143 B_MOUSE from Swiss-Prot version Database query 30, 05/11/02, 16: 45 GMT, EBI server. (know-what) 2. O 70455, P 54775 143 B_MOUSE from Swiss-Prot version 29, 05/11/02 16: 45 GMT, local copy. 3. P 43686 and P 54775 derived by a distributed query over Virtual data products DB 1 and DB 2. (know-how) 4. Inter. Pro (no particular version) is a pattern database for protein superfamilies and domains for GPCR’s but you need Workflow an account. (know-how) 5. The publicly available workflow mouse ATPase (http: //www. somelab. edu/bio/carole/wf/3345. wsfl) will generate the result from data in your personal repository and Personalised profile you have permission to run the services it needs. Click to run (know-whom-to) it. 6. The Attwood lab expertise is in nucleotide binding proteins Collaboration & (ATPase superfamily proteins are nucleotide binding proteins). community (know 7. Jones published a new paper on this in Nature Genetics where, two weeks ago, and you have an account to access it on-line. know-when) 8. Smith in your lab asked this question yesterday and the answer he got is annotated by a commentary in his e-Log Book. Digital archive 9. P 43686 (human) calculated by applying the algorithm ABC (know-which) located at NCBI using data in database AAA Provenance (know-wherefrom) Replicas (know-which) Ontology and Inference (know-whether) Authorisation, Authentication and Accounting (know-who) Explanation (know-why) Annotation & notes (know-that)
Provenance forms mass = 200 decay = bb • Derivations – A path like a workflow, script or query. – Linking items, usually in a directed graph. – An explanation of when, who, how something produced. – Execution Process-centric • Annotations – Attached to items or collections of items, in a structured, semi-structured or free text form. – Annotations on one item or linking items. – An explanation of why, when, where, who, what, how. – Data-centric mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW stability = 1 Low. Pt = 20 High. Pt = 10000 mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW plot = 1 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1
Semantic discovery – services & workflows • Services and workflows in registry have RDF and OWL A registry browser descriptions • Selection by the types of inputs they use, outputs they produce, the bioinformatics tasks they perform… • Querying using RDQL over RDF UDDI registry for operational metadata • Matching using Fa. CT OWL classification for concept-based metadata A workflow wizard
Provenance forms in my. Grid • Derivations – Free. Fluo Workflow Enactment Engine provides a detailed provenance record stored in the my. Grid Information Repository (m. IR) describing what was done, with what services and when – XML document, soon to be an RDF model • Annotations – Every m. IR object has Dublin Core provenance properties described in an attribute value model
Provenance of data • Operational execution trail Gene: AC 005412. 6 SNP: 000010197 input run_for urn: Clare Jennings output process start time end time by_service lsid: HGVBase_retrieve
Provenance of knowledge • Declarative semantic execution trail contains_single_nucleotide_polymorphism Gene: AC 005412. 6 input as stated by run_for urn: Claire Jennings SNP: 000010197 output process start time end time by_service lsid: HGVBase_retrieve
Provenance of knowledge urn: Carole Goble • Trust and attribution disputed by contains_single_nucleotide_polymorphism Gene: AC 005412. 6 input as stated by run_for urn: Claire Jennings SNP: 000010197 output process start time end time by_service lsid: HGVBase_retrieve
Provenance of knowledge • Aggregation and integration run_for urn: Bill Jones process start time end time by_service lsid: BIGDbretrieve as stated by contains_single_nucleotide_polymorphism Gene: AC 005412. 6 input as stated by run_for urn: Claire Jennings SNP: 000010197 output process start time end time by_service lsid: HGVBase_retrieve
20, 000 feet and ground level Top Down provenance – What is going on? – Unification and summaries of collective provenance knowledge. – Collaborative, Awareness, Experience base, Scientific Corporate memory. – “What projects have something to do with human SNPs? ” – “What experiments use the PSI-BLAST service regardless of version? ” Bottom Up provenance – Where did this data object http: //doh. dah. ac. uk/… come from? – Which version of Swiss. Prot was run in workflow http: /blah. ac. uk/…? User Trust Domain Experiment Execution Data Services Workflow Build up layers of provenance knowledge
Provenance for People and Machines Subjective People Experiment User Manual/ semi-automated Trust Services Domain Objective Data Contextual Execution Workflow Context-free Machines Automated
1. Explicitly capture Context Reuse methods and strategies (e. g. , protocols) Make explicit the situational bias that is normally implicit Enable future generations of scientists to follow our work To capture meaning, we must devise a way of representing concepts and their relationships Hero http: //hero. geog. psu. edu/ Hero_knowledge_management. pdf Downloaded 301103
1. Explicitly capture Context Using models and terms that can be shared and interpreted that are extensible and preclude premature restrictions that are navigable and computationally processable Hero http: //hero. geog. psu. edu/ Hero_knowledge_management. pdf Downloaded 301103
2. Bridge islands of exported provenance Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1
Not all exports are the same Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1
So we need to… • Uniquely identify items through URIs and Life Science Identifiers (GSH/GSR/Handle. net…) • Explicitly expose provenance by assertions in a common data model… • Publish and share consensually agreed ontologies so we can share the provenance metadata and add in background knowledge… • Then we can query, filter, integrate and aggregate the provenance metadata … • and reason over it to infer more provenance metadata using rules … • and attribute trust to the provenance … • Flexibly so that do not cast in stone models and terms, and so can cope with different degrees of description. What’s an Ontology? A common vocabulary of terms Some specification of the meaning of the terms Concepts, relationships, axioms A shared consensual understanding for people and machines
W 3 C Metadata language/model Resource Description Framework • Common model for metadata • Assertions as triples (subject, predicate, object) forming graphs. • Associate URIs (LSIDs) with other URIs (LSIDs). • Associate URIs with OWL concepts (which are URIs). • RDQL, repositories, integration tools, presentation tools • Query over, Link together, Aggregate, Integrate assertions. • Avoids pre-commitment – – Data Workflow Experiment User Service Self-describing Incremental Extensible Advantage and drawback. Graphic based on Tim Berners-Lee http: //www. w 3. org/2003/Talks/0521 -www-keynote-tbl/slide 22 -0. html
Bridging islands Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1
Bridging islands: Concepts and LSID Service 1 Service 2 Workflow 1 RDF RDF RDF Experimental Investigation 1 RDF Data 1
W 3 C Ontology language/model: OWL • Continuum of expressivity – Concepts, roles, individuals, axioms – From simple frames to description logics – Sound and complete formal semantics – Compositional and property based • Reasoning to infer classification • Eas(ier) to extend and evolve and merge ontologies • A web language • Tools, tools! DAML OIL RDF DAML+OIL OWL
Bridging islands: Concepts and LSIDs Service 1 Service 2 Workflow 1 RDF RDF RDF Experimental Investigation 1 RDF Data 1
Bridging islands: Concepts and LSIDs LSID Service 1 LSID Workflow 1 Service 2 RDF LSID RDF LSID Experimental Investigation 1 LSID Data 1 LSID
Layers of Knowledge Languages Attribution Explanation Rules & Inference Ontologies Metadata Standard Syntax Identity Wedding cake courtesy of Tim Berners-Lee
my. Grid everything has a concept & LSID Workflows Literature Provenance record of workflow runs Notes Ontologies People Data holdings Services
Linking objects to objects via URIs and LSIDs People who wrote the workflow Literature People to notify of the workflow status Provenance record of workflow runs Provenance of the workflow template. Related workflows. Notes Data holdings Ontologies describing workflows Services used
Generated link anchors Lymphocyte and neutrophil are subsumed by the concept white blood cell
Annotating a workflow log with concepts 5. Create the annotation 4. Provide a description 3. Select the concept 1. Choose the ontology 2. Select an area to annotate with
Generating provenance Data and metadata from the run RDF+OWL Scufl Workflow execution Template start. Time, end. Time, service instances invoked … RDF+OWL Identify workflow m. IR Input data & parameters OWL descriptions RDF registry Workflow knowledge template Bind services Free. Fluo WFEE Execution Provenance log Knowledge arising from workflow RDF+OWL
P Afflard et al The Grid(s)? @ Novartis presented at PRISM Pharma. Grid retreat, July 2003
William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a Knowledge and Data Management Portal in 1 st Semantic Web Conference (ISWC 2003) Workshop on Retrieval of Scientific Data, Florida, USA, October 2003
Two views of a gravity model concept from the Hero CODEX web tool William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a Knowledge and Data Management Portal in 1 st Semantic Web Conference (ISWC 2003) Workshop on Retrieval of Scientific Data, Florida, USA, October 2003 • An ontological description shows how one geoscientist constructs a model • a social network reveals which users favour different instances of the model, with edge length suggesting the degree of support.
Collaboratory for Multi-Scale Chemical Science CMCS “Pedigree Graph” portlet showing provenance relationships between resources (colour coded by original relationship type). CMCS Pedigree Browser showing the metadata and relationships of the selected data set.
Provenance dimensions connected by concepts and identifiers project Services Workflow instances pr oj ec t Author workflow template Based on http: //www. w 3. org/2003/Talks/0521 -www-keynote-tbl/slide 22 -0. html
Reflections: annotations • Annotation metadata model for my. Grid holdings are a Graph – If it waddles like RDF and quacks like RDF, its RDF – Experiments in RDF scalability – Co-existence of RDF and other data models (relational) • Acquisition of annotations and adverts – Automated by mining WSDL docs, mining ws-info docs – Deep annotation works ok for bioinformatic service concepts (it’s an EMBL record) but… – Annotating with biologically meaningful concepts is harder • Data in the m. IR (it’s a lymphocyte) • Manual annotation cost is high! – Service/workflow publication tools • Dealing with change – Ontology changes; service changes; annotations change.
Random Thoughts • • • Where does the knowledge come from (see Luc)? How do we model trust (see Luc)? Scalability of Semantic Web technologies? Visualisation of knowledge (see monica)? What’s the lifecycle of provenance? Different knowledge models for different disciplines? knowledge • • • Layers of provenance Provenance that is domain knowledge Provenance for context vs execution workflow provenance People vs machine Different models for different items but still needs to be integrated • Technologies for sharing and integrating that are flexible.
Talk provenance • my. Grid http: //www. mygrid. org. uk – Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord, Chris Greenhalgh, Luc Moreau, Robert Stevens • Hero http: //hero. geog. psu. edu/ – William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal • Collaboratory for Multi-Scale Chemical Science CMSC – James D. Myers, Carmen Pancerella, Carina Lansing, Karen L. Schuchardt, Brett Didier • Chimera – Michael Wilde, Ian Foster • Knowledge Space – Novartis • And special thanks to Ian Cottam for heroic support when my laptop died yesterday. Afternoon.