61306d5fd93b919a5d7fe39ce451b859.ppt
- Количество слайдов: 149
Feasting on Brains! From Web Services to Web 2. 0 to the Semantic Web and back again… A personal journey through the Semantic Web and Web Services for Health Care and Life Sciences Mark Wilkinson (markw@illuminae. com) Assistant Professor, Medical Genetics University of British Columbia Heart and Lung Research Institute at St. Paul’s Hospital
Benjamin Good (He’s a “Creep”!)
approach “Bioinformatics” is a broad field and suffers SEVERE interoperability problems “Bioinformaticians” tend to be specialists in a particular domain of computational analysis As a group, the brains of all bioinformaticians Contain all (known) bioinformatics Is it possible to extract the knowledge Required for interoperability from the brains of bioinformaticians en masse?
“Human Computation” (luis von Ahn)
Ontology Spectrum Catalog/ ID Thesauri “narrower term” relation Terms/ glossary Informal is-a Selected Formal Frames Logical is-a (properties) Constraints (disjointness, inverse, …) Formal instance Value Restrs. General Logical constraints Originally from AAAI 1999 - Ontologies Panel by Gruninger, Lehmann, Mc. Guinness, Uschold, Welty; – updated by Mc. Guinness. Description in: www. ksl. stanford. edu/people/dlm/papers/ontologies-come-of-age-abstract. html
An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Zombie Lemur Human eats Classes, instances properties, relationships Brains Shoots Chips Hair
Classes Animal Mammal Hair Primate Zombie Brains Lemur Shoots Human Chips
instances
Properties has is_a eats
relations has is_a eats
An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Hair Zombie Hair Lemur Human eats Classes, instances properties, relationships Brains Shoots Chips Hair
Web Service? A software tool that is accessible over the Web Services are intended to be accessed by machines, not people.
Interoperability? The ability of two Web Services to exchange information, and use that information correctly This generally requires Semantics in the form of Ontologies…
Mmmm… Brains!! Bio. Moby Eating brains to enable Web Service Interoperability
What does Bio. Moby do?
The Bio. Moby Plan • • Create an ontology of bioinformatics data-types Define an ontology of bioinformatics operations Open these ontologies for community input Define Web Services v. v. these two ontologies • A Machine can find an appropriate service • A Machine can execute that service unattended • Ontology is community-extensible
Overview of Bio. Moby Semantic Interoperability MOBY hosts & services Alignment Sequence Gene names Sequence Align Express. Phylogeny Protein Primers Alleles … MOBY Central
Why couldn’t we do this before?
Interoperability is HARD!
Interoperability through Human Computation Bio. Moby Data Type Ontology: An explicit list of all biological data-types, and the relationships between them. Ontology built, brain by brain, by informaticians! We achieve interoperability simply because informaticians donate their brain-power HUMAN COMPUTATION
A portion of the Bio. Moby Ontology …built from the brains of the community!
…so what can I do with it?
Analytical workflow Discovery No explicit coordination between providers Run-time discovery of appropriate tools Automated execution of those tools The machine “understands” the data you have in-hand, and assists you in choosing the next step in your analysis.
Interoperability through Human Computation Individuals contributed their knowledge about bioinformatics data-types to a central ontology Their combined knowledge enabled the construction of an interoperable framework
Who uses Bio. Moby?
Usage Statistics 15 Nations > 60 independent institutions >1600 interoperable Bioinformatics Resources ~500, 000 requests for “brokering” each month
What have we learned? We can consume the brains of a large community… …to generate something complex, yet organized
Open Kimono The Bio. Moby ontology is actually quite messy… …communal brains can build useful ontologies, but the problem is…
Ontologies are HARD!
How are ontologies usually constructed?
By small, hard-working, dedicated groups with lots of money! • Gene Ontology & code – Curated: ~5 full-time staff – ~$25 Million (Lewis, S personal communication) • NCI Metathesaurus & code – Curated: ~12 full-time staff – ~$15 Million (Peter K. , estimate) • Health Level 7 (HL 7) – Curated – $Lots… Some claim as much as $15 Billion (Smith, Barry, KBB Workshop, Montreal, 2005)
To build the global Semantic Web for Systems Biology we need to encode knowledge from EVERY domain of biology – from barley root apex structure and function, to HIV clinical -trials outcomes… and this knowledge is constantly changing! At >$15 M each, can we afford the Semantic Web? ? ?
Mmmm… Need MORE Brains!! i. CAPTURer experiment
Dr. Bruce Mc. Manus with a human heart in his hands He knows his hearts… …but he doesn’t know how to build an ontology
What we need
The Problem
The Solution?
The Solution?
So… how do we do it?
Remember what we learned from Moby… …communities CAN build ontologies!
Building Systems Biology Ontologies through Human Computation
i. CAPTURer Benjamin Good Ph. D. Student, UBC Bioinformatics Genome BC Better Biomarkers in Transplantation project, St. Paul’s Hospital i. CAPTURE Centre
Old Way • KE drills the brain of one or a very few experts. • Painful, expensive, and time-consuming…
New Way? – the i. CAPTURer • KE creates a clever interface • No direct interaction with expert • Thousands of experts • Cheap!
Go to a scientific conference Text-mine conference abstracts Auto-Extract concepts Put concepts into a series of i. CAPTURer 1. 0 question “templates” a web interface presents questions about these concepts to conference attendees Give points for every question they answer Give a prize to the highest point winner
Results Is _____ a meaningful term? – Yes, No, I don’t know buttons What is a synonym for ______ Knowledge Points Captured 464 340 – Text entry box Where does _____ fit in the following tree of related terms? – Clickable tree 207 1011 total
Observations Yes/No questions work well Text entry is less effective Adding to a tree is a disaster! Competition is a great motivator for human computation!
COST?
COST?
COST?
COST?
COST? < $15, 000
i. CAPTURer 1. 5
Start with hypothetical concept tree Put concepts-concept relations into a series of true/false questions Make a web interface to ask questions If a relationship is false, then re-start at the root of the concept tree Give points for every question they answer Give a prize to the highest point winner
“Chatterbot” “I’ve heard that a cardiac myocyte is a type of cardiac cell. Is this true? ” “I’ve heard that STEMI means the same thing as ST Elevated Myocardial Infarction. Is that nonsense, or is it correct? ” “How do you feel about your mother? ”
Results Knowledge capture in 3 days >11, 000 Concepts
COST $0
Full details of this experiment are available in: Proceedings of the Pacific Symposium on Biocomputing, 2006
Ontology Quality?
Potential Ontology Evaluation Metrics • Domain independent – philosophical desiderata – graphical structure – satisfiability – Manual, subjective – Auto, questionable value – Auto, useful, not enough • Domain specific – “Fit” to text – Similarity to a gold standard – Task-based – Auto, dependent on NLP – Auto/Manual; gold standard must exist! – Optimal! Auto/Manual, but not generalizable
“Good” ? ? ?
What do we mean by “Good”? Ontology construction is “motivated by the goal of alignment not on concepts but on the universals in reality and thereby also on the corresponding instances” - Barry Smith Reality should be the benchmark for the “goodness” of an ontology
ontology evaluation based on referents in reality
Chosen Philosophical Principle “Epistemology Precedes Ontology” • A Class should refer to an invariant pattern of properties common among all its instances – Mammals have mammary glands and hair – Humans are an instance of the class Mammal • Therefore… – If class-instances are mapped into an ontology – Each instance has “properties” or “qualities” – These properties or qualities SHOULD segregate into different classes if the ontology is any good
Philosophical Desiderata • Non-vagueness – at least one instance can exist with the Class pattern – Vague class: “mammalian cell wall” • Non-ambiguity – no more than one common pattern per Class – Ambiguous class: “cell” (e. g. cell phone, jail cell) • Non-redundancy – within the same level of granularity, no other class refers to same common properties – Redundant classes: “human”, “homo sapiens” Cimino, J, 1998
Realist Evaluation: Step 1 Table of Instance-Properties A Instance I. 1 C B I. 2 I. 1 I. 3 I. 2 I. 4 I. 3 I. 4 (Test one class at a time). . . Char 1 Char 2 Class B? Y Y N N. . . N Y. . . Y Y N N. . .
Realist Evaluation: Step 2 Machine Learning Instance Char 1 I. 2 I. 3 I. 4. . . Y Y N N. . . Char 2 Class B? N Y. . . Y Y N N. . . Pattern Class B score for this pattern If char 1 = Y Then Class X 100%
WEKA Produced by Waikato University in New Zealand An open source library containing implementations of hundreds of machine learning algorithms (rule learners, LDA, SVM, neural networks. . . )
Realist Evaluation 0. 35 0. 92 Instance Char 1 Char 2 Class 1? I. 1 I. 2 I. 3 I. 4. . . Y Y N N. . . N Y. . . Y Y N N. . . Class Score for Each Class 0. 1
Realist Evaluation - positive control 1. Identify an ontology that already has logical constraints on properties of a classes. 2. Assemble instances that have those properties 3. Classify the instances with a reasoner 4. Remove class restrictions from the ontology, but keep instances assigned to their classes 5. Look for patterns of instance properties 6. If successful, patterns should be detected 7. The higher the pattern score, the “gooder” the ontology is
Positive Control: Phosphabase • An ontology describing different classes of phosphatase enzymes. • Given the domain composition of a protein, phosphatase class can be inferred automatically. Wolstencraft et al (2006) Protein classification using ontology classification Bioinformatics. Vol. 22 no. 14, pages 530– 538
Remove the Logical Rules • Remove the defining rules for each class • Maintain the classified instances • Execute the realist evaluation • Can we re-discover the patterns that the logical class-rules used to dictate?
Realist Evaluation Positive Control • 25 classes from phosphabase tested on 700 simulated protein instances • 21 - pattern correctly identified for 100% of instances • For 4 others, patterns identified covering 99, 92, 82% of instances respectively.
Realist Evaluation Positive Control • So the Phosphabase ontology is “good” • We can detect strong patterns of properties in its instances that follow the philosophical desiderata • This is unsurprising, since we knew that it was “good” in the first place…
Evaluation of Gene Ontology is ongoing…
Interesting side effect… Class-defining rules are generated by the realist evaluation Most existing bio-ontologies lack formal class-definitions This evaluation could be used to create such rules automatic classifiers Can also detect what TYPE of property is best “classified” by current bio-ontologies
Is Realist Evaluation a Valid metric? the realist evaluation measures the success of an ontology in classifying a specific set of properties We claim that this is a metric relating to the quality of that ontology Is this metric any better than other metric like graph complexity, or fit-to-text?
Evaluating metrics
Onto. Loki – Making mischief with Ontologies 1. Take an ontology that we claim is “good” 2. Make it worse by mischievously adding changes 3. Measure the degree of “mischief” 4. Run the evaluation metric of interest 5. Metric score should correlate with the amount of mischief added
Measured Ontology Quality Comparison of ontology quality metrics Good Metric Bad Metric Amount of noise added (ontology quality decreasing)
Is Reality Evaluation a good metric?
Let’s Onto. Loki it to find out!
Onto. Loki test of Realist Evaluation Metric Average Class Score Good Metric! Noise Added (a measure of nodes affected)
Conclusion Human computation effective at Onto. Loki seems to be can collect significant amounts ofan interesting an evaluating the evaluation metrics Reality evaluation is knowledge in organized way new metric for testing ontologies
Subjective i. CAPTURer Observations Humans had an EXTREMELY difficult time classifying concepts into preexisting categories Humans had an EXTREMELY difficult time defining new categories and placing them into the existing classification system
Classification is HARD!
Abandoning Classification
(briefly…)
An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Gorilla Classes, instances properties, relationships Lemur Human has_size Big Small Medium Hair
AN ontology is ONE representation of knowledge Animal Mammal has Primate Hair is_a Hair Gorilla Ontology of Anatomy Hair Lemur Human has_size Big Small Medium Hair
AN ontology is ONE representation of knowledge Animal African_animal lives Southern_African_animal is_a Ontology of Habitat Also might want… Odour, # digits, bone density, friendliness, cuteness. . mountain Aquatic plains Africa
Clay Shirky: Ontology is Overrated… • Attempts to predict the future – “Soviet Union” used to be a category in the Library of Congress • Attempts mind-reading – Size, location, odour. . Authors must predict what users are interested in • Great minds don’t think alike. . – No two people are likely to create the same ontology http: //www. shirky. com/writings/ontology_overrated. html
Categories Properties
BRAINS!! MORE BRAINS!! Mass Collaborative Tagging
Mass Open Social Tagging A rapidly growing trend on the Web Unstructured Mass-collaboration Anyone can say anything about anything using any words they wish
Connotea: Scientific Tagging (Connotea is a product of Nature Publishing Group)
Connotea Growth
Tagging is EASY!
The Tagged World Tagging is easy! Tagging costs nothing Tagging empowers all viewpoints Tagging is happening!!!!!!
Lexical Comparison of Tagging with Formal Indexing Systems and Ontologies
Ontology (FMA) FMA Preflabels (11) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length
Ontology (GO Molecular Function) GO_MF (15) % OLP uniterms: 0. 8 compositions: 0. 6 % OLP duplets: 0. 4 0. 2 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length
Ontology (GO Biological Process) GO_BP (13) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length
Tagging (Bibsonomy) Bibsonomy (20) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length
Tagging (Cite. ULike) Cite. Ulike (22) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length
Tagging (Connotea) Connotea (21) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length
Ontologies and Folksonomies are fundamentally different! GO_MF (15) Bibsonomy (20) % OLP uniterms: 0. 8 % OLP uniterms: 1 compositions: 0. 6 % OLP duplets: compositions: % OLP duplets: 0. 5 0. 4 0. 2 complements: 0 OLP flexibility: % contained. By. Another: Standard Deviation - Term Length Skewness - Term % Length Skewness - Term Length contained. By. Another Standard Deviation Term Length
Problem? ? Folksonomies and ontologies are fundamentally different! It may not be possible to derive one from the other accurately Nevertheless, we would like to take advantage of tagging behaviour while gaining the power of controlled vocabularies/Ontologies
E. D. The Entity Desciber
Connotea tagging User types in all tags Type-ahead displays previously used tags
Connotea + E. D. Tagging
Leveraging Tagging? “Tagging” effectively assigns properties to entities ED Tagging constrains those properties to a controlled vocabulary or ontology Can we discover patterns in those properties that indicate a “natural” classification system? Can a “realist-evaluation” generate logical rules that define classes based on patterns of tags?
Final Thoughts Ontologies are important, but hard to build i. CAPTURer: formal, template-based, cost-free consumption of biologist’s brains seems to work! Informal annotation (tagging) is cheap, easy, and scalable, and is HAPPENING Can we leverage tagging to create ontologylike structures? Maybe… Maybe not!
My journey back to Web Services
Why do I care about WS so passionately?
The Deep Web All the data and knowledge only accessible through Web Forms Estimated to be orders of magnitude greater than the “surface Web” - 91, 000 Terabytes in the deep Web - 167 Terabytes in the Surface Web Much of the Deep Web CANNOT be represented on the Semantic Web since it DOES NOT EXIST until the Web Form is accessed
Moby 2. 0 and Cardio. SHARE Merging the Deep Web and the Semantic Web
What Web Services do Sequence Data Blast Hit BLAST SERVICE
What Bio. Moby does Sequence Data Want Blast MOBY BLAST SERVICE Blast Hit
The implied relationship between input and output Sequence Data gives. Blast. Result Blast Hit Not “Bologically” Meaningful
The implied biological relationship between input and output Sequence Data has. Homology. To Blast Hit …looks a lot like the RDF statement… URI has. Homology. To URI
To merge Web Services and the Semantic Web… …Simply assert the relationship and let Moby do the rest!
Start with a partial Triple URI rdf: type Sequence has. Homology. To
What Moby 2. 0 Does URI rdf: type Sequence has. Homology. To Blast Hit Moby 2. 0 MOBY BLAST SERVICE Need BLAST Service consuming rdf: type Sequence Predicate to has. Homology. To property Service by Web provided BLAST services Translator
Moby 2. 0 Query FIND SERVICES THAT Consume Sequence Data | | Provide has. Homology. To Property | | Attached to other Sequence Data
Moby 2. 0 extends SPARQL queries contain concepts and relationships of interest Map RDF predicates onto Moby services capable of generating them Registry query: “What Moby service consumes [subject] and generates the [predicate] relationship type? ”
But wait, there’s more!
Cardio. SHARE: Exploit knowledge in OWL-DL ontologies to enhance query Predicate Subject Predicate Evaluate Query Expression Look up and execute Moby service Consumes STK’s and Provides inhibitor property Look up and execute Moby service Consumes proteins and generates Functional annotation property
Cardio. SHARE: Exploit knowledge in OWL-DL ontologies to enhance query This SPARQL query could be posed on a database of RAW, UNANNOTATED Protein sequences, and be answered by Moby 2. 0
What do Moby 2. 0 and Cardio. SHARE achieve? Makes the Deep Web transparently accessible as if it were a Semantic Web Resource Allows SPARQL to do truly semantic queries! Reduces the requirement of Biologists to know how/where to get their data of interest Simplifies construction of complex analytical pipelines by automating much of the discovery/execution tasking
Ontology Spectrum Catalog/ ID Thesauri “narrower term” relation Terms/ glossary Informal is-a Selected Formal Frames Logical is-a (properties) Constraints (disjointness, inverse, …) Formal instance Value Restrs. General Logical constraints Originally from AAAI 1999 - Ontologies Panel by Gruninger, Lehmann, Mc. Guinness, Uschold, Welty; – updated by Mc. Guinness. Description in: www. ksl. stanford. edu/people/dlm/papers/ontologies-come-of-age-abstract. html
Fin
61306d5fd93b919a5d7fe39ce451b859.ppt