Feasting on Brains From Web Services to Web

Feasting on Brains! From Web Services to Web 2. 0 to the Semantic Web and back again… A personal journey through the Semantic Web and Web Services for Health Care and Life Sciences Mark Wilkinson (markw@illuminae. com) Assistant Professor, Medical Genetics University of British Columbia Heart and Lung Research Institute at St. Paul’s Hospital

Benjamin Good (He’s a “Creep”!)

approach “Bioinformatics” is a broad field and suffers SEVERE interoperability problems “Bioinformaticians” tend to be specialists in a particular domain of computational analysis As a group, the brains of all bioinformaticians Contain all (known) bioinformatics Is it possible to extract the knowledge Required for interoperability from the brains of bioinformaticians en masse?

“Human Computation” (luis von Ahn)

Ontology Spectrum Catalog/ ID Thesauri “narrower term” relation Terms/ glossary Informal is-a Selected Formal Frames Logical is-a (properties) Constraints (disjointness, inverse, …) Formal instance Value Restrs. General Logical constraints Originally from AAAI 1999 - Ontologies Panel by Gruninger, Lehmann, Mc. Guinness, Uschold, Welty; – updated by Mc. Guinness. Description in: www. ksl. stanford. edu/people/dlm/papers/ontologies-come-of-age-abstract. html

An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Zombie Lemur Human eats Classes, instances properties, relationships Brains Shoots Chips Hair

Classes Animal Mammal Hair Primate Zombie Brains Lemur Shoots Human Chips

instances

Properties has is_a eats

relations has is_a eats

An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Hair Zombie Hair Lemur Human eats Classes, instances properties, relationships Brains Shoots Chips Hair

Web Service? A software tool that is accessible over the Web Services are intended to be accessed by machines, not people.

Interoperability? The ability of two Web Services to exchange information, and use that information correctly This generally requires Semantics in the form of Ontologies…

Mmmm… Brains!! Bio. Moby Eating brains to enable Web Service Interoperability

What does Bio. Moby do?

The Bio. Moby Plan • • Create an ontology of bioinformatics data-types Define an ontology of bioinformatics operations Open these ontologies for community input Define Web Services v. v. these two ontologies • A Machine can find an appropriate service • A Machine can execute that service unattended • Ontology is community-extensible

Overview of Bio. Moby Semantic Interoperability MOBY hosts & services Alignment Sequence Gene names Sequence Align Express. Phylogeny Protein Primers Alleles … MOBY Central

Why couldn’t we do this before?

Interoperability is HARD!

Interoperability through Human Computation Bio. Moby Data Type Ontology: An explicit list of all biological data-types, and the relationships between them. Ontology built, brain by brain, by informaticians! We achieve interoperability simply because informaticians donate their brain-power HUMAN COMPUTATION

A portion of the Bio. Moby Ontology …built from the brains of the community!

…so what can I do with it?

Analytical workflow Discovery No explicit coordination between providers Run-time discovery of appropriate tools Automated execution of those tools The machine “understands” the data you have in-hand, and assists you in choosing the next step in your analysis.

Interoperability through Human Computation Individuals contributed their knowledge about bioinformatics data-types to a central ontology Their combined knowledge enabled the construction of an interoperable framework

Who uses Bio. Moby?

Usage Statistics 15 Nations > 60 independent institutions >1600 interoperable Bioinformatics Resources ~500, 000 requests for “brokering” each month

What have we learned? We can consume the brains of a large community… …to generate something complex, yet organized

Open Kimono The Bio. Moby ontology is actually quite messy… …communal brains can build useful ontologies, but the problem is…

Ontologies are HARD!

How are ontologies usually constructed?

By small, hard-working, dedicated groups with lots of money! • Gene Ontology & code – Curated: ~5 full-time staff – ~$25 Million (Lewis, S personal communication) • NCI Metathesaurus & code – Curated: ~12 full-time staff – ~$15 Million (Peter K. , estimate) • Health Level 7 (HL 7) – Curated – $Lots… Some claim as much as $15 Billion (Smith, Barry, KBB Workshop, Montreal, 2005)

To build the global Semantic Web for Systems Biology we need to encode knowledge from EVERY domain of biology – from barley root apex structure and function, to HIV clinical -trials outcomes… and this knowledge is constantly changing! At >$15 M each, can we afford the Semantic Web? ? ?

Mmmm… Need MORE Brains!! i. CAPTURer experiment

Dr. Bruce Mc. Manus with a human heart in his hands He knows his hearts… …but he doesn’t know how to build an ontology

What we need

The Problem

The Solution?

So… how do we do it?

Remember what we learned from Moby… …communities CAN build ontologies!

Building Systems Biology Ontologies through Human Computation

i. CAPTURer Benjamin Good Ph. D. Student, UBC Bioinformatics Genome BC Better Biomarkers in Transplantation project, St. Paul’s Hospital i. CAPTURE Centre

Old Way • KE drills the brain of one or a very few experts. • Painful, expensive, and time-consuming…

New Way? – the i. CAPTURer • KE creates a clever interface • No direct interaction with expert • Thousands of experts • Cheap!

Go to a scientific conference Text-mine conference abstracts Auto-Extract concepts Put concepts into a series of i. CAPTURer 1. 0 question “templates” a web interface presents questions about these concepts to conference attendees Give points for every question they answer Give a prize to the highest point winner

Results Is _____ a meaningful term? – Yes, No, I don’t know buttons What is a synonym for ______ Knowledge Points Captured 464 340 – Text entry box Where does _____ fit in the following tree of related terms? – Clickable tree 207 1011 total

Observations Yes/No questions work well Text entry is less effective Adding to a tree is a disaster! Competition is a great motivator for human computation!

COST?

COST? < $15, 000

i. CAPTURer 1. 5

Start with hypothetical concept tree Put concepts-concept relations into a series of true/false questions Make a web interface to ask questions If a relationship is false, then re-start at the root of the concept tree Give points for every question they answer Give a prize to the highest point winner

“Chatterbot” “I’ve heard that a cardiac myocyte is a type of cardiac cell. Is this true? ” “I’ve heard that STEMI means the same thing as ST Elevated Myocardial Infarction. Is that nonsense, or is it correct? ” “How do you feel about your mother? ”

Results Knowledge capture in 3 days >11, 000 Concepts

COST $0

Full details of this experiment are available in: Proceedings of the Pacific Symposium on Biocomputing, 2006

Ontology Quality?

Potential Ontology Evaluation Metrics • Domain independent – philosophical desiderata – graphical structure – satisfiability – Manual, subjective – Auto, questionable value – Auto, useful, not enough • Domain specific – “Fit” to text – Similarity to a gold standard – Task-based – Auto, dependent on NLP – Auto/Manual; gold standard must exist! – Optimal! Auto/Manual, but not generalizable

“Good” ? ? ?

What do we mean by “Good”? Ontology construction is “motivated by the goal of alignment not on concepts but on the universals in reality and thereby also on the corresponding instances” - Barry Smith Reality should be the benchmark for the “goodness” of an ontology

ontology evaluation based on referents in reality

Chosen Philosophical Principle “Epistemology Precedes Ontology” • A Class should refer to an invariant pattern of properties common among all its instances – Mammals have mammary glands and hair – Humans are an instance of the class Mammal • Therefore… – If class-instances are mapped into an ontology – Each instance has “properties” or “qualities” – These properties or qualities SHOULD segregate into different classes if the ontology is any good

Philosophical Desiderata • Non-vagueness – at least one instance can exist with the Class pattern – Vague class: “mammalian cell wall” • Non-ambiguity – no more than one common pattern per Class – Ambiguous class: “cell” (e. g. cell phone, jail cell) • Non-redundancy – within the same level of granularity, no other class refers to same common properties – Redundant classes: “human”, “homo sapiens” Cimino, J, 1998

Realist Evaluation: Step 1 Table of Instance-Properties A Instance I. 1 C B I. 2 I. 1 I. 3 I. 2 I. 4 I. 3 I. 4 (Test one class at a time). . . Char 1 Char 2 Class B? Y Y N N. . . N Y. . . Y Y N N. . .

Realist Evaluation: Step 2 Machine Learning Instance Char 1 I. 2 I. 3 I. 4. . . Y Y N N. . . Char 2 Class B? N Y. . . Y Y N N. . . Pattern Class B score for this pattern If char 1 = Y Then Class X 100%

WEKA Produced by Waikato University in New Zealand An open source library containing implementations of hundreds of machine learning algorithms (rule learners, LDA, SVM, neural networks. . . )

Realist Evaluation 0. 35 0. 92 Instance Char 1 Char 2 Class 1? I. 1 I. 2 I. 3 I. 4. . . Y Y N N. . . N Y. . . Y Y N N. . . Class Score for Each Class 0. 1

Realist Evaluation - positive control 1. Identify an ontology that already has logical constraints on properties of a classes. 2. Assemble instances that have those properties 3. Classify the instances with a reasoner 4. Remove class restrictions from the ontology, but keep instances assigned to their classes 5. Look for patterns of instance properties 6. If successful, patterns should be detected 7. The higher the pattern score, the “gooder” the ontology is

Positive Control: Phosphabase • An ontology describing different classes of phosphatase enzymes. • Given the domain composition of a protein, phosphatase class can be inferred automatically. Wolstencraft et al (2006) Protein classification using ontology classification Bioinformatics. Vol. 22 no. 14, pages 530– 538

Remove the Logical Rules • Remove the defining rules for each class • Maintain the classified instances • Execute the realist evaluation • Can we re-discover the patterns that the logical class-rules used to dictate?

Realist Evaluation Positive Control • 25 classes from phosphabase tested on 700 simulated protein instances • 21 - pattern correctly identified for 100% of instances • For 4 others, patterns identified covering 99, 92, 82% of instances respectively.

Realist Evaluation Positive Control • So the Phosphabase ontology is “good” • We can detect strong patterns of properties in its instances that follow the philosophical desiderata • This is unsurprising, since we knew that it was “good” in the first place…

Evaluation of Gene Ontology is ongoing…

Interesting side effect… Class-defining rules are generated by the realist evaluation Most existing bio-ontologies lack formal class-definitions This evaluation could be used to create such rules automatic classifiers Can also detect what TYPE of property is best “classified” by current bio-ontologies

Is Realist Evaluation a Valid metric? the realist evaluation measures the success of an ontology in classifying a specific set of properties We claim that this is a metric relating to the quality of that ontology Is this metric any better than other metric like graph complexity, or fit-to-text?

Evaluating metrics

Onto. Loki – Making mischief with Ontologies 1. Take an ontology that we claim is “good” 2. Make it worse by mischievously adding changes 3. Measure the degree of “mischief” 4. Run the evaluation metric of interest 5. Metric score should correlate with the amount of mischief added

Measured Ontology Quality Comparison of ontology quality metrics Good Metric Bad Metric Amount of noise added (ontology quality decreasing)

Is Reality Evaluation a good metric?

Let’s Onto. Loki it to find out!

Onto. Loki test of Realist Evaluation Metric Average Class Score Good Metric! Noise Added (a measure of nodes affected)

Conclusion Human computation effective at Onto. Loki seems to be can collect significant amounts ofan interesting an evaluating the evaluation metrics Reality evaluation is knowledge in organized way new metric for testing ontologies

Subjective i. CAPTURer Observations Humans had an EXTREMELY difficult time classifying concepts into preexisting categories Humans had an EXTREMELY difficult time defining new categories and placing them into the existing classification system

Classification is HARD!

Abandoning Classification

(briefly…)

An ontology is a representation of knowledge Animal Mammal has Primate Hair is_a Gorilla Classes, instances properties, relationships Lemur Human has_size Big Small Medium Hair

AN ontology is ONE representation of knowledge Animal Mammal has Primate Hair is_a Hair Gorilla Ontology of Anatomy Hair Lemur Human has_size Big Small Medium Hair

AN ontology is ONE representation of knowledge Animal African_animal lives Southern_African_animal is_a Ontology of Habitat Also might want… Odour, # digits, bone density, friendliness, cuteness. . mountain Aquatic plains Africa

Clay Shirky: Ontology is Overrated… • Attempts to predict the future – “Soviet Union” used to be a category in the Library of Congress • Attempts mind-reading – Size, location, odour. . Authors must predict what users are interested in • Great minds don’t think alike. . – No two people are likely to create the same ontology http: //www. shirky. com/writings/ontology_overrated. html

Categories Properties

BRAINS!! MORE BRAINS!! Mass Collaborative Tagging

Mass Open Social Tagging A rapidly growing trend on the Web Unstructured Mass-collaboration Anyone can say anything about anything using any words they wish

Connotea: Scientific Tagging (Connotea is a product of Nature Publishing Group)

Connotea Growth

Tagging is EASY!

The Tagged World Tagging is easy! Tagging costs nothing Tagging empowers all viewpoints Tagging is happening!!!!!!

Lexical Comparison of Tagging with Formal Indexing Systems and Ontologies

Ontology (FMA) FMA Preflabels (11) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length

Ontology (GO Molecular Function) GO_MF (15) % OLP uniterms: 0. 8 compositions: 0. 6 % OLP duplets: 0. 4 0. 2 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length

Ontology (GO Biological Process) GO_BP (13) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 Skewness - Term Length OLP flexibility: % contained. By. Another: Standard Deviation - Term Length

Tagging (Bibsonomy) Bibsonomy (20) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length

Tagging (Cite. ULike) Cite. Ulike (22) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length

Tagging (Connotea) Connotea (21) % OLP uniterms: 1 compositions: % OLP duplets: 0. 5 complements: 0 OLP flexibility: Skewness - Term % Length contained. By. Another: Standard Deviation Term Length

Ontologies and Folksonomies are fundamentally different! GO_MF (15) Bibsonomy (20) % OLP uniterms: 0. 8 % OLP uniterms: 1 compositions: 0. 6 % OLP duplets: compositions: % OLP duplets: 0. 5 0. 4 0. 2 complements: 0 OLP flexibility: % contained. By. Another: Standard Deviation - Term Length Skewness - Term % Length Skewness - Term Length contained. By. Another Standard Deviation Term Length

Problem? ? Folksonomies and ontologies are fundamentally different! It may not be possible to derive one from the other accurately Nevertheless, we would like to take advantage of tagging behaviour while gaining the power of controlled vocabularies/Ontologies

E. D. The Entity Desciber

Connotea tagging User types in all tags Type-ahead displays previously used tags

Connotea + E. D. Tagging

Leveraging Tagging? “Tagging” effectively assigns properties to entities ED Tagging constrains those properties to a controlled vocabulary or ontology Can we discover patterns in those properties that indicate a “natural” classification system? Can a “realist-evaluation” generate logical rules that define classes based on patterns of tags?

Final Thoughts Ontologies are important, but hard to build i. CAPTURer: formal, template-based, cost-free consumption of biologist’s brains seems to work! Informal annotation (tagging) is cheap, easy, and scalable, and is HAPPENING Can we leverage tagging to create ontologylike structures? Maybe… Maybe not!

My journey back to Web Services

Why do I care about WS so passionately?

The Deep Web All the data and knowledge only accessible through Web Forms Estimated to be orders of magnitude greater than the “surface Web” - 91, 000 Terabytes in the deep Web - 167 Terabytes in the Surface Web Much of the Deep Web CANNOT be represented on the Semantic Web since it DOES NOT EXIST until the Web Form is accessed

Moby 2. 0 and Cardio. SHARE Merging the Deep Web and the Semantic Web

What Web Services do Sequence Data Blast Hit BLAST SERVICE

What Bio. Moby does Sequence Data Want Blast MOBY BLAST SERVICE Blast Hit

The implied relationship between input and output Sequence Data gives. Blast. Result Blast Hit Not “Bologically” Meaningful

The implied biological relationship between input and output Sequence Data has. Homology. To Blast Hit …looks a lot like the RDF statement… URI has. Homology. To URI

To merge Web Services and the Semantic Web… …Simply assert the relationship and let Moby do the rest!

Start with a partial Triple URI rdf: type Sequence has. Homology. To

What Moby 2. 0 Does URI rdf: type Sequence has. Homology. To Blast Hit Moby 2. 0 MOBY BLAST SERVICE Need BLAST Service consuming rdf: type Sequence Predicate to has. Homology. To property Service by Web provided BLAST services Translator

Moby 2. 0 Query FIND SERVICES THAT Consume Sequence Data | | Provide has. Homology. To Property | | Attached to other Sequence Data

Moby 2. 0 extends SPARQL queries contain concepts and relationships of interest Map RDF predicates onto Moby services capable of generating them Registry query: “What Moby service consumes [subject] and generates the [predicate] relationship type? ”

But wait, there’s more!

Cardio. SHARE: Exploit knowledge in OWL-DL ontologies to enhance query Predicate Subject Predicate Evaluate Query Expression Look up and execute Moby service Consumes STK’s and Provides inhibitor property Look up and execute Moby service Consumes proteins and generates Functional annotation property

Cardio. SHARE: Exploit knowledge in OWL-DL ontologies to enhance query This SPARQL query could be posed on a database of RAW, UNANNOTATED Protein sequences, and be answered by Moby 2. 0

What do Moby 2. 0 and Cardio. SHARE achieve? Makes the Deep Web transparently accessible as if it were a Semantic Web Resource Allows SPARQL to do truly semantic queries! Reduces the requirement of Biologists to know how/where to get their data of interest Simplifies construction of complex analytical pipelines by automating much of the discovery/execution tasking

Ontology Spectrum Catalog/ ID Thesauri “narrower term” relation Terms/ glossary Informal is-a Selected Formal Frames Logical is-a (properties) Constraints (disjointness, inverse, …) Formal instance Value Restrs. General Logical constraints Originally from AAAI 1999 - Ontologies Panel by Gruninger, Lehmann, Mc. Guinness, Uschold, Welty; – updated by Mc. Guinness. Description in: www. ksl. stanford. edu/people/dlm/papers/ontologies-come-of-age-abstract. html

Fin