be7f7dc2d1ed146fadc716e8777786fe.ppt
- Количество слайдов: 76
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas
Administrative Class presentation schedule is online First class presentation is “kernel based classification” by Han Bin on Nov 6 th Project design is due Oct 30 th 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 2
Overview Gene ontology Challenges What is gene ontology construct gene ontology Text mining, natural language processing and information extraction: An Introduction Summary 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 3
Ontology <philosophy> A systematic account of Existence. <artificial intelligence> (From philosophy) An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. <information science> The hierarchical structuring of knowledge about things by subcategorising them according to their essential (or at least relevant and/or cognitive) qualities. This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices. The philosophy of indexing everything in existence? 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 4
Aristotele’s (384 -322 BC) Ontology Substance plants, animals, . . . Quality Quantity Relation Where When Position Having Action Passion 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 5
Ontology and -informatics In information sciences, ontology is better defined as: “a domain of knowledge, represented by facts and their logical connections, that can be understood by a computer”. (J. Bard, Bio. Essays, 2003) “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing” (Gruber, 1993) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 6
Information Exchange in Bio-sciences Basic challenges: Definition, definition What is a name? What is a function? 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 7
Cell 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 8
Cell 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 9
Cell 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 10
Cell 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 11
Cell 10/16/2006 Text Mining Biological Data Image from http: //microscopy. fsu. edu KU EECS 800, Luke Huan, Fall’ 06 slide 12
What’s in a name? The same name can be used to describe different concepts 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 13
What’s in a name? Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism Gluconeogenesis All refer to the process of making glucose from simpler components 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 14
What’s in a name? The same name can be used to describe different concepts A concept can be described using different names Comparison is difficult – in particular across species or across databases 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 15
What is Function? The Hammer Example Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 16
Information Explosion 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 17
Entering the Genome Sequencing Era Eukaryotic Genome Sequences Year Genome Size (Mb) # Genes Yeast (S. cerevisiae) 12 6, 000 Worm (C. elegans) 1998 97 19, 100 Fly (D. melanogaster) 2000 120 13, 600 Plant (A. thaliana) 2001 125 25, 500 Human (H. sapiens, 1 st Draft) 10/16/2006 Text Mining 1996 2001 ~3000 ~35, 000 Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 18
What is the Gene Ontology? A Common Language for Annotation of Genes from Yeast, Flies and Mice …and Plants and Worms …and Humans …and anything else! 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 19
http: //www. geneontology. org/ 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 20
What is the Gene Ontology? Gene annotation system Controlled vocabulary that can be applied to all organisms Organism independent Used to describe gene products proteins and RNA - in any organism 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 21
The 3 Gene Ontologies Molecular Function = elemental activity/task the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 22
Cellular Component where a gene product acts 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 23
Cellular Component 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 24
Cellular Component 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 25
Cellular Component Enzyme complexes in the component ontology refer to places, not activities. 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 26
Molecular Function 10/16/2006 Text Mining insulin binding insulin receptor activity Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 27
Molecular Function activities or “jobs” of a gene product glucose-6 -phosphate isomerase activity 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 28
Molecular Function A gene product may have several functions; a function term refers to a single reaction or activity, not a gene product. Sets of functions make up a biological process. 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 29
Biological Process a commonly recognized series of events cell division 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 30
Biological Process transcription 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 31
Biological Process Metabolism: degradation or synthesis of biomelecules 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 32
Biological Process Development: how a group of cell become a tissue 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 33
Biological Process courtship behavior 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 34
Ontology applications Can be used to: Formalise the representation of biological knowledge Standardise database submissions Provide unified access to information through ontology-based querying of databases, both human and computational Improve management and integration of data within databases. Facilitate data mining 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 35
Gene Ontology Structure Ontologies can be represented as directed acyclic graphs (DAG), where the nodes are connected by edges Nodes = terms in biology Edges = relationships between the terms is-a part-of 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 36
Parent-Child Relationships Chromosome Cytoplasmic chromosome Mitochondrial chromosome Nuclear chromosome Plastid chromosome A child is a subset or instances of a parent’s elements 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 37
Parent-Child Relationships cell membrane mitochondrial membrane chloroplast membrane is-a part-of 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 38
Annotation in GO A gene product is usually a protein but can be a functional RNA An annotation is a piece of information associated with a gene product A GO annotation is a Gene Ontology term associated with a gene product 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 39
Terms, Definitions, IDs Term: MAPKKK cascade (mating sensu Saccharomyces) Goid: GO: 0007244 Definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. Evidence code: how annotation is done Definition_reference: PMID: 9561267 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 40
Annotation Example nek 2 PMID: 11956323 Reference Gene Product IDA centrosome GO: 0005813 GO Term 10/16/2006 Text Mining Inferred from Direct Assay Evidence Code Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 41
GO Annotation 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 42
GO Annotation 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 43
GO Annotation 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 44
Evidence Code Indicate the type of evidence in the cited source that supports the association between the gene product and the GO term http: //www. geneontology. org/GO. evidence. html 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 45
Types of evidence codes Types of evidence code Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND Two types of annotation Manual Annotation Electronic Annotation 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 46
IDA: Inferred from Direct Assay • direct assay for the function, process, or component indicated by the GO term • • In vitro reconstitution (e. g. transcription) • Immunofluorescence (for cellular component) • 10/16/2006 Text Mining Enzyme assays Cell fractionation (for cellular component) Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 47
IMP: Inferred from Mutant Phenotype • variations or changes such as mutations or abnormal levels of a single gene product • • Deletion mutant • RNAi experiments • Specific protein inhibitors • 10/16/2006 Text Mining Gene/protein mutation Allelic variation Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 48
IGI: Inferred from Genetic Interaction • Any combination of alterations in the sequence or expression of more than one gene or gene product • Traditional genetic screens - Suppressors, synthetic lethals • • • 10/16/2006 Text Mining Functional complementation Rescue experiments An entry in the ‘with’ column is recommended Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 49
IPI: Inferred from Physical Interaction • Any physical interaction between a gene product and another molecule, ion, or complex • • Co-immunoprecipitation • 10/16/2006 Text Mining Co-purification • • 2 -hybrid interactions Protein binding experiments An entry in the ‘with’ column is recommended Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 50
IEP: Inferred from Expression Pattern Timing or location of expression of a gene Transcript levels Northerns, microarray Exercise caution when interpreting expression results 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 51
ISS: Inferred from Sequence or structural Similarity Sequence alignment, structure comparison, or evaluation of sequence features such as composition Sequence similarity Recognized domains/overall architecture of protein An entry in the ‘with’ column is recommended 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 52
RCA: Inferred from Reviewed Computational Analysis non-sequence-based computational method large-scale experiments genome-wide two-hybrid genome-wide synthetic interactions integration of large-scale datasets of several types text-based computation (text mining) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 53
IGC Inferred from Genomic Context Chromosomal position Most often used for Bacteria - operons Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 54
IEA: Inferred from Electronic Annotation depend directly on computation or automated transfer of annotations from a database Hits from BLAST searches Inter. Pro 2 GO mappings No manual checking Entry in ‘with’ column is allowed (ex. sequence ID) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 55
TAS: Traceable Author Statement publication used to support an annotation doesn't show the evidence Review article Text mining! Would be better to track down cited reference and use an experimental code 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 56
NAS: Non-traceable Author Statements in a paper that cannot be traced to another publication 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 57
ND: No biological Data available Can find no information supporting an annotation to any term Indicate that a curator has looked for info but found nothing Place holder Date 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 58
IC: Inferred by Curator annotation is not supported by evidence, but can be reasonably inferred from other GO annotations for which evidence is available ex. evidence = transcription factor (function) IC = nucleus (component) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 59
Choosing the correct evidence code Ask yourself: What is the experiment that was done? Text Mining can help you review papers faster! 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 60
Beyond GO – Open Biomedical Ontologies Orthogonal to existing ontologies to facilitate combinatorial approaches Share unique identifier space Include definitions 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 61
Gene Ontology and Text Mining Derive ontology from text data More general goal: understand text data automatically 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 62
Finding GO terms …for B. napus PERK 1 protein (Q 9 ARH 1) In this study, we report the isolation and molecular characterization of the B. napus PERK 1 c. DNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK 1 has serine/threonine kinase activity, In addition, the location of a PERK 1 -GTP fusion protein to the plasma membrane supports the prediction that PERK 1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… Pub. Med ID: 12374299 Function: protein serine/threonine kinase activity GO: 0004674 Component: integral to plasma membrane GO: 0005887 Process: response to wounding GO: 0009611 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 63
Mining Text Data Mining / Knowledge Discovery Structured Data Home. Loan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200, 000 Term: 15 years ) Multimedia Free Text Loans($200 K, [map], . . . ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200, 000 under a 15 -year loan from MW Financial. Hypertext <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>. . . (Taken from Cheng. Xiang Zhai, CS 397 cxz, UIUC, CS – Fall 2003) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 64
Bag-of-Tokens Approaches Documents Token Sets Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … Feature Extraction nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Loses all order-specific information! Severely limits context! 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 65
Natural Language Processing A dog is chasing a boy on the playground Det Noun Aux Noun Phrase Verb Complex Verb Semantic analysis Dog(d 1). Boy(b 1). Playground(p 1). Chasing(d 1, b 1, p 1). + Det Noun Prep Det Noun Phrase Prep Phrase Verb Phrase Syntactic analysis (Parsing) Verb Phrase Sentence Scared(x) if Chasing(_, x, _). A person saying this may be reminding another person to get the dog back… Scared(b 1) Inference 10/16/2006 Text Mining Noun Lexical analysis (part-of-speech tagging) Pragmatic analysis (speech act) Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 66
General NLP—Too Difficult! Word-level ambiguity “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity “natural language processing” (Modification) “A man saw a boy with a telescope. ” (PP Attachment) Anaphora resolution “John persuaded Bill to buy a TV for himself. ” (himself = John or Bill? ) Presupposition “He has quit smoking. ” implies that he smoked before. Humans rely on context to interpret (when possible). This context may extend beyond a given document! 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 67
Shallow Linguistics Progress on Useful Sub-Goals: English Lexicon Part-of-Speech Tagging Word Sense Disambiguation Phrase Detection / Parsing 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 68
Word. Net An extensive lexical network for the English language Contains over 138, 838 words. Several graphs, one for each part-of-speech. Synsets (synonym sets), each defining a semantic sense. Relationship information (antonym, hyponym, meronym …) Downloadable for free (UNIX, Windows) Expanding to other languages (Global Word. Net Association) Funded >$3 million, mainly government (translation interest) to George Miller, National Medal of Science, 1991. watery wet dry damp moist parched anhydrous arid synonym 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 antonym slide 69
Part-of-Speech Tagging Training data (Annotated text) This Det sentence N serves V 1 “This is a new sentence. ” as P an example Det N POS Tagger of P annotated V 2 text… N This is a new Det Aux Det Adj sentence. N Pick the most likely tag sequence. Independent assignment Most common tag Partial dependency (HMM) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 70
Word Sense Disambiguation ? “The difficulties of computational linguistics are rooted in ambiguity. ” N Aux V P N Supervised Learning Features: • Neighboring POS tags (N Aux V P N) • Neighboring words (linguistics are rooted in ambiguity) • Stemmed form (root) • Dictionary/Thesaurus entries of neighboring words • High co-occurrence words (plant, tree, origin, …) • Other senses of word within discourse Algorithms: • Rule-based Learning (e. g. IG guided) • Statistical Learning (i. e. Naïve Bayes) • Unsupervised Learning (i. e. Nearest Neighbor) 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 71
Parsing Choose most likely parse tree… Grammar Lexicon 10/16/2006 Text Mining V chasing Aux is N dog N boy N playground Det the Det a P on Probability of this tree=0. 000015 NP Probabilistic CFG S NP VP NP Det BNP NP NP PP BNP N VP V VP Aux V NP VP PP PP P NP S Det BNP A 1. 0 0. 3 0. 4 0. 3 N . . . V NP is chasing P NP on a boy Probability of this tree=0. 000011 S NP 0. 01 Det 0. 003 A … PP the playground 1. 0 … VP Aux dog … … VP VP BNP N Aux is dog NP V chasing NP a boy PP P NP on the playground Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 72
Obstacles • Ambiguity “A man saw a boy with a telescope. ” • Computational Intensity Imposes a context horizon. Text Mining NLP Approach: 1. Locate promising fragments using fast IR methods (bag-of-tokens). 2. Only apply slow NLP techniques to promising fragments. 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 73
Summary: Shallow NLP However, shallow NLP techniques are feasible and useful: • Lexicon – machine understandable linguistic knowledge • possible senses, definitions, synonyms, antonyms, typeof, etc. • POS Tagging – limit ambiguity (word/POS), entity extraction • “. . . research interests include text mining as well as bioinformatics. ” NP N • WSD – stem/synonym/hyponym matches (doc and query) • Query: “Foreign cars” Document: “I’m selling a 1976 Jaguar…” • Parsing – logical view of information (inference? , translation? ) • “A man saw a boy with a telescope. ” Even without complete NLP, any additional knowledge extracted from text data can only be beneficial. Ingenuity will determine the applications. 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 74
Reference for GO Gene ontology teaching resources: http: //www. geneontology. org/GO. teaching. resources. shtml 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 75
References for TM 1. 5. 6. C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi. Structured Data”, Morgan Kaufmann, 2002. G. Miller, R. Beckwith, C. Fell. Baum, D. Gross, K. Miller, and R. Tengi. Five papers on Word. Net. Princeton University, August 1993. C. Zhai, Introduction to NLP, Lecture Notes for CS 397 cxz, UIUC, Fall 2003. M. Hearst, Untangling Text Data Mining, ACL’ 99, invited paper. 7. http: //www. sims. berkeley. edu/~hearst/papers/acl 99 -tdm. html R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall 2. 3. 4. 8. 9. 2003. A Road Map to Text Mining and Web Mining, University of Texas resource page. http: //www. cs. utexas. edu/users/pebronia/text-mining/ Computational Linguistics and Text Mining Group, IBM Research, http: //www. research. ibm. com/dssgrp/ 10/16/2006 Text Mining Biological Data KU EECS 800, Luke Huan, Fall’ 06 slide 76
be7f7dc2d1ed146fadc716e8777786fe.ppt