6cb6bf0c8b0f3d592cb7c7ca136f5f88.ppt
- Количество слайдов: 23
Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton, New Zealand +64 7 839 7808 www. reeltwo. com 0
What is Machine Learning? n creating computer programs that get better with experience n learn how to make expert judgments n discover previously hidden, potentially useful information (data mining) How does it work? n user provides learning system with examples of concept to be learned n induction algorithm infers a characteristic model of the examples n model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly! 1
Structured Learning Mushroom Data weight Weight Damage Dirt Firmness Quality heavy normal light Light normal heavy. . . high medium clear medium mild clean mild hard soft hard poor good poor heavy dirt firmness mild clean poor good 2 normal light hard good soft poor
Unstructured Learning n data does not have fixed fields with specific values n examples: images, continuous signals, expression data, text n learning proceeds by correlating the presence or absence of any and all salient attributes Document Classification n given examples of documents covering some topic, learn a semantic model that can recognize whether or not other documents are relevant n prioritize them: i. e. quantify “how relevant” documents are to the topic n not limited to keywords (nor is it misled by them) n adapt to the user’s needs (ephemeral or long-term) 3
How Text Mining Works Users supply the system with training data • Documents that are good examples of the desired category The system builds ‘classifiers’ • Statistical models based on the training data The system classifies novel data • Identifies other documents about the desired category Results are displayed or stored • Files can be viewed, routed to end users or stored in databases 4
Classification System Client-specific categories Familiar Windows-style interface Drag-and-drop documents to create custom categories Classified documents are ranked by relevance View contents of individual documents – sentences are highlighted by their relevance to the category 5
The Gene Ontology – A Good First Step The Initial Problem: Individual curators evaluate data differently While scientists can agree to use the word "kinase, " they must also agree to support this by stating how and why they use "kinase, " and consistently apply it. Only in this way can they hope to compare gene products and find out if and how they are related. Activation of p 38 MAP Kinase Protein Modification MAPK-KK Cascade The Initial Solution: The Gene Ontology (GO) – A controlled vocabulary with defined relationships between items. GO consists of more than 13, 000 nodes, or ‘GO Terms’, divided into three main trees: Biological Process, Cellular Component and Molecular Function Of these, only about 3800 GO Terms are ‘active’ – that is, terms appended with more than just one or two publications. 6
The Gene Ontology Knowledge Discovery System GO KDS – Filling the gaps in GO GO is only a partial solution • GO KDS) bridges the gap by classifying all of MEDLINE. • New documents are classified as they’re added • Scientists can now annotate gene targets quickly and reliably • GO KDS is updated along with GO and MEDLINE • Enormous gap between GOannotated docs (27, 000) and full MEDLINE database (12 million entries). • Updates lag behind. • Scientists must understand agree to use the GO • Knowledge changes and alters definitions. Using GO “as is” takes too long and delivers too little 7
GO KDS Interface Tour All sub-terms for the listed term: click on a term to further refine your search Current GO term(s) open Location of listed term in GO Enter a keyword to search in this GO category Opens abstract in separate window Color of stars identifies the GO branch: number of stars indicates confidence of category placement KDS discovers novel classifications 8 Original GO classifications (by domain-expert)
GO KDS Key Benefits www. go-kds. com q Quickly sort documents into most relevant categories to the user q Replace laborious annotation by domain experts with a trainable, automated system q Discover conceptual links between previously unrelated scientific domains q Identify key articles for pertinent research q Integrate public, private and proprietary documents 9
How is document classification useful? Life Science Research Patent preparation Finding relevant literature Prioritizing articles/reports Discovering hidden connections Distributing information Searching patent databases Collecting relevant documents Synthesizing information Drug Approval Collecting information Organizing/Collating documents Satisfying approval criteria 10
Intelligent Text Mining: Therapeutic Courses One Reel Two client is using Classification System to rapidly sort through large volumes of medical documentation in disparate therapeutic areas. The Problem: Client must generate E-Learning Courses from hundreds of pages of reports, literature and product documentation supplied by client Old Solution: Manually read through documents to find paragraphs related to ‘Diagnosis’, Etiology, Epidemiology etc. New Solution: Use Reel Two Classification System to build a custom taxonomy, then automatically classify and extract relevant document sections into Therapeutic Area categories 11
Intelligent Text Mining – Patent Analysis Search patent filings for the ideas or concepts behind one’s analysis – Explore state of prior art, competitive landscape or ‘innovation gaps’ – Overcome intentionally vague language in patent filings Identifying ‘Mechanism of Action’ in life science patents Example Project Patents are classified according to a taxonomy built by the client: Alzheimer’s Patents Mo. A: 5 -HT Inhibitor Mo. A: Acetylcholinesterase Mo. A: Antioxidant Mo. A: Antiviral… Sample Output ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic; tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant; uropathic; gastrointestinal; antiaddictive; gynecological. MECHANISM OF ACTION - Neurotransmitter release modulator. In an in vitro assay, 2 -chloro-5 -(3 -(R)-pyrrolidinylmethoxy)-3 -pyridinecarbaldoxime (Ia) exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0. 012 n. M. The Mechanism of Action listed for this patent is "Neurotransmitter release modulator. " However Classification System identified that this chemical modulator binds to the acetylcholine receptor, which is the true mechanism of action, and classified this patent in “Mo. A: Acetylcholinesterase”. 12
“Life Science Information Management will form the largest unmet need for IT companies in the 21 st Century” Caroline Kovak, General Manager, IBM Life Sciences 13
Appendix: GO KDS Interface 1. Search for a particular GO term by opening one of the main branches 14
Appendix 2. ‘Drill down” through the taxonomy to find a term of interest. Click on that term. 15
Appendix 3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search with this term. ’ 16
Appendix 4. Scroll down to view abstracts. 17
Appendix 5. Discover conceptual links to other GO categories. Click on the category to add the term to your search. 18
Appendix 6. View the data intersection between GO categories. Scroll through to view abstract. 19
Appendix 7. GO terms identify concepts embodied in the abstracts, enabling quick review. 20
Appendix 8. Select an abstract of interest, and click to open the complete abstract. 21
Appendix 9. The abstract will open in a new window, allowing you to continue with your search, or to link directly to the journal. 22
6cb6bf0c8b0f3d592cb7c7ca136f5f88.ppt