f6328388df2bc042781f80e51780a8a3.ppt
- Количество слайдов: 36
The Bio. Text Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech 1
Bio. Text Project Goals • Provide fast, flexible, intelligent access to information for use in biosciences applications. • Focus on – Textual Information – Tightly integrated with other resources • Ontologies • Record-based databases 2
People • Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin • Computational Linguistics – Barbara Rosario – Presley Nakov • Database Research – Ariel Schwartz – Gaurav Bhalotia (graduated) • User Interface / Information Retrieval – Kevin Li – Emilia Stoica • Bioscience – Dr. Ting Zhang 3
Outline • Main Goals – System Architecture – Apoptosis problem statement • Recent results in – Abbreviation definition recognition – Semantic relation recognition (from text) – Search User Interfaces – Hierarchical grouping of journals 4
Bio. Text: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface 5
Recent Result (Schwartz & Hearst 03) • Fast, simple algorithm for recognizing abbreviation definitions. – Simpler and faster than the rest – Higher precision and recall – Idea: Work backwards from the end • Examples: – In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). – Gcn 5 -related N-acetyltransferase (GNAT) • Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present. 6
Bio. Text: A Two-Sided Approach Empirical Computational Linguistics Algorithms Blast Medline Journal Full Text Mesh Word Net Swiss. Prot GO Sophisticated Database Design & Algorithms 7
Apoptosis Network Survival Factors Signaling Death Receptors Signaling Genotoxic Stress Lost of Attachment Cell Cycle stress, etc ER Stress Initiator Caspases (8, 10) P 53 pathway BH 3 only Ca++ Signaling NFk. B Bcl-2 like Bax, Bak Smac Caspase 12 IAPs Mitochondria Cytochrome c Apaf 1 AIF Caspase 9 Effecter Caspases (3, 6, 7) Apoptosis Slide courtesy Ting Zhang 8
The issues (courtesy Ting Zhang): • The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published. • The supporting experimental data are gathered in different organs, tissues, cells using various techniques. • There are various levels of uncertainty associated with different techniques used to answer certain questions. • Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts. • We need to keep track of ALL the information in order to understand the system better. 9
Simple cases: • Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using c. DNA expression library from T-Lymphoma cell line KO 52 DA 20). • Human Bim. EL protein is 89% identical to mouse Bim. EL, Human Bim. L is 85% identical to mouse Bim. L (Hybridization of mouse bim c. DNA to human fetal spleen and peripheral blood c. DNA library). • Bim m. RNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO 52 DA 20, WEHI 703, WEHI 707, WEHI 7. 1, CH 1, WEHI 231 WEHI 415, B 6. 23. 16 BW 2 cell extracts). • Bim. L protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immunoprecipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti. EEtag) using extracts human 293 T cells co-transfected with EE-tagged Bim. L AND (bcl-2 OR bcl-XL OR bcl-w) plasmids) • Bim. L deleted of the BH 3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above) 10
Computational Language Goals • Recognizing and annotating entities within textual documents • Identifying semantic relations among entities • To (eventually) be used in tandem with semi-automated reasoning systems. 11
Main Ideas for NLP Approach • Assign Semantics using – Statistics – Hierarchical Lexical Ontologies to generalize – Redundancy in the data • Build up Layers of Representation – Syntactic and Semantic – Use these in a feedback loop 12
Computational Linguistics Goals • Mark up text with semantic relations 13
Recent Result: Descent of Hierarchy • Idea: – Use the top levels of a lexical hierarchy to identify semantic relations • Hypothesis: – A particular semantic relation holds between all 2 -word Noun Compounds that can be categorized by a Me. SH pair. 14
Definition • NC: Any sequence of nouns that itself functions as a noun – asthma hospitalizations – health care personnel hand wash • Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. 15
NCs: Three tasks • Identification • Syntactic analysis (attachments) • [Baseline [headache frequency]] • [[Tension headache] patient] • Our Goal: Semantic analysis • Headache treatment for headache • Corticosteroid treatment that uses corticosteroid 16
Main Idea: • Top-level MESH categories can be used to indicate which relations hold between noun compounds • headache – C 23. 888. 592. 612. 441 • breast – A 01. 236 recurrence C 23. 550. 291. 937 pain G 11. 561. 796. 444 cancer C 04 cells A 11 17
Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. – – – (used-in): kitchen knife (made-of): steel knife (instrument-for): carving knife (used-on): putty knife (used-by): butcher’s knife 18
Distribution of Frequent Category Pairs 19
How Far to Descend? • Anatomy: 250 CPs – 187 (75%) remain first level – 56 (22%) descend one level – 7 (3%) descend two levels • Natural Science (H 01): 21 CPs – 1 (4%) remain first level – 8 (39%) descend one level – 12 (57%) descend two levels • Neoplasm (C 04) 3 CPs: – 3 (100%) descend one level 20
Evaluation • Apply the rules to a test set • Accuracy: – Anatomy: 91% accurate – Natural Science: 79% – Diseases: 100% • Total: – 89. 6% via intra-category averaging – 90. 8% via extra-category averaging 21
Summary of NC Work • Lexical hierarchy useful for inferring semantic relations • Works because semantics are constrained and word sense ambiguity is not too much of a problem • Can it be extended to other types of relations? – Preliminary results on one set of relations are promising. 22
Database Research Issues • Efficiently and effectively combining – Relational databases & Text – Hierarchical Ontologies – Layers of Annotations 23
Interface Issues • Create intuitive, appealing interfaces that are better than what’s currently out there. • Start with existing assigned metadata • As text analysis improves, incorporate the results into the interface. 24
25
26
27
28
Some Recent Work • Organizing Bio. Science Journal Names – Currently there are > 3500 29
30
31
Some Recent Work • Organizing Bio. Science Journal Names – Currently there are > 3500 • Idea: – Group them into faceted hierarchies semiautomatically – Using clustering of title terms, synonym similarity via Word. Net, and other techniques 32
33
34
Summary • Bio. Text aims to improve access to bioscience information via – Sophisticated language analysis – Integration of results into • Annotated database • Flexible user interface • Eventual goal – Semi-automated mining and discovery 35
There’s lots to do! For more information: biotext. berkeley. edu 36
f6328388df2bc042781f80e51780a8a3.ppt