Machine Learning Group Learning to Extract Proteins and

Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Raymond J. Mooney Department of Computer Sciences Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk Wah Wong Department of Computer Sciences Edward M. Marcotte, Arun Ramani Institute for Cellular and Molecular Biology University of Texas at Austin

Machine Learning Group Biological Motivation • Human Genome Project has produced huge amounts of genetic data. • Next step is analyzing and interpreting this data. University of Texas at Austin 2

Machine Learning Group University of Texas at Austin 3

Machine Learning Group Starting at the tip of chromosome 1. . . 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 5641 5701 5761 5821 5881 5941 6001 6061 6121 6181 6241 6301 6361 6421 6481 6541 6601 taaccctaacc taaccctaaa cccaacccca cctaaccctaaccc agccggcccg ccgaaatctg gaggagaacg agacacatgc cggcgcaggc gagaggcgca gcgtggcgca ggagcaaagt aaactcacgt gggatcgacc tgccagggcg. . . gctccagggc ccttcatgct gagtggccag gaggtgggga gtccaagagc aggccggacc gcgcgggcat tggacccctg gttttgtgcc gaaaaatgtg tggggaaccc gctctacagt aactattcaa gtgcactgag gcgctgctgc gggagtgggg ggtggtgtta cctaaccctaa ccctaaaccccaaccctaaccctaac cccggg tgcagaggac caactccgcc tagcgcgtcg gcagagacac ccgcgccggc ggcgcagaga cgcacggcgc cacggtggcg gccccttgct ccccctgctg accctaacccc ccctaaccctaac caaccctaaccctaac ccctaaccct tctgacctga aacgcagctc ggcgcaggcg gggtggaggc atgctaccgc gcaggcgcag cgcaagccta cgggctgggg cggcgcagag tgcagccggg gcgactaggg taaccctaac aaccctaacccta ccctaaccctaacccc aaccctaacc ggagaactgt cgccctcgcg cagagaggcg gtggcgcagg gtccaggggt agacacatgc cggggggagg acgggtagaa cactacagga caactgcagg cctaaccctaacccaa cctaaccctaa accctaaccctaaccctaaccctcg gctccgcctt gtgctctccg cgccg cgcagagagg ggaggcgtgg tagcgcgtcc ttgggggggc gtggcgccgt cctcagtaat cccgcttgct gctctcttgc accctaaccct accctaacct cccctaaccc caaccccaac ctaccctaacccta cggtaccctc cagagtacca ggtctgtgct gcgcaggcgc cgcgc cgcaggcgca aggggtggag gtgtgttgca gcacgcgcag ccgaaaagcc cacggtgctg ttagagtggt ccgctcacct gcgcagcttg ccaccggagg acagggcaag ctgctgggag tttggagact cctgtgtgca agctagccat acttctggat ttgctgtagt agagcctcac ttgaaaacca aaaattgaga cacgccagaa tgctgtcgtc gtgcactggc gtaccccatc tgctcctgct gccttgccga ggtcaaccac gaggaaaggc ggaagtcacc gtgtgtgggg gatactccct gctctgacag gctagggtta ttgttattag ttgttcaggc ctattttatg atttctgacc atcaggtggc ctgcctggcg cagcacctca ttgtaggtct ccttctgctg tgcccccagc ttccctggga tgctcaggca tcccctcaaa gcctgggcac gcttcctctcagttgc cactgggaga accccttctt tccctctgcc aaccaagtag acttaacaaa ctcaaagagc ccttggccta ggagctgggg gaaacacaaa ctgcttctcc ttggcggatg gctccctgga gggctgggga cgaggagccc tgacttctgc tagcccccac acacacgagc cacagcagtg tccattggtt ctagaagtga aacaagatat cccacagaaa tgctcccacc caggggccgc gtggtggtgg gtgtggggtg agctttcgct gactctagca ctggagccgg agcttactgt tgcgctgggg aaccacctga cctgcagagc cagcagaggg aagctgaaat taattaggaa gaagtccaga ttgaaatgga atccacccga tgaaggagac ggttgagggt gggcggtggg tctagggaag. . . University of Texas at Austin and 3 x 10 9 more. . . 4

Machine Learning Group Proteomics 101 • Genes code for proteins. • Proteins are the basic components of biological machinery. • Proteins accomplish their functions by interacting with other proteins. • Knowledge of protein interactions is fundamental to understanding gene function. • Chains of interactions compose large, complex gene networks. University of Texas at Austin 5

Machine Learning Group Sample Gene Network University of Texas at Austin 6

Machine Learning Group Yeast Gene Network Yeast ~5, 800 genes ~5, 800 proteins x 2 -10 interactions/protein ~12, 000 - 60, 000 interactions ~10 -20, 000 known==> ~1/3 of the way to a complete map! University of Texas at Austin 7

Machine Learning Group Human Gene Network ~40, 000 genes >>40, 000 proteins x 2 -10 interactions/protein >>80, 000 - 400, 000 interactions <5, 000 known ==> approx. 1% of the complete map! ==> We’re a long ways from the complete map University of Texas at Austin 8

Machine Learning Group Relevant Sources of Data Biological literature ~14 million documents DNA sequence data ~1010 nucleotides Gene expression data ~108 measurements, but. . . DNA polymorphisms ~107 known Gene inactivation (knockout) studies ~105 Protein structure data ~104 structures Protein interaction data ~104 interactions, but… Protein expression data ~104 measurements, but. . . Protein location data ~104 measurements University of Texas at Austin 9

Machine Learning Group Extraction from Biomedical Literature • An ever increasing wealth of biological information is present in millions of published articles but retrieving it in structured form is difficult. • Much of this literature is available through the NIH NLM’s Medline repository. • 11 million abstracts in electronic form are available through Medline. • Excellent source of information on protein interactions. • Need automated information extraction to easily locate and structure this information. University of Texas at Austin 10

Machine Learning Group TI - Two potentially oncogenic cyclins, cyclin A and cyclin D 1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D 1 (PRAD 1) as a putative G 1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D 1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp 60 c-src in vitro. In synchronized human osteosarcoma cells, cyclin D 1 is induced in early G 1 and becomes associated with p 9 Ckshs 1, a Cdkbinding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D 1 is associated with both p 34 cdc 2 and p 33 cdk 2, and that cyclin D 1 immune complexes exhibit appreciable histone H 1 kinase activity. Immobilized, recombinant cyclins A and D 1 were found to associate with cellular proteins in complexes that contain the p 105 Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. University of Texas at Austin 11

Machine Learning Group TI - Two potentially oncogenic cyclins, cyclin A and cyclin D 1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D 1 (PRAD 1) as a putative G 1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D 1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp 60 c-src in vitro. In synchronized human osteosarcoma cells, cyclin D 1 is induced in early G 1 and becomes associated with p 9 Ckshs 1, a Cdkbinding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D 1 is associated with both p 34 cdc 2 and p 33 cdk 2, and that cyclin D 1 immune complexes exhibit appreciable histone H 1 kinase activity. Immobilized, recombinant cyclins A and D 1 were found to associate with cellular proteins in complexes that contain the p 105 Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. University of Texas at Austin 12

Machine Learning Group TI - Two potentially oncogenic cyclins, cyclin A and cyclin D 1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene. Other recent studies have identified human cyclin D 1 (PRAD 1) as a putative G 1 cyclin and candidate proto-oncogene. However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved. In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells. Moreover, cyclin D 1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp 60 c-src in vitro. In synchronized human osteosarcoma cells, cyclin D 1 is induced in early G 1 and becomes associated with p 9 Ckshs 1, a Cdkbinding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D 1 is associated with both p 34 cdc 2 and p 33 cdk 2, and that cyclin D 1 immune complexes exhibit appreciable histone H 1 kinase activity. Immobilized, recombinant cyclins A and D 1 were found to associate with cellular proteins in complexes that contain the p 105 Rb protein. This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression. University of Texas at Austin 13

Machine Learning Group Manually Developed IE Systems for Medline • A number of projects have focused on the manual development of information extraction (IE) systems for biomedical literature. • Ke. X for extracting protein names (Fukuda et al. , 1998): Extract words with special symbols excluding those with more than half of the characters being special symbols, hence eliminating strings such as “+/−”. • Suiseki for extracting protein interactions (Blaschke et al. , 2001): PROT (0 -2) complex NOUN between (0 -3) PROT (0 -3) and (0 -3) PROT University of Texas at Austin 14

Machine Learning Group Learning Information Extractors • Manually developing IE systems is tedious and time-consuming and they do not capture all possible formats and contexts for the desired information. • Machine learning from supervised corpora, is becoming the standard approach to building information extractors. • Recently, several learning approaches have been applied to Medline extraction (Craven & Kumlein, 1999; Tanabe & Wilbur, 2002; Raychaudhuri et al. , 2002). • We have explored the use of a variety of machine learning techniques to develop IE systems for extracting human protein names and interactions, presenting uniform results on a single, reasonably large, human-annotated corpus. University of Texas at Austin 15

Machine Learning Group Non-Learning Protein Extractors • Dictionary-based extraction • KEX (Fukuda et al. , 1998) University of Texas at Austin 16

Machine Learning Group Learning Methods for Protein Extraction • Rule-based pattern induction – Rapier (Califf & Mooney, 1999) – BWI (Freitag & Kushmerick, 2000) • Token classification (chunking approach): – K-nearest neighbor – Transformation-Based Learning Abgene (Tanabe & Wilbur, 2002) – Support Vector Machine – Maximum entropy • Hidden Markov Models • Conditional Random Fields (Lafferty, Mc. Callum, and Pereira, 2001) • Relational Markov Networks (Taskar, Abbeel, and Koller, 2002) University of Texas at Austin 17

Machine Learning Group Our Biomedical Corpora • 750 abstracts that contain the word human were randomly chosen from Medline for testing protein name extraction. They contain a total of 5, 206 protein references. • 200 abstracts previously known to contain protein interactions were obtained from the Database of Interacting Proteins. They contain 1, 101 interactions and 4, 141 protein names. • As negative examples for interaction extraction are rare, an extra set of 30 abstracts containing sentences with non-interacting proteins are included. • The resulting 230 abstracts are used for testing protein interaction extraction. University of Texas at Austin 18

Machine Learning Group The Yapex Corpus • 200 abstracts from Medline, manually tagged for protein names. • 147 randomly chosen such that they contain the Mesh terms “protein binding”, “interaction”, “molecular”. • 53 randomly chosen from the GENIA corpus http: //www. sics. se/humle/projects/prothalt/ University of Texas at Austin 19

Machine Learning Group Evaluation Metrics for Information Extraction • Precision is the percentage of extracted items that are correct. • Recall is the percentage of correct items that are extracted. • Extracted protein names are considered correct if the same character sequences have been human-tagged as protein names in the exact positions. • Extracted protein interactions from an abstract are considered correct if both proteins have been human-tagged as interacting in that abstract. Positions are not taken into account. University of Texas at Austin 20

Machine Learning Group Dictionary as Source of Domain Knowledge • Before applying machine learning, abstracts are tagged by matching ngrams against entries from a dictionary. Tagged abstracts are used as input for subsequent methods. • A dictionary of 42, 000 protein names is used (synonyms included). • Generalization of protein names leads to increased coverage: Original Protein Name Generalized Name Interleukin-1 beta Interleukin num greek Interferon alpha-D Interferon greek roman NF-IL 6 -beta NF IL num greek TR 2 TR num University of Texas at Austin 21

Machine Learning Group Rule-based Learning Algorithms: Rapier and BWI • Rule-based learning algorithms are used for inducing patterns for extracting protein names. • For Rapier (Califf & Mooney, 1999), each rule consists of a pre-filler pattern, a filler pattern and a post-filler pattern. [ human ] [ (2) transcriptase ] [ ( ] • For BWI (Freitag & Kushmerick, 2000), rules are composed of contextual patterns called wrappers, recognizing the start or end of a protein name. [ human ] [] [ transcriptase ] [ ( ] • High precision (> 70%) but low recall (< 25%). University of Texas at Austin 22

Machine Learning Group Hidden Markov Models • We use part-of-speech information in HMMs as described in (Ray & Craven, 2001). • We train a positive model that generates sentences containing proteins, and a null model that generates sentences containing no proteins. • Select the model which gives the highest likelihood of generating a particular sentence, and tag the sentence using the Viterbi path in that model. NN: PROT … START END START NN … END NN • Moderate precision (~60%) and moderate recall (~40%). University of Texas at Austin 23

Machine Learning Group Name Extraction by Token Classification (“Chunking” Approach) • Since in our data no protein names directly abut each other, we can reduce the extraction problem to classification of individual words as being part of a protein name or not. • Protein names are extracted by identifying the longest sequences of words classified as being part of a protein name. Two potentially oncogenic cyclins , cyclin A and cyclin D 1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein University of Texas at Austin 24

Machine Learning Group Constructing Feature Vectors for Classification • For each token, we take the following as features: – – – Current token Last 2 tokens and next 2 tokens Output of dictionary-based tagger for these 5 tokens Suffix for each of the 5 tokens (last 1, 2, and 3 characters) Class labels for last 2 tokens Two potentially oncogenic cyclins , cyclin A and cyclin D 1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein University of Texas at Austin 25

Machine Learning Group Maximum-Entropy Token Classifier • Distinguish among 5 types of tags: • S(-tart), C(-ontinue), E(-nd), U(-nique), O(-ther) • Feature templates: – current, previous, next word, and previous tag – part-of-speech for current, previous, next word – word class (full) ex: FGF 1 => AAA 0 – word class (brief) ex: FGF 1 => A 0 (Collins, ACL 02) • An extraction’s confidence is the minimum of its transition probabilities. Example (4 tokens): t(y) is the forward probability of getting to state y at time step t University of Texas at Austin 26

Machine Learning Group Max. Ent: Greedy Extraction • Use a Viterbi-like algorithm to find the most likely complete sequence of tags. • Drawback: many low confidence extractions are missed. • Want to be able to increase recall beyond Viterbi results to control precision-recall trade-off. • Solution: use a greedy extraction algorithm on all token sequences between any two consecutive Viterbi extractions. University of Texas at Austin 27

Machine Learning Group Experimental Method • 10 -fold cross-validation: Average results over 10 trials with different training and (independent) test data. • For methods which produce confidence in extractions, vary threshold for extraction in order to explore recall-precision trade-off. • Use standard methods from information-retrieval to generate a complete precision-recall curve. • Maximizing F-measure assumes a particular cost-benefit trade-off between incorrect and missed extractions. University of Texas at Austin 28

Machine Learning Group Protein Name Extraction Results (Bunescu et al. , 2004) University of Texas at Austin 29

Machine Learning Group Graphical Models An intuitive representation of conditional independence between domain variables. Ø Directed Models => well suited to represent temporal and causal relationships (Bayesian Networks, HMMs) Ø Undirected Models => appropriate for representing statistical correlation between variables (Markov Networks) Ø Generative Models => define a joint probability over observations and labels (HMMs) Ø Discriminative Models => specifies a probability over labels given a set of observations (Conditional Random Fields [Lafferty et al. 2001]). Ø Allow for arbitrary, overlapping features over the observation sequence. University of Texas at Austin 30

Machine Learning Group Discriminative Markov Networks G = (V, E) – an undirected graph V = X Y – a set of discrete random variables X – observed variables Y – hidden variables (labels) C(G) – the cliques of G Vc = Xc Yc – the set of vertices in a clique c C(G) – the set of clique potentials A clique potential c specifies the compatibility of any possible assignment of values over the nodes in the associated clique c. University of Texas at Austin 31

Machine Learning Group Conditional Random Fields [Lafferty et al. 2001] Ø CRF’s are a type of discriminative Markov networks used for tagging sequences. Ø CRF’s have shown superior or competitive performance in various tasks as: Ø Shallow Parsing [Sha & Pereira 2003] Ø Entity Recognition [Mc. Callum & Li 2003] Ø Table Extraction [Pinto et al 2003] University of Texas at Austin 32

Machine Learning Group Conditional Random Fields (CRFs) Lafferty, Mc. Callum & Pereira 2001 • Undirected graphical model for sequence segmentation. • Log-linear model, different from Max. Ent model because of “global normalization” tags Start T 1. tag T 2. tag T 3. tag … Tn. tag End tw T 1. w T 2. w T 3. w T 2. cap T 3. cap … Tn. w cap T 1. cap Tn. cap • Tj. tag – the tag (one of S, C, E, U, O) at position j • Tj. w – true if word w occurs at position j • Tj. cap – true if word at position j begins with capital letter, … University of Texas at Austin … 33

Machine Learning Group Protein Name Extraction Results (Yapex) University of Texas at Austin 34

Machine Learning Group Collective Classification of Web Pages [Taskar, Abbeel & Koller 2002] University of Texas at Austin 35

Machine Learning Group Collective Information Extraction Task: Ø Extracting protein/gene names from Medline abstracts. Approach: Ø Collectively classify all candidate phrases from the same abstract. Ø Binary classification: Ø e. label = 0 => e is not a protein name Ø e. label = 1 => e is a protein name Use two types of label correlations: Ø Acronyms and their long forms. Ø Repetitions of the same phrase. University of Texas at Austin 36

Machine Learning Group Collective Information Extraction The control of human ribosomal protein L 22 ( rp. L 22 ) to enter into the nucleolus and its ability to be assembled into the ribosome is regulated by its sequence. The nuclear import of rp. L 22 depends on a classical nuclear localization signal of four lysines at positions 13 – 16 … Once it reaches the nucleolus , the question of whether rp. L 22 is assembled into the ribosome depends upon the presence of the N - domain. e 3 L 22 overlap e 1 acronym ribosomal protein L 22 e 2 repe rep ( rp. L 22 ) etit ion of rp. L 22 depends repetition e 5 n itio t whether rp. L 22 is e 4 University of Texas at Austin 37

Machine Learning Group [Taskar, Abbeel & Koller 2002] Relational Markov Networks Discriminative Markov Networks, augmented with clique templates: Ø Overlap Template (OT) Ø Acronym Template (AT) Ø Repeat Template (RT) e 3 of rp. L 22 depends e 5 L 22 e 1 ribosomal protein L 22 e 2 ( rp. L 22 ) e 4 University of Texas at Austin whether rp. L 22 is 38

Machine Learning Group Candidate Entities: Definition Candidate Entities: Ø The set of candidate entities usually depends on the type of named entity. Ø In general, could consider as candidates all phrases of length < L, where L may be task dependent. Two examples: Ø [Genes, Proteins] Most entity names are base noun phrases or parts of them. Thus a candidate extraction is any contiguous sequence of tokens whose POS tags are from {“JJ”, “VBN”, “VBG”, “POS”, “NNS”, “NNPS”, “CD”, “ ”}, and whose head is either a noun or a number. Ø [People, Organizations, Locations] Most entity names are sequences of proper names potentially interspersed with definite articles and prepositions. University of Texas at Austin 39

Machine Learning Group Candidate Entities: Local Features “… to the antioxidant superoxide dismutase 1 ( SOD 1 ) enzyme and …” Entity Features: based on features introduced in [Collins ’ 02] Ø head word, with generic placeholder for numbers => “HD = 0” Ø entity text => “TXT = superoxide dismutase – 1” Ø entity type e. g. concatenation of its words types => “TYPE = a a – 0” Ø bigrams / trigrams at entity left / right boundaries based on combinations of lexical tokens, and word types. Ø Bigrams left => “BL = antioxidant superoxide”, “BL = antioxidant a”, … Ø Bigrams right => “BR = 0 (“, … Ø Trigrams left => “TL = the antioxidant superoxide”, “TL = the antioxidant a”, … Ø Trigrams right => “TR = 0 ( SOD 1”, “TR = 0 ( A 0”, … Ø suffix / prefix lists of words and word types Ø Preffixes => “PF = superoxide”, “PF = superoxide dismutase”, … Ø Suffixes => “SF = 0”, “SF = – 0”, “SF = dismutase – 0”, … University of Texas at Austin 40

Machine Learning Group Overlap Template Ø Entity names should not overlap => hardwired overlap potential OT. e 1 “… to the antioxidant superoxide dismutase 1 ( SOD 1 ) enzyme and …” e 2 e 1. label=0 e 2 e 1. label=1 e 2. label=0 1 e 2. label=1 University of Texas at Austin 1 1 0 41

Machine Learning Group Repeat Template Production of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of endothelial nitric oxide synthase ( e. NOS ) … Here we have used the yeast two - hybrid system and identified a novel 34 k. Da protein , termed NOSIP ( e. NOS interaction protein ) , which avidly binds to the carboxyl – terminal region of the e. NOS oxygenase domain. u “e. NOS” u. OR 0 u v u. OR v “e. NOS” v. OR v 1 v 2 v 1 “e. NOS interaction” v 2 “e. NOS interaction protein” University of Texas at Austin u 1 u 2 … um v 1 v 2 … vn 42

Machine Learning Group Acronym Template v 2 d v 1 v “to the antioxidant superoxide dismutase 1 ( SOD 1 ) enzyme and ” v. OR v 1 v 2 v 3 v v. OR v 1 University of Texas at Austin v 2 … vn 43

Machine Learning Group Experimental Results Datasets: Ø Yapex – a dataset of 200 Medline abstracts, manually tagged for protein names. Ø Aimed – a dataset of 225 Medline abstracts, of which 200 are known to mention protein interactions. Ø Co. NLL – the Co. NLL 2003 English dataset. Compared three approaches: Ø LT–RMN extraction using local templates + Overlap Template Ø GLT–RMN extraction using both local and global templates. Ø CRF extraction as token classification using Conditional Random Fields [Lafferty et al 2001], with features based on current word, previous/next words, words short/long types and POS tags [Bunescu et al 2004]. University of Texas at Austin 44

Machine Learning Group Experimental Results – Yapex University of Texas at Austin 45

Machine Learning Group Experimental Results – Aimed University of Texas at Austin 46

Machine Learning Group Experimental Results – Co. NLL 2003 University of Texas at Austin 47

Machine Learning Group Protein Interaction Extraction • Most IE methods focus on extracting individual entities. • Protein interaction extraction requires extracting relations between entities. • Our current results on relation extraction have focused on rule-based learning approaches. University of Texas at Austin 48

Machine Learning Group Rapier and BWI Revisited: the Inter-filler Approach • Existing rule-based learning algorithms are used for inducing patterns for identifying protein interactions. • Rules are learned for extracting inter-fillers. SHPTPW interacts with another signaling protein, Grb 7. • Inter-fillers are sometimes very long (~9 tokens on average; 215 tokens maximum!). For some rule-based learning algorithms (e. g. Rapier), the time complexity can grow exponentially in the length of inter-fillers. University of Texas at Austin 49

Machine Learning Group Rapier and BWI Revisited: the Role-filler Approach • In the role-filler approach, we extract two interacting proteins into different slots, which we call the interactor and the interactee. • A sentence is divided into segments. Interactors are associated with interactees in the same segment using simple heuristics. We show that the S 252 W mutation allows the mesenchymal splice form of FGFR 2 (FGFR 2 c) to bind and to be activated by the mesenchymally expressed ligands FGF 7 or FGF 10 and the epithelial splice form of FGFR 2 (FGFR 2 b) to be activated by FGF 2, FGF 6, and FGF 9. • Moderately high precision (> 60%) but low recall (< 40%). University of Texas at Austin 50

Machine Learning Group ELCS (Extraction using Longest Common Subsequences) • A new method for inducing rules that extract interactions between previously tagged proteins. • Each rule consists of a sequence of words with allowable word gaps between them (similar to Blaschke & Valencia, 2001, 2002). - (7) interactions (0) between (5) PROT (9) PROT (17). • Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example. • Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples. University of Texas at Austin 51

Machine Learning Group Generalizing Rules using Longest Common Subsequence The self - association site appears to be formed by interactions between helices 1 and 2 of beta spectrin repeat 17 of one dimer with helix 3 of alpha spectrin repeat 1 of the other dimer to form two combined alpha beta triple - helical segments. Title – Physical and functional interactions between the transcriptional inhibitors Id 3 and ITF-2 b. - (7) interactions (0) between (5) PROT (9) PROT (17). University of Texas at Austin 52

Machine Learning Group The ELCS Framework • A greedy-covering, bottom-up rule induction method is used to cover all the positive examples without covering many negative examples. • We use an algorithm similar to beam search that considers only the n = 25 best rules for generalization at any time. • The confidence level of a rule is based on the number of positive and negative examples the rule covers while allowing some margin for noise (Cestnik, 1990). University of Texas at Austin 53

Machine Learning Group Protein Interaction Extraction Results University of Texas at Austin 54

Machine Learning Group Protein Interaction Extraction Results (full) University of Texas at Austin 55

Machine Learning Group Ongoing and Future Work • Extracted proteins and their interactions from 753, 459 Medline abstracts on human biology. Evaluation of results in progress. • Improve RMN approach with better local and global templates, better candidate entity generation, and better algorithms for probabilistic inference. • Extend RMN approach to handle extracting relations between entities. • Evaluate RMN approach on other biological entities and relations and on other non-biological corpora. • Reduce human efforts by actively selecting the best training examples for human labeling. • Combine evidence from text with other biological data sources to derive accurate, comprehensive gene networks. University of Texas at Austin 56

Machine Learning Group Conclusions • We have compared a wide variety of existing machine-learning methods for extracting human protein names and interactions. • CRFs approach performs the best of existing methods. • We developed a new more-general approach based on RMN’s that allows collective extraction that integrates information across all potential extractions. • For extracting protein interactions, we found that several methods for learning extraction rules outperform hand-written rules with respect to precision and noisy protein tags. University of Texas at Austin 57

Machine Learning Group The End University of Texas at Austin 58