Protein protein interaction networks Most ideas on biological

Protein protein interaction networks Most ideas on biological networks based on initial HT data for Sacharomyces cerevisae (Uetz Y 2 H, Ito) – unicellular, compartmentalized, all proteins Program for today: Further experimental data sets (1) C. elegans – multicellular, focus on proteins that promote cell-cell interactions (2) Drosophila melanogaster (3) Classification of networks (4) Essentiality of proteins 1 8. Lecture WS 2004/05 Bioinformatics III

Current status of research on real networks ► Many systems show small-world property ► Statistical abundance of „hubs“, p(k) follows a power-law distribution Robustness to damages, vulnerability to attacks ► Realize that most complex networks are the result of a growth process ► View networks as dynamical systems that evolve through the subsequent addition and deletion of vertices and edges. Requires dynamical theory. What is left? Eur. Phys. J. B 38, 143 (2004) 2 8. Lecture WS 2004/05 Bioinformatics III

10 questions Are there formal ways of classifying the structure of different growing models? Open questions concerning the universality of some topological properties, the correlations introduced by the dynamic process and the interplay between clustering, hierarchies and centralities in networks. Aim for rigorous mathematical analysis. Are there further statistical distributions (except p(k), <C>, <L> and the degree-degree correlation p(k, k‘)) that can provide insights on the structure and classification of complex networks? Why are most networks modular? Are there universal features of network dynamics? Networks are not only specified by their topology but also by the dynamics of information or traffic flow taking place along the links. Aim at mathematical characterization for general principles describing the networks‘ dynamics. Eur. Phys. J. B 38, 143 (2004) 3 8. Lecture WS 2004/05 Bioinformatics III

10 open questions How do the dynamical processes taking place on a network shape the network topology? Dynamics, traffic and underlying topology of networks are mutually correlated. Need to obtain large empirical datasets that simultaneously capture the topology of the network and the time-resolved dynamics taking place on it. What are the evolutionary mechanisms that shape the topology of biological networks? Uncovering evolution is relatively simple in technological and large infrastructure networks, but not in biological networks. Here, the role of evolution and selection in shaping biological networks is still unclear, especially if we want a quantitative dynamical implementation of evolutionary principles in network modeling. Eur. Phys. J. B 38, 143 (2004) 4 8. Lecture WS 2004/05 Bioinformatics III

10 questions How to quantify the interaction between networks of different character (network of networks)? Most networks are interconnected among them forming networks of networks. E. g. Internet, energy and power distribution, or gene network interconnected with protein-protein interaction network and with the metabolic network. Understanding and characterization of the complicate set of regulatory and feedback mechanisms connecting various networks is one of the most ambitious tasks in network research. How to characterize small networks? Many real world networks are far from being large scale objects that are well described by statistical measures. Scaling behavior or average properties are not well defined for small networks new concepts and mathematics required Why are social networks all assortive, while all biological and technological networks are disassortative? In social networks, hubs tend to connect to eachother („assortive“). In biological and technological networks, hubs mostly connect to less connected hubs. Eur. Phys. J. B 38, 143 (2004) 5 8. Lecture WS 2004/05 Bioinformatics III

C. Elegans protein-protein interaction network As Y 2 H baits, select set of 3024 worm predicted proteins that relate directly or indirectly to multicellular functions. Cloned ORFs that are not autoactivated by Y 2 H: GAL 1: : HIS 3 reporter : 1873 Only consider those that activate at least 2 out of 3 different Gal 4 -responsive promoters. Divide in 3 confidence classes. Core 1: 858 interactions, 3 times found independently Core 2: 1299 interactions, found < 3 times, passed retest Non-core: 1892 other interactions found in Y 2 H screen Li et al. , Science 303, 5657 (2004) 6 8. Lecture WS 2004/05 Bioinformatics III

C. Elegans protein-protein interaction network Coaffinity purification assays. Shown are 10 examples from the Core-1, Core-2, and Non-Core data sets. The top panels show Myc-tagged prey expression after affinity purification on glutathione-Sepharose, demonstrating binding to GST-bait. The middle and bottom panels show expression of Myc-prey and GST-bait, respectively. The lanes alternate between extracts expressing GST-bait proteins (+) and GST alone (–). ORF pairs are identified in table S 1 with the lane number corresponding to the order in which they appear in the table. Li et al. , Science 303, 5657 (2004) 7 8. Lecture WS 2004/05 Bioinformatics III

Confidence of interactions Li et al. , Science 303, 5657 (2004) 8 8. Lecture WS 2004/05 Bioinformatics III

C. Elegans protein-protein interaction network Estimate coverage: out of 108 known interactors in Worm. PD („literature data set“) only 8 Core and 2 Non-core interactions are found in this benchmark data set coverage is ca. 10% of all interactions In silico searches for potentially conserved interactions, „interologs“ whose orthologous pairs are known to interact in one or more other species. High-confidence yeast interaction data 949 potential worm interologs + other data 5534 interactions for worm, connecting 15% of the C. elegans proteome. Li et al. , Science 303, 5657 (2004) 9 8. Lecture WS 2004/05 Bioinformatics III

Analysis of the C. elegans protein-protein network (A) Nodes (proteins) are colored according to their phylogenic class: ancient (red), multicellular (yellow), and worm (blue). Giant network component contains 2898 nodes connected by 5460 edges. The inset highlights a small part of the network. (B) The proportion of proteins, P(k), with different numbers of interacting partners, k, is shown for C. elegans proteins used as baits or preys and for S. cerevisiae proteins. Again, the worm interactome network exhibits small-world and scale-free properties. (C) Do evolutionary recent proteins preferentially interact with eachother? „Ancient“: 748 proteins with yeast ortholog „Multicellular“: 1314 proteins with ortholog in Drosophila, Arabidopsis or human but not in yeast „Worm“: 836 proteins with no ortholog outside of worm. The 3 groups connect equally well with each other new cellular functions rely on a combination of evolutionarily new and ancient elements. Li et al. , Science 303, 5657 (2004) 8. Lecture WS 2004/05 10 Bioinformatics III

(D) Overlap with transcriptome in yeast, Pearson correlation coefficients (PCCs) were calculated Relate interactome with transcriptome and phenome and graphed for each pair of proteins in the interaction data sets and their corresponding randomized data sets. Red area corresponds to interactions that show a significant relationship to expression profiling data (P < 0. 05). 9. 5% of interacting core proteins are co-expressed. Note: 75% of literature pairs (= biologically relevant) do not co-express. (E) Example of highly connected cluster (Y 2 H interactions) where both proteins belong to common C. elegans expression clusters. (F) Proportion of interaction pairs where both genes are embryonic lethal (P < 10 -7). Li et al. , Science 303, 5657 (2004) 8. Lecture WS 2004/05 Bioinformatics III 11

Example of highly connected subnetwork Conclusions: - Y 2 H data set provides functional hypotheses for 1000 s of uncharacterized proteins of C. elegans. Integration with other functional genomic data indicates that the correlation between transcriptome and interactome data is lower than found in yeast. Explanation? Biological processes in multicellular organisms may occur differently in the organism, across various organs, tissues, or single cells. Li et al. , Science 303, 5657 (2004) 12 8. Lecture WS 2004/05 Bioinformatics III

Drosophila is an important model for human biology. High-throughput Y 2 H screen identified 20405 interactions involving 7048 proteins. Giot et al. Science 302, 1727 (2003) 13 8. Lecture WS 2004/05 Bioinformatics III

Confidence scores for protein-protein interactions Manual method: expert biologist reviewed list of interactions on the basis of the names of the proteins in each interaction pair. High-confidence interactions: those published previously or those involving two proteins of the same complex. Low-confidence interactions: unlikely to occur in vivo, e. g. an interaction between a nuclear and an extracellular protein. Automated method: compare Drosophila interactions with those in yeast. Positive examples: interacting proteins whose yeast orthologs interact as well Negative examples: Drosophila interactions whose yeast orthologs are a distance of 3 or more protein-protein interaction links apart (pair of random yeast proteins has distance of 2. 8) Positive training set: 129 examples (70 manual, 65 automated, 6 common) Negative training set: 196 examples (88/112/4) Giot et al. Science 302, 1727 (2003) 14 8. Lecture WS 2004/05 Bioinformatics III

Confidence scores for protein-protein interactions Fit generalized linear model to training set with predictors - number of times each interaction was observed - number of interaction partners of each protein - local clustering of network - gene region (5‘ untranslated region, coding sequence, 3‘ untranslated region) Validate selection: confidence score for interaction correlates well with correlation of gene ontology (GO) annotation (see C) Find dividing surface. Giot et al. Science 302, 1727 (2003) 15 8. Lecture WS 2004/05 Bioinformatics III

Confidence scores for protein-protein interactions (D) p(k) for all interactions (black circles) and for the high-confidence interactions (green circles). Linear behavior in this log-log plot would indicate a power-law distribution. Although regions of each distribution appear linear, neither distribution may be adequately fit by a single power-law. Both may be fit, however, by a combination of power-law and exponential decay. Faster decay of high-confidence interactions may indicate that highly connected proteins may be suppressed in biological networks. Giot et al. Science 302, 1727 (2003) 16 8. Lecture WS 2004/05 Bioinformatics III

Statistical properties of refined Drosophila PI map The high-confidence Drosophila protein-protein interactions form a small-world network with evidence for a hierarchy of organization. Network properties are presented for the giant connected component, in which 3659 pairwise interactions connect 3039 proteins into a single cluster. (A) The probability distribution for the shortest path between a pair of proteins in the actual network (green points) peaks at 9 to 11 links, with a mean of 9. 4 links. In contrast, an ensemble of randomly rewired networks shows a mean separation of 7. 7 links between proteins. Biological organization may be responsible for flattening the actual network by enhancing links between proteins that are already close. Giot et al. Science 302, 1727 (2003) 17 8. Lecture WS 2004/05 Bioinformatics III

Statistical properties of refined Drosophila PI map (B) Clustering is analyzed quantitatively by counting the number of closed loops (triangles, squares, pentagons, etc. ) in which the perimeter is formed by a series of proteins connected head-to-tail, with no protein repeated. The actual network (green points) shows an enhancement of loops with perimeter up to 10 to 11 relative to the random network (red points). In both (A) and (B), the one-level and two-level models produce nearly indistinguishable fits for the random networks, indicating the absence of structured clustering (= a hierarchy where proteins connect to protein complexes, and a second level where protein complexes interact with eachother). Giot et al. Science 302, 1727 (2003) 18 8. Lecture WS 2004/05 Bioinformatics III

Protein family/human disease orthlog view Proteins are color-coded according to protein family as annotated by the GO hierarchy. Proteins orthologous to human disease proteins have a jagged, starry border. Interactions were sorted according to interaction confidence score, and the top 3000 interactions are shown with their corresponding 3522 proteins. This representation is particularly relevant to understanding human diseases and potential treatment. Giot et al. Science 302, 1727 (2003) 19 8. Lecture WS 2004/05 Bioinformatics III

Subcellular location view (B) Subcellular localization view. This view shows the fly interaction map with each protein colored by its GO Cellular Component annotation. This map has been filtered by only showing proteins with less than or equal to 20 interactions and with at least one GO annotation. We show proteins for all interactions with a confidence score of 0. 5 or higher. This results in a map with 2346 proteins and 2268 interactions. View allows annotation of subcellular localizations, and potential function of proteins not annotated sofar. Giot et al. Science 302, 1727 (2003) 20 8. Lecture WS 2004/05 Bioinformatics III

Local interaction networks involved in A transcription B Splicing C Signal transduction Giot et al. Science 302, 1727 (2003) 21 8. Lecture WS 2004/05 Bioinformatics III

Cell cycle regulation: Drosophila Skp pathway Network surrounding the Skp protein complex that targets proteins to ubiquitin-mediated proteasomal degradation. Target proteins are recruited to the Skp complex by F-box proteins. Among the Skp proteins, only Skp. A was reported to bind F-box proteins Morgue and Slmb. Giot et al. Science 302, 1727 (2003) 22 8. Lecture WS 2004/05 Bioinformatics III

Local pathway views 10 examples of local pathway views identified in the interaction network Giot et al. Science 302, 1727 (2003) 23 8. Lecture WS 2004/05 Bioinformatics III

The functional significance of a gene is defined by its essentiality. A pair of non-essential genes can be synthetically lethal (cell-death occurs when boh genes are deleted simultaneously. Hypothesis of ‚marginal benefit‘: many non-essential genes make significant but small contributions to the fitness of the cell although the effects may not be sufficiently large to be detected by conventional methods. Yu et al. Trends Gen 20, 227 (2004) 24 8. Lecture WS 2004/05 Bioinformatics III

‚Marginal essentiality‘ Define ‚marginal essentiality‘ (M) as a quantitative measure of the importance of a non-essential gene to a cell. Incorporate results from diverse set of 4 large-scale knockout experiments that examined different aspects of the impact of a protein on the fitness of a yeast cell: (i) growth rate (ii) phenotypes under diverse environments (iii) sporulation efficiency (iv) sensitivity to small molecules. Previously findings (Jeong & Barabasi): hubs tend to be essential (Fraser et al. ) effect of an individual protein on cell fitness correlates with p(k) Analyze data set: 4743 yeast proteins – 23294 unique interactions Yu et al. Trends Gen 20, 227 (2004) 25 8. Lecture WS 2004/05 Bioinformatics III

Basic analysis Essential proteins have ca. twice as many links as non-essential proteins. Essential proteins have a shallower slope a larger proportion of them are ‚hubs‘ Yu et al. Trends Gen 20, 227 (2004) 26 8. Lecture WS 2004/05 Bioinformatics III

determine hubs that are essential ‚hub‘: proteins that belong to upper 25% of proteins with most links (a) 43% of the hubs are essential vs. 20% for random proteins Within the network, essential proteins tend to be more cliquish and tend to be more closely connected to eachother (see mean path length). (b) out-degree: transcription factors with many (> 100) targets are more essential than the other TFs. (c) in-degree: genes regulated by many TFs are less likely to be essential than genes regulated by few TFs. (d) Genes with more functions are more likely to be essential. Yu et al. Trends Gen 20, 227 (2004) 27 8. Lecture WS 2004/05 Bioinformatics III

properties of essential genes Most essential genes are ‚house-keeping‘ genes = their expression level is much higher and the fluctuation of their expression is much lower compared with non-essential genes. The regulation of essential genes tends to have less regulation non-essential genes often use more regulators to control the expression of gene products. Explanation? essential proteins perform the most basic and important functions within the cell and always need to be switched ‚on‘. Their expression does not need to be regulated by many factors because this makes the essential genes dependent on the viability of more regulators, and the cell less stable. Yu et al. Trends Gen 20, 227 (2004) 28 8. Lecture WS 2004/05 Bioinformatics III

Analysis of non-essential genes The more marginally essential a protein is the more likely - is it to have a large number of interaction partners (a, b) - it will be closely connected to other proteins (short characteristic path length) (c) - will it be one of the 1061 hubs (d) Yu et al. Trends Gen 20, 227 (2004) 29 8. Lecture WS 2004/05 Bioinformatics III

Evolution of Drosophila melanogaster PI network Various network growth and evolution models have been developed to explain the properties of real-world networks - small world (Watts & Strogatz) - preferential attachment (Barabasi et al. ) - duplication-mutation mechanisms However, often model parameters can be tuned such that multiple models of widely varying mechanisms perfectly fit the motivating real network in terms of single selected features such as the scale-free exponent and the clustering coefficient. Aim here: use a discriminative classification technique from machine learning to classify a given real network as one of many proposed network mechanisms by enumerating local substructures. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 30 8. Lecture WS 2004/05 Bioinformatics III

Analysis of Drosophila melanogaster protein interaction network Data set: protein-protein interaction map for Drosophila by Giot et al. Problem: data set is subject to numerous false positives. Giot et al. assign a confidence score p [0, 1] to each interaction measuring how likely the interaction occurs in vivo. What threshold p* should be used? Measure size of the components for all possible values of p*. Observe: for p*= 0. 65, the two largest components are connected use this value as threshold. Edges in the graph correspond to interactions for which p > p*. Remove self-interactions and isolated vertices 3359 (4625) nodes with 2795 (4683) edges for p*= 0. 65 (0. 5) Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 31 8. Lecture WS 2004/05 Bioinformatics III

Network evolution models considered Duplication-mutation-complementation (DMC) algorithm: based on model that proposes that most of the duplicate genes observed today have been preserved by functional complementation. If either the gene or its copy loses one of its functions (edges), the other becomes essential in assuring the organisms‘s survival. Algorithm: duplication step is followed by mutations that preserve functional complementarity. At every time step choose a node v at random. A twin vertex vtwin is introduced copying all of v‘s edges. For each edge of v, delete with probability qdel either the original edge or its corresponding edge of vtwin. Cojoin twins themselves with independent probability qcon representing an interaction of a protein with its own copy. No edges are created by mutations DMC algorithm assumes that the probability of creating new advantageous functions by random mutations is negligible. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 32 8. Lecture WS 2004/05 Bioinformatics III

Network evolution models considered Variant of DMC: Duplication-random mutations (DMR) algorithm: Possible interactions between twins are neglected. Instead, edges between vtwin and the neighbors of v can be removed with probability qdel and new edges can be created at random between vtwin and any other vertices with probability qnew/N, where N is the current total number of vertices. DMR emphasizes the creation of new advantageous functions by mutation. Other models: - linear preferential attachment (LPA) (Barabasi) - random static networks (Erdös-Renyi) (RDS) - random growing networks (RDG – growing graphs where new edges are created randomly between existing nodes) - aging vertex networks (AGV – growing graphs modeling citation networks, where the probability for new edges decreases with the age of the vertex) - small-world network (SMV – interpolation between regular ring lattices and randomly connected graphs). Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 33 8. Lecture WS 2004/05 Bioinformatics III

Training set Create 1000 graphs as training data for each of the seven different models. Every graph is generated with the same number of edges and nodes as measured in Drosophila. Quantify topology of a network by counting all possible subgraphs up to a given cut-off, which could be the number of nodes, number of edges, or the length of a given walk. Here: count all subgraphs that can be constructed by a walk of length=8 (148 nonisomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs). Use these counts as input features for classifier. Note that the average shortest path between two nodes of the Drosophila network‘s giant component is 11. 6 (9. 4) for p*=0. 65 (0. 5). Walks of length=8 can traverse large parts of the network. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 34 8. Lecture WS 2004/05 Bioinformatics III

Learning algorithm: Alternating Decision Tree Rectangles: decision nodes. A given network‘s subgraph counts determine paths in the tree dictated by inequalities specified by the decision nodes. For each class, the ADT outputs a real-valued prediction score, which is the sum of all weights over all paths. The class with the heightest score wins. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 35 8. Lecture WS 2004/05 Bioinformatics III

Performance on training set The confusion matrix shows truth and prediction for the test sets. 5 out of 7 have nearly perfect prediction accuracy. AGV is constructed as an interpolation between LPA and a ring lattice the AGV, LPA and SMW mechanisms are equivalent in specific parameter regimes and show a non-negligible overlap. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 36 8. Lecture WS 2004/05 Bioinformatics III

Discriminating similar networks Ten graphs of two different mechanisms exhibit similar average geodesic lengths and almost identical degree distribution and clustering coefficients. (a) cumulative degree distribution p(k>k 0), average clustering coefficient <C> and average geodesic length <L>, all quantities averaged over a set of 10 graphs. (b) Prediction score for all ten graphs and all five cross-validated ADTs. The two sets of graphs can be perfectly separated by the classifier. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 37 8. Lecture WS 2004/05 Bioinformatics III

Learning algorithm: Alternating Decision Tree Figure shows the first few descision nodes (out of 120) of a resulting ADT. The prediction scores reveal that a high count of 3 -cycless suggest a DMC network. DMC mechanism indeed facilitates creation of many 3 -cycles by allowing 2 copies to attach eachother, thus creating 3 -cycles with their common neighbors. A low count in 3 -cycles but a high count in 8 -edge linear chains is a good precictor for LPA and DMR networks. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 38 8. Lecture WS 2004/05 Bioinformatics III

Prediction for Drosophila melanogaster network Use this classifier (ADT) with good prediction accuracy now to determine the network mechanism that best reproduces the Drosophila network (or any network of the same size). Prediction scores for the Drosophial protein network for different confidence threshold p* and different cut-offs in subgraph size. Drosophial is consistently classified as a DMC network, with an especially strong prediction for a confidence threshould of p*=0. 65 and independently of the cut-off in subgraph size. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 39 8. Lecture WS 2004/05 Bioinformatics III

Visualization of subgraphs A qualitative and more intuitive way of interpreting the classification result is visualizing the subgraph profiles. Subgraphs associated with Figures 3 and 1. A representatie subset of 50 subgraphs out of 148 is shown. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 40 8. Lecture WS 2004/05 Bioinformatics III

Subgraph profiles The average subgraph count of the training data for every mechanism is shown for the 50 representative subgraphs S 1 -S 50. Black lines indicate that this model is closest to Drosophila based on the absolute difference between the subgraph counts. For 60% of the subgraphs (S 1 -S 30), the counts for Drosophila are closest to the DMC model. All of these subgraphs contain one or more cycles, including highly connected subgraphs (S 1) and long linear chains ending in cycles (S 16, S 18, S 22, S 23, S 25). The DMC algorithm is the only mechanism that produces such cycles with a high occurrence. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 41 8. Lecture WS 2004/05 Bioinformatics III

Robustness against noise Edges in Drosophila network are randomly replaced and the network is classified. Plotted are prediction scores for each of the 7 classes as more and more edges are replaced. Every point is an average over 200 independent random replacements. For high noise level (beyond 80%), the network is classified as an Erdös-Renyi (RDS) graph. For low noise (< 30%), the confidence in the classification as a DMC network is even higher than in the classification as an RDS network for high noise. The prediction score y(c) for class c is related to the estimated probability p(c) for the tested network to be in class c by Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 42 8. Lecture WS 2004/05 Bioinformatics III

Conclusions Very nice (!) method that allows to infer growth mechanisms for real networks. Method is robust against noise and data subsampling, no prior assumption about network features/topology required. Learning algorithm does not assume any relationships between features (e. g. orthogonality). Therefore the input space can be augmented with various features in addition to subgraph counts. The protein interaction network of Drosophila is confidently classified as DMC network. Middendorf et al. , DOI: q-bio. QM/0408010, ar. Xiv, 2004/08/15 43 8. Lecture WS 2004/05 Bioinformatics III