Скачать презентацию Unifying measures of gene function and evolution Eugene Скачать презентацию Unifying measures of gene function and evolution Eugene

eb38a60bb6f79ed1cf0669f2a0a9a115.ppt

  • Количество слайдов: 43

Unifying measures of gene function and evolution Eugene V. Koonin, National Center for Biotechnology Unifying measures of gene function and evolution Eugene V. Koonin, National Center for Biotechnology Information, NIH, Bethedsa Nothing in (systems) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Wolf, Carmel, Koonin, Proc. Roy Soc. B, in press

Systems Biology and Evolution With the advent of OMICS data… The game of correlations Systems Biology and Evolution With the advent of OMICS data… The game of correlations began…

Evolutionary systems biology: • In principle, we address the classical problem: the relationship between Evolutionary systems biology: • In principle, we address the classical problem: the relationship between the (largely neutral? ) evolution of the genome and the (largely adaptive) evolution of the phenotype • In practice, the progress of genomics + other OMICS allows us to measure, on whole-genome scale, the effects of all kinds of molecular phenotypic characteristics (expression level, protein-protein interactions etc) on evolutionary rates – this typically yields weak, even if significant, correlations • Can we synthesize these measurements to produce a coherent picture of the links between phenomic and genomic evolution?

The Cautionary Tale The Cautionary Tale "It was six men of Indostan / To learning much inclined, Who went to see the Elephant / (Though all of them were blind), That each by observation / Might satisfy his mind " (J. G. Saxe)

The Cautionary Tail The Cautionary Tail "…each was partly in the right / And all were in the wrong" (J. G. Saxe)

Different Faces of the Hypercube? Pairwise correlations Synthesis Different Faces of the Hypercube? Pairwise correlations Synthesis

Analysis of Multidimensional Data Analysis of Multidimensional Data

Analysis of Multidimensional Data Analysis of Multidimensional Data "fair world" model "unfair world" model

Analysis of Multidimensional Data C 1 P C 3 P PC 2 Principal Components Analysis of Multidimensional Data C 1 P C 3 P PC 2 Principal Components Analysis (PCA) introduces a new orthogonal coordinate system where axes are ranked by the fraction of original variance accounted for.

PCA • PCA takes a set of variables and defines new variables that are PCA • PCA takes a set of variables and defines new variables that are linear combinations of the initial variables. • PCA expects the variables you enter to be correlated (as is the case in the correlation game of Systems Biology). • PCA returns new, uncorrelated variables, the principal components or axes, that summarize the information contained in the original full set of variables. • PCA does not test any hypotheses or predict values for dependent variables; it is more of an exploratory technique. • The data entered represent a cloud of points, in n-space. • The cloud is, typically, longer in one direction than another, and that longest dimension is where the points are the most different; that's where PCA draws a line called the first principal component. • The first principal component is guaranteed to be the line that places your sample points the farthest apart from each other, in that way, PCA "extracts the most variance" from your data. This process is repeated to get multiple components, or axes.

The Data Set: KOGs Ideally, we would like to obtain and synthesize the data The Data Set: KOGs Ideally, we would like to obtain and synthesize the data on individual genes in precise space-time coordinates (e. g. , instant evolutionary rates) However: • some of the variables are not easily measurable (if defined at all) for genes in extant species [e. g. rate of evolution]; • other variables are measurable in principle but, in practice, are available only for a few species [e. g. , expression level] • much of the data are inherently noisy, either due to technical problems or true biological variation [e. g. fitness effect of gene disruption]. Thus, we analyze orthologous protein sets, using the proteins from different species to derive complementary data and smooth out variations in other. Practically, this means using the KOG dataset (with additions): 10058 KOGs from 15 species (Koonin et al. 2004, Genome Biol).

The Data Set: KOGs Arath Orysa Dicdi Enccu Maggr Neucr Schpo Sacce Canal Caeel The Data Set: KOGs Arath Orysa Dicdi Enccu Maggr Neucr Schpo Sacce Canal Caeel Caebr Drome Cioin 100 Myr Homsa Musmu Original KOGs for some species, "index orthologs" for other. 10058 KOGs altogether

Variables: Gene Loss Propensity for Gene Loss (PGL), introduced by Krylov et al. (Genome Variables: Gene Loss Propensity for Gene Loss (PGL), introduced by Krylov et al. (Genome Res. 13, 2229 -2235, 2003). At Ce. Dm Hs Sc Sp Ec Gene loss Computed from KOG phyletic pattern. Originally an empirical measure (Dollo parsimony reconstruction of events; ratio of branch lengths). In this work – employs an Expectation Maximization algorithm.

Variables: Gene Duplication Number of Paralogs, average number observed for a given KOG. Example: Variables: Gene Duplication Number of Paralogs, average number observed for a given KOG. Example: KOG 0417 (Ubiquitin-protein ligase) and KOG 0424 (Ubiquitin-protein ligase). At 1 g 16890 At 1 g 36340 At 1 g 64230 At 1 g 78870 At 2 g 16740 At 2 g 32790 At 3 g 08690 At 3 g 08700 At 3 g 13550 At 4 g 27960 At 5 g 25760 At 5 g 41700 At 5 g 53300 At 5 g 56150 CE 03482 CE 09712 CE 10824 CE 28997 7292764 7292948 7295708_2 7296089 7297757 7298165 7299919 Hs 17476541 Hs 22043797 Hs 22054779 Hs 22064361 Hs 4507773 Hs 4507775 Hs 4507777 Hs 4507779 Hs 4507793 Hs 5454146 Hs 7661808 Hs 8393719 YBR 082 c YDR 059 c YDR 092 w YGR 133 w SPAC 11 E 3. 04 c SPAC 1250. 03 SPBC 119. 02 SPBC 1198. 09 ECU 10 g 0940 ECU 11 g 1990 At 3 g 57870 CE 01332 CE 09784 7296195 Hs 4507785 YDL 064 w SPAC 30 D 11. 13 ECU 01 g 0940

Variables: Evolution Rate Select a taxon Build an alignment (MUSCLE); Compute distance matrix (PAML); Variables: Evolution Rate Select a taxon Build an alignment (MUSCLE); Compute distance matrix (PAML); Select minimum distance between members of the two subtrees of the group. Ascomycota: Sordariomycetes vs. Yeasts

Variables: Expression Level data for S. cerevisiae, D. melanogaster and H. sapiens were downloaded Variables: Expression Level data for S. cerevisiae, D. melanogaster and H. sapiens were downloaded from UCSC Table Browser (hg. Fixed). Organism Table No. exp. No. prob. No. KOGs Sacce yeast. Cho. Cell. Cycle 17 6602 3030 Drome arb. Fly. Life. All 162 4921 2617 Homsa gnf. Human. Atlas 2 All 158 10197 3872 Standardized ( =0; =1) log values; median expression level among paralogs was used to represent a KOG.

Variables: Interactions Protein and Genetic Interactions (PPI and GI) data for S. cerevisiae, C. Variables: Interactions Protein and Genetic Interactions (PPI and GI) data for S. cerevisiae, C. elegans and D. melanogaster were downloaded from GRID Web site. Median number of interaction partners among paralogs was used to represent a KOG.

Variables: Lethality of Gene Knockout data for S. cerevisiae were downloaded from MIPS FTP Variables: Lethality of Gene Knockout data for S. cerevisiae were downloaded from MIPS FTP site (0/1 values). Embryonic Lethality of RNAi Interference data for C. elegans were taken from Kamath et al. , 2003 (0/1 values).

Missing Data Total: 38 variables in 10058 KOGs – lots of missing data. Complete Missing Data Total: 38 variables in 10058 KOGs – lots of missing data. Complete data (all 38 variabless available): 23 KOGs – too few. Combined data: 7 variables, 1482 KOGs with complete data; 4124 with at most one missing point; 3912 KOGs after removal of outliers. Example: evolution rate. At. Os Sc. Ca Mg. Nc Hs. Mm. Pl. MF KOG 0009 0. 168 0. 300 0. 405 KOG 0010 0. 671 1. 252 0. 606 0. 087 1. 492 KOG 0011 0. 905 1. 698 0. 428 0. 073 1. 547 KOG 0012 2. 238 0. 665 0. 244 KOG 0013 0. 355 0. 014 1. 343 KOG 0014 1. 913 4. 041 -0. 126 2. 840 KOG 0015 2. 286 0. 400 0. 027 KOG 0016 - At. Os Sc. Ca Mg. Nc Hs. Mm. Pl. MF 0. 090 0. 575 -0. 212 1. 006 0. 672 1. 166 0. 781 1. 358 0. 911 0. 821 0. 984 0. 810 1. 201 1. 275 3. 275 0. 532 0. 181 0. 703 2. 869 2. 168 -1. 692 1. 487 1. 227 0. 767 0. 365 0. 970 5. 087 - Average 0. 293 0. 957 0. 977 1. 917 0. 472 2. 054 0. 786 3. 028

Variables Phenotypic • EL – expression level • PPI – protein-protein interactions • GI Variables Phenotypic • EL – expression level • PPI – protein-protein interactions • GI – genetic interactions • KE – knockout effect • NP – number of paralogs Evolutionary • ER – (sequence) evolution rate • PGL – propensity for gene loss

The correlations NP PPI GI PGL ER EL NP - PPI 0. 057 - The correlations NP PPI GI PGL ER EL NP - PPI 0. 057 - GI 0. 060 0. 034 - PGL 0. 000 -0. 125 -0. 019 - ER -0. 070 -0. 200 0. 034 0. 141 - EL 0. 129 0. 199 -0. 050 -0. 099 -0. 277 - KE 0. 027 0. 234 -0. 048 -0. 181 -0. 155 0. 188 KE -

Two Tiers of Variables Observation on the pattern of pairwise relationships in the data: Two Tiers of Variables Observation on the pattern of pairwise relationships in the data: "phenotypic" and "evolutionary" variables behave differently. "phenotypic" variables "bigger is better" "evolutionary" variables "slow is good, fast is bad"

Two Tiers of Variables Observation on the pattern of pairwise relationships in the data: Two Tiers of Variables Observation on the pattern of pairwise relationships in the data: "phenotypic" and "evolutionary" variables behave differently. "phenotypic" variables positive negative "evolutionary" variables positive

The correlations NP PPI GI PGL ER EL NP - PPI 0. 057 - The correlations NP PPI GI PGL ER EL NP - PPI 0. 057 - GI 0. 060 0. 034 - PGL 0. 000 -0. 125 -0. 019 - ER -0. 070 -0. 200 0. 034 0. 141 - EL 0. 129 0. 199 -0. 050 -0. 099 -0. 277 - KE 0. 027 0. 234 -0. 048 -0. 181 -0. 155 0. 188 KE non-essential (almost by definition) low-expressed relatively fast-evolving -

PCA of the Data Space PC. 1 PC. 2 PC. 3 NP 0. 17 PCA of the Data Space PC. 1 PC. 2 PC. 3 NP 0. 17 0. 69 0. 44 PPI 0. 46 0 -0. 17 GI 0 0. 67 -0. 54 PGL -0. 33 0 0. 51 ER -0. 47 0 -0. 20 EL 0. 48 0 0. 36 KE 0. 45 -0. 27 -0. 21 --------------------% var. 25. 0 15. 3 14. 5 Sphericity

PC 2 PCA of the Data Space PC 1 PC 2 PCA of the Data Space PC 1

PC 3 PCA of the Data Space PC 2 PC 3 PCA of the Data Space PC 2

PC 2 PC 1 – Gene’s “status PC 2 PC 1 – Gene’s “status" "accessory" "important" PC 1

"rigid" PC 2 "flexible" PC 2 – "Adaptability" PC 1

PC 2 and Expression Profile Skew ~0 Skew >0 Status - LO Status - PC 2 and Expression Profile Skew ~0 Skew >0 Status - LO Status - HI PC 2 LO HI p-value S. cerevisiae 0. 29 1 x 100 0. 32 0. 44 3 x 10 -3 D. melanogaster 1. 82 1. 84 4 x 10 -1 1. 82 1. 90 7 x 10 -2 H. sapiens 1. 75 1. 94 7 x 10 -4 1. 87 2. 12 <1 x 10 -20 Omnibus test 1 x 10 -2 -20

PC 3 – PC 3 – "Reactivity" PC 2

PC 3 and Expression Profile Skew ~0 Skew >0 Status - LO Status - PC 3 and Expression Profile Skew ~0 Skew >0 Status - LO Status - HI PC 3 LO HI p-value S. cerevisiae 0. 26 0. 31 3 x 10 -1 0. 22 0. 50 <1 x 10 -20 D. melanogaster 1. 77 1. 88 6 x 10 -2 1. 86 1. 85 9 x 10 -1 H. sapiens 1. 80 1. 94 3 x 10 -4 1. 86 2. 13 <1 x 10 -20 Omnibus test 4 x 10 -4 -20

Relationships Between Variables Relationships Between Variables "STATUS" "ADAPTABILITY" "REACTIVITY" "phenotypic" variables "evolutionary" variables

Status and Adaptability of Genes Classification of KOGs into 4 major categories Status and Adaptability of Genes Classification of KOGs into 4 major categories

Status and Adaptability of Genes Status INF CELL Adaptability MET Reactivity UNKN Classification of Status and Adaptability of Genes Status INF CELL Adaptability MET Reactivity UNKN Classification of KOGs into 4 major categories

Status and Adaptability of Genes Cytoplasmic and Mitochondrial ribosomal proteins Status and Adaptability of Genes Cytoplasmic and Mitochondrial ribosomal proteins

Status and Adaptability of Genes Vacuolar ATPase and Vacuolar Sorting proteins Status and Adaptability of Genes Vacuolar ATPase and Vacuolar Sorting proteins

Status and Adaptability of Genes Replication Licensing Complex and Histones Status and Adaptability of Genes Replication Licensing Complex and Histones

Status and Adaptability of Genes Core Cluster (spliceosome and m. RNA cleavagepolyadenylation complex) RNA Status and Adaptability of Genes Core Cluster (spliceosome and m. RNA cleavagepolyadenylation complex) RNA processing and modification

Adaptability and Reactivity of Genes carbohydrate transport and metabolism translation and ribosome replication, RNA Adaptability and Reactivity of Genes carbohydrate transport and metabolism translation and ribosome replication, RNA processing and modification signal transduction

Conclusions • Three composite, independent variables – Conclusions • Three composite, independent variables – "status", "adaptability" and "reactivity" – dominate the multidimensional data space of quantitative genomics. • The notion of status provides biologically relevant null hypotheses regarding the connections between various measures. • Breaks in the pattern possibly indicate something nontrivial (targets for further investigation). • Functional groups of genes show distinctive patterns of status, adaptability, and reactivity

Co-Authors Liran Carmel Yuri Wolf Eugene Koonin Co-Authors Liran Carmel Yuri Wolf Eugene Koonin