Codons Genes and Networks Andrei Zinovyev Bioinformatics service

Codons, Genes and Networks Andrei Zinovyev Bioinformatics service Math@Bio group of M. Gromov

Plan of the talk n Part I: 7 -clusters structure of genome (codons and genes) n Part II: Coding and non-coding DNA scaling laws (genes and networks)

Part I: 7 -clusters genome structure Dr. Tatyana Popova R&D Centre in Biberach, Germany Prof. Alexander Gorban Centre for Mathematical Modelling

Genomic sequence as a text in unknown language. . cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgc… tag gg rcg ca cg t a gg rcg tg gg ag rctag gcac ctg gtgg gc atg tgag gat ctga cta gct a tgcta gg gtg gggr g gr cgac cg gtt gtgg ac c c gt rcg aggg gg cgt gg tagggacgcacgtggtgagctgatgctaggg frequency dictionaries: 1 t a g g g a c g c a c g t g a g c t g a t g c t a g g g N = 4=4 N = 16=42 ta gg ga cg ca cg tg gt ga gc tg at gc ta gg N = 64=43 tag gga cgc acg tga gct gat gct agg gcta gggr N=256=44 tagg gacg cacg tggt gagc tgat

From text to geometry cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgc 107 length~200 -400 cgtggtgagctgatgctagggacgcacact tgagctgatgctagggacgcacaattc gtgagctgatgctagggacgcacggtg …… gagctgatgctagggacgcacaagtga 10000 -20000 fragments RN

Method of visualization principal components analysis RN R 2 PCA plot R 2

Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961)

First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc

Basic 7 -cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tcg gtg ggt gag tgt gct cgg tgg gtg agt gtg ctg

Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat

The flower-like 7 clusters structure is flat

Seven classes vs Seven clusters Georgia Institute of Technology TIGR Stanford Lomsadze A. , Ter-Hovhannisyan V. , Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20 Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A 3(2) using the Z curve method. FEBS Letters 540(1 -3), 188 -194 Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A, 95(17): 10026 -31, 1998.

Computational gene prediction Accuracy >90%

Mean-field approximation for triplet frequencies FIJK : Frequency of triplet IJK ( I, J, K {A, C, G, T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers position-specific letter frequency + correlations : 12 numbers

Why hexagonal symmetry? GC-content = -+0 0+- +-0 +0 - 0 -+ -0+ PC + P G

Genome codon usage and mean-field approximation correct frameshift … ggtga. ATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK … ggtga. ATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI 1 , PJ 2 , PK 3

PIJ are linear functions of GC-content eubacteria archae

THE MYSTERY OF TWO STRAIGHT LINES ? ? ? R 12 R 64 FIJK = P 1 IP 2 JP 3 K + correlations

Codon usage signature 0 -+

19 possible eubacterial signatures

Example: Palindromic signatures

Four symmetry types of the basic 7 -cluster structure eubacteria parallel triangles perpendicular triangles degenerated flower-like

F. Nucleatum (GC=27%) E. Coli (GC=51%) B. Halodurans (GC=44%) S. Coelicolor (GC=72%)

Using branching principal components to analyze 7 -clusters genome structures

Using branching principal components to analyze 7 -clusters genome structures Streptomyces coelicolor Bacillus halodurans Fusobacterium nucleatum Ercherichia coli

Web-site cluster structures in genomic sequences http: //www. ihes. fr/~zinovyev/7 clusters

Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome. 2005. Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7 -cluster structure of 143 complete bacterial genomic sequences. 2005. Physica A 353, 365 -387 Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7 -cluster structure of microbial genomic sequences. 2005. In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions 2003. In Silico Biology. V. 3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification 2003. Open Systems and Information Dynamics 10 (4).

Part II: Coding and non-coding DNA scaling laws Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics service Cavendish laboratory, University of Cambridge

C-value and G-value paradox Neither genome length nor gene number account for complexity of an organism n Drosophila melanogaster (fruit fly) C=120 Mb n Podisma pedestris (mountain grasshopper) C=1650 Mb n

Non-linear growth of regulation Log number of regulatory genes “Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated bacteria archae Slope = 1. 96 Log number of genes Mattick, J. S. Nature Reviews Genetics 5, 316– 323 (2004).

Complexity ceiling for prokaryotes n Adding a new function DS requires adding a regulatory overhead DR, the total increase is DN = DR + DS Since R ~ N 2 , at some point DR > DS, i. e. gain from a new function is too expensive for an organism, it requires too much regulation to be integrated There is a maximum possible genome length for prokaryotes (~10 Mb)

How eukaryotes bypassed this limitation? n Presumably, they invented a cheaper (digital) regulatory system, based on RNA n This regulatory information is stored in the “non-coding” DNA

Simple model: Accelerated networks Node is a gene (c genes) Edge is a “regulation” (n edges) Connectivity < kmax, regulators are only proteins n = ac 2 Connectivity > kmax deficit of regulations is taken from non-coding DNA

How much regulation genome needs to take from non-coding DNA? cmax (prokaryotic ceiling) These regulations must be encoded in the non-coding part of genome, therefore N – non-coding DNA length C – coding DNA length Cprok – ceiling for prokaryotes (~10 Mb) b - some coefficient

Observation: coding length vs non-coding b=1 Minimum non-coding length needed for the «deficit» regulation

Hypothesis Prokaryotes: <Non-coding length> = a <Coding length> a = 5 -15% (little constant add-on, promoters, UTRs…) 15% ≈ 1/7 n Eukaryotes Nreg = b/2 C/Cmaxprok(C-Cmaxprok) ~ C 2, Cmaxprok ≈ 10 Mb, b ≈ 1 n This is the amount necessary for regulation, but repeats, genome parasites, etc. , might make a genome much bigger

This is only a hypothesis, but… Prediction on the Nreg for human: Nreg = 87 Mb = 3% of genome length C = 48 Mb = 1. 7% Nreg+C = 4. 7% n

Thank you for your attention n Questions?