Скачать презентацию Hierarchical Cluster Structures and Symmetries in Genomic Sequences Скачать презентацию Hierarchical Cluster Structures and Symmetries in Genomic Sequences

1d25fddcb246dbd051590cac9a029b63.ppt

  • Количество слайдов: 29

Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M. Gromov

Plan of the talk Genomic sequences: geometric approach, clustering n n n Genomic sequence Plan of the talk Genomic sequences: geometric approach, clustering n n n Genomic sequence as text Basic 7 -cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications

Introduction Frequency dictionaries Introduction Frequency dictionaries

Genomic sequence as a text in unknown language. . cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc… tag gg rcg ca Genomic sequence as a text in unknown language. . cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc… tag gg rcg ca cg t a gg rcg tg gg ag rctag gcac ctg gtgg gc atg tgag gat ctga cta gct a tgcta gg gtg gggr g gr cgac cg gtt gtgg ac c c gt rcg aggg gg cgt gg tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: 1 t a g g g r c g c a c g t g a g c t g a t g c t a g g g N = 4=4 N = 16=42 ta gg gr cg ca cg tg gt ga gc tg at gc ta gg N = 64=43 tag ggr cgc acg tga gct gat gct agg gcta gggr N=256=44 tagg grcg cacg tggt gagc tgat

From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc 107 length~300 -400 cgtggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga 3000 From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc 107 length~300 -400 cgtggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga 3000 -4000 fragments RN

Method of visualization principal components analysis RN R 2 PCA plot R 2 Method of visualization principal components analysis RN R 2 PCA plot R 2

Chapter 1 Basic 7 -cluster structure (level 1 of non-randomness) Chapter 1 Basic 7 -cluster structure (level 1 of non-randomness)

Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets

First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgc

Basic 7 -cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt ctg atg Basic 7 -cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tcg gtg ggt gag tgt gct cgg tgg gtg agt gtg ctg

Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat

Mean-field approximation for triplet frequencies FIJK : Frequency of triplet IJK ( I, J, Mean-field approximation for triplet frequencies FIJK : Frequency of triplet IJK ( I, J, K {A, C, G, T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers letter frequency + correlations : 12 numbers

Why hexagonal symmetry? GC-content = -+0 0+- +-0 +0 - 0 -+ -0+ PC Why hexagonal symmetry? GC-content = -+0 0+- +-0 +0 - 0 -+ -0+ PC + P G

Chapter 2 Global structure of codon frequencies (143 complete bacterial genomes) Chapter 2 Global structure of codon frequencies (143 complete bacterial genomes)

Genome codon usage and mean-field approximation correct frameshift … ggtga. ATG gat gct agg Genome codon usage and mean-field approximation correct frameshift … ggtga. ATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK … ggtga. ATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI 1 , PJ 2 , PK 3

Global structure of codon frequencies a ae h c ar eubacteria Global structure of codon frequencies a ae h c ar eubacteria

PIJ are linear functions of GC-content PIJ are linear functions of GC-content

Four symmetry types of the basic 7 -cluster structure eubacteria parallel triangles perpendicular triangles Four symmetry types of the basic 7 -cluster structure eubacteria parallel triangles perpendicular triangles degenerated flower-like

Chapter 3 Internal structure of codon frequencies (level 2 of non-randomness) Chapter 3 Internal structure of codon frequencies (level 2 of non-randomness)

Second level of hierarchy ? Second level of hierarchy ?

Distribution of genes function 2 function 1 function 3 R 64 Distribution of genes function 2 function 1 function 3 R 64

Fast-growing bacteria I II IV Genes of class I (most of) Genes of class Fast-growing bacteria I II IV Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)

Escherichia coli Genes of class I (most of) Genes of class II (higly expressed) Escherichia coli Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)

Chapter 4 Applications Chapter 4 Applications

Computational gene prediction Accuracy >90% Computational gene prediction Accuracy >90%

Protein expression optimization gene sequence S, protein A I II IV gene sequence S’, Protein expression optimization gene sequence S, protein A I II IV gene sequence S’, same protein A, higher expression

Web-site cluster structures in genomic sequences http: //www. ihes. fr/~zinovyev/7 clusters Web-site cluster structures in genomic sequences http: //www. ihes. fr/~zinovyev/7 clusters

Papers Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal Papers Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7 -cluster structure of 143 complete bacterial genomic sequences. 2004. Arxive e-print. Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions 2003. In Silico Biology. V. 3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification 2003. Open Systems and Information Dynamics 10 (4).

People Dr. Tanya Popova Institute of Computational Modeling Russia Professor Alexander Gorban University of People Dr. Tanya Popova Institute of Computational Modeling Russia Professor Alexander Gorban University of Leicester UK