Скачать презентацию Comparative genomics Overview Tools MUMmer algorithm Скачать презентацию Comparative genomics Overview Tools MUMmer algorithm

af72e65b402b8dc09645a7af02aa0676.ppt

  • Количество слайдов: 35

Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo. ernet. in

Genome sequence: Fact file • 1995: The first complete genome sequence of Haemophilus infuenzae Genome sequence: Fact file • 1995: The first complete genome sequence of Haemophilus infuenzae Rd-was published • • Biological systems are dynamic and evolving The forth dimension: Time Genome sequence is a snapshot of evolution Correlation between Phenotypic properties and Genomic region is not straightforward as phenotypic properties are result of many to many interactions Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 2

Genomes: the current status • Published complete genomes: 403 » Archaeal: 81 GOLD database Genomes: the current status • Published complete genomes: 403 » Archaeal: 81 GOLD database » Bacterial: 1226 » Eukaryal: 169 • Ongoing: » Archaeal: 107 Metagenomics: 203 » Prokaryotic: 3478 » Eukaryotic: 1209 As of Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. Viral: >4500 3

Genome databases • Genomes at NCBI, EBI, TIGR Jan 21, 2010 © UKK, Bioinformatics Genome databases • Genomes at NCBI, EBI, TIGR Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 4

H. influenzae Complete Genome Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. H. influenzae Complete Genome Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 5

Function information clock of E. coli Generated on March 2 K 4 Jan 21, Function information clock of E. coli Generated on March 2 K 4 Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 6

Comparison of the coding regions • Begins with the gene identification algorithm: infer what Comparison of the coding regions • Begins with the gene identification algorithm: infer what portions of the genomic sequence actively code for genes. • There are four basic approaches. Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 7

Knowledge of Full Genome sequence: Solutions or new questions…? Correct # of genes…? Jan Knowledge of Full Genome sequence: Solutions or new questions…? Correct # of genes…? Jan 21, 2010 • Still struggling with the gene counters… © UKK, Bioinformatics Centre, University of Pune. 8

Genome analyses • Variation in E. coli: 4. 6 Mbp M. pneumoniae: 0. 81 Genome analyses • Variation in E. coli: 4. 6 Mbp M. pneumoniae: 0. 81 Mbp B. subtilis: 4. 20 Mbp – Genome size – GC content B. burgdorferi: 29% M. tuberculosis: 68% – Codon usage – Amino acid composition G, A, P, R: GC rich – Genome organisation I, F, Y, M, D: AT rich • Single circular chromosomes • Linear chromosome + extra chromosomal elements Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 9

CG: Comparisons between genomes • The stains of the same species • The closely CG: Comparisons between genomes • The stains of the same species • The closely related species • The distantly related species – List of Orthologs – Evolution of individual genes – Evolution of organisms Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 10

Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 11 Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 11

CG helps to ask some interesting questions • Identification similarities/differences between genomes may allow CG helps to ask some interesting questions • Identification similarities/differences between genomes may allow us to understand : – How 2 organisms evolved? – Why certain bacteria cause diseases while others do not? – Identification and prioritization of drug targets Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 12

CG: Unit of comparison • Unit of comparison: Gene/Genome – – – Number Content CG: Unit of comparison • Unit of comparison: Gene/Genome – – – Number Content (sequence) Location (map position) Gene Order Gene Cluster (Genes that are part of a known metabolic pathway, are found to exist as a group) – Colinearity of gene order is referred as synteny – A conserved group of genes in the same order in two genomes as a syntenic groups or syntenic clusters – Translocation: movement of genomic part from one position to another Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 13

Numbers: Gene number Structure of • tryptophan operon • Arrows: Direction of transcription • Numbers: Gene number Structure of • tryptophan operon • Arrows: Direction of transcription • //: Dispersion of operon by 50 genes trp. B and trp. A genetically linked separate genes Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 14 Dandekar et al. , 1998 Domain fusion trp. D and trp. G trp. F and trp. C

Important observations with regard to Gene Order • Order is highly conserved in closely Important observations with regard to Gene Order • Order is highly conserved in closely related species but gets changed by rearrangements • With more evolutionary distance, no correspondence between the gene order of orthologous genes • Group of genes having similar biochemical function tend to remain localized – Genes required for synthesis of tryptophan (trp genes) in E. coli and other prokaryotes Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 15

Synteny • Refers to regions of two genomes that show considerable similarity in terms Synteny • Refers to regions of two genomes that show considerable similarity in terms of – sequence and – conservation of the order of genes • likely to be related by common descent. Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 16

COGs: Phylogenetic classification of proteins encoded in complete genomes Jan 21, 2010 © UKK, COGs: Phylogenetic classification of proteins encoded in complete genomes Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 17

Genome analyses@NCBI Pairwise genome comparison of protein homologs (symmetrical best hits) Jan 21, 2010 Genome analyses@NCBI Pairwise genome comparison of protein homologs (symmetrical best hits) Jan 21, 2010 © UKK, Bioinformatics Centre, http: //www. ncbi. nlm. nih. gov/sutils/geneplot. cgi University of Pune. 18

Integr 8: CG site at EBI http: //www. ebi. ac. uk/integr 8 Jan 21, Integr 8: CG site at EBI http: //www. ebi. ac. uk/integr 8 Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 19

Comparative Genomics Tools • • • BLAST 2 MUMmer Pip. Maker AVID/VISTA Comparisons and Comparative Genomics Tools • • • BLAST 2 MUMmer Pip. Maker AVID/VISTA Comparisons and analyses at both – Nucleic acid and protein level Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 20

BLAST 2 • Available at NCBI • Input: GI or FASTA sequence (range can BLAST 2 • Available at NCBI • Input: GI or FASTA sequence (range can be specified) • Output: – Graphical – Alignment of 2 genomes Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 21

Genome Alignment Algorithm: MUMmer • Developed by – Dr. Steven Salzberg’s group at TIGR Genome Alignment Algorithm: MUMmer • Developed by – Dr. Steven Salzberg’s group at TIGR – NAR (1999) 27: 2369 -2376 – NAR (2002) 30: 2478 -2483 • Availability – Free – TIGR site Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 22

Features of MUMmer • The algorithm assumes that sequences are closely related • Can Features of MUMmer • The algorithm assumes that sequences are closely related • Can quickly compare millions of bases • Outputs: – Base to base alignment – Highlights the exact matches and differences in the genomes – Locates • • Jan 21, 2010 SNPs Large inserts Significant repeats Tandem repeats and reversals © UKK, Bioinformatics Centre, University of Pune. 23

Definitions are drawn from biology • SNP: Single mutation surrounded by two matching regions Definitions are drawn from biology • SNP: Single mutation surrounded by two matching regions – Regions of DNA where 2 sequences have diverged by more than one SNP • Large inserts: regions inserted into one of the genomes – Sequence reversals, lateral gene transfer • Repeats: the form of duplication that has occurred in either genome. • Tandem repeats: regions of repeated DNA in immediate succession but with different copy number in different genomes. – A repeat can occur 2. 5 times Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 24

Techniques used in the MUMmer Algorithm Compute Suffix trees for every genome Longest Increasing Techniques used in the MUMmer Algorithm Compute Suffix trees for every genome Longest Increasing Subsequence (LIS) Alignment using Smith & Waterman algorithm Integration of these techniques for genome alignment Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 25

MUMmer: Steps in the alignment process Read two genomes Using SNPs, mutation regions, repeats, MUMmer: Steps in the alignment process Read two genomes Using SNPs, mutation regions, repeats, tandem repeats Perform Maximum Unique Match (MUM) of genomes Close the gaps in the Alignment Output alignment Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. Sort and order the MUMs using LIS • MUMs • regions that do not match exactly 26

MUMmer steps • Locating MUMs • Sorting MUMs • Closure with gaps G 1: MUMmer steps • Locating MUMs • Sorting MUMs • Closure with gaps G 1: ACTGATTACGTGAACTGGATCCA G 2: ACTCTAGGTGAAGTGATCCA Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 27

Genome 1: ACTGATTACGTGAACTGGATCCA Genome 2: ACTCTAGGTGAAGTGATCCA ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGT-GATCCA Jan 21, 2010 © UKK, Bioinformatics Genome 1: ACTGATTACGTGAACTGGATCCA Genome 2: ACTCTAGGTGAAGTGATCCA ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGT-GATCCA Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 28

What is a MUM? • MUM is a subsequence that occurs exactly once in What is a MUM? • MUM is a subsequence that occurs exactly once in both genomes and is NOT part of any longer sequence • Two characters that bound a MUM are always mismatches Gen. A: tcgatc. GACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAcgactta Gen. B: gcatta. GACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAtccagag • Principle: if a long matching sequence occurs exactly once in each genome, it is certainly to be part of global alignment Similar to BLAST & FASTA!! Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 29

Sorting & ordering MUMs • MUMs are sorted according to their position in Genome Sorting & ordering MUMs • MUMs are sorted according to their position in Genome A • The order of matching MUMs in Genome B is considered 2 4 MUM 3: Random match Inexact repeat MUM 5: transposition • LIS algorithm to locate longest set of MUMs which occur in ascending order in both genomes Jan 21, 2010 © UKK, Bioinformatics Centre, Leads to Global MUM-alignment University of Pune. 30

MUMmer Results • 2 strains of M. tuberculosis – H 37 Rv & CDC MUMmer Results • 2 strains of M. tuberculosis – H 37 Rv & CDC 1551 – Genome size: 4 Mb – Time: 55 s • Generating suffix tree: 5 s • Sorting MUMs: 45 s • S&W alignment: 5 s Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 31

Alignment of M. tuberculosis strains CDC 1551 (Top) & H 37 Rv (bottom) Single Alignment of M. tuberculosis strains CDC 1551 (Top) & H 37 Rv (bottom) Single green lines indicate SNPs Blue lines indicate insertions Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 32

Comparison of 2 Mycoplasma genomes cousins that are distantly related • M. genitalium: 580 Comparison of 2 Mycoplasma genomes cousins that are distantly related • M. genitalium: 580 074 nt • M. pneumoniae: 816 394 (+226 000) • Analysis of proteins tell us that all M. g. proteins are present in P. m. • Alignment was carried using – FASTA (dividing each genome into 1000 bp) – All-against-all searches – Fixed length of pattern (25) – Using MUMmer (length = 25) Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 33

Comparison of 2 Mycoplasma genomes Using FASTA Fixed length patterns: 25 mers MUMmer Jan Comparison of 2 Mycoplasma genomes Using FASTA Fixed length patterns: 25 mers MUMmer Jan 21, 2010 © UKK, Bioinformatics Centre, University of Pune. 34

Post-sequencing challenges • Genome sequencing is just the beginning to appreciate biocomplexity • Sequence-based Post-sequencing challenges • Genome sequencing is just the beginning to appreciate biocomplexity • Sequence-based function assignment approaches fail as the sequence similarity drops … • Structure-based function prediction approaches are limited by the availability of structures, association of structural motifs & associated functional descriptor • As a result, in any genome, Genes with known function: ~ 40% Jan 21, 2010 Genes with unknown function: ~60% © UKK, Bioinformatics Centre, University of Pune. 35