New applications of alignment-free methods for biological sequence

New applications of alignment-free methods for biological sequence analysis and comparison Susana Vinga 13 th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 1

The –omics era 2

Bioinformatics & Computational Biology Bioinformatics Alignment-free methodologies based on vector maps Multidisciplinary area Experimental (manipulation and measurement) Computational (mining and modeling) v. Computer science, dynamic systems, statistics, graph theory, molecular biology, biochemistry, physiology, … • Genome analysis • Sequence analysis • Structural bioinformatics • Gene expression • Genetics and population analysis • Systems biology • Databases and ontologies Phylogenetics Data and text mining http: //bioinformatics. oupjournal. org

Outline • Motivation & Introduction – Biological sequence analysis and comparison – alignmentbased and alignment-free strategies – Modeling strategies, problem definition and concepts • Methods & Examples (global and local analysis) – Resolution based methods – L-tuple composition – Iterated function systems – CGR/USM and genomic signatures – Information theory – Renyi entropy of DNA – Markov chain models – statistical significance of motifs – Entropic profiles • Summary & Conclusions

Gen. Bank DNA How to… • Classify • Analyze • Integrate this increasing amount of complex information? http: //www. ncbi. nlm. nih. gov/Genbank/genbankstats. html

Sequence alignment • The fundamental idea of alignment is that sequences that share the same substrings might have the same function or be related by homology 6

Aligment-based algorithms Needleman-Wunsh (1970) Smith-Waterman (1981) • Build up the best alignment by using optimal alignments of smaller subsequences • Example of dynamic programming (Richard Bellman, 1953): – – A divide-and-conquer strategy: Break the problem into smaller subproblems. Solve the smaller problems optimally. Use the sub-problem solutions to construct an optimal solution for the original problem. 7

Alignment-based algorithms 8

Multiple aligment Clustal 9

BLAST Times Cited (ISI): 27, 556 10

11

Aligment-free methods review (2003) 12

Vector-valued functions of biological sequences GAATTCT AATCTCC CTCTCAA CCCTACA GTACCCA 13 (f 1, …, fn)

Words in sequences L-tuple composition • Count “words” in sequence • Example DNA sequence X=GTGTGA, extract and count (overlapping) 3 -tuples 14

Metrics and dissimilarities • Euclidean • Cosine q • Minkowski - City-block (m=1) 15

Dissimilarities • L-tuple • Resolution-free • Sucessuful applications Vinga (2007) Editors: T. D. Pham et al, pp. 71 -105. 16

W-metric • Based in 1 -tuple frequencies and PAM/Blosum weights Vinga et al. Bioinformatics, 2004. 20(2): p. 206 -215. 17

Transforming L-tuples Processing frequnecy vectors Normalize, filtrate, feature selection, using algebraic and statistical tools • Normalize by expected frequencies (Pietrokovski et al. , 1990) – according to (L-1)-tuple and (L-2)-tuple: contrast L‑vocabulary (CV) • Oligonucleotide bias (Rocha et al. , 1998) – over- and under represented L-tuples in genomic datasets might indicate phenomena of positive/negative selection in B. subtilis • Shortest unique substrings (Haubold et al. , 2005) – Occur only once and cannot be further reduced in length without losing the property of uniqueness. Caenorhabditis elegans, human and mouse genomes • Uni. Markers (Chen et al. , 2002) – fixed-length unique sequence markers might be used to assign the genomic positions of SNP sites. UM’s appear only once in the genome thus allowing to locate SNP’s much faster that alignment-based methods. Create synteny maps (Liao et al. , 2004) 18

Human beta globin genes 19

HUMHBB classification 20

10 European Languages

http: //commons. wikimedia. org/wiki/File: Languages_of_Europe. png Natural languages 22

Phylogenetic inference • Genome trees and the nature of genome evolution (Snel et al. , Annual Rev Microbiol, 2005) – Alignment-Free Genome Trees - reconstruction methods use a statistic of the entire genomic DNA, or of all encoded proteins, to derive a distance between genomes that is then used to cluster them. • “The fact that these alignment-free methods do not incorporate so much standard molecular evolutionary methodology and proven powerful evolutionary concepts raises interesting questions, especially because they perform reasonably well” – Kolmogorov complexity (Li et al. Bioinformatics 2001) and Lempel. Ziv complexity (Otu et al. Bioinformatics 2003) - Mitochondria – Qi J et al (2004). ; Volkovich Z et al. (2010); Zheng, XQ et al. (2009). – Haubold et al (2009). Estimate mutation distances 23

Hybrid approaches • Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113. • Sperisen, P. & Pagni, M. (2005). JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinformatics, 6, 216. 24

Iterated Function Systems (IFS) • Chaos Game Representation (CGR) for DNA CGR • Higher-order generalization Universal Sequence Maps (USM) USM • Represent discrete sequences in a continuous map discrete sequences (bijection) • Relation with Markov Chains and suffix properties Markov Chains Jeffrey, H. J. (1990). Chaos game representation of gene structure. Nucleic Acids Res, 18(8): 2163– 2170. Almeida, J. S. and Vinga, S. (2002). Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics, 3(1): 6.

CGR/USM Algorithm (0, 1) C ATGCGAGTGT. . . X=ATGCGAGTGT. . . ATGC. . . 0. 8 ATGCGAG. . . ATGCGAGTG. . . ATG. . . 0. 7 0. 6 0. 5 ATGCGA. . . 0. 4 0. 3 ATGCGAGT. . . A. . . 0. 2 ATGCGAGTGT. . . AT. . . 0. 1 0 A (1, 1) ATGCG. . . 0. 9 (0, 0) G 1 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1 (1, 0) T

CGR/USM Example C A G T

Suffix property AAAGTAGGATAGTT A given L-tuple (e. g. AGT) is always mapped in the same region – fractal properties Generalization of Markov chain models Transition probability tables k-order 2 k+1 divisions

Genomic signatures Pervasive patterns even for ‘short’ DNA segments GENSTYLE Deschavanne et al. 1999 Mol Biol Evol

Generalizations • How to maximally “fill” the space? • Sierpinsky triangle (n=3) • CGR (n=4) • … • Very academic… Almeida and Vinga BMC Bioinformatics 2009 10: 100 30

Information Theory • Definition Shannon’s entropy N states with probabilities pi, i=1, . . . , N L-tuple frequencies Measures randomness level of given system (or predictability, complexity, . . . )

Rényi entropy order of probability distributions • discrete • continuous Generalization of Shannon

Example: N=2 Coin A not necessarily fair coin… Maximum entropy when p=0. 5 Highest unpredictability

Global Rényi entropy of DNA sequences 2. Estimate probability density function 1. Represent the DNA sequence (pdf) Chaos Game Parzen’s window method using Representation/Universal Sequence normal or gaussian kernels with Maps (CGR/USM) different variances s 2 s 2 Normal distribution 3. Calculate entropy of estimated pdf Rényi continuous entropy of DNA Simplification 4. Compare with random model

Algebraic simplification • When using Gaussian kernels, Rényi quadratic entropy =2 simplifies to: Exact results No numerical integration Vinga & Almeida 2004 J Theor Biol

Results – pdf estimation -ATCmotif Over-represented Suffix property = High density pdf estimation ^ f

Global Rényi entropy more random less random length N=2000

Random sequences - medians length N ln N

Local sequence information Statistical significance of motifs • SMILE program: Structured Motifs Inference and Evaluation (Sagot et al. ) and RISO (Carvalho et al. 2006) – Exact algorithm to find motifs or models in sequence sets – Calculates statistical significance of patterns found based on permutation tests and Markov Chain Models • Flexible input parameters – Example: BOX 1 = 6 - 9 BOX 2 = 5 - 7 * * * (* * *) * * * (* * ) SPACER = 15 -19 ERROR for each box and total

Application Inference of conserved motifs • Organisms: E. coli, B. subtilis, H. pylori • Several promoter families: – TTAAGC [19 -23] TATAAT – TTTTAA [10 -14] TATAAT – TTGACA [15 -19] TATAAT High statistical significance Biological significance? !? Vanet et al. 2000 JMB

Motifs • Xu, M. L. and Z. C. Su (2010). "A Novel Alignment-Free Method for Comparing Transcription Factor Binding Site Motifs. " Plos One 5(1). • Comin, M. and D. Verzotto (2010). "Classification of protein sequences by means of irredundant patterns. " BMC Bioinformatics 11. • Casimiro, A. C. , Vinga, S. , Freitas, A. T. , and Oliveira, A. L. (2008) An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance. BMC Bioinformatics 9, 89. 41

Distribution of motifs • Analyze the position where the motif occurs • Test for uniformity • Motifs that are both overrepresented and non-uniformly distributed might be biologically significant 42 S. Cerevisiae Promoter sequences TF Aft 2 p, Dal 80 p, Gln 3 p, Met 4 p and Gat 1 p Casimiro, A. C. , et al. (2008) BMC Bioinformatics 9, 89.

Uniformity tests • Small samples choose “best” uniformity test – Chi-Square, Kolmogorov-Smirnov and bootstrap Chi-Square – Optimize number of bins – Specificity and ROC curves 43

Applications of entropy to extract local information • Linguistic complexity S. cerevisiae H. influenzae Low entropy=high repeatability Crochemore & Vérin 1999 Comput Chem Troyanskaya et al. 2002 Bioinformatics Coding vs. non-coding comparisons

Entropic Profiles • Local information plots that indicate overall conservation of motifs in genomes • Obtained by unfolding the probability density function used in Renyi global entropy estimation using CGR • New tool implementation, based on new data structures and algorithmic simplifications, allows to process whole genomes in few minutes. • http: //kdbio. inesc-id. pt/software/ep/ Vinga and Almeida (2007) BMC Bioinformatics, 8: 393 Fernandes et al. (2009) BMC Research Notes 2: 72. 45

New kernel function applied to CGR • L resolution • f smoothing Almeida and Vinga Algorithms for Molecular Biology 2006 1: 18 46

Properties • CGR properties are maintained • Domain is respected • Estimation of pdf is also straightforward Almeida and Vinga Algorithms for Molecular Biology 2006 1: 18 47

EP algorithm • Calculation: simplified to suffix counts Calculation: Entropic profile for the ith symbol si, coordinate xi § L is the length resolution § f is a smoothing parameter 48 Number of motifs (si -k+1…si) in the whole sequence Vinga and Almeida (2007) BMC Bioinformatics

Whole genome case-studies Statistical significance Input parameters Entropic profile 49 For L>6 Chi site motif emerges

Whole genome case-studies Maximum at L=8 (motif length) EPmax>7 std 50

Position study Escherichia coli genome q Corresponds to a Chi site (Crossover Hotspot Instigator) (5’-GCTGGTGG-3’) *key region that modulates the exonuclease activity of Rec. BCD, an enzyme that is necessary for chromosomal ds. DNA repair and integration of exogenous ds. DNA) 51 The detection of relevant and statistically significant segments can be accomplished unsupervisedly by spanning the parameters space to find local maxima.

Motif Study Haemophilus influenza genome USS+ highly overrepresented EPmax>10 Histogram q Analysis of the motif which represents a USS+ (Uptake Signal Sequence) 5’-AAGTGCCGGT-3’) 52 *USSs are involved in natural competence, which is a genetically controlled form of horizontal gene transfer in some bacterial species

EP conclusions • Entropic profiles (EP) provide local information about global features of DNA • Excellent performance for sequences up to 2 Gbp (time and memory) • Whole genomes testing corroborate the strengths of this approach to detect biologically meaningful DNA segments, related with the detection of local scales and suffix/motifs over or under-representation 53

Alignment-free algorithms • Advantages – – – General approach with rich collection of methods Robust to shuffling and recombination events Applicable even when less conservation is present All genome information can be considered Computationally less intensive Symbol order is sometimes neglected • Disadvantages – Methods are less developed and integrated – Limited detailed local information – hard to identify point mutations, indels – Less discriminating for querying databases and genome searches – Symbol order is sometimes neglected 54

Summary and conclusions • Global and local sequence analysis – Metrics, CGR, information theory, statistics Alignment-free techniques can provide new tools to… “All Models Are Wrong But Some Are Useful” Classify George E. P. Box Analyze Integrate … biological sequence data

Acknowledgments • Prof. Jonas S. Almeida • KDBIO Group @ INESC-ID • Biomathematics Group @ ITQB • Project Dyna. Mo (PTDC/EEAACR/69530/2006) from FCT Thank you! http: //kdbio. inesc-id. pt/~svinga