Скачать презентацию The Lion the Leopard the Wolf or the Скачать презентацию The Lion the Leopard the Wolf or the

e43833d773500c2430c444277870dc22.ppt

  • Количество слайдов: 129

The Lion, the Leopard, the Wolf, or the Boar… Why not more? Comparative Genomics The Lion, the Leopard, the Wolf, or the Boar… Why not more? Comparative Genomics

Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine

Outline • Biology – Evolution by Duplication – Segmental Duplications in mammalian genomes • Outline • Biology – Evolution by Duplication – Segmental Duplications in mammalian genomes • Mathematical Challenges in Systems Biology • Compare Genomes All-vs-All – – Models: Blast and SWAT: MUMMER: PRIZM • Nondeterminism, Probability and Determinism: Computing the Priors and Mixed Strategies. • Demos: IT WORKS!

Introduction to Biology Introduction to Biology

Introduction to Biology • Genome: – Hereditary information of an organism is encoded in Introduction to Biology • Genome: – Hereditary information of an organism is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome. • DNA molecule can be thought of as a very long sequence of nucleotides or bases: S = {A, T, C, G}

Complementarity • DNA is a double-stranded polymer and should be thought of as a Complementarity • DNA is a double-stranded polymer and should be thought of as a pair of sequences over S. However, there is a relation of complementarity between the two sequences: – A , T, C , G – That is if there is an A (respectively, T, C, G) on one sequence at a particular position the other sequence must have a T (respectively, A, G, C) at the same position. • We will measure the sequence length (or the DNA length) in terms of base pairs (bp): for instance, human (H. sapiens) DNA is 3. 3 £ 109 bp measuring about 6 ft of DNA polymer completely stretched out!

Inert & Rigid • Complementary base pairs (A-T and C-G) are connected by hydrogen Inert & Rigid • Complementary base pairs (A-T and C-G) are connected by hydrogen bonds and the base-pair forms a coplanar “rung” connecting the two strands. – Cytosine and thymine are smaller (lighter) molecules, called pyrimidines – Guanine and adenine are bigger (bulkier) molecules, called purines. – Adenine and thymine allow only for double hydrogen bonding, while cytosine and guanine allow for triple hydrogen bonding. • Thus the chemical (through hydrogen bonding) and the mechanical (purine to pyrimidine) constraints on the pairing lead to the complementarity and makes the double stranded DNA both chemically inert and mechanically quite rigid and stable.

J. B. S. Haldane • “If I were compelled to give my own appreciation J. B. S. Haldane • “If I were compelled to give my own appreciation of the evolutionary process…, I should say this: In the first place it is very beautiful. In that beauty, there is an element of tragedy…In an evolutionary line rising from simplicity to complexity, then often falling back to an apparently primitive condition before its end, we perceive an artistic unity … • “To me at least the beauty of evolution is far more striking than its purpose. ” • J. B. S. Haldane, The Causes of Evolution. 1932.

Human Condition Human Condition

Mer-scape… • Overlapping words of different sizes are analyzed for their frequencies along the Mer-scape… • Overlapping words of different sizes are analyzed for their frequencies along the whole human genome • Red: 24 -mers, • Green: 21 -mers • Blue: 18 mers • Gray: 15 mers • To the very left is a ubiquitous human transposon Alu. The high frequency is indicative of its repetitive nature. • To the very right is the beginning of a gene. The low frequency is indicative of its uniqueness in the whole genome.

Doublet Repeats • Serendipitous discovery of a new uncataloged class of short duplicate sequences; Doublet Repeats • Serendipitous discovery of a new uncataloged class of short duplicate sequences; doublet repeats. – almost always < 100 bp – appear to use duplicative machinery that does not have the hallmarks of transposition, segmental duplication, or pseudogene formation. • (Top). The distance between the two loci of a doublet is plotted versus the chromosomal position of the first locus. • (Bottom) : Distribution of doublets (black) and segmental duplications (red) across human chromosome 2

EBD • J. B. S. Haldane (1932): – “A redundant duplicate of a gene EBD • J. B. S. Haldane (1932): – “A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene. ” • Susumu Ohno (1970): – “Natural selection merely modified, while redundancy created. ”

Evolution By Duplication • Tandem: – Polymerase slippage – Unequal crossover: γ-globin duplication mediated Evolution By Duplication • Tandem: – Polymerase slippage – Unequal crossover: γ-globin duplication mediated by L 1. • Interspersed: – Transposons – Exon shuffling – Segmental duplication • Whole genome – S. cerevisiae; early vertebrates; A. thaliana, etc

Duplicated Genes • Mutation during nonfunctionalization (MDN) • Gene sharing, Duplication-degenerationcomplementary (DDC) • Dosage Duplicated Genes • Mutation during nonfunctionalization (MDN) • Gene sharing, Duplication-degenerationcomplementary (DDC) • Dosage compensation • Multi-function constraint

Chromosomal Aberrations • Breakage • Translocation (Among non-homologous chromosomes. ) • Formation of acentric Chromosomal Aberrations • Breakage • Translocation (Among non-homologous chromosomes. ) • Formation of acentric and dicentric chromosomes. • Gene Conversions • Amplifications and deletions • Point mutations • Jumping genes a Transposition of DNA segments • Programmed rearrangements a E. g. , antibody responses.

Recent Segmental Duplications • 3. 5% ~ 5% of the human genome is found Recent Segmental Duplications • 3. 5% ~ 5% of the human genome is found to contain • segmental duplications, with length > 5 or 1 kb, identity > 90%. • These duplications are estimated to have emerged about 40 Mya under neutral assumption. • The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally. Human From [Bailey, et al. 2002]

The Model f -- deletion or mutation insertion f +- f +Duplication by recombination The Model f -- deletion or mutation insertion f +- f +Duplication by recombination between other repeats or other mechanisms deletion or mutation insertion f ++ Duplication by recombination between repeats f ++ Mutation accumulation in the duplicated sequences

The Mathematical Model Time after duplication 1 -α-2β h 0 -- 1 -α-2β α The Mathematical Model Time after duplication 1 -α-2β h 0 -- 1 -α-2β α α α f -- α f ++ h 1 -γ H 0 h 0 2β h 0+- γ α 2γ h 0++ h 1++ 2γ 0≤d<ε 1 -α-β/2 -γ β/2 α 2β α 1 -α-β/2 -γ β/2 1 -α-2γ γ α 1 -α-β/2 -γ H 1 2β 2γ α β/2 α 1 -α-2γ ε ≤ d < 2ε h 1: proportion of duplications by repeat recombination; h 1++: proportion of duplications by recombination of the specific repeat; h 1 - - : proportion of duplications by recombination of other repeats; h 0: proportion of duplications by other repeat-unrelated mechanism; h 0++: proportion of h 0 with common specific repeat in the flanking regions; h 0+-: proportion of h 0 with no common specific repeat in the flanking regions; h 0 - -: proportion of h 0 with no specific repeat in the flanking regions; 1 -α-2γ (k-1)ε ≤ d < kε α: mutation rate in duplicated sequences; β: insertion rate of the specific repeat; γ: mutation rate in the specific repeat; d: divergence level of duplications; ε: divergence interval of duplications.

Mechanisms of Segmental Duplications in Mammalian Genomes • To test and quantify the hypothesis, Mechanisms of Segmental Duplications in Mammalian Genomes • To test and quantify the hypothesis, we designed a Markov model that incorporates: – evolutionary dynamics of the sequence elements – interactions between different elements • Segmental duplication is a multi-mechanism process. • 12% was caused by recombination between the most active interspersed repeats. • Physical instability in the DNA sequences is also involved in duplication mechanism. • The results showed dynamic interaction between duplications at different scales.

TASKS • Paralogs: – Compare one genome against itself • Orthologs: – Compare one TASKS • Paralogs: – Compare one genome against itself • Orthologs: – Compare one genome against another • Compare multiple sets of genomes • Metagenomes:

Grand Challenges in Computational Systems Biology Grand Challenges in Computational Systems Biology

Mathematical challenges in Systems Biology 1. 2. 3. 4. 5. 6. How to fully Mathematical challenges in Systems Biology 1. 2. 3. 4. 5. 6. How to fully decipher the (digital) information content of the genome How to do all-vs-all comparisons of 1000 s of genomes How to extract protein and gene regulatory networks from 1 & 2 How to integrate multiple high-throughout data types dependably How to visualize & explore large- scale, multi- dimensional data How to convert static network maps into dynamic mathematical models 7. How to predict protein function ab initio 8. How to identify signatures for cellular states (e. g. healthy vs. diseased) 9. How to build hierarchical models across multiple scales of time & space 10. How to reduce complex multi- dimensional models to underlying principles

Evolution by Mutation • Mutant gene or DNA sequence: – Substitution, Insertion/Deletion, Recombination, Duplication, Evolution by Mutation • Mutant gene or DNA sequence: – Substitution, Insertion/Deletion, Recombination, Duplication, Gene Conversion – Spread through a population by genetic drift and/or natural selection and eventually is fixed in a species • Nucleotide substitution can be divided into two classes: – Transitions: Substitution of a purine by purine (A, G) or a pyrimidine by a pyrimidine (C, T) – Transversions: The other types. • More specific properties: – Frameshift mutation, nonsense mutation, synonymous or silent substitutions, non-synonymous or amino acid replacement substitution – Transposons, gene conversions, horizontal gene transfer.

Jukes Cantor • The nucleotide substitution occurs at any site with equal frequency, and Jukes Cantor • The nucleotide substitution occurs at any site with equal frequency, and that at each site a nucleotide changes to of the three remaining nucleotides with a probability of a per year. • Probability of a change of a nucleotide to another is r = 3 a • qt = Proportion of identical nucleotides between two sequences qt+1 = (1 -2 r) qt +(2/3) r (1 -qt) q(t) = 1 – ¾(1 -e-8 rt/3)

Kimura 2 Parameters • The rate of transitional substitution per site per year = Kimura 2 Parameters • The rate of transitional substitution per site per year = a • The rate of transversional substitution per site per year = 2 b • Total substitution rate, r = a + 2 b • Pt is the proportion of identical transition-type pairs AG, GA, CT, TC • Qt is the proportion of identical transversion-type pairs AG, GA, CT, TC P(t) = (1/4)(1 - 2 e-4(a+b)t +e-8 b t) Q(t) = (1/2)(1 –e-8 b t)

Other Models • Tajima & Nei: – Substitutions that seem to be rather insesnsitive Other Models • Tajima & Nei: – Substitutions that seem to be rather insesnsitive to various disturbing factors. • Tamura: – Takes into account varying GC content • Hasegawa et al. (HKY) • Rzhetsky & Nei • Tamura & Nei

Blast and SWAT: Blast, and the world Blasts with you; SWAT, and you SWAT Blast and SWAT: Blast, and the world Blasts with you; SWAT, and you SWAT alone…

Pair-wise Alignment: Task Definition • Given – a pair of sequences (DNA or protein) Pair-wise Alignment: Task Definition • Given – a pair of sequences (DNA or protein) – a method for scoring the similarity of a pair of characters • Do – determine the correspondences between substrings in the sequences such that the similarity score is maximized

DP Alignment Algorithms • Needleman & Wunsch – Improvement for Affine Gap-penalty, by Smith DP Alignment Algorithms • Needleman & Wunsch – Improvement for Affine Gap-penalty, by Smith & Waterman (Swat) – Dynamic programming: • Determine alignment of two sequences by determining alignment of all prefixes of the sequences

Comparing Syntenic Regions Comparison of Human and Fugu orthologous genomic sequence using Plains and Comparing Syntenic Regions Comparison of Human and Fugu orthologous genomic sequence using Plains and EMBOSS. This sequence contains six exon regions. Both Plains and EMBOSS correctly identify the exon region 2 in Fugu with 3 in Human, but only Plains correctly aligns the exon region 5 in Fugu with 5 in Human.

Heuristic Alignment • FASTA [Pearson & Lipman, 1988] • BLAST [Altschul et al. , Heuristic Alignment • FASTA [Pearson & Lipman, 1988] • BLAST [Altschul et al. , 1990] – BLAST heuristically finds high scoring segment pairs (HSPs): • identical length segments from 2 sequences with statistically significant match scores • i. e. ungapped local alignments • key tradeoff: sensitivity vs. speed

The MUMmer System • Delcher et al. , Nucleic Acids Research, 1999 • Given The MUMmer System • Delcher et al. , Nucleic Acids Research, 1999 • Given genomes A and B – Find all maximal, unique, matching subsequences (MUMs) – Extract the longest possible set of matches that occur in the same order in both genomes – Close the gaps – Output the alignment

Find Longest Subsequence • Sort MUMs according to position in genome A • Solve Find Longest Subsequence • Sort MUMs according to position in genome A • Solve variation of Longest Increasing Subsequence (LIS) problem to find sequences in ascending order in both genomes – Requires O(k log k) time where k is number of MUMs

The More the Merrier Algorithm Homology Seed Indexing BLAST exact k-mer Scanned with DFA* The More the Merrier Algorithm Homology Seed Indexing BLAST exact k-mer Scanned with DFA* [31] WABA Array [32] Sorted Array [33] BLASTZ wobble base degenerate k-mer randomly projected k-mer with

Summary • Many innovative sequence alignment tools are available for detailed comparative genomic studies. Summary • Many innovative sequence alignment tools are available for detailed comparative genomic studies. – Recent segmental duplications in mammalian genomes (with identity level >90%) can be detected using BLAST and many other tools. • They use exact or inexact k-mers as homology seeds. • As homology levels become lower, they encounter a dilemma between sensitivity and computational efficiency – homologous segments, – segmental duplications, or – homology-based phylogentic distances.

Summary • To improve sensitivity they rely on exhaustive searches of exact matches with Summary • To improve sensitivity they rely on exhaustive searches of exact matches with short mers or inexact matches with long mers. – They encounter too many false-positives, to be later filtered through an expensive post-processing step. • Instead, more stringent search criteria (longer mers with more exact matches) may be used to improve efficiency. – Then these algorithms fail to detect low-homology regions, such as ancient duplication events. • In order to detect less-recent duplications, orthologous genes have been used as “anchors” to map out the duplication blocks. But, for obvious reasons, they are unsuitable for identifying duplications that are not subject to strong selection process.

#2 PRIZM (Paxia, Rastogi, Zhou, Mishra) How to do all-vs-all comparisons of 1000 s #2 PRIZM (Paxia, Rastogi, Zhou, Mishra) How to do all-vs-all comparisons of 1000 s of genomes

PRIZM • It uses a Bayesian scheme. – It is efficient…Linear time. – It PRIZM • It uses a Bayesian scheme. – It is efficient…Linear time. – It computes homologous regions between two genomes even when the homology level drops to a value around 65%. • Incorporates background knowledge about genome evolution, by experimenting with several priors (noninformative improper prior, exponential and Gamma priors, and priors based on Juke-Cantor one parameter and Kimura’s two parameters models of evolution). • The results appear to be unaffected by these choices, while the computational efficiency is only mildly affected by what prior is employed.

A Simple Observation • Homology is hard to compute but easy to verify. Quadratic A Simple Observation • Homology is hard to compute but easy to verify. Quadratic vs. Linear time. – Can a probabilistic approach replace nondeterminism? If so, we can expect a probabilistic linear time algorithm. – Unlikely! But if we can use priors based on the underlying distributions, there is hope. – Many computational biology problems share this feature!

#2 The genomic sequences under comparison: A) M. genitalium (Y -axis) and M. pneumoniae #2 The genomic sequences under comparison: A) M. genitalium (Y -axis) and M. pneumoniae (X-axis), computed using 300 probes from six iterations, taking about 5 seconds using a nonoptimized algorithm

#2 The mouse chromosome 10 and rat chromosome 1 share a syntenic region about #2 The mouse chromosome 10 and rat chromosome 1 share a syntenic region about 20 Mb at the beginning of the two chromosomes (arrow).

#2 Alignment between Mouse Chromosome 8 and human chromosome 4. In silico experiment uses #2 Alignment between Mouse Chromosome 8 and human chromosome 4. In silico experiment uses 21 mers (5, 105, 3), and takes about 45 seconds.

Demo Human vs. Mouse M. genitalium vs. M. pneumoniae Demo Human vs. Mouse M. genitalium vs. M. pneumoniae

Homology Curve • An m-mer is a word of length m, selected from either Homology Curve • An m-mer is a word of length m, selected from either genome. • Consider a location in the first genome, G 1[a] and a short window, starting at a. W 1, a = G 1[a, a+m− 1] • Compare this window with a word of equal length from the second genome starting at G 2[b]: W 2, b = G 2[b, b+m − 1]. • Define the homology level for the locations G 1[a] and G 2[b] and a window of size m as h(2)(G 1[a], G 2[b], m) = (1/m) åg=1 m-1 IG 1[a+g] = G 2[a + g].

Homology Curve • Let us define, h(a) to denote the highest homology level for Homology Curve • Let us define, h(a) to denote the highest homology level for genome G 1 at position a and computed with respect to G 2: h(a) = max 1 · b <|G 2| - m h(2)(G 1[a], G 2[b], m) • The “homology curve” for the first genome, G 1 with the respect to the second genome G 2 is then defined as: h : [1. . G 1 −m − 1] → [0, 1] : a a h(a)

PRIZM • Replacing non-determinism with a probabilistic guessing scheme. – The probability distributions are PRIZM • Replacing non-determinism with a probabilistic guessing scheme. – The probability distributions are based on biologically meaningful priors. – Using these priors it guesses a local homology curve, and designs and performs an in silico experiment. – It uses the results to verify its guess (in linear time) and refines the local homology curve and the probability distributions for the next iteration.

Probabilistic guesses • Use a Bayesian scheme and a boosting approach to modify the Probabilistic guesses • Use a Bayesian scheme and a boosting approach to modify the probability distributions of the “guessing experiments” from one iteration to next. • At each iteration, a sequence of words with a specific distribution is selected from one genome, and is optimally partitioned into groups for “in silico experiments” involving – exact-match search, – inexact-match search with one error, – inexact match search with two errors, etc.

Probabilistic guesses • These searches can be efficiently conducted over the second genome, – Probabilistic guesses • These searches can be efficiently conducted over the second genome, – Assuming that the other genome has been preprocessed and stored in an efficient data structure (e. g. , suffix arrays or hash table). – From the results of the experiments, a Bayesian estimator can compute the local homology levels for the genome, and use it to verify and sharpen the probabilistic distributions for the next iteration. • The algorithm converges to the true local homology levels after a few iterations.

In Silico Experiments • Let b, B = | G 2 |/b, w, W, In Silico Experiments • Let b, B = | G 2 |/b, w, W, m, N 0, N 1, . . . , Nk (k ≤ m, and in our applications usually k = 2) be some pre-specified parameters. – Choose k+1 random subsets, S 0, S 1, . . . , Sk, of words each of length m, randomly (i. i. d. uniform) from G 1[a, a+w− 1], such that |S 0| = N 0, |S 1| = N 1, . . . , and |Sk| = Nk. – Consider a block in the second genome of length b: Bb = G 2 [b, b+b− 1]. Let X 0 (X 1, . . . , Xk, respectively) be defined as the number of m-mers from set S 0 (S 1, . . . , Sk, respectively) that match exactly (with one, two and so on up to k mismatches, respectively) to an m-mer in G 2[b, b +b − 1].

Experiment Design • By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h]) we can divide the Experiment Design • By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h]) we can divide the interval for h into three intervals: [1/4, θ 1] ≈ [1/4, (m− 2)/m], (θ 1, θ 2] ≈ [(m− 2)/m, (m− 1)/m] and (θ 1, 1] ≈ ((m− 1)/m, 1], such that the choices of (N 0, N 1, N 2) are based on the following mixed strategies: N 0 = (K/b) sq 21 pi, I(h) dh N 1 = (K/3 bm) sq 1 q 2 pi, I(h) dh N 2 = (2 K/9 b(m 2 –m)) s 1/4 q 1 pi, I(h) dh ….

In Silico Experiments • Thus Xi’s for i in [0. . k] are binomially In Silico Experiments • Thus Xi’s for i in [0. . k] are binomially distributed random variables whose parameters depend on the homology level h. • We can estimate the local homology by the following robust estimators: h h | X 0, X 1, … Xk i = s 01 h p(h | X 0, …, Xk) dh = s 01 h p(h) p(X 0, …, Xk|h) dh/ s 01 p(h) p(X 0, …, Xk|h) dh – Similarly, compute the mean, standard deviation and confidence of the homology function over Bb. – Let b* = arg maxb mean(Bb). Then the homology function is estimated at a by mean(Bb*).

Conditional Probabilities ri = b pm åj=1 i C[m, j] 3 j si = Conditional Probabilities ri = b pm åj=1 i C[m, j] 3 j si = åj=1 i C[m, j] hm−j (1 − h)j bi = (1 − si)(1 − ri) ai = 1 − bi = si + ri − si ri. p(X 1, X 2, …, XK|h) / Õj aj Xj bj. Nj – Xj

Initial Priors • Using Jukes-Cantor: the random variable r represents the rate of nucleotide Initial Priors • Using Jukes-Cantor: the random variable r represents the rate of nucleotide substitution per site per year. – In this model, it is assumed that nucleotide substitution occurs at any nucleotide site with equal frequency and at each site a nucleotide changes to one of the three remaining nucleotides with a probability a per year: r = 3 a. – The substitution rate is often higher at functionally less important sites than at functionally more important sites. – Case 1: r » Exponential(λ): fexp(r) = λe−λr. In that case p(h) » (4 h − 1)3λ/8 T− 1 – Case 2: r » Gamma(λ, ν): f Γ(r) = λn e−λrrn− 1/Γ(n). In that case p(h) » (4 h − 1)3λ/8 T− 1 ln[3/(4 h − 1)/T]n− 1

Initial Priors • A more complex structures arise as we consider multi-parameter models: e. Initial Priors • A more complex structures arise as we consider multi-parameter models: e. g. , Kimura’s Two-Parameter Method. In this model, the rate of transitional substitution per site per year (a) is assumed to be different from that of transversional substitution (2β). h = (1 − P)(1 − Q) P = (1/4) (1 − 2 e− 4(a+β)T + e− 8βT) Q = (1/2) (1 − e 8βT)

Initial Priors • Assume that α » Exponential(λa) and β » Exponential(λβ) and they Initial Priors • Assume that α » Exponential(λa) and β » Exponential(λβ) and they are independent. p(h) = (λa λβ/8 T 2) s 0 1 (1 − P)2+(λa+λb)/8 T(2 h + P − 1)(λa−λb)/8 T (h − 2 P + P 2)−λa/4 T /(2 h + P − 1)(h − 2 P + P 2) d. P.

Refining Priors • In iteration i, let us consider an interval I with k Refining Priors • In iteration i, let us consider an interval I with k homology estimates: h m 1, s 1 i, h m 2, s 2 i, …, h mk, sk i • Assume that the homology values h 1, h 2, . . . , hk is sampled from a distribution h » N(μ, σ2). • Furthermore, we assume the following: μ » N(ξ, t 2) r = (σ2 + t 2)− 1 » Γ(a, β).

New Prior – Prior = Kummer’s hypergeometric function of order 1 f(h|ξ, t, a New Prior – Prior = Kummer’s hypergeometric function of order 1 f(h|ξ, t, a , β) » s 01/τ2 ra-2 1/√(1 − r t 2) e−Br dr » 1 F 1(a − 1, a − 1/2, −B/ t 2) ¼ ((h-x)2 + 2 b)/2 t 2)-a+1 – Estimates ξ = h μji t 2 = h μ 2 i − h μj i 2 a/β = h (σ2 i + t 2)− 1 i a/β 2 = h σ2 i + t 2)− 2 i - h σ2 i + t 2)− 1 i 2

Optimizing the parameters • The parameter choices are as follows: – Let the number Optimizing the parameters • The parameter choices are as follows: – Let the number of blocks (b) and the number of windows (w) be chosen a priori based on the needed resolution for homology. – We may choose these parameters so that b = O(p(G 2)) and w = O(p(G 1)). We assign K = O(1) amount of work to a region defined by a combination of any single block with any single window. – Thus the amount of work is roughly K(G 1 G 2)/(w b) = O(G 1+G 2) per iteration. – The mer size parameter ‘m’ is chosen so that the probability of a “hit” in a block containing a homologous sequence much higher than in a random block: (b/4 m) << E(h 0, G)m.

#7 How to identify signatures for cellular states (e. g. healthy vs. diseased) #7 How to identify signatures for cellular states (e. g. healthy vs. diseased)

Microarray Analysis Normal DNA • Representations are reproducible samplings of DNA populations in Tumor Microarray Analysis Normal DNA • Representations are reproducible samplings of DNA populations in Tumor DNA which the resulting DNA has a new format and reduced complexity. Normal LCR Tumor LCR Label Hybridize – We array probes derived from low complexity representations of the normal genome – We measure differences in gene copy number between samples ratiometrically – Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise

Copy Number Fluctuation A 1 B 1 C 1 A 2 B 2 C Copy Number Fluctuation A 1 B 1 C 1 A 2 B 2 C 2 A 3 B 3 C 3

A MAP (Maximum A Posteriori) Estimators • Priors: – Deletion + Amplification • Data: A MAP (Maximum A Posteriori) Estimators • Priors: – Deletion + Amplification • Data: – Priors + Noise • Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers • Generalizes HMM: The prior depends on two parameters pe and pb. – pe is the probability of a particular probe being “normal”. – pb is the average number of intervals per unit length. (pe, pb) max at (0. 55, 0. 01)

A reasonable choice of priors yields good segmentation. A reasonable choice of priors yields good segmentation.

A reasonable choice of priors yields good segmentation. A reasonable choice of priors yields good segmentation.

Prior Selection: F criterion • For each break we have a T 2 statistic Prior Selection: F criterion • For each break we have a T 2 statistic and the appropriate tail probability (p value) calculated from the distribution of the statistic. In this case, this is an F distribution. • The best (pe, pb) is the one that leads to the maximum min p-value. (pe, pb) max at (0. 55, 0. 01)

Current Implementation • NYU Array CGH • Versatile: – Works well for BAC array, Current Implementation • NYU Array CGH • Versatile: – Works well for BAC array, ROMA & Agilent – Raw Affymetrix data is noisier, but our new algorithm for “background correction and summarization (BCS)” makes Affy-data significantly better than the competitors. – (BCS also generalizes to gene expression and performs better than RMA, Li-Wong, etc. ) • Global Analysis (LOH analysis, detecting TSG and onco-genes) • Fast implementation, with visualization and integration (to be released soon…)

Finding Cancer Genes • LOH/Deletion Analysis analysis • Hypothesize a TSG (Tumor Suppressor Gene) Finding Cancer Genes • LOH/Deletion Analysis analysis • Hypothesize a TSG (Tumor Suppressor Gene) • Score function for each possible genomic region containing the TSG – Evolutionary history – Interactions – Parameters • This score can be computed using estimation from data and also prior information on how the deletions arise. We use a simple approximation; we assume there is a Poisson process that generates breakpoints along the genome and an Exponential process that models the length of the deletions.

Simulations • • • There are S = 200 individuals. Deletions occur in each Simulations • • • There are S = 200 individuals. Deletions occur in each individual. • A cell having incurred a deletion overlapping one of the TSGs, will multiply quickly and • generate many copies of itself. These copies evolve independently for • a while until we collect the information. • Several scenarios: Model 1: one TSG [300, 350]. All individuals are diseased because of homozygous deletion of the TSG. Model 2: one TSG [300, 350]; 50% of the individuals are diseased because of homozygous deletion of the TSG and the rest are diseased because of hemizygous deletion of the TSG. Model 3: one TSG [300, 350]. All individuals are diseased because of hemizygous deletion of the TSG. Model 4: two TSGs; 50% of the people are diseased because of homozygous deletion in the first TSG and the other half are diseased because of homozygous deletion in the second TSG.

Model Simulation Model Simulation

#10 Host-Pathogen Interaction #10 Host-Pathogen Interaction

Bio. COMP/Biospice GOALIE Gene Ontology Algorithmic Logic Invariant Extractor Bio. COMP/Biospice GOALIE Gene Ontology Algorithmic Logic Invariant Extractor

A Picture Biological System: Part-List + Functional Relations Measurements Regulatory, Metabolic & Signaling Relations A Picture Biological System: Part-List + Functional Relations Measurements Regulatory, Metabolic & Signaling Relations Recomputation Redescription Descriptive Relation Numerical Traces KRIPKE MODELS Invariants

Yeast-Cell Cycle Data: Spellman et al. Yeast-Cell Cycle Data: Spellman et al.

Invariants G 0 G 1(I) M G 1(II) G 2 S Invariants G 0 G 1(I) M G 1(II) G 2 S

Another Example • Pre-Apoptosis 6 time points data analysis • Six time-point data at Another Example • Pre-Apoptosis 6 time points data analysis • Six time-point data at 2 h, 4 h, 6 h, 8 h, 12 h, 24 h – Cells treated with SEB – Control: untreated cells • Data from Jett-Lab (Walter-Reed)

Hypothesized pathway Hypothesized pathway

 • Note that GOALIE is not intimately tied to any particular ontology: It • Note that GOALIE is not intimately tied to any particular ontology: It can be customized to work with other controlled vocabulary: e. g. , UMLS, Meta. CYC, Reactome… • Thus GOALIE also provides a way to “compare” different ontologies… From the GO web site. The path back to each ontology from a gene. We will call each term in a path a split.

Structure of a GO annotation Each gene can have several annotated GOs and each Structure of a GO annotation Each gene can have several annotated GOs and each GO can have several splits. E. g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits

Time Coarse GO categorization • GOALIE partitions the temporal data in order to understand Time Coarse GO categorization • GOALIE partitions the temporal data in order to understand local behavior. Data are grouped by considering (weighted) windows of time points (2 -4 -6, 4 -6 -8, 6 -8 -12, 8 -12 -24) • Next GOALIE uses a K-Means algorithm (genesis) to produce clusters for each window • GOALIE then associates to each cluster a set of GO descriptors (with p-values and controlled FDR, false discovery rates) • GOALIE computes “patterns” of GO categories across the time points • All these steps can be run from one unifying interface that GOALIE provides. GOALIE will be embedded inside VALIS.

X 1 X 2 X 3 X 4 Female, Sings, Overweight Male, Talks, Thin X 1 X 2 X 3 X 4 Female, Sings, Overweight Male, Talks, Thin Male, Sings, Overweight Female, Sings, Underweight

X 3 X 4 X 3 X 2 X 1 X 4 X 2 X 3 X 4 X 3 X 2 X 1 X 4 X 2

X 3 X 4 X 3 X 2 X 1 X 4 X 2 X 3 X 4 X 3 X 2 X 1 X 4 X 2

X 3 X 4 X 3 X 2 X 1 X 4 X 2 X 3 X 4 X 3 X 2 X 1 X 4 X 2

X 2 X 4 X 3 X 1 X 2 Finale X 2 X 4 X 3 X 1 X 2 Finale

X 2 X 4 X 3 X 4 : O, : FLS X 1 X 2 X 4 X 3 X 4 : O, : FLS X 1 X 2 X 3 : O, : FLS X 1 : O, FLS Finale O

X 2 X 4 X 3 X 4 : O, : FLS : O X 2 X 4 X 3 X 4 : O, : FLS : O UFLS X 1 X 2 X 3 : O, : FLS : O UFLS X 1 : O, FLS : O UFLS : O, FLS Finale O

X 2 X 4 X 3 X 4 : O, : FLS : O X 2 X 4 X 3 X 4 : O, : FLS : O UFLS X 1 X 2 X 3 : O, : FLS : O UFLS X 1 : O, FLS : O UFLS : O, FLS Finale O

X 2 X 4 X 3 X 4 : O, : FLS : O X 2 X 4 X 3 X 4 : O, : FLS : O UFLS X 1 X 2 X 3 : O, : FLS : O UFLS A[: O UFLS] It ain’t over ‘til the fat lady sings : O, : FLS : O UFLS X 1 : O, FLS : O UFLS : O, FLS Finale O

GOALIE: GO Algorithmic Logic for Invariant Extraction Clusters connection tree Each level a “window” GOALIE: GO Algorithmic Logic for Invariant Extraction Clusters connection tree Each level a “window” Micro-array accessions GO categories Clusters information Connection information Cluster Information

GOALIE: GO Algorithmic Logic for Invariant Extraction GO categories describing genes in “source” cluster GOALIE: GO Algorithmic Logic for Invariant Extraction GO categories describing genes in “source” cluster GO categories shared with “destination” cluster GO categories describing “destination” cluster but not “source” GO categories describing “source” cluster but not “destination”

GOALIE: SEB Analysis Preliminary Results 1. 2. Time Course Window 1 to Time Course GOALIE: SEB Analysis Preliminary Results 1. 2. Time Course Window 1 to Time Course Window 2: Connection 1: 9 to 2: 18. By inspecting the first cluster in the first window (Cluster~1: 9), we note that one of the connection to the cluster 2 in the second window (Cluster~2: 18) is labeled (among many others) by the GO categories “circulation” (GO: 0008015), and by the category “negative regulation of heart rate” (GO: 0045822). This represents a constant set of biological processes shared by this cluster chain, traversing Cluster 3: 17, to Cluster 4: 13. Time Course Window 1 to Time Course Window 2: Connection 1: 9 to 2: 6. The connection between Cluster 1: 9 and Cluster 2: 6 is interesting because it shows how the category “regulation of lymphocyte proliferation” (GO: 0050670) becomes activated in the next time-window (Cluster 2: 6), while the categories “antigen presentation” and “antigen processing” became inactive. This should indicate that some of the genes in the clusters start a response to the pathogen in the second time point.

Hidden Kripke Model • “Hidden Kripke Model” reconstruction via ontology based redescription of time-sliced Hidden Kripke Model • “Hidden Kripke Model” reconstruction via ontology based redescription of time-sliced clusters of time-course measurements (arrays) offers a novel viewpoint form which formulate biological hypotheses and eventually reconstruct “biological first principles” • The GOALIE tool, in its embryonic state, has already been proven “correct” and offered a wide range of insights into a number of biological datasets • Spellman’s Yeast Cell Cycle • SEB host-pathogen data from WRAIR • New datasets being analyzed now include • P. falciparum dataset [Bozdech et al, 1(1): 085] • Subset of Genome Module Map dataset [Segal et al]

Story generation Dataset Formula Generator Formula Temporal Logic Analysis Natural Language Story Generation HTML Story generation Dataset Formula Generator Formula Temporal Logic Analysis Natural Language Story Generation HTML file • Temporal Logic formulae can be rendered in English. • Temporal Logic formulae can be generated automatically (with care). • Each formula can be tested against a set of datasets; differences can then be noted.

#4 How to integrate multiple high-throughout data types dependably; How to visualize & explore #4 How to integrate multiple high-throughout data types dependably; How to visualize & explore large- scale, multi- dimensional data #5 Chr 1 Ns ATs Reps MER 57 A CDs ΔG Dup Copy # Mer Freq VALIS vast¢active¢ living¢intelligence¢system L 1 P

Databases • Sequence – GENBANK, EMBL, DDBJ – SWISS-PROT, Gen. Pept, TREMBL, PIR – Databases • Sequence – GENBANK, EMBL, DDBJ – SWISS-PROT, Gen. Pept, TREMBL, PIR – PDB, SCOP, … • Genetic (Flybase, Wormbase, GDB, At. DB, SGD, …) • Expression (RNA Expression from microarrays, . . . ) • Metabolic (KEGG, WIT, …) • Literature (MEDLINE, …) • Bioperl • International open-source collaboration • 7 Years of development • 600 Modules • 100, 000 lines of code • Easy-to-use, stable, and consistent programming interface for bioinformatics application programmers

Bio. Perl Module Statistics mean 144 Bio. Perl Module Statistics mean 144

Bioinformatics Data • Majority of the data kept in (indexed) flat files & Relational Bioinformatics Data • Majority of the data kept in (indexed) flat files & Relational Databases • Flat Files – Unstructured/Difficult to update • Relational Databases – Strings are atomic objects – Blobs • Example: Golden. Path /UCSC Genome Browser • Rough Estimates: – 800 Tables – 14000 fields – 1600 varchars [most varchar(255) ] – 1300 blobs • Blobs: – exon. Start, Exon. End, q. Starts, t. Starts, …

Key Feature of Valis • State of the art of rapid prototyping in bioinformatics Key Feature of Valis • State of the art of rapid prototyping in bioinformatics • Multilanguage Scripting • Data storage • Graphical User interfaces

Multi-Scripting • A Valis script can be written in any supported language: – JScript, Multi-Scripting • A Valis script can be written in any supported language: – JScript, VBScript, Python, PERL, Lisp, R and SETL. – All the scripts see the same Valis class hierarchy. – For example, once a user learns that a Valis Sequence Object has a method called Input that will read the sequence from a file, the user can subsequently use this same primitive from all the different languages.

Data Storage #4 • Based on Extended B-Trees • At the lowest level there Data Storage #4 • Based on Extended B-Trees • At the lowest level there is an Heap of pages • Must correctly keep track of the reference counts of each record/object to implement value semantics

Visualization #5 • Once the processing is completed, it is very important to be Visualization #5 • Once the processing is completed, it is very important to be able to quickly visualize the results. • For this reason Valis provides numerous visualization tools that allow a user to quickly display – – sequences, maps, microarray data, tables, graphs and annotations. • These widget can be customized from the scripts.

How to convert static network maps into dynamic mathematical models; How to predict protein How to convert static network maps into dynamic mathematical models; How to predict protein function ab initio; How to build hierarchical models across multiple scales of time & space; How to reduce complex multi- dimensional models to underlying principles #6 Glycogen #9 #10 P_i Glucose-1 -P Phosphorylase a Phosphoglucomutase Glucokinase Glucose-6 -P Phosphoglucose isomerase SIMPATHICA Fructose-6 -P Phosphofructokinase Glycolysis

Sim. Pathica System Sim. Pathica System

Simpathica is a multi-functional system Front End Analysis XSSYS (TL) Time/Frequency Combined TL/TF Model Simpathica is a multi-functional system Front End Analysis XSSYS (TL) Time/Frequency Combined TL/TF Model data structures Equations Handling Octave/Matlab code generation Simulation Dashboard Matlab Octave m. Array DB NYUMAD Visualization Pt. Plot gnuplot

Simpathica is a modular system Canonical Form: Characteristics: • Predefined Modular Structure • Automated Simpathica is a modular system Canonical Form: Characteristics: • Predefined Modular Structure • Automated Translation from Graphical to Mathematical Model • Scalability

Purine Metabolism • Purine Metabolism – Provides the organism with building blocks for the Purine Metabolism • Purine Metabolism – Provides the organism with building blocks for the synthesis of DNA and RNA. – The consequences of a malfunctioning purine metabolism pathway are severe and can lead to death. • The entire pathway is almost closed but also quite complex. It contains – several feedback loops, – cross-activations and – reversible reactions • Thus is an ideal candidate for reasoning with computational tools.

Simple Model Simple Model

Biochemistry of Purine Metabolism • The main metabolite in purine biosynthesis is 5 -phosphoribosyl-a-1 Biochemistry of Purine Metabolism • The main metabolite in purine biosynthesis is 5 -phosphoribosyl-a-1 pyrophosphate (PRPP). – A linear cascade of reactions converts PRPP into inosine monophosphate (IMP). IMP is the central branch point of the purine metabolism pathway. – IMP is transformed into AMP and GMP. – Guanosine, adenosine and their derivatives are recycled (unless used elsewhere) into hypoxanthine (HX) and xanthine (XA). – XA is finally oxidized into uric acid (UA).

Purine Metabolism Purine Metabolism

Queries • Variation of the initial concentration of PRPP does not change the steady Queries • Variation of the initial concentration of PRPP does not change the steady state. (PRPP = 10 * PRPP 1) implies steady_state() • This query will be true when evaluated against the modified simulation run (i. e. the one where the initial concentration of PRPP is 10 times the initial concentration in the first run – PRPP 1). TRUE • • Persistent increase in the initial concentration of PRPP does cause unwanted changes in the steady state values of some metabolites. If the increase in the level of PRPP is in the order of 70% then the system does reach a steady state, and we expect to see increases in the levels of IMP and of the hypoxanthine pool in a “comparable” order of magnitude. Always (PRPP = 1. 7*PRPP 1) implies steady_state() TRUE

Queries • Consider the following statement: • Eventually (Always (PRPP = 1. 7 * Queries • Consider the following statement: • Eventually (Always (PRPP = 1. 7 * PRPP 1) implies steady_state() and Eventually Always(IMP < 2* IMP 1)) and Eventually (Always (hx_pool < 10*hx_pool 1))) • where IMP 1 and hx_pool 1 are the values observed in the unmodified trace. The above statement turns out to be false over the modified experiment trace. . • • False In fact, the increase in IMP is about 6. 5 fold while the hypoxanthine pool increase is about 60 fold. Since the above queries turn out to be false over the modified trace, we conclude that the model “over-predicts” the increases in some of its products and that it should therefore be amended

Final Model Final Model

Purine Metabolism Purine Metabolism

Quantum leaps “provoke creative dreaming” Quantum leaps “provoke creative dreaming”

Shotgun Mapping • Schematics – – – – Experiment Design Robotics Bio. Chemistry Imaging Shotgun Mapping • Schematics – – – – Experiment Design Robotics Bio. Chemistry Imaging Image Analysis Statistical Algorithms Visualization

Shotgun Optical Mapping of Genomes Gentig’s Successes E. coli • E. coli • P. Shotgun Optical Mapping of Genomes Gentig’s Successes E. coli • E. coli • P. falciparum ¦ D. radiodurans ¦ Y. Pestis • Rhodobacter sphaeroides ¦ Shigella flexneri ¦ Salmonella enterica • Aspergillus fumigatus • … P. falciparum • The automated Gentig system is routinely used – to map a microbe genome quickly & effortlessly – by a scientist with no quantitative or computational training. Y. Pestis

Some Interesting Applications • • Sequence Validation Haplotyping Sequencing Comparative Genomics Rearrangement events Hemizygous Some Interesting Applications • • Sequence Validation Haplotyping Sequencing Comparative Genomics Rearrangement events Hemizygous Deletions Epigenomics Characterizing c. DNAs – Expression Profiling – Alternate Splicing

Sequencing in Post-Genomic Era • Haplotypic Sequencing of 6. 6 Billion Base Pairs in Sequencing in Post-Genomic Era • Haplotypic Sequencing of 6. 6 Billion Base Pairs in a Diploid Human genome • Less than $700 • Less than 24 Hours • Draft Quality (Not Resequencing)

Ingredients • Single Molecule Optical Mapping – Methylation Sensitive Restriction Enzymes – Multiple-Enzyme Maps Ingredients • Single Molecule Optical Mapping – Methylation Sensitive Restriction Enzymes – Multiple-Enzyme Maps • Probe Hybridization on the Surface – PNA, LNA, TFO • Sequencing by Hybridization – Localize algorithms

PSBH problem • The PSBH problem: – Can overcome the complexity issues associated with PSBH problem • The PSBH problem: – Can overcome the complexity issues associated with the SBH problem. – For each L-mer probe, in addition to knowing whether it hybridizes with the unknown sequence (with or without count) we are provided constraints on the location of the L-mer in the sequence: – The constraint takes the form of a set of permissible locations for each L-mer (which need not be contiguous). – However it is known that if the constraint limits each L-mer to no more than two exact locations on the sequence, then it admits an efficient algorithmic solution—If 3 or more locations are possible the reconstruction problem becomes NP-complete. – However if the location constraints are in the form of K contiguous locations then the reconstruction problem is exponential in K, rather than the sequence length m.

Multiple PSBH problem • • For our data set of probe maps for all Multiple PSBH problem • • For our data set of probe maps for all 6 -mers, we have multiple instances of each L-mer, for L=6 (about one every 4 Kb on each strand of the DNA) in the sequence, and for each instance we can constrain the location to within about 200 base pairs depending on the optical resolution. A special case of the PSBH problem, which we will call the “Multiple Positional Sequence by Hybridization” (Multiple PSBH) problem, where we have separate constraints for each of the multiple instances of each L-mer. – By focusing on a small window of about 2000 bp, in which most L-mers will occur only once, we can solve the standard PSBH problem where separate constraints for multiple instances of each L-mer are not important. – However if we solve each local PSBH problem for each 2000 bp window separately, the worst case exponential time reconstruction is unlikely to apply to most windows, so if we simply limit the amount of time spent in each window to some upper bound linear in the window size, we can still be able to reconstruct the sequence in most windows in linear time.

Initial Experiments Initial Experiments

The end The end

The end The end