Скачать презентацию CSE 182 -L 9 Gene Finding DNA signals Скачать презентацию CSE 182 -L 9 Gene Finding DNA signals

b7238d9cc476d85a828b8c24ece872ea.ppt

  • Количество слайдов: 39

CSE 182 -L 9 Gene Finding (DNA signals) Genome Sequencing and assembly CSE 182 -L 9 Gene Finding (DNA signals) Genome Sequencing and assembly

An HMM for Gene structure An HMM for Gene structure

Gene Finding via HMMs • Gene finding can be interpreted as a d. p. Gene Finding via HMMs • Gene finding can be interpreted as a d. p. approach that threads genomic sequence through the states of a ‘gene’ HMM. – Einit, Efin, Emid, – I, IG (intergenic) Note: all links are not shown here IG I Efin Emid Einit i

Generalized HMMs, and other refinements • A probabilistic model for each of the states Generalized HMMs, and other refinements • A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described • In standard HMMs, there is an exponential distribution on the duration of time spent in a state. • This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.

Length distributions of Introns & Exons Length distributions of Introns & Exons

Generalized HMM for gene finding • Each state also emits a ‘duration’ for which Generalized HMM for gene finding • Each state also emits a ‘duration’ for which it will cycle in the same state. The time is generated according to a random process that depends on the state.

Forward algorithm for gene finding qk j i Duration Prob. : Probability that you Forward algorithm for gene finding qk j i Duration Prob. : Probability that you stayed in state qk for j-i+1 steps Emission Prob. : Probability that you emitted Xi. . Xj in state qk (given by the 5 th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state qk

De novo Gene prediction: Summary • Various signals distinguish coding regions from non-coding • De novo Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. • Further improvement may come from improved signal detection

DNA Signals • • • Coding versus non-coding Splice Signals Translation start ATG 5’ DNA Signals • • • Coding versus non-coding Splice Signals Translation start ATG 5’ UTR Translation start Transcription start 3’ UTR exon intron Donor splice site Acceptor

DNA signal example: • The donor site marks the junction where an exon ends, DNA signal example: • The donor site marks the junction where an exon ends, and an intron begins. • For gene finding, we are interested in computing a probability – D[i] = Prob[Donor site at position i] • Approach: Collect a large number of donor sites, align, and look for a signal.

PWMs • • • Fixed length for the splice signal. Each position is generated PWMs • • • Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites 321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG

Improvements to signal detection • Pr[GGTA] is a donor site? – 0. 5*0. 5 Improvements to signal detection • Pr[GGTA] is a donor site? – 0. 5*0. 5 • Pr[CGTA] is a donor site? – 0. 5*0. 5 • Is something wrong with this explanation? GGTA CGTG

MDD • PWMs do not capture correlations between positions • Many position pairs in MDD • PWMs do not capture correlations between positions • Many position pairs in the Donor signal are correlated

Maximal Dependence Decomposition • Choose the position i which has the highest correlation score. Maximal Dependence Decomposition • Choose the position i which has the highest correlation score. • Split sequences into two: those which have the consensus at position i, and the remaining. • Recurse until – Stop if #sequences is ‘small enough’

MDD for Donor sites MDD for Donor sites

Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. • Further improvement may come from improved signal detection

How many genes do we have? Nature Science How many genes do we have? Nature Science

Alternative splicing Alternative splicing

Comparative methods • Gene prediction is harder with alternative splicing. • One approach might Comparative methods • Gene prediction is harder with alternative splicing. • One approach might be to use comparative methods to detect genes • Given a similar m. RNA/protein (from another species, perhaps? ), can you find the best parse of a genomic sequence that matches that target sequence • Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps. • There is a genome sequencing project for a different Hirudo species. You could compare the Hirudo ESTs against the genome to do gene finding.

Comparative gene finding tools • • Procrustes/Sim 4: m. RNA vs. genomic Genewise: proteins Comparative gene finding tools • • Procrustes/Sim 4: m. RNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach. • Mass Spec related? – Later in the class we will consider mass spectrometry data. – Can we use this data to identify genes in eukaryotic genomes? (Research project)

Databases • Ref. Seq and other databases maintain sequences of fulllength transcripts/genes. • We Databases • Ref. Seq and other databases maintain sequences of fulllength transcripts/genes. • We can query using sequence.

Course Gene finding • Sequence Comparison (BLAST & other tools) • Protein Motifs: – Course Gene finding • Sequence Comparison (BLAST & other tools) • Protein Motifs: – Profiles/Regular Expression/HMMs • Discovering protein coding genes – Gene finding HMMs – DNA signals (splice signals) • How is the genomic sequence itself obtained? ESTs Protein sequence analysis

Silly Quiz • Who are these people, and what is the occasion? Silly Quiz • Who are these people, and what is the occasion?

Genome Sequencing and Assembly Genome Sequencing and Assembly

DNA Sequencing • DNA is doublestranded • The strands are separated, and a polymerase DNA Sequencing • DNA is doublestranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early.

Sequencing • A break at T is shown here. • Measuring the lengths using Sequencing • A break at T is shown here. • Measuring the lengths using electrophoresis allows us to get the position of each T • The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides

 • Automated detectors ‘read’ the terminating bases. • The signal decays after 1000 • Automated detectors ‘read’ the terminating bases. • The signal decays after 1000 bases.

Sequencing Genomes: Clone by Clone • Clones are constructed to span the entire length Sequencing Genomes: Clone by Clone • Clones are constructed to span the entire length of the genome. • These clones are ordered and oriented correctly (Mapping) • Each clone is sequenced individually

Shotgun Sequencing • Shotgun sequencing of clones was considered viable • However, researchers in Shotgun Sequencing • Shotgun sequencing of clones was considered viable • However, researchers in 1999 proposed shotgunning the entire genome.

Library • Create vectors of the sequence and introduce them into bacteria. As bacteria Library • Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.

Sequencing Sequencing

Questions • Algorithmic: How do you put the genome back together from the pieces? Questions • Algorithmic: How do you put the genome back together from the pieces? Will be discussed in the next lecture. • Statistical? • EX: Let G be the length of the genome, and L be the length of a fragment. How many fragments do you need to sequence? – The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.

Lander Waterman Statistics Island L G Lander Waterman Statistics Island L G

LW statistics: questions • As the coverage c increases, more and more areas of LW statistics: questions • As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. • Q 1: What is the expected number of islands? • Ans: N exp(-c ) • The number increases at first, and gradually decreases.

Analysis: Expected Number Islands • Computing Expected # islands. • Let Xi=1 if an Analysis: Expected Number Islands • Computing Expected # islands. • Let Xi=1 if an island ends at position i, Xi=0 otherwise. • Number of islands = ∑i Xi • Expected # islands = E(∑i Xi) = ∑i E(Xi)

Prob. of an island ending at i L i T • E(Xi) = Prob Prob. of an island ending at i L i T • E(Xi) = Prob (Island ends at pos. i) • =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions)

LW statistics • Pr[Island contains exactly j clones]? • Consider an island that has LW statistics • Pr[Island contains exactly j clones]? • Consider an island that has already begun. With probability e c , it will never be continued. Therefore • Pr[Island contains exactly j clones]= • Expected # j-clone islands

Expected # of clones in an island Why? Expected # of clones in an island Why?

Expected length of an island Expected length of an island