- Количество слайдов: 39
CSE 182 -L 9 Gene Finding (DNA signals) Genome Sequencing and assembly
An HMM for Gene structure
Gene Finding via HMMs • Gene finding can be interpreted as a d. p. approach that threads genomic sequence through the states of a ‘gene’ HMM. – Einit, Efin, Emid, – I, IG (intergenic) Note: all links are not shown here IG I Efin Emid Einit i
Generalized HMMs, and other refinements • A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described • In standard HMMs, there is an exponential distribution on the duration of time spent in a state. • This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.
Length distributions of Introns & Exons
Generalized HMM for gene finding • Each state also emits a ‘duration’ for which it will cycle in the same state. The time is generated according to a random process that depends on the state.
Forward algorithm for gene finding qk j i Duration Prob. : Probability that you stayed in state qk for j-i+1 steps Emission Prob. : Probability that you emitted Xi. . Xj in state qk (given by the 5 th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state qk
De novo Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. • Further improvement may come from improved signal detection
DNA Signals • • • Coding versus non-coding Splice Signals Translation start ATG 5’ UTR Translation start Transcription start 3’ UTR exon intron Donor splice site Acceptor
DNA signal example: • The donor site marks the junction where an exon ends, and an intron begins. • For gene finding, we are interested in computing a probability – D[i] = Prob[Donor site at position i] • Approach: Collect a large number of donor sites, align, and look for a signal.
PWMs • • • Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites 321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG
Improvements to signal detection • Pr[GGTA] is a donor site? – 0. 5*0. 5 • Pr[CGTA] is a donor site? – 0. 5*0. 5 • Is something wrong with this explanation? GGTA CGTG
MDD • PWMs do not capture correlations between positions • Many position pairs in the Donor signal are correlated
Maximal Dependence Decomposition • Choose the position i which has the highest correlation score. • Split sequences into two: those which have the consensus at position i, and the remaining. • Recurse until
MDD for Donor sites
Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. • Further improvement may come from improved signal detection
How many genes do we have? Nature Science
Comparative methods • Gene prediction is harder with alternative splicing. • One approach might be to use comparative methods to detect genes • Given a similar m. RNA/protein (from another species, perhaps? ), can you find the best parse of a genomic sequence that matches that target sequence • Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps. • There is a genome sequencing project for a different Hirudo species. You could compare the Hirudo ESTs against the genome to do gene finding.
Comparative gene finding tools • • Procrustes/Sim 4: m. RNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach. • Mass Spec related? – Later in the class we will consider mass spectrometry data. – Can we use this data to identify genes in eukaryotic genomes? (Research project)
Databases • Ref. Seq and other databases maintain sequences of fulllength transcripts/genes. • We can query using sequence.
Course Gene finding • Sequence Comparison (BLAST & other tools) • Protein Motifs: – Profiles/Regular Expression/HMMs • Discovering protein coding genes – Gene finding HMMs – DNA signals (splice signals) • How is the genomic sequence itself obtained? ESTs Protein sequence analysis
Silly Quiz • Who are these people, and what is the occasion?
Genome Sequencing and Assembly
DNA Sequencing • DNA is doublestranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early.
Sequencing • A break at T is shown here. • Measuring the lengths using electrophoresis allows us to get the position of each T • The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides
• Automated detectors ‘read’ the terminating bases. • The signal decays after 1000 bases.
Sequencing Genomes: Clone by Clone • Clones are constructed to span the entire length of the genome. • These clones are ordered and oriented correctly (Mapping) • Each clone is sequenced individually
Shotgun Sequencing • Shotgun sequencing of clones was considered viable • However, researchers in 1999 proposed shotgunning the entire genome.
Library • Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.
Questions • Algorithmic: How do you put the genome back together from the pieces? Will be discussed in the next lecture. • Statistical? • EX: Let G be the length of the genome, and L be the length of a fragment. How many fragments do you need to sequence? – The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.
Lander Waterman Statistics Island L G
LW statistics: questions • As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. • Q 1: What is the expected number of islands? • Ans: N exp(-c ) • The number increases at first, and gradually decreases.
Analysis: Expected Number Islands • Computing Expected # islands. • Let Xi=1 if an island ends at position i, Xi=0 otherwise. • Number of islands = ∑i Xi • Expected # islands = E(∑i Xi) = ∑i E(Xi)
Prob. of an island ending at i L i T • E(Xi) = Prob (Island ends at pos. i) • =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions)
LW statistics • Pr[Island contains exactly j clones]? • Consider an island that has already begun. With probability e c , it will never be continued. Therefore • Pr[Island contains exactly j clones]= • Expected # j-clone islands
Expected # of clones in an island Why?
Expected length of an island