3a65eb4b12b9d0ab178e3bb5eca0dc20.ppt
- Количество слайдов: 41
Bioinformatics Algorithms and Data Structures BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Q: What is BLAST? A: A: Uhmmm, actually no, BLAST is an acronym: Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at: http: //www. ncbi. nlm. nih. gov/BLAST/ UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST • Q: Why do you care? • A: Because you are going to do a project. • • • U 51112 J 03581 NM_000245 NM_010849 NM_007409 NM_002475 XM_086788 M 30047 NM_000518 NM_000477 NM_008476 Membrane protein that transports sodium and hydrogen Tyrosinase. . people lacking this are albino MET, an oncogene. . . mutations in this cause cancer MYC, another oncogene Alcohol Dehydrogenase. . good to have when drinking Myosin. . . one of the muscle proteins Crystallin, the major protein in the lens Myelin basic protein. . protects the neurons Hemoglobin, oxygen carrying protein in RBC Albumin, major serum protein. . . does lot of things Keratin, skin and integument protein UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST • BLAST is designed to efficiently find alignments of a target string s against large databases – Motivation: increase the speed of finding fewer and better hotspots. – Idea: Find high scoring matches using a substitution matrix rather than exact matches. – We are still searching only for gapless matches. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
High-Scoring Pair • Two strings s and t are a high scoring pair (HSP) if d(s, t) > T • Given a query s[1. . n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s – Each such match to such word in the database is called a hit • Typical k: 12 for nucleotides, 3 -5 for amino acids. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
High-Scoring Pair • Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs – Threshold is chosen so that a random match with such a score is unlikely. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Finding Potential Matches We can locate seed words in a large database in a single pass • Construct a FSA that recognizes seed words • Use hashing techniques to locate matching words UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Extending Potential Matches • Once a seed is found, BLAST attempts to find a local alignment that extends the seed s • Seeds on the same diagonal are combined (as in FASTA) t UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Which programs are used? • Originally Blast did not allow gaps. – Now people use gapped-Blast – Gapped blast joins different diagonals. • For proteins Blast is superior • For nucleotides Fasta is better. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Review: Unrelated Sequences • Our model of unrelated sequences is simple – Each position is sampled independently from a distribution over the alphabet – We assume there is a distribution q() that describes the probability of letters in such positions • Then: • R denotes the assumption that s and t are random unrelated strings UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Review: Related Sequences • We assume that each pair of aligned positions (s[i], t[i]) evolved from a common ancestor • Let p(a, b) be a distribution over pairs of letters. • p(a, b) is the probability that some ancestral letter evolved into this particular pair of letters • Here M denotes the assumption that s and t are related strings. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Review: Ratio Test for Alignment • Taking logarithm of both sides, we get UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Review: Probabilistic Interpretation of Scoring Rule • If we take • then the score of an alignment is the log-ratio between the two models: – Score > 0 R is more “probable” – Score < 0 U is more “probable” UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Problems with Scoring Rule When searching for an optimal alignment in a big database, there a number of problems that arise with this simple scheme. • We are assuming P(M)=P(R), this assumes there an equal number of related and unrelated sequences in the database. • When searching through a big database, there is high probability that an unrelated sequence will receive a high score • When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Prior Probability on the models • What we really wish to calculate is: • The log score being: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Prior Probability on the models • Our threshold should be: UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
The Hazard of Large Databases • Define • This is the probability that two unrelated sequences will match with score > by chance • Assume there are N strings in our database • Assuming that they are independent of each other, and all are unrelated to s, we have UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
The Hazard of Large Databases 1 f(x, 0. 001) f(x, 0. 00001) f(x, 0. 000001) 0. 8 0. 6 0. 4 0. 2 0 0 20000 40000 UNIVERSITY OF SOUTH CAROLINA 60000 80000 100000 College of Engineering & Information Technology
Local Matching • Question: Which local alignment query is expected to give a higher score: – To a short sequence – To a long sequence? • A local match can begin at any of the nm entries in the DP matrix. • The score is the optimal of all these starting points. • If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Score Significance-Fasta • How meaningful is a score? • Calculate distribution of scores and related scores • Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Extreme Value Distribution (BLAST) • We ask the following questions: Given a database of size n and a sequence of size m • What is the expected number of hits with score at least S? This number is called an E-score • Notice this is a Poisson distribution. • • K corrects for the dependencies depends on the scoring matrix Doubling n, the length of sequence, doubles expectation Doubling S, the score, causes E() to decrease exponentially UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Blast P-value • Recall the Poisson distribution: – Probability of finding no hits with a score => S – Therefore probability of finding at least one hit with score => S is – This is called the P-value. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
A Typical Genebank entry UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Sequence Information UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
The Sequence UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST programs • BLASTN - Nucleotide query searching a nucleotide database. • BLASTP - Protein query searching a protein database. • BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. • TBLASTN - Protein query searching a translated nucleotide (6 frames) database. • TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Search UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Output • List of hits – Database accession codes, name, description. – Score in bits (Usually >30 bits is significant ) – Expectation value E() • For each hit – A header including hit name, description, length – Each hit may contain several HSPs – Score and expectation value – how many identical residues – how many residues contributing positively to the score • The local alignment itself UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
BLAST Output UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Multiple Alignment • Proteins can be classified into families: – Common structure. – Common function. – Common evolutionary origin. • For a set of sequences belonging to some family – Each pair has some differences – But, there are some common motifs in almost all sequences of the family • A multiple alignment carries more information than pairwise alignment UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Protein Families • Consider Zinc Fingers: • All have the same function: – Bind to DNA • All have similar structure • They constitute a Protein Family • In a protein family some parts of the sequence (the functional parts) are more conserved than others. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Definition A multiple alignment of strings S 1, S 2, …, Sk is a series of strings with blanks S’ 1, S’ 2, …, S’k such that: – |S’ 1|=|S’ 2|=…=|S’k| – S’j is an extension of Sj obtained by insertion of blanks. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Example AGT. . CTT. ACGCG AGTAGCTT. . . GCG. . TAGC. T. . GGCG. CTA. C. TAACCCG ACTA. . . TAAC. . . UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Example UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Sum of Pairs • The sum of pairwise distances between all pairs of sequences for some scoring matrix • Not only assumes that alignment of each column is independent, but also each pair of sequences. – Each sequence is scored as if descended from k-1 sequences instead of one common ancestor. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Calculation of Multiple Alignment • The optimal alignment can be calculated exactly using k-dimensional dynamic programming. – Space complexity O(nk) – Time complexity O(2 knk) • A Heuristic Program called Clustal. W quickly finds a good multiple alignment. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
Creating a PSSM • After aligning the sequences we see that there are some conserved regions. • We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. • This matrix represents information from a whole family, it is more strict in highly conserved regions. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
PSI- BLAST (Position Specific Iterated) • BLAST provides a new automatic “profile like” search. • Iterative procedure: – Perform BLAST on database. – Use Significant alignments to construct a “position specific” score matrix. – This matrix replaces the query sequence in the next round of database searching. • The program may be iterated until no new significant alignments are found. • Most commonly used search method today. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology
3a65eb4b12b9d0ab178e3bb5eca0dc20.ppt