ae6651badb998d6cbd2ed52f48a696fc.ppt
- Количество слайдов: 43
BBSI Research Simulation News • Project proposals - Monday, June 16 - Format (see News, Presentations and other dates) • Renaissance fair and other events • Party at Greg’s house
BBSI Research Simulation PSSMs and Search for Repeats in DNA Application of PSSMs • Regulatory protein and their binding sites • Palindromic DNA and its significance • How to find protein binding sites: Meme • PSSMs to find beginning of genes • Repeated sequences and location of protein binding sites Li et al (2002)
Regulatory Protein and their Binding Sites lac. Z 5’-GTGAGTTAGCTCACNNNNNTANNNTNNNNNNNNNNNNNNATGNNNNNNNN 3’-CACTCAATCGAGTGNNNNNATNNNANNNNNNNNNNNNNNTACNNNNNNNN GTA . . (8). . C TAC Operator Crp RNA Polymerase Presence of CRP sites Regulation by carbon source Presence of X sites Regulation by Y
Regulatory Protein and their Binding Sites 5’-GTGAGTTAGCTCACNNNNNTANNNTNNNNNNNNNNNNNNATGNNNNNNNN 3’-CACTCAATCGAGTGNNNNNATNNNANNNNNNNNNNNNNNTACNNNNNNNN
Regulatory Protein and their Binding Sites Palindromic sequences NNNNNNN TTAATGTGAGTTAGCTCATT NNNNNNNNNNNNNN AATTACACTCAATCGAGTAA NNNNNNN recognizes GTGAGTT
Regulatory Protein and their Binding Sites Palindromic sequences NNNNNNN TTAATGTGAGTTAGCTCATT NNNNNNNNNNNNNN AATTACACTCAATCGAGTAA NNNNNNN recognizes GTGAGTT
Regulatory Protein and their Binding Sites Palindromic sequences NNNNNNN TTAATGTGAGTTAGCTCATT NNNNNNNNNNNNNN AATTACACTCAATCGAGTAA NNNNNNN
Regu lator y Pr otein Palin and their NNN drom Bind NNNNN ic se NNN ing S quen NNN ces ites NNN N TT NNN AAT AA T G T AC AGT ACT T AG CAA CTC TCG ACT AGT CAT GAG T NN TAA NNN NNN NNN NN N
Re gu N la NN NN to NN NN ry NN NN Pr NN NN Pa ote NN NN lin in NN NN an dr NA TT om d A T AA th ic T A TG se eir CA TG qu CT AG en Bi CA TT nd ce A T AG s in CT CG g AG CA Si TG CT te CA AG s TA TT A N NN NN NN
R egu la tor y. P ro tein an Pal ind rom dt hei N r ic N Bin N seq N NN din NNN N T uen g. S ces N A TA A TG ites T AC TGA ACT GTT CAA AGC TCG TCA AGT CTC GAG ATT TAA NNN NNN
Regulatory Protein and their Binding Sites Palindromic sequences NNNNNNN TTAATGTGAGTTAGCTCATT NNNNNNNNNNNNNN AATTACACTCAATCGAGTAA NNNNNN recognizes GTGAGTT
R egul atory Prot ein an Pali ndrom d the ir Bin ding ic seq NNNN Sites u N ences NNNN NNNN N TTAA N AATT TGTG ACAC AGTTA T C AA T G C T C A C GAGT CTCA GAGT TT NNN A A NNNNN N NNNN N
in B NN NN NN r s e i ce NN NN th en NN NN TT A N nd sequ a CA TA CT AG in mic te ro CA TG ro nd CT AG P li AG CG ry Pa TT AT to AG CA la TG CT gu TG CA e AA T A TT AT NN N A NN NN NN N N N ng di S es it
Regulatory Protein and their Binding Sites Palindromic sequences NNNNNNN TTAATGTGAGTTAGCTCATT NNNNNNNNNNNNNN AATTACACTCAATCGAGTAA NNNNNNN Palindromes: Serve as binding sites for dimeric protein
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ DNA: cruciform RNA: stem/loop TA T G GC AT GC TA GC TTAAT TCATT AATTA AGTAA CG TA CG AT CG G T AT t. RNA
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT GC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT AC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT A C Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT A T Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT AT Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT GC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTGAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT GC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTAAGTTAGCTCATT CG AATTACACTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT GC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTAAGTTAGCTCATT CG AATTACATTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA Function of palindrome T G GC RNA secondary structure? AT GC Binding site for dimeric protein? TA GC TTAAT TCATT How to tell? AATTA AGTAA CG TA CG AT TTAATGTAAGTTAGCTCATT CG AATTACATTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites Palindromic sequences 5’- TTAATGTGAGTTAGCTCATT -3’ 3’- AATTACACTCAATCGAGTAA -5’ TA How to tell? T G GC Compensatory mutations: RNA AT GC Uncorrelated mutations: protein TA GC TTAAT TCATT AATTA AGTAA CG TA CG AT TTAATGTAAGTTAGCTCATT CG AATTACATTCAATCGAGTAA G T AT
Regulatory Protein and their Binding Sites How to find them? Count all in certain class (Li et al, 2000) Guess a pattern and improve (Meme, Gibbs sampler) Human sequences 5’ to transcriptional start sn. RNA U 1 (p. U 1 -6) histone H 1 t HMG-14 TP 1 protamine P 1 nucleolin sn. RNP E rp S 14 rp S 17 ribosomal p. S 19 a'-tubulin ba'1 b'-tubulin b'2 a'-actin skel-m. a'-cardiac actin b'-actin AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG TGCCGCCGCGTGACCTTCACACTTCCGGTTCTTT GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG GGGAGGGTATATAAGCGTTGGCGGACGGTTGTAGCA CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCACCC CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA
Regulatory Protein and their Binding Sites How Meme finds them Human sequences 5’ to transcriptional start sn. RNA U 1 (p. U 1 -6) histone H 1 t HMG-14 TP 1 protamine P 1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table ACAGGGCAGAA CCCGGGTGTTT CCGGGGACGCG CCCCCGGGCCT CCGCAGAGCTG
Regulatory Protein and their Binding Sites How Meme finds them How do pattern finders work? sn. RNA U 1 (p. U 1 -6) histone H 1 t HMG-14 TP 1 protamine P 1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. Move around to find local maximum
Regulatory Protein and their Binding Sites How Meme finds them How do pattern finders work? sn. RNA U 1 (p. U 1 -6) histone H 1 t HMG-14 TP 1 protamine P 1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. Move around to find local maximum Step 6. If probability score high, remember pattern and score
Regulatory Protein and their Binding Sites How Meme finds them How do pattern finders work? sn. RNA U 1 (p. U 1 -6) histone H 1 t HMG-14 TP 1 protamine P 1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. Move around to find local maximum Step 6. If probability score high, remember pattern and score Step 7. Repeat Steps 1 - 5
Regulatory Protein and their Binding Sites How Meme finds them • You’ve found a gene related to Purple Tongue Syndrome • Blast. P: Encoded protein related to c. AMP-binding proteins • Are the similarities trivial? Related to c. AMP binding? • Does your protein contain c. AMP-binding site? • What IS a c. AMP-binding site? Task 1. Determine what is a c. AMP-binding site 2. Determine if your protein has one
Regulatory Protein and their Binding Sites How Meme finds them Strategy 1. Collect sequences of known c. AMP-binding proteins 2. Run Meme, a pattern-finding program Ask it to find any significant motifs Do it 3. Rerun Meme. Demand that every protein has identified motifs 4. Run Pfam over known sequence to check
PSSMs in action Identification of beginning of gene ace. B atp. I bio. B gln. A gln. H lac. Z rps. J ser. C suc. A trp. E ACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAGGAGTGAAAAAC ACGTTTTGGAGAAGCCCCATGGCTCAC ATCCAGGAGAGTTAAAGTATGTCCGCT TAGAAAAAAGGAAATGCTATGAAGTCT TTCACACAGGAAACAGCTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGGGAAATGGCTCAA GATGCTTAAGGGATCACGATGCAGAAC CAAAATTAGAGAATAACAATGCAAACA unknown Experimentally proven start sites
PSSMs in action Identification of beginning of gene ace. B atp. I bio. B gln. A gln. H lac. Z rps. J ser. C suc. A trp. E ACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAGGAGTGAAAAAC ACGTTTTGGAGAAGCCCCATGGCTCAC ATCCAGGAGAGTTAAAGTATGTCCGCT TAGAAAAAAGGAAATGCTATGAAGTCT TTCACACAGGAAACAGCTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGGGAAATGGCTCAA GATGCTTAAGGGATCACGATGCAGAAC CAAAATTAGAGAATAACAATGCAAACA unknown Experimentally proven start sites
PSSMs in action Identification of beginning of gene ace. B atp. I bio. B gln. A gln. H lac. Z rps. J ser. C suc. A trp. E ACCACATAACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAG. . . GAGTGAAAAAC ACGTTTTGGAGAAGC. . . CCCATGGCTCAC ATCCAGGAGAGTTA. AAGTATGTCCGCT TAGAAAAAAGGAAATG. . . CTATGAAGTCT TTCACACAGGAAACAG. . CTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGG. . . GAAATGGCTCAA GATGCTTAAGGGATCA. . CGATGCAGAAC CAAAATTAGAGAATA. . . ACAATGCAAACA A C G T
PSSMs in action Identification of beginning of gene ace. B ACCACATAACTATGGAGCATCT. GCACATGAAAACC atp. I ACCTCGAAGGGAGCAG. . . GAGTGAAAAAC bio. B ACGTTTTGGAGAAGC. . . CCCATGGCTCAC gln. A ATCCAGGAGAGTTA. AAGTATGTCCGCT gln. H TAGAAAAAAGGAAATG. . . CTATGAAGTCT lac. Z TTCACACAGGAAACAG. . CTATGACCATG rps. J AATTGGAGCTCTGGTCTCATGCAGAAC ser. C GCAACGTGGTGAGGG. . . GAAATGGCTCAA suc. A GATGCTTAAGGGATCA. . CGATGCAGAAC trp. E CAAAATTAGAGAATA. . . ACAATGCAAACA A C G T
PSSMs in action Algorithm to find binding sites (Li et al)
Li et al (2002) Algorithm Calculation of probability by Poisson equation How likely is it to find: GTGAGTTAACTCAC Frequency of GTGAGTT = f 1 Frequency of AACTCAC = f 2 Frequency of joint occurrence = f 1 · f 2 = f 12 Dimer occurred n times. How likely is that?
Li et al (2002) Algorithm Calculation of probability by Poisson equation Probability of n occurrences of dimer = f 12 · … n times · · (1 -f 12) · … N - n times N! n! · (N – n)! N Cn
Li et al (2002) Algorithm Calculation of probability by Poisson equation Probability of n occurrences of dimer = N! · (f 12)n · (1 -f 12)(N-n) n! · (N – n)! Expected number = m = f 12 · N f 12 = m / N N! · (m/N)n · (1 -m/N)(N-n) n! · (N – n)!
Li et al (2002) Algorithm Calculation of probability by Poisson equation Probability of n occurrences of dimer = N! · (m/N)n · (1 -m/N)(N-n) n! · (N – n)! (m)n · (1 -m/N)N N! · (N)n · (1 -m/N)n n! · (N – n)! (m)n · (1 -m/N)N N! · (N)n · (1 )n n! · (N – n)! (m)n · e-m N! · (N)n · (1 )n n! · (N – n)!
Li et al (2002) Algorithm Calculation of probability by Poisson equation Probability of n occurrences of dimer = (m)n · e-m N! · (N)n · (1 )n n! · (N – n)! N · (N-1) · (N – 2) · … (N– n+1) n! · (m)n · (N)n · (1 (m)n · e-m n! e-m )n
ae6651badb998d6cbd2ed52f48a696fc.ppt