d49a7f776c7157ddb2b529f2f93ca407.ppt
- Количество слайдов: 25
Pep. HMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search Laxman Yetukuri T-61. 6070: Modeling of Proteomics Data
Outline p Motivation p Basics: MS and MS/MS for Protein Identification p Computational Framework of Database Search p Scoring Algorithms n Pep. HMM n MOWSE p Results p Summary
Motivation p Proteomics studies- dynamic and context sensitive p Speed and accuracy of omics-driven methods n High throughput MS-based approaches p Real analysis starts with protein identification p Protein identification is challenging n The heart of protein identification algorithm is scoring function
Protein Identification Is Challenging p Sample Contamination p Imperfect Fragmentation p Post translational Modifications p Low signal to noise ratio p Machine errors
Basics: MS and MS/MS for protein Identification Liquid Chromatography Trypsin Digest Mass Spectrometry Precursor selection + collision induced dissociation (CID) MS/MS
Computational Problem Nesvizhskii and Aebersold, Drug Discovery Today, 2004, 9, 173 -181
Peptide Fragmentation: b & y ions yn-i-1 -HN-CH-CO-NHCH-R’ i+1 Ri bi R” i+1 bi+1
Peptide Fragmentation: b & y ions … 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 1020 L 260 1166 K 147 b ions y 6 100 % Intensity y 7 y 5 b 3 y 2 0 907 E 389 250 y 3 b 4 y 4 b 5 500 b 6 b 7 750 b 8 y b 9 8 y 9 1000 m/z
Peptide Fragmentation with other ions xn-i y n-i z n-i yn-i-1 -HN-CH-CO-NHCH-R’ i+1 Ri ai bi ci R” i+1 bi+1
Peptide Identification Two main methods for tandem MS: p De novo interpretation p Sequence database search
De Novo Interpretation % Intensity 100 KL 0 L SGF 250 E E D 500 E L D 750 E F G L 1000 m/z
Sequence Database Search p Widely used approach p Compares peptides from a protein sequence database with experimental spectra p Scoring function summarise the comparison n p Critical for any search engine Score each peptide against spectrum n Cross correlation (SEQUEST) n MOWSE scoring and its extensions (MASCOT) n Probabilistic scoring systems (OMSSA, OLAV, Prob. ID…. . ) Pep. HMM is HMM based probabilistic scoring function
Computational Framework for pep. HMM p MSDB based peptide extraction p Hypothetical spectrum generation n p b, y, y-H 2 O, b 2+ and y 2+ Computing probabilistic scores n Initial classification : Match, missing or noise n Compute pep. HMM scores (discussed later) p Compute Z-score p Compute E-score
Contents of pep. HMM Model p Pep. HMM combines the information on correlation among the ions, peak intensity and match tolerance p Input – sets of matches, missing and noise p Model is based on b and y ions p Each match is associated with observation (T, I) p Observation state = observed (T, I) p Hidden state =True assignement of the observations
Model Structure Four possible assignments corresponding to four hidden states
Model Computation Goal: Calculate highest score peptide in the database Let a path in HMM be states, probability of the path represents configuration of
Model Computation… Considering all possible paths Forward algorithm: Probability of all possible Paths from the first position to state v at postion i
Emmission Probabilities Probability of observing (Tb, Ib) and (Ty, Iy) for the state 1 at position i ---Normal distribution ---Exponential distribution
MOWSE Scoring System MOWSE Algorithm is implemented in MASCOT software Where mi, j -elements of MOWSE frequence matrix
Data Sets ISB data set: 1. A, B mixtures of 18 different proteins with modifications/relative amounts 2. Analysed using SEQUEST and other in-house Software 3. Data set is curated 4. Final data set with charge 2+ for trypsin digestion contains 857 spectra 5. 5 -fold cross validation by random selection -Training set : 687 spectra -Testing set : 170 spectra 6. EM algorithm is used for estimating parameters
Results: Distributions of Ions b and y ions Match Tolerance Noise Parameter estimates
Comparative Studies Dat selection repeated 10 times to select both training and test data set For each group parameters are similar values Prediction is considered correct if the peptide has highest score
Independent Data Set A. Y’s Lab: The other independent data set for comparing with other tools like SEQUEST and MASCOT size of data set =20, 980 spectra
False/True Positive Rates
Summary p Developed probabilistic scoring function called pep. HMM for improving protein identifications p Pep. HMM outperform other tools like MASCOT with low false postive rate (always? ) p Can this handle other type of ions other than b and y ions p Need to handle post translational modifications
d49a7f776c7157ddb2b529f2f93ca407.ppt