cb864122e4b299a3140fce93d7c4b2c7.ppt
- Количество слайдов: 1
Predicting Novel Transcription Factor Binding Sites in Human Using a Machine Learning Approach Sonya Liberman 1, 2, Nir Friedman 1 & Hanah Margalit 2 1 School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel 2 Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem, Israel Transcription factors (TFs) regulate gene expression by binding to specific sequences on the DNA. A major challenge is to expand the known repertoire of TF-target pairs by identifying novel Transcription Factor Binding Sites (TFBS) based on sequence data. One main difficulty in such computational predictions is the large number of false positives they generate. Here we examine the association of five features with TFBS and show that they differ between true binding sites and similar sequences that are predicted as binding sites. Using machine learning approaches, we developed a computational scheme for TFBSs prediction, in which prediction of sites based on sequence data is subjected to filtering and further classification according to these features. This results in a significant reduction in the number of false positive predictions and enables the construction of a more accurate transcription regulation network. 1 Training Sets • Each site was represented as a 5 -coordinate vector Evolutionary Conservation (X 1, X 2, X 3, X 4, X 5) Average Conservation Score Known Transcription Factor Biding Sites 0. 57 Sites predicted by a motif search tool 0. 24 Conservation Number of neighboring binding sites Known TFBSs are on average more conserved than other predicted sites Number of neighboring known binding sites of other transcription factors Known TFBS Gene 1 Clustered TFBS Scattered TFBS Different shapes indicate BSs for different TFs Sites for which distance is less than 200 bp are considered neighbors 1. 04 Orientation of transcription factor binding • A positive set was constructed out of 159 known sites that were also discovered by Test. MOTIF • A negative set was constructed out of 159 randomly chosen sites from the set of new sites predicted by Test. MOTIF The sites were classified using 4 different kernels: Gaussian, Linear, Polynomial and Sigmoidal. Cross-Validation • A sevenfold cross-validation was performed to evaluate performance using each one of the kernel functions Average number of neighboring known TFBSs Known Transcription Factor Biding Sites Distance from TSS Kernels Predicted TFBS Gene 2 Number of neighboring sites that fit the motif Sites predicted by a motif search tool 0. 44 • Linear kernel achieved best cross-validation results Sevenfold cross-validation results for Linear kernel Known TFBSs have on average more neighbors among known TFBSs than other predicted sites do True Positives (%) 83. 68% • Transcription is governed by cis-regulatory elements and associated transcription factors Number of neighboring sites with a similar sequence Promoter with a knwon site Promoter without a knwon site • In order to predict new TFBSs we use motifs of known TFBS represented by PSSMS Gene 1 Gene 2 Average number of sites with a similar sequence AATGATGC GCATCATT AATGATGC TTACTACG CGTAGTAA Promoter with a known TFBS Promoter without a known TFBS TTACTACG 14. 65 We use a motif search tool (Test. MOTIF 6) that predicts new TFBS in promoter sequences according to known motifs, and assigns a p-value to each prediction 2 False Negatives (%) 5. 98 False Negatives Average value 16. 32% True Negatives (%) -0. 92 86. 82% True Negatives Average value -2. 20 False Positives (%) 13. 18% False Positives Average value 0. 64 Classification Results Known TFBS Predicted TFBS GENE True Positives Average value 10. 52 Known TFBSs tend to be surrounded by other sites that match their motif • A classifier was trained on the full set of 318 sites and managed to separate correctly 88. 68% of the training data • All new sites predicted by Test. MOTIF were tagged by the classifier • The threshold for defining true binding sites was set to a positive score of 2 • 936/73607 (~1. 3%) sites received a score above threshold (222 unique pairs of TF and target gene) • Final set included new target genes for 51 known transcription factors 4 6 The Model • Known human TFBSs from TRANSFAC database were mapped onto the human genome Distance from the TSS (Transcription Start Site) of the target gene • 210 sites were chosen as a reliable set of known TFBSs • We predicted ~150, 000 statistically significant new sites including 174 out of 210 known TFBSs (~83%) Distribution of the distance of sites from their target genes. True Positives To differentiate between true positive predictions and false positive predictions False ? Learning two separate models The frequencies for the Gaussian components of the mixture and the parameters for each component were learned separately for the set of known sites predicted by Test. MOTIF and the set of the new sites 2 Number of real sites • Promoters of 150 genes were searched for putative binding sites for 98 different TFs To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components Average log probability of known sites given a model built according to known sites (159) Average log probability of known sites given a model built according to newly predicted sites (73607) 2. 623 Position relative to TSS Average log probability of newly predicted sites given a model built according to newly predicted sites (73607) -88. 862 17. 631 Average log probability of newly predicted sites given a model built according to known sites (159) 2. 296 • 61% of sites are located within the 200 bp upstream to TSS. 83% Positives False Negatives Classification • 75% are located within the 400 bp upstream to TSS New sites predicted by Test. MOTIF can be classified according to their probability of being generated by the first or the second set of parameters 7 Orientation of the transcription factor binding Known TFBSs Only several TFs have a specific binding orientation, i. e. : • E 2 F has a defined orientation of upstream binding sites. (90% have same orientation) • EBOX has a defined orientation of downstream binding sites. (86%) The differentiation is made based on the following five features: • Evolutionary conservation • Number of neighboring known binding sites of other TFS Unfortunately only few transcription factors have enough known binding sites to enable reliable statistics. • Number of neighboring sites with a similar sequence AACCCA TTGGGT • Distance from the TSS of the target gene TGGGTT ACCCAA • Orientation of the transcription factor binding 3 Gene 1 Gene 3 Gene 1 AACCC TTGGGT A X 1. Sinha, S. , M. Blanchette, and M. Tompa, Phy. ME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 2004. 5: p. 170. 2. Neal, R. M. , Regression and classification using Gaussian process priors. Oxford University Press, 1998: p. 475501. 3. Siepel, A. , et al. , Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 2005. 15(8): p. 1034 -50. 4. Li, N. and M. Tompa, Analysis of computational approaches for motif discovery. Algorithms Mol Biol, 2006. 1: p. 8. 5. Shane T. Jensen, X. Shirley Liu. , Qing Zhou and Jun S. Liu, Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Statistical Science, 2004. 19(1): p. 188 -204. 6. Barash Y. , Elidan G. , Kaplan T. , Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics, 2005. 1; 21(5): p. 596 -600. Gene 3 5 • Dr. Yael Altuvia for help with the feature definition • Tommy Kaplan for his help with the Test. MOTIF tool
cb864122e4b299a3140fce93d7c4b2c7.ppt