Скачать презентацию Motif Discovery Algorithm and Application Dan Scanfeld Hong Скачать презентацию Motif Discovery Algorithm and Application Dan Scanfeld Hong

529d063841fa6f4ca60da1ca59bf2045.ppt

  • Количество слайдов: 27

Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal Motif Discovery: Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information Get bound and unbound sequences Objective: Motif discovery and use for deriving biological information Get bound and unbound sequences by TF nanog in human ES cells Find a motif using a motif finding algorithm Genome wide functional analysis using motif to find biological pattern

Why nanog: Relevance to ES Cells • • >200 Phenotypes 1013 Cells Repress a Why nanog: Relevance to ES Cells • • >200 Phenotypes 1013 Cells Repress a key set of genes needed for an embryo to develop. • 1 Genome 1 Cell Activate certain genes essential for cell growth This key set of repressed genes activate entire networks for generating many different specialized cells and tissues.

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm Genome wide Functional Analysis using motif to find biological signals

Location Analysis (Ch. IP-CHIP) in Human ES Cells (Cell Boyer et al 122: 947 Location Analysis (Ch. IP-CHIP) in Human ES Cells (Cell Boyer et al 122: 947 -956) Crosslink Fragment Enrich for Nanog Differentially label 44 k 10 Set Agilent

Ch. IP-CHIP Data Analysis Set - normalized Obtain Intensities using Genepix negative controlsubtracted Perform Ch. IP-CHIP Data Analysis Set - normalized Obtain Intensities using Genepix negative controlsubtracted Perform Median Normalization Probe-set p-value p=0. 005 May 2004 Genome Release P<=0. 005 P<=0. 01 IP signal Sequences (500 bp) Enrichment ratio P<=0. 001 0 Chromosomal position WCE signal

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm (State-of-the-art) Genome wide functional analysis using motif to find biological pattern

Motif Finding Algorithm (Mac Isaac, et. al. , 2006) Use Structural Prior (Database, Mac. Motif Finding Algorithm (Mac Isaac, et. al. , 2006) Use Structural Prior (Database, Mac. Issac, et. al. ) Refinement: Expectation-Maximization (ZOOPS) Score of found motifs: Classification on unseen data Significance testing on score: Use of Empirical p-value

Refinement: Expectation-Maximization Differences from EM in Lab 1 l Use of structural prior (beta Refinement: Expectation-Maximization Differences from EM in Lab 1 l Use of structural prior (beta = Strength of prior) l ZOOPS (Zero or One per sequence) model l 5 th order Markov Model for background trained over unbound sequences l SVM for hypothesis testing

ZOOPS Model (Bailey & Elkan 1994) B Background Model, M: Motif Model Λ Percentage ZOOPS Model (Bailey & Elkan 1994) B Background Model, M: Motif Model Λ Percentage of Bound Sequences (Mixture Model parameter) Sequences are drawn from the distribution P(S) = P(S| M) Λ + P(S|B)(1 - Λ) Hidden Variable for EM: Zij : 1 or 0, position j in sequence i is bound by the TF (1) or not (0) E-step: Prob(Zij) = [Λ *P(Si bound at j |M)] --------------------[(1 - Λ)P(Si |B) + Λ *∑ j P(Si bound at j |M)] P(M bound at j | Si) P(Si) M-step: (SAME AS BEFORE) Updating M (Motif Model): For position p on the motif model and each base b (A C T or G) Baseip : Base at position p of ith sequence PWM(p, b) = ∑ i (∑ j (prob(Zi(j-p+1))* (Baseij = = b))) + pseudocounts AND NORMALIZE Updating Background Model [[WE DON’T UPDATE BACKGROUND) Updating Λ Λ = (∑ i ∑ j prob(Zij))/( number of sequences )

Hypothesis testing l l Get motifs from EM Use 2 sets of bound and Hypothesis testing l l Get motifs from EM Use 2 sets of bound and unbound seq. ( Train and test) Train a linear SVM on train set. Find classification error on test set Error = Misclassifications/Total Samples l + EM B Motif (M) Input = P(S|M)/P(S|B) Output = B OR UB B B UB UB Score = 1 – error Train Set Train Classifier Test Set Test Classifier

Expectation-Maximization When to stop? Will it overtrain? l Rules of thumb (When likelihood increases Expectation-Maximization When to stop? Will it overtrain? l Rules of thumb (When likelihood increases very slowly) l Second derivative is negative for given number of times l Euclidean distance is less than given value l Over-train to given sequences l Maximizes likelihood of motif in given sequences. Disregards their likelihood in unbound sequences l Find test classification error at each EM step using SVMs.

Expectation-Maximization Final Motif A different Methodology: l 4 sets of data: Bound (for EM), Expectation-Maximization Final Motif A different Methodology: l 4 sets of data: Bound (for EM), B & U. B. (Train SVM), B. & U. B. (Test SVM), B. & U. B. (Validation) l l At each EM iteration, train SVM and find test Error. SVM & Error Initial Points Final Motif SVM & Error Use two kind of motifs l l Best Test Error motif EM last iteration motif SVM & Error Choose 10 best hypothesis Use larger validation set SVM & Error Initial Points

Expectation-Maximization Details of RUN l Transfactor: Nanog l Beta = [0 0. 2 0. Expectation-Maximization Details of RUN l Transfactor: Nanog l Beta = [0 0. 2 0. 35 0. 6 0. 7 1] (Strength of prior) l 5 motifs per beta by masking motifs l Motif Length : 8 l 25 bound seqs for EM l 500 base pairs in each seq. l 150 total train seq (SVM) [Low: Noisy] l 150 total test seq (SVM) [Low: Noisy] l 500 total Validation seq. l c = [1 e-3, 0. 05, 100. 0] (SVM: Budget for misclassifications) l EM for minimum 60 iterations, Second derivative is negative for five iterations

Expectation-Maximization Representative Score graphs during EM iterations X-Axis: EM Iteration Y-Axis: Score of Motif Expectation-Maximization Representative Score graphs during EM iterations X-Axis: EM Iteration Y-Axis: Score of Motif Beta 0. 0 Beta 0. 6 Beta 0. 35 Beta 0. 7

Expectation-Maximization Test and Validate Error of refined Motifs X-Axis: beta Value Y-Axis: Score of Expectation-Maximization Test and Validate Error of refined Motifs X-Axis: beta Value Y-Axis: Score of Motif Test Classification Score Validate Classification Score *: End of iteration EM result o: Best of Iteration

Expectation-Maximization iteration When is it the best-of-iteration? RUNS Total iterations Iterations for Best-Of-Iterations Expectation-Maximization iteration When is it the best-of-iteration? RUNS Total iterations Iterations for Best-Of-Iterations

Expectation Maximization Results: : l 6 out of 7 top ranking motifs were best-ofiteration Expectation Maximization Results: : l 6 out of 7 top ranking motifs were best-ofiteration and 1 was end-of-iteration (6 out of 10 as well) l Best Motif: Validate Error over set of 500 l Score: 61. 2%, Error: 38. 8% A 0. 003392 0. 764554 0. 995187 0. 072268 0. 063644 0. 459349 0. 000033 0. 088069 C 0. 268216 0. 050266 0. 000149 0. 000022 0. 303880 0. 003363 0. 472214 0. 201074 G 0. 039865 0. 000023 0. 002015 0. 205620 0. 105970 0. 537248 0. 446827 0. 228689 T 0. 688527 0. 185157 0. 002648 0. 722090 0. 526506 0. 000040 0. 080927 0. 482167 T A A T T A or G C or G T

Assumptions and Caveats l Random baseline: End-of-run motif in EM l Low number of Assumptions and Caveats l Random baseline: End-of-run motif in EM l Low number of sequences for test error l Bound sets may actually not be bound. Better to use highly probable sequences as bound. l All runs (inc. beta=0) used starting point as the structural prior.

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm Genome wide functional analysis using motif to find biological pattern

GSEA (Subramanian et al 2005) l Gene Set Enrichment Analysis (GSEA) determines whether an GSEA (Subramanian et al 2005) l Gene Set Enrichment Analysis (GSEA) determines whether an a priori defined set of genes shows statistically significant differences between two biological states.

GSEA Output l l l Enrichment Plot Gene List Gene Set Information GSEA Output l l l Enrichment Plot Gene List Gene Set Information

GSEA Ranked List l l l Set of promoter sequences for every human gene. GSEA Ranked List l l l Set of promoter sequences for every human gene. 2000 bp upstream and 200 bp downstream of Transcription initiation site. Score each promoter for likelihood of the motif. Input this ranked list into GSEA. Search for gene sets enriched in the ranked list.

Results l l Human embryonic stem cell genes OCT 4, NANOG, STELLAR, and GDF Results l l Human embryonic stem cell genes OCT 4, NANOG, STELLAR, and GDF 3 are expressed in both seminoma and breast carcinoma. ( Ezeh et al 2006 ) Breast cancer geneset found at p-value: 0. 008

Implementation Details l Young Lab Error model for ch. IP-chip data Analysis l Motif Implementation Details l Young Lab Error model for ch. IP-chip data Analysis l Motif finding Algorithm in MATLAB l l l Implemented Markov Model Implemented ZOOPS Model Integrated SVM Toolbox ( by S. R. Gunn. ) with code l Used structural prior from Mac. Isaac, et. al. 2006 l Used software for GSEA for Functional Analysis.

Future Directions l Algorithm l l l Better use of classification error. Maximize Likelihood Future Directions l Algorithm l l l Better use of classification error. Maximize Likelihood in Bound + Minimizes Likelihood in Unbound (Multi-objective Optimization using GAs) Biological Information: Distance from transcription site, Conservation Integrating expression data Cross-species Motif search and functional analysis, maybe using GO Terms l l Scoring Sequence length

Acknowledgments l l l Fraenkel Lab Young Lab Kenzie D. Mac. Isaac Dr. David Acknowledgments l l l Fraenkel Lab Young Lab Kenzie D. Mac. Isaac Dr. David Gifford (CSAIL) Dr. Richard Young (WIBR) Dr. Tommi Jaakkola (CSAIL)