Скачать презентацию Identification of regulatory elements using high-throughput binding evidence Скачать презентацию Identification of regulatory elements using high-throughput binding evidence

f6a863c9bca71e00a850dc079e02e387.ppt

  • Количество слайдов: 99

Identification of regulatory elements using high-throughput binding evidence. Inference of population structure on large Identification of regulatory elements using high-throughput binding evidence. Inference of population structure on large genetic data sets. Stoyan Georgiev advisors: Uwe Ohler and Sayan Mukherjee Computational Biology and Bioinformatics, Duke University February 2011

Outline • Motif analysis – Transcriptional regulation • genome-wide DNA binding data (Georgiev et Outline • Motif analysis – Transcriptional regulation • genome-wide DNA binding data (Georgiev et al. 2010) – Post-transcriptional regulation • transcriptome-wide RNA binding data (Mukherjee et al. , under review; Corcoran* and Georgiev* et al. , submitted) • Inference of population structure – randomized algorithm

Motif analysis Motif analysis

Outline • Introduction • Transcriptional regulation – Problem statement – Genomic assays – Statistical Outline • Introduction • Transcriptional regulation – Problem statement – Genomic assays – Statistical framework – Results • Post-transcriptional regulation

Gene regulation Transcription DNA motifs Splicing, Capping, Poly-adenylation Nucleus Export Cytoplasm Stability Translation mi. Gene regulation Transcription DNA motifs Splicing, Capping, Poly-adenylation Nucleus Export Cytoplasm Stability Translation mi. RBP mi. R-RBP complexes RBP RNA-binding Proteins RNA motifs

Gene regulatory code • Transcriptional regulation: short patterns in DNA (motifs) control the initiation Gene regulatory code • Transcriptional regulation: short patterns in DNA (motifs) control the initiation of production of gene transcripts – mechanism: sequence-specific DNA binding proteins (TFs) Motif Discovery Tool: c. ERMIT (Georgiev et al. 2010) • Post-transcriptional regulation: short patterns in RNA control the utilization of gene transcripts – mechanism: sequence-specific RNA binding proteins (RBPs), or micro. RNA mediated Motif Analysis Tool (Corcoran* and Georgiev* et al. ; Mukherjee et al. )

Transcriptional regulation Transcriptional regulation

Transcriptional regulation • Chromatin arrangement • Activity of transcription factors - intra-cellular environment - Transcriptional regulation • Chromatin arrangement • Activity of transcription factors - intra-cellular environment - cis-regulatory code • DNA methylation • Copy Number Variation

Simplified abstraction location Simplified abstraction location

Ch. IP-seq Ch. IP-seq

c. ERMIT • Computational tool for de-novo motif discovery – Predict binding motif and c. ERMIT • Computational tool for de-novo motif discovery – Predict binding motif and functional targets of a specific transcription factor of interest (e. g. TF) using genome-wide measurements of binding (e. g. Ch. IP-seq, Ch. IP-chip) (Georgiev et al. 2010) • Input: set of sequence regions with assigned binding evidence • Output: ranked list of predicted binding motifs and corresponding target locations

Brief introduction to c. ERMIT • Binding site representation: consensus sequence • Search for Brief introduction to c. ERMIT • Binding site representation: consensus sequence • Search for the "best" binding site that explains the genomewide binding evidence. – "best“: occurs in regions that tend to have high evidence of being bound (this is formalized as a normalized average score) – can evaluate all possible binding sites up to some reasonable length. . . in theory – in practice, we try to cover as many as possible • start with all possible 5 -mers (AAAAA, AAAAG, AAAAC, . . . , TTTTT) • for each, evaluate its "neighbours“ and replace it with the "best" one • repeat until no neighbour scores better than the current motif

Algorithmic view sequence regions high evidence AAAAAG AAAAC AAAAT AAAGA ES = 1. 5 Algorithmic view sequence regions high evidence AAAAAG AAAAC AAAAT AAAGA ES = 1. 5 . . . TTTCT TTTTA TTTTG TTTTC TTTTT low evidence sequence regions 512 seed motifs RTGASTCA ES = 15. 0 TGACTCA RTGASTCAK GAWTCAYY TGACTCA TGAWTCAK. . . evolved motifs ES = normalized average binding evidence

Variable definitions Variable definitions

Motif model Motif model

Motif model Motif model

Ch. IP-seq motif discovery input output Ch. IP-seq motif discovery input output

Results Results

Ch. IP-chip validation • conservation filter improves prediction accuracy (Georgiev et al. 2010) Ch. IP-chip validation • conservation filter improves prediction accuracy (Georgiev et al. 2010)

SKO 1 Example yeast Ch. IP-chip output GCN 4 SKO 1 Example yeast Ch. IP-chip output GCN 4

Human Ch. IP-seq prediction literature CTCF Barski et al. 2007 STAT 1 Robertson et Human Ch. IP-seq prediction literature CTCF Barski et al. 2007 STAT 1 Robertson et al. 2007 SRF Valouev et al. 2008

Post-transcriptional control Post-transcriptional control

Gene regulatory code • Post-transcriptional regulation: short patterns in RNA control the utilization of Gene regulatory code • Post-transcriptional regulation: short patterns in RNA control the utilization of gene transcripts – mechanism: sequence-specific RNA binding proteins (RBPs), or micro. RNA mediated to control translation Motif Analysis Tool (Corcoran*, Georgiev* et al. ; Mukherjee et al. )

PAR-CLIP • CLIP: Cross linking and immunoprecipitation – a method of transcriptome-wide identification of PAR-CLIP • CLIP: Cross linking and immunoprecipitation – a method of transcriptome-wide identification of RNAprotein interaction sites – problem, quite noisy • PAR-CLIP = CLIP + photoactivatable nucleotides – more efficient cross linking – directly observable evidence of Protein-RNA cross linking: upon reverse transcription T->C conversion near or at the interaction site

PAR-CLIP 1. culture with 4 -SU 2. cross-link 3. Immunoprecipitate & size-select 4. convert PAR-CLIP 1. culture with 4 -SU 2. cross-link 3. Immunoprecipitate & size-select 4. convert into a c. DNA library & sequence [Hafner et al. 2010]

RBP motif analysis pipeline RBP c. ERMIT Motif seeds Motif predictions RBP motif analysis pipeline RBP c. ERMIT Motif seeds Motif predictions

Modified motif score Modified motif score

Variable definitions Variable definitions

Motif model Motif model

Motif model Motif model

Motif model Motif model

Results Results

Pumilio predicted motif • 2 million mapped reads • # clusters with site / Pumilio predicted motif • 2 million mapped reads • # clusters with site / total # clusters = 1, 162 / 8, 483 (Hafner et al. 2010)

Summary • c. ERMIT: motif discovery using genome-wide binding data – identify motifs that Summary • c. ERMIT: motif discovery using genome-wide binding data – identify motifs that are highly enriched in targets with high binding evidence. – applicable to RNA and DNA binding data – adjust for sequence biases and other potential confounders using linear regression framework • In progress… – Bayesian formulation – improve stability of predictions – more comprehensive search

Inference of population structure and generalized eigendecomposition Inference of population structure and generalized eigendecomposition

Outline • Motivation • Current approaches • Extensions – large data sets – supervised Outline • Motivation • Current approaches • Extensions – large data sets – supervised dimension reduction • Empirical results – Wishart simulation – WCCC Crohn’s disease data set

Motivation • A classic problem in biology and genetics is to study population structure Motivation • A classic problem in biology and genetics is to study population structure (Cavalli-Sforza 1978, 2003) • Genotype data on millions of loci and thousands of individuals • Can we detect structure based on the genetic data? – infer population demographic histories – correct for population structure in disease association studies – correspondence to geography

Current approaches • Structure (Pritchard et al. 2000) – Bayesian model-based clustering of genotype Current approaches • Structure (Pritchard et al. 2000) – Bayesian model-based clustering of genotype data • Eigenstrat (Patterson et al. 2006) – PCA-based inference of axis of genetic variation

Population structure within Europe (Novembre et al. 2008) Population structure within Europe (Novembre et al. 2008)

Eigenstrat (Patterson et al. 2006) • Combines Principal Component Analysis and Random Matrix Theory Eigenstrat (Patterson et al. 2006) • Combines Principal Component Analysis and Random Matrix Theory

Eigenstrat (Patterson et al. 2006) • Runtime O(m 2 n) computation • The challenge: Eigenstrat (Patterson et al. 2006) • Runtime O(m 2 n) computation • The challenge: future (current? ) genetic data sets n ≥ 500, 000 m ≥ 20, 000 (e. g. WTCCC Nature 2007: 17, 000 individuals, 500 K snp array) • Can we extend Eigenstrat to this data to be run on a standard desktop? • Assume low rank, k << min(m, n) • Approx algorithm in O(kmn) computation

Randomized PCA Basic steps: 1. Random projection (approx. preserves distances) • project data onto Randomized PCA Basic steps: 1. Random projection (approx. preserves distances) • project data onto low dimensional space – • do SVD on Y -- similar to SVD on M 2. Power method : when spectrum decay is slow

Properties of Randomized PCA • Error bound on the k rank approximation : power Properties of Randomized PCA • Error bound on the k rank approximation : power iteration drives the leading constant to one exponentially fast as i increases! • Top k eigenvalues and eigenvectors can be well approximated in time O(ikmn) – rapid convergence when close to low rank structure (i=1 -3) – slowly decaying singular values require more iterations • Clearly no benefit when ik ≈ m << n

Properties of Randomized PCA • Empical observations – we don’t seem to need power Properties of Randomized PCA • Empical observations – we don’t seem to need power iteration, as random projection good enough (data is low rank) – eigenvalue accuracy estimate can be “sloppy” if emphasis is on subspace estimation, assuming a spectral gap – often we care mainly about subspace estimation accuracy

Generalized eigdecomposition 1. (Semi) supervised dimension reduction – add prior information by means of Generalized eigdecomposition 1. (Semi) supervised dimension reduction – add prior information by means of class labels – linear and non-linear variations: (L)SIR (Li et al. 1991, Wu et al. 2010) 2. (Non-) linear embeddings – Laplacian Eigenmaps (Belkin and Niyogi 2002) – Locality Preserving Projections (He and Niyogi, 2003) 3. Canonical Correlation Analysis

Empirical results • Wishart Covariance Structure – independent N(0, 1) entries for data matrix Empirical results • Wishart Covariance Structure – independent N(0, 1) entries for data matrix • The Wellcome Trust Case Control Consortium (Nature 2007) – Crohn’s Disease; 500 K SNP array; 5, 000 individuals

Subspace distance metric • Exact method -- subspace A, approx. method -subspace B (consider Subspace distance metric • Exact method -- subspace A, approx. method -subspace B (consider column spaces) • Construct projection operators • Define distance metric: (Ye and Weiss, 2003)

Wishart covariance • Data matrix: independent N(0, 1) entries • Runtime improvement over exact Wishart covariance • Data matrix: independent N(0, 1) entries • Runtime improvement over exact

Spiked wishart (rank = 5) Spiked wishart (rank = 5)

WTCCC Crohn’s disease data set WTCCC Crohn’s disease data set

Subspace distance metric (WTCCC) Subspace distance metric (WTCCC)

Subspace distance metric (WTCCC) Subspace distance metric (WTCCC)

Acknowledgements • • Uwe Ohler 1, 2 & Sayan Mukherjee 2, 3 David Corcoran Acknowledgements • • Uwe Ohler 1, 2 & Sayan Mukherjee 2, 3 David Corcoran 1, 2 Nick Patterson 4 Ohler & Mukherjee Group 1 Department of Biostatistics and Bioinformtics, Duke University 2 Institute for Genome Sciences and Policy, Duke University 3 Department of Statistical Sciences, Duke University 4 Broad Institute, Harvard and MIT

Thank you! Thank you!

Wishart Covariance Structure • Data matrix: independent N(0, 1) entries • Runtime improvement over Wishart Covariance Structure • Data matrix: independent N(0, 1) entries • Runtime improvement over exact

Decreasing difference in dimension size Decreasing difference in dimension size

Random wishart (# iter = 1) Random wishart (# iter = 1)

Random wishart (# iter = 2) Random wishart (# iter = 2)

Random wishart (# iter = 3) Random wishart (# iter = 3)

Spiked wishart (rank = 5) Spiked wishart (rank = 5)

WTCCC data (# iter = 1) WTCCC data (# iter = 1)

WTCCC data (# iter = 2) WTCCC data (# iter = 2)

Sequence region binding evidence = log[# T-> C conversion events] T-T T-C X-linked clusters Sequence region binding evidence = log[# T-> C conversion events] T-T T-C X-linked clusters TCATGCTATTTTAGCGATCTGATCGTAGACTGTTAGTCGATGCTGTGTATTTGCA [David Corcoran]

Quaking predicted motif • 4 million mapped reads • # clusters with site / Quaking predicted motif • 4 million mapped reads • # clusters with site / total # clusters = 3, 740 / 9, 998

Bibliography [1] Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Bibliography [1] Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Using Multilocus Genotype Data (2000). Genetics, Vol. 155, 945 -959 [2] Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD. Genes mirror geography within Europe (2008). Nature. Nov 6; 456 (7218): 98 -101 [3] Patterson N, Price AL, Reich D: Population Structure and Eigenanalysis (2006). PLo. S Genetics (12): e 190. doi: 10. 1371/journal. pgen. 0020190 [4] Rokhlin V, Szlam. A and Tygert M: A randomized algorithm for principal component analysis (2009). SIAM Journal on Matrix Analysis and Applications, 31 (3): 1100 -1124 [5] Halko N, Martinsson P. , Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. ar. Xiv: 0909. 4061 v 2 [math. NA] [6] Ye and Weiss RE: Using the bootstrap to select one of a new class of dimension reduction methods (2003). Journal of the American Statistical Association. 98, pp. 968979. [7] Zhu Y and Zeng P: Fourier methods for estimating the central subspace and the central mean subspace in regression (2006). Journal of the American Statistical Association. 101, pp. 16381651. [8] The Wellcome Trust Case Control Consortium: Genome-wide association study of 14, 000 cases of seven common diseases and 3, 000 shared controls (2007). Nature. 447, pp. 661 -678.

Ch. IP-seq papers CTCF: Barski A, Cuddapah S, Cui K, Roh T, Schones D, Ch. IP-seq papers CTCF: Barski A, Cuddapah S, Cui K, Roh T, Schones D, Wang Z, Wei G, Chepelev I, Zhao K High-resolution profiling of histone methylations in the human genome. Cell 2007 STAT 1: Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith O, He A, Marra M, Snyder M, Jones S Genome-wide profiles of STAT 1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007 SRF: Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A Genome-wide analysis of transcription factor binding sites based on Ch. IP-Seq data. Nat Methods 2008

Example of cluster generation in the Argonaute dataset Example of cluster generation in the Argonaute dataset

Eigenstrat Eigenstrat

Properties of Randomized PCA • Empical observations – we don’t seem to need power Properties of Randomized PCA • Empical observations – we don’t seem to need power iteration, as random projection good enough (data is low rank) – eigenvalue accuracy estimate can be “sloppy” if emphasis is on subspace estimation, assuming a spectral gap – often we care mainly about subspace estimation accuracy • Lot’s of “painful” implementation details – efficient matrix multiply – data packing

Inference of population structure and generalized eigendecomposition (with Sayan Mukherjee 1 and Nick Patterson Inference of population structure and generalized eigendecomposition (with Sayan Mukherjee 1 and Nick Patterson 2) 1 2 Department of Statistical Sciences, Duke University Broad Institute, Harvard and MIT

PARalyzer • 1. 2. 3. 4. non-parametric kernel-density estimate classifier to identify the RNA-protein PARalyzer • 1. 2. 3. 4. non-parametric kernel-density estimate classifier to identify the RNA-protein interaction sites from a combination of T=>C conversions and read density reads that have been aligned to the genome and overlap by at least 1 nucleotide are grouped together. Within each read-group we generate two smoothened kernel density estimates; the first of T=>C transitions and the other of non-transition events. Nucleotides within the grouped-reads that maintain a minimum read depth, and where the relative likelihood of T=>C conversion is higher than non-conversion, are considered interaction sites This region is then extended either to include the underlying reads, or by a generic window size (by 3 nt for Pum)

AGO • largest number of clusters for the Argonaute dataset was found in intergenic AGO • largest number of clusters for the Argonaute dataset was found in intergenic regions • requiring at least two separate locations with observed T=>C conversions within the cluster removed a large proportion (67%) of those sites, while only removing a small proportion (24%) of clusters found in 3'UTRs • We therefore require all clusters to have more than one location with a T=>C conversion for all subsequent analysis. • To increase the stringency of the CCRs, we required the mode location to have had at least 20% T=>C conversion

Argaunote(AGO) PAR-CLIP Analysis Argaunote(AGO) PAR-CLIP Analysis

micro. RNA Enrichment Analysis Tool (m. EAT) sequence regions high evidence mi. R-93 mi. micro. RNA Enrichment Analysis Tool (m. EAT) sequence regions high evidence mi. R-93 mi. R-15 let-7. . . low evidence ES = 15. 0 mi. R seeds ES = average binding evidence across mi. R canonical seeds

Variable Definitions Variable Definitions

mi. R seed enrichment mi. R seed enrichment

mi. R seed enrichment mi. R seed enrichment

mi. R seed enrichment mi. R seed enrichment

Results Results

Regression Interpretation Regression Interpretation

PARalayzer (PAR-CLIP data analyzer) Ø m. RNAs translation into protein can be regulated through PARalayzer (PAR-CLIP data analyzer) Ø m. RNAs translation into protein can be regulated through sequence motifs on the m. RNA transcript § RNA binding proteins (RBPs) Ø Input: § binding evidence for transcribed m. RNAs § library of m. RNA sequence motifs Ø Output: enriched m. RNA sequence motifs [David Corcoran]

PARalayzer (PAR-CLIP data analyzer) 1. Align reads to a reference genome 2. Group adjacent PARalayzer (PAR-CLIP data analyzer) 1. Align reads to a reference genome 2. Group adjacent reads into clusters (sequence regions) 3. Assign binding evidence to each cluster: log 2[# reads] 4. Use clusters to find enriched motifs [David Corcoran]

PARalayzer (PAR-CLIP data analyzer) 1. Align reads to a reference genome, allowing for up PARalayzer (PAR-CLIP data analyzer) 1. Align reads to a reference genome, allowing for up to 3 mismatches (i. e. up to 3 T->C conversion events per read) 2. Group overlapping reads – groups with ≥ 5 reads are further analyzed – Clusters are extended to either the longest read that overlaps a ‘positive’ signal or until there are no longer at least 5 reads at a location – filter groups based on known repeat regions 3. Within each group generate sub-groups (clusters) based on the observed T->C conversion events – identify regions with enriched T->C relative to T->T – use non-parametric smoothing (KDE) to call peaks 4. Use sub-groups in downstream motif enrichment analysis [David Corcoran]

m. EAT m. EAT

m. EAT: Enrichment vs. Expression m. EAT: Enrichment vs. Expression

Gene expression in the eukaryotic cell Gene expression in the eukaryotic cell

Ch. IP-seq PAR-CLIP 1. culture with 4 -SU 2. cross-link 3. Immunoprecipitate & size-select Ch. IP-seq PAR-CLIP 1. culture with 4 -SU 2. cross-link 3. Immunoprecipitate & size-select 4. convert into a c. DNA library & sequence [Hafner et al. 2010]

Motif model (Georgiev et al. 2010) Motif model (Georgiev et al. 2010)

Hu. R # reads = 20 M, aligned 13 M, # clusters = 250 Hu. R # reads = 20 M, aligned 13 M, # clusters = 250 K, # clusters after pre-processing = 125 K, “explained” with presence of binding motif = 25% long, 75% two short plots with T->C conversions (David), - in vitro binding studies, which have shown that Hu. R is capable of binding to AREs including, AUUUA pentamers, long poly-U stretches, and 3 to 5 nucleotide stretches of Us separated by A, C, or G (Levine et al. , 1993; Meisner et al. , 2004).

Quaking • # clusters with site / total # clusters = 3, 740 / Quaking • # clusters with site / total # clusters = 3, 740 / 9, 998 • # reads, # clusters, # “explained” with presence of binding motif, plots with T->C conversions (David), • Group using reads, as not all X-linked

Pumilio Pumilio