Скачать презентацию A Gene Expression Project Richard Mott Wellcome Trust Скачать презентацию A Gene Expression Project Richard Mott Wellcome Trust

327fa4549344c2cb61c36850b62ecaac.ppt

  • Количество слайдов: 38

A Gene Expression Project Richard Mott Wellcome Trust Centre for Human Genetics A Gene Expression Project Richard Mott Wellcome Trust Centre for Human Genetics

Project Summary • Gene Expression Analysis – Comparison with phenotypes – Gene Ontology Analysis Project Summary • Gene Expression Analysis – Comparison with phenotypes – Gene Ontology Analysis – Comparison between tissues – Gene Coexpression Networks • NB: There are many R packages available from CRAN for gene expression and network analysis, which are not covered in this lecture. – You should explore them!

Gene expression datasets • Hippocampus (460 mice), • Liver and Lung (260 mice) • Gene expression datasets • Hippocampus (460 mice), • Liver and Lung (260 mice) • 100 Phenotypes • Mice are from a Heterogeneous Stock, from 164 families

Gene Expression data • Gene expression measured on Illumina Mouse arrays – 47000 50 Gene Expression data • Gene expression measured on Illumina Mouse arrays – 47000 50 -mer probes – Approx 2 probes per gene – Covariates (eg Sex, Family) also available • > load("liver. exp. RData") • > load("liver. cov. RData") • > source("expression. tutorial. R")

Exploring Expression Data > liver. median <- apply(liver. exp, 2, median ) > hist(liver. Exploring Expression Data > liver. median <- apply(liver. exp, 2, median ) > hist(liver. median, breaks=50) > liver. subset <- liver. exp[, liver. median>6]

Sex Effects • Which transcripts have different expression levels for the two sexes? – Sex Effects • Which transcripts have different expression levels for the two sexes? – Use a T-test on each transcript – The R apply() function speeds up the analysis – First define a function tfunc that performs the T test and reports the P-value – tfunc <- function( X, GENDER ) { tt <- t. test( X ~ GENDER ); return(tt$p. value) } – Then compute the test for each transcript – > sex. pvalue <- apply(liver. subset, 2, tfunc, liver. cov$GENDER ) – Then plot the distribution of p-values – > hist( sex. pvalue, breaks=50) – > sum(sex. pvalue<1. 0 e-5) – [1] 78

Sex Effects 312/2796 (11%) of transcripts with median level > 6 have sex effects Sex Effects 312/2796 (11%) of transcripts with median level > 6 have sex effects with P < 0. 01 78/2796 (2%) of transcripts with median level > 6 have sex effects with P < 0. 00001

Family effects (Heritability) • • • Which transcripts are affected by genetic background? Use Family effects (Heritability) • • • Which transcripts are affected by genetic background? Use one-way ANOVA wrapped inside apply() First define a function to return the p-value of the ANOVA: anova. pvalue <- function( X, factor ) { a <- anova(lm( X ~ factor)) return(a[1, 5]) } Then find the transcripts with high heritability family. pvalue <- apply( liver. subset, 2, anova. pvalue, liver. cov$Family )

Family Effects 18% of transcripts with median level >6 have heritability p-value < 0. Family Effects 18% of transcripts with median level >6 have heritability p-value < 0. 01 0. 2% of transcripts with median level >6 have heritability p-value < 0. 00001

Body Weight • We can find transcripts associated with body weight in a similar Body Weight • We can find transcripts associated with body weight in a similar fashion to family effects, except that linear regression is used. – Note that the direction of causality is no longer certain, ie it is not clear whether variation in a transcript is causative for variation in weight or vice versa > weight. pvalue <- apply( liver. subset, 2, anova. pvalue, liver. cov$End. Normal. BW ) > hist(weight. pvalue, breaks=50)

Body Weight 11% of transcripts with median levels > 6 are significant at P Body Weight 11% of transcripts with median levels > 6 are significant at P < 0. 01 1. 5% of transcripts with median levels > 6 are significant at P < 0. 00001

What do the genes do? • So far we have identified sets of genes What do the genes do? • So far we have identified sets of genes which are associated with sex, family and weight • How can we characterise these genes ? • One popular method is to test if the annotations of these genes have unusual features. • Annotations include: – genome location – protein domain architecture (eg from INTERPRO) – gene function, where known (eg from GO) • From a statistical perspective, is is importation that a controlled vocabulary (ontology) is used to describe gene functions. – The analysis then does not have to understand any biology!!

The Gene Ontology (GO) http: //www. geneontology. org/ • GO associates a set of The Gene Ontology (GO) http: //www. geneontology. org/ • GO associates a set of GO-terms with every gene, describing aspects of the gene’s known function. • It has become a very popular tools for automated investigation of large sets of genes. • But note that: – GO is not complete, covering only biological processes, cellular components and molecular functions. Other ontologies are also important. – many genes have no known function

GO annotation Examples GO: 0000001 mitochondrion inheritance GO: 0000002 mitochondrial genome maintenance GO: 0000003 GO annotation Examples GO: 0000001 mitochondrion inheritance GO: 0000002 mitochondrial genome maintenance GO: 0000003 reproduction GO: 0000005 ribosomal chaperone activity GO: 0000006 high affinity zinc uptake transporter activity GO: 0000007 low-affinity zinc ion transporter activity GO: 0000008 thioredoxin GO: 0000009 alpha-1, 6 -mannosyltransferase activity GO: 0000010 trans-hexaprenyltransferase activity GO: 0000011 vacuole inheritance GO: 0000012 single strand break repair ENSMUSG 00000061404 Olfr 936 GO: 0001584 GO: 0016020 GO: 0007600 GO: 0007166 GO: 0004930 GO: 0031224 GO: 0003674 GO: 0005623 GO: 0050896 GO: 0016021 GO: 0004888 GO: 0004871 GO: 0050877 GO: 0007582 GO: 0005575 GO: 0007186 GO: 0007608 GO: 0007165 GO: 0004872 GO: 0007154 GO: 0044464 GO: 0044425 GO: 0004984 GO: 0007606 GO: 0050874 GO: 0009987 GO: 0051869 GO: 0008150 ENSMUSG 00000030105 Arl 8 b GO: 0016043 GO: 0007046 GO: 0051233 GO: 0005737 GO: 0016817 GO: 0044424 GO: 0048487 GO: 0005515 GO: 0016787 GO: 0043014 GO: 0005623 GO: 0007028 GO: 0044422 GO: 0044237 GO: 0007242 GO: 0043228 GO: 0005856 GO: 0007582 GO: 0008152 GO: 0007165 GO: 0015630 GO: 0008092 GO: 0019003 GO: 0016462 GO: 0005622 GO: 0044464 GO: 0007154 GO: 0003824 GO: 0006139 GO: 0006364 GO: 0005488 GO: 0003924 GO: 0043170 GO: 0016818 GO: 0019001 GO: 0009987 GO: 0005525 GO: 0008150 GO: 0017076 GO: 0043229 GO: 0006396 GO: 0016072 GO: 0007059 GO: 0043232 GO: 0050875 GO: 0044430 GO: 0043283 GO: 0044446 GO: 0030496 GO: 0015631 GO: 0003674 GO: 0042254 GO: 0007264 GO: 0000166 GO: 0005819 GO: 0017111 GO: 0044238 GO: 0043226 GO: 0016070 GO: 0005575 GO: 0006996 ENSMUSG 00000042428 Mgat 3 GO: 0016020 GO: 0008375 GO: 0043413 GO: 0005615 GO: 0044421 GO: 0005737 GO: 0031224 GO: 0044267 GO: 0044424 GO: 0008194 GO: 0009058 GO: 0005623 GO: 0044422 GO: 0044237 GO: 0007582 GO: 0008152 GO: 0044425 GO: 0005622 GO: 0044464 GO: 0003824 GO: 0044431 GO: 0003830 GO: 0043227 GO: 0043170 GO: 0019538 GO: 0006487 GO: 0009059 GO: 0006412 GO: 0009987 GO: 0016740 GO: 0008150 GO: 0043229 GO: 0006486 GO: 0009101 GO: 0050875 GO: 0016758 GO: 0043283 GO: 0044446 GO: 0005795 GO: 0003674 GO: 0009100 GO: 0016021 GO: 0005576 GO: 0044238 GO: 0044249 GO: 0043226 GO: 0043231 GO: 0005575 GO: 0044260 GO: 0016757 GO: 0006464 GO: 0005794 GO: 0043412 GO: 0044444

Testing for GO association • Set of genes G is classified into two groups Testing for GO association • Set of genes G is classified into two groups eg by sex • A given GO annotation term classifies the genes into two groups (present, absent) • The data are a 2 x 2 contingency table classified by sex and GO, and the test of GO/sex association can be done either by a chi-squared test or by Fisher’s Exact Test FET, or a generalised linear model with Poisson link function. • The most popular methods use the FET, which can be calculated quickly using the hypergeometric distribution, and is exact even when the counts of data are small

Testing for GO association • Read in a file of GO terms associated with Testing for GO association • Read in a file of GO terms associated with each Ensembl Mouse gene (this set has been reduced to include only those GO terms present in more than 1% of genes) go 1 <- read. delim("GO. Ensembl. 01. txt") > dim(go 1) [1] 19988 387 • Find the common transcripts between liver. subset and the annotations, and those transcripts with sex p-values < 0. 01 > intersect intersect <- unique(sort(intersect))

Testing for GO association > liver. subset. intersect <- liver. subset[, match(intersect, colnames(liver. subset))] Testing for GO association > liver. subset. intersect <- liver. subset[, match(intersect, colnames(liver. subset))] > dim( liver. subset. intersect) [1] 275 1650 > go. intersect <- go 1[match(intersect, go 1$transcript), ] > dim(go. intersect) [1] 1650 388 > sex. ids <- colnames(liver. subset)[sex. pvalue<0. 01] > sex. intersect <- sex. ids[match(sex. ids, intersect, nomatch=0)] > length(sex. intersect) [1] 174 > sex. idx <- go. intersect$transcript %in% sex. ids

Testing for GO Association using apply() and fisher. test() • fisher. func <- function( Testing for GO Association using apply() and fisher. test() • fisher. func <- function( X, sex. idx) { X <- as. factor(X) ; if ( nlevels(X) == 2 ) {f <- fisher. test(X, sex. idx); return (f$p. value)} else return(1) } • > fish <- apply( go. intersect[, 4: ncol(go. intersect)], 2, fisher. func, sex. idx ) • • > length(fish) [1] 385 • • > fish[fish < 0. 01] GO. 0000267 GO. 0002376 GO. 0003735 GO. 0005624 GO. 0005783 GO. 0005840 5. 255498 e-03 6. 142841 e-04 9. 096193 e-03 4. 108839 e-04 1. 153113 e-03 9. 125263 e-03 GO. 0006412 GO. 0006955 GO. 0009058 GO. 0009059 GO. 0016740 GO. 0016788 9. 852476 e-05 4. 726468 e-05 7. 243732 e-03 4. 532035 e-05 3. 915276 e-03 4. 224464 e-03 GO. 0030529 GO. 0043170 GO. 0043234 GO. 0044249 GO. 0044422 GO. 0044446 2. 250219 e-03 2. 157644 e-03 5. 039780 e-04 2. 347306 e-04 2. 360906 e-03 • •

Significant GO terms > data. frame( pvalue=fish[fish<0. 01], desc=as. character(go 2 name$desc[go 2 name$go Significant GO terms > data. frame( pvalue=fish[fish<0. 01], desc=as. character(go 2 name$desc[go 2 name$go %in% names(fish[fish<0. 01])])) pvalue desc GO. 0000267 5. 255498 e-03 cell fraction GO. 0002376 6. 142841 e-04 immune system process GO. 0003735 9. 096193 e-03 structural constituent of ribosome GO. 0005624 4. 108839 e-04 membrane fraction GO. 0005783 1. 153113 e-03 endoplasmic reticulum GO. 0005840 9. 125263 e-03 ribosome GO. 0006412 9. 852476 e-05 protein biosynthesis GO. 0006955 4. 726468 e-05 immune response GO. 0009058 7. 243732 e-03 biosynthesis GO. 0009059 4. 532035 e-05 macromolecule biosynthesis GO. 0016740 3. 915276 e-03 transferase activity GO. 0016788 4. 224464 e-03 hydrolase activity, acting on ester bonds GO. 0030529 2. 250219 e-03 ribonucleoprotein complex GO. 0043170 2. 157644 e-03 macromolecule metabolism GO. 0043234 5. 039780 e-04 protein complex GO. 0044249 2. 347306 e-04 cellular biosynthesis GO. 0044422 2. 360906 e-03 organelle part GO. 0044446 2. 360906 e-03 intracellular organelle part

Lecture V Lecture V

Comparing gene expression across experments • We sometimes want to compare expression levels between Comparing gene expression across experments • We sometimes want to compare expression levels between experiments – In different tissues in the same individuals – At different time points or under different experimental conditions with genetically identical individuals • Inbred lines • Cell lines • Comparisons may be done – At the level of the individual – Across the population

Comparing gene expression across tissues • A common set of individuals are used for Comparing gene expression across tissues • A common set of individuals are used for two tissues (eg liver, lung) • Look for transcripts whose expression levels across samples between the tissues are correlated samples Lung transcripts Liver transcripts

An efficient way to do the computation in R Liver transcripts Process each column An efficient way to do the computation in R Liver transcripts Process each column in the combined matrix using apply() samples Lung transcripts

Comparing gene expression across tissues > load(“lung. liver. exp”) # CONTAINS lung and liver Comparing gene expression across tissues > load(“lung. liver. exp”) # CONTAINS lung and liver stacked vertically > lung. liver <- c(rep( TRUE, 262), rep(FALSE, 262) > lung. liver. cor. pvalue <- apply( lung. liver. exp, 2, function( X, by ) { f <- cor. test(X[by], X[!by]); return (f$p. value) }, by=lung. liver ) > hist(-log 10(lung. liver. cor. pvalue), breaks=50) > lung. liver. cor. log. P <- -log 10(lung. liver. cor. pvalue) > sum( lung. liver. cor. log. P> [1] 0. 0837884 > sum( lung. liver. cor. log. P> [1] 0. 04374960 > sum( lung. liver. cor. log. P> [1] 0. 02791541 > sum( lung. liver. cor. log. P> [1] 0. 01906007 > 2)/47429 3)/47429 4)/47429 5)/47429

Distribution of P-values Distribution of P-values

Multiple Testing • Conventionally, the significance threshold a for a test is a = Multiple Testing • Conventionally, the significance threshold a for a test is a = 0. 05, or 0. 01 • However, Statistical Genetics problems frequently involve multiple testing – eg a genome scan may test over one million SNPs for association, a gene expression microarray will screen tens of thousands of transcripts. • One approach: adjust significance thresholds to be more conservative, in order to prevent too many false positives occurring. – – – • Bonferroni Correction for multiple testing: multiply p-values by the number of tests – – • In N independent tests, the expected P-value of the most significant result to occur by chance is 1/(N+1) [because P-values are uniformly distributed on [0, 1]] Whilst it is not true that tests are independent, it is often a good guide to what will be observed in reality. Often, at these scales, P-values are reported as –log 10(P-value), so that N=10 6 implies the most significant Pvalue is about 10 -6 , ie log. P = 6. Pr( all N P-values < t) = 1 -(1 -t) N ~ Nt Assumes Null hypothesis is true for all tests – probably unrealistic Raising the significance threshold lowers the power to detect associations, – so larger sample sizes are needed to compensate

Multiple Testing: False Discovery Rate • Often a different approach, based around False Discovery Multiple Testing: False Discovery Rate • Often a different approach, based around False Discovery Rates (FDR) is employed – The idea is to choose a lax threshold so that false positives occur, but to limit the proportion of false positives to say 10%. – The FDR is useful when the purpose of the screen is to identify items (SNPs, genes, etc) for further experimental investigation, and which will provide more information. – FDR = E[(#False Positives)/(#False+#True Positives) • • N p-values are arranged in increasing order p 1, p 2, …. p. N Null hypothesis: expected p-values and 1/N, 2/N, …. 1 For a given α, accept all tests 1…k such that pk < kα/N implies expected FDR is < α if tests are independent – The q-value of a test with p-value p is the FDR if p is the threshold for determining significance • T = observed # of tests with p-value

GO associations of transcripts with correlated expression between liver and lung • General-purpose function GO associations of transcripts with correlated expression between liver and lung • General-purpose function for testing GO associations – genes is a vector of the transcript names under consideration – classification is a boolean vector the same length as genes, indicating which transcripts are in the classification – go is a matrix where go[transcript, Goterm] = TRUE if Goterm is attached to transcript – go 2 name is a data frame with the descriptions of the Go terms > GO. analysis <- function( genes, classification, go 2 name ) { intersect <- unique(sort(genes[match(go$transcript, genes, nomatch=0)])) go. intersect <- go[match(intersect, go$transcript), ] classification. intersect <- classification[match(classification, intersect, nomatch=0)] class. idx <- go. intersect$transcript %in% genes[classification] fish <- apply( go. intersect[, 4: ncol(go. intersect)], 2, fisher. func, class. idx) fish <- fish[names(fish) %in% go 2 name$go] r <- order(fish) return( data. frame( p. value=fish 2, desc=go 2 name$desc[match(go 2 name$go, names(fish 2[r]), nomatch=0)])) }

Many very significant GO terms > fish <- GO. analysis( colnames(lung. liver. exp), lung. Many very significant GO terms > fish <- GO. analysis( colnames(lung. liver. exp), lung. liver. cor. log. P> 4, go 1, go 2 name ) > head(fish$pvalue, n=20) GO. 0001584 2. 577931 e-18 GO. 0043227 2. 572202 e-12 GO. 0005737 2. 648800 e-08 GO. 0005634 4. 469984 e-07 GO. 0004930 GO. 0004888 GO. 0004984 GO. 0007186 GO. 0043231 3. 011990 e-17 3. 477225 e-14 2. 396701 e-13 1. 138140 e-12 1. 838885 e-12 GO. 0044424 GO. 0005622 GO. 0007166 GO. 0043229 GO. 0043226 1. 000009 e-11 2. 866695 e-11 6. 100367 e-10 6. 792908 e-10 6. 806115 e-10 GO. 0004872 GO. 0004871 GO. 0044444 GO. 0007582 GO. 0043170 3. 289317 e-08 5. 339634 e-08 7. 805342 e-08 9. 355738 e-08 1. 049470 e-07 GO. 0050875 4. 844296 e-07

Biological Processes coordinated between liver and lung • • • • GO. 0001584 GO. Biological Processes coordinated between liver and lung • • • • GO. 0001584 GO. 0004930 GO. 0004888 GO. 0004984 GO. 0007186 GO. 0043231 GO. 0043227 GO. 0044424 GO. 0005622 GO. 0007166 GO. 0043229 GO. 0043226 GO. 0005737 GO. 0004872 GO. 0004871 2. 577931 e-18 3. 011990 e-17 3. 477225 e-14 2. 396701 e-13 1. 138140 e-12 1. 838885 e-12 2. 572202 e-12 1. 000009 e-11 2. 866695 e-11 6. 100367 e-10 6. 792908 e-10 6. 806115 e-10 2. 648800 e-08 3. 289317 e-08 5. 339634 e-08 GO. 0001584 bud site selection GO. 0004930 activation of MAPKK activity during cell wall biogenesis GO. 0004888 C-8 sterol isomerase activity GO. 0004984 protein import into nucleus, translocation GO. 0007186 shmoo orientation GO. 0043231 nuclear interphase chromosome GO. 0043227 isoleucine/valine: sodium symporter activity GO. 0044424 cytoplasmic interphase chromosome GO. 0005622 spliceosomal catalysis GO. 0007166 inactivation of MAPK (pseudohyphal growth) GO. 0043229 second spliceosomal transesterification activity GO. 0043226 mitochondrion inheritance GO. 0005737 re-entry into mitotic cell cycle GO. 0004872 sulfite transport GO. 0004871 organellar small ribosomal subunit >

Mapping expression Quantitative Trait Loci (e. QTL) • e. QTL are Genetic loci at Mapping expression Quantitative Trait Loci (e. QTL) • e. QTL are Genetic loci at which genotype variation is associated with variation in a transcript cis e. QTL co-localise with the correaponding gene’s position trans e. QTL map to other locations in the genome Many transcripts have cis e. QTL as well as trans e. QTL hubs are loci that contain many trans e. QTL, indicating sets of genes under the same control – Basic principle of mapping any quantitative trait is to determine the association between the phenotype and each genotyped SNP – anova(lm( transcript ~ genotype)) – –

cis e. QTLs on mouse Chr 19 cis e. QTLs on mouse Chr 19

 Tissue Hippocampus Liver Lung without SNPs Hippocampus Liver Lung with SNPs cis 1996 Tissue Hippocampus Liver Lung without SNPs Hippocampus Liver Lung with SNPs cis 1996 408 3442 733 209 424 trans 2774 3411 911 368 426 450

A simple e. QTL mapping programme Data: transcript is a matrix of transcript levels A simple e. QTL mapping programme Data: transcript is a matrix of transcript levels (columns), genotypes is a matrix of genotypes (columns) The rows are the subjects, which must be identical across the two matrices >load(“g 19. RData”) >load(“liver. 19. exp. RData”) > map. eqtl <- function( transcript, genotypes ) { log. P = -log 10(apply( genotypes, 2, function( X, transcript ) { X <- as. factor(X) ; if ( nlevels(X) > 1 ) { a <- anova(lm( transcript ~ X )); return(a[1, 5])} else return(1)}, transcript)) return(log. P) } > map. eqtls <- function( genotypes, transcripts ) { genos <- data. frame(genotypes[, 2: ncol(genotypes)]) log. P <- apply( transcripts, 2, map. eqtl, genos ) rownames(log. P) <- colnames(genos) colnames(log. P) <- colnames(transcripts) return(log. P) } log. P <- map. eqtls( g 19, liver. 19. exp[, c(20346, 29095)]) # map transcripts 29095 = "LIVER. express. scl 074170. 1_145. S" 20346 = "LIVER. express. scl 0023972. 1_63. S"

Example chromosome scans LIVER. express. scl 074170. 1_145. S LIVER. express. scl 0023972. 1_63. Example chromosome scans LIVER. express. scl 074170. 1_145. S LIVER. express. scl 0023972. 1_63. S

Different Types of Association Analysis • • SNP genotypes G take three possible values Different Types of Association Analysis • • SNP genotypes G take three possible values AA, AB, BB Under an additive genetic model, the phenotypic effect of a genotype is proportional to the number of A alleles (or equivalently B alleles) – we recode the genotype as a number X <- additive(G) • BB = 0, • AB = 1 • AA = 2 • Under a dominant genetic model, the phenotypic effect is zero unless an A allele is present (or vice-versa) – we recode the genotype as a number X <- dominance(G) • BB = 0, • AB = 1 • AA = 1 • • Under a full genetic mode, the phenotypic effect of each genotype is arbitrary Note that a full model is the sum of the additive and dominance models – phenotype ~ G – phenotype ~ additive(G) + dominance(G)

R functions for converting full genotypes to submodels additive <- function( G ) { R functions for converting full genotypes to submodels additive <- function( G ) { G <- as. factor(G); G <- reorder( G, sort(levels(G)) return(as. numeric(G) } dominance <- function(G) { a <- additive(G) return(ifelse(a>0, 1, 0)) }

Fast Gene-expression analysis and Permutation in Linear Models – M computed usingular value techniques Fast Gene-expression analysis and Permutation in Linear Models – M computed usingular value techniques – ANOVA F statistics can be calculated from • computing M is slow, but need only be done once • permutation test: – permute y, keep X and M fixed • Gene expression: – change y, keep X, M fixed