Contacts Data sets are available at http

Скачать презентацию Contacts Data sets are available at http Скачать презентацию Contacts Data sets are available at http

bc27a570377497f18c8fd4e26d00406c.ppt

  • Количество слайдов: 158

Contacts • Data sets are available at: http: //www. bioinformatica. unito. it/bioinformatics/DAGEL. II/ • Contacts • Data sets are available at: http: //www. bioinformatica. unito. it/bioinformatics/DAGEL. II/ • Post-workshop support: – support on data analysis course will be supplied by Prof. Calogero, sending an email to: • raffaele. [email protected] it containing in the Subject: Post DW support

Workshop topics • We will analyse the following topics: – Experimental design. – Quality Workshop topics • We will analyse the following topics: – Experimental design. – Quality controls. – Data filtering. – Statistical inference of differential expression. – Annotation. – Assessing the biological meaning of differential expression.

Analysis pipe-line Sample Preparation Hybridization Array Fabrication Scanning + Image Analysis Quality control Filtering Analysis pipe-line Sample Preparation Hybridization Array Fabrication Scanning + Image Analysis Quality control Filtering Normalization statistical analysis Annotation Platform specific devices Bioconductor Biological Knowledge extraction

Open source software • Bioconductor – is an open source and open development software Open source software • Bioconductor – is an open source and open development software project to provide tools for the analysis and comprehension of genomic data. • TMEV 4. 0 – is an application that allows the viewing of processed microarray slide representations and the identification of genes and expression patterns of interest. A variety of normalization algorithms and clustering analyses allow the user flexibility in creating meaningful views of the expression data.

Analysis of microarray • Bioconductor (www. bioconductor. org) is an open source and open Analysis of microarray • Bioconductor (www. bioconductor. org) is an open source and open development software project to provide tools for the analysis and comprehension of genomic data. • Advantages: – highly dinamic, – free of charge, – introductory and advance courses are available in Europe/USA every year. • Disadvantages: – limited graphical interface, – limited documentation.

one. Channel. GUI • This is a graphical interface to Bioconductor libraries devoted to one. Channel. GUI • This is a graphical interface to Bioconductor libraries devoted to the analysis of data derived from single channel platforms. • affylm. GUI is a graphical interfase to limma library, which allows differential expression detection by mean of linear model analysis. • one. Channel. GUI is an extension of affylm. GUI capabilities.

one. Channel. GUI • 3’ IVT arrays (e. g. HGU 133 plus 2): – one. Channel. GUI • 3’ IVT arrays (e. g. HGU 133 plus 2): – Primary (probe level QC, probe set summary and normalization), secondary analysis (replicates QC, filtering, statistical analysis, classification) and data mining (GO enrichment). • Exon arrays: – Secondary analysis (replicates QC, filtering, statistical analysis, classification, basic Splice Index inspection) using expression console as source of primary data. • Large data set (i. e. probe set expression in tab delimited format): – Secondary analysis (replicates QC, filtering, statistical analysis, classification) using expression console/GEO/Array. Express data as source of primary data.

Starting R and one. Channel. GUI Setting the virtual RAM at 2 GB: C: Starting R and one. Channel. GUI Setting the virtual RAM at 2 GB: C: . . RR-2. 3. 0binRgui. exe --max-mem-size=2048 M

A Double click on “R” to start B A Double click on “R” to start B

A B Click on “Package” to load Bioconductor packages A B Click on “Package” to load Bioconductor packages

A B Click on “Load package” to select the one. Channel. GUI package Click A B Click on “Load package” to select the one. Channel. GUI package Click on “OK” to load the one. Channel. GUI package C

A Click on “Yes” to start the affylm. GUI interface. B Wait few seconds! A Click on “Yes” to start the affylm. GUI interface. B Wait few seconds! Yes Click on “Yes” to start the one. Channel. GUI interface.

Standard affylm. GUI menu Overlaying one. Channel. GUI to affylm. GUI will change the Standard affylm. GUI menu Overlaying one. Channel. GUI to affylm. GUI will change the default affylm. GUI menu to the one. Channel. GUI menu for 3’IVT Affymetrix arrays one. Channel. GUI menu for 3’IVT arrays

one. Channel. GUI • Analysis steps for 3’IVT arrays: – Loading. CEL files to one. Channel. GUI • Analysis steps for 3’IVT arrays: – Loading. CEL files to be analyzed – Arrays quality control: • Raw data plots • Robust probe-level model library (affy. PLM) – NUSE plot – RLE plot

A Summary of loaded data: none is available since no CEL files have been A Summary of loaded data: none is available since no CEL files have been loaded

A Click on “File” to start a new project B C Click on “New” A Click on “File” to start a new project B C Click on “New” to start a new project Selected 3’IVT arrays D Selected as working dir the folder containing the. CEL files

Selected the “targets” file. Then press OK to continue Targets file is a tab Selected the “targets” file. Then press OK to continue Targets file is a tab delimited text file containing the description of the experiment. It is made of three columns: Name: the name you want to assign to each array. File. Name: the names of the corresponding. CEL file Target: the experimental condition associated to the array (e. g. mock, treated, etc). At least two conditions should be present.

Define the name of you analysis. Press OK to continue. …. Now the array Define the name of you analysis. Press OK to continue. …. Now the array will be loaded in a specific R object called environment. Raw data are now loaded and are ready for normalization.

Exercise 1 (15 minutes) • Data set for this exercise: – spugnini. et. al: Exercise 1 (15 minutes) • Data set for this exercise: – spugnini. et. al: 7 CEL files were generated for two prototypic situations (biological replicas): – C, untreated MSTO-211 H cells: • baldi. c 1_b. CEL, baldi. c 2. CEL, baldi. c 3. CEL – T, MSTO-211 H cells treated with 780 μM piroxicam for 48 hours: • baldi. f 1_b. CEL, baldi. f 2. CEL, baldi. f 3. CEL, baldi. f 4. CEL See next page

Exercise 1 • Go in the folder spugnini. et. al. • Create, with excel, Exercise 1 • Go in the folder spugnini. et. al. • Create, with excel, a tab delimited file named targets. txt: – Targets file is made of three columns with the following header: • Name • File. Name • Target – In column Name place a brief name (e. g. c 1, c 2, etc) – In column File. Name place the name of the corresponding. CEL file – In column Target place the experimental conditions (e. g. control, treatment, etc) See next page

Exercise 1 • • • Open R Load the one. Channel. GUI Start a Exercise 1 • • • Open R Load the one. Channel. GUI Start a new project: – Change the working dir in spugnini. et. al – Load the targets. txt file (previously created) – Set as project name: ex 1

Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation

A The next steps are few simple basic quality controls. B Click on Quality A The next steps are few simple basic quality controls. B Click on Quality Control menu

You can now evaluate: Intensity histogram for one array at a time. A E You can now evaluate: Intensity histogram for one array at a time. A E C D

You can now evaluate: Intensity density plot for one array at a time. A You can now evaluate: Intensity density plot for one array at a time. A E C D

A You can now evaluate: all arrays intensities as box plots. C A You can now evaluate: all arrays intensities as box plots. C

A B B A It is possible that c. RNA concentration in sample s. A B B A It is possible that c. RNA concentration in sample s. E 2 was over estimated and a low c. RNA amount was loaded on the array. As result a lot of signals are below the value 100 [log 2(100) = 6. 44]

A Some other basic controls can be done after the calculation of the probe A Some other basic controls can be done after the calculation of the probe set intensity summary using a special Bioconductor library affy. PLM B Fit the model (BE PATIENT!!!) The end of the fitting procedure is given by a message. Then the NUSE/RLE function is automatically called C

affy. PLM QC library • affy. PLM provides a number of useful tools based affy. PLM QC library • affy. PLM provides a number of useful tools based on probe-level modelling procedures. • affy. PLM package allows arrays quality controls.

What is a Probe Level Model? • A Probe Level Model (PLM) is a What is a Probe Level Model? • A Probe Level Model (PLM) is a model that is fit to probe-intensity data. • affy. PLM fits a model with probe level and chip level parameters on a probe set by probe set basis. • In quality control chip level parameters are a factor variable with a level for each array.

What is a PLMset? • The main function for fitting PLM is the function What is a PLMset? • The main function for fitting PLM is the function fit. PLM. • This function will fit a linear model with an effect estimated for each chip and an effect for each probe. • fit. PLM implements iteratively re-weighted least squares M-estimation regression. • The fitted model is stored in a PLMset object containing chip level parameter estimates and the corresponding standard errors.

Default fitted model • where kj is the log 2 probe set expression value Default fitted model • where kj is the log 2 probe set expression value on array j for probeset k and ki are probe effects. • To make the model identifiable the constrain is used. • For this default model, the parameter estimates given are probe set expression values.

Relative Log Expression (RLE) • RLE values are computed for each probe set by Relative Log Expression (RLE) • RLE values are computed for each probe set by comparing the expression value on each array against the median expression value for that probeset across all arrays. • Assuming that most genes are not changing in expression across arrays means ideally most of these RLE values will be near 0. • Boxplots of these values, for each array, provides a quality assessment tool. • RLE plots: – Estimation of expression gi for each gene g on each array i. – Compute the median value across arrays for each gene

Relative Log Expression (RLE) Relative Log Expression (RLE)

Normalized Unscaled Standard Errors (NUSE) • Standard error measures the amount of errors done Normalized Unscaled Standard Errors (NUSE) • Standard error measures the amount of errors done fitting y for every x value. se= • Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. • The standard error estimates obtained for each gene on each array from fit. PLM are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. • This process accounts for differences in variability between genes. • An array were there are elevated SE relative to the other arrays is typically of lower quality. • Boxplots of these values, separated by array can be used to compare arrays.

A C B A C B

A B A B

A Since the fit. PLM object can be very big. It is a good A Since the fit. PLM object can be very big. It is a good idea, to delete it after quality control. Before Delete PLM After Delete PLM

Exercise 2 (20 minutes) • Starting from the data set you have loaded – Exercise 2 (20 minutes) • Starting from the data set you have loaded – check the Raw data with the available plots – analyze the data using NUSE and RLE plots. • Answer the following questions: – Is there any array characterized by a very narrow probe intensity distribution? • YES NO – Is there any array which is significantly different with respect to the others by mean of NUSE or RLE plots? • YES NO See next page

Exercise 2 • If any array has a NUSE very different from the others Exercise 2 • If any array has a NUSE very different from the others create a new target file without it and load again the remaining . CEL files.

Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation

affylm. GUI • Analysis steps: – Calculating probe set summaries: • • • RMA affylm. GUI • Analysis steps: – Calculating probe set summaries: • • • RMA GCRMA PLIER – Normalization: • Quantile method

Brief summary about probe set intensity calculation • RMA methodology (Irizarry et al. , Brief summary about probe set intensity calculation • RMA methodology (Irizarry et al. , 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. • GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al. , 2004). • The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in probe behavior and handling error at the appropriately at low and high signal values. • Methods such as PLIER+16 and GCRMA, which use model-based background correction, maintain relatively good accuracy without losing much precision.

Yeast DNA on human chips • G, C stretch length better perform with respect Yeast DNA on human chips • G, C stretch length better perform with respect to G + C content. • The distribution of 4 G-C strata in the non-specific sample well approximate to a log-normal distribution.

Observed log (base 2) expression versus nominal log concentration (in pico. Molar). Observed log (base 2) expression versus nominal log concentration (in pico. Molar).

Why Normalization ? • • • To remove systematic biases, which include, Sample preparation Why Normalization ? • • • To remove systematic biases, which include, Sample preparation Variability in hybridization Spatial effects Scanner settings Experimenter bias Extracted from D. Hyle presentation, http: //www. bioinf. man. ac. uk/microarray

What Normalization Is & What It Isn’t • • Methods and Algorithms Applied after What Normalization Is & What It Isn’t • • Methods and Algorithms Applied after some Image Analysis Applied before subsequent Data Analysis Allows comparison of experiments • Not a cure for poor data

Where Normalization Fits In Sample Preparation Hybridization Scanning + Image Analysis Normalization Data Analysis Where Normalization Fits In Sample Preparation Hybridization Scanning + Image Analysis Normalization Data Analysis Array Fabrication Normalization Spot location, assignment of intensities, background correction etc. Subsequent analysis, e. g clustering, uncovering genetic networks

Quantile normalization Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Quantile normalization Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

Quantile normalized Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Quantile normalized Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

M-A Plots log R M-A plot is 45° rotation of standard scatter plot M M-A Plots log R M-A plot is 45° rotation of standard scatter plot M 0 45° A log G M = log R – log G A = ½[ log R + log G ] M = Minus A = Add

The next step is normalization and calculation of probe set summary. A B Click The next step is normalization and calculation of probe set summary. A B Click on probe set menu and select the probe set summary and normalization option.

Normalization and intensity calculation come together. Three Normalization/intensity calculation option are available: RMA + Normalization and intensity calculation come together. Three Normalization/intensity calculation option are available: RMA + quantile normalization GCRMA + quantile normalization PLM + quantile normalization At any time it is possible to check the structure of the normalized data set

Exercise 3 a (20 minutes) • Data set for following exercises: • estrogen. IGF Exercise 3 a (20 minutes) • Data set for following exercises: • estrogen. IGF 1 : 18 CEL files were generated for 4 prototypic situations (biological replicas): – MCF 7 ctrl, untreated MCF 7: • M 1. CEL, M 4. CEL, M 7. CEL – MCF 7 E 2, MCF 7 treated for 3 hours with Estrogen: • M 3. CEL, M 6. CEL, M 9. CEL – MCF 7 IGF, MCF 7 treated for 3 hours with IGF 1: • M 2. CEL, M 5. CEL, M 8. CEL – SKER 3 ctrl, untreated SKER 3: • s 1. CEL, s 4. CEL, s 7. CEL – SKER 3 E 2, SKER 3 treated for 3 hours with Estrogen: • s 3. CEL, s 6. CEL, s 9. CEL – SKER 3 IGF, SKER 3 treated for 3 hours with IGF 1: • s 2. CEL, s 5. CEL, s 8. CEL See next page

Exercise 3 a • Go in the folder estrogen. IGF 1. • Create, with Exercise 3 a • Go in the folder estrogen. IGF 1. • Create, with excel, a tab delimited file named targets. txt: – Targets file is made of three columns with the following header: • Name • File. Name • Target – In column Name place a brief name (e. g. c 1, c 2, etc) – In column File. Name place the name of the corresponding. CEL file – In column Target place the experimental conditions (e. g. control, treatment, etc) • Create a target only for MCF 7 with/without IGF 1 treatment See next page

Exercise 3 a • Calculate Probe set summaries with GCRMA for the estrogen. IGF Exercise 3 a • Calculate Probe set summaries with GCRMA for the estrogen. IGF 1 set. • Save the limma project as ex 3. See next page

Replicates quality control • To evaluate sample replicates quality we will use a partition Replicates quality control • To evaluate sample replicates quality we will use a partition technique called Principal component analysis (PCA).

Principal component analysis • Principal component analysis (PCA) involves a mathematical procedure that transforms Principal component analysis • Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. • The first principal component accounts for as much of the variability in the data as possible • Each succeeding component accounts for as much of the remaining variability as possible. • The components can be thought of as axes in ndimensional space, where n is the number of components. Each axis represents a different trend in the data.

PCA 2 PCA 1 PCA 2 PCA 1

1 2 2° PC will be orthogonal to the 1 st In general the 1 2 2° PC will be orthogonal to the 1 st In general the first three components account for nearly all the variability. Therefore, PCA can be reasonably represented in a 3 D space.

A B To perform sample replicates QC we use principal component analysis (PCA) This A B To perform sample replicates QC we use principal component analysis (PCA) This check is performed on probe set summaries! t 4 is clearly an outlier!

Exercise 3 b • Use the estrogen. IGF 1 data set to evaluate quality Exercise 3 b • Use the estrogen. IGF 1 data set to evaluate quality replicas by mean of PCA analysis. • Questions: – CTRL and TRT are well separated? • YES NO – Replicates are homogeneous? • YES NO

Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation

Filtering • Filtering affects the false discovery rate. • Researcher is interested in keeping Filtering • Filtering affects the false discovery rate. • Researcher is interested in keeping the number of tests/genes as low as possible while keeping the interesting genes in the selected subset. • If the truly differentially expressed genes are overrepresented among those selected in the filtering step, the FDR associated with a certain threshold of the test statistic will be lowered due to the filtering. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

Filtering can be performed at various levels: • Annotation features: – Specific gene features Filtering can be performed at various levels: • Annotation features: – Specific gene features (i. e. GO term, presence of transcriptional regulative elements in promoters, etc. ) • Signal features: – % intensities greater of a user defined value – Interquantile range (IQR) greater of a defined value

Specific gene feature • In transcriptional studies focusing on genes characterized by specific feature Specific gene feature • In transcriptional studies focusing on genes characterized by specific feature (i. e. transcription factor elements in promoters) the best filtering approach is selecting only those genes linked to the “peculiar feature”. • For example: – Identification of genes modulated by estradiol: ER or IGF 1 by direct binding to Estrogen-Responsive Elements (ERE): • HGU 133 plus 2: – 54675 probe sets 19951 Entrez Genes • HGU 133 plus 2 with ERE in putative promoter regions: – 6764 probe sets 3058 Entrez Genes

Specific gene feature • Data derived from specifically devoted annotation data set can be Specific gene feature • Data derived from specifically devoted annotation data set can be used for functional filtering. • The Ingenuity Pathways Knowledge Base is the world's largest curated database of biological networks created from millions of individually modeled relationships between: – proteins, genes, complexes, cells, tissues, drugs, diseases. • The Ingenuity Pathways Analysis software (IPA) identifies relations between genes. • The relations that can be grasped are: – Regulates – Regulated by – Binds

Start an Ingenuity session at: https: //analysis. ingenuity. com/pa/login. jsp Start an Ingenuity session at: https: //analysis. ingenuity. com/pa/login. jsp

Specific classes of proteins can be searched and exported Specific classes of proteins can be searched and exported

A key word can also be used to perform a wide search A key word can also be used to perform a wide search

After selection of the Functions & diseases of interest genes should be visualized as After selection of the Functions & diseases of interest genes should be visualized as gene details before exportation in a file to be used for filtering expression data Exporting results in a table as previously

The Entrez Gene IDs present in this file can be used to extract e The Entrez Gene IDs present in this file can be used to extract e specific subset of genes. To use filtering using a list of EG you need to extract from the IPA table only the Entrez genes of interest and save them on a text file without header.

Exercise 4 (30 minutes) • Export from Ingenuity a table related to transcription factors Exercise 4 (30 minutes) • Export from Ingenuity a table related to transcription factors • Create a file containing only the Entrez genes without header and use it to filter the data set of exercise 3. • Save the affylm project as ex 4. TF. lma • Inspect the sample PCA: – CTRL and TRT are well separated? • YES NO

Non–specific filtering • This technique has as its premise the removal of genes that Non–specific filtering • This technique has as its premise the removal of genes that are deemed to be not expressed or unchanged according to some specific criterion that is under the control of the user. • The aim of non–specific filtering is to remove genes that, e. g. due to their low overall intensity or variability, are unlikely to carry information about the phenotypes under investigation. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

Intensity distributions Bg level probe sets RMA GCRMA Intensity distributions Bg level probe sets RMA GCRMA

How to define the efficacy of a filtering procedure? • This enrichment is very How to define the efficacy of a filtering procedure? • This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.

Filtering by genefilter p. Over. A (keep if ≥ 25% probe sets have intensities Filtering by genefilter p. Over. A (keep if ≥ 25% probe sets have intensities ≥ log 2(100)) 22300 42/42 Spike. In Enrichment: 100% 5553 42/42 Spike. In Enrichment: 401%

Filtering by Inter. Quantile Range 25% IQR 75% Filtering by Inter. Quantile Range 25% IQR 75%

The distribution of all intensity values of a differential expression experiment are the summary The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditions How filtering by genefilter IQR works?

How filtering by IQR works? The filter removes genes that show little changes within How filtering by IQR works? The filter removes genes that show little changes within the experimental points

Filtering by genefilter IQR (removing if intensities IQR 0. 25, 0. 5) 22300 42/42 Filtering by genefilter IQR (removing if intensities IQR 0. 25, 0. 5) 22300 42/42 Spike. In Enrichment: 100% 244 42/42 Spike. In Enrichment: 9139% 68 42/42 Spike. In Enrichment: 32794%

A C B D A C B D

A B C In this example will be selected only those genes characterized by A B C In this example will be selected only those genes characterized by having in at least 50% of the arrays an intensity ≥ 100.

Exercise 5 (20 minutes) • Use the estrogen. IGF 1 data normalized with gcrma Exercise 5 (20 minutes) • Use the estrogen. IGF 1 data normalized with gcrma (ex 3). • Perform and IQR filter at 0. 25 followed by an intensity filter at 50% of the arrays with and intensity over 100. • Export the data as tab delimited file. • Question: – How many probe sets are left after the first and the second filter? • …………………………… – Which are the changes observable at the level of probe sets distribution after the first and the second filter? • ……………………………………… • Save the affylm project as ex 5. lma. • Make the same filtering on ex 4. TF. lma • Question: – How many probe sets are left after the first and the second filter?

Expression Console® (Affymetrix person) • Presentation of Expression Console®: – General structure – Loading Expression Console® (Affymetrix person) • Presentation of Expression Console®: – General structure – Loading exon libraries – QC for raw data – Exporting gene level and exon level analysis

QC and filtering for exon data • At the time one. Channel. GUI was QC and filtering for exon data • At the time one. Channel. GUI was setup Bioconductor tools for handling raw data from Affymetrix exon arrays were not available. • For this reason the one. Channel. GUI uses the libraries and primary analysis outputs from Affymetrix Expression Console®. • Exon raw data quality control is perfomed using the Expression Console®. • Sample QC and filtering are performed on one. Channle. GUI.

Exon arrays on one. Channel. GUI • On one. Channel. GUI gene level and Exon arrays on one. Channel. GUI • On one. Channel. GUI gene level and exon level data from Expression Console® are loaded. • User needs to specify where Expression Console® library files are located, at any time a new exon data set is loaded.

Loading an exon array data set it is necessary to indicate the organism and Loading an exon array data set it is necessary to indicate the organism and which kind of exon data are going to be loaded (core, extended, full) Loading an exon array data set it is necessary to specify the location of Expression Console libraries.

Subsequently three files have to be loaded: The target file, which has the same Subsequently three files have to be loaded: The target file, which has the same structure previously described. The tab delimited files containing GENE and EXON level data exported form the Expression Console.

A new Menu is then available for exon data A new Menu is then available for exon data

Exon arrays QC on one. Channel. GUI Exon arrays QC on one. Channel. GUI

The brain (b) replicates are very poor. The quality is particularly bad for exon The brain (b) replicates are very poor. The quality is particularly bad for exon data. However, we have to consider that these data are derived from tissues coming from different post-mortem donors.

Exon arrays filtering Since the knowledge on exon data is still relatively limited we Exon arrays filtering Since the knowledge on exon data is still relatively limited we have little empirical information about background threshold. Exon/intron housekeeping gene information available in exon data might be a possible approach to define it. Different color lines indicate the possible thresholds to be selected. In black are shown the intensity density plots for introns as in red those for exons.

IQR filter works as described for 3’IVT arrays. However, any filter done at gene IQR filter works as described for 3’IVT arrays. However, any filter done at gene level will also affect the corresponding exon data. Starting condition After filtering

Intensity filter is instead based on the threshold previously selected on the basis of Intensity filter is instead based on the threshold previously selected on the basis of exon/intron HK expression signals. In this example we are keeping only the genes where all samples have a signal greater than the pre-defined BG.

Exercise 6 (30 minutes) • Use breast and cerebellum Affymetrix exon data. • Calculate, Exercise 6 (30 minutes) • Use breast and cerebellum Affymetrix exon data. • Calculate, with Expression Console, gene/exon level expression for the CORE sub set. • Evaluate the sample quality using gene/exon PCA. • Question: – Are replicates homogeneous? – YES NO • Define a BG threshold using introns information. • Perform an IQR filter at 0. 25 followed by intensity filter at a threshold of your choice. • Save the affylm project as ex 6. lma.

Splice Index • The Splicing Index captures the basic metric for the analysis of Splice Index • The Splicing Index captures the basic metric for the analysis of alternative splicing. • It is a measure of how much exon specific expression (with gene induction factored out) differs between two samples. Defining function-oriented data set for splice index calculation

A Use a set of function-oriented EGs to select probe set IDs B C A Use a set of function-oriented EGs to select probe set IDs B C

Use the selected probe set IDs for “Filtering using a list of probe sets”. Use the selected probe set IDs for “Filtering using a list of probe sets”.

A B A B

C ATTENTION: this is only a very rough descriptive instrument! Much work needs to C ATTENTION: this is only a very rough descriptive instrument! Much work needs to be done on exon analysis! Splice Index inspection is performed modelling the splice index exon profiles for two experimental conditions. Results are saved on a pdf file in your working dir.

The sub set of splice indexes to be inspected is defined using two filters: The sub set of splice indexes to be inspected is defined using two filters:

A D Example of one gene output B C A D Example of one gene output B C

This plot gives some advise about the scattering levels of the Splice Indexes over This plot gives some advise about the scattering levels of the Splice Indexes over the gene under analysis A Model of splice indexes over the two experimental conditions. Red dashed lines indicate the confidence interval of the model.

Plots of significance p-value of the alternative splicing versus the average Splice Index values. Plots of significance p-value of the alternative splicing versus the average Splice Index values. In this example only one exon seems to be differentially spliced : 14.

Significance p-value of the alternative splicing versus the average Splice Index values. IN this Significance p-value of the alternative splicing versus the average Splice Index values. IN this example only one exon seems to be differentially spliced. Filtering conditions are shown over the plot of intensity values versus exon number.

Exercise 7 (30 minutes) • Use ex 6. lma. • Perform a filtering on Exercise 7 (30 minutes) • Use ex 6. lma. • Perform a filtering on the basis of a probe sets related to TF. – Create in Net. Affx a Transcript cluster view containing only Transcript Cluster ID – Gene level probe set list is created interrogating Neta. Affx with the EG used in exercise 4 • Calculate on the data sub set the splice Index. • Save the affylm project as ex 7. lma. • Filter for p-value ≤ 0. 01 and mean average Splice Index differences ≥ 2. • Inspect the output plots.

Analysis pipe-line Quality control Filtering Statistical analysis Biological Knowledge extraction Normalization Annotation This step Analysis pipe-line Quality control Filtering Statistical analysis Biological Knowledge extraction Normalization Annotation This step is the same for 3’IVT arrays and exon arrays gene level analyses

Log 2(T/C) is frequently used to evaluate fold change variation 200, 400, 800, 1600, Log 2(T/C) is frequently used to evaluate fold change variation 200, 400, 800, 1600, 32000 100, 100 100 100 200 400 800 1600 32000 6. 00 log 2(t/c) down-regulation compression t/c 4. 00 2. 00 0. 00 -2. 00 F 65439 PT 765 AC 87654 log 2(t/c) MN 0987 S 098345 AA 238 L 8754 G 7599 U 6539 -4. 00 H 07498 Differential expression 8. 00

Statistical analysis • Intensity changes between experimental groups (i. e. control versus treated) are Statistical analysis • Intensity changes between experimental groups (i. e. control versus treated) are known as: – Fold change. – Ranking genes based on fold change alone implicitly assigns equal variance to every gene. • Fold change alone is not sufficient to indicate the significance of the expression changes. • Fold change has to be supported by statistical information.

Multiple testing errors • Performing multiple statistical tests two types of errors can occur: Multiple testing errors • Performing multiple statistical tests two types of errors can occur: – Type I error (False positive) – Type II error (False negative) • Reduction of type I errors increases the number of type II errors. • It is important to identify an approach that reduces false positives with the minimum loss of information (false negative)

The multiple tests problem • If the number of samples increases the tails of The multiple tests problem • If the number of samples increases the tails of a distribution are getting more populated.

Type I error correction • Null hypothesis (H 0): the mean of treated and Type I error correction • Null hypothesis (H 0): the mean of treated and the mean of control for a gene i belong to the same distribution. • Type I error: H 0 is false. • Sidak significance point: = acceptance level (es 0. 05) g= n. of independent tests • If the p-values are lower of K (g, ) all the remaining H 0 are considered true.

Type I error correction (FWER) P of diff. exprs. genes <10 -6 < 10 Type I error correction (FWER) P of diff. exprs. genes <10 -6 < 10 -6 2* 10 -5 0. 047 … ’ 1 – (1 – 0. 05)1/5= 1 – (1 – 0. 05)1/4= 1 – (1 – 0. 05)1/3= 1 – (1 – 0. 05)1/2= 0. 102 0. 0127 0. 0170 0. 0253

Statistical analysis • The sensitivity of statistical tests is affected by the number of Statistical analysis • The sensitivity of statistical tests is affected by the number of available replicates. • Replicates can be: – Technical – Biological • Biological replicates better summarize the variability of samples belonging to a common group. • The minimum number of replicates is an important issue!

How much replicates are important? Yang YH e Speed T, 2002 How much replicates are important? Yang YH e Speed T, 2002

Sample size • Microarray experiments are often performed with a small number of biological Sample size • Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. • The issue of how many replicates are required in a typical experimental system needs to be addressed. • Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e. g. mouse and rat vs. human).

Assessing sample sizes in microarray experiments • The R package, sizepower, is used to Assessing sample sizes in microarray experiments • The R package, sizepower, is used to calculate sample size and power in the planning stage of a microarray study. • It helps the user to determine how many samples are needed to achieve a specified power for a test of whether a gene is differentially expressed or, in reverse, to determine the power of a given sample size.

Exercise 8 (20 minutes) • Use the breast/cerebellum filtered exon data (ex 6. lma). Exercise 8 (20 minutes) • Use the breast/cerebellum filtered exon data (ex 6. lma). • Using the sizepower implementation – Check the sample size and the experimental power selecting various threshold of false positives and log 2(FC)

Comments about experimental design • If the biological material is not a limiting factor Comments about experimental design • If the biological material is not a limiting factor “THINK WIDE”

Selecting differentially expressed genes Statistical validation method II Differential expression linked to a specific Selecting differentially expressed genes Statistical validation method II Differential expression linked to a specific biological event. Statistical validation method III

Selecting differentially expressed genes • Each method grasps some true signals but not all. Selecting differentially expressed genes • Each method grasps some true signals but not all. • Each method catches some false signals. • The trick is to find the best condition to maximize true signals while minimizing fakes.

Mean C Mean T Population Ctrl Population Trtd Sample mean “s” Less than a Mean C Mean T Population Ctrl Population Trtd Sample mean “s” Less than a 5% chance that the sample with mean s came from population C, i. e. , s is significantly different from “mean C” at the p < 0. 05 significance level. But we cannot reject the hypothesis that the sample came from population T.

 • T-statistics is widespread in assessing differential expression. • Unstable variance estimates that • T-statistics is widespread in assessing differential expression. • Unstable variance estimates that arise when sample size is small can be corrected using: – Error fudge factors (SAM) – Bayesian methods (Limma)

SAM Significance Analysis of Microarray SAM Significance Analysis of Microarray

SAM (Significance analysis of microarrays) (Tusher et al. 2001) fudge factor regularizes the t SAM (Significance analysis of microarrays) (Tusher et al. 2001) fudge factor regularizes the t -statistic by inflating the denominator s(i) is the pooled standard deviation, taking into account differing gene-specific variation across arrays.

Some SAM designs Two-class unpaired: to pick out genes whose mean expression level is Some SAM designs Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test). Two-class paired: samples are split into two groups, and there is a 1 -to-1 correspondence between an sample in group A and one in group B (analogous to paired t-test). Multi-class: picks up genes whose mean expression is different across > 2 groups of samples (analogous to one-way ANOVA) Others: ………………………………

 • SAM uses data permutations to define a set of significant differential expression. • SAM uses data permutations to define a set of significant differential expression. N N N T T T { N T N T N T T T N T N T N N }

FDR is given by p 0 * False / Called p 0 is the FDR is given by p 0 * False / Called p 0 is the prior probability pi 0 that a gene is not differentially expressed

How SAM calculates the False Discovery Rate for a specific delta? Permutations 4 720 How SAM calculates the False Discovery Rate for a specific delta? Permutations 4 720 3 2 1 Mean false

A SAM analysis can be performed in Bioconductor using the siggenes library. Two class A SAM analysis can be performed in Bioconductor using the siggenes library. Two class or multi class analysis is selected automatically due to the structure of Target information B C The delta table prompts to the user the information related to the amount of differentially expressed genes given a certain FDR.

The user selects a delta value and check the behaviour of the differentially expressed The user selects a delta value and check the behaviour of the differentially expressed genes.

The user selects a delta value and check the behaviour of the differentially expressed The user selects a delta value and check the behaviour of the differentially expressed genes.

Subsequently the user performs a log 2(fold change) filter to produce a table of Subsequently the user performs a log 2(fold change) filter to produce a table of differentially expressed genes.

Subsequently the user performs a log 2(fold change) filter to produce a table of Subsequently the user performs a log 2(fold change) filter to produce a table of differentially expressed genes.

The table can be saved in a tab delimited file The table can be saved in a tab delimited file

relative difference in gene expression Standard deviation Raw p-value Fold change Significance measurement derived relative difference in gene expression Standard deviation Raw p-value Fold change Significance measurement derived from raw p-value

Exercise 9 (20 minutes) • Use the estrogen. IGF 1 data set. • After Exercise 9 (20 minutes) • Use the estrogen. IGF 1 data set. • After a filtering, perform a two class analysis on ctrl versus IGF 1 treatment, after IQR and/or intensity filter. • Select a FDR ≤ 0. 1 and a |log 2(FC)| ≥ 1 • Save results in ex 9. xls




  • Мы удаляем страницу по первому запросу с достаточным набором данных, указывающих на ваше авторство. Мы также можем оставить страницу, явно указав ваше авторство (страницы полезны всем пользователям рунета и не несут цели нарушения авторских прав). Если такой вариант возможен, пожалуйста, укажите об этом.