Скачать презентацию Presented by John Quackenbush Ph D at the Скачать презентацию Presented by John Quackenbush Ph D at the

8880254f56d227ce81886b36f0b34ee4.ppt

  • Количество слайдов: 65

Presented by John Quackenbush, Ph. D. at the June 10, 2003 meeting of the Presented by John Quackenbush, Ph. D. at the June 10, 2003 meeting of the Pharmacology Toxicology Subcommittee of the Advisory Committee for Pharmaceutical Science

Challenges in Data Management and Analysis for Microarrays FDA 10 June 2003 Challenges in Data Management and Analysis for Microarrays FDA 10 June 2003

Selecting the Appropriate Platform Selecting the Appropriate Platform

Affymetrix Gene. Chip™ Expression Analysis Generate DNA Sequence ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC ACGTAGCTGATCGTAGCTAGCTGATC ACGTAGCTGATCGTAGC ACGTAGCTGATCGTAGCTG GTAGCTGATCGTAGCTA GTAGCTGATCGTAGCTAG Affymetrix Gene. Chip™ Expression Analysis Generate DNA Sequence ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC ACGTAGCTGATCGTAGCTAGCTGATC ACGTAGCTGATCGTAGC ACGTAGCTGATCGTAGCTG GTAGCTGATCGTAGCTA GTAGCTGATCGTAGCTAG TAGCTGATCGTAGCTAGC AGCTGATC GCTAGCTGATCGTAGCTAGCTGATCG CTAGCTGATCGTAGCTAGCTA CTAGCTGATCGTAGCTAGCTAG TAGCTGATCGTAGCTAGC AGCTGATCGTAGCTAGCT GCTGATCGTAGCTAGCTG CTGATCGTAGCTAGCTGA TGATCGTAGCTAGCTGAT GATCGTAGCTAGCTGATC ATCGTAGC Design and synthesize chips

Affymetrix Gene. Chip™ Expression Analysis Control Hybridize and wash chips Scan chips Analyze Test Affymetrix Gene. Chip™ Expression Analysis Control Hybridize and wash chips Scan chips Analyze Test Obtain RNA Samples Prepare Fluorescently Labeled Probes PM MM

Microarray Overview I Microtiter Plate Microbial ORFs Design PCR Primers Microarray Slide (with 60, Microarray Overview I Microtiter Plate Microbial ORFs Design PCR Primers Microarray Slide (with 60, 000 or more spotted genes) + PCR Products Eukaryotic Genes Select c. DNA clones PCR Products Many different plates For each plate set, containing different genes many identical replicas

Microarray Gene Chip Overview II Measure Fluorescence in 2 channels red/green Control Test Prepare Microarray Gene Chip Overview II Measure Fluorescence in 2 channels red/green Control Test Prepare Fluorescently Obtain RNA Samples Labeled Probes Hybridize, Wash Analyze the data to identify patterns of gene expression

Microarray Expression Analysis Tissue Selection Differential State/Stage Selection RNA Preparation and Labeling Competitive Hybridization Microarray Expression Analysis Tissue Selection Differential State/Stage Selection RNA Preparation and Labeling Competitive Hybridization Gene Spots on an Array Fluorescence Intensity Expression Measurement

Platform-related issues Lack of standardization makes direct comparison of results a challenge Lot-to-log variation Platform-related issues Lack of standardization makes direct comparison of results a challenge Lot-to-log variation in arrays can introduce artifacts – are the results dependent on the biology or on the arrays (or technician or reagent lots or. . ) Commercial arrays provide a standard and remove some design considerations (one sample, one array), but cost up to 10 x (or greater) more than in-house arrays Arrays demand good LIMS systems for sample tracking

Microarray Analysis Microarray Analysis

General Microarray Strategy Choose an experimentally interesting and tractable model system Design an experiment General Microarray Strategy Choose an experimentally interesting and tractable model system Design an experiment with comparisons between related variants Include sufficient biological replication to make good estimates Hybridize and collect data Normalize and filter Mine data for biological patterns of expression Integrate expression data with other ancillary data such, including genotype, phenotype, the genome, and its annotation

Annotating and Comparing Arrays Annotating and Comparing Arrays

TIGR Gene Indices home page www. tigr. org/tdb/tgi ~60 species >16, 000 sequences TIGR Gene Indices home page www. tigr. org/tdb/tgi ~60 species >16, 000 sequences

The Mouse Gene Index <http: //www. tigr. org/tdb/mgi> The Mouse Gene Index

A TC Example A TC Example

GO Terms and EC Numbers Babak Parvizi GO Terms and EC Numbers Babak Parvizi

The TIGR Gene Indices <http: //www. tigr. org. tdb/tgi> Dan Lee, Ingeborg Holt The TIGR Gene Indices Dan Lee, Ingeborg Holt

Building TOGs: Reflexive, Transitive Closure And Paralogues Tentative Orthologues Thanks to Woytek Makałowski and Building TOGs: Reflexive, Transitive Closure And Paralogues Tentative Orthologues Thanks to Woytek Makałowski and Mark Boguski

TOGA: An Sample Alignment: bithoraxoid-like protein TOGA: An Sample Alignment: bithoraxoid-like protein

Gene Finding in Humans is easy! Razvan Sultana Gene Finding in Humans is easy! Razvan Sultana

Gene Finding in Humans is easy? Razvan Sultana Gene Finding in Humans is easy? Razvan Sultana

Gene Finding in Humans is difficult? Razvan Sultana Gene Finding in Humans is difficult? Razvan Sultana

Gene Finding in Humans is difficult? A genome and its annotation is only a Gene Finding in Humans is difficult? A genome and its annotation is only a hypothesis that must be tested. Razvan Sultana

RESOURCERER Jennifer Tsai http: //pga. tigr. org/tools. shtml RESOURCERER Jennifer Tsai http: //pga. tigr. org/tools. shtml

RESOURCERER: An Example RESOURCERER: An Example

RESOURCERER: Using Genetic Markers Next step: Integrate QTLs RESOURCERER: Using Genetic Markers Next step: Integrate QTLs

Annotation Issues The “complete” genome is incomplete Gene names are not yet well defined Annotation Issues The “complete” genome is incomplete Gene names are not yet well defined One gene may have many names One gene may have many sequences One sequence may have many names Analysis and interpretation depends on well annotated gene sets Gene names, Gene Ontology Assignments, and pathway information Cross-species comparisons require good knowledge of orthologues and paralogues

Tools and Techniques for Array Analysis Tools and Techniques for Array Analysis

Analysis steps Design the experiment Perform the hybridizations and generate images Analyze images to Analysis steps Design the experiment Perform the hybridizations and generate images Analyze images to identify genes and expression levels (hybridization intensities) Normalize expression levels to facilitate comparisons Analyze expression data to find biologically relevant patterns

MADAM: Microarray Data Manager MAGE-ML export by June Joseph White Jerry Li Alexander Saeed MADAM: Microarray Data Manager MAGE-ML export by June Joseph White Jerry Li Alexander Saeed Vasily Sharov Syntek Inc. Available with OSI source and My. SQL

Why Normalize Data? Goal is to measure ratios of gene expression levels (ratio)i = Why Normalize Data? Goal is to measure ratios of gene expression levels (ratio)i = Ri/Gi where Ri/Gi are, respectively , the measured intensities for the ith spot. In a self-self hybridization, we would expect all ratios to be equal to one: Ri/Gi = 1 for all i. But they may not be. Why not? Unequal labeling efficiencies for Cy 3/Cy 5 Noise in the system Differential expression Normalization brings (appropriate) ratios back to one.

LOWESS Results LOWESS Results

MIDAS: Data Analysis Wei Liang Variance Stabilization, Adding Error Models, MAANOVA, Automated Reporting Available MIDAS: Data Analysis Wei Liang Variance Stabilization, Adding Error Models, MAANOVA, Automated Reporting Available with source

Me. V: Data Mining Tools Available with OSI source Alexander Saeed Alexander Sturn Nirmal Me. V: Data Mining Tools Available with OSI source Alexander Saeed Alexander Sturn Nirmal Bhagabati John Braisted Syntek Inc. Datanaut, Inc.

Analysis Issues There is no standard method for data analysis The same algorithm with Analysis Issues There is no standard method for data analysis The same algorithm with a small change in parameters (such as distance metric) can produce very different results Data normalization plays a big role in identifying “differentially expressed” genes Much of the apparent disparity in microarray datasets can be attributed to differences in data analysis methods, from image processing to normalization to data mining

Data Reporting Standards Data Reporting Standards

What data should we collect? Nature Genetics 29, December 2001 MAGE-ML – XML-based data What data should we collect? Nature Genetics 29, December 2001 MAGE-ML – XML-based data exchange format EVERYTHING

Publications on Microarray Data Exchange Standards MIAME Standards: Nature family, Cell family, EMBO reports, Publications on Microarray Data Exchange Standards MIAME Standards: Nature family, Cell family, EMBO reports, Bioinformatics, Genome Research, Genome Biology, Science, The Lancet, Science, and others….

Standardization Issues MIAME Standards are a start, but still evolving Implementation will require further Standardization Issues MIAME Standards are a start, but still evolving Implementation will require further development of ontologies to create standard descriptors MIAME-Tox represents an attempt to extend this to toxicology Software must be developed to read/write MAGE-ML Public databases need to be extended to meet Tox needs

Science Science

Integrating Expression with other data Integrating Expression with other data

Innate Immunity BPI CD 14 LBP Adaptive Immunity TLR Proteins MD-2 LPS Antigen Presentation Innate Immunity BPI CD 14 LBP Adaptive Immunity TLR Proteins MD-2 LPS Antigen Presentation IRAK 2 My. D 88 Cytokines and Adhesion Proteins TRAF-6 Inflammatory Cell Recruitment NIK Ik. B Degradation NF-k. B Pathophysiologic Conditions Sepsis ARDS Asthma Immunomodulatory Genes David Schwartz Adapted from Godowski. NEJM 1999; 340: 1835

1000 C 57 BL/6 BXD 29 P 2 P 1 DBA/2 BXD 5 L 1000 C 57 BL/6 BXD 29 P 2 P 1 DBA/2 BXD 5 L 1 Lavage PMNs x 10 3/ml Examples BXD 39 L 2 L 3 BXD 42 800 600 400 200 1 BXD Recombinant Inbred Strains (n=32) H 1 H 2 P 1 P 2 Result: ~425 “significant” genes H 3 L 1 R (P 1+P 2) P 1+ P 2+ L 1+ H 3 + H 2+ L 1+ L 2 + L 3 + H 1+ P 1+ H 1+ 53 Hybridizations H 1 P 2+

Examples C 57 BL/6 BXD 29 BXD 5 DBA/2 Lavage PMNs x 10 3/ml Examples C 57 BL/6 BXD 29 BXD 5 DBA/2 Lavage PMNs x 10 3/ml 1000 BXD 39 BXD 42 800 600 400 200 1 BXD Recombinant Inbred Strains (n=32) IDEA: Build QTL Maps and use those to filter expression data Goal: Find differentially expressed genes genetically linked to response

Microarray Expression-QTL Consensus Candidate Genes 525 Genes in QTL 46 426 Genes by Microarray Microarray Expression-QTL Consensus Candidate Genes 525 Genes in QTL 46 426 Genes by Microarray Candidate genes for follow-up and validation

Candidate Gene Set for LPS response BG 076932 BG 085317 BG 064781 BG 085740 Candidate Gene Set for LPS response BG 076932 BG 085317 BG 064781 BG 085740 BG 063515 BG 078398 AW 556835 BG 077485 BG 085186 AW 550270 BG 065761 BG 074379 BG 080688 BG 067349 BG 073439 AW 551388 BG 076460 BG 080666 BG 067921 BG 072974 BG 070296 BG 074109 BG 077487 annexin A 1 (Anxa 1) arginase type II (Arg 2) cytidine 5'-triphosphate synthase (Ctps) ets-related transcription facto ets-related ferritin heavy chain (Fth) MARCKS-like protein (Mlp) protein tyrosine phosphatase, non-receptor type 2 (Ptpn 2) phosphatase, ring finger protein (C 3 HC 4 type) 19 (Rnf 19) surfactant protein-D gene tenascin C (Tnc) tumor necrosis factor, alpha-induced protein 2 (Tnfaip 2) co-chaperone mt-Grp. E#2 precursor putative CSF-1 C-type lectin Mincle DKFZp 564 O 1763 E 2 F-like transcriptional repressor protein glutamate-cysteine ligase catalytic subunit (GLCLC) glutamate-cysteine gly 96 GTP binding protein DKFZp 547 B 146 DKFZp 566 F 164 Hsp 86 -1 hypoxia inducible factor 1 BG 078274 BG 084405 BG 069214 BG 067127 BG 080268 BG 070106 BG 064651 BG 063925 BG 077818 BG 073108 BG 064928 BG 072801 BG 086320 BG 072793 BG 073446 BG 072227 BG 068491 BG 071081 BG 067341 BG 067620 BG 067670 BG 066678 BG 071169 I kappa B alpha gene IAP-1 inhibitor of apoptosis protein 1 interferon regulatory factor 1 KC lipocalin MAIL metallothionein II metallothionein-I MHC class III region RD mitogen-responsive 96 mitogen-responsive S 100 A 9 SDF-1 -beta T-cell activating protein TH 1 protein TNFa

Sleep Deprivation Studies in Mouse 0 3 z 6 z z z 9 z Sleep Deprivation Studies in Mouse 0 3 z 6 z z z 9 z z z z z z z z 12 z z z z

Experimental Paradigm Compare gene expression between sleeping and sleep-deprived mice in cortex and hypothalamus Experimental Paradigm Compare gene expression between sleeping and sleep-deprived mice in cortex and hypothalamus Perform 3 biological replicates Normalize and filter data and use data mining techniques to select distinct patterns of gene expression Use Gene Ontology (GO) assignments to classify genes by cellular localization, molecular function, biological process Use GO analysis to develop an understanding of response

Differential Expression in Cortex Stress Response Metabolism and Signal Transduction Energy Metabolism Transcription; Mitochondrial Differential Expression in Cortex Stress Response Metabolism and Signal Transduction Energy Metabolism Transcription; Mitochondrial and Ribosomal Proteins

Differential Expression in Hypothalamus Sleep signaling Differential Expression in Hypothalamus Sleep signaling

Predicting Outcome Predicting Outcome

The problem Patients present with tumors, many of which are indistinguishable. Histology can provide The problem Patients present with tumors, many of which are indistinguishable. Histology can provide some information, but these have little predictive power. Microarrays provide a “fingerprint” that can serve as a phenotypic measure that may be linked to outcome. This is a huge problem in data mining.

The problem in pictures: Adenocarcinomas The problem in pictures: Adenocarcinomas

32 k Human Arrays 32 k Human Arrays

c. DNA Multi-Organ Cancer Classifier 77 tumor samples; 144 hybridization assays Normalization and flip-dye c. DNA Multi-Organ Cancer Classifier 77 tumor samples; 144 hybridization assays Normalization and flip-dye replica consistency check Divide experiments into training and validation sets Validation 25% p < 0. 05 Training breast 75% Statistical filtering of genes (Kruskal-Wallis H-test) 685 genes hierarchical clustering (Pearson correlation) UNSUPERVISED CLASSIFICATION ovary Artificial neural network training and validation SUPERVISED CLASSIFICATION lung

Neural Networks and Cancer Input data: A list of genes with expression levels Output Neural Networks and Cancer Input data: A list of genes with expression levels Output data: A tumor type call “hidden layers” allow complex connections

Neural Networks and Cancer Training: Adjusts weights and connections Breast Tumor Neural Networks and Cancer Training: Adjusts weights and connections Breast Tumor

Tumors in the Universal Classifier Tumor Type Number of Samples Array Platform Bladder 19 Tumors in the Universal Classifier Tumor Type Number of Samples Array Platform Bladder 19 U 95, HU 6800 Breast 42 U 95, HU 6800, TIGR 32 k Central Nervous – Atypical Teratoid/Rhandoid 10 HU 6800 Central Nervous Glioma 10 HU 6800 543 tumor samples 21 tumor types 95% of all cancers Central Nervous - Medulloblastoma 70 HU 6800 Colon 41 U 95, HU 6800, TIGR 32 k Stomach/EG Junction 30 U 95, TIGR 32 k Kidney 31 U 95, HU 6800, TIGR 32 k Leukemia – Acute Lymphocyite B Cell 10 HU 6800 Leukemia – Acute Lymphocyite T Cell 10 HU 6800 Leukemia – Acute Myelogenous 10 HU 6800 Lung – Adenocarcinoma 71 U 95, HU 6800, TIGR 32 k Lung – Squamous Cell Carcinoma 21 U 95 Lymphoma - Follicular 11 HU 6800 Lymphoma – Large B Cell 11 HU 6800 Melanoma 10 HU 6800 Mesothelioma 10 HU 6800 Ovary 44 U 95, HU 6800, TIGR 32 k Pancreas 26 U 95, HU 6800, TIGR 32 k Prostate 42 U 95, HU 6800 Uterus 10 HU 6800

Data Acquisition Microarray Database Normalization and Scaling Average Across Chips using Reference U 95 Data Acquisition Microarray Database Normalization and Scaling Average Across Chips using Reference U 95 A=124 Gene-by-Gene using Reference U 95 A Gene 1 2. 2 Gene-by-Gene using Reference U 95 A Hu 6800 Gene 2 0. 5 Gene 3 1. 2 … Hu 6800=136 Statistical Screening Neural Network Training and Validation All Normalized and Scaled Genes Training Set Tumor 1 Tumor 2 Tumor 3 Tumor 4 Tumor 5 … Tumor n Hu 6800 Kruskal-Wallis Bonferoni f(x) Test Set Tumor 1 Tumor 2 Tumor 3 Tumor 4 Tumor 5 … Tumor n TIGR Gene 1 2. 2 Gene 2 0. 5 Gene 3 1. 2 … Correlative Gene Subset Classifier

Summary We collected 540 expression profiles 21 tumor types 95% of all cancers 10 Summary We collected 540 expression profiles 21 tumor types 95% of all cancers 10 Independent Classifiers 75% of data for training, 25% for test Average ~88% accuracy Web based Classifier available So far, 7 of 8* in classification 84% accuracy in classifying primary source of mets * Bad RNA

Further challenges in analysis? Statistical significance is not the same as biological significance If Further challenges in analysis? Statistical significance is not the same as biological significance If you perturb a system, many genes change their expression levels Multiple pathways and features in the data can be revealed through different analysis methods Genes which are good for classification or prognostics may not be biologically relevant Extracting meaning from microarrays will require new software and tools The most important thing we need is more data collected and stored in a standard fashion

Barriers to Toxicology Applications The “complete” genomes are incomplete Many of the signatures we Barriers to Toxicology Applications The “complete” genomes are incomplete Many of the signatures we see on arrays do not have immediate biological implications Most often genes are included on the arrays that are used solely for normalization Larger datasets may reveal diagnostic or prognostic patterns that are not obvious at present Reported “variation” in the assays must be understood Differences in laboratory and analysis protocols are likely sources There is a need to define QC and analysis standards There is clearly a need for a large database of expression profiles linked to other relevant ancillary information

Science is built with facts as a house is with stones – but a Science is built with facts as a house is with stones – but a collection of facts is no more a science than a heap of stones is a house. – Jules Henri Poincare

Acknowledgments <johnq@tigr. org> TIGR Human/Mouse/Arabidopsis H. Lee Moffitt Center/USF Expression Team Timothy J. Yeatman Acknowledgments TIGR Human/Mouse/Arabidopsis H. Lee Moffitt Center/USF Expression Team Timothy J. Yeatman Emily Chen Greg Bloom Bryan Frank Renee Gaspard PGA Collaborators Jeremy Hasseman Gary Churchill (TJL) Heenam Kim Greg Evans (NHLBI) Lara Linford Harry Gavaras (BU) Simon Kwong Howard Jacob (MCW) John Quackenbush Anne Kwitek (MCW) Shuibang Wang Allan Pack (Penn) Yonghong Wang Emeritus Beverly Paigen (TJL) Ivana Yang Jennifer Cho (TGI) Luanne Peters (TJL) Yan Yu Ingeborg Holt (TGI) David Schwartz (Duke) Array Software Hit Team Feng Liang (TGI) Nirmal Bhagabati Kristie Abernathy (m. A) TIGR PGA Collaborators John Braisted Sonia Dharap(m. A) Norman Lee Tracey Currier Julie Earle-Hughes (m. A) Renae Malek Jerry Li Cheryl Gay (m. A) Hong-Ying Wang Wei Liang Priti Hegde (m. A) Truong Luu John Quackenbush Rong Qi (m. A) Bobby Behbahani Alexander I. Saeed Erik Snesrud (m. A) Vasily Sharov Mathangi Thaiagarjian Funding provided by the Department of Energy Joseph White and the National Science Foundation Assistant Funding provided by the National Cancer Institute, Sue Mineo the National Heart, Lung, Blood Institute, and the National Science Foundation The TIGR Gene Index Team Foo Cheung Svetlana Karamycheva Yudan Lee Babak Parvizi Geo Pertea Razvan Sultana Jennifer Tsai John Quackenbush Joseph White TIGR Faculty, IT Group, and Staff