Скачать презентацию Driving Discovery Through Data Integration and Analysis John Скачать презентацию Driving Discovery Through Data Integration and Analysis John

8ca666b22ff0b10a729735cff7b6abb4.ppt

  • Количество слайдов: 62

Driving Discovery Through Data Integration and Analysis John Quackenbush Molecular Diagnostics World 2010 28 Driving Discovery Through Data Integration and Analysis John Quackenbush Molecular Diagnostics World 2010 28 October 2010

Disease Progression and Personalized Care Birth Treatment Natural History of Disease Clinical Care Environment Disease Progression and Personalized Care Birth Treatment Natural History of Disease Clinical Care Environment + Lifestyle Outcomes Treatment Options Disease Staging Patient Stratification Early Detection Genetic Risk Biomarkers Quality Of Life Death

Turning the vision into a reality Assure access to samples and rational consent Develop Turning the vision into a reality Assure access to samples and rational consent Develop a technology platform Make information integration as a central mission Conduct research as a vital component Present data and information to the local community Enable research beyond your own Engage corporate partners Communicating the mission to the community.

Assure Access to Samples Assure Access to Samples

Access, Research, Security Patients want to be part of the process of curing disease Access, Research, Security Patients want to be part of the process of curing disease Informed consent needs to be structured to allow patients to be partners in the research process HIPPA requires both informed consent and that we assure patient confidentiality But “identifiability” is a moving target in a genomic age With the <$1000 genome, in the age of Facebook, what this means remains unclear The new Genomics is a disruptive technology.

Develop a Technology Platform Develop a Technology Platform

2006: State of the Art Sequencing PRODUCTION Rooms of equipment Subcloning > picking > 2006: State of the Art Sequencing PRODUCTION Rooms of equipment Subcloning > picking > prepping 35 FTEs 3 -4 weeks SEQUENCING 74 x Capillary Sequencers 10 FTEs 15 -40 runs per day 1 -2 Mb per instrument per day 120 Mb total capacity per day Sequencing the genome took ~15 years and $3 B

2008: Enabling a New Era in Genome Analysis PRODUCTION 1 x Cluster Station 1 2008: Enabling a New Era in Genome Analysis PRODUCTION 1 x Cluster Station 1 FTE 1 day SEQUENCING 1 x Genome Analyzer Same FTE as above 1 run per 5 days 15 Gb per instrument per run >3 Gb per day (1 x genome coverage) We can now re-sequence the genome in a ~1 week

The Challenge New technologies inspired by the Human Genome Project are transforming biomedical research The Challenge New technologies inspired by the Human Genome Project are transforming biomedical research from a laboratory science to an information science We need new approaches to making sense of the data we generate The winners in the race to understand disease are going to be those best able to collect, manage, analyze, and interpret the data.

Make information integration as a central mission Make information integration as a central mission

http: //compbio. dfci. harvard. edu Gene RNA Gene Index Databases Protein TM 4 Microarray http: //compbio. dfci. harvard. edu Gene RNA Gene Index Databases Protein TM 4 Microarray Software Network Patient Predict Network Candidate Gene(s) Perturb Network (RNAi) Assay Response (m. A) Resourcerer Other Databases Other tools Me. SHer Cluster. Med Bayesian Nets Central Warehouse DNA Microarray Analysis Other Things: Mesoscopic Expression Correlated Signatures State Space Gene Models Tiling Arrays to Genes

Dealing with an Information Overload Dealing with an Information Overload

Beating Information Overload Clinical Data Genomics Cytogenomics Metabolomics Transcriptomics Central Warehouse Chemical Biology Clinical Beating Information Overload Clinical Data Genomics Cytogenomics Metabolomics Transcriptomics Central Warehouse Chemical Biology Clinical Trials Etc. Epigenomics Proteomics Improved Diagnostics Individualized Therapies More Effective Agents Pub. Med The Hap. Map The Genome Disease Databases (OMIM) Published Datasets Drug Bank

misc Dana Farber Clinical Systems Pub. Med Gen. Bank Rules Engine Web Center Portal misc Dana Farber Clinical Systems Pub. Med Gen. Bank Rules Engine Web Center Portal BAM Dashboard Portals Business Intelligence Partners OMICS IDX Rx Lab E n te r p r i s e S e r v i c e B u s Dana Farber Lab External Dana-Farber Research DB Conceptual Architecture Clinical Trial Idm & Security HTB ODS genomics Web Service Directory BPEL …… Custom De-identification Terminology EMPI A Facts C …. . Mapping A D Facts C B D Severity Score Clinical Pathways B Security Auditing RFID Build or Buy Oracle Existing

An Example: Signature Analysis Warehouse Array Express Fenglong Liu GEO Random Websites Aedin Culhane, An Example: Signature Analysis Warehouse Array Express Fenglong Liu GEO Random Websites Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard

Gene. Chip Oncology Database Fenglong Liu Gene. Chip Oncology Database Fenglong Liu

Gene. Chip Oncology Database I’s B E by las d e At c n Gene. Chip Oncology Database I’s B E by las d e At c n la io ep ss r e be xpr to E n e oo en S G Fenglong Liu

An Example: Signature Analysis Pub. Med Kerm Picard Array Express Warehouse Analysis Fenglong Liu An Example: Signature Analysis Pub. Med Kerm Picard Array Express Warehouse Analysis Fenglong Liu GEO Random Websites In-House Studies Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard

Gene. Sig. DB – release 2 http: //compbio. dfci. harvard. edu/genesigdb Gene. Sig. DB – release 2 http: //compbio. dfci. harvard. edu/genesigdb

Gene. Sig. DB – comparing cancers Gene. Sig. DB – comparing cancers

Cancer is a Cell-Cycle Disease Aedin Culhane, Daniel Gusenleitner Cancer is a Cell-Cycle Disease Aedin Culhane, Daniel Gusenleitner

Breast Cancer has unique signatures Aedin Culhane, Daniel Gusenleitner Breast Cancer has unique signatures Aedin Culhane, Daniel Gusenleitner

A sample research question How many Multiple Myeloma patients, with bone marrow or blood A sample research question How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond?

A Path Forward We are working to develop a two-way strategy for future Clinic A Path Forward We are working to develop a two-way strategy for future Clinic → Lab → Clinic Consider Oncotype. Dx This approach represents the intellectual framework for future success – and the bridges between the various laboratories and programs.

Conduct research as a vital component Conduct research as a vital component

Bayesian Networks Amira Djebbari Raktin Sinha Dan Schlauch Bayesian Networks Amira Djebbari Raktin Sinha Dan Schlauch

When we say “Networks” we mean… Genes are represented as “nodes” Interactions are represented When we say “Networks” we mean… Genes are represented as “nodes” Interactions are represented by “edges” Edges can be directed to show “causal” interactions Edges are not necessarily direct interactions

Bayesian network - example Conditional Edges represent dependencies probability table at Gene 1 node Bayesian network - example Conditional Edges represent dependencies probability table at Gene 1 node “Gene 2” Gene 1 -1 0 1 Gene 2=1|Gene 1 0. 2 0. 7 Gene 2 Gene 3 Gene 4 Learning Bayesian networks: Structure Conditional probability tables

Bayesian networks - priors No free lunch theorem (Wolpert & Mac. Ready, 1996): The Bayesian networks - priors No free lunch theorem (Wolpert & Mac. Ready, 1996): The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when averaged over all cost functions. Suggests that when considering a specific application one can introduce a potentially useful bias using domain knowledge

A low-cost lunch? One can “help” the search along by providing a seed structure A low-cost lunch? One can “help” the search along by providing a seed structure representing what we believe is the most likely network The network search process will then use gene expression data to look for perturbations on the structure that are supported by the data There are many possible sources of prior structures including the Biomedical literature and large-scale interaction studies (PPI)

Bayesian networks using microarray data and literature Test Set: Golub et al. ALL/AML dataset Bayesian networks using microarray data and literature Test Set: Golub et al. ALL/AML dataset Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and literature+PPI Perform 200 bootstrap network estimations and find links that are “high confidence” Compare without prior (microarray data only) vs. with prior structure from the literature to look for known interactions.

BN: No Priors Amira Djebbari BN: No Priors Amira Djebbari

BN: PPI Data Amira Djebbari BN: PPI Data Amira Djebbari

BN: Literature Priors Amira Djebbari BN: Literature Priors Amira Djebbari

BN: Literature + PPI Cell Cycle Gene Subnetwork Amira Djebbari BN: Literature + PPI Cell Cycle Gene Subnetwork Amira Djebbari

Improving the Seeds Co-occurrence does not a provide directionality for interactions, but a BN Improving the Seeds Co-occurrence does not a provide directionality for interactions, but a BN is a DAG and our assignment is ad hoc The literature contains information about how we the genes (and their products) interact The challenge is extracting that information from the literature—there is too much to read Text mining doesn’t work well for the biomedical literature.

Improving the Seeds (2) Solution: Use a hybrid approach! Use text-mining tools to find Improving the Seeds (2) Solution: Use a hybrid approach! Use text-mining tools to find sentences that contain names of two or more genes Use the Amazon Mechanical Turk to extract [subject]—[predicate]—[object] triples Define relationships between genes based on the “consensus” interaction Combine these results with pathway databases to build seed networks.

“Predictive. Networks” seeds from the literature “Predictive. Networks” seeds from the literature

Present data and information to the local community Present data and information to the local community

LGRC Research Portal LGRC Research Portal

LGRC Research Portal LGRC Research Portal

PAGE DETAILS - View aggregate statistics - View cohort details - Build cohort sets PAGE DETAILS - View aggregate statistics - View cohort details - Build cohort sets - Build composite phenotypes Actions: -Go to data download for selected cohort -Go to assay detail for selected cohort -Go to cohort manager

LGRC Research Portal LGRC Research Portal

PAGE DETAILS Search -Facets -Search within results -Keyword prompts -Search history Table: -Paged results PAGE DETAILS Search -Facets -Search within results -Keyword prompts -Search history Table: -Paged results -Sortable columns Actions: -Go to Gene detail page -Add genes to ‘gene set’

PAGE DETAILS Annotation summary & summary view for each assay/data type: Accordion style sections PAGE DETAILS Annotation summary & summary view for each assay/data type: Accordion style sections Annotation Summary Gene Expression Summary -GEXP – expression profile across major Dx categories -RNASeq – Exon structure of the gene -SNPs – Table of SNPs in region of gene, highlighting association with major Dx group - Methylation – Methylation profile in region around gene -Genomic alterations – table of CNVs & alterations observed w/ freq in region around gene Actions: - Click through to assay detail page -Add gene to set RNASeq

LGRC Research Portal LGRC Research Portal

Analysis Tools Cohort 1: Cohort 2: PAGE DETAILS Set 1 Set 2 Job name: Analysis Tools Cohort 1: Cohort 2: PAGE DETAILS Set 1 Set 2 Job name: -Very minimal parameters and options…here just 2 cohorts of interest, maybe p-value cutoff My job 1 View analysis parameters Generates comprehensive report Start Analysis Edit in place results – Don’t set parameters, edit the results Analysis goes into queue, email notification when finished Job Status Running

Analysis of Differential Expression: My Job 1 PAGE DETAILS -Very minimal parameters and options. Analysis of Differential Expression: My Job 1 PAGE DETAILS -Very minimal parameters and options. Supervised Analysis Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Accordion style result sections Meta analysis Generate PDF report of analysis Analysis goes into queue, email notification when finished Unsupervised analysis

Engage corporate partners Engage corporate partners

We need to find the best tools We received an $1 M Oracle Commitment We need to find the best tools We received an $1 M Oracle Commitment grant to create our integrated clinical/research data warehouse We’ve partnered with IDBS to create data portals We are working with Illumina on a variety of projects We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent information We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal genomics space.

Enable research beyond your own Enable research beyond your own

John Quackenbush, Director Mick Correll, Associate Director John Quackenbush, Director Mick Correll, Associate Director

The Mission The mission of the CCCB is to provide broad-based support for the The Mission The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer.

CCCB Service Offering IT Infrastructure -Application hosting -Data management -Custom software development -Comprehensive collaboration CCCB Service Offering IT Infrastructure -Application hosting -Data management -Custom software development -Comprehensive collaboration portals

CCCB Service Offering IT Infrastructure Next-Gen Sequencing -Competitive per-lane pricing -Integrated informatics -Major focus CCCB Service Offering IT Infrastructure Next-Gen Sequencing -Competitive per-lane pricing -Integrated informatics -Major focus for development in 2010

CCCB Service Offering Sequencing IT Infrastructure Analytical Consulting -Bioinformatics / statistical data analysis -Experimental CCCB Service Offering Sequencing IT Infrastructure Analytical Consulting -Bioinformatics / statistical data analysis -Experimental design -Value-add for IT/Sequencing services

CCCB Collaborative Consulting Model 1. Initial meeting to understand project scope and objectives 2. CCCB Collaborative Consulting Model 1. Initial meeting to understand project scope and objectives 2. Development of an analysis plan and time/cost estimate Sequencing IT Infrastructure Consulting 3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal 4. Available as ad-hoc service, or larger scale support agreements

Communicate the mission to the community. Communicate the mission to the community.

The LGRC The LGRC

Genomics is here to stay Genomics is here to stay

Acknowledgments The Gene Index Team Corina Antonescu Valentin Antonescu Fenglong Liu Geo Pertea Razvan Acknowledgments The Gene Index Team Corina Antonescu Valentin Antonescu Fenglong Liu Geo Pertea Razvan Sultana John Quackenbush Array Software Hit Team Katie Franklin Eleanor Howe Sarita Nair Jerry Papenhausen John Quackenbush Dan Schlauch Raktim Sinha Joseph White H. Lee Moffitt Center/USF Timothy J. Yeatman Greg Bloom Center for Cancer Computational Microarray Expression Team Biology Stefan Bentink Mick Correll Thomas Chittenden Howie Goodell Aedin Culhane Kristina Holton Jerry Papenhausen Jane Pak Patricia Papastamos Renee Rubio John Quackenbush (Former) Stellar Students http: //cccb. dfci. harvard. edu Martin Aryee Kaveh Maghsoudi Jess Mar Systems Support Stas Alekseev, Sys Admin Assistant Patricia Papastamos http: //compbio. dfci. harvard. edu