Скачать презентацию Large-Scale Expression Data Mining and Management Dr Ewan Скачать презентацию Large-Scale Expression Data Mining and Management Dr Ewan

d860c7159089271aedcae71231743ccf.ppt

  • Количество слайдов: 66

Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe) Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe)

Silicon Genetics Founded in 1998 to provide scientists with software that efficiently analyzes, interprets, Silicon Genetics Founded in 1998 to provide scientists with software that efficiently analyzes, interprets, and manages large volumes of expression data

Customer List Continues to Grow Over 4000 customers at over 500 organizations, including leading Customer List Continues to Grow Over 4000 customers at over 500 organizations, including leading research institutes, biotech and pharmaceutical companies Pfizer Stanford University Merck & Co Celera Bristol Meyers Squibb TGen Novartis NASA Ames Research Cold Spring Harbor UCLA UCSF UC Davis Lawrence Berkeley Labs Merck KGa. A Baylor College of Medicine Applied Biosystems Genentech Cedars-Sinai SRI International Vancouver General Hospital Glaxo Smith. Kline Cornell University Astra. Zeneca US EPA US NIH US FDA Wyeth/AHP Roche Schering-Plough Boehringer Ingelheim Bayer Affymetrix Swiss Array Consortium Celgene NERC Biogen Emory University Aventis

Our recipe for success • Corporate focus on expression informatics • Independently owned and Our recipe for success • Corporate focus on expression informatics • Independently owned and profitable sinception six-years ago • Customer-driven software development • Responsive, knowledgeable and proactive technical support team • Extensive experience in software implementation

The expression informatics leader Market leadership confirmed in recent Genome. Web survey of 167 The expression informatics leader Market leadership confirmed in recent Genome. Web survey of 167 industry professionals.

Extensive citations in leading journals 230 Year End Gene. Spring citations appearing in leading Extensive citations in leading journals 230 Year End Gene. Spring citations appearing in leading peerreviewed journals 61 23 1 8 www. silicongenetics. com/citations. html

Gene Expression Data Flow Output from Affymetrix , Clontech Agilent and others ® ™ Gene Expression Data Flow Output from Affymetrix , Clontech Agilent and others ® ™ Silicon Genetics Gene. Spring and Ge. Net ™ Validation Scanning Image Processing Data Processing Normalization Scaling Error models Formatting Analysis & Visualization Validation Clustering Cross validation ANOVA Biochemical Class Prediction Literature Pathways

Different data types must be integrated • Raw Data – – – Residing in Different data types must be integrated • Raw Data – – – Residing in custom-developed databases Residing in LIMS Residing in flat-files/spreadsheets • Sample and gene annotation – – – Residing in custom-developed databases Residing in LIMS Residing in flat-files/spreadsheets • Pre-processed data – – From third-party applications From flat-files • Analysis results – – – From Gene. Spring From flat-files From third-party applications

Ge. Net as a Centralized Workspace Ge. Net as a Centralized Workspace

Automated synchronization capabilities • Synchronize existing data repositories with Ge. Net using Sample. Loader Automated synchronization capabilities • Synchronize existing data repositories with Ge. Net using Sample. Loader API • Integrate data from LIMS systems and corporate databases • Integrate pre-processed data from third-party applications

Sample. Loader populating the workspace with annotated raw data Sample. Loader populating the workspace with annotated raw data

Integration of Standard Annotation with Sample Data Treatment Type Age Gender Array Design Stage Integration of Standard Annotation with Sample Data Treatment Type Age Gender Array Design Stage Duration Concentration Dosage Compound Sample ID Author Time Disease Type Organ Tissue Type

Enforceable Annotation Standards via Sample. Loader • Compliance with MIAME or inhouse annotation standards Enforceable Annotation Standards via Sample. Loader • Compliance with MIAME or inhouse annotation standards can be easily enforced (via XML DTD) • Attributes can be chosen from a standard list complete with dropdown options • Attributes can be indicated as required, recommended and optional

Integration of clinical information to create a searchable and standardized repository Integration of clinical information to create a searchable and standardized repository

Sample. Loader populating the workspace with pre-processed data Sample. Loader populating the workspace with pre-processed data

Mining Ge. Net with Gene. Spring Mining Ge. Net with Gene. Spring

The Gene. Spring client – list of powerful analysis capabilities continues to grow • The Gene. Spring client – list of powerful analysis capabilities continues to grow • • • Scripting Language (automated analysis Ontology and Homology Tools MIAME support (Published Meta Data Structure) Two-Way ANOVA Post-Hoc tests Find Similar Samples Algorithm Boolean Filtering PCA on conditions SVM classifier Multiple Clustering Algorithms – – QT clustering Hierarchical K-means SOM

Seamless interaction with Ge. Net • Sample data residing in Ge. Net can be Seamless interaction with Ge. Net • Sample data residing in Ge. Net can be accessed from the Sample Manager in Gene. Spring • Users can easily search for samples of interest and proceed with analysis in Gene. Spring

Populating the workspace with analysis results Populating the workspace with analysis results

The Ge. Net Workspace The Ge. Net Workspace

Configuring data upload to Ge. Net • Data upload and download via Gene. Spring Configuring data upload to Ge. Net • Data upload and download via Gene. Spring to and from Ge. Net is seamless – Users can upload important analysis results to Ge. Net with a click of the mouse – Data residing in Ge. Net is automatically available to the Gene. Spring user upon login • Data upload to Ge. Net via Sample. Loader can be configured via customizable xml files – “Sample. Loader Configuration Files” – Runs nightly cron-jobs to synchronize existing data repositories with Ge. Net

More on Integration… External Program Interface • The External Program Interface (EPI) allows you More on Integration… External Program Interface • The External Program Interface (EPI) allows you to run external programs from within Gene. Spring • Used to integrate Gene. Spring directly with other analysis and visualization packages – Ex: SAS, S+, R, JMP, Mat. Lab • Extends out-of-the-box capabilities of Gene. Spring – Interface with custom code (C, C++, Java, Perl, etc. ) • Results can be stored in Ge. Net

Example EPIs • Ariadne’s Pathway. Assist • Lion’s SRS and LTE • SAS • Example EPIs • Ariadne’s Pathway. Assist • Lion’s SRS and LTE • SAS • Bioconductor/R • S+

Utilizing the Full Power of the Ge. Net Workspace Utilizing the Full Power of the Ge. Net Workspace

Custom Integration and Future APIs • Further customization and integration work can be performed Custom Integration and Future APIs • Further customization and integration work can be performed with the help of Silicon Genetics’ Professional Services • Additional APIs are in development that will allow other applications to query Ge. Net for data – Key customer input and assistance in developing specs is welcome and encouraged

Making Ge. Net a Workspace Ge. Net API – Architecture S. O. A. P. Making Ge. Net a Workspace Ge. Net API – Architecture S. O. A. P. (Simple Object Access Protocol) “a lightweight XML-based messaging protocol used to encode the information in Web service request and response messages before sending them over a network. ” Webopedia. In addition we will distribute WSDL (web service description library) files that allow specialized applications to auto-generate routines in a variety of language that generate Ge. Net-specific SOAP objects.

Making Ge. Net a Workspace Ge. Net API – Use Cases – Generating normalized Making Ge. Net a Workspace Ge. Net API – Use Cases – Generating normalized samples using a third party application – Updating genomic annotations using a “custom spidering” application – Adding sample attachments created by third-party applications

Scenario #1 Biologist analyzes own data Biologist • Data pre-processing and normalization • Statistical Scenario #1 Biologist analyzes own data Biologist • Data pre-processing and normalization • Statistical Data Analysis • Data visualization • Clustering and pathway analysis • Report Generation Results • Gene. Lists • P-values • Fold Change • Pathways • Clusters • Graphs

Example Workflow for Scenario #1 Annotated Data Sample. Loader LIMS Gene. Spring Ge. Net Example Workflow for Scenario #1 Annotated Data Sample. Loader LIMS Gene. Spring Ge. Net Finished Results Biologist

Scenario #2 Bioinformatician analyzes data for Biologist Bioinformatician • Data pre-processing and normalization • Scenario #2 Bioinformatician analyzes data for Biologist Bioinformatician • Data pre-processing and normalization • Statistical Data Analysis • Clustering and pathway analysis Results Biologist • Gene. Lists • Data visualization • P-values • Report Generation • Fold Change • Pathways • Clusters • Graphs

Example Workflow for Scenario #2 lts esu ed R sh Sample. Loader LIMS Biologist Example Workflow for Scenario #2 lts esu ed R sh Sample. Loader LIMS Biologist Ge. Net Fini she d P R rim es ar ul y ts Primary Results AP I Raw Data Fini R SAS EPI Gene. Spring Statistician/ Bioinformatician Ge. Net Viewer Res ults 3 rd party tool

Scenario #3 Analysis responsibilities shared Statistician/ Bioinformatician • Data pre-processing and normalization • Statistical Scenario #3 Analysis responsibilities shared Statistician/ Bioinformatician • Data pre-processing and normalization • Statistical data analysis Biologist • Data visualization • Clustering and pathway analysis • Report Generation Results • Gene. Lists • P-values • Fold Change • Pathways • Clusters • Graphs

Catering to the statistician and biologist sharing analysis responsibilities • Statistician/Bioinformatician – Can perform Catering to the statistician and biologist sharing analysis responsibilities • Statistician/Bioinformatician – Can perform initial analysis with both 3 rd party applications and Gene. Spring – 3 rd party apps • • • Probe-level analysis Data normalization and QC Statistical tests – Gene. Spring • • • Data normalization and QC Boolean Filters Statistical tests • Biologist – Can complete analysis with Gene. Spring • Data visualization • Clustering • Ontology Builder • Homology Tools • Sequence support • Pathway Analysis • Final Report Generation • Image Export

Example Workflow for Scenario #3 Data & Analyses LIMS Sample. Loader Raw Data y Example Workflow for Scenario #3 Data & Analyses LIMS Sample. Loader Raw Data y ar im s Pr sult Re Ge. Net I AP R, D-chip Probe-level analysis SAS, S+ Data processing Statistical analysis Statistician/ Bioinformatician Gene. Spring Finished Results Biologist

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Automated Ontology Builder • Builds hierarchical, ontological classifications based on annotation in master gene Automated Ontology Builder • Builds hierarchical, ontological classifications based on annotation in master gene table • Genelists categorized by biological process, molecular function and cellular component

Automated Homology Builder • Homology tables between organisms can be automatically created • Aids Automated Homology Builder • Homology tables between organisms can be automatically created • Aids in comparing functionality in model systems • Aids in comparing identical genes from different technologies

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Statistics in Gene. Spring • Basic Statistics – Mean – Standard Deviation – Standard Statistics in Gene. Spring • Basic Statistics – Mean – Standard Deviation – Standard Error • One-sample t-test p-values with Multiple Testing Correction option • One-way and Two-way ANOVA with Multiple Testing Correction option • Post-Hoc tests • Global Error Model-derived Statistics • Similar Lists p-values • Correlation for Similar Samples • External Program Interfaces to other statistical packages

Easy to execute ANOVA tests • User can execute both 1 -way and 2 Easy to execute ANOVA tests • User can execute both 1 -way and 2 -way ANOVA tests from a simple interface Choose 1 -way or 2 -way test Choose variable to test Choose test type Choose MTC Choose Post-hoc for 1 -way ANOVA Run test

Easy to interpret ANOVA results Results from 2 -way ANOVA are returned in a Easy to interpret ANOVA results Results from 2 -way ANOVA are returned in a spreadsheet format Post-hoc test summary by groups Lists can be saved and viewed in Gene. Spring or displayed in a Venn Diagram

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

An impressive list of clustering options • Gene Tree • Condition Tree • K-means An impressive list of clustering options • Gene Tree • Condition Tree • K-means • SOM • QT clustering

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Principal Components Analysis • Gene. Spring enables the user to easily perform both PCA Principal Components Analysis • Gene. Spring enables the user to easily perform both PCA on genes and PCA on conditions

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Class Prediction • Used to identify genes that discriminate well among phenotypes • Used Class Prediction • Used to identify genes that discriminate well among phenotypes • Used for quality control or class discovery – Samples representing potential outliers • Uses K-nearest neighbors algorithm • Leave-one-out cross validation tests accuracy of prediction rule

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Sequence Analysis • Entire sequence information for entire organisms can be loaded and visualized Sequence Analysis • Entire sequence information for entire organisms can be loaded and visualized • Advanced searches for potential regulatory sequences and specific promoters can be performed • Genes and sequences can be visualized on organism-specific chromosomal maps

Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering Key Gene. Spring Features • • Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

Powerful Pathway Analysis • Powerful capabilities to visualize and manipulate pathways • Genes automatically Powerful Pathway Analysis • Powerful capabilities to visualize and manipulate pathways • Genes automatically placed on KEGG and Gen. Ma. PP pathways and linked with expression data • Pathways are easily converted to genelists for further analysis • Seamless integration with popular pathway tools, such as Ariadne Genomics’ Pathway. Assist ( Utilizes Natural Language Processing (NPL) to data mine biological publications via Pub. Med (NIH db) •

Utilizing Ge. Net as a Workspace • These powerful analysis features in Gene. Spring Utilizing Ge. Net as a Workspace • These powerful analysis features in Gene. Spring can be easily extended to mine the entire Ge. Net repository of data – “Database” converted to a “Workspace” • Complex querying through scripting function • Compute Farm for computationally intensive analysis procedures

Mining Data in the Workspace using Gene. Spring Example: Find Similar Samples Sample Pool Mining Data in the Workspace using Gene. Spring Example: Find Similar Samples Sample Pool can be chosen from all samples stored locally and on Ge. Net

Interpreting the Results • Results will show correlation value of all samples in the Interpreting the Results • Results will show correlation value of all samples in the userspecified pool to the target sample • Is the sample of interest highly correlated to a sample that previously exhibited toxicity in a different study? • Is the sample of interest highly correlated to a sample that demonstrated high therapeutic potential?

Ge. Net as a Workspace Example: Find Similar Gene Lists • When creating a Ge. Net as a Workspace Example: Find Similar Gene Lists • When creating a new gene list in Gene. Spring, Ge. Net is automatically queried to find if previously created lists are statistically similar to the new list • Does my genelist have a significant number of similar members to an important genelist in a previous study?

Complex Querying Capabilities • Leverage the knowledge of the entire organization by performing highly Complex Querying Capabilities • Leverage the knowledge of the entire organization by performing highly customizable, database-wide queries through the scripting function • Queries can be performed on a large-scale using the entire Ge. Net repository or on a smaller-scale using the individual researchers sample repository

Automating Analyses through Scripts Complex and routine analyses with a series of steps can Automating Analyses through Scripts Complex and routine analyses with a series of steps can be bundled into one, push-button operation with the Script. Editor™

A Flexible Visual Scripting Language • Power users create standard scripts using our powerful A Flexible Visual Scripting Language • Power users create standard scripts using our powerful visual scripting language • Scripts are easily executed by novice Gene. Spring users • Scripts can execute any external program/algorithm • Scripts can be bundled within scripts • All scripts and results can be stored and shared on Ge. Net

Simple script execution Specify inputs using mini-navigator If needed, specify knob values Script Description Simple script execution Specify inputs using mini-navigator If needed, specify knob values Script Description Choose to run script locally or remotely

The Bio. Script Library Major Categories in Bio. Script Library 1. QC Filtering 2. The Bio. Script Library Major Categories in Bio. Script Library 1. QC Filtering 2. Study-Centric Queries - Analysis of groups – Multi-group comparison - Analysis of series – Single time/dose series analysis 3. Biological Queries - Biological fold analysis - Gene Ontology (GO) analysis - Biological pathway analysis - Sequence… promoter analysis 4. Gene-Centric Queries 5. Ge. Net-Wide Queries 6. Analysis via External Applications

Accessing Scripts in Gene. Spring • The Bio. Script Library, as well as custom Accessing Scripts in Gene. Spring • The Bio. Script Library, as well as custom scripts are available in the Gene. Spring Navigator or via a connection to Ge. Net

An architecture designed specifically for high-volume data mining • Our unique architecture gives users An architecture designed specifically for high-volume data mining • Our unique architecture gives users the best of both worlds – Flexibility and responsiveness of the desktop – High power and administrative ease of the server • Effectively bypasses disadvantages associated with desktop -only and server-only systems – Computational limitations of lower-memory PCs – Slowness of server and long wait-times for results

One-click Remote Computationally intensive analyses can be sent to a compute farm with one-click One-click Remote Computationally intensive analyses can be sent to a compute farm with one-click You’re now free to keep working while the analysis is completed

Ge. Net Public Data Repository (GPDR) • Fully annotated, ready-for-analysis public data repository with Ge. Net Public Data Repository (GPDR) • Fully annotated, ready-for-analysis public data repository with over 6, 500 samples available to Ge. Net customers

Open and Scaleable System • Open system based on industry standards • Architecture scales Open and Scaleable System • Open system based on industry standards • Architecture scales easily to support a large number of users – Easy to connect additional clients to Ge. Net – Easy to additional Remote Servers, as computational needs grow – Sample. Loader can connect to an unlimited number of external data sources • Powerful scripting language and external program interface allows for further customization and standardization • API for querying Ge. Net allows for greater integrative capacities

A look into the future… • Ge. Net will intelligently integrate other data types A look into the future… • Ge. Net will intelligently integrate other data types that are valuable to investigate in the context of gene expression data • Genotyping data integration in prototype stage – SNP analysis tool • Extended support for: – – – Proteomics data Metabolomics data Diagnostic and clinical statistics Sequence data Other data types…