G 2 P Visual Analytics Group Roadmap San

Скачать презентацию G 2 P Visual Analytics Group Roadmap San

783df1c487f64c0c359e5a17c9261a69.ppt

Количество слайдов: 39

G 2 P Visual Analytics Group Roadmap San Diego Meeting, January 2010 Ruth Grene, chair Greg Abram, co-chair Bernice Rogowitz Bjoern Usadel, Lenny Heath, Nick Provart, Eric Lyons, Steve Welch, Tom Brutnell 1 January 10, 2010 draft

Our process 1. 2. 3. 4. Worked with i. PLANT biologists to understand specific experimental and analytical “use cases” as part of the process of uncovering intermediate steps between phenotype and genotype Expressed these processes as high-level “workflow” diagrams, showing the steps they would go through to accomplish a scientific discovery Captured their insights on “gaps” – what technology capabilities would enhance their science ? Developed recommendations for the overall i. PLANT cyberinfrastructure that would support scientists’ and educators’ current and future needs 2

Our Workflow Approach • We worked with i. PLANT scientists to express their scientific use cases in terms of data and operations on those data – Multiple types of data (e. g. , experimental, computed, archival) – Multiple types of operations (e. g. , analytical, visualization, search) • A Workflow is a pathway of operations on data – Operation – Data – Flow 3

G 2 P Workflows • Workflow for Maize Gene Analysis (“Tom’s Workflow” • Analysis of Omics Data from a Model Species (“Ruth’s Workflow”) • Analysis of Gene Expression Data from a Partially Sequenced Species • Simplified Analysis of Omics Data – proposed first prototype • Interactive visualization of gene expression data • Steve’s analytical workflow • i. PTOL workflow 4 4

Workflow for Maize Gene Analysis Modeling and Statistical Inference Candidate maize gene Homolog Finder (e. g, Co. GE) te itera List of homologous Arabidopsis gene IDs Literature search 5 genes of interest Examine clusters that can handle maize data (e. g. , e. Northern, Map. Man) note: very limited data for maize so may need to go to rice For each, examine structure of transcripts and expression over time (e. g, EFP Maize Genome Browser) Expression data for 20 maize genes 5 Co-Expression Analysis (e. g. , ATTED 2) Expression Network of 10 Arabidopsis Genes Homolog Finder (e. g, Co. GE) Find expression values for these genes (e. g, Next Gen) List of 20 homogolous maize gene IDs /tb/ber

DNA Subway Metaphor • Tom’s workflow very nicely fits the DNA Bus metaphor – Linear workflow – Options can be substituted at various nodes 6

Workflow for Analysis of Omics Data in a Model Species Gene expression data Experiments Identify sub-cellular locations of gene products (e. g. , SUBA, Interactome) Metabolite Data • Interactive visual and statistical analysis Interactive Visual &Statistical Analysis (e. g. , Vi. VA, Co-expression analysis, Plant. Met. Gen. Map, Cytoscape/Gene Mania) Inferred Protein interactions Visualize Visually-identifed, cellbased, network regions of interest 7 • Integrated gene expression and metabolomic data • Explicit support for iterative what-if analysis Visually identified genes and metabolites to map onto functional pathways iterate Testable Hypotheses Visualize Visually-identified enriched pathways /rg/ber

But, analysis paths are not always linear… • Ruth’s non-linear use-case, for example, shows the analytical importance of – Branching – Recursion – Back-tracking – N. B. - The DNA Subway can be conceptualized as a special case of a more general workflow model 8

Gene Expression from A Partially Sequenced Species 1 Experimental exposure of plants to differing conditions Ecophysiological data Responsive genes Meta Annotator: Explore known features of these genes (e. g. signaling pathways, e. FP, literature) Paint responsive genes onto pathways (e. g. , Map. Man) 3 7 6 2 Identification of homologs in reference species (e. g. Co. Ge) Formulate mechanistic models Homologs in reference 4 a Visually-identified enriched pathways 5 9 Compare magnitude of activity across reference pathways (e. g. , Page. Man, KEGG, GO, Map. Man) 4 b Identification of candidate homologs that have been reported as co-expressed with reference homologs (e. g. , statistical correlation) Network of coexpressed genes for reference species /rg/ber

Workflow for Interactive Gene Expression Analysis Array experiments: Multiple conditions for multiple genes Gene Expression Data High-level view of interactive visual analysis workflow Interactive visual analysis 10 Visually-identified patterns and relationships /ber

A Workflow for Interactive Visual Analysis Array experiments: Multiple conditions for multiple genes Visually select regions of interest in data for each experiment Parallel Coordinates Histogram Stack Common data model View behavior of, and select genes in, the gene array 11 Identify gene expression patterns across multiple experiments Scatterplot View and find clusters in two or three variables Gene Array Visualizer Chromosome Map Visualization 11 Drill-down reveals the underlying components Gene Expression Data Explore which chromosomes have highest expression (e. g. , e. QTL) Visually-identified patterns and relationships /ber

Interactive Visual Analysis of Gene Expression Data strand 12

Modeling Work Flow 13

e. QTL Visualization Method-dependent Parameters (from GUI) e. QTL Data from database Method-dependent Parameters (from GUI) Select phenotypes List of phenotypes 14 Select genome regions List of genomic regions Specialized visualization component

i. PTo. L Visualization Workflows 1. Large Trees Phylogenetic Tree Tip and Node Labels Interactive Tree Visualization, incorporating intelligent filtering (e. g, Phylo. Widget, Paloverde) Visually-identified patterns and relationships Challenge: visualization tools that support analysis of 100 -500 K tips, and support interactive what-if analysis, and “semantic” zoom, in addition to standard zoom pan and select functions 2. Tree Comparisons Phylogenetic Tree 1 (e. g. , Species Tree) Phlogenetic Tree 2 (e. g. , Gene Tree 15 Two Dynamically. Linked Interactive Tree Visualizations, highlighting correspondences Visually-identified correspondences Challenge: infrastructure that supports semantic “brushing” and linking between different representations

General Observations from our Scientists – We found that a workflow model worked very well for representing the data and processes in plant genomic research exploration • Suitable for representing gene expression, metabolites and signaling components • Accommodated different approaches, including, e. g. , metabolism, growth habit, and physiology – Testable hypotheses may be presented visually as linear pathways or as networks. – Having a common structure for representing use cases helped identify similarities and patterns across different biological domains – The workflow methodology helped scientists clarify and communicate the processes they used – Using this methodology helped scientists clarify requirements for data and integration – Most important: This methodology helped scientists identify operations that don’t exist today which could help them create next-generation science 16

Specific Observation 1: Different User Populations Need to serve three different types of users – Plant biologists who are not computer scientists -- need sophisticated, easy-to -use analysis and visualization tools • additionally, will serve educational and outreach applications – Power-users: Computationally-savvy plant biologists -- need easy-to-use system that allows them to create personalized and customized workflows – Developers who are creating new tools – need established APIs so that their tools can be easily used by the community 17

Different Interfaces for Different Users Biologists’ View: High-level templates for Plant Biologists and casual users, with lots of defaults and pre-selected parameters List of genes Co-expression analysis Network Power Users’ View: Visibility into underlying workflows, with freedom to add different data sources, select tools and parameters Co-expression analysis List of genes Metabolites Statistical analysis tool Interactive Visual Analysis Pathways Network Underlying Infrastructure supports both views: Provides explicit treatment of underlying data, databases, data integration, tools, operations parameters, defaults, wrappers, provenance, interconnectivity, access, etc. 18

Specific Observation 2: Dynamic and Interactive Data Analysis – Most current tools and methods are static • Scientists want to interact with their data, to do interactive “what-if” experiments • Scientists want to have access to dynamic, time-varying data, and tools to help them analyze them 19

Specific Observation 3: Multiple Data Types • Multiple data types. Scientists want to join in multiple types of data into their analyses, to see patterns across multiple data types, including networks, pathways, sequences, tabular data, images, 3 -D, text 20

Example: Using Pubmed to Integrate Data about Relevant Literature 1. Click on Pub. Med link for selected term 2. Clips citing selected term 3. Obtain cited article by clicking Pub. Med icon 21 Reference--http: //brainmaps. org

Specific Observation 4: Exploring links across different visual representations • Interactive “brushing” (a la Vi. VA). Support for color “painting” in one visual representation that is reflected in other linked representations – e. g. , painting a metabolic pathway with metabolomic and gene expression magnitude – e. g. , using interactive visualization to identify genes of interest, and have the relevant sub-cellular structures or pathways automatically highlighted – e. g. , mark a pathway or tree-node and see gene expression for that region 22

Specific Observation 5: Annotation and Provenance • Capturing scientists-identified features When a scientist identifies a key pathway or relationship, the system should allow him/her to capture that pattern, so that it can later be used for communication, to use as a template for future analyses, or to search for similar patterns in other data sets. – The scientists want tools to help them keep track of where data came from, what operations were done, and why – 23

Specific Observation 6: Need to Validate Tools and Workflows • How does a scientist know if a “pre-packaged” component or workflow is valid or appropriate or computationally correct? • One suggestion: use “Social Computing” methodologies – The community provides ratings, comments and annotations that build knowledge • Community of plant biologists and computational scientists – For components: users can rate components, discuss their uses, register criticisms, provide comparisons, suggest competing technologies – For workflows: users can comments on their uses, suggest extensions, quibble with particular choices, etc. 24

Specific Observation 7: Scalability and Extensibility • • • The scientists need a system that will allow them to continually update, improve and modify the work they do, keeping pace with – Larger data sets – Improved analytics – Alternate methods – New tools – New data – Joining different types of data Need for easy methods to substitute components, visualizations, methods, analyses, data, etc. The i. PLANT cyberinfrastructure has to able to grow organically and flexibly 25

Implications for the Cyber. Infrastructure • The Cyber. Infrastructure needs to be: – Based on re-usable, composable components – Extensible, able to support • • • Updated components New data types New visual and analytics methods Iteration Leaps forward in scale, interactivity and dynamic data – Easy to use for plant biologists, power users and developers 26

Viz. Trails- example workflow methodology Power-user workflow Provenance and metadata Conceptual workflow Interactive visualizations • • • Visual programming interface for representing data and operations as workflows Loose coupling, using parameterizable Python wrappers Extensible, flexible, re-usable components and workflows Coupled with an attractive, flexible User Interface (to be developed) Supported by “plumbing” infrastructure that provides explicit treatment of underlying data, databases, data integration, tools, interconnectivity, access, etc. 27

Multiple Levels Conceptual Level: High-level templates for Plant Biologists and casual users, with lots of defaults pre-selected List of genes Co-expression analysis Power-User Level: Visibility into underlying workflows, with freedom to select tools and data sources and program new operations Infrastructure Level: The explicit treatment of underlying data, databases, data integration, tools, operations, parameters, defaults, wrappers, provenance, interconnectivity, access, etc. 28 List of genes Interactive Visual Analysis Statistical analysis tool APIs i. PLANT Cyberinfrastructure Network

i. Plant RIC i. P Visual Programming Interface Predefined Visual Programs Other i. P Visualization Tools Other i. Plant Tools i. Plant GUI Application Dataflow Engine and Component Set Visualization Data Cache i. Plant Cyberinfrastructure 29 i. Plant Resource … i. Plant Resource

Visual Analytics Roadmap • Stage 1: Explore workflow methodology with plant biologists, develop sample workflows, and provide insights into requirements for the i. PLANT cyberinfrastructure done • Stage 2: Use Viz. Trails to provide a prototype workflow for a real biology problem using real data 1 Q 2010 30

Simplified Omics Workflow for First Demo Experiments Gene expression data Metabolite Data • Integrated gene expression and metabolomic data • Interactive visual and statistical analysis Interactive Visual &Statistical Analysis (e. g. , Vi. VA, Co-expression analysis, Plant. Met. Gen. Map, Cytoscape/Gene Mania) • Explicit support for iterative what-if analysis Visually identified genes and metabolites to map onto functional pathways iterate Testable Hypotheses 31 Visualize Visually-identified enriched pathways /rg/ber

Some Conclusions • For plant biologists, the intellectual process is creative and diverse; there is no one-size-fits-all solution • There is a great hunger to – Use existing tools – Use components of existing tools – Develop new tools • For visual analysis, no set of tools will be comprehensive – Pre-existing tools are very valuable – Reusable visualization components will be important – New tools, especially more abstract, general tools, will be invaluable 32

Review of Scientists’ Requirements • Whatever the direction, we need to support – Different users with different needs – Dynamic and interactive data – Multiple data types – Interactive “brushing” or “painting” across all visual representations – Annotation and provenance – A methodology to validate our components and tools – Scalability and extensibility 33

And most important, we need to… Provide an analysis environment that will enable new science – Enable scientists to see relationships across multiple types of data (e. g. , integrating gene expression and metabolomics data) – Enable scientists to do what-if experiments, interactive exploration of static and dynamic data, and to integrate modeling and visualization capabilities – Enable scientists to mark a region in one representations and see the impact in all other linked representations – Provide a system that is flexible and extensible, which will grow organically as data grows in volume, and new data, tools, and methodologies emerge i. PLANT should be the platform for the future of plant science, and the choice for future plant scientists 34

Back-up 35

“Build” or “Buy” Continuum Build from scratch, e. g. , with Open GL Enhance Existing software <= Time consuming/expensive • • Wrap components from existing software Wrap existing applications Quick/inexpensive/flexible=> Key Question: Do we build a fixed system, or do we explicitly design a system that welcomes different applications (entire software packages) or components (e. g. , visualizations, algorithms)? The Workflow approach: Building a system that is extensible and evolutionary, able to integrate new functions that we have not anticipated – For the first release, identify the tasks and capabilities do we want to support. Question: Who decides? – Understand the state-of-the art in commercial and academic systems. Identify candidate applications or components to wrap. – Identify gaps, since this will determine the need for new software (e. g. , modifying a component to accept plant genomes) – Focus development effort on overall design goals (e. g. , dynamic linking between representations, translators, wrappers, etc. ) 36 RG/BR 12/16/09

Alternate Simplified Demo 1 Experimental exposure of plants to differing conditions Responsive genes Ecophysiological data 2 3 Paint identified genes onto pathways (e. g. , Map. Man) Identification of homologs in reference species (e. g. Co. Ge) Homologs in reference 4 6 Formulate Testable Hypotheses Explore known features of reference homologs from the literature Simplified from the Gene Expression from A Partially Sequenced Species workflow 37 /rg/ber

Other data sources to be Incorporated • 1. Motifs from Regulatory Regions in Model Species • 2. Cell-specific Expression • 3. Pathways Wiki, place gene(s) of interest in established pathways. • 4. Metabolites, incorporate information from Reactome • 5. Literature , Pub. Med Assistant? ? ? • 6. Displays of inferred regulatory networks, as in Gene Mania. • 7. Small RNAs and target genes. 38

Specific Requirements for Visual Analysis • • Multiple data types. Ability to see patterns across different types of data, including networks, pathways, sequences, tabular data, images, 3 -D, text. Multiple modes of interaction, including static visualizations, interactive “what-if” visual analyses, multiple time slices, dynamic data. Interactive “brushing”. Support for color “painting” in one visual representation that is reflected in other linked representations (e. g. , using interactive visualization to identify genes of interest, and have the relevant sub-cellular structures or pathways automatically highlighted) Capturing scientists-identified features. When a scientist identifies a key pathway or relationship, the system should allow him/her to capture that pattern, so that it can later be used for communication, to use as a template for future analyses, or to search for similar patterns in other data sets. 39