Скачать презентацию Providing an environment where every data-driven researcher will Скачать презентацию Providing an environment where every data-driven researcher will

d1b6ae8bf15509046955b89131ed0e05.ppt

  • Количество слайдов: 56

Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole. goble@manchester. Providing an environment where every data-driven researcher will thrive Professor Carole Goble carole. goble@manchester. ac. uk University of Manchester, UK

 • Pipelines – Scientific workflows over (web) services – Data pipelines, model population • Pipelines – Scientific workflows over (web) services – Data pipelines, model population and validation, simulation sweeps – Distributed, federated datasets and analyses combined with local datasets and analysis – Opening up resources. • e-Laboratories – Crowd-sourcing, group curating and sharing/reusing scientific assets. – Web 2. 0 and Semantic Web. – Social networking, community content, collaborative filtering – Sharing and exchanging “Research Objects” – Opening up capabilities and capacity.

 • Pan European collaboration. • Systems Biology of Microorganisms 13 projects, 91 institutes • Pan European collaboration. • Systems Biology of Microorganisms 13 projects, 91 institutes – Different research outcomes – A cross-section of microorganisms, incl. bacteria, archaea and yeast. • Record and describe the dynamic molecular processes occurring in microorganisms by computerized mathematical models. – Modellers meet experimentalists • Pool research capacities, data, models and know-how. • Retrospectively. http: //www. sysmo. net Ba. Cell-Sys. MO COSMIC SUMO KOSMOBAC Sys. MO-LAB PSYSMO Valla MOSES TRANSLUCENT STREAM Sulfo. SYS + two more

Data-driven • Multiple ‘omics – genomics, transcriptomics – proteomics, metabolomics • • Images, Reaction Data-driven • Multiple ‘omics – genomics, transcriptomics – proteomics, metabolomics • • Images, Reaction Kinetics Models Data sets + experiments + models – SBML, Agent-based, Mechanics based • Analysis of data

Systems biology workflows in MCISB Systems biology workflows in MCISB

 • High throughput experimental methods • Public data sets (e. g. EBI) • • High throughput experimental methods • Public data sets (e. g. EBI) • Web Services • ~ 1400 NAR January Issue Little Data • • Little databases Lab books Spreadsheets Private and Shared. Proliferation Derived data Long tail.

Big Data Group Science Data services Access “Little” Data “Local” Science Publish My Datasets Big Data Group Science Data services Access “Little” Data “Local” Science Publish My Datasets My Analytics

Massive decentralisation – wikis, sticks, spreadsheets Massive centralisation – commons, clouds, curated core facilities Massive decentralisation – wikis, sticks, spreadsheets Massive centralisation – commons, clouds, curated core facilities Tremendous fragility Digital Dust in Data Tombs

Picking Pain Points. Keeping it Real. • Project Directors – Data remains with us Picking Pain Points. Keeping it Real. • Project Directors – Data remains with us under our control. – We control who sees what. – Just enough exchange. • Sys. MO PALs – Spreadsheets. – Yellow Pages. – Standard Operating Procedures.

An education Modellers vs Experimentalists Computational thinking Systems thinking An education Modellers vs Experimentalists Computational thinking Systems thinking

Gray‘s Laws (modified) • Working Now, Working to working – – Gateways and ramps Gray‘s Laws (modified) • Working Now, Working to working – – Gateways and ramps Jam today, jam tomorrow Just enough, just in time Work with what you got already • 20 questions – – ? ? Is there any group generating kinetic data? Is this data available? Who is working with which organism? What methods are been used to determine enzyme activity? – Under which experimental conditions are my partners working on for the measurement of glucose concentration?

Help people search for and find stuff Data Services Processes Models Software Experts Help people search for and find stuff Data Services Processes Models Software Experts

Interlinking ASSETS CATALOGUE Sys. MO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. Interlinking ASSETS CATALOGUE Sys. MO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. • Yellow Pages – People. Expertise. Projects. Institutions. Facilities. Studies. • Data – Experimental data sets and analysed results. – Gateway to data stores – SABIO-RK, ‘omics • Models – Store. Stimulate. Publish. Curate. – Gateway to COPASI, JWS Online, Bio. Models. • Processes – – Laboratory protocols – Standard Operating Procedures Bioinformatics analyses – computational workflows - Taverna Model population and validation – workflows – Taverna Gateway to my. Experiment, Mol. Meth, Open. Wet. Ware….

Linking data to process Standard Operating Procedures Models Software Provenance The Lab Book Retrospective Linking data to process Standard Operating Procedures Models Software Provenance The Lab Book Retrospective method reconstruction The myth of reproducible science

 • Scientists willing to share methods and protocols. • SOPs an early win. • Scientists willing to share methods and protocols. • SOPs an early win. • Defined standard metadata model based on Nature Protocols. • Seeded.

Linking data with stuff • Research Objects for packaging and exchanging Assets – Workflows Linking data with stuff • Research Objects for packaging and exchanging Assets – Workflows linked to models linked to data linked to SOPs – Encapsulate community standards – Mixed resources: External and central. – Trust – “Preservation Packet” – Bechhofer et al 2010 forthcoming in The Future of The Web for Collaborative Science 2010. • SBRML – Systems Biology Results Markup Language – To tie to the SBML

At the coal-face The Spreadsheet. The Content Management Systems. Legacy assets are assets. Metadata At the coal-face The Spreadsheet. The Content Management Systems. Legacy assets are assets. Metadata ramps.

The Content Management System • Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable The Content Management System • Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice. • Anarchy amenable.

Sys. MOLab Spreadsheets • Template distribution • Template mapping Sys. MOLab Spreadsheets • Template distribution • Template mapping

Everyone wants metadata. No one wants to collect it. Standards mayhem Metadata millstones Most Everyone wants metadata. No one wants to collect it. Standards mayhem Metadata millstones Most data is thrown away. Metadata for my sake Metadata compliance by stealth Preparation for publishing

Minimum Information Models 63% 47% CIMR Core Information for Metabolomics Reporting MIABE Minimal Information Minimum Information Models 63% 47% CIMR Core Information for Metabolomics Reporting MIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlow. Cyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput Se. Quencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIq. PCR Minimum Information about a quantitative Polymerase Chain Reaction experiment MIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments STRENDA Standards for Reporting Enzymology Data TBC Tox Biology Checklist Bio. PAX : Biological Pathways Exchange http: //www. biopax. org/ Fu. GE Functional Genomics Experiment MIBBI: Minimum Information for Biological and Biomedical Investigations MGED: Microarray Experimental Conditions http: //www. mibbi. org/index. php/MIBBI_portal

Just Enough Results Model • Harvest standards e. g. MIAME (MIBBI. org) • Analyse Just Enough Results Model • Harvest standards e. g. MIAME (MIBBI. org) • Analyse consortium schemas and spreadsheets • JERMs for each data type – microarray, metabolomics, proteomics. . • Map project data sources to JERMs. • Distribute JERM spreadsheet templates “I only want to collect and share just enough results”

JERM Spreadsheets Templates Controlled vocabulary plug in • RDF for ripping, mashing and comparing JERM Spreadsheets Templates Controlled vocabulary plug in • RDF for ripping, mashing and comparing spreadsheets. • A little semantics goes a long way

Reward curation Local curation at the point of capture – ISA-TAB for ‘omics. Centralised Reward curation Local curation at the point of capture – ISA-TAB for ‘omics. Centralised curation – SBML, Cell. ML, SBO Automated curation. Which data is worth curating?

 • Blue-Collar Science. • Curator Credit • Curator Career • Funding. • Personal • Blue-Collar Science. • Curator Credit • Curator Career • Funding. • Personal and institutional visibility • Scholarly citation metrics • Federate workloads • Unpopular with the big data providers. www. biocurators. org

Commons-based Quality Control. Commons-based Quality Control.

Progressive Curation: “lazy evaluation” metadata Just enough, Just in time Jam today and Jam Progressive Curation: “lazy evaluation” metadata Just enough, Just in time Jam today and Jam tomorrow Pain Very BAD Just right Good, but Unlikely Gain

Sensitive sharing. Collaborate to compete Good reasons not to. Just enough just in time Sensitive sharing. Collaborate to compete Good reasons not to. Just enough just in time sharing. Data kept at host. Registered centrally through harvesting. Pre-Publication sharing vs Publication

Competitive advantage. Academic vanity. Adoption. Reputation. Rewards Scrutiny. Being scooped. Misinterpretation. Reputation. Legal issues. Competitive advantage. Academic vanity. Adoption. Reputation. Rewards Scrutiny. Being scooped. Misinterpretation. Reputation. Legal issues. Risks Nature 461, 145 (10 September 2009) | doi: 10. 1038/461145 a

Just Enough Sharing Access Permissions Reusing my. Experiment Just Enough Sharing Access Permissions Reusing my. Experiment

Reward sharing and reusing not reinventing. Technically. Culturally. Institutionally. Credit and Risk Mitigation. Reward sharing and reusing not reinventing. Technically. Culturally. Institutionally. Credit and Risk Mitigation.

Reward and Provenance Attribution. Trust. Credit Reusing my. Experiment Reward and Provenance Attribution. Trust. Credit Reusing my. Experiment

Some pretty key things • Data citation • Stable and shared ids and names Some pretty key things • Data citation • Stable and shared ids and names – A nightmare. – Sharednames. org – Biosharing. org • Versioning and Provenance – Models, software, data sets – Ensembl web service doesn’t report version number.

Data commons, Data havens For data after the project has ended. For the common Data commons, Data havens For data after the project has ended. For the common good Beth’s Provenance Objects or me. Tidy and untidy data. Bio 2 RDF

Access and availability of data and data analysis resources Web services underpin the ESFRI Access and availability of data and data analysis resources Web services underpin the ESFRI ELIXIR programme. Interfaces that are understandable and stable. Designed for people too. No access, no tools, no point (Keith Haines) Deposition to community databanks that minimise pain.

What is it? Is it working? What is it? Is it working?

Data analysis, model population and data pipelining ramps. Crossing the adoption chasm There is Data analysis, model population and data pipelining ramps. Crossing the adoption chasm There is a world of complexity for data preparation, processing and analysis Science Informatics Sweatshops. E-Laboratories. Workflows. Portals. Pre-cooked processes and process templates. Pre-cooked interfaces. Training.

Lymphoma Prediction Workflow ca. Array Micro. Array from tumor tissue Microarray pre. Processing Use Lymphoma Prediction Workflow ca. Array Micro. Array from tumor tissue Microarray pre. Processing Use geneexpression patterns associated with two lymphoma types to predict the type of an unknown sample. Lymphoma prediction Gene. Pattern Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI) Jared Nedzel (MIT) Wei Tan Univ. Chicago

my. Experiment Communities • Supermarket shoppers • Tool builders • Trainers and Trainees my. Experiment Communities • Supermarket shoppers • Tool builders • Trainers and Trainees

Drop and Compute Ian Cottam Local folder synchronised and shared via cloud Condor job Drop and Compute Ian Cottam Local folder synchronised and shared via cloud Condor job submitted by drag and drop Results appear in Dropbox

Bashing against local IT NO – you can’t access that datastore / run your Bashing against local IT NO – you can’t access that datastore / run your analysis. Joined up thinking.

Data + Publications Data trapped in documents Supplemental information Text mining workflows Text mining Data + Publications Data trapped in documents Supplemental information Text mining workflows Text mining to find method and controls

Reflect. Elsevier Challenge Winner 2009 Reflect. Elsevier Challenge Winner 2009

Manual and Auto-mark up [Oscar-3] Manual and Auto-mark up [Oscar-3]

Do not underestimate the power of Interactive Visualisation and Browsing Pre-cooked complex queries. Navigation. Do not underestimate the power of Interactive Visualisation and Browsing Pre-cooked complex queries. Navigation. With my data. At the click of a button.

 • Distributed Annotation Service • Upload and overlay my data • Distributed Annotation Service • Upload and overlay my data

Sys. MO summary • Providing an environment where every data-driven researcher will thrive • Sys. MO summary • Providing an environment where every data-driven researcher will thrive • Reality is messy. – Extreme Technology Determinism vs Voluntarist Sociocultural shaping • Extreme and continuous partnership with users. – Act Local Think Global • Agile development environment facilitated stream of features to tackle pain points. – Leverage other e-Laboratories, Maintaining scientists’ buy-in. • Socio-Political Axis dominates the Technical Axis. – Collaboration evolutions, Confidence in exchange.

Six Action Plan Areas Coordination bility ina usta S Data Adoption Capacity Inte per Six Action Plan Areas Coordination bility ina usta S Data Adoption Capacity Inte per ro ility ab

Capacity building of our skills base • Influence training and capacity building programmes. • Capacity building of our skills base • Influence training and capacity building programmes. • Promote training for young and mid-career researchers and research technologists. • Enable mixed skilled research teams to include research and information technologists. • Value and reward highly skilled research and information technologists within HE institutions with a career structure.

Data Silo culture Funding silos Discipline silos Data Silo culture Funding silos Discipline silos

Academic Credit and Risk Mitigation for sharing, curating, and reusing not reinventing Academic Credit and Risk Mitigation for sharing, curating, and reusing not reinventing

Data and Software is free like puppies are free Data and Software is free like puppies are free

EML Research g. Gmb. H, Germany Wolfgan g Müller Sergejs Aleksejevs Carole Goble Isabel EML Research g. Gmb. H, Germany Wolfgan g Müller Sergejs Aleksejevs Carole Goble Isabel Rojas Olga Krebs Katy Wolstencroft University of Manchester, U Stuart Owen Jacky Snoep University of Stellenbosch, South Africa University of Manchester, UK Finn Bacal

Links • my. Grid Project – http: //www. mygrid. org. uk • Sys. MO-DB Links • my. Grid Project – http: //www. mygrid. org. uk • Sys. MO-DB – http: //www. sysmo-db. org • my. Experiment – http: //www. myexperiment. org • Taverna – http: //www. taverna. org. uk • JWS Online – http: //jjj. biochem. sun. ac. za/ • SABIO-RK – http: //sabio. villa-bosch. de/