15566e1183175a3a47fca70c8aaa60eb.ppt
- Количество слайдов: 20
E-science grid facility for Europe and Latin America Computational challenges on Grid Computing for workflows applied to Phylogeny R. Isea 1, E. Montes 2, A J. Rubio-Montero 2 and R. Mayo 2 1 Fundación IDEA (Venezuela) 2 CIEMAT (Spain) IWPACBB 2009 Salamanca, June 12 th, 2009 www. eu-eela. eu
Outline • Phylogenetics: a reminder • Challenges in Phylogenetics – Computational methods: Mr. Bayes – Exploiting of Grid technology • Mr. Bayes and Bioinformatic resources on Grid • The Phylo. Grid approach – – General description and objectives Taverna workflow Grid. Sphere portal Future work: Grid. Way metascheduler • Some results: HPV case study • Summary and conclusions www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 2
Phylogenetics: a reminder • Phylogeny: reconstruction of the evolutionary history (evolutionary tree) of organisms – Influence and relationship between species – Evolution of selected populations At July 1837 Darwin draw his first-know sketch of a evolutionary tree • Applications on Life Sciences, Industry, etc: – Know real history of evolution: Tree of Life – Drug discovery – Tracing geographical origin, dating introduction of stumps – Prediction of gene’s and proteins’ function – Epidemiological studies www. eu-eela. eu Complete Tree of life IWPACBB 2009. Salamanca, June 12 th, 2009 3
Computational problem: so many trees… Nº of possible labelled topologies with n species or taxa Rooted Trees: Unrooted Trees: Nº of Rooted Nº of taxa Nº of Unrooted trees Exhaustive enumeration of all possible phylogenies is not computationally feasible www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 4
Computational methods • Phenetics: no evolutionary model – Distance-matrix based methods (Neighbour-Joining) • Cladistics: – Maximum Parsimony (not statistically consistent) – Maximum Likelihood – Bayesian inference (Markov Chain Monte Carlo): simulation techniques for approximating posterior probability distribution of trees • Mr. Bayes (http: //mrbayes. csit. fsu. edu) – Sequential and Parallel implementations (MPI enabled) – High CPU and memory consumption: § 50 taxa: simulation of 250. 000 generations ~ 50 hours in a P 4 2. 8 Ghz computational § 2900 sequences of HIV-1 challenge www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 5
Challenges for Bioinformatics • Yet a computational problem – Partial scientific community: inefficient local facilities – Rise in provision of HPC facilities: additional skills required • Different approach to access computing infrastructures irrespective of their location Grid Computing www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 6
Why Grid Computing? • Grids represent a powerful new tool for e-Science – Provide seamless sharing of computing and storage resources – Enable the creation of scalable VOs: Biomed VO – Service Grids (EGEE, EELA) and Opportunistic Grids • Benefit for applications demanding non-trivial computing capabilities • Local and remote computing and storage facilities www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 7
Bioinformatics Grid resources • Wide range of Bioinformatics resources through Web Interfaces: – Projects of public databases (genomes, proteins, etc. ): § EMBL-EB I(UK), NCBI (USA), DDBJ and PDBJ (Japan), etc. – Web services for Bioinformatics toolkits: § EBI web services, NCBI Entrez Utils, DDBJ, Bio. Moby services – Bioinformatics Web services Index/registry servers: § EMBRACE service registry (Bio. Catalogue), Bio. Moby Central Registry • Grid-enabled software packages: – EELA-2: gr. EMBOSS (UNAM) • Grid portals to mask applications – Genius, Grid. Sphere • Grid infrastructures & VOs – EGEE related: Biomed, GENE, EELA-prod VOs – my. Grid, ca. BIG, Tera. Grid. www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 8
How to access Mr. Bayes on Grid • Simply sending a standard job to a site – Software must be preinstalled in sites – Successfully tested in several projects § § National Grid Service (UK) FIRB LIBI “International Laboratory for Bioinformatics” project (Italy) Bioinfo. GRID project EELA: MPI version installed and tested in EELA-CIEMAT site – Supported by EELA-2/EGEE sites • Grid bureaucracy: certificates, VOs, etc. – Usually Biologists are not advanced grid users • Need for friendly interfaces to Grid facilities www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 9
Phylo. Grid aim Offer to the scientific community an easy interface for calculating phylogenies in Grid without requiring the user knowledge about the computational procedure: – Based on MPI-enabled version of Mr. Bayes § By means of a Taverna workflow – Takes advantage of the computational power of actual Grid infrastructures The use of Taverna Workflows: – Allows multiple database selection – Extendable with access to complementary tools (Clustalw-MPI) or other workflows (My. Experiment repository) www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 10
Phylo. Grid architecture GRID protocols LFC Catalog g. Lite GRID SE WMS WNs CE HTTPS Portal Certificate SOAP Grid. Sphere Portal + WF Enactor/Engine www. eu-eela. eu g. Lite UI + Submission WS IWPACBB 2009. Salamanca, June 12 th, 2009 11
Taverna Workflow Mgmt. System • A bioinformatician could easily implement Grid Workflows without Grid skills • Public workflow repository (my. Experiment) • Several Plugins to use WS – My. Grid, Ca. BIG, Grid. SAM, Bio. Moby – Many public databases – GT 4 services and g. Ravi developer framework • Many tools/plugins – Manipulating files, format converter, local and remote execution, visualization applets, tools for accessing WS www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 12
Phylo. Grid Workflow for Mr. Bayes • Input params received from Grid. Sphere portal • ALN/Clustal. W, PHYLIP, MSA to NEXUS format • Builds NEXUS file for Mr. Bayes • Creates JDL file • Job submission • Nested workflow checks Grid job execution • Get output from SE www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 13
Grid. Sphere portal • Phylo. Grid web portal built on top of Grid. Sphere portal framework (http: //www. gridsphere. org): – A Grid portal improves usability of Grids § Hiding complexity of technology involved – A Grid portal improves utilization of Grids § Providing an appealing user-friendly Web Interface § Enforcing Grid utilization policies Snapshot of the virtual work area of • PKI security, etc. Phylo. Grid Portal with some results Cohesive Grid portals www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 14
Future work: Grid. Way • The JDL job approach – Hard to handle job errors into Taverna workflow – g. Lite plugin for Taverna is under development § Taverna must be installed in a UI or, § Use remote execution to a UI (Taverna remote workflow enactor) • Grid. Way metascheduler – Characteristics § § Fully compatible with g. Lite based Grids (EELA-2, EGEE) Better resource selection based on internal statistics Automatic migration and re-schedule of failed jobs Checkpointing management for large duration tasks – Taverna binding implementation: § WS GRAM interface deployed over Grid. Way § By means of GT 4 plugins or directly implementing a JSDL plugin www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 15
HPV case study with Phylo. Grid • HPV is a recognized underlying factor in Cervical Cancer: – 90% cases shows infection from some HPV strand • Complete HPV nucleotide seqs. about 8000 basis long: – E 1, E 2, E 4 -E 7 early expression and L 1, L 2 late expression genes – HPV classification according to L 1 variability (> 100 types) – Two different categories with respect to oncogenic potential • Study: check if this categorization really fits the evolutionary history of HPV – 121 HPV sequences – Molecular phylogenetic calculations for L 1, L 2 and E 7 genes www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 16
Results obatined with Phylo. Grid Molecular Phylogeny of HPV in oncogenes from L 1, L 2, E 7 • 121 HPV nucleotide sequences of L 1 (the major capsid gene) • Phylogenetic tree for L 1 • Broader lines means differences between this tree and tree derived from L 2 gene • Topology similarity score of 85% between L 1 and L 2 Conflict with HPV classification based on variability of L 1 gene www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 17
Summary and conclusions • Phylo. Grid is a tool for Phylogenetic studies on Grid by means of MPI-enabled Mr. Bayes: – Friendly interface (Grid. Sphere portal): no computational or grid skills required to perform calculations. – Automation of tasks: Taverna workflow • Phylo. Grid takes advantage of the computational power of actual Grid infrastructures – Allowing Phylogenetic analysis on large scale – Reducing the technological divide that a partial scientific community has for accessing computational platforms such as Grid www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 20
Thanks for your attention ? www. eu-eela. eu IWPACBB 2009. Salamanca, June 12 th, 2009 21
E-science grid facility for Europe and Latin America Contact R. Isea 1: raul. isea at gmail. com E. Montes 2: esther. montes at ciemat. es A J. Rubio-Montero 2: antonio. rubio at ciemat. es R. Mayo 2: rafael. mayo at ciemat. es http: //www. ciemat. es/portal. do? IDR=1481&TR=C 1 Fundación 2 CIEMAT www. eu-eela. eu IDEA (Venezuela) (Spain)


