mmiller The Encyclopedia of Life EOL Project An

Скачать презентацию mmiller The Encyclopedia of Life EOL Project An

88d260670092bb110cb181a74f285ad8.ppt

Количество слайдов: 1

mmiller: The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome data Baldridge, K. ; Baru, C. ; Bourne, P. ; Clingman, E. ; Cotofana, C. ; Ferguson, C. ; Fountain, A. ; Greenberg, J. ; Jermanis, D. ; Li, W. ; Matthews, J. ; Miller, M. ; Mitchell, J. ; Mosley, M. ; Pekurovsky, D. ; Quinn, G. B. ; Rowley, j. ; Shindyalov, I. ; Smith, C. ; Stoner, D. ; Veretnik, S. San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093 -0505, USA The Need for Protein Annotation The EOL Model Innovative Data Access Accompanying the massive supply of genomic data is a need to annotate proteins from structural and functional points of view. Questions that researchers look to answer using the massive amount of new genomic data include: - What other genomic proteins are similar to the protein that I am researching? - What level of conservation is there for a particular protein sequence across species? Sequence data from genomic sequencing projects Ported pipeline applications - Which protein domains are common to various protein sequences? - What is the likely cellular location of a specific protein or class of proteins? On a limited basis, researchers are able to manually perform BLAST searches, sequence analysis and data collation for small collections of protein sequences of interest, but for the very large numbers of sequences (10, 000 to 15, 000 or greater) coded for in an individual genome, this becomes impractical. Pipeline data Load/update scripts Data warehouse My. SQL Data. Mart(s) Structure assignment by PSI-BLAST Publish Web Services & API Therefore, key to large-scale genomic sequence analysis is the creation of a reliable and automated software “pipeline” to handle both the analysis functions and then the collation of output data from the analysis. Structure assignment by 123 D Domain location prediction Application server SOAP/Web Server UDDI directory The Sequence Analysis Pipeline Data incorporated into third party web pages Automated data downloads to mirrors and researchers EOL Web pages served via JSP EOL Notebook WWW Genomic Pipeline sequence info structure info NR, PFAM SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP mmiller: This should read “all available genome Arabidopsis Protein sequences” Prediction of : signal peptides (Signal. P, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123 D on FOLDLIB Figure 3 Book Metaphor Web Interface Figure 2 The EOL Data Analysis and Delivery Model The EOL model (Figure 2), applies the i. GAP pipeline (proven by the PAT project) al available (cuurently 800+) genomes. It is a key goal of the project to provide the computational and storage resources necessary to accommodate the analysis of this magnitude of sequnce data (current esitmates are 300 cpu years with available hardware). Ongoing efforts are aimed at obtaining more cpu resources, and improving the efficiency of computational resource utilization. Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSI-Pred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in the DB Figure 1 mmiller: Genomic analysis pipeline used to analyze Arabidopsis Strike out arabidopsis thaliana sequence data The Proteins of Arabidopsis thaliana (PAT) project was a prototype initiative to establish a reliable and accurate pipeline for genome annotation (i. GAP) (Figure 1). Using homology modeling, the i. GAP provides functional annotations and predicts three-dimensional structures (where possible) for proteins encoded in the Arabidopsis thaliana genome. The results from i. GAP (BLAST-WU, PSIBLAST, 123 D+, COILS, TMHMM, Signal. P) were combined and organized into a relational database with a web-based GUI. Steps in Protein Annotation • Structural assignment by sequence similarity and fold recognition. – Fold assignment. – Function assignment. – Modeling by aligning with template. • Functional assignment by sequence similarity. • Assignment of special classes (filtering). • Assignment of protein features. An important issue in this process is automation and its associated automated quality assessment. In the pipeline model, this was addressed by: • Introduction of six reliability categories. Reliability Categories (based on selectivity benchmark): A. Certain (99. 9% of true positives among predicted positives) B. Reliable (99%) C. Probable (90%) Sensitivity = tp/(tp+fn) D. Possible (50%) Selectivity = tp/(tp+fp) E. Potential (10%) F. No annotation • Introduction of benchmark based on 1000 non-redundant SCOP folds [Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. J. Mol. Biol. , 1995, 247: 536]. • Testing a variety of search conditions and methods within this benchmark. Further information about the PAT project may be found at the PAT web site: http: //arabidopsis. sdsc. edu Stages in EOL Data Processing and Delivery An unique aspect of the EOL model is its ability to deliver data through multiple routes. One arm of this data delivery system is the Web interface, driven by Java Server Pages (JSP). Building on the “Encyclopedia of Life” concept, the interface provides fast access to EOL data through a book metaphor design. Data is cataloged alphabetically by species, and the user is provided with multiple additional tools to search sequence data, including: • BLAST search with a protein query sequence to one or more specific species data. • Keyword search. • Natural Language Query search. • Publicly available genomic sequence data are obtained via a high-speed Internet 2 connection from NCBI to the San Diego Supercomputer Center. • Sequence identifier (accession ID) search. • Sequence data is distributed to several large-scale computing resources such as at partner institutions, such as the BII in Singapore; and the Tera. Grid at SDSC (see below), to which the PAT software pipeline has been ported. • Putative function browser. • Data from the pipeline is deposited into a DB 2 -based multi-species version of the PAT data warehouse schema, and federated with data from a number of other local database projects. • Multiple complex queries on the data are run and the results are stored in the data base. • Data is loaded into multiple data marts for fast, read-only query access/distribution to both end-users (via a Web interface and a SOAP-based Web services paradigm), and to EOL data mirror sites. • SCOP Fold browser. Query results will be returned in multiple forms, including a Web page summary at the genome, sequence, and structure data levels; as well as by links to the same information in XML, a PDF printer -friendly output, EOL notebook version (see below), and a narrated summary in Flash. The Web interfaces make extensive use of Scalable Vector Graphics (SVG) components to deliver fast, client-side graphical data renderings using XML encapsulated data accoroding to W 3 C standards. An example is the SVG “chromosome mapper” shown in Figure 4. SVG molecular rendering is used at the client side to provide fast, interactive, and visually informative molecular graphics. • Researchers throughout the world are able to access the data by pointing their Web browser to the EOL data Web site or one of its mirrors. Additionally, the World Wide The end-user experience of accessing data processed in this manner is fast, allows for peer-to. Web Consortium (W 3 C) standards-based Web Service protocol comprehensive and flexible. automated computer data access for a variety of uses. peer Large-Scale Computing Resources and Data Storage Key to the success of the EOL project has been the ability to partner with computing projects that will provide the resources to drive the software pipeline to process over 800 available genomes. Largescale computing resources being recruited for the EOL project include the Tera. Grid, the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research (http: //www. teragrid. org), PRAGMA, an open organization in which Pacific Rim institutions formally collaborate to develop grid-enabled applications and to deploy Grid infrastructure throughout the Pacific region (http: //www. teragrid. org), and NRAC reources, including SDSC’s Blue Horizon; University of Michigans AMD cluster, and the University of Wisconsin Condor Flock. Another factor in the development of EOL has been the ability to deploy large-scale, mass storage to handle the enormous amount of data generated by i. GAP analyses and loaded into EOL data warehouse schema and data marts. Ultimately more than 10 terabytes of storage will be deployed for genome annotation alone. Multiple EOL Data Mirror Sites Data mirrors will be a major component in the EOL data distribution system. A software package can be downloaded from the EOL interface that allows researchers to store selected EOL data on local machines, and, if desired, the software makes it possible to act as a public EOL data mirror. This mirror package software will be based upon a freely available relational database management system (My. SQL) and application server (JBoss). This ensures the widest possible deployment of an EOL mirror data repository, from major university and biotech sites to the smallest research institutions, even including high schools, Figure 4 Client-side data rendering using SVG Web Services and the EOL Notebook In addition to obtaining access to EOL data via the web, other components of data delivery include publication of Web Services-based API; and the SDSC Blue Titan web services network direction system. Through Web Services, any researcher or data service is able to access EOL data automatically and with minimal programmatic effort. The EOL notebook is a subproject within EOL (and bioinformatics. org) to create a Java-based application, distributed via JNLP, that will act as a local repository for EOL data. In addition to being able to store and search data locally, the EOL notebook will also be a consumer of EOL Web Services and, via automation, will ensure locally kept data (stored in XML format for interoperability) is kept in sync with data in the main EOL repository. For further information about EOL, please visit us online at: http: //www. eolproject. info or contact Mark Miller at mmiller@sdsc. edu, +1 -858 -822 -0866