
38b74c115d414357fa23bcfe3e00ccb9.ppt
- Количество слайдов: 30
BASys A Web Server for Automated Bacterial Annotation Savita Shrivastava Feb 25 th, 2005 Lab Presentation
BASys-Introduction • A web server for automated, in-depth annotation of bacterial genomic sequence. • BASys uses more then 30 programs to determine ~60 annotation subfields for each gene • BASys also generates colorful, clickable and fully zoomable maps of each query chromosome • Annotation and map can be generated in ~24 hrs for an average bacterial chromosome (5 Mb) or 3000 genes • BASys annotations may be viewed or downloaded anonymously or through a password protected access system • BASys server and databases can also be downloaded and run locally • BASys is available at : http: //wishart. biology. ualberta. ca/basys
Automated genome annotation-why? • Complete published genomes – 21 Archaeal – 205 Bacterial – 32 Eukaryal • Ongoing projects – 655 Prokaryotic genomes – 474 Eukaryotic genomes
Challenges • Sheer volume of data • The heterogeneous and growing types of annotations • The time sensitivity of searches • Computing power can be expensive • The need to present the information in an integrated graphical fashion
Existing automated genome annotation systems • Gene. Quiz • PEDANT (from protein sequence to biochemical function using variety of search and analysis methods) (focuses on protein based annotation as well as on many DNA based analysis) • Genotator (gene prediction, and searches for homologs, promoters, splice sites, and ORFs) • MAGPIE and Bluejay (gene description, gene taxonomic information, similarity searches, metabolic pathways, GO, etc. ) • GAIA (Structural annotation) • TIGR CMR (gene and protein name, GO, M. W. , p. I and taxonomic information of organism)
BASys in detail • Data submission and scheduling • The BASys annotation engine • The BASys report generator
Data submission and scheduling • A front-end web interface for : - Submitting the raw genomic data - Scheduling the annotations - Monitoring and reporting the annotation progress
Submitting the raw genomic data • Anonymous access – For anonymous submission, the user is emailed a secure URL for monitoring and retrieving the progress of their annotations – For Single chromosome submission • Login-based access – Register with BASys – Password-protected – Allows users to submit and monitor multiple chromosome and plasmid annotations
Submitting the raw genomic data • BASys provides a web based form for submitting – Chromosome data as a FASTA-formatted file – Chromosome topology (circular or linear) – Gram stain subtype – Chromosome identifier
Submitting the raw genomic data • Gene prediction using “Glimmer”, a popular gene prediction program • If gene positions are already known, they can be supplied to BASys in a simple TAB-delimited format or as an NCBI’s “. ffn” formatted FASTA file • “. ffn” includes the nucleotide coding sequences along with the location and direction along the chromosome.
BASys is a distributed system operating in a clustered computing environment accommodates multiple users simultaneously performing long running , resource intensive genome annotations BASys annotation engine user Reference DB Similarity Search genome Email data Master node Swiss. Prot CCDB Similarity Searches G. D. Model Organisms Slave node Etc. predict. SPTM Host the web server Can also issue directives toand runs the queuing suspend, resume, restart, and remove the and scheduling genome annotation jobs system on the slave nodes. PROSITE Sequence Analysis Pfam • Overview Structure Analysis Homodeller VADAR PDB
Monitoring and reporting the annotation progress • Each slave node continually communicates its progress to the master node while generating the annotations and reports. • Upon completion of the annotation job submitter is notified by email that the annotations are ready • My. SQL client server protocol to communicate directives and status • Apache web server/HTTP protocol to transfer the sequence data and reports
BASys in detail • Data submission and scheduling • The BASys annotation engine • The BASys report generator
BASys annotation engine • • • Function prediction Comparative annotations Structural annotations Secondary structure analysis Metabolic annotations General properties prediction
BASys annotation pipeline KEGG Genomic Sequence Data Gene Identification PROSITE Translation predict. SPTM Pfam BLAST against nr database for protein function prediction Annotations from other sources Proteomic Sequence Data BLAST e-10 Swiss. Prot No hit Check for missing annotations CCDB No hit Exact COG Information Structural Analysis • BLAST PDB database • Homodeller • VADAR • Psi. Pred • Modification of secondary structure if transmembrane regions are present • Structure class General Properties Operon Structure Homologues & Paralogues Hypothetical Protein < orf number> Annotation Parser Annotations + Features PSORTB (metabolic information) Preceding and Following Gene Annotations Annotation Collection CCDB format Evidence cards HTML format Target. DB Status and Availability
Annotations from multiple sources Example: Sub cellular location • Swiss. Prot • If gene ontology is associated with hydrolase, nuclease, endonuclease or ribonuclease activity or nucleic acid or RNA binding properties then the sub cellular locations is "Cytoplasmic“ • If protein name is related with transcriptional activities then the sub cellular location is "Cytoplasmic” • CCDB • If transmembrane regions are present then the sub cellular location is "Membrane“ • PSORTB • If above cases are not true then the sub cellular location is assigned as "Cytoplasmic"
Annotations from multiple sources Example: Enzyme Classification (EC) number and it’s related field • Swiss. Prot. • CCDB • KEGG database • Metabolic information from CCDB is transferred – – When EC number from Swiss. Prot/KEGG is matching with EC number from CCDB or If EC number is not available from Swiss. Prot/KEGG.
Annotation parsing • CCDB format (Annotations) • Text format (Annotations and evidence) • HTML format ( Annotation table)
CCDB format • Clean view • Annotations are marked with – [S] if exact match to Swiss. Prot – [H] if homology to a Swiss. Prot entry – [C] if homology to a CCDB entry • Annotations are linked to online sources i. e. Pfam, PROSITE, Inter. Pro accession no, GI numbers from Homologues etc.
Text Format • Provides evidence – Source of annotation, i. e. database name and version – Evidence used to support the annotation, i. e. BLAST report in case of similarity search – Quality indicator such as “marginal”, “strong” or “clear” – Time of generation of annotations
Table format • For a quick view of annotations • Shows start and end position and direction of the gene, accession no. , gene name, COG id and protein function
BASys annotation pipeline • Each analysis program is written in Object Oriented Perl also uses Bioperl library. • The annotation API is fully compatible with the Bioperl project • Currently the BASys system contains nearly 54 Perl modules and many small scripts with more than 60, 000 lines of code defining classes and fully object-oriented code. • Tried to write a fully documented code
BASys annotation pipeline • ~8 external tools to analyze the data – Glimmer, HMMER, BLAST, Homodeller, VADAR, predict. SPTM, ps_scan, etc. • ~20 databases as a source of annotation – Swiss. Prot, CCDB, nr, COG, KEGG, PROSITE, reference database of model organisms, PDB, PSORTB, Target. DB, gene ontology etc.
BASys and Bac. Map • BASys annotation engine is used in Bac. Map to generate annotation of bacterial genomes • Successfully completed annotation of 200 bacterial & archaeal genomes in NCBI
BASys in detail • Data submission and scheduling • The BASys annotation engine • The BASys report generator
The BASys report generator • A navigable circular genome map automatically generated after the annotation are done for genome visualization and exploration. • BASys uses CGView application to produce the navigable circular genome map. • BASys passes annotations to CGView in the form of an XML document. • CGView then renders this information as a series of hyperlinked PNG images files. • Map shows annotated genes and COG category classification.
The BASys report generator • Each identified gene is displayed and labeled on the map. • Each gene is hyperlinked to gene cards containing the annotations for the gene • Each gene card contains hyperlinks to evidence card for more detailed description of source and quality of the annotation and an annotation table for brief annotations.
Future work • • • BLAST and text searching Manual annotation TIGRFAMs BLOCKS PRINTS
Publications • G. H. Van Domselaar, P. Stothard, S. Shrivastava, J. Cruz, A. Guo, X. Dong, P. Lu, D. Szafron, R. Greiner, and D. S. Wishart (2005) BASys: A web server for automated bacterial genome annotation. Nucleic Acids Research (accepted). • P. Stothard, G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison, and D. S. Wishart (2005) Bac. Map: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Research 33: D 317 D 320.
Acknowledgements • • • Prof. David Wishart Dr. Gary Van Domselaar Dr. Paul Stothard Anchi Guo Joseph Cruz Xiaoli Dong Nelson Young All the lab members and Dr. Warren Gallin
38b74c115d414357fa23bcfe3e00ccb9.ppt