Protein Bioinformatics Advances and Challenges BY Sona

Protein Bioinformatics – Advances and Challenges BY Sona Vasudevan Peter Mc. Garvey 1

Outline • What is Bioinformatics? Past & Present • About PIR • PIR resources • Uni. Prot resources • PIR’s leading role in Ca. Big; Biodefense and Ontology 2

What is Bioinformatics? NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) Ø Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computer + Mouse = Bioinformatics (Information) (Biology) 3

“A science which hesitates to forget its founders is lost. ” ---- A. N. Whitehead 4

Evolution of Protein databases (Georgetown University) Dr. Margaret Oakley Dayhoff (1925 – 1983) The origin of the single-letter code for the amino acids 5

Challenges we are facing today! Total number of sequences in NR Total number of environmental sequences Number of domain Families (Pfam) Number of domain Families (SMART) Number of Structures (PDB) Number of COGS ~4, 919, 302 ~6, 028, 191(NCBI) ~8957 ~665 ~43339 ~4873 (Unicellular) ~4852 (Eukaryote) 6

Molecular Biology Databases The DNA sequence database has exceeded 100 gigabases. 719 Databases in 14 categories 7

the birth of “omes” & "omic" era in biology 8

Genomics Proteomics Functionomics Unknomics Metagenomics 9

10

Protein Information Resource Integrated Protein Informatics Resource for Proteomics Research Uni. Prot Universal Protein Resource: Central Resource of Protein Sequence and Function Ø PIRSF Protein Family Classification System: Protein Classification and Functional Annotation Ø i. Pro. Class Integrated Protein Knowledgebase: Data Integration and Functional Associative Analysis Ø http: //pir. georgetown. edu 11

Uni. Prot Databases Uni. Parc: Comprehensive Sequence Archive with Sequence History Ø Uni. Prot: Knowledgebase with Full Classification and Functional Annotation Ø Uni. Ref: Non-redundant Reference Databases for Sequence Search Ø Uni. Ref 50 Clustering at 100, 90, 50% Identity Uni. Ref 90 Uni. Ref 100 (NREF) Classification, Literature-Based & Automated Annotation Uni. Prot (Knowledgebase) Merging Uni. Parc (Archive) Swiss. Prot Tr. EMBL PIR-PSD Ref. Seq Gen. Bank/ EMBL/DDBJ Ensembl PDB Patent Data Other Data 12

Uni. Prot Knowledgebase Ø Objective: Stable, Comprehensive, Fully Classified, Richly and Accurately Annotated Ø Information Content l l Ø Isoform Presentation Nomenclature Family Classification and Domain Identification Functional Annotation Approaches l l l Full Classification Automated Annotation Literature-Based Curation Database Cross-References Controlled Vocabularies & Ontologies Evidence Attribution 13

PIRSF Classification System Ø PIRSF: l l Ø Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies Definitions: l Homeomorphic Family (HF): Basic Unit l Homologous: Common ancestry, inferred by sequence similarity l l l Ø Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation Network Structure: Allows multiple parents Advantages: l l Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology 14 Credit AN Nikolskaya

PIRSF Classification System Protein Classification and Functional Annotation (http: //pir. georgetown. edu/pirsf/) Ø Comprehensive Classification of All Uni. Proteins Ø Curated Families with Protein Name and Site Rules Ø Classification and Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Iterative Blast. Clust Tree with Annotation Table, MSA & Phylogenetic tree 15

Classification Tool: Blast. Clust Curatorguided clustering Ø Singlelinkage clustering using BLAST Ø Retrieve all proteins sharing a common domain Ø Iterative Blast. Clust Ø (fixed length coverage) 16

PIRSF-Based Protein Annotation Classification-Driven Rule-Based Annotation Provides Consistent Annotation and Database Integrity Check Includes: Site Rule (PIRSR): Position-Specific Site Feature (FT) Name Rule (PIRNR): transfer name from PIRSF to individual proteins Protein Name (DE) with Synonym, EC, Misnomer GO Term Rule ID Rule Condition Rule Description (Name Rule Interface) PIRSF 000881 PIRNR 000881 member and -1 vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3. 1. 2. 14) PIRSF 000881 PIRNR 000881 member and not -2 vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC 3. 1. 2. -) PIRNR 025624 PIRSF 025624 -1 member Name: ACT domain protein Misnomer: chorismate mutase 17

Rule-based Annotation of Protein Entries Using PIRSF Structure Binding/active sites Identification of residues 18

Methodology Ø Defining a Rule l l l Ø Rule Condition l l l Ø Select template structure Align curated PIRSF seed members and structural template Structure-based sequence alignment of seeds Edit MSA retaining conserved regions covering all site residues Build Site HMM from concatenated conserved regions Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold) Site Residue Check (position-specific residue in HMMAlign) Rule Propagation l Propagate conserved feature annotation to all members that fit the rule 19

An example of PIR rule Integrated into SP record PIR Rule 20

PIRSF Protein Classification provides a platform for protein annotation Ø Improves Annotation Quality l l Annotation of biological function of whole proteins Annotation of uncharacterized hypothetical proteins (functional predictions helped by newly detected family relationships) l l Correction of annotation errors Improvement of under- or over-annotated proteins Ø Standardization of Protein Names 21

Data Integration Ø Data Warehouse l l Ø Hypertext Navigation l l Ø Local Copy of Databases in a Unified Database Schema Allows Local Control of Data; Update Problem Browsing Model with Hypertext Links Allows Direct Interaction; Easily Lost in Cyberspace i. Pro. Class Approach l l l Data Warehouse + Hypertext Navigation Rich Links (Links + Executive Summaries) Modular and Open Framework for Adding New Components in Distributed Networking Environment 22

i. Pro. Class Database Integrated Protein Family, Function, Structure Information Ø ~5, 000 Protein Sequences Ø Rich Links to >80 Databases Ø Value-Added Views for Uni. Prot 23

i. Pro. Class Views Family Report Sequence Report 24

PIR i. Pro. Class Searches ID Mapping Peptide Search Text Search BLAST Search 25

1. Albert Einstein College of Medicine T. gondii, C. parvum 2. Caprion Pharmaceuticals B. abortus Albert Einstein PNNL U of Michigan Harvard Myriad D A T A Scripps Caprion 3. Harvard Institute of Proteomics V. cholerae, B. anthracis SSS 4. Myriad Genetics B. anthracis, Y. pestis, F. tularensis, Vaccinia, Variola 5. Pacific Northwest National Laboratory S. typhimurium, S. typhi, Vaccinia, Monkeypox PIR Resource Center VBI 6. Scripps SARS Co. V, Influenza 7. University of Michigan B. anthracis 26

Organism Research Center Data Type 27

Master Protein Directory 28 29 Colonization Pathway Proteins Currently contains 3, 733 ORF Clones out of 3, 784 Proteins

Search forand Reagent Information Protein Related Proteins in Catalog by Protein Summary Family Classification or Similarity Searches Order Clones from Repositories Clone Sequences Report 29

Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages 30

NCI ca. BIG Initiative cancer Biomedical Informatics Grid: • • • Informatics platform to enable sharing of research, data and tools • Designed and built by an open federation of organizations • Facilitate connectivity via common standards and unifying architecture • Open source and open access principles Domain Workspaces • Clinical Trial Management Systems • Integrative Cancer Research • Imaging • Tissue Banks and Pathology Tools Cross Cutting Workspaces • Architecture • Vocabularies and Common Data Elements

PIR Activities in ca. BIG™ • Integrative Cancer Research Workspace • Developer • Grid-enablement of PIR • Adopter • SEED Genome Annotation Tool (completed) • Gene. Connect Genomic Identifier Mapping Service • Vocabularies and Common Data Elements • Participant

33