Proteomics Bioinformatics MBI Master s Degree Program in

“Proteomics & Bioinformatics” MBI, Master's Degree Program in Helsinki, Finland Lecture 4 10 May, 2007 Sophia Kossida, BRF, Academy of Athens, Greece Esa Pitkänen, Univeristy of Helsinki, Finland Juho Rousu, University of Helsinki, Finland

Proteomics and biology /Applications Protein Expression Profiling Proteome Mining Identifying as many as possible of the proteins in your sample Identification of proteins in a particular sample as a function of a particular state of the organism or cell Post-translational modifications Identifying how and where the proteins are modified Functional proteomics Protein quantitation or differential analysis Protein-protein interactions Proteinnetwork mapping Structural Proteomics Determining how the proteins interact with each other in living systems

Databases and tools Melanie

General workflow of proteomics analysis Proteins/peptides Digestion and/or separation 2 D gel image aquisition and storage External data sources taxonomy, ontologies, bibliography… Applications Systems biology (pathways, interactions. . ) biomarker-discovery, drug targets MALDI, MS/MS Identification PMF Quantification MS/MS DIGE LC-MS & Tags Store peak lists and all meta data

General workflow of proteomics analysis Digestion and/or separation Proteins/peptides 2 D Page data bases Sequence data bases: KEGG PDB DIP OMIM Reactome PROSIT Pfam SPIN BOND STRING Ami. GO David Pub. Med MEDLINE Make 2 D Swiss 2 D PAGE, Gelbank, Cornelia, Word. PAGE EMBL Nucleotide Sequence Database Gen. Bank Uni. Prot. KB/Swiss-Prot & Tr. EMBL Ensemble EST database MALDI, MS/MS PIR Identification Quantification Mascot Sequest Aldente Popitam Phenyx Find. Mod Profound Pep. Frag MS-Fit OMSSA Search XLinks Imaging tools: Melanie, PDQuest Progenesis Delta 2 D Storing/ organising: Proteincsape MSight

General workflow of proteomics analysis Digestion and/or separation Make 2 D Proteins/peptides 2 D Page data bases 2 D gel databases: Imaging Softwares: Data integration on the web Image data and textual information The ability to compare two gels (images) and then identify differently expressed spots • Swiss 2 D PAGE • Gelbank • Cornelia • Word. PAGE • Melanie • PDQuest • Progenesis • Delta 2 D Proteinscape –platform for storing, organizing data MSight -representation of mass spectra along with data from the separation

2 D Gel Databases Swiss-2 DPAGE www. expasy. ch Gel. Bank http: //www. gelscape. ualberta. ca: 8080/htm/gdb. Index. html Cornea 2 D-PAGE http: //www. cornea-proteomics. com/ World 2 DPAGE, Index of 2 D gel databases http: //ca. expasy. org/ch 2 d/2 d-index. html

Swiss 2 D PAGE viewer

Gel bank

Cornea

World-2 DPAGE http: //ca. expasy. org/ch 2 d/2 d-index. html

Make 2 D database A software package to create, convert, publish, interconnect and keep up to date 2 DE-databases. Provided by Ex. PASY The database is queryable via description, accession or spot clicking. Cross-references are provided to other federated 2 D PAGE database entries, Medline and SWISS-PROT Entries are linked to images showing the experimentally determined and theoretical protein locations. Search via –clickable images, -keywords It runs on most UNIX-based operating systems (Linux, Solaris/Sun. OS, IRIX). Being continuously developed, the tool is evolving in concert with the current Proteomics Standards Initiative of the Human Proteome Organization (HUPO). Data can be marked to be public, as well as fully or partially private. An administration Web interface, highly secured, makes external data integration, data export, data privacy control, database publication and versions' control a very easy task to perform.

Federated database A collection of databases that are treated as one entity and viewed through a single user interface (pc. mag. com) Robustness Consistency Maintenance of the database Data quality Limitations of current databases: Do not contain strict/detailed descriptions of protocol (buffers, sample volume, staining techniques all important information for gel comparisons). Designed as 2 D (and not proteomics) databases and therefore not readily expandable to incorporate other proteomics data e. g. MS, MDLC. Designed for reference gels, not on-going projects.

Guidelines for building a federated 2 -DE database http: //ca. expasy. org/ch 2 d/fed-rules. html Individual entries in the database must be accessible by a keyword search. Other methods are possible but not required. The database must be linked to other databases by active hypertext crossreferences, linking together all related databases. Database entries must be at least linked to the main index. A main index has to be supplied that provides a means of querying all databases through one unique query point. Individual protein entries must be available through clickable images. 2 DE analysis software designed for use with federated databases, must be able to access individual entries in any federated 2 DE databases. for a complete reference, see Appel et al. , Electrophoresis 17, 1996, 540 -546, 1996):

Image analysis software Image. Master 2 D/ Melanie PDQuest (Bio-Rad, USA) Progenesis (Nonlinear, UK) Delta 2 D (Decodon, Germany)

Melanie http: //au. expasy. org/melanie/

Melanie http: //www. 2 d-gel-analysis. com/

PDQuest http: //www. bio-rad. com/

Progenesis http: //www. nonlinear. com/products/progenesis/

Delta 2 D http: //www. decodon. com/Solutions/Delta 2 D/

Protein. Scape Platform for storing, organizing, analyzing data generated during the proteomics workflow. • Hierarchy: Project Sample Gel Spots MS Data Search Events

MSight Specifically developed for the representation of mass spectra along with data from the separation http: //www. expasy. org/MSight

General workflow of proteomics analysis Digestion and/or separation Proteins/peptides 2 D gel image aquisition and storage Sequence data bases: EMBL Nucleotide Sequence Database Gen. Bank Uni. Prot. KB/Swiss-Prot & Tr. EMBL MALDI, MS/MS Ensemble EST database PIR PMF Identification MS/MS Quantification DIGE LC-MS & Tags Store peak lists and all meta data

EMBL Nucleotide Sequence Database Collaboration between Gen. Bank (USA) and DNA Database of Japan (DDBJ) and EBI. New collected sequence data is exchanged, and each database is updated daily.

EBI

Gen. Bank Gen Bank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Gen. Bank is available for searching at NCBI Each entry includes a concise description of the sequence, the scientific name and the taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance, such as transcription units, sites of mutations or modifications and repeats. Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences. http: //www. psc. edu/general/software/packages/genbank. html

Search Gen. Bank http: //www. ncbi. nlm. nih. gov/Genbank/index. html

DDBJ

INSDC

Uni. Prot Universal Protein Resource Joining the information contained in Uni. Prot. KB/Swiss-Prot, Uni. Prote. KB/Tr. EMBL and PIR. It is comprised of three components • Uni. Prot Knowledge base (curated protein information, including function, classification, and cross-reference. • Uni. Prot Reference Clusters (combines closely related sequences into a single record to speed searches. ) • Uni. Prot Archive (is a repository, reflecting the history of all protein sequences)

Ex. PASy Proteomics Server Expert Protein Analysis System Proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2 D-PAGE. http: //www. isb-sib. ch/ http: //ca. expasy. org/

Uni. Prot. KB/Swiss-Prot The Uni. Prot KB/Swiss-Protein Knowledgebase is a annotated protein sequence database established in 1986. It is maintained collaboratively by the SIB (Swiss Institute of Bioinformatics) and the European Bioinformatics Institute (EBI) http: //ca. expasy. org/sprot/

Swiss Prot

Tr. EMBL • Uni Prot. KB/Tr. EMBL is a computer-annotated protein sequence database complementing the Uni. Prot. KB/Swiss-Protein Knowledgebase. • It contains the translations of all coding sequences (CDS) present in the EMBL/Gen. Bank/DDBJ Nucleotide Sequence Databases and also protein sequences extracted from the literature or submitted to Uni. Prot. KB/Swiss-Prot. • The database is enriched with automated classification and annotation.

PIR http: //pir. georgetown. edu/pirwww/

ESTdb Expressed Sequence Tags, EST is a unique DNA sequence within a coding region of a gene that is useful for identifying full-length genes and serves as a landmark for mapping. The db. EST is a division of Gen. Bank that contains sequence data and other information on “singke-pass” c. DNA sequences, from a number of organisms. http: //www. ncbi. nlm. nih. gov/db. EST/

Ensemble is a joint project between the EMBL-EBI and the Welcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Access to all the software and data is free and without constraints of any kind. http: //www. ebi. ac. uk/ensembl/

IPI- International Protein Index

General workflow of proteomics analysis Digestion and/or separation Proteins/peptides 2 D gel image aquisition and storage Mascot Sequest Aldente Popitam Phenyx Find. Mod Profound Pep. Frag MS-Fit OMSSA Search XLinks Tag. Ident MALDI, MS/MS PMF Identification MS/MS Quantification DIGE LC-MS & Tags Store peak lists and all meta data

Proteomics tools http: //restools. sdsc. edu/biotools 19. html http: //ca. expasy. org/tools/

PROWL

Identification and Characterization Tools PMFdata MS/MS data Mascot (Matrix Science) Sequest Aldente (Ex. Pasy) Mascot Profound (Rockefeller University) OMSSA MS-Fit (Prospector; UCSF) X!Hunter

Identification and Characterization Tools Popitam (Ex. PASy, SIB) Phenyx –Gene. Bio, Swizerland) Pep. Frag (Rockefeller University, USA) Search. XLinks – (Caesar, Germany)

Popitam is designed to characterize peptides with unexpected modification (e. g. post-translational modifications or mutations) by tandem mass spectrometry (Ex. PASy, SIB) http: //expasy. org/cgi-bin/popitam/help. pl

Popitam results

Phenyx is a software platform for the identification and characterization of proteins and peptides from mass spectrometry data. Developed by Gene. Bio in collaboration with SIB http: //www. phenyx-ms. com/about_phenyx. html

PEPFRAG Searches known protein sequences with peptide fragment mass information http: //prowl. rockefeller. edu/

Search. XLinks http: //www. searchxlinks. de/ Analysis of mass spectra of modified, cross-linked, and digested proteins, the amino acid of which is known

Identification and Characterization Tools Find. Mod predicts potential protein post-translational modifications (PTM) and finds potential single amino acid substitutions in peptides. Find. Pept identifies peptides that result from unspecific cleavage of proteins from experimental masses, taking into account artefactual chemical modifications, posttranslational modifications (PTM) and protease autolytic cleavage. Glyco. Mod predicts possible oligosaccharide structures that occur on proteins from their experimentally determined masses. AAComp. Ident achieves identification with amino acid composition Tag. Ident identifies proteins with isoelectric point, p. I, molecular weight, MW, and sequence tag generating a list of proteins close to a given p. I and Mw. Multident achieves cross-species identification with multiple parameters (p. I, Mw, sequence tag and peptide mass fingerprinting data) http: //au. expasy. org/tools/findmod/

General workflow of proteomics analysis Digestion and/or separation Proteins/peptides 2 D gel image aquisition and storage KEGG PDB DIP OMIM MALDI, MS/MS Reactome PROSIT Pfam SPIN BOND PMF STRING Identification MS/MS Ami. GO DIGE Quantification David Pub. Med LC-MS & Tags MEDLINE Store peak lists and all meta data

KEGG: Kyoto Encyclopedia of Genes and Genomes • Organism specific entry points: -KEGG Organisms • Subject specific entry points: -DRUG, GLYCAN, REACTION, KAAS http: //www. genome. jp/kegg 2. html

KEGG is a “biological systems” database integrating both molecular building block information and higher-level systematic information. Manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for metabolism, other cellular processes, and human diseases. Functional hierarchies and binary relations of KEGG objects, including genes and proteins, compounds and reactions, drugs and diseases, and cells and organisms. Gene catalogs of all complete genomes and some partial genomes with ortholog annotation (KO assignment), enabling KEGG PATHWAY mapping and BRITE mapping. A composite database of chemical substances and reactions representing our knowledge on the chemical repertoire of biological systems and environments.

Search Pathway Carbon fixation

Search “Pathway”

“Pathways” _motifs

Reactome

Pub. Med http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? DB=pubmed

David http: //david. abcc. ncifcrf. gov/home. jsp

Protein Data Bank Provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence, function, and disease. http: //www. rcsb. org/pdb/home. do

OMIM This database is a catalog of human genes and genetic disorders. The database contains textual information and references. It also contains links to MEDLINE and sequence records http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? db=OMIM

Protein family classification PROSITE (Ex. PASY) Pfam (Sanger Institute) SMART (EMBL)

Prosit A Pseudo-Rotational Online Service and Interactive Tool Proteins can be grouped on the basis of their sequences, into a limited number of families. Some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or the maintenance of the three- dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated ww proteins. http: //au. expasy. org/prosite/

PROSIT

Pfam Multiple sequence alignments and HMMs of protein domains and families, at Sanger Institute. http: //www. sanger. ac. uk/Software/Pfam/help/index. shtml

Browse interactions

http: //smart. embl-heidelberg. de/

Structure data bases/interactions STRING (EMBL) BOND (Unleashed Informatics) Cytoscape DIP (UCLA) i. HOP SPIN-PP (protein-protein interfaces in the PDB) MIPS (Mammalian Protein-Protein Interaction Database) Inter. Act (protein interactions from literature curation)

STRING http: //string. embl. de

STRING search results

STRING graphical

STRING_ new node

BOND The Biomolecular Object Network Databank http: //bond. unleashedinformatics. com

Cytoscape is an open source bioinformatics software platform for visualizing molecular interactions with gene expression profiles and other state data.

Node label position can be controled by new GUI in Viz. Mapper.

Cytoscape_ plugins Plugins available for network and molecular profile analysis. for example: • Filter the network • Find active subnetworks/ pathway modules • Find clusters A tool to determine which Gene Ontology (GO) categories are statistically over respresented in a set of genes or a subgraph of a biological network.

Database of Interacting Proteins The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein interactions. http: //dip. doe-mbi. ucla. edu/

i. HOP http: //www. ihop-net. org/Uni. Pub/i. HOP/