Bioinformatics of proteins Sequence structure and the symbiosis

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them Maya Schushan The Ben-Tal lab

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

OUTLINE • Sequence: Databases, domains, motifs & annotations • Structure: Secondary structure, structure databases, visualization and identification of functional site

Sequences, domains, motifs & annotations Uni. Prot • Uni. Prot is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the Uni. Prot Consortium.

Sequences, domains, motifs & annotations Uni. Prot • The world's most comprehensive catalog of information on proteins • Sequence, function & more… • Comprised mainly of the databases: – Swiss. Prot – 366226 last year, 412525 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. – Tr. EMBL - 5708298 last year, 7341751 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database many proteins are poorly annotated since only automatic annotation is generated

Sequences, domains, motifs & annotations Uni. Prot • Annotation description includes: – Function(s) of the protein; – Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor; – Domains and sites, for example, calcium-binding regions, ATPbinding sites, zinc fingers, homeoboxes, – Secondary structure, e. g. alpha helix, beta sheet; – Quaternary structure, i. g. homodimer, heterotrimer, etc. ; – Similarities to other proteins; – Disease(s) associated with any number of deficiencies in the protein; – Sequence conflicts, variants, etc

Sequences, domains, motifs & annotations Uni. Prot • Connected to many other databases (e. g. Pfam , Prosite, EC, GO, Pdb. Sum, PDB (to be discussed…)) • Each sequence has a unique 6 letter accession • Entries in Swiss. Prot also have IDs, which usually make sense (e. g. CADH 1_HUMAN for a cadherin of humans) • Download sequence in FASTA format

Sequences, domains, motifs & annotations Uni. Prot: http: //www. uniprot. org/ Type accession: P 05102 Or ID: MTH 1 _HAEPH

Sequences, domains, motifs & annotations

Sequences, domains, motifs & annotations General data: name, origin, EC (enzymatic reaction)…

Sequences, domains, motifs & annotations Functional data, including the GO annotations Scroll down to find the sequence & download the FASTA

Sequences, domains, motifs & annotations Known sites, predicted/known secondary structures, Natural variation or mutagenesis

Sequences, domains, motifs & annotations The protein’s sequence in FASTA format Download Send to BLAST

Sequences, domains, motifs & annotations References for all info in the page- important to take a look…

Sequences, domains, motifs & annotations Connections to other databases Other sequence database, e. g. genebank Related structures in the PDB (if available) Model-structure in the Mod. Base databaseautomatically derived! All sorts of domainmotifs databases. The family related to the entry

Sequences, domains, motifs & annotations Pfam- domain database • Proteins are generally composed of one or more functional regions, commonly termed domains. • Different combinations of domains give rise to the diverse range of proteins found in nature. • The identification of domains that occur within proteins can therefore provide insights into their function.

Sequences, domains, motifs & annotations Pfam- domain database • The Pfam database is a large collection of protein domain families. • Each family is represented by multiple sequence alignments and hidden Markov models (HMMs). • Pfam entries are classified in one of four ways: Family: A collection of related proteins Domain: A structural unit which can be found in multiple protein contexts Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present Motifs: A short unit found outside globular domains

Sequences, domains, motifs & annotations Pfam- domain database There are two components to Pfam: • Pfam-A entries are high quality, manually curated families. these Pfam-A entries cover a large proportion of the sequences in the sequence database. • Pfam-B- automatically generated entries. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. • Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

Sequences, domains, motifs & annotations Pfam- domain database Allows http: //pfam. sanger. ac. uk/ : • Analyze your protein sequence for Pfam matches • View Pfam family annotation and alignments • See groups of related families • Look at the domain organization of a protein sequence • Find the domains on a PDB structure • Query Pfam by keyword

Sequences, domains, motifs & annotations Pfam- domain database Searching for a certain protein accession

Sequences, domains, motifs & annotations Pfam- domain database

Sequences, domains, motifs & annotations Other domain/motifs databases: • PROSITE • Interpro • BLOCKS • Inter. Pro • SMART • Etc…

Sequences, domains, motifs & annotations Classifying protein function • Each protein performs one (or more…) specific functions. This can be, e. g. , catalyzation of a specific enzymatic reaction, transport of an ion, interaction with a DNA molecule etc… • In order to easily address the specific functions, attempts have been made to numerate and classify the various functions performed by proteins.

Sequences, domains, motifs & annotations Classifying protein function Examplesome of the diverse functions exhibited by Membrane proteins.

Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • A numerical classification scheme for enzymes, based on the chemical reactions they catalyze • EC numbers do not specify enzymes, but enzymecatalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number. • By contrast, the Uni. Prot database identifiers uniquely specify a protein by its amino acid sequence.

Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • Every enzyme code consists of the letters "EC" followed by four numbers separated by periods. Those numbers represent a progressively finer classification of the enzyme. • For example, the tripeptide aminopeptidases have the code "EC 3. 4. 11. 4": • EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3. 4 are hydrolases that act on peptide bonds • EC 3. 4. 11 are those hydrolases that cleave off the aminoterminal amino acid from a polypeptide • EC 3. 4. 11. 4 are those that cleave off the amino-terminal end from a tripeptide

Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • For example, the tripeptide aminopeptidases have the code "EC 3. 4. 11. 4“, as shown for an enzyme from Lactobacillus helveticus in the BRENDA database for Comprehensive Enzyme Information System:

Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • • • EC EC EC 1 2 3 4 5 6 - Oxidoreductases Transferases Hydrolases Lyases Isomerases Ligases

Sequences, domains, motifs & annotations Gene Ontology • A collaborative effort to address the need for consistent descriptions of gene products in different database • The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. • The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels.

Sequences, domains, motifs & annotations Gene Ontology Cellular component A cellular component is just that, a component of a cell, but that it is part of some larger object; this may be an anatomical structure (e. g. rough endoplasmic reticulum or nucleus) or a gene product group (e. g. ribosome, proteasome or a protein dimer )

Sequences, domains, motifs & annotations Gene Ontology Biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of biological process terms are signal transduction or pyrimidine metabolism. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

Sequences, domains, motifs & annotations Gene Ontology Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

Sequences, domains, motifs & annotations Gene Ontology Topology The ontologies are in the form of directed acyclic graphs (DAG), with the graph nodes being GO terms. The ontologies are hierarchically structured, a more specialized term (child) can be related to more than one less specialized term (parent). E. g. the biological process hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide. When any gene is involved in hexose biosynthetic process, it is automatically annotated to both hexose metabolic process and monosaccharide biosynthetic process.

Sequences, domains, motifs & annotations Gene Ontology Example

Sequences, domains, motifs & annotations Gene Ontology Interface Search by gene or protein accession http: //www. geneontology. org/

Sequences, domains, motifs & annotations Summary of the first part- protein sequence databases and tools • Uni. Prot- the most comprehensive protein sequence database. Connected to many other databases and resources, • Pfam- domain database. Many others… interpor, prosite, BLOCKS etc. • EC and GO classifications of protein function

OUTLINE • Sequence: Databases, domains, motifs & annotations • Structure: Secondary structure, structure databases, visualization and identification of functional site

Investigating & visualizing protein structures From Sequence to Structure • All information about the native structure of a protein is encoded in the amino acid sequence + its native solution environment. • Many possible conformation still only one or few native folds are exhibited for each protein (Levinthal’s paradox) • Protein folding is driven by various forces: – Ionic forces – Hydrogen bonds – The hydrophobic affect –. . .

Investigating & visualizing protein structures Secondary Structure Prediction Why predict secondary structures of proteins? 1) When the structure of the protein is still unknown. This can serve as the first step for structure prediction- first predict the secondary structures, then how they are arranged together. 2) For calculating better multiple alignments or pairwise alignments. sequence

Investigating & visualizing protein structures Predicting 2° Structure Ø Each amino acid has a different propensity for being in each 2° structure. Ø For example, Proline causes a kink which destroys the helix structure. Thus, Proline is usually found only at the helix end. Ø The different structures also have typical lengths.

Investigating & visualizing protein structures Predicting 2° Structure http: //www. predictprotein. org/

Investigating & visualizing protein structures Predicting 2° Structure All these and more…

Investigating & visualizing protein structures Predicting 2° Structure Ø Input: Sequence Ø Output: Secondary structure prediction, globular regions, coiled-coil regions, transmembrane helices, PROSITE motifs, bound cystein… Ø The Meta Predict Protein server now allows many other options… http: //www. predictprotein. org/meta. php

Investigating & visualizing protein structures Predicting 2° Structure Ø A common measure is Q 3 = the % of amino acids that were predicted correctly. Ø Today, Q 3 is about 75 -78% (as determined objectively by CASP) The theoretical limit is thougt to be about 90% Ø

Investigating & visualizing protein structures Predicting 2° Structure E. g. PSIPRED http: //bioinf. cs. ucl. ac. uk/psipred/psiform. html • A simple and accurate secondary structure prediction method, incorporating two feedforward neural networks which perform an analysis on output obtained from PSI-BLAST. • Using a very stringent cross validation method to evaluate the method's performance, PSIPRED recent version achieves an average Q 3 score of 80. 7%.

Investigating & visualizing protein structures Protein 3 D Structures A protein’s structure has a critical effect on its function: 1. Binding pockets PDB ID 1 nw 7

Investigating & visualizing protein structures Protein 3 D Structures A protein’s structure has a critical effect on its function: 2. Areas of specific chemicalelectrical properties

Investigating & visualizing protein structures Protein 3 D Structures A protein’s structure has a critical effect on its function: 3. Importance of the global fold for function

Investigating & visualizing protein structures Tertiary structure = protein fold Complete 3 -dimensional structure Why is it interesting ? isn’t the sequence enough? Ø Ø The structure is more conserved Detection of distant evolutionary relationships A key to understand protein function Structure-based drug design

Investigating & visualizing protein structures RCSB- the Protein Data Bank • The main & comprehensive database for biological macro-molecular structures • Each structure receives a PDB ID: a 4 letters unique identifier • Search by author, PDB id or any keyword. • Download structures

Investigating & visualizing protein structures RCSB- Protein Databank http: //www. rcsb. org/pdb/home. do PDB ID: 3 mht

Investigating & visualizing protein structures RCSB- The Protein Data Bank Download structure The paper describing the structure Data concerning the structureresolution, R-value…. Display structure

Investigating & visualizing protein structures RCSB- The Protein Data Bank PDB files have a specific format: • • • TITLE REMARK COMPND JRNL- reference SEQRES- the original sequence HELIX, BETA- secondary structure ATOM – The actual protein/DNA/RNA chain HETATM- additional atoms such as ligands, water etc. …

Investigating & visualizing protein structures RCSB – The Protein Data Bank PDB files have a specific format: ATOM HETATM HETATM HETATM 7 8 9 10 3139 3140 3141 3142 3143 3144 3145 3146 3147 SD CE N CA C 6 N 1 C 2 N 3 C 4 O O O MET ILE SAH SAH SAH HOH HOH A A 1 1 2 2 328 328 328 329 330 331 -29. 059 -27. 535 -29. 656 -30. 077 -11. 642 -10. 474 -11. 895 -13. 079 -14. 120 -13. 832 -29. 525 -28. 213 -24. 619 Atom, residue Numbering or molecule Chain if exists 28. 614 29. 074 32. 903 33. 171 26. 514 26. 661 25. 334 25. 090 25. 887 27. 092 42. 890 42. 867 35. 287 71. 539 70. 866 69. 094 67. 730 89. 489 90. 103 88. 899 88. 350 88. 278 88. 861 90. 934 93. 588 96. 173 1. 00 1. 00 26. 90 16. 57 25. 93 25. 49 17. 97 14. 50 23. 10 16. 93 16. 05 14. 31 24. 84 8. 11 17. 96 Coordinates: X, Y, Z http: //www. wwpdb. org/documentation/format 3. 1 -20080211. pdf S C N C C N N C O O O

Investigating & visualizing protein structures RCSB – The Protein Data Bank More Sequences Than Structures Discrepancy between the number of known sequences and solved structures: 5, 047, 807 Uni. Ref 90 entries vs. 19988 90% Non-redundant structures Computational methods are needed to obtain more structures

Investigating & visualizing protein structures Fold classification Classification: clustering proteins into structural families Motivation? ØProfound analysis of evolutionary mechanisms ØConstraints on secondary structure packing? ØClassification at domain level

Investigating & visualizing protein structures Fold classification http: //scop. berkeley. edu • The SCOP database aims to provide a description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the PDB. • The SCOP classification of proteins has been constructed manually, but with the assistance of tools to make the task manageable and help provide generality.

Investigating & visualizing protein structures Fold classification 1. Family: Clear evolutionarily relationship Generally, this means that pairwise residue identities between the proteins are 30% and greater. 2. Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies.

Investigating & visualizing protein structures Fold classification 3. Fold: Major structural similarity Same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins of the same fold category may not have a common evolutionary origin: the structural similarities could arise from convergent evolution.

Investigating & visualizing protein structures Number Growth of unique folds as defined by SCOP Year

Investigating & visualizing protein structures Fold classification ØHierarchical classification of protein domain structures in the PDB. ØDomains are clustered at five major levels: Class Architecture Topology Homologous superfamily Sequence family

Investigating & visualizing protein structures Fold classification • Class [C] - derived from secondary structure content (automatic)- alpha, beta, alpha and beta, few. • Architecture [A] - derived from orientation of secondary structures (manual) • Topology [T] - derived from topological connection and secondary structures- (by automated structural alignment) • Homologous Superfamily [H]/sequence family- clusters of similar structures & functions.

Investigating & visualizing protein structures SCOP Vs. CATH Same SCOP family, different CATH topologies: d 1 rh 6 b (a. 6. 1. 7) / 1 rh 6 B 00 (1. 10. 1660. 20) vs. d 1 g 4 da (a. 6. 1. 7) / 1 g 4 d. A 00 (1. 10. 10) Csaba et al. , 2009 Different SCOP classes, same CATH homologous superfamilies: d 1 bbxd (b. 34. 13. 1) / 1 bbx. D 00 (2. 40. 50. 40) vs. d 1 rhpa (d. 9. 1. 1) / 1 rhp. A 00 (2. 40. 50. 40)

Investigating & visualizing protein structures SCOP Vs. CATH SCOP class fold superfamily CATH class architecture topology homologous superfamily sequence family CATH more directed toward structural classification, SCOP pays more attention to evolutionary relationships

Investigating & visualizing protein structures Pdb. Sum • A database providing an overview of all biological macromolecular structures • Connected to Uni. Prot find the sequence accession of a known PDB ID • Detailed description of many structure properties, e. g. : – – – EC number Chains & ligands and their interactions Clefts Secondary structure FASTA sequence of structure… …

Investigating & visualizing protein structures Pdb. Sum PDB ID http: //www. ebi. ac. uk/thornton-srv/databases/pdbsum/ Free text Search by sequence

Investigating & visualizing protein structures Pdb. Sum Useful tabs Uni. Prot accession Chains & ligands

Investigating & visualizing protein structures Pdb. Sum GO annotation EC and reaction Highlights from the related paper

Investigating & visualizing protein structures Pdb. Sum Protein tab Secondary structurefrom the PDB

Investigating & visualizing protein structures Pdb. Sum Ligand tab The ligand’s structure Lig. Plot. Predicts the residues that bind the ligand

Investigating & visualizing protein structures Before the invention of computer graphics, trained artists were employed for hand-drawing understandable picture of a protein Irving Geis (1908 – 1997)

Investigating & visualizing protein structures Features: Py. Mol Viewer • Viewing 3 D Structures • Rendering Figures • Giving Presentations • Animating Molecules • Sharing Visualizations • Exporting Geometry

Investigating & visualizing protein structures Pymol Viewer: Potassium channel from (kcsa) from streptomyces lividans, pdb id 1 bl 8 Declan et al. , 1998

Investigating & visualizing protein structures View Manipulation • Identify the different parts of the screen: -the external GUI window -the internal GUI window. • The internal window contains the viewer, which displays the molecule, and the command line.

Investigating & visualizing protein structures View Manipulation To manipulate an object, we use the letter icons near its name - A – Action - S – Show - H – Hide - L – Label - C – Color

Investigating & visualizing protein structures View Manipulation Change the representation of the object to “Cartoon” using: S (show) As Cartoon

Investigating & visualizing protein structures View Manipulation Other protein representations under “S” “As”: • Lines • Ribbons • Sticks • Dots • Spheres • Surface

Investigating & visualizing protein structures Part 1: View Manipulation Color by chain: C (color) by chain

Investigating & visualizing protein structures View Manipulation Other coloring options: • Color by spectrum: b-factor, rainbow • Color by secondary structure (“SS”) • Color by element: • A lot of available colors, other can be defined in the external GUI “settings” ”colors…” “new”

Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • Select specific amino acids by clicking on them. • Select a range in sequence by clicking first residue, and “shift+click” on the residue. the then last • The selection will be indicated on the structure (in pink dots).

Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • In the object list, a new object “(sele)” was added. • This object represents the current selection • You can manipulate it with the buttons next to the object. For example, change its representation to sticks • (“S” “As” “Sticks”)

Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • Give a different name to the selection, so you can easily manipulate it later. • Select the first chain again (using the sequence) and change it name to “chain 1” by pressing: “Action Rename Selection” and typing “chain 1”.

Investigating & visualizing protein structures Making high-quality photos 1. Change the background color to white, with “Display Background White” on the external GUI menu:

Investigating & visualizing protein structures Making high-quality photos 2. Type in the command line: “ray [x], [y]” ”… wait… 3. Save the image by: “Save” “Image Pay attention not to accidentally press on the image before saving!

Investigating & visualizing protein structures Making high-quality photos

Investigating & visualizing protein structures Con. Surf The goal: identification of functionally important amino acids that mediate the interaction of a query protein with ligands, DNA/RNA, other proteins etc. Approach: Functionally important amino acid sites are often evolutionarily conserved

Investigating & visualizing protein structures Consurf Beta Class N 6 -Adenine DNA Methyltransferase

Investigating & visualizing protein structures Con. Surf The 3 D structure of Beta Class N 6 -Adenine DNA Methyltransferase has already been solved: PDB id : 1 nw 7

Investigating & visualizing protein structures Consurf • The Con. Surf webserver calculates the evolutionary rate for each position in the protein • The results, mapped on the structure, reveal residues crucial for function and structure stability • In this case, the ligand is bound in a highly conserved cluster of residues http: //consurf. tau. ac. il/

Investigating & visualizing protein structures Consurf The consensus sequence approach: . . W. . . . E. . G. .

Investigating & visualizing protein structures Consurf However, some sequences might be close homologues of each other . . W. . . . E. . primates . . G. . Conclusion: Assessing conservation without taking into consideration the phylogenetic relations may lead to uneven sampling in sequence space

Investigating & visualizing protein structures Consurf Phylogenetic reconstruction may be used to distinguish between two possible cases: 1. Structural/functional constraints that truly result in sequence conservation as a result of evolutionary pressure. 2. Short evolutionary time that may be mistaken as sequence conservation, while no evolutionary pressure affects the examined position.

Investigating & visualizing protein structures Consurf Rate 4 Site: an algorithm for calculating the evolutionary rate at each amino acid site Definition: Evolutionary rate = number of AA replacements/(site*year) Conserved sites evolve slowly variable sites evolve rapidly Pupko et al. , 2002 Mayrose et al. , 2005

Investigating & visualizing protein structures Consurf Web-Server: http: //consurf. tau. ac. il/ Landau et al. , 2005

Investigating & visualizing protein structures Consurf coloring bar The Rate 4 Site conservation scores are not specific integers. Such scores are impossible to display on a structure. Hence, the Con. Surf webserver divides them into 9 bins- 1 for highly variable , 9 for the most conserved

Investigating & visualizing protein structures Consurf The Con. Surf webserver Essential input- MSA and tree constructed by Con. Surf through “advanced options”: 1. PDB IDPDB filemodel-structure and chain Essential and optional input: 1. PDB IDPDB filemodel-structure and chain 2. Constructed MSA, with the query sequence included 3. Phylogenetic tree

http: //consurf. tau. ac. il/index. html

Essential and Optional input: Bayesian Max Likelihood 1 NW 7 Check in the PDBsum… MSA Sequence name in the MSA Tree Email http: //consurf. tau. ac. il/index. html

Essential input: 1 NW 7 Check in the PDBsum… http: //consurf. tau. ac. il/index. html

Essential input: Email Alignment method SWISS-PROT Uni. Prot Additional BLAST options http: //consurf. tau. ac. il/index. html

Calculation Finished: Easy web-based viewer Viewer for producing medium-quality images* View scores Produced or input MSA View phylogenetic tree Script for coloring in Ras. Top* Instructions for Py. MOl*

Investigating & visualizing protein structures Consurf Jmol- Easy web-based viewer

Investigating & visualizing protein structures Consurf Summary - MSA Quality • Con. Surf is dependent on the quality of the MSA. • When an MSA is not given by the user, sequences are automatically gathered by PSI-BLAST and aligned by CLUSTALW with default parameters. • Even though these alignments are usually good, it is highly recommended to inspect the alignment manually and with other tools in order to improve the quality of the evolutionary data.

Investigating & visualizing protein structures Consurf A caveat: In some cases the functionally important region may not be conserved at all The peptide-binding groove of the MHC class I heavy chain. PDB id : 2 vaa

Investigating & visualizing protein structures Patch. Finderidentification of functional sites Patch- a spatially continuous cluster of surface residues. Problems: – Subjectivity of boundaries. – Difficult to apply on large datasets

Investigating & visualizing protein structures Patch. Finder Input: 1. Protein Structure (1) Assignment of conservation scores (Rate 4 Site 3) 2. Multiple sequence alignment (MSA) (2) Identification of exposed residues (3) Extraction of the surface patch of conserved residues with the highest statistical significance (ML-patch). (4) Identification of nonoverlapping secondary patches 1 Nimrod et al. , 2005 2 Nimrod et al, 2008 3 Mayrose et al. , 2004

Investigating & visualizing protein structures Patch. Finder- http: //patchfinder. tau. ac. il/

Investigating & visualizing protein structures Summary of structure-related databases & tools • Secondary structure prediction- Predict. Protein, Meta Predict. Protein and PSIPRED. • PDB, SCOP and CATH- collection and classification of structures available by experiment. • Structure visualization- Py. Mol • Conservation analysis- Consurf and Patchfinder

Protein structure prediction

Structure Prediction Approaches 1. Homology (Comparative) Modeling Based on sequence similarity with a protein for which a structure has been solved. 2. Threading (Fold Recognition) Requires a structure similar to a known structure 3. Ab-initio fold prediction Not based on similarity to a sequencestructure

Ab-initio Structure prediction from “first principals”: Given only the sequence, try to predict the structure based on physico-chemical properties (energy, hydrophobicity etc. ) • When all else fails works for novel folds • Shows that we understand the process

The Force Field (energy function) A group of mathematical expressions describing the potential energy of a molecular system • Each expression describes a different type of physicochemical interaction between atoms in the system: • Van der Waals forces • Covalent bonds • Hydrogen bonds • Charges • Hydrophobic effects Non-bonded terms

Approaches to Ab-initio Prediction 1. Molecular Dynamics • Simulates the forces that governs the protein within water. • Since proteins usually naturally fold, this would lead to the native protein structure. Problems: • Thousands of atoms • Huge number of time steps to reach folded protein feasible only for very small proteins

Approaches to Ab-initio Prediction 2. Minimal Energy Assumption: the folded form is the minimal energy conformation of a protein Main principals: • Define an energy function. • Search for 3 D conformation that minimize energy.

Ab-initio 2. Minimal Energy • Use of simplified energy function • Search methods for minimal energy conformation: – Greedy search – Simulated annealing –…

Ab-initio • Current methods (e. g. Rosetta) primarily utilize the fact that although we are far from observing all protein folds, we probably have seen nearly all substructures: Local sequence-structure relationships: • A library of known sub-structures (fragments less than 10 residues) is created. • A range of possible conformations for each fragment in the query protein are selected. Moult J. Philos. Trans. R. Soc. B. 361: 453– 458 (2006)

Ab-initio Non-local sequence-structure relationships: • The primary nonlocal interactions considered are hydrophobic burial, electrostatics, main-chain hydrogen bonding etc. Structures that are consistent with both the local and non-local interactions are generated by minimizing the non-local interaction energy in the space defined by the local structure distributions. Moult J. Philos. Trans. R. Soc. B. 361: 453– 458 (2006)

Ab-initio - Example Moult J. Philos. Trans. R. Soc. B. 361: 453– 458 (2006)

Fold Recognition (Threading) Given a sequence and a library of folds, thread the sequence through each fold. Take the one with the highest score. • Method will fail if new protein does not belong to any fold in the library. • Score of the threading is computed based on known physical chemistry properties and statistics of amino acids.

Threading: example 4 E • structural template • neighbor definition • energy function ACCECADAAC -3 -1 -4 -4 -1 -4 -3 -3=-23 C 2 A 1 10 5 C 9 6 A 8 7 D C A A Eab A C D E …. . A -3 -1 0 0. . C -1 -4 1 2. . D 0 1 5 6. . E 0 2 6 7. . . .

Find best fold for a protein sequence: Fold recognition (threading) 1) . . . 56) . . . MAHFPGFGQSLLFGYPVYVFGD. . . -10 . . . n) . . . -123 . . . Potential fold 20. 5

Gen. THREADER • Align the query sequence with each template (requires some sequence homology!) • Assess the alignment by: – Sequence alignment score – Pairwise potentials – Solvation function • Record lengths of: alignment, query, template • Using Neural Network the overall score is computed. Jones DT et al. J. Mol. Biol. 287: 797 -815(1999)

Gen. THREADER Jones DT et al. J. Mol. Biol. 287: 797 -815(1999)

I-TASSER- Hybrid Approach • In a recent wide blind experiment, CASP 7, I-TASSER generated the best 3 D structure predictions among all automated servers. • Based on the secondary-structure threading and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program. • For predicting the biological function of the protein, the I-TASSER server matches the predicted 3 D models to the proteins in 3 independent libraries which consist of proteins of known enzyme classification (EC) number, gene ontology (GO) vocabulary, and ligand-binding sites.

I-TASSER

Test Case: Rosetta Vs. TASSER Grey: Crystal structure of Betannnn: Purple: Rosetta prediction, starting from homology modeling Green: TASSER predcition

Homology Modeling – Basic Idea 1. A protein structure is defined by its amino acid sequence. 2. Closely related sequences adopt highly similar structures, distantly related sequences may still fold into similar structures. 3. Three-dimensional structure of proteins from the same family is more conserved than their primary sequences. Triophospate ismoerases 44. 7% sequence identity 0. 95 RMSD

General Scheme 1. Searching for structures related to the query sequence 2. Selecting templates 3. Aligning query sequence with template structures 4. Building a model for the query using information from the template structures 5. Evaluating the model Fiser A et al. Methods in Enzymology 374: 461 -491(2004)

General Scheme

Homology modeling requires handling structures & sequences • Query- only the protein sequence is available- usually found at the Uni. Prot database • Template- after identification, both structural and sequencerelated data should be found- Uni. Port (or NCBI databases), RCSB and PDBsum

Homology modeling- querytemplate alignment Different levels of similarity between the template & query initiate various computational approaches:

Homology modeling- model evaluation Evolutionary Conservation http: //consurf. tau. ac. il

Homology Modeling • The accuracy of the model depends on its sequence identity with the template: