Скачать презентацию Structural genomics efforts are gaining momentum and helping Скачать презентацию Structural genomics efforts are gaining momentum and helping

ebc23dfbbab34a04c59ae050b48951d4.ppt

  • Количество слайдов: 54

Structural genomics efforts are gaining momentum and helping to assign functions to orfs and Structural genomics efforts are gaining momentum and helping to assign functions to orfs and to fill in the space of all possible protein folds.

Atomic Resolution Structural Biology Organ Tissue Cell Molecule Atoms • A cell is an Atomic Resolution Structural Biology Organ Tissue Cell Molecule Atoms • A cell is an organization of millions of molecules • Proper communication between these molecules is essential to the normal functioning of the cell • To understand communication: *Determine the arrangement of atoms*

Atomic Resolution Structural Biology Determine atomic structure to analyze why molecules interact Atomic Resolution Structural Biology Determine atomic structure to analyze why molecules interact

From John Norvell - NIH From John Norvell - NIH

The Reward: Understanding Control Anti-tumor activity Duocarmycin Atomic interactions Shape The Reward: Understanding Control Anti-tumor activity Duocarmycin Atomic interactions Shape

Atomic Structure in Context NER RPA BER RR Molecule Pathway Activity Structural Genomics Structural Atomic Structure in Context NER RPA BER RR Molecule Pathway Activity Structural Genomics Structural Proteomics Systems Biology

The Strategy of Atomic Resolution Structural Biology • Break down complexity so that the The Strategy of Atomic Resolution Structural Biology • Break down complexity so that the system can be understood at a fundamental level • Build up a picture of the whole from the reconstruction of the high resolution pieces • Understanding basic governing principles enables prediction, design, control

Structural Genomics Pipeline Genomic Based Target Selection PDB Deposition & Release Publication Data Collection Structural Genomics Pipeline Genomic Based Target Selection PDB Deposition & Release Publication Data Collection Functional Annotation Structure Determination Isolation, Expression, Purification, Crystallization

High-throughput Biological Data • Enormous amounts of biological data are being generated by highthroughput High-throughput Biological Data • Enormous amounts of biological data are being generated by highthroughput capabilities; even more are coming – – – genomic sequences gene expression data mass spec. data protein-protein interaction protein structures. . .

200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 Structures Sequences Structural 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 Structures Sequences Structural Proteomics: The Motivation

Year Number of released entries Year Number of released entries

History of the PDB 1970 s – Community discussions about how to establish an History of the PDB 1970 s – Community discussions about how to establish an archive of protein structures – Cold Spring Harbor meeting in protein crystallography – PDB established at Brookhaven (October 1971; 7 structures) 1980 s – Number of structures increases as technology improves – Community discussions about requiring depositions – IUCr guidelines established – Number of structures deposited increases 1990 s – Structural genomics begins – PDB moves to RCSB 2000 s – ww. PDB formed

Protein structural data explosion Protein Data Bank (PDB): 14500 Structures (6 March 2001) 10900 Protein structural data explosion Protein Data Bank (PDB): 14500 Structures (6 March 2001) 10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others. . .

Policies and Practices for 3 D Coordinate Data • Structural biology – Release of Policies and Practices for 3 D Coordinate Data • Structural biology – Release of coordinates upon publication required by most journals worldwide – Deposition and release required by many US funding agencies – Some depositions from pharmaceutical companies • Structural genomics – Deposition of coordinates upon completion of refinement – Release US: 6 weeks, International: 6 months

Sequence versus structural data • Despite structural genomics efforts, growth of PDB slowed somewhat Sequence versus structural data • Despite structural genomics efforts, growth of PDB slowed somewhat down in 2001 -2002 Structural genomics initiatives are now in full swing and growth is up again. • More than 300 completely sequenced genomes Increasing gap between structural and sequence data

Protein Structure Initiative • Organize and recruit interested structural biologists and structure biology centres Protein Structure Initiative • Organize and recruit interested structural biologists and structure biology centres from around the world • Coordinate target selection • Develop new kinds of high throughput techniques • Solve, solve, solve….

Structural Proteomics Status • • 20 registered centres (~30 organisms) 82700 targets have been Structural Proteomics Status • • 20 registered centres (~30 organisms) 82700 targets have been selected 52705 targets have been cloned 29855 targets have been expressed 12311 targets are soluble 1493 X-ray structures determined 502 NMR structures determined 1743 Structures deposited in PDB

Protein Structure Initiative (PSI) Long-Range Goal To make three-dimensional atomic level structures of most Protein Structure Initiative (PSI) Long-Range Goal To make three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences

PSI Benefit • Collection of structures will address key biochemical and biophysical problems – PSI Benefit • Collection of structures will address key biochemical and biophysical problems – Protein folding, prediction, folds, evolution, etc. • Benefits to biologists – – Technology developments Structural biology facilities Availability of reagents and materials Experimental outcome data on protein production and crystallization

Expected PSI Benefits • Structure provides information on function and will aid in the Expected PSI Benefits • Structure provides information on function and will aid in the design of experiments • Development of better therapeutic targets from comparisons of protein structures from: – Pathogens vs. hosts – Diseased vs. normal tissues

Structural Genomics Basics • Target strategy: systematic sampling of protein sequence families to search Structural Genomics Basics • Target strategy: systematic sampling of protein sequence families to search for unique protein structures • Experimental determination of unique protein structures in high throughput operation • Computational modeling of structures of sequence family homologs

PSI Pilot Phase Lessons Learned 1. Structural genomics pipelines can be constructed and scaled-up PSI Pilot Phase Lessons Learned 1. Structural genomics pipelines can be constructed and scaled-up 2. High throughput operation works for many proteins 3. Genomic approach works for structures 4. Bottlenecks remain for some proteins 5. A coordinated, target selection policy must be developed 6. Homology modeling methods need improvement

We have the human genome sequence Now what? We have the human genome sequence Now what?

The work is just beginning… The work is just beginning…

Gene-Finding Strategies Genomic Sequence Content-Based Bulk properties of sequence: • Open reading frames • Gene-Finding Strategies Genomic Sequence Content-Based Bulk properties of sequence: • Open reading frames • Codon usage • Repeat periodicity • Compositional complexity Site-Based Absolute properties of sequence: • Consensus sequences • Donor and acceptor splice sites • Transcription factor binding sites • Polyadenylation signals • “Right” ATG start • Stop codons out-of-context Comparative Inferences based on sequence homology: • Protein sequence with similarity to translated product of query • Modular structure of proteins usually precludes finding complete gene

GENOME ANNOTATION • Two main levels: – STRUCTURAL ANNOTATION – Finding genes and other GENOME ANNOTATION • Two main levels: – STRUCTURAL ANNOTATION – Finding genes and other biologically relevant sites thus building up a model of genome as objects with specific locations – FUNCTIONAL ANNOTATION – Objects are used in database searches (and expts) aim is attributing biologically relevant information to whole sequence and individual objects

What’s Annotation • Interpreting raw sequence data into useful biological information • Identification, structural What’s Annotation • Interpreting raw sequence data into useful biological information • Identification, structural description, characterisation of putative protein products and other features in primary genomic sequence • Addition of as much reliable and up-to-date information as possible to describe a sequence • Information attached to genomic coordinates with start and end point, can occur at different levels

Annotation is the description of: • • Function(s) of the protein Post-translational modification(s) Domains Annotation is the description of: • • Function(s) of the protein Post-translational modification(s) Domains and sites Secondary structure Quaternary structure Similarities to other proteins Disease(s) associated with deficiencie(s) in the protein • Sequence conflicts, variants, etc.

Additional information for proteins • FUNCTION • CATALYTIC ACTIVITY • COFACTOR • INDUCTION • Additional information for proteins • FUNCTION • CATALYTIC ACTIVITY • COFACTOR • INDUCTION • ENZYME REGULATION • PATHWAY • SUBUNIT • DOMAIN • • SPLICE PRODUCTS POLYMORPHISM DISEASE TISSUE SPECIFICITY • DEVELOPMENTAL STAGE • SUBCELLULAR LOCATION • TRANSMEMBRANE

Annotation sources: • Publications that report experimental data • Review articles on specific protein Annotation sources: • Publications that report experimental data • Review articles on specific protein families or groups of proteins • Protein sequence analysis • External experts on the organism • Comparison with other, related sequenced organisms

Predicting function from sequence similarity • Orthologs- Homologous sequences in different species that arose Predicting function from sequence similarity • Orthologs- Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. • Paralogs- from duplication within a genome, second copy may have new or changed function • Equivalog- proteins with equivalent functions • Analog- proteins catalyzing same reaction but not structurally related • Some enzymes may have sequence similarity simply because common catalytic site, substrate, pathway.

From sequence to function From sequence to function

PROTEIN SEQUENCE ANALYSIS FROM HOMOLOGY • Protein sequence can come from gene predictions, literature PROTEIN SEQUENCE ANALYSIS FROM HOMOLOGY • Protein sequence can come from gene predictions, literature or peptide sequencing • Simplest case- match for whole sequence in database- determination of structure and function • In between- partial matches across sequence to diverse or hypothetical proteins • Difficult case- no match, have to derive information from amino acid properties, pattern searches etc

TYPES OF HOMOLOGY Superfamily PROTEIN/DOMAIN Duplication within species Paralogs may have different functions A TYPES OF HOMOLOGY Superfamily PROTEIN/DOMAIN Duplication within species Paralogs may have different functions A B Speciation Orthologs may have different functions, if same - Equivalogs B 1 B 2

EXAMPLE OF ANNOTATION PIPELINE NEW SEQUENCES FROM SEQUENCING PROJECT SEARCH FOR PATTERNS & FUNCTION EXAMPLE OF ANNOTATION PIPELINE NEW SEQUENCES FROM SEQUENCING PROJECT SEARCH FOR PATTERNS & FUNCTION DBs BLAST/ FASTA NO SIGNIFICANT HITS PSI-BLAST SIGNIFICANT HITS IF EQUIVALOG, INFER FUNCTION Search SCOP NB look out for multi-domain proteins, put into genome context NO SIGNIFICANT HITS HIT TO 3 D PROTEINSTRUCTURE & FUNCTION PHYSICAL PROPERTIES, LOCALISATION ETC SIGNIFICANT HITS ASSIGN PROTEIN FAMILY OR DOMAIN, CF OTHER PROTEINS IN FAMILY, INFER FUNCTION Supplement with manual curation and use evidence tags

BLAST (http: //www. ncbi. nlm. nih. gov/blast/) BLAST searches one or more nucleic acid BLAST (http: //www. ncbi. nlm. nih. gov/blast/) BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. PSI-BLAST (http: //bioportal. weizmann. ac. il/ education/materials/gcg/psiblast. html) PSIBLAST iteratively searches one or more protein databases for sequences similar to one or more protein query sequences. PSIBLAST is similar to BLAST except that it uses position-specific scoring matrices derived during the search. Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2. 0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero.

Many functionally and evolutionarily important protein similarities are recognizable only through comparison of three-dimensional Many functionally and evolutionarily important protein similarities are recognizable only through comparison of three-dimensional structures. When such structures are not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities. There is a large literature on the definition and construction of these patterns, which have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models. In essence, for each position in the derived pattern, every amino acid is assigned a score. If a residue is highly conserved at a particular position, that residue is assigned a high positive score, and others are assigned high negative scores. At weakly conserved positions, all residues receive scores near zero. Position-specific scores can also be assigned to potential insertions and deletions.

http: //scop. mrc-lmb. cam. ac. uk/scop/ Nearly all proteins have structural similarities with other http: //scop. mrc-lmb. cam. ac. uk/scop/ Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.

Inferring function from homology 40% 30% 20% 10% Inferring function from homology 40% 30% 20% 10%

Limits of protein sequence annotation (1) • Predicting function from sequence requires another sequence Limits of protein sequence annotation (1) • Predicting function from sequence requires another sequence to be mapped to a function –many hypothetical proteins in databases • If sequence homologues are found, may not be functional homologs -qualitative rather than quantitative process - orthologs may have different functions -enzyme homologs may be inactive -equivalent functions may use different genes, not ortholog • Analogy can often infer molecular function, but not necessarily cellular function

Limits of protein sequence annotation (2) • Databases are biased in sequence and aa Limits of protein sequence annotation (2) • Databases are biased in sequence and aa composition and search is dependent on size • If no homology found- limited amount of information can be inferred • Incorrect annotation can be propagated when similarity is over part on sequence not used in annotation • No answers to tissue-specificity, binding of ligands, relationship between genotype and phenotype

Limits of protein sequence annotation (3) • Need additional information from experiments, eg can Limits of protein sequence annotation (3) • Need additional information from experiments, eg can predict glycosylation sites, but not kind of sugar attached • Problem with multidomain proteins (Do you assign orthology on basis of domains or domain composition of whole protein? )

Genome annotation problems: • • • Assembling the genome Analysis & interpretation Lack of Genome annotation problems: • • • Assembling the genome Analysis & interpretation Lack of consistency from gene to gene Lack of consistency from person to person Lack of controlled vocabulary Parts we don’t know Bacteria vs mammals Graphical user interface Gene expression/molecular interactions Dimensions Updates and maintenance

The ideal annotation of “My. Gene” All clones All SNPs Promoter(s) My. Gene All The ideal annotation of “My. Gene” All clones All SNPs Promoter(s) My. Gene All m. RNAs All proteins All structures • All protein modifications • Ontologies • Interactions (complexes, pathways, networks) • Expression (where and when, and how much) • Evolutionary relationships

The Biologist’s Wishlist • A complete and accurate set of all genes and their The Biologist’s Wishlist • A complete and accurate set of all genes and their genomic positions • A set of all the transcripts produced by each gene • The location and timing of expression of each transcript • The protein produced from each transcript • The location and timing of each protein’s expression • The complete structure of each protein • The functions of each protein