c734e0c731d1438324ec485f9f2c7d65.ppt
- Количество слайдов: 47
Bioinformatics & Computational Biology Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs Drena Dobbs 1 Iowa State University
What is Bioinformatics? (& What is Computational Biology? ) Wikipedia: • Bioinformatics & computational biology involve the use of techniques from mathematics, informatics, statistics, and computer science (& engineering) to solve biological problems Gerstein: • (Molecular) Bioinformatics is conceptualizing biology in terms of molecules & applying “informatics” techniques - derived from disciplines such as mathematics, computer science, and statistics - to organize and understand information associated with these molecules, on a large scale 2
What is the Information? Biological Sequences, Structures, Processes Central Dogma of Molecular Biology Central Paradigm for Bioinformatics • DNA sequence -> RNA -> Protein -> Phenotype • Genomic (DNA) Sequence • Molecules à Sequence, Structure, Function • Processes -> m. RNAs & other RNA sequences -> Protein sequences -> RNA & Protein Structures -> RNA & Protein Functions -> Phenotype • Large Amounts of Information à Mechanism, Specificity, Regulation Modified from Mark Gerstein idea from D Brutlag, Stanford, graphics from S Strobel) à Standardized à Statistical 3
Explosion of "Omes" & "Omics!" Genome, Transcriptome, Proteome • Genome - the complete collection of DNA (genes and "non-genes") of an organism • Transcriptome - the complete collection of RNAs (m. RNAs & others) expressed in an organism • Proteome - the complete collection of of proteins expressed in an organism 4
Genome = Constant Transcriptome & Proteome = Variable • Genome - the complete collection * Note: Although the of DNA (genes and "non-genes") of DNA is "identical" in all cells of an organism, the an organism • Transcriptome - the complete collection of RNAs (m. RNAs & others) expressed in an organism* • Proteome - the complete collection of proteins expressed in an sets of RNAs or proteins expressed in different cells & tissues of a single organism vary greatly -and depend on variables such as environmental conditions, age. developmental stage disease state, etc. organism* 5
Molecular Biology Information: DNA & RNA Sequences Functions: • • DNA sequence: Genetic material Information transfer (m. RNA) Protein synthesis (t. RNA/m. RNA) Catalytic & regulatory activities (some very new!) atggcaattaaaattggtatcaatggttttggtcgtat gcacaacaccgtgatgacattgaagttgtaggtattaa atggcttatatgttgaaatatgattcaactcacggtcg aaagatggtaacttagtggttaatggtaaaactatccg Gcaaacttaaactggggtgcaatcggtgttgatatcgctttaactg atgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt Information: RNA sequence has "U" instead of "T" • 4 letter alphabet à (DNA nucleotides: AGCT) • ~ 1, 000 base pairs in a small gene • ~ 3 X 109 bp in a genome (human) Modified from Mark Gerstein • • Where are the genes? Which DNA sequences encode m. RNA? Which DNA sequences are "junk"? Which RNA sequences encode protein? 6
Molecular Biology Information: Protein Sequences Functions: Most cellular functions are performed or facilitated by proteins • • • Protein sequences: Biocatalysis d 1 dhfa_ Cofactor transport/storage LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTT Mechanical motion/support d 8 dfr__ Immune protection LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTS d 4 dfra_ Regulation of growth and differentiation ISLIAALAVDRVIGMENAMPWN- Information: • 20 letter alphabet (amino acids) à ACDEFGHIKLMNPQRSTVWY (but not BJOUXZ) • ~ 300 aa in an average protein (in bacteria) • ~ 3 X 106 known protein sequences Modified from Mark Gerstein LPADLAWFKRNTL d 3 dfr__ TAFLWAQDRDGLIGKDGHLPWHLPDDLHYFRAQTV • What is this protein? • Which amino acids are most important -- for folding, activity, interaction with other proteins? • Which sequence variations are harmful (or beneficial)? 7
Molecular Biology Information: Macromolecular Structures DNA/RNA/Protein Structures • How does a protein (or RNA) sequence fold into an active 3 -dimensional structure? • Can we predict structure from sequence? • Can we predict function from structure (or perhaps, from sequence alone? ) Modified from Mark Gerstein 8
We don't yet understand the protein folding code - but we try to engineer proteins anyway! Modified from Mark Gerstein 9
Molecular Biology Information: Biological Processes Functional Genomics • How do patterns of gene expression determine phenotype? • Which genes and proteins are required for differentiation during development? • How do proteins interact in biological networks? • Which genes and pathways have been most highly conserved during evolution? 10
On a Large Scale? Whole Genome Sequencing Genome sequences now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading. -- G A Pekso, Nature 401: 115 -116 (1999) Modified from Mark Gerstein 11
Automated Sequencing for Genome Projects Another recent improvement: rapid & high resolution separation of fragments in capillaries instead of gels (E Yeung, Ames Lab, ISU) More recently? Modified from Eric Green Pyro-sequencing 454 sequencing http: //www. 454. com/ $ 1000 genomes? 12
1 st Draft Human Genome - "Finished" in 2001 Modified from Eric Green 13
Human Genome Sequencing Two approaches: • Public (government) - International Consortium (6 countries, NIH-funded in US) • "Hierarchical" cloning & BAC-by-BAC sequencing • Map-based assembly • Private (industry) - Celera (Craig Venter) • Whole genome random "shotgun" sequencing • Computational assembly (took advantage of public maps & sequences, too) Guess which human genome they sequenced? How many genes? ~ 20, 000 Craig's (Science May 2007) 14
Public Sequencing - International Consortium Modified from Eric Green 15
Comparison of Sequenced Genome Sizes Plants? Some have much larger genomes than human! Modified from Eric Green 16
"Complete" Human Genome Sequence - What next? from Eric Green 17
Next Step after the Sequence? Understanding Gene Function on a Genomic Scale • Expression Analysis • Structural Genomics • Protein Interactions • Pathway Analysis • Systems Biology Evolutionary Implications of: • Introns & Exons • Intergenic Regions as "Gene Graveyard" Modified from Mark Gerstein 18
Interpreting the Human Genome Sequence! from Eric Green 19
Comparative Genomics: compare entire genomic sequences from Eric Green 20
Comparing Genomes: Functional Elements from Eric Green 21
Gene Expression Data: the Transcriptome (& Proteome) Micro. Array Data Yeast Expression Data: • Levels for all 6, 000 genes! • Experiments to investigate how genes respond to changes in environment or how patterns of expression change in normal vs cancerous tissue Modified from Mark Gerstein (courtesy of J Hager) ISU's Biotechnology Facilities include state-of-the-art Microarray & Proteomics instrumentation 22
Other Whole-Genome Experiments Systematic Knockouts: Make "knockout" (null) mutations in every gene one at a time - and analyze the resulting phenotypes! For yeast: 6, 000 KO mutants! Modified from Mark Gerstein 2 -hybrid Experiments: For each (and every) protein, identify every other protein with which it interacts! For yeast: 6000 x 6000 / 2 ~ 18 M interactions!! 23
Molecular Biology Information: Integrating Data • Understanding the function of genomes requires integration of many diverse and complex types of information: à Metabolic pathways à Regulatory networks à Whole organism physiology à Evolution, phylogeny à Environment, ecology à Literature (MEDLINE) Modified from Mark Gerstein 24
Storing & Analyzing Large-scale Information: Exponential Growth of Data Matched by Development of Computer Technology CPU vs Disk & Net • Both the increase in computer speed and the ability to store large amounts of information on computers have been crucial • Improved computing resources have been a driving force in Bioinformatics ISU's supercomputer "Cy. Blue" is among 100 most powerful in the world Modified from Mark Gerstein (Internet picture adaptedfrom D Brutlag, Stanford) 25
from Mark Gerstein Weber Cartoon 26
Challenges in Organizing & Understanding Highthroughput Data: Redundancy and Multiplicity • Different sequences can have the same structure • Organism has many similar genes • Single gene may have multiple functions • Genes and proteins function in genetic and regulatory pathways • How do we organize all this information so that we can make sense of it? Integrative Genomics: genes >< structures <> functions <> pathways <> expression <>regulatory systems <> …. Modified from Mark Gerstein 27
"Simple" example? Proteins Molecular Parts = Conserved Domains Modified from Mark Gerstein 28
"Parts List" approach to bike maintenance: How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -types of parts (nuts & washers)? Modified from Mark Gerstein Where are the parts located? 29
World of protein structures is also finite, providing a valuable simplification! (human) ~20, 000 genes ~2, 000 folds (T. pallidum) ~2, 000 genes Global Surveys of a Finite Set of Parts from Many Perspectives Same logic for pathways, functions, sequence families, blocks, motifs. . Modified from Mark Gerstein Functions picture from www. fruitfly. org/~suzi (Ashburner); Pathways picture from, ecocyc. pangeasystems. com/ecocyc (Karp, Riley). Related resources: COGS, Pro. Dom, Pfam, Blocks, Domo, WIT, CATH, Scop. . 30
BUT, what actually happens in cells & in whole organisms is much more complex! providing a challenging complication!! Exploring the Virtual Cell at ISU Virtual Cell projects elsewhere. . . NCBI's Bookshelf - a great resource! 31
So, having a list of parts is not enough! BIG QUESTION? How do parts work together to form a functional system? SYSTEMS BIOLOGY What is a system? Macromolecular complex, pathway, network, cell, tissue, organism, ecosystem… 32
Is this Bioinformatics? (#1, with Answers) • Creating digital libraries à Automated bibliographic search and textual comparison à Knowledge bases for biological literature • Motif discovery using Gibb's sampling • Methods for structure determination à Computational X-ray crystallography à NMR structure determination • Distance Geometry • Metabolic pathway simulation Modified from Mark Gerstein YES YES 33
Is this Bioinformatics? #2 • Gene identification by sequence inspection à Prediction of splice sites, promoters, etc. • DNA methods in forensics • Modeling populations of organisms à Ecological Modeling • Genomic sequencing methods à Assembling contigs à Physical and genetic mapping YES YES • Linkage analysis à Linking specific genes to various traits Modified from Mark Gerstein YES 34
Is this Bioinformatics? #3 • Rational drug design • RNA structure prediction • Protein structure prediction YES • Radiological image processing à Computational representations for human anatomy • (e. g. , Visible Human) • Artificial life simulations à Artificial immunology à Virtual cells Modified from Mark Gerstein Maybe Yes 35
So, this is Bioinformatics What is it good for? 36
EXAMPLES OF BIOINFORMATICS RESEARCH A few general ones & a few personal favorites! 37
Designing New Drugs • Understanding how proteins bind other molecules • Structural modeling & ligand docking • Designing inhibitors or modulators of key proteins Modified from Mark Gerstein Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). 38
Finding homologs of "new" human genes Modified from Mark Gerstein 39
Finding WHAT? Homologs - "same genes" in different organisms (actually, orthologs) • Human vs. Mouse vs. Yeast à Much easier to do experiments on yeast to determine function à Often, function of an ortholog in at least one organism is known Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Human Disease MIM # Hereditary Non-polyposis Colon Cancer Cystic Fibrosis Wilson Disease Glycerol Kinase Deficiency Bloom Syndrome Adrenoleukodystrophy, X-linked Ataxia Telangiectasia Amyotrophic Lateral Sclerosis Myotonic Dystrophy Lowe Syndrome Neurofibromatosis, Type 1 120436 MSH 2 120436 MLH 1 219700 CFTR 277900 WND 307030 GK 210900 BLM 300100 ALD 208900 ATM 105400 SOD 1 160900 DM 309000 OCRL 162200 NF 1 U 03911 U 07418 M 28668 U 11700 L 13943 U 39817 Z 21876 U 26455 K 00065 L 19268 M 88162 M 89914 9. 2 e-261 MSH 2 M 84170 6. 3 e-196 MLH 1 U 07187 1. 3 e-167 YCF 1 L 35237 5. 9 e-161 CCC 2 L 36317 1. 8 e-129 GUT 1 X 69049 2. 6 e-119 SGS 1 U 22341 3. 4 e-107 PXA 1 U 17065 2. 8 e-90 TEL 1 U 31331 2. 0 e-58 SOD 1 J 03279 5. 4 e-53 YPK 1 M 21307 1. 2 e-47 YIL 002 C Z 47047 2. 0 e-46 IRA 2 M 33779 DNA repair protein Metal resistance protein Probable copper transporter Glycerol kinase Helicase Peroxisomal ABC transporter PI 3 kinase Superoxide dismutase Serine/threonine protein kinase Putative IPP-5 -phosphatase Inhibitory regulator protein Choroideremia Diastrophic Dysplasia Lissencephaly Thomsen Disease Wilms Tumor Achondroplasia Menkes Syndrome 303100 222600 247200 160800 194070 100800 309400 X 78121 U 14528 L 13385 Z 25884 X 51630 M 58051 X 69208 2. 1 e-42 7. 2 e-38 1. 7 e-34 7. 9 e-31 1. 1 e-20 2. 0 e-18 2. 1 e-17 GDP dissociation inhibitor Sulfate permease Methionine metabolism Voltage-gated chloride channel Sulphite resistance protein Serine/threoinine protein kinase Probablecopper transporter Modified from Mark Gerstein Human Gene CHM DTD LIS 1 CLC 1 WT 1 FGFR 3 MNK Gen. Bank BLASTX Acc# for P-value Human c. DNA Yeast Gene GDI 1 SUL 1 MET 30 GEF 1 FZF 1 IPL 1 CCC 2 Gen. Bank Yeast Gene Acc# for Description Yeast c. DNA S 69371 X 82013 L 26505 Z 23117 X 67787 U 07163 L 36317 40
Comparative Genomics Genome/Transcriptome/Proteome/Metabolome Databases, statistics • Occurrence of a specific genes or features in a genome à How many kinases in yeast? • Compare Tissues à Which proteins are expressed in cancer vs normal tissues? • Diagnostic tools • Drug target discovery Modified from Mark Gerstein 41
Molecular Recognition: Analyzing & Predicting Macromolecular Interfaces (in DNA, RNA & protein complexes) Drena Dobbs, GDCB Jae-Hyung Lee Michael Terribilini Jeff Sander Pete Zaback Vasant Honavar, Com S Feihong Wu Cornelia Caragea Robert Jernigan, BBMB Taner Sen Andrzej Kloczkowski Kai-Ming Ho, Physics 42
Designing Zinc Finger DNA-binding proteins to recognize specific sites in genomic DNA Drena Dobbs, GDCB Jeff Sander Pete Zaback Dan Voytas, GDCB Fenglli Fu Les Miller, Com. S Vasant Honavar, Com. S Keith Joung, Harvard
Structure & function of human telomerase: Predicting structure & functional sites in a clinically important but "recalcitrant" RNP Cell Biologist: www. intl-pag. org/ Biochemist: www. chemicon. com Imagined structure: Lingner et al (1997) Science 276: 561 -567. How would a systems biologist study telomerase? 44
Resources for Bioinformatics & Computational Biology • Wikipedia: • • Bioinformatics NCBI - National Center for Biotechnology Information ISCB - International Society for Computational Biology JCB - Jena Center for Bioinformatics UBC - Bioinformatics Links Directory 45
ISU Resources & Experts ISU Research Centers & Graduate Training Programs: BCB - Bioinformatics & Computational Biology Baker Center - Bioinformatics & Biological Statistics CIAG - Center for Integrated Animal Genomics CILD - Computational Intelligence, Learning & Discovery ISU Facilities: Biotech - Instrumentation Facilities CIAG - Center for Integrated Animal Genomics PSI - Plant Sciences Institute PSI Centers 46
For fun: DNA Interactive: "Genomes" A tutorial on genomic sequencing, gene structure, genes prediction Howard Hughes Medical Institute (HHMI) Cold Spring Harbor Laboratory (CSHL) http: //www. dnai. org/c/index. html 47
c734e0c731d1438324ec485f9f2c7d65.ppt