Скачать презентацию Introduction to Bioinformatics 1 GENERAL INFORMATION Course Скачать презентацию Introduction to Bioinformatics 1 GENERAL INFORMATION Course

423a67e36d0ba737e4a1a033d6f8a785.ppt

  • Количество слайдов: 109

Introduction to Bioinformatics 1 Introduction to Bioinformatics 1

GENERAL INFORMATION Course Methodology The course consists of the following components; i. a series GENERAL INFORMATION Course Methodology The course consists of the following components; i. a series of 10 lectures and 10 mini-exams, ii. 7 skills classes, each with one programming task, iii. one final written exam. • In the lectures the main theoretical aspects will be presented. • Each lecture starts with a "mini-exam" with three short questions belonging to the previous lecture. • In the skills classes (SCs) several programming tasks are performed, one of which has to be submitted until next SC. • Finally , the course terminates with a open-book exam. 2

GENERAL INFORMATION 10 lectures and 10 mini-exams Prologue (In praise of cells) Chapter 1. GENERAL INFORMATION 10 lectures and 10 mini-exams Prologue (In praise of cells) Chapter 1. The first look at a genome (sequence statistics) Chapter 2. All the sequence's men (gene finding) Chapter 3. All in the family (sequence Alignment) Chapter 4. The boulevard of broken genes (hidden Markov models) Chapter 5. Are Neanderthals among us? (variation within and between species) Chapter 6. Fighting HIV (natural selection at the molecular level) Chapter 7. SARS: a post-genomic epidemic (phylogenetic analysis) Chapter 8. Welcome to the hotel Chlamydia (whole genome comparisons) Chapter 9. The genomics of wine-making (Analysis of gene expression) Chapter 10. A bed-time story (identification of regulatory sequences) 3

GENERAL INFORMATION mini-exams * First 15 minutes of the lecture * Closed Book * GENERAL INFORMATION mini-exams * First 15 minutes of the lecture * Closed Book * Three short questions on the previous lecture * Counts as bonus points for the final mark … * There is a resit, where you can redo individual mini’s you failed to attend with a legitimate leave 4

GENERAL INFORMATION Skills Class: * Each Friday one hour hands-on with real data * GENERAL INFORMATION Skills Class: * Each Friday one hour hands-on with real data * Hand in one-a-week – for a bonus point 5

6 6

7 7

GENERAL INFORMATION Final Exam: * 10 short questions regarding the course material * Open GENERAL INFORMATION Final Exam: * 10 short questions regarding the course material * Open book 8

GENERAL INFORMATION Grading: The relative weights of the components are: i. 10 mini-exam: B GENERAL INFORMATION Grading: The relative weights of the components are: i. 10 mini-exam: B 1 bonus points (max 1) ii. 7 skills class programming task: B 2 bonus points (max 1) iii. final written exam (open-book, three hours): E points (max 10) Final grade = min(E + (B 1 + B 2), 10) Study Points: 6 ECTS/ 4 NSP 9

GENERAL INFORMATION Course Book: Introduction to Computational Genomics A Case Studies Approach Nello Cristianini, GENERAL INFORMATION Course Book: Introduction to Computational Genomics A Case Studies Approach Nello Cristianini, Matthew W. Hahn 10

GENERAL INFORMATION Additional recommended texts: • Bioinformatics: the machine learning approach, Baldi & Brunak. GENERAL INFORMATION Additional recommended texts: • Bioinformatics: the machine learning approach, Baldi & Brunak. • Introduction to Bioinformatics, Lesk, and: Introduction to Bioinformatics, Attwood & Parry-Smith. 11

Introduction to Bioinformatics. LECTURES 12 Introduction to Bioinformatics. LECTURES 12

Introduction to Bioinformatics. LECTURE 1: * Prologue (In praise of cells) * Chapter 1. Introduction to Bioinformatics. LECTURE 1: * Prologue (In praise of cells) * Chapter 1. The first look at a genome (sequence statistics) 13

Introduction to Bioinformatics. Prologue : In praise of cells * Nothing in Biology Makes Introduction to Bioinformatics. Prologue : In praise of cells * Nothing in Biology Makes Sense Except in the Light of Evolution (Theodosius Dobzhansky) 14

GENOMICS and PROTEOMICS Genomics is the study of an organism's genome and the use GENOMICS and PROTEOMICS Genomics is the study of an organism's genome and the use of the genes. It deals with the systematic use of genome information, associated with other data, to provide answers in biology, medicine, and industry. Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteomics is much more complicated than genomics. Most importantly, while the genome is a rather constant entity, the proteome differs from cell to cell and is constantly changing through its biochemical interactions with the genome and the environment. One organism will have radically different protein expression in different parts of its body, in different stages of its life cycle and in different environmental conditions. 15

Development of Genomics/ Proteomics Databases 16 Development of Genomics/ Proteomics Databases 16

modern map-makers have mapped the entire human genome Hurrah – we know the entire modern map-makers have mapped the entire human genome Hurrah – we know the entire 3. 3 billion bps of the human genome !!! … but what does it mean ? ? ? 17

Metabolic activity in GENETIC PATHWAYS 18 Metabolic activity in GENETIC PATHWAYS 18

19 19

How can we measure metabolic processes and gene activity ? ? ? 20 How can we measure metabolic processes and gene activity ? ? ? 20

EXAMPLE: Caenorhabditis elegans 21 EXAMPLE: Caenorhabditis elegans 21

Some fine day in 1982 … 22 Some fine day in 1982 … 22

Boy, do I want to map the activity of these genes !!! 23 Boy, do I want to map the activity of these genes !!! 23

Until recently we lacked tools to measure gene activity 1989 saw the introduction of Until recently we lacked tools to measure gene activity 1989 saw the introduction of the microarray technique by Stephen Fodor But only in 1992 this technique became generally available – but still very costly 24

Until recently we lacked tools to measure gene activity 1989 saw the introduction of Until recently we lacked tools to measure gene activity 1989 saw the introduction of the microarray technique by Stephen Fodor But only in 1992 this technique became Microarray generally available – but still very costly Stephen Fodor Microarray-ontwikkelaar Ontwikkelde microarray 25

26 26

27 27

Some fine day many, many years later … 28 Some fine day many, many years later … 28

Now I’m almost there … 29 Now I’m almost there … 29

Using the microarray technology we can now make time series of the activity of Using the microarray technology we can now make time series of the activity of our 22. 000 genes – so-called genome wide expression profiles 30

The identification of genetic pathways from Microarray Timeseries Sequence of genomewide expression profiles at The identification of genetic pathways from Microarray Timeseries Sequence of genomewide expression profiles at consequent instants become more realistic with decreasing costs … 31

Genomewide expression profiles: 25, 000 genes 32 Genomewide expression profiles: 25, 000 genes 32

Now the problem is to map these microarray-series of genome-wide expression profiles into something Now the problem is to map these microarray-series of genome-wide expression profiles into something that tells us what the genes are actually doing … for instance a network representing their interaction 33

34 34

GENOMICS: structure and coding 35 GENOMICS: structure and coding 35

DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions specifying DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions specifying the biological development of all cellular forms of life (and most viruses). DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code, a triplet code of nucleotides. 36

37 37

DNA under electron microscope 38 DNA under electron microscope 38

3 D model of a section of the DNA molecule 39 3 D model of a section of the DNA molecule 39

James Watson and Francis Crick 40 James Watson and Francis Crick 40

41 41

Genetic code The genetic code is a set of rules that maps DNA sequences Genetic code The genetic code is a set of rules that maps DNA sequences to proteins in the living cell, and is employed in the process of protein synthesis. Nearly all living things use the same genetic code, called the standard genetic code, although a few organisms use minor variations of the standard code. Fundamental code in DNA: {x(i)|i=1. . N, x(i) in {C, A, T, G}} Human: N = 3. 3 billion 42

Genetic code 43 Genetic code 43

Replication of DNA 44 Replication of DNA 44

Genetic code: TRANSCRIPTION DNA → RNA Transcription is the process through which a DNA Genetic code: TRANSCRIPTION DNA → RNA Transcription is the process through which a DNA sequence is enzymatically copied by an RNA polymerase to produce a complementary RNA. Or, in other words, the transfer of genetic information from DNA into RNA. In the case of protein-encoding DNA, transcription is the beginning of the process that ultimately leads to the translation of the genetic code (via the m. RNA intermediate) into a functional peptide or protein. Transcription has some proofreading mechanisms, but they are fewer and less effective than the controls for DNA; therefore, transcription has a lower copying fidelity than DNA replication. Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old polymer is read in the 3' → 5' direction and the new, complementary fragments are generated in the 5' → 3' direction). IN RNA Thymine (T) → Uracil (U) 45

Genetic code: TRANSLATION DNA-triplet → RNA-triplet = codon → amino acid RNA codon table Genetic code: TRANSLATION DNA-triplet → RNA-triplet = codon → amino acid RNA codon table There are 20 standard amino acids used in proteins, here are some of the RNA-codons that code for each amino acid. Ala A Leu L Arg R Lys K Asn N Met M Asp D Phe F Cys C Pro P. . . Start Stop GCU, GCC, GCA, GCG UUA, UUG, CUU, CUC, CUA, CUG CGU, CGC, CGA, CGG, AGA, AGG AAA, AAG AAU, AAC AUG GAU, GAC UUU, UUC UGU, UGC CCU, CCC, CCA, CCG AUG, GUG UAG, UGA, UAA 46

PROTEOMICS: structure and function 47 PROTEOMICS: structure and function 47

Protein Structure: primary structure 48 Protein Structure: primary structure 48

Protein Structure: secondary Structure a: Alpha-helix, b: Beta-sheet 49 Protein Structure: secondary Structure a: Alpha-helix, b: Beta-sheet 49

Protein Structure: super-secondary Structure 50 Protein Structure: super-secondary Structure 50

Protein Structure = protein function: 51 Protein Structure = protein function: 51

EVOLUTION and the origin of SPECIES 52 EVOLUTION and the origin of SPECIES 52

Tree of Life 53 Tree of Life 53

54 54

55 55

Phylogenetic relations between Cetaceans and ariodactyl 56 Phylogenetic relations between Cetaceans and ariodactyl 56

Unsolved problems in biology Life. How did it start? Is life a cosmic phenomenon? Unsolved problems in biology Life. How did it start? Is life a cosmic phenomenon? Are the conditions necessary for the origin of life narrow or broad? How did life originate and diversify in hundred millions of years? Why, after rapid diversification, do microorganisms remain unchanged for millions of years? Did life start on this planet or was there an extraterrestrial intervention (for example a meteor from another planet)? Why have so many biological systems developed sexual reproduction? How do organisms recognize like species? How are the sizes of cells, organs, and bodies controlled? Is immortality possible? DNA / Genome. Do all organisms link together to a primary source? Given a DNA sequence, what shape will the protein fold into? Given a particular desired shape, what DNA sequence will produce it? What are all the functions of the DNA? Other than the structural genes, which is the simpler part of the system? What is the complete structure and function of the proteome proteins expressed by a cell or organ at a particular time and under specific conditions? What is the complete function of the regulator genes? The building block of life may be a precursor to a generation of electronic devices and computers, but what are the electronic properties of DNA? Does Junk DNA function as molecular garbage? Viruses / Immune system. What causes immune system deficiencies? What are the signs of current or past infection to discover where Ebola hides between human outbreaks? What is the origin of antibody diversity? What leads to the complexity of the immune system? What is the relationship between the immune system and the brain? Humanity: Why are there drastic changes in hominid morphology? Why are there giant hominid skeletons and very small hominid skeletons? Is hominid evolution static? Is hominid devolution 57 possible? Are there Human-Neanderthal hybrids? What explains the differences between Human and Neanderthal Fossils?

Introduction to Bioinformatics. LECTURE 1: CHAPTER 1: The first look at a genome (sequence Introduction to Bioinformatics. LECTURE 1: CHAPTER 1: The first look at a genome (sequence statistics) * A mathematical model should be as simple as possible, but not too simple! (A. Einstein) * All models are wrong, but some are useful. (G. Box) 58

Introduction to Bioinformatics. The first look at a genome (sequence statistics) • Genome and Introduction to Bioinformatics. The first look at a genome (sequence statistics) • Genome and genomic sequences • Probabilistic models and sequences • Statistical properties of sequences • Standard data formats and databases 59

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 1 Genomic era, year zero • 1958: Fred Sanger (Cambridge, UK): Nobel prize for developing protein sequencing techniques • 1978: Fred Sanger: First complete viral genome • 1980: Fred Sanger: First mitochrondrial genome • 1980: Fred Sanger: Nobel prize for developing DNA sequencing techniques • 1995: Craig Venter (TIGR): complete geneome of Haemophilus influenza • 2001: entire genome of Homo sapiens • Start of post-genomic era (? !) 60

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 1 Genomic era, year zero ORGANISM Phage phi. X 74 Human mt. DNA HIV H. influenza H. sapiens DATE SIZE 1978 5, 368 bp 1980 16, 571 bp 1985 9, 193 bp 1995 1, 830 Kb 2001 3, 500 Mb DESCRIPTION 1 st viral genome 1 st organelle genome AIDS retrovirus 1 st bacterial genome complete human genome 61

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 2 The anatomy of a genome • Definition of genome • Prokaryotic genomes • Eukaryotic genomes • Viral genomes • Organellar genomes 62

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 3 Probabilistic models of genome sequences • Alphabets, sequences, and sequence space • Multinomial sequence model • Markov sequence model 63

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 3 Probabilistic models of genome sequences Alphabets, sequences, and sequence space 4 -letter alphabet N = {A, C, G, T} (= nucleoitides) * sequence: s = s 1 s 2…sn e. g. : s = ATATGCCTGACTG * sequence space: the space of all sequences (up to a certain length) 64

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 3 Probabilistic models of genome sequences Multinomial sequence model * Nucleotides are independent and identically distributed (i. i. d), * p = {p. A, p. C, p. G, p. T}, p. A + p. C + p. G + p. T = 1 * 65

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 3 Probabilistic models of genome sequences Markov sequence model 66

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 3 Probabilistic models of genome sequences Markov sequence model * Probability start state π * State transition matrix T * 67

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 4 Annotating a genome: statistical sequence analysis • Base composition & sliding window plot • GC content & change point analysis • Finding unusual DNA words • Biological relevance of unusual motifs • Pattern matching versus pattern discovery 68

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 4 Annotating a genome: statistical sequence analysis Base composition H. influenzae BASE AMOUNT FREQUENCY A C G T 567623 350723 347436 564241 0. 3102 0. 1916 0. 1898 0. 3083 69

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) Haemophilus Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) Haemophilus influenzae type b 70

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot 71

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot 72

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 4 Annotating a genome: statistical sequence analysis Base composition & sliding window plot 73

Evidence for co-evolution of gene order and recombination rate Csaba Pál & Laurence D. Evidence for co-evolution of gene order and recombination rate Csaba Pál & Laurence D. Hurst Nature Genetics 33, 392 - 395 (2003) Figure 3. Sliding-window plot of the number of essential genes (black line) and standard deviation from chromosomal mean recombination rate (gray line) along chromosome 9. Dot indicates the centromere. The windows were each ten genes long, and one gene jump was made between windows. 74

GC content GC versus AT Organism H. influenzae M. tuberculosis S. enteridis GC content GC content GC versus AT Organism H. influenzae M. tuberculosis S. enteridis GC content 38. 8 65. 8 49. 5 75

GC content • Detect foreign genetic material • Horizontal gene transfer • Change point GC content • Detect foreign genetic material • Horizontal gene transfer • Change point analysis • AT denatures (=splits) at lower temperatures • Thermophylic Archaeabacteriae: high CG • Evolution: Archaea > Eubacteriae > Eukaryotes 76

GC content Example of very high GC content Average GC content: 61% 77 GC content Example of very high GC content Average GC content: 61% 77

GC content 78 GC content 78

Change points in Labda-phage 79 Change points in Labda-phage 79

k-mer frequency motif bias • dimer, trimer, k-mer: nucleotide word of length 2, 3, k-mer frequency motif bias • dimer, trimer, k-mer: nucleotide word of length 2, 3, k • “unusual” k-mers • 2 -mer in H. influenzae 80

k-mer frequency motif bias 2 -mer (dinucleotide) density in H. influenzae A* C G k-mer frequency motif bias 2 -mer (dinucleotide) density in H. influenzae A* C G T *A C G T 0. 1202 0. 0505 0. 0483 0. 0912 0. 0665 0. 0372 0. 0396 0. 0484 0. 0514 0. 0522 0. 0363 0. 0499 0. 0721 0. 0518 0. 0656 0. 1189 NB: freq(‘AT’) freq(A or T) 81

k-mer frequency motif bias Most frequent 10 -mer (dinucleotide) density in H. influenzae: AAAGTGCGGT k-mer frequency motif bias Most frequent 10 -mer (dinucleotide) density in H. influenzae: AAAGTGCGGT ACCGCACTTT Why? 82

83 83

84 84

Unusual DNA-words Compare OBSERVED with EXPECTED frequency of a word using multinomial model Observed/expected Unusual DNA-words Compare OBSERVED with EXPECTED frequency of a word using multinomial model Observed/expected ratio: A* C G T *A 1. 2491 1. 1182 0. 8736 0. 7541 C G T 0. 8496 0. 8210 0. 9535 1. 0121 1. 0894 0. 8190 1. 4349 1. 0076 0. 8526 0. 8763 1. 1204 1. 2505 This takes also into account the relative proportionality p. A, p. C, p. G, p. T. 85

Unusual DNA-words Restriction sites: very unusual words CTAG -> “kincking” of DNA-strand 86 Unusual DNA-words Restriction sites: very unusual words CTAG -> “kincking” of DNA-strand 86

genome signature: Nucleotide motif bias in four genomes 87 genome signature: Nucleotide motif bias in four genomes 87

Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. Introduction to Bioinformatics LECTURE 1: The first look at a genome (sequence statistics) 1. 5 Finding data: Gen. Bank, EMBL, and DDBJ • Online databases • FASTA: a standard data format 88

DATABASES Generalized (DNA, proteins and carbohydrates, 3 Dstructures) Specialized (EST, STS, SNP, RNA, genomes, DATABASES Generalized (DNA, proteins and carbohydrates, 3 Dstructures) Specialized (EST, STS, SNP, RNA, genomes, protein families, pathways, microarray data. . . ) 89

OVERVIEW OF DATABASES 1. Database indexing and specification of search terms (retrieval, follow-up, analysis) OVERVIEW OF DATABASES 1. Database indexing and specification of search terms (retrieval, follow-up, analysis) 2. Archives (databases on: nucleic acid sequences, genome, protein sequences, structures, proteomics, expression, pathways) 3. Gateways to Archives (NCBI, Entrez, Pub. Med, Ex. Pasy, Swiss -Prot, SRS, PIR, Ensembl) 90

Generalized DNA, protein and carbohydrate databases Primary sequence databases EMBL (European Molecular Biology Laboratory Generalized DNA, protein and carbohydrate databases Primary sequence databases EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI, Hinxton, UK) Gen. Bank (at National Center for Biotechnology information, NCBI, Bethesda, MD, USA) DDBJ (DNA Data Bank Japan at CIB , Mishima, Japan) 91

NCBI: National Center for Biotechnology information Established in 1988 as a national resource for NCBI: National Center for Biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease. 92

NCBI - Gen. Bank 93 NCBI - Gen. Bank 93

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. 94

EBI: European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a non-profit academic organisation EBI: European Bioinformatics Institute The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. Our mission To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress To contribute to the advancement of biology through basic investigator-driven research in bioinformatics To provide advanced bioinformatics training to scientists at all levels, from Ph. D students to independent investigators 95 To help disseminate cutting-edge technologies to industry

What is DDBJ (DNA Data Bank of Japan) began DNA data bank activities in What is DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/Gen. Bank. DNA sequence records the organismic evolution more directly than other biological materials and , thus, is invaluable not only for research in life sciences, but also human welfare in general. The databases are, so to speak, a common treasure of human beings. With this in mind, we make the databases online accessible to anyone in the world 96

Ex. PASy Proteomics Server (SWISS-PROT) The Ex. PASy (Expert Protein Analysis System) proteomics server Ex. PASy Proteomics Server (SWISS-PROT) The Ex. PASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2 -D PAGE 97

Generalized DNA, protein and carbohydrate databases Protein sequence databases SWISS-PROT (Swiss Institute of Bioinformatics, Generalized DNA, protein and carbohydrate databases Protein sequence databases SWISS-PROT (Swiss Institute of Bioinformatics, SIB, Geneva, CH) Tr. EMBL (=Translated EMBL: computer annotated protein sequence database at EBI, UK) PIR-PSD (PIR-International Protein Sequence Database, annotated protein database by PIR, MIPS and JIPID at NBRF, Georgetown University, USA) Uni. Prot (Joined data from Swiss-Prot, Tr. EMBL and PIR) Uni. Ref (Uni. Prot NREF (Non-redundant REFerence) database at EBI, UK) IPI (International Protein Index; human, rat and mouse proteome database at EBI, UK) 98

Generalized DNA, protein and carbohydrate databases Carb. Bank (Former complex carbohydrate structure database, CCSD, Generalized DNA, protein and carbohydrate databases Carb. Bank (Former complex carbohydrate structure database, CCSD, discontinued!) 3 D structure databases PDB (Protein Data Bank cured by RCSB, USA) EBI-MSD (Macromolecular Structure Database at EBI, UK ) NDB (Nucleic Acid structure Datatabase at Rutgers State University of New Jersey , USA) 99

PROTEIN DATA BANK 100 PROTEIN DATA BANK 100

DATABASE SEARCH Text-based (SRS, Entrez. . . ) Sequence-based (sequence similarity search) (BLAST, FASTA. DATABASE SEARCH Text-based (SRS, Entrez. . . ) Sequence-based (sequence similarity search) (BLAST, FASTA. . . ) Motif-based (Scan. Prosite, e. MOTIF) Structure-based (structure similarity search) (VAST, DALI. . . ) Mass-based protein search (Protein. Prospector, Pept. Ident, Prowl …) 101

Search across databases Help Welcome to the Entrez cross-database search page Pub. Med: biomedical Search across databases Help Welcome to the Entrez cross-database search page Pub. Med: biomedical literature citations and abstracts Pub. Med Central: free, full text journal articles Site Search: NCBI web and FTP sites Books: online books OMIM: online Mendelian Inheritance in Man OMIA: online Mendelian Inheritance in Animals Nucleotide: sequence database (Gen. Bank) Protein: sequence database Genome: whole genome sequences Structure: three-dimensional macromolecular structures Taxonomy: organisms in Gen. Bank SNP: single nucleotide polymorphism Gene: gene-centered information Homolo. Gene: eukaryotic homology groups Pub. Chem Compound: unique small molecule chemical structures Pub. Chem Substance: deposited chemical substance records Genome Project: genome project information Uni. Gene: gene-oriented clusters of transcript sequences CDD: conserved protein domain database 3 D Domains: domains from Entrez Structure Uni. STS: markers and mapping data Pop. Set: population study data sets GEO Profiles: expression and molecular abundance profiles GEO Data. Sets: experimental sets of GEO data Cancer Chromosomes: cytogenetic databases Pub. Chem Bio. Assay: bioactivity 102 screens of chemical substances GENSAT: gene expression atlas of mouse central nervous system Probe: sequence-specific reagents

New! Assembly Archive recently created at NCBI links together trace data and finished sequence New! Assembly Archive recently created at NCBI links together trace data and finished sequence providing complete information about a genome assembly. The Assembly Archive's first entries are a set of closely related strains of Bacillus anthracis. The assemblies are avalaible at Trace. Assembly See more about Bacillus anthracis genome Bacillus licheniformis ATCC 14580 Release Date: September 15, 2004 Reference: Rey, M. W. , et al. Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species (er) Genome Biol. 5, R 77 (2004) Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus. Organism: Bacillus licheniformis ATCC 14580 Genome sequence information chromosome - CP 000002 - NC_006270 Size: 4, 222, 336 bp Proteins: 4161 Sequence data files submitted to Gen. Bank/EMBL/DDBJ can be found at NCBI FTP: Gen. Bank or Ref. Seq Genomes Bacillus cereus ZKRelease Date: September 15, 2004 Reference: Brettin, T. S. , et al. Complete genome sequence of Bacillus cereus ZK Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus cereus group. 103 Organism:

BLAST NCBI → BLAST Latest news: 6 December 2005 : BLAST 2. 2. 13 BLAST NCBI → BLAST Latest news: 6 December 2005 : BLAST 2. 2. 13 released About Getting started / News / FAQs More info NAR 2004 / NCBI Handbook / The Statistics of Sequence Similarity Scores Software Downloads / Developer info Other resources References / NCBI Contributors / Mailing list / Contact us The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Nucleotide Quickly search for highly similar sequences (megablast) Quickly search for divergent sequences (discontiguous megablast) Nucleotide-nucleotide BLAST (blastn) Search for short, nearly exact matches Search trace archives with megablast or discontiguous megablast 104 Protein-protein BLAST (blastp)

 Fasta Protein Database Query Provides sequence similarity searching against nucleotide and protein databases Fasta Protein Database Query Provides sequence similarity searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the Fasta programs. Download Software 105

Kangaroo MOTIV BASED SEARCH Kangaroo is a program that facilitates searching for gene and Kangaroo MOTIV BASED SEARCH Kangaroo is a program that facilitates searching for gene and protein patterns and sequences Kangaroo is a pattern search program. Given a sequence pattern the program will find all the records that contain that pattern. To use this program, simply enter a sequence of DNA or Amino Acids in the pattern window, choose the type of search, the taxonomy and submit your request. 106

ANALYSIS TOOLS DNA sequence analysis tools RNA analysis tools Protein sequence and structure analysis ANALYSIS TOOLS DNA sequence analysis tools RNA analysis tools Protein sequence and structure analysis tools (primary, secondary, tertiary structure) Tools for protein Function assignment Phylogeny Microarray analysis tools 107

MISCELLANEOUS Literature search Patent search Bioinformatics centers and servers Links to other collections of MISCELLANEOUS Literature search Patent search Bioinformatics centers and servers Links to other collections of bioinformatics resources Medical resources Bioethics Protocols Software (Bio)chemie Educational resources 108

Introduction to Bioinformatics. END of LECTURE 1 109 Introduction to Bioinformatics. END of LECTURE 1 109