Yes if you train quickly you can create

Скачать презентацию Yes if you train quickly you can create

4ba99a4ab69427212dea283079876c88.ppt

Количество слайдов: 78

Yes, if you train quickly, you can create a new database of databases, but first eat your dinner ! An introduction to biological databases Sept 2002

Database or databank ? At the beginning, subtle distinctions were done between databases and databanks (in UK, but not in the USA), such as: « Database management programs for the gestion of databanks » From now on, the term « database » (db) is usually preferred

What is a database ? A collection of structured searchable (index) updated periodically (release) cross-referenced (hyperlinks) -> table of contents -> new edition -> links with other db data Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion…. Data storage management: flat files, relational databases…

Database: a « flat file » example « Introduction To Databases » Teacher Database (flat file, 3 entries) Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA 2000; DEA 2001; Dea 2002; http: //www. expasy. org/people/amos. html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet 2000, EMBnet 2001; EMBnet 2002; DEA 2000; DEA 2001; DEA 2002 // Accession number 3: First Name: Marie-Claude Last name: Blatter Course: EMBnet 2000; EMBnet 2001; EMBnet 2002; DEA 2000; DEA 2001; DEA 2002 http: //www. expasy. org/people/Marie-Claude. Blatter. html // Easy to manage: all the entries are visible at the same time !

Database: a « relational » example Relational database ( « table file » ): Teacher Accession number Education Amos 1 Biochemistry Laurent 2 Biochemistry M-Claude 3 Biochemistry Course Date Involved teachers DEA 2000; 2001; 2002 1; 2; 3 EMBnet 2000; 2001; 2002 2; 3 Easier to manage; choice of the output

Why biological databases ? Exponential growth in biological data. Data (genomic sequences, 3 D structures, 2 D gel analysis, MS analysis, Microarrays…. ) are no longer published in a conventional manner, but directly submitted to databases. Essential tools for biological research.

Distribution of sequence databases Books, articles Computer tapes Floppy disks CD-ROM FTP On-line services WWW DVD 1968 -> 1985 1982 ->1992 1984 -> 1990 1989 -> ? 1982 -> 1994 1993 -> ? 2001 -> ?

Some statistics More than 1000 different ‘biological’ databases Variable size: <100 Kb to >10 Gb DNA: > 10 Gb Protein: 1 Gb 3 D structure: 5 Gb Other: smaller Update frequency: daily to annually Usually accessible through the web (free !? ) Amos’ links: www. expasy. org/alinks. html Biohunt: http: //www. expasy. org/Bio. Hunt/ Google: http: //www. google. com/

Some databases in the field of molecular biology… AATDB, Ace. Db, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, As. Db, BBDB, BCGD, Beanref, Biolmage, Bio. Mag. Res. Bank, BIOMDB, BLOCKS, Bov. GBASE, BOVMAP, BSORF, BTKbase, CANSITE, Carb. Bank, CARBHYD, CATH, CAZY, CCDC, CD 4 OLbase, CGAP, Chick. GBASE, Colibri, COPE, Cotton. DB, CSNDB, CUTG, Cyano. Base, db. CFC, db. EST, db. STS, DDBJ, DGP, Dicty. Db, Picty_c. DB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC 02 DBASE, Eco. Cyc, Eco. Gene, EMBL, EMD db, ENZYME, EPD, Epo. DB, ESTHER, Fly. Base, Fly. View, GCRDB, GENATLAS, Genbank, Gene. Cards, Genline, Gen. Link, GENOTK, Gen. Prot. EC, GIFTS, GPCRDB, GRAP, GRBase, g. RNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2 DPAGE, HEXAdb, HGMD, HIDB, HIDC, Hl. Vdb, Hot. Molec. Base, HOVERGEN, HPDB, HSC-2 DPAGE, ICN, ICTVDB, IL 2 RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, Maize. Db, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP 5 Micado, Mito. Dat, MITOMAP, MJDB, Mmt. DB, Mol-R-Us, MPDB, MRR, Mut. Base, Myc. DB, NRSub, 0 -lyc. Base, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, Pat. Base, PDB, PDD, Pfam, Phospho. Base, Pig. BASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, Pro. Dom, Prolysis, PROSITE, PROTOMAP, Rat. MAP, RDP, REBASE, RGP, SBASE, SCOP, Seq. Anai. Ref, SGD, SGP, Sheep. Map, Soybase, SPAD, SRNA db, SRPDB, STACK, Sty. Gene, Sub 2 D, Subti. List, SWISS-2 DPAGE, SWISS-3 DIMAGE, SWISSMODEL Repository, SWISS-PROT, Tel. DB, TGN, tm. RDB, TOPS, TRANSFAC, TRR, Uni. Gene, URNADB, V BASE, VDRR, Vector. DB, WDCM, WIT, Worm. Pep, YEPD, YPM, etc. . . . !!!!

Categories of databases for Life Sciences Sequences (DNA, protein) Genomics Mutation/polymorphism Protein domain/family (----> tools) Proteomics (2 D gel, Mass Spectrometry) 3 D structure Metabolism Bibliography ‘Others’ (Microarrays, …)

Sequence databases 1. DNA/RNA 2. Proteins

Ideal minimal content of a « sequence » db Sequences !! Accession number (AC) Taxonomic data References ANNOTATION/CURATION Keywords Cross-references Documentation

Sequence database : example SWISS-PROT (flat file) Accession number Taxonomy Reference Annotations (comments) Cross-references Keywords ID AC DT DT DT DE GN OS OC OC OX RN RP RX RA RA RA RT RT RL …. CC CC … DR DR DR …. EPO_HUMAN STANDARD; PRT; 193 AA. P 01588; Q 9 UHA 0; Q 9 UEZ 5; Q 9 UDZ 0; 21 -JUL-1986 (Rel. 01, Created) 21 -JUL-1986 (Rel. 01, Last sequence update) 20 -AUG-2001 (Rel. 40, Last annotation update) Erythropoietin precursor. EPO. Homo sapiens (Human). Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. NCBI_Tax. ID=9606; [1] SEQUENCE FROM N. A. MEDLINE=85137899; Pub. Med=3838366; Jacobs K. , Shoemaker C. , Rudersdorf R. , Neill S. D. , Kaufman R. J. , Mufson A. , Seehra J. , Jones S. S. , Hewick R. , Fritsch E. F. , Kawakita M. , Shimizu T. , Miyake T. ; "Isolation and characterization of genomic and c. DNA clones of human erythropoietin. "; Nature 313: 806 -810(1985). KW Erythrocyte maturation; Glycoprotein; Hormone; Signal; Pharmaceutical. -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THE REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF A PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS. -!- SUBCELLULAR LOCATION: SECRETED. -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALS AND BY LIVER OF FETAL OR NEONATAL MAMMALS. -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) and Procrit (Ortho Biotech). EMBL; EMBL; X 02158; CAA 26095. 1; -. X 02157; CAA 26094. 1; -. M 11319; AAA 52400. 1; -. AF 053356; AAC 78791. 1; -. AF 202308; AAF 23132. 1; -. AF 202306; AAF 23132. 1; JOINED.

Sequence database: example (cont. ) Annotations (features) Sequence FT FT FT FT FT ** ** **CL SQ // SIGNAL CHAIN PROPEP DISULFID CARBOHYD VARIANT 1 28 190 34 56 51 65 110 153 131 27 193 188 60 51 65 110 153 132 VARIANT 149 CONFLICT 40 85 140 ERYTHROPOIETIN. MAY BE REMOVED IN PROCESSED PROTEIN. N-LINKED (GLCNAC. . . ). O-LINKED (GALNAC. . . ). SL -> NF (IN AN HEPATOCELLULAR CARCINOMA). /FTId=VAR_009870. P -> Q (IN AN HEPATOCELLULAR CARCINOMA). /FTId=VAR_009871. E -> Q (IN REF. 1; CAA 26095). Q -> QQ (IN REF. 5). G -> R (IN REF. 1; CAA 26095). ######### INTERNAL SECTION ######### 7 q 22; SEQUENCE 193 AA; 21306 MW; C 91 F 0 E 4 C 26 A 52033 CRC 64; MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR

Sequence Databases: some « technical » definitions Data storage management: flat file: text file relational (e. g. , Oracle, Postgres) object oriented (rare in biological field) Flat file format: fasta GCG NBRF/PIR MSF…. standardized format ?

Sequence database: example …a SWISS-PROT entry, in fasta format: >sp|P 01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human). MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

Database 1: nucleotide sequences The main DNA sequence db are EMBL (Europe)/Gen. Bank (USA) /DDBJ (Japan) There also specialized databases for the different types of RNAs (i. e. t. RNA, r. RNA, tm RNA, u. RNA, etc…) 3 D structure (DNA and RNA) Others: Aberrant splicing db; Eucaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource ……

Nucleotids and associated topics databases (AMOS’links) EMBL - EMBL Nucleotide sequence db (EBI) Genbank - Gen. Bank Nucleotide Sequence db (NCBI) DDBJ - DNA Data Bank of Japan db. EST - db. EST (Expressed Sequence Tags) db (NCBI) db. STS - db. STS (Sequence Tagged Sites) db (NCBI) NDB - Nucleic Acid Databank (3 D structures) BNASDB - Nucleic acid structure db from University of Pune As. Db - Aberrant Splicing db ACUTS - Ancient conserved untranslated DNA sequences db Codon Usage Db EPD - Eukaryotic Promoter db HOVERGEN - Homologous Vertebrate Genes db IMGT - Im. Muno. Gene. Tics db [Mirror at EBI] ISIS - Intron Sequence and Information System RDP - Ribosomal db Project g. RNAs db - Guide RNA db PLACE - Plant cis-acting regulatory DNA elements db Plant. CARE - Plant cis-acting regulatory DNA elements db s. RNA db - Small RNA db ssu r. RNA - Small ribosomal subunit db lsu r. RNA - Large ribosomal subunit db 5 S r. RNA - 5 S ribosomal RNA db tm. RNA Website tm. RDB - tm. RNA d. B t. RNA - t. RNA compilation from the University of Bayreuth u. RNADB - u. RNA db RNA editing - RNA editing site RNAmod db - RNA modification db SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites Tel. DB - Multimedia Telomere Resource TRADAT - TRAnscription Databases and Analysis Tools Subviral RNA db - Small circular RNAs db (viroid and viroid-like) MPDB - Molecular probe db OPD - Oligonucleotide probe db Vector. DB - Vector sequence db (seems dead!)

EMBL/Gen. Bank/DDBJ These 3 db contain mainly the same informations within 2 -3 days (few differences in the format and syntax) Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc. ) derived from: Genome projects Sequencing centers Individual scientists Patent offices (i. e. European Patent Office, EPO) Non-confidential data are exchanged daily Currently: 18 x 106 sequences, over 20 x 109 bp; Over the last 12 months the database size has tripled Sequences from > 50’ 000 different species;

The tremendous increase in nucleotide sequences EMBL data…first increase in data due to the PCR development… human High throughput genomes (HTG) mouse human 1980: 80 genes fully sequenced ! human rat

Categories/Qualities of nucleotid sequences ESTs: single pass c. DNA reads (human and mouse) GSS: Genome Survey Sequences single pass genomic DNA sequences HTG: ‘Unfinished’ DNA sequences generated by the high-throughput sequencing centers

EMBL/Gen. Bank/DDBJ Heterogeneous sequence length: genomes, variants, fragments… Sequence sizes: max 300’ 000 bp /entry (! genomic sequences, overlapping) min 10 bp /entry Archive: nothing goes out -> highly redundant ! full of errors: in sequences, in annotations, in CDS attribution…. no consistency of annotations; most annotations are done by the submitters; heterogeneity of the quality and the completion and updating of the informations

EMBL/Gen. Bank/DDBJ Unexpected informations you can find in these db: FT FT FT source 1. . 124 /db_xref="taxon: 4097" /organelle="plastid: chloroplast" /organism="Nicotiana tabacum" /isolate="Cuban cahibo cigar, gift from President Fidel Castro" Or: FT FT source 1. . 17084 /chromosome="complete mitochondrial genome" /db_xref="taxon: 9267" /organelle="mitochondrion" /organism="Didelphis virginiana" /dev_stage="adult" /isolate="fresh road killed individual" /tissue_type="liver"

EMBL entry: example ID XX AC XX SV XX DT DT XX DE XX KW XX OS OC OC XX RN RP RX RA RA RA RT RT RL XX DR DR DR XX … HSERPG standard; DNA; HUM; 3398 BP. X 02158; X 02158. 1 13 -JUN-1985 (Rel. 06, Created) 22 -JUN-1993 (Rel. 36, Last updated, Version 2) Human gene for erythropoietin; glycoprotein hormone; signal peptide. Homo sapiens (human) Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. [1] 1 -3398 MEDLINE; 85137899. Jacobs K. , Shoemaker C. , Rudersdorf R. , Neill S. D. , Kaufman R. J. , Mufson A. , Seehra J. , Jones S. S. , Hewick R. , Fritsch E. F. , Kawakita M. , Shimizu T. , Miyake T. ; Isolation and characterization of genomic and c. DNA clones of human erythropoietin; Nature 313: 806 -810(1985). GDB; 119110; EPO. GDB; 119615; TIMP 1. SWISS-PROT; P 01588; EPO_HUMAN. keyword taxonomy references Cross-references

EMBL entry (cont. ) CC Data kindly reviewed (24 -FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1. . 3398 FT /db_xref=taxon: 9606 FT /organism=Homo sapiens FT m. RNA join(397. . 627, 1194. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 3327) FT CDS join(615. . 627, 1194. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 2763) FT /db_xref=SWISS-PROT: P 01588 FT /product=erythropoietin FT /protein_id=CAA 26095. 1 FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 2763) FT /product=erythropoietin FT sig_peptide join(615. . 627, 1194. . 1261) FT exon 397. . 627 FT /number=1 FT intron 628. . 1193 FT /number=1 FT exon 1194. . 1339 FT /number=2 annotation FT intron 1340. . 1595 FT /number=2 FT exon 1596. . 1682 FT /number=3 FT intron 1683. . 2293 FT /number=3 FT exon 2294. . 2473 FT /number=4 FT intron 2474. . 2607 FT /number=4 FT exon 2608. . 3327 FT /note=3' untranslated region FT /number=5 XX sequence SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 CDS Coding sequence

Gen. Bank entry: same entry LOCUS HSERPG 3398 bp DNA PRI 22 -JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X 02158 VERSION X 02158. 1 GI: 31224 KEYWORDS erythropoietin; glycoprotein hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs, K. , Shoemaker, C. , Rudersdorf, R. , Neill, S. D. , Kaufman, R. J. , Mufson, A. , Seehra, J. , Jones, S. S. , Hewick, R. , Fritsch, E. F. , Kawakita, M. , Shimizu, T. and Miyake, T. TITLE Isolation and characterization of genomic and c. DNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806 -810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24 -FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1. . 3398 /organism="Homo sapiens" /db_xref="taxon: 9606" m. RNA join(397. . 627, 1194. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 3327) exon 397. . 627 /number=1 sig_peptide join(615. . 627, 1194. . 1261) CDS join(615. . 627, 1194. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA 26095. 1" /db_xref="GI: 312304" /db_xref="SWISS-PROT: P 01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI …

Gen. Bank entry (cont. ) … TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" 628. . 1193 /number=1 exon 1194. . 1339 /number=2 mat_peptide join(1262. . 1339, 1596. . 1682, 2294. . 2473, 2608. . 2760) /product="erythropoietin" intron 1340. . 1595 /number=2 exon 1596. . 1682 /number=3 intron 1683. . 2293 /number=3 exon 2294. . 2473 /number=4 intron 2474. . 2607 /number=4 exon 2608. . 3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc intron

EMBL: The Genome divisions http: //www. ebi. ac. uk/genomes/ Schizosaccharomyces pombe strain 972 h- complete genome

Human genome • The completion of the draft human genome sequence has been announced on 26 -June-2000. • Publication of the public Human Genome Sequence in Nature the 15 th february 2001. Approx. 30, 000 genes are analysed, 1. 4 million SNPs and much more. • The draft sequence data is available at EMBL/GENBANK/DDJB • Finished: The clone insert is contiguously sequenced with high quality standard of error rate of 0. 01%. There are usually no gaps in the sequence. • The general assumption is that about 50% of the bases are redundant. 2002

Finished: The clone insert is contiguously sequenced with high quality standard of error rate of 0. 01%. There are usually no gaps in the sequence.

Nucleotid databases and « associated » genomic projects/databases Problem: Redundancy = makes Blasts searches of the complete databases useless for detecting anything behond the closest homologs. Solutions: • assemblies of genomic sequence data (contigs) and corresponding RNA and protein sequences -> dataset of genomic contigs, RNAs and proteins • annotation of genes, RNAs, proteins, variation (SNPs), STS markers, gene prediction, nomenclature and chromosomal location. • compute connexions to other resources (cross-references) Examples: Ref. Seq/Locus link (drosophila, human, mouse, rat and zebrafish), TIGR (microbes and plants), Ens. EMBL (Eukaryota)…

Locus. Link / Ref. Seq Erythropoitin receptor

Database 2: protein sequences SWISS-PROT: created in 1986 (A. Bairoch) http: //www. expasy. org/sprot/ Tr. EMBL: created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations ( « proteomic » version of EMBL) PIR-PSD: Protein Information Resources http: //pir. georgetown. edu/ Genpept: « proteomic » version of Gen. Bank Many specialized protein databases for specific families or groups of proteins. Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) YPD (Yeast) etc.

SWISS-PROT Collaboration between the SIB (CH) and EMBL/EBI (UK) Fully annotated (manually), non-redundant, crossreferenced, documented protein sequence database. ~113 ’ 000 sequences from more than 6’ 800 different species; 70 ’ 000 references (publications); 550 ’ 000 cross-references (databases); ~200 Mb of annotations. Weekly releases; available from about 50 servers across the world, the main source being Ex. PASy

Tr. EMBL (Translation of EMBL) It is impossible to cope with the quantity of newly generated data AND to maintain the high quality of SWISS-PROT -> Tr. EMBL, created in 1996. Tr. EMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools. Contains all what is not in SWISS-PROT + Tr. EMBL = all known protein sequences. Well-structured SWISS-PROT-like resource.

The simplified story of a SWISS-PROT entry Some data are not submitted to the public databases !! (delayed or cancelled…) c. DNAs, genomes, … EMBLnew EMBL « Automated » • Redundancy check (merge) • Family attribution (Inter. Pro) • Annotation (computer) Tr. EMBL « Manual » • Redundancy (merge, conflicts) • Annotation (manual) • SWISS-PROT tools (macros…) • SWISS-PROT documentation • Medline • Databases (MIM, MGD…. ) • Brain storming CDS Tr. EMBLnew SWISS-PROT Once in SWISS-PROT, the entry is no more in Tr. EMBL, but still in EMBL (archive) CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally proven or derived from gene prediction programs). Tr. EMBL neither translates DNA sequences, nor uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.

Remark: about 30 % of the genes annotated in newly sequenced genomes such as Arabidopsis thaliana are, at the present (sept 2001), purely the result of computational predictions. Pertea et al. , Nucleic Acids Research (2001), 29, 1185 -1190

Tr. EMBL: a platform for improving automated annotation tools • After a lot of testing, many new annotation tools are going to be applied systematically (Signal. P, TMMPred, REP, Inter. Pro domain assignement). • EVIDENCE TAGS are added to any part of a Tr. EMBL entry not derived from the original EMBL entry (not available for external users). -> follow up of all added informations

Some nomenclature Example: SRS 6 at the Sanger Center http: //www. sanger. ac. uk/srs 6 bin/cgi-bin/wgetz? -page+top

SWISS-PROT + Tr. EMBL new (SWALL, SPTR) (Standard) (Preliminary) Tr. EMBL= SPTr. EMBL + REMTr. EMBL SPTr. EMBL contains Tr. EMBL entries which will be integrated into SWISS-PROT. REMTr. EMBL contains Tr. EMBL entries which will never be integrated into SWISS-PROT. Tr. EMBLnew contains entries which have not yet been integrated into Tr. EMBL (weekly update to Tr. EMBL) SPTR (SWall) = SWISS-PROT + (SP)Tr. EMBL + Tr. EMBLnew

taxonomy references Line code Content Occurrence in an entry -------------------ID Identification One; starts the entry AC Accession number(s) One or more DT Date Three times DE Description One or more GN Gene name(s) Optional OS Organism species One or more OG Organelle Optional OC Organism classification One or more RN Reference number One or more RP Reference position One or more RC Reference comment(s) Optional RX Reference cross-reference(s) Optional RA Reference authors One or more RT Reference title Optional RL Reference location One or more CC Comments or notes Optional DR Database cross-references Optional KW Keywords Optional FT Feature table data Optional SQ Sequence header One Amino Acid Sequence One // Termination line One; ends the entry Lines in which you may find ‘manual-annotated’ information

a Swiss-Prot entry… overview Entry name Accession number sequence

Protein name Gene name Taxonomy

References

Comments

Cross-references

Keywords

Feature table (sequence description)

Tr. EMBL: example Original Tr. EMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in Tr. EMBL anymore.

SWISS-PROT / Tr. EMBL: a minimal of redundancy • SWISS-PROT and Tr. EMBL introduces some degree of redundancy • Only 100 % identical sequences are automatically merged between SWISS-PROT and Tr. EMBL; • Complete sequences or fragments with 1 -3 conflicts will be automatically merged soon (genome projects; check for chromosomal location and gene names)

SWISS-PROT / Tr. EMBL: a minimal of redundancy Human EPO: Blastp results

SWISS-PROT and Tr. EMBL introduce a new arithmetical concept ! How many sequences in SWISS-PROT + Tr. EMBL ? 113’ 000 + 670’ 000 about 450’ 000 (sept 2002)

SWISS-PROT and Tr. EMBL introduce a new arithmetical concept ! In the case of human data, the redundancy is still very high: 8’ 400 + 41’ 000 = about 20’ 000 2

SWISS-PROT and the cross-references (X-ref) • SWISS-PROT was the 1 st database with X-ref. ; • Explicitly X-referenced to 36 databases; X-ref to DNA (EMBL/Gen. Bank/DDBJ), 3 D-structure (PDB), literature (Medline), genomic (MIM, MGD, Fly. Base, SGD, Subti. List, etc. ), 2 D-gel (SWISS-2 DPAGE), specialized db (PROSITE, TRANSFAC); • Implicitly X-referenced to 17 additional db added by the Ex. PASy servers on the WWW (i. e. : Gene. Cards, PRODOM, HUGE, etc. ) Gasteiger et al. , Curr. Issues Mol. Biol. (2001), 3(3): 47 -55

Domains, functional sites, protein families PROSITE Inter. Pro Pfam PRINTS SMART Mendel-GFDb Human diseases MIM 2 D and 3 D Structural dbs HSSP PDB Organism-spec. dbs Dicty. Db Eco. Gene Fly. Base HIV Maize. DB MGD Sty. Gene Subti. List TIGR Tubercu. List Worm. Pep Zebrafish Protein-specific dbs GCRDb MEROPS REBASE TRANSFAC SWISS-PROT PTM Carb. Bank Glyco. Suite. DB 2 D-gel protein databases SWISS-2 DPAGE ECO 2 DBASE HSC-2 DPAGE Aarhus and Ghent MAIZE-2 DPAGE Nucleotide sequence db EMBL, Gene. Bank, DDBJ

Database 2: Protein sequence What else ?

http: //pir. georgetown. edu/

PIR-PSD: example « well annotated »

Databases 3: ‘genomics’ Contain informations on gene chromosomal location (mapping) and nomenclature, and provide links to sequence databases; has usually no sequence; Exist for most organisms important in life science research; usually species specific. Examples: MIM, GDB (human), MGD (mouse), Fly. Base (Drosophila), SGD (yeast), Maize. DB (maize), Subti. List (B. subtilis), etc. ; Generally relational db (Oracle, Sy. Base or Ace. Db).

MIM OMIM™: Online Mendelian Inheritance in Man catalog of human genes and genetic disorders contains a summary of literature and reference information. It also contains links to publications and sequence information.

Genecard an electronic encyclopedia of biological and medical information based on intelligent knowledge navigation technology

http: //www. genelynx. org/

Collections of hyperlinks for each human gene

Databases 4: mutation/polymorphism Contain informations on sequence variations linked or not to genetic diseases; Mainly human but: OMIA - Online Mendelian Inheritance in Animals General db: OMIM HMGD - Human Gene Mutation db SVD - Sequence variation db HGBASE - Human Genic Bi-Allelic Sequences db db. SNP - Human single nucleotide polymorphism (SNP) db Disease-specific db: most of these databases are either linked to a single gene or to a single disease; p 53 mutation db ADB - Albinism db (Mutations in human genes causing albinism) Asthma and Allergy gene db ….

For human

Mutation/polymorphism: definitions SNPs: single nucleotide polymorphisms; occur approximately once every 100 to 300 bases. c-SNPs: coding single nucleotide polymorphisms SAPs: single amino-acid polymorphisms (Single Nucleotide Polymorphisms within c. DNA sequences) Missense mutation: -> SAP Nonsense mutation: -> STOP Insertion/deletion of nucleotides -> frameshift… ! Numbering of the mutated amino acid depends on the db (aa no 1 is not necessary the initiator Met !)

Mutation/polymorphism The SNP consortium (TSC) http: //snp. cshl. org/ Public/private collaboration: Bayer, Roche, IBM, Pfizer, Novartis, Motorola…… Has to date discovered and characterized nearly 1. 5 million SNPs; in addition, the allele frequencies in three major world populations have been determined on a subset of ~57, 000 SNPs db. SNP at NCBI http: //www. ncbi. nlm. nih. gov/SNP/ Collaboration between the National Human Genome Research Institute and the National Center for Biotechnology Information (NCBI) Mission: central repository for both single base nucleotide subsitutions and short deletion and insertion polymorphisms (several species) August 2002, db. SNP has submissions for 4’ 700’ 000 SNPs. Chromosome 21 db. SNP http: //csnp. isb-sib. ch/ A joint project between the Division of Medical Genetics of the University of Geneva Medical School and the SIB Mission: comprehensive c. SNP (Single Nucleotide Polymorphisms within c. DNA sequences) database and map of chromosome 21

Mutation/polymorphism Generally modest size; lack of coordination and standards in these databases making it difficult to access the data. There are initiatives to unify these databases Mutation Database Initiative (4 th July 1996). -> SVD - Sequence Variation Database project at EBI (HMut. DB) http: //www 2. ebi. ac. uk/mutations/ -> HUGO Mutation Database Initiative (MDI). Human Genome Variation Society http: //www. genomic. unimelb. edu. au/mdi/dblist. html

Before… End of the first part… After the first part…