Скачать презентацию Bases de Datos Biologicas Que es una base Скачать презентацию Bases de Datos Biologicas Que es una base

f4a9563fb6bfbee791fb1fb73b5f65db.ppt

  • Количество слайдов: 106

Bases de Datos Biologicas ¿Que es una base de datos? ¿Que tipos de datos Bases de Datos Biologicas ¿Que es una base de datos? ¿Que tipos de datos hay disponibles? ¿que es el esquema Genbank ? ¿Como es una entrada de datos en una BD biologica? ¿Como se usan?

Data Jungle toxicology physiology sequencing information molecular biology genetics medicine gene expression structural biology Data Jungle toxicology physiology sequencing information molecular biology genetics medicine gene expression structural biology

Bio-databases: A short word on problems • Even today we face some key limitations Bio-databases: A short word on problems • Even today we face some key limitations – There is no standard format • Every database or program has its own format – There is no standard nomenclature • Every database has its own names – Data is not fully optimized • Some datasets have missing information without indications of it – Data errors • Data is sometimes of poor quality, erroneous, misspelled • Error propagation resulting from computer annotation

¿Que es una base de datos? • Es una colección de datos que tiene ¿Que es una base de datos? • Es una colección de datos que tiene que ser: – estructurada – Buscable – regular updates – links y referencias a otras colecciones de datos

Tipos de Bases de Datos NAR 2004 • Las bases de datos genómicas han Tipos de Bases de Datos NAR 2004 • Las bases de datos genómicas han aumentado mucho • Las Bases datos tradicionales de secuencias son ahora una pequeña parte • La nuevas BD se contienen nuevos tipos de datos • BD dse rutas metabólicas y focalizadas en enfermedades

Bioinformatics Information Space July 17, 1999 • • • Secuencias nucleotídicas: Secuencias de proteínas: Bioinformatics Information Space July 17, 1999 • • • Secuencias nucleotídicas: Secuencias de proteínas: 3 D structures: Human Unigene Clusters: Mapas y Genomes completos: Diferentes especies : db. SNP Ref. Genes human contigs > 250 kb Pub. Med records: OMIM records: 4, 456, 822 706, 862 9, 780 75, 832 10, 870 52, 889 6, 377 515 341 (4. 9 MB) 10, 372, 886 10, 695

Bioinformatics Information Space Februar 10, 2004 Nucleotide records 36, 653, 899 Protein sequences 4, Bioinformatics Information Space Februar 10, 2004 Nucleotide records 36, 653, 899 Protein sequences 4, 436, 362 3 D structures 19, 640 Interactions & complexes 52, 385 Human Unigene Cluster 118, 517 Maps and Complete Genomes 6, 948 Different taxonomy Nodes 283, 121 Human db. SNP 13, 179, 601 Human Ref. Seq records 22, 079 bp in Human Contigs > 5, 000 kb (116) 2, 487, 920, 000 Pub. Med records 12, 570, 540 OMIM records 15, 138

Genome. Net / KEGG Regulatory Pathways Genome. Net www. genome. ad. jp Metabolic Pathways Genome. Net / KEGG Regulatory Pathways Genome. Net www. genome. ad. jp Metabolic Pathways Graphical pathway maps and ortholog group tables Maps are fully interactive

http: //nar. oupjournals. org/content/vol 31/issue 2/ http: //nar. oupjournals. org/content/vol 31/issue 2/

Biological databases • Igual que otro tipo de bases de datos – Los datos Biological databases • Igual que otro tipo de bases de datos – Los datos estan organizados para un análisis optimo • Tipos de datos muy diferentes • • • – Secuencias de ADN/ARN – SNPs – Secuencias de proteínas – Estructuras proteicas – Funciones – Genomas – Datos de expresión génica – Rutas metabólicas – Literatura científica – Bases de datos especie específicas – Interacciones proteína-proteína y proteína-ADN – Enfermedades, etc.

Tipos de Bases de Datos NAR 2005 Tipos de Bases de Datos NAR 2005

Bases de datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Bases de datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos

Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Gen. Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Gen. Bank flat file PDB file Interaction Record Title of a book Book

Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Boxes Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Boxes Oracle My. SQL PC binary files Unix text files Bookshelves

Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Una Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Una lista Un catálogo ficheros indexados SQL grep

Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Google Bases de Datos Sistemas de Información Sistemas de query Sistemas de almacenamieto Datos Google Entrez SRS

Tipos de organización de Bases de datos • Flat file databases (flat DBMS) – Tipos de organización de Bases de datos • Flat file databases (flat DBMS) – Simple, restrictivas, tablas • Bases de datos jerárquicas (hierarchical DBMS) – Simple, restrictivas, tablass • Relational databases (RDBMS) – Complejas, versatiles, tablas • Object-oriented databases (ODBMS) – Complejas, versatiles, organizadas en objetos

DBMS • Organización interna – Control de la velocidad y flexibilidad • Un conjunto DBMS • Organización interna – Control de la velocidad y flexibilidad • Un conjunto de programas que: – almacenas – Extraen – Modifcan Database Store Extract USER(S) Modify

Bases de datos avanzadas • Bases de datos relaccionales – Contienen datos y sus Bases de datos avanzadas • Bases de datos relaccionales – Contienen datos y sus relacciones • Control de versiones • Control de consistencia • Multi-autor/multi-usuario; con seguridad

Data warehouse • Importan de forma periódica datos de otras bases de datos y Data warehouse • Importan de forma periódica datos de otras bases de datos y los almacenan de forma local. • Con los datos obtenidos de otros sitios se pueden recrear las mismas bases o crear nuevas bases de datos: containing for instance protein family data (sequence, structure, function and pathway/process data integrated with the gene expression and other experimental data). • Inconvenientes: caro, necesidad intensiva de recursos, personal, updates. • Ventajas: fácil control en el proceso de data mining cuando se utilizan tareas automatizadas.

Colecciones de datos en el mundo Storage in databases Global efforts to collect: • Colecciones de datos en el mundo Storage in databases Global efforts to collect: • sequence data • structure data • protein expression profiles • functional data • genes expression profies…. . . Data analysis Bioinformatics

 • NIH Entrez • Information is mirrored daily between DDBJ, Gen. Bank and • NIH Entrez • Information is mirrored daily between DDBJ, Gen. Bank and EMBL. NCBI • Submissions • Updates Gen. Bank • Submissions • Updates EMBL DDBJ EBI CIB NIG getentry • Submissions • Updates SRS EMBL

Databases • Primarias (archivos) – Gen. Bank/EMBL/DDBJ – Uni. Prot – PDB – Medline Databases • Primarias (archivos) – Gen. Bank/EMBL/DDBJ – Uni. Prot – PDB – Medline (Pub. Med) – BIND • Secundarias (curadas) – Ref. Seq – Taxon – Uni. Prot – OMIM – SGD

BD Acidos nucléicos DDBJ Genbank EMBL VECTOR Gen. Pep ESTs Rep. Base HIVbase IMGT BD Acidos nucléicos DDBJ Genbank EMBL VECTOR Gen. Pep ESTs Rep. Base HIVbase IMGT TREMBL

BD de proteinas Swissprot Gen. Pept HIVbase PIR TREMBL Enzyme PROSITE PDB BLOCKS BD de proteinas Swissprot Gen. Pept HIVbase PIR TREMBL Enzyme PROSITE PDB BLOCKS

DNA Sequence Files • Formatoss – Contenido de la información • Conversion/Uso DNA Sequence Files • Formatoss – Contenido de la información • Conversion/Uso

Formato común • • • Genbank ASN 1 FASTA GCG IG(Intelligenetics) Text Others!!! Formato común • • • Genbank ASN 1 FASTA GCG IG(Intelligenetics) Text Others!!!

FASTA >gi|1345098|gb|U 30791. 1|PCU 30791 TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGAT AATACTAGATTTATTCCTGGAAACTTAAATTAGTTATTTTAAGT TATGGGATGTTGTTTTTCTGCTACATATAACCAAGATACACTTC GTTCCAA FASTA >gi|1345098|gb|U 30791. 1|PCU 30791 TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGAT AATACTAGATTTATTCCTGGAAACTTAAATTAGTTATTTTAAGT TATGGGATGTTGTTTTTCTGCTACATATAACCAAGATACACTTC GTTCCAA

What is Gen. Bank? Gen. Bank is the NIH genetic sequence database of all What is Gen. Bank? Gen. Bank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain. http: //www. ncbi. nlm. nih. gov/Genbank. Overview. html Benson et al. , 2004, Nucleic Acids Res. 32: D 23 -D 26

Gen. Bank Flat File (GBFF) LOCUS DEFINITION MUSNGH 1803 bp m. RNA ROD 29 Gen. Bank Flat File (GBFF) LOCUS DEFINITION MUSNGH 1803 bp m. RNA ROD 29 -AUG-1997 Mouse neuroblastoma and rat glioma hybridoma cell line NG 108 -15 cell TA 20 m. RNA, complete cds. ACCESSION D 25291 NID g 1850791 KEYWORDS neurite extension activity; growth arrest; TA 20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line: NG 108 -15 c. DNA to m. RNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda, C. , Nagai, S. , Tohda, M. and Nomura, Y. TITLE A novel factor, TA 20, involved in neuronal differentiation: c. DNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21 -27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda, C. TITLE Direct Submission JOURNAL Submitted (18 -NOV-1993) to the DDBJ/EMBL/Gen. Bank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama 930 -01, Japan (E-mail: [email protected] toyama-mpu. ac. jp Tel: +81 -764 -34 -2281(ex. 2841), , Fax: +81 -764 -34 -5057) COMMENT On Feb 26, 1997 this sequence version replaced gi: 793764. FEATURES Location/Qualifiers source 1. . 1803 /organism="Murinae gen. sp. " /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon: 39108" /cell_line="NG 108 -15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156. . 163 /note="AP-2 binding site" GC_signal 647. . 655 /note="Sp 1 binding site" TATA_signal 694. . 701 gene 748. . 1311 /gene="TA 20" CDS 748. . 1311 /gene="TA 20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID: d 1005516" /db_xref="PID: g 793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" poly. A_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttttttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca cccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat // Header • Title • Taxonomy • Citation Features (AA seq) DNA Sequence

Example: Growthfactor, implicated in parkinson syndrome Genebank entry LOCUS DEFINITION AF 053749 1943 bp Example: Growthfactor, implicated in parkinson syndrome Genebank entry LOCUS DEFINITION AF 053749 1943 bp DNA PRI 09 -JUL-1999 Homo sapiens glial cell line-derived neurotrophic factor (GDNF) gene, 5' flanking sequence and exon 1. ACCESSION AF 053749 NID g 5430697 VERSION AF 053749. 1 GI: 5430697 KEYWORDS. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1943) AUTHORS Baecker, P. A. , Lee, W. H. , Verity, A. N. , Eglen, R. M. and Johnson, R. M. TITLE Characterization of a promoter for the human glial cell line-derived neurotrophic factor gene JOURNAL Brain Res. Mol. Brain Res. 69 (2), 209 -222 (1999) MEDLINE 99296655 REFERENCE 2 (bases 1 to 1943) AUTHORS Baecker, P. A. , Lee, W. H. , Verity, A. N. , Eglen, R. M. and Johnson, R. M. TITLE Direct Submission JOURNAL Submitted (16 -MAR-1998) Molecular and Cellular Biochemistry, Roche Bioscience, 3401 Hillview Avenue, Palo Alto, CA 94304, USA …. .

Example: Growthfactor, implicated in parkinson syndrome FEATURES source Location/Qualifiers 1. . 1943 /organism= Example: Growthfactor, implicated in parkinson syndrome FEATURES source Location/Qualifiers 1. . 1943 /organism="Homo sapiens" /db_xref="taxon: 9606" /chromosome="5" /map="5 p 12 -p 13. 1" gene 1. . >1943 /gene="GDNF" misc_feature 1. . 1643 /gene="GDNF" /note="5' flanking region" m. RNA 1644. . >1817 /gene="GDNF" /product=" glial cell line-derived neurotrophic factor" 5'UTR 1644. . >1817 /gene="GDNF" exon 1644. . 1817 /gene="GDNF" /number=1 BASE COUNT 356 a 662 c 576 g 349 t ORIGIN GAATTCAGGT CCAATGGCTT CCGGAAAACA GGTTTCTGCT TAGCAAAGAC ATGCCCTATT TAGTACATTA TTTTAGAGGT ACAGCCAATT CCATGCCCCA TGTGAA ATGTATTTAT GGTTATAGCC ATGCACAGGG TGTGTAAGGA CTTGCCCTCC TCCTGTCCTC TACAAAAGAA GGCTCAGGCA GCTTCTGGTG GTGAACTAAC CAACAAAAGG AATGCCCAGA AGGTCTCACC TCTCCCATCC ACAGAGCTCT GGAATGGGGG CCGGGCCCCT GATCGCTGGA AACTCAGCAT CCAAGTGGGC GCTTGCTGAA GTTTCCCATC TGCATTTTCG AAAATCTGGA TAAAAGCAGG TTTAGCTCAA CCTCCCCTAA CCCGTTCCTG ATAAAGTGAT CTTACGCCTC TGGAATTGGG …. . . 60 120 180 240 300 360 420

Genbank divisions PRI: ROD: MAM: VRT: INV: PLN: BCT: VRL: PHG: SYN: UNA: EST: Genbank divisions PRI: ROD: MAM: VRT: INV: PLN: BCT: VRL: PHG: SYN: UNA: EST: PAT: STS: GSS: HTC: HTG: primate sequences rodent sequences other mammalian sequences other vertbrate sequences invertebrate sequences plant, fungal and algal sequences bacterial sequences viral sequences bacteriophage sequences synthetic sequences unannotated sequences expressed sequence tags patent sequences sequence tag sites genome survey sequences high throughput c. DNA sequences high throughput genomic sequences

Features FEATURES Location/Qualifiers source 1. . 1234 /organism= Features FEATURES Location/Qualifiers source 1. . 1234 /organism="Pneumocystis carinii f. sp. carinii“ /strain="Form 6“ /note="450 kb chromosome" /db_xref="taxon: 38081“ 5'UTR 1. . 90 gene 91. . 1155 /gene="pcg 1"

CDS Critical Evidence? ? CDS 91. . 1155 /gene= CDS Critical Evidence? ? CDS 91. . 1155 /gene="pcg 1” /note="G-protein alpha subunit" /codon_start=1 /product= "guanosine nucleotide binding protein alpha subunit" /protein_id="AAC 49295. 1". /db_xref="PID: g 1345099" /db_xref="GI: 1345099" /translation="MGCCFSATYNQDTLRSKEIE SYLRQEQEHACHEAKILLLGAGES…

What’s Missing in DNA sequence files? • • • Expression data Variation Curation/referee system What’s Missing in DNA sequence files? • • • Expression data Variation Curation/referee system limited EC or other standard bio-links Auto-update links to other information Specific clone information – Plasmid construction

What’s missing in protein files? • Evidence that the protein exists – MOST ARE What’s missing in protein files? • Evidence that the protein exists – MOST ARE INFERRED from DNA • (DNA protein links are not truly dynamic) • EC links to metabolism/regulation/structure – Not uniformly done (see No. Ec. gb. txt) • Uniform description of modifications • Cellular location

Types of files in Gen. Bank • From one-gene investigators – Often a very Types of files in Gen. Bank • From one-gene investigators – Often a very well annotated c. DNA – A genomic segment from an new invertebrate – A mitochondria or virus • From population/phylogenetic analysis – r. RNA amplicon from environmental sampling • From Genome Centers: – Gene expression: • Expressed Sequence Tags (ESTs) • Full Length Insert c. DNA – Genome sequencing projects • WGS • HTG • CON

Uni. Prot • New protein sequence database that is the result of a merge Uni. Prot • New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. • Data in Uni. Prot is primarily derived from coding sequence annotations in EMBL (Gen. Bank/DDBJ) nucleic acid sequence data. • Uni. Prot is a Flat-File database just like EMBL and Gen. Bank • Flat-File format is Swiss. Prot-like, or EMBL-like

Swiss-Prot ID AC DT DT DT DE GN OS OC OC RN RP RX Swiss-Prot ID AC DT DT DT DE GN OS OC OC RN RP RX RA RA RT RT RL RN RP RC RX RA RT RT RT RL RN RP RC RX RA RA RT RT RL RN RP RX RA RA RT RT RL CC CC CC CC CC DR DR DR DR KW FT FT SQ ID AC DT DE GN OS OC CYS 3_YEAST STANDARD; PRT; 393 AA. P 31373; 01 -JUL-1993 (REL. 26, CREATED) CYSTATHIONINE GAMMA-LYASE (EC 4. 4. 1. 1) (GAMMA-CYSTATHIONASE). CYS 3 OR CYI 1 OR STR 1 OR YAL 012 W OR FUN 35. TAXONOMY SACCHAROMYCETACEAE; SACCHAROMYCES. RX CC CC CC CITATION -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2 -OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------DISCLAMOR ------------------------------------- DR KW FT FT SQ DATABASE cross-reference CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55 BA 2771 CRC 32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN // // CYS 3_YEAST STANDARD; PRT; 393 AA. P 31373; 01 -JUL-1993 (REL. 26, CREATED) 01 -JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) 01 -NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) CYSTATHIONINE GAMMA-LYASE (EC 4. 4. 1. 1) (GAMMA-CYSTATHIONASE). CYS 3 OR CYI 1 OR STR 1 OR YAL 012 W OR FUN 35. SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; SACCHAROMYCETACEAE; SACCHAROMYCES. [1] SEQUENCE FROM N. A. , AND PARTIAL SEQUENCE. MEDLINE; 92250430. [NCBI, Ex. PASy, Israel, Japan] ONO B. -I. , TANAKA K. , NAITO K. , HEIKE C. , SHINODA S. , YAMAMOTO S. , OHMORI S. , OSHIMA T. , TOH-E A. ; "Cloning and characterization of the CYS 3 (CYI 1) gene of Saccharomyces cerevisiae. "; J. BACTERIOL. 174: 3339 -3347(1992). [2] SEQUENCE FROM N. A. , AND CHARACTERIZATION. STRAIN=DBY 939; MEDLINE; 93328685. [NCBI, Ex. PASy, Israel, Japan] YAMAGATA S. , D'ANDREA R. J. , FUJISAKI S. , ISAJI M. , NAKAMURA K. ; "Cloning and bacterial expression of the CYS 3 gene encoding cystathionine gamma-lyase of Saccharomyces cerevisiae and the physicochemical and enzymatic properties of the protein. "; J. BACTERIOL. 175: 4800 -4808(1993). [3] SEQUENCE FROM N. A. STRAIN=S 288 C / AB 972; MEDLINE; 93289814. [NCBI, Ex. PASy, Israel, Japan] BARTON A. B. , KABACK D. B. , CLARK M. W. , KENG T. , OUELLETTE B. F. F. , STORMS R. K. , ZENG B. , ZHONG W. W. , FORTIN N. , DELANEY S. , BUSSEY H. ; "Physical localization of yeast CYS 3, a gene whose product resembles the rat gamma-cystathionase and Escherichia coli cystathionine gammasynthase enzymes. "; YEAST 9: 363 -369(1993). [4] SEQUENCE FROM N. A. STRAIN=S 288 C / AB 972; MEDLINE; 93209532. [NCBI, Ex. PASy, Israel, Japan] OUELLETTE B. F. F. , CLARK M. W. , KENG T. , STORMS R. K. , ZHONG W. W. , ZENG B. , FORTIN N. , DELANEY S. , BARTON A. B. , KABACK D. B. , BUSSEY H. ; "Sequencing of chromosome I from Saccharomyces cerevisiae: analysis of a 32 kb region between the LTE 1 and SPO 7 genes. "; GENOME 36: 32 -42(1993). [5] SEQUENCE OF 1 -18, AND CHARACTERIZATION. MEDLINE; 93289817. [NCBI, Ex. PASy, Israel, Japan] ONO B. -I. , ISHII N. , NAITO K. , MIYOSHI S. -I. , SHINODA S. , YAMAMOTO S. , OHMORI S. ; "Cystathionine gamma-lyase of Saccharomyces cerevisiae: structural gene and cystathionine gamma-synthase activity. "; YEAST 9: 389 -397(1993). -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + NH(3) + 2 -OXOBUTANOATE. -!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING L-CYSTEINE FROM L-METHIONINE. -!- SUBUNIT: HOMOTETRAMER. -!- SUBCELLULAR LOCATION: CYTOPLASMIC. -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. -------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http: //www. isb-sib. ch/announce/ or send an email to [email protected] ch). -------------------------------------EMBL; L 05146; AAC 04945. 1; -. [EMBL / Gen. Bank / DDBJ] [ Co. Ding. Sequence] EMBL; L 04459; AAA 85217. 1; -. [EMBL / Gen. Bank / DDBJ] [ Co. Ding. Sequence] EMBL; D 14135; BAA 03190. 1; -. [EMBL / Gen. Bank / DDBJ] [ Co. Ding. Sequence] PIR; S 31228. YEPD; 5280; -. SGD; L 0000470; CYS 3. [SGD / YPD] PFAM; PF 01053; Cys_Meta_PP; 1. PROSITE; PS 00868; CYS_METAB_PP; 1. DOMO; P 31373. PRODOM [Domain structure / List of seq. sharing at least 1 domain] PROTOMAP; P 31373. PRESAGE; P 31373. SWISS-2 DPAGE; GET REGION ON 2 D PAGE. CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. INIT_MET 0 0 BINDING 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY). SEQUENCE 393 AA; 42411 MW; 55 BA 2771 CRC 32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN

Swiss-Prot Swiss-Prot

Functional Divisions PAT EST STS GSS HTG HTC CON ENV Patent Expressed Sequence Tags Functional Divisions PAT EST STS GSS HTG HTC CON ENV Patent Expressed Sequence Tags Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unfinished) High throughput c. DNA (unfinished) Contig assembly instructions Environmental sampling methods Organismal divisions: BCT PRI FUN ROD INV SYN MAM VRL PHG VRT PLN

Swiss-Prot • SWISS-PROT incorporates: • • Function of the protein Post-translational modification Domains and Swiss-Prot • SWISS-PROT incorporates: • • Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.

TREMBL • Tr. EMBL is a computer-annotated protein sequence database supplementing the SWISS-PROT Protein TREMBL • Tr. EMBL is a computer-annotated protein sequence database supplementing the SWISS-PROT Protein Sequence Data Bank. • Tr. EMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISSPROT. • Tr. EMBL can be considered as a preliminary section of SWISS-PROT. For all Tr. EMBL entries which should finally be upgraded to the standard SWISSPROT quality, SWISS-PROT accession numbers have been assigned.

PDB • Protein Data. Base – Protein and NA 3 D structures – Sequence PDB • Protein Data. Base – Protein and NA 3 D structures – Sequence present – YAFFF

PDB • • • HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES PDB • • • HEADER COMPND SOURCE AUTHOR DATE JRNL REMARK SECRES ATOM COORDINATES HEADER COMPND SOURCE AUTHOR REVDAT JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK SEQRES SEQRES HELIX CRYST 1 ORIGX 2 ORIGX 3 SCALE 1 SCALE 2 SCALE 3 ATOM LEUCINE ZIPPER 15 -JUL-93 1 DGC GCN 4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 2 ATF/CREB SITE DNA GCN 4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC T. J. RICHMOND 1 22 -JUN-94 1 DGC 0 AUTH P. KONIG, T. J. RICHMOND TITL THE X-RAY STRUCTURE OF THE GCN 4 -BZIP BOUND TO TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA TITL 3 FLEXIBILITY REF J. MOL. BIOL. V. 233 139 1993 REFN ASTM JMOBAK UK ISSN 0022 -2836 0070 1 2 2 RESOLUTION. 3. 0 ANGSTROMS. 3 3 REFINEMENT. 3 PROGRAM X-PLOR 3 AUTHORS BRUNGER 3 R VALUE 0. 216 3 RMSD BOND DISTANCES 0. 020 ANGSTROMS 3 RMSD BOND ANGLES 3. 86 DEGREES 3 3 NUMBER OF REFLECTIONS 3296 3 RESOLUTION RANGE 10. 0 - 3. 0 ANGSTROMS 3 DATA CUTOFF 3. 0 SIGMA(F) 3 PERCENT COMPLETION 98. 2 3 3 NUMBER OF PROTEIN ATOMS 456 3 NUMBER OF NUCLEIC ACID ATOMS 386 4 4 GCN 4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 4 ACID BIOSYNTHETIC ENZYMES. 5 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 5 281 AMINO ACIDS OF INTACT GCN 4. 6 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 7 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED. 8 8 RESIDUE NUMBERING OF NUCLEOTIDES: 8 5' T G G A T G A C G T C A T C C 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 9 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 9 COMPLEX PER ASYMMETRIC UNIT. 10 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 10 10 0 -1 0 X 117. 32 X SYMM 10 -1 0 0 Y + 117. 32 = Y SYMM 10 0 0 -1 Z 43. 33 Z SYMM 1 A 62 ILE VAL PRO GLU SER ASP PRO ALA LEU LYS ARG 2 A 62 ALA ARG ASN THR GLU ALA ARG SER ARG ALA ARG 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 4 A 62 GLU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 5 A 62 ALA ARG LEU LYS LEU VAL GLY GLU ARG 1 B 19 T G G A T G A C G T C 2 B 19 A T C C 1 A ALA A 228 LYS A 276 1 58. 660 86. 660 90. 00 P 41 21 2 8 1. 000000 0. 000000 1. 000000 0. 017047 0. 000000 0. 011539 0. 00000 1 N PRO A 227 35. 313 108. 011 15. 140 1. 00 38. 94 2 CA PRO A 227 34. 172 107. 658 15. 972 1. 00 39. 82 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 1 DGC 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 ATOM TER MASTER END 842 843 844 1 DGC 1 DGC 916 917 918 919 920 C 5 C 6 46 C B C B 0 9 9 9 0 57. 692 100. 286 58. 128 100. 193 1 0 0 0 22. 744 21. 465 6 842 1. 00 29. 82 1. 00 30. 63 2 0 7

Format • ASN. 1 • Flat Files – DNA – Protein • FASTA – Format • ASN. 1 • Flat Files – DNA – Protein • FASTA – DNA – Protein

Abstract Syntax Notation (ASN. 1) Abstract Syntax Notation (ASN. 1)

FASTA > >gi|121066|sp|P 03069|GCN 4_YEAST GENERAL CONTROL PROTEIN GCN 4 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES FASTA > >gi|121066|sp|P 03069|GCN 4_YEAST GENERAL CONTROL PROTEIN GCN 4 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE R

Graphical Representation Graphical Representation

Guiding Principals In Gen. Bank, records are grouped for various reasons: understand this is Guiding Principals In Gen. Bank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.

Identifiers • You need identifiers which are stable through time • Need identifiers which Identifiers • You need identifiers which are stable through time • Need identifiers which will always refer to specific sequences • Need these identifiers to track history of sequence updates • Also need feature and annotation identifiers

LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession. version numbers, but slightlt different format.

LOCUS, Accession, gi and PID LOCUS DEFINITION ACCESSION VERSION HSU 40282 1789 bp m. LOCUS, Accession, gi and PID LOCUS DEFINITION ACCESSION VERSION HSU 40282 1789 bp m. RNA PRI 21 -MAY-1998 Homo sapiens integrin-linked kinase (ILK) m. RNA, complete cds. U 40282. 1 GI: 3150001 LOCUS: ACCESSION: VERSION: GI: PID: Protein gi: protein_id: CDS HSU 40282. 1 3150001 g 3150002 AAC 16892. 1 LOCUS ACCESSION Accession. version gi PID protein gi Protein_id 157. . 1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC 16892. 1" /db_xref="PID: g 3150002" /db_xref="GI: 3150002"

EST: Expressed Sequence Tags are short (300 -500 bp) single reads from m. RNA EST: Expressed Sequence Tags are short (300 -500 bp) single reads from m. RNA (c. DNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage. Also see: http: //www. ncbi. nlm. nih. gov/db. EST/ http: //www. ncbi. nlm. nih. gov/Uni. Gene/

STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see: http: //www. ncbi. nlm. nih. gov/db. STS/ http: //www. ncbi. nlm. nih. gov/genemap/

GSS: Genome Survey Sequences are similar in nature to the ESTs, except that its GSS: Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than c. DNA (m. RNA). The GSS division contains: • random "single pass read" genome survey sequences. • single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be) • exon trapped genomic sequences • Alu PCR sequences Also see: http: //www. ncbi. nlm. nih. gov/db. GSS/

HTG: High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have HTG: High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records. Also see: http: //www. ncbi. nlm. nih. gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7: 952 -955

HTGS in Gen. Bank phase 0 Acc = AC 000003 phase 1 Acc = HTGS in Gen. Bank phase 0 Acc = AC 000003 phase 1 Acc = AC 000003 phase 2 Acc = AC 000003 phase 3 Acc = AC 000003 gi = 1235673 gi = 1556454 gi = 2182283 gi = 2204282 HTG HTG PRI

HTGS in Gen. Bank • Unfinished Record – Sequencing will be unfinished – Phase HTGS in Gen. Bank • Unfinished Record – Sequencing will be unfinished – Phase 1 or phase 2 – HTG division – KEYWORDS: HTG; HTGS_PHASE 1 or 2 • Finished record – Sequencing will be finished – Phase 3 – Organismal division it belongs to PRI, INV or PLN – KEYWORDS: HTG

HTC in Gen. Bank • Gen. Bank division for unfinished highthroughput c. DNA sequencing HTC in Gen. Bank • Gen. Bank division for unfinished highthroughput c. DNA sequencing (HTC). • HTC sequences may have 5'UTR and 3'UTR at their ends, partial coding regions, and introns. • A keyword of "HTC" will be present, in addition to division code "HTC". Those HTC sequences that undergo finishing (eg, resequencing) will move to the appropriate taxonomic Gen. Bank division and the "HTC" keyword will be removed.

WGS in Gen. Bank • Contigs from ongoing Whole Genome Shotgun sequencing projects • WGS in Gen. Bank • Contigs from ongoing Whole Genome Shotgun sequencing projects • The nucleotides from WGS projects go into the BLAST ‘wgs’ database, whereas the proteins go into the BLAST nr database. • More info, and how to submit to this division: http: //www. ncbi. nlm. nih. gov/Genbank/wgs. html • Accession format is 4+2+6

CON in Gen. Bank • Points to files that make the contig, does not CON in Gen. Bank • Points to files that make the contig, does not actually contain sequence • ‘Invented’ by NCBI to deal with tracking of segmented sets and 350 KB limit in DDBJ/EMBL/Gen. Bank

CON in Gen. Bank LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM [. . ] CON in Gen. Bank LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM [. . ] FEATURES source CONTIG // AH 007743 7832 bp DNA CON 26 -MAY-1999 Gallus gallus ornithine transcarbamylase (OTC) gene, complete cds. AH 007743. 1 GI: 4927367. chicken. Gallus gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria; Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. Location/Qualifiers 1. . 7832 /organism="Gallus gallus" /db_xref="taxon: 9031" /chromosome="1" join(AF 065630. 1: 1. . 1903, gap(), AF 065631. 1: 1. . 435, gap(), AF 065632. 1: 1. . 509, gap(), AF 065633. 1: 1. . 722, gap(), AF 065634. 1: 1. . 707, gap(), AF 065635. 1: 1. . 836, gap(), AF 065636. 1: 1. . 1614, gap(), AF 065637. 1: 1. . 605, gap(), AF 065638. 1: 1. . 501)

EST: Expressed Sequence Tags are shorter (300 -1000 bp) single reads from m. RNA EST: Expressed Sequence Tags are shorter (300 -1000 bp) single reads from m. RNA (c. DNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage. Also see: http: //www. ncbi. nlm. nih. gov/db. EST/ http: //www. ncbi. nlm. nih. gov/Uni. Gene/

Sequences NOT in Gen. Bank • • SNPs SAGE tags Ref. Seq (Genomic, m. Sequences NOT in Gen. Bank • • SNPs SAGE tags Ref. Seq (Genomic, m. RNA, or protein) Consensus sequences

STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see: http: //www. ncbi. nlm. nih. gov/db. STS/ http: //www. ncbi. nlm. nih. gov/genemap/

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL MEDLINE BV 102466 LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL MEDLINE BV 102466 500 bp DNA linear STS 29 -JAN-2005 47926 ij From 19 q 13. 4 public sequences in the databases from UCSC NT_011109 Homo sapiens STS genomic clone CTC-258 N 23, sequence tagged site. BV 102466. 1 GI: 58330885 STS. Homo sapiens (human) Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 1 (bases 1 to 500) Slim, R. , Fallahian, M. , Riviere, J. -B. and Zali, M. R. Evidence of genetic heterogeneity of familial hydatidiform moles Placenta 26 (1), 5 -9 (2005) 15664405

COMMENT Contact: Rima Slim Mc. Gill University Health Center, Montreal General Hospital Research Institute COMMENT Contact: Rima Slim Mc. Gill University Health Center, Montreal General Hospital Research Institute room L 12 -132, 1650 Cedar Avenue, H 3 G 1 A 4, Montreal, Canada Tel: (514) 934 -1934 ext 44519 Fax: (514) 934 8265 Email: rima. [email protected] mcgill. ca Primer A: CCGAGTGGGGTGGCACAT Primer B: GGTGGAGCAATTGGGAAGATACTA STS size: 260 PCR Profile: Presoak: 0 degrees C for 0. 00 minute (s) Denaturation: 94 degrees C for 4. 00 minute (s) Denaturation: 94 degrees C for 0. 45 minute (s) Annealing: 55 degrees C for 0. 45 minute (s) Polymerization: 72 degrees C for 1. 00 minute (s) PCR cycles: 35 Thermal Cycler: Perkin Elmer Gene Amp 9700 Protocol: Template: 200 ng Primer: each 1 u. M d. NTPs: each 200 u. M Taq Polymerase: 0. 07 unit/ul Total volume: 13 ul Buffer: Mg. Cl 2: KCl: Tris-HCl: PH: 1. 5 m. M 50 m. M 10 m. M 8. 3.

FEATURES source STS primer_bind ORIGIN 1 ataccagcct 61 cagccgagtg 121 ctcgcttgag 181 cctgggcaac 241 FEATURES source STS primer_bind ORIGIN 1 ataccagcct 61 cagccgagtg 121 ctcgcttgag 181 cctgggcaac 241 acactt 301 agtatcttcc 361 gctgcaacac 421 ggtgtggaga 481 taacaaggaa // Location/Qualifiers 1. . 500 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon: 9606" /map="19 q 13. 4" /clone="CTC-258 N 23" /clone_lib="From 19 q 13. 4 public sequences in the databases from UCSC NT_011109" /note="From public sequences in the databases from UCSC NT_011109" 64. . 323 64. . 81 complement(300. . 323) agactacaaa gggtggcaca cctaggagtt agagtgaggc aaaaaacaaa caattgctcc agtggctgag aaacagtgcc agcattactc gtgagatccc tgcctgtagt gggggctgca cctgtctcaa aggttgaaaa acctcacttt acatcgctct caggacagag atttctacaa cccagctact atgagctatg aaatacacac tgaaacacat agagtcaggt cagcgtgtca cctgagaaac aaataaaaat caggaggctg attttgccac acacacgcac agtgaaataa gaggat ccgtgaggtc ctcaccggga tagctgggct aggcgggagg tgcactccag acacac aagttctgat ggtggcggag tcccagggag agatggagca

GSS: Genome Survey Sequences are similar in nature to the ESTs, except that its GSS: Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than c. DNA (m. RNA). The GSS division contains: • random "single pass read" genome survey sequences. • single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be) • exon trapped genomic sequences • Alu PCR sequences Also see: http: //www. ncbi. nlm. nih. gov/db. GSS/

HTG: High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have HTG: High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records. Also see: http: //www. ncbi. nlm. nih. gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7: 952 -955

HTGS in Gen. Bank phase 0 Acc = AC 000003 phase 1 Acc = HTGS in Gen. Bank phase 0 Acc = AC 000003 phase 1 Acc = AC 000003 phase 2 Acc = AC 000003 phase 3 Acc = AC 000003 gi = 1235673 gi = 1556454 gi = 2182283 gi = 2204282 HTG HTG PRI

HTGS in Gen. Bank • Unfinished Record – Sequencing will be unfinished – Phase HTGS in Gen. Bank • Unfinished Record – Sequencing will be unfinished – Phase 1 or phase 2 – HTG division – KEYWORDS: HTG; HTGS_PHASE 1 or 2 • Finished record – Sequencing will be finished – Phase 3 – Organismal division it belongs to PRI, INV or PLN – KEYWORDS: HTG

HTC in Gen. Bank • Gen. Bank division for unfinished highthroughput c. DNA sequencing HTC in Gen. Bank • Gen. Bank division for unfinished highthroughput c. DNA sequencing (HTC). • HTC sequences may have 5'UTR and 3'UTR at their ends, partial coding regions, and introns. • A keyword of "HTC" will be present, in addition to division code "HTC". Those HTC sequences that undergo finishing (eg, resequencing) will move to the appropriate taxonomic Gen. Bank division and the "HTC" keyword will be removed.

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS CONSRTM TITLE JOURNAL COMMENT CR LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS CONSRTM TITLE JOURNAL COMMENT CR 926482 2728 bp RNA linear HTC 11 -JAN-2005 Pongo pygmaeus m. RNA; c. DNA DKFZp 469 F 2123 (from clone DKFZp 469 F 2123). CR 926482. 1 GI: 56541783 HTC. Pongo pygmaeus (orangutan) Pongo pygmaeus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Pongo. 1 (bases 1 to 2728) Ansorge, W. , Krieger, S. , Regiert, T. , Rittmueller, C. , Schwager, B. , Mewes, H. W. , Weil, B. , Amid, C. , Osanger, A. , Fobo, G. , Han, M. and Wiemann, S. The German c. DNA Consortium Direct Submission Submitted (08 -DEC-2004) MIPS, Ingolstaedter Landstr. 1, D-85764 Neuherberg, GERMANY Clone from S. Wiemann, Molecular Genome Analysis, German Cancer Research Center (DKFZ); Email s. [email protected] de ; sequenced by EMBL (European Molecular Biology Laboratories, Heidelberg/Germany) within the c. DNA sequencing consortium of the German Genome Project. This clone (DKFZp 469 F 2123) is available at the RZPD Deutsches Ressourcenzentrum fuer Genomforschung Gmb. H in Berlin, Germany. Please contact RZPD for ordering: http: //www. rzpd. de/cgi-bin/products/cl. cgi? Clone. ID =DKFZp 469 F 2123 Further information about the clone and the sequencing project is available at http: // mips. gsf. de/projects/cdna /.

FEATURES source gene CDS Location/Qualifiers 1. . 2728 /organism= FEATURES source gene CDS Location/Qualifiers 1. . 2728 /organism="Pongo pygmaeus" /mol_type="pre-RNA" /db_xref="taxon: 9600" /clone="DKFZp 469 F 2123" /tissue_type="kidney" /clone_lib="469 (synonym: pkid 1). Vector p. Sport 1_Sfi; host DH 10 B; sites Sfil. A + Sfil. B" /dev_stage="adult" /note="Rh type C glycoprotein (Homo sapiens), not fully spliced" join(40. . 165, 194. . 1015, 1217. . 1354, 1663. . 1788, 2038. . 2244) /gene="DKFZp 469 F 2123" /codon_start=1 /product="hypothetical protein" /protein_id="CAI 30274. 1" /db_xref="GI: 56541784" /translation="MAWNTNLRWRLPLTCLLLEVVMVILFGVFVRYDFDADAHWWSWR TEFYYRYPSFQDVHVMVFVGFGFLMTFLQRYGFSAVGFNFLLAAFGIQWALLMQGWFH FLQGRYIVVGVENLINADFCVASVCVAFGAVLGKVSPIQLLIMTFFQVTLFAVNEFIL LNLLKVKDAGGSMTIHTFGAYFGLTVTRILYRRNLEQSKERQNSVYQSDLFAMIGTLF LWMYWPSFNSAISYHGDSQHRAAINTYCSLAACVLTSVAISSALHKKGKLDMVHIQNA TPAGGVAVGTAAEMMLMPYGALIVGFVCGIISTLGFVYLTPFLESRLHIQDTCGINNL HGIPGIIGGIVGAVTAASASLEVYGKEGLVHSFDFQGFKRDWTARTQGKFQIYGLLVT LAMALMGGIIVGVGLILRLPFWGQPSDENCFEDAVYWEMPEGNSTVYIPEDPTFKPSG PSVPSVPMVSPLPMASSVPLVP" ORIGIN 1 61 121 181 241 301 361 421 481 gaaccgcccg cgctggcggc gtgttcgtgc aacttgagcg cgtgatggtc cgccgtgggctggttc cgctgacttc ccccattcag ctgccc tgccgctcac gctacgactt acgtggagaa ttcgtgggct ttcaacttcc cacttcttac tgcgtggcct ctactcatca ggcccggcac ctgctc cgacgccgac ccgaattcta tcggcttcct tgttggcggc aaggccgcta ctgtctgcgt tgactttctt ccctgcagca ctggaggtgg gcccactggt ctatcgctac catgaccttcggcatcgtcgtg ggcttttggg ccaagtgacc tggcctggaa ttatggtgat ggtcacagac ccaagcttcc ctgcagcgct cagtgggcgc ggcgtggaga gcagttctgg ctcttcgccg caccaacctc tctctttggg gaagcacaag aggacgtgca acggcttcag tgctcatgca acctcatcaa gtaaagtcag tgaatgagtt

Sequences NOT in Gen. Bank • • • WGS TPA SNPs SAGE tags Ref. Sequences NOT in Gen. Bank • • • WGS TPA SNPs SAGE tags Ref. Seq (Genomic, m. RNA, or protein) Consensus sequences

What is Uni. Prot? Uni. Prot is a new protein sequence database that is What is Uni. Prot? Uni. Prot is a new protein sequence database that is the result of a merge from SWISS-PROT and PIR and is in great part funded by the NIH. It is the main distributed, annotated, and curated protein sequence database. Data in Uni. Prot is primarily derived from coding sequence annotations in EMBL (Gen. Bank/DDBJ) nucleic acid sequence data, but also from sequences in PIR and SP. Uni. Prot is a Flat. File database just like EMBL and Swiss. Prot • http: //www. pir. uniprot. org/ • Bairoch et al. , The Universal Protein Resource (Uni. Prot) Nucl. Acids Res. 2005 33: D 154 D 159

Uni. Prot • Uni. Prot incorporates: • • Function of the protein Post-translational modification Uni. Prot • Uni. Prot incorporates: • • Function of the protein Post-translational modification Domains and sites. Secondary structure. Quaternary structure. Similarities to other proteins; Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc.

Swissprot Annotation ID AC DT DT DT DE DE AMPA_CHLTR STANDARD; PRT; 499 AA. Swissprot Annotation ID AC DT DT DT DE DE AMPA_CHLTR STANDARD; PRT; 499 AA. O 84049; 15 -FEB-2000 (Rel. 39, Created) 15 -FEB-2000 (Rel. 39, Last sequence update) 15 -FEB-2000 (Rel. 39, Last annotation update) PROBABLE CYTOSOL AMINOPEPTIDASE (EC 3. 4. 11. 1) (LEUCINE AMINOPEPTIDASE) DE (LAP). Probable - Putative - Potential CC -!- CATALYTIC ACTIVITY: AMINOACYL-PEPTIDE + H(2)O = AMINO ACID + CC PEPTIDE. CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY M 17; ALSO KNOWN AS THE CC CYTOSOL AMINOPEPTIDASE FAMILY. FT METAL 263 FT METAL 268 FT ACT_SITE 275 MANGANESE OR ZINC (BY SIMILARITY). POTENTIAL.

Other Databases Rebase - the restriction enzyme database Prosite - protein functional sites (pattern Other Databases Rebase - the restriction enzyme database Prosite - protein functional sites (pattern and profiles) PDB - protein structures OMIM - Online Mendelian Inheritance in Man

Protein Domain Databases • Prosite Short conserved patterns (+ profiles) • Prints Fingerprints (aligned Protein Domain Databases • Prosite Short conserved patterns (+ profiles) • Prints Fingerprints (aligned unweighted motifs) • Blocks (aligned weighted motifs) • Pfam Domain HMMs • Prodom Domain multiple alignments • INTERPRO

Databases Human Genome NCBI Human_assembled Locus. Link Refseq Unigene EBI Ens. EMBL_c. DNA Ens. Databases Human Genome NCBI Human_assembled Locus. Link Refseq Unigene EBI Ens. EMBL_c. DNA Ens. EMBL_prot

Ref. Seq and Locus. Link Ref. Seq and Locus. Link

Ensembl Ensembl

Formatdb • Formatdb will take fasta input files and index the sequences • The Formatdb • Formatdb will take fasta input files and index the sequences • The index files are in blast format allowing a user to do a blast search against the newly created indices • To produce a blast database type: formatdb -i infile -p ’T/F input is protein’ -o T

Requirements • The system must integrate various types of data seamlessly. • The system Requirements • The system must integrate various types of data seamlessly. • The system must provide a uniform user interface. • The system must be able to scale. • Must be easy to maintain and update • Must grow with new data type needs.

What is SRS(1/2) • Read-only data warehouse. – It’s raw working material is text What is SRS(1/2) • Read-only data warehouse. – It’s raw working material is text (flatfile databases - EMBL, Swiss-Prot, MEDLINE, HTML, XML files, etc…) • Uses context parsers which are word oriented: ‘glutathione transferase’ indexes as ‘glutathione’ + ‘transferase’ – it is not a ‘free text search’ engine!

What is SRS(2/2) • Enables linking between databases as these share primary and secondary What is SRS(2/2) • Enables linking between databases as these share primary and secondary identifiers : EMBL PDB Inter. Pro PROSITE SWISS-PROT PFAM BLOCKS

SRS Linking • Linking is possible because of the presence of cross-reference information: – SRS Linking • Linking is possible because of the presence of cross-reference information: – DR records in Swiss-Prot & EMBL – RX records from Swiss-Prot and EMBL to MEDLINE and Pub. Med.

SRS Linking ID DR A 1 B 3 Link indexing A A 1 A SRS Linking ID DR A 1 B 3 Link indexing A A 1 A 2 A 3 A 4 A 5 A>B BA B 3 A 1 A 2 B 3 B 4 A 1 A 2 A 3 A 4

Main advantage of SRS linking Direct link from ‘A’ to ‘B’ A Direct link Main advantage of SRS linking Direct link from ‘A’ to ‘B’ A Direct link from ‘B’ to ‘C’ B Multi-step link from ‘A’ to ‘C’ An important thing to note is that links are bi-directional: If (‘A’ to ‘B’) exist then (‘B’ to ‘A’) do as well. And (‘C’ to ‘A’) as well. . . C

MEDLINE MEDLINE

Remote use of SRS • SRS can be used as a network based cross-reference Remote use of SRS • SRS can be used as a network based cross-reference tool: – http: //srs. ebi. ac. uk/cgi-bin/wgetz? +e+[embl-id: hscfos] – http: //www. ebi. ac. uk/cgi-bin/emblfetch? hscfos • Typically, it is used to cross-reference results from similarity and homology and function prediction searches (I. e. fasta, blast, MPsrch, Inter. Proscan etc. )

New Inter. Pro. Scan New Inter. Pro. Scan

Remote linking to SRS • Permits free and easy access to remote data. : Remote linking to SRS • Permits free and easy access to remote data. : -) • Not necessary to maintain too many databanks locally. : -) • It is network availability dependant : -( • It is not maintenance intensive. : -)

Databanks under SRS at EBI • Currently more than 200 libraries and 154 tools Databanks under SRS at EBI • Currently more than 200 libraries and 154 tools publicly visible. – Databank total is 180 with many awaiting ‘publishing’. • We are aiming at reducing the number of databases available at the EBI by creating virtual libraries: – EMBLRel. +EMBLNEW=EMBL – swissprot+(swissnewdelta)+sptrembl+tremblnew=Uni. Prot

Constraints • Databank must have informative value to enhance the system. • Fasta formatted Constraints • Databank must have informative value to enhance the system. • Fasta formatted databases are implicitly avoided. • Linking (hard and relative links) are encouraged.

Database Search Tools Database Search Tools

Genome. Net www. genome. ad. jp Keywords: metabolic pathways / proteomics / metabolomics KEGG Genome. Net www. genome. ad. jp Keywords: metabolic pathways / proteomics / metabolomics KEGG • Metabolic pathways • Regulatory pathways • Disease Catalogs, Cell Catalogs • Molecule Catalogs; compounds and enzymes • Gene Catalogs • Genome Maps • Gene Expression Profiles • Computational Tools • Links to other pathway and compound sites

What to take home • Databases are a collection of data – Need to What to take home • Databases are a collection of data – Need to access and maintain easily and flexibly • Biological information is vast and sometimes very redundant • Distributed databases bring it all together with quality controls, cross-referencing and standardization • Computers can only create data, they do not give answers • Review-suggestion: “Integrating biological databases”, Stein, Nature 2003