
1cedf412fc2c07a6c55e0576d4083c45.ppt
- Количество слайдов: 51
The Uni. Prot. KB/Swiss-Prot protein knowledgebase: trends and challenges Amos Bairoch; University of Geneva and Swiss Institute of Bioinformatics (SIB) Swiss-Prot group Vancouver – March 8, 2007
The Swiss-Prot staff at SIB and EBI • • • Group leaders: Amos Bairoch, Rolf Apweiler, Lydie Bougueleret Annotators/curators: Yasmin Alam-Faruque, Philippe Aldebert, Severine Altairac, Nicola Althorpe, Ghislaine Argoud Puy, Andrea Auchincloss, Kristian Axelsen, Kirsty Bates, Marie. Claude Blatter, Emmanuel Boutet, Silvia Braconi Quintaje, Lionel Breuza, Alan Bridge, Paul Browne, Evelyn Camon, Wei mun Chan, Luciane Ciapina, Guy Cochrane, Danielle Coral, Elisabeth Coudert, Isabelle Cusin, Tania de Oliveira Lima, Kirill Degtyarenko, Paula Duek, Ruth Eberhardt, Anne Estreicher, Livia Famiglietti, Nathalie Farriol-Mathis, Nadeem Faruque, Serenella Ferro, Marc Feuermann, Rebecca Foulger, Gill Fraser, Gabriella Frigerio, John Garavelli, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Julius Jacobsen, Janet James, Silvia Jimenez, Florence Jungo, Vivien Junker, Guillaume Keller, Kati Laiho, Lydie Lane, Petra Langendijk-Genevaux, Duncan Legge, Philippe Lemercier, Virginie Lesaux, Damien Lieberherr, Michele Magrane, Karine Michoud, Madelaine Moinat, Anne Morgat, Nicola Mulder, Marisa Nicolas, Claire O'Donovan, Sandra Orchard, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Manuela Prüss, Sorogini Reynaud, Catherine Rivoire, Bernd Röchert, Michel Schneider, Christian Sigrist, André Stutz, Shyamala Sundaram, Michael Tognoli, Claudia Vitorello, Eleanor Whitfield, Luiz Fernando Zuleta Programmers and system administrators: Delphine Baratin, Daniel Barrell, Laurent Bollondi, Lawrence Bower, Matias Castro, Michael Darsow, Edouard de. Castro, Paula de Matos, Mike Donnelly, Séverine Duvaud, Alexander Fedetov, Wolfgang Fleischmann, Elisabeth Gasteiger, Alain Gateau, Sebastien Gehant, Andre Hackmann, Henning Hermjakob, Alessandro Innocenti, Eric Jain, Phil Jones, Alexander Kanapin, Paul Kersey, Ernst Kretschmann, Corinne Lachaize, Vincente Lara, Vincent Le Texier, Maria-Jesus Martin, Xavier Martin, John O’Rourke, Salvo Paesano, Sam Patient, Isabelle Phan, Astrid Rakow, Nicole Redaschi, Emilio Salazar, Nataliya Skylar, Karin Sonesson, Peter Sterk, Daniela Wieser, Dan Wu, Wei. Min Zhu Research staff: Valeria Amendolia, Brigitte Boeckmann, Lorenz Cerutti, Fabrice David, David Perret, Violaine Pillet, Anne-Lise Veuthey, Lina Yip Clerical and secretarial assistance: Dolnide Dornevil, Claudia Sapsezian, Margaret Shore-Nye, Kerry Smith, Laure Verbregue
An avalanche of data • In 1954: publication of the first sequence of a protein: bovine insulin by Frederick Sanger • More than 50% of the biomolecular data available today was produced in the last two years; • In 1986: 4’ 000 proteins in Swiss-Prot; today: 4’ 000 new proteins will enter Swiss-Prot+Tr. EMBL.
The implications… • The Life Sciences have undergone a dramatic revolution in the last 20 years: üThey used to be rich in hypotheses, well-off in BTW: I was told to give knowledge and poor in data; a “nuts and bolts kind of üThey are now very rich in data, not so well-off in talk”, so here they are! knowledge and very poor in hypotheses. How do we go from: To a complex system A list of parts
Uni. Prot. KB/Swiss-Prot Created in July 1986; since 1987, a collaboration of the SIB and the EMBL/EBI; from 2003 onward it is the central part of the Uni. Prot project; Ø Annotated, non-redundant, cross-referenced, documented protein sequence knowledge resource; Ø 260’ 000 sequences; 130’ 000 literature references; 3’ 600’ 000 cross-references to 100 databases; ~700 Mb of annotations; Ø About 3’ 900’ 000 sequences in Tr. EMBL, its computerannotated supplement; Ø Biweekly releases; available from about 50 servers, the main source being Ex. PASy (www. expasy. org). Ø
Uni. Prot in one slide… • Universal Protein Resource; • Collaboration between 3 groups: the Swiss-Prot groups at SIB and EBI and the PIR group; • www. uniprot. org (a new and hopefully much better version will be online in the summer); • The Uni. Prot Knowledgebase (Uni. Prot. KB) is the core component and is comprised of Swiss-Prot+Tr. EMBL; • The Uni. Prot Non-redundant Reference (Uni. Ref) clusters combine closely related sequences into a single record to speed searches. Three versions exist: Uni. Ref 50, Uni. Ref 90 and Uni. Ref 100; • The Uni. Prot Archive (Uni. Parc) is a comprehensive repository, reflecting the history of all protein sequences.
The universe in which Swiss-Prot evolves 1953: 1 st sequence (bovine insulin) 1986: 4’ 000 sequences 2007: 5 million sequences Where will it stop? 179'000'025'042 (179 billion)
179'000'025'042 1 st estimate: ~30 million species (1. 5 million named) 2 nd estimate: 20 million bacteria/archea x 4'000 genes 5 million protists x 6'000 genes 3 million insects x 14'000 genes 1 million fungi x 6'000 genes 0. 6 million plants x 20'000 genes 0. 2 million molluscs, worms, arachnids, etc. x 20'000 genes 0. 2 million vertebrates x 25'000 genes The calculation: 2 x 107 x 4000+5 x 106 x 6000+3 x 106 x 14000+106 x 6000+6 x 105 x 20000+2 x 105 x 25000+25000(Craig Venter)+42(Douglas Adam) Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere.
When will Uni. Prot. KB be complete? • Swiss-Prot: – – In July 2009: 500’ 000 entries; In 2013: 1 million entries; In 2026 (40 th anniversary): 10 million entries; In 2036 (50 th anniversary): 100 million entries. • Tr. EMBL: – In May 2080 Tr. EMBL will have reached 10 billion entries; – Somewhere in the 22 th century, we could reach 179 billion entries; – But we are confident these dates are worthless as new sequencing techniques (example: Solexa) will have made all of these projections a very futile exercise!
EMBL Manual annotation of the sequence and associated biological information Tr. EMBL Automated extraction of protein sequence (CDS), gene name and references + Automated annotation Swiss-Prot
In a Swiss-Prot entry, you can expect to find: • All the names of a given protein (and of its gene); • Its biological origin with links to the taxonomic databases; • A summary of what is known about the protein: function, alternative products, PTM, tissue expression, disease, etc. …; • Selected keywords and ontological descriptions; • A description of important sequence features: domains, PTMs, variations, etc. ; • A selection of references; • Numerous cross-references; • A (often corrected) protein sequence and the description of various isoforms/variants.
Swiss-Prot entry creation flowchart
Tools to help with the manual annotation process • Sequence analysis: – A field which has matured quite a lot; – There are now a variety of software tools that, when smartly applied, can help with the prediction of: – Domains structure and families relationships; – Sequence features; including some PTMs and active sites; – We have developed Anabelle, an integrative platform, to help us make the best use of existing tools and algorithms.
Anabelle Integration and workflow Swiss-Prot editor i no ssta Pro tio t n an Sequence Analysis programs 2) Sel. Mo Sw FF G Anabelle 1) PSAT se G lec FF te da d ta da ta macro 3) SAM HAMAP Pro. Rules
Anabelle selection module Viewer Layout: Link to entry Nice. Prot view Blast (full) entry more links!. . . Links… Link to Inter. Pro Link to most similar Align most similar entry Nice. Prot view entry with entry Blast uncharted region Link to domain original database
And here is what the annotator gets back
The goals of the Swiss-Prot Human Proteome Initiative • Annotation of all known human proteins; • Annotation of mammalian orthologs of human proteins; • Annotation of all known human polymorphisms at the protein sequence level; • Annotation of all known post-translational modifications in human proteins; • Tight links to structural information.
How many human proteins? The total number is dependent on: – The number of protein coding genes; – The extent of RNA processing events (alternative splicing, RNA editing, etc. ); – The extent of translational variation (alternative initiation, ribosomal frameshifting, etc. ); – The extent of post-translational modifications (PTMs).
From genome to proteome ~ 21’ 000 human genes alternative splicing of m. RNA 2 -5 fold increase ~ 1'000 human proteins post-translational modifications of proteins (PTMs) 5 -10 fold increase ~ 80 to 100’ 000 human transcripts Protein complexity
Protein polymorphism • Called ‘c-SNPs’ (coding single nucleotide polymorphisms) or ‘SAPs’ (single amino-acid polymorphisms); • Swiss-Prot already holds information on many protein variants (almost 31’ 000); • About 50% of them are linked to genetic disorders; • Mutations that cause major changes to a protein sequence (such as frameshift mutations) are not considered to be relevant to Swiss-Prot, as their deleterious effect on a given protein’s function is usually obvious.
Visualisation of 3 D models
Disease-related annotation • According to OMIM, there are currently 2’ 200 genes known to be associated with one or more genetic disorders; • Swiss-Prot contains disease-related information for about 2’ 100 proteins. • 1’ 443 Swiss-Prot entries contain information on sequence variants associated with a disease state
Protein identification and database accuracy are interdependent Proteomics experiment Sequence correction, Annotation enrichment (PTM, splicing forms) Data filtering, curation Database search Direct submission (identifications only) Protein identifications Publication Submission to specialized proteomic databases (spectra+identifications)
Proteomic studies already allowed to update ~3000 human entries, mainly with PTM information Phosphorylation (83%) Subcellular location (4%) Glycosylation (9%) Other PTMs (4%)
Microbial genome and proteomes
So what does HAMAP means? High quality HAMAP Automated and Manual Annotation of microbial Lots of microbial genomes, lots of proteins. What should Proteomes we do with them in Uni. Prot?
Automatic annotation of proteins belonging to specified families (1) • Allows to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families; • Can be applied to both characterized families and to some UPF’s (Uncharacterized Protein Family); • This projects requires the continuous development or adaptation of software tools as well as the development of a database of annotation rules for each type of specified microbial protein (so far about 1’ 400).
Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete microbial proteome
From pull to push. . • For now more than 20 years we have been «pulling» information and knowledge from various sources, but mainly from literature; • It is now time to make sure that the next 20 years will be defined by the fact that researchers «push» their results and the interpretation of their results in the knowledgebase.
• Attempt to try to get the community to directly submit information on the proteins that they are studying; • Using a wikepedia-type model/interface; • Will first be «field-tested» in the yeast community; • We are hopeful, yet we are realist: only a small percentage of life researchers will take the time and are altruistic enough to fully participate in such a scheme.
Grey grey matter counts! • Many life scientists with knowledge of the molecular world and that are computerproficient are reaching retirement age; • Some want to continue to play a role in the advancement of research, yet they will not be able to do lab work anymore; • We should offer them the tools necessary for them to contribute to the annotation process.
But what about the rest of the life scientists? • We saw how we could get parents (adopt a protein) and grand parents (grey matter count) involvements, but what about the children…. . ; • …the young researchers, those who are active in producing new knowledge?
Two carrots, a stick and lots of education! • The carrots: – Making sure that granting agencies see favorably the involvement of researchers in the process of submitting information to databases; – The same criteria should be considered by any hiring or promotion committee; • The stick: getting journal editors to refuse to accept to publish a paper if the results have not been submitted to the relevant knowledge resources;
Education! • Everyone should feel concerned; • Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences; • Organizations such as EMBNet, EBI, SIB, NCBI, NIG, HUPO should continue and strenghten their «outreach» efforts; • We (databases providers) should do more in term of providing tutorials (on-line and on-site).
Aiala, Alain x 4, Alastair, Alex x 2, Alexander x 2, Alexandre x 2, Alice, Alistair, Allyson, Alvis, Amanda, Ana Tereza, Anastasia, Andre x 3, Andreas, Andrew, Angela, Anne x 4, Anne-Lise, Anthony, Antoine, Anulka, Arnaud x 2, Arthur, Astrid, Athel, Barbara x 2, Barend, Baris, Barry, Bart, Bastien, Bengt, Bernard x 2, Bernd, Bernhard x 2, Bill, Bob, Brigitte, Bruno x 2, Burkhard, Carl, Carola, Carolyn, Catherine x 4, Cathy x 2, Cecile x 2, Cecilia, Cedric, Cesare, Chantal x 3, Charles x 2, Chrissie, Christian x 3, Christiane, Christine x 2, Christophe, Christopher, Christos, Claire, Claudia x 2, Claudine, Colin, Colombe, Corinne, Cristiano, Damien, Dana, Daniel x 3, Daniela, Danielle, Darcy, Darren, Dave x 2, David x 5, Delphine, Denis x 2, Dennis, Des, Dietmar, Dolnide, Dominique, Doron, Dorothy, Doug, Duncan, Eddie, Edgar, Edouard, Eleanor, Elisabeth x 2, Elmar, Elvis, Emily, Emmanuel, Eric x 3, Erik x 2, Ernest, Ernst, Esther, Eugene x 2, Eva, Evelyn, Evgenia, Evgeny, Ewan, Fabrice, Fiona, Flavio, Florence x 3, Fotis, Francis, Frank, François x 3, Frederic, Frederique x 2, Gabriella, Ganesh, Gaston, Geoff, Gerry, Gert, Ghislaine, Gilbert, Gill, Goran, Gottfried, Graham x 2, Gregoire, Guido, Guillaume, Gunnar, Guy x 2, Guy-Olivier, Hanah, Heidi, Henning, Hien, Hilde, Holger, Hongzhan, Howard, Hsing-Kuo, Ian, Iirit, Ilkka, Ioannis, Irving, Isabelle x 2, Ivan x 2, Ivo, Jack, Jacques x 2, Jaime, Janet x 2, Jean-Charles, Jean-François, Jean. Jacques, Jean-Michel, Jean-Pierre x 2, Jeffrey, Jenny, Jerome, Jim, Jingchu, Joachim, Joanna, Joel, John x 7, Jonas, Jonathan x 2, Jorja, Jos, Juan, Juergen, Julia, Julio, Julius, Kai, Karine, Kati, Katja, Katsumi, Kay, Keiichi, Keith x 3, Ken x 2, Kenta, Khaled, Kirill, Kirsty, Kristian, Larry, Laurent x 3, Lee, Leigh, Leon, Lina, Lionel, Lisa x 2, Livia, Lorenzo, Louise, Luca, Luciane, Lucien, Luisa x 2, Luiz, Lydie x 2, Ma'ayan, Madelaine, Maggie, Mahesh, Manolo, Manuel x 2, Manuela, Marc x 6, Marcia, Marco, Margaret x 2, Mari Trini, Maria Esperanza, Maria-Jesus, Marie-Claude, Marilyn, Marisa, Mark x 2, Martine, Marvin, Mary, Massimo, Mathias, Matteo, Matthew, Mauricio, Michael x 7, Michel x 3, Michele, Michelle, Miguel, Mike x 2, Minna, Minoru, Monica, Monika, Morido, Nabil, Nadeem, Nadine x 2, Naruya, Nasri, Natalia, Nathalie, Neil x 2, Nicky, Nicolas x 3, Nicoletta, Nicolle, Nikos, Nina, Oliver, Olivier x 4, Orna, Owen, Paolo, Pascal, Patricia x 6, Patrick x 5, Paula, Pavel, Pedro, Peer, Peter x 7, Petra, Phil x 2, Philippe x 3, Pierre-Alain, Pieter, Piotr, Rachael, Raffaella, Rainer, Raja, Rasko, Raton laveur, Rebecca x 2, Reinhard x 2, Remi, Reto, Reynaldo, Richard, Robert x 2, Roberto, Robin, Rodger, Rodrigo, Roland, Rolf, Ron, Rosita, Ross, Roy, Russ x 2, Ruth x 3, Saeid, Salvo, Samia, Samuel x 2, Sandor, Sandra x 2, Sandrine, Sarah, Scott, Sebastien x 2, Serenella, Sergio, Severine x 2, Shigehaki, Shmuel, Shoko, Shoshana, Shyamala, Silvia x 2, Sineaid, Siv, Sona, Soren, Sorogini, Steffen x 2, Steffi, Stephanie x 2, Steven, Stuart x 2, Stylianos, Sunil, Sylvain, Sylvie x 2, Takashi, Tamara, Tammera, Tania x 2, Temple, Terri, Terry, Thomas x 3, Thure, Tim x 2, Timothy, Toby, Tom, Toni, Torsten, Ujwal, Ulrich, Ursula, Valeria, Vassilios, Veronique, Vicente, Victor x 2, Vincent, Vinnei, Violaine, Virginie x 2, Vitaliano, Vitek, Vivien x 2, Vivienne, Wanessa, Wei mun, Weimin, Williams, Willy, Winona, Winston, Witek, Wolfgang, Xavier x 2, Yasmin, Yasuhiro, Yongxing, Yoshio, Youla, Young-Ki, Zeev, Zhang-Zhi.
1cedf412fc2c07a6c55e0576d4083c45.ppt