Functional-Annotation-Ben.pptx
- Количество слайдов: 35
Functional Annotation Ben Rosen J. Craig Venter Institute Plant Informatics Workshop July 17, 2013
Sequencing Gets Cheaper and Faster Cost of one human genome • HGP: $3 billion • 2004: $30, 000 • 2008: $100, 000 • 2010: $10, 000 • 2011: $4, 000 • 2012: $1, 000 • 2013: $300 Time to sequence one genome: years/months hours/days Massive parallelization. Credits: Haibao Tang, JCVI
Functional Annotation ?
Outline • Basic Searches to Run • Advanced Assignments • Protein Families • Naming Genes
1. Basic Searches to Run
Basic Searches to Run • BLAST (nucleotide or protein homology) ¡ ¡ ¡ • • • CDD (NCBI’s Conserved Domain Database) Interpro (protein families, domains and functional sites) HMMER or SAM (searches using statistical descriptions) ¡ ¡ • • • Non-redundant protein sequences (nr) Uni. Ref (Uni. Prot - Swiss-Prot, Tr. EMBL) Trusted genomes (TAIR) Pfam (database of protein families and HMMs) TIGRFAMS (protein family based HMMs) SCOP (Structural domains) TMHMM (Transmembrane domains) Signal. P (signal peptide cleavage sites) Target. P (subcellular location) Many others
Erythropoietin
Myostatin
Arabidopsis
Web BLAST • • NCBI Blast http: //www. ncbi. nlm. nih. gov/blast/ WU blast http: //genome. wustl. edu/tools/blast/ Uniprot-swissprot blast http: //www. uniprot. org/ JCVI Medicago Blast http: //blast. jcvi. org/er-blast/index. cgi? project=mtbe Phytozome http: //www. phytozome. net/search. php The Gene Indices http: //compbio. dfci. harvard. edu/tgi/ Sanger projects http: //www. sanger. ac. uk/Data. Search/ TAIR - http: //www. arabidopsis. org/Blast/index. jsp
CDD • Collection of multiple sequence alignments • Contains protein domain models imported from outside sources, such as Pfam, SMART, COGs (Clusters of Orthologous Groups of proteins), PRK (PRotein Klusters), and are curated at NCBI.
Inter. Pro • Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
Hidden Markov Model • Databases of HMM domains to search: ¡ ¡ Pfam: http: //www. sanger. ac. uk/Software/Pfam/ TIGRFAMs: http: //www. jcvi. org/cms/research/projects/tigrfams/overview/ SCOP: http: //scop. mrc-lmb. cam. ac. uk/scop/ TMHMM: http: //www. cbs. dtu. dk/services/TMHMM/ • Tools to use: ¡ ¡ HMMER, HMMPFAM: http: //hmmer. janelia. org/ SAM (Sequence Alignment and Modeling System) http: //compbio. soe. ucsc. edu/sam. html
Pfam • For each family in Pfam you can: • • • Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures
TMHMM • Predicts transmembrane helices in integral membrane proteins using HMM’s
Signal. P • Predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms. • Based on a combination of artificial neural networks and HMMs.
Target. P • Target. P predicts the subcellular location of eukaryotic proteins. • The location assignment is based on the predicted presence of any of the N-terminal presequences: ¡ ¡ ¡ chloroplast transit peptide (c. TP) mitochondrial targeting peptide (m. TP) secretory pathway signal peptide (SP)
2. Advanced Assignments
Advanced Assignments • Enzyme Commission (EC) Numberhttp: //www. chem. qmul. ac. uk/iubmb/enzyme/ • Gene Ontology (GO) Terms • Pathways ¡ ¡ ¡ KEGG Meta. Cyc Pathway Tools
Assigning EC Number • EC classification scheme is a hierarchical numerical classification based on the chemical reactions enzymes catalyze. • Every enzyme code consists of four numbers separated by periods. Ex. - EC 1. 1 - alcohol dehydrogenase • EC numbers may be assigned computationally. • There are many available tools and methods for predicting EC numbers and pathways. • Common problems: ¡ The computational method may not be specific for assigning EC number to the enzymes. It may be accurate to decide an enzyme family for a gene rather than a specific enzyme. To be precise, the fourth number (Ex. 1. 1. 1 -) is often left blank.
GO Terms • Gene Ontology (Gene Ontology Consortium™ ) is a method used to structure biological knowledge using a dynamic controlled vocabulary across organisms. ¡ Molecular function What the gene product does – Think ‘activity’ – Ion channel activity – ¡ Biological process A biological objective – Ion transport, transmembrane transport – ¡ Cellular component Location in the cell (or smaller unit) – Or part of a complex – Membrane, plasma membrane –
View Pathways • Graphical interface for users to visualize the substrates, final products and steps in a completed pathway catalyzed by an enzyme (gene). ¡ ¡ ¡ KEGG: http: //www. genome. jp/kegg/tool/search_pathway. html Meta. Cyc: http: //metacyc. org Pathway Tools: http: //bioinformatics. ai. sri. com/ptools
Pathway Tools
3. Protein Families
Why Compute Protein Families? • To group proteins by probable function. • To identify possible gene structure problems. • To identify evolutionary relationships between protein families. • Gene naming and Transposable Element assignment
Domain Based Protein Families (JCVI Paralogous families) Identify Pfam and all vs all blast. P based domains protein sequences Families grouped based on type and number of domains
Domain Based Protein Families (JCVI Paralogous families) Identify Pfam and all vs all blast. P based domains protein sequences 9 family members contain: PF 00027 - Cyclic nucleotidebinding domain PF 00520 - Ion transport protein para_246
Tribe. MCL Protein Clustering • Markov clustering method for grouping proteins into families ¡ http: //doc. bioperl. org/bioperl-run/lib/Bio/Tools/Run/Tribe. MCL. html Nucleic Acids Res. 2002 April 1; 30(7): 1575– 1584.
4. Naming Genes
Methods to Name Gene Products 1. Top BLAST hit to database of choice 2. Manually aggregate evidence from multiple sources 3. Computationally aggregate evidence 1. Automated Assignment of Human Readable Descriptions (AHRD) https: //github. com/groupschoof/AHRD 2. Weighted Keyword (JCVI)
AHRD
AHRD cyclic nucleotide-gated channel 15 LENGTH=678 cyclic nucleotide-gated channel 17 LENGTH=720 cyclic nucleotide-gated channel 15 LENGTH=678 cyclic nucleotide-gated channel 16 LENGTH=705 cyclic nucleotide gated channel 1 LENGTH=716 cyclic nucleotide gated channel 5 LENGTH=717 cyclic nucleotide gated channel 1 LENGTH=716 cyclic nucleotide-gated channel 18 LENGTH=706 cyclic nucleotide-gated channel 13 LENGTH=696
Weighted Keyword get_common_names. pl Score evidence Collect keywords common names Singletons Score keywords Score names Final names Family name homogenization
Weighted Keyword get_common_names. pl Score evidence Collect keywords Score names putative cyclic nucleotide-gated ion channel 15 -like probable cyclic nucleotide-gated ion channel 17 -like putative cyclic nucleotide-gated ion channel 15 -like probable cyclic nucleotide-gated ion channel 16 -like Cyclic nucleotide-gated ion channel, putative cyclic nucleotide-gated ion channel 18 -like Isoform 2 of Probable cyclic nucleotide-gated ion channel 10 Family name homogenization cyclic nucleotide-gated ion channel-like protein cyclic nucleotide-gated ion channel-like protein cyclic nucleotide-gated ion channel-like protein
Transposable Elements get_common_names. pl Score evidence Collect keywords Score names Retrotransposon protein, putative, unclassified Putative non-LTR reverse transcriptase Ribonuclease H-like superfamily protein LENGTH=575 RNA-directed DNA polymerase (reverse transcriptase)-related Retrotransposon protein, putative, unclassified, expressed Putative reverse transcriptase Retrotransposon protein, putative, unclassified Family name homogenization reverse transcriptase zinc-binding protein reverse transcriptase zinc-binding protein non-ltr retroelement reverse transcriptase zinc-binding protein
Functional-Annotation-Ben.pptx