RNAseq-Assembly-Haibao.pptx
- Количество слайдов: 26
de novo RNA-seq Assembly Haibao Tang J. Craig Venter Institute Plant Informatics Workshop (Jul-18 -2013)
Outline • RNA-seq overview and goal • Two strategies for RNA-seq assembly – “Align-then-assemble”, “Assemble-then-align” • Some advantage of “Align-then-assemble” approach • Popular de novo RNA-seq assemblers - Trinity and Rnnotator
A Paradigm for Genomic Research WGS Sequencing Assemble Draft Genome Scaffolds Methylation TF binding sites SNPs Proteins Credits: Brian Haas, Broad
A Paradigm for Genomic Research RNA-Seq WGS Sequencing Assemble Align Draft Genome Scaffolds Transcripts Methylation TF binding sites SNPs Proteins Expression Credits: Brian Haas, Broad
A Maturing Paradigm RNA-Seq WGS Sequencing Assemble Align Assemble Draft Genome Scaffolds Transcripts Methylation Tx-factor binding sites SNPs Proteins Expression Credits: Brian Haas, Broad
A Paradigm for Genomic Research $$$$$ + $$$$$ RNA-Seq WGS Sequencing $ Assemble Align Assemble Draft Genome Scaffolds $ Transcripts Methylation Tx-factor binding sites SNPs Proteins Expression Credits: Brian Haas, Broad
Why sequence RNA (versus DNA)? • Genome may be constant but an experimental condition has a pronounced effect on gene expression ¡ e. g. Drought challenged vs. normal condition ¡ e. g. Wild type versus mutant • Some molecular features can only be observed at the RNA level ¡ Alternative isoforms, fusion transcripts, RNA editing • Predicting transcript sequence from genome sequence is difficult (help annotations) • ‘Regulatory’ mutations (affect what m. RNA isoform is expressed and how much) that do not have an obvious effect on protein sequence ¡ e. g. splice sites, promoters, exonic/intronic splicing motifs, etc.
Goals of RNA-Seq analysis • Comprehensive transcript discovery and annotation • Gene expression and differential expression • New transcript structures derived from chromosomal aberrations or structural polymorphisms • Alternative expression analysis and quantitatively discriminate isoforms • Allele specific expression ¡ Relating to SNPs or mutations • Mutation discovery • Fusion detection • Non-coding RNAs • RNA editing
RNA-seq experimental design • Key issues: • Sequencing depth - how much ? • Number of replicates – how many ? (biological rep, technical rep) • Aims of the data : • Transcriptome assembly / transcript characterisation • Maximise depth • Detection of differential expression (de novo or reference) • Balance depth and replication
Two assembly strategies • There is no one ‘correct’ way to analyze RNA-seq data (though there are some incorrect ways) • Two major branches 1. Direct alignment of reads (spliced or un-spliced) to genome or transcriptome 2. Assembly of reads followed by alignment • Assembly is the only option when working with a creature with no genome sequence Image from Haas & Zody, 2010
Pros and cons • Alignment to genome / transcriptome – Computationally inexpensive – Spliced (exon junction) reads map correctly – In principle, maximum sensitivity but depend on correct read-to-reference alignment, a task that is complicated by splicing, reads mappings to multiple locations, errors and the lack or incompleteness of reference genomes • de novo Assembly – assembly-first approach does not require any read-reference alignments, important when the genomic sequence is not available, is gapped, highly fragmented or substantially altered, as in cancer cells. – Allows detection of chimeric transcripts and resolution of ‘breakpoints’ – Is less prone to making fused transcripts from adjacent loci
Identify novel transcript structures Robertson et al. , 2010
Identify fused genes Schematic of detecting a fusion gene. The contig aligns to two genomic regions. The regions may be on different chromosomes, or on one chromosome but separated by a distance that is much longer than the ~200 -bp PE insert length Robertson et al. , 2010
Alternative splicing [Griffith and Marra 07]
Cucumber re-annotation “We used spaln and PASA to align 90, 307 cucumber ESTs, 260 cucumber FL-c. DNAs downloaded from NCBI, and transcripts reconstructed by Inchworm. ” Li et al. BMC Genomics, 2011
RNA-Seq assembly challenges • In general: ¡ Volume of data (longer run time, large RAM requirements) ¡ Parameter optimization ¡ Evaluating assembly • Specific to RNA-Seq ¡ The relative abundance of RNAs vary wildly, 105 – 107 orders of magnitude ¡ Since RNA sequencing works by random sampling, a small fraction of highly expressed genes may consume the majority of reads ¡ Ribosomal and mitochondrial genes ¡ Coverage within the same transcript vary greatly – violation of assumption for some assemblers (CABOG, ALLPATHS will NOT work!)
Read density variability Credits: CSIRO, Nescent 2011
Assembly – Kmer graphs K=4 Miller et al. , 2010
Assembly – Kmer graphs Tips • Sequencing error Bubbles • Sequencing error • Polymorphism Frayed Rope / Cycles • Repeats Miller et al. , 2010
Trinity Grabherr et al. , 2011
Trinity To compare: “Align-then-assemble” approach sometimes gets one fused gene model Grabherr et al. , 2011
Rnnotator Martin et al. , 2010
Rnnotator Multiple Velvet assemblies with different K-mer size Martin et al. , 2010
Assembly QC • Collect length statistics, N 50 is still a good indicator • Ask these questions: • Accuracy – How many of the assembled contigs map to genome? • Accuracy – What are the contigs that do not align? (BLAST to nr) • Completeness – How many previously annotated genes covered by the contigs? How many full-length? • Contiguity – Does a single contig cover each gene? • Compare results from multiple programs Martin et al. , 2010
Assembly QC All 380 assemblies (cov cutoffs 2 to 20 and kmer sizes 25 to 63) screened for complete transcripts of five genes. If a complete transcript was present in an assembly it was marked in grey and in black otherwise. Gruenheit et al. , 2012
Talk summary • de novo RNA seq is often the only choice without the reference genome (also a cost-effective method) • de novo RNA seq is useful to identify novel transcript structures, gene fusions • de novo assembly of RNA-seq reads can create new gene models, improve existing gene models, and estimate expression levels • FIN
RNAseq-Assembly-Haibao.pptx