Bioinformatic Techniques Tools for SNP Analysis Jill

Bioinformatic Techniques & Tools for SNP Analysis Jill L Wegrzyn Department of Plant Sciences University of California at Davis, CA 95616 USA E-mail: jlwegrzyn@ucdavis. edu

SNP Analysis Agenda Sequence-Based SNP Identification Common Bioinformatic Solutions Phred, Phrap, Consed, Polyphred, and Polybayes Re-Sequencing Project (ADEPT 2) High-Throughput SNP Identification Solution (Pine. SAP)

Finding SNPs: Sequence-based SNP Mining Genomic BAC Library RRS Library m. RNA Random Shotgun c. DNA Library Align to Reference EST Overlap DNA SEQUENCING BAC Overlap Shotgun Overlap Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC

Comprehensive SNP Discovery: Resequencing • Overlapping PCR Amplicons across entire gene • Make no assumptions about sequence function • Sequence diversity and genetic structure for each gene is different • Proper association studies can only be designed in this context • Complete resequencing facilitates population genetics methods

Sequence-based SNP Identification Amplify DNA 5’ 3’ Sequence Phrap Phred Sequence each end Base-calling Contig assembly of the fragment. Quality determination Final quality determination Poly. Phred/Polybayes Polymorphism detection ATAGACG ATACACG Consed Sequence viewing Polymorphism tagging Analysis Homozygotes Heterozygote Polymorphism reporting Individual genotyping Phylogenetic analysis

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a DNA sequence analysis package from University of Washington: l Trace file (chromatograms) reading l Quality (confidence) assignment to each individual base l Vector and repeat sequences identification and masking Sequence assembly l Assembly visualization and editing l Automatic finishing

Phred, Phrap, Consed, Polyphred, Polybayes l phred: Base calling and quality assignments l phrap: Contig formation and new quality assignments l consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces l polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms. l polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Files and Program Execution l Input files: ABI chromat or SCF files l Order of Program execution: n phred, then phrap, then consed n or: phred, then phrap, then polyphred, polybayes, then consed n or: phred, then phrap, then consed, then polyphred, polybayes, then consed

Trace File High quality region – no ambiguities

Trace File Medium quality region – some ambiguities

Trace File Poor quality region – low confidence

What is Phred? • Phred observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. • Writes base calls and qv to output files for Phrap assembly. • Quality values (phred scores) range from 0 to 60. 20 and above is considered a confident base call (1 in 100 chance that is has been called wrong ~99% accuracy) • Useful for consensus sequence construction ATGCATTC string 1 CGTTCATGC string 2 ATGC-TTCATGC superstring • Here we have a mismatch ‘A’ and ‘G’, the qv will determine the dash in the superstring. The base with higher qv will replace the dash.

Phred value formula q = - 10 x log 10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10 -2 (1 error in 100 bases) q = 40 means p = 10 -4 (1 error in 10, 000 bases)

Why Phred? • Output sequence might contain errors. • Vector contamination • Dye-terminator reaction might not occur. • Weak or variable signal strength of peak corresponding to a base.

The structure of a phd file BEGIN_SEQUENCE 01 EBV 10201 A 02. g BEGIN_COMMENT CHROMAT_FILE: EBV 10201 A 02. g ABI_THUMBPRINT: PHRED_VERSION: 0. 990722. g CALL_METHOD: phred QUALITY_LEVELS: 99 TIME: Thu May 24 00: 18: 58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE

Phrap • Phrap constructs the contig sequence as a mosaic • • of the highest quality parts of the reads rather than a statistically computed “consensus”. Avoids the complex algorithm issues associated with multiple alignment methods • Speed and accuracy • Phrap is an assembler NOT an aligner The sequence produced by Phrap is quite accurate • Less than 1 error per 10 kb in typical datasets. • Phrap considers sequence quality at a given position (determined by Phred)

Running phred. Phrap Construct directory structure l directory for the run, containing subdirectories chromat_dir, edit_dir, phd_dir, poly_dir Move tracefiles to directory chromat_dir Run phred. Phrap from directory edit_dir n Determine phred. Phrap parameters

The greater the forcelevel, the more liberal the assembly (generally produces fewer contigs). however results in more manual editing reviewing polymorphisms later Default Adjusting Parameters

Phrap output files • *. contigs – fasta file containing the contigs - Contigs with more than one read - Singletons (single reads with a match to some other contig but could not be merged consistently) • *. singlets – fasta file of the singlet reads - Reads with no match to other read • *. ace – for viewing the assembly w/ Consed • *. view – for viewing the assembly w/ Phrapview

Consed is a program for viewing and editing assemblies produced by Phrap a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc.

Phred/Phrap/Consed Pipeline Directories: Chromat_dir Phd_dir Edit_dir

Using Poly. Phred to Visualize SNPs • Compares sequences across traces obtained from different individuals to identify sites for SNPs. • Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. • To reduce the number of miscalled sites, ignores regions of poor quality & ends

Polyphred l polyphred -ace gene_file. ace. 1 -tag p -snp hom -indel -f 50 > gene_file. polyphred. out n n n n where gene_file. ace. 1 is the. ace file present in the edit_dir directory. Note: this command will give polymorphisms of quality 'ranks' 1 through 3: 1 is the highest quality, 6 is the lowest quality The qualifier -f 50 is used to list 50 bp of flanking sequence on each side of the detected polymorphism The qualifier -indel is used to identify insertions or deletions The qualifier -snp hom is used to identify homozygous SNPs only The qualifier -tag p is used to list the tagged polymorphisms in the polyphred output file *. polyphred. out. It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.

Polyphred n Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. n Reads the PHD and POLY files associated with each trace. n During the SNP search phase, Poly. Phred combines information from all of the sequence traces to derive a genotype and a score for each sequence n The score indicates how well the trace at the site matches the expected pattern for a SNP. n Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Polybayes Bayesian statistical model takes into account: - depth of coverge - base quality values of the sequences -a priori expected rate of polymorphic sites in region Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation **Can also integrate into the poly files for viewing with Consed

Allele Discovery of Economic Pine Traits l Goals n Re-sequence ~8, 000 amplicons in Loblolly Pine n SNP Discovery n Genotyping of 3, 000 trees using Illumina technology n Phenotyping of same 3, 000 trees n Association studies

Resequencing Project Pipeline EST Databases (UGA, UMN) Assemble Contigs Diversity Panel DNA Extraction Megagametophyte Primer Design Illumina Genotyping Primer Validation Agencourt Sequencing 18 individuals & 5 other SNP Identication Phenotyping Drought, Dieases, & Wood Population DNA Extraction (Needles) Association Studies

Primer Selection and Validation Primer Selection Using OSP primer picking Report to Tree. Genes - several primers per amplicon will be stored in Tree. Genes Successes - select 1 primer set per amplicon Primer synthesis at Illumina Sequencing on validation DNA Report to Tree. Genes Successes BLASTing of validation DNA sequence back to EST database Report to Tree. Genes Successes Sequencing of diversity panel - 1 primer set per amplicon which has been successful at each step Report to Tree. Genes

Species Num. Successful Amplicons Pinus taeda 7, 424 Pinus radiata 6, 249 (84%) Pinus lamertiana 2, 234 (30%) Picea abies 1, 024 (13%) Pseudotsuga menziesii 750 (10%) **312 successful in all 5 species

Alignment and SNP Calling Pipeline Challenges in High-Throughput SNP Identification l Alignment Critical in the automation of base calls n Commonly used Phrap (from Phred. Phrap) is an assembler and is NOT ideal for alignments Many commonly used aligners work best with protein sequences or with a reference sequence n Preservation of quality scores for input into SNP identification programs n Speed for high-throughput programs l Automated SNP Calls - Reference Sequence Required - Traditional approaches without reference sequence include “e. SNPs” (human, maize, and pine) n -Very little redundancy outside of abundant genes -Overall high number of false positives (single pass reads) - Not specific to frequencies observed in different organisms High number of false positives in currently accepted methods (Polybayes & Polyphred)

Pine. SAP Re-Sequencing data from Agencourt: Initial Processing Base Calling Sequence Alignment SNP Identification Machine Learning Data Storage & Release

Base Calling and Sequence Alignment Modified Phred. Phrap allows for trimming of bases from start and end of sequence based on trace quality Ace 2 FASTA Converts native Phred. Phrap output (ace file) into an unaligned FASTA file Probcons. RNA Optimal DNA sequence alignment program Aligned. Contig 2 Read. FASTA Provides single multifasta file with all reads aligned to the contig from Phred. Phrap AND the contigs alignment to the other contigs from probcons. RNA FASTA 2 Ace Converts resulting FASTA file back into ace file for SNP Identification

Phrap assembly Number of contigs 1 2 3 4 5 6 7 8 Percentage 23. 67% 27. 00% 19. 67% 14. 00% 6. 00% 4. 33% 2. 67% 2. 33%

Alignment Pipeline Success Problems (out of 356 validation set) Sequencing problems 3 0. 84% Problems in end or middle 4 1. 12% F&R don't overlap 2 0. 56% Alignment problems due to low quality 1 0. 28% Alignment problem due to wrong direction in original phrap alignment 2 0. 56% Summary (out of 356 validation set) Correct alignments Alignments we are working on fixing 344 96. 63% +5 98. 03%

SNP Identification and Machine Learning Algorithms applied Independently for SNP Identification: Polyphred and Polybayes both utilize alignment and base quality information to identify potential polymorphisms. Both programs yield a high # of false positives Machine Learning Utilizes sequence based information and compares actual base calls to proposed calls to develop SNP identification filters specific to pine Parameters under Consideration: Local and global quality scores, frequencies of major and minor alleles, local and global SNP frequences, relative positions in sequence to existing variation, and polyphred and polybayes scores

SNP Identification Overview l Examine features to improve the accuracy of SNP location prediction l Utilize machine learning to apply the features l Refine the accuracy of the learning algorithm through adjustments to feature representation (iterative process) l Utilize the final classifier against the large re-sequenced set to improve accuracy of SNP calls originating from Polybayes and Polyphred

Features Selected Description Representation Sequence Depth Continuous Variation Type Categorical Polybayes Score Continuous Polyphred Score Continuous Freq of major/minor alleles Continuous Max quality of major/minor alleles Continuous Local average quality Continuous Overall average quality Continuous Alignment Quality Continuous

Feature Representation l Sequence depth n n Count of number of sequences in the alignment at the position of variation. All sequences in the alignment may not overlap at the position of variation; l number is different from the total number of the sequences in the alignment l Variation type n l Poly. Bayes score n l Variation type can be SNP or INDEL Poly. Bayes program assigns a Bayesian posterior probability value for each called SNP using the frequency priors given for observing a variation at that position Polyphred score n Polyphred assigns a score calculated primarily from sequence depth and quality score.

Feature Representation l Base frequencies n The number of occurrences of different bases at the position of variation is important in determining a polymorphic position. l l Frequencies of the first (major allele) and the second (minor allele) are represented as ratio to sequence depth. Relative distance n Sequence quality at the ends of the alignment tends to be poor due to inherent limitations of current sequencing technology. l SNP position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance

Feature Representation l Sequence quality n n Variation is observed because of a poor quality base. Based on the base frequencies calculated: l l l maximum qualities of the major and minor alleles average qualities of major and minor alleles Alignment quality Misalignment of bases caused by sequence alignment programs sometimes result in an erroneous SNP call. n In the neighborhood of the SNP (+/- 10 bases) all the mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is continuous n

Classification Goal: Prediction Software: Decision tree C 4. 5 program is open-source C code (WEKA) n At each point in the construction of the decision tree, C 4. 5 selects the feature to test based on maximum information gain. n The goal is to generate a minimum size tree that correctly classifies all the elements in the training set. n The size of the tree is the number of nodes (decision nodes) and the number of leaves are the classes (categories) that they are distributed to.

Alignment and SNP Identification Datasets l Training set for loblolly pine was composed of a total of 300 validated sequences. n n l Training set for poplar was composed of 42 validated sequences selected at random n l Divided to represent the relative percentages of sequence source Testing set is composed of 120 validated sequence sets Testing set is composed of a total of 30 validated sequence sets. Validation = manually observed FP, FN, TP, and TN SNP calls through observation of tracefiles in Consed.

Alignment and SNP Identification Evaluation Criteria l Accuracy = (TP + TN)/total l Sensitivity = TP/(TP + FN) Evaluation J 48 Polyphred Polybayes l Specificity = TN/(FP + TN) Accuracy 93. 6 76. 25 78. 02 Sensitivity 88. 21 83. 22 86. 54 Specificity 98. 73 N/A Evaluation J 48 Polyphred Polybayes Accuracy 94. 6 79. 35 80. 24 Sensitivity 90. 54 85. 01 88. 14 Specificity 97. 23 N/A

Resulting Identifications l l l SNPs have been called using Pine. SAP on 7424 amplicons representing 6924 Unigenes. Average amplicon length is 445 bp Average of 5. 5 SNPs/amplicons Pine is highly polymorphic! Total SNPS ~41, 500 Distribution of SNPs/amplicon n 1404 - 0 SNPs 1133 - 1 SNP 4887 - 2 or more SNPs

Alignment and SNP Identification SNP Formatting Custom lists can be exported into Illumina style input **option for adding IUPAC codes for SNPs in flanking sequence

Alignment and SNP Identification Illumina Design

Data Storage and Release SNP Identifcations GENOTYPING

SNP Discovery: db. SNP database db. SNP NCBI SNP database

SNP data submitted to db. SNP processing of SNPs submitted by research communtiy (submitted SNPs = ss#) Unique mapping to a genome location (reference SNP = rs#) (by 2 hit-2 allele)

Summary l Pine. SAP improves n Inaccuracies introduced by using Phrap to align sequences n Time which would be required by using a aligner such as Probcons. RNA or Clustal. W on its own l Pine. SAP has a 98% success rate when used to align loblolly resequencing data. l Pine. SAP identified a success list of features to enhance polymorphism predictions and generated a prediction accuracy of 93% l Pine. SAP provided a full alignment and polymorphism detection system that can be adapted to specific genomes

More Information… Phred/Phrap/Consed: www. phrap. org/ Polyphred: droog. mbt. washington. edu/Poly. Phred. html Polybayes: genomeold. wustl. edu/groups/informatics/software/polybayes/