b3e42e4a84e4ec9c173da265b9ef208c.ppt
- Количество слайдов: 1
The SSAHA Trace Server Zemin Ning, Will Spooner, Mark Rae, Steven Leonard, Martin Widlake and Tony Cox The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK INTRODUCTION Sequence Representation Various genome projects have brought the creation of many large biological databases. The total data size of DNA sequences, for example, is estimated to be approximately 200 GB, including WGS and clone reads, finished sequences, ref. Seq etc. Designing services to make all the data searchable in a fast, sensitive and flexible way, poses significant challenges in both development of algorithms and hardware architecture implementation. In this poster, we outline a system with the potential to accomplish this challenging but extremely worthwhile task. Sequence S: (s 1 s 2, …, si, …, sm) K-tuple: (sisi+1. . . si+k-1) Using two binary digits for each base, we may have the following representations: SSAHA 2 Client: (1) Communicates over TCP/IP with the SSAHA 2 server; (2) Inputs the query data; (3) Outputs the alignment results. “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way SSAHA 2 Servers: Run on a 16 or 32 GB Linux 64 bit machine. (1) Communicates with the SSAHA 2 client; (2) Receives input data and carries out search and alignment; (3) Outputs the search results to the client. Index: where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values. Computer Nodes Selection and Data filtration: Species_Code – Human, mouse, zebrafish, etc; Trace_Type – Finished sequence, WGS reads, EST reads, etc; Centre_Name – SC, WIBR, WUGSC, etc. Hash Table: E 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hardware Requirement for the System: The requirement of hardware for the server system will be 6 (16 GB) or 4 (32 GB) Linux Boxes, each with 4 CPUs: Search Speed: It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 200 GB database. The solution is extensible by plugging extra appliances. Query sequence: E 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k-tuple AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT Ni 1 3 2 2 7 4 1 6 4 0 5 6 1 6 3 2, 19 1, 15 2, 13 2, 3 1, 21 1, 5 1, 23 1, 3 1, 25 1, 1 3, 25 1, 7 1, 13 Sq = (TGCAACAT) 2, 5 2, 35 3, 3 2, 9 2, 31 2, 39 1, 17 1, 31 1, 27 1, 11 2, 7 Indices and Offsets 2, 11 2, 21 2, 27 2, 33 3, 5 3, 7 2, 43 3, 15 2, 15 2, 25 2, 17 2, 29 3, 1 1, 29 2, 1 2, 37 1, 19 2, 23 2, 41 3, 9 i =1, 2, …, m k-tuple AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT A 2 -tuple hashing table of S 1, S 2 and S 3 Ni 1 3 2 2 7 4 1 6 4 0 5 6 1 6 3 Indices and Offsets 2, 11 2, 21 2, 27 2, 33 3, 5 3, 7 2, 43 3, 15 2, 15 2, 25 2, 17 2, 29 3, 1 1, 29 2, 1 2, 37 1, 19 2, 23 2, 41 3, 9 2, 5 2, 35 3, 3 2, 9 2, 31 2, 39 1, 17 1, 31 1, 27 1, 11 2, 7 2, 19 1, 15 2, 13 2, 3 1, 21 1, 5 1, 23 1, 3 1, 25 1, 1 3, 25 1, 7 1, 13 3, 21 3, 17 3, 19 3, 11 3, 23 S 1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S 2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S 3=(GGATCCCCTGTCCTCTCTGTCACATA) 3, 21 3, 17 3, 19 3, 11 3, 23 where k = kmer size; Ns = number of bases. M = 4*Ns/k+ 4*22 k In the hash table, we only store an element which combines sequence index and offset. Array of index and offset data Query sequence: Sq = (TGCAACAT) k-tuples F(t) -(t-1) Fs(t) TG GC CA AA AC CA AT SSAHA 2 f(t) 1, 13 2, 7 3, 9 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 2, 19 1, 9 2, 5 2, 11 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 2, 13 3, 3 1, 13 2, 7 3, 9 2, 1 2, 7 2, 19 2, 25 2, 31 3, 19 3, 21 2, 16 1, 5 2, 1 2, 7 2, -2 2, 4 2, 16 2, 22 2, 28 3, 16 3, 18 2, 7 3, -3 0 0 0 -1 -2 -2 -3 -4 -4 -4 -5 -5 -6 -6 1, 5 1, 13 2, -2 2, 1 2, 4 2, 7 2, 16 2, 19 2, 21 2, 25 2, 28 2, 31 3, -3 3, 9 3, 16 3, 18 3, 19 3, 21 Data Structures and Distributions Client Hash tables for all the CPU nodes are generated using a certain amount of traces (fasta only) ordered according to species or trace types and stored in the RAM memory of individual nodes. The Oracle Database stores all the traces. SSAHA 2 finds matching seeds from the hash tables and calls the Database to pull out the sequences. Full sequence alignment is then performed. SSAHA Seeds Match Start Match End SSAHA 2 = SSAHA + Cross_Match Start Match End Query SSAHA seeds Output Format F Zfish 37251 -2938 a 06. p 1 c Z 35723 -a 3166 b 08. q 1 c 37 210 415 588 174 98. 28 Alignment score: 152 Query: 37 ATTGCCATTAAAATAATAATAAAAGGACATATTGATATTTTGGTCATGCTATTCCT 96 ATTGCCATTAAAATAATAAT AAAGGACATATTGATATTTTGG CATGCTATTCCT Sbjct: 415 ATTGCCATTAAAATAATAATGAAAGGACATATTGATATTTTGGCCATGCTATTCCT 474 Query: 97 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGTCCAA 156 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGTCCAA Sbjct: 475 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGTCCAA 534 Query: 157 TAAGACAATGATCAAACATTCACTATTTTTTATAATAATCTGTATATTCTATAA 210 TAAGACAATGATCAAACATTCACTATTTT TATAATAATCTGTATATTCTATAA Sbjct: 535 TAAGACAATGATCAAACATTCACTATTTTGTATAATAATCTGTATATTCTATAA 588 Subject Edge length Exact Match Q_start Query: 352 CTGCCTCTCCGATTAGACAATGATCAAA 379 CTGCCT C GATT ACAATGATCAAA Sbjct: 195 CTGCCTGATTTTACAATGATCAAA 222 Near Exact Match Q_end Sequence for cross_match Query Subject S_start F Zfish 37251 -2938 a 06. p 1 c Z 35723 -a 3166 b 08. q 1 c 240 379 77 222 140 80. 00 Alignment score: 36 Query: 240 AAATAAAATCATTCAAACAATAATAACATGATATTTTGGTCATC 299 AAATAAAAT T C CATT AAACAATAATAAAAT ACATGATATTTTG TCATC Sbjct: 77 AAATAAA-TAAAATGTTGC-CATTAAAACAATAATAAAATGACATGATATTTTGATCATC 134 Query: 300 -----TATCCCTA-T-T-ATCTCTGAAATCAAAGACAGAGAACACCCTATGAAACC 351 TAT CCTA T T ATCTCTGAAATCAAAGACAG AACA CCT T AAAC AACC Sbjct: 135 CTATGTATTCCTAATGTCATCTCTGAAATCAAAGACAGCAAACAGCCTGTAAAACC 194 Edge length S_end SSAHA for matching seeds, cross_match for sequence alignment. References (1) Ning, Z. , Cox, A. J. and Mullikin, J. C. 2001. SSAHA: A Fast Search Method for Large DNA Databases. Genome Research 11: 1725 -1729. (2) *http: //www. phrap. com/ * We would like to thank Professor Phil Green, University of Washington, who has kindly agreed for the Phrap/Cross_Match package to used for sequence alignment in the SSAHA system.