ebdb10f3c24efff075a433f8908f6b90.ppt
- Количество слайдов: 11
A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information Science University of Tokyo 1
Promising Techniques Block Sorting Compression [Burrows, Wheeler 94] • faster than PPMs • decoding is much faster • comparable performance with PPMs Suffix Array [Manber, Myers 93] • search data structure • can find any substring • memory efficient than suffix trees We unify compression and search by using them. Key: the Burrows-Wheeler Transformation (BWT) 2
Block Sorting Compression • Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes. • Permuted text becomes more compressible. 3
Novel Feature of the Block Sorting • BWT is defined by the suffix array (sorted indexes of suffixes) • The suffix array is recovered from the compressed text Suffix array can be compressed by the Block Sorting! But, it cannot be used for case-insensitive search. 4
Our Contribution • propose Modified Burrows-Wheeler Transformation – used for compressing text and its suffix array • Decoded suffix array can be used for -insensitive search. • Any unification function is available. (not only case-insensitive search) case 5
An Application Distributed Web Search Robots Web sites search robot collected text xyz Abc XYZ ABC compress by Block Sorting transfer via network 6
transfer via network decode xyz XYZ 14 2 8 3 9 5 10. . . Abc ABC 3 10 8 5 2 7. . . text suffix array merge into database 8 4 100 251 58. . . suffix array on disk Search Server 7
The original BWT Input text 0 Abc. ABC 1 bc. ABCA 2 c. ABCAb 3 ABCAbc 4 BCAbc. A 5 CAbc. AB BWTed text sorting BWT suffix array 3 ABCAb c 0 Abc. AB C 4 BCAbc A 5 CAbc. A B 1 bc. ABC A 2 c. ABCA b A A B C b c 3 0 4 5 1 2 reverse BWT 8
Unification We identify character equivalence. • unify capital/small letters (tolower) DCC = dcc • unify double-byte codes and single-byte codes in Japanese EUC code ABC (a 3 c 1 a 3 c 2 a 3 c 3) = ABC (41 42 43) • unify Japanese Hiragana and Katakana あいうえお = アイウエオ 9
Modified BWT Input text Abc. ABC unify 0 abcabc$ 1 bcabc$ 2 cabc$ 3 abc$ 4 bc$ 5 c$ permutes symbols by suffix array of unified text suffix array MBWTed text sorting MBWT 3 abc$ c 0 abcabc$ C 4 bc$ A 1 bcabc$ A 5 c$ B 2 cabc$ b c c a a b b unify a a b b c c 3 0 4 1 5 2 reverse BWT reverse MBWT 10
Compression Ratio and Speed HTML files (total 90 Mbytes) Block size: 9 Mbytes unification func. identical (BWT) normal (MBWT) LSB 4 MSB 4 zero (no BWT) comp. ratio 1. 743 1. 764 2. 523 2. 707 5. 772 comp. time (s) 363. 58 363. 41 443. 89 438. 04 411. 74 • small difference between BWT and MBWT • MBWT provides case-insensitive searches. 11
ebdb10f3c24efff075a433f8908f6b90.ppt