d3ca9e08d8ad506d5d9976f0a17790c3.ppt
- Количество слайдов: 17
Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena SUNY Stony Brook 1
Genome Rearrangement • events – duplication – translocation – reversal (inversion) • occur primarily during reproduction • allow large-scale genomic comparisons 2
Sorting by Reversals • genome represented as a permutation on 1, 2, …, n – n = # homologous genes among species • assumptions – can identify genes – genes are distinct • operation: reversal of a subsequence (of genes) – models inversion (occurs during crossover) • one of the permutations can be 1, 2, …, n – appropriately relabel others 3
Example 4 2 8 7 1 5 6 11 10 4 3 2 1 7 8 5 6 9 10 11 1 2 3 4 8 7 6 5 9 10 11 1 • • 3 9 2 3 4 5 6 7 8 9 10 11 6 reversal in our model (for f(l) = l): cost = 18 4
Our Model • unsigned • cost of reversal of subsequence of length l is f(l) • total sorting cost (or distance) is S Sj are reversed subsequences f (length(sj)) 5
Cost Functions • additive f(x+y) = f(x) + f(y) • subadditive f(x+y) < f(x) + f(y) f(l) • superadditive f(x+y) > f(x) + f(y) • other – e. g. bitonic f(l) 6
Problems • algorithm to sort any permutation – worst-case min cost • approximate min cost for a given permutation 7
Extremal Costs • highly subadditive: e. g. unit cost, f(l) = 1 – NP complete [Caprara, ’ 97] – series of approximation ratios: 2, 1. 75, 1. 375 • highly superadditive: f(l) > l 2 – essentially bubblesort 8
Our Results • additive cost function – specifically f(l) = l • Quick. Sort-like algorithm for worst-case – complexity: O(n lg 2 n) • min cost approximation ratio of O(lg 2 n) 9
Median. Eject(a, b) • find r maximal blocks of wrong-sided elements with respect to median • for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks • move wrong-sided blocks to median boundary • reverse left and right blocks 10
Sample Run complexity: O((b-a) lg r) 11
Reversal. Sort(a, b) Median. Eject (a, b); Reversal. Sort (a, Reversal. Sort ( ); , b); Complexity T(n) = 2 n T (2 ) + O(f(n) lg n) O(f(n)lg 2 n) = O(n lg 2 n) for f(n)~n 12
Algorithmic Improvements I simplify “short” phases II merge 2 last steps of Median. Eject p q p when possible (2 p+q vs. 3 p+q) III apply II recursively 13
Approximation Ratio • M(p) is the maximal total distance between pairs of out-of order elements Lemma 4: but Lemma 6: + Lemma 7: yields: min cost is (M(p)) # of out-of order elts < 3 M(p) Median. Eject touches only elements within linear range from out-of-order elements • each round of Median. Eject takes O(M(p) lg 2 n) • Reversal. Sort costs O(M(p) lg 2 n) • Reversal. Sort is at most O((lg 2 n) times optimal 14
Bioinformatic “Validation” • use our cost (= distance) to build phylogenetic trees Cyanophora Cyanidium Guilardia Porphyra • 4 plants (chloroplastic genes) • consistent with [Martin et al. , PNAS Sept ‘ 02] • work in progress [M. Shoham] 15
Open Problems: Algorithmic • weighted genes • tighter approximation ratio – close to O(lg n) – can get to constant? • other cost functions (incl. bitonic) • the signed case 16
Open Problems: Modeling • chromosomal ordering • what is the right cost function? – consider cost(l) = ld • combine with constant-based models – restricted regions – “undesired” reversal sequences • deal with duplication and translocation events 17


