Encoding Information for DNA computing Shinnosuke Seki

Скачать презентацию Encoding Information for DNA computing Shinnosuke Seki

b79dedcd900c0d26cd0ad01973d7fcae.ppt

Количество слайдов: 45

Encoding Information for DNA computing Shinnosuke Seki

Purpose l What’s an advantage of encoding? l To make a “good” or tractable code set for DNA computing. l Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”.

Claude Elwood Shannon l l l The father of information theory (Shannon’s entropy) Boolean algebra with binary arithmetic makes it possible to simplify electromechanical relays In “A mathematical theory of communication” [Sha 48], he showed that we can send error -free information even on noisy channel. Chess program using minimax evaluation procedure etc. …

Shannon’s information channel Positive Noise capacity C sender encoder decoder receiver Information flow R Negative Noise R>C l R≤C l l overflow We can make the error rate as small as possible. To attain R = C in the noisy channel, we need to find a ‘good’ code.

Biological perspective l Every biological reaction is an information channel model. ¡ example The case of heredity Natural Selection parent DNA heredity DNA child Mutation For billions of years, Mother Nature has developed wonderful code system? l Biology -> Computer Science l

Review: 1. 2. 3. in vitro DNA computing Encode a given problem into single or doublestranded DNAs (ss. DNAs, ds. DNAs) Computation by a succession of bio-operations. Decode the resulting solution and extract its output.

Review: l l l WK-complementarity Hydrogen bonds A T C G Two strands which are 1. complementary to each other 2. with opposite directions can form a (complete) ds. DNA. Example 5’ - A T C G G T C A A C T G C C C T A A T G 3’ 3’ T A G C C A G T T G A C G G G A T T A C - 5’

Adleman’s first trial Find a solution of Hamiltonian path problem in a solution in polynomial time order of the input graph. l The solution is filled with encoding oligonucleotides. l 1 3 1 2 3 4 ACG CTT ATA GAT CGG TTA ACT TAA GAA TAT CTA GCC AAT TGA 1 -> 2 2 4 2 -> 3 3 -> 4

What’s a good code set? l Each code word (oligonucleotide) shouldn’t form any undesirable structure. A T A 2 ATA GAT G This may make itself inert. l Code words don’t interact with each other in an undesirable way. l Structure formation is due to l ¡ ¡ WK-complementarity Gibbs free energy

What’s a good code set? (cont. ) Uniform melting temperature l Preventing undesirable hybridizations l Other constraints l ¡ ¡ ¡ Avoiding repeated bases Forbidden subsequences l Using a restriction enzyme, its corresponding recognition site should appear only in intended sites Using only 3 types of nucleotides A, C, T

Melting temperature l Melting temperature Tm of a ds. DNA is ¡ ¡ the temperature at which half of the ds. DNAs is denatured. The higher Tm is, the more stable the ds. DNA is. ¡ • • R: gas constant, Ct: total oligo concentration, ΔH & ΔS : enthalpy & entropy α: 1 for self-complementary and 4 for non-self

Nearest-neighborhood method Refer to [Al. Sa 97], [TKY 04] ([8], [9] in this table)

Melting temperature (cont. ) l Uniform melting temperature ¡ l To uniform Tm can eliminate a bias of hybridization. GC content ¡ ¡ The ratio of the # of G’s and C’s over the total # of nucleotides in a sequence G-C pair is more stable than A-T pair. Higher GC content implies higher Tm. Sequences are designed with 50% GC content.

Gibbs free energy (ΔG) l A well-known indicator of stability for DNA structures ¡ A structure with lower ΔG is more stable. ¡ The ΔG of entire structure is the sum of ΔG of each substructures [Zu. St 81].

Secondary structures look like…

Template method [Ar. Ko 02] l Prepare 2 bit sequences, each of which has some desirable property ¡ l (e. g. , 50%-GC content, error-correction). Using convert rule, from these 2 sequences, we construct a sequence.

Template method (cont. ) l Design criteria ¡ Template l An element x should have at least d-mismatches with x. R, xx, x. R, xx. R, x. Rx. An exhaustive search to find a good template Map (error-correcting code) l A code whose words have at least k-mismatches. l e. g. BCH code l ¡ l Drawback ¡ It cannot prevent sequences from forming secondary structures.

AG-templates, GC-templates [KKA 03] l GC-template ¡ ¡ l Template contains the same # of 0’s and 1’s (50% GC-content) Map is an error correcting code. AG-template ¡ ¡ Map is constant weight codes (50% GC-content) Results in the bigger set of sequences

Other approaches l DNASequence. Generator [FBR 00] ¡ ¡ A software with GUI Create a sequence with melting temperature, GCcontent, no palindromes, start codons, nor restriction sites.

Other approaches l Suyama’s approach [Yo. Su 00] ¡ ¡ To generate sequences randomly, add it into a sequence set iff it satisfied all of the following constraints: l Uniform melting temperature l No mis-hybridization l No formation of stable secondary structure Drawback is to fall into local optima easily.

Other approaches l Hybrid randomized neighborhoods [Tu. Ho 03] ¡ ¡ ¡ Stochastic local search (SLS) algorithm Searches neighbors by mutating current best sequences randomly with a probability ε. It moves to the direction where the # of constraint conflicts is maximally decreased with a probability 1 -ε.

Other approaches l GA (genetic algorithm)-based approach [ANH 00] ¡ ¡ Use GAs to evaluate fitness of solutions As criteria l Restriction sites l GC-content l Hamming distance l Same base repetition

Other approaches l Gibbs free energy base approach ¡ ¡ ¡ Taking thermodynamics into consideration Gibbs free energy as a stability measure Advantage l Greater accuracy because it takes into account stability of loops or stacking between base-pairs Disadvantage l More computational time to calculate free energy How to decrease this computational complexity? See [TKY 05], [KNO 08]

A formal language approach Design a set of structure-free codes in terms of WK-complementary. l Advantage l More reliable codes than Free-energy approach ¡ More efficient algorithm for decision problems ¡ l Disadvantage ¡ Need to consider each structure separately.

A formal language approach (cont. ) l Abstracts of concepts ¡ ¡ {A, C, G, T} → an alphabet V, WK-complementarity → an antimorphic involution l l ¡ Involution • A mapping θ s. t. θ 2 is identity (symmetry). Antimorphism • θ(xy) = θ(y)θ(x) (opposite direction). e. g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA TCATCCGATTTCGGG AGTAGGCTAAAGCCC

Bond-free properties [KKS 05] l θ-non-overlapping: l θ-compliant: ¡ Strictly (a) : a property (a) with θ-non-overlapping

Bond-free properties [KKS 05] l θ-p-compliant: l θ-s-compliant:

Bond-free properties [KKS 05] l θ-free: l θ-sticky-free:

Bond-free properties [KKS 05] l θ-3’-overhang-free: l θ-5’-overhang-free: l θ-overhang-free: both of these

Decidability [KKS 05] l Theorem ¡ the following problem is decidable in quadratic time w. r. t. |A| l Input: an NFA A, l Output: Yes/No depending on whether L(A) satisfies any of the properties (or their strictly versions): • θ-compliant, θ-p-compliant, θ-s-compliant, • θ-sticky-free, • θ-3’-overhang-free, θ-5’-overhang-free, θ-overhang-free.

Decidability and maximality [KKS 05] l Theorem ¡ Let M be a regular language and L is a regular subset of M with a property ρ: l ρ is one of the followings: • • ¡ θ-compliant, θ-p-compliant, θ-s-compliant, or θ-sticky-free Then it is decidable whether L is a maximal subset of M satisfying ρ.

Secondary structure prevention l Secondary structures: ¡ ¡ l Hairpin-loop (or simply hairpin) Internal loop Multiple-branch loop Pseudoknot They can be undesirable ¡ e. g. for Adleman’s encoding technique for Hamiltonian Path Problem (HPP).

Secondary Structures Hairpin frame (multiple loop) 5’ 3’ Internal loop 5’ A C G T 3’ 3’ G C C 5’

Hairpin-free language l A formal model of hairpin: x v y θ(v) z. TAA---ACG---CGTTA---CGT---CGGT x l v y θ(v) z Hairpin freeness Intuitively it’s almost impossible to prevent hairpins of short stack length (say 2 or 3). ¡ Our desire is to prevent any hairpin of stack length no less than some given parameter k. ¡

Hairpin-free language [KKL 06] l A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff hpf(θ, k) : the set of all hp(θ, k)-free words on Σ* l hp(θ, k) : Σ* - hpf(θ, k). l l A language L is called (θ, k)-hairpin-free iff

Regularity of hairpin languages l X X w l l X θ(w) hp(θ, k) and hpf(θ, k) are regular. For a hp(θ, k)-free language L, there exists a finite automaton M s. t. L = L(M).

Hairpin Freedom Problems l Hairpin-Freedom problem Input: A nondeterministic automaton M, Output: Y/N depending on whether L(M) is hp(θ, k)-free. l Maximal Hairpin-Freedom problem Input: A deterministic automaton M 1, and NFA M 2. Output: Y/N depending on whethere is a word s. t. is hp(θ, k)-free.

Decidability l The hairpin-freedom problem for regular languages is decidable in time. l The maximal hairpin-freedom problem for regular languages is decidable in time.

Hairpin Frames So-called Multiple loop l hp-frame of degree n: l Figure is an example of hpframe of degree 3. l A word u is hp(fr, j)-word if it contains a hp-frame of degree j. l

Regularity & decidability hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ* l hpf(θ, fr, j) : its complement in Σ* l l The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular. The hp(fr, j)-freedom problem is decidable in linear time. l The maximal hp(fr, j)-freedom problem is decidable in time. l

Application : DNA-HRAMs C G T C A A C A G T 0 opening --A-C-T-G-T-C-G-A-C-A-G-T-closing 1 n-bit DNA-HRAM consists of n hairpins. l Each hairpin stores 1 -bit information by forming and deforming a hairpin as shown above. l

n-bit DNA-HRAM l Concatenation of n 1 -bit RAM, which is equivalent to hpframe of degree n. l In order for this word to work as n-bit RAM, the following subword should be hpf(θ, 20)-free. l DNA memory with 4 hairpins was proposed in [KYO 08].

Reference l l l l [Al. Sa 97] Allawi, HT. , Santa. Lucia, J. : Thermodynamics and NMR of internal G T mismatches in DNA. Biochemistry 36(34) (1997) 10581 -10594 [Ar. Ko 02] Arita, M. , Kobayashi, S. : DNA sequence design using templates. New Generation Computing 20 (2002) 263 -277 [ANH 00] Arita, M. , Nishikawa, A. , Hagiya, M. , Komiya, K. , Gouzu, H. , Sakamoto, K. : Improving sequence design for dna computing. Proc. Genetic and Evolutionary Computation Conference (2000) 875 -882. [FBR 00] Feldkamp, U. , Saghafi, S. , Rauhe, H. : A DNA sequence compiler. Proc. DNA 6, (2000) [KKS 05] Kari, L. , Konstantinidis, S. , Sosik, P. : Preventing undesirable bonds between DNA codewords. Prof. DNA 10, LNCS 3384 (2005) 182 -191. [KKL 06] Kari, L. , Konstantinidis, S. , Losseva, E. , Sosik, P. , Thierrin, G. : A formal language analysis of DNA hairpin structures. Fundamenta Informaticae 71 (2006) 453 -475 [KKA 03] Kobayashi, S. , Kondo, T. , Arita, M. : On template method for DNA sequence design. Proc. DNA 8, LNCS 2568 (2003) 205 -214

Reference (cont. ) l l l [KNO 08] Kawashimo, S. , Ng, Y-K. , Ono, H. , Sadakane, K. , Yamashita, M. : Speeding up local-search type algorithms for designing dna sequences under thermodynamical constraints. Proc. DNA 14 (2008) 152 -161 [KYO 08] Kameda, A. , Yamamoto, M. , Ohuchi, A. , Yaegashi, S. , Hagiya, M. : Unravel four hairpins! Natural Computing 7 (2008) 287 -298 [RFL 01] Ruben, A. J. , Freeland, S. J. , Landweber, L. F. : PUNCH: An evolutionary algorithm for optimizing bit selection. DNA 7 (2001) 150 -160 [Sha 48] Shannon, C. E. : A mathematical theory of communication. Bell System Technical Journal 27 (1948) 379 -423, 623 -656 [TKY 04] Tanaka, F. , Kameda, A. , Yamamoto, M. , Ohuchi, A. : Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143 -7150 [TKY 05] Tanaka, F. , Kameda, A. , Yamamoto, M. , Ohuchi, A. : Design of nucleic acid sequences for DNA computing based on a thermodynamic approach. Nucleic Acids Res. 33(3) (2005) 903 -911

Reference (cont. ) [Tu. Ho 03] Tulpan, D. , Hoos, H. : Hybrid randomised neighbourhoods improve stochastic local search for dna code design. In Advances in Artificial Intelligence: 16 th Conference of the Canadian Society for Computational Studies of Intelligence, 2671 (2003) 418 -433 l [Yo. Su 00] Yoshida, H. , Suyama, A. : Solution to 3 -sat by breadth first search. Proc. the 5 th DIMACS Workshop on DNA Based Computers, 54 (2000) 9 -22 l [Zu. St 81] Zuker, M. , Stiegler, P. : Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1) (1981) 133 -148 l