This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Coding for an Abstracted Nanopore Channel for DNA Storage

Reyna Hulett,  Shubham Chandak,  and Mary Wootters Computer Science Department, Stanford University. [email protected]. RH’s research supported in part by a NSF Graduate Research Fellowship under grant DGE-1656518. Electrical Engineering Department, Stanford University. [email protected]. Computer Science and Electrical Engineering Departments, Stanford University. [email protected]. RH and MW’s research supported in part by NSF grants CCF-1844628 and SemiSynBio-1807371.
(January 2021)
Abstract

In the emerging field of DNA storage, data is encoded as DNA sequences and stored. The data is read out again by sequencing the stored DNA. Nanopore sequencing is a new sequencing technology that has many advantages over other methods; in particular, it is cheap, portable, and can support longer reads. While several practical coding schemes have been developed for DNA storage with nanopore sequencing, the theory is not well understood. Towards that end, we study a highly abstracted (deterministic) version of the nanopore sequencer, which highlights key features that make its analysis difficult. We develop methods and theory to understand the capacity of our abstracted model, and we propose efficient coding schemes and algorithms.

1 Introduction

In the emerging field of DNA storage, data is encoded as DNA sequences and stored; the data can be read back by sequencing the stored DNA. This technology promises high storage density and stability, as well as efficient duplication of data and random access using PCR-based technologies; we refer the reader to [4] and the references therein for an excellent overview.

Both the synthesis and sequencing processes are noisy, and as a result the data must be encoded before the synthesis stage to ensure accurate data recovery. Prior work has studied methods for encoding data in order to protect it against (aspects of) the noise introduced by these processes, for example [10, 3, 7, 22, 17, 14, 15, 5].

In this work, we focus on one particular stage of this noisy process, the nanopore sequencer. Nanopore sequencing—and in particular the MinION sequencer developed by Oxford Nanopore Technologies [12]—is an emerging sequencing technology. While initial works in DNA storage used Illumina sequencing, nanopore sequencing has been attracting interest due to its portability, low cost, and ability to support significantly longer reads than Illumina.

At a high level, the nanopore sequencer works as follows. A single strand of DNA is passed through a pore, leading to variations in a current readout. The pore can hold kk nucleotides (for our purposes, a nucleotide is just a value in {A,C,G,T}\{A,C,G,T\}) at a time; in practice kk is about six. The value of the current readout depends on which nucleotides are in the pore. For example, if the strand of DNA is AGCTAAT, and the pore sees the sub-strand AGCT, it will output one current reading. As the strand is passed through the pore, the contents of the pore will shift, say from AGCT to GCTA, then to CTAA, then to TAAT, and so on. This will result in a change in the current reading, according to some function ff that maps kk-mers to current levels. The process is depicted in Figure 1. Given the current readout, the goal is to recover the original DNA sequence.

kk bases in the pore at a time TTGGAACCCCGGTTAAtimecurrent
Figure 1: High-level view of the nanopore sequencer.

This channel is difficult to analyze, for several reasons. First, the output at any given time depends on k>1k>1 bases, and so there is inter-symbol interference. Second, there may be collisions in the output: two different pore contents may lead to similar current readouts. Third, the current readout can be noisy. Fourth, the amount of time that each kk-mer spends in the pore can vary, and sometimes never occur at all, leading to synchronization errors in the output.

Due to this complexity, typical basecallers (that is, methods for recovering the original sequence from the current readouts) rely on machine learning techniques [21, 9, 20]. This is effective in practice, but difficult to get a theoretical handle on. While there are several practical approaches to error correction for nanopore sequencing in DNA storage [22, 17, 14, 15, 5], the theoretical limits of this channel are not well understood. The work [16], discussed more below, proposed a probabilistic model of the nanopore sequencer and developed bounds on the capacity of the resulting channel. However, this problem is quite difficult; in particular, taking into account the possibility of synchronization errors places this problem as more difficult than determining the capacity of the deletion channel, a notorious open problem.

In this work, we take a different approach to developing a theoretical understanding of the nanopore sequencer. We develop a highly abstracted, deterministic model, which is meant to highlight the first two sources of noise mentioned above: the inter-symbol interference, and the prospect of collisions when multiple kk-mers lead to similar current outputs. We develop methods and theory to understand the capacity of our abstracted channel, and we propose efficient coding schemes and algorithms.

Our goal in this work is to open up a research direction towards a theoretical understanding of the nanopore sequencer. We fully acknowledge that our work is not yet practical; in particular our abstracted channel does not include noise in the current readings or synchronization errors that can arise from variable pore dwelling times. It is our hope that a solid understanding of our abstracted channel can then be combined with (much more well-studied) theories of coding for substitution and synchronization errors, in order to make progress on a more realistic channel model.

Contributions. Our contributions are as follows.

  1. 1.

    We propose a novel abstraction of the nanopore sequencer, which highlights the inter-symbol interference and the prospect of collisions. This abstraction is simple enough that it is tractable, yet complex enough that it (a) captures some fundamental properties of the nanopore sequencer, and (b) already gives rise to extremely interesting problems from a theoretical perspective. We hope that our abstraction will lead to future work in this area.

  2. 2.

    In Section 3, we develop an algorithm to determine the capacity of this channel. The algorithm is inefficient as pore size kk grows, but we can use it to determine the capacity for small kk.

  3. 3.

    In Section 4, we develop simple bounds on the “best” and “worst” capacity for an arbitrary pore size kk (where “best” and “worst” refers to the choice of the map from kk-mers to current readouts). We use our algorithms mentioned above to compare these bounds to the exact values for k=2k=2. These preliminary results suggest there may be some intriguing “bumpiness” in the worst-case capacity as the number of distinct current levels increases, but that the average-case is closer to best- than worst-case.

  4. 4.

    In Section 5, we develop efficient coding schemes for our abstracted channel. In particular, we develop a scheme that achieves rate CfϵC_{f}-\epsilon, where CfC_{f} is the capacity, that requires preprocessing time O(1ϵ41/ϵ)O\left(\frac{1}{\epsilon}\cdot 4^{1/\epsilon}\right), and has encoding/decoding time O(n)O(n), where the O()O(\cdot) notation hides a constant that depends on ff. We also present an efficient coding scheme that improves over the “naive” one with high probability, where the probability is over a random map from kk-mers to current readouts.

1.1 Related Work

DNA storage has been around as an idea since the 1960’s, and there has been renewed interest in it in the past decade, starting with the works [6, 10]. We refer the reader to [4] for an excellent survey on DNA storage. Most works on DNA storage have focused on other sequencing technologies like Illumina, but starting with the work of [22], there have been several works which developed practical DNA storage systems for nanopore sequencers, including [22, 17, 14, 15, 5]. All of these works developed practical coding schemes for DNA storage with a nanopore sequencer, but did not explicitly model the sequencer or analyze the theoretical limits.

Perhaps the work most related to ours is that of [16], who also developed a model of the nanopore sequencer and studied the capacity of their model. In particular, they give a multi-letter capacity formula for their channel, and derive computable bounds for the capacity, in terms of a Markov transition matrix PP that captures the probability of transitioning from one kk-mer to the next in the pore. Our work complements that work by focusing on different aspects of the problem. In more detail, that work differs from ours in several ways. First, the model in [16] is stochastic, while ours is deterministic. As a result, they take an information-theoretic approach, while our approach is more combinatorial in nature. Second, their model includes the possibility that kk-mers might get dropped; in particular, the binary deletion channel appears as one part of their model. Since understanding the capacity of the binary deletion channel is a difficult open problem, this makes their problem extremely difficult. In contrast, we ignore this aspect in order to more cleanly focus in on the effects of the inter-symbol interaction and the potential confusability between the kk-mers. Third, that work focuses on the nanopore sequencer for general applications (not necessarily for DNA storage), and in particular does not consider efficient coding schemes for their model. Finally, that work derives bounds on the channel capacity for a particular choice of (a stochastic analog of) the map ff from kk-mers to current levels, derived from experimental data. In contrast, we are interested in results for any ff, and in particular for the best and worst such functions ff. While the former direction is obviously of immediate interest for existing technology, it is our hope that understanding how the capacity of the channel changes with ff could perhaps guide how nanopore technology is developed in the future. We note that this is still an emerging area and the technology is evolving; see [11] for an overview of recent advances.

2 Abstract Model of Nanopore Sequencer

In this section we formalize our model. As mentioned above, our goal is to focus on the challenges of (1) inter-symbol interference, and (2) the possibility of different kk-mers producing similar current readouts. With that in mind, we propose a very simple model for the nanopore sequencer. As input, we take an encoded string s0s1sn1{A,C,G,T}ns_{0}s_{1}\cdots s_{n-1}\in\{A,C,G,T\}^{n}. This is transformed into a sequence of kk-mers according to a sliding window, to obtain (s0sk1),(s1sk),(s2sk+1),,(snksn1)(s_{0}\cdots s_{k-1}),(s_{1}\cdots s_{k}),(s_{2}\cdots s_{k+1}),\ldots,(s_{n-k}\cdots s_{n-1}). Finally, each of these kk-mers is mapped to one of bb distinct current levels, according to a mapping f:{A,C,G,T}k{0,1,,b1}f:\{A,C,G,T\}^{k}\to\{0,1,\ldots,b-1\}. This mapping ff defines the channel.

Definition 1 (Abstract Nanopore Channel).

Given a mapping f:{A,C,G,T}k{0,1,,b1}f:\{A,C,G,T\}^{k}\to\{0,1,\dots,b-1\} from kk-mers to current levels, let f:{A,C,G,T}{0,1,,b1}f^{*}:\{A,C,G,T\}^{*}\to\{0,1,\dots,b-1\}^{*} represent the mapping from DNA strands to their full current readout, i.e.,

f(s0s1sn1)=f(s0sk1)f(s1sk)f(snksn1).f^{*}(s_{0}s_{1}\cdots s_{n-1})=f(s_{0}\cdots s_{k-1})\circ f(s_{1}\cdots s_{k})\circ\cdots\circ f(s_{n-k}\cdots s_{n-1}).

We call ff^{*} the abstract nanopore channel given by ff.

Given a mapping ff, we are interested in the capacity CfC_{f} (in bits-per-base), which we define as follows.

Definition 2.

Let f:{A,C,G,T}k{0,1,,b1}f:\{A,C,G,T\}^{k}\to\{0,1,\ldots,b-1\}. The capacity CfC_{f} of the abstract nanopore channel determined by ff is defined as

Cf=limnlog|{c{0,1,,b1}nk+1s{A,C,G,T}n s.t. f(s)=c}|n.C_{f}=\lim_{n\to\infty}\frac{\log{|\{c\in\{0,1,\dots,b-1\}^{n-k+1}\mid\exists s\in\{A,C,G,T\}^{n}\text{ s.t. }f^{*}(s)=c\}|}}{n}. (1)

Observe that if 𝒮\mathcal{S} is a collection of strings s{A,C,G,T}ns\in\{A,C,G,T\}^{n} so that f(s)f^{*}(s) are all distinct for s𝒮s\in\mathcal{S}, then by assigning a different message to each s𝒮s\in\mathcal{S}, we can communicate perfectly (if not necessarily efficiently) across the abstract nanopore channel.

We will focus on balanced mappings ff. These are mappings so that |f1(i)||f^{-1}(i)| is the same for all i{0,1,,b1}i\in\{0,1,\ldots,b-1\}.

3 Computing the Capacity

Our first contribution is an algorithm (Algorithm 1 below) that computes CfC_{f}, given ff. The basic idea is to consider a finite automaton on the alphabet {0,1,,b1}\{0,1,\dots,b-1\} that accepts exactly those current readouts that can be generated by some DNA strand; then we use the transfer matrix method [2, 8] for counting accepting paths in that finite automaton.

Formally, an Nondeterministic Finite Automaton (NFA) is a tuple (Q,Σ,Δ,Q0,F)(Q,\Sigma,\Delta,Q_{0},F) where QQ is the set of states, Σ\Sigma is the alphabet, Δ:Q×Σ2Q\Delta:Q\times\Sigma\to 2^{Q} is the transition function, Q0QQ_{0}\subseteq Q is the set of initial states, and FQF\subseteq Q is the set of accepting states. Likewise, a Deterministic Finite Automaton (DFA) is a tuple (Q,Σ,δ,q0,F)(Q,\Sigma,\delta,q_{0},F), defined analogously except that δ:Q×ΣQ\delta:Q\times\Sigma\to Q is the transition function and there is only a single initial state q0q_{0}.

We consider the NFA MM and the DFA MM^{\prime} described in Algorithm 1. The NFA MM has states indexed by strings in {A,C,G,T}k1\{A,C,G,T\}^{k-1} and alphabet Σ={0,1,,b1}\Sigma=\{0,1,\ldots,b-1\}, the current levels. Given a state (s0sk2)(s_{0}\cdots s_{k-2}) and an input current level iΣi\in\Sigma, the NFA MM can transition to any other state of the form (s1sk1)(s_{1}\cdots s_{k-1}) so that f(s0s1sk1)=if(s_{0}s_{1}\cdots s_{k-1})=i. All states are accepting states. By construction, a sequence of current readings cΣnk+1c\in\Sigma^{n-k+1} can be an output f(s0s1sn1)f^{*}(s_{0}s_{1}\cdots s_{n-1}) of the nanopore sequencer if and only if cc is accepted by MM. The DFA MM^{\prime} accepts exactly the same strings as MM, and is obtained using the classic subset construction [18], such that each state of the DFA corresponds to a subset of the states of the NFA.

The transfer matrix method is a method for obtaining a generating function gf(z)g_{f}(z) for the number of states accepted by a given DFA. In more detail, given the transfer matrix TT for the automaton (so that Ti,jT_{i,j} is the number of transitions from state ii to state jj), the transfer matrix method gives an expression for a function gf(z)g_{f}(z) (in terms of the matrix TT), so that

gf(z)=m=0Nmzm,g_{f}(z)=\sum_{m=0}^{\infty}N_{m}z^{m},

so that NmN_{m} is the number of strings of length mm accepted by the finite automaton. We will use the notation [zm]gf(z)[z^{m}]g_{f}(z) to denote the coefficient NmN_{m} on zmz^{m}.

Lemma 1 (Transfer Matrix Method [2]).

Given a DFA D=(Q,Σ,δ,q0,F)D=(Q,\Sigma,\delta,q_{0},F), let D=(QqF,Σλ,δ,q0,{qF})D^{\prime}=(Q\cup q_{F},\Sigma\cup\lambda,\delta^{\prime},q_{0},\{q_{F}\}) be obtained from the DFA DD by adding a new state qFq_{F} and new symbol λ\lambda, along with λ\lambda-transitions from each of the accepting states of DD to qFq_{F}. Let TT be the transfer matrix of DD^{\prime}, where TijT_{ij} is the number of transitions from state ii to state jj, with q0q_{0} being state 0 and qFq_{F} being state |Q||Q|. Then the generating function for DD^{\prime} is

gf(z)=(1)|Q|×det(IzT:|Q|,0)zdet(IzT)g_{f}(z)=(-1)^{|Q|}\times\frac{\det(I-zT:|Q|,0)}{z\det(I-zT)}

where (IzT:|Q|,0)(I-zT:|Q|,0) is the minor of index |Q|,0|Q|,0, i.e., the matrix IzTI-zT with the |Q|th|Q|^{th} row and 0th0^{th} column deleted.

In our case, the number of strings of length mm accepted by our finite automaton will be the number of current readouts of length m=nk+1m=n-k+1 that can be generated by some DNA strand of length nn.

From gf(z)g_{f}(z), we can derive the asymptotic behavior of the number of possible current readouts required to determine CfC_{f}. As shown in the proof below, it is related to the smallest positive singularity of gf(z)g_{f}(z).

Q{A,C,G,T}k1Q\leftarrow\{A,C,G,T\}^{k-1}
Σ{0,1,,b1}\Sigma\leftarrow\{0,1,\dots,b-1\}
NFA M(Q,Σ,Δ,Q,Q)M\leftarrow(Q,\Sigma,\Delta,Q,Q) where Δ(s0sk2,i)={s1sk1f(s0s1sk1)=i}\Delta(s_{0}\cdots s_{k-2},i)=\{s_{1}\cdots s_{k-1}\mid f(s_{0}s_{1}\cdots s_{k-1})=i\}
DFA M(2Q,Σ,δ,q0,F)M^{\prime}\leftarrow(2^{Q},\Sigma,\delta,q_{0},F) obtained from the subset construction on MM
M(2Q{qF},Σ{λ},δ,q0,{qF})M^{\prime}\leftarrow(2^{Q}\cup\{q_{F}\},\Sigma\cup\{\lambda\},\delta^{\prime},q_{0},\{q_{F}\}) where
δ(q,σ)={δ(q,σ)q2Q and σλqFqF and σ=λotherwise\delta^{\prime}(q,\sigma)=\begin{cases}\delta(q,\sigma)&q\in 2^{Q}\text{ and }\sigma\neq\lambda\\ q_{F}&q\in F\text{ and }\sigma=\lambda\\ \emptyset&\text{otherwise}\end{cases}
TT\leftarrow transfer matrix of MM^{\prime}, i.e., Ti,j=T_{i,j}= number of transitions from state ii to state jj
gf(z)(1)2|Q|×det(IzT:2|Q|,0)zdet(IzT)g_{f}(z)\leftarrow(-1)^{2^{|Q|}}\times\frac{\det(I-zT:2^{|Q|},0)}{z\det(I-zT)} where (IzT:2|Q|,0)(I-zT:2^{|Q|},0) is the minor of index 2|Q|,02^{|Q|},0
rr\leftarrow smallest positive real root of the denominator of gf(z)g_{f}(z) (simplified)
return log1r\log\frac{1}{r}
Algorithm 1 Calculate CfC_{f}
Theorem 2.

Given a mapping f:{A,C,G,T}k{0,1,,b1}f:\{A,C,G,T\}^{k}\to\{0,1,\dots,b-1\}, Algorithm 1 computes CfC_{f}.

Proof.

First, observe that the NFA MM accepts exactly those current readouts that can be generated by some DNA strand under the mapping ff. For any current readout accepted by MM, consider an accepting path P=s0sk2,s1sk1,,smsm+k2P=s_{0}\cdots s_{k-2},s_{1}\cdots s_{k-1},\dots,s_{m}\cdots s_{m+k-2}. Then by construction, the DNA strand s0sm+k2s_{0}\cdots s_{m+k-2} generates that current readout. In the other direction, for any DNA strand s0sm+k2s_{0}\cdots s_{m+k-2}, the path PP is an accepting path for f(s0sm+k2)f^{*}(s_{0}\cdots s_{m+k-2}).

The DFA obtained from the subset construction accepts the same current readouts as MM [18]. Then we apply the transfer matrix method described in Lemma 1 to obtain the generating function gf(z)g_{f}(z), which counts the number of current readouts accepted by MM.

Finally, we need to extract the asymptotic behavior of [zm]gf(z)[z^{m}]g_{f}(z). Since gf(z)g_{f}(z) is a generating function with non-negative coefficients, the Exponential Growth Formula [8] tells us that lim supm([zm]gf(z))1/m=1/r\limsup_{m\to\infty}\left([z^{m}]g_{f}(z)\right)^{1/m}=1/r where rr is the smallest positive singularity of gf(z)g_{f}(z). Note that the number of possible current readouts is monotonically non-decreasing in mm, so the lim sup is equal to the limit. Therefore, based on equation (1), and the fact that the length of the current readouts for DNA strands of length nn is m=nk+1m=n-k+1,

Cf\displaystyle C_{f} =limnlog([znk+1]gf(z))n\displaystyle=\lim_{n\to\infty}\frac{\log\left([z^{n-k+1}]g_{f}(z)\right)}{n}
=limnnk+1nlog([znk+1]gf(z))1/(nk+1)\displaystyle=\lim_{n\to\infty}\frac{n-k+1}{n}\log\left([z^{n-k+1}]g_{f}(z)\right)^{1/(n-k+1)}
=log1r\displaystyle=\log\frac{1}{r}

3.1 Complexity of Algorithm 1

Unfortunately, the DFA obtained from the subset construction has 24k12^{4^{k-1}} states, so the runtime of Algorithm 1 is exponential in the problem size (i.e., the description length of ff).

It is possible that calculating CfC_{f} may be hard, because computing such a statistic for NFAs in general is PSPACE-complete [19]. Specifically, even determining whether Cf=logbC_{f}=\log b for b=2b=2 or 4 (Cf=logbC_{f}=\log b is the highest possible capacity over all mapping functions ff; see Lemma 5) is equivalent to determining whether the corresponding NFA is universal (i.e., accepts every string in the alphabet). We make this precise in the following lemma.

Lemma 3.

For b=2b=2 or 44 and any ff, CfC_{f} is equal to logb\log b if and only if every current readout in {0,1,,b1}\{0,1,\dots,b-1\}^{*} can be generated by some DNA strand.

Proof.

First we verify the “if” direction. Indeed, if every current readout can be generated, then Cf=limnlogbnk+1n=logbC_{f}=\lim_{n\to\infty}\frac{\log b^{n-k+1}}{n}=\log b.

To show the “only if” direction we show the contrapositive. Suppose there is some current readout cc of length \ell which cannot be generated by any DNA strand. Observe that any current readout which contains cc as a contiguous substring can also not be generated by any DNA strand (otherwise we could generate cc by truncation). Thus, consider any current readout of length nk+1n-k+1 which can be generated by some DNA strand. If we divide it up into sections of length \ell, obviously none of those sections can be equal to cc. So we see that

Cf\displaystyle C_{f} limnlog((b1)nk+1b(nk+1)mod)n\displaystyle\leq\lim_{n\to\infty}\frac{\log\left((b^{\ell}-1)^{\lfloor\frac{n-k+1}{\ell}\rfloor}\ b^{(n-k+1)\bmod\ell}\right)}{n}
=limnnk+1log(b1)n\displaystyle=\lim_{n\to\infty}\frac{\lfloor\frac{n-k+1}{\ell}\rfloor\log(b^{\ell}-1)}{n}
=1log(b1)<logb.\displaystyle=\frac{1}{\ell}\log(b^{\ell}-1)<\log b.

Since universality is PSPACE-complete for general NFAs, an efficient algorithm would have to in some way leverage the highly structured nature of the NFAs corresponding to some mapping ff.

Remark 4.

Although we currently don’t know how to calculate (or approximate) CfC_{f} efficiently, it is possible to approximate the capacity for strands of fixed length \ell, Cf()C_{f}(\ell). Specifically, there are existing randomized approximation algorithms for counting the number of strings of a given length accepted by an NFA, in time polynomial in \ell and the size of the NFA [13, 1].

4 Bounding the Capacity

The above approach for computing CfC_{f} exactly given a mapping ff is only practical for small window sizes kk. However, we can derive some general bounds that apply to any mapping ff. In particular, we are interested in the worst-case mapping ff (e.g., the one that attains minfCf\min_{f}C_{f}), as well as the best-case mapping ff (e.g., the one that attains maxfCf\max_{f}C_{f}). We prove the following lemma.

Lemma 5.

For a given window size kk and with bb distinct current levels, we have the following bounds on CfC_{f}:

  1. 1.

    maxfCf=min(log(b),2)\max_{f}C_{f}=\min(\log(b),2)

  2. 2.

    minfCflog(b)k\min_{f}C_{f}\geq\frac{\log(b)}{k}

  3. 3.

    minfCf1\min_{f}C_{f}\leq 1 when b2kb\leq 2^{k}

Proof.

We prove each bound independently:

  1. 1.

    Observe that the the size of the set in the numerator of equation (1) is bounded above both by the number of possible current readouts (bnk+1b^{n-k+1}) and the number of DNA strands (4n4^{n}). Therefore maxfCfmin(log(b),2)\max_{f}C_{f}\leq\min(\log(b),2).

    Furthermore, this rate is achievable. For b=2b=2, it is achieved by, for instance,

    f(s0sk1)={0if s0{A,C}1otherwisef(s_{0}\cdots s_{k-1})=\begin{cases}0&\text{if $s_{0}\in\{A,C\}$}\\ 1&\text{otherwise}\end{cases}

    since then all possible current readouts are achievable: given c{0,1,,b1}nk+1c\in\{0,1,\dots,b-1\}^{n-k+1}, simply replace all the 0’s with AA’s and the 1’s with GG’s, and add any k1k-1 bases to the end.

    For b4b\geq 4, the optimal rate is achieved by any mapping where f(s0sk1)=f(t0tk1)f(s_{0}\cdots s_{k-1})=f(t_{0}\cdots t_{k-1}) implies s0=t0s_{0}=t_{0}, since then every DNA strand ending with k1k-1 A’s gives rise to a unique current readout.

  2. 2.

    Regardless of the mapping ff, we can always generate at least bn/kb^{\lfloor n/k\rfloor} distinct current readouts: any choice of desired 0th0^{th}, kthk^{th}, 2kth2k^{th}, etc. current readings can be obtained because they correspond to non-overlapping length-kk windows. Thus minfCflimnlogbn/kn=log(b)k\min_{f}C_{f}\geq\lim_{n\to\infty}\frac{\log b^{\lfloor n/k\rfloor}}{n}=\frac{\log(b)}{k}.

  3. 3.

    When b2kb\leq 2^{k}, we can effectively make the bases AA and CC indistinguishable from each other, and likewise the bases GG and TT indistinguishable from each other. Consider any balanced mapping f:{A,G}k{0,1,,b1}f^{\prime}:\{A,G\}^{k}\to\{0,1,\dots,b-1\} — in this case, balanced means that for each ii,

    |{s0sk1{A,G}kf(s0sk1)=i}|=2k/b.\left\lvert\{s_{0}\cdots s_{k-1}\in\{A,G\}^{k}\mid f^{\prime}(s_{0}\cdots s_{k-1})=i\}\right\rvert=2^{k}/b.

    Then define f(s)=f(s)f(s)=f^{\prime}(s^{\prime}) where ss^{\prime} is obtain from ss by replacing all CC’s with AA’s and all TT’s with GG’s. Observe that ff is a balanced mapping, and also the size of the set in the numerator of equation (1) is now at most 2n2^{n}. Thus minfCflimnlog(2n)n=1\min_{f}C_{f}\leq\lim_{n\to\infty}\frac{\log(2^{n})}{n}=1.

Combining these bounds, we obtain the plot in Figure 3. One might wonder how tight the bounds shown in Figure 3 are, and also about where CfC_{f} lies for a “typical” mapping function ff. Using Algorithm 1, we are able to exactly calculate the maximum, minimum, and average CfC_{f} for k=2k=2, restricted to balanced mappings. These empirical results are shown in Figure 3.

1122011222k2^{k}2k+12^{k+1}4k4^{k}bbits-per-base
Figure 2: Bounds on the value of CfC_{f}. The dashed blue line is equal to maxfCf\max_{f}C_{f}. The value of minfCf\min_{f}C_{f} must lie somewhere in the red shaded area.
11224488161601122bbits-per-base
Figure 3: For k=2k=2, the true maxfCf\max_{f}C_{f} (dashed blue), minfCf\min_{f}C_{f} (solid red), and 𝔼fCf\mathbb{E}_{f}C_{f} (dotted green), superimposed over our bounds from Figure 3.

The lower bound is on minfCf\min_{f}C_{f} curious. Our theoretical bounds on minfCf\min_{f}C_{f} are not tight, but the k=2k=2 results suggest there may be some “bumpiness” in the true bound. Also based on the k=2k=2 results, it appears that random (balanced) mappings are closer to the best-case scenario than the worst-case. Section 5.2 illustrates one coding scheme that takes advantage of a property shared by most random mappings for b=2b=2, and we may hope there exist more general coding schemes that perform well for random mappings.

5 Coding Schemes

In the proof of Lemma 5, we observed that given a mapping ff with window size kk and bb distinct current levels, we can achieve a rate of log(b)k\frac{\log(b)}{k} by coding only on the 0th,kth,2kth,0^{th},k^{th},2k^{th}, etc. current readings. Here, we propose two generalizations of this trivial scheme to improve the rate with additional preprocessing based on the mapping ff.

5.1 “Block” Encoding

Instead of coding on non-overlapping windows of length kk, each of which maps to one of bb current readings, we may instead choose a block length k\ell\geq k and code on non-overlapping blocks of length \ell. This requires precomputing an alphabet Σ(){A,C,G,T}\Sigma(\ell)\subseteq\{A,C,G,T\}^{\ell} such that fΣ()f^{*}\restriction_{\Sigma(\ell)} is an injective function with the same range as f{A,C,G,T}f^{*}\restriction_{\{A,C,G,T\}^{\ell}}. That is, each DNA strand in Σ()\Sigma(\ell) generates a unique current readout, and together they generate every possible current readout of length k+1\ell-k+1 given the mapping ff. Provided our desired strand length nn is divisible by \ell or tends to infinity, this coding scheme obtains the rate

Cf()=log|{c{0,1,,b1}k+1s{A,C,G,T} s.t. f(s)=c}|=log|Σ()|C_{f}(\ell)=\frac{\log{|\{c\in\{0,1,\dots,b-1\}^{\ell-k+1}\mid\exists s\in\{A,C,G,T\}^{\ell}\text{ s.t. }f^{*}(s)=c\}|}}{\ell}=\frac{\log|\Sigma(\ell)|}{\ell}

Since limCf()=Cf\lim_{\ell\to\infty}C_{f}(\ell)=C_{f}, it is possible to get arbitrarily close to the optimal rate by picking a sufficiently large \ell.

Theorem 6.

Given a mapping ff, for all ϵ>0\epsilon>0, there is a coding scheme achieving rate CfϵC_{f}-\epsilon with linear time encoding and decoding that requires preprocessing time O(1ϵ41/ϵ)O(\frac{1}{\epsilon}\cdot 4^{1/\epsilon}), where the O()O(\cdot) notation hides constants that may depend on the mapping ff.

Proof.

We will exhibit such a scheme by choosing an appropriate block length \ell. Let |Σ()||\Sigma(\ell)| be the number of distinct current readouts that can be generated from DNA strands of length \ell. Equivalently, |Σ()|=[zk+1]gf(z)|\Sigma(\ell)|=[z^{\ell-k+1}]g_{f}(z) is the number of current readouts of length k+1\ell-k+1 accepted by the NFA MM constructed by Algorithm 1. Because gf(z)g_{f}(z) is a counting function for a regular language, and because [zk+1]gf(z)[z^{\ell-k+1}]g_{f}(z) is non-decreasing in \ell, the asymptotic behavior of [zk+1]gf(z)[z^{\ell-k+1}]g_{f}(z) has a simple form ([8], Theorem V.3):

|Σ()|=[zk+1]gf(z)=Θ(Π(k+1)(2Cf)k+1)|\Sigma(\ell)|=[z^{\ell-k+1}]g_{f}(z)=\Theta(\Pi(\ell-k+1)(2^{C_{f}})^{\ell-k+1})

where Π(x)\Pi(x) is a polynomial.

Thus, there must exist some 0\ell_{0} and some constant CC depending only on the mapping ff, such that for all 0\ell\geq\ell_{0}, |Σ()|C(2Cf)k+1|\Sigma(\ell)|\geq C(2^{C_{f}})^{\ell-k+1}. Therefore, if we choose max(0,Cf(k1)logCϵ)\ell\geq\max\left(\ell_{0},\frac{C_{f}\cdot(k-1)-\log C}{\epsilon}\right), we see that

Cf()\displaystyle C_{f}(\ell) =log|Σ()|\displaystyle=\frac{\log|\Sigma(\ell)|}{\ell}
log(C(2Cf)k+1)\displaystyle\geq\frac{\log\left(C(2^{C_{f}})^{\ell-k+1}\right)}{\ell}
=logC+Cf(k+1)\displaystyle=\frac{\log C+C_{f}\cdot(\ell-k+1)}{\ell}
=CfCf(k1)logC\displaystyle=C_{f}-\frac{C_{f}\cdot(k-1)-\log C}{\ell}
Cfϵ.\displaystyle\geq C_{f}-\epsilon.

Therefore, to achieve rate CfϵC_{f}-\epsilon, we should choose \ell proportional to 1/ϵ1/\epsilon, with the constants depending only on the mapping ff.

Given the block length \ell, we now describe how to compute the alphabet Σ()\Sigma(\ell). We will construct an array EE containing the alphabet Σ()\Sigma(\ell) and a hash table DD mapping current readouts of length k+1\ell-k+1 to the index in EE of the DNA strand that generates that readout.

For each DNA strand ss of length \ell, compute f(s)f^{*}(s). If f(s)f^{*}(s) has not yet been added to the hash table DD, append ss to the end of array EE and map f(s)f^{*}(s) to the appropriate index. This preprocessing takes O()O(\ell) time for each of 44^{\ell} DNA strands, for a total of O(4)=O(1ϵ41/ϵ)O(\ell\cdot 4^{\ell})=O(\frac{1}{\epsilon}\cdot 4^{1/\epsilon}).

Encoding and decoding are then quite straightforward: Convert the message to base |Σ()||\Sigma(\ell)| and use lookup table EE to map each digit to a block of length \ell. Similarly, decode each block of k+1\ell-k+1 current readings using the hash table DD (skipping the k1k-1 readings that straddle each pair of adjacent blocks). ∎

As an example, consider the case when b=2b=2 or 44 and ff is any mapping with the highest-possible capacity Cf=logbC_{f}=\log b. Per Lemma 3, this implies that every current readout can be generated by some DNA strand, so |Σ()|=bk+1|\Sigma(\ell)|=b^{\ell-k+1} and Cf()=(k+1)log(b)C_{f}(\ell)=\frac{(\ell-k+1)\log(b)}{\ell}. In this case, we can calculate the dependence of \ell on ϵ\epsilon exactly:

Cfϵ\displaystyle C_{f}-\epsilon =(k+1)log(b)\displaystyle=\frac{(\ell-k+1)\log(b)}{\ell}
ϵ\displaystyle-\epsilon =(k+1)log(b)\displaystyle=\frac{(-k+1)\log(b)}{\ell}
\displaystyle\ell =(k1)log(b)ϵ.\displaystyle=\frac{(k-1)\log(b)}{\epsilon}.

For instance, when k=2,b=2,k=2,b=2, and ϵ=0.1\epsilon=0.1, we would require =10\ell=10.

5.2 “Greedy” Encoding

Alternatively, instead of changing the lengths of the blocks used in the trivial scheme, we may relax the “non-overlapping” requirement.

This may not always be possible, depending on the mapping ff. Indeed, in the worst case, it is possible that once you have fixed the first kk bases, the next k1k-1 current readings may also be fixed—for instance, if the first kk bases are all AA, and all windows starting with AA map to the same current level. However, this pessimal case shouldn’t happen for most mappings.

Consider the case of b=2b=2 current levels, and suppose that for some length 1<k1\leq\ell<k, for every “prefix” p{A,C,G,T}p\in\{A,C,G,T\}^{\ell}, at least one window in f1(0)f^{-1}(0) and at least one window in f1(1)f^{-1}(1) starts with that prefix. Then it is possible to have every (k)th(k-\ell)^{th} current reading code for one binary symbol independently. This would give us a rate of 1k\frac{1}{k-\ell} rather than 1k\frac{1}{k}. Such an event is not too unlikely with a random mapping ff.

Lemma 7.

Given a random mapping ff with b=2b=2 distinct current levels and a length 1<k1\leq\ell<k, ff admits the coding scheme described above with rate 1k\frac{1}{k-\ell} with probability at least

142(12)4k.1-4^{\ell}\cdot 2\cdot\left(\frac{1}{2}\right)^{4^{k-\ell}}.

Furthermore,

  1. 1.

    we can determine whether such a scheme exists for a given ff and \ell in O(4k)O(4^{k}).

  2. 2.

    we can find the maximum \ell for which such a scheme exists for a given ff in O(k4k)O(k4^{k}).

  3. 3.

    we can implement such a scheme with O(k4k)O(k4^{k}) preprocessing and linear encoding and decoding.

Proof.

Let ff be a uniformly random mapping from {A,C,G,T}k{0,1}\{A,C,G,T\}^{k}\to\{0,1\} (not necessarily balanced). For a given prefix pp of length \ell and current level β\beta, the probability that all 4k4^{k-\ell} windows beginning with that prefix belong to f1(β)f^{-1}(\beta) is (12)4k\left(\frac{1}{2}\right)^{4^{k-\ell}}. Thus by the union bound, the probability of being able to execute this scheme for a given length \ell is at least

142(12)4k.1-4^{\ell}\cdot 2\cdot\left(\frac{1}{2}\right)^{4^{k-\ell}}.

This probability would only be higher if we restrict ff to be a random balanced mapping.

Given a specific mapping ff and desired length \ell, we can trivially check whether this property is satisfied by checking, for each prefix of length \ell, the current level corresponding to each of the 4k4^{k-\ell} windows beginning with that prefix, in time O(4k)O(4^{k}).

Naturally, we could determine the maximum \ell that works for a given mapping ff in O(k4k)O(k4^{k}) by repeating the above procedure for each {1,2,,k1}\ell\in\{1,2,\dots,k-1\}.

Furthermore, such a scheme could be straightforwardly implemented for a given ff. For instance, we could construct tries — trie 0 corresponding to f1(0)f^{-1}(0) and trie 1 corresponding to f1(1)f^{-1}(1) — in O(k4k)O(k4^{k}). For ease of encoding, we will also equip each leaf node s0s1sk1s_{0}s_{1}\cdots s_{k-1} of the tries with two pointers, pointing to the internal nodes sksk+1sk1s_{k-\ell}s_{k-\ell+1}\cdots s_{k-1} in both trie 0 and trie 1. To encode from binary to a DNA strand, we then start with any length-kk window in the trie corresponding to the first bit. For each subsequent bit, follow the pointer from our current leaf to the internal node of the trie corresponding to that bit, then append an arbitrary path from that prefix to a leaf in the appropriate trie. Decoding is simply a matter of extracting every (k)th(k-\ell)^{th} current reading. Thus both encoding and decoding are linear in the length of the DNA strand. ∎

For example, with k=6,=4k=6,\ell=4, we see that we can obtain a rate of 1/21/2 (compared to the trivial 1/61/6) with probability at least 111281-\frac{1}{128}.

6 Conclusion

We have initiated a theoretical study of coding for a highly abstracted version of the nanopore sequencer for DNA storage. We have provided algorithms and bounds for understanding the capacity, and we have given efficient coding schemes. However, we view our work as the tip of an iceberg. First, even for this abstracted model, much remains open. Can one derive better bounds on the capacity, or compute it efficiently for, say, k=6k=6? Second, we hope that our insights will generalize to more practical models, including with substitution and synchronization errors.

References

  • [1] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram and Cristian Riveros “Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation” In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2019
  • [2] Abdulbaki Aydin, Lucas Bang and Tevfik Bultan “Automata-Based Model Counting for String Constraints” In Computer Aided Verification Cham: Springer International Publishing, 2015, pp. 255–272
  • [3] Meinolf Blawat et al. “Forward error correction for DNA data storage” In Procedia Computer Science 80 Elsevier, 2016, pp. 1011–1022
  • [4] Luis Ceze, Jeff Nivala and Karin Strauss “Molecular digital data storage using DNA” In Nature Reviews Genetics 20.8 Nature Publishing Group, 2019, pp. 456–466
  • [5] Shubham Chandak et al. “Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8822–8826 IEEE
  • [6] George M Church, Yuan Gao and Sriram Kosuri “Next-generation digital information storage in DNA” In Science 337.6102 American Association for the Advancement of Science, 2012, pp. 1628–1628
  • [7] Yaniv Erlich and Dina Zielinski “DNA Fountain enables a robust and efficient storage architecture” In Science 355.6328 American Association for the Advancement of Science, 2017, pp. 950–954
  • [8] Philippe Flajolet and Robert Sedgewick “Analytic Combinatorics” Cambridge University Press, 2009
  • [9] “Flappie: Flip-flop basecaller for Oxford Nanopore reads” Last accessed: January 2021, https://github.com/nanoporetech/flappie
  • [10] Nick Goldman et al. “Towards practical, high-capacity, low-maintenance information storage in synthesized DNA” In Nature 494.7435 Nature Publishing Group, 2013, pp. 77–80
  • [11] Yusuke Goto, Rena Akahori, Itaru Yanagi and Ken-ichi Takeda “Solid-state nanopores towards single-molecule DNA sequencing” In Journal of human genetics 65.1 Nature Publishing Group, 2020, pp. 69–77
  • [12] Miten Jain, Hugh E Olsen, Benedict Paten and Mark Akeson “The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community” In Genome biology 17.1 Springer, 2016, pp. 1–11
  • [13] Sampath Kannan, Z. Sweedyk and Steve Mahaney “Counting and Random Generation of Strings in Regular Languages” In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’95 San Francisco, California, USA: Society for IndustrialApplied Mathematics, 1995, pp. 551–557
  • [14] Henry H Lee et al. “Terminator-free template-independent enzymatic DNA synthesis for digital information storage” In Nature communications 10.1 Nature Publishing Group, 2019, pp. 1–12
  • [15] Randolph Lopez et al. “DNA assembly for nanopore data storage readout” In Nature communications 10.1 Nature Publishing Group, 2019, pp. 1–9
  • [16] Wei Mao, Suhas N Diggavi and Sreeram Kannan “Models and information-theoretic bounds for nanopore sequencing” In IEEE Transactions on Information Theory 64.4 IEEE, 2018, pp. 3216–3236
  • [17] Lee Organick et al. “Random access in large-scale DNA data storage” In Nature biotechnology 36.3 Nature Publishing Group, 2018, pp. 242
  • [18] M. O. Rabin and D. Scott “Finite Automata and Their Decision Problems” In IBM Journal of Research and Development 3.2, 1959, pp. 114–125 DOI: 10.1147/rd.32.0114
  • [19] Narad Rampersad, Jeffrey Shallit and Zhi Xu “The computational complexity of universality problems for prefixes, suffixes, factors, and subwords of regular languages” In CoRR abs/0907.0159, 2009 arXiv: http://arxiv.org/abs/0907.0159
  • [20] “Scrappie: a technology demonstrator for the Oxford Nanopore Research Algorithms group” Last accessed: January 2021, https://github.com/nanoporetech/scrappie
  • [21] Ryan R Wick, Louise M Judd and Kathryn E Holt “Performance of neural network basecalling tools for Oxford Nanopore sequencing” In Genome biology 20.1 Springer, 2019, pp. 1–10
  • [22] SM Hossein Tabatabaei Yazdi, Ryan Gabrys and Olgica Milenkovic “Portable and error-free DNA-based data storage” In Scientific reports 7.1 Nature Publishing Group, 2017, pp. 1–6