This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fundamental Limits of Multiple Sequence Reconstruction from Substrings

Kel Levick University of Illinois, Urbana-Champaign
Urbana, IL, USA
[email protected]
   Ilan Shomorony University of Illinois, Urbana-Champaign
Urbana, IL, USA
[email protected]
Abstract

The problem of reconstructing a sequence from the set of its length-kk substrings has received considerable attention due to its various applications in genomics. We study an uncoded version of this problem where multiple random sources are to be simultaneously reconstructed from the union of their kk-mer sets. We consider an asymptotic regime where m=nαm=n^{\alpha} i.i.d. source sequences of length nn are to be reconstructed from the set of their substrings of length k=βlognk=\beta\log n, and seek to characterize the (α,β)(\alpha,\beta) pairs for which reconstruction is information-theoretically feasible. We show that, as nn\to\infty, the source sequences can be reconstructed if β>max(2α+1,α+2)\beta>\max(2\alpha+1,\alpha+2) and cannot be reconstructed if β<max(2α+1,α+32)\beta<\max(2\alpha+1,\alpha+\tfrac{3}{2}), characterizing the feasibility region almost completely. Interestingly, our result shows that there are feasible (α,β)(\alpha,\beta) pairs where repeats across the source strings abound, and non-trivial reconstruction algorithms are needed to achieve the fundamental limit.

I Introduction

The problem of reconstructing a sequence from the set of its length-kk substrings (or kk-mers) has a long history, dating back to the development of the sequencing-by-hybridization (SBH) technology in the late 1980s [1]. Due to this technological advance, significant attention was originally devoted to understanding when the set of kk-mers uniquely determines the underlying sequence, and to designing algorithms to perform the reconstruction efficiently [2, 3, 4, 5].

Inspired by Ukkonen’s 1992 paper [2], which characterized how large kk needs to be to allow reconstruction, a more recent line of work considered this problem from an information-theoretic perspective [6, 7, 8, 9, 10]. In particular, [7] considered a random source 𝐱{\bf x} (modeling a genome sequence), and provided conditions on how large kk needs to be (and how many length-kk substrings need to be observed) in order for the reconstruction of a source sequence 𝐱{\bf x} to be information-theoretically feasible.

In this work, we consider a multiple-source version of this problem. We assume that there are mm random source sequences 𝐱1,,𝐱m{\bf x}_{1},\dots,{\bf x}_{m}, each of length nn, and we wish to reconstruct them from the union of their sets of kk-mers. This multiple-source setting can be motivated by modern sequencing assays such as metagenomics and RNA-seq, where one observes substrings from a mixture of distinct source sequences (different microbial genomes in the case of metagenomics and different mRNA molecules in the case of RNA-seq). It can also be motivated by the fact that many bioinformatics algorithms first convert a set of reads into the set of their constituent kk-mers and then operate in kk-mer space. It is thus relevant to understand when such sets fully preserve the information in the original reads. Our setting is also connected with the problem of DNA-based data storage, where one wishes to recover a set of mm synthesized DNA molecules by jointly sequencing them. Our work can thus be seen as an uncoded version of the recent work on multi-strand reconstruction from substrings [11], or a random coding approach to it (see also “Related work” below).

Refer to caption
Figure 1: The blue region is given by β>max(2α+1,α+2)\beta>\max(2\alpha+1,\alpha+2), and corresponds to (α,β)(\alpha,\beta) pairs for which 𝐱1,,𝐱m{\bf x}_{1},\dots,{\bf x}_{m} are uniquely reconstructible from the set of kk-mers with vanishing error probability. The red region is given by β<max(2α+1,α+2)\beta<\max(2\alpha+1,\alpha+2), and corresponds to (α,β)(\alpha,\beta) pairs where the solution is not unique. We illustrate the types of kk-mer de Bruijn graphs that lead to the ambiguity. The white region corresponds to the unknown (α,β)(\alpha,\beta) pairs.

Our goal is to characterize how large kk needs to be to allow perfect reconstruction of all mm source sequences. We assume each source is i.i.d. Bern(12){\rm Bern}(\tfrac{1}{2}) and consider an asymptotic regime where m=nαm=n^{\alpha} and k=βlognk=\beta\log n, for some constants α,β>0\alpha,\beta>0. We aim to characterize for which (α,β)(\alpha,\beta) pairs the set of kk-mers uniquely determines the set of source sequences {𝐱1,,𝐱m}\{{\bf x}_{1},\dots,{\bf x}_{m}\} with vanishing error probability. Our main result, illustrated in Figure 1, characterizes the (α,β)(\alpha,\beta) feasibility region almost completely. We establish this feasibility region by carefully analyzing the (random) de Bruijn graph formed by the kk-mer set. In this graph, the set of source sequences corresponds to a set of paths that cover all edges. Our analysis shows that there are feasible (α,β)(\alpha,\beta) pairs where the kk-mer de Bruijn graph is fairly complex, due to the presence of repeats across the source sequences, but nevertheless there is a unique way to decompose the graph into mm source paths.

Related work: An important line of related work deals with the coded version of the problem we consider. Since a sequence 𝐱{\bf x} containing no repeats of length k1k-1 can be reconstructed from the set of its kk-mers [2], [12] studied how large a repeat-free code can be. They showed that, as long as k=βlognk=\lceil\beta\log n\rceil, for β>1\beta>1, one can asymptotically build (k1)(k-1)-repeat-free codes with rate 11. The setup where one seeks to reconstruct a codeword 𝐱{\bf x} from a noisy version of its set of kk-mers have also received considerable attention [13, 14, 15, 16].

A multi-sequence version of the coding problem in [12] was considered in [13]. Similar to our paper, they provide conditions on how kk needs to scale with nn and mm to guarantee that the asymptotic rate of all kk-mer-reconstructible codes approaches 1. Our work can be seen as an uncoded counterpart to [13], where we characterize how kk needs to scale with nn and mm to allow the reconstruction of random source sequences with vanishing failure probability. As an important distinction, we notice that the code constructions in [13] guarantee that there are no repeats across the mm sequences to be reconstructed, while in our setting, mm source sequences with repeats may still be reconstructible for some (α,β)(\alpha,\beta).

II Problem Setting and Main Results

We consider mm source sequences 𝐱1,,𝐱m{\bf x}_{1},\dots,{\bf x}_{m}, each of which is generated as an i.i.d. Ber(12){\rm Ber}(\tfrac{1}{2}) length-nn sequence. The goal is to reconstruct the set of source sequences 𝒳:={𝐱1,,𝐱m}{\mathcal{X}}:=\{{\bf x}_{1},\dots,{\bf x}_{m}\} from the set of all of their (k+1)(k+1)-mers, i.e.,

𝒴(𝒳):={𝐱i[j:j+k]:1im, 1jnk},\displaystyle{\mathcal{Y}}({\mathcal{X}}):=\{{\bf x}_{i}[j:j+k]:1\leq i\leq m,\,1\leq j\leq n-k\},

where 𝐱i[a:b]{\bf x}_{i}[a:b] is the substring of 𝐱i{\bf x}_{i} between (and including) the aath and bbth symbols. Here, we use (k+1)(k+1)-mers rather than kk-mers for notational convenience, and the final results are not affected. When clear from context, we write 𝒴{\mathcal{Y}} for 𝒴(𝒳){\mathcal{Y}}({\mathcal{X}}). We will also use the notation 𝐱i(a){\bf x}_{i}(a) to refer to the aath kk-mer of 𝐱i{\bf x}_{i}, for 1ank+11\leq a\leq n-k+1.

We consider an asymptotic regime where m=nαm=n^{\alpha} and k=βlognk=\beta\log n, for positive constants α\alpha and β\beta. As it turns out, this is the asymptotic regime of interest, where the fundamental limits are non-trivial. We say that 𝒴{\mathcal{Y}} reconstructs 𝒳{\mathcal{X}} if 𝒳{\mathcal{X}} is the unique set of mm length-nn sequences (up to relabeling) that could have generated 𝒴{\mathcal{Y}}. It is then natural to define a notion of feasibility as follows.

Definition 1

We say that (α,β)(\alpha,\beta) is a feasible pair if for m=nαm=n^{\alpha} and k=βlognk=\beta\log n, limnPr(𝒴 reconstructs 𝒳)=1.\lim_{n\to\infty}\Pr({\mathcal{Y}}\text{ reconstructs }{\mathcal{X}})=1.

The goal is to characterize the feasibility region +2{\mathcal{F}}\subset{\mathbb{R}}_{+}^{2}, defined as the set of all feasible (α,β)(\alpha,\beta) pairs.

An equivalent and useful representation of the kk-mer set 𝒴{\mathcal{Y}} is its de Bruijn graph G(𝒴)G({\mathcal{Y}}). The de Bruijn graph G(𝒴)G({\mathcal{Y}}) can be obtained by taking the set of kk-mers in 𝒴{\mathcal{Y}} (i.e., length-kk prefixes and suffixes of the (k+1)(k+1)-mers in 𝒴{\mathcal{Y}}) as the node set, and connecting two kk-mers with a directed edge if they are the prefix and suffix of a (k+1)(k+1)-mer in 𝒴{\mathcal{Y}}. Notice that the true set of sequences 𝒳{\mathcal{X}} can be seen as a set of paths on G(𝒴)G({\mathcal{Y}}) that cover all the edges (edges can be used multiple times).

Our main result is to characterize, almost completely, the feasibility region {\mathcal{F}} as follows:

Theorem 1

The following bounds on {\mathcal{F}} hold:

  • (α,β)(\alpha,\beta)\in{\mathcal{F}} if β>max(2α+1,α+2)\beta>\max(2\alpha+1,\alpha+2),

  • (α,β)(\alpha,\beta)\notin{\mathcal{F}} if β<max(2α+1,α+32)\beta<\max(2\alpha+1,\alpha+\tfrac{3}{2}).

As illustrated in Figure 1, Theorem 1 nearly completely characterizes {\mathcal{F}}. In particular, for α>1\alpha>1, (α,β)(\alpha,\beta)\in{\mathcal{F}} if β>2α+1\beta>2\alpha+1, and (α,β)(\alpha,\beta)\notin{\mathcal{F}} if β<2α+1\beta<2\alpha+1. For α<1\alpha<1 there is a small uncharacterized region of (α,β)(\alpha,\beta) pairs.

The rest of the paper is organized as follows. In Section III, as a warm-up, we characterize the repeat-free region, a (strict) subset of {\mathcal{F}} where there are no length-kk repeats across the source sequences (and the source strings are thus reconstructible). In Section IV, we prove the inner bound part of Theorem 1, and in Section V, we prove the outer bound part of Theorem 1.

III Repeat-Free Region

We start by considering a simple scenario where 𝒳\mathcal{X} can be recovered correctly: when each kk-mer is unique across all source sequences. In this case, it is well known that the de Bruijn graph G(𝒴)G({\mathcal{Y}}) will contain mm disjoint paths, each of which corresponds to one of the 𝐱i{\bf x}_{i}s, and 𝒳{\mathcal{X}} is thus reconstructible. We prove the following result.

Lemma 1

If β>2α+2\beta>2\alpha+2, with probability tending to 11 as nn\to\infty, there are no repeated kk-mers across or within 𝐱i{\bf x}_{i}s.

This region is marked in Figure 1. Notice that it does not include all pairs of feasible (α,β)(\alpha,\beta) pairs.

To prove Lemma 1, we note that, if iji\neq j,

Pr(𝐱i(a)=𝐱j(b))=2k=2βlogn=nβ.\displaystyle\Pr\left({\bf x}_{i}(a)={\bf x}_{j}(b)\right)=2^{-k}=2^{-\beta\log n}=n^{-\beta}.

Hence, the probability that any two strings 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j}, for iji\neq j have at least one kk-mer in common is upper-bounded using the union bound by

(m2)n2nβm2n2βn2α+2β.\displaystyle\binom{m}{2}n^{2}n^{-\beta}\leq m^{2}n^{2-\beta}\leq n^{2\alpha+2-\beta}.

Additionally, we need to consider the case where a single string contains a repeat kk-mer. In this case, not all of the kk-mers are independent due to potential overlaps. However, a careful analysis reveals that, even if |ba|<k|b-a|<k and 𝐱i(a){\bf x}_{i}(a) and 𝐱i(b){\bf x}_{i}(b) overlap, we still have Pr(𝐱i(a)=𝐱i(b))=nβ\Pr\left({\bf x}_{i}(a)={\bf x}_{i}(b)\right)=n^{-\beta} (see Section A for a similar analysis). We conclude that the probability that there is repeated kk-mer within some 𝐱i{\bf x}_{i} is at most mn2nβ=nα+2βmn^{2}n^{-\beta}=n^{\alpha+2-\beta}, and the probability that any kk-mer is repeated across or within 𝐱1,,𝐱m{\bf x}_{1},\dots,{\bf x}_{m} is upper-bounded by

n2α+2β+nα+2β,\displaystyle n^{2\alpha+2-\beta}+n^{\alpha+2-\beta},

which tends to zero for β>2α+2\beta>2\alpha+2. A converse result can also be shown; i.e., for β<2α+2\beta<2\alpha+2, there will be repeats with probability tending to 11 (but we do not describe that). In the next section, we show that there are feasible (α,β)(\alpha,\beta) pairs that are in this repeat-abundant regime.

IV Inner Bound to Feasibility Region

In order to characterize {\mathcal{F}}, we define the error event ={𝒴 does not reconstruct 𝒳}{\mathcal{E}}=\{{\mathcal{Y}}\text{ does not reconstruct }{\mathcal{X}}\}, and seek to characterize the asymptotic behavior of Pr()\Pr({\mathcal{E}}) for a given choice of (α,β)(\alpha,\beta). To do so, we will need to define several events related to repeats:

𝒜={\displaystyle{\mathcal{A}}=\{ i,a,b:𝐱i(a)=𝐱i(b)}\displaystyle\exists i,a,b:{\bf x}_{i}(a)={\bf x}_{i}(b)\}
={\displaystyle{\mathcal{B}}=\{ i,j,l,a,b,c,d:𝐱i(a)=𝐱j(c),𝐱i(b)=𝐱l(d),\displaystyle\exists i,j,l,a,b,c,d:{\bf x}_{i}(a)={\bf x}_{j}(c),{\bf x}_{i}(b)={\bf x}_{l}(d),
0ba<k, and either {jl} or\displaystyle 0\leq b-a<k,\text{ and either }\{j\neq l\}\text{ or }
{j=l and dcba}}\displaystyle\{j=l\text{ and }d-c\neq b-a\}\}
𝒞={\displaystyle{\mathcal{C}}=\{ i,j,a:𝐱i(1)=𝐱j(a)}{i,j,a:𝐱i(n)=𝐱j(a)}\displaystyle\exists i,j,a:{\bf x}_{i}(1)={\bf x}_{j}(a)\}\cup\{\exists i,j,a:{\bf x}_{i}(n^{\prime})={\bf x}_{j}(a)\}
𝒟={\displaystyle{\mathcal{D}}=\{ i,j,a:𝐱i(a)=𝐱j(a)}\displaystyle\exists i,j,a:{\bf x}_{i}(a)={\bf x}_{j}(a)\}

where we let n=nk+1n^{\prime}=n-k+1 for conciseness.

Event 𝒜{\mathcal{A}} is the event that there is at least one intra-sequence repeat. Event {\mathcal{B}} is the event that there are two pairs of repeats with an overlap (i.e., 𝐱i(a){\bf x}_{i}(a) and 𝐱i(b){\bf x}_{i}(b) overlap in 𝐱i{\bf x}_{i}), but they are not simply a longer length-tt repeat for t>kt>k. Notice that {\mathcal{B}} includes the event that a triple repeat 𝐱i(a)=𝐱j(b)=𝐱l(c){\bf x}_{i}(a)={\bf x}_{j}(b)={\bf x}_{l}(c) occurs. Event 𝒞{\mathcal{C}} is the event that the first or last kk-mer of some 𝐱i{\bf x}_{i} appears somewhere else. Event 𝒟{\mathcal{D}} is the event that the aath kk-mer of two sequences 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} is the same. Notice that events 𝒜{\mathcal{A}}, {\mathcal{B}} and 𝒞{\mathcal{C}} are not error events per se (they each do not imply {\mathcal{E}}), but they will be useful in the analysis. On the other hand, 𝒟{\mathcal{D}}\Rightarrow{\mathcal{E}}, since given 𝒟{\mathcal{D}} it is possible to replace 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} with two incorrect sequences

𝐱~i=𝐱i[1:a1]𝐱j[a:n]\displaystyle\tilde{\bf x}_{i}={\bf x}_{i}[1:a-1]{\bf x}_{j}[a:n]
𝐱~j=𝐱j[1:a1]𝐱i[a:n].\displaystyle\tilde{\bf x}_{j}={\bf x}_{j}[1:a-1]{\bf x}_{i}[a:n].

We will upper bound the error probability as

Pr()\displaystyle\Pr({\mathcal{E}}) Pr(𝒜𝒞𝒟)\displaystyle\leq\Pr({\mathcal{A}}\cup{\mathcal{B}}\cup{\mathcal{C}}\cup{\mathcal{D}}\cup{\mathcal{E}})
Pr(𝒜)+Pr()+Pr(𝒞)+Pr(𝒟)+\displaystyle\leq\Pr({\mathcal{A}})+\Pr({\mathcal{B}})+\Pr({\mathcal{C}})+\Pr({\mathcal{D}})+
+Pr(𝒜¯¯𝒞¯𝒟¯)\displaystyle\quad+\Pr({\mathcal{E}}\cap\bar{\mathcal{A}}\cap\bar{\mathcal{B}}\cap\bar{\mathcal{C}}\cap\bar{\mathcal{D}}) (1)

and analyze the conditions for each term to go to zero. First we notice that, by the union bound,

Pr(𝒜)mn2nβ=nα+2β,\displaystyle\Pr({\mathcal{A}})\leq mn^{2}n^{-\beta}=n^{\alpha+2-\beta}, (2)

from which we see that Pr(𝒜)0\Pr({\mathcal{A}})\to 0 if β>α+2\beta>\alpha+2. Similarly,

Pr()m3n32kn2β+m2n2(2k)2n2β\displaystyle\Pr({\mathcal{B}})\leq m^{3}n^{3}2kn^{-2\beta}+m^{2}n^{2}(2k)^{2}n^{-2\beta}
n3α+32β+o(1),\displaystyle\quad\quad\quad\leq n^{3\alpha+3-2\beta+o(1)}, (3)
Pr(𝒞)2m2nnβ=2n2α+1β,\displaystyle\Pr({\mathcal{C}})\leq 2m^{2}nn^{-\beta}=2n^{2\alpha+1-\beta}, (4)
Pr(𝒟)m2nnβ=4n2α+1β.\displaystyle\Pr({\mathcal{D}})\leq m^{2}nn^{-\beta}=4n^{2\alpha+1-\beta}. (5)

We point out that to establish (3), we need a careful analysis of the case where 𝐱j(c){\bf x}_{j}(c) and 𝐱j(d){\bf x}_{j}(d) also overlap, presented in Lemma 4.

As a result we see that, if β>max(α+2,32α+32,2α+1)\beta>\max(\alpha+2,\tfrac{3}{2}\alpha+\tfrac{3}{2},2\alpha+1), Pr(𝒜)+Pr()+Pr(𝒞)+Pr(𝒟)0\Pr({\mathcal{A}})+\Pr({\mathcal{B}})+\Pr({\mathcal{C}})+\Pr({\mathcal{D}})\to 0. What remains is to bound Pr(𝒜¯¯𝒞¯𝒟¯)\Pr({\mathcal{E}}\cap\bar{\mathcal{A}}\cap\bar{\mathcal{B}}\cap\bar{\mathcal{C}}\cap\bar{\mathcal{D}}).

Now suppose 𝒜¯¯𝒞¯𝒟¯\bar{\mathcal{A}}\cap\bar{\mathcal{B}}\cap\bar{\mathcal{C}}\cap\bar{\mathcal{D}} holds. Let the multiplicity μ𝒳(𝐯)\mu_{{\mathcal{X}}}({\bf v}) of an node 𝐯{\bf v} in G(𝒴)G({\mathcal{Y}}) be the number of times it is traversed by paths in 𝒳{\mathcal{X}}. Notice that ¯\bar{\mathcal{B}} guarantees that no node can be traversed three or more times, so given ¯\bar{\mathcal{B}}, node multiplicities can only be 11 or 22.

The proof of the following lemma is in Appendix B.

Lemma 2

Suppose 𝒜¯\bar{\mathcal{A}}, ¯\bar{\mathcal{B}}, and 𝒞¯\bar{\mathcal{C}} hold. Then the multiplicity of every node in G(𝒴)G({\mathcal{Y}}) is fully determined. In other words, every valid solution of G(𝒴)G({\mathcal{Y}}) traverses a given node in G(𝒴)G({\mathcal{Y}}) the same number of times.

Now suppose {\mathcal{E}} occurs. By definition, there is at least one 𝒳~𝒳\tilde{\mathcal{X}}\neq{\mathcal{X}} such that 𝒴(𝒳~)=𝒴(𝒳){\mathcal{Y}}(\tilde{\mathcal{X}})={\mathcal{Y}}({\mathcal{X}}). Let 𝒳~\tilde{\mathcal{X}} be one such choice with minimum set difference |𝒳𝒳~||{\mathcal{X}}-\tilde{\mathcal{X}}|. Let |𝒳𝒳~|=c|{\mathcal{X}}-\tilde{\mathcal{X}}|=c. By Lemma 2, we know that for all nodes 𝐯{\bf v},

μ𝒳(𝐯)=μ𝒳~(𝐯):=μG(𝐯).\displaystyle\mu_{{\mathcal{X}}}({\bf v})=\mu_{\tilde{\mathcal{X}}}({\bf v}):=\mu_{G}({\bf v}). (6)

Additionally,

μ𝒳𝒳~(𝐯)+μ𝒳𝒳~(𝐯)=μ𝒳(𝐯)\displaystyle\mu_{{\mathcal{X}}\cap\tilde{{\mathcal{X}}}}({\bf v})+\mu_{{\mathcal{X}}-\tilde{{\mathcal{X}}}}({\bf v})=\mu_{{\mathcal{X}}}({\bf v}) =μG(𝐯)\displaystyle=\mu_{G}({\bf v}) (7)
μ𝒳𝒳~(𝐯)+μ𝒳~𝒳(𝐯)=μ𝒳~(𝐯)\displaystyle\mu_{{\mathcal{X}}\cap\tilde{{\mathcal{X}}}}({\bf v})+\mu_{\tilde{{\mathcal{X}}}-{\mathcal{X}}}({\bf v})=\mu_{\tilde{\mathcal{X}}}({\bf v}) =μG(𝐯),\displaystyle=\mu_{G}({\bf v}), (8)

implying that μ𝒳𝒳~(𝐯)=μ𝒳~𝒳(𝐯)\mu_{{\mathcal{X}}-\tilde{{\mathcal{X}}}}({\bf v})=\mu_{\tilde{{\mathcal{X}}}-{\mathcal{X}}}({\bf v}).

We now build a “difference de Bruijn graph” G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) on the (k+1)(k+1)-mers in 𝒴{\mathcal{Y}} that are present in 𝒳𝒳~{\mathcal{X}}-\tilde{\mathcal{X}}. It is easy to see that G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) is the same as the subgraph of G(𝒴)G({\mathcal{Y}}) induced by the edges corresponding to (k+1)(k+1)-mers in 𝒳𝒳~{\mathcal{X}}-\tilde{\mathcal{X}} with nodes {𝐯:μ𝒳𝒳~(𝐯)>0}\{{\bf v}:\mu_{{\mathcal{X}}-\tilde{\mathcal{X}}}({\bf v})>0\}. Note that the nodes and their multiplicities are equivalent to those of the set {𝐯:μ𝒳~𝒳(𝐯)>0}\{{\bf v}:\mu_{\tilde{\mathcal{X}}-{\mathcal{X}}}({\bf v})>0\}, and the set of edges in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) must also correspond to all of the (k+1)(k+1)-mers in 𝒴{\mathcal{Y}} that are present in 𝒳~𝒳\tilde{\mathcal{X}}-{\mathcal{X}}. Therefore G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) is the same as the de Bruijn graph on the (k+1)(k+1)-mers in 𝒴{\mathcal{Y}} that are present in 𝒳~𝒳\tilde{\mathcal{X}}-{\mathcal{X}}; i.e., G~(𝒴,𝒳,𝒳~)=G~(𝒴,𝒳~,𝒳)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}})=\tilde{G}({\mathcal{Y}},\tilde{\mathcal{X}},{\mathcal{X}}). Notice that G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) is a de Bruijn graph for both the set of cc missing true paths 𝒳𝒳~={𝐱1,,𝐱c}{\mathcal{X}}-\tilde{\mathcal{X}}=\{{\bf x}_{1},\dots,{\bf x}_{c}\} and for the set of incorrectly reconstructed paths 𝒳~𝒳={𝐱~1,,𝐱~c}\tilde{\mathcal{X}}-{\mathcal{X}}=\{\tilde{\bf x}_{1},\dots,\tilde{\bf x}_{c}\}.

Define a maximal shared subpath in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) as a directed sequence 𝐩\mathbf{p} of nodes with maximal length such that μ𝒳𝒳~(𝐯)=2\mu_{{\mathcal{X}}-\tilde{\mathcal{X}}}({\bf v})=2 for all nodes 𝐯𝐩{\bf v}\in\mathbf{p}. We have the following lemma.

Lemma 3

Given 𝒜¯,¯,𝒞¯,𝒟¯\bar{\mathcal{A}},\bar{\mathcal{B}},\bar{\mathcal{C}},\bar{\mathcal{D}}, there are at least cc maximal shared subpaths in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}).

Proof:

Given 𝒜¯\bar{\mathcal{A}}, ¯\bar{\mathcal{B}}, 𝒞¯\bar{\mathcal{C}}, and Lemma 2, the multiplicities of all nodes in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) are fully determined and must be either 11 or 22. Now consider the paths corresponding to 𝐱~1,,𝐱~c\tilde{\bf x}_{1},\dots,\tilde{\bf x}_{c} in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}). We claim that, given 𝒞¯\bar{\mathcal{C}} and 𝒟¯\bar{\mathcal{D}}, each 𝐱~i\tilde{\bf x}_{i} must traverse at least two maximal shared subpaths. Otherwise, suppose 𝐱~i\tilde{\bf x}_{i} only traverses a single maximal shared subpath, say 𝐱1[s:t]{\bf x}_{1}[s:t], which is shared between 𝐱1{\bf x}_{1} and 𝐱2{\bf x}_{2}. Given 𝒞¯\bar{\mathcal{C}}, 𝐱~i\tilde{\bf x}_{i} must start and end at nodes with zero in-degree and out-degree respectively. Hence we must be able to write 𝐱~i\tilde{\bf x}_{i}, without loss of generality, as

𝐱~i=𝐱1[1:s1]𝐱1[s:t]𝐱2[t+1:n]\displaystyle\tilde{\bf x}_{i}={\bf x}_{1}[1:s-1]{\bf x}_{1}[s:t]{\bf x}_{2}[t+1:n]

But this implies that 𝐱1(s)=𝐱2(s){\bf x}_{1}(s)={\bf x}_{2}(s), which contradicts 𝒟¯\bar{\mathcal{D}}.

Finally, since each 𝐱~i\tilde{\bf x}_{i}, i=1,,ci=1,\dots,c, must traverse at least two maximal shared subpaths and, given ¯\bar{\mathcal{B}}, every maximal shared subpath can be shared by at most two true paths 𝐱i{\bf x}_{i}, we conclude that there must be at least cc maximal shared subpaths in G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}). ∎

Figure 2 shows examples of the graph G~(𝒴,𝒳,𝒳~)\tilde{G}({\mathcal{Y}},{\mathcal{X}},\tilde{\mathcal{X}}) given 𝒜¯,¯,𝒞¯,𝒟¯\bar{\mathcal{A}},\bar{\mathcal{B}},\bar{\mathcal{C}},\bar{\mathcal{D}}, for c=2c=2 and c=4c=4. Notice that we have cc paths and cc maximal shared subpaths. Also notice that, in both examples, there exist two choices for the set of cc paths.

Lemma 3 allows us to analyze Pr()\Pr({\mathcal{E}}) by partitioning it into events c:={|𝒳~𝒳|=c}{\mathcal{E}}_{c}:={\mathcal{E}}\cap\{|\tilde{\mathcal{X}}-{\mathcal{X}}|=c\}. Notice that a maximal shared subpath between 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} occurs with probability at most nβn^{-\beta}. Moreover, by the minimality of 𝒳~\tilde{\mathcal{X}}, each 𝐱~i\tilde{\bf x}_{i} must contain at least one maximal shared subpath. By the union bound, we conclude that

Pr(\displaystyle\Pr( c𝒜¯¯𝒞¯𝒟¯)\displaystyle{\mathcal{E}}_{c}\cap\bar{\mathcal{A}}\cap\bar{\mathcal{B}}\cap\bar{\mathcal{C}}\cap\bar{\mathcal{D}})
(mc)nc(cn)cnβcmcn2cnβc=nc(α+2β),\displaystyle\leq\binom{m}{c}n^{c}(cn)^{c}n^{-\beta c}\leq m^{c}n^{2c}n^{-\beta c}=n^{c(\alpha+2-\beta)},

where we used the fact that (mc)(m/c)c\tbinom{m}{c}\leq(m/c)^{c}. Summing over all values of cc, we obtain

Pr(𝒜¯¯𝒞¯𝒟¯)\displaystyle\Pr({\mathcal{E}}\cap\bar{\mathcal{A}}\cap\bar{\mathcal{B}}\cap\bar{\mathcal{C}}\cap\bar{\mathcal{D}}) c=2mnc(α+2β)\displaystyle\leq\sum_{c=2}^{m}n^{c(\alpha+2-\beta)}
=n2(α+2β)(1n(m1)(α+2β))1nα+2β,\displaystyle=\frac{n^{2(\alpha+2-\beta)}(1-n^{(m-1)(\alpha+2-\beta)})}{1-n^{\alpha+2-\beta}},

which tends to zero as nn\to\infty provided that β>α+2\beta>\alpha+2. Plugging this back into (1), we conclude that any (α,β)(\alpha,\beta) pair with β>max(α+2,32α+32,2α+1)=max(α+2,2α+1)\beta>\max(\alpha+2,\tfrac{3}{2}\alpha+\tfrac{3}{2},2\alpha+1)=\max(\alpha+2,2\alpha+1) is feasible. In other words, we showed that

{(α,β):β>max(α+2,2α+1)},\displaystyle\{(\alpha,\beta):\beta>\max(\alpha+2,2\alpha+1)\}\subseteq{\mathcal{F}},

concluding the proof of the achievability part of Theorem 1.

Refer to caption
Figure 2: Two examples of difference de Bruijn graphs G~\tilde{G}. (a) Example G~\tilde{G} for c=2c=2. The red and blue paths represent the correct strings, while the other reconstruction follows the paths red–blue–red and blue–red–blue. (b) Example G~\tilde{G} for c=4c=4. The red, blue, green, and purple paths represent the correct strings, while a possible incorrect reconstruction follows the paths red–blue–green, blue–red–purple, green–purple–red, purple–green–blue.

V Outer Bound to Feasibility Region

In this section, we will prove the outer bound part of Theorem 1. First, we will show that Pr()↛0\Pr({\mathcal{E}})\not\to 0 as nn\to\infty, if β<2α+1\beta<2\alpha+1. Second, we will show that Pr()↛0\Pr({\mathcal{E}})\not\to 0 if β<α+32\beta<\alpha+\tfrac{3}{2}. To do so, we will show that in those regimes, with positive probability, there will be multiple reconstructions.

Define Zi,j,a=𝟙𝐱i(a)=𝐱j(a)Z_{i,j,a}=\mathbbm{1}_{{\bf x}_{i}(a)={\bf x}_{j}(a)} for i<ji<j; i.e., Zi,j,aZ_{i,j,a} indicates whether 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} share a kk-mer at a common position aa. Let V=i=1m1j=i+1ma=1nk+1Zi,j,aV=\sum_{i=1}^{m-1}\sum_{j=i+1}^{m}\sum_{a=1}^{n-k+1}Z_{i,j,a}, and note that 𝒟={V>0}{\mathcal{D}}=\{V>0\}. In particular, since any Zi,j,a=1Z_{i,j,a}=1 implies that there are multiple valid reconstructions, we have that Pr()Pr(V>0)\Pr({\mathcal{E}})\geq\Pr(V>0). By the Paley-Zygmund inequality, we then have

Pr()Pr(V>0)𝔼[V]2𝔼[V2].\displaystyle\Pr({\mathcal{E}})\geq\Pr(V>0)\geq\frac{\mathbb{E}[V]^{2}}{\mathbb{E}[V^{2}]}. (9)

It is straightforward to see 𝔼[Zi,j,a]=nβ\mathbb{E}[Z_{i,j,a}]=n^{-\beta}, and by the linearity of expectation,

𝔼[V]=(m2)(nk+1)nβ.\mathbb{E}[V]=\binom{m}{2}(n-k+1)n^{-\beta}.

The computation of E[V2]E[V^{2}] requires more care in handling Pr(𝐱i(a)=𝐱j(a),𝐱s(b)=𝐱t(b))\Pr({\bf x}_{i}(a)={\bf x}_{j}(a),{\bf x}_{s}(b)={\bf x}_{t}(b)). First, we note that when the pairs (i,j)(i,j) and (s,t)(s,t) are disjoint, the events 𝐱i(a)=𝐱j(a){\bf x}_{i}(a)={\bf x}_{j}(a) and 𝐱s(b)=𝐱t(b){\bf x}_{s}(b)={\bf x}_{t}(b) are independent:

Pr(𝐱i(a)=𝐱j(a),𝐱s(b)=𝐱t(b))\displaystyle\Pr({\bf x}_{i}(a)={\bf x}_{j}(a),{\bf x}_{s}(b)={\bf x}_{t}(b)) =n2β.\displaystyle=n^{-2\beta}.

The events are also independent when i=si=s but jtj\neq t, or j=tj=t but isi\neq s,

Pr(𝐱i(a)=𝐱j(a),𝐱i(b)=𝐱t(b))=n2β,\displaystyle\Pr({\bf x}_{i}(a)={\bf x}_{j}(a),{\bf x}_{i}(b)={\bf x}_{t}(b))=n^{-2\beta},

and when i=si=s, j=tj=t, and |ab|k|a-b|\geq k. However, when i=si=s, j=tj=t, and |ab|<k|a-b|<k,

Pr(𝐱i(a)=𝐱j(a),𝐱i(b)=𝐱j(b))=2(k+|ab|)nβ\displaystyle\Pr({\bf x}_{i}(a)={\bf x}_{j}(a),{\bf x}_{i}(b)={\bf x}_{j}(b))=2^{-(k+|a-b|)}\leq n^{-\beta}

as there are |ab||a-b| symbols of overlap between 𝐱i(a){\bf x}_{i}(a) and 𝐱j(b){\bf x}_{j}(b). Therefore,

𝔼[V2]\displaystyle\mathbb{E}[V^{2}] =𝔼[i=1m1j=i+1ma=1nk+1Zi,j,as=1m1t=s+1mb=1nk+1Zs,t,b]\displaystyle=\mathbb{E}\left[\sum_{i=1}^{m-1}\sum_{j=i+1}^{m}\sum_{a=1}^{n-k+1}Z_{i,j,a}\sum_{s=1}^{m-1}\sum_{t=s+1}^{m}\sum_{b=1}^{n-k+1}Z_{s,t,b}\right]
m4n22β+2m2n1βk.\displaystyle\leq m^{4}n^{2-2\beta}+2m^{2}n^{1-\beta}k. (10)

Combining 𝔼[V]\mathbb{E}[V] and (10), we have that

Pr()\displaystyle\Pr({\mathcal{E}}) (m2)2(nk+1)2n2βm4n22β+2m2n1βk\displaystyle\geq\frac{\binom{m}{2}^{2}(n-k+1)^{2}n^{-2\beta}}{m^{4}n^{2-2\beta}+2m^{2}n^{1-\beta}k} (11)
m4n2n2β4m4n22β+8m2n1βk\displaystyle\sim\frac{m^{4}n^{2}n^{-2\beta}}{4m^{4}n^{2-2\beta}+8m^{2}n^{1-\beta}k} (12)
=14+8βlognn2α+1β\displaystyle=\frac{1}{4+\frac{8\beta\log n}{n^{2\alpha+1-\beta}}} (13)

where (12) is asymptotically equivalent as nn\to\infty. Finally, we notice that if β<2α+1\beta<2\alpha+1, (13) converges to 14\frac{1}{4}, implying that Pr()\Pr({\mathcal{E}}) is bounded away from zero and (α,β)(\alpha,\beta)\notin{\mathcal{F}}.

Next, we show that Pr()↛0\Pr({\mathcal{E}})\not\to 0 when β<α+32\beta<\alpha+\frac{3}{2}. For this regime, we will define the event {\mathcal{H}} that there are two strings 𝐱i{\bf x}_{i}, 𝐱j{\bf x}_{j} and indices aa, bb, cc such that

𝐱i(a)=𝐱j(c),𝐱i(b)=𝐱j(c+ba).{\bf x}_{i}(a)={\bf x}_{j}(c),{\bf x}_{i}(b)={\bf x}_{j}(c+b-a).

This is illustrated in Figure 2(a). Essentially, we have two repeated kk-mers across sequences 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j}, and the gap between them is the same in both sequences. Given {\mathcal{H}}, an incorrect reconstruction exists because we can swap the middle parts of 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j}, creating

𝐱~i=𝐱i[1:a1]𝐱j[c:c+ba1]𝐱i[b:n]\displaystyle\tilde{\bf x}_{i}={\bf x}_{i}[1:a-1]{\bf x}_{j}[c:c+b-a-1]{\bf x}_{i}[b:n]
𝐱~j=𝐱j[1:c1]𝐱i[a:b1]𝐱j[c+ba:n],\displaystyle\tilde{\bf x}_{j}={\bf x}_{j}[1:c-1]{\bf x}_{i}[a:b-1]{\bf x}_{j}[c+b-a:n],

which can be verified to be two length-nn sequences. We will show that Pr()↛0\Pr({\mathcal{H}})\not\to 0 as nn\to\infty.

Define Wi,j,𝐚=𝟙𝐱i(a1)=𝐱j(a3),𝐱i(a2)=𝐱j(a3+a2a1)W_{i,j,\mathbf{a}}=\mathbbm{1}_{{\bf x}_{i}(a_{1})={\bf x}_{j}(a_{3}),{\bf x}_{i}(a_{2})={\bf x}_{j}(a_{3}+a_{2}-a_{1})} for 𝐚=(a1,a2,a3)\mathbf{a}=(a_{1},a_{2},a_{3}) and let U=i,j𝐚Wi,j,𝐚U=\sum_{i,j}\sum_{\mathbf{a}}W_{i,j,\mathbf{a}}. We can again use the Paley-Zygmund inequality to find a lower bound on Pr(U>0)\Pr(U>0). The second moment 𝔼[U2]\mathbb{E}[U^{2}] can be computed similarly to 𝔼[V2]\mathbb{E}[V^{2}], considering separately the terms of the form 𝔼[Wij𝐚Wij𝐛]\mathbb{E}[W_{ij\mathbf{a}}W_{ij\mathbf{b}}] where the kk-mers indicated by 𝐚\mathbf{a} overlap with those indicated by 𝐛\mathbf{b}:

𝔼[U2]\displaystyle\mathbb{E}[U^{2}] =𝔼[i=1m1j=i+1m𝐚Wi,j,𝐚s=1m1t=s+1m𝐛Ws,t,𝐛]\displaystyle=\mathbb{E}\left[\sum_{i=1}^{m-1}\sum_{j=i+1}^{m}\sum_{\mathbf{a}}W_{i,j,\mathbf{a}}\sum_{s=1}^{m-1}\sum_{t=s+1}^{m}\sum_{\mathbf{b}}W_{s,t,\mathbf{b}}\right]
(m2n3n2β)2+m2n3(2k)3n2β\displaystyle\leq(m^{2}n^{3}n^{-2\beta})^{2}+m^{2}n^{3}(2k)^{3}n^{-2\beta} (14)

where the first term in (14) covers all kk-mers that are independent of each other and the second term covers the case where i=si=s, j=tj=t, and aia_{i} overlaps with bib_{i} for i=1,2,3i=1,2,3.

To compute the first moment, 𝔼[U]\mathbb{E}[U], we must account for which assignments of 𝐚\mathbf{a} are possible. More precisely, we need to count the vectors (a1,a2,a3)[1:n]3(a_{1},a_{2},a_{3})\in[1:n^{\prime}]^{3} such that a1<a2a_{1}<a_{2} and a2+a3a1[1:n]a_{2}+a_{3}-a_{1}\in[1:n^{\prime}], where n=nk+1n^{\prime}=n-k+1. This is cumbersome, so we consider a lower bound by considering a1[1:n/4]a_{1}\in[1:n^{\prime}/4], a2[n/4+1:n/2]a_{2}\in[n^{\prime}/4+1:n^{\prime}/2], and a3[n/4+1:n/2]a_{3}\in[n^{\prime}/4+1:n^{\prime}/2]. Notice that for any of the (n/4)3(n^{\prime}/4)^{3} choices, a2+a3a1[1:n]a_{2}+a_{3}-a_{1}\in[1:n^{\prime}]. We obtain the (rather loose) lower bound

𝔼[U](m2)(n4)3n2β.\displaystyle\mathbb{E}[U]\geq\binom{m}{2}\left(\frac{n^{\prime}}{4}\right)^{3}n^{-2\beta}. (15)

Therefore, we can lower bound the error probability as

Pr()\displaystyle\Pr({\mathcal{E}}) 𝔼[U]2𝔼[U2](m2)2(n/4)6n4βm4n6n6β+m2n3(2k)3n2β\displaystyle\geq\frac{\mathbb{E}[U]^{2}}{\mathbb{E}[U^{2}]}\geq\frac{\binom{m}{2}^{2}(n^{\prime}/4)^{6}n^{-4\beta}}{m^{4}n^{6}n^{-6\beta}+m^{2}n^{3}(2k)^{3}n^{-2\beta}}
47m4n64βm4n64β+8m2k3n32β\displaystyle\sim\frac{4^{-7}m^{4}n^{6-4\beta}}{m^{4}n^{6-4\beta}+8m^{2}k^{3}n^{3-2\beta}}
=471+8k3n2α+32β,\displaystyle=\frac{4^{-7}}{1+\frac{8k^{3}}{n^{2^{\alpha+3-2\beta}}},}

which tends to 474^{-7} as nn\to\infty, as long as β<α+32\beta<\alpha+\tfrac{3}{2}. We conclude that for any (α,β)(\alpha,\beta) such that β<α+32\beta<\alpha+\tfrac{3}{2} or β<2α+1\beta<2\alpha+1, Pr()↛0\Pr({\mathcal{E}})\not\to 0 as nn\to\infty. This concludes the proof of the outer bound in Theorem 1.

VI Concluding Remarks

In this paper, we have characterized the feasibility of a multiple sequence reconstruction problem using kk-mers for a large region of (α,β)(\alpha,\beta) parameters. We identified a region where the kk-mers of each string are disjoint with high probability and correct recovery of the source sequences is trivial, as well as a region where, while repeats exist, the only possible reconstruction of the source sequences is the correct one. In the latter repeat-abundant region, careful examination of the de Bruijn graph G(𝒴)G({\mathcal{Y}}) reveals the unique set of paths covering the graph. Finally, we have determined a region where incorrect reconstructions exist with positive probability.

The above findings have characterized nearly all (α,β)(\alpha,\beta) pairs for α>0\alpha>0, β>0\beta>0, except for a small region between β=max(2α+1,α+32)\beta=\max(2\alpha+1,\alpha+\frac{3}{2}) and β=α+2\beta=\alpha+2. We believe that part of the remaining region may be where the solution is not unique, and close analysis of additional difference de Bruijn graph topologies for a larger number of strings and shared kk-mers might result in a larger red region in Figure 1.

This work invites future study in several directions. One of these is generalization to other source distributions; some results here may be immediately trivially extended to a general i.i.d. source. Additionally, motivated by real-world DNA sequencing technologies, future work may analyze source reconstruction from noisy kk-mers or a random number of copies of each kk-mer.

Acknowledgements

The work of I.S. was supported in part by the National Science Foundation under CCF grants 2007597 and 2046991. The authors would like to thank the anonymous reviewers of this manuscript for their detailed feedback and suggestions for improvement.

References

  • [1] R. Dramanac, I. Labat, I. Brukner, and R. Crkvenjakov, “Sequencing of megabase plus DNA by hybridization: theory of the method,” Genomics, vol. 4, no. 2, pp. 114–128, 1989.
  • [2] E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theoretical computer science, vol. 92, no. 1, pp. 191–211, 1992.
  • [3] P. A. Pevzner, “l-tuple DNA sequencing: computer analysis,” Journal of Biomolecular structure and dynamics, vol. 7, no. 1, pp. 63–73, 1989.
  • [4] P. A. Pevzner, H. Tang, and M. S. Waterman, “An eulerian path approach to DNA fragment assembly,” Proceedings of the national academy of sciences, vol. 98, no. 17, pp. 9748–9753, 2001.
  • [5] S. S. Skiena and G. Sundaram, “Reconstructing strings from substrings,” Journal of Computational Biology, vol. 2, no. 2, pp. 333–353, 1995.
  • [6] F. P. Preparata and E. Upfal, “Sequencing-by-hybridization at the information-theory bound: an optimal algorithm,” in Proceedings of the fourth annual international conference on Computational molecular biology, 2000, pp. 245–253.
  • [7] A. S. Motahari, G. Bresler, and N. David, “Information theory of DNA shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6273–6289, 2013.
  • [8] S. Mohajer, A. Motahari, and D. Tse, “Reference-based DNA shotgun sequencing: Information theoretic limits,” in 2013 IEEE International Symposium on Information Theory.   IEEE, 2013, pp. 1635–1639.
  • [9] A. Motahari, K. Ramchandran, D. Tse, and N. Ma, “Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads,” in 2013 IEEE International Symposium on Information Theory.   IEEE, 2013, pp. 1640–1644.
  • [10] I. Shomorony, T. Courtade, and D. Tse, “Do read errors matter for genome assembly?” in 2015 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2015, pp. 919–923.
  • [11] Y. Yehezkeally, S. Marcovich, and E. Yaakobi, “Multi-strand reconstruction from substrings,” in 2021 IEEE Information Theory Workshop (ITW), 2021, pp. 1–6.
  • [12] O. Elishco, R. Gabrys, E. Yaakobi, and M. Médard, “Repeat-free codes,” IEEE Transactions on Information Theory, vol. 67, no. 9, pp. 5749–5764, 2021.
  • [13] Y. Yehezkeally and N. Polyanskii, “On codes for the noisy substring channel,” in 2021 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2021, pp. 1700–1705.
  • [14] S. Marcovich and E. Yaakobi, “Reconstruction of strings from their substrings spectrum,” IEEE Transactions on Information Theory, vol. 67, no. 7, pp. 4369–4384, 2021.
  • [15] Z. Chang, J. Chrisnata, M. F. Ezerman, and H. M. Kiah, “Rates of DNA sequence profiles for practical values of read lengths,” IEEE Transactions on Information Theory, vol. 63, no. 11, pp. 7166–7177, 2017.
  • [16] R. Gabrys and O. Milenkovic, “Unique reconstruction of coded sequences from multiset substring spectra,” in 2018 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2018, pp. 2540–2544.

Appendix A Proof of Equation (3)

Taking the union over the events {jl}\{j\neq l\} and {j=l}\{j=l\}, the probability of {\mathcal{B}} can be bounded as

Pr()\displaystyle\Pr({\mathcal{B}}) m3n32kn2β+m2n2(2k)2Pi,j(a,b,c,d)\displaystyle\leq m^{3}n^{3}2kn^{-2\beta}+m^{2}n^{2}(2k)^{2}P_{i,j}(a,b,c,d) (16)

where Pi,j(a,b,c,d)=Pr(𝐱i(a)=𝐱j(c),𝐱i(b)=𝐱j(d))P_{i,j}(a,b,c,d)=\Pr({\bf x}_{i}(a)={\bf x}_{j}(c),{\bf x}_{i}(b)={\bf x}_{j}(d)) is shown in the following lemma to be upper-bounded by n2βn^{-2\beta} when dcbad-c\neq b-a.

Lemma 4

For any two sequences 𝐱i{\bf x}_{i}, 𝐱j{\bf x}_{j} and indices aa, bb, cc, dd, with b>ab>a and either ba>1b-a>1 or |dc|>1|d-c|>1, the probability that 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c) and 𝐱i(b)=𝐱j(d){\bf x}_{i}(b)={\bf x}_{j}(d) is upper-bounded by n2βn^{-2\beta}.

Proof:

If either bakb-a\geq k or |dc|k|d-c|\geq k, then it is clear that either 𝐱i(a){\bf x}_{i}(a) and 𝐱i(b){\bf x}_{i}(b) or 𝐱j(c){\bf x}_{j}(c) and 𝐱j(d){\bf x}_{j}(d), respectively, are independent of each other. Therefore, Pr(𝐱i(a)=𝐱j(c),𝐱i(b)=𝐱j(d))=Pr(𝐱i(a)=𝐱j(c))Pr(𝐱i(b)=𝐱j(d))=n2β\Pr({\bf x}_{i}(a)={\bf x}_{j}(c),{\bf x}_{i}(b)={\bf x}_{j}(d))=\Pr({\bf x}_{i}(a)={\bf x}_{j}(c))\Pr({\bf x}_{i}(b)={\bf x}_{j}(d))=n^{-2\beta}.

If both ba<kb-a<k and |dc|<k|d-c|<k, then closer analysis is required, since 𝐱i(a){\bf x}_{i}(a) overlaps with 𝐱i(b){\bf x}_{i}(b) and 𝐱j(c){\bf x}_{j}(c) overlaps with 𝐱j(d){\bf x}_{j}(d) in this case. Note that if ba=|dc|<kb-a=|d-c|<k, 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c), and xi(a+1)xj(c+1)x_{i}(a+1)\neq x_{j}(c+1), then Pr(𝐱i(b)=𝐱j(d))=0\Pr({\bf x}_{i}(b)={\bf x}_{j}(d))=0, since bb and dd will contain the differing symbols in xi(a+1)x_{i}(a+1) and xj(c+1)x_{j}(c+1), respectively.

We first assume that both a<ba<b and c<dc<d. Let ti=k(ba)t_{i}=k-(b-a) and tj=k(dc)t_{j}=k-(d-c); without loss of generality, we assume that ti>tjt_{i}>t_{j} (if not, we can swap the labels of xix_{i} and xjx_{j}). These values represent the overlap in symbols between the substrings of each string. If 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c), then

xi(a)\displaystyle x_{i}(a) =xj(c)\displaystyle=x_{j}(c)
\displaystyle\dots
xi(a+k1)\displaystyle x_{i}(a+k-1) =xj(c+k1),\displaystyle=x_{j}(c+k-1),

and if 𝐱i(b)=𝐱j(d){\bf x}_{i}(b)={\bf x}_{j}(d), then

xi(b)\displaystyle x_{i}(b) =xj(d)\displaystyle=x_{j}(d)
\displaystyle\dots
xi(b+k1)\displaystyle x_{i}(b+k-1) =xj(d+k1).\displaystyle=x_{j}(d+k-1).

Furthermore, if 𝐱i(a){\bf x}_{i}(a) and 𝐱i(b){\bf x}_{i}(b) overlap by tit_{i} symbols,

xi(a+kti)\displaystyle x_{i}(a+k-t_{i}) =xi(b)\displaystyle=x_{i}(b)
\displaystyle\dots
xi(a+k1)\displaystyle x_{i}(a+k-1) =xi(b+ti1).\displaystyle=x_{i}(b+t_{i}-1).

Similarly, if 𝐱j(c){\bf x}_{j}(c) and 𝐱j(d){\bf x}_{j}(d) overlap by tjt_{j} symbols,

xj(c+ktj)\displaystyle x_{j}(c+k-t_{j}) =xj(d)\displaystyle=x_{j}(d)
\displaystyle\dots
xj(c+k1)\displaystyle x_{j}(c+k-1) =xj(d+tj1).\displaystyle=x_{j}(d+t_{j}-1).

In other words, the last tit_{i} symbols of 𝐱i(a){\bf x}_{i}(a) are equal to the first tit_{i} symbols of 𝐱i(b){\bf x}_{i}(b), and the last tjt_{j} symbols of 𝐱j(c){\bf x}_{j}(c) are equal to the first tjt_{j} symbols of 𝐱j(d){\bf x}_{j}(d). Since 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c) and 𝐱i(b)=𝐱j(d){\bf x}_{i}(b)={\bf x}_{j}(d), this implies that the last tjt_{j} symbols of 𝐱i(a){\bf x}_{i}(a) are equal to the first tjt_{j} symbols of 𝐱i(b){\bf x}_{i}(b), or

𝐱i(a+ktj)\displaystyle{\bf x}_{i}(a+k-t_{j}) =𝐱i(b)\displaystyle={\bf x}_{i}(b)
\displaystyle\dots
𝐱i(a+k1)\displaystyle{\bf x}_{i}(a+k-1) =𝐱i(b+tj1).\displaystyle={\bf x}_{i}(b+t_{j}-1).

The probability that there are these tjt_{j} replicated symbols in 𝐱i(a){\bf x}_{i}(a) is 2tj2^{-t_{j}}, the probability that 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c) is 2k2^{-k}, and the probability that 𝐱i(b)=𝐱j(d){\bf x}_{i}(b)={\bf x}_{j}(d) given 𝐱i(a)=𝐱j(c){\bf x}_{i}(a)={\bf x}_{j}(c) and the requisite repetition of symbols in 𝐱i(a){\bf x}_{i}(a) is 2(ktj)2^{-(k-t_{j})}. Combining these probabilities,

Pr(𝐱i(a)=𝐱j(c),𝐱i(b)=𝐱j(d))=n2β.\displaystyle\Pr({\bf x}_{i}(a)={\bf x}_{j}(c),{\bf x}_{i}(b)={\bf x}_{j}(d))=n^{-2\beta}.

The case where a<ba<b and c>dc>d can be shown similarly; here, the last tjt_{j} symbols of 𝐱i(b){\bf x}_{i}(b) must be equal to the first tjt_{j} symbols of 𝐱i(a){\bf x}_{i}(a). ∎

Therefore,

Pr()\displaystyle\Pr({\mathcal{B}}) m3n32kn2β\displaystyle\leq m^{3}n^{3}2kn^{-2\beta} (17)
=n3α+32β+log(βlogn)logn\displaystyle=n^{3\alpha+3-2\beta+\frac{\log(\beta\log n)}{\log n}} (18)
=n3α+32β+o(1).\displaystyle=n^{3\alpha+3-2\beta+o(1)}. (19)

Appendix B Proof of Lemma 2

Proof:

This can be proven by demonstrating an algorithm that can sequentially and unambiguously determine the multiplicity of each node. According to 𝒞¯\bar{\mathcal{C}}, the first (start) and last (end) nodes of each string are known and have multiplicity 1. The node-labeling algorithm can thus label the mm start nodes with 1 and continue towards the end nodes until a node that has two incoming edges is reached. Due to ¯\bar{\mathcal{B}}, this node can be shared by a maximum of two sequences, and due to 𝒜¯\bar{\mathcal{A}}, each of those strings can contain the node’s corresponding kk-mer exactly once, so the multiplicity of this node must be 2. This node is then followed either by an edge pointing to a node that also has multiplicity 2 or two edges pointing to nodes of multiplicity 1. ∎