Fundamental Limits of Multiple Sequence Reconstruction from Substrings
Abstract
The problem of reconstructing a sequence from the set of its length- substrings has received considerable attention due to its various applications in genomics. We study an uncoded version of this problem where multiple random sources are to be simultaneously reconstructed from the union of their -mer sets. We consider an asymptotic regime where i.i.d. source sequences of length are to be reconstructed from the set of their substrings of length , and seek to characterize the pairs for which reconstruction is information-theoretically feasible. We show that, as , the source sequences can be reconstructed if and cannot be reconstructed if , characterizing the feasibility region almost completely. Interestingly, our result shows that there are feasible pairs where repeats across the source strings abound, and non-trivial reconstruction algorithms are needed to achieve the fundamental limit.
I Introduction
The problem of reconstructing a sequence from the set of its length- substrings (or -mers) has a long history, dating back to the development of the sequencing-by-hybridization (SBH) technology in the late 1980s [1]. Due to this technological advance, significant attention was originally devoted to understanding when the set of -mers uniquely determines the underlying sequence, and to designing algorithms to perform the reconstruction efficiently [2, 3, 4, 5].
Inspired by Ukkonen’s 1992 paper [2], which characterized how large needs to be to allow reconstruction, a more recent line of work considered this problem from an information-theoretic perspective [6, 7, 8, 9, 10]. In particular, [7] considered a random source (modeling a genome sequence), and provided conditions on how large needs to be (and how many length- substrings need to be observed) in order for the reconstruction of a source sequence to be information-theoretically feasible.
In this work, we consider a multiple-source version of this problem. We assume that there are random source sequences , each of length , and we wish to reconstruct them from the union of their sets of -mers. This multiple-source setting can be motivated by modern sequencing assays such as metagenomics and RNA-seq, where one observes substrings from a mixture of distinct source sequences (different microbial genomes in the case of metagenomics and different mRNA molecules in the case of RNA-seq). It can also be motivated by the fact that many bioinformatics algorithms first convert a set of reads into the set of their constituent -mers and then operate in -mer space. It is thus relevant to understand when such sets fully preserve the information in the original reads. Our setting is also connected with the problem of DNA-based data storage, where one wishes to recover a set of synthesized DNA molecules by jointly sequencing them. Our work can thus be seen as an uncoded version of the recent work on multi-strand reconstruction from substrings [11], or a random coding approach to it (see also “Related work” below).

Our goal is to characterize how large needs to be to allow perfect reconstruction of all source sequences. We assume each source is i.i.d. and consider an asymptotic regime where and , for some constants . We aim to characterize for which pairs the set of -mers uniquely determines the set of source sequences with vanishing error probability. Our main result, illustrated in Figure 1, characterizes the feasibility region almost completely. We establish this feasibility region by carefully analyzing the (random) de Bruijn graph formed by the -mer set. In this graph, the set of source sequences corresponds to a set of paths that cover all edges. Our analysis shows that there are feasible pairs where the -mer de Bruijn graph is fairly complex, due to the presence of repeats across the source sequences, but nevertheless there is a unique way to decompose the graph into source paths.
Related work: An important line of related work deals with the coded version of the problem we consider. Since a sequence containing no repeats of length can be reconstructed from the set of its -mers [2], [12] studied how large a repeat-free code can be. They showed that, as long as , for , one can asymptotically build -repeat-free codes with rate . The setup where one seeks to reconstruct a codeword from a noisy version of its set of -mers have also received considerable attention [13, 14, 15, 16].
A multi-sequence version of the coding problem in [12] was considered in [13]. Similar to our paper, they provide conditions on how needs to scale with and to guarantee that the asymptotic rate of all -mer-reconstructible codes approaches 1. Our work can be seen as an uncoded counterpart to [13], where we characterize how needs to scale with and to allow the reconstruction of random source sequences with vanishing failure probability. As an important distinction, we notice that the code constructions in [13] guarantee that there are no repeats across the sequences to be reconstructed, while in our setting, source sequences with repeats may still be reconstructible for some .
II Problem Setting and Main Results
We consider source sequences , each of which is generated as an i.i.d. length- sequence. The goal is to reconstruct the set of source sequences from the set of all of their -mers, i.e.,
where is the substring of between (and including) the th and th symbols. Here, we use -mers rather than -mers for notational convenience, and the final results are not affected. When clear from context, we write for . We will also use the notation to refer to the th -mer of , for .
We consider an asymptotic regime where and , for positive constants and . As it turns out, this is the asymptotic regime of interest, where the fundamental limits are non-trivial. We say that reconstructs if is the unique set of length- sequences (up to relabeling) that could have generated . It is then natural to define a notion of feasibility as follows.
Definition 1
We say that is a feasible pair if for and ,
The goal is to characterize the feasibility region , defined as the set of all feasible pairs.
An equivalent and useful representation of the -mer set is its de Bruijn graph . The de Bruijn graph can be obtained by taking the set of -mers in (i.e., length- prefixes and suffixes of the -mers in ) as the node set, and connecting two -mers with a directed edge if they are the prefix and suffix of a -mer in . Notice that the true set of sequences can be seen as a set of paths on that cover all the edges (edges can be used multiple times).
Our main result is to characterize, almost completely, the feasibility region as follows:
Theorem 1
The following bounds on hold:
-
•
if ,
-
•
if .
As illustrated in Figure 1, Theorem 1 nearly completely characterizes . In particular, for , if , and if . For there is a small uncharacterized region of pairs.
The rest of the paper is organized as follows. In Section III, as a warm-up, we characterize the repeat-free region, a (strict) subset of where there are no length- repeats across the source sequences (and the source strings are thus reconstructible). In Section IV, we prove the inner bound part of Theorem 1, and in Section V, we prove the outer bound part of Theorem 1.
III Repeat-Free Region
We start by considering a simple scenario where can be recovered correctly: when each -mer is unique across all source sequences. In this case, it is well known that the de Bruijn graph will contain disjoint paths, each of which corresponds to one of the s, and is thus reconstructible. We prove the following result.
Lemma 1
If , with probability tending to as , there are no repeated -mers across or within s.
This region is marked in Figure 1. Notice that it does not include all pairs of feasible pairs.
To prove Lemma 1, we note that, if ,
Hence, the probability that any two strings and , for have at least one -mer in common is upper-bounded using the union bound by
Additionally, we need to consider the case where a single string contains a repeat -mer. In this case, not all of the -mers are independent due to potential overlaps. However, a careful analysis reveals that, even if and and overlap, we still have (see Section A for a similar analysis). We conclude that the probability that there is repeated -mer within some is at most , and the probability that any -mer is repeated across or within is upper-bounded by
which tends to zero for . A converse result can also be shown; i.e., for , there will be repeats with probability tending to (but we do not describe that). In the next section, we show that there are feasible pairs that are in this repeat-abundant regime.
IV Inner Bound to Feasibility Region
In order to characterize , we define the error event , and seek to characterize the asymptotic behavior of for a given choice of . To do so, we will need to define several events related to repeats:
where we let for conciseness.
Event is the event that there is at least one intra-sequence repeat. Event is the event that there are two pairs of repeats with an overlap (i.e., and overlap in ), but they are not simply a longer length- repeat for . Notice that includes the event that a triple repeat occurs. Event is the event that the first or last -mer of some appears somewhere else. Event is the event that the th -mer of two sequences and is the same. Notice that events , and are not error events per se (they each do not imply ), but they will be useful in the analysis. On the other hand, , since given it is possible to replace and with two incorrect sequences
We will upper bound the error probability as
(1) |
and analyze the conditions for each term to go to zero. First we notice that, by the union bound,
(2) |
from which we see that if . Similarly,
(3) | |||
(4) | |||
(5) |
We point out that to establish (3), we need a careful analysis of the case where and also overlap, presented in Lemma 4.
As a result we see that, if , . What remains is to bound .
Now suppose holds. Let the multiplicity of an node in be the number of times it is traversed by paths in . Notice that guarantees that no node can be traversed three or more times, so given , node multiplicities can only be or .
The proof of the following lemma is in Appendix B.
Lemma 2
Suppose , , and hold. Then the multiplicity of every node in is fully determined. In other words, every valid solution of traverses a given node in the same number of times.
Now suppose occurs. By definition, there is at least one such that . Let be one such choice with minimum set difference . Let . By Lemma 2, we know that for all nodes ,
(6) |
Additionally,
(7) | ||||
(8) |
implying that .
We now build a “difference de Bruijn graph” on the -mers in that are present in . It is easy to see that is the same as the subgraph of induced by the edges corresponding to -mers in with nodes . Note that the nodes and their multiplicities are equivalent to those of the set , and the set of edges in must also correspond to all of the -mers in that are present in . Therefore is the same as the de Bruijn graph on the -mers in that are present in ; i.e., . Notice that is a de Bruijn graph for both the set of missing true paths and for the set of incorrectly reconstructed paths .
Define a maximal shared subpath in as a directed sequence of nodes with maximal length such that for all nodes . We have the following lemma.
Lemma 3
Given , there are at least maximal shared subpaths in .
Proof:
Given , , , and Lemma 2, the multiplicities of all nodes in are fully determined and must be either or . Now consider the paths corresponding to in . We claim that, given and , each must traverse at least two maximal shared subpaths. Otherwise, suppose only traverses a single maximal shared subpath, say , which is shared between and . Given , must start and end at nodes with zero in-degree and out-degree respectively. Hence we must be able to write , without loss of generality, as
But this implies that , which contradicts .
Finally, since each , , must traverse at least two maximal shared subpaths and, given , every maximal shared subpath can be shared by at most two true paths , we conclude that there must be at least maximal shared subpaths in . ∎
Figure 2 shows examples of the graph given , for and . Notice that we have paths and maximal shared subpaths. Also notice that, in both examples, there exist two choices for the set of paths.
Lemma 3 allows us to analyze by partitioning it into events . Notice that a maximal shared subpath between and occurs with probability at most . Moreover, by the minimality of , each must contain at least one maximal shared subpath. By the union bound, we conclude that
where we used the fact that . Summing over all values of , we obtain
which tends to zero as provided that . Plugging this back into (1), we conclude that any pair with is feasible. In other words, we showed that
concluding the proof of the achievability part of Theorem 1.

V Outer Bound to Feasibility Region
In this section, we will prove the outer bound part of Theorem 1. First, we will show that as , if . Second, we will show that if . To do so, we will show that in those regimes, with positive probability, there will be multiple reconstructions.
Define for ; i.e., indicates whether and share a -mer at a common position . Let , and note that . In particular, since any implies that there are multiple valid reconstructions, we have that . By the Paley-Zygmund inequality, we then have
(9) |
It is straightforward to see , and by the linearity of expectation,
The computation of requires more care in handling . First, we note that when the pairs and are disjoint, the events and are independent:
The events are also independent when but , or but ,
and when , , and . However, when , , and ,
as there are symbols of overlap between and . Therefore,
(10) |
Combining and (10), we have that
(11) | ||||
(12) | ||||
(13) |
where (12) is asymptotically equivalent as . Finally, we notice that if , (13) converges to , implying that is bounded away from zero and .
Next, we show that when . For this regime, we will define the event that there are two strings , and indices , , such that
This is illustrated in Figure 2(a). Essentially, we have two repeated -mers across sequences and , and the gap between them is the same in both sequences. Given , an incorrect reconstruction exists because we can swap the middle parts of and , creating
which can be verified to be two length- sequences. We will show that as .
Define for and let . We can again use the Paley-Zygmund inequality to find a lower bound on . The second moment can be computed similarly to , considering separately the terms of the form where the -mers indicated by overlap with those indicated by :
(14) |
where the first term in (14) covers all -mers that are independent of each other and the second term covers the case where , , and overlaps with for .
To compute the first moment, , we must account for which assignments of are possible. More precisely, we need to count the vectors such that and , where . This is cumbersome, so we consider a lower bound by considering , , and . Notice that for any of the choices, . We obtain the (rather loose) lower bound
(15) |
Therefore, we can lower bound the error probability as
which tends to as , as long as . We conclude that for any such that or , as . This concludes the proof of the outer bound in Theorem 1.
VI Concluding Remarks
In this paper, we have characterized the feasibility of a multiple sequence reconstruction problem using -mers for a large region of parameters. We identified a region where the -mers of each string are disjoint with high probability and correct recovery of the source sequences is trivial, as well as a region where, while repeats exist, the only possible reconstruction of the source sequences is the correct one. In the latter repeat-abundant region, careful examination of the de Bruijn graph reveals the unique set of paths covering the graph. Finally, we have determined a region where incorrect reconstructions exist with positive probability.
The above findings have characterized nearly all pairs for , , except for a small region between and . We believe that part of the remaining region may be where the solution is not unique, and close analysis of additional difference de Bruijn graph topologies for a larger number of strings and shared -mers might result in a larger red region in Figure 1.
This work invites future study in several directions. One of these is generalization to other source distributions; some results here may be immediately trivially extended to a general i.i.d. source. Additionally, motivated by real-world DNA sequencing technologies, future work may analyze source reconstruction from noisy -mers or a random number of copies of each -mer.
Acknowledgements
The work of I.S. was supported in part by the National Science Foundation under CCF grants 2007597 and 2046991. The authors would like to thank the anonymous reviewers of this manuscript for their detailed feedback and suggestions for improvement.
References
- [1] R. Dramanac, I. Labat, I. Brukner, and R. Crkvenjakov, “Sequencing of megabase plus DNA by hybridization: theory of the method,” Genomics, vol. 4, no. 2, pp. 114–128, 1989.
- [2] E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theoretical computer science, vol. 92, no. 1, pp. 191–211, 1992.
- [3] P. A. Pevzner, “l-tuple DNA sequencing: computer analysis,” Journal of Biomolecular structure and dynamics, vol. 7, no. 1, pp. 63–73, 1989.
- [4] P. A. Pevzner, H. Tang, and M. S. Waterman, “An eulerian path approach to DNA fragment assembly,” Proceedings of the national academy of sciences, vol. 98, no. 17, pp. 9748–9753, 2001.
- [5] S. S. Skiena and G. Sundaram, “Reconstructing strings from substrings,” Journal of Computational Biology, vol. 2, no. 2, pp. 333–353, 1995.
- [6] F. P. Preparata and E. Upfal, “Sequencing-by-hybridization at the information-theory bound: an optimal algorithm,” in Proceedings of the fourth annual international conference on Computational molecular biology, 2000, pp. 245–253.
- [7] A. S. Motahari, G. Bresler, and N. David, “Information theory of DNA shotgun sequencing,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6273–6289, 2013.
- [8] S. Mohajer, A. Motahari, and D. Tse, “Reference-based DNA shotgun sequencing: Information theoretic limits,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1635–1639.
- [9] A. Motahari, K. Ramchandran, D. Tse, and N. Ma, “Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1640–1644.
- [10] I. Shomorony, T. Courtade, and D. Tse, “Do read errors matter for genome assembly?” in 2015 IEEE International Symposium on Information Theory (ISIT). IEEE, 2015, pp. 919–923.
- [11] Y. Yehezkeally, S. Marcovich, and E. Yaakobi, “Multi-strand reconstruction from substrings,” in 2021 IEEE Information Theory Workshop (ITW), 2021, pp. 1–6.
- [12] O. Elishco, R. Gabrys, E. Yaakobi, and M. Médard, “Repeat-free codes,” IEEE Transactions on Information Theory, vol. 67, no. 9, pp. 5749–5764, 2021.
- [13] Y. Yehezkeally and N. Polyanskii, “On codes for the noisy substring channel,” in 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 2021, pp. 1700–1705.
- [14] S. Marcovich and E. Yaakobi, “Reconstruction of strings from their substrings spectrum,” IEEE Transactions on Information Theory, vol. 67, no. 7, pp. 4369–4384, 2021.
- [15] Z. Chang, J. Chrisnata, M. F. Ezerman, and H. M. Kiah, “Rates of DNA sequence profiles for practical values of read lengths,” IEEE Transactions on Information Theory, vol. 63, no. 11, pp. 7166–7177, 2017.
- [16] R. Gabrys and O. Milenkovic, “Unique reconstruction of coded sequences from multiset substring spectra,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 2540–2544.
Appendix A Proof of Equation (3)
Taking the union over the events and , the probability of can be bounded as
(16) |
where is shown in the following lemma to be upper-bounded by when .
Lemma 4
For any two sequences , and indices , , , , with and either or , the probability that and is upper-bounded by .
Proof:
If either or , then it is clear that either and or and , respectively, are independent of each other. Therefore, .
If both and , then closer analysis is required, since overlaps with and overlaps with in this case. Note that if , , and , then , since and will contain the differing symbols in and , respectively.
We first assume that both and . Let and ; without loss of generality, we assume that (if not, we can swap the labels of and ). These values represent the overlap in symbols between the substrings of each string. If , then
and if , then
Furthermore, if and overlap by symbols,
Similarly, if and overlap by symbols,
In other words, the last symbols of are equal to the first symbols of , and the last symbols of are equal to the first symbols of . Since and , this implies that the last symbols of are equal to the first symbols of , or
The probability that there are these replicated symbols in is , the probability that is , and the probability that given and the requisite repetition of symbols in is . Combining these probabilities,
The case where and can be shown similarly; here, the last symbols of must be equal to the first symbols of . ∎
Therefore,
(17) | ||||
(18) | ||||
(19) |
Appendix B Proof of Lemma 2
Proof:
This can be proven by demonstrating an algorithm that can sequentially and unambiguously determine the multiplicity of each node. According to , the first (start) and last (end) nodes of each string are known and have multiplicity 1. The node-labeling algorithm can thus label the start nodes with 1 and continue towards the end nodes until a node that has two incoming edges is reached. Due to , this node can be shared by a maximum of two sequences, and due to , each of those strings can contain the node’s corresponding -mer exactly once, so the multiplicity of this node must be 2. This node is then followed either by an edge pointing to a node that also has multiplicity 2 or two edges pointing to nodes of multiplicity 1. ∎