Reconstruction of Sequences Distorted by Two Insertions
Abstract
Reconstruction codes are generalizations of error-correcting codes that can correct errors by a given number of noisy reads. The study of such codes was initiated by Levenshtein in 2001 and developed recently due to applications in modern storage devices such as racetrack memories and DNA storage. The central problem on this topic is to design codes with redundancy as small as possible for a given number of noisy reads. In this paper, the minimum redundancy of such codes for binary channels with exactly two insertions is determined asymptotically for all values of . Previously, such codes were studied only for channels with single edit errors or two-deletion errors.
Index Terms:
Sequence reconstruction, insertion channel, DNA storageI Introduction
The study of sequence reconstruction problems was initiated by Levenshtein in [1, 2, 3] to combat errors by repeatedly transmitting a message without coding, which is of interest in fields like informatics, molecular biology, and chemistry. The setting of this problem is as follows: the sender transmits a sequence through different noisy channels and the receiver receives all the channel outputs, called noisy reads; then the receiver aims to reconstruct the transmitted sequence with the help of these outputs. There are probabilistic and combinatorial versions of this problem. In the probabilistic version, the noisy reads are obtained by transmitting the sequence through probabilistic channels, and the goal is to minimize the value of required for reconstructing with high probability, see [2, 4, 5, 6, 7, 8, 9] and the references therein. In the combinatorial version, noisy reads are obtained by introducing the maximum number of errors, and the goal is to determine the minimum value of needed for zero-error reconstruction [2].
In this paper, we focus on the combinatorial version, which requires that all channel outputs are different from each other[2]. Then the problem of determining the minimum value of is reduced to determining the maximum intersection size of two distinct error balls. Specifically, the maximum intersection size is equal to . In [2, 3], Levenshtein solved the problem for several error types, such as substitutions, insertions, deletions, transpositions, and asymmetric errors. In [10, 11, 12], permutation errors were analyzed. For general error graphs, see [2, 13, 14]. Note that these works considered the uncoded case, that is, the transmitted sequences are selected from the entire space. Due to applications in DNA storage, many researchers also investigated this problem under the setting where the transmitted sequences are chosen from a given code with a certain error-correcting property, see for example [15, 16, 17, 18, 19].
Recently, the dual problem of the sequence reconstruction was developed under the scenario that the number of noisy reads is a fixed system parameter. Motivated by applications in DNA-based data storage and racetrack memories, Kiah et al. [20, 21] and Chrisnata et al. [22] initiated the study of designing reconstruction codes, for which the size of intersections between any two error-balls is strictly less than the number of reads , for a given . Note that when , reconstruction codes are the classical error-correcting codes. When , one can increase the information capacity, or equivalently, reduce the number of redundant bits by leveraging on these multiple reads. The redundancy of a binary code of length is defined to be . In [20, 21], the authors focused on channels that introduce single edit error, that is a single substitution, insertion, or deletion, and their variants. For single deletion binary codes, they showed that the number of redundant symbols required can be reduced from to if one increases the number of noisy reads from one to two. For two deletions, the best known -deletion correcting code (i.e., ) has redundancy [23]. In [22, 24], Chrisnata et al. reduced this redundancy to by increasing the number of noisy reads from one to five, while each noisy read suffers exactly two deletions. For -deletions (with ), it was proved in [24] that the reconstruction code with two reads for a single deletion [21] is able to reconstruct codewords from distinct noisy reads.
It is well known that a code can correct -deletions (i.e., ) if and only if it can correct -insertions. However, it is not the case when . When , the problem is closely related to the size of the intersection of two error balls. From [17], [18] and [21, Proposition 9], we can see that in general the intersection size of two -deletion balls is different from that of two -insertion balls.
In this paper, we study binary reconstruction codes for channels that cause exactly two insertion errors. We show that the number of redundant symbols required to reconstruct a codeword uniquely can be reduced from to by increasing the number of noisy reads from one to five. The redundancy can be further reduced to if the number of noisy reads is at least . For readers’ easy reference, we summarize the best known results on binary -deletion reconstruction codes and -insertion reconstruction codes in Table I. For general -insertions (with ), we can show the similar result as in [24]: the reconstruction code with two reads for a single insertion [21] is able to reconstruct codewords from distinct noisy reads.
The value of | Redundancy | Reference | |
-DRC | [23, Theorem 1] | ||
[24, Theorem 10] | |||
[24, Theorem 5] | |||
[24, Theorem 9] | |||
[2, Page 7],[24, Theorem 3] | |||
-IRC | [23, Theorem 1] | ||
Theorem V.2 | |||
Proposition III.2 | |||
Proposition III.2,Theorem IV.2 | |||
Proposition III.2 |
This paper is organized as follows. In Section II, we introduce some notations and the problem statement formally. In Section III, we derive some preliminary results, which help us to solve the trivial cases. In Section IV, we first completely characterize the structures of two sequences when the intersection size of their -insertion balls is exactly or . Then by utilizing the structures, we give explicit reconstruction codes for , which are asymptotically optimal in terms of redundancy. In Section V, by applying higher order parity checks on short subwords, we construct asymptotically optimal codes when . Finally, we conclude this paper in Section VI with some open problems.
II Notations and Problem statement
In this section, we introduce some notations and preliminary results needed throughout this paper. Let denote the alphabet , and denote the set of all binary sequences of length . Further, let , that is the set consisting of all binary sequences of finite length. Here if , we mean the empty sequence, denoted by .
For a sequence and a sequence , where is an integer satisfying , if there exist such that for all , we say that is a subsequence of , or is a supersequence of . In particular, when for all , we say that is a subword of , and use the notation with and to denote . Let denote the length of . For any and , let . We call the -insertion ball centered at . It is known that is independent of the choice of by [2] and equal to
(1) |
The Levenshtein distance between and in , denoted by , is the smallest integer such that is not empty. For example, . It is easy to see that , and if and only if . Finally, the Hamming distance between and , denoted by , is the number of coordinates where and differ, and the Hamming weight of is the number of nonzero coordinates of .
The intersection size of two error balls is a key ingredient of the reconstruction problem in coding theory. By Levenshtein’s original framework [2], for channels causing insertion errors, it is always possible to exactly reconstruct given distinct elements of , where
(2) | ||||
This notation was generalized in [18] for any two sequences with a minimum Levenshtein distance. For integers , , let
then the explicit formula is given by
(3) |
Note that when , the values of reduce to and , respectively. See [18, Corollary 9] and [18, Corollary 10] for details.
Given a code with , the read coverage of after insertions is defined as
(4) |
It is clear that if , we can uniquely recover any codeword in by its distinct reads. So we have the following definition.
Definition II.1.
(Reconstruction codes) A code is an -reconstruction code if , where is the read coverage of after insertions.
Given a code , the redundancy of is defined to be the value . Then we are interested in the following quantity
which is called the optimal redundancy of an -reconstruction code. By definition, an -reconstruction code is also an -reconstruction code for any . So we have for all .
When , an -reconstruction code is indeed a -insertion correcting code, or equivalently a -deletion correcting code. This class of codes has been extensively studied in recent years, see [25, 26, 27, 28, 23] for more details. For , we have the following single deletion correcting code [29]:
(5) |
where is a fixed integer between and . This is the famous Varshamov-Tenengolts (VT) code, which has asymptotically optimal redundancy when [30, Corollary 2.3]. In [21, Theorem 24], Cai et al. determined the asymptotic value of for all as below.
Theorem II.1.
Taking in [21, Theorem 24], we have
Our work is to extend the result in Theorem II.1 to the case of . As will be shown in Corollary III.2 and the paragraph prior to Proposition III.2, the lower bound of the optimal redundancy can be deduced from the case . The upper bound needs explicit code constructions. So the main task is to construct an -reconstruction code with redundancy as small as possible, for any given . Before closing this section, we introduce the following notations which will be used later.
For , we write for the concatenation of and . In particular, if , we define to be the sequence obtained by concatenating ’s; if , is the empty sequence. If , we denote as the complementary sequence of . For any and , we denote , , , , and .
Example II.1.
Let , and . Then , and . Let , and . Then , , , and .
Given a sequence with , we say that has period () if is the smallest integer such that for all . A sequence with period is called alternating. We also view an empty sequence or a length-one sequence as an alternating sequence. For example, , , , , , are all alternating sequences.
III Preliminary results
As mentioned before, the intersection size of two error balls is a key ingredient of the reconstruction problem. In this section, we characterize the structures of two binary sequences whose intersection of -insertion balls are of a certain size.
Let us first recall the results for a single insertion, that is the size of . By Equation (2), we know that for any two distinct sequences . The characterization of when the equality holds was given in [21], which needs the following notations.
Definition III.1.
(Type-A confusability) Suppose that . We say that are Type-A confusable, if there exist , , such that
where is an alternating sequence of length at least one, i.e., for some or for some , where or .
For example, the two sequences and are Type-A confusable with , and . In general, if the Hamming distance , then and are always Type-A confusable by definition. Note that the case where was not included in the original definition in [21, Definition 8], where is required when . However, the following statement is still true, which is just a straightforward combination of the two claims in [21, Proposition 9].
Lemma III.1.
[21, Proposition 9] Let be distinct. Then if and only if are Type-A confusable.
In fact, if two sequences and are Type-A confusable, then
for some , and . By Lemma III.1, for the left case and for the right case. Next, we define the concept of Type-B confusability, which was introduced in [22, Definition 18] and [24, Definition 11]. This concept will help to characterize the case .
Definition III.2.
(Type-B confusability) Suppose that with . We say that are Type-B confusable, if there exist , , and such that
Example III.1.
We give some examples of these two types of confusability.
-
•
Let and . Then and are Type-A confusable with , and . They are also Type-B confusable with , , , and .
-
•
Let and . Then and are Type-A confusable with and . Obviously, they are not Type-B confusable.
-
•
Let and . Then and are Type-B confusable with , , and . It is easy to see that they are not Type-A confusable.
-
•
The two sequences and are neither Type-A confusable nor Type-B confusable.
From Example III.1, we can see that Type-A confusability and Type-B confusability do not contain each other. In general, we have the following proposition, which is easy to verify.
Proposition III.1.
From Definition III.1 and Definition III.2, one can easily check that
-
(1)
if , are Type-A confusable, then they are Type-B confusable if and only if for some , or for some ;
-
(2)
if , are Type-B confusable, then they are Type-A confusable if and only if one of the following two conditions is satisfied:
-
(i)
and for some ;
-
(ii)
and for some .
-
(i)
Lemma III.2.
Suppose . If , then are Type-B confusable.
Proof:
Since , then by Lemma III.1, and are not Type-A confusable, which further implies that . So we can assume that
where and . Let be the leftmost and rightmost indices, respectively, where and differ. See Figure 1 for a depiction.
Suppose that . If , then , are Type-A confusable, which is a contradiction. If , then and thus , a contradiction. Therefore, we conclude that and are both nonempty.
Let and . Suppose that and are inserted in and to get , respectively. Then there are only two possibilities: and , or and . If and , then and thus . This means that there exists such that and . Therefore, are Type-B confusable. If and , the proof is similar. ∎
By Lemma III.2, if , then we must have for some , , and . Consequently, is the unique sequence in . Combining Lemma III.1 and Lemma III.2, we completely characterize the structures of two binary sequences for which the intersection of -insertion balls is of a certain size. See the following corollary and Figure 2.
Corollary III.1.
Let be distinct. Then
-
•
if and only if and are neither Type-A confusable, nor Type-B confusable;
-
•
if and only if and are Type-B confusable, but not Type-A confusable;
-
•
if and only if and are Type-A confusable.
Now we proceed to estimate the intersection size of two -insertions balls, that is , based on the value of . By Equation (2), we have that . That is, for any . The following lemma further restricts its value in the case that .
Lemma III.3.
Let and . If , then . In particular, when and , we have
-
•
if and only if and are neither Type-B confusable, nor Type-A confusable;
-
•
if and only if and are Type-B confusable, but not Type-A confusable;
-
•
if and only if and are Type-A confusable.
Proof:
Since , we conclude that are Type-B confusable by Lemma III.2. Assume that and for some and . Let . Then . Let . We prove that by induction on the length .
The base case is , that is, and . In this case, we have and
The second equality in the first line follows from the fact that , and similarly for the second equality in the second line.
By the form of , we have and . Thus is the disjoint union of and . By Equation 1, . So it remains to show that , or equivalently by Equation 2. Otherwise, if , then and , or and for some . For both cases, and are Type-A confusable by Proposition III.1 (2), which contradicts the fact that . This completes the proof of the base case by Corollary III.1.
In the base case, we have shown that for some set (i.e. ) of size at most two. Suppose we have proved the same conclusion for the codeword length , and we will prove it for length . Now , and . Without loss of generality, we assume that and for some and . Then . Denote , and , which are obtained from by deleting the first element . By the induction hypothesis, , for some set of size at most two. Then . Hence , and this completes the proof. ∎
By Lemma III.3, we are able to give a rough estimate of based on the different values of , and consequently based on the confusability of and by Corollary III.1. See the following lemma and Figure 2.
Lemma III.4.
Let where . Then
-
(i)
if and only if ;
-
(ii)
if and only if ;
-
(iii)
if and only if .
Proof:
(i) We prove the sufficiency by contradiction. If , that is , then by Equation 3. If , then by Lemma III.3. So we must have .
For the necessity, recall that . Since , by Lemma III.1, there must be some , and , such that
For the left case, let and . Then and . Noting that by Equation 1, we have . Therefore, . For the right case, the proof is similar.
The claim (ii) is clear by Lemma III.3, the claim (i) and the fact that if . The claim (iii) then follows too. ∎
By Lemma III.4 and Figure 2, it is immediate to have the following relations between reconstruction codes for two insertions and those for single insertion under certain conditions. See Figure 3 for a depiction.
Corollary III.2.
Let and .
-
(i)
When , is an -reconstruction code if and only if is an -reconstruction code.
-
(ii)
When , is an -reconstruction code if and only if is an -reconstruction code.
By now, we are able to determine for most values of . First, since for any , we have and for any . By Lemma III.4, we see that if , an -reconstruction code must be an -reconstruction code, and thus by Theorem II.1; if , an -reconstruction code must be an -reconstruction code, and thus by Theorem II.1. The last case , corresponds to a -insertion error-correcting code, or equivalently a -deletion error-correcting code, whose best known upper bound on the optimal redundancy is [23, Theorem 1], and the best known lower bound is (see [29]). Proposition III.2 summarizes the information given above.
Proposition III.2.
Therefore, the remaining cases are and , which will be handled in the rest of this paper. In fact, we construct asymptotically optimal reconstruction codes for with redundancy , and for with redundancy , thus determine the asymptotically optimal redundancy for all .
IV Reconstruction Codes for
In this section, we construct reconstruction codes for with redundancy as small as possible. The main idea is as follows. First, we choose an -reconstruction code with small redundancy, or equivalently an -reconstruction code (see Corollary III.2 (i)), which can be found in [21, Theorem 17]. Then we impose additional constraints on such that under these constraints, and consequently we obtain an -reconstruction code .
The main task in our construction is to characterize the structures of two sequences when or . By Lemma III.4, we have . Then by Lemma III.2, we can assume that
for some and , where satisfies neither of the following conditions by Proposition III.1:
(6) |
Let . In the proof of Lemma III.3, we have shown that for some set of size at most two. The following result reveals more information about , whose proof is deferred to Appendix A.
Lemma IV.1.
Let and for some and , and let . If and are not Type-A confusable, then
and the union is disjoint.
Let be the length of . Then by Equation 1, we have and . Lemma IV.1 implies
By Equation 6 and Proposition III.1, and are Type-B confusable but not Type-A confusable, and hence by Corollary III.1. Then by Lemma III.3. We also have due to the fact . So the above equality implies that or if and only if or , respectively. Thus, we can further assume that The following theorem characterizes the conditions when or .
Conditions | Row Number | ||
---|---|---|---|
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 | |||
8 |
Theorem IV.1.
Proof:
Denote . Let and . Then
(7) |
(i). From Lemma III.3, we know that if and only if and are Type-A confusable. That is to say, and for some , where is an alternating sequence of length at least one. Denote and . Then
(8) |
Since , we have . If , consider the prefix of . By the third line of Equation 8 and the second line of Equation 7, we have and , respectively. Thus , which contradicts the second line of Equation 8. So we have , and Equation 8 becomes
(9) |
Next, we determine the explicit structure of according to the values of and . Write . Then
(10) |
For convenience, let , . Since , the first two lines of Equation 7 and the first line of Equation 9 imply that is alternating. Therefore,
(11) |
We divide our discussion into three cases.
-
•
. In this case, we have and . Therefore, and when is even, or and when is odd.
-
•
. In this case, we have and . Therefore, and when is odd, or and when is even.
-
•
. In this case, for by the second line of Equation 9 and Equation 10. So by the third line of Equation 9. Hence is alternating by the first line of Equation 9 and Equation 10. Since , we have and by the first line of Equation 9. Now, combining with Equation 11, we come to the desired forms of : for all ,
and
This completes the proof of (i).
(ii). The proof of this case is long and tedious, thus we move it to Appendix B. ∎
Theorem IV.1 gives a full characterization of the structures of two sequences when or , respectively. This will help to exclude such pairs of sequences from an -reconstruction code (or an -reconstruction code) to obtain an -reconstruction code, as mentioned in the beginning of this section. We use the -reconstruction code in [21, Theorem 17] as a candidate for our construction. For that, we need the following notation.
For any , define
By definition, for any and any ,
(12) |
where denotes the number of appearing in . Let denote the set of all sequences such that the length of any subword with period of is at most , for any . By definition, if and .
Lemma IV.2.
([31, Theorem 13]) For , if , we have that .
For , let and . In [21, Theorem 17], the authors defined the following code
They proved that is an -reconstruction code in the following way. The constraint on implies that the Hamming weights of any two distinct codewords and in have the same parity and thus . Therefore, if , we can conclude that and for some and some alternating sequence of positive even length. On the one hand, it can be shown that . So the constraint on leads to and hence . On the other hand, the length of is upper bounded by since has period . So we have , which is impossible. Therefore, is an -reconstruction code.
By Corollary III.2 (i), the code is an -reconstruction code. To obtain an -reconstruction code from , we have to exclude pairs of sequences satisfying that or . By Theorem IV.1, such pairs have a common subword with a certain periodic property. Thus we can further bound the length of and use the constraint on to get a contradiction and exclude such pairs. See our construction below.
Theorem IV.2.
() For integers such that , let and . Let
Then is an -reconstruction code. Furthermore, if , then has redundancy at most for some choice of and .
Proof:
Since , that is , it suffices to show that for any two distinct sequences .
If, on the contrary, there exists such a pair , then without loss of generality, we can assume that
for some and . Since , we conclude that .
If , then for some by Theorem IV.1(i). Since , we have . From Equation 12, we know that
So by the definition of . Then , which implies that , a contradiction.
If , then is one of the four forms in Table II under the case by Theorem IV.1(ii). For each case, we can deduce that , thus a contradiction. For example, if for some , then . On the other hand, since , we have , , and . So , which implies , a contradiction.
So we conclude that is an -reconstruction code. Finally, we can choose . Then by Lemma IV.2. Note that for different pairs , the codes are disjoint from each other. So by pigeonhole principle, there must be some and such that , which has redundancy for large enough. ∎
Obviously, the code in Theorem IV.2 is also an -reconstruction code. Then combining with Proposition III.2, we have for . However, we can do a little bit better for . Under the same notations, let
Then we can show that is an -reconstruction code by the similar argument in Theorem IV.2. Furthermore, if we set , then has redundancy at most for some choice of and . Since this is about a half of the in Theorem IV.2, we know that the redundancy of is about one bit smaller than that of .
V Reconstruction codes for
In this section, we provide reconstruction codes for with redundancy as small as possible. It seems that we can also construct a code by characterizing the structures of two sequences such that or , as we did in the last section. However, it will be a long and tedious task. So the construction we use here is quite different from the one for .
In [27, Theorem 2], Sima et al. constructed a class of -deletion correcting codes with redundancy , by generalizing the VT code with higher order parity checks of some indicator vectors of the original sequences. Inspired by their constructions, we will present an -reconstruction code with redundancy by utilizing five different reads. This improves the trivial upper bound to , which is tight in terms of the leading term. We first recall the construction in [27, Theorem 2], which needs the following notations.
For ( ) and , the -indicator of is defined as follows:
For example, if , then and . Note that the - and -indicators of any sequence do not contain consecutive ones. Define the following integer vectors of length :
(13) |
In other words, the th components of , and are , and , respectively. Then given , the higher order parity checks for and are defined as
(14) |
where denotes the inner product over the integers. By applying the above parity checks, Sima et al. proved the following result.
Theorem V.1.
Clearly, the redundancy of is at most for some choice of , , , and . Although this implies an -reconstruction code, the redundancy is large. To reduce the redundancy, we note that the parity checks in the construction of Theorem V.1 are applied to the whole sequence of length to correct two random insertion errors. However, as will be shown later, the two random insertion errors are located in a short subword (say of length ) of the original sequence, if we impose some constraint (belonging to ) on the sequences. So we can apply the parity checks in Equation 14 only on this subword of length to combat the two insertion errors, which will greatly reduce the redundancy of the code if .
Now assume that () and , then by Lemma III.4, and hence by Lemma III.1. So we can always assume that
for some and . Further, and , since otherwise . As in Lemma IV.1, we have
(15) |
In particular, . The proof of Equation 15 is similar to that of Lemma IV.1 and thus omitted.
By Equation 15, we know that if , then any sequence in can be obtained from by inserting two symbols in , or from by inserting two symbols in . Therefore, the idea is: if for some , then by partitioning (or ) into non-overlapping segments of length (see Figure 4, the rightmost segment has length at most ), we can see that the two inserted symbols are located in at most two adjacent segments. Thus we can apply the higher order parity checks in Equation 14 to the subword of length . Intuitively, we can deem that the redundancy can be greatly reduced.
To bound the length of or , we restrict our sequences to be within , which is of size at least when by Lemma IV.2. Suppose . If we can prove that and are both concatenation of a finite number of sequences of period at most , then and can be upper bounded by . Next, we will show that this is true.
Let , where and . Then , where
It is clear that , and are mutually disjoint, and each has size at most two. Suppose that . Then we must have:
(16) |
For the former case of Equation 16, and are Type-A confusable, i.e., there exist some such that
(17) |
where is an alternating sequence of length at least one. Since , we conclude that and are Type-A confusable or Type-B confusable. With these notations and observations, we characterize the structures of and in the following two lemmas for different cases. The proofs are in Appendix C.
Lemma V.1.
Suppose that and are Type-A confusable, and that and are Type-A confusable. Then and for some , where the four sequences and all have period at most .
Lemma V.2.
Suppose that and are Type-A confusable, and that and are Type-B confusable. Then and for some , where the six sequences and all have period at most .
For the latter case of Equation 16, we have and are Type-A confusable by , and and are Type-A (or B) confusable by . As in Equation 17, we can let and , and show that Lemma V.1 and Lemma V.2 are both true for this case.
Now suppose that , and . Then Lemma V.1 and Lemma V.2 imply that
(18) |
So in the rest of this subsection, we let . For convenience, we always assume that , that is, each segment in Figure 4 has length . All results in this subsection can be extended to the case by a truncated method, see Remark V.2 later. Before giving the construction of an -reconstruction code, we define the following notations.
For with , partition into segments of length as in Figure 4. For any two consecutive segments, we define the parity checks on this subword of length as in Equation 14 as follows. For each , let and , which are the indicators of the -th segment of . Define
where is defined in Equation 13 but with length . If we let , then it is easy to see that and . Next, we will sum up these ’s and ’s according to the parity of . Let be the set of even integers and let be the set of odd integers in , respectively. It is clear that if (or ) and , then the two intervals and are disjoint. Define
with the summation over , and define
with the summation over . Now we are ready to give our construction.
Theorem V.2.
() Let and be integers such that and . For any , , and , let be the set of all such that the following four conditions hold:
-
•
;
-
•
;
-
•
; and
-
•
.
Then is an -reconstruction code. Furthermore, if , then has redundancy at most for some choice of , , and .
Proof:
According to Equation 5, the first condition implies that is a subset of a VT code. So , and hence for any distinct . Assume that
for some and . By Lemma III.4, . We will show that by contradiction.
Assume that . Then by Equation 18 we can partition the sequences and into segments each of length , such that the two inserted symbols in the common supersequence are located in at most two adjacent segments containing or . See Figure 4 for illustrations. This means that there exists some such that and are subwords of and , respectively. Without loss of generality, we assume and the proof is the same for the case . Let and . Then , , and . By the choice of , we know that for each , and hence and . Therefore, we have
The second condition in this theorem implies and , i.e., and . By Theorem V.1, we know that . However, by Equation 15, , a contradiction. Thus the code is indeed an -reconstruction code.
Finally, we choose , which implies that by Lemma IV.2. So there must be some , , , such that , where . By computing the redundancy, we complete the proof. ∎
Remark V.1.
If we replace the fourth condition in Theorem V.2 with , and take and , then we will get an -reconstruction code with redundancy at most , the constant term of which is slightly smaller than that of the redundancy of the code given in Theorem V.2.
Remark V.2.
Now we explain how to extend the construction in Theorem V.2 to the case . Let and as before. Suppose that be the smallest integer such that and . For each , let be the word of length by appending zeros at the end of . Then in Theorem V.2, modify the second and third conditions by replacing with , and keep the other two conditions unchanged. By almost the same arguments, we can show that there exists an -reconstruction code with redundancy at most .
When , an -reconstruction code must be an -reconstruction code. So the lower bound of is (see Equation 5). Therefore, the redundancy of the code in Theorem V.2 has optimal leading term. It is worth noting that the idea in this section can be applied to all . The main task is to upper bound the length of and under some restrictions. We only accomplished this task when . When , the analysis will be more complicated.
To conclude this section, we summarize the main results into the following theorem.
Theorem V.3.
VI Conclusion
In this paper, we study the redundancy of reconstruction codes for channels with exactly two insertions. Let be the number of given channels and let be the sequence length. It turns out that the nontrivial cases are and . For and , our constructions of codes are both explicit, and the ideas are quite different. When , the construction is based on characterizations of two sequences when the intersection size of their -insertion balls is exactly or . When , the construction is based on higher order parity checks on short subwords. Consequently, for all , we construct codes which are asymptotically optimal in terms of redundancy.
For general -insertions (with being a constant), we have the following result as a generalization of Lemma III.3.
Lemma VI.1.
Let and . Suppose that , then
for any .
The proof of Lemma VI.1 is in Appendix D. By Equation 1 and Equation 2, we can see that . Thus Lemma VI.1 gives a similar result as in [24]: the reconstruction code with two reads for a single insertion [21] is able to reconstruct codewords from distinct noisy reads.
For future research, we raise the following questions.
-
(1)
In Theorem V.2, we construct an -reconstruction code with redundancy . The leading term is optimal. But we do not know whether the coefficient of the secondary term is optimal or not. This is left for future study.
-
(2)
Using the notations of Section V, we can see that if and only if and . In other words, if and only if
By these equivalent conditions, we can determine the explicit structures of and . Then similar to Theorem IV.2, we can construct an -reconstruction code with redundancy . But the proofs are long and tedious. The details of this construction can be provided upon requirement. Again, we do not know whether the secondary term is optimal or not.
-
(3)
Find a way to upper bound the length of and in Section V. This will be helpful for us to construct -reconstruction codes for .
Appendix A The proof of Lemma IV.1
Proof:
Recall that , and . For simplicity, we denote , , and . Clearly, , and the two sets are disjoint. Next we will prove . It is easy to see that
So we only need to prove that the opposite inclusion is true. Since , it is sufficient to show . Assume that . Suppose that are the two insertion positions in ,111This means that two symbols are inserted to the left of and , respectively, to get . and are the two insertion positions in . Let and be the leftmost and rightmost indices where and differ (see Figure 5). Comparing with , we can divide the discussions into three cases: (1) ; (2) ; (3) .
Case (1): (Figure 6). Since , we must have in this case, and either or .
Subcase (1): If , then by matching positions, we have
Subcase (2): If , then
For , we have
If the second bracket is not empty, then . This holds if and only if and , or and , for some . Both cases imply that and are Type-A confusable by Proposition III.1(2), which is a contradiction. Therefore, is empty and so
This means that , since is also in () by assumption.
Case (2) (Figure 7): . Since , we can not have exactly one of less than . Thus we have or in this case.
Subcase (1): If then we must have since . So we have
Subcase (2): If , then
Note that . So
The last equality follows from . If is not empty, then we have , which means and so . This implies that and are Type-A confusable, which is a contradiction. Therefore, is empty and
This means that , since is also in () by assumption.
Case (3) (Figure 8): . In this case, we have or . By a process similar to that of the previous two cases, we will get or .
Combining cases (1), (2) and (3), we conclude that . Thus the proof is completed. ∎
Appendix B Proof of Case (ii) of Theorem IV.1
Proof:
(ii). Recall that , and , where is the length of . In addition, we write and .
By Lemma III.3, if and only if and are Type-B confusable, but not Type-A confusable. That is to say,
(19) |
or
(20) |
for some and . Note that since and have the same Hamming weight. Suppose that ,, where . Then and so .
In order to simplify arguments in the sequel, we let , , and . Then it is clear that and for all . Both of Equation 19 and Equation 20 indicate that and . Therefore, we have the following two identities:
(21) |
for any and . From the first row in Equation 21, we know that is the same as that in Equation 11. And the second row in Equation 21 implies that
(22) |
From Equation 22, we can see that if , then . However, if , we get no information from Equation 22. Furthermore, if is even, we have , ; if is odd, we have , . Therefore, we must have since .
Now we divide our discussions into two cases, according to Equation 19 or Equation 20. As we will see later, they will lead to different values of .
-
•
Firstly, we assume that Equation 19 holds. Then , , , and . Thus we have
(23) The three identities above imply that
(24) Therefore, if , then . Combining Equations 11, 22 and 24, we get
(25) for all and . Recall that if and if . So the first and last rows correspond to the case where , while the second and third rows correspond to the case where .
-
•
Secondly, we assume that Equation 20 holds. Comparing with the previous case, the only difference is that Equation 23 becomes
(26) Therefore, if , we have . Recall that . From the first two rows in Equation 26, we can deduce that . Otherwise, the first two rows in Equation 26 will result in , which contradicts Equation 11. Now the three identities in Equation 26 indicate that has period , and that and .
If , then and . So from the periodicity of , we have and . Similarly, if , then and . So and . If , then . So and , where the last equality follows from Equation 11. This contradicts the fact that . So we can not have .
By the above discussions, we have
(27) Recall that if and if . Now, combining Equations 11, 22 and 27, for all and , we get eight cases for the values of , which are listed in Table III. Note that in the third and seventh rows of Table III, we can not have . Otherwise, we will get , which contradicts the second row in Equation 26.
From Equation 25 and Table III, we can derive the results in Table II. Precisely (in Table III, we require ),
-
•
the first row in Table II follows from the second row in Equation 25;
-
•
the second row in Table II follows from the third row in Equation 25;
- •
- •
-
•
the fifth row in Table II follows from the first row in Equation 25;
-
•
the sixth row in Table II follows from the fourth row in Equation 25;
- •
- •
Therefore, we have prove the “only if” part of the conclusion.
For the “if” part, it is straightforward to verify that if we take to be one of these values, and are Type-B confusable. Comparing Table II with the results in (i), we can see that and are not Type-A confusable. This completes the proof of (ii).
No. | conditions for and | ||||
---|---|---|---|---|---|
1 |
|
No | |||
2 |
|
Yes | |||
3 |
|
Yes | |||
4 |
|
No | |||
5 |
|
Yes | |||
6 |
|
No | |||
7 |
|
No | |||
8 |
|
Yes |
∎
Appendix C Proofs of Lemma V.1 and Lemma V.2
Lemma V.1 is the combination of Claim 1–Claim 4, and Lemma V.2 is the combination of Claim 1 and Claim 5–Claim 7.
In this section, we always denote , and . Then , and . Let . For simplicity, we define and . So . Since , we conclude that and are Type-A confusable or Type-B confusable. Therefore, we can choose to be the leftmost index such that and to be the rightmost index such that . By the choice of and , we have and . Furthermore, according to Definition III.1 and Definition III.2, we conclude that
-
•
if and are Type-A confusable, then is an alternating sequence and .
-
•
if and are Type-B confusable, then . Besides, one of the following two conditions must hold:
(28) or
(29)
Claim 1.
If , then is a periodic sequence and its period is at most . If , then is a periodic sequence and its period is at most .
Proof:
We only prove the conclusion for , since for , the proof is similar.
When , it is clear that the period of is . So we assume that . When , we have , which implies that is a periodic sequence and its period is at most . ∎
Claim 2.
Suppose that . If and are Type-A confusable, then , where are sequences of period at most .
Proof:
Since , we know that . When , the conclusion is trivial. So we assume that . When , we have , and . The first and third equalities imply that and both have period at most . Recall that is an alternating sequence. So or . Otherwise the second equality will lead to , which is impossible. Now we can conclude that is the concatenation of two sequences, and each of them has period at most . ∎
Claim 3.
Suppose that and . If and are Type-A confusable, then , where are sequences of period at most .
Proof:
Since , we know that . When , the conclusion is trivial. So we assume that . When and , we have and . The second equality implies that has period at most . Recall that is an alternating sequence and . So or . Otherwise the first equality will lead to , which is impossible. Now we can take and . ∎
Claim 4.
Suppose that and are Type-A confusable, and that . If , or if and , then , where are sequences of period at most .
Claim 5.
Suppose that and , and that and are Type-B confusable. Then
-
•
, where are sequences of period at most ; or
-
•
, where are sequences of period at most .
Proof:
When the conclusion is trivial. So we assume that . If Equation 28 is true, then and . Now we can conclude that is a run and hence has period , and that is a sequence of period at most .
If Equation 29 is true, then and
. Now we can conclude that is sequence of period at most , and that is a sequence of period at most .
∎
Claim 6.
Suppose that , and that and are Type-B confusable. Then , where are sequences of period at most .
Proof:
Similar to the proof of Claim 2, we can see that and both have period at most . If Equation 28 is true, then . Now we can conclude that is a run and hence has period .
If Equation 29 is true, then . Therefore, is sequence of period at most . ∎
Claim 7.
Suppose that and are Type-B confusable, and that .
-
•
If and , then
-
–
, where are sequences of period at most ; or
-
–
, where are sequences of period at most .
-
–
-
•
If , then , where are sequences of period at most .
Appendix D The proof of Lemma VI.1
Firstly, we need the following lemma, which is a generalization of the base case (i.e., the case where ) in the proof of Lemma III.3.
Lemma D.1.
Let and such that
for some and . If , then
for any .
Proof:
It is clear that . Let . Then and
If , then and , or and for some . For both cases, and are Type-A confusable by Proposition III.1 (2), which contradicts the fact that . Therefore, and hence . ∎
Proof of Lemma VI.1 Since , by Lemma III.2 we can assume that and for some and . Let . Then . Let . We prove this theorem by induction on and .
The base case is or . When , the conclusion follows from Lemma D.1. When , the conclusion follows from Lemma III.3, since . Now suppose we have proved this theorem for all and . Without loss of generality, we assume that for some and . Clearly, , and
By the induction hypothesis, and . By Equations (1) and (2), we can get the following two identities,
Then the proof is completed.
References
- [1] V. I. Levenshtein, “Reconstruction of objects from the minimum number of distorted patterns,” Dokl. Akad. Nauk, vol. 354, no. 5, pp. 593–596, 1997.
- [2] ——, “Efficient Reconstruction of Sequences,” IEEE Trans. Inf. Theory, vol. 47, no. 1, pp. 2–22, Jan. 2001.
- [3] ——, “Efficient Reconstruction of Sequences from Their Subsequences or Supersequences,” J. Combinat. Theory, A, vol. 93, no. 2, pp. 310–332, Feb. 2001.
- [4] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing Strings from Random Traces,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), New Orleans, LA, USA, Jan. 2004, pp. 910–918.
- [5] S. Kannan and A. McGregor, “More on reconstructing strings from random traces: insertions and deletions,” in Proc. Int. Symp. Inf. Theory (ISIT), Adelaide, Australia, Sep. 2005, pp. 297–301.
- [6] K. Viswanathan and R. Swaminathan, “Improved string reconstruction over insertion-deletion channels,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), San Francisco, CA, USA, Jan. 2008, pp. 399–408.
- [7] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, “Trace reconstruction with constant deletion probability and related results,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), San Francisco, CA, USA, Jan. 2008, pp. 389–398.
- [8] M. Cheraghchi, R. Gabrys, O. Milenkovic, and J. Ribeiro, “Coded Trace Reconstruction,” IEEE Trans. Inf. Theory, vol. 66, no. 10, pp. 6084–6103, Oct. 2020.
- [9] J. Brakensiek, R. Li, and B. Spang, “Coded trace reconstruction in a constant number of traces,” in Proc. Annu. Symp. Found. Comput. Sci. (FOCS), Durham, NC, USA, Nov. 2020, pp. 482–493.
- [10] E. Konstantinova, V. I. Levenshtein, and J. Siemons, “Reconstruction of permutations distorted by single transposition errors,” arXiv: 0702191, 2007.
- [11] E. Konstantinova, “On reconstruction of signed permutations distorted by reversal errors,” Discrete Math., vol. 308, no. 5, pp. 974–984, Mar. 2008.
- [12] ——, “Reconstruction of permutations distorted by reversal errors,” Discrete Appl. Math., vol. 155, no. 18, pp. 2426–2434, Nov. 2007.
- [13] V. I. Levenshtein, E. Konstantinova, E. Konstantinov, and S. Molodtsov, “Reconstruction of a graph from 2-vicinities of its vertices,” Discrete Appl. Math., vol. 156, no. 9, pp. 1399–1406, May 2008.
- [14] V. I. Levenshtein and J. Siemons, “Error graphs and the reconstruction of elements in groups,” J. Combinat. Theory, A, vol. 116, no. 4, pp. 795–815, May 2009.
- [15] E. Yaakobi, M. Schwartz, M. Langberg, and J. Bruck, “Sequence reconstruction for Grassmann graphs and permutations,” in Proc. Int. Symp. Inf. Theory (ISIT), Istanbul, Turkey, Oct. 2013, pp. 874–878.
- [16] R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,” in Proc. Int. Symp. Inf. Theory (ISIT), Barcelona, Spain, Aug. 2016, pp. 1596–1600.
- [17] ——, “Sequence Reconstruction Over the Deletion Channel,” IEEE Trans. Inf. Theory, vol. 64, no. 4, pp. 2924–2931, Apr. 2018.
- [18] F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek, “Exact Reconstruction From Insertions in Synchronization Codes,” IEEE Trans. Inf. Theory, vol. 63, no. 4, pp. 2428–2445, Apr. 2017.
- [19] V. L. P. Pham, K. Goyal, and H. M. Kiah, “Sequence Reconstruction Problem for Deletion Channels: A Complete Asymptotic Solution,” arXiv:2111.04255v1, 2021.
- [20] H. M. Kiah, T. Thanh Nguyen, and E. Yaakobi, “Coding for Sequence Reconstruction for Single Edits,” in Proc. Int. Symp. Inf. Theory (ISIT), Los Angeles, CA, USA, 2020, pp. 676–681.
- [21] K. Cai, H. M. Kiah, T. T. Nguyen, and E. Yaakobi, “Coding for Sequence Reconstruction for Single Edits,” IEEE Trans. Inf. Theory, vol. 68, no. 1, pp. 66–79, Jan. 2022.
- [22] J. Chrisnata, H. M. Kiah, and E. Yaakobi, “Optimal Reconstruction Codes for Deletion Channels,” arXiv: 2004.06032, 2020.
- [23] V. Guruswami and J. Håstad, “Explicit two-deletion codes with redundancy matching the existential bound,” IEEE Trans. Inf. Theory, vol. 67, no. 10, pp. 6384–6394, Oct. 2021.
- [24] J. Chrisnata, H. M. Kiah, and E. Yaakobi, “Correcting Deletions with Multiple Reads,” IEEE Trans. Inf. Theory, vol. Early Access, 2022.
- [25] J. Brakensiek, V. Guruswami, and S. Zbarsky, “Efficient Low-Redundancy Codes for Correcting Multiple Deletions,” IEEE Trans. Inf. Theory, vol. 64, no. 5, pp. 3403–3410, May 2018.
- [26] R. Gabrys and F. Sala, “Codes Correcting Two Deletions,” IEEE Trans. Inf. Theory, vol. 65, no. 2, pp. 965–974, Feb. 2019.
- [27] J. Sima, N. Raviv, and J. Bruck, “Two Deletion Correcting Codes From Indicator Vectors,” IEEE Trans. Inf. Theory, vol. 66, no. 4, pp. 2375–2391, Apr. 2020.
- [28] J. Sima and J. Bruck, “On Optimal -Deletion Correcting Codes,” IEEE Trans. Inf. Theory, vol. 67, no. 6, pp. 3360–3375, Jun. 2021.
- [29] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, Feb. 1966.
- [30] N. J. A. Sloane, “On single-deletion-correcting codes,” Codes and Designs, vol. 10, pp. 273–291, May 2002.
- [31] Y. M. Chee, H. M. Kiah, A. Vardy, V. K. Vu, and E. Yaakobi, “Coding for Racetrack Memories,” IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 7094–7112, Nov. 2018.