This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reconstruction of Sequences Distorted by Two Insertions

Zuo Ye, Xin Liu, Xiande Zhang and Gennian Ge This project was supported by the National Key Research and Development Program of China under Grant 2020YFA0712100, Grant 2018YFA0704703 and Grant 2020YFA0713100, the National Natural Science Foundation of China under Grant 11971325, Grant 12171452 and Grant 12231014, and Beijing Scholars Program.Z. Ye and X. Zhang are with the School of Mathematical Sciences, University of Science and Technology of China, Hefei 230026, Anhui, China (e-mails: [email protected]; [email protected]).X. Liu and G. Ge are with the School of Mathematical Sciences, Capital Normal University, Beijing 100048, China (e-mails: [email protected]; [email protected]).
Abstract

Reconstruction codes are generalizations of error-correcting codes that can correct errors by a given number of noisy reads. The study of such codes was initiated by Levenshtein in 2001 and developed recently due to applications in modern storage devices such as racetrack memories and DNA storage. The central problem on this topic is to design codes with redundancy as small as possible for a given number NN of noisy reads. In this paper, the minimum redundancy of such codes for binary channels with exactly two insertions is determined asymptotically for all values of N5N\geq 5. Previously, such codes were studied only for channels with single edit errors or two-deletion errors.

Index Terms:
Sequence reconstruction, insertion channel, DNA storage

I Introduction

The study of sequence reconstruction problems was initiated by Levenshtein in [1, 2, 3] to combat errors by repeatedly transmitting a message without coding, which is of interest in fields like informatics, molecular biology, and chemistry. The setting of this problem is as follows: the sender transmits a sequence through NN different noisy channels and the receiver receives all the channel outputs, called noisy reads; then the receiver aims to reconstruct the transmitted sequence with the help of these NN outputs. There are probabilistic and combinatorial versions of this problem. In the probabilistic version, the noisy reads are obtained by transmitting the sequence 𝒙\bm{x} through probabilistic channels, and the goal is to minimize the value of NN required for reconstructing 𝒙\bm{x} with high probability, see [2, 4, 5, 6, 7, 8, 9] and the references therein. In the combinatorial version, noisy reads are obtained by introducing the maximum number of errors, and the goal is to determine the minimum value of NN needed for zero-error reconstruction [2].

In this paper, we focus on the combinatorial version, which requires that all channel outputs are different from each other[2]. Then the problem of determining the minimum value of NN is reduced to determining the maximum intersection size of two distinct error balls. Specifically, the maximum intersection size is equal to N1N-1. In [2, 3], Levenshtein solved the problem for several error types, such as substitutions, insertions, deletions, transpositions, and asymmetric errors. In [10, 11, 12], permutation errors were analyzed. For general error graphs, see [2, 13, 14]. Note that these works considered the uncoded case, that is, the transmitted sequences are selected from the entire space. Due to applications in DNA storage, many researchers also investigated this problem under the setting where the transmitted sequences are chosen from a given code with a certain error-correcting property, see for example [15, 16, 17, 18, 19].

Recently, the dual problem of the sequence reconstruction was developed under the scenario that the number of noisy reads is a fixed system parameter. Motivated by applications in DNA-based data storage and racetrack memories, Kiah et al. [20, 21] and Chrisnata et al. [22] initiated the study of designing reconstruction codes, for which the size of intersections between any two error-balls is strictly less than the number of reads NN, for a given NN. Note that when N=1N=1, reconstruction codes are the classical error-correcting codes. When N>1N>1, one can increase the information capacity, or equivalently, reduce the number of redundant bits by leveraging on these multiple reads. The redundancy of a binary code 𝒞\mathcal{C} of length nn is defined to be nlog2|𝒞|n-\log_{2}|\mathcal{C}|. In [20, 21], the authors focused on channels that introduce single edit error, that is a single substitution, insertion, or deletion, and their variants. For single deletion binary codes, they showed that the number of redundant symbols required can be reduced from log2(n)+O(1)\log_{2}(n)+O(1) to log2log2(n)+O(1)\log_{2}\log_{2}(n)+O(1) if one increases the number of noisy reads NN from one to two. For two deletions, the best known 22-deletion correcting code (i.e., N=1N=1) has redundancy 4log2(n)+o(log2(n))4\log_{2}(n)+o(\log_{2}(n)) [23]. In [22, 24], Chrisnata et al. reduced this redundancy to 2log2(n)+o(log2(n))2\log_{2}(n)+o(\log_{2}(n)) by increasing the number of noisy reads NN from one to five, while each noisy read suffers exactly two deletions. For tt-deletions (with t2t\geq 2), it was proved in [24] that the reconstruction code with two reads for a single deletion [21] is able to reconstruct codewords from N=nt1/(t1)!+O(nt2)N=n^{t-1}/(t-1)!+O(n^{t-2}) distinct noisy reads.

It is well known that a code can correct tt-deletions (i.e., N=1N=1) if and only if it can correct tt-insertions. However, it is not the case when N>1N>1. When N>1N>1, the problem is closely related to the size of the intersection of two error balls. From [17], [18] and [21, Proposition 9], we can see that in general the intersection size of two tt-deletion balls is different from that of two tt-insertion balls.

In this paper, we study binary reconstruction codes for channels that cause exactly two insertion errors. We show that the number of redundant symbols required to reconstruct a codeword uniquely can be reduced from 4log2(n)+o(log2(n))4\log_{2}(n)+o(\log_{2}(n)) to log2(n)+O(log2log2(n))\log_{2}(n)+O(\log_{2}\log_{2}(n)) by increasing the number of noisy reads NN from one to five. The redundancy can be further reduced to log2log2(n)+Θ(1)\log_{2}\log_{2}(n)+\Theta(1) if the number of noisy reads is at least n+4n+4. For readers’ easy reference, we summarize the best known results on binary 22-deletion reconstruction codes and 22-insertion reconstruction codes in Table I. For general tt-insertions (with t>2t>2), we can show the similar result as in [24]: the reconstruction code with two reads for a single insertion [21] is able to reconstruct codewords from N=nt1/(t1)!+O(nt2)N=n^{t-1}/(t-1)!+O(n^{t-2}) distinct noisy reads.

TABLE I: Best known results, where DRC denotes deletion reconstruction codes and IRC denotes insertion reconstruction codes
The value of NN Redundancy Reference
22-DRC N4N\leq 4 Θ(log2(n))\Theta(\log_{2}(n)) [23, Theorem 1]
N=5N=5 2log2(n)+O(log2log2(n))2\log_{2}(n)+O(\log_{2}\log_{2}(n)) [24, Theorem 10]
N=7N=7 log2(n)+O(1)\log_{2}(n)+O(1) [24, Theorem 5]
N=n+1N=n+1 log2log2(n)+O(1)\log_{2}\log_{2}(n)+O(1) [24, Theorem 9]
N>2n4N>2n-4 0 [2, Page 7],[24, Theorem 3]
22-IRC N4N\leq 4 Θ(log2(n))\Theta(\log_{2}(n)) [23, Theorem 1]
N=5,6N=5,6 log2(n)+O(log2log2(n))\log_{2}(n)+O(\log_{2}\log_{2}(n)) Theorem V.2
6<Nn+36<N\leq n+3 log2(n)+Θ(1)\log_{2}(n)+\Theta(1) Proposition III.2
n+4N2n+4n+4\leq N\leq 2n+4 log2log2(n)+Θ(1)\log_{2}\log_{2}(n)+\Theta(1) Proposition III.2,Theorem IV.2
N>2n+4N>2n+4 0 Proposition III.2

This paper is organized as follows. In Section II, we introduce some notations and the problem statement formally. In Section III, we derive some preliminary results, which help us to solve the trivial cases. In Section IV, we first completely characterize the structures of two sequences when the intersection size of their 22-insertion balls is exactly n+4n+4 or n+5n+5. Then by utilizing the structures, we give explicit reconstruction codes for N=n+4N=n+4, which are asymptotically optimal in terms of redundancy. In Section V, by applying higher order parity checks on short subwords, we construct asymptotically optimal codes when N=5N=5. Finally, we conclude this paper in Section VI with some open problems.

II Notations and Problem statement

In this section, we introduce some notations and preliminary results needed throughout this paper. Let Σ2\Sigma_{2} denote the alphabet {0,1}\{0,1\}, and Σ2n\Sigma_{2}^{n} denote the set of all binary sequences of length nn. Further, let Σ2n=0Σ2n\Sigma_{2}^{*}\triangleq\mathop{\cup}\limits_{n=0}^{\infty}\Sigma_{2}^{n}, that is the set consisting of all binary sequences of finite length. Here if n=0n=0, we mean the empty sequence, denoted by \emptyset.

For a sequence 𝒙=x1xnΣ2n\bm{x}=x_{1}\cdots x_{n}\in\Sigma_{2}^{n} and a sequence 𝒛=z1ztΣ2t\bm{z}=z_{1}\cdots z_{t}\in\Sigma_{2}^{t}, where tt is an integer satisfying 1tn1\leq t\leq n, if there exist 1i1<<itn1\leq i_{1}<\cdots<i_{t}\leq n such that zj=xijz_{j}=x_{i_{j}} for all 1jt1\leq j\leq t, we say that 𝒛\bm{z} is a subsequence of 𝒙\bm{x}, or 𝒙\bm{x} is a supersequence of 𝒛\bm{z}. In particular, when ij+1=ij+1i_{j+1}=i_{j}+1 for all jj, we say that 𝒛\bm{z} is a subword of 𝒙\bm{x}, and use the notation 𝒙[l:k]\bm{x}[l:k] with l=i1l=i_{1} and k=itk=i_{t} to denote 𝒛\bm{z}. Let |𝒙||\bm{x}| denote the length of 𝒙\bm{x}. For any 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n} and t0t\geq 0, let It(𝒙){𝒛Σ2n+t: 𝒛 is a supersequence of 𝒙}I_{t}(\bm{x})\triangleq\{\bm{z}\in\Sigma_{2}^{n+t}:\text{ }\bm{z}\text{ is a supersequence of }\bm{x}\}. We call It(𝒙)I_{t}(\bm{x}) the tt-insertion ball centered at 𝒙\bm{x}. It is known that |It(𝒙)|\left|I_{t}(\bm{x})\right| is independent of the choice of 𝒙\bm{x} by [2] and equal to

I(n,t)i=0t(n+ti).I(n,t)\triangleq\mathop{\sum}\limits_{i=0}^{t}\binom{n+t}{i}. (1)

The Levenshtein distance between 𝒙\bm{x} and 𝒚\bm{y} in Σ2n\Sigma_{2}^{n}, denoted by dL(𝒙,𝒚)d_{L}(\bm{x},\bm{y}), is the smallest integer \ell such that I(𝒙)I(𝒚)I_{\ell}(\bm{x})\cap I_{\ell}(\bm{y}) is not empty. For example, dL(10100,01001)=1d_{L}(10100,01001)=1. It is easy to see that 0dL(𝒙,𝒚)n0\leq d_{L}(\bm{x},\bm{y})\leq n, and dL(𝒙,𝒚)=0d_{L}(\bm{x},\bm{y})=0 if and only if 𝒙=𝒚\bm{x}=\bm{y}. Finally, the Hamming distance between 𝒙\bm{x} and 𝒚\bm{y}, denoted by dH(𝒙,𝒚)d_{H}(\bm{x},\bm{y}), is the number of coordinates where 𝒙\bm{x} and 𝒚\bm{y} differ, and the Hamming weight wtH(𝒙)wt_{H}(\bm{x}) of 𝒙\bm{x} is the number of nonzero coordinates of 𝒙\bm{x}.

The intersection size of two error balls is a key ingredient of the reconstruction problem in coding theory. By Levenshtein’s original framework [2], for channels causing tt insertion errors, it is always possible to exactly reconstruct 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n} given N+(n,t)+1N^{+}(n,t)+1 distinct elements of It(𝒙)I_{t}(\bm{x}), where

N+(n,t)\displaystyle N^{+}(n,t) max{|It(𝒙)It(𝒚)|: 𝒙,𝒚Σ2n and 𝒙𝒚}\displaystyle\triangleq\max\{\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|:\text{ }\bm{x},\bm{y}\in\Sigma_{2}^{n}\text{ and }\bm{x}\neq\bm{y}\} (2)
=i=0t1(n+ti)(1(1)ti).\displaystyle=\mathop{\sum}\limits_{i=0}^{t-1}\binom{n+t}{i}(1-(-1)^{t-i}).

This notation was generalized in [18] for any two sequences with a minimum Levenshtein distance. For integers nn, t0t\geq\ell\geq 0, let

N+(n,t,)max{|It(𝒙)It(𝒚)|: 𝒙,𝒚Σ2n and dL(𝒙,𝒚)},N^{+}(n,t,\ell)\triangleq\max\{\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|:\text{ }\bm{x},\bm{y}\in\Sigma_{2}^{n}\text{ and }d_{L}(\bm{x},\bm{y})\geq\ell\},

then the explicit formula is given by

N+(n,t,)=j=ti=0tj(2jj)(t+ji2j)(n+ti)(1)t+ji.N^{+}(n,t,\ell)=\mathop{\sum}\limits_{j=\ell}^{t}\mathop{\sum}\limits_{i=0}^{t-j}\binom{2j}{j}\binom{t+j-i}{2j}\binom{n+t}{i}(-1)^{t+j-i}. (3)

Note that when =0,1\ell=0,1, the values of N+(n,t,l)N^{+}(n,t,l) reduce to I(n,t)I(n,t) and N+(n,t)N^{+}(n,t), respectively. See [18, Corollary 9] and [18, Corollary 10] for details.

Given a code 𝒞Σ2n\mathcal{C}\subseteq\Sigma_{2}^{n} with |𝒞|2|\mathcal{C}|\geq 2, the read coverage νt(𝒞)\nu_{t}(\mathcal{C}) of 𝒞\mathcal{C} after tt insertions is defined as

νt(𝒞)max{|It(𝒙)It(𝒚)|: 𝒙,𝒚𝒞 and 𝒙𝒚}.\nu_{t}(\mathcal{C})\triangleq\max\{\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|:\text{ }\bm{x},\bm{y}\in\mathcal{C}\text{ and }\bm{x}\neq\bm{y}\}. (4)

It is clear that if νt(𝒞)<N\nu_{t}(\mathcal{C})<N, we can uniquely recover any codeword in 𝒞\mathcal{C} by its NN distinct reads. So we have the following definition.

Definition II.1.

(Reconstruction codes) A code 𝒞Σ2n\mathcal{C}\subset\Sigma_{2}^{n} is an (n,N;BI(t))(n,N;B^{I(t)})-reconstruction code if νt(𝒞)<N\nu_{t}(\mathcal{C})<N, where νt(𝒞)\nu_{t}(\mathcal{C}) is the read coverage of 𝒞\mathcal{C} after tt insertions.

Given a code 𝒞Σ2n\mathcal{C}\subset\Sigma_{2}^{n}, the redundancy of 𝒞\mathcal{C} is defined to be the value nlog2|𝒞|n-\log_{2}\left|\mathcal{C}\right|. Then we are interested in the following quantity

ρ(n,N;BI(t))min{nlog2|𝒞|: 𝒞Σ2n and νt(𝒞)<N},\rho(n,N;B^{I(t)})\triangleq\min\{n-\log_{2}\left|\mathcal{C}\right|:\text{ }\mathcal{C}\subseteq\Sigma_{2}^{n}\text{ and }\nu_{t}(\mathcal{C})<N\},

which is called the optimal redundancy of an (n,N;B2I(t))(n,N;B_{2}^{I(t)})-reconstruction code. By definition, an (n,N;BI(t))(n,N;B^{I(t)})-reconstruction code is also an (n,N;BI(t))(n,N^{\prime};B^{I(t)})-reconstruction code for any N>NN^{\prime}>N. So we have ρ(n,N;BI(t))ρ(n,N+1;BI(t))\rho(n,N;B^{I(t)})\geq\rho(n,N+1;B^{I(t)}) for all N1N\geq 1.

When N=1N=1, an (n,1;BI(t))(n,1;B^{I(t)})-reconstruction code is indeed a tt-insertion correcting code, or equivalently a tt-deletion correcting code. This class of codes has been extensively studied in recent years, see [25, 26, 27, 28, 23] for more details. For t=1t=1, we have the following single deletion correcting code [29]:

{𝒙=x1xnΣ2n:i=1nixia(modn+1)},\left\{\bm{x}=x_{1}\cdots x_{n}\in\Sigma_{2}^{n}\;:\;\mathop{\sum}\limits_{i=1}^{n}ix_{i}\equiv a\pmod{n+1}\right\}, (5)

where aa is a fixed integer between 0 and nn. This is the famous Varshamov-Tenengolts (VT) code, which has asymptotically optimal redundancy log2(n+1)\log_{2}(n+1) when a=0a=0 [30, Corollary 2.3]. In [21, Theorem 24], Cai et al. determined the asymptotic value of ρ(n,N;BI(1))\rho(n,N;B^{I(1)}) for all N1N\geq 1 as below.

Theorem II.1.

Taking q=2q=2 in [21, Theorem 24], we have

ρ(n,N;BI(1))={log2(n)+Θ(1),if N=1,log2log2(n)+Θ(1),if N=2,0,if N3.\rho(n,N;B^{I(1)})=\begin{cases}\log_{2}(n)+\Theta(1),&\mbox{if }N=1,\\ \log_{2}\log_{2}(n)+\Theta(1),&\mbox{if }N=2,\\ 0,&\mbox{if }N\geq 3.\end{cases}

Our work is to extend the result in Theorem II.1 to the case of t=2t=2. As will be shown in Corollary III.2 and the paragraph prior to Proposition III.2, the lower bound of the optimal redundancy can be deduced from the case t=1t=1. The upper bound needs explicit code constructions. So the main task is to construct an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code with redundancy as small as possible, for any given NN. Before closing this section, we introduce the following notations which will be used later.

For 𝒙,𝒚Σ2\bm{x},\bm{y}\in\Sigma_{2}^{*}, we write 𝒙𝒚\bm{x}\bm{y} for the concatenation of 𝒙\bm{x} and 𝒚\bm{y}. In particular, if i1i\geq 1, we define 𝒙i\bm{x}^{i} to be the sequence obtained by concatenating ii 𝒙\bm{x}’s; if i=0i=0, 𝒙0\bm{x}^{0} is the empty sequence. If 𝒖=u1unΣ2n\bm{u}=u_{1}\cdots u_{n}\in\Sigma_{2}^{n}, we denote 𝒖¯=(1u1)(1un)\overline{\bm{u}}=(1-u_{1})\cdots(1-u_{n}) as the complementary sequence of 𝒖\bm{u}. For any SΣ2S\subseteq\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}, we denote Sa{𝒙S: 𝒙 starts with a}S^{a}\triangleq\{\bm{x}\in S:\text{ }\bm{x}\text{ starts with }a\}, Sb{𝒙S: 𝒙 ends with b}S_{b}\triangleq\{\bm{x}\in S:\text{ }\bm{x}\text{ ends with }b\}, Sba{𝒙S: 𝒙 starts with a and ends with b}S_{b}^{a}\triangleq\{\bm{x}\in S:\text{ }\bm{x}\text{ starts with }a\text{ and ends with }b\}, aS{a𝒚: 𝒚S}aS\triangleq\{a\bm{y}:\text{ }\bm{y}\in S\}, and Sb{𝒚b: 𝒚S}Sb\triangleq\{\bm{y}b:\text{ }\bm{y}\in S\}.

Example II.1.

Let 𝐱=10\bm{x}=10, 𝐲=101\bm{y}=101 and 𝐮=101101\bm{u}=101101. Then 𝐱𝐲=10101\bm{x}\bm{y}=10101, 𝐱2=1010\bm{x}^{2}=1010 and 𝐮¯=010010\overline{\bm{u}}=010010. Let S={10101,0101,01010}S=\left\{10101,0101,01010\right\}, a=0a=0 and b=1b=1. Then Sa={0101,01010}S^{a}=\left\{0101,01010\right\}, Sb={10101,0101}S_{b}=\left\{10101,0101\right\}, Sba={0101}S_{b}^{a}=\left\{0101\right\}, aS={010101,00101,001010}aS=\left\{010101,00101,001010\right\} and Sb={101011,01011,010101}Sb=\left\{101011,01011,010101\right\}.

Given a sequence 𝒙=x1xnΣ2n\bm{x}=x_{1}\cdots x_{n}\in\Sigma_{2}^{n} with n2n\geq 2, we say that 𝒙\bm{x} has period \ell (n\leq n) if \ell is the smallest integer such that xi=xi+x_{i}=x_{i+\ell} for all 1in1\leq i\leq n-\ell. A sequence with period 22 is called alternating. We also view an empty sequence or a length-one sequence as an alternating sequence. For example, \emptyset, 11, 0, 1010, 0101, 101101 are all alternating sequences.

III Preliminary results

As mentioned before, the intersection size of two error balls is a key ingredient of the reconstruction problem. In this section, we characterize the structures of two binary sequences whose intersection of 22-insertion balls are of a certain size.

Let us first recall the results for a single insertion, that is the size of I1(𝒙)I1(𝒚)I_{1}(\bm{x})\cap I_{1}(\bm{y}). By Equation (2), we know that |I1(𝒙)I1(𝒚)|2\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|\leq 2 for any two distinct sequences 𝒙,𝒚Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n}. The characterization of 𝒙,𝒚\bm{x},\bm{y} when the equality holds was given in [21], which needs the following notations.

Definition III.1.

(Type-A confusability) Suppose that 𝐱𝐲Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n}. We say that 𝐱,𝐲\bm{x},\bm{y} are Type-A confusable, if there exist 𝐮\bm{u}, 𝐯\bm{v}, 𝐰\bm{w} Σ2\in\Sigma_{2}^{*} such that

𝒙=𝒖𝒘𝒗 and 𝒚=𝒖𝒘¯𝒗,\bm{x}=\bm{u}\bm{w}\bm{v}\text{ and }\bm{y}=\bm{u}\overline{\bm{w}}\bm{v},

where 𝐰\bm{w} is an alternating sequence of length at least one, i.e., 𝐰=(aa¯)i\bm{w}=(a\bar{a})^{i} for some i1i\geq 1 or 𝐰=(aa¯)ja\bm{w}=(a\bar{a})^{j}a for some j0j\geq 0, where a=0a=0 or 11.

For example, the two sequences 110110 and 100100 are Type-A confusable with 𝒖=1\bm{u}=1, 𝒘=1\bm{w}=1 and 𝒗=0\bm{v}=0. In general, if the Hamming distance dH(𝒙,𝒚)=1d_{H}(\bm{x},\bm{y})=1, then 𝒙\bm{x} and 𝒚\bm{y} are always Type-A confusable by definition. Note that the case where dH(𝒙,𝒚)=1d_{H}(\bm{x},\bm{y})=1 was not included in the original definition in [21, Definition 8], where j1j\geq 1 is required when 𝒘=(aa¯)ja\bm{w}=(a\bar{a})^{j}a. However, the following statement is still true, which is just a straightforward combination of the two claims in [21, Proposition 9].

Lemma III.1.

[21, Proposition 9] Let 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n} be distinct. Then |I1(𝐱)I1(𝐲)|\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right| =2=2 if and only if 𝐱,𝐲\bm{x},\bm{y} are Type-A confusable.

In fact, if two sequences 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable, then

{𝒙=𝒖(aa¯)m+1𝒘𝒚=𝒖(a¯a)m+1𝒘 or {𝒙=𝒖(aa¯)ma𝒘𝒚=𝒖(a¯a)ma¯𝒘\left\{\begin{array}[]{l}\bm{x}=\bm{u}(a\bar{a})^{m+1}\bm{w}\\ \bm{y}=\bm{u}(\bar{a}a)^{m+1}\bm{w}\end{array}\right.\text{\ or\ }\left\{\begin{array}[]{l}\bm{x}=\bm{u}(a\bar{a})^{m}a\bm{w}\\ \bm{y}=\bm{u}(\bar{a}a)^{m}\bar{a}\bm{w}\end{array}\right.

for some 𝒖,𝒘Σ2\bm{u},\bm{w}\in\Sigma_{2}^{*}, aΣ2a\in\Sigma_{2} and m0m\geq 0. By Lemma III.1, I1(𝒙)I1(𝒚)={𝒖(a¯a)m+1a¯𝒘,𝒖(aa¯)m+1a𝒘}I_{1}(\bm{x})\cap I_{1}(\bm{y})=\{\bm{u}(\bar{a}a)^{m+1}\bar{a}\bm{w},\bm{u}(a\bar{a})^{m+1}a\bm{w}\} for the left case and I1(𝒙)I1(𝒚)={𝒖(a¯a)m+1𝒘,𝒖(aa¯)m+1𝒘}I_{1}(\bm{x})\cap I_{1}(\bm{y})=\{\bm{u}(\bar{a}a)^{m+1}\bm{w},\bm{u}(a\bar{a})^{m+1}\bm{w}\} for the right case. Next, we define the concept of Type-B confusability, which was introduced in [22, Definition 18] and [24, Definition 11]. This concept will help to characterize the case |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1.

Definition III.2.

(Type-B confusability) Suppose that 𝐱𝐲Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n} with n3n\geq 3. We say that 𝐱,𝐲\bm{x},\bm{y} are Type-B confusable, if there exist 𝐮\bm{u}, 𝐯\bm{v}, 𝐰\bm{w} Σ2\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2} such that

{𝒙,𝒚}={𝒖aa¯𝒗b𝒘,𝒖a¯𝒗bb¯𝒘}.\left\{\bm{x},\bm{y}\right\}=\left\{\bm{u}a\bar{a}\bm{v}b\bm{w},\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w}\right\}.
Example III.1.

We give some examples of these two types of confusability.

  • Let 𝒙=11101010\bm{x}=11101010 and 𝒚=11010110\bm{y}=11010110. Then 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable with 𝒖=11\bm{u}=11, 𝒘=1010\bm{w}=1010 and 𝒗=10\bm{v}=10. They are also Type-B confusable with 𝒖=11\bm{u}=11, 𝒗=1\bm{v}=1, 𝒘=10\bm{w}=10, a=1a=1 and b=0b=0.

  • Let 𝒙=111010\bm{x}=111010 and 𝒚=110110\bm{y}=110110. Then 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable with 𝒖=11\bm{u}=11 and 𝒘=𝒗=10\bm{w}=\bm{v}=10. Obviously, they are not Type-B confusable.

  • Let 𝒙=1110100110\bm{x}=1110100110 and 𝒚=1101001010\bm{y}=1101001010. Then 𝒙\bm{x} and 𝒚\bm{y} are Type-B confusable with 𝒖=11\bm{u}=11, 𝒗=100\bm{v}=100, 𝒘=10\bm{w}=10 and a=b=1a=b=1. It is easy to see that they are not Type-A confusable.

  • The two sequences 00110011 and 11101110 are neither Type-A confusable nor Type-B confusable.

From Example III.1, we can see that Type-A confusability and Type-B confusability do not contain each other. In general, we have the following proposition, which is easy to verify.

Proposition III.1.

From Definition III.1 and Definition III.2, one can easily check that

  • (1)

    if 𝒙=𝒖𝒘𝒗\bm{x}=\bm{u}\bm{w}\bm{v}, 𝒚=𝒖𝒘¯𝒗\bm{y}=\bm{u}\overline{\bm{w}}\bm{v} Σ2n\in\Sigma_{2}^{n} are Type-A confusable, then they are Type-B confusable if and only if 𝒘{(10)m,(01)m}\bm{w}\in\{(10)^{m},(01)^{m}\} for some m2m\geq 2, or 𝒘{(10)m1,(01)m0}\bm{w}\in\{(10)^{m}1,(01)^{m}0\} for some m1m\geq 1;

  • (2)

    if 𝒙=𝒖aa¯𝒗b𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w}, 𝒚=𝒖a¯𝒗bb¯𝒘\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} Σ2n\in\Sigma_{2}^{n} are Type-B confusable, then they are Type-A confusable if and only if one of the following two conditions is satisfied:

    • (i)

      a=ba=b and 𝒗=(aa¯)m\bm{v}=(a\bar{a})^{m} for some m0m\geq 0;

    • (ii)

      a=b¯a=\bar{b} and 𝒗=(aa¯)ma\bm{v}=(a\bar{a})^{m}a for some m0m\geq 0.

𝒖\bm{u}aa𝒅\bm{d}bb𝒘\bm{w}𝒙\bm{x}𝒖\bm{u}a¯\bar{a}𝒆\bm{e}b¯\bar{b}𝒘\bm{w}𝒚\bm{y}iijj
Figure 1: We denote ii and jj to be the leftmost and rightmost indices where 𝒙\bm{x} and 𝒚\bm{y} differ.
Lemma III.2.

Suppose 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n}. If |I1(𝐱)I1(𝐲)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, then 𝐱,𝐲\bm{x},\bm{y} are Type-B confusable.

Proof:

Since |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, then by Lemma III.1, 𝒙\bm{x} and 𝒚\bm{y} are not Type-A confusable, which further implies that dH(𝒙,𝒚)2d_{H}(\bm{x},\bm{y})\geq 2. So we can assume that

𝒙=𝒖a𝒅b𝒘, and 𝒚=𝒖a¯𝒆b¯𝒘,\bm{x}=\bm{u}a\bm{d}b\bm{w},\text{ and }\bm{y}=\bm{u}\bar{a}\bm{e}\bar{b}\bm{w},

where a,bΣ2a,b\in\Sigma_{2} and 𝒖,𝒅,𝒆,𝒘Σ2\bm{u},\bm{d},\bm{e},\bm{w}\in\Sigma_{2}^{*}. Let i,ji,j be the leftmost and rightmost indices, respectively, where 𝒙\bm{x} and 𝒚\bm{y} differ. See Figure 1 for a depiction.

Suppose that 𝒅=𝒆=\bm{d}=\bm{e}=\emptyset. If a=b¯a=\bar{b}, then 𝒙\bm{x}, 𝒚\bm{y} are Type-A confusable, which is a contradiction. If a=ba=b, then |wtH(𝒙)wtH(𝒚)|=2|wt_{H}(\bm{x})-wt_{H}(\bm{y})|=2 and thus |I1(𝒙)I1(𝒚)|=0\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=0, a contradiction. Therefore, we conclude that 𝒅\bm{d} and 𝒆\bm{e} are both nonempty.

Let 𝒛=z1z2zn+1\bm{z}=z_{1}z_{2}\cdots z_{n+1} and {𝒛}=I1(𝒙)I1(𝒚)\{\bm{z}\}=I_{1}(\bm{x})\cap I_{1}(\bm{y}). Suppose that zkz_{k} and zz_{\ell} are inserted in 𝒙\bm{x} and 𝒚\bm{y} to get 𝒛\bm{z}, respectively. Then there are only two possibilities: kik\leq i and >j\ell>j, or k>jk>j and i\ell\leq i. If kik\leq i and >j\ell>j, then 𝒛=𝒖a¯a𝒅b𝒘=𝒖a¯𝒆b¯b𝒘\bm{z}=\bm{u}\bar{a}a\bm{d}b\bm{w}=\bm{u}\bar{a}\bm{e}\bar{b}b\bm{w} and thus a𝒅=𝒆b¯a\bm{d}=\bm{e}\bar{b}. This means that there exists 𝒗Σ2\bm{v}\in\Sigma_{2}^{*} such that 𝒅=𝒗b¯\bm{d}=\bm{v}\bar{b} and 𝒆=a𝒗\bm{e}=a\bm{v}. Therefore, 𝒙,𝒚\bm{x},\bm{y} are Type-B confusable. If k>jk>j and i\ell\leq i, the proof is similar. ∎

By Lemma III.2, if |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, then we must have {𝒙,𝒚}={𝒖aa¯𝒗b𝒘,𝒖a¯𝒗bb¯𝒘}\left\{\bm{x},\bm{y}\right\}=\left\{\bm{u}a\bar{a}\bm{v}b\bm{w},\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w}\right\} for some 𝒖\bm{u}, 𝒗\bm{v}, 𝒘\bm{w} Σ2\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}. Consequently, 𝒛=𝒖aa¯𝒗bb¯𝒘\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w} is the unique sequence in I1(𝒙)I1(𝒚)I_{1}(\bm{x})\cap I_{1}(\bm{y}). Combining Lemma III.1 and Lemma III.2, we completely characterize the structures of two binary sequences for which the intersection of 11-insertion balls is of a certain size. See the following corollary and Figure 2.

Corollary III.1.

Let 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n} be distinct. Then

  • |I1(𝒙)I1(𝒚)|=0\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=0 if and only if 𝒙\bm{x} and 𝒚\bm{y} are neither Type-A confusable, nor Type-B confusable;

  • |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1 if and only if 𝒙\bm{x} and 𝒚\bm{y} are Type-B confusable, but not Type-A confusable;

  • |I1(𝒙)I1(𝒚)|=2\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=2 if and only if 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable.

Now we proceed to estimate the intersection size of two 22-insertions balls, that is |I2(𝒙)I2(𝒚)|\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|, based on the value of |I1(𝒙)I1(𝒚)|\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|. By Equation (2), we have that N+(n,2)=2n+4N^{+}(n,2)=2n+4. That is, |I2(𝒙)I2(𝒚)|2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq 2n+4 for any 𝒙𝒚Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n}. The following lemma further restricts its value in the case that |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1.

Lemma III.3.

Let n3n\geq 3 and 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n}. If |I1(𝐱)I1(𝐲)|=1|I_{1}(\bm{x})\cap I_{1}(\bm{y})|=1, then n+3|I2(𝐱)I2(𝐲)|n+5n+3\leq\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq n+5. In particular, when 𝐱=aa¯𝐯b\bm{x}=a\bar{a}\bm{v}b and 𝐲=a¯𝐯bb¯\bm{y}=\bar{a}\bm{v}b\bar{b}, we have

  • |I2(𝒙)I2(𝒚)|=n+3\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+3 if and only if aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are neither Type-B confusable, nor Type-A confusable;

  • |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4 if and only if aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are Type-B confusable, but not Type-A confusable;

  • |I2(𝒙)I2(𝒚)|=n+5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+5 if and only if aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are Type-A confusable.

Proof:

Since |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, we conclude that 𝒙,𝒚\bm{x},\bm{y} are Type-B confusable by Lemma III.2. Assume that 𝒙=𝒖aa¯𝒗b𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w} and 𝒚=𝒖a¯𝒗bb¯𝒘\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} for some a,bΣ2a,b\in\Sigma_{2} and 𝒖,𝒗,𝒘Σ2\bm{u},\bm{v},\bm{w}\in\Sigma_{2}^{*}. Let I1(𝒙)I1(𝒚)={𝒛}I_{1}(\bm{x})\cap I_{1}(\bm{y})=\{\bm{z}\}. Then 𝒛=𝒖aa¯𝒗bb¯𝒘\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w}. Let S=I2(𝒙)I2(𝒚)S=I_{2}(\bm{x})\cap I_{2}(\bm{y}). We prove that n+3|S|n+5n+3\leq|S|\leq n+5 by induction on the length nn.

The base case is 𝒖=𝒘=\bm{u}=\bm{w}=\emptyset, that is, 𝒙=aa¯𝒗b,𝒚=a¯𝒗bb¯\bm{x}=a\bar{a}\bm{v}b,\bm{y}=\bar{a}\bm{v}b\bar{b} and 𝒛=aa¯𝒗bb¯\bm{z}=a\bar{a}\bm{v}b\bar{b}. In this case, we have S=SaSb¯Sba¯S=S^{a}\cup S_{\bar{b}}\cup S_{b}^{\bar{a}} and

Sa=a(I2(a¯𝒗b)I1(a¯𝒗bb¯))=aI1(a¯𝒗bb¯)I1(𝒛),Sb¯=(I1(aa¯𝒗b)I2(a¯𝒗b))b¯=I1(aa¯𝒗b)b¯I1(𝒛),Sba¯=a¯(I1(aa¯𝒗)I1(𝒗bb¯))b.\begin{array}[]{l}S^{a}=a\left(I_{2}(\bar{a}\bm{v}b)\cap I_{1}(\bar{a}\bm{v}b\bar{b})\right)=aI_{1}(\bar{a}\bm{v}b\bar{b})\subseteq I_{1}(\bm{z}),\\ S_{\bar{b}}=\left(I_{1}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b)\right)\bar{b}=I_{1}(a\bar{a}\bm{v}b)\bar{b}\subseteq I_{1}(\bm{z}),\\ S_{b}^{\bar{a}}=\bar{a}\left(I_{1}(a\bar{a}\bm{v})\cap I_{1}(\bm{v}b\bar{b})\right)b.\end{array}

The second equality in the first line follows from the fact that I2(a¯𝒗b)I1(a¯𝒗bb¯)=I1(a¯𝒗bb¯)I_{2}(\bar{a}\bm{v}b)\cap I_{1}(\bar{a}\bm{v}b\bar{b})=I_{1}(\bar{a}\bm{v}b\bar{b}), and similarly for the second equality in the second line.

By the form of 𝒛\bm{z}, we have I1(𝒛)SaSb¯I_{1}(\bm{z})\subseteq S^{a}\cup S_{\bar{b}} and I1(𝒛)Sba¯=I_{1}(\bm{z})\cap S_{b}^{\bar{a}}=\emptyset. Thus SS is the disjoint union of I1(𝒛)I_{1}(\bm{z}) and Sba¯S_{b}^{\bar{a}}. By Equation 1, |I1(𝒛)|=n+3|I_{1}(\bm{z})|=n+3. So it remains to show that |Sba¯|2|S_{b}^{\bar{a}}|\leq 2, or equivalently aa¯𝒗𝒗bb¯a\bar{a}\bm{v}\neq\bm{v}b\bar{b} by Equation 2. Otherwise, if aa¯𝒗=𝒗bb¯a\bar{a}\bm{v}=\bm{v}b\bar{b}, then a=ba=b and 𝒗=(aa¯)m\bm{v}=(a\bar{a})^{m}, or a=b¯a=\bar{b} and 𝒗=(aa¯)ma\bm{v}=(a\bar{a})^{m}a for some m0m\geq 0. For both cases, 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable by Proposition III.1 (2), which contradicts the fact that |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1. This completes the proof of the base case by Corollary III.1.

In the base case, we have shown that S=I1(𝒛)J(𝒙,𝒚)S=I_{1}(\bm{z})\cup J({\bm{x}},{\bm{y}}) for some set J(𝒙,𝒚)J({\bm{x}},{\bm{y}}) (i.e. Sba¯S_{b}^{\bar{a}}) of size at most two. Suppose we have proved the same conclusion for the codeword length n1n-1, and we will prove it for length nn. Now 𝒙=𝒖aa¯𝒗b𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w}, 𝒚=𝒖a¯𝒗bb¯𝒘\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} and 𝒛=𝒖aa¯𝒗bb¯𝒘\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w}. Without loss of generality, we assume that |𝒖|1|\bm{u}|\geq 1 and 𝒖=c𝒖\bm{u}=c\bm{u}^{*} for some cΣ2c\in\Sigma_{2} and 𝒖Σ2\bm{u}^{*}\in\Sigma_{2}^{*}. Then Sc¯S^{\bar{c}} == c¯(I1(𝒙)I1(𝒚))\bar{c}\left(I_{1}(\bm{x})\cap I_{1}(\bm{y})\right) == {c¯𝒛}=I1(𝒛)c¯\{\bar{c}\bm{z}\}=I_{1}(\bm{z})^{\bar{c}}. Denote 𝒙~=𝒖aa¯𝒗b𝒘\widetilde{\bm{x}}=\bm{u}^{*}a\bar{a}\bm{v}b\bm{w}, 𝒚~=𝒖a¯𝒗bb¯𝒘\widetilde{\bm{y}}=\bm{u}^{*}\bar{a}\bm{v}b\bar{b}\bm{w} and 𝒛~=𝒖aa¯𝒗bb¯𝒘\widetilde{\bm{z}}=\bm{u}^{*}a\bar{a}\bm{v}b\bar{b}\bm{w}, which are obtained from 𝒙,𝒚,𝒛\bm{x},\bm{y},\bm{z} by deleting the first element cc. By the induction hypothesis, I2(𝒙~)I2(𝒚~)=I1(𝒛~)J(𝒙~,𝒚~)I_{2}(\widetilde{\bm{x}})\cap I_{2}(\widetilde{\bm{y}})=I_{1}(\widetilde{\bm{z}})\cup J(\widetilde{\bm{x}},\widetilde{\bm{y}}), for some set J(𝒙~,𝒚~)J(\widetilde{\bm{x}},\widetilde{\bm{y}}) of size at most two. Then Sc=c(I2(𝒙~)I2(𝒚~))=cI1(𝒛~)cJ(𝒙~,𝒚~)=I1(𝒛)ccJ(𝒙~,𝒚~)S^{c}=c\left(I_{2}(\widetilde{\bm{x}})\cap I_{2}(\widetilde{\bm{y}})\right)=cI_{1}(\widetilde{\bm{z}})\cup cJ(\widetilde{\bm{x}},\widetilde{\bm{y}})=I_{1}(\bm{z})^{c}\cup cJ(\widetilde{\bm{x}},\widetilde{\bm{y}}). Hence S=Sc¯Sc=I1(𝒛)cJ(𝒙~,𝒚~)S=S^{\bar{c}}\cup S^{c}=I_{1}(\bm{z})\cup cJ(\widetilde{\bm{x}},\widetilde{\bm{y}}), and this completes the proof. ∎

By Lemma III.3, we are able to give a rough estimate of |I2(𝒙)I2(𝒚)|\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right| based on the different values of |I1(𝒙)I1(𝒚)|\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|, and consequently based on the confusability of 𝒙\bm{x} and 𝒚\bm{y} by Corollary III.1. See the following lemma and Figure 2.

I1=2I_{1}=2I2=2n+4I_{2}=2n+4I1=1I_{1}=1n+3I2n+5n+3\leq I_{2}\leq n+5I1=0I_{1}=0I26I_{2}\leq 6Type-AType-B not Type-Aneither Type-A nor Type-B
Figure 2: The relation of two kinds of confusability and the intersection sizes, where I1|I1(𝒙)I1(𝒚)|I_{1}\triangleq\left|I_{1}\left(\bm{x}\right)\cap I_{1}\left(\bm{y}\right)\right| and I2|I2(𝒙)I2(𝒚)|I_{2}\triangleq\left|I_{2}\left(\bm{x}\right)\cap I_{2}\left(\bm{y}\right)\right|.
Lemma III.4.

Let 𝐱𝐲Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n} where n4n\geq 4. Then

  1. (i)

    |I1(𝒙)I1(𝒚)|=2\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=2 if and only if |I2(𝒙)I2(𝒚)|=2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=2n+4;

  2. (ii)

    |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1 if and only if n+3|I2(𝒙)I2(𝒚)|n+5n+3\leq\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq n+5;

  3. (iii)

    |I1(𝒙)I1(𝒚)|=0\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=0 if and only if |I2(𝒙)I2(𝒚)|6\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq 6.

Proof:

(i) We prove the sufficiency by contradiction. If |I1(𝒙)I1(𝒚)|=0\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=0, that is dL(𝒙,𝒚)2d_{L}(\bm{x},\bm{y})\geq 2, then |I2(𝒙)I2(𝒚)|N+(n,2,2)=6<2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq N^{+}(n,2,2)=6<2n+4 by Equation 3. If |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, then |I2(𝒙)I2(𝒚)|n+5<2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq n+5<2n+4 by Lemma III.3. So we must have |I1(𝒙)I1(𝒚)|=2\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=2.

For the necessity, recall that |I2(𝒙)I2(𝒚)|2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq 2n+4. Since |I1(𝒙)I1(𝒚)|\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right| =2=2, by Lemma III.1, there must be some 𝒖,𝒘Σ2\bm{u},\bm{w}\in\Sigma_{2}^{*}, aΣ2a\in\Sigma_{2} and m0m\geq 0, such that

{𝒙=𝒖(aa¯)m+1𝒘𝒚=𝒖(a¯a)m+1𝒘 or {𝒙=𝒖(aa¯)ma𝒘𝒚=𝒖(a¯a)ma¯𝒘.\left\{\begin{array}[]{l}\bm{x}=\bm{u}(a\bar{a})^{m+1}\bm{w}\\ \bm{y}=\bm{u}(\bar{a}a)^{m+1}\bm{w}\end{array}\right.\text{\ or\ }\left\{\begin{array}[]{l}\bm{x}=\bm{u}(a\bar{a})^{m}a\bm{w}\\ \bm{y}=\bm{u}(\bar{a}a)^{m}\bar{a}\bm{w}\end{array}.\right.

For the left case, let 𝒛1=𝒖(a¯a)m+1a¯𝒘\bm{z}_{1}=\bm{u}(\bar{a}a)^{m+1}\bar{a}\bm{w} and 𝒛2=𝒖(aa¯)m+1a𝒘\bm{z}_{2}=\bm{u}(a\bar{a})^{m+1}a\bm{w}. Then I1(𝒙)I1(𝒚)={𝒛1,𝒛2}I_{1}(\bm{x})\cap I_{1}(\bm{y})=\{\bm{z}_{1},\bm{z}_{2}\} and I1(𝒛1)I1(𝒛2)={𝒖(a¯a)m+2𝒘,𝒖(aa¯)m+2𝒘}I_{1}(\bm{z}_{1})\cap I_{1}(\bm{z}_{2})=\{\bm{u}(\bar{a}a)^{m+2}\bm{w},\bm{u}(a\bar{a})^{m+2}\bm{w}\}. Noting that |I1(𝒛1)|=|I1(𝒛2)|=n+3\left|I_{1}(\bm{z}_{1})\right|=\left|I_{1}(\bm{z}_{2})\right|=n+3 by Equation 1, we have |I2(𝒙)I2(𝒚)||I1(𝒛1)I1(𝒛2)|\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\geq|I_{1}(\bm{z}_{1})\cup I_{1}(\bm{z}_{2})| 2n+4\geq 2n+4. Therefore, |I2(𝒙)I2(𝒚)|=2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=2n+4. For the right case, the proof is similar.

The claim (ii) is clear by Lemma III.3, the claim (i) and the fact that |I2(𝒙)I2(𝒚)|6\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq 6 if |I1(𝒙)I1(𝒚)|=0\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=0. The claim (iii) then follows too. ∎

By Lemma III.4 and Figure 2, it is immediate to have the following relations between reconstruction codes for two insertions and those for single insertion under certain conditions. See Figure 3 for a depiction.

Corollary III.2.

Let n4n\geq 4 and 𝒞Σ2n\mathcal{C}\subset\Sigma_{2}^{n}.

  1. (i)

    When n+5<N2n+4n+5<N\leq 2n+4, 𝒞\mathcal{C} is an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code if and only if 𝒞\mathcal{C} is an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code.

  2. (ii)

    When 6<Nn+36<N\leq n+3, 𝒞\mathcal{C} is an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code if and only if 𝒞\mathcal{C} is an (n,1;BI(1))(n,1;B^{I(1)})-reconstruction code.

N13,N22n+5N_{1}\geq 3,N_{2}\geq 2n+5N1=2N_{1}=2n+5<N22n+4n+5<N_{2}\leq 2n+4N2=n+4,n+5N_{2}=n+4,n+5N1=1N_{1}=16<N2n+36<N_{2}\leq n+31N261\leq N_{2}\leq 6
Figure 3: Relations between (n,N1;BI(1))(n,N_{1};B^{I(1)})-reconstruction codes and (n,N2;BI(2))(n,N_{2};B^{I(2)})-reconstruction codes, and a hierarchy relation for different values of NN. Values of N1N_{1} and N2N_{2} in the same layer satisfy the “if and only if” statement, that is an (n,N1;BI(1))(n,N_{1};B^{I(1)})-reconstruction code is an (n,N2;BI(2))(n,N_{2};B^{I(2)})-reconstruction code, and vice verse. Values NN in a smaller square and values NN^{\prime} in a bigger square satisfy the “inclusion” statement, that is, an (n,N;BI(t))(n,N;B^{I(t)})-reconstruction code is an (n,N;BI(t))(n,N^{\prime};B^{I(t)})-reconstruction code for t=1,2t=1,2, but the converse is not true.

By now, we are able to determine ρ(n,N;BI(2))\rho(n,N;B^{I(2)}) for most values of NN. First, since |I2(𝒙)I2(𝒚)|2n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\leq 2n+4 for any 𝒙𝒚Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n}, we have ν2(Σ2n)=2n+4\nu_{2}(\Sigma_{2}^{n})=2n+4 and ρ(n,N;BI(2))=0\rho(n,N;B^{I(2)})=0 for any N>2n+4N>2n+4. By Lemma III.4, we see that if N=n+4,n+5N=n+4,n+5, an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code must be an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code, and thus ρ(n,N;BI(2))log2log2(n)+Ω(1)\rho(n,N;B^{I(2)})\geq\log_{2}\log_{2}(n)+\Omega(1) by Theorem II.1; if 2N62\leq N\leq 6, an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code must be an (n,1;BI(1))(n,1;B^{I(1)})-reconstruction code, and thus ρ(n,N;BI(2))log2(n)+Ω(1)\rho(n,N;B^{I(2)})\geq\log_{2}(n)+\Omega(1) by Theorem II.1. The last case N=1N=1, corresponds to a 22-insertion error-correcting code, or equivalently a 22-deletion error-correcting code, whose best known upper bound on the optimal redundancy is 4log2(n)+O(log2log2(n))4\log_{2}(n)+O(\log_{2}\log_{2}(n)) [23, Theorem 1], and the best known lower bound is 2log2(n)O(1)2\log_{2}(n)-O(1) (see [29]). Proposition III.2 summarizes the information given above.

Proposition III.2.
ρ(n,N;BI(2))={0,if N>2n+4,log2log2(n)+Θ(1),if n+5<N2n+4,log2log2(n)+Ω(1),if N=n+4,n+5,log2(n)+Θ(1),if 6<Nn+3,Θ(log2(n)),if 1N6.\rho(n,N;B^{I(2)})=\begin{cases}0,&\mbox{if }N>2n+4,\\ \log_{2}\log_{2}(n)+\Theta(1),&\mbox{if }n+5<N\leq 2n+4,\\ \geq\log_{2}\log_{2}(n)+\Omega(1),&\mbox{if }N=n+4,n+5,\\ \log_{2}(n)+\Theta(1),&\mbox{if }6<N\leq n+3,\\ \Theta(\log_{2}(n)),&\mbox{if }1\leq N\leq 6.\end{cases}

Therefore, the remaining cases are N{n+4,n+5}N\in\{n+4,n+5\} and 2N62\leq N\leq 6, which will be handled in the rest of this paper. In fact, we construct asymptotically optimal reconstruction codes for N{n+4,n+5}N\in\{n+4,n+5\} with redundancy log2log2(n)+O(1)\log_{2}\log_{2}(n)+O(1), and for N{5,6}N\in\{5,6\} with redundancy log2(n)+14log2log2(n)+O(1)\log_{2}(n)+14\log_{2}\log_{2}(n)+O(1), thus determine the asymptotically optimal redundancy for all N5N\geq 5.

IV Reconstruction Codes for N{n+4,n+5}N\in\{n+4,n+5\}

In this section, we construct reconstruction codes for N{n+4,n+5}N\in\{n+4,n+5\} with redundancy as small as possible. The main idea is as follows. First, we choose an (n,n+6;BI(2))(n,n+6;B^{I(2)})-reconstruction code with small redundancy, or equivalently an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code 𝒞1\mathcal{C}_{1} (see Corollary III.2 (i)), which can be found in [21, Theorem 17]. Then we impose additional constraints on 𝒞1\mathcal{C}_{1} such that |I2(𝒙)I2(𝒚)|{n+4,n+5}\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\notin\{n+4,n+5\} under these constraints, and consequently we obtain an (n,n+4;BI(2))(n,n+4;B^{I(2)})-reconstruction code 𝒞\mathcal{C}.

The main task in our construction is to characterize the structures of two sequences 𝒙,𝒚\bm{x},\bm{y} when |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4 or n+5n+5. By Lemma III.4, we have |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1. Then by Lemma III.2, we can assume that

𝒙=𝒖aa¯𝒗b𝒘 and 𝒚=𝒖a¯𝒗bb¯𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w}\text{ and }\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w}

for some 𝒖,𝒗,𝒘Σ2\bm{u},\bm{v},\bm{w}\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}, where 𝒗\bm{v} satisfies neither of the following conditions by Proposition III.1:

a=b and 𝒗=(aa¯)m for some m0,a=b¯ and 𝒗=(aa¯)ma for some m0.\begin{array}[]{c}a=b\text{ and }\bm{v}=(a\bar{a})^{m}\text{ for some }m\geq 0,\\ a=\bar{b}\text{ and }\bm{v}=(a\bar{a})^{m}a\text{ for some }m\geq 0.\end{array} (6)

Let 𝒛=𝒖aa¯𝒗bb¯𝒘\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w}. In the proof of Lemma III.3, we have shown that I2(𝒙)I2(𝒚)=I1(𝒛)J(𝒙,𝒚)I_{2}(\bm{x})\cap I_{2}(\bm{y})=I_{1}(\bm{z})\cup J({\bm{x}},{\bm{y}}) for some set J(𝒙,𝒚)J({\bm{x}},{\bm{y}}) of size at most two. The following result reveals more information about J(𝒙,𝒚)J({\bm{x}},{\bm{y}}), whose proof is deferred to Appendix A.

Lemma IV.1.

Let 𝐱=𝐮aa¯𝐯b𝐰\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w} and 𝐲=𝐮a¯𝐯bb¯𝐰\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} for some 𝐮,𝐯,𝐰\bm{u},\bm{v},\bm{w} Σ2\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}, and let 𝐳=𝐮aa¯𝐯bb¯𝐰\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w}. If 𝐱\bm{x} and 𝐲\bm{y} are not Type-A confusable, then

I2(𝒙)I2(𝒚)=I1(𝒛)𝒖(I2(aa¯𝒗b)I2(a¯𝒗bb¯)I1(aa¯𝒗bb¯))𝒘,\begin{array}[]{l}I_{2}(\bm{x})\cap I_{2}(\bm{y})=I_{1}(\bm{z})\cup\bm{u}\left(I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\setminus I_{1}(a\bar{a}\bm{v}b\bar{b})\right)\bm{w},\end{array}

and the union is disjoint.

Let nn^{\prime} be the length of aa¯𝒗ba\bar{a}\bm{v}b. Then by Equation 1, we have |I1(𝒛)|=n+3\left|I_{1}(\bm{z})\right|=n+3 and |I1(aa¯𝒗bb¯)|=n+3\left|I_{1}(a\bar{a}\bm{v}b\bar{b})\right|=n^{\prime}+3. Lemma IV.1 implies

|I2(𝒙)I2(𝒚)|\displaystyle\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right| =|I1(𝒛)|+|I2(aa¯𝒗b)I2(a¯𝒗bb¯)I1(aa¯𝒗bb¯)|\displaystyle=\left|I_{1}(\bm{z})\right|+\left|I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\setminus I_{1}(a\bar{a}\bm{v}b\bar{b})\right|
=n+3+(|I2(aa¯𝒗b)I2(a¯𝒗bb¯)|n3).\displaystyle=n+3+\left(\left|I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\right|-n^{\prime}-3\right).

By Equation 6 and Proposition III.1, aa¯𝒗ba\bar{a}\bm{v}b and a¯𝒗bb¯\bar{a}\bm{v}b\bar{b} are Type-B confusable but not Type-A confusable, and hence |I1(aa¯𝒗b)I1(a¯𝒗bb¯)|=1\left|I_{1}(a\bar{a}\bm{v}b)\cap I_{1}(\bar{a}\bm{v}b\bar{b})\right|=1 by Corollary III.1. Then |I2(aa¯𝒗b)I2(a¯𝒗bb¯)|{n+3,n+4,n+5}\left|I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\right|\in\{n^{\prime}+3,n^{\prime}+4,n^{\prime}+5\} by Lemma III.3. We also have |I2(𝒙)I2(𝒚)|{n+3,n+4,n+5}\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\in\{n+3,n+4,n+5\} due to the fact |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1. So the above equality implies that |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4 or n+5n+5 if and only if |I2(aa¯𝒗b)I2(a¯𝒗bb¯)|=n+4\left|I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\right|=n^{\prime}+4 or n+5n^{\prime}+5, respectively. Thus, we can further assume that 𝒙=aa¯𝒗b and 𝒚=a¯𝒗bb¯.\bm{x}=a\bar{a}\bm{v}b\text{ and }\bm{y}=\bar{a}\bm{v}b\bar{b}. The following theorem characterizes the conditions when |I2(aa¯𝒗b)I2(a¯𝒗bb¯)|=n+4\left|I_{2}(a\bar{a}\bm{v}b)\cap I_{2}(\bar{a}\bm{v}b\bar{b})\right|=n^{\prime}+4 or n+5n^{\prime}+5.

TABLE II: case (ii) of Theorem IV.1
𝒗\bm{v} Conditions Row Number
a=ba=b (aa¯)ia¯j(aa¯)k(a\bar{a})^{i}\bar{a}^{j}(a\bar{a})^{k} i0,j2,k0i\geq 0,j\geq 2,k\geq 0 1
(aa¯)iaj(aa¯)k(a\bar{a})^{i}a^{j}(a\bar{a})^{k} i0,j2,k0i\geq 0,j\geq 2,k\geq 0 2
(aa¯)i(a¯aa¯)j(a¯a)ka¯(a\bar{a})^{i}(\bar{a}a\bar{a})^{j}(\bar{a}a)^{k}\bar{a} i0,j1,k0i\geq 0,j\geq 1,k\geq 0 3
(aa¯)ia(aa¯a)j(aa¯)k(a\bar{a})^{i}a(a\bar{a}a)^{j}(a\bar{a})^{k} i0,j1,k0i\geq 0,j\geq 1,k\geq 0 4
a=b¯a=\bar{b} (aa¯)ia¯j(a¯a)k(a\bar{a})^{i}\bar{a}^{j}(\bar{a}a)^{k} i0,j1,k0i\geq 0,j\geq 1,k\geq 0 5
(aa¯)iaj(a¯a)k(a\bar{a})^{i}a^{j}(\bar{a}a)^{k} i0,j3,k0i\geq 0,j\geq 3,k\geq 0 6
(aa¯)i(a¯aa¯)j(a¯a)k(a\bar{a})^{i}(\bar{a}a\bar{a})^{j}(\bar{a}a)^{k} i0,j1,k0i\geq 0,j\geq 1,k\geq 0 7
(aa¯)ia(aa¯a)j(aa¯)ka(a\bar{a})^{i}a(a\bar{a}a)^{j}(a\bar{a})^{k}a i0,j1,k0i\geq 0,j\geq 1,k\geq 0 8
Theorem IV.1.

Let n3n^{\prime}\geq 3. Suppose that a,bΣ2a,b\in\Sigma_{2}, and 𝐯Σ2n3\bm{v}\in\Sigma_{2}^{n^{\prime}-3} do not satisfy the conditions in (6). Then 𝐱=aa¯𝐯b\bm{x}=a\bar{a}\bm{v}b and 𝐲=a¯𝐯bb¯\bm{y}=\bar{a}\bm{v}b\bar{b} satisfy the following properties.

  1. (i)

    |I2(𝒙)I2(𝒚)|=n+5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n^{\prime}+5 if and only if

    • a=ba=b and 𝒗{(aa¯)ia(aa¯)j,(aa¯)i(a¯a)ja¯}\bm{v}\in\{(a\bar{a})^{i}a(a\bar{a})^{j},(a\bar{a})^{i}(\bar{a}a)^{j}\bar{a}\} for some i,j0i,j\geq 0; or

    • a=b¯a=\bar{b} and 𝒗{(aa¯)i(a¯a)j,(aa¯)iaa(a¯a)j}\bm{v}\in\{(a\bar{a})^{i}(\bar{a}a)^{j},(a\bar{a})^{i}aa(\bar{a}a)^{j}\} for some i,j0i,j\geq 0.

  2. (ii)

    |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n^{\prime}+4 if and only if a,ba,b and 𝒗\bm{v} satisfy one of the conditions listed in Table II.

Proof:

Denote l=n3l=n^{\prime}-3. Let 𝒙~=aa¯𝒗=x~1x~l+2\widetilde{\bm{x}}=a\bar{a}\bm{v}=\tilde{x}_{1}\cdots\tilde{x}_{l+2} and 𝒚~=𝒗bb¯=y~1y~l+2\widetilde{\bm{y}}=\bm{v}b\bar{b}=\tilde{y}_{1}\cdots\tilde{y}_{l+2}. Then

{x~1=a,x~2=a¯;x~i+2=y~i, for all i=1,,l;y~l+1=b,y~l+2=b¯.\left\{\begin{array}[]{l}\tilde{x}_{1}=a,\tilde{x}_{2}=\bar{a};\\ \tilde{x}_{i+2}=\tilde{y}_{i},\text{ for all }i=1,\ldots,l;\\ \tilde{y}_{l+1}=b,\tilde{y}_{l+2}=\bar{b}.\end{array}\right. (7)

(i). From Lemma III.3, we know that |I2(𝒙)I2(𝒚)|=n+5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n^{\prime}+5 if and only if 𝒙~\widetilde{\bm{x}} and 𝒚~\widetilde{\bm{y}} are Type-A confusable. That is to say, 𝒙~=𝒖𝒄𝒘\widetilde{\bm{x}}=\bm{u}\bm{c}\bm{w} and 𝒚~=𝒖𝒄¯𝒘\widetilde{\bm{y}}=\bm{u}\bar{\bm{c}}\bm{w} for some 𝒖,𝒄,𝒘Σ2\bm{u},\bm{c},\bm{w}\in\Sigma_{2}^{*}, where 𝒄\bm{c} is an alternating sequence of length at least one. Denote s=|𝒖|0s=|\bm{u}|\geq 0 and t=|𝒄|1t=|\bm{c}|\geq 1. Then

{x~i=y~i, for i=1,,s,s+t+1,,l+2 by 𝒖 and 𝒘;x~i=y~i¯, for i=s+1,,s+t by 𝒄 and 𝒄¯;x~i=x~i+2, for i=s+1,,s+t2 by alternating of 𝒄.\left\{\begin{array}[]{ll}\tilde{x}_{i}=\tilde{y}_{i},\text{ for }i=1,\ldots,s,s+t+1,\ldots,l+2&\text{ by $\bm{u}$ and $\bm{w}$};\\ \tilde{x}_{i}=\overline{\tilde{y}_{i}},\text{ for }i=s+1,\ldots,s+t&\text{ by $\bm{c}$ and $\bar{\bm{c}}$};\\ \tilde{x}_{i}=\tilde{x}_{i+2},\text{ for }i=s+1,\ldots,s+t-2&\text{ by alternating of $\bm{c}$}.\end{array}\right. (8)

Since wtH(𝒙~)=wtH(𝒚~)=wtH(𝒗)+1wt_{H}(\widetilde{\bm{x}})=wt_{H}(\widetilde{\bm{y}})=wt_{H}(\bm{v})+1, we have |𝒄|=dH(𝒙~,𝒚~)2|\bm{c}|=d_{H}(\widetilde{\bm{x}},\widetilde{\bm{y}})\geq 2. If |𝒄|3|\bm{c}|\geq 3, consider the prefix x~s+1x~s+2x~s+3\tilde{x}_{s+1}\tilde{x}_{s+2}\tilde{x}_{s+3} of 𝒄\bm{c}. By the third line of Equation 8 and the second line of Equation 7, we have x~s+1=x~s+3\tilde{x}_{s+1}=\tilde{x}_{s+3} and x~s+3=y~s+1\tilde{x}_{s+3}=\tilde{y}_{s+1}, respectively. Thus x~s+1=y~s+1\tilde{x}_{s+1}=\tilde{y}_{s+1}, which contradicts the second line of Equation 8. So we have |𝒄|=2|\bm{c}|=2, and Equation 8 becomes

{x~i=y~i, for i=1,,s,s+3,,l+2;x~i=y~i¯, for i=s+1,s+2;x~s+1=x~s+2¯.\left\{\begin{array}[]{l}\tilde{x}_{i}=\tilde{y}_{i},\text{ for }i=1,\ldots,s,s+3,\ldots,l+2;\\ \tilde{x}_{i}=\overline{\tilde{y}_{i}},\text{ for }i=s+1,s+2;\\ \tilde{x}_{s+1}=\overline{\tilde{x}_{s+2}}.\end{array}\right. (9)

Next, we determine the explicit structure of 𝒗\bm{v} according to the values of ss and ll. Write 𝒗=v1vl\bm{v}=v_{1}\cdots v_{l}. Then

vi=x~i+2=y~i, for all i=1,,l.v_{i}=\tilde{x}_{i+2}=\tilde{y}_{i},\text{ for all }i=1,\ldots,l. (10)

For convenience, let v1=x~1=av_{-1}=\tilde{x}_{1}=a, v0=x~2=a¯v_{0}=\tilde{x}_{2}=\bar{a}. Since lsl\geq s, the first two lines of Equation 7 and the first line of Equation 9 imply that v1v0vsv_{-1}v_{0}\cdots v_{s} is alternating. Therefore,

v1v0vs={(aa¯)s+22 if s is even,(aa¯)s+12a if s is odd.v_{-1}v_{0}\cdots v_{s}=\left\{\begin{array}[]{ll}(a\bar{a})^{\frac{s+2}{2}}&\text{ if }s\text{ is even},\\ (a\bar{a})^{\frac{s+1}{2}}a&\text{ if }s\text{ is odd}.\end{array}\right. (11)

We divide our discussion into three cases.

  • l=sl=s. In this case, we have vl=x~l+2=y~l+2¯=bv_{l}=\tilde{x}_{l+2}=\overline{\tilde{y}_{l+2}}=b and vl1=x~l+1=y~l+1¯=b¯v_{l-1}=\tilde{x}_{l+1}=\overline{\tilde{y}_{l+1}}=\bar{b}. Therefore, 𝒗=(aa¯)l2\bm{v}=(a\bar{a})^{\frac{l}{2}} and a=b¯a=\bar{b} when ll is even, or 𝒗=(aa¯)l12a\bm{v}=(a\bar{a})^{\frac{l-1}{2}}a and a=ba=b when ll is odd.

  • l=s+1l=s+1. In this case, we have vl=x~l+2=y~l+2=b¯v_{l}=\tilde{x}_{l+2}=\tilde{y}_{l+2}=\bar{b} and vl1=x~l+1=y~l+1¯=b¯v_{l-1}=\tilde{x}_{l+1}=\overline{\tilde{y}_{l+1}}=\bar{b}. Therefore, 𝒗=(aa¯)l12a¯\bm{v}=(a\bar{a})^{\frac{l-1}{2}}\bar{a} and a=ba=b when ll is odd, or 𝒗=(aa¯)l22aa\bm{v}=(a\bar{a})^{\frac{l-2}{2}}aa and a=b¯a=\bar{b} when ll is even.

  • ls+2l\geq s+2. In this case, vi=y~i=x~i¯v_{i}=\tilde{y}_{i}=\overline{\tilde{x}_{i}} for i=s+1,s+2i=s+1,s+2 by the second line of Equation 9 and Equation 10. So vs+1=vs+2¯v_{s+1}=\overline{v_{s+2}} by the third line of Equation 9. Hence vs+1vs+2vlv_{s+1}v_{s+2}\cdots v_{l} is alternating by the first line of Equation 9 and Equation 10. Since l+2s+4l+2\geq s+4, we have vl1=x~l+1=y~l+1=bv_{l-1}=\tilde{x}_{l+1}=\tilde{y}_{l+1}=b and vl=x~l+2=y~l+2=b¯v_{l}=\tilde{x}_{l+2}=\tilde{y}_{l+2}=\bar{b} by the first line of Equation 9. Now, combining with Equation 11, we come to the desired forms of 𝒗\bm{v}: for all 0sl20\leq s\leq l-2,

     when a=b¯, 𝒗=v1vl={(aa¯)s2(a¯a)ls2 if l,s are both even,(aa¯)s12aa(a¯a)l1s2 if s is odd and l is even,\text{ when }a=\bar{b},\text{ }\bm{v}=v_{1}\cdots v_{l}=\left\{\begin{array}[]{ll}(a\bar{a})^{\frac{s}{2}}(\bar{a}a)^{\frac{l-s}{2}}&\text{ if }l,s\text{ are both even},\\ (a\bar{a})^{\frac{s-1}{2}}aa(\bar{a}a)^{\frac{l-1-s}{2}}&\text{ if }s\text{ is odd and }l\text{ is even},\end{array}\right.

    and

     when a=b, 𝒗=v1vl={(aa¯)s2(a¯a)l1s2a¯ if s is even and l is odd,(aa¯)s12a(aa¯)ls2 if l,s are both odd.\text{ when }a=b,\text{ }\bm{v}=v_{1}\cdots v_{l}=\left\{\begin{array}[]{ll}(a\bar{a})^{\frac{s}{2}}(\bar{a}a)^{\frac{l-1-s}{2}}\bar{a}&\text{ if }s\text{ is even and }l\text{ is odd},\\ (a\bar{a})^{\frac{s-1}{2}}a(a\bar{a})^{\frac{l-s}{2}}&\text{ if }l,s\text{ are both odd}.\end{array}\right.

This completes the proof of (i).

(ii). The proof of this case is long and tedious, thus we move it to Appendix B. ∎

Theorem IV.1 gives a full characterization of the structures of two sequences 𝒙,𝒚Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n} when |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4 or n+5n+5, respectively. This will help to exclude such pairs of sequences from an (n,n+6;BI(2))(n,n+6;B^{I(2)})-reconstruction code (or an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code) to obtain an (n,n+4;BI(2))(n,n+4;B^{I(2)})-reconstruction code, as mentioned in the beginning of this section. We use the (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code in [21, Theorem 17] as a candidate for our construction. For that, we need the following notation.

For any 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n}, define

Inv(𝒙)=|{(i,j):1i<jn and xi>xj}|.\textup{Inv}\left(\bm{x}\right)=\left|\{(i,j):1\leq i<j\leq n\text{ and }x_{i}>x_{j}\}\right|.

By definition, for any aΣ2a\in\Sigma_{2} and any 𝒖,𝒗,𝒘Σ2\bm{u},\bm{v},\bm{w}\in\Sigma_{2}^{*},

|Inv(𝒖aa¯𝒗a¯𝒘)Inv(𝒖a¯𝒗a¯a𝒘)|=|Inv(aa¯𝒗a¯)Inv(a¯𝒗a¯a)|=N𝒗(a¯)+2,|\textup{Inv}\left(\bm{u}a\bar{a}\bm{v}\bar{a}\bm{w}\right)-\textup{Inv}\left(\bm{u}\bar{a}\bm{v}\bar{a}a\bm{w}\right)|=|\textup{Inv}\left(a\bar{a}\bm{v}\bar{a}\right)-\textup{Inv}\left(\bar{a}\bm{v}\bar{a}a\right)|=N_{\bm{v}}(\bar{a})+2, (12)

where N𝒗(a¯)N_{\bm{v}}(\bar{a}) denotes the number of a¯\bar{a} appearing in 𝒗\bm{v}. Let R(n,,t)R(n,\ell,t) denote the set of all sequences 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n} such that the length of any subword with period \ell^{\prime} of 𝒙\bm{x} is at most tt, for any \ell^{\prime}\leq\ell. By definition, R(n,,t)R(n,ȷ,s)R(n,\ell,t)\subseteq R(n,\jmath,s) if ȷ\ell\geq\jmath and tst\leq s.

Lemma IV.2.

(​​[31, Theorem 13]) For 1\ell\geq 1, if tlog2(n)++1t\geq\lceil\log_{2}(n)\rceil+\ell+1, we have that |R(n,,t)|2n1|R(n,\ell,t)|\geq 2^{n-1}.

For n,P>0n,P>0, let c1+Pc\in\mathbb{Z}_{1+P} and d2d\in\mathbb{Z}_{2}. In [21, Theorem 17], the authors defined the following code

𝒞1=𝒞1(n;c,d)={𝒙R(n,2,2P):Inv(𝒙)c(mod1+P),wtH(𝒙)d(mod2)}.\mathcal{C}_{1}=\mathcal{C}_{1}(n;c,d)=\{\bm{x}\in R(n,2,2P):\textup{Inv}\left(\bm{x}\right)\equiv c\pmod{1+P},wt_{H}(\bm{x})\equiv d\pmod{2}\}.

They proved that 𝒞1\mathcal{C}_{1} is an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code in the following way. The constraint on wtH(𝒙)wt_{H}(\bm{x}) implies that the Hamming weights of any two distinct codewords 𝒙\bm{x} and 𝒚\bm{y} in 𝒞1\mathcal{C}_{1} have the same parity and thus dH(𝒙,𝒚)2d_{H}(\bm{x},\bm{y})\geq 2. Therefore, if |I1(𝒙)I1(𝒚)|=2\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=2, we can conclude that 𝒙=𝒖𝒄𝒘\bm{x}=\bm{u}\bm{c}\bm{w} and 𝒚=𝒖𝒄¯𝒘\bm{y}=\bm{u}\overline{\bm{c}}\bm{w} for some 𝒖,𝒘Σ2\bm{u},\bm{w}\in\Sigma_{2}^{*} and some alternating sequence 𝒄\bm{c} of positive even length. On the one hand, it can be shown that |Inv(𝒙)Inv(𝒚)|=|𝒄|2\left|\textup{Inv}\left(\bm{x}\right)-\textup{Inv}\left(\bm{y}\right)\right|=\frac{|\bm{c}|}{2}. So the constraint on Inv(𝒙)\textup{Inv}\left(\bm{x}\right) leads to (1+P)|𝒄|2(1+P)\mid\frac{|\bm{c}|}{2} and hence 1+P|𝒄|21+P\leq\frac{|\bm{c}|}{2}. On the other hand, the length of 𝒄\bm{c} is upper bounded by 2P2P since 𝒄\bm{c} has period 22. So we have 1+P|𝒄|2P1+P\leq\frac{|\bm{c}|}{2}\leq P, which is impossible. Therefore, 𝒞1\mathcal{C}_{1} is an (n,2;BI(1))(n,2;B^{I(1)})-reconstruction code.

By Corollary III.2 (i), the code 𝒞1\mathcal{C}_{1} is an (n,n+6;BI(2))(n,n+6;B^{I(2)})-reconstruction code. To obtain an (n,n+4;BI(2))(n,n+4;B^{I(2)})-reconstruction code from 𝒞1\mathcal{C}_{1}, we have to exclude pairs of sequences 𝒙,𝒚𝒞1\bm{x},\bm{y}\in\mathcal{C}_{1} satisfying that |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4 or n+5n+5. By Theorem IV.1, such pairs have a common subword 𝒗\bm{v} with a certain periodic property. Thus we can further bound the length of 𝒗\bm{v} and use the constraint on Inv(𝒙)\textup{Inv}\left(\bm{x}\right) to get a contradiction and exclude such pairs. See our construction below.

Theorem IV.2.

(N=n+4N=n+4) For integers n4,P6n\geq 4,P\geq 6 such that 3P3\mid P, let c1+Pc\in\mathbb{Z}_{1+P} and d2d\in\mathbb{Z}_{2}. Let

𝒞(n;c,d)={𝒙R(n,3,P3):Inv(𝒙)c(mod1+P),wtH(𝒙)d(mod2)}}.\mathcal{C}(n;c,d)=\{\bm{x}\in R(n,3,\frac{P}{3}):\textup{Inv}\left(\bm{x}\right)\equiv c\pmod{1+P},wt_{H}(\bm{x})\equiv d\pmod{2}\}\}.

Then 𝒞(n;c,d)\mathcal{C}(n;c,d) is an (n,n+4;BI(2))(n,n+4;B^{I(2)})-reconstruction code. Furthermore, if P=3log2(n)+12P=3\lceil\log_{2}(n)\rceil+12, then 𝒞(n;c,d)\mathcal{C}(n;c,d) has redundancy at most log2(P+1)+2=log2log2(n)+O(1)\log_{2}(P+1)+2=\log_{2}\log_{2}(n)+O(1) for some choice of cc and dd.

Proof:

Since R(n,3,P3)R(n,2,2P)R(n,3,\frac{P}{3})\subseteq R(n,2,2P), that is 𝒞(n;c,d)𝒞1(n;c,d)\mathcal{C}(n;c,d)\subseteq\mathcal{C}_{1}(n;c,d), it suffices to show that |I2(𝒙)I2(𝒚)|{n+4,n+5}\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\notin\{n+4,n+5\} for any two distinct sequences 𝒙,𝒚𝒞(n;c,d)\bm{x},\bm{y}\in\mathcal{C}(n;c,d).

If, on the contrary, there exists such a pair 𝒙,𝒚\bm{x},\bm{y}, then without loss of generality, we can assume that

𝒙=𝒖aa¯𝒗b𝒘 and 𝒚=𝒖a¯𝒗bb¯𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w}\text{ and }\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w}

for some 𝒖,𝒗,𝒘\bm{u},\bm{v},\bm{w} Σ2\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}. Since wtH(𝒙)wtH(𝒚)d(mod2)wt_{H}(\bm{x})\equiv wt_{H}(\bm{y})\equiv d\pmod{2}, we conclude that a=b¯a=\bar{b}.

If |I2(𝒙)I2(𝒚)|=n+5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+5, then 𝒗{(aa¯)i(a¯a)j,\bm{v}\in\{(a\bar{a})^{i}(\bar{a}a)^{j}, (aa¯)iaa(a¯a)j}(a\bar{a})^{i}aa(\bar{a}a)^{j}\} for some i,j0i,j\geq 0 by Theorem IV.1(i). Since 𝒙R(n,3,P3)\bm{x}\in R(n,3,\frac{P}{3}), we have 2i,2jP32i,2j\leq\frac{P}{3}. From Equation 12, we know that

|Inv(𝒙)Inv(𝒚)|=|Inv(aa¯𝒗a¯)Inv(a¯𝒗a¯a)|=i+j+2.\left|\textup{Inv}\left(\bm{x}\right)-\textup{Inv}\left(\bm{y}\right)\right|=\left|\textup{Inv}\left(a\bar{a}\bm{v}\bar{a}\right)-\textup{Inv}\left(\bar{a}\bm{v}\bar{a}a\right)\right|=i+j+2.

So i+j+20(mod1+P)i+j+2\equiv 0\pmod{1+P} by the definition of 𝒞\mathcal{C}. Then 1+Pi+j+2P3+21+P\leq i+j+2\leq\frac{P}{3}+2, which implies that P1P\leq 1, a contradiction.

If |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n+4, then 𝒗\bm{v} is one of the four forms in Table II under the case a=b¯a=\bar{b} by Theorem IV.1(ii). For each case, we can deduce that P3P\leq 3, thus a contradiction. For example, if 𝒗=(aa¯)i(a¯aa¯)j(a¯a)k\bm{v}=(a\bar{a})^{i}(\bar{a}a\bar{a})^{j}(\bar{a}a)^{k} for some i0,j1,k0i\geq 0,j\geq 1,k\geq 0, then |Inv(𝒙)Inv(𝒚)|=i+2j+k+20(mod1+P)|\textup{Inv}\left(\bm{x}\right)-\textup{Inv}\left(\bm{y}\right)|=i+2j+k+2\equiv 0\pmod{1+P}. On the other hand, since 𝒙R(n,3,P3)\bm{x}\in R(n,3,\frac{P}{3}), we have 2iP32i\leq\frac{P}{3}, 3jP33j\leq\frac{P}{3}, and 2kP32k\leq\frac{P}{3}. So 1+Pi+2j+k+25P9+21+P\leq i+2j+k+2\leq\frac{5P}{9}+2, which implies P2P\leq 2, a contradiction.

So we conclude that 𝒞(n;c,d)\mathcal{C}(n;c,d) is an (n,n+4;BI(2))(n,n+4;B^{I(2)})-reconstruction code. Finally, we can choose P=3log2(n)+12P=3\lceil\log_{2}(n)\rceil+12. Then |R(n,3,P3)|2n1\left|R(n,3,\frac{P}{3})\right|\geq 2^{n-1} by Lemma IV.2. Note that for different pairs (c,d)1+P×2(c,d)\in\mathbb{Z}_{1+P}\times\mathbb{Z}_{2}, the codes 𝒞(n;c,d)\mathcal{C}(n;c,d) are disjoint from each other. So by pigeonhole principle, there must be some cc and dd such that |𝒞(n;c,d)|2n12(1+P)|\mathcal{C}(n;c,d)|\geq\frac{2^{n-1}}{2(1+P)}, which has redundancy log2(P+1)+2=log2log2(n)+O(1)\log_{2}(P+1)+2=\log_{2}\log_{2}(n)+O(1) for nn large enough. ∎

Obviously, the code 𝒞(n;c,d)\mathcal{C}(n;c,d) in Theorem IV.2 is also an (n,n+5;BI(2))(n,n+5;B^{I(2)})-reconstruction code. Then combining with Proposition III.2, we have ρ(n,N;BI(2))=log2log2(n)+Θ(1)\rho(n,N;B^{I(2)})=\log_{2}\log_{2}(n)+\Theta(1) for N{n+4,n+5}N\in\{n+4,n+5\}. However, we can do a little bit better for N=n+5N=n+5. Under the same notations, let

𝒞2(n;c,d)={𝒙R(n,2,2P3):Inv(𝒙)c(mod1+P),wtH(𝒙)d(mod2)}}.\mathcal{C}_{2}(n;c,d)=\{\bm{x}\in R(n,2,\frac{2P}{3}):\textup{Inv}\left(\bm{x}\right)\equiv c\pmod{1+P},wt_{H}(\bm{x})\equiv d\pmod{2}\}\}.

Then we can show that 𝒞2(n;c,d)\mathcal{C}_{2}(n;c,d) is an (n,n+5;BI(2))(n,n+5;B^{I(2)})-reconstruction code by the similar argument in Theorem IV.2. Furthermore, if we set 2P=3log2(n)+92P=3\lceil\log_{2}(n)\rceil+9, then 𝒞2(n;c,d)\mathcal{C}_{2}(n;c,d) has redundancy at most log2(P+1)+2\log_{2}(P+1)+2 for some choice of cc and dd. Since this PP is about a half of the PP in Theorem IV.2, we know that the redundancy of 𝒞2(n;c,d)\mathcal{C}_{2}(n;c,d) is about one bit smaller than that of 𝒞(n;c,d)\mathcal{C}(n;c,d).

V Reconstruction codes for N{5,6}N\in\{5,6\}

In this section, we provide reconstruction codes for N{5,6}N\in\{5,6\} with redundancy as small as possible. It seems that we can also construct a code by characterizing the structures of two sequences 𝒙,𝒚\bm{x},\bm{y} such that |I2(𝒙)I2(𝒚)|=5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=5 or 66, as we did in the last section. However, it will be a long and tedious task. So the construction we use here is quite different from the one for N=n+4N=n+4.

In [27, Theorem 2], Sima et al. constructed a class of 22-deletion correcting codes with redundancy 7log2(n)+o(log2(n))7\log_{2}(n)+o(\log_{2}(n)), by generalizing the VT code with higher order parity checks of some indicator vectors of the original sequences. Inspired by their constructions, we will present an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code with redundancy log2(n)+O(log2log2(n))\log_{2}(n)+O(\log_{2}\log_{2}(n)) by utilizing five different reads. This improves the trivial upper bound 7log2(n)+o(log2(n))7\log_{2}(n)+o(\log_{2}(n)) to log2(n)+O(log2log2(n))\log_{2}(n)+O(\log_{2}\log_{2}(n)), which is tight in terms of the leading term. We first recall the construction in [27, Theorem 2], which needs the following notations.

For 𝒙=x1x2xnΣ2n\bm{x}=x_{1}x_{2}\ldots x_{n}\in\Sigma_{2}^{n} ( n2n\geq 2) and a,bΣ2a,b\in\Sigma_{2}, the abab-indicator 𝟙ab(𝒙)Σ2n1\mathbbm{1}_{ab}(\bm{x})\in\Sigma_{2}^{n-1} of 𝒙\bm{x} is defined as follows:

𝟙ab(𝒙)i={1,if xi=a and xi+1=b,0,otherwise.\mathbbm{1}_{ab}(\bm{x})_{i}=\begin{cases}1,&\mbox{if }x_{i}=a\text{ and }x_{i+1}=b,\\ 0,&\mbox{otherwise}.\end{cases}

For example, if 𝒙=10011010\bm{x}=10011010, then 𝟙10(𝒙)=1000101\mathbbm{1}_{10}(\bm{x})=1000101 and 𝟙01(𝒙)=0010010\mathbbm{1}_{01}(\bm{x})=0010010. Note that the 1010- and 0101-indicators of any sequence do not contain consecutive ones. Define the following integer vectors of length n1n-1:

𝒎(0)(1,2,,n1),𝒎(1)(1,1+2,1+2+3,,n(n1)2),𝒎(2)(12,12+22,12+22+32,,(n1)n(2n1)6).\begin{array}[]{l}\bm{m}^{(0)}\triangleq(1,2,\ldots,n-1),\\ \bm{m}^{(1)}\triangleq\left(1,1+2,1+2+3,\ldots,\frac{n(n-1)}{2}\right),\\ \bm{m}^{(2)}\triangleq\left(1^{2},1^{2}+2^{2},1^{2}+2^{2}+3^{2},\ldots,\frac{(n-1)n(2n-1)}{6}\right).\end{array} (13)

In other words, the iith components of 𝒎(0)\bm{m}^{(0)}, 𝒎(1)\bm{m}^{(1)} and 𝒎(2)\bm{m}^{(2)} are ii, i(i+1)/2i(i+1)/2 and i(i+1)(2i+1)/6i(i+1)(2i+1)/6, respectively. Then given 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n}, the higher order parity checks for 𝟙10(𝒙)\mathbbm{1}_{10}(\bm{x}) and 𝟙01(𝒙)\mathbbm{1}_{01}(\bm{x}) are defined as

f(𝒙)=(𝟙10(𝒙)𝒎(0)(mod2n),𝟙10(𝒙)𝒎(1)(modn2),𝟙10(𝒙)𝒎(2)(modn3)),h(𝒙)=(wtH(𝟙01(𝒙))(mod3),𝟙01(𝒙)𝒎(1)(mod2n)),\begin{array}[]{l}f(\bm{x})=\left(\mathbbm{1}_{10}(\bm{x})\cdot\bm{m}^{(0)}\pmod{2n},~{}~{}\mathbbm{1}_{10}(\bm{x})\cdot\bm{m}^{(1)}\pmod{n^{2}},~{}~{}\mathbbm{1}_{10}(\bm{x})\cdot\bm{m}^{(2)}\pmod{n^{3}}\right),\\ h(\bm{x})=\left(wt_{H}\left(\mathbbm{1}_{01}(\bm{x})\right)\pmod{3},~{}~{}\mathbbm{1}_{01}(\bm{x})\cdot\bm{m}^{(1)}\pmod{2n}\right),\end{array} (14)

where \cdot denotes the inner product over the integers. By applying the above parity checks, Sima et al. proved the following result.

Theorem V.1.

[27, Theorem 2] For any a12na_{1}\in\mathbb{Z}_{2n}, a2n2a_{2}\in\mathbb{Z}_{n^{2}}, a3n3a_{3}\in\mathbb{Z}_{n^{3}}, a43a_{4}\in\mathbb{Z}_{3} and a52na_{5}\in\mathbb{Z}_{2n}, let

𝒟(a1,a2,a3,a4,a5)={𝒙Σ2n:f(𝒙)=(a1,a2,a3),h(𝒙)=(a4,a5)}.\mathcal{D}(a_{1},a_{2},a_{3},a_{4},a_{5})=\left\{\bm{x}\in\Sigma_{2}^{n}:f(\bm{x})=\left(a_{1},a_{2},a_{3}\right),h(\bm{x})=\left(a_{4},a_{5}\right)\right\}.

Then 𝒟(a1,a2,a3,a4,a5)\mathcal{D}(a_{1},a_{2},a_{3},a_{4},a_{5}) is a 22-deletion (or insertion) correcting code.

Clearly, the redundancy of 𝒟(a1,a2,a3,a4,a5)\mathcal{D}(a_{1},a_{2},a_{3},a_{4},a_{5}) is at most 7log2(n)+O(1)7\log_{2}(n)+O(1) for some choice of a1a_{1}, a2a_{2}, a3a_{3}, a4a_{4} and a5a_{5}. Although this implies an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code, the redundancy is large. To reduce the redundancy, we note that the parity checks in the construction of Theorem V.1 are applied to the whole sequence of length nn to correct two random insertion errors. However, as will be shown later, the two random insertion errors are located in a short subword (say of length LL) of the original sequence, if we impose some constraint (belonging to R(n,3,P)R(n,3,P)) on the sequences. So we can apply the parity checks in Equation 14 only on this subword of length LL to combat the two insertion errors, which will greatly reduce the redundancy of the code if L=o(n)L=o(n).

Now assume that 𝒙𝒚Σ2n\bm{x}\neq\bm{y}\in\Sigma_{2}^{n} (n4n\geq 4) and |I2(𝒙)I2(𝒚)|6|I_{2}(\bm{x})\cap I_{2}(\bm{y})|\leq 6, then |I1(𝒙)I1(𝒚)|=0|I_{1}(\bm{x})\cap I_{1}(\bm{y})|=0 by Lemma III.4, and hence dH(𝒙,𝒚)2d_{H}(\bm{x},\bm{y})\geq 2 by Lemma III.1. So we can always assume that

𝒙=𝒖a𝒅b𝒘 and 𝒚=𝒖a¯𝒆b¯𝒘\bm{x}=\bm{u}a\bm{d}b\bm{w}\text{ and }\bm{y}=\bm{u}\bar{a}\bm{e}\bar{b}\bm{w}

for some 𝒖,𝒘,𝒅,𝒆Σ2\bm{u},\bm{w},\bm{d},\bm{e}\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}. Further, a𝒅𝒆b¯a\bm{d}\neq\bm{e}\bar{b} and 𝒅ba¯𝒆\bm{d}b\neq\bar{a}\bm{e}, since otherwise |I1(𝒙)I1(𝒚)|0|I_{1}(\bm{x})\cap I_{1}(\bm{y})|\neq 0. As in Lemma IV.1, we have

I2(𝒙)I2(𝒚)=𝒖(I2(a𝒅b)I2(a¯𝒆b¯))𝒘.I_{2}(\bm{x})\cap I_{2}(\bm{y})=\bm{u}\left(I_{2}(a\bm{d}b)\cap I_{2}(\bar{a}\bm{e}\bar{b})\right)\bm{w}. (15)

In particular, |I2(𝒙)I2(𝒚)|=|I2(a𝒅b)I2(a¯𝒆b¯)|\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=\left|I_{2}(a\bm{d}b)\cap I_{2}(\bar{a}\bm{e}\bar{b})\right|. The proof of Equation 15 is similar to that of Lemma IV.1 and thus omitted.

By Equation 15, we know that if |I2(𝒙)I2(𝒚)|{5,6}|I_{2}(\bm{x})\cap I_{2}(\bm{y})|\in\{5,6\}, then any sequence in I2(𝒙)I2(𝒚)I_{2}(\bm{x})\cap I_{2}(\bm{y}) can be obtained from 𝒙\bm{x} by inserting two symbols in a𝒅ba\bm{d}b, or from 𝒚\bm{y} by inserting two symbols in a¯𝒆b¯\bar{a}\bm{e}\bar{b}. Therefore, the idea is: if |a𝒅b|=|a¯𝒆b¯|m\left|a\bm{d}b\right|=\left|\bar{a}\bm{e}\bar{b}\right|\leq m for some m=o(n)m=o(n), then by partitioning 𝒙\bm{x} (or 𝒚\bm{y}) into non-overlapping segments of length mm (see Figure 4, the rightmost segment has length at most mm), we can see that the two inserted symbols are located in at most two adjacent segments. Thus we can apply the higher order parity checks in Equation 14 to the subword of length L=2mL=2m. Intuitively, we can deem that the redundancy can be greatly reduced.

mmmmm\leq m\star\star
(a) inserted symbols are in the same segment
mmmmmmm\leq m\star\star
(b) inserted symbols are in two adjacent segments
Figure 4: Stars represent the inserted symbols

To bound the length of a𝒅ba\bm{d}b or a¯𝒆b¯\bar{a}\bm{e}\bar{b}, we restrict our sequences to be within R(n,,P)R(n,\ell,P), which is of size at least 2n12^{n-1} when P=log2(n)++1P=\lceil\log_{2}(n)\rceil+\ell+1 by Lemma IV.2. Suppose 𝒙,𝒚R(n,,P)\bm{x},\bm{y}\in R(n,\ell,P). If we can prove that a𝒅ba\bm{d}b and a¯𝒆b¯\bar{a}\bm{e}\bar{b} are both concatenation of a finite number of sequences of period at most \ell, then |a𝒅b|\left|a\bm{d}b\right| and |a¯𝒆b¯|\left|\bar{a}\bm{e}\bar{b}\right| can be upper bounded by m=O(P)=o(n)m=O(P)=o(n). Next, we will show that this is true.

Let S=I2(a𝒅b)I2(a¯𝒆b¯)S=I_{2}(a\bm{d}b)\cap I_{2}(\bar{a}\bm{e}\bar{b}), where a𝒅𝒆b¯a\bm{d}\neq\bm{e}\bar{b} and 𝒅ba¯𝒆\bm{d}b\neq\bar{a}\bm{e}. Then S=SbaSb¯aSba¯Sb¯a¯S=S^{a}_{b}\cup S^{a}_{\bar{b}}\cup S_{b}^{\bar{a}}\cup S^{\bar{a}}_{\bar{b}}, where

Sba=a(I2(𝒅){a¯𝒆b¯})b,Sb¯a=a(I1(𝒅b)I1(a¯𝒆))b¯,Sba¯=a¯(I1(a𝒅)I1(𝒆b¯))b,Sb¯a¯=a¯({a𝒅b}I2(𝒆))b¯.\begin{array}[]{l}S^{a}_{b}=a\left(I_{2}(\bm{d})\cap\{\bar{a}\bm{e}\bar{b}\}\right)b,\\ S^{a}_{\bar{b}}=a\left(I_{1}(\bm{d}b)\cap I_{1}(\bar{a}\bm{e})\right)\bar{b},\\ S^{\bar{a}}_{b}=\bar{a}\left(I_{1}(a\bm{d})\cap I_{1}(\bm{e}\bar{b})\right)b,\\ S^{\bar{a}}_{\bar{b}}=\bar{a}\left(\{a\bm{d}b\}\cap I_{2}(\bm{e})\right)\bar{b}.\end{array}

It is clear that SbaS^{a}_{b}, Sb¯aS^{a}_{\bar{b}} Sba¯S_{b}^{\bar{a}} and Sb¯a¯S^{\bar{a}}_{\bar{b}} are mutually disjoint, and each has size at most two. Suppose that |S|=|I2(𝒙)I2(𝒚)|{5,6}|S|=\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\in\{5,6\}. Then we must have:

|Sb¯a|=2 and |Sba¯|1; or |Sba¯|=2 and |Sb¯a|1.\left|S_{\bar{b}}^{a}\right|=2\text{ and }\left|S_{b}^{\bar{a}}\right|\geq 1;\text{ or }\left|S_{b}^{\bar{a}}\right|=2\text{ and }\left|S_{\bar{b}}^{a}\right|\geq 1. (16)

For the former case of Equation 16, 𝒅b\bm{d}b and a¯𝒆\bar{a}\bm{e} are Type-A confusable, i.e., there exist some 𝜶,𝜷Σ2\bm{\alpha},\bm{\beta}\in\Sigma_{2}^{*} such that

𝒅b=𝜶𝒘𝜷 and a¯𝒆=𝜶𝒘¯𝜷,\bm{d}b=\bm{\alpha}\bm{w}\bm{\beta}\text{ and }\bar{a}\bm{e}=\bm{\alpha}\overline{\bm{w}}\bm{\beta}, (17)

where 𝒘Σ2\bm{w}\in\Sigma_{2}^{*} is an alternating sequence of length at least one. Since |Sba¯|1\left|S_{b}^{\bar{a}}\right|\geq 1, we conclude that a𝒅a\bm{d} and 𝒆b¯\bm{e}\bar{b} are Type-A confusable or Type-B confusable. With these notations and observations, we characterize the structures of 𝜶\bm{\alpha} and 𝜷\bm{\beta} in the following two lemmas for different cases. The proofs are in Appendix C.

Lemma V.1.

Suppose that 𝐝b\bm{d}b and a¯𝐞\bar{a}\bm{e} are Type-A confusable, and that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-A confusable. Then 𝛂=𝐩1𝐩2\bm{\alpha}=\bm{p}_{1}^{\prime}\bm{p}_{2}^{\prime} and 𝛃=𝐩1𝐩2\bm{\beta}=\bm{p}_{1}\bm{p}_{2} for some 𝐩1,𝐩2,𝐩1,𝐩2Σ2\bm{p}_{1},\bm{p}_{2},\bm{p}_{1}^{\prime},\bm{p}_{2}^{\prime}\in\Sigma_{2}^{*}, where the four sequences 𝐩1,𝐩2,𝐩1\bm{p}_{1},\bm{p}_{2},\bm{p}_{1}^{\prime} and 𝐩2\bm{p}_{2}^{\prime} all have period at most 22.

Lemma V.2.

Suppose that 𝐝b\bm{d}b and a¯𝐞\bar{a}\bm{e} are Type-A confusable, and that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-B confusable. Then 𝛂=𝐩1𝐩2𝐩3\bm{\alpha}=\bm{p}_{1}^{\prime}\bm{p}_{2}^{\prime}\bm{p}_{3}^{\prime} and 𝛃=𝐩1𝐩2𝐩3\bm{\beta}=\bm{p}_{1}\bm{p}_{2}\bm{p}_{3} for some 𝐩1,𝐩2,𝐩3,𝐩1,𝐩2,𝐩3Σ2\bm{p}_{1},\bm{p}_{2},\bm{p}_{3},\bm{p}_{1}^{\prime},\bm{p}_{2}^{\prime},\bm{p}_{3}^{\prime}\in\Sigma_{2}^{*}, where the six sequences 𝐩1,𝐩2,𝐩3,𝐩1,𝐩2\bm{p}_{1},\bm{p}_{2},\bm{p}_{3},\bm{p}_{1}^{\prime},\bm{p}_{2}^{\prime} and 𝐩3\bm{p}_{3}^{\prime} all have period at most 33.

For the latter case of Equation 16, we have a𝒅a\bm{d} and 𝒆b¯\bm{e}\bar{b} are Type-A confusable by |Sba¯|=2\left|S_{b}^{\bar{a}}\right|=2, and 𝒅b\bm{d}b and a¯𝒆\bar{a}\bm{e} are Type-A (or B) confusable by |Sb¯a|1\left|S_{\bar{b}}^{a}\right|\geq 1. As in Equation 17, we can let a𝒅=𝜶𝒘𝜷a\bm{d}=\bm{\alpha}\bm{w}\bm{\beta} and 𝒆b¯=𝜶𝒘¯𝜷\bm{e}\bar{b}=\bm{\alpha}\overline{\bm{w}}\bm{\beta}, and show that Lemma V.1 and Lemma V.2 are both true for this case.

Now suppose that 𝒙,𝒚R(n,3,P)\bm{x},\bm{y}\in R(n,3,P), and |I2(𝒙)I2(𝒚)|{5,6}\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\in\{5,6\}. Then Lemma V.1 and Lemma V.2 imply that

|a𝒅b|=|a¯𝒆b¯|7P+1.\left|a\bm{d}b\right|=\left|\bar{a}\bm{e}\bar{b}\right|\leq 7P+1. (18)

So in the rest of this subsection, we let m7P+1m\triangleq 7P+1. For convenience, we always assume that mnm\mid n, that is, each segment in Figure 4 has length mm. All results in this subsection can be extended to the case mnm\nmid n by a truncated method, see Remark V.2 later. Before giving the construction of an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code, we define the following notations.

For 𝒙Σ2n\bm{x}\in\Sigma_{2}^{n} with n=smn=sm, partition 𝒙\bm{x} into ss segments of length mm as in Figure 4. For any two consecutive segments, we define the parity checks on this subword of length 2m2m as in Equation 14 as follows. For each k{0,1,,s2}k\in\{0,1,\ldots,s-2\}, let 𝟙10(k)(𝒙)𝟙10(𝒙[km+1:km+2m])\mathbbm{1}^{(k)}_{10}(\bm{x})\triangleq\mathbbm{1}_{10}(\bm{x}[km+1:km+2m]) and 𝟙01(k)(𝒙)𝟙01(𝒙[km+1:km+2m])\mathbbm{1}^{(k)}_{01}(\bm{x})\triangleq\mathbbm{1}_{01}(\bm{x}[km+1:km+2m]), which are the indicators of the kk-th segment of 𝒙\bm{x}. Define

fk(𝒙)=(𝟙10(k)(𝒙)𝒎(0)(mod4m),𝟙10(k)(𝒙)𝒎(1)(mod4m2),𝟙10(k)(𝒙)𝒎(2)(mod8m3)),hk(𝒙)=(wtH(𝟙01(k)(𝒙))(mod3),𝟙01(k)(𝒙)𝒎(0)(mod4m)),\begin{array}[]{l}f_{k}(\bm{x})=\left(\mathbbm{1}^{(k)}_{10}(\bm{x})\cdot\bm{m}^{(0)}\pmod{4m},~{}~{}\mathbbm{1}^{(k)}_{10}(\bm{x})\cdot\bm{m}^{(1)}\pmod{4m^{2}},~{}~{}\mathbbm{1}^{(k)}_{10}(\bm{x})\cdot\bm{m}^{(2)}\pmod{8m^{3}}\right),\\ h_{k}(\bm{x})=\left(wt_{H}\left(\mathbbm{1}^{(k)}_{01}(\bm{x})\right)\pmod{3},~{}~{}\mathbbm{1}^{(k)}_{01}(\bm{x})\cdot\bm{m}^{(0)}\pmod{4m}\right),\end{array}

where 𝒎(i)\bm{m}^{(i)} is defined in Equation 13 but with length 2m12m-1. If we let 𝒛=𝒙[km+1:km+2m]\bm{z}=\bm{x}[km+1:km+2m], then it is easy to see that fk(𝒙)=f(𝒛)f_{k}(\bm{x})=f(\bm{z}) and hk(𝒙)=h(𝒛)h_{k}(\bm{x})=h(\bm{z}). Next, we will sum up these fkf_{k}’s and hkh_{k}’s according to the parity of kk. Let KeK_{e} be the set of even integers and let KoK_{o} be the set of odd integers in {0,1,,s2}\{0,1,\ldots,s-2\}, respectively. It is clear that if k1,k2Kek_{1},k_{2}\in K_{e} (or KoK_{o}) and k1k2k_{1}\neq k_{2}, then the two intervals [k1m+1,k1m+2m][k_{1}m+1,k_{1}m+2m] and [k2m+1,k2m+2m][k_{2}m+1,k_{2}m+2m] are disjoint. Define

f~e(𝒙)kKefk(𝒙) and f~o(𝒙)kKofk(𝒙)\tilde{f}_{e}(\bm{x})\triangleq\sum_{k\in K_{e}}f_{k}(\bm{x})\text{ and }\tilde{f}_{o}(\bm{x})\triangleq\sum_{k\in K_{o}}f_{k}(\bm{x})

with the summation over 4m×4m2×8m3\mathbb{Z}_{4m}\times\mathbb{Z}_{4m^{2}}\times\mathbb{Z}_{8m^{3}}, and define

h~e(𝒙)kKehk(𝒙) and h~o(𝒙)kKohk(𝒙)\tilde{h}_{e}(\bm{x})\triangleq\sum_{k\in K_{e}}h_{k}(\bm{x})\text{ and }\tilde{h}_{o}(\bm{x})\triangleq\sum_{k\in K_{o}}h_{k}(\bm{x})

with the summation over 3×4m\mathbb{Z}_{3}\times\mathbb{Z}_{4m}. Now we are ready to give our construction.

Theorem V.2.

(N=5N=5) Let nn and PP be integers such that m=7P+1<nm=7P+1<n and mnm\mid n. For any 𝐚\bm{a} == (a1,a2,a3,a4,a5)(a_{1},a_{2},a_{3},a_{4},a_{5}), 𝐛=(b1,b2,b3,b4,b5)\bm{b}=(b_{1},b_{2},b_{3},b_{4},b_{5}) \in 4m×4m2×8m3×3×4m\mathbb{Z}_{4m}\times\mathbb{Z}_{4m^{2}}\times\mathbb{Z}_{8m^{3}}\times\mathbb{Z}_{3}\times\mathbb{Z}_{4m}, and an+1a\in\mathbb{Z}_{n+1}, let (n;a,𝐚,𝐛)\mathcal{E}(n;a,\bm{a},\bm{b}) be the set of all 𝐱=x1xnΣ2n\bm{x}=x_{1}\cdots x_{n}\in\Sigma_{2}^{n} such that the following four conditions hold:

  • i=1nixi=a(modn+1)\mathop{\sum}\limits_{i=1}^{n}ix_{i}=a\pmod{n+1};

  • (f~e(𝒙),h~e(𝒙))=𝒂\left(\tilde{f}_{e}(\bm{x}),\tilde{h}_{e}(\bm{x})\right)=\bm{a};

  • (f~o(𝒙),h~o(𝒙))=𝒃\left(\tilde{f}_{o}(\bm{x}),\tilde{h}_{o}(\bm{x})\right)=\bm{b}; and

  • 𝒙R(n,3,P)\bm{x}\in R(n,3,P).

Then (n;a,𝐚,𝐛)\mathcal{E}(n;a,\bm{a},\bm{b}) is an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code. Furthermore, if P=log2(n)+4P=\lceil\log_{2}(n)\rceil+4, then (n;a,𝐚,𝐛)\mathcal{E}(n;a,\bm{a},\bm{b}) has redundancy at most 19+2log2(3)+log2(n+1)+14log2(7P+1)=log2(n)+14log2log2(n)+O(1)19+2\log_{2}(3)+\log_{2}(n+1)+14\log_{2}(7P+1)=\log_{2}(n)+14\log_{2}\log_{2}(n)+O(1) for some choice of 𝐚\bm{a}, 𝐛\bm{b}, and aa.

Proof:

According to Equation 5, the first condition implies that (n;a,𝒂,𝒃)\mathcal{E}(n;a,\bm{a},\bm{b}) is a subset of a VT code. So I1(𝒙)I1(𝒚)=I_{1}(\bm{x})\cap I_{1}(\bm{y})=\emptyset, and hence dH(𝒙,𝒚)2d_{H}(\bm{x},\bm{y})\geq 2 for any distinct 𝒙,𝒚\bm{x},\bm{y}. Assume that

𝒙=𝒖a𝒅b𝒘 and 𝒚=𝒖a¯𝒆b¯𝒘\bm{x}=\bm{u}a\bm{d}b\bm{w}\text{ and }\bm{y}=\bm{u}\bar{a}\bm{e}\bar{b}\bm{w}

for some 𝒖,𝒘,𝒅,𝒆Σ2\bm{u},\bm{w},\bm{d},\bm{e}\in\Sigma_{2}^{*} and a,bΣ2a,b\in\Sigma_{2}. By Lemma III.4, |I2(𝒙)I2(𝒚)|6|I_{2}(\bm{x})\cap I_{2}(\bm{y})|\leq 6. We will show that |I2(𝒙)I2(𝒚)|<5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|<5 by contradiction.

Assume that |I2(𝒙)I2(𝒚)|5\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|\geq 5. Then by Equation 18 we can partition the sequences 𝒙\bm{x} and 𝒚\bm{y} into s=n/ms=n/m segments each of length mm, such that the two inserted symbols in the common supersequence are located in at most two adjacent segments containing a𝒅ba\bm{d}b or a¯𝒆b¯\bar{a}\bm{e}\bar{b}. See Figure 4 for illustrations. This means that there exists some k{0,1,,s2}k\in\{0,1,\ldots,s-2\} such that a𝒅ba\bm{d}b and a¯𝒆b¯\bar{a}\bm{e}\bar{b} are subwords of 𝒙[km+1:km+2m]\bm{x}[km+1:km+2m] and 𝒚[km+1:km+2m]\bm{y}[km+1:km+2m], respectively. Without loss of generality, we assume kKek\in K_{e} and the proof is the same for the case kKok\in K_{o}. Let 𝒙𝒙[km+1:km+2m]\bm{x}^{\prime}\triangleq\bm{x}[km+1:km+2m] and 𝒚𝒚[km+1:km+2m]\bm{y}^{\prime}\triangleq\bm{y}[km+1:km+2m]. Then fk(𝒙)=f(𝒙)f_{k}(\bm{x})=f(\bm{x}^{\prime}), fk(𝒚)=f(𝒚)f_{k}(\bm{y})=f(\bm{y}^{\prime}), hk(𝒙)=h(𝒙)h_{k}(\bm{x})=h(\bm{x}^{\prime}) and hk(𝒚)=h(𝒚)h_{k}(\bm{y})=h(\bm{y}^{\prime}). By the choice of kk, we know that 𝒙[km+1:(k+2)m]\bm{x}[k^{\prime}m+1:(k^{\prime}+2)m] == 𝒚[km+1:(k+2)m]\bm{y}[k^{\prime}m+1:(k^{\prime}+2)m] for each kKe{k}k^{\prime}\in K_{e}\setminus\{k\}, and hence fk(𝒙)=fk(𝒚)f_{k^{\prime}}(\bm{x})=f_{k^{\prime}}(\bm{y}) and hk(𝒙)=hk(𝒚)h_{k^{\prime}}(\bm{x})=h_{k^{\prime}}(\bm{y}). Therefore, we have

f~e(𝒙)f~e(𝒚)=fk(𝒙)fk(𝒚)=f(𝒙)f(𝒚),h~e(𝒙)h~e(𝒚)=hk(𝒙)hk(𝒚)=h(𝒙)h(𝒚).\begin{array}[]{l}\tilde{f}_{e}(\bm{x})-\tilde{f}_{e}(\bm{y})=f_{k}(\bm{x})-f_{k}(\bm{y})=f(\bm{x}^{\prime})-f(\bm{y}^{\prime}),\\ \tilde{h}_{e}(\bm{x})-\tilde{h}_{e}(\bm{y})=h_{k}(\bm{x})-h_{k}(\bm{y})=h(\bm{x}^{\prime})-h(\bm{y}^{\prime}).\end{array}

The second condition in this theorem implies f~e(𝒙)=f~e(𝒚)\tilde{f}_{e}(\bm{x})=\tilde{f}_{e}(\bm{y}) and h~e(𝒙)=h~e(𝒚)\tilde{h}_{e}(\bm{x})=\tilde{h}_{e}(\bm{y}), i.e., f(𝒙)=f(𝒚)f(\bm{x}^{\prime})=f(\bm{y}^{\prime}) and h(𝒙)=h(𝒚)h(\bm{x}^{\prime})=h(\bm{y}^{\prime}). By Theorem V.1, we know that |I2(𝒙)I2(𝒚)|=0\left|I_{2}(\bm{x}^{\prime})\cap I_{2}(\bm{y}^{\prime})\right|=0. However, by Equation 15, |I2(𝒙)I2(𝒚)|\left|I_{2}(\bm{x}^{\prime})\cap I_{2}(\bm{y}^{\prime})\right| == |I2(𝒙)I2(𝒚)|\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right| 5\geq 5, a contradiction. Thus the code (n;a,𝒂,𝒃)\mathcal{E}(n;a,\bm{a},\bm{b}) is indeed an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code.

Finally, we choose P=log2(n)+4P=\lceil\log_{2}(n)\rceil+4, which implies that |R(n,3,P)|2n1\left|R(n,3,P)\right|\geq 2^{n-1} by Lemma IV.2. So there must be some an+1a\in\mathbb{Z}_{n+1}, 𝒂\bm{a}, 𝒃\bm{b} \in 4m×4m2×8m3×3×4m\mathbb{Z}_{4m}\times\mathbb{Z}_{4m^{2}}\times\mathbb{Z}_{8m^{3}}\times\mathbb{Z}_{3}\times\mathbb{Z}_{4m}, such that |(n;a,𝒂,𝒃)|2n1(n+1)M2|\mathcal{E}(n;a,\bm{a},\bm{b})|\geq\frac{2^{n-1}}{(n+1)M^{2}}, where M=4m4m28m334m=329m7M={4m}\cdot{4m^{2}}\cdot{8m^{3}}\cdot 3\cdot{4m}=3\cdot 2^{9}\cdot m^{7}. By computing the redundancy, we complete the proof. ∎

Remark V.1.

If we replace the fourth condition in Theorem V.2 with 𝐱R(n,2,P)\bm{x}\in R(n,2,P), and take P=log2(n)+3P=\left\lceil\log_{2}(n)\right\rceil+3 and m=5P+1m=5P+1, then we will get an (n,6;BI(2))(n,6;B^{I(2)})-reconstruction code with redundancy at most log2(n)+14log2log2(n)+O(1)\log_{2}(n)+14\log_{2}\log_{2}(n)+O(1), the constant term of which is slightly smaller than that of the redundancy of the code given in Theorem V.2.

Remark V.2.

Now we explain how to extend the construction in Theorem V.2 to the case mnm\nmid n. Let P=log2(n)+4P=\lceil\log_{2}(n)\rceil+4 and m=7P+1m=7P+1 as before. Suppose that n¯\bar{n} be the smallest integer such that n¯>n\bar{n}>n and mn¯m\mid\bar{n}. For each 𝐱Σ2n\bm{x}\in\Sigma_{2}^{n}, let 𝐱¯\bar{\bm{x}} be the word of length n¯\bar{n} by appending n¯n\bar{n}-n zeros at the end of 𝐱\bm{x}. Then in Theorem V.2, modify the second and third conditions by replacing 𝐱\bm{x} with 𝐱¯\bar{\bm{x}}, and keep the other two conditions unchanged. By almost the same arguments, we can show that there exists an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code with redundancy at most log2(n)+14log2log2(n)+O(1)\log_{2}(n)+14\log_{2}\log_{2}(n)+O(1).

When N{5,6}N\in\{5,6\}, an (n,N;BI(2))(n,N;B^{I(2)})-reconstruction code must be an (n,1;BI(1))(n,1;B^{I(1)})-reconstruction code. So the lower bound of ρ(n,N;BI(2))\rho(n,N;B^{I(2)}) is log2(n)+Ω(1)\log_{2}(n)+\Omega(1) (see Equation 5). Therefore, the redundancy of the code in Theorem V.2 has optimal leading term. It is worth noting that the idea in this section can be applied to all N6N\leq 6. The main task is to upper bound the length of a𝒅ba\bm{d}b and a¯𝒆b¯\bar{a}\bm{e}\bar{b} under some restrictions. We only accomplished this task when N{5,6}N\in\{5,6\}. When N4N\leq 4, the analysis will be more complicated.

To conclude this section, we summarize the main results into the following theorem.

Theorem V.3.
ρ(n,N;BI(2))={Θ(log2(n)),if 1N4,log2(n)+O(log2log2(n)),if N=5,6,log2(n)+Θ(1),if 6<Nn+3,log2log2(n)+Θ(1),if n+4N2n+4,0,if N>2n+4.\rho(n,N;B^{I(2)})=\begin{cases}\Theta(\log_{2}(n)),&\mbox{if }1\leq N\leq 4,\\ \log_{2}(n)+O(\log_{2}\log_{2}(n)),&\mbox{if }N=5,6,\\ \log_{2}(n)+\Theta(1),&\mbox{if }6<N\leq n+3,\\ \log_{2}\log_{2}(n)+\Theta(1),&\mbox{if }n+4\leq N\leq 2n+4,\\ 0,&\mbox{if }N>2n+4.\\ \end{cases}

VI Conclusion

In this paper, we study the redundancy of reconstruction codes for channels with exactly two insertions. Let NN be the number of given channels and let nn be the sequence length. It turns out that the nontrivial cases are 1N61\leq N\leq 6 and N{n+4,n+5}N\in\{n+4,n+5\}. For N=n+4N=n+4 and N=5N=5, our constructions of codes are both explicit, and the ideas are quite different. When N=n+4N=n+4, the construction is based on characterizations of two sequences when the intersection size of their 22-insertion balls is exactly n+4n+4 or n+5n+5. When N=5N=5, the construction is based on higher order parity checks on short subwords. Consequently, for all N5N\geq 5, we construct codes which are asymptotically optimal in terms of redundancy.

For general tt-insertions (with t>2t>2 being a constant), we have the following result as a generalization of Lemma III.3.

Lemma VI.1.

Let n3n\geq 3 and 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n}. Suppose that |I1(𝐱)I1(𝐲)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, then

|It(𝒙)It(𝒚)|I2(n+1,t1)+N2+(n1,t1)\displaystyle\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|\leq I_{2}(n+1,t-1)+N_{2}^{+}(n-1,t-1)

for any t2t\geq 2.

The proof of Lemma VI.1 is in Appendix D. By Equation 1 and Equation 2, we can see that I2(n+1,t1)+N2+(n1,t1)=nt1(t1)!+O(nt2)I_{2}(n+1,t-1)+N_{2}^{+}(n-1,t-1)=\frac{n^{t-1}}{(t-1)!}+O(n^{t-2}). Thus Lemma VI.1 gives a similar result as in [24]: the reconstruction code with two reads for a single insertion [21] is able to reconstruct codewords from N=nt1/(t1)!+O(nt2)N=n^{t-1}/(t-1)!+O(n^{t-2}) distinct noisy reads.

For future research, we raise the following questions.

  1. (1)

    In Theorem V.2, we construct an (n,5;BI(2))(n,5;B^{I(2)})-reconstruction code with redundancy log2(n)+14log2log2(n)+O(1)\log_{2}(n)+14\log_{2}\log_{2}(n)+O(1). The leading term log2(n)\log_{2}(n) is optimal. But we do not know whether the coefficient of the secondary term is optimal or not. This is left for future study.

  2. (2)

    Using the notations of Section V, we can see that |I2(𝒙)I2(𝒚)|=6\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=6 if and only if |Sb¯a|=|Sa¯b|=2\left|S^{a}_{\bar{b}}\right|=\left|S^{b}_{\bar{a}}\right|=2 and |Sba|=|Sa¯b¯|=1\left|S^{a}_{b}\right|=\left|S^{\bar{b}}_{\bar{a}}\right|=1. In other words, |I2(𝒙)I2(𝒚)|=6\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=6 if and only if

    𝒅b and a¯𝒆 are Type-A confusable,a𝒅 and 𝒆b¯ are Type-A confusable,a¯𝒆b¯I2(𝒅), and a𝒅bI2(𝒆).\begin{array}[]{l}\bm{d}b\text{ and }\bar{a}\bm{e}\text{ are Type-A confusable},\\ a\bm{d}\text{ and }\bm{e}\bar{b}\text{ are Type-A confusable},\\ \bar{a}\bm{e}\bar{b}\in I_{2}(\bm{d}),\text{ and }a\bm{d}b\in I_{2}(\bm{e}).\end{array}

    By these equivalent conditions, we can determine the explicit structures of 𝒅\bm{d} and 𝒆\bm{e}. Then similar to Theorem IV.2, we can construct an (n,6;BI(2))(n,6;B^{I(2)})-reconstruction code with redundancy log2(n)+2log2log2(n)+O(1)\log_{2}(n)+2\log_{2}\log_{2}(n)+O(1). But the proofs are long and tedious. The details of this construction can be provided upon requirement. Again, we do not know whether the secondary term is optimal or not.

  3. (3)

    Find a way to upper bound the length of a𝒅ba\bm{d}b and a¯𝒆b¯\bar{a}\bm{e}\bar{b} in Section V. This will be helpful for us to construct (n,N;BI(2))(n,N;B^{I(2)})-reconstruction codes for 2N42\leq N\leq 4.

Appendix A The proof of Lemma IV.1

Proof:

Recall that 𝒙=𝒖aa¯𝒗b𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w}, 𝒚=𝒖a¯𝒗bb¯𝒘\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} and 𝒛=𝒖aa¯𝒗bb¯𝒘\bm{z}=\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w}. For simplicity, we denote 𝒙~=aa¯𝒗b\bm{\widetilde{x}}=a\bar{a}\bm{v}b, 𝒚~=a¯𝒗bb¯\bm{\widetilde{y}}=\bar{a}\bm{v}b\bar{b}, 𝒛~=aa¯𝒗bb¯\bm{\widetilde{z}}=a\bar{a}\bm{v}b\bar{b} and T=I2(𝒙~)I2(𝒚~)I1(𝒛~)T=I_{2}(\widetilde{\bm{x}})\cap I_{2}(\widetilde{\bm{y}})\setminus I_{1}(\bm{\widetilde{z}}). Clearly, 𝒖I1(𝒛~)𝒘I1(𝒛)\bm{u}I_{1}(\bm{\widetilde{z}})\bm{w}\subset I_{1}(\bm{z}), and the two sets I1(𝒛),𝒖T𝒘I_{1}(\bm{z}),\bm{u}T\bm{w} are disjoint. Next we will prove I2(𝒙)I2(𝒚)=I1(𝒛)𝒖T𝒘I_{2}(\bm{x})\cap I_{2}(\bm{y})=I_{1}(\bm{z})\cup\bm{u}T\bm{w}. It is easy to see that

I1(𝒛)𝒖T𝒘I2(𝒙)I2(𝒚).I_{1}(\bm{z})\cup\bm{u}T\bm{w}\subseteq I_{2}(\bm{x})\cap I_{2}(\bm{y}).

So we only need to prove that the opposite inclusion is true. Since 𝒖I1(𝒛~)𝒘I1(𝒛)\bm{u}I_{1}(\bm{\widetilde{z}})\bm{w}\subset I_{1}(\bm{z}), it is sufficient to show I2(𝒙)I2(𝒚)I_{2}(\bm{x})\cap I_{2}(\bm{y}) \subseteq I1(𝒛)𝒖(I2(𝒙~)I2(𝒚~))𝒘I_{1}(\bm{z})\cup\bm{u}\left(I_{2}(\widetilde{\bm{x}})\cap I_{2}(\widetilde{\bm{y}})\right)\bm{w}. Assume that 𝒛I2(𝒙)I2(𝒚)\bm{z}^{\prime}\in I_{2}(\bm{x})\cap I_{2}(\bm{y}). Suppose that k1k2k_{1}\leq k_{2} are the two insertion positions in 𝒙\bm{x},111This means that two symbols are inserted to the left of xk1x_{k_{1}} and xk2x_{k_{2}}, respectively, to get 𝒛\bm{z}^{\prime}. and 12\ell_{1}\leq\ell_{2} are the two insertion positions in 𝒚\bm{y}. Let ii and jj be the leftmost and rightmost indices where 𝒙\bm{x} and 𝒚\bm{y} differ (see Figure 5). Comparing k1,k2k_{1},k_{2} with ii, we can divide the discussions into three cases: (1) k1,k2ik_{1},k_{2}\leq i; (2) k1i,k2>ik_{1}\leq i,k_{2}>i; (3) k1,k2>ik_{1},k_{2}>i.

Case (1): k1,k2ik_{1},k_{2}\leq i (Figure 6). Since xjyjx_{j}\neq y_{j}, we must have 2>j\ell_{2}>j in this case, and either 1>j\ell_{1}>j or 1j\ell_{1}\leq j.

Subcase (1): If 1>j\ell_{1}>j, then by matching positions, we have

𝒛I2(𝒖)aa¯𝒗b𝒘𝒖a¯𝒗bb¯I2(𝒘)=𝒖(I2(𝒙~)I2(𝒚~))𝒘.\bm{z}^{\prime}\in I_{2}(\bm{u})a\bar{a}\bm{v}b\bm{w}\cap\bm{u}\bar{a}\bm{v}b\bar{b}I_{2}(\bm{w})=\bm{u}\left(I_{2}(\bm{\widetilde{x}})\cap I_{2}(\bm{\widetilde{y}})\right)\bm{w}.

Subcase (2): If 1j\ell_{1}\leq j, then

𝒛\displaystyle\bm{z}^{\prime} I2(𝒖)aa¯𝒗b𝒘I1(𝒖a¯𝒗b)b¯b𝒘\displaystyle\in I_{2}(\bm{u})a\bar{a}\bm{v}b\bm{w}\cap I_{1}(\bm{u}\bar{a}\bm{v}b)\bar{b}b\bm{w}
=(I2(𝒖)aa¯𝒗I1(𝒖a¯𝒗b)b¯)b𝒘.\displaystyle=\left(I_{2}(\bm{u})a\bar{a}\bm{v}\cap I_{1}(\bm{u}\bar{a}\bm{v}b)\bar{b}\right)b\bm{w}.

For I2(𝒖)aa¯𝒗I1(𝒖a¯𝒗b)b¯I_{2}(\bm{u})a\bar{a}\bm{v}\cap I_{1}(\bm{u}\bar{a}\bm{v}b)\bar{b}, we have

I2(𝒖)aa¯𝒗I1(𝒖a¯𝒗b)b¯=(I2(𝒖)aa¯𝒗𝒖I1(a¯𝒗b)b¯)(I2(𝒖)aa¯𝒗I1(𝒖)a¯𝒗bb¯).I_{2}(\bm{u})a\bar{a}\bm{v}\cap I_{1}(\bm{u}\bar{a}\bm{v}b)\bar{b}=\left(I_{2}(\bm{u})a\bar{a}\bm{v}\cap\bm{u}I_{1}(\bar{a}\bm{v}b)\bar{b}\right)\cup\left(I_{2}(\bm{u})a\bar{a}\bm{v}\cap I_{1}(\bm{u})\bar{a}\bm{v}b\bar{b}\right).

If the second bracket I2(𝒖)aa¯𝒗I1(𝒖)a¯𝒗bb¯I_{2}(\bm{u})a\bar{a}\bm{v}\cap I_{1}(\bm{u})\bar{a}\bm{v}b\bar{b} is not empty, then aa¯𝒗=𝒗bb¯a\bar{a}\bm{v}=\bm{v}b\bar{b}. This holds if and only if 𝒗=(aa¯)m\bm{v}=(a\bar{a})^{m} and a=ba=b, or 𝒗=(aa¯)ma\bm{v}=(a\bar{a})^{m}a and a=b¯a=\bar{b}, for some m0m\geq 0. Both cases imply that 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable by Proposition III.1(2), which is a contradiction. Therefore, I1(𝒖)a¯aa¯𝒗I1(𝒖)a¯𝒗bb¯I_{1}(\bm{u})\bar{a}a\bar{a}\bm{v}\cap I_{1}(\bm{u})\bar{a}\bm{v}b\bar{b} is empty and so

𝒛(I2(𝒖)aa¯𝒗𝒖I1(a¯𝒗b)b¯)b𝒘𝒖I1(a¯𝒗b)b¯b𝒘𝒖I2(𝒚~)𝒘.\bm{z}^{\prime}\in\left(I_{2}(\bm{u})a\bar{a}\bm{v}\cap\bm{u}I_{1}(\bar{a}\bm{v}b)\bar{b}\right)b\bm{w}\subseteq\bm{u}I_{1}(\bar{a}\bm{v}b)\bar{b}b\bm{w}\subseteq\bm{u}I_{2}(\bm{\widetilde{y}})\bm{w}.

This means that 𝒛𝒖(I2(𝒙~)I2(𝒚~))𝒘\bm{z}^{\prime}\in\bm{u}\left(I_{2}(\bm{\widetilde{x}})\cap I_{2}(\bm{\widetilde{y}})\right)\bm{w}, since 𝒛\bm{z}^{\prime} is also in I2(𝒖aa¯𝒗b𝒘)I_{2}(\bm{u}a\bar{a}\bm{v}b\bm{w}) (=I2(𝒙)=I_{2}(\bm{x})) by assumption.

Case (2) (Figure 7): k1i,k2>ik_{1}\leq i,k_{2}>i. Since xiyix_{i}\neq y_{i}, we can not have exactly one of i\ell_{i} less than ii. Thus we have 1,2i\ell_{1},\ell_{2}\leq i or 1,2>i\ell_{1},\ell_{2}>i in this case.

Subcase (1): If 1,2i\ell_{1},\ell_{2}\leq i then we must have k2>jk_{2}>j since xjyjx_{j}\neq y_{j}. So we have

𝒛\displaystyle\bm{z}^{\prime} I1(𝒖)aa¯𝒗bI1(𝒘)I2(𝒖)a¯𝒗bb¯𝒘\displaystyle\in I_{1}(\bm{u})a\bar{a}\bm{v}bI_{1}(\bm{w})\cap I_{2}(\bm{u})\bar{a}\bm{v}b\bar{b}\bm{w}
=I1(𝒖)aa¯𝒗bb¯𝒘I2(𝒖)a¯𝒗bb¯𝒘\displaystyle=I_{1}(\bm{u})a\bar{a}\bm{v}b\bar{b}\bm{w}\cap I_{2}(\bm{u})\bar{a}\bm{v}b\bar{b}\bm{w}
I1(𝒖aa¯𝒗bb¯𝒘)=I1(𝒛).\displaystyle\subseteq I_{1}(\bm{u}a\bar{a}\bm{v}b\bar{b}\bm{w})=I_{1}(\bm{z}).

Subcase (2): If 1,2>i\ell_{1},\ell_{2}>i, then

𝒛\displaystyle\bm{z}^{\prime} I1(𝒖)aI1(a¯𝒗b𝒘)𝒖a¯I2(𝒗bb¯𝒘)\displaystyle\in I_{1}(\bm{u})aI_{1}(\bar{a}\bm{v}b\bm{w})\cap\bm{u}\bar{a}I_{2}(\bm{v}b\bar{b}\bm{w})
=𝒖a¯aI1(a¯𝒗b𝒘)𝒖a¯I2(𝒗bb¯𝒘)\displaystyle=\bm{u}\bar{a}aI_{1}(\bar{a}\bm{v}b\bm{w})\cap\bm{u}\bar{a}I_{2}(\bm{v}b\bar{b}\bm{w})
=𝒖a¯(aI1(a¯𝒗b𝒘)I2(𝒗bb¯𝒘)).\displaystyle=\bm{u}\bar{a}\left(aI_{1}(\bar{a}\bm{v}b\bm{w})\cap I_{2}(\bm{v}b\bar{b}\bm{w})\right).

Note that I1(a¯𝒗b𝒘)=I_{1}(\bar{a}\bm{v}b\bm{w})= I1(a¯𝒗b)𝒘a¯𝒗bI1(𝒘)I_{1}(\bar{a}\bm{v}b)\bm{w}\cup\bar{a}\bm{v}bI_{1}(\bm{w}). So

aI1(a¯𝒗b𝒘)I2(𝒗bb¯𝒘)\displaystyle aI_{1}(\bar{a}\bm{v}b\bm{w})\cap I_{2}(\bm{v}b\bar{b}\bm{w}) =(aI1(a¯𝒗b)𝒘I2(𝒗bb¯𝒘))(aa¯𝒗bI1(𝒘)I2(𝒗bb¯𝒘))\displaystyle=\left(aI_{1}(\bar{a}\bm{v}b)\bm{w}\cap I_{2}(\bm{v}b\bar{b}\bm{w})\right)\cup\left(a\bar{a}\bm{v}bI_{1}(\bm{w})\cap I_{2}(\bm{v}b\bar{b}\bm{w})\right)
=(aI1(a¯𝒗b)𝒘I2(𝒗bb¯)𝒘)(aa¯𝒗bI1(𝒘)I1(𝒗bb¯)I1(𝒘)).\displaystyle=\left(aI_{1}(\bar{a}\bm{v}b)\bm{w}\cap I_{2}(\bm{v}b\bar{b})\bm{w}\right)\cup\left(a\bar{a}\bm{v}bI_{1}(\bm{w})\cap I_{1}(\bm{v}b\bar{b})I_{1}(\bm{w})\right).

The last equality follows from I2(𝒗bb¯𝒘)=I2(𝒗bb¯)𝒘𝒗bb¯I2(𝒘)I1(𝒗bb¯)I1(𝒘)I_{2}(\bm{v}b\bar{b}\bm{w})=I_{2}(\bm{v}b\bar{b})\bm{w}\cup\bm{v}b\bar{b}I_{2}(\bm{w})\cup I_{1}(\bm{v}b\bar{b})I_{1}(\bm{w}). If aa¯𝒗bI1(𝒘)I1(𝒗bb¯)I1(𝒘)a\bar{a}\bm{v}bI_{1}(\bm{w})\cap I_{1}(\bm{v}b\bar{b})I_{1}(\bm{w}) is not empty, then we have aa¯𝒗bI1(𝒗bb¯)a\bar{a}\bm{v}b\in I_{1}(\bm{v}b\bar{b}), which means aa¯𝒗b=𝒗bb¯ba\bar{a}\bm{v}b=\bm{v}b\bar{b}b and so aa¯𝒗=𝒗bb¯a\bar{a}\bm{v}=\bm{v}b\bar{b}. This implies that 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable, which is a contradiction. Therefore, aa¯𝒗bI1(𝒘)I1(𝒗bb¯)I1(𝒘)a\bar{a}\bm{v}bI_{1}(\bm{w})\cap I_{1}(\bm{v}b\bar{b})I_{1}(\bm{w}) is empty and

𝒛𝒖a¯(aI1(a¯𝒗b)𝒘I2(𝒗bb¯)𝒘)𝒖a¯I2(𝒗bb¯)𝒘𝒖I2(𝒚~)𝒘.\bm{z}^{\prime}\in\bm{u}\bar{a}\left(aI_{1}(\bar{a}\bm{v}b)\bm{w}\cap I_{2}(\bm{v}b\bar{b})\bm{w}\right)\subseteq\bm{u}\bar{a}I_{2}(\bm{v}b\bar{b})\bm{w}\subseteq\bm{u}I_{2}(\bm{\tilde{y}})\bm{w}.

This means that 𝒛𝒖(I2(𝒙~)I2(𝒚~))𝒘\bm{z}^{\prime}\in\bm{u}\left(I_{2}(\bm{\tilde{x}})\cap I_{2}(\bm{\tilde{y}})\right)\bm{w}, since 𝒛\bm{z}^{\prime} is also in I2(𝒖aa¯𝒗b𝒘)I_{2}(\bm{u}a\bar{a}\bm{v}b\bm{w}) (=I2(𝒙)=I_{2}(\bm{x})) by assumption.

Case (3) (Figure 8): k1,k2>ik_{1},k_{2}>i. In this case, we have 1,2i\ell_{1},\ell_{2}\leq i or 1i,2>i\ell_{1}\leq i,\ell_{2}>i. By a process similar to that of the previous two cases, we will get 𝒛𝒖(I2(𝒙~)I2(𝒚~))𝒘\bm{z}^{\prime}\in\bm{u}\left(I_{2}(\bm{\tilde{x}})\cap I_{2}(\bm{\tilde{y}})\right)\bm{w} or 𝒛I1(𝒛)\bm{z}^{\prime}\in I_{1}(\bm{z}).

Combining cases (1), (2) and (3), we conclude that I2(𝒙)I2(𝒚)I1(𝒛)𝒖T𝒘I_{2}(\bm{x})\cap I_{2}(\bm{y})\subseteq I_{1}(\bm{z})\cup\bm{u}T\bm{w}. Thus the proof is completed. ∎

𝒖\bm{u}aaa¯\bar{a}𝒗\bm{v}bb𝒘\bm{w}𝒙\bm{x}𝒖\bm{u}a¯\bar{a}𝒗\bm{v}bbb¯\bar{b}𝒘\bm{w}𝒚\bm{y}iijj
Figure 5: We denote ii and jj to be the leftmost and rightmost indices where 𝒙\bm{x} and 𝒚\bm{y} differ.
ii\star\star𝒙\bm{x}jj\star\star𝒚\bm{y}
(a) subcase (1): 1,2>j\ell_{1},\ell_{2}>j.
ii\star\star𝒙\bm{x}jj\star\star𝒚\bm{y}
(b) subcase (2): 1j,2>j\ell_{1}\leq j,\ell_{2}>j.
Figure 6: Case (1): k1,k2ik_{1},k_{2}\leq i. Stars are the inserted positions.
ii\star\star𝒙\bm{x}ii\star\star𝒚\bm{y}
(a) subcase (1): 1,2i\ell_{1},\ell_{2}\leq i.
ii\star\star𝒙\bm{x}ii\star\star𝒚\bm{y}
(b) subcase (2): 1,2>i\ell_{1},\ell_{2}>i.
Figure 7: Case (2): k1i,k2>ik_{1}\leq i,k_{2}>i. Stars are the inserted positions.
ii\star\star𝒙\bm{x}ii\star\star𝒚\bm{y}
(a) subcase (1): 1,2i\ell_{1},\ell_{2}\leq i.
ii\star\star𝒙\bm{x}ii\star\star𝒚\bm{y}
(b) subcase (2): 1i,2>i\ell_{1}\leq i,\ell_{2}>i.
Figure 8: Case (3): k1,k2>ik_{1},k_{2}>i. Stars are the inserted positions.

Appendix B Proof of Case (ii) of Theorem IV.1

Proof:

(ii). Recall that 𝒙~=aa¯𝒗\widetilde{\bm{x}}=a\bar{a}\bm{v}, 𝒚~=𝒗bb¯\widetilde{\bm{y}}=\bm{v}b\bar{b} and 𝒗=v1vl\bm{v}=v_{1}\cdots v_{l}, where l0l\geq 0 is the length of 𝒗\bm{v}. In addition, we write 𝒙~=x~1x~l+2\widetilde{\bm{x}}=\tilde{x}_{1}\cdots\tilde{x}_{l+2} and 𝒚~=y~1y~l+2\widetilde{\bm{y}}=\tilde{y}_{1}\cdots\tilde{y}_{l+2}.

By Lemma III.3, |I2(𝒙)I2(𝒚)|=n+4\left|I_{2}(\bm{x})\cap I_{2}(\bm{y})\right|=n^{\prime}+4 if and only if aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are Type-B confusable, but not Type-A confusable. That is to say,

𝒙~=aa¯𝒗=𝒖αα¯𝒗~β𝒘, 𝒚~=𝒗bb¯=𝒖α¯𝒗~ββ¯𝒘\widetilde{\bm{x}}=a\bar{a}\bm{v}=\bm{u}\alpha\bar{\alpha}\widetilde{\bm{v}}\beta\bm{w},\text{ }\widetilde{\bm{y}}=\bm{v}b\bar{b}=\bm{u}\bar{\alpha}\widetilde{\bm{v}}\beta\bar{\beta}\bm{w} (19)

or

𝒙~=aa¯𝒗=𝒖α¯𝒗~ββ¯𝒘, 𝒚~=𝒗bb¯=𝒖αα¯𝒗~β𝒘\widetilde{\bm{x}}=a\bar{a}\bm{v}=\bm{u}\bar{\alpha}\widetilde{\bm{v}}\beta\bar{\beta}\bm{w},\text{ }\widetilde{\bm{y}}=\bm{v}b\bar{b}=\bm{u}\alpha\bar{\alpha}\widetilde{\bm{v}}\beta\bm{w} (20)

for some 𝒖,𝒗~,𝒘Σ2\bm{u},\widetilde{\bm{v}},\bm{w}\in\Sigma_{2}^{*} and α,βΣ2\alpha,\beta\in\Sigma_{2}. Note that β=α¯\beta=\bar{\alpha} since aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} have the same Hamming weight. Suppose that |𝒖|=s|\bm{u}|=s,|𝒘|=t|\bm{w}|=t, where s,t0s,t\geq 0. Then l+2=|aa¯𝒗|=s+t+3+|𝒗~|l+2=|a\bar{a}\bm{v}|=s+t+3+|\widetilde{\bm{v}}| and so ls+t+1l\geq s+t+1.

In order to simplify arguments in the sequel, we let v1=av_{-1}=a, v0=a¯v_{0}=\bar{a}, vl+1=bv_{l+1}=b and vl+2=b¯v_{l+2}=\bar{b}. Then it is clear that x~i=vi2\tilde{x}_{i}=v_{i-2} and y~i=vi\tilde{y}_{i}=v_{i} for all 1il+21\leq i\leq l+2. Both of Equation 19 and Equation 20 indicate that x~1x~s=y~1y~s\tilde{x}_{1}\cdots\tilde{x}_{s}=\tilde{y}_{1}\cdots\tilde{y}_{s} and x~l+3tx~l+2=y~l+3ty~l+2\tilde{x}_{l+3-t}\cdots\tilde{x}_{l+2}=\tilde{y}_{l+3-t}\cdots\tilde{y}_{l+2}. Therefore, we have the following two identities:

v1vs2=v1vs,vlt+1vl=vlt+3vl+2,\begin{array}[]{c}v_{-1}\cdots v_{s-2}=v_{1}\cdots v_{s},\\ v_{l-t+1}\cdots v_{l}=v_{l-t+3}\cdots v_{l+2},\end{array} (21)

for any s,t0s,t\geq 0 and ls+t+1l\geq s+t+1. From the first row in Equation 21, we know that v1v0vsv_{-1}v_{0}\cdots v_{s} is the same as that in Equation 11. And the second row in Equation 21 implies that

vlt+1vl+2={(vlt+1vlt+2)t+22,if t is even,(vlt+1vlt+2)t+12vlt+1,if t is odd.v_{l-t+1}\cdots v_{l+2}=\begin{cases}(v_{l-t+1}v_{l-t+2})^{\frac{t+2}{2}},&\mbox{if }t\text{ is even},\\ (v_{l-t+1}v_{l-t+2})^{\frac{t+1}{2}}v_{l-t+1},&\mbox{if }t\text{ is odd}.\end{cases} (22)

From Equation 22, we can see that if t>0t>0, then b¯=vl+2=vl\bar{b}=v_{l+2}=v_{l}. However, if t=0t=0, we get no information from Equation 22. Furthermore, if tt is even, we have vl+1=vlt+1v_{l+1}=v_{l-t+1}, vl+2=vlt+2v_{l+2}=v_{l-t+2}; if tt is odd, we have vl+1=vlt+2v_{l+1}=v_{l-t+2}, vl+2=vlt+1v_{l+2}=v_{l-t+1}. Therefore, we must have vlt+1vlt+2v_{l-t+1}\neq v_{l-t+2} since vl+1=bb¯=vl+2v_{l+1}=b\neq\bar{b}=v_{l+2}.

Now we divide our discussions into two cases, according to Equation 19 or Equation 20. As we will see later, they will lead to different values of 𝒗\bm{v}.

  • Firstly, we assume that Equation 19 holds. Then vs1=x~s+1=α=x~¯s+2=v¯sv_{s-1}=\tilde{x}_{s+1}=\alpha=\overline{\tilde{x}}_{s+2}=\bar{v}_{s}, vlt=x~l+2t=βv_{l-t}=\tilde{x}_{l+2-t}=\beta, vs+1=y~s+1=α¯v_{s+1}=\tilde{y}_{s+1}=\bar{\alpha}, vl+2t=y~l+2t=β¯=y~¯l+1t=v¯l+1tv_{l+2-t}=\tilde{y}_{l+2-t}=\bar{\beta}=\overline{\tilde{y}}_{l+1-t}=\bar{v}_{l+1-t} and x~s+3x~l+1t=𝒗~=y~s+2y~lt\tilde{x}_{s+3}\cdots\tilde{x}_{l+1-t}=\widetilde{\bm{v}}=\tilde{y}_{s+2}\cdots\tilde{y}_{l-t}. Thus we have

    vs1=v¯s=v¯s+1,vlt=v¯lt+2=vlt+1,vs+1vlt1=vs+2vlt.\begin{array}[]{l}v_{s-1}=\bar{v}_{s}=\bar{v}_{s+1},\\ v_{l-t}=\bar{v}_{l-t+2}=v_{l-t+1},\\ v_{s+1}\cdots v_{l-t-1}=v_{s+2}\cdots v_{l-t}.\end{array} (23)

    The three identities above imply that

    v¯s1=vs==vlt+1=v¯lt+2.\bar{v}_{s-1}=v_{s}=\cdots=v_{l-t+1}=\bar{v}_{l-t+2}. (24)

    Therefore, if t=0t=0, then b¯=vl+2=v¯l\bar{b}=v_{l+2}=\bar{v}_{l}. Combining Equations 11, 22 and 24, we get

    𝒗=v1vl={(aa¯)s2a¯lst(a¯a)t2, s,t both even,(aa¯)s2a¯lst(a¯a)t12a¯, s even,t odd,(aa¯)s12al+1st(aa¯)t2, s odd,t even,(aa¯)s12al+1st(aa¯)t12a, s,t both odd,\bm{v}=v_{1}\cdots v_{l}=\left\{\begin{array}[]{ll}(a\bar{a})^{\frac{s}{2}}\bar{a}^{l-s-t}(\bar{a}a)^{\frac{t}{2}},&\text{ }s,t\text{ both even},\\ (a\bar{a})^{\frac{s}{2}}\bar{a}^{l-s-t}(\bar{a}a)^{\frac{t-1}{2}}\bar{a},&\text{ }s\text{ even},t\text{ odd},\\ (a\bar{a})^{\frac{s-1}{2}}a^{l+1-s-t}(a\bar{a})^{\frac{t}{2}},&\text{ }s\text{ odd},t\text{ even},\\ (a\bar{a})^{\frac{s-1}{2}}a^{l+1-s-t}(a\bar{a})^{\frac{t-1}{2}}a,&\text{ }s,t\text{ both odd},\end{array}\right. (25)

    for all s,t0s,t\geq 0 and ls+t+1l\geq s+t+1. Recall that vl=b¯v_{l}=\bar{b} if t1t\geq 1 and vl=bv_{l}=b if t=0t=0. So the first and last rows correspond to the case where a=b¯a=\bar{b}, while the second and third rows correspond to the case where a=ba=b.

  • Secondly, we assume that Equation 20 holds. Comparing with the previous case, the only difference is that Equation 23 becomes

    vs1=vs+2=v¯s+1,v¯lt=vlt+2=vlt1,vsvlt2=vs+3vlt+1.\begin{array}[]{l}v_{s-1}=v_{s+2}=\bar{v}_{s+1},\\ \bar{v}_{l-t}=v_{l-t+2}=v_{l-t-1},\\ v_{s}\cdots v_{l-t-2}=v_{s+3}\cdots v_{l-t+1}.\end{array} (26)

    Therefore, if t=0t=0, we have b¯=vl+2=v¯l\bar{b}=v_{l+2}=\bar{v}_{l}. Recall that ls+t+1l\geq s+t+1. From the first two rows in Equation 26, we can deduce that ls+t+2l\geq s+t+2. Otherwise, the first two rows in Equation 26 will result in vs1=v¯s+1=vsv_{s-1}=\bar{v}_{s+1}=v_{s}, which contradicts Equation 11. Now the three identities in Equation 26 indicate that vs+1v_{s+1}\cdots vlt+1v_{l-t+1} has period 33, and that vs+1=v¯s1=vs=vs+3v_{s+1}=\bar{v}_{s-1}=v_{s}=v_{s+3} and vs+2=vs1v_{s+2}=v_{s-1}.

    If ls+t(mod3)l\equiv s+t\pmod{3}, then lt+1s+1(mod3)l-t+1\equiv s+1\pmod{3} and lt1s+2(mod3)l-t-1\equiv s+2\pmod{3}. So from the periodicity of vs+1v_{s+1}\cdots vlt+1v_{l-t+1}, we have vlt+1=vs+1=vsv_{l-t+1}=v_{s+1}=v_{s} and vlt+2=vlt1=vs+2=vs1vs=vlt+1v_{l-t+2}=v_{l-t-1}=v_{s+2}=v_{s-1}\neq v_{s}=v_{l-t+1}. Similarly, if ls+t+1(mod3)l\equiv s+t+1\pmod{3}, then lt+1(s+1)+1(mod3)l-t+1\equiv(s+1)+1\pmod{3} and lt1s+3(mod3)l-t-1\equiv s+3\pmod{3}. So vlt+1=vs+2=vs1v_{l-t+1}=v_{s+2}=v_{s-1} and vlt+2=vlt1=vs+3=vsvlt+1v_{l-t+2}=v_{l-t-1}=v_{s+3}=v_{s}\neq v_{l-t+1}. If ls+t+2(mod3)l\equiv s+t+2\pmod{3}, then lt+1(s+1)+2(mod3)l-t+1\equiv(s+1)+2\pmod{3}. So vlt+1=vs+3=vsv_{l-t+1}=v_{s+3}=v_{s} and vlt+2=v¯lt=v¯s+2=v¯s1=vsv_{l-t+2}=\overline{v}_{l-t}=\overline{v}_{s+2}=\overline{v}_{s-1}=v_{s}, where the last equality follows from Equation 11. This contradicts the fact that vlt+2vlt+1v_{l-t+2}\neq v_{l-t+1}. So we can not have ls+t+2(mod3)l\equiv s+t+2\pmod{3}.

    By the above discussions, we have

    vs+1vlt={(vsvs1vs)lst3,if ls+t(mod3),(vsvs1vs)lst13vs,if ls+t+1(mod3).v_{s+1}\cdots v_{l-t}=\begin{cases}(v_{s}v_{s-1}v_{s})^{\frac{l-s-t}{3}},&\mbox{if }l\equiv s+t\pmod{3},\\ (v_{s}v_{s-1}v_{s})^{\frac{l-s-t-1}{3}}v_{s},&\mbox{if }l\equiv s+t+1\pmod{3}.\end{cases} (27)

    Recall that vl=b¯v_{l}=\bar{b} if t1t\geq 1 and vl=bv_{l}=b if t=0t=0. Now, combining Equations 11, 22 and 27, for all s,t,0s,t,\geq 0 and ls+t+2l\geq s+t+2, we get eight cases for the values of 𝒗\bm{v}, which are listed in Table III. Note that in the third and seventh rows of Table III, we can not have t=0t=0. Otherwise, we will get vl=vl1v_{l}=v_{l-1}, which contradicts the second row in Equation 26.

From Equation 25 and Table III, we can derive the results in Table II. Precisely (in Table III, we require ls+t+2l\geq s+t+2),

Therefore, we have prove the “only if” part of the conclusion.

For the “if” part, it is straightforward to verify that if we take 𝒗\bm{v} to be one of these values, aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are Type-B confusable. Comparing Table II with the results in (i), we can see that aa¯𝒗a\bar{a}\bm{v} and 𝒗bb¯\bm{v}b\bar{b} are not Type-A confusable. This completes the proof of (ii).

TABLE III: values of 𝒗\bm{v} derived from Equation 20
No. 𝒗\bm{v} conditions for l,sl,s and tt a=ba=b
1 (aa¯)s2(a¯aa¯)lst3(a¯a)t2(a\bar{a})^{\frac{s}{2}}(\bar{a}a\bar{a})^{\frac{l-s-t}{3}}(\bar{a}a)^{\frac{t}{2}}
ss is even, tt is even,
ls+t(mod3)l\equiv s+t\pmod{3}
No
2 (aa¯)s2(a¯aa¯)lst3(a¯a)t12a¯(a\bar{a})^{\frac{s}{2}}(\bar{a}a\bar{a})^{\frac{l-s-t}{3}}(\bar{a}a)^{\frac{t-1}{2}}\bar{a}
ss is even, tt is odd,
ls+t(mod3)l\equiv s+t\pmod{3}
Yes
3 (aa¯)s2(a¯aa¯)lst13(a¯a)t2a¯(a\bar{a})^{\frac{s}{2}}(\bar{a}a\bar{a})^{\frac{l-s-t-1}{3}}(\bar{a}a)^{\frac{t}{2}}\bar{a}
ss is even, t2t\geq 2 is even
ls+t+1(mod3)l\equiv s+t+1\pmod{3}
Yes
4 (aa¯)s2(a¯aa¯)lst13(a¯a)t+12(a\bar{a})^{\frac{s}{2}}(\bar{a}a\bar{a})^{\frac{l-s-t-1}{3}}(\bar{a}a)^{\frac{t+1}{2}}
ss is even, tt is odd
ls+t+1(mod3)l\equiv s+t+1\pmod{3}
No
5 (aa¯)s12a(aa¯a)lst3(aa¯)t2(a\bar{a})^{\frac{s-1}{2}}a(a\bar{a}a)^{\frac{l-s-t}{3}}(a\bar{a})^{\frac{t}{2}}
ss is odd, tt is even
ls+t(mod3)l\equiv s+t\pmod{3}
Yes
6 (aa¯)s12a(aa¯a)lst3(aa¯)t12a(a\bar{a})^{\frac{s-1}{2}}a(a\bar{a}a)^{\frac{l-s-t}{3}}(a\bar{a})^{\frac{t-1}{2}}a
ss is odd, tt is odd
ls+t(mod3)l\equiv s+t\pmod{3}
No
7 (aa¯)s12a(aa¯a)lst13(aa¯)t2a(a\bar{a})^{\frac{s-1}{2}}a(a\bar{a}a)^{\frac{l-s-t-1}{3}}(a\bar{a})^{\frac{t}{2}}a
ss is odd, t2t\geq 2 is even
ls+t+1(mod3)l\equiv s+t+1\pmod{3}
No
8 (aa¯)s12a(aa¯a)lst13(aa¯)t+12(a\bar{a})^{\frac{s-1}{2}}a(a\bar{a}a)^{\frac{l-s-t-1}{3}}(a\bar{a})^{\frac{t+1}{2}}
ss is odd, tt is odd
ls+t+1(mod3)l\equiv s+t+1\pmod{3}
Yes

Appendix C Proofs of Lemma V.1 and Lemma V.2

Lemma V.1 is the combination of Claim 1Claim 4, and Lemma V.2 is the combination of Claim 1 and Claim 5Claim 7.

In this section, we always denote k=|𝒘|k=\left|\bm{w}\right|, s=|𝜶|s=\left|\bm{\alpha}\right| and t=|𝜷|t=\left|\bm{\beta}\right|. Then k1k\geq 1, s0s\geq 0 and t0t\geq 0. Let βt+1b¯\beta_{t+1}\triangleq\bar{b}. For simplicity, we define 𝒙~a𝒅\widetilde{\bm{x}}\triangleq a\bm{d} and 𝒚~𝒆b¯\widetilde{\bm{y}}\triangleq\bm{e}\bar{b}. So |𝒙~|=|𝒚~|=k+s+t\left|\widetilde{\bm{x}}\right|=\left|\widetilde{\bm{y}}\right|=k+s+t. Since |Sba¯|1\left|S_{b}^{\bar{a}}\right|\geq 1, we conclude that 𝒙~\widetilde{\bm{x}} and 𝒚~\widetilde{\bm{y}} are Type-A confusable or Type-B confusable. Therefore, we can choose ii to be the leftmost index such that x~iy~i\widetilde{x}_{i}\neq\widetilde{y}_{i} and jj to be the rightmost index such that x~jy~j\widetilde{x}_{j}\neq\widetilde{y}_{j}. By the choice of ii and jj, we have x~1x~i1\widetilde{x}_{1}\cdots\widetilde{x}_{i-1} == y~1y~i1\widetilde{y}_{1}\cdots\widetilde{y}_{i-1} and x~j+1x~s+k+t\widetilde{x}_{j+1}\cdots\widetilde{x}_{s+k+t} == y~j+1y~s+k+t\widetilde{y}_{j+1}\cdots\widetilde{y}_{s+k+t}. Furthermore, according to Definition III.1 and Definition III.2, we conclude that

  • if 𝒙~\widetilde{\bm{x}} and 𝒚~\widetilde{\bm{y}} are Type-A confusable, then x~ix~j\widetilde{x}_{i}\cdots\widetilde{x}_{j} is an alternating sequence and x~ix~j=y~iy~j¯\widetilde{x}_{i}\cdots\widetilde{x}_{j}=\overline{\widetilde{y}_{i}\cdots\widetilde{y}_{j}}.

  • if 𝒙~\widetilde{\bm{x}} and 𝒚~\widetilde{\bm{y}} are Type-B confusable, then ji2j-i\geq 2. Besides, one of the following two conditions must hold:

    x~ix~i+1=y~i,x~j=y~j1y~j and x~i+2x~j1=y~i+1y~j2,\widetilde{x}_{i}\neq\widetilde{x}_{i+1}=\widetilde{y}_{i},\widetilde{x}_{j}=\widetilde{y}_{j-1}\neq\widetilde{y}_{j}\text{ and }\widetilde{x}_{i+2}\cdots\widetilde{x}_{j-1}=\widetilde{y}_{i+1}\cdots\widetilde{y}_{j-2}, (28)

    or

    x~i=y~i+1y~i,x~j1=y~jx~j and x~i+1x~j2=y~i+2y~j1.\widetilde{x}_{i}=\widetilde{y}_{i+1}\neq\widetilde{y}_{i},\widetilde{x}_{j-1}=\widetilde{y}_{j}\neq\widetilde{x}_{j}\text{ and }\widetilde{x}_{i+1}\cdots\widetilde{x}_{j-2}=\widetilde{y}_{i+2}\cdots\widetilde{y}_{j-1}. (29)
Claim 1.

If js+k+1j\leq s+k+1, then 𝛃\bm{\beta} is a periodic sequence and its period is at most 22. If isi\geq s, then 𝛂\bm{\alpha} is a periodic sequence and its period is at most 22.

Proof:

We only prove the conclusion for 𝜷\bm{\beta}, since for 𝜶\bm{\alpha}, the proof is similar.

When 0t20\leq t\leq 2, it is clear that the period of 𝜷\bm{\beta} is 22. So we assume that t3t\geq 3. When js+k+1j\leq s+k+1, we have β1βt1\beta_{1}\cdots\beta_{t-1} == x~s+k+2x~s+k+t\widetilde{x}_{s+k+2}\cdots\widetilde{x}_{s+k+t} == y~s+k+2y~s+k+t\widetilde{y}_{s+k+2}\cdots\widetilde{y}_{s+k+t} == β3βtb¯\beta_{3}\cdots\beta_{t}\bar{b}, which implies that 𝜷\bm{\beta} is a periodic sequence and its period is at most 22. ∎

Claim 2.

Suppose that is+k+2i\geq s+k+2. If a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-A confusable, then 𝛃=𝐩1𝐩2\bm{\beta}=\bm{p}_{1}\bm{p}_{2}, where 𝐩1,𝐩2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 22.

Proof:

Since s+k+2is+k+ts+k+2\leq i\leq s+k+t, we know that t2t\geq 2. When t=2t=2, the conclusion is trivial. So we assume that t3t\geq 3. When is+k+2i\geq s+k+2, we have β1βisk2\beta_{1}\cdots\beta_{i-s-k-2} == x~s+k+2x~i1\widetilde{x}_{s+k+2}\cdots\widetilde{x}_{i-1} == y~s+k+2y~i1\widetilde{y}_{s+k+2}\cdots\widetilde{y}_{i-1} == β3βisk\beta_{3}\cdots\beta_{i-s-k}, βisk1βjsk1\beta_{i-s-k-1}\cdots\beta_{j-s-k-1} == x~ix~j\widetilde{x}_{i}\cdots\widetilde{x}_{j} == y~iy~j¯\overline{\widetilde{y}_{i}\cdots\widetilde{y}_{j}} == βisk+1βjsk+1¯\overline{\beta_{i-s-k+1}\cdots\beta_{j-s-k+1}} and βjskβt1\beta_{j-s-k}\cdots\beta_{t-1} == x~j+1x~s+k+t\widetilde{x}_{j+1}\cdots\widetilde{x}_{s+k+t} == y~j+1y~s+k+t\widetilde{y}_{j+1}\cdots\widetilde{y}_{s+k+t} == βjsk+2βt+1\beta_{j-s-k+2}\cdots\beta_{t+1}. The first and third equalities imply that β1βisk\beta_{1}\cdots\beta_{i-s-k} and βjskβt+1\beta_{j-s-k}\cdots\beta_{t+1} both have period at most 22. Recall that x~ix~j\widetilde{x}_{i}\cdots\widetilde{x}_{j} is an alternating sequence. So ji=0j-i=0 or 11. Otherwise the second equality will lead to βisk+1=βisk1=β¯isk+1\beta_{i-s-k+1}=\beta_{i-s-k-1}=\bar{\beta}_{i-s-k+1}, which is impossible. Now we can conclude that β\beta is the concatenation of two sequences, and each of them has period at most 22. ∎

Claim 3.

Suppose that is+k+1i\leq s+k+1 and js+k+2j\geq s+k+2. If a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-A confusable, then 𝛃=𝐩1𝐩2\bm{\beta}=\bm{p}_{1}\bm{p}_{2}, where 𝐩1,𝐩2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 22.

Proof:

Since s+k+2js+k+ts+k+2\leq j\leq s+k+t, we know that t2t\geq 2. When t=2t=2, the conclusion is trivial. So we assume that t3t\geq 3. When is+k+1i\leq s+k+1 and js+k+2j\geq s+k+2, we have β1βjsk1\beta_{1}\cdots\beta_{j-s-k-1} == x~s+k+2x~j\widetilde{x}_{s+k+2}\cdots\widetilde{x}_{j} == y~s+k+2y~j¯\overline{\widetilde{y}_{s+k+2}\cdots\widetilde{y}_{j}} == β3βjsk+1¯\overline{\beta_{3}\cdots\beta_{j-s-k+1}} and βjskβt1\beta_{j-s-k}\cdots\beta_{t-1} == x~j+1x~s+k+t\widetilde{x}_{j+1}\cdots\widetilde{x}_{s+k+t} == y~j+1y~s+k+t\widetilde{y}_{j+1}\cdots\widetilde{y}_{s+k+t} == βjsk+2βt+1\beta_{j-s-k+2}\cdots\beta_{t+1}. The second equality implies that βjskβt+1\beta_{j-s-k}\cdots\beta_{t+1} has period at most 22. Recall that x~ix~j\widetilde{x}_{i}\cdots\widetilde{x}_{j} is an alternating sequence and js+k+2j\geq s+k+2. So j=s+k+2j=s+k+2 or s+k+3s+k+3. Otherwise the first equality will lead to β3=β1=β¯3\beta_{3}=\beta_{1}=\bar{\beta}_{3}, which is impossible. Now we can take 𝒑1=β1β2\bm{p}_{1}=\beta_{1}\beta_{2} and 𝒑2=β3βt\bm{p}_{2}=\beta_{3}\cdots\beta_{t}. ∎

Similar to Claim 2 and Claim 3, we can prove the following claim.

Claim 4.

Suppose that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-A confusable, and that s3s\geq 3. If js1j\leq s-1, or if is1i\leq s-1 and jsj\geq s, then 𝛂=𝐩1𝐩2\bm{\alpha}=\bm{p}_{1}\bm{p}_{2}, where 𝐩1,𝐩2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 22.

Claim 5.

Suppose that is+k+1i\leq s+k+1 and js+k+2j\geq s+k+2, and that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-B confusable. Then

  • 𝜷=β1𝒑1𝒑2\bm{\beta}=\beta_{1}\bm{p}_{1}\bm{p}_{2}, where 𝒑1,𝒑2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 22; or

  • 𝜷=𝒑1𝒑2\bm{\beta}=\bm{p}_{1}\bm{p}_{2}, where 𝒑1,𝒑2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 33.

Proof:

When t3t\leq 3 the conclusion is trivial. So we assume that t4t\geq 4. If Equation 28 is true, then β2βjsk2\beta_{2}\cdots\beta_{j-s-k-2} == x~s+k+3x~j1\widetilde{x}_{s+k+3}\cdots\widetilde{x}_{j-1} == y~s+k+2y~j2\widetilde{y}_{s+k+2}\cdots\widetilde{y}_{j-2} == β3βjsk1\beta_{3}\cdots\beta_{j-s-k-1} and βjskβt1\beta_{j-s-k}\cdots\beta_{t-1} == x~j+1x~s+k+t\widetilde{x}_{j+1}\cdots\widetilde{x}_{s+k+t} == y~j+1y~s+k+t\widetilde{y}_{j+1}\cdots\widetilde{y}_{s+k+t} == βjsk+2βt+1\beta_{j-s-k+2}\cdots\beta_{t+1}. Now we can conclude that β2βjsk1\beta_{2}\cdots\beta_{j-s-k-1} is a run and hence has period 11, and that βjskβt+1\beta_{j-s-k}\cdots\beta_{t+1} is a sequence of period at most 22.

If Equation 29 is true, then β1βjsk3\beta_{1}\cdots\beta_{j-s-k-3} == x~s+k+2x~j2\widetilde{x}_{s+k+2}\cdots\widetilde{x}_{j-2} == y~s+k+3y~j1\widetilde{y}_{s+k+3}\cdots\widetilde{y}_{j-1} == β4βjsk\beta_{4}\cdots\beta_{j-s-k} and
βjskβt1\beta_{j-s-k}\cdots\beta_{t-1} == x~j+1x~s+k+t\widetilde{x}_{j+1}\cdots\widetilde{x}_{s+k+t} == y~j+1y~s+k+t\widetilde{y}_{j+1}\cdots\widetilde{y}_{s+k+t} == βjsk+2βt+1\beta_{j-s-k+2}\cdots\beta_{t+1}. Now we can conclude that β1βjsk\beta_{1}\cdots\beta_{j-s-k} is sequence of period at most 33, and that βjskβt+1\beta_{j-s-k}\cdots\beta_{t+1} is a sequence of period at most 22. ∎

Claim 6.

Suppose that is+k+2i\geq s+k+2, and that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-B confusable. Then 𝛃=𝐩1𝐩2𝐩3\bm{\beta}=\bm{p}_{1}\bm{p}_{2}\bm{p}_{3}, where 𝐩1,𝐩2,𝐩3Σ2\bm{p}_{1},\bm{p}_{2},\bm{p}_{3}\in\Sigma_{2}^{*} are sequences of period at most 33.

Proof:

Similar to the proof of Claim 2, we can see that β1βisk\beta_{1}\cdots\beta_{i-s-k} and βjskβt+1\beta_{j-s-k}\cdots\beta_{t+1} both have period at most 22. If Equation 28 is true, then βisk+1βjsk2\beta_{i-s-k+1}\cdots\beta_{j-s-k-2} == x~i+2x~j1\widetilde{x}_{i+2}\cdots\widetilde{x}_{j-1} == y~i+1y~j2\widetilde{y}_{i+1}\cdots\widetilde{y}_{j-2} == βisk+2βjsk1\beta_{i-s-k+2}\cdots\beta_{j-s-k-1}. Now we can conclude that βisk+1βjsk1\beta_{i-s-k+1}\cdots\beta_{j-s-k-1} is a run and hence has period 11.

If Equation 29 is true, then βiskβjsk3\beta_{i-s-k}\cdots\beta_{j-s-k-3} == x~i+1x~j2\widetilde{x}_{i+1}\cdots\widetilde{x}_{j-2} == y~i+2y~j1\widetilde{y}_{i+2}\cdots\widetilde{y}_{j-1} == βisk+3βjsk\beta_{i-s-k+3}\cdots\beta_{j-s-k}. Therefore, βiskβjsk\beta_{i-s-k}\cdots\beta_{j-s-k} is sequence of period at most 33. ∎

Similar to Claim 5 and Claim 6, we can prove the following claim.

Claim 7.

Suppose that a𝐝a\bm{d} and 𝐞b¯\bm{e}\bar{b} are Type-B confusable, and that s4s\geq 4.

  • If jsj\geq s and is1i\leq s-1, then

    • 𝜶=𝒑1𝒑2αs\bm{\alpha}=\bm{p}_{1}\bm{p}_{2}\alpha_{s}, where 𝒑1,𝒑2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 22; or

    • 𝜶=𝒑1𝒑2\bm{\alpha}=\bm{p}_{1}\bm{p}_{2}, where 𝒑1,𝒑2Σ2\bm{p}_{1},\bm{p}_{2}\in\Sigma_{2}^{*} are sequences of period at most 33.

  • If js1j\leq s-1, then 𝜶=𝒑1𝒑2𝒑3\bm{\alpha}=\bm{p}_{1}\bm{p}_{2}\bm{p}_{3}, where 𝒑1,𝒑2,𝒑3Σ2\bm{p}_{1},\bm{p}_{2},\bm{p}_{3}\in\Sigma_{2}^{*} are sequences of period at most 33.

Appendix D The proof of Lemma VI.1

Firstly, we need the following lemma, which is a generalization of the base case (i.e., the case where 𝒖=𝒘=\bm{u}=\bm{w}=\emptyset) in the proof of Lemma III.3.

Lemma D.1.

Let n3n\geq 3 and 𝐱,𝐲Σ2n\bm{x},\bm{y}\in\Sigma_{2}^{n} such that

𝒙=aa¯𝒗b, 𝒚=a¯𝒗bb¯\bm{x}=a\bar{a}\bm{v}b,\text{ }\bm{y}=\bar{a}\bm{v}b\bar{b}

for some a,bΣ2a,b\in\Sigma_{2} and 𝐯\bm{v} Σ2n3\in\Sigma_{2}^{n-3}. If I1(𝐱)I1(𝐲)={𝐳}I_{1}(\bm{x})\cap I_{1}(\bm{y})=\{\bm{z}\}, then

|It(𝒙)It(𝒚)||It1(𝒛)|+N2+(n1,t1)=I2(n+1,t1)+N2+(n1,t1)\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|\leq\left|I_{t-1}(\bm{z})\right|+N_{2}^{+}(n-1,t-1)=I_{2}(n+1,t-1)+N_{2}^{+}(n-1,t-1)

for any t2t\geq 2.

Proof:

It is clear that 𝒛=aa¯𝒗bb¯\bm{z}=a\bar{a}\bm{v}b\bar{b}. Let S=It(𝒙)It(𝒚)S=I_{t}(\bm{x})\cap I_{t}(\bm{y}). Then S=SaSb¯Sba¯S=S^{a}\cup S_{\bar{b}}\cup S_{b}^{\bar{a}} and

Sa=a(It(a¯𝒗b)It1(a¯𝒗bb¯))=aIt1(a¯𝒗bb¯)It1(𝒛),Sb¯=(It1(aa¯𝒗b)It(a¯𝒗b))b¯=It1(aa¯𝒗b)b¯It1(𝒛),Sba¯=a¯(It1(aa¯𝒗)It1(𝒗bb¯))b.\begin{array}[]{l}S^{a}=a\left(I_{t}(\bar{a}\bm{v}b)\cap I_{t-1}(\bar{a}\bm{v}b\bar{b})\right)=aI_{t-1}(\bar{a}\bm{v}b\bar{b})\subseteq I_{t-1}(\bm{z}),\\ S_{\bar{b}}=\left(I_{t-1}(a\bar{a}\bm{v}b)\cap I_{t}(\bar{a}\bm{v}b)\right)\bar{b}=I_{t-1}(a\bar{a}\bm{v}b)\bar{b}\subseteq I_{t-1}(\bm{z}),\\ S_{b}^{\bar{a}}=\bar{a}\left(I_{t-1}(a\bar{a}\bm{v})\cap I_{t-1}(\bm{v}b\bar{b})\right)b.\end{array}

If aa¯𝒗=𝒗bb¯a\bar{a}\bm{v}=\bm{v}b\bar{b}, then a=ba=b and 𝒗=(aa¯)m\bm{v}=(a\bar{a})^{m}, or a=b¯a=\bar{b} and 𝒗=(aa¯)ma\bm{v}=(a\bar{a})^{m}a for some m0m\geq 0. For both cases, 𝒙\bm{x} and 𝒚\bm{y} are Type-A confusable by Proposition III.1 (2), which contradicts the fact that |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1. Therefore, aa¯𝒗𝒗bb¯a\bar{a}\bm{v}\neq\bm{v}b\bar{b} and hence |It(𝒙)It(𝒚)|=|S|=|SaSb¯|+|Sba¯||It1(𝒛)|+N2+(n1,t1)\left|I_{t}(\bm{x})\cap I_{t}(\bm{y})\right|=|S|=|S^{a}\cup S_{\bar{b}}|+|S_{b}^{\bar{a}}|\leq\left|I_{t-1}(\bm{z})\right|+N_{2}^{+}(n-1,t-1). ∎

Proof of Lemma VI.1 Since |I1(𝒙)I1(𝒚)|=1\left|I_{1}(\bm{x})\cap I_{1}(\bm{y})\right|=1, by Lemma III.2 we can assume that 𝒙=𝒖aa¯𝒗b𝒘\bm{x}=\bm{u}a\bar{a}\bm{v}b\bm{w} and 𝒚=𝒖a¯𝒗bb¯𝒘\bm{y}=\bm{u}\bar{a}\bm{v}b\bar{b}\bm{w} for some a,bΣ2a,b\in\Sigma_{2} and 𝒖,𝒗,𝒘Σ2\bm{u},\bm{v},\bm{w}\in\Sigma_{2}^{*}. Let n0=|aa¯𝒗b|n_{0}=|a\bar{a}\bm{v}b|. Then nn0n\geq n_{0}. Let S=It(𝒙)It(𝒚)S=I_{t}(\bm{x})\cap I_{t}(\bm{y}). We prove this theorem by induction on nn and tt.

The base case is n=n0n=n_{0} or t=2t=2. When n=n0n=n_{0}, the conclusion follows from Lemma D.1. When t=2t=2, the conclusion follows from Lemma III.3, since I2(n+1,1)+N2+(n1,1)=n+5I_{2}(n+1,1)+N_{2}^{+}(n-1,1)=n+5. Now suppose we have proved this theorem for all n<nn^{\prime}<n and t<tt^{\prime}<t. Without loss of generality, we assume that 𝒖=c𝒖\bm{u}=c\bm{u}^{*} for some cΣ2c\in\Sigma_{2} and 𝒖Σ2\bm{u}^{*}\in\Sigma_{2}^{*}. Clearly, S=Sc¯ScS=S^{\bar{c}}\cup S^{c}, and

Sc¯=c¯(It1(𝒙)It1(𝒚)),Sc=c(It(𝒖aa¯𝒗b𝒘)It(𝒖a¯𝒗bb¯𝒘)).\begin{array}[]{c}S^{\bar{c}}=\bar{c}\left(I_{t-1}(\bm{x})\cap I_{t-1}({\bm{y}})\right),\\ S^{c}=c\left(I_{t}(\bm{u}^{*}a\bar{a}\bm{v}b\bm{w})\cap I_{t}(\bm{u}^{*}\bar{a}\bm{v}b\bar{b}\bm{w})\right).\end{array}

By the induction hypothesis, |Sc¯|I2(n+1,t2)+N2+(n1,t2)|S^{\bar{c}}|\leq I_{2}(n+1,t-2)+N_{2}^{+}(n-1,t-2) and |Sc|I2(n,t1)+N2+(n2,t1)|S^{c}|\leq I_{2}(n,t-1)+N_{2}^{+}(n-2,t-1). By Equations (1) and (2), we can get the following two identities,

I2(n+1,t1)=I2(n+1,t2)+I2(n,t1), N2+(n1,t1)=N2+(n1,t2)+N2+(n2,t1).\begin{array}[]{c}I_{2}(n+1,t-1)=I_{2}(n+1,t-2)+I_{2}(n,t-1),\\ \text{ }N_{2}^{+}(n-1,t-1)=N_{2}^{+}(n-1,t-2)+N_{2}^{+}(n-2,t-1).\end{array}

Then the proof is completed. \blacksquare

References

  • [1] V. I. Levenshtein, “Reconstruction of objects from the minimum number of distorted patterns,” Dokl. Akad. Nauk, vol. 354, no. 5, pp. 593–596, 1997.
  • [2] ——, “Efficient Reconstruction of Sequences,” IEEE Trans. Inf. Theory, vol. 47, no. 1, pp. 2–22, Jan. 2001.
  • [3] ——, “Efficient Reconstruction of Sequences from Their Subsequences or Supersequences,” J. Combinat. Theory, A, vol. 93, no. 2, pp. 310–332, Feb. 2001.
  • [4] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing Strings from Random Traces,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), New Orleans, LA, USA, Jan. 2004, pp. 910–918.
  • [5] S. Kannan and A. McGregor, “More on reconstructing strings from random traces: insertions and deletions,” in Proc. Int. Symp. Inf. Theory (ISIT), Adelaide, Australia, Sep. 2005, pp. 297–301.
  • [6] K. Viswanathan and R. Swaminathan, “Improved string reconstruction over insertion-deletion channels,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), San Francisco, CA, USA, Jan. 2008, pp. 399–408.
  • [7] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, “Trace reconstruction with constant deletion probability and related results,” in Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), San Francisco, CA, USA, Jan. 2008, pp. 389–398.
  • [8] M. Cheraghchi, R. Gabrys, O. Milenkovic, and J. Ribeiro, “Coded Trace Reconstruction,” IEEE Trans. Inf. Theory, vol. 66, no. 10, pp. 6084–6103, Oct. 2020.
  • [9] J. Brakensiek, R. Li, and B. Spang, “Coded trace reconstruction in a constant number of traces,” in Proc. Annu. Symp. Found. Comput. Sci. (FOCS), Durham, NC, USA, Nov. 2020, pp. 482–493.
  • [10] E. Konstantinova, V. I. Levenshtein, and J. Siemons, “Reconstruction of permutations distorted by single transposition errors,” arXiv: 0702191, 2007.
  • [11] E. Konstantinova, “On reconstruction of signed permutations distorted by reversal errors,” Discrete Math., vol. 308, no. 5, pp. 974–984, Mar. 2008.
  • [12] ——, “Reconstruction of permutations distorted by reversal errors,” Discrete Appl. Math., vol. 155, no. 18, pp. 2426–2434, Nov. 2007.
  • [13] V. I. Levenshtein, E. Konstantinova, E. Konstantinov, and S. Molodtsov, “Reconstruction of a graph from 2-vicinities of its vertices,” Discrete Appl. Math., vol. 156, no. 9, pp. 1399–1406, May 2008.
  • [14] V. I. Levenshtein and J. Siemons, “Error graphs and the reconstruction of elements in groups,” J. Combinat. Theory, A, vol. 116, no. 4, pp. 795–815, May 2009.
  • [15] E. Yaakobi, M. Schwartz, M. Langberg, and J. Bruck, “Sequence reconstruction for Grassmann graphs and permutations,” in Proc. Int. Symp. Inf. Theory (ISIT), Istanbul, Turkey, Oct. 2013, pp. 874–878.
  • [16] R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,” in Proc. Int. Symp. Inf. Theory (ISIT), Barcelona, Spain, Aug. 2016, pp. 1596–1600.
  • [17] ——, “Sequence Reconstruction Over the Deletion Channel,” IEEE Trans. Inf. Theory, vol. 64, no. 4, pp. 2924–2931, Apr. 2018.
  • [18] F. Sala, R. Gabrys, C. Schoeny, and L. Dolecek, “Exact Reconstruction From Insertions in Synchronization Codes,” IEEE Trans. Inf. Theory, vol. 63, no. 4, pp. 2428–2445, Apr. 2017.
  • [19] V. L. P. Pham, K. Goyal, and H. M. Kiah, “Sequence Reconstruction Problem for Deletion Channels: A Complete Asymptotic Solution,” arXiv:2111.04255v1, 2021.
  • [20] H. M. Kiah, T. Thanh Nguyen, and E. Yaakobi, “Coding for Sequence Reconstruction for Single Edits,” in Proc. Int. Symp. Inf. Theory (ISIT), Los Angeles, CA, USA, 2020, pp. 676–681.
  • [21] K. Cai, H. M. Kiah, T. T. Nguyen, and E. Yaakobi, “Coding for Sequence Reconstruction for Single Edits,” IEEE Trans. Inf. Theory, vol. 68, no. 1, pp. 66–79, Jan. 2022.
  • [22] J. Chrisnata, H. M. Kiah, and E. Yaakobi, “Optimal Reconstruction Codes for Deletion Channels,” arXiv: 2004.06032, 2020.
  • [23] V. Guruswami and J. Håstad, “Explicit two-deletion codes with redundancy matching the existential bound,” IEEE Trans. Inf. Theory, vol. 67, no. 10, pp. 6384–6394, Oct. 2021.
  • [24] J. Chrisnata, H. M. Kiah, and E. Yaakobi, “Correcting Deletions with Multiple Reads,” IEEE Trans. Inf. Theory, vol. Early Access, 2022.
  • [25] J. Brakensiek, V. Guruswami, and S. Zbarsky, “Efficient Low-Redundancy Codes for Correcting Multiple Deletions,” IEEE Trans. Inf. Theory, vol. 64, no. 5, pp. 3403–3410, May 2018.
  • [26] R. Gabrys and F. Sala, “Codes Correcting Two Deletions,” IEEE Trans. Inf. Theory, vol. 65, no. 2, pp. 965–974, Feb. 2019.
  • [27] J. Sima, N. Raviv, and J. Bruck, “Two Deletion Correcting Codes From Indicator Vectors,” IEEE Trans. Inf. Theory, vol. 66, no. 4, pp. 2375–2391, Apr. 2020.
  • [28] J. Sima and J. Bruck, “On Optimal kk-Deletion Correcting Codes,” IEEE Trans. Inf. Theory, vol. 67, no. 6, pp. 3360–3375, Jun. 2021.
  • [29] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, Feb. 1966.
  • [30] N. J. A. Sloane, “On single-deletion-correcting codes,” Codes and Designs, vol. 10, pp. 273–291, May 2002.
  • [31] Y. M. Chee, H. M. Kiah, A. Vardy, V. K. Vu, and E. Yaakobi, “Coding for Racetrack Memories,” IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 7094–7112, Nov. 2018.