This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hideLIPIcs

M&D Data Science Center, Institute of Integrated Research, Institute of Science Tokyo, [email protected]://orcid.org/0000-0002-6856-5185JSPS KAKENHI Grant Number JP24K02899 Kyushu Institute of Technology, [email protected]://orcid.org/0000-0001-9106-6192 Department of Informatics, Kyushu University, [email protected]://orcid.org/0000-0001-6269-9353JSPS KAKENHI Grant Number JP21K17705 and JP23H04386 \ccsdesc[500]Theory of computation Data compression \ccsdesc[500]Mathematics of computing Combinatorics on words \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

On the compressiveness of the Burrows-Wheeler transform

Hideo Bannai    Tomohiro I    Yuto Nakashima
Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string ww into another string 𝖡𝖶𝖳(w)\mathsf{BWT}(w). The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes rr and rBr_{B}. We also show that the so-called “clustering effect” of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

keywords:
Data Compression, Bijective Burrows-Wheeler Transform, Fibonacci words
category:
\relatedversion

1 Introduction

The Burrows-Wheeler Transform (BWT) [5] is a reversible mapping from a string to another string that enables compression and efficient pattern search, and is the theoretical cornerstone for essential tools in the field of bioinformatics [17, 18]. The compressibility of BWT has been studied in various contexts, but more recently, rather than statistical measures such as empirical entropy which are not helpful in highly repetitive datasets, the size rr of the run-length encoded BWT (RLBWT) and its relation to the many other repetitiveness measures related to dictionary compression has become an important topic of study (See [21] for a comprehensive survey).

Dictionary compression is a family of compressed representations which are essentially based on copy and paste operations. Kempa and Prezza [15] proposed the notion of string attractors to view dictionary compression in a uniform way, and showed that the size γ\gamma of the smallest string attractor lower bounds all other measures of dictionary compression, since they implicitly give a string attractor of the same size as their representations: the size bb of the smallest bidirectional macro scheme (BMS) [23], the size zz of the LZ77 parsing [25], the size grlg_{\mathrm{rl}} of the smallest grammar with run-length rules, the size gg of the smallest grammar, rr, and more. Another measure δ\delta based on substring complexity is known to lowerbound γ\gamma [16].

Concerning rr in relation to the other measures, Navarro et al. [22] showed that given an RLBWT of size rr, we can construct a BMS of size 2r2r, thus showing b=O(r)b=O(r). Kempa and Kociumaka [14] showed r=O(zlog2n)r=O(z\log^{2}n), and further r=O(δlog2n)r=O(\delta\log^{2}n)111More precisely, r=O(δlogδmax(1,lognδlogδ))r=O(\delta\log\delta\max(1,\log\frac{n}{\delta\log\delta})). Also, z=O(blogn)z=O(b\log n) [22] has been shown, and therefore z=O(rlogn)z=O(r\log n) or, r=Ω(z/logn)r=\Omega(z/\log n). Recently, the size rBr_{B} of the run-length encoded bijective BWT (RLBBWT) has been studied [2, 1], and was shown, similarly to rr, that rB=O(zlog2n)r_{B}=O(z\log^{2}n) [1]. Also, the existence of an infinite family of strings such that r=o(rB)r=o(r_{B}), namely, r=O(1)r=O(1) and rB=Ω(logn)r_{B}=\Omega(\log n) was shown [1]. The existence of the opposite case, i.e., if there are string families for which rB=o(r)r_{B}=o(r), was left open.

The “clustering effect” of the BWT in terms of the run length encoding was studied by Mantaci et al. [19], where they proved that BWT can increase the size of the run-length encoding of the string by a factor of at most 22. The maximal clustering effect for run-length encoding can be seen for example with the Fibonacci words: Fibonacci words of length nn have a run-length encoding of size Θ(n)\Theta(n), while r=2r=2. While this is remarkable when viewed from run-length encoding since a maximally incompressible string (in terms of rle) is transformed into a compressible one, in terms of the smallest bidirectional macro scheme, Fibonacci words and their BWT have the same size 44, and one might argue that BWT is merely transforming a compressible string into another compressible string.

In this paper, we extend previous results on the relation between rr and rBr_{B}. We also investigate the clustering effect of BWT and BBWT in terms of other repetitiveness measures and show that they can be much more powerful than previously perceived, giving an example where BBWT transforms a maximally incompressible string (in terms of dictionary compression) into a compressible one.

The contributions of this paper are as follows.

  1. 1.

    We show an infinite family of strings such that r=Ω(logn)r=\Omega(\log n) and rB=O(1)r_{B}=O(1), answering affirmatively an open question about the existence of a family of strings where rB=o(r)r_{B}=o(r) raised by Badkobeh et al. [1].

  2. 2.

    We show a polylogarithmic (O(log4n)O(\log^{4}n)) upper bound on the multiplicative difference between rr and rBr_{B}.

  3. 3.

    We analyze the clustering effect of BWT and BBWT in terms of other repetitiveness measures, and show that the application of BWT and BBWT can only increase various repetitiveness measures of a string by a polylogarithmic factor.

  4. 4.

    We show that BBWT on the strings whose BBWT images are Fibonacci words exhibit a maximal clustering effect in terms of the repetitiveness measures δ\delta, γ\gamma, bb, vv, or rr, from almost linear (Θ(n/logn)\Theta(n/\log n)) to constant. The main combinatorial part of our proof is an elegant characterization of the LF mapping function on Fibonacci strings that uses the Zeckendorf representation of the positions.

    As a byproduct, this result shows that it is possible in some cases to transcend dictionary compression by simply applying BBWT before applying dictionary compression.

2 Preliminaries

Let Σ\Sigma denote a set of symbols called the alphabet. A string is an element of Σ\Sigma^{*}. If w=xyzw=xyz for any strings w,x,y,zw,x,y,z, then, xx, yy, zz are respectively a prefix, substring, and suffix of ww and are a proper prefix, substring, or suffix if they are not equal to ww. The length of string xx is denoted by |x||x|. The empty string, i.e., the string of length 0 is denoted by ε\varepsilon. The symbol at position ii will be denoted by x[i]x[i], and we will use a 0-based index, i.e., x=x[0]x[|x|1]x=x[0]\cdots x[|x|-1]. For integers i,j[0,|x|)i,j\in[0,|x|), let x[i..j]=x[i]x[j]x[i..j]=x[i]\cdots x[j] if iji\leq j and x[i..j]=εx[i..j]=\varepsilon if i>ji>j. We will also use x[i..j)=x[i..j1]x[i..j)=x[i..j-1]. For any string xx, x0=εx^{0}=\varepsilon, and for any integer i1i\geq 1, xi=xi1xx^{i}=x^{i-1}x. A string ww is primitive if it cannot be represented as w=xkw=x^{k} for some string xx and integer k2k\geq 2. For any symbol cΣc\in\Sigma let |x|c=|{ix[i]=c}||x|_{c}=|\{i\mid x[i]=c\}|, i.e., the number of cc’s in xx.

For any string xx, let 𝗋𝗈𝗍(x)=x[1..|x|)x[0]\mathsf{rot}(x)=x[1..|x|)x[0], denote a left cyclic rotation of xx by one symbol. Notice that for any integer ii, 𝗋𝗈𝗍i(x)\mathsf{rot}^{i}(x) naturally corresponds to the cyclic rotation of xx by ii symbols to the left if i>0i>0, and to the right if i<0i<0. We will use 𝗋𝗈𝗍(x)\mathsf{rot}^{*}(x) to denote the lexicographic smallest rotation of xx. A conjugacy class is an equivalence class of the equivalence relation defined by xy𝗋𝗈𝗍(x)=𝗋𝗈𝗍(y)x\equiv y\iff\mathsf{rot}^{*}(x)=\mathsf{rot}^{*}(y).

A string ww is a Lyndon word if it is lexicographically smaller than any of its, proper rotations, i.e., w<𝗋𝗈𝗍k(w)w<\mathsf{rot}^{k}(w) for any k[1,|w|)k\in[1,|w|). From the definition, it is clear that a Lyndon word must be primitive. A rotation of a primitive string xx is always primitive and 𝗋𝗈𝗍(x)\mathsf{rot}^{*}(x) is Lyndon and unique, which we will sometimes refer to as the Lyndon rotation of xx. It is known that Lyndon words cannot have a non-empty proper border (a proper prefix that is also a proper suffix). The Lyndon factorization [6] of a string ww is a unique partitioning of ww into a sequence of non-increasing Lyndon words, i.e., w=L1e1L(w)e(w)w=L^{e_{1}}_{1}\cdots L^{e_{\ell(w)}}_{\ell(w)} where each Li(i[1,(w)])L_{i}\leavevmode\nobreak\ (i\in[1,\ell(w)]), which we will call Lyndon factors of ww, is a Lyndon word and Li>Li+1L_{i}>L_{i+1} for i[1,(w))i\in[1,\ell(w)). It is known that any occurrence of a Lyndon word in ww must be a substring of a Lyndon factor of ww.

The ω\omega-order <ω<_{\omega} between primitive strings or same length strings x,yx,y is defined as x<ωyx<yx<_{\omega}y\iff x^{\infty}<y^{\infty}.222As we will not be comparing non-primitive strings of different lengths in this paper, the definition here is a simplified version of the original. The ω\omega-order can be different from the standard lexicographic order, but are identical when comparing strings of the same length or when comparing Lyndon words.

For a string x[0..|x|)x[0..|x|), let 𝗋𝖺𝗇𝗄c(i,x)=|x[0..i)|c\mathsf{rank}_{c}(i,x)=|x[0..i)|_{c}, i.e., the number of symbols cc in x[0..i)x[0..i). Let ρ(x)\rho(x) denote the size of the run-length encoding of xx, i.e., the maximal number of same symbol runs in xx. Let ρc(x)\rho_{c}(x) denote the number of runs of symbol cc in the run-length encoding of xx.

2.1 Repetitiveness Measures

A set of positions Γ\Gamma is a string attractor [15] of ww if any substring of ww has an occurrence in ww that covers a position in Γ\Gamma. The size of the smallest string attractor of ww is denoted by γ(w)\gamma(w). The measure δ\delta [16] is defined as maxk[1,|w|]Sk/k\max_{k\in[1,|w|]}S_{k}/k, where SkS_{k} is the number of distinct length-kk substrings of ww.

The Burrows-Wheeler transform (BWT) [5] 𝖡𝖶𝖳(w)\mathsf{BWT}(w) of a string ww is defined as the sequence of last (or equivalently, previous, in the cyclic sense) symbols of all rotations of ww, in lexicographic order of the rotations. The size of the run-length encoding of 𝖡𝖶𝖳(w)\mathsf{BWT}(w) will be denoted by r(w)r(w), i.e., r(w)=ρ(𝖡𝖶𝖳(w))r(w)=\rho(\mathsf{BWT}(w)). The bijective BWT (BBWT) [11] 𝖡𝖡𝖶𝖳(w)\mathsf{BBWT}(w) of a string ww is defined as the sequence of last (or again, previous, in the cyclic sense) symbols of all the rotations of all the Lyndon factors of ww, in ω\omega-order of the rotations. Actually, BWT can be understood as a special case of BBWT: 𝖡𝖶𝖳(w)=𝖡𝖡𝖶𝖳(𝗋𝗈𝗍(w))\mathsf{BWT}(w)=\mathsf{BBWT}(\mathsf{rot}^{*}(w)) because 𝗋𝗈𝗍(w)=Le\mathsf{rot}^{*}(w)=L^{e} for some Lyndon word LL and integer e=|w|/|L|e=|w|/|L|, and the ω\omega-order is equivalent to lexicographic order in this case since all the compared strings are of the same length. The size of the run-length encoding of 𝖡𝖡𝖶𝖳(w)\mathsf{BBWT}(w) will be denoted by rB(w)r_{B}(w), i.e., rB(w)=ρ(𝖡𝖡𝖶𝖳(w))r_{B}(w)=\rho(\mathsf{BBWT}(w)).

The inverse transform of 𝖡𝖶𝖳\mathsf{BWT} and 𝖡𝖡𝖶𝖳\mathsf{BBWT} on a string xx can be defined by the LF mapping, which is a function Ψx(i)=j\Psi_{x}(i)=j over positions of xx where c=x[i]=s[j]c=x[i]=s[j], 𝗋𝖺𝗇𝗄c(i,x)=𝗋𝖺𝗇𝗄c(j,s)\mathsf{rank}_{c}(i,x)=\mathsf{rank}_{c}(j,s), and ss is the string obtained by sorting the multiset of symbols of xx in increasing order. Ψx\Psi_{x} is a permutation and thus forms cycles on the set of positions of xx. A cycle (i,Ψ1(i),,Ψk1(i))(i,\Psi^{1}(i),\cdots,\Psi^{k-1}(i)) where kk is the smallest positive integer such that Ψk(i)=i\Psi^{k}(i)=i, corresponds to a (cyclic) string x[Ψk1(i)]x[Ψ1(i)]x[i]x[\Psi^{k-1}(i)]\cdots x[\Psi^{1}(i)]x[i]. This string is always primitive, and by concatenating, in non-increasing order, all the Lyndon rotations of the strings corresponding to all cycles, it can be shown that 𝖡𝖡𝖶𝖳1(x)\mathsf{BBWT}^{-1}(x) is obtained. Ψx\Psi_{x} can be interpreted as returning, given the ω\omega-order rank of a given rotation of a cycle, the ω\omega-order rank of the previous rotation of the cycle. Note that when xx is a 𝖡𝖶𝖳\mathsf{BWT} image of a primitive string, Ψx\Psi_{x} consists of only a single cycle and that although the Lyndon rotation will give the string ww for which 𝖡𝖡𝖶𝖳(w)=x\mathsf{BBWT}(w)=x, we have 𝖡𝖶𝖳(w)=x\mathsf{BWT}(w^{\prime})=x for any rotation ww^{\prime} of ww.

A Bidirectional Macro Scheme (BMS) [23] of a string ww is a partitioning of ww into phrases, where each phrase is either a single symbol, or is a substring that has an occurrence elsewhere in ww which we call the source, or the reference of the phrase. The references of the phrases must be such that the induced reference for each position in the phrases are acyclic, i.e., the referencing on the position forms a forest where the roots are the positions corresponding to single symbol phrases. The size of the smallest bidirectional macro scheme for ww is denoted by b(w)b(w).

The LZ77 factorization [25] of a string ww is a BMS of ww where all references are left-referencing, i.e., they must point to a smaller position, and the phrases are determined greedily from left-to-right. It is known that LZ77 is the smallest among left-referencing BMS. The size of LZ77 of ww is denoted by z(w)z(w).

The lex-parse [22] of a string ww is a BMS of ww where all references point to a rotation of smaller lexicographic (or equivalently ω\omega-order) rotation, and the phrases are determined greedily from left-to-right. Similarly, it is known that lex-parse is the smallest among BMS with such constraint. The size of lex-parse of ww is denoted by v(w)v(w).

2.2 Fibonacci Words

Since we will later use Fibonacci words or their slight modifications to show some of our results, we introduce them here.

The Fibonacci words are defined recursively as: F0=𝚋F_{0}=\mathtt{b}, F1=𝚊F_{1}=\mathtt{a}, and for any integer i2i\geq 2, Fi=Fi1Fi2F_{i}=F_{i-1}F_{i-2}. Fibonacci words can also be defined via a morphism ϕ\phi defined as ϕ(𝚊)=𝚊𝚋\phi(\mathtt{a})=\mathtt{ab}, and ϕ(𝚋)=𝚊\phi(\mathtt{b})=\mathtt{a}, and Fi=ϕi(𝚋)F_{i}=\phi^{i}(\mathtt{b}) for all i0i\geq 0. The lengths of the Fibonacci words correspond to the Fibonacci sequence, i.e., f0=f1=1f_{0}=f_{1}=1, and fi=fi1+fi2f_{i}=f_{i-1}+f_{i-2}. For technical reasons, we define fi=0f_{i}=0 for i<0i<0. The observation below follows from a simple induction. {observation} For any i1i\geq 1, |Fi|𝚊=fi1|F_{i}|_{\mathtt{a}}=f_{i-1}, and |Fi|𝚋=fi2|F_{i}|_{\mathtt{b}}=f_{i-2}.

3 New results for rr vs rBr_{B}

3.1 Strings family giving r(w)/rB(w)=Ω(logn)r(w)/r_{B}(w)=\Omega(\log n)

Here, we answer an open question raised by Badkobeh et al. [1], and show that there exists an infinite family of strings such that rBr_{B} is asymptotically strictly smaller than rr.

Theorem 3.1.

There exists an infinite family of strings such that r=Ω(rBlogn)r=\Omega(r_{B}\log n).

Proof 3.2.

Consider string wk=𝚋F2kw_{k}=\mathtt{b}F^{*}_{2k}, where F2k=𝗋𝗈𝗍(F2k)F^{*}_{2k}=\mathsf{rot}^{*}(F_{2k}). Then, for any k3k\geq 3, rB(wk)=3r_{B}(w_{k})=3 (Lemma 3.3), and r(wk)=2kr(w_{k})=2k (Corollary 3.8).

Lemma 3.3.

Let wk=𝚋F2kw_{k}=\mathtt{b}F^{*}_{2k}, where F2k=𝗋𝗈𝗍(F2k)F^{*}_{2k}=\mathsf{rot}^{*}(F_{2k}). Then, rB(wk)=3r_{B}(w_{k})=3 for any k3k\geq 3.

Proof 3.4.

The Lyndon factorization of wkw_{k} results in the factors 𝚋\mathtt{b} and F2kF^{*}_{2k}, and it is known that 𝖡𝖡𝖶𝖳(F2k)=𝖡𝖶𝖳(F2k)=𝚋f2k2𝚊f2k1\mathsf{BBWT}(F^{*}_{2k})=\mathsf{BWT}(F^{*}_{2k})=\mathtt{b}^{f_{2k-2}}\mathtt{a}^{f_{2k-1}}. Since 𝚋\mathtt{b} is greater, in ω\omega-order, than any rotation of F2kF^{*}_{2k}, we have 𝖡𝖡𝖶𝖳(wk)=𝚋f2k2𝚊f2k1𝚋\mathsf{BBWT}(w_{k})=\mathtt{b}^{f_{2k-2}}\mathtt{a}^{f_{2k-1}}\mathtt{b} and rB(wk)=3r_{B}(w_{k})=3.

The rest of the proof will focus on showing r(wk)=2kr(w_{k})=2k. We show two proofs, one that relies heavily on previous results, and the other an alternate direct proof.

3.1.1 Proof for r(wk)=2kr(w_{k})=2k

Actually, it turns out that wkw_{k} is a rotation of the strings whose BWT are shown to have 2k2k runs in Proposition 3 of [12] and in Proposition 3 of [13]. The former defines it as v2k𝚋v_{2k}\mathtt{b} for v2k=F2kRv_{2k}=F_{2k}^{R}. The latter defines it as inserting 𝚋\mathtt{b} at position f2k12f_{2k-1}-2 of F2kF_{2k}, and mentions that it is a rotation of the former string. Our proof of r(wk)=2kr(w_{k})=2k further connects these strings with the lexicographically smallest rotation F2kF^{*}_{2k} of F2kF_{2k}.

The following is known:

Theorem 3.5 (Theorem 1 in [7]).

The rotation of FnF_{n} with rank ρ\rho in the lexicographically sorted list of all the rotations of FnF_{n}, for n2n\geq 2, ρ[0..fn)\rho\in[0..f_{n}) is the rotation 𝗋𝗈𝗍i(Fn)\mathsf{rot}^{i}(F_{n}) where

i={(ρfn21)modfn if n odd((ρ+1)fn21)modfn if n even.i=\begin{cases}(\rho\cdot f_{n-2}-1)\bmod f_{n}&\mbox{ if $n$ odd}\\ (-(\rho+1)\cdot f_{n-2}-1)\bmod f_{n}&\mbox{ if $n$ even}.\end{cases}

Therefore, we have

Corollary 3.6.

For every integer k1k\geq 1, F2k=𝗋𝗈𝗍f2k11(F2k)F^{*}_{2k}=\mathsf{rot}^{f_{2k-1}-1}(F_{2k}).

Proof 3.7.

From Theorem 3.5 with ρ=0\rho=0, F2k=𝗋𝗈𝗍i(F2k)F^{*}_{2k}=\mathsf{rot}^{i}(F_{2k}) where i=(f2k2+1)f2k11(modf2k)i=-(f_{2k-2}+1)\equiv f_{2k-1}-1\pmod{f_{2k}}.

which means that inserting a 𝚋\mathtt{b} at position f2k12f_{2k-1}-2 of F2kF_{2k} is a rotation of 𝚋F2k\mathtt{b}F^{*}_{2k}, leading to:

Corollary 3.8.

Let wk=𝚋F2kw_{k}=\mathtt{b}\cdot F^{*}_{2k}, where F2k=𝗋𝗈𝗍(F2k)F^{*}_{2k}=\mathsf{rot}^{*}(F_{2k}). Then, r(wk)=2kr(w_{k})=2k for any k3k\geq 3.

3.1.2 Alternate proof for r(wk)=2kr(w_{k})=2k

We also give an alternate direct proof based on morphisms, which is perhaps slightly simpler compared to the proof for r(v2k𝚋)=2kr(v_{2k}\mathtt{b})=2k presented in [12] which relies on additional results on special factors of standard words [4]. Results on morphisms and their effects on rr have been studied by Fici et al. [9], but to the best of our knowledge, their results do not directly apply to our case.

We use a string morphism θ\theta defined as: θ(𝚊)=𝚊𝚊𝚋\theta(\mathtt{a})=\mathtt{aab}, θ(𝚋)=𝚊𝚋\theta(\mathtt{b})=\mathtt{ab}. The following claim can be shown by a simple induction.

Claim 1 (Claim 5 in [20]).

For any string w{𝚊,𝚋}w\in\{\mathtt{a},\mathtt{b}\}^{*}, θ(w)=𝗋𝗈𝗍2(ϕ2(w))\theta(w)=\mathsf{rot}^{2}(\phi^{2}(w)).

To prove Theorem 3.1, we first show the following lemma.

Lemma 3.9.

For every integer k0k\geq 0, F2k=θk(𝚋)F^{*}_{2k}=\theta^{k}(\mathtt{b}).

Proof 3.10.

It clearly holds for k=0,1k=0,1. For k2k\geq 2, we show that θk(𝚋)=𝗋𝗈𝗍f2k11(F2k)\theta^{k}(\mathtt{b})=\mathsf{rot}^{f_{2k-1}-1}(F_{2k}) by induction on kk. We can see that the statement holds for k=2k=2 since θ2(𝚋)=𝚊𝚊𝚋𝚊𝚋=𝗋𝗈𝗍2(F4)\theta^{2}(\mathtt{b})=\mathtt{aabab}=\mathsf{rot}^{2}(F_{4}). Suppose that θk(𝚋)=𝗋𝗈𝗍f2k11(F2k)\theta^{k^{\prime}}(\mathtt{b})=\mathsf{rot}^{f_{2k^{\prime}-1}-1}(F_{2k^{\prime}}) holds for some k2k^{\prime}\geq 2. We have

θk+1(𝚋)\displaystyle\theta^{k^{\prime}+1}(\mathtt{b}) =θ(θk(𝚋))=θ(𝗋𝗈𝗍f2k11(F2k))\displaystyle=\theta(\theta^{k^{\prime}}(\mathtt{b}))=\theta(\mathsf{rot}^{f_{2k^{\prime}-1}-1}(F_{2k^{\prime}})) by induction hypothesis
=θ(𝗋𝗈𝗍1(F2k2F2k1))\displaystyle=\theta(\mathsf{rot}^{-1}(F_{2k^{\prime}-2}F_{2k^{\prime}-1}))
=𝗋𝗈𝗍2(ϕ2(𝗋𝗈𝗍1(F2k2F2k1))\displaystyle=\mathsf{rot}^{2}(\phi^{2}(\mathsf{rot}^{-1}(F_{2k^{\prime}-2}F_{2k^{\prime}-1})) by 1

Since the last symbol of F2kF_{2k^{\prime}} is F2k[f2k11]=𝚊F_{2k^{\prime}}[f_{2k^{\prime}-1}-1]=\mathtt{a} and |ϕ2(𝚊)|=3|\phi^{2}(\mathtt{a})|=3, we have

θk+1(𝚋)\displaystyle\theta^{k^{\prime}+1}(\mathtt{b}) =𝗋𝗈𝗍2(𝗋𝗈𝗍3(F2kF2k+1))\displaystyle=\mathsf{rot}^{2}(\mathsf{rot}^{-3}(F_{2k^{\prime}}F_{2k^{\prime}+1}))
=𝗋𝗈𝗍f2k+11(F2k+2)\displaystyle=\mathsf{rot}^{f_{2k^{\prime}+1}-1}(F_{2k^{\prime}+2})
=𝗋𝗈𝗍f2(k+1)11(F2(k+1))\displaystyle=\mathsf{rot}^{f_{2(k^{\prime}+1)-1}-1}(F_{2(k^{\prime}+1)})

Hence, the statement also holds for k=k+1k=k^{\prime}+1, and θk(𝚋)=𝗋𝗈𝗍f2k11(F2k)\theta^{k}(\mathtt{b})=\mathsf{rot}^{f_{2k-1}-1}(F_{2k}) for every k2k\geq 2. By combining it with Corollary 3.6, we obtain F2k=θk(𝚋)F^{*}_{2k}=\theta^{k}(\mathtt{b}).

We are ready to prove the following lemma.

Lemma 3.11.

Let wk=𝚋F2kw_{k}=\mathtt{b}F^{*}_{2k}, where F2k=𝗋𝗈𝗍(F2k)F^{*}_{2k}=\mathsf{rot}^{*}(F_{2k}). Then, r(wk)=2kr(w_{k})=2k for any k2k\geq 2.

Proof 3.12.

We first prove that r(yk)=2k+2r(y_{k})=2k+2 for yk=𝚌F2ky_{k}=\mathtt{c}F^{*}_{2k} for any k3k\geq 3. From Lemma 3.9, we have yk=𝚌F2k=𝚌θk(𝚋)y_{k}=\mathtt{c}F^{*}_{2k}=\mathtt{c}\theta^{k}(\mathtt{b}). Below, we extend the definition of θ\theta so that θ(𝚌)=𝚌\theta(\mathtt{c})=\mathtt{c}, so yk=θk(𝚌𝚋)y_{k}=\theta^{k}(\mathtt{cb}). We claim that

𝖡𝖶𝖳(yk)\displaystyle\mathsf{BWT}(y_{k}) =𝚌𝚋f2k2kj=1k(𝚊f2j2𝚋)\displaystyle=\mathtt{c}\mathtt{b}^{f_{2k-2}-k}\prod_{j=1}^{k}(\mathtt{a}^{f_{2j-2}}\mathtt{b}) (1)

We show this statement by induction on kk. If k=3k=3, y3=𝚌𝚊𝚊𝚋𝚊𝚊𝚋𝚊𝚋𝚊𝚊𝚋𝚊𝚋y_{3}=\mathtt{caabaababaabab} and 𝖡𝖶𝖳(y3)=𝚌𝚋2𝚊𝚋𝚊2𝚋𝚊5𝚋\mathsf{BWT}(y_{3})=\mathtt{cb}^{2}\mathtt{aba}^{2}\mathtt{ba}^{5}\mathtt{b} hold. Assume that the statement holds for some k3k^{\prime}\geq 3. We show the statement also holds for k=k+1k=k^{\prime}+1. We consider the three disjoint sets of the rotations of yk+1y_{k^{\prime}+1} as follows: the rotations (i) starting with 𝚊2\mathtt{a}^{2}, (ii) starting with 𝚊𝚋\mathtt{ab}, (iii) starting with 𝚋\mathtt{b} or 𝚌\mathtt{c}.

  1. (i)

    From the definition of θ\theta, the number of rotations starting with 𝚊2\mathtt{a}^{2} of yk+1y_{k^{\prime}+1} is f2k1(=|F2k|𝚊)f_{2k^{\prime}-1}\leavevmode\nobreak\ (=|F_{2k^{\prime}}|_{\mathtt{a}}). These rotations are the first f2k1f_{2k^{\prime}-1} rotations in the lexicographically increasingly sorted list of all rotations of yk+1y_{k^{\prime}+1}. By the definition of yky_{k}, the lexicographically smallest rotation is 𝗋𝗈𝗍(yk)\mathsf{rot}(y_{k^{\prime}}), and the preceding symbol of this rotation is 𝚌\mathtt{c}. The other rotations of this case are preceded by 𝚋\mathtt{b}. Hence, 𝖡𝖶𝖳(yk+1)[0..f2k11]=𝚌𝚋f2k11\mathsf{BWT}(y_{k^{\prime}+1})[0..f_{2k^{\prime}-1}-1]=\mathtt{cb}^{f_{2k^{\prime}-1}-1}.

  2. (ii)

    Let 𝗉𝗈𝗌\mathsf{pos} be the set of positions in yk+1y_{k^{\prime}+1} that are preceded by an occurrence of 𝚊𝚋\mathtt{ab}. From the definitions of yky_{k} and θ\theta, every occurrence of 𝚊𝚋\mathtt{ab} in yk+1y_{k^{\prime}+1} is a suffix of an occurrence of the substrings (𝚊𝚊𝚋\mathtt{aab} or 𝚊𝚋\mathtt{ab}) produced from θ\theta and an occurrence of 𝚊\mathtt{a} or 𝚋\mathtt{b} in yky_{k^{\prime}}. Therefore, we have |𝗉𝗈𝗌|=|yk|1=f2k|\mathsf{pos}|=|y_{k^{\prime}}|-1=f_{2k^{\prime}} and if s(i)\mathit{s}(i) denotes the position in yky_{k^{\prime}} that produces the corresponding symbol at position ii in yk+1y_{k^{\prime}+1} by the morphism θ\theta, then 𝗋𝗈𝗍i(yk+1)=θ(𝗋𝗈𝗍s(i)(yk))\mathsf{rot}^{i}(y_{k^{\prime}+1})=\theta(\mathsf{rot}^{\mathit{s}(i)}(y_{k^{\prime}})) for any i𝗉𝗈𝗌i\in\mathsf{pos}. Thus, for any positions i1,i2𝗉𝗈𝗌i_{1},i_{2}\in\mathsf{pos},

    𝗋𝗈𝗍i12(yk+1)<𝗋𝗈𝗍i22(yk+1)\displaystyle\mathsf{rot}^{i_{1}-2}(y_{k^{\prime}+1})<\mathsf{rot}^{i_{2}-2}(y_{k^{\prime}+1}) 𝗋𝗈𝗍i1(yk+1)<𝗋𝗈𝗍i2(yk+1)\displaystyle\iff\mathsf{rot}^{i_{1}}(y_{k^{\prime}+1})<\mathsf{rot}^{i_{2}}(y_{k^{\prime}+1})
    θ(𝗋𝗈𝗍s(i1)(yk))<θ(𝗋𝗈𝗍s(i2)(yk))\displaystyle\iff\theta(\mathsf{rot}^{\mathit{s}(i_{1})}(y_{k^{\prime}}))<\theta(\mathsf{rot}^{\mathit{s}(i_{2})}(y_{k^{\prime}}))
    𝗋𝗈𝗍s(i1)(yk)<𝗋𝗈𝗍s(i2)(yk).\displaystyle\iff\mathsf{rot}^{\mathit{s}(i_{1})}(y_{k^{\prime}})<\mathsf{rot}^{\mathit{s}(i_{2})}(y_{k^{\prime}}). (2)

    where the last relation follows from the fact that θ\theta is an order preserving morphism (i.e., s<tθ(s)<θ(t)s<t\Leftrightarrow\theta(s)<\theta(t) holds for any s,ts,t). We can also see that for any i𝗉𝗈𝗌i\in\mathsf{pos}, the symbol yk+1[i3]y_{k^{\prime}+1}[i-3] preceding the occurrence of 𝚊𝚋\mathtt{ab} is 𝚊\mathtt{a} (resp. 𝚋\mathtt{b}) iff yk[s(i)1]y_{k^{\prime}}[\mathit{s}(i)-1] is 𝚊\mathtt{a} (resp. 𝚋\mathtt{b}). Thus, together with Equation 2, the sequence of entries in 𝖡𝖶𝖳(yk+1)\mathsf{BWT}(y_{k^{\prime}+1}) corresponding to rotations that start with 𝚊𝚋\mathtt{ab} are equivalent to the sequence of entries in 𝖡𝖶𝖳(yk)\mathsf{BWT}(y_{k^{\prime}}) for rotations that start with 𝚊\mathtt{a} or 𝚋\mathtt{b}, i.e., 𝖡𝖶𝖳(yk+1)[f2k1..f2k+11]=𝖡𝖶𝖳(yk)[1..f2k]=𝚋f2k2kj=1k(𝚊f2j2𝚋)\mathsf{BWT}(y_{k^{\prime}+1})[f_{2k^{\prime}-1}..f_{2k^{\prime}+1}-1]=\mathsf{BWT}(y_{k^{\prime}})[1..f_{2k^{\prime}}]=\mathtt{b}^{f_{2k^{\prime}-2}-k^{\prime}}\prod_{j=1}^{k^{\prime}}(\mathtt{a}^{f_{2j-2}}\mathtt{b}). See also Figure 1.

  3. (iii)

    From the definition of θ\theta, the number of rotations starting with 𝚋\mathtt{b} in yk+1y_{k^{\prime}+1} is f2kf_{2k^{\prime}}, and the lexicographically largest rotation is yk+1y_{k^{\prime}+1} itself. These rotations are the last f2k+1f_{2k^{\prime}}+1 rotations in the lexicographically increasingly sorted list of all rotations of yk+1y_{k^{\prime}+1}. The preceding symbol of the largest rotation, or equivalently, the last symbol of yky_{k} is 𝚋\mathtt{b}. All rotations starting with 𝚋\mathtt{b} are preceded by 𝚊\mathtt{a}. Hence, 𝖡𝖶𝖳(yk+1)[f2k+1..f2k+2]=𝚊f2k𝚋\mathsf{BWT}(y_{k^{\prime}+1})[f_{2k^{\prime}+1}..f_{2k^{\prime}+2}]=\mathtt{a}^{f_{2k^{\prime}}}\mathtt{b}.

All together, we have

𝖡𝖶𝖳(yk+1)\displaystyle\mathsf{BWT}(y_{k^{\prime}+1}) =𝚌𝚋f2k11(𝗂)𝚋f2k2kj=1k(𝚊f2j2𝚋)(𝗂𝗂)𝚊f2k𝚋(𝗂𝗂𝗂)\displaystyle=\overbrace{\mathtt{cb}^{f_{2k^{\prime}-1}-1}\vphantom{\prod_{j}^{k^{\prime}}]}}^{\mathsf{(i)}}\cdot\overbrace{\mathtt{b}^{f_{2k^{\prime}-2}-k^{\prime}}\prod_{j=1}^{k^{\prime}}(\mathtt{a}^{f_{2j-2}}\mathtt{b})}^{\mathsf{(ii)}}\cdot\overbrace{\mathtt{a}^{f_{2k^{\prime}}}\mathtt{b}\vphantom{\prod_{j}^{k^{\prime}}]}}^{\mathsf{(iii)}}
=𝚌𝚋f2k(k+1)j=1k+1(𝚊f2j2𝚋)\displaystyle=\mathtt{c}\mathtt{b}^{f_{2k^{\prime}}-(k^{\prime}+1)}\prod_{j=1}^{k^{\prime}+1}(\mathtt{a}^{f_{2j-2}}\mathtt{b})

and the statement holds for k=k+1k=k^{\prime}+1, proving Equation 1. Therefore, r(yk)=2k+2r(y_{k})=2k+2. We note that Equation 1 holds for k=2k=2, but since f2k22=0f_{2k-2}-2=0, r(y2)=5r(y_{2})=5.

The last symbol of F2kF^{*}_{2k} is 𝚋\mathtt{b}, so changing the 𝚌\mathtt{c} in yky_{k} to 𝚋\mathtt{b} changes the unique occurrence of 𝚋𝚌\mathtt{bc} in all rotations of yky_{k} except for yky_{k} itself, to the unique occurrence of 𝚋𝚋\mathtt{bb} in all rotations of wkw_{k} except for wkw_{k} itself. Therefore, the lexicographic order of rotations of wkw_{k} are equivalent to those in yky_{k}, except for wkw_{k} itself. We can see that 𝖡𝖶𝖳(wk)\mathsf{BWT}(w_{k}) will have the following changes compared to 𝖡𝖶𝖳(yk)\mathsf{BWT}(y_{k}): 1) the lexicographically smallest rotation 𝗋𝗈𝗍(wk)\mathsf{rot}(w_{k}) is preceded by 𝚋\mathtt{b} instead of 𝚌\mathtt{c}, and 2) wkw_{k} is the lexicographically smallest rotation that starts with 𝚋\mathtt{b} (rather than 𝚌\mathtt{c}). 1) changes 𝚌\mathtt{c} to 𝚋\mathtt{b} and thus decreases the run-length by 11. 2) moves the last 𝚋\mathtt{b} right after the second to last 𝚋\mathtt{b} in 𝖡𝖶𝖳(yk)\mathsf{BWT}(y_{k}) (the preceding symbol of the largest rotation starting with 𝚊\mathtt{a}) and thus decreases the run-length by 11. Therefore, in total, r(wk)=2kr(w_{k})=2k which can be confirmed to hold for k=2k=2 as well.

Refer to caption
Figure 1: Illustration of the proof of Lemma 3.11. Every position in the set 𝗉𝗈𝗌\mathsf{pos} is preceded by an occurrence of 𝚊𝚋\mathtt{ab}. We can obtain the lexicographic rank of the rotation 𝗋𝗈𝗍i2(yk+1)\mathsf{rot}^{i-2}(y_{k^{\prime}+1}) that starts with 𝚊𝚋\mathtt{ab} by using the corresponding rotation 𝗋𝗈𝗍s(i)(yk)\mathsf{rot}^{s(i)}(y_{k^{\prime}}). The preceded symbols (underlined symbols) of yk+1[i3]y_{k^{\prime}+1}[i-3] and yk[s(i)1]y_{k^{\prime}}[\mathit{s}(i)-1] are the always same. The figure shows the case of yk+1[j3]=yk[s(j)1]=𝚊y_{k^{\prime}+1}[j-3]=y_{k^{\prime}}[\mathit{s}(j)-1]=\mathtt{a} by the position ii and the case of yk+1[j3]=yk[s(j)1]=𝚋y_{k^{\prime}+1}[j-3]=y_{k^{\prime}}[\mathit{s}(j)-1]=\mathtt{b} by the position jj.

3.2 Bounds for r(w)/rB(w)r(w)/r_{B}(w)

We show poly-logarithmic upper and lower bounds on the ratio r(w)/rB(w)r(w)/r_{B}(w) for any string.

We first show a simple upper bound on the multiplicative rotation sensitivity of zz.

Lemma 3.13.

For any strings w,ww,w^{\prime} of length nn in the same conjugacy class, z(w)=O(z(w)logn)z(w^{\prime})=O(z(w)\log n).

Proof 3.14.

It is easy to see that b(w)=Θ(b(w))b(w^{\prime})=\Theta(b(w)). This is because, given any BMS of ww of size bb, we can reuse the parsing and referencing, except for the following changes, to construct a BMS for ww^{\prime} of size 2b+12b+1: (1) if ww^{\prime} starts in the middle of a BMS phrase of ww, we split the phrase, and (2) if a phrase references a substring of ww but is split in ww^{\prime}, we split the phrase. Since only one of the phrases of the split phrase in (1) is split by (2), the total number of phrases is at most 2b+12b+1.

Since b(w)z(w)=O(b(w)logn)b(w)\leq z(w)=O(b(w)\log n) and b(w)z(w)=O(b(w)logn)b(w^{\prime})\leq z(w^{\prime})=O(b(w^{\prime})\log n), we have z(w)/z(w)=O(logn)z(w^{\prime})/z(w)=O(\log n).

Lemma 3.15.

For any strings w,ww,w^{\prime} of length nn in the same conjugacy class, rB(w)=O(rB(w)log4n)r_{B}(w^{\prime})=O(r_{B}(w)\log^{4}n).

Proof 3.16.

Badkobeh et al. [1] showed (1) b(w)=O(rB(w))b(w)=O(r_{B}(w)), which implies z(w)=O(rB(w)logn)z(w)=O(r_{B}(w)\log n), and (2) rB(w)=O(z(w)log2n)r_{B}(w)=O(z(w)\log^{2}n). Therefore, rB(w)/rB(w)=O(z(w)/z(w)log3n)=O(log4n)r_{B}(w^{\prime})/r_{B}(w)=O(z(w^{\prime})/z(w)\log^{3}n)=O(\log^{4}n) from Lemma 3.13.

Corollary 3.17.

For any string ww of length nn, rB(w)=O(r(w)log4n)r_{B}(w)=O(r(w)\log^{4}n) and r(w)=O(rB(w)log4n)r(w)=O(r_{B}(w)\log^{4}n).

4 Clustering effects of 𝖡𝖶𝖳\mathsf{BWT} and 𝖡𝖡𝖶𝖳\mathsf{BBWT}

4.1 In terms of ρ\rho

The clustering effect of 𝖡𝖶𝖳\mathsf{BWT} was measured by the length ρ\rho of the run-length encoding. Namely, the following statement was claimed by Mantaci et al. [19].

Theorem 4.1 (Theorem 3.3 in [19]).

For any string ww, ρ(𝖡𝖶𝖳(w))2ρ(w)\rho(\mathsf{BWT}(w))\leq 2\rho(w).

While the statement is true, we believe there is a non-trivial case that was not covered in their original proof. Here, we address this case and also show that the same statement holds for 𝖡𝖡𝖶𝖳\mathsf{BBWT}. Note that the following Theorem 4.2 implies Theorem 4.1 because 𝖡𝖶𝖳(w)=𝖡𝖡𝖶𝖳(𝗋𝗈𝗍(w))\mathsf{BWT}(w)=\mathsf{BBWT}(\mathsf{rot}^{*}(w)) and ρ(𝗋𝗈𝗍(w))ρ(w)\rho(\mathsf{rot}^{*}(w))\leq\rho(w).

Theorem 4.2.

For any string ww, ρ(𝖡𝖡𝖶𝖳(w))2ρ(w)\rho(\mathsf{BBWT}(w))\leq 2\rho(w).

Proof 4.3.

We first recall the arguments of [19] for BWT. Consider a range [i..j][i..j] of x=𝖡𝖶𝖳(w)x=\mathsf{BWT}(w) that correspond to rotations of ww that start with a symbol cΣc\in\Sigma. Then, x[i..j]x[i..j] will consist of |w|cρc(w)|w|_{c}-\rho_{c}(w) occurrences of cc’s that correspond to those preceding a cc in the same run of cc’s in ww, and ρc(w)\rho_{c}(w) occurrence of symbols not equal to cc that precede a maximal run of cc’s in ww. Thus, ρ(x[i..j])\rho(x[i..j]) is maximized when the ρc(w)\rho_{c}(w) non-cc symbols in x[i..j]x[i..j] are not adjacent to each other. Mantaci et al. argued that this implies ρ(x[i..j])2ρc(w)\rho(x[i..j])\leq 2\rho_{c}(w) summing up to 2ρ(w)2\rho(w) for all cc. However, they did not give a reason to why ρ(x[i..j])\rho(x[i..j]) could not be 2ρc(w)+12\rho_{c}(w)+1 (summing up to ρ(w)+σ\rho(w)+\sigma), which is possibly the case when x[i..j]x[i..j] starts and ends with the symbol cc. We show that in such a case, an LF mapping with a single cycle cannot be defined, and ultimately, ρ(x(w)[i..j])2ρc(w)\rho(x(w)[i..j])\leq 2\rho_{c}(w) holds.

Let w[i..j]w[i^{\prime}..j^{\prime}] be a maximal run of symbol cc in ww and let kk be such that x[k]x[k] corresponds to w[j]w[j^{\prime}], i.e., kk is the ω\omega-order rank of 𝗋𝗈𝗍j+1(w)\mathsf{rot}^{j^{\prime}+1}(w). Since w[j+1]cw[j^{\prime}+1]\neq c, k[i..j]k\not\in[i^{\prime}..j^{\prime}]. Also, since w[i]==w[j]=cw[i]=\cdots=w[j]=c, the ω\omega-order ranks k,Ψx(k),,Ψxji+1(k)k,\Psi_{x}(k),\ldots,\Psi^{j^{\prime}-i^{\prime}+1}_{x}(k), corresponding respectively to those of 𝗋𝗈𝗍j+1(w),,𝗋𝗈𝗍i(w)\mathsf{rot}^{j^{\prime}+1}(w),\ldots,\mathsf{rot}^{i^{\prime}}(w), are in [i..j][i..j] (except kk) and monotonic, i.e., either k<iΨx(k)<<Ψxji+1(k)jk<i\leq\Psi_{x}(k)<\cdots<\Psi^{j^{\prime}-i^{\prime}+1}_{x}(k)\leq j or k>jΨx(k)>>Ψxji+1(k)ik>j\geq\Psi_{x}(k)>\cdots>\Psi^{j^{\prime}-i^{\prime}+1}_{x}(k)\geq i holds. Also, x[Ψxji+1(k)]=w[i1]cx[\Psi^{j^{\prime}-i^{\prime}+1}_{x}(k)]=w[i^{\prime}-1]\neq c. On the other hand, for any p,qp,q s.t. ip<qji\leq p<q\leq j and x[p]=x[q]=cx[p]=x[q]=c, Ψx(p)<Ψx(q)\Psi_{x}(p)<\Psi_{x}(q) must hold implying that for any such p,qp,q, it cannot be that pp is part of a monotonically decreasing cycle and qq is part of a monotonically increasing cycle. Since each of the above monotonic sequences end at a non-cc position in x[i..j]x[i..j], there must be some position k[i..j]k^{\prime}\in[i..j] such that Ψx(k)=k\Psi_{x}(k^{\prime})=k^{\prime}, and Ψx\Psi_{x} cannot be defined to cover all positions in [i..j][i..j] in a single cycle. Therefore ρ(x[i..j])2ρc(w)\rho(x[i..j])\leq 2\rho_{c}(w) holds.

For x=𝖡𝖡𝖶𝖳(w)x=\mathsf{BBWT}(w), similarly consider the range [i..j][i..j] of 𝖡𝖡𝖶𝖳(w)\mathsf{BBWT}(w) such that the corresponding rotations of the Lyndon factors of ww start with the symbol cΣc\in\Sigma. Since it is known that Lyndon factors of ww (except for single symbol factors) must start and end at run-boundaries of ww [10], a maximal run of cc’s in ww is either a substring of a Lyndon factor of ww, or a maximal run of a single symbol Lyndon factors. If it is a substring of a Lyndon factor but not a prefix, then there is one non-cc symbol per such run that precedes the run and occurs in x[i..j]x[i..j]. If it is a prefix of a Lyndon factor longer than 11, then the previous symbol (or the last symbol of the Lyndon factor) is non-cc due to the Lyndon factor not having a proper border. Otherwise, the maximal run is a maximal run of single symbol Lyndon factors cc, in which case the previous symbol for these occurring in x[i..j]x[i..j] will also be cc. Thus, the arguments in the previous paragraph for BWT hold for BBWT as well, except for the last part: it can be that x[i]=x[j]=cx[i]=x[j]=c and Ψ(k)=k\Psi(k^{\prime})=k^{\prime} for some k[i,j]k^{\prime}\in[i,j]. However, this implies that this corresponds to a single symbol Lyndon factor which would not have introduced a non-cc symbol in x[i..j]x[i..j]. Therefore ρ(x[i..j])2ρc(w)\rho(x[i..j])\leq 2\rho_{c}(w) still holds.

4.2 In terms of other repetitiveness measures

We consider measuring the clustering effect of 𝖡𝖶𝖳\mathsf{BWT} and 𝖡𝖡𝖶𝖳\mathsf{BBWT} other than ρ\rho. Here, we show that 𝖡𝖶𝖳\mathsf{BWT} or 𝖡𝖡𝖶𝖳\mathsf{BBWT} can only increase the repetitiveness of the string by a polylogarithmic factor.

Theorem 4.4.

For repetitiveness measure μ{δ,γ,b,v,z,grl}\mu\in\{\delta,\gamma,b,v,z,g_{\mathrm{rl}}\} and any string ww of length nn, it holds that μ(𝖡𝖶𝖳(w))=O(μ(w)log2n)\mu(\mathsf{BWT}(w))=O(\mu(w)\log^{2}n). Moreover, μ(𝖡𝖡𝖶𝖳(w))=O(μ(w)log2n)\mu(\mathsf{BBWT}(w))=O(\mu(w)\log^{2}n) for μ{z,grl}\mu\in\{z,g_{\mathrm{rl}}\}, and μ(𝖡𝖡𝖶𝖳(w))=O(μ(w)log6n)\mu(\mathsf{BBWT}(w))=O(\mu(w)\log^{6}n) for μ{δ,γ,b,v}\mu\in\{\delta,\gamma,b,v\}.

Proof 4.5.

The proof for 𝖡𝖶𝖳\mathsf{BWT} follows from a simple observation that for any string ww, μ(w)=O(ρ(w))\mu(w)=O(\rho(w)) implying μ(𝖡𝖶𝖳(w))=O(ρ(𝖡𝖶𝖳(w))\mu(\mathsf{BWT}(w))=O(\rho(\mathsf{BWT}(w)). Since r(w)=ρ(𝖡𝖶𝖳(w))=O(δ(w)log2n)r(w)=\rho(\mathsf{BWT}(w))=O(\delta(w)\log^{2}n), we have μ(𝖡𝖶𝖳(w))=O(δ(w)log2n)=O(μ(w)log2n)\mu(\mathsf{BWT}(w))=O(\delta(w)\log^{2}n)=O(\mu(w)\log^{2}n), where the last relation follows from the fact that δ\delta lower bounds all the repetitiveness measures considered.

For 𝖡𝖡𝖶𝖳\mathsf{BBWT}, μ(𝖡𝖡𝖶𝖳(w))=O(ρ(𝖡𝖡𝖶𝖳(w)))\mu(\mathsf{BBWT}(w))=O(\rho(\mathsf{BBWT}(w))) holds, but only rB(w)=ρ(𝖡𝖡𝖶𝖳(w))=O(z(w)log2n)r_{B}(w)=\rho(\mathsf{BBWT}(w))=O(z(w)\log^{2}n) has been proved, from which rB(w)=O(grl(w)log2n)r_{B}(w)=O(g_{\mathrm{rl}}(w)\log^{2}n) and the statement follows for μ{z,grl}\mu\in\{z,g_{\mathrm{rl}}\}. For the other measures, Corollary 3.17 gives us rB(w)=O(r(w)log4n)=O(δ(w)log6n)r_{B}(w)=O(r(w)\log^{4}n)=O(\delta(w)\log^{6}n) from which the statement follows.

4.3 A family of significantly “clustered” strings by BBWT

Finally, we show an infinite family of strings for which BBWT exhibits an asymptotically maximal clustering effect in terms of δ,γ,b,v,r\delta,\gamma,b,v,r, and slightly weaker in terms of z,grl,gz,g_{\mathrm{rl}},g. Namely, we consider the family of strings whose BBWT image are Fibonacci words.

Theorem 4.6.

Let ww be a string where 𝖡𝖡𝖶𝖳(w)\mathsf{BBWT}(w) is a Fibonacci word of length nn. Then, for μ{δ,γ,b,v,r}\mu\in\{\delta,\gamma,b,v,r\}, μ(𝖡𝖡𝖶𝖳(w))=O(lognnμ(w))\mu(\mathsf{BBWT}(w))=O(\frac{\log n}{n}\mu(w)) and for μ{z,grl,g}\mu\in\{z,g_{\mathrm{rl}},g\}, μ(𝖡𝖡𝖶𝖳(w))=O(log2nnμ(w))\mu(\mathsf{BBWT}(w))=O(\frac{\log^{2}n}{n}\mu(w)).

In order to understand ww, we first introduce the tools we use to characterize the LF mapping ΨFk\Psi_{F_{k}} on Fibonacci words.

The Zeckendorf representation [24] of a non-negative integer is a unique (sub-)set of distinct non-consecutive Fibonacci numbers {fkk1}\{f_{k}\mid k\geq 1\} that sum up to the integer. We represent a non-negative integer ii as a bit string Z(i)Z(i), where i=Z(i)=j0Z(i)[j]fj+1i=||Z(i)||=\sum_{j\geq 0}Z(i)[j]\cdot f_{j+1}. We use Zk(i)Z_{k}(i) to denote the length-kk prefix Z(i)[0..k1]Z(i)[0..k-1] of Z(i)Z(i). Note that for any i[0,fk)i\in[0,f_{k}), it suffices that a subset of {f1,,fk1}\{f_{1},\dots,f_{k-1}\} is used, and thus Z(i)Z(i) requires only up to (k1)(k-1) bits, i.e., 11’s will only occur in Z(i)[0..k2]Z(i)[0..k-2], and the rest Z(i)[k1..]Z(i)[k-1..] will be 0.

We observe that length i0i\geq 0 prefix of a Fibonacci word can be uniquely factorized into a sequence of distinct non-consecutive Fibonacci words in order of decreasing length, where the length of each Fibonacci word corresponds to an element in the Zeckendorf representation of ii.

Lemma 4.7.

For any i0i\geq 0,

F[0..i)\displaystyle F_{\infty}[0..i) =j=k20Fj+1Z(i)[j]=Fk1Z(i)[k2]F1Z(i)[0],\displaystyle=\prod_{j={k-2}}^{0}F_{j+1}^{Z(i)[j]}=F_{k-1}^{Z(i)[k-2]}\cdots F_{1}^{Z(i)[0]}, (3)

where kk is the smallest integer such that i<fki<f_{k}.

Proof 4.8.

Consider a greedy factorization of F[0..i)F_{\infty}[0..i) that takes the longest prefix of the remaining string that is a Fibonacci word. Let kk be the smallest integer such that ifki\leq f_{k}. If i=fki=f_{k} then we simply take FkF_{k} as the last element. Otherwise, we take Fk1F_{k-1}. Notice that in this case, the remaining string is F[fk1..i)=Fk[fk1..i)=Fk2[0..ifk1)F_{\infty}[f_{k-1}..i)=F_{k}[f_{k-1}..i)=F_{k-2}[0..i-f_{k-1}) of length i=ifk1<fk2i^{\prime}=i-f_{k-1}<f_{k-2}, and we repeat the process to find a greedy factorization of Fk2[0..i)F_{k-2}[0..i^{\prime}). The remaining string is a proper prefix of Fk2F_{k-2} and thus Fk2F_{k-2} will not be chosen next, implying that consecutive Fibonacci words are not chosen. Thus, the lengths of the sequence of strings will be a set of non-consecutive set of distinct Fibonacci numbers that sum up to ii. The lemma follows from the uniqueness of the Zeckendorf representation of ii.

The following relation about the least significant bit in the Zeckendorf representation and the iith symbol in the (infinite) Fibonacci word is known.

Lemma 4.9 (Problem 6 in [8]).

For i=0,1,i=0,1,\dots,

F[i]\displaystyle F_{\infty}[i] ={𝚊if Z(i)[0]=0,𝚋if Z(i)[0]=1.\displaystyle=\begin{cases}\mathtt{a}&\mbox{if }Z(i)[0]=0,\\ \mathtt{b}&\mbox{if }Z(i)[0]=1.\end{cases}

Now, we make observations on the LF-mapping for Fibonacci words.

Lemma 4.10.

The LF-mapping function for the Fibonacci word FkF_{k} is, for any 0i<fk0\leq i<f_{k},

ΨFk(i)={𝗋𝖺𝗇𝗄𝚊(i,Fk)if Fk[i]=𝚊,fk1+𝗋𝖺𝗇𝗄𝚋(i,Fk)if Fk[i]=𝚋.\Psi_{F_{k}}(i)=\begin{cases}\mathsf{rank}_{\mathtt{a}}(i,F_{k})&\mbox{if }F_{k}[i]=\mathtt{a},\\ f_{k-1}+\mathsf{rank}_{\mathtt{b}}(i,F_{k})&\mbox{if }F_{k}[i]=\mathtt{b}.\end{cases}
Proof 4.11.

Straightforward from the definition of ΨFk\Psi_{F_{k}} and Section 2.2.

The next lemma shows that the LF mappings on FkF_{k} can be interpreted as rotation operations on the Zeckendorf representation of the position to be mapped.

Lemma 4.12.

For any i[0,fk)i\in[0,f_{k}),

Zk(ΨFk(i))\displaystyle Z_{k}(\Psi_{F_{k}}(i)) ={𝗋𝗈𝗍(Zk(i))if Fk[i]=𝚊,𝗋𝗈𝗍2(Zk(i))if Fk[i]=𝚋.\displaystyle=\begin{cases}\mathsf{rot}(Z_{k}(i))&\text{if $F_{k}[i]=\mathtt{a}$},\\ \mathsf{rot}^{2}(Z_{k}(i))&\text{if $F_{k}[i]=\mathtt{b}$.}\end{cases}
Proof 4.13.

Let j=ΨFk(i)j=\Psi_{F_{k}}(i). Since i,j[0,fk)i,j\in[0,f_{k}), the most significant bits of Zk(i)Z_{k}(i) and Zk(j)Z_{k}(j), corresponding to fkf_{k}, are always zero, i.e., Zk(i)[k1]=Zk(j)[k1]=0Z_{k}(i)[k-1]=Z_{k}(j)[k-1]=0.

From Lemma 4.9, Fk[i]=𝚊F_{k}[i]=\mathtt{a} implies that Zk(i)[0]=0Z_{k}(i)[0]=0. Thus, 𝗋𝗈𝗍(Zk(i))=Zk(i)[1..k1]0\mathsf{rot}(Z_{k}(i))=Z_{k}(i)[1..k-1]\cdot 0. Therefore, we have

ΨFk(i)\displaystyle\Psi_{F_{k}}(i) =𝗋𝖺𝗇𝗄𝚊(i,Fk)\displaystyle=\mathsf{rank}_{\mathtt{a}}(i,F_{k}) by Lemma 4.10
=|j=k20Fj+1Z(i)[j]|𝚊\displaystyle=\left|\prod_{j=k-2}^{0}F_{j+1}^{Z(i)[j]}\right|_{\mathtt{a}} by Lemma 4.7
=j=0k2Zk(i)[j]fj=j=0k1Zk(i)[j]fj\displaystyle=\sum_{j=0}^{k-2}Z_{k}(i)[j]\cdot f_{j}=\sum_{j=0}^{k-1}Z_{k}(i)[j]\cdot f_{j}
=Zk(i)[0]fk+j=0k2Zk(i)[j+1]fj+1\displaystyle=Z_{k}(i)[0]\cdot f_{k}+\sum_{j=0}^{k-2}Z_{k}(i)[j+1]\cdot f_{j+1}
=j=0k1Zk(i)[(j+1)modk]fj+1\displaystyle=\sum_{j=0}^{k-1}Z_{k}(i)[(j+1)\bmod k]\cdot f_{j+1}
=𝗋𝗈𝗍(Zk(i))\displaystyle=||\mathsf{rot}(Z_{k}(i))||

On the other hand, again from Lemma 4.9, Fk[i]=𝚋F_{k}[i]=\mathtt{b} implies that Zk(i)[0]=1Z_{k}(i)[0]=1 and since the Zeckendorf representation does not use consecutive Fibonacci numbers, Zk(i)[1]=0Z_{k}(i)[1]=0. Thus, 𝗋𝗈𝗍2(Zk(i))=Zk(i)[2..k1]10\mathsf{rot}^{2}(Z_{k}(i))=Z_{k}(i)[2..k-1]\cdot 1\cdot 0. Therefore, we have

𝗋𝖺𝗇𝗄𝚋(i,Fk)\displaystyle\mathsf{rank}_{\mathtt{b}}(i,F_{k}) =|j=k20Fj+1Z(i)[j]|𝚋\displaystyle=\left|\prod_{j=k-2}^{0}F_{j+1}^{Z(i)[j]}\right|_{\mathtt{b}} by Lemma 4.7
=j=0k2Zk(i)[j]fj1=j=0k1Zk(i)[j]fj1\displaystyle=\sum_{j=0}^{k-2}Z_{k}(i)[j]\cdot f_{j-1}=\sum_{j=0}^{k-1}Z_{k}(i)[j]\cdot f_{j-1}
=Zk(i)[0]f1+Zk(i)[1]f0+j=0k3Zk(i)[j+2]fj+1\displaystyle=Z_{k}(i)[0]\cdot f_{-1}+Z_{k}(i)[1]\cdot f_{0}+\sum_{j=0}^{k-3}Z_{k}(i)[j+2]\cdot f_{j+1}
=j=0k3Zk(i)[j+2]fj+1\displaystyle=\sum_{j=0}^{k-3}Z_{k}(i)[j+2]\cdot f_{j+1}

Furthermore,

ΨFk(i)\displaystyle\Psi_{F_{k}}(i) =𝗋𝖺𝗇𝗄𝚋(i,Fk)+fk1\displaystyle=\mathsf{rank}_{\mathtt{b}}(i,F_{k})+f_{k-1} by Lemma 4.10
=j=0k3Zk(i)[j+2]fj+1+fk1\displaystyle=\sum_{j=0}^{k-3}Z_{k}(i)[j+2]\cdot f_{j+1}+f_{k-1}
=j=0k3Zk(i)[j+2]fj+1+Zk(i)[0]fk1+Zk(i)[1]fk\displaystyle=\sum_{j=0}^{k-3}Z_{k}(i)[j+2]\cdot f_{j+1}+Z_{k}(i)[0]\cdot f_{k-1}+Z_{k}(i)[1]\cdot f_{k}
=j=0k1Zk(i)[(j+2)modk]fj+1\displaystyle=\sum_{j=0}^{k-1}Z_{k}(i)[(j+2)\bmod k]\cdot f_{j+1}
=𝗋𝗈𝗍2(Zk(i)).\displaystyle=||\mathsf{rot}^{2}(Z_{k}(i))||.

thus proving the lemma (see also Figure 2).

Refer to caption
Figure 2: Example of Lemma 4.12 for F7F_{7}. The 𝚊\mathtt{a} at position 1818 is the 1212th 𝚊\mathtt{a} in F7[0..18]F_{7}[0..18], and thus the LF mapping should point to position 1111, whose Zeckendorf representation Z7(11)Z_{7}(11) is a 1-bit left rotation of Z7(18)Z_{7}(18). The 𝚋\mathtt{b} at position 1717 is the 77th 𝚋\mathtt{b} in F7[0..17]F_{7}[0..17], and since there are 1313 𝚊\mathtt{a}’s in F7F_{7}, the LF mapping should point to position 13+71=1913+7-1=19, whose Zeckendorf representation Z7(19)Z_{7}(19) is a 2-bit left rotation of Z7(17)Z_{7}(17).

We are now ready to prove Theorem 4.6: See 4.6

Proof 4.14.

For any Fibonacci word FkF_{k}, δ(Fk)γ(Fk)b(Fk)v(Fk)r(Fk)=2\delta(F_{k})\leq\gamma(F_{k})\leq b(F_{k})\leq v(F_{k})\leq r(F_{k})=2 and z(Fk),grl(Fk),g(Fk)=O(logn)z(F_{k}),g_{\mathrm{rl}}(F_{k}),g(F_{k})=O(\log n) are known. In the rest of the proof, we show δ(w)=Ω(n/logn)\delta(w)=\Omega(n/\log n) implying the same lower bound for all other measures which finishes the proof.

From Lemma 4.12, ΨFk\Psi_{F_{k}} can be interpreted as a rotation operation on a kk-bit string corresponding to the Zeckendorf representation of the given position. Therefore, the number of cycles that ΨFk\Psi_{F_{k}} produces is equivalent to the number of conjugacy classes among bit strings corresponding to the Zeckendorf representations of the set of positions [0,Fk)[0,F_{k}). Since the Zeckendorf representations do not use adjacent Fibonacci numbers, this is equivalent to the number C(k)C(k) of binary necklaces of length kk that do not contain “11”. This corresponds to entry A000358 in the on-line encyclopedia of integer sequences, which is known [3] to be:

C(k)\displaystyle C(k) =1kd|kφ(k/d)(fd2+fd),\displaystyle=\frac{1}{k}\sum_{d|k}\varphi(k/d)(f_{d-2}+f_{d}), (4)

where d|kd|k represents that dd is a divisor of kk, and φ\varphi is Euler’s totient function which returns, for integer nn, the number of positive integers less than nn that are relatively prime to nn (i.e., having gcd\gcd of 11).

As seen in the proof of Lemma 4.12, an 𝚊\mathtt{a} in FkF_{k} implies a 1-bit left rotation on ZkZ_{k} for ΨFk\Psi_{F_{k}}, and 𝚋\mathtt{b} in FkF_{k} implies a 2-bit left rotation on ZkZ_{k} for ΨFk\Psi_{F_{k}}. For simplicity of the analysis, we will restrict kk to being prime. Each necklace will then correspond to some cyclic string ww^{\prime}, where

|w|𝚊+2|w|𝚋\displaystyle|w^{\prime}|_{\mathtt{a}}+2|w^{\prime}|_{\mathtt{b}} =k,\displaystyle=k, (5)

except for the string 𝚊\mathtt{a} corresponding to the bit string 0k0^{k}, since all other necklaces must be primitive and a total of kk-bits must be rotated in order to come back to the same bit string. Consider the lexicographically smallest rotations of all of them, which are Lyndon words. All of them are distinct, and any two of them (except for the single 𝚊\mathtt{a}) cannot be a substring of the other due to the constraint on their lengths (Equation 5). The string w=𝖡𝖡𝖶𝖳1(Fk)w=\mathsf{BBWT}^{-1}(F_{k}) is a concatenation (in non-increasing order) of all these strings with the single 𝚊\mathtt{a} appended at the end, corresponding to the Lyndon factorization of ww. Since Lyndon words can only occur as a substring of a Lyndon factor, it follows that there are Ω(C(k))\Omega(C(k)) distinct uniquely occurring strings (Lyndon factors) that do not overlap. For prime kk, Equation 4 becomes:

C(k)\displaystyle C(k) =1k(φ(1)(fk2+fk)+φ(k)(f1+f1))=1k(fk2+fk+k1)fk/k.\displaystyle=\frac{1}{k}\left(\varphi(1)(f_{k-2}+f_{k})+\varphi(k)(f_{-1}+f_{1})\right)=\frac{1}{k}(f_{k-2}+f_{k}+k-1)\geq f_{k}/k.

Since the uniquely occurring Lyndon factors of ww satisfy Equation 5 and have length at most k1k-1, any length 2k2k substring of ww will contain at least one full occurrence of a Lyndon factor of ww. Since all such strings must also have unique occurrences and therefore be distinct, δ(w)(fk2k)/2k=Ω(n/logn)\delta(w)\geq(f_{k}-2k)/2k=\Omega(n/\log n) holds.

For example, for 𝖡𝖡𝖶𝖳1(F7)\mathsf{BBWT}^{-1}(F_{7}), the binary necklaces of length 77 that do not contain “11” are: 00000000000000, 00000010000001, 00001010000101, 00010010001001, 00101010010101, which respectively correspond to 𝚊\mathtt{a}, 𝚊𝚊𝚊𝚊𝚊𝚋\mathtt{aaaaab}, 𝚊𝚊𝚊𝚋𝚋\mathtt{aaabb}, 𝚊𝚊𝚋𝚊𝚋\mathtt{aabab}, and 𝚊𝚋𝚋𝚋\mathtt{abbb}, and therefore, 𝖡𝖡𝖶𝖳1(F7)=𝚊𝚋𝚋𝚋𝚊𝚊𝚋𝚊𝚋𝚊𝚊𝚊𝚋𝚋𝚊𝚊𝚊𝚊𝚊𝚋𝚊\mathsf{BBWT}^{-1}(F_{7})=\mathtt{abbbaababaaabbaaaaaba}.

We note that 𝖡𝖡𝖶𝖳1(Fk)\mathsf{BBWT}^{-1}(F_{k}) is asymptotically maximally incompressible by dictionary compression since it is known that for any string, δγz=O(n/logσn)\delta\leq\gamma\leq z=O(n/\log_{\sigma}n) holds, which is O(n/logn)O(n/\log n) for binary strings.

5 Discussion

The log\log factors in our bounds, especially in Theorem 4.4 are most likely not tight; Our focus here was on showing the existence of such bounds. A natural open question is how many log\log factors can be shaved.

References

  • [1] Golnaz Badkobeh, Hideo Bannai, and Dominik Köppl. Bijective BWT based compression schemes. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 16–25. Springer, 2024. doi:10.1007/978-3-031-72200-4\_2.
  • [2] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the bijective Burrows-Wheeler transform. In Giuseppa Castiglione and Marinella Sciortino, editors, Proceedings of the 24th Italian Conference on Theoretical Computer Science, Palermo, Italy, September 13-15, 2023, volume 3587 of CEUR Workshop Proceedings, pages 129–142. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3587/4564.pdf.
  • [3] M. Bona. Handbook of Enumerative Combinatorics. Discrete Mathematics and Its Applications. CRC Press, 2015. URL: https://books.google.co.jp/books?id=j3kZBwAAQBAJ.
  • [4] Jean-Pierre Borel and Christophe Reutenauer. On Christoffel classes. RAIRO Theor. Informatics Appl., 40(1):15–27, 2006. URL: https://doi.org/10.1051/ita:2005038, doi:10.1051/ITA:2005038.
  • [5] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Systems Reseach Center, 1994. SRC Research Report 124.
  • [6] Kuo Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free differential calculus, IV. The quotient groups of the lower central series. Annals of Mathematics, 68(1):81–95, 1958.
  • [7] Manolis Christodoulakis, Costas S. Iliopoulos, and Yoan José Pinzón Ardila. Simple algorithm for sorting the fibonacci string rotations. In Jirí Wiedermann, Gerard Tel, Jaroslav Pokorný, Mária Bieliková, and Julius Stuller, editors, SOFSEM 2006: Theory and Practice of Computer Science, 32nd Conference on Current Trends in Theory and Practice of Computer Science, Merín, Czech Republic, January 21-27, 2006, Proceedings, volume 3831 of Lecture Notes in Computer Science, pages 218–225. Springer, 2006. doi:10.1007/11611257\_19.
  • [8] Maxime Crochemore, Thierry Lecroq, and Wojciech Rytter. Fibonacci Words and Fibonacci Numeration System, page 17–52. Cambridge University Press, 2021.
  • [9] Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. On the impact of morphisms on BWT-runs. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.CPM.2023.10, doi:10.4230/LIPICS.CPM.2023.10.
  • [10] Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Almost linear time computation of maximal repetitions in run length encoded strings. In Yoshio Okamoto and Takeshi Tokuyama, editors, 28th International Symposium on Algorithms and Computation, ISAAC 2017, December 9-12, 2017, Phuket, Thailand, volume 92 of LIPIcs, pages 33:1–33:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2017.33, doi:10.4230/LIPICS.ISAAC.2017.33.
  • [11] Joseph Yossi Gil and David Allen Scott. A bijective string sorting transform. CoRR, abs/1201.3077, 2012. URL: http://arxiv.org/abs/1201.3077, arXiv:1201.3077.
  • [12] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows-Wheeler-transform. In Tomás Bures, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdzinski, Claus Pahl, Florian Sikora, and Prudence W. H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science - 47th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2021, Bolzano-Bozen, Italy, January 25-29, 2021, Proceedings, volume 12607 of Lecture Notes in Computer Science, pages 249–262. Springer, 2021. doi:10.1007/978-3-030-67731-2\_18.
  • [13] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit catastrophes for the Burrows-Wheeler transform. In Frank Drewes and Mikhail Volkov, editors, Developments in Language Theory - 27th International Conference, DLT 2023, Umeå, Sweden, June 12-16, 2023, Proceedings, volume 13911 of Lecture Notes in Computer Science, pages 86–99. Springer, 2023. doi:10.1007/978-3-031-33264-7\_8.
  • [14] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
  • [15] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
  • [16] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  • [17] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10, 2009.
  • [18] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 05 2009. arXiv:https://academic.oup.com/bioinformatics/article-pdf/25/14/1754/48994219/bioinformatics\_25\_14\_1754.pdf, doi:10.1093/bioinformatics/btp324.
  • [19] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, and Luca Versari. Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci., 698:79–87, 2017. URL: https://doi.org/10.1016/j.tcs.2017.07.015, doi:10.1016/J.TCS.2017.07.015.
  • [20] Takuya Mieno, Shunsuke Inenaga, and Takashi Horiyama. RePair grammars are the smallest grammars for Fibonacci words. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 26:1–26:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.26, doi:10.4230/LIPICS.CPM.2022.26.
  • [21] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022. doi:10.1145/3432999.
  • [22] Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021. doi:10.1109/TIT.2020.3042746.
  • [23] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, 1982. doi:10.1145/322344.322346.
  • [24] Edouard Zeckendorf. Representations des nombres naturels par une somme de nombres de Fibonacci on de nombres de Lucas. Bulletin de La Society Royale des Sciences de Liege, pages 179–182, 1972. URL: https://cir.nii.ac.jp/crid/1570009749187075840.
  • [25] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.