This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Local Editing in LZ-End Compressed Data

Daniel Roodt1, Ulrich Speidel2, Vimal Kumar1, Ryan K. L. Ko3




1University of Waikato, Hamilton, New Zealand 2University of Auckland, Auckland, New Zealand 3University of Queensland, Brisbane, Australia contact email: [email protected]
Abstract

This paper presents an algorithm for the modification of data compressed using LZ-End, a derivate of LZ77, without prior decompression. The algorithm’s performance and the impact of the modifications on the compression ratio is evaluated. Finally, we discuss the importance of this work as a first step towards local editing in Lempel-Ziv compressed data.

Index Terms:
Lempel-Ziv, compression, local editing, local decoding, LZ-End, entropy, calibrated strings, logistic map

I Introduction

The demand for compressed data storage has led to the development of methods to search in [1, 2] and randomly access compressed data [3, 4, 5]. In addition, some purpose-built algorithms allow editing parts of the compressed data string [6]. Recent innovations allow adapting any universal compressor to allow random access [7, 8], and random access with local editing [8]. All of these have been achieved without decompressing the entire string, which processing compressed data traditionally requires.

The limitations of the algorithms of Tatwawadi et al. [7] and Vatedka and Tchamkerten [8] are that neither method applies to the parsing algorithm directly. Both algorithms involve breaking the data into a series of blocks which are compressed independently of one another by an arbitrary parsing algorithm (e.g. LZ77 or PPM). The local editing of the data is achieved by the method in which the blocks of data are separated and merged together [8]. Similarly, random access is achieved by splitting the data into blocks, some of which are represented as a sparse vector. This sparse vector is compressed so as to allow random access [9, 10]. Although the LZ compression algorithms can be modified in this way to allow random access and edits, neither of these methods apply to the parsing algorithm directly.

In 2010, Kreft and Navarro developed LZ-End, an adaptation of LZ77, to allow retrieval of arbitrary sections of data without global decompression [3]. Other algorithms, such as LZ78 and LZW [12, 1], allow searching over compressed data, but not extraction of arbitrary pieces of data. In this paper, we show how to use this same feature of the LZ-End algorithm to allow local editing of the compressed data without global decompression. This approach is the first step towards efficient local editing in LZ-compressed data. Each edit made to the compressed data results in a degradation of the compression ratio. We evaluate how different kinds of edits affect the compression ratio, and conclude by discussing the importance of this algorithm as an inital step towards locally editing general LZ data.

II The LZ-End Compression Algorithm

Like LZ77, LZ-End encodes the data as a series of phrases, each of which consists of a pointer to a previous phrase and an “innovation” symbol which extends that previous phrase (see Section II). The main difference between LZ77 and LZ-End is that LZ-End restricts where each phrase may end. This allows LZ-End the retrieval of arbitrary data independently of any other part of the compressed string [3]. Kreft and Navarro used their first version of LZ-End in 2010 [3] to build a self-index [13]. Later variations of LZ-End allow parsing in linear time [14] and compressed space [15].

II-A Representation of the Compressed Data

LZ-End builds on the LZ77 algorithm [16], such that LZ-End allows decompression of an arbitrary portion of compressed data without decompressing any other part of the compressed data. This section gives a rough overview of LZ-End, as presented in [3].

Definition 1.

The LZ-End parsing of a series of symbols 𝐓=T0Tn1\mathbf{T}=T_{0}\dots T_{n-1} is a sequence 𝐏\mathbf{P} of nn^{\prime} many phrases. If the first qq many phrases of 𝐏\mathbf{P} encode the first ii many symbols of 𝐓\mathbf{T}, then the q𝑡ℎq^{\mathit{th}} phrase of 𝐏\mathbf{P} encodes the longest prefix of TiTn1T_{i}\dots T_{n-1} which is a suffix of one of the phrases P0Pq1P_{0}\dots P_{q-1} [3].

The compressed output of the LZ-End algorithm is a series of phrases, each represented conceptually as a tuple q,l,s\langle q,l,s\rangle:

  • qq: a pointer to the previous phrase which the current phrase builds upon.

  • ll: the number of symbols encoded by the current phrase.

  • ss: the final symbol of the current phrase.

Each phrase encodes either:

  • a single symbol. In this case, qq is null and ss is the symbol encoded.

  • a series of symbols. qq points to the last phrase which the current phrase extends. By making l1l-1 greater than the length of the phrase at index qq, this pointer may reference multiple adjacent phrases, and the remaining symbols of the current phrase are derived from phrases q1q-1, q2q-2, etc. As in LZ77, ss extends this sequence of symbols from the previous phrases [3].

Unlike LZ77, LZ-End parsing does not use a sliding window [3], but parses by reading the whole string into memory and constructing a suffix array of the data [3]. Three arrays store the compressed data. To compress a string 𝐓{\bf T} of nn symbols 𝐓=T1T2Tn{\bf T}=T_{1}T_{2}\dots T_{n} into nn^{\prime} phrases:

  • char[nn^{\prime}] stores ss in the tuple above for each phrase,

  • source[nn^{\prime}] stores qq in the tuple for each phrase, and

  • 𝐁[n]\mathbf{B}[n], where 𝐁[i]=1\mathbf{B}[i]=1 if TiT_{i} is at the end of a phrase, 0 otherwise. This array is sparse and stored as a list of indices to the non-zero elements (see [17]). This supports two constant-time operations:

    • 𝑟𝑎𝑛𝑘B(p)\mathit{rank}_{B}(p): The number of 1’s in 𝐁\mathbf{B} at indices up to pp, i.e., the number of phrases preceding TpT_{p}.

    • 𝑠𝑒𝑙𝑒𝑐𝑡B(p)\mathit{select}_{B}(p): The position of the pp-th 1 in 𝐁\mathbf{B}, i.e., the index of the last symbol in 𝐓\mathbf{T} encoded by the pp-th phrase.

    Collectively, 𝑟𝑎𝑛𝑘B()\mathit{rank}_{B}() and 𝑠𝑒𝑙𝑒𝑐𝑡B()\mathit{select}_{B}() yield the length ll of each phrase.

II-B Notation

When a phrase yy points back to a phrase xx, we say that yy depends on xx, and we call yy a dependent of xx. Note that yy may depend on any number of symbols from xx and its immediate predecessors. A phrase may have any number of dependents, including none. Let [a,b][a,b] denote the range of integers between aa and bb inclusive. When referring to a range of variables in an array, we use T[i,j]T[i,j] to refer to the symbols TiTjT_{i}\dots T_{j} inclusive, and T[i,j)T[i,j) to refer to symbols TiTj1T_{i}\dots T_{j-1} inclusive. For arrays of phrases, |||| denotes the append operation : 𝐛𝐛||a{\bf b}\leftarrow{\bf b}||a appends a variable aa to array 𝐛\bf b. Let p.qp.q denote the index, p.lp.l the length, and p.sp.s the innovation of a phrase pp. Angle brackets denote the tuple of a phrase: q,l,s\langle q,l,s\rangle.

III Local Editing Algorithm

Algorithm 1 𝑚𝑜𝑑𝑖𝑓𝑦(i,j,𝐬𝐭𝐫)\mathit{modify}(i,j,\mathbf{str}):
0:  1. an array of nn^{\prime} phrases, 𝐏\mathbf{P}, which form the LZ-End parsing of the string 𝐓=T0,,Tn1\mathbf{T}=T_{0},\dots,T_{n-1} 2. 0ij0\leq i\leq j 3. a string of symbols 𝐬𝐭𝐫\mathbf{str} Result: The array 𝐏\mathbf{P} which has had symbols [Ti,Tj)[T_{i},T_{j}) removed and 𝐬𝐭𝐫\mathbf{str} inserted in their place.
1:  x𝑟𝑎𝑛𝑘B(i)x\leftarrow\mathit{rank}_{\text{B}}(i)
2:  l𝑟𝑎𝑛𝑘B(j)xl\leftarrow\mathit{rank}_{\text{B}}(j)-x
3:  𝐝𝐞𝐩𝑓𝑖𝑛𝑑_𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡_𝑝ℎ𝑟𝑎𝑠𝑒𝑠(𝐏,x,x+l\mathbf{dep}\leftarrow\mathit{find\_dependent\_phrases}(\mathbf{P},x,x+l) (Algorithm 2)
4:  𝐫𝐞𝐩𝑓𝑖𝑛𝑑_𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡_𝑝ℎ𝑟𝑎𝑠𝑒𝑠(𝐏\mathbf{rep}\leftarrow\mathit{find\_replacement\_phrases}(\mathbf{P}, ii, jj, 𝐝𝐞𝐩)\mathbf{dep}) (Algorithm 3)
5:  𝐜𝐨𝐦𝐩_𝐬𝐭𝐫𝑒𝑛𝑐𝑜𝑑𝑒_𝑠𝑡𝑟(P,i,j,𝑠𝑡𝑟)\mathbf{comp\_str}\leftarrow\mathit{encode\_str}(\mathit{P},i,j,\mathit{str}) (Algorithm 4)
6:  delete phrases 𝐏[𝑟𝑎𝑛𝑘B(i),𝑟𝑎𝑛𝑘B(j1)]\mathbf{P}[\mathit{rank}_{B}(i),\mathit{rank}_{B}(j-1)]
7:  insert 𝐜𝐨𝐦𝐩_𝐬𝐭𝐫\mathbf{comp\_str} into 𝐏\mathbf{P}, between (𝑟𝑎𝑛𝑘B(i)1)(\mathit{rank}_{B}(i)-1) and 𝑟𝑎𝑛𝑘B(i)\mathit{rank}_{B}(i)
8:  𝐏𝑎𝑑𝑗𝑢𝑠𝑡_𝑝𝑜𝑖𝑛𝑡𝑒𝑟𝑠(𝐏,x+𝐜𝐨𝐦𝐩_𝐬𝐭𝐫.𝑠𝑖𝑧𝑒1,l+1,\mathbf{P}\leftarrow\mathit{adjust\_pointers}(\mathbf{P},x+\mathbf{comp\_str}.\mathit{size}-1,l+1, 𝐜𝐨𝐦𝐩_𝐬𝐭𝐫.𝑠𝑖𝑧𝑒,𝐝𝐞𝐩,𝐫𝐞𝐩)\mathbf{comp\_str}.\mathit{size},\mathbf{dep},\mathbf{rep}) (Algorithm 5)
9:  return  𝐏\mathbf{P}
Algorithm 2 𝑓𝑖𝑛𝑑_𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡_𝑝ℎ𝑟𝑎𝑠𝑒𝑠(𝐏,i,j\mathit{find\_dependent\_phrases}(\mathbf{P},i,j)
0:  1. An array 𝐏\mathbf{P} of LZ-End phrases2. An integer ii denoting the first phrase to be deleted3. An integer jj denoting the first phrase of 𝐏\mathbf{P} not deleted. Result: an array of phrase pointers, 𝐝𝐞𝐩\mathbf{dep}, which point to one or more symbols in [Ti,Tj)[T_{i},T_{j}).
1:  𝐝𝐞𝐩\mathbf{dep}\leftarrow an empty array of phrases
2:  𝐝𝐢𝐬𝐭\mathbf{dist}\leftarrow an array of 𝐏.𝑠𝑖𝑧𝑒j\mathbf{P}.\mathit{size}-j integers, where 𝐝𝐢𝐬𝐭[1]=0\mathbf{dist}[-1]=0.
3:  𝑝𝑡𝑟j\mathit{ptr}\leftarrow j
4:  while 𝑝𝑡𝑟<𝐏.𝑠𝑖𝑧𝑒\mathit{ptr}<\mathbf{P}.\mathit{size} do
5:     𝐝𝐢𝐬𝐭[𝑝𝑡𝑟j]𝐝𝐢𝐬𝐭[𝑝𝑡𝑟j1]+𝐏[𝑝𝑡𝑟].l\mathbf{dist}[\mathit{ptr}-j]\leftarrow\mathbf{dist}[\mathit{ptr}-j-1]+\mathbf{P}[\mathit{ptr}].l
6:     if i𝐏[𝑝𝑡𝑟].q<j or 𝐝𝐢𝐬𝐭[P[𝑝𝑡𝑟].q]𝐏[𝑝𝑡𝑟].l1i\leq\mathbf{P}[\mathit{ptr}].q<j\text{ {or} }\mathbf{dist}[P[\mathit{ptr}].q]\leq\mathbf{P}[\mathit{ptr}].l-1 then
7:        𝐝𝐞𝐩𝐝𝐞𝐩||𝑝𝑡𝑟\mathbf{dep}\leftarrow\mathbf{dep}||\mathit{ptr}
8:     end if
9:     𝑝𝑡𝑟𝑝𝑡𝑟+\mathit{ptr}\leftarrow\mathit{ptr}+ 1.
10:  end while
11:  return  𝐝𝐞𝐩\mathbf{dep}

This section presents the algorithm to locally edit data compressed using LZ-End parsing. We assume prior knowledge of the index positions of the symbols within 𝐓\bf T which are to be edited. This is because the shared phrase structure of the LZ77 and LZ-End parsing allows the pattern matching algorithm for LZ77 compressed data [2] to apply to LZ-End.

We can express any operation we wish to perform on the data 𝐓=T0Tn1{\bf T}=T_{0}\dots T_{n-1} as a variation of a generic operation 𝑚𝑜𝑑𝑖𝑓𝑦(i,j,𝐬𝐭𝐫)\mathit{modify}(i,j,\mathbf{str}) as follows: ii indexes the first symbol 𝐓i\mathbf{T}_{i} to be removed, jj the first symbol 𝐓j\mathbf{T}_{j} after TiT_{i} that is not removed, and the symbols of 𝐬𝐭𝐫\mathbf{str} replace [𝐓𝐢,𝐓𝐣)[\bf T_{i},\bf T_{j}). Thus, deletion happens for j>ij>i. For j=ij=i, nothing is deleted, and j<ij<i is not allowed. If 𝐬𝐭𝐫\mathbf{str} is null, we have a pure deletion. Insertion happens at position ii. If j=ij=i (no deletion), symbol ii moves to the right and we have a pure insertion. If j>ij>i and 𝐬𝐭𝐫\mathbf{str} is not null, we have a replacement.

Algorithm 1 and its subroutines (Algorithms 2, 3, 4 and 5) together implement the 𝑚𝑜𝑑𝑖𝑓𝑦(i\mathit{modify}(i, jj, 𝐬𝐭𝐫\mathbf{str}) function. The algorithm first identifies the target phrases which encode the symbols in [Ti,Tj)[T_{i},T_{j}). It then identifies all dependents of the target phrases (Algorithm 2). Next, the algorithm replaces each dependent with one or more phrases which do not depend on target (Algorithm 3). The next step involves appending (prepending) to 𝑠𝑡𝑟\mathit{str} the symbols in the first (last) phrase of target which precede TiT_{i} (follow Tj1T_{j-1}). The updated 𝑠𝑡𝑟\mathit{str} is then parsed by the LZ-End algorithm as a separate string. No part of the compressed data is used to encode 𝑠𝑡𝑟\mathit{str} (Algorithm 4). The algorithm then replaces the target phrases by the phrases for 𝐬𝐭𝐫\mathbf{str}. The final step is to adjust the back-references of each phrase which follows the original target phrases. This is done to account for the phrases which were inserted and deleted in the previous steps of Algorithm 5. Note that our implementation replaces the inner loop shown here by a binary search, achieving log-linear rather than quadratic running time.

The algorithm to find replacement phrases for each dependent (Algorithm 3) first determines the exact symbols [Ta,Tb][T_{a},T_{b}] in 𝐓\mathbf{T} which the dependent references, where aa and bb are set in Steps 4 and 3, respectively. The algorithm then checks if the dependent references any symbol(s) preceding the target. If it does, the dependent’s first replacement phrase references the symbols in [Ta,Ti)[T_{a},T_{i}) (Step 6). The same happens in Step 19 for symbols [Tj,Tb][T_{j},T_{b}]. The part of the dependent which references one or more phrases in the target is replaced by exact copies of those referenced phrases (Steps 1318).

Algorithm 3 𝑓𝑖𝑛𝑑_𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡_𝑝ℎ𝑟𝑎𝑠𝑒𝑠(𝐏,i,j,𝐝𝐞𝐩\mathit{find\_replacement\_phrases}(\mathbf{P},i,j,\mathbf{dep})
0:  1. An array of LZ-End phrases 𝐏\mathbf{P} 2. An index ii of the first symbol being removed3. An index jj of the first symbol after ii not being removed4. An array 𝐝𝐞𝐩\mathbf{dep} of phrases which depend on one or more symbols in 𝐓\mathbf{T}[ii, j)j) Result: a ragged 2-D array of phrases encoding each phrase in 𝐝𝐞𝐩\mathbf{dep} without dependence on symbols in 𝐓[i,j)\mathbf{T}[i,j)
1:  𝐫𝐞𝐩an array of 𝐝𝐞𝐩.size\mathbf{rep}\leftarrow\text{an array of }\mathbf{dep}.size many empty arrays of phrases.
2:  for kk in [0,𝐝𝐞𝐩.size)[0,\mathbf{dep}.size) do
3:     b𝑠𝑒𝑙𝑒𝑐𝑡B(𝐝𝐞𝐩[k].q)b\leftarrow\mathit{select}_{B}(\mathbf{dep}[k].q)
4:     ab𝐝𝐞𝐩[k].l+1a\leftarrow b-\mathbf{dep}[k].l+1
5:     𝑙𝑒𝑛𝐝𝐞𝐩[k].l\mathit{len}\leftarrow\mathbf{dep}[k].l
6:     if a<ia<i then
7:        𝐫𝐞𝐩\mathbf{rep}[k]𝐫𝐞𝐩k]\leftarrow\mathbf{rep}[k]||𝑟𝑎𝑛𝑘B(i)1,ia,𝐏[𝑟𝑎𝑛𝑘B(i)1].sk]||\langle\mathit{rank}_{B}(i)-1,i-a,\mathbf{P}[\mathit{rank}_{B}(i)-1].s\rangle
8:        𝑙𝑒𝑛𝑙𝑒𝑛ia\mathit{len}\leftarrow\mathit{len}-i-a
9:        𝑝𝑡𝑟𝑟𝑎𝑛𝑘B(i)\mathit{ptr}\leftarrow\mathit{rank}_{B}(i)
10:     else
11:        𝑝𝑡𝑟𝑟𝑎𝑛𝑘B(a)\mathit{ptr}\leftarrow\mathit{rank}_{B}(a)
12:     end if
13:     while 𝑙𝑒𝑛>0\mathit{len}>0 and 𝑝𝑡𝑟𝑟𝑎𝑛𝑘B(j\mathit{ptr}\leq\mathit{rank}_{B}(jdo
14:        𝐫𝐞𝐩\mathbf{rep}[k]𝐫𝐞𝐩k]\leftarrow\mathbf{rep}[k]||𝐏[𝑝𝑡𝑟]k]||\mathbf{P}[\mathit{ptr}]
15:        𝐫𝐞𝐩\mathbf{rep}[k].l𝑀𝑖𝑛(𝑙𝑒𝑛k].l\leftarrow\mathit{Min}(\mathit{len}, 𝐫𝐞𝐩[k].l\mathbf{rep}[k].l)
16:        𝑙𝑒𝑛𝑙𝑒𝑛𝐏[𝑝𝑡𝑟].l\mathit{len}\leftarrow\mathit{len}-\mathbf{P}[\mathit{ptr}].l
17:        𝑝𝑡𝑟𝑝𝑡𝑟+1\mathit{ptr}\leftarrow\mathit{ptr}+1
18:     end while
19:     if 𝑙𝑒𝑛>0\mathit{len}>0 then
20:        𝐫𝐞𝐩[k]𝐫𝐞𝐩[k]||𝐝𝐞𝐩[k].q,𝑙𝑒𝑛,𝐝𝐞𝐩[k].s\mathbf{rep}[k]\leftarrow\mathbf{rep}[k]||\langle\mathbf{dep}[k].q,\mathit{len},\mathbf{dep}[k].s\rangle
21:     end if
22:  end for
23:  return  𝐫𝐞𝐩\mathbf{rep}
Algorithm 4 𝑒𝑛𝑐𝑜𝑑𝑒_𝑠𝑡𝑟(𝐏\mathit{encode\_str}(\mathbf{P}, i,j,𝐬𝐭𝐫i,j,\mathbf{str})
0:  1. an array of LZ-End phrases 𝐏\mathbf{P}2. An index ii of the first symbol 𝐓[i]\mathbf{T}[i] to be deleted3. An index jj of the first symbol after 𝐓[i]\mathbf{T}[i] to not be deleted4. A string of symbols 𝐬𝐭𝐫\mathbf{str} to be inserted in place of 𝐓[i,j)\mathbf{T}[i,j). Result: the LZ-End phrases of the symbols in 𝐬𝐭𝐫\mathbf{str} as well as the symbols before ii which are in the same phrase as ii, and the symbols after and including jj which are in the same phrase as j1j-1.
1:  𝑓𝑖𝑟𝑠𝑡𝑠𝑒𝑙𝑒𝑐𝑡B(𝑟𝑎𝑛𝑘B(i)1)+1\mathit{first}\leftarrow\mathit{select}_{B}(\mathit{rank}_{B}(i)-1)+1
2:  𝑙𝑎𝑠𝑡𝑠𝑒𝑙𝑒𝑐𝑡B(𝑟𝑎𝑛𝑘B(j1))\mathit{last}\leftarrow\mathit{select}_{B}(\mathit{rank}_{B}(j-1))
3:  𝑝𝑟𝑒𝑓𝑖𝑥𝑒𝑥𝑡𝑟𝑎𝑐𝑡()\mathit{prefix}\leftarrow\mathit{extract}()111Using Kreft and Navarro’s 𝑒𝑥𝑡𝑟𝑎𝑐𝑡()\mathit{extract}() algorithm [3] the symbols 𝐓[𝑓𝑖𝑟𝑠𝑡,i)\mathbf{T}[\mathit{first},i)
4:  𝑠𝑢𝑓𝑓𝑖𝑥𝑒𝑥𝑡𝑟𝑎𝑐𝑡()\mathit{suffix}\leftarrow\mathit{extract}()111Using Kreft and Navarro’s 𝑒𝑥𝑡𝑟𝑎𝑐𝑡()\mathit{extract}() algorithm [3] the symbols 𝐓[j,𝑙𝑎𝑠𝑡]\mathbf{T}[j,\mathit{last}]
5:  𝐬𝐭𝐫𝑝𝑟𝑒𝑓𝑖𝑥𝐬𝐭𝐫𝑠𝑢𝑓𝑓𝑖𝑥\mathbf{str}\leftarrow\mathit{prefix}||\mathbf{str}||\mathit{suffix}
6:  𝐜𝐨𝐦𝐩_𝐬𝐭𝐫\mathbf{comp\_str}\leftarrow encode222Using the LZ-End parsing algorithm [3] the symbols of 𝐬𝐭𝐫\mathbf{str}
7:  return  𝐜𝐨𝐦𝐩_𝐬𝐭𝐫\mathbf{comp\_str}
Algorithm 5 𝑎𝑑𝑗𝑢𝑠𝑡_𝑝𝑜𝑖𝑛𝑡𝑒𝑟𝑠(𝐏,x,l,z,𝐝𝐞𝐩,𝐫𝐞𝐩)\mathit{adjust\_pointers}(\mathbf{P},x,l,z,\mathbf{dep},\mathbf{rep})
0:  1. An array 𝐏\mathbf{P} of LZ-End phrases. 2. A pointer xx to the first phrase whose index may need adjusting. 3. The number of phrases ll which were removed from 𝐏\mathbf{P}. 4. The number of phrases zz in the encoding of 𝐬𝐭𝐫\mathbf{str}.5. An array 𝐝𝐞𝐩\mathbf{dep} of pointers to phrases which depended on phrases which have been removed. 6. A ragged 2-D array 𝐫𝐞𝐩\mathbf{rep} of phrases which were inserted in place of each phrase in 𝐝𝐞𝐩\mathbf{dep}. Result: The array 𝐏\mathbf{P}, once its index pointers have been corrected to account for the modification.
1:  for ii in [x,𝐏.𝑠𝑖𝑧𝑒)[x,\mathbf{P}.\mathit{size}) do
2:     if P[i].qxP[i].q\geq x then
3:        P[i].qP[i].q+zlP[i].q\leftarrow P[i].q+z-l
4:        for jj in [0,𝐝𝐞𝐩.𝑠𝑖𝑧𝑒)[0,\mathbf{dep}.\mathit{size}) do
5:           if P[i].q𝐝𝐞𝐩[j]P[i].q\geq\mathbf{dep}[j] then
6:              P[i].qP[i].q+𝐫𝐞𝐩[j].𝑠𝑖𝑧𝑒P[i].q\leftarrow P[i].q+\mathbf{rep}[j].\mathit{size}
7:           else
8:              break
9:           end if
10:        end for
11:     end if
12:  end for
13:  return  𝐏\mathbf{P}

Note that replacement does not change existing phrase boundaries, i.e., any phrases referencing a dependent can instead reference the corresponding final replacement phrase. The worst case run-times of the subroutines are as listed, where the variables are as defined in Algorithm 1. The derivation of the runtime complexity for each subroutine is found in the Appendix:

  • Find dependent phrases (Algorithm 2): Θ(n𝑟𝑎𝑛𝑘B(j))\Theta(n^{\prime}-\mathit{rank}_{B}(j)).

  • Find replacement phrases (Algorithm 3): O((n𝑟𝑎𝑛𝑘B(j))O((n^{\prime}-\mathit{rank}_{B}(j)) ×(𝑟𝑎𝑛𝑘B(j)𝑟𝑎𝑛𝑘B(i)+2))\times(\mathit{rank}_{B}(j)-\mathit{rank}_{B}(i)+2)).

  • Encoding 𝐬𝐭𝐫\mathbf{str} (Algorithm 4): O(Slogs)O(S\log s), SslS\leq sl, where ll is the length of the longest phrase encoding 𝐬𝐭𝐫\mathbf{str}, and ss is the size of 𝐬𝐭𝐫\mathbf{str} after Step 5 of Algorithm 4.

  • Adjust pointers (Algorithm 5): O((n𝑟𝑎𝑛𝑘B(j))×log(n𝑟𝑎𝑛𝑘B(j)))O((n^{\prime}-\mathit{rank}_{B}(j))\times\log(n^{\prime}-\mathit{rank}_{B}(j))). This occurs when every phrase after 𝑟𝑎𝑛𝑘B(j)\mathit{rank}_{B}(j) was a dependent of all preceding phrases up to and including 𝑟𝑎𝑛𝑘B(j)\mathit{rank}_{B}(j) before modification. Each dependent is replaced by at least one phrase which references phrases preceding 𝑟𝑎𝑛𝑘B(i)\mathit{rank}_{B}(i) (i.e., needs no index pointer adjustment) and at most by one phrase with index >𝑟𝑎𝑛𝑘B(j)>\mathit{rank}_{B}(j) (which needs adjustment).

IV Evaluation

The LZ-End parsing is known to be coarsely optimal [18]. That is, for any kk, the compression ratio is asymptotically bounded by the kk-th order entropy HkH_{k} of the string. The proof of coarse optimality depends on the assumption that each phrase in the LZ-End parsing of a string represents a unique sequence of symbols. Although this assumption is valid for the LZ-End parsing, the 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() operation does not preserve phrase uniqueness. The coarse optimality of the LZ-End parsing therefore does not hold after the application of the 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() operation. This section empirically evaluates the effect of the 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() operation on the compressed data.

IV-A Metrics

The compression performance of the LZ-End algorithm has already been extensively evaluated theoretically and empirically [3]. We therefore focus on evaluating the effect which the 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() operation has on the compressed data. The evaluation considers the following factors:

  • the type of modification – that is, insertions, deletions, and replacements.

  • the size of the modification (as a fraction of overall file size).

  • the effect of successive modifications.

  • the characteristics of the file being modified.

To this end, the experiments consisted of applying the same set of modifications to two different sets of files. The first set was the Canterbury Corpus [19], which is well-known in literature. The second part of the evaluation consisted of generating a set of files of calibrated entropy, as described by Ebeling et al. [20]. These two sets of experiments allow the evaluation of the 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() function on a set of well-known and easily accessed files, as well as extrapolation of the results to files of various entropy.

The effect of 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() is measured by the modification ratio (MR). This is calculated as the size of the compressed data following the application of 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}(), divided by the size of the compressed data if it had been decompressed, modified and recompressed.

The following set of modifications were carried out on the files:

  • Incremental modifications (Figure 1(a) and left column of Table II).

    The ratio reported is the mean MR of each file following 100 insertions, 100 deletions, or 100 replacements. Each modification was 0.5% of the original file size, so that each set of insertions, deletions, and replacements affected 50% of the original file.

  • Size of the modification (Figure 1(b) and middle columns of Table II).

    The reported MR is the mean MR of insertions, deletions and replacements of various sizes between 5% and 95% of the file size.

  • Position of the modification (Figure 1(c) and rightmost columns of Table II).

    The reported MR is the mean MR after a modification of size 0.5% of the file size, made at various points between 0.05 and 0.95 of the file length. For a file of 100 bytes, an edit made at 0.05 starts at the 5th byte, while an edit made at 0.95 starts at the 95th byte.

Although not particularly time efficient, this algorithm represents the first step in a new and interesting direction to improve the efficiency of compression functions in the future.

IV-B Implementation Notes

The 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() operation does not require any change to the LZ-End parsing or decoding algorithms. The phrases following an application of 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}() are valid LZ-End phrases (except that the phrases are no longer unique). This means that the decoder does not need to be changed at all.

The implementation of the LZ-End parser did not optimise the encoding for short phrases. Rather, the implementation encodes even single characters in its own phrase. This is similar to the original LZ77 parsing [16], and is unlike the more optimised version of Strorer and Szymanski [21]. Whether or not fast local decoding of LZ-End data is still possible under an LZSS-type parsing applied to the LZ-End algorithm has not been investigated.

The calibrated entropy strings were produced by a bipartition c[0,1]c\in[0,1] of the logistic map, as proposed by Ebeling et al. in [20]. To further demonstrate dependence on entropy, one can apply an FIR filter to the bit streams generated by the logistic map to correlate subsequent bits and obtain a lower entropy string. FnF_{n}, the nn-th output value produced by the FIR filter, results from the logistic map bit sequence SnS_{n} via an equation such as the following:

Fn=0.5×Sn+0.1×Sn1++0.1×Sn5 where S1==S5=1F_{n}=0.5\times S_{n}+0.1\times S_{n-1}+\dots+0.1\times S_{n-5}\text{\qquad where }S_{-1}=\dots=S_{-5}=1

The value of the nn-th bit is then set using a bipartition at 0.4, similar to that of the Logistic Map. This means that the FIR filter string will have a bit value of 1 wherever the logistic map string has a bit value of 1 or wherever four or more of the preceding five bits in the logistic map string are set to 1. Note that the order and the coefficients of the FIR filter above are chosen somewhat arbitrarily – all we need is some correlation. This can be achieved with a large range of FIR designs.

Ten strings of 50000 bytes each were generated for each value of the noise amplitude ξ\xi in Table I with calibration c=0.5c=0.5 [20]. The table shows the average, minimum and maximum size in KiB to which these strings compressed. At low entropies, FIR filtering has little impact on the compressed size, but as expected reduces the compressed size somewhat for larger entropies. This shows that the effects seen cannot be mere artefacts of the logistic map generation.

Each experiment was run with three kinds of 𝐬𝐭𝐫\mathbf{str}:

  • a low-entropy 𝐬𝐭𝐫\mathbf{str} consisting only of a series of ASCII symbols “a”,

  • a medium-entropy 𝐬𝐭𝐫\mathbf{str} which compresses to half its size, and

  • a high-entropy 𝐬𝐭𝐫\mathbf{str} which doesn’t compress.

The mean MR for these three values of 𝐬𝐭𝐫\mathbf{str} is reported.

We have conducted extensive validation to confirm correctness of 𝑚𝑜𝑑𝑖𝑓𝑦()\mathit{modify}(). This involved incrementally making a series of 100 random modifications of random sizes and in random positions within each file of the Canterbury Corpus [19], and confirming that each file decompressed correctly.

IV-C Results

TABLE I: Compressed sizes of strings with calibrated entropy
Logistic map Logistic map + FIR filter
ξ\xi avg. min. max. avg. min. max.
0.0001 8.88 8.79 8.94 8.88 8.82 8.94
0.00025 11.91 11.69 12.71 11.91 11.69 12.74
0.0005 15.45 15.38 15.50 15.45 15.42 15.48
0.00075 17.40 17.31 17.46 17.41 17.31 17.47
0.001 18.95 18.90 19.03 18.94 18.86 19.01
0.0025 26.40 26.26 26.51 26.86 26.68 26.98
0.005 32.12 31.98 32.22 32.81 32.73 32.92
0.0075 35.88 35.77 35.97 35.03 34.92 35.15
0.01 38.94 38.82 39.05 35.88 35.73 36.01
0.025 52.60 52.49 52.78 38.16 38.06 38.28
TABLE II: Modification ratios of the Canterbury Corpus
File: Incremental MR Sizes Positions
0.05 0.5 0.95 0.05 0.5 0.95
aaa.txt 242.0 1.565 1.527 1.590 1.889 1.718 1.650
alice29.txt 1.169 1.015 1.155 1.527 1.014 1.002 1.000
alphabet.txt 32.20 1.557 1.481 1.471 1.812 1.809 1.725
asyoulik.txt 1.217 1.012 1.134 1.511 1.012 1.003 1.000
bible.txt 1.518 1.018 1.181 1.567 1.014 1.004 1.045
cp.html 1.394 1.011 1.115 1.273 1.015 1.002 1.004
E.coli 1.335 1.010 1.134 1.431 1.059 1.002 1.000
fields.c 1.597 1.025 1.091 1.216 1.013 1.006 1.004
grammar.lsp 1.747 1.037 1.187 1.140 1.024 1.032 1.004
kennedy.xls 1.205 1.014 1.070 1.318 1.006 1.001 1.000
lcet10.txt 1.701 1.013 1.142 1.272 1.015 1.003 1.000
pi.txt 1.275 1.010 1.121 1.350 1.011 1.002 1.000
plrabn12.txt 1.302 1.011 1.141 1.375 1.012 1.002 1.000
ptt5 1.577 1.008 1.291 1.256 1.001 1.003 1.001
random.txt 1.167 1.010 1.118 1.183 1.011 1.002 1.000
sum 1.406 1.003 1.436 1.468 1.013 1.001 1.009
world192.txt 1.326 1.011 1.132 1.239 1.016 1.002 1.002
xargs.1 1.479 1.017 1.145 1.118 1.015 1.008 1.005

The first column in Table II shows the effect of 100 incremental modifications. The values reported are the mean ratio for insertions, deletions and replacements, for high, medium and low entropy 𝐬𝐭𝐫\mathbf{str} sources. Figure 1(a) shows the breakdown of how these operations and the entropy of the 𝐬𝐭𝐫\mathbf{str} relate to one another. As expected, modification most negatively affects the lowest entropy files (e.g., aaa.txt or alphabet.txt). This is because a single phrase of data from a low-entropy source encodes much more data than for a higher entropy source. We see from Figure 1(a) that the insertion of a low entropy 𝐬𝐭𝐫\mathbf{str} has a much smaller effect than for a 𝐬𝐭𝐫\mathbf{str} of higher entropy. We also see that insertion has less of an effect than the replacement and deletion operations. This is because the degradation of the compression ratio occurs mainly in the replacement of dependent phrases. An insertion will have few dependent phrases, since no phrases are being removed. The replacement operation can be viewed as a deletion followed by an insertion. This explains the greater degradation in the MR which occurs as a result of replacements compared to insertions.

The middle columns of Table II shows the effect of the size of the edit on the MR. Figure 1(b) demonstrates how the deletion, insertion and replacement operations compare for different size edits. Insertion has a constant effect on the MR, whereby the MR stays just above 1 for any size modification. Deletion has the highest MR for all sizes of edit, with the MR for replacements sandwiched between that of deletion and insertion. The MR of a deletion and replacement generally increases for larger edits. However, the experiments on the Canterbury Corpus show that this is not a universal rule, and depends on the characteristics of the individual file.

Finally, the rightmost colums of Table II shows the effect of the position of the edit on the MR. Figure 1(c) shows how the insertions, deletions and replacements relate to one another for the strings of calibrated entropy. Generally, insertions have constant effect, while replacements and deletions have almost identical effect. Deletions and replacements at the start of a file have the greatest effect on MR, and this quickly tapers off. This is due to the fact that the initial phrases at the very start of a file have many dependencies. However, specific files such as bible.txt demonstrate exceptions to this principle. The edits made in these tests are very small (0.5% of the file size), hence the small effect on MR. Once again, files of low entropy (aaa.txt and alphabet.txt) have the greatest MR.

Refer to caption
(a) Effect of 100 incremental modifications on the compressed size of strings with calibrated entropies. The High, Medium and Low Entropy classes refer to the entropy of the 𝐬𝐭𝐫\mathbf{str} which is inserted.
Refer to caption
(b) MRs vs. relative size of a single modification. The size of the insertion does not affect the MR, while the deletions have a larger effect than replacements.
Refer to caption
(c) Effect of modification position of a single local modification on the MR. The insertion ratio is independent of position, while deletion and replacement are nearly identical for these small modifications.
Figure 1: Evaluation of various modifications carried out on the strings generated using the logistic map and the FIR filter.

V Conclusion and Future Work

This paper presented the first algorithm to locally modify data compressed by a type of Lempel-Ziv parsing. Existing results on the LZ-End algorithm allowed data to be found and decoded without decompression. Modifying the compressed data, however, was previously not possible without decompressing the entire compressed string first. Our implementation of the algorithm was subjected to extensive testing to ensure correctness.

Investigations using Canterbury Corpus data and files of calibrated entropy show that small local modifications usually have a very small effect on the compression ratio of the data, as do insertions of any size and position, compared to what would be the case when decompressing, modifying and recompressing the data. Local deletions and replacements which affect a larger part of the data have a greater effect on the compression. The effect which the position of the local modification has on the compression is dependent on the properties of the data itself; in the general case, however, modifications made towards the start of the data have a larger effect on the compression ratio.

The algorithm to modify the data is not particularly time-efficient, as it is the first of its kind for a LZ-type parser. The primary factor limiting the performance of the algorithm is the lack of a sliding window in the LZ-End parser. The LZ77 algorithm does incorporate a sliding window into its parsing [16]; however, the LZ77 parser does not currently support local decoding. Should the LZ77 parsing allow local decoding, the methods presented in this paper can be used to allow random editing as well. This random editing will also be faster than the current method for LZ-End, as the number of dependent phrases will be strictly limited by the size of the sliding window.

Acknowledgment

The authors would like to thank Mark Titchener for his suggestions regarding generating the calibrated entropy strings.

References

  • [1] T. Tao and A. Mukherjee, “LZW Based Compressed Pattern Matching.” in Data Compression Conference, Snowbird, Utah. IEEE, 2004, pp. 568.
  • [2] G. Navarro and M. Raffinot, “A General Practical Approach to Pattern Matching Over Ziv-Lempel Compressed Text,” in Combinatorial Pattern Matching, M. Crochemore and M. Paterson, Eds. Springer, 1999, pp. 14–36.
  • [3] S. Kreft and G. Navarro, “LZ77-like Compression with Fast Random Access,” in Data Compression Conference, Snowbird, Utah. IEEE, 2010, pp. 239–248.
  • [4] R. Vestergaard, D. E. Lucani, and Q. Zhang, “A Randomly Accessible Lossless Compression Scheme for Time-Series Data,” in Global Communications Conference (GLOBECOM), Waikoloa, Hawaii.   IEEE, 2019.
  • [5] R. Grossi, R. Raman, S. S. Rao, and R. Venturini, “Dynamic Compressed Strings with Random Access,” in Automata, Languages, and Programming, F. V. Fomin, R. Freivalds, M. Kwiatkowska, and D. Peleg, Eds., Springer, 2013, pp. 504–515.
  • [6] J. Jansson, K. Sadakane, and W.-K. Sung, “CRAM: Compressed Random Access Memory,” in Automata, Languages, and Programming, A. Czumaj, K. Mehlhorn, A. Pitts, and R. Wattenhofer, Eds. Springer, 2012, pp. 510–521.
  • [7] K. Tatwawadi, S. S. Bidokhti, and T. Weissman, “On Universal Compression with Constant Random Access,” in International Symposium on Information Theory (ISIT), Vail, Colorado. IEEE, 2018, pp. 891–895.
  • [8] S. Vatedka and A. Tchamkerten, “Local Decoding and Update of Compressed Data,” in International Symposium on Information Theory (ISIT), Paris, France. IEEE, 2019, pp. 572–576.
  • [9] A. Mazumdar, V. Chandar, G. W. Wornell, “Local Recovery in Data Compression for General Sources”, in International Symposium on Information Theory (ISIT), Hong Kong. IEEE, 2015, pp. 2984–2988.
  • [10] H. Buhrman, P. B. Miltersen, J. Radhakrishnan and S. Venkatesh, “Are Bitvectors Optimal?”, in SIAM Journal on Computing, vol. 31, no. 6, pp. 1723–1744. Society for Industrial and Applied Mathematics, 2002.
  • [11] G. Navarro and Y. Nekrich, “Optimal Dynamic Sequence Representations,” SIAM Journal on Computing, vol. 43, no. 5, pp. 1781–1806. Society for Industrial and Applied Mathematics, 2014.
  • [12] G. Navarro and J. Tarhio, “Boyer—Moore String Matching over Ziv-Lempel Compressed Text,” in Combinatorial Pattern Matching, R. Giancarlo and D. Sankoff, Eds., Springer, 2000, pp. 166–180.
  • [13] S. Kreft and G. Navarro, “Self-indexing based on LZ77,” in Annual Symposium on Combinatorial Pattern Matching, Palermo, Italy. Springer, 2011, pp. 41–54.
  • [14] D. Kempa and D. Kosolobov, “LZ-End Parsing in Linear Time,” in 25th Annual European Symposium on Algorithms, Vienna, Austria, 2017.
  • [15] D. Kempa and D. Kosolobov, “LZ-End Parsing in Compressed Space,” in Data Compression Conference, Snowbird, Utah. IEEE, 2017, pp. 350–359.
  • [16] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, 1977.
  • [17] R. Raman, V. Raman, and S. S. Rao, “Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets,” in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California. Society for Industrial and Applied Mathematics, 2002, pp. 233–242.
  • [18] S. Kreft and G. Navarro, “On Compressing and Indexing Repetitive Sequences,” Theoretical Computer Science, vol. 483, pp. 115–133. Elsevier, 2013.
  • [19] R. Arnold and T. Bell, “A Corpus for the Evaluation of Lossless Compression Algorithms,” in Data Compression Conference, Snowbird, Utah. IEEE, 1997, pp. 201–210.
  • [20] W. Ebeling, R. Steuer, and M. Titchener, “Partition-based Entropies of Deterministic and Stochastic Maps,” Stochastics and Dynamics, vol. 1, no. 01, pp. 45–61, 2001.
  • [21] J. A. Storer and T. G. Szymanski, “Data Compression via Textual Substitution”, Journal of the ACM, vol. 29, no. 4, pp. 928–951, 1982.

-A Complexity of Algorithm 2

Algorithm 2 is a fairly straightforward linear algorithm. It iteratively checks each phrase which follows the modification location, to determine whether that phrase is dependent on the section of data being edited. This iterates from the phrase containing the first character which is not removed (𝑟𝑎𝑛𝑘B(j)\mathit{rank}_{B}(j)) until the end of the compressed data (nn^{\prime}).

The complexity of Algorithm2 is therefore Θ(n𝑟𝑎𝑛𝑘B(j))\Theta(n^{\prime}-\mathit{rank}_{B}(j)).

-B Complexity of Algorithm 3

The 𝑟𝑎𝑛𝑘B()\mathit{rank}_{B}() and 𝑠𝑒𝑙𝑒𝑐𝑡B()\mathit{select}_{B}() operations are constant-time array look-ups [16]. We therefore focus on the for and while loops in the algorithm.

The for loop iterates once for each dependent phrase and the while loop iterates once for each phrase which is referenced by that dependent. The worst-case scenario is when all phrases following the location of the edit are dependent, and when each dependent phrase references all phrases which are being edited.

The number of phrases which follow the edit location is n𝑟𝑎𝑛𝑘B(j)n^{\prime}-\mathit{rank}_{B}(j), and the number of phrases which are edited is 𝑟𝑎𝑛𝑘B(j)𝑟𝑎𝑛𝑘B(i)+2\mathit{rank}_{B}(j)-\mathit{rank}_{B}(i)+2.

The overall complexity of the 𝑓𝑖𝑛𝑑_𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡_𝑝ℎ𝑟𝑎𝑠𝑒𝑠()\mathit{find\_replacement\_phrases()} subroutine is therefore O((n𝑟𝑎𝑛𝑘B(j))×(𝑟𝑎𝑛𝑘B(j)𝑟𝑎𝑛𝑘B(i)+2))O((n^{\prime}-\mathit{rank}_{B}(j))\times(\mathit{rank}_{B}(j)-\mathit{rank}_{B}(i)+2)).

-C Complexity of Algorithm 4

Algorithm 4 simply encodes the symbols of 𝐬𝐭𝐫\mathbf{str} after it has been extended in the first part of the subroutine. The extended 𝐬𝐭𝐫\mathbf{str} is then encoded as its own string, without any backreferences to previous parts of the compressed data. The complexity of calculating the LZ-End parsing of a string is O(Slogs)O(S\log s), where SslS\leq sl, ll is the length of the longest phrase and ss is the length of the string [3].

-D Complexity of Algorithm 5

The final subroutine iterates through every phrase which follows the point at which the edit was made, and adjusts the back-references of each phrase which points to some phrase following the edit location.

Each dependent phrase is replaced by two or more replacement phrases in Algorithm 3. The replacement phrases each point to phrases preceding the edit location, and these back-references do not need to be adjusted. The only exception is the final replacement phrase for each dependent phrase, which may point to a phrase immediately after the edit location. The worst-case scenario is when each phrase which comes after the edit location is dependent on the edited phrases (and its backreference therefore needs to be adjusted).

The inner for loop in Step 4 of the algorithm is for didactic purposes only. In practice, a binary search is used, achieving logarithmic search time.

Pior to making the modification, there were n𝑟𝑎𝑛𝑘B(j)n^{\prime}-\mathit{rank}_{B}(j) many phrases after the edited target phrases. These each therefore have at most one replacement phrase which satisfies the criteria in Step 2 of the algorithm, whereby its backreference needs adjusting.

The complexity of this subroutine is therefore O((n𝑟𝑎𝑛𝑘B(j))×log(n𝑟𝑎𝑛𝑘B(j)))O((n^{\prime}-\mathit{rank}_{B}(j))\times\log(n^{\prime}-\mathit{rank}_{B}(j))).