This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A High-Throughput Hardware Accelerator for Lempel-Ziv 4 Compression Algorithm

Tao Chen1, Suwen Song2, and Zhongfeng Wang1,2 Fellow, IEEE
1
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
2 School of Integrated Circuits, Sun Yat-sen University, Shenzhen, China
Email: [email protected], [email protected], [email protected]
Abstract

This paper delves into recent hardware implementations of the Lempel-Ziv 4 (LZ4) algorithm, highlighting two key factors that limit the throughput of single-kernel compressors. Firstly, the actual parallelism exhibited in single-kernel designs falls short of the theoretical potential. Secondly, the clock frequency is constrained due to the presence of the feedback loops. To tackle these challenges, we propose a novel scheme that restricts each parallelization window to a single match, thus elevating the level of actual parallelism. Furthermore, by restricting the maximum match length, we eliminate the feedback loops within the architecture, enabling a significant boost in throughput. Finally, we present a high-speed hardware architecture. The implementation results demonstrate that the proposed architecture achieves a throughput of up to 16.10 Gb/s, exhibiting a 2.648×\times improvement over the start-of-the-art. The new design only results in an acceptable compression ratio reduction ranging from 4.93% to 11.68% with various numbers of hash table entries, compared to the LZ4 compression ratio achieved by official software implementations disclosed on GitHub.

Index Terms:
LZ4, lossless compression, FPGA, high throughput, hardware accelerator.

I Introduction

In the era of big data, massive amounts of data pose significant challenges to transmission bandwidth, data storage, and data processing. As an effective method to reduce data size, lossless compression algorithms have been widely used in most data application scenarios[1, 2, 3]. For instance, the Lempel-Ziv 77 (LZ77)[4] algorithm and its variant, Lempel-Ziv-Stac (LZS) algorithm are utilized for compressing IP packets to reduce network transmission overhead.

Most compression algorithms were originally implemented in software, resulting in relatively slow speeds. However, with the rapid development of digital technologies, such as big data and multimedia, the demand of high-throughput implementation for compression algorithms is steadily growing. Consequently, there is a rising interest in transferring these compression tasks from software platforms to hardware solutions, in which field-programmable gate arrays (FPGAs) have emerged as a popular choice.

The LZ4 algorithm, introduced by Collet[5] in 2011, is a variant of LZ77. Known for its outstanding compression speed, LZ4 surpasses other LZ algorithms in implementations. Besides, a comprehensive analysis of LZ4 from the perspective of hardware design[6] has been conducted, revealing additional benefits of the LZ4 algorithm. These include low latency and high resource efficiency, making it a highly advantageous choice for hardware-based compression tasks. Hence, we select LZ4 as a candidate for our design in this paper.

Several FPGA-based acceleration designs for LZ4 compression have been proposed[6, 7, 8, 9, 10, 11]. In 2015, Bartik et al. [6] pioneered the use of an FPGA to implement LZ4 for compressing 4K transmissions. Liu et al.[7] refined their work and introduced a modified LZ4 format to enhance the clock frequency. In order to further improve throughput, in 2019 Bartik and Benes et al.[9, 10] utilized parallelism in the match searching stage, resulting in a significant increase in throughput. Based on the parallel implementation, [11] further proposed a scheme aiming at minimizing resource utilization to enable the deployment of multiple kernels within the same resource constraints, leading to an enhancement in throughput.

While some existing architectures[8, 9, 10, 11] utilized parallelism to improve throughput, they avoid sacrificing compression ratio by traversing all non-overlapping matches in the extended match stage. This results in a significant reduction in parallelism during the extended match stage, which fails to match the parallelism achieved in previous stages, consequently greatly reducing the throughput. Besides, in the current parallel hardware architectures[9, 10, 11], the presence of timing dependencies in the address signals during the extended match stage forms the feedback loops, which inhibits further increment of frequency through pipeline insertion, thereby limiting the upper bound of throughput.

The objective of this paper is to address the aforementioned issues to improve throughput. To tackle the mismatch in parallelism, we propose a solution where each parallelization window is constrained to contain a single match, ensuring that each component of the architecture exhibits the same level of parallelism. In order to address the loops in the architecture, we aim to establish a maximum limit on the match length during the extended match stage. This approach ensures that the architecture becomes fully feedforward, enabling the insertion of pipelines to enhance frequency. Furthermore, it guarantees predictable output delays, albeit at the cost of a certain degree of compression ratio loss. Given certain scenarios, such as multimedia applications, where the significance of the throughput and predictability outweighs that of compression ratio[6, 12], sacrificing compression ratio becomes acceptable. The implementation of our proposed architecture on FPGA demonstrates that our throughput is up to 16.10 Gb/s, which is 2.648×\times higher than that of the best-performing architecture in the open literatures.

The rest of the paper is structured as follows. Section II provides a review of relevant works and investigates the throughput bottlenecks in current parallel architectures. In Section III, we introduce solutions to mitigate throughput bottlenecks while analyzing the impact on compression ratio. The hardware architecture of our proposed scheme and a comparison with other FPGA designs of LZ4 algorithm are presented in Section IV. Finally, Section V concludes the paper.

II Preliminaires

II-A LZ4 Algorithm

LZ4 is a dictionary-based compression algorithm[13], which was initially conceived as a distinct compressed data format. Compressed data files are composed of LZ4 sequences, which include a token, literal length, literals, offset, and match length, as depicted in Fig. 1.

Refer to caption

Figure 1: Data format of an LZ4 sequence.

LZ4 achieves data compression by leveraging statistical redundancy in data patterns and employing a mechanism to eliminate repeated literals. It identifies recurring data strings from previous data and substitutes them with an offset-length pair, utilizing a dictionary to determine the offset of previous literals and their match length.

II-B Parallel Implementation of the LZ4 Algorithm

A typical hardware parallel implementation of the LZ4 compression algorithm consists of five major stages, including hash calculation, hash table update, match searching, extended match, and sequence encoding, as illustrated in Fig. 2. We use a designed Parallelization Window Size (PWS) to represent parallelism. Within this parallelization window, a set of spatial locations enables concurrent execution of the compression algorithm across multiple consecutive bytes.

Refer to caption

Figure 2: The flow chart of LZ4 algorithm.

Hash Calculation

The Fibonacci hashing principle, known for its simplicity and efficiency achieved through straightforward multiplication with a constant 2654435761, is widely adopted as the most common methods for hash calculation algorithms[6, 7, 8, 9, 10, 11]. This 32-bit constant can be efficiently multiplied by four DSP48 slices available on the FPGA. The generated IP (Intellectual Property) core is pipelined, and the multiplication result is truncated to extract the most significant bits used for the hash table address.

Hash Table Update

LZ4 utilizes a hash table to swiftly identify repeated data with a constant time complexity of 𝒪\mathcal{O}(1). Indexed by the hash value of current string, the hash table retrieves the candidate address and four bytes of the candidate string, where the length of four bytes is the minimum match length specified by the LZ4 algorithm. Subsequently, the address of the current string and its corresponding four-byte data are stored, replacing the previous ones[7, 9, 10].

Match Searching

The LZ4 algorithm compares the four bytes of the candidate string with the current four bytes to validate the match. To identify the longest match, the earliest candidate among PWS possible matches should be selected, which corresponds to the input string located near the beginning of the parallelization window. Upon successful matching between the candidate and current strings, an extended match process is initiated to search for a longer match. Conversely, if no match is found, the input window advances by PWS bytes, and the process repeats.

Extended Match

Upon detecting a 4-byte match, we proceed by fetching PWS bytes data from the address obtained by advancing 4 bytes from the candidate address, and compare it with the current data. Depending on the result of this comparison, we determine whether to continue reading PWS bytes from the address obtained by advancing (PWS + 4) bytes from the candidate address in the following clock cycle, or to read PWS bytes from a new address. This process repeats iteratively until the maximum match length is found.

For example, assume that the PWS is set to 8 bytes, the current string is “abcdefghijklmnopqrst” and the candidate string is “abcdefghijklmnoheshe”. Upon discovering a 4-byte substring “abcd” match, we proceed to compare the next 8 bytes “efghijkl”. If the match length is equal to the window size, then we backtrack to find an 8-byte substring “mnoheshe”, and we find that only 3 bytes match. The match length is less than the PWS bytes, which means the maximum match length is found.

Due to the temporal logic dependency, this process inevitably creates a feedback loop in the circuit[9, 10, 11], which becomes a bottleneck for increasing the frequency. As illustrated in Fig. 2, this loop is highlighted by the red line.

Besides, in some implementations, the compression ratio is optimized by traversing all non-overlapping matches in the extended match stage, which can be achieved either by utilizing FIFOs to segregate the match searching module and the extended match module[10] or by advancing the starting address of the subsequent parallelization window to the end of the ongoing match[11]. However, this adjustment inherently diminishes the effective parallelism level of the compression kernel in comparison to the theoretical parallelism represented by PWS. This occurs because, during the extended match stage, the sliding step becomes narrower than the PWS, ultimately restricting the overall throughput.

Sequence Encoding

Upon completion of a match round, a comprehensive compression data set is produced, comprising literals, literal length, and details regarding matches (length and offset). Subsequently, the output encoding module encodes this information into sequences as shown in Fig. 1.

II-C Summary

We have conducted a thorough analysis of the typical LZ4 implementation on FPGA and identified two primary factors that significantly limit the potential for throughput enhancement in parallel architectures.

  • Some implementations employ FIFO structures[10] or advance the parallelization window[11] to capture all non-overlapping matches, aiming to enhance the compression ratio. However, this approach inherently leads to a decrease in the effective degree of parallelism compared to the theoretical parallelism.

  • The presence of a loop within the current parallel architectures[9, 10, 11] poses as a bottleneck, constraining the operating frequency and subsequently limiting the maximum achievable throughput.

III The Proposed High-Throughput Compressor

We propose solutions aimed at addressing the two aforementioned disadvantageous factors that constrain throughput and evaluate their influence on the compression ratio.

III-A Enhancement of Parallelism

As previously discussed in Section II-B, researchers have employed FIFO or advanced the starting address of the next parallelization window to enhance the compression ratio, at the expense of reducing the parallelism. The rationale behind these operations ensuring a higher compression ratio is their ability to prevent the oversight of scenarios where multiple non-overlapping matches occur within the parallelization window.

For instance, considering the case where PWS is set to 8 bytes, there is only one scenario where multiple non-overlapping matches exist within the parallelization window. As depicted in Fig. 3(a), the first match ends within the current parallelization window and the starting position of the subsequent match exists within the remaining data of the same window. In this situation, after spending a clock cycle to compress the first match, the compression of the second match within the same parallelization window still requires an additional clock cycle to be carried out, which results in a lower degree of parallelism compared to the previous compression stages where one parallelization window could be fully processed within a single clock cycle. So when encountering the scenario depicted in Fig. 3(a), the actual parallelism will fall below the theoretical parallelism. The extent of this reduction is dependent on the frequency of occurrences of scenarios similar to Fig. 3(a) within the dataset. This inevitable decrease in actual parallelism will ultimately result in a decrement in overall throughput. For example, Benes et al.[10] suffered from a throughput reduction from 10 Gbps to 6.08 Gbps with the implementation of FIFO, while Liu et al.[11] saw their throughput decrease from 6.4 Gbps to 4.5 Gbps by advancing the starting address of the subsequent parallelization window. Their throughput all decreased by approximately 30% to 40% due to the decline in parallelism.

Refer to caption

Figure 3: The example of different schemes within current parallelization window.

Hence, to enhance the parallelism and align it with the theoretical parallelism, we constrain each parallelization window to process only a single match, as depicted in Fig. 3(b). This approach does not continue to search for matches within the remaining data of the current window, and instead, it starts to find matches from the beginning of the next parallelization window, which will inevitably result in some data within the matched segment being treated as literals, as marked within the orange box in Fig. 3(b), leading to a slight but acceptable compromise in compression ratio.

Subsequently, we evaluate the degradation in compression ratio resulting from our proposed approach using standard test datasets, commonly referred to as corpus. The corpus sets typically encompass a diverse range of input data, ensuring relevance to real-world scenarios. The Calgary corpus[14], is selected in the compression ratio evaluation. The performance of the aforementioned method is presented in Table I.

TABLE I: Compression Ratioa of Different Schemes within Parallelization Window on the Calgary Corpus
Schemeb Number of entries in the hash table
64 128 256 512 1024 2048 4096 8192
Multiple matches 1.277 1.340 1.419 1.512 1.615 1.725 1.830 1.908
Only a single match 1.266 1.325 1.400 1.488 1.582 1.671 1.750 1.805
Attenuation ratio 0.86% 1.12% 1.34% 1.59% 2.04% 3.13% 4.37% 5.39%
  • a The compression ratio is calculated by average size of all original files in the corpusaverage size of all compressed files\frac{\textit{average size of all original files in the corpus}}{\textit{average size of all compressed files}}.

  • b In our implementation, each data block is treated as independent. We set the input buffer size to be 64KB and implement a parallelism level of 8.

As shown in Table I, the compression ratio suffers from a modest attenuation ranging from 0.86% to 5.39% across various numbers of hash table entries, which can be considered negligible when contrasted with the throughput enhancement achieved by aligning the actual parallelism with the theoretical parallelism.

III-B Elimination of the Loop Limitations

Enhancing the frequency of the circuit is one of the most straightforward and effective ways to boost throughput. The underlying cause of the frequency bottleneck in contemporary parallel architectures[9, 10, 11], as extensively illustrated in Section II-B, is the presence of a loop within the architecture. In order to break this loop, it is essential to resolve the time dependency of the address signal during the extended match stage.

In this paper, we aim to establish an upper boundary for the maximum match length, ensuring that the search for the maximum match length can be accomplished within a single clock cycle. By this approach, we no longer need to rely on the match results of the current cycle to determine the address for the next cycle, thus effectively eliminating the temporal dependency and breaking the loop within the architecture.

Furthermore, this operation ensures predictable output delay. By adopting the proposed strategy, the maximum match length information can be promptly obtained within a single clock cycle during the extended match stage, immediately following the discovery of the first match. This attribute distinguishes our approach from the original extended match stage, which exhibited unpredictable delay due to its reliance on the size of the match length.

The disadvantage of establishing a maximum match length limit is apparent. As clearly depicted in Fig. 4, this approach may result in the fragmentation of a long match into multiple shorter ones. Additionally, any small match at the end, measuring less than 4 bytes, will be discarded. This is due to the output format of the LZ4 algorithm, which requires a minimum match length of 4 bytes.

Refer to caption

Figure 4: (a) Original vs (b) Modified extended match stage.
TABLE II: Compression Ratioa of Modified Extended Match Stage on the Calgary Corpus
Number of entries Maximum match length
in the hash tableb Not Limit Limit to 12 Limit to 20 Limit to 36 Limit to 68
64 1.277 1.188 1.204 1.220 1.238
128 1.340 1.235 1.255 1.275 1.295
256 1.419 1.292 1.317 1.341 1.366
512 1.512 1.362 1.394 1.423 1.451
1024 1.615 1.439 1.478 1.511 1.545
2048 1.725 1.518 1.564 1.602 1.640
4096 1.830 1.591 1.643 1.685 1.727
8192 1.908 1.649 1.705 1.751 1.797
  • a The compression ratio is calculated by average size of all original files in the corpusaverage size of all compressed files\frac{\textit{average size of all original files in the corpus}}{\textit{average size of all compressed files}}.

  • b In our implementation, each data block is treated as independent. We set the input buffer size to be 64KB and implement a parallelism level of 8.

To quantitatively assess the impact of this measure on the compression ratio, we conduct tests utilizing the Calgary corpus. As depicted in Table II, as the maximum match length limit increases, the decrease in compression ratio gradually wanes.

However, setting an excessively high maximum match length limit is not always optimal. As the limit increases, the hardware resources required for match comparison gradually expand. Consequently, to achieve a balance between compression ratio and hardware resource utilization, we have chosen a maximum match length limit of 36 bytes. This choice results in a compression ratio reduction of about 4.46% \sim 8.23%.

III-C Combination of the Two Schemes

This section evaluates the change in the compression ratio when combining the two proposed schemes. The results, tested using the Calgary corpus, are presented in Table III. The compression ratio reduction ranges from 4.93% to 11.68% with various numbers of hash table entries. Given that throughput often takes precedence over compression ratio in numerous applications of lossless compression algorithms[6, 12], a compromise of about 10% in compression ratio to achieve a substantial enhancement in throughput is deemed acceptable.

TABLE III: The Compression Ratioa of the Combined Scheme on the Calgary Corpus
Schemeb Number of entries in the hash table
64 128 256 512 1024 2048 4096 8192
GitHub[15] 1.277 1.340 1.419 1.512 1.615 1.725 1.830 1.908
Combinationc 1.214 1.267 1.331 1.409 1.491 1.569 1.638 1.685
Attenuation ratio 4.93% 5.45% 6.20% 6.81% 7.68% 9.04% 10.49% 11.68%
  • a The compression ratio is calculated by average size of all original files in the corpusaverage size of all compressed files\frac{\textit{average size of all original files in the corpus}}{\textit{average size of all compressed files}}.

  • b In our implementation, each data block is treated as independent. We set the input buffer size to be 64KB and implement a parallelism level of 8.

  • c Here we limit the maximum match length to 36 bytes.

IV Hardware Architecture and Implementation Results

In this section, we introduce the detailed architecture of the parallel compression kernel, as shown in Fig. 5. Furthermore, we evaluate the hardware resource utilization of the proposed solutions on the FPGA platform and compare it with the implementation results of prior works.

Refer to caption

Figure 5: The detailed architecture of the parallel compression kernel.

IV-A Hardware Architecture

The parallel compression kernel consists of several modules, all designed to facilitate multi-byte parallel processing.

Input Buffer

The input buffer releases a continuous stream of data, consisting of PWS bytes per clock cycle. As depicted in the left portion of Fig. 5, the processing window smoothly slides forward by PWS bytes, generating a fresh set of PWS bytes as output in each subsequent cycle. Besides, the LZ4 format limits the maximum input data size to 64 KB, making it compatible with the Level-1 cache of modern processors. For convenience, here we also take the maximum value of the input buffer as 64KB.

Word Shift

In this module, the input is parallelized into PWS input strings. As depicted in the upper-left corner of Fig. 5, each of these PWS input strings is formed by assembling PWS + 3 bytes.

Hash Calculation

During each cycle, the Hash Calculation module accepts PWS strings from the Word Shift module and produces their corresponding hash values for transmission to the subsequent module. As we have mentioned in Section II-B, DSPs are employed for executing the PWS-way Fibonacci hash computation.

Hash Table

This module features a parallel hash table that facilitates the simultaneous updating of PWS records. The Live Value Table (LVT) was chosen as the method for constructing a multi-port memory over an alternative approach that utilized an XOR technique to access the most recent value[9, 10]. Multi-banking is another prevalent method employed to construct multi-port memory, minimizing the utilization of BRAM resources[11]. Nonetheless, it comes with a notable drawback that bank conflicts may result in considerable compression ratio attenuation. Consequently, we opt for an architecture based on LVT to realize the multi-port hash table, as depicted in detail on the bottom of Fig. 5.

The shared hash table contains 256 items as [9, 10] did, each composed of an input buffer pointer (up to 16 bits) and a 4-byte string. Differing from the LZ4 software implementation algorithm available on GitHub[15], which solely stores address pointers in the hash table, our incorporation of an additional 4-byte data element serves the purpose of addressing potential hash collisions. In the LVT scheme, the LVT also serves to store validity flags[8], eliminating the need to clear the dictionary between input data blocks.

Match Searching

In this module, the match result is confirmed by comparing the current and candidate strings. To determine the longest match, we select the earliest candidate among PWS potential match candidates, prioritizing the input string located near the beginning of the parallelization window.

Extended Match

Based on the selection signal generated from the previous clock cycle, one of the PWS input data is selected for comparison with the candidate string read from the data memory module. This comparison aims to identify the maximum length of match data.

Sequence Encoding

This module is responsible for encoding the generated compression information into an LZ4 sequence as illustrated in Fig. 1.

Output Buffer

In previous studies[6], a comprehensive analysis of the output buffer size has been conducted. For simplicity, here we opt to set the output buffer size to be the same as that of the input buffer.

IV-B Implementation Results

We implement our hardware architecture on an FPGA to build a high-throughput compression kernel. The selected platform is the Xilinx Kintex-7 XC7K325TFBG676-3 as [7] did.

Given that in previous works[9, 10, 11], the PWS for the LZ4 parallel architecture has been consistently set to 8 bytes, we have also chosen to set our PWS to 8 bytes for a fair comparison. The implementation results is compared with other FPGA implementations of the LZ4 algorithm. Table IV presents a comparison of the designs in terms of frequency, resource utilization, and throughput. The respective implementation platforms (FPGAs) exhibit comparability in terms of resources and/or frequency due to their similarities.

TABLE IV: The Implementation Results of Different LZ4 Acceleration Kernels
Scheme Bartik’s[6] MLZ4C[7] Benes’[10] Xilinx[16] HybriDC[11] Ours
Intra-core parallelism ×\times ×\times ×\times
FPGA Chip XC7A100T XC7K325T XCKU040 XCU200 10AX048H2 XC7K325T
FPGA Technology 28nm 28nm 20nm 16nm 20nm 28nm
LUT slices 764 573 14076 3000 12336 10863
Resource FF slices 375 937 2803 3500 4081 3378
DSPs 3 4 32 NA 16 32
BRAM (Kbits)a 612 2484 2952 1908 1460 2880
Frequency (MHz) 146 240 156.25 300 100 251.57
Throughput (Gbps) NA 1.92 6.08 2.32 4.50 16.10
Throughput/slicesb NA 1.272 0.360 0.357 0.274 1.131
(Gbps/K slices)
  • a Due to variations in the size of block RAMs across different platforms, we standardized the comparison of on-chip memory usage by converting the number of BRAMs into a consistent capacity measurement.

  • b Summation of LUT and FF slices.

Table IV demonstrates that despite employing the worst technology conditions, our single-kernel parallel architecture achieves a frequency of 251.51 MHz, which surpasses the highest frequency of the current parallel architecture (Benes[10]) by 1.610×\times, and trails only behind the serial architecture (Xilinx[16]) due to the parallel implementation and technological reasons. As a result of this enhanced frequency, our architecture boasts the maximum throughput among all existing LZ4 architectures, reaching a peak value of 16.10 Gbps. This throughput is an impressive 2.648×\times higher than the current leading throughput reported by Benes[10].

In terms of hardware resources, our design also holds advantages. By imposing two limiting conditions, we eliminate the feedback control loops from the original circuit, resulting in a decrease in LUT and FF resources. As can be seen from Table IV, disregarding the MLZ4C algorithm[7], which is incompatible with the official LZ4 tool due to alterations in the LZ4 compression format, our design boasts an area efficiency ratio of 1.131 Gbps/K slices, notably surpassing other architectures, including Benes[10], Xilinx[16], and HybriDC[11].

V Conclusion

This paper introduces a single-kernel parallel architecture for LZ4 compression algorithm. We propose two methods to break the throughput bottleneck existing in current architectures. First, we restrict each parallelization window to only a single match to enhance the actual parallelism. Second, the maximum match length is limited to break feedback loops in the architecture and thus improve the frequency. Both methods come at the expense of sacrificing a certain degree of compression ratio. The implementation on FPGA shows that the throughput reaches a maximum of 16.10 Gbps, surpassing the highest throughput reported for existing LZ4 architectures by 2.648×\times.

VI Acknowledgment

This work was supported in part by the National Key R&D Program of China under Grant 2022YFB4400604, and in part by Shenzhen Science and Technology Program (Grant No. RCBS20231211090610015) (Corresponding authors: Suwen Song; Zhongfeng Wang.)

References

  • [1] P. Kavitha, “A survey on lossless and lossy data compression methods,” Int. J. Comput. Sci. & Eng. Technol., vol. 7, no. 03, pp. 110–114, 2016.
  • [2] U. Ferraro Petrillo, F. Palini, G. Cattaneo, and R. Giancarlo, “FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy,” BMC bioinformatics, vol. 22, pp. 1–21, 2021.
  • [3] Y. Jia, Z. Shao, and F. Chen, “SlimCache: An efficient data compression scheme for flash-based key-value caching,” ACM Trans. Storage (TOS), vol. 16, no. 2, pp. 1–34, 2020.
  • [4] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343, 1977.
  • [5] Y. Collet, “Real time data compression: LZ4 explained,” 2011, 2011.
  • [6] M. Bartík, S. Ubik, and P. Kubalik, “LZ4 compression algorithm on FPGA,” in Proc IEEE Int. Conf. Electron Circuits Syst. (ICECS), 2015, pp. 179–182.
  • [7] W. Liu, F. Mei, C. Wang, M. O’Neill, and E. E. Swartzlander, “Data Compression Device Based on Modified LZ4 Algorithm,” IEEE Trans. Consumer Electron., vol. 64, no. 1, pp. 110–117, 2018.
  • [8] S. M. Lee, J. H. Jang, J. H. Oh, J. K. Kim, and S. E. Lee, “Design of hardware accelerator for Lempel-Ziv 4 (LZ4) compression,” IEICE Electronics Express, vol. 14, no. 11, pp. 20 170 399–20 170 399, 2017.
  • [9] M. Bartík, T. Beneš, and P. Kubalík, “Design of a High-Throughput Match Search Unit for Lossless Compression Algorithms,” in IEEE Annu. Comput. Commun. Workshop Conf. (CCWC), 2019, pp. 0732–0738.
  • [10] T. Beneš, M. Bartík, and P. Kubalík, “High Throughput and Low Latency LZ4 Compressor on FPGA,” in Int. Conf. Reconfigurable Comput. FPGAs (ReConFig), 2019, pp. 1–5.
  • [11] P. Liu, Z. Wei, C. Yu, and S. Chen, “HybriDC: A Resource-Efficient CPU-FPGA Heterogeneous Acceleration System for Lossless Data Compression,” Micromachines, vol. 13, no. 11, p. 2029, 2022.
  • [12] M. Bartík, T. Beneš, and P. Kubalík, “An In-Sight Into How Compression Dictionary Architecture Can Affect the Overall Performance in FPGAs,” IEEE Access, vol. 8, pp. 183 101–183 116, 2020.
  • [13] K. Sayood, Introduction to data compression.   Morgan Kaufmann, 2017.
  • [14] T. Bell, I. H. Witten, and J. G. Cleary, “Modeling for text compression,” ACM Computing Surveys (CSUR), vol. 21, no. 4, pp. 557–591, 1989.
  • [15] T. Matsuoka, “LZ4 - Extremely fast compression,” https://github.com/lz4/lz4/, accessed on 25 March 2024.
  • [16] Xilinx, “Xilinx LZ4 Streaming Compression,” https://xilinx.github.io/Vitis_Libraries/data_compression/2022.1/source/L2/lz4_compress_streaming.html, accessed on 25 March 2024.