Deep Lossy Plus Residual Coding for
Lossless and Near-lossless Image Compression

Yuanchao Bai, Xianming Liu, Kai Wang, Xiangyang Ji, Xiaolin Wu, Wen Gao Yuanchao Bai, Xianming Liu and Kai Wang are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China, E-mail: {yuanchao.bai, csxm}@hit.edu.cn, [email protected]. Xiangyang Ji is with the Department of Automation, Tsinghua University, Beijing, 100084, China, E-mail: [email protected] Xiaolin Wu is with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, L8G 4K1, Ontario, Canada, Email:[email protected]. Wen Gao is with the School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China, E-mail: [email protected].
(Corresponding author: Xianming Liu)

Abstract

Lossless and near-lossless image compression is of paramount importance to professional users in many technical fields, such as medicine, remote sensing, precision engineering and scientific research. But despite rapidly growing research interests in learning-based image compression, no published method offers both lossless and near-lossless modes. In this paper, we propose a unified and powerful deep lossy plus residual (DLPR) coding framework for both lossless and near-lossless image compression. In the lossless mode, the DLPR coding system first performs lossy compression and then lossless coding of residuals. We solve the joint lossy and residual compression problem in the approach of VAEs, and add autoregressive context modeling of the residuals to enhance lossless compression performance. In the near-lossless mode, we quantize the original residuals to satisfy a given $\ell_{\infty}$ error bound, and propose a scalable near-lossless compression scheme that works for variable $\ell_{\infty}$ bounds instead of training multiple networks. To expedite the DLPR coding, we increase the degree of algorithm parallelization by a novel design of coding context, and accelerate the entropy coding with adaptive residual interval. Experimental results demonstrate that the DLPR coding system achieves both the state-of-the-art lossless and near-lossless image compression performance with competitive coding speed.

Index Terms:

Deep Learning, Image Compression, Lossless Compression, Near-lossless Compression, Lossy Plus Residual Coding.

1 Introduction

In many important technical fields, such as medicine, remote sensing, precision engineering and scientific research, imaging in high spatial, spectral and temporal resolutions is instrumental to discoveries and innovations. As achievable resolutions of modern imaging technologies steadily increase, users are inundated by the resulting astronomical amount of image and video data. For example, pathology imaging scanners can easily produce 1GB or more data per specimen. For the sake of cost-effectiveness and system operability (e.g., real-time access via clouds to high-fidelity visual objects), acquired raw images of high resolutions in multiple dimensions have to be compressed.

Unlike in consumer applications (e.g., smartphones and social media), where users are mostly interested in the appearlingness of decompressed images and can be quite oblivious to compression distortions at the signal level, high fidelity of decompressed images is of paramount importance to professional users in many technical fields. In the latter case, the current gold standard is mathematically lossless image compression. The Shannon’s source coding theorem [1] establishes the theoretical foundation of lossless image compression, which proves the lower bound of the expected codelength given real probability distribution of image data, i.e., the information entropy. In practice, the compression performance of any specific lossless image codec depends on how well it can approximate the unknown real probability distribution, in order to approach the theoretical lower bound. Despite years of research, typical compression ratios of traditional lossless image codecs [2, 3, 4, 5] are limited between 2:1 and 3:1. An alternative way to improve the compression performance while keeping the high fidelity of decompressed images is near-lossless image compression [6, 7]. Instead of mathematically lossless, near-lossless image compression imposes strict $\ell_{\infty}$ constraints on the decompressed images requiring the maximum reconstruction error of each pixel to be no larger than a given tight numerical bound. By introducing the $\ell_{\infty}$ constrained error bound, near-lossless image compression can guarantee the reliability of each pixel while break the theoretical compression limit of lossless image compression. When the tight error bound is set to zero, near-lossless image compression is equivalent to lossless image compression. Traditional lossless image codecs, such as JPEG-LS [2] and CALIC [3, 8], provide users with both lossless and near-lossless image compression in order to meet the requirements of bandwidth and cost-effectiveness for diverse imaging and vision systems.

With the fast progress of deep neural networks (DNNs), learning-based image compression has achieved tremendous progress over the last five years. However, most of these methods are designed for rate-distortion optimized lossy image compression [9], which cannot realize lossless or near-lossless image compression even with sufficient bit-rates. Recently, a number of research teams embark on developing end-to-end optimized lossless image compression methods [10, 11, 12, 13, 14, 15, 16, 17]. These methods take advantage of sophisticated deep generative models, such as autoregressive models [18, 19, 20], flow models [21] and variational auto-encoder (VAE) models [22], to learn the unknown probability distribution of given image data and entropy encode the image data to bitstreams according to the learned models. While superior compression performance is achieved beyond traditional lossless image codecs, existing learning-based lossless image methods usually suffer from excessively slow coding speed and can hardly be applied to practical full resolution image compression tasks. It is also regrettable that, unlike traditional JPEG-LS [2] and CALIC [3, 8], no studies (except our recent work [23]) are carried out on learning-based near-lossless image compression, given its great potential as aforementioned.

In this paper, we propose a unified and powerful deep lossy plus residual (DLPR) coding framework for both lossless and near-lossless image compression, which addresses the challenges of learning-based lossless image compression to a large extent. The remarkable characters of the DLPR coding system includes: the state-of-the-art lossless and near-lossless image compression performance, scalable near-lossless image compression with variable $\ell_{\infty}$ bounds in a single network and competitive coding speed on even 2K resolution images. Specifically, for lossless image compression, the DLPR coding system first performs lossy compression and then lossless coding of residuals. Both the lossy image compressor and the residual compressor are designed with advanced neural network architectures. We solve the joint lossy image and residual compression problem in the approach of VAEs, and add autoregressive context modeling of the residuals to enhance lossless compression performance. Note that our VAE model is different from transform coding based VAE for simply lossy image compression [24, 25] or bits-back coding based VAEs [26] for lossless image compression. For near-lossless image compression, we quantize the original residuals to satisfy a given $\ell_{\infty}$ error bound, and compress the quantized residuals instead of the original residuals. To achieve scalable near-lossless compression with variable $\ell_{\infty}$ error bounds, we derive the probability model of the quantized residuals by quantizing the learned probability model of the original residuals for lossless compression, without training multiple networks. Because residual quantization leads to the context mismatch between training and inference, we propose a scalable quantized residual compressor with bias correction scheme to correct the bias of the derived probability model. In order to expedite the DLPR coding, the bottleneck is the serialized autoregressive context model in residual and quantized residual compression. We thus propose a novel design of coding context to increase the degree of algorithm parallelization, and further accelerate the entropy coding with adaptive residual interval. Finally, the lossless or near-lossless compressed image is stored including the bitstreams of the encoded lossy image, the encoded original residuals or the quantized residuals.

In summary, the major contributions of this research are as follows:

•

We propose a unified DLPR coding framework to realize both lossless and near-lossless image compression. The framework can be interpreted as a VAE model and end-to-end optimized. Though lossless and near-lossless modes have been supported in traditional lossless image codecs, such as JPEG-LS or CALIC, we are the first to support both the two modes in learning-based image compression.
•

We realize scalable near-lossless image compression with variable $\ell_{\infty}$ error bounds. Given the $\ell_{\infty}$ bounds, we quantize the original residuals and derive the probability model of the quantized residuals from the learned probability model of the original residuals for lossless image compression, instead of training multiple networks. A bias correction scheme further improves the compression performance.
•

To expedite the DLPR coding system, we propose a novel design of coding context to increase the degree of algorithm parallelization without compromising the compression performance. Meanwhile, we further introduce an adaptive residual interval scheme to reduce the entropy coding time.

Experimental results demonstrate that the DLPR coding system achieves both the state-of-the-art lossless and near-lossless image compression performance, and achieves competitive PSNR while much smaller $\ell_{\infty}$ error compared with lossy image codecs at high bit rates. At the same time, the DLPR coding system is practical in terms of runtime, which can compress and decompress 2K resolution images in several seconds.

Note that this paper is the non-trivial extension of our recent work [23]. First, this paper focuses on both lossless and near-lossless image compression, rather than simply near-lossless image compression in [23]. Second, we improve the network architectures of lossy image compressor, residual compressor and scalable quantized residual compressor beyond [23], leading to more powerful while concise DLPR coding system. Third, to expedite the DLPR coding system, we introduce a novel design of context coding to increase the degree of algorithm parallelization and an adaptive residual interval scheme to accelerate the entropy coding. Finally, we conduct comprehensive experiments to demonstrate that the resulting DLPR coding system achieves the state-of-the-art lossless and near-lossless image compression performance, significantly outperforms its prototype in [23], while enjoys much faster coding speed.

The rest of the paper is organized as follows. We provide a brief review of related works in Sec. 2. We theoretically analyze lossless and near-lossless image compression problems, and formulate the DLPR coding framework in Sec. 3. The network architecture and acceleration of DLPR coding framework are presented in Sec. 4. Experiments and conclusions are in Sec. 5 and Sec. 6, respectively.

2 Related Work

This section reviews related works from three aspects, including learning-based lossy image compression, learning-based lossless image compression and near-lossless image compression. Our DLPR coding framework takes advantages of the recent progress of learning-based lossy image compression, and achieves the state-of-the-art performance of lossless and near-lossless image compression.

2.1 Learning-based Lossy Image Compression

Early learning-based lossy image compression methods with DNNs are based on recurrent neural networks (RNNs), starting from the work of Toderici et al. [27]. Toderici et al. [27] proposed a long short-term memory (LSTM) network to progressively encode images or residuals, and achieved multi-rate image compression with the increase of RNN iterations. Following [27], Toderici et al. [28] and Johnston et al. [29] improved RNN-based lossy image compression by modifying the RNN architectures, introducing LSTM-based entropy coding and adding spatially adaptive bit allocation.

Apart from RNN-based methods, a general end-to-end convolutional neural network (CNN) based compression framework was proposed by Ballé et al. [24] and Theis et al. [30], which can be interpreted as VAEs [22] based on transform coding [31]. In this framework, raw images are transformed to latent space, quantized and entropy encoded to bitstreams at encoder side. At decoder side, the quantized latent variables are recovered from the bitstreams and then inversely transformed to reconstruct lossy images. During training, the compression rates are approximated and minimized with the entropy of the quantized latent variables, while the reconstruction distortion is usually minimized with PSNR or MS-SSIM [32], leading to the rate-distortion optimization [1]. This framework is followed by most recent learned lossy image compression methods and is improved from three aspects, i.e., transform (network architectures) [33, 34, 35], quantization [36, 37, 38] and entropy coding [25, 39, 40, 41, 42, 43, 44, 45].

Inspired by the recent progress of learned lossy image compression, we propose a DLPR coding framework for both lossless and near-lossless image compression, by integrating lossy image compression with residual compression. The DLPR coding system achieves the state-of-the-art lossless and near-lossless image compression performance with competitive coding speed.

2.2 Learning-based Lossless Image Compression

Lossless image compression can usually be solved in two steps: 1) statistical modeling of given image data; 2) encoding the image data to bitstreams according to the statistical model, with entropy encoding tools such as arithmetic coder [46] or asymmetric numerical systems [47]. Given the strong connections between lossless compression and unsupervised machine learning, deep generative models are introduced to solve the first step of lossless image compression, which is a challenging task due to the complexity of unknown probability distribution of raw images. There are three dominant kinds of deep generative models used in lossless image compression, including autoregressive models, flow models and VAE models.

Autoregressive models. Oord et al. [18, 19] proposed PixelRNN and PixelCNN that estimated the joint distribution of pixels in an image as the product of the conditional distributions over the pixels. The masked convolution was used to ensure that the conditional probability estimation of the current pixel only depends on the previously observed pixels. Following [18, 19], Salimans et al. [20] proposed PixelCNN++ that improved the implementation of PixelCNN with several aspects. Reed et al. [48] proposed multi-scale parallelized PixelCNN that allowed efficient probability density estimation. Zhang et al. [49] studied the out-of-distribution generalization of autoregressive models and utilized a local PixelCNN for lossless image compression.

Flow models. Hoogeboom et al. [14] proposed a discrete integer flow (IDF) model for lossless image compression that learned rich invertible transforms on discrete image data. The latent variables resulting from IDF were assumed to enjoy simpler distributions leading to efficient entropy coding, and were able to recover raw images losslessly. Following [14], Berg et al. [50] further proposed a IDF++ model improving several aspects of the IDF model, such as the network architecture. In [34], Ma et al. proposed a wavelet-like transform for lossy and lossless image compression, which can be considered as a special flow model. In [15], Ho et al. proposed a local bit-back coding scheme and realized lossless image compression with continuous flows. In [16], Zhang et al. proposed a invertible volume-preserving flow (iVPF) model to achieve discrete bijection for lossless image compression. Beyond [16], Zhang et al. [17] further proposed a iFlow model composed of modular scale transforms and uniform base conversion systems, leading to the state-of-the-art performance.

VAE models. Townsend et al. [26] proposed bits-back with asymmetric numeral systems (BB-ANS) that performed lossless image compression with VAE models. The bits-back coding scheme estimated posterior distributions of latent variables conditioned on given images and decoded the latent variables from auxiliary bits accordingly. Kingma et al. [12] further generalized BB-ANS with a bit-swap scheme based on hierarchical VAE models, to avoid large amount of auxiliary bits. In [13], Townsend et al. proposed an alternative hierarchical latent lossless compression (HiLLoC) method integrating BB-ANS with hierarchical VAE models, and adopted FLIF [4] to compress parts of image data as auxiliary bits. Besides bits-back coding scheme, Mentzer et al. [10] proposed a practical lossless image compression (L3C) model, which can also be translated as a hierarchical VAE model.

Instead of the above mentioned methods, we propose a DLPR coding framework, which can be utilized for lossless image compression and interpreted in terms of VAE models. In [11], Mentzer et al. used the traditional BPG lossy image codec [51] to compress raw images and proposed a CNN model to compress the corresponding residuals, which is a special case of our framework. We further design our network architecture by integrating the VAE model with an autoregressive context model, leading to superior lossless image compression performance. Meanwhile, we further propose novel design of coding context to increase coding parallelization, making the DLPR coding system practical for real image compression tasks.

2.3 Near-lossless Image Compression

Near-lossless image compression requires the maximum reconstruction error of each pixel to be no larger than a given tight numerical bound, i.e., the $\ell_{\infty}$ error bound. It is a challenging task to realize near-lossless image compression, because the $\ell_{\infty}$ error bound is non-differentiable and must be strictly satisfied.

Traditional near-lossless image compression methods can be divided into three categories: 1) pre-quantization: adjusting raw pixel values to the $\ell_{\infty}$ error bound, and then compressing the pre-processed images with lossless image compression, such as near-lossless WebP [52]; 2) predictive coding: predicting subsequent pixels based on previously encoded pixels, then quantizing predication residuals to satisfy the $\ell_{\infty}$ error bound, and finally compressing the quantized residuals, such as, [53, 7], near-lossless JPEG-LS [2] and near-lossless CALIC [8]; 3) lossy plus residual coding: similar to 2), but replacing predictive coder with lossy image coder, and both the lossy image and the quantized residuals are encoded, as discussed in [6]. Compared with learning-based lossy and lossless image compression, learning-based near-lossless image compression is still in its infancy.

In this paper, we propose a DLPR coding framework inspired by traditional lossy plus residual coding, which can be utilized for near-lossless image compression. The DLPR coding framework supports scalable near-lossless image compression with variable $\ell_{\infty}$ bounds without training multiple networks, and achieves the state-of-the-art compression performance. Recently, Zhang et al. [54] proposed a learning-based soft-decoding method to improve the reconstruction performance of near-lossless CALIC. Though PSNR is improved, the soft-decoding method cannot strictly guarantee the $\ell_{\infty}$ error bound.

3 Deep Lossy Plus Residual Coding

In this section, we introduce a DLPR coding framework for lossless and near-lossless image compression, by integrating lossy image compression with residual compression. We theoretically analyze lossless and near-lossless image compression problems, and formulate the DLPR coding framework in terms of VAEs.

3.1 DLPR coding for Lossless Image Compression

Lossless image compression guarantees that raw image are perfectly reconstructed from the compressed bitstreams. Assuming that raw images $\mathbf{x}$ ’s are sampled from an unknown probability distribution $p(\mathbf{x})$ , the shortest expected codelength of the compressed images with lossless image compression is theoretically lower-bounded by the information entropy [1, 55]:

H(p)=\mathbb{E}_{p(\mathbf{x})}[-\log p(\mathbf{x})]

(1)

In practice, the compression performance of any specific lossless image compression method depends on how well it can approximate $p(\mathbf{x})$ with an underlying model $p_{\boldsymbol{\theta}}(\mathbf{x})$ . The corresponding compression performance is given by the cross entropy [1, 55]:

H(p,p_{\boldsymbol{\theta}})=\mathbb{E}_{p(\mathbf{x})}[-\log p_{\boldsymbol{\theta}}(\mathbf{x})]\geq H(p)

(2)

where (2) holds only if $p_{\boldsymbol{\theta}}(\mathbf{x})=p(\mathbf{x})$ .

In order to approximate $p(\mathbf{x})$ , the latent variable model $p_{\boldsymbol{\theta}}(\mathbf{x})$ is extensively employed for this purpose and is formulated by a marginal distribution:

p_{\boldsymbol{\theta}}(\mathbf{x})=\int p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{y})d\mathbf{y}=\int p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{y})p_{\boldsymbol{\theta}}(\mathbf{y})d\mathbf{y}

(3)

where $\mathbf{y}$ is an unobserved latent variable and ${\boldsymbol{\theta}}$ denote the parameters of this model. Since directly learning the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ with (3) is typically intractable, one alternative way is to optimize the evidence lower bound (ELBO) via VAEs [22]. By introducing an inference model $q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})$ to approximate the posterior $p_{\boldsymbol{\theta}}(\mathbf{y}|\mathbf{x})$ , the logarithm of the marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ can be rewritten as:

\log p_{\boldsymbol{\theta}}(\mathbf{x})=\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}}_{ELBO}+\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}\log\frac{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{y}|\mathbf{x})}}_{D_{kl}(q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})||p_{\boldsymbol{\theta}}(\mathbf{y}|\mathbf{x}))}

(4)

where $D_{kl}(\cdot||\cdot)$ is the Kullback-Leibler (KL) divergence. ${\boldsymbol{\phi}}$ denote the parameters of the inference model $q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})$ . Because $D_{kl}(q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})||p_{\boldsymbol{\theta}}(\mathbf{y}|\mathbf{x}))\geq 0$ and $\log p_{\boldsymbol{\theta}}(\mathbf{x})\leq 0$ , ELBO is the lower bound of $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ . Thus, we have

\mathbb{E}_{p(\mathbf{x})}[-\log p_{\boldsymbol{\theta}}(\mathbf{x})]\leq\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}\left[-\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{y})}{q_{\boldsymbol{\phi}}(\mathbf{y}|\mathbf{x})}\right]

(5)

and can minimize the expectation of negative ELBO as a proxy for the expected codelength $\mathbb{E}_{p(\mathbf{x})}[-\log p_{\boldsymbol{\theta}}(\mathbf{x})]$ .

In order to minimize the expectation of negative ELBO, we propose a DLPR coding framework. We first adopt lossy image compression based on transform coding [31] to compress the raw image $\mathbf{x}$ and obtain its lossy reconstruction ${\tilde{\mathbf{x}}}$ . The expectation of negative ELBO can be reformulated as follows:

\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})}\left[\cancelto{0}{\log q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})}\underbrace{-\log p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})}_{Distortion}\underbrace{-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}})}_{R_{\hat{\mathbf{y}}}}\right]

(6)

where ${\hat{\mathbf{y}}}$ is the quantized result of continuous latent representation $\mathbf{y}$ and $\mathbf{y}$ is deterministically transformed from $\mathbf{x}$ . Like [25], we relax the quantization of $\mathbf{y}$ by adding noise from $\mathcal{U}(-\frac{1}{2},\frac{1}{2})$ , and assume $q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})=\prod_{i}\mathcal{U}(y_{i}-\frac{1}{2},y_{i}+\frac{1}{2})$ . Thus, $\log q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})=0$ is dropped from (6). For simply lossy image compression, such as [24, 25, 40, 33], the second term of (6) can be regarded as the distortion loss between $\mathbf{x}$ and its lossy reconstruction ${\tilde{\mathbf{x}}}$ from ${\hat{\mathbf{y}}}$ . The third term can be regarded as the rate loss of lossy image compression. Only ${\hat{\mathbf{y}}}$ needs to be encoded to the bitstreams and stored.

Beyond lossy image compression, we further take residual compression into consideration. The residual $\mathbf{r}$ is computed by $\mathbf{r}=\mathbf{x}-{\tilde{\mathbf{x}}}$ . We have the following Proposition 1.

Proposition 1.

$p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}},\mathbf{r}|{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ .

Proof.

For each $\mathbf{x}$ and all $({\tilde{\mathbf{x}}},\mathbf{r})$ pairs satisfying ${\tilde{\mathbf{x}}}+\mathbf{r}=\mathbf{x}$ , we have $p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})=\sum_{{\tilde{\mathbf{x}}}+\mathbf{r}=\mathbf{x}}p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}},\mathbf{r}|{\hat{\mathbf{y}}})$ . Following Bayes’ rule, we have $p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}},\mathbf{r}|{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})\cdot p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ . Thus, we have $p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})=\sum_{{\tilde{\mathbf{x}}}+\mathbf{r}=\mathbf{x}}p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})\cdot p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ . Because the lossy reconstruction ${\tilde{\mathbf{x}}}$ is computed by the deterministic inverse transform of ${\hat{\mathbf{y}}}$ , there is only one ${\tilde{\mathbf{x}}}$ with $p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})=1$ and the other ${\tilde{\mathbf{x}}}$ ’s are with $p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})=0$ . Thus, we can have $p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})\cdot p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})+\sum 0=p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}},\mathbf{r}|{\hat{\mathbf{y}}})=1\cdot p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ . ∎

Based on (6) and Proposition 1, we substitute $p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ for $p_{\boldsymbol{\theta}}(\mathbf{x}|{\hat{\mathbf{y}}})$ and achieve the DLPR coding formulation:

\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})}\left[\underbrace{-\log p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})}_{R_{\mathbf{r}}}\underbrace{-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}})}_{R_{\hat{\mathbf{y}}}}\right]

(7)

where the first term $R_{\mathbf{r}}$ and the second term $R_{\hat{\mathbf{y}}}$ of (7) are the expected codelengths of entropy encoding $\mathbf{r}$ and ${\hat{\mathbf{y}}}$ with $p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ and $p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}})$ , respectively. During training, we relax the quantization of ${\tilde{\mathbf{x}}}$ by adding noise from $\mathcal{U}(-\frac{1}{2},\frac{1}{2})$ , and have $\log p_{\boldsymbol{\theta}}({\tilde{\mathbf{x}}}|{\hat{\mathbf{y}}})=0$ consistent with the precondition of the Proposition 1. Because (7) is equivalent to the expectation of negative ELBO (5), the proposed DLPR coding framework is the upper-bound of the expected codelength $\mathbb{E}_{p(\mathbf{x})}[-\log p_{\boldsymbol{\theta}}(\mathbf{x})]$ and can be minimized as a proxy.

Note that no distortion loss of lossy image compression is specified in (7). Therefore, we can embed arbitrary lossy image compressors and minimize (7) to achieve lossless image compression. A special case is the previous lossless image compression method [11], in which the BPG lossy image compressor [51] with a learned quantization parameter classifier minimizes $-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}})$ and a CNN-based residual compressor minimizes $-\log p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}})$ .

3.2 DLPR coding for Near-lossless Image Compression

We further extend the DLPR coding framework for near-lossless image compression. Given a tight $\ell_{\infty}$ error bound $\tau\in\{1,2,\ldots\}$ , near-lossless methods compress a raw image $\mathbf{x}$ satisfying the following distortion constraint:

D_{nll}(\mathbf{x},{\hat{\mathbf{x}}})=\|\mathbf{x}-{\hat{\mathbf{x}}}\|_{\infty}=\max_{i,c}|x_{i,c}-\hat{x}_{i,c}|\leq\tau

(8)

where ${\hat{\mathbf{x}}}$ is the near-lossless reconstruction of the raw image $\mathbf{x}$ . $x_{i,c}$ and $\hat{x}_{i,c}$ are the pixels of $\mathbf{x}$ and ${\hat{\mathbf{x}}}$ , respectively. $i$ denotes the $i$ -th spatial position in a pre-defined scan order, and $c$ denotes the $c$ -th channel. If $\tau=0$ , near-lossless image compression is equivalent to lossless image compression.

In order to satisfy the $\ell_{\infty}$ constraint (8), we extend the DLPR coding framework by quantizing the residuals. First, we still obtain a lossy reconstruction ${\tilde{\mathbf{x}}}$ of the raw image $\mathbf{x}$ through lossy image compression. Although lossy image compression methods can achieve high PSNR results at relatively low bit rates, it is difficult for these methods to ensure a tight error bound $\tau$ of each pixel in ${\tilde{\mathbf{x}}}$ . We then compute the residual $\mathbf{r}=\mathbf{x}-{\tilde{\mathbf{x}}}$ and suppose that $\mathbf{r}$ is quantized to ${\hat{\mathbf{r}}}$ . Let ${\hat{\mathbf{x}}}={\tilde{\mathbf{x}}}+{\hat{\mathbf{r}}}$ , the reconstruction error $\mathbf{x}-{\hat{\mathbf{x}}}$ is equivalent to the quantization error $\mathbf{r}-{\hat{\mathbf{r}}}$ of $\mathbf{r}$ . Thus, we adopt a uniform residual quantizer whose bin size is $2\tau+1$ and quantized value is [2, 8]:

\hat{r}_{i,c}=\operatorname*{sgn}(r_{i,c})\cdot(2\tau+1)\left\lfloor\frac{|r_{i,c}|+\tau}{2\tau+1}\right\rfloor

(9)

where $\operatorname*{sgn}(\cdot)$ denotes the sign function. $r_{i,c}$ and $\hat{r}_{i,c}$ are the elements of $\mathbf{r}$ and ${\hat{\mathbf{r}}}$ , respectively. With (9), we now have $|r_{i,c}-\hat{r}_{i,c}|\leq\tau$ for each $\hat{r}_{i,c}$ in ${\hat{\mathbf{r}}}$ , satisfying the tight error bound (8). Because residual quantization (9) is deterministic, the DLPR coding framework for near-lossless image compression can be formulated as:

\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}}|\mathbf{x})}\left[\underbrace{-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})}_{R_{\hat{\mathbf{r}}}}\underbrace{-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}})}_{R_{\hat{\mathbf{y}}}}\right]

(10)

where the first term $R_{\hat{\mathbf{r}}}$ of (10) is the expected codelength of the quantized residual ${\hat{\mathbf{r}}}$ with $p_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ , rather than that of the original residual $\mathbf{r}$ in (7). Finally, we concatenate the bitstream of ${\hat{\mathbf{r}}}$ with that of ${\hat{\mathbf{y}}}$ , leading to the near-lossless image compression result.

4 Network Architecture and Acceleration

4.1 Network Architecture of DLPR Coding

We propose the network architecture of our DLPR coding framework including a lossy image compressor (LIC), a residual compressor (RC) and a scalable quantized residual compressor (SQRC), as illustrated in Fig. 1. With LIC and RC, we realize DLPR coding for lossless image compression. With LIC and SQRC, we further realize DLPR coding for near-lossless image compression with variable $\ell_{\infty}$ bounds $\tau\in\{1,2,\ldots\}$ in a single network, instead of training multiple networks for different $\tau$ ’s. We next specify each of the three components in the following subsections.

Refer to caption — Figure 1: Network architecture of DLPR coding framework, including a lossy image compressor (LIC), a residual compressor (RC) and a scalable quantized residual compressor (SQRC).

4.1.1 Lossy Image Compressor

In LIC, we employ sophisticated image encoder and decoder while efficient hyper-prior model [25], as shown in Fig. 2. The image encoder $g_{e}(\cdot)$ and decoder $g_{d}(\cdot)$ are composed of analysis, synthesis and Swin-Attention blocks following the philosophy of residual and self-attention learning [56, 57, 33], as detailed in Fig. 3. In Swin-Attention blocks, we adopt the window and shifted window based multi-head self-attention (W-MSA/SW-MSA) in [58] to aggregate information within and across local windows adaptively, which improve the representation ability of $g_{e}(\cdot)$ and $g_{d}(\cdot)$ with moderate computational complexity. With $g_{e}(\cdot)$ and $g_{d}(\cdot)$ , we transform an input raw image $\mathbf{x}$ to its latent representation $\mathbf{y}=g_{e}(\mathbf{x})$ , quantize $\mathbf{y}$ to ${\hat{\mathbf{y}}}=Q(\mathbf{y})$ , and reversely transform ${\hat{\mathbf{y}}}$ to the lossy reconstruction ${\tilde{\mathbf{x}}}=g_{d}({\hat{\mathbf{y}}})$ .

Because the sophisticated image encoder and decoder can largely reduce the spatial redundancies in $\mathbf{x}$ , the burden of entropy coding ${\hat{\mathbf{y}}}$ is relieved and we decide to employ the efficient hyper-prior model [25] without any context models to ensure the coding parallelization on GPUs. The hyper-prior model extracts side information ${\hat{\mathbf{z}}}=Q(h_{e}(\mathbf{y}))$ to model the probability distribution of ${\hat{\mathbf{y}}}$ , where $h_{e}(\cdot)$ is the hyper-encoder. We assume a factorized Gaussian distribution model $\mathcal{N}({\boldsymbol{\mu}},{\boldsymbol{\sigma}})=h_{d}({\hat{\mathbf{z}}})$ for $p_{\theta}({\hat{\mathbf{y}}}|{\hat{\mathbf{z}}})$ where $h_{d}(\cdot)$ is the hyper-decoder, and a non-parametric factorized density model for $p_{\theta}({\hat{\mathbf{z}}})$ . The $R_{\hat{\mathbf{y}}}$ in (7) and (10) is thus extended by

R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}=\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}},{\hat{\mathbf{z}}}|\mathbf{x})}\left[-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{y}}}|{\hat{\mathbf{z}}})-\log p_{\boldsymbol{\theta}}({\hat{\mathbf{z}}})\right]

(11)

where $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ is the cost of encoding both ${\hat{\mathbf{y}}}$ and ${\hat{\mathbf{z}}}$ .

4.1.2 Residual Compressor

Given the raw image $\mathbf{x}$ and its lossy reconstruction ${\tilde{\mathbf{x}}}$ from LIC, we have the residual $\mathbf{r}=\mathbf{x}-{\tilde{\mathbf{x}}}$ . We next introduce RC to estimate the probability mass function (PMF) of $\mathbf{r}$ and compress $\mathbf{r}$ with arithmetic coding [46] accordingly.

Denote by $\mathbf{u}=g_{\mathbf{u}}({\hat{\mathbf{y}}})$ , where the feature $\mathbf{u}$ is generated from ${\hat{\mathbf{y}}}$ by $g_{\mathbf{u}}(\cdot)$ . The $g_{\mathbf{u}}(\cdot)$ and the image decoder $g_{d}(\cdot)$ share the network except the last convolutional layer, as shown in Fig. 1. We interpret $\mathbf{u}$ as the feature of the residual $\mathbf{r}$ given ${\tilde{\mathbf{x}}}$ and ${\hat{\mathbf{y}}}$ . The feature $\mathbf{u}$ shares the same height and width with $\mathbf{r}$ and has 256 channels. Unlike the latent representation ${\hat{\mathbf{y}}}$ of which the spatial redundancies are largely reduced by the image encoder, the residual $\mathbf{r}$ in the pixel domain has spatial redundancies that cannot be fully exploited by only the feature $\mathbf{u}$ . Therefore, we further introduce the autoregressive model into the statistical modeling of $\mathbf{r}$ , leading to

p_{\boldsymbol{\theta}}(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})=p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u})=\prod_{i,c}p_{\boldsymbol{\theta}}(r_{i,c}|\mathbf{u},r_{<(i,c)})

(12)

where $r_{<(i,c)}$ denotes the elements of $\mathbf{r}$ encoded or decoded before $r_{i,c}$ in a pre-defined scan order. In practice, we implement spatial autoregressive model using a mask convolutional layer with a specific receptive field, rather than depending on all elements in $r_{<(i,c)}$ . We regard the receptive field of the mask convolutional layer as the context $C_{\mathbf{r}}$ . Based on (12) and $C_{\mathbf{r}}$ , we reformulate the $R_{\mathbf{r}}$ in (7) as

R_{\mathbf{r}}=\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}},{\hat{\mathbf{z}}}|\mathbf{x})}\left[-\log p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})\right]

(13)

Specifically, we utilize a $7\times 7$ mask convolutional layer with 256 channels to extract the context $C_{r_{i}}\in C_{\mathbf{r}}$ from $r_{<(i,c)}$ . The $C_{r_{i}}$ is shared by $r_{i,c}$ of all channels. For RGB images with three channels, we have

p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})=\prod_{i}p_{\boldsymbol{\theta}}(r_{i,1},r_{i,2},r_{i,3}|u_{i},C_{r_{i}})

(14)

We further adopt a channel autoregressive scheme over $r_{i,1}$ , $r_{i,2}$ , $r_{i,3}$ [20] and reformulate $p_{\boldsymbol{\theta}}(r_{i,1},r_{i,2},r_{i,3}|u_{i},C_{r_{i}})$ as

	$\displaystyle p_{\boldsymbol{\theta}}(r_{i,1},r_{i,2},r_{i,3}\|u_{i},C_{r_{i}})$	$\displaystyle=p_{\boldsymbol{\theta}}(r_{i,1}\|u_{i},C_{r_{i}})\cdot$		(15)
	$\displaystyle p_{\boldsymbol{\theta}}(r_{i,2}\|r_{i,1},u_{i},C_{r_{i}}$	$\displaystyle)\cdot p_{\boldsymbol{\theta}}(r_{i,3}\|r_{i,1},r_{i,2},u_{i},C_{r_{i}})$

We model the PMF of $r_{i,c}$ with discrete logistic mixture likelihood [20] and propose a sub-network to estimate the corresponding entropy parameters, including mixture weights $\pi_{i}^{k}$ , means $\mu_{i,c}^{k}$ , variances $\sigma_{i,c}^{k}$ and mixture coefficients $\beta_{i,t}$ . $k$ denotes the index of the $k$ -th logistic distribution. $t$ denotes the channel index of $\beta$ . The network architecture of the entropy model is shown in Fig. 4. We utilize a mixture of $K=5$ logistic distributions. The channel autoregressive scheme over $r_{i,1},r_{i,2},r_{i,3}$ is implemented by updating the means using:

	$\displaystyle\tilde{\mu}_{i,1}^{k}=\mu_{i,1}^{k},\quad\tilde{\mu}_{i,2}^{k}=\mu_{i,2}^{k}+\beta_{i,1}\cdot r_{i,1},$
	$\displaystyle\tilde{\mu}_{i,3}^{k}=\mu_{i,3}^{k}+\beta_{i,2}\cdot r_{i,1}+\beta_{i,3}\cdot r_{i,2}$		(16)

With $\pi_{i}^{k}$ , $\tilde{\mu}_{i,c}^{k}$ and $\sigma_{i,c}^{k}$ , we have

p_{\boldsymbol{\theta}}(r_{i,c}|r_{i,<c},u_{i},C_{r_{i}})\sim\sum_{k=1}^{K}\pi_{i}^{k}\operatorname*{logistic}(\tilde{\mu}_{i,c}^{k},\sigma_{i,c}^{k})

(17)

where $\operatorname*{logistic}(\cdot)$ denotes the logistic distribution. For discrete $r_{i,c}$ , we evaluate $p_{\boldsymbol{\theta}}(r_{i,c}|r_{i,<c},u_{i},C_{r_{i}})$ as [20]:

\sum_{k=1}^{K}\pi_{i}^{k}\left[S\left(\frac{r_{i,c}^{+}-\tilde{\mu}_{i,c}^{k}}{\sigma_{i,c}^{k}}\right)-S\left(\frac{r_{i,c}^{-}-\tilde{\mu}_{i,c}^{k}}{\sigma_{i,c}^{k}}\right)\right]

(18)

where $S(\cdot)$ denotes the sigmoid function. $r_{i,c}^{+}=r_{i,c}+0.5$ and $r_{i,c}^{-}=r_{i,c}-0.5$ . The probability inference scheme of $\mathbf{r}$ is sketched in Fig. 5a.

4.1.3 Scalable Quantized Residual Compressor

We finally introduce SQRC to realize scalable near-lossless image compression with variable $\ell_{\infty}$ bound $\tau\in\{1,2,\ldots\}$ . Though near-lossless image compression given a specific $\tau$ can be realized by optimizing (10), this $\tau$ -specific scheme in Fig. 5b leads to two problems:

•

Relaxation problem of residual quantization. Unlike rounding quantization, the bin size of the residual quantization (9) is much larger. Moreover, the original residuals are not uniformly distributed in each bin, and thus cannot be relaxed by adding uniform noise.
•

Storage problem of multiple networks. To deploy the near-lossless codec, we have to transmit and store multiple networks for different $\tau$ ’s, which is storage-inefficient.

Instead, we propose a scalable near-lossless image compression scheme, which can circumvent the relaxation of residual quantization and utilize a single network to satisfy variable $\ell_{\infty}$ error bound $\tau\in\{1,2,\ldots\}$ . Specifically, the scalable compression scheme is based on the learned lossless compression with the DLPR coding framework. We keep the lossy reconstruction ${\tilde{\mathbf{x}}}$ fixed and quantize the original residual $\mathbf{r}$ to ${\hat{\mathbf{r}}}$ with variable $\tau$ ’s by (9). To encode the quantized ${\hat{\mathbf{r}}}$ , we can derive the PMF of ${\hat{\mathbf{r}}}$ from the learned PMF of the original $\mathbf{r}$ . Given $\tau$ and the learned PMF $p_{\boldsymbol{\theta}}(r_{i,c}|r_{i,<c},u_{i},C_{r_{i}})$ of original $r_{i,c}$ , the PMF $\hat{p}_{\boldsymbol{\theta}}(\hat{r}_{i,c}|r_{i,<c},u_{i},C_{r_{i}})$ of quantized $\hat{r}_{i,c}$ can be computed by the following PMF quantization:

\hat{p}_{\boldsymbol{\theta}}(\hat{r}_{i,c}|r_{i,<c},u_{i},C_{r_{i}})=\sum_{\mathclap{v=\hat{r}_{i,c}-\tau}}^{\hat{r}_{i,c}+\tau}p_{\boldsymbol{\theta}}(v|r_{i,<c},u_{i},C_{r_{i}})

(19)

We show an illustrative example in Fig. 6. Together with (14) and (15), we can derive the probability model $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ of ${\hat{\mathbf{r}}}$ , which is optimal given the learned $p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ of $\mathbf{r}$ . The resulting cost of encoding ${\hat{\mathbf{r}}}$ , denoted by $R^{\tau}_{{\hat{\mathbf{r}}}}$ , is reduced significantly with the increase of $\tau$ .

However, encoding ${\hat{\mathbf{r}}}$ with $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ results in undecodable bitstreams, since the original residual $\mathbf{r}$ is unknown to the decoder. $\hat{p}_{\boldsymbol{\theta}}(\hat{r}_{i,c}|r_{i,<c},u_{i},C_{r_{i}})$ cannot be evaluated without $r_{i,<c}$ and causal context $C_{r_{i}}$ . Instead, we can evaluate PMF using the quantized residual ${\hat{\mathbf{r}}}$ , i.e., we evaluate $p_{\boldsymbol{\theta}}(r_{i,c}|\hat{r}_{i,<c},u_{i},C_{\hat{r}_{i}})$ and derive $\hat{p}_{\boldsymbol{\theta}}(\hat{r}_{i,c}|\hat{r}_{i,<c},u_{i},C_{\hat{r}_{i}})$ with (19), leading to $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ for the encoding of ${\hat{\mathbf{r}}}$ . Because of the mismatch between training (with $\mathbf{r}$ ) and inference (with ${\hat{\mathbf{r}}}$ ) phases, it leads to biased PMF $p_{\boldsymbol{\theta}}(r_{i,c}|\hat{r}_{i,<c},u_{i},C_{\hat{r}_{i}})$ , $\hat{p}_{\boldsymbol{\theta}}(\hat{r}_{i,c}|\hat{r}_{i,<c},u_{i},C_{\hat{r}_{i}})$ and $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ . The above probability inference scheme is sketched in Fig. 5c.

SQRC for Bias Correction: Because of the discrepancy between the oracle $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ and the biased $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ , encoding ${\hat{\mathbf{r}}}$ with $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ degrades the compression performance. In order to tackle this problem, we propose SQRC for bias correction to close the gap between the oracle $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ and the biased $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ , while the resulting bitstreams are still decodable. The components of SQRC are illustrated in Fig. 1. The masked convolutional layer in SQRC is shared with that in RC. The conditional entropy model has the same network architecture as the entropy model illustrated in Fig. 4, but replaces the convolutional layers with the conditional convolutional layers [19, 60] illustrated in Fig. 7.

The probability inference scheme with SQRC is sketched in Fig. 5d. For $\tau=0$ , we still select the entropy model in RC to estimate $p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ to encode $\mathbf{r}$ . For $\tau\in\{1,2,\ldots,N\}$ , we select the conditional entropy model in SQRC to estimate $p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ conditioned on $\tau$ . We then derive $\hat{p}_{\boldsymbol{\varphi}}({\hat{\mathbf{r}}}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ with (19) to encode ${\hat{\mathbf{r}}}$ , where ${\boldsymbol{\varphi}}$ denote the parameters of SQRC. As $\hat{p}_{\boldsymbol{\varphi}}({\hat{\mathbf{r}}}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ approximates the oracle $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ better than the biased $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ , the compression performance can be improved. Since evaluating $\hat{p}_{\boldsymbol{\varphi}}({\hat{\mathbf{r}}}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ is independent of $\mathbf{r}$ , the resulting bitstreams are decodable. In experiments, we demonstrate that the proposed scalable near-lossless compression scheme with SQRC in Fig. 5d can outperform both the $\tau$ -specific near-lossless scheme in Fig. 5b and the scalable near-lossless compression scheme without SQRC in Fig. 5c.

4.2 Training Strategy of DLPR coding

4.2.1 Training LIC and RC

The full loss function for jointly optimizing LIC and RC, i.e., DLPR coding for lossless image compression, is

\mathcal{L}({\boldsymbol{\theta}},{\boldsymbol{\phi}})=R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}+R_{\mathbf{r}}+\lambda\cdot D_{ls}

(20)

where ${\boldsymbol{\theta}}$ and ${\boldsymbol{\phi}}$ are the learned parameters of LIC and RC. Besides rate terms $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ in (11) and $R_{\mathbf{r}}$ in (13), we further introduce a distortion term $D_{ls}(\mathbf{x},{\tilde{\mathbf{x}}})$ to minimize the mean square error (MSE) between the raw image $\mathbf{x}$ and its lossy reconstruction ${\tilde{\mathbf{x}}}$ :

D_{ls}(\mathbf{x},{\tilde{\mathbf{x}}})=\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{i,c}(x_{i,c}-\tilde{x}_{i,c})^{2}

(21)

As discussed in [25], minimizing MSE loss is equivalent to learn a LIC that fits residual $\mathbf{r}$ to a factorized Gaussian distribution. However, the discrepancy between the real distribution of $\mathbf{r}$ and the factorized Gaussian distribution is usually large. Therefore, we utilize a sophisticated discrete logistic mixture likelihood model to encode $\mathbf{r}$ in our DLPR coding framework.

The $\lambda$ in (20) is a “rate-distortion” trade-off parameter between the lossless compression rate and the MSE distortion. When $\lambda=0$ , the loss function (20) is consistent with the theoretical DLPR coding formulation (7), and ${\tilde{\mathbf{x}}}$ becomes a latent variable without any constraints. In experiments, we study the effects of $\lambda$ ’s on the lossless and near-lossless image compression performance. We set $\lambda=0$ leading to the best lossless image compression, while set $\lambda=0.03$ leading to robust near-lossless image compression with variable $\tau$ ’s.

4.2.2 Training SQRC

For training SQRC, we generate random $\tau\in\{1,2,\ldots,N\}$ and quantize $\mathbf{r}$ to ${\hat{\mathbf{r}}}$ with (9). Given $\mathbf{u}$ and the extracted context $C_{\hat{\mathbf{r}}}$ from quantized ${\hat{\mathbf{r}}}$ , we use the conditional entropy model to estimate $-\log p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ conditioned on different $\tau$ ’s, and minimize

\mathcal{L}({\boldsymbol{\varphi}})=\mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{q_{\boldsymbol{\phi}}({\hat{\mathbf{y}}},{\hat{\mathbf{z}}}|\mathbf{x})}\mathbb{E}_{\tau}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})}{p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)}\right]

(22)

where ${\boldsymbol{\varphi}}$ denote the learned parameters of the conditional entropy model. $-\log p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ is estimated by the entropy model in RC. $\mathcal{L}({\boldsymbol{\varphi}})$ can be considered as an approximate KL-divergence or relative entropy [55] between $p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ and $p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ .

SQRC is trained together with LIC and RC, but minimizing (22) only updates the parameters of the conditional entropy model as shown in Fig. 1. The masked convolutional layer is shared with that in RC, and thus can be updated by minimizing (20). This leads to three advantages: 1) We can achieve the target conditional entropy model to close the gap between training with $\mathbf{r}$ and inference with ${\hat{\mathbf{r}}}$ ; 2) We can circumvent the aforementioned relaxation problem of residual quantization; 3) We can avoid degrading the estimation of $p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ in RC caused by training with randomly generated $\tau$ . Because the entropy model in RC receives the context $C_{\mathbf{r}}$ extracted from the original residual $\mathbf{r}$ , $p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ approximates the true distribution $p(\mathbf{r}|{\tilde{\mathbf{x}}},{\hat{\mathbf{y}}})$ better than $p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ . Thus, $-\log p_{\boldsymbol{\theta}}(\mathbf{r}|\mathbf{u},C_{\mathbf{r}})$ is the lower bound of $-\log p_{\boldsymbol{\varphi}}(\mathbf{r}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ on average.

4.3 Acceleration of DLPR Coding

In order to realize practical DLPR coding, the bottleneck is the serialized autoregressive model in RC and SQRC, which severely limits the coding speed on GPUs. We thus propose a novel design of context coding to increase the degree of algorithm parallelization, and further accelerate the entropy coding with adaptive residual interval.

4.3.1 Context Design and Parallelization

Generally, autoregressive models suffer from serialized decoding and cannot be efficiently implemented on GPUs. Given an $H\times W$ image, we need to compute $HW$ times context model to decode all pixels sequentially. In lossy image compression, checkerboard context model [45] and channel-wise context model [44] were introduced to accelerate the probability inference of latent variables. However, these two context models are too weak for our residual coding and result in significant performance degradation, without the help of transform coding.

To improve the parallelization of residual coding, we first adopt a common operation to split an $H\times W$ image into multiple non-overlapping $P\times P$ patches and code all $P\times P$ patches in parallel, reducing $HW$ times sequential context computations to $P^{2}$ times. We next propose a novel design of context coding to improve the algorithm parallelization given the $P\times P$ patches and $k\times k$ mask convolution, as illustrated in Fig. 8. Assuming that $P=9$ and $k=7$ , we need $P^{2}=81$ sequential decoding steps in raster scan order for the commonly used context model shown in Fig. 8a. The number in each pixel denotes the time step $t$ at which the pixel is decoded. Since $P>\lceil\frac{k}{2}\rceil$ is usually satisfied, the currently decoded pixel only depends on some of the previously decoded pixels. For example, the red pixel is decoded currently and the yellow pixels are its context. The blue pixels are previously decoded pixels but are not included in the context of the red pixel. Hence, there are pixels that can be potentially decoded in parallel by revising the scan order. By using $\frac{180}{\pi}\cdot\arctan(\frac{2}{k+1})$ degree parallel scan, the pixels with the same number $t$ can be decoded simultaneously, as shown in Fig. 8b. The number of decoding steps is reduced from $P^{2}$ to $\frac{k+3}{2}\cdot P-\frac{k+1}{2}$ . In this case, we use $14.04^{\circ}$ parallel scan, leading to $5P-4=41$ sequential decoding steps. The similar scan order was also used in [61]. Moreover, we can remove one context pixel in the upper right of the currently decoded pixel. The newly designed context model leads to $\frac{180}{\pi}\cdot\arctan(\frac{2}{k-1})$ degree parallel scan and reduces decoding steps to $\frac{k+1}{2}\cdot P-\frac{k-1}{2}$ . As shown in Fig. 8c, we use $18.43^{\circ}$ parallel scan and $4P-3=33$ decoding steps with this context model. When more upper-right context pixels are removed, the coding parallelization can be further improved while the compression performance is gradually compromised on. As shown in Fig. 8d, the context model leads to $\frac{180}{\pi}\cdot\arctan(\frac{2}{k-3})$ degree parallel scan and $\frac{k-1}{2}\cdot P-\frac{k-3}{2}$ decoding steps, i.e., $26.57^{\circ}$ parallel scan and $3P-2=25$ decoding steps in our example. When $45^{\circ}$ parallel scan is reached, this special case is the zig-zag scan [43] and we need $2P-1=17$ decoding steps, as shown in Fig. 8e. Finally, the fastest case is shown in Fig. 8f. We can use $90^{\circ}$ parallel scan and only $P=9$ decoding steps.

In summary, the proposed design of context coding demonstrates that: given $P\times P$ image patches and $k\times k$ mask convolution with $P>\lceil\frac{k}{2}\rceil$ , we can design a series of context models $\{M^{(k+3)/2}_{k},M^{(k+1)/2}_{k},\ldots,M^{1}_{k}\}$ leading to $\{\frac{k+3}{2}\cdot P-\frac{k+1}{2},\frac{k+1}{2}\cdot P-\frac{k-1}{2},\ldots,P\}$ parallel decoding steps, by gradually adjusting the context pixels. The corresponding scan angles are $\{\frac{180}{\pi}\cdot\arctan(\frac{2}{k+1}),\frac{180}{\pi}\cdot\arctan(\frac{2}{k-1}),\ldots,90^{\circ}\}$ , respectively. In experiments, we set $P=64$ , $k=7$ and select the context model $M^{3}_{7}$ shown in Fig. 8d. The $M^{3}_{7}$ enjoys almost the same compression performance as $M_{7}^{5}$ in Fig. 8a and Fig. 8b, but needs much fewer coding steps.

4.3.2 Adaptive Residual Interval

Since the pixels of both a raw image $\mathbf{x}$ and its lossy reconstruction ${\tilde{\mathbf{x}}}$ are in the interval $[0,255]$ , the element $r_{i,c}$ of the corresponding residual $\mathbf{r}$ is in the interval $[-255,255]$ . To entropy coding each $r_{i,c}$ , we need to compute and utilize PMF with $511$ elements, which is relatively large and slows down the entropy coding process.

In practice, the theoretical interval $[-255,255]$ of $r_{i,c}$ can hardly be filled up. Hence, we can compute and record $r_{\min}=\min_{i,c}r_{i,c}$ and $r_{\max}=\max_{i,c}r_{i,c}$ of each image, and reduce the domain of PMF to the adaptive interval $[r_{\min},r_{\max}]$ with $r_{\max}-r_{\min}+1$ elements. The overheads of recording the $r_{\min}$ and $r_{\max}$ can be amortized and ignored. For near-lossless image compression with $\tau>0$ , we can similarly compute and record the quantized $\hat{r}_{\min}=\min_{i,c}\hat{r}_{i,c}$ and $\hat{r}_{\max}=\max_{i,c}\hat{r}_{i,c}$ of each image, and the domain of the quantized PMF can be further reduced to the adaptive interval $\{\hat{r}_{\min},\hat{r}_{\min}+2\tau+1,\hat{r}_{\min}+2\cdot(2\tau+1),\ldots,\hat{r}_{\max}\}$ , i.e., $\frac{\hat{r}_{\max}-\hat{r}_{\min}}{2\tau+1}+1$ elements in total. The reduction of residual intervals can significantly accelerate the entropy coding on average.

5 Experiments

5.1 Experimental Settings

We train the DLPR coding system on DIV2K high resolution training dataset [62] consisting of 800 2K resolution RGB images. Although DIV2K is originally built for image super-resolution task, it contains large number of high-quality images that is suitable for training our codec. During training, the 2K images are first cropped into non-overlapped $121379$ patches with the size of $128\times 128$ . We then flip these patches horizontally and vertically with a random factor $0.5$ , and further randomly crop the flipped patches to the size of $64\times 64$ . We optimize the proposed network for $600$ epochs using Adam [63] with minibatches of size $64$ . The learning rate is initially set to $1\times 10^{-4}$ and is decayed by $0.9$ at the epoch $\in[350,390,430,470,510,550,590]$ .

We evaluate the trained DLPR coding system on six image datasets:

•

ImageNet64. ImageNet64 validation dataset [64] is a downsampled variant of ImageNet validation dataset [65], consisting of $50000$ images of size $64\times 64$ .
•

DIV2K. DIV2K high resolution validation dataset [62] consists of 100 2K color images sharing the same domain with the DIV2K high resolution training dataset.
•

CLIC.p. CLIC professional validation dataset¹¹1https://www.compression.cc/challenge/ consists of 41 color images taken by professional photographers. Most images in CLIC.p are in 2K resolution but some of them are of small sizes.
•

CLIC.m. CLIC mobile validation dataset¹ consists of 61 2K resolution color images taken with mobile phones. Most images in CLIC.m are in 2K resolution but some of them are of small sizes.
•

Kodak. Kodak dataset [66] consists of 24 uncompressed $768\times 512$ color images, widely used in evaluating lossy image compression methods.
•

Histo24. Besides natural images, we build a Histo24 dataset consisting of 24 uncompressed $768\times 512$ histological images, in order to evaluate our codec on images of different modality. These histological images are randomly cropped from high resolution ANHIR dataset [67], which is originally used for histological image registration task.

The DLPR coding system is implemented with Pytorch. We train the DLPR coding system on NVIDIA V100 GPU, while evaluate the compression performance and running time on Intel CPU i9-10900K, 64G RAM and NVIDIA RTX3090 GPU. We use torchac [10], an arithmetic coding tool in Pytorch, for entropy coding.

5.2 Lossless Results of DLPR coding

TABLE I: Lossless image compression performance (bpsp) of the proposed DLPR coding system with

\lambda=0

, compared with other lossless image codecs on ImageNet64, DIV2K, CLIC.p, CLIC.m, Kodak and Histo24 datasets.

Codec	ImageNet64	DIV2K	CLIC.p	CLIC.m	Kodak	Histo24
PNG	5.42	4.23	3.93	3.93	4.35	3.79
JPEG-LS [2]	4.45	2.99	2.82	2.53	3.16	3.39
CALIC [3]	4.71	3.07	2.87	2.59	3.18	3.48
JPEG2000 [68]	4.74	3.12	2.93	2.71	3.19	3.36
WebP [52]	4.36	3.11	2.90	2.73	3.18	3.29
BPG [51]	4.42	3.28	3.08	2.84	3.38	3.82
FLIF [4]	4.25	2.91	2.72	2.48	2.90	3.23
JPEG-XL [5]	4.94	2.79	2.63	2.36	2.87	3.07
L3C [10]	4.48	3.09	2.94	2.64	3.26	3.53
RC [11]	$-$	3.08	2.93	2.54	$-$	$-$
Bit-Swap [12]	5.06	$-$	$-$	$-$	$-$	$-$
HiLLoC [13]	3.90	$-$	$-$	$-$	$-$	$-$
IDF [14]	3.90	$-$	$-$	$-$	$-$	$-$
IDF++ [50]	3.81	$-$	$-$	$-$	$-$	$-$
LBB [15]	3.70	$-$	$-$	$-$	$-$	$-$
iVPF [16]	3.75	2.68	2.54	2.39	$-$	$-$
iFlow [17]	3.65	2.57	2.44	2.26	$-$	$-$
DLPR (Ours)	3.69	2.55	2.38	2.16	2.86	2.96

TABLE II: Near-lossless image compression performance (bpsp) of the proposed DLPR coding system with

\lambda=0.03

, compared with near-lossless JPEG-LS, near-lossless CALIC and near-lossless WebP on ImageNet64, DIV2K, CLIC.p, CLIC.m, Kodak and Histo24 datasets. ^∗The error bounds of near-lossless WebP are powers of two.

Codec	$\tau^{*}$	ImageNet64	DIV2K	CLIC.p	CLIC.m	Kodak	Histo24
WebP nll[52]	1	3.61	2.45	2.26	2.11	2.41	2.31
	2	3.11	2.04	1.89	1.85	2.01	1.76
	4	2.70	1.83	1.73	1.75	1.82	1.73
JPEG-LS[2]	1	4.01	2.62	2.34	2.44	2.90	1.99
	2	3.25	2.07	1.80	1.89	2.30	1.58
	4	2.49	1.53	1.28	1.35	1.68	1.24
CALIC[8]	1	3.69	2.45	2.18	2.28	2.75	1.78
	2	2.94	1.88	1.62	1.70	2.14	1.28
	4	2.41	1.31	1.07	1.13	1.51	0.84
DLPR (Ours)	1	2.59	1.69	1.56	1.50	1.81	1.71
	2	2.06	1.26	1.13	1.09	1.37	1.23
	4	1.55	0.84	0.69	0.67	0.90	0.65

We evaluate the lossless image compression performance of the proposed DLPR coding system, measured by bit per subpixel (bpsp). Each RGB pixel has three subpixels. We compare with eight traditional lossless image codecs including PNG, JPEG-LS [2], CALIC [3], JPEG2000 [68], WebP [52], BPG [51], FLIF [4] and JPEG-XL [5], and nine recent learning-based lossless image compression methods including L3C [10], RC [11], Bit-Swap [12], HiLLoC [13], IDF [14], IDF++ [50], LBB [15], iVPF [16] and iFlow [17]. For Bit-Swap, HiLLoC, IDF, IDF++ and LBB, their codes can hardly be applied on practical full resolution image compression tasks, and thus only be evaluated on ImageNet64 dataset. For RC, iVPF and iFlow, we report the compression performance published by their authors, because their codes are either difficult to be generalized to arbitrary datasets or unavailable. We set $\lambda=0$ in (20) leading to the best lossless image compression performance.

As reported in Table I, the proposed DLPR coding system achieves the best lossless compression performance on DIV2K validation dataset, which shares the same domain with the training dataset. The DLPR coding system also achieves the best compression performance on CLIC.p, CLIC.m, Kodak and Histo24 datasets, and achieves the second best compression performance on ImageNet64 validation dataset. Though iFlow outperforms ours on ImageNet64 validation dataset, it is trained on ImageNet64 training dataset sharing the same domain while ours is trained on DIV2K. The above results demonstrate that the DLPR coding system achieves the state-of-the-art lossless image compression performance and can be effectively generalized to images of various domains and modalities.

5.3 Near-lossless Results of DLPR coding

We next evaluate the near-lossless image compression performance of the proposed DLPR coding system. We set $\lambda=0.03$ leading to the robust near-lossless image compression results with variable $\tau$ ’s. We compare with near-lossless WebP (WebP nll) [52], near-lossless JPEG-LS [2] and near-lossless CALIC [8], as reported in Table II. Near-lossless WebP adjusts pixel values to $\ell_{\infty}$ error bound $\tau$ and compresses the pre-processed images losslessly. Near-lossless JPEG-LS and CALIC adopt predictive coding schemes, and encode the residuals quantized by (9). These three codecs handcraft the pre-processor, predictors and probability estimators, which are not efficient enough for variable $\tau$ ’s. More efficiently, our DLPR coding system is based on jointly trained LIC, RC and SQRC. We employ (9) to realize variable error bound $\tau$ ’s and the probability distributions of the quantized residuals are derived from the learned SQRC. Therefore, our DLPR coding system outperforms near-lossless WebP, JPEG-LS and CALIC by a wide margin.

Besides existing near-lossless image codecs, we also compare our near-lossless DLPR coding system with six traditional lossy image codecs, i.e., JPEG [69], JPEG2000 [68], WebP [52], BPG [51], Lossy FLIF [4] and VVC [70], and three representative learned lossy image compression methods, i.e., Ballé[MSE] [25], Minnen[MSE] [40] and Cheng[MSE] [33], as shown in Fig. 9. Because recent learned lossy image compression methods are all trained at relatively low bit rates ( $\leq$ 2 bpp $\approx$ 0.67 bpsp on Kodak), we re-implement Ballé[MSE], Minnen[MSE] and Cheng[MSE] at high bit-rates ( $\geq$ 0.8 bpsp on Kodak). Though Cheng[MSE] outperforms Ballé[MSE] and Minnen[MSE] at low bit rates, it performs worse than Ballé[MSE] and Minnen[MSE] at high bit rates because the sophisticated analysis and synthesis transforms hinder it from reaching very high-quality reconstructions. In terms of the rate-distortion performance measured by $\ell_{\infty}$ error, our DLPR coding consistently yields the best results among all codecs. Besides $\ell_{\infty}$ error, we also compare the rate-distortion performance of all codecs measured by PSNR. Our DLPR coding can achieve competitive performance at bit rates higher than 0.8 bpsp, even though PSNRs of near-lossless reconstructions are not our optimization objective.

In Fig. 10 and 11, we display the near-lossless reconstructed images resulting from our DLPR coding at variable $\tau$ ’s on Kodak and Histo24 datasets. When $\tau$ is not large, human eyes can hardly differentiate between the raw images and the near-lossless reconstructions.

5.4 Runtime of DLPR coding

We evaluate the runtime of our DLPR coding on images of three different sizes. We compare with four representative traditional lossless image codecs including JPEG-LS [2], BPG [51], FLIF [4] and JPEG-XL [5]. We also compare with the practical learned lossless image compression method L3C [10] and the learned lossy image compression method Minnen[MSE] [40] with a serial autoregressive model. Note that the reported runtime includes both inference time and entropy coding time.

As reported in Table III, our lossless DLPR coding is almost as fast as FLIF with respect to encoding speed, much faster than BPG, JPEG-XL, L3C and lossy Minnen[MSE]. Although our lossless DLPR coding is slower than traditional codecs with respect to decoding speed, it is still practical for 2K resolution images and much faster than the learned L3C and lossy Minnen[MSE]. When $\tau>0$ , the near-lossless DLPR coding can be even faster because the entropy coding time is reduced by the adaptive residual interval scheme. Based on the above results, we demonstrates the practicability of the DLPR coding system and its great potential to be employed in real lossless and near-lossless image compression tasks.

TABLE III: Runtime (sec.) of DLPR coding on images of different sizes (encoding/decoding time). For lossless compression (

\tau=0

), we use

\lambda=0

. For near-lossless compression (

\tau>0

), we use

\lambda=0.03

. *OOM denotes out-of-memory.

Codec	768 $\times$ 512	996 $\times$ 756	2040 $\times$ 1356
JPEG-LS [2]	0.12/0.12	0.23/0.22	0.83/0.80
BPG [51]	2.38/0.13	4.46/0.27	16.52/0.98
FLIF [4]	0.90/0.16	1.84/0.35	7.50/1.35
JPEG-XL [5]	0.73/0.08	12.48/0.14	40.96/0.42
L3C [10]	8.17/7.89	15.25/14.55	OOM*
Minnen[MSE] [40]	2.55/5.18	5.13/10.36	18.71/37.97
DLPR lossless	1.26/1.80	2.28/3.24	8.20/11.91
DLPR ( $\tau=1$ )	0.79/1.24	0.98/1.86	2.09/5.56
DLPR ( $\tau=2$ )	0.75/1.20	0.92/1.78	1.87/5.24
DLPR ( $\tau=4$ )	0.73/1.18	0.89/1.74	1.68/5.03

5.5 Ablation Study

TABLE IV: The relationships between different network architectures of LIC and lossless image compression performance (bpsp). Conv. denotes

5\times 5

convolutional layers used in [25]. A/S Blks. denotes the proposed analysis and synthesis blocks. Attn. denotes the attention block used in [33]. Swin Attn. denotes the proposed Swin attention blocks.

Conv. [25]	A/S Blks.	Attn. [33]	Swin-Attn.	ImageNet64	DIV2K	CLIC.m
$\checkmark$	$\times$	$\times$	$\times$	3.75 (+0.06)	2.60 (+0.05)	2.20 (+0.04)
$\times$	$\checkmark$	$\times$	$\times$	3.71 (+0.02)	2.57 (+0.02)	2.18 (+0.02)
$\times$	$\checkmark$	$\checkmark$	$\times$	3.71 (+0.02)	2.56 (+0.01)	2.17 (+0.01)
$\times$	$\checkmark$	$\times$	$\checkmark$	3.69	2.55	2.16

TABLE V: The relationships between different network architectures of RC and lossless image compression performance (bpsp).

$\mathbf{u}$	$C_{\mathbf{r}}$	ImageNet64	DIV2K	CLIC.m
$\checkmark$	$\times$	4.03 (+0.34)	2.91 (+0.36)	2.47 (+0.31)
$\times$	$\checkmark$	3.78 (+0.09)	2.64 (+0.09)	2.24 (+0.08)
$\checkmark$	$\checkmark$	3.69	2.55	2.16

Network architectures of LIC and RC. In Table IV, we study the relationships between different network architectures of LIC and lossless compression performance. Compared with the $5\times 5$ convolutional layers used in [25] and the attention blocks used in [33], the proposed analysis/synthesis blocks and Swin attention blocks can effectively improve the lossless image compression performance. These results also demonstrate that LIC plays an important role in our DLPR coding system.

In Table V, we study the relationships between different network architectures of RC and lossless image compression performance. Both the feature $\mathbf{u}$ and context $C_{\mathbf{r}}$ can effectively improve the lossless image compression performance, which demonstrates the effectiveness of the proposed network architecture of RC. In Table. VI, we further compare the lossless and near-lossless image compression performance of the logistic mixture model, Gaussian single model and Gaussian mixture model for RC and SQRC. The logistic mixture model and Gaussian mixture model achieve almost identical performance, outperforming the Gaussian single model for complex real distributions of $\mathbf{r}$ and ${\hat{\mathbf{r}}}$ . We utilize the logistic mixture model because its cumulative distribution function (CDF) is a sigmoid function, making it easier to compute the probability of each discrete residual using (18). In contrast, the CDF of the Gaussian distribution is the more complex Gauss error function.

TABLE VI: Lossless and near-lossless image compression performance (bpsp) resulting from logistic mixture model (lmm.) with

K=5

, Gaussian single model (gsm.) and Gaussian mixture model (gmm.) with

K=5

model	ImageNet64	DIV2K	CLIC.m
lmm.(lossless)	3.69	2.55	2.16
gsm. (lossless)	3.78 (+0.09)	2.57 (+0.02)	2.22 (+0.06)
gmm.(lossless)	3.70 (+0.01)	2.55 (=)	2.17 (+0.01)
lmm.( $\tau=1$ )	2.59	1.69	1.50
gsm. ( $\tau=1$ )	2.61 (+0.02)	1.72 (+0.03)	1.52 (+0.02)
gmm.( $\tau=1$ )	2.59 (=)	1.68 ( $-$ 0.01)	1.50 (=)

$\tau$ -specific vs. scalable. As aforementioned in Sec. 4.1.3, $\tau$ -specific near-lossless image compression scheme leads to the challenging relaxation problem of residual quantization. In order to compare with our scalable near-lossless image compression scheme, we realize the $\tau$ -specific models by relaxing residual quantization with straight-through (copying gradients from quantized ${\hat{\mathbf{r}}}$ to original $\mathbf{r}$ ). As shown in Fig. 12, the resulting $\tau$ -specific models perform worse than our scalable model in most cases, due to the gradient bias during training.

In Fig. 13, we show lossless to near-lossless image compression performance $\tau\in\{0,1,...,5\}$ of each images compressed by our scalable model with $\lambda=0.03$ on Kodak dataset. $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ , on average, accounts for about 16% of $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}+R_{\mathbf{r}}$ at $\tau=0$ . With the increase of $\tau$ , the bit-rate $R_{{\hat{\mathbf{r}}}}^{\tau}$ of the quantized residual ${\hat{\mathbf{r}}}$ is significantly reduced. Especially the $\tau=1$ near-lossless mode saves about 39% bit rates compared with the $\tau=0$ lossless mode.

SQRC for bias correction. In Fig. 14, we demonstrate the efficacy of SQRC for bias correction. Because of the discrepancy between the oracle $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ and the biased $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ , encoding ${\hat{\mathbf{r}}}$ with $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ (without SQRC) degrades the compression performance. Instead, we encode ${\hat{\mathbf{r}}}$ with $\hat{p}_{\boldsymbol{\varphi}}({\hat{\mathbf{r}}}|\mathbf{u},C_{{\hat{\mathbf{r}}}},\tau)$ (with SQRC) resulting in lower bit rates. With the increase of $\tau$ , the PMF of $\mathbf{r}$ is quantized by larger bins and becomes coarser. Thus, the compression performance with SQRC approaches the oracle. However, the gap between the compression performance without SQRC and the oracle remains large, as the discrepancy between $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\mathbf{r}})$ and $\hat{p}_{\boldsymbol{\theta}}({\hat{\mathbf{r}}}|\mathbf{u},C_{\hat{\mathbf{r}}})$ is also magnified with the increasing $\tau$ . Note that the compression performance of our DLPR coding without SQRC is still better than near-lossless WebP, JPEG-LS and CALIC.

Discussion on different $\lambda$ ’s. The $\lambda$ ’s in (20) adapts the lossless compression rate of the raw image $\mathbf{x}$ and the distortion of the lossy reconstruction ${\tilde{\mathbf{x}}}$ . In Table VII, we evaluate the effects of different $\lambda$ ’s $\in\{0,0.001,0.03,0.06\}$ on lossless image compression performance on Kodak dataset. With the increase of $\lambda$ , both the rate $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ and the PSNR of ${\tilde{\mathbf{x}}}$ become higher. While the rate $R_{\mathbf{r}}$ of residual decreases at the same time, the decrease of $R_{\mathbf{r}}$ is smaller than the increase of $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ , leading to the degradation of lossless image compression performance. $\lambda=0$ leads to the best lossless compression performance. We visualize lossy reconstruction ${\tilde{\mathbf{x}}}$ ’s, residual $\mathbf{r}$ ’s and feature $\mathbf{u}$ ’s resulting from different $\lambda$ ’s in Fig. 15. Interestingly, DLPR coding with $\lambda=0$ learns to set ${\tilde{\mathbf{x}}}=\mathbf{0}$ and $\mathbf{r}=\mathbf{x}$ . The LIC becomes a special feature compressor extracting feature $\mathbf{u}$ for lossless image compression (proved effective in Table V). This special case of DLPR coding with $\lambda=0$ is similar to PixelVAE [71].

In Fig. 16, we further study the effects of different $\lambda$ ’s on the near-lossless image compression performance. Though $\lambda=0$ leads to the best lossless compression performance, it is unsuitable for near-lossless compression since the residual quantization (9) is adopted on the $\mathbf{r}=\mathbf{x}$ . For near-lossless compression, we set $\lambda=0.03$ . Compared with $\lambda=0$ and $0.001$ , the quantized residual ${\hat{\mathbf{r}}}$ of $\lambda=0.03$ results in much lower entropy and smaller context bias, since most elements of $\mathbf{r}$ and ${\hat{\mathbf{r}}}$ are zeros or close to zeros. The reduction of $R_{\hat{\mathbf{r}}}^{\tau}$ of $\lambda=0.03$ compensates for larger $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ with the increase of $\tau$ . Compared with $\lambda=0.06$ , $\lambda=0.03$ enjoys similar $R_{\hat{\mathbf{r}}}^{\tau}$ but lower $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ . Therefore, $\lambda=0.03$ achieves the most robust near-lossless image compression performance only slightly worse than $\lambda=0$ and $0.001$ at $\tau=1$ .

TABLE VII: The effects of different

\lambda

’s on the lossless image compression performance on Kodak dataset.

$\lambda$	Total Rate	$R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$	$R_{\mathbf{r}}$	${\tilde{\mathbf{x}}}$ (PSNR)
0	2.86	0.04	2.82	6.78
0.001	2.94	0.13	2.81	29.80
0.03	2.99	0.49	2.50	38.77
0.06	3.02	0.59	2.43	39.74

TABLE VIII: Lossless image compression performance (bpsp) of different context models on ImageNet64, DIV2K, CLIC.p, CLIC.m, Kodak and Histo24 datasets. The context model

M_{7}^{5}

is set as the anchor.

Context ( $7\times 7$ )	ImageNet64	DIV2K	CLIC.p	CLIC.m	Kodak	Histo24
$M_{7}^{5}$ (Fig. 8a & 8b)	3.69	2.55	2.37	2.15	2.86	2.96
$M_{7}^{4}$ (Fig. 8c)	3.70 (+0.01)	2.56 (+0.01)	2.39 (+0.02)	2.16 (+0.01)	2.87 (+0.01)	2.97 (+0.01)
$M_{7}^{3}$ (Fig. 8d, Selected)	3.69 (=)	2.55 (=)	2.38 (+0.01)	2.16 (+0.01)	2.86 (=)	2.96 (=)
$M_{7}^{2}$ (Fig. 8e)	3.71 (+0.02)	2.59 (+0.04)	2.40 (+0.03)	2.19 (+0.04)	2.95 (+0.09)	3.01 (+0.05)
$M_{7}^{1}$ (Fig. 8f)	3.88 (+0.19)	2.74 (+0.19)	2.56 (+0.19)	2.34 (+0.19)	2.95 (+0.09)	3.24 (+0.28)
checkerboard	3.86 (+0.17)	2.73 (+0.18)	2.54 (+0.17)	2.33 (+0.18)	2.95 (+0.09)	3.19 (+0.23)
channel-only	4.03 (+0.34)	2.91 (+0.36)	2.70 (+0.33)	2.47 (+0.32)	3.09 (+0.23)	3.48 (+0.52)
w/o context	4.47 (+0.78)	3.37 (+0.82)	3.19 (+0.82)	3.14 (+0.99)	3.61 (+0.75)	3.45 (+0.49)

TABLE IX: Runtime (sec.) of lossless DLPR coding with different context models and adaptive residual interval (AdaRI.) on Kodak dataset.

Context (7 $\times$ 7)	AdaRI.	Enc./Dec. Time
$M_{7}^{5}$ (Fig. 8a, Serial)	$\times$	12.46/23.24
$M_{7}^{5}$ (Fig. 8a, Serial)	$\checkmark$	11.93 (-0.53)/22.74 (-0.50)
$M_{7}^{5}$ (Fig. 8b)	$\checkmark$	1.47 (-10.99)/2.30 (-20.94)
$M_{7}^{3}$ (Fig. 8d, Selected)	$\checkmark$	1.24 (-11.22)/1.75 (-21.49)

Rate-distortion performance of LIC. We conduct an ablation study to show the rate-distortion performance of our LIC in DLPR coding framework in Fig. 17. Besides lossy reconstruction ${\tilde{\mathbf{x}}}$ , our LIC also generates feature $\mathbf{u}$ of residual $\mathbf{r}$ and is jointly trained with RC. Therefore, the rate $R_{{\hat{\mathbf{y}}},{\hat{\mathbf{z}}}}$ of our LIC not only carries the information from lossy reconstruction $\tilde{\mathbf{x}}$ but also carries part of the information from residual $\mathbf{r}$ . As a result, the rate-distortion performance of our LIC in DLPR coding framework is between Ballé[MSE] [25] and JPEG2000 on Kodak dataset.

Context design and adaptive residual interval. In Table VIII, we study the lossless image compression performance resulting from the designed $7\times 7$ context models in Fig. 8. The context model $M_{7}^{5}$ is set as the anchor. Based on the experimental results, the selected context model $M_{7}^{3}$ removing two upper-right context pixels from $M_{7}^{5}$ achieves almost the same lossless image compression performance as $M_{7}^{5}$ , even slightly better than $M_{7}^{4}$ . At the same time, $M_{7}^{3}$ reduces about 40% parallel coding steps from $5P-4$ to $3P-2$ , compared with $M_{7}^{5}$ . Though the context model $M_{7}^{2}$ (zigzag) and $M_{7}^{1}$ can be faster, the corresponding lossless compression performance drops significantly. Besides, we also show the compression performance resulting from checkerboard context model [45]. Without the help of transform coding, the compression performance of checkerboard context model is only slightly better than context $M_{7}^{1}$ and worse than context $M_{7}^{2}$ in our DLPR coding framework. We further switch off spatial context models and show the compression performance resulting from only channel-wise autoregressive scheme in (15) and (16). The channel-only setting cannot reduce spatial redundancies among residuals, resulting in worse compression performance. We also evaluate the compression performance without both spatial and channel context models. Compared with the channel-only setting, the w/o context setting ignoring channel redundancies results in the worst compression performance, except on Histo24. This is because the color properties of stained histological images differ from natural images, and the channel autoregressive model trained on DIV2K does not generalize as well on Histo24 as on other datasets. In Table IX, we finally shows that the designed context models and the adaptive residual interval scheme can effectively reduce the runtime of DLPR coding in practice.

6 Conclusion

In this paper, we propose a unified DLPR coding framework for both lossless and near-lossless image compression. The DLPR coding framework consists of a lossy image compressor, a residual compressor and a scalable quantized residual compressor, which is formulated in terms of VAEs and is solved with end-to-end training. The DLPR coding framework supports scalable near-lossless image compression with variable $\ell_{\infty}$ -constraint $\tau$ ’s in a single network, instead of multiple networks for different $\tau$ ’s. We further propose a novel design of context coding and an adaptive residual interval scheme to significantly accelerate the coding process. Extensive experiments demonstrate that the DLPR coding system achieves not only the state-of-the-art compression performance, but also competitive coding speed for practical full resolution image compression tasks.

References

[1] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
[2] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The loco-i lossless image compression algorithm: Principles and standardization into jpeg-ls,” IEEE Transactions on Image Processing, vol. 9, no. 8, pp. 1309–1324, 2000.
[3] X. Wu and N. Memon, “Context-based, adaptive, lossless image coding,” IEEE Transactions on Communications, vol. 45, no. 4, pp. 437–444, 1997.
[4] J. Sneyers and P. Wuille, “Flif: Free lossless image format based on maniac compression,” in IEEE International Conference on Image Processing. IEEE, 2016, pp. 66–70.
[5] J. Alakuijala, R. Van Asseldonk, S. Boukortt, M. Bruse, I.-M. Comsa, M. Firsching, T. Fischbacher, E. Kliuchnikov, S. Gomez, R. Obryk et al., “Jpeg xl next-generation image compression architecture and coding tools,” in Applications of Digital Image Processing XLII, vol. 11137. SPIE, 2019, pp. 112–124.
[6] R. Ansari, N. D. Memon, and E. Ceran, “Near-lossless image compression techniques,” Journal of Electronic Imaging, vol. 7, no. 3, pp. 486 – 494, 1998.
[7] L. Ke and M. W. Marcellin, “Near-lossless image compression: minimum-entropy, constrained-error dpcm,” IEEE Transactions on Image Processing, vol. 7, no. 2, pp. 225–228, 1998.
[8] W. Xiaolin and P. Bao, “ $l_{\infty}$ constrained high-fidelity image compression via adaptive context modeling,” IEEE Transactions on Image Processing, vol. 9, no. 4, pp. 536–542, 2000.
[9] Y. Hu, W. Yang, Z. Ma, and J. Liu, “Learning end-to-end lossy image compression: A benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4194–4211, 2022.
[10] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, “Practical full resolution learned lossless image compression,” in IEEE Conference on Computer Vision and Pattern Recogition, 2019, pp. 10 621–10 630.
[11] F. Mentzer, L. Van Gool, and M. Tschannen, “Learning better lossless compression using lossy compression,” in IEEE Conference on Computer Vision and Pattern Recogition, 2020.
[12] F. H. Kingma, P. Abbeel, and J. Ho, “Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables,” in International Conference on Machine Learning, 2019.
[13] J. Townsend, T. Bird, J. Kunze, and D. Barber, “Hilloc: Lossless image compression with hierarchical latent variable models,” in International Conference on Learning Representations, 2020.
[14] E. Hoogeboom, J. Peters, R. van den Berg, and M. Welling, “Integer discrete flows and lossless compression,” in Advances in Neural Information Processing Systems, 2019, pp. 12 134–12 144.
[15] J. Ho, E. Lohn, and P. Abbeel, “Compression with flows via local bits-back coding,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
[16] S. Zhang, C. Zhang, N. Kang, and Z. Li, “ivpf: Numerical invertible volume preserving flow for efficient lossless compression,” in IEEE Conference on Computer Vision and Pattern Recogition, 2021, pp. 620–629.
[17] S. Zhang, N. Kang, T. Ryder, and Z. Li, “iflow: Numerically invertible flows for efficient lossless compression via a uniform coder,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
[18] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in International Conference on Machine Learning. JMLR.org, 2016, pp. 1747–1756.
[19] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
[20] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,” in International Conference on Learning Representations, 2017.
[21] I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3964–3979, 2020.
[22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, 2014.
[23] Y. Bai, X. Liu, W. Zuo, Y. Wang, and X. Ji, “Learning scalable $\ell_{\infty}$ -constrained near-lossless image compression via joint lossy image and residual compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 11 946–11 955.
[24] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in International Conference on Learning Representations, 2017.
[25] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in International Conference on Learning Representations, 2018.
[26] J. Townsend, T. Bird, and D. Barber, “Practical lossless compression with latent variables using bits back coding,” in International Conference on Learning Representations, 2019.
[27] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” in International Conference on Learning Representations, 2016.
[28] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in IEEE Conference on Computer Vision and Pattern Recogition, 2017, pp. 5306–5314.
[29] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. J. Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in IEEE Conference on Computer Vision and Pattern Recogition, 2018, pp. 4385–4393.
[30] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in International Conference on Learning Representations, 2017.
[31] V. K. Goyal, “Theoretical foundations of transform coding,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.
[32] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
[33] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in IEEE Conference on Computer Vision and Pattern Recogition, 2020, pp. 7939–7948.
[34] H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized versatile image compression with wavelet-like transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1247–1263, 2022.
[35] Y. Zhu, Y. Yang, and T. Cohen, “Transformer-based transform coding,” in International Conference on Learning Representations, 2022.
[36] M. Li, W. Zuo, S. Gu, J. You, and D. Zhang, “Learning content-weighted deep image compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[37] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, “Conditional probability models for deep image compression,” in IEEE Conference on Computer Vision and Pattern Recogition, 2018, pp. 4394–4402.
[38] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[39] Y. Hu, W. Yang, and J. Liu, “Coarse-to-fine hyper-prior modeling for learned image compression,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 013–11 020.
[40] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10 771–10 780.
[41] Y. Qian, Z. Tan, X. Sun, M. Lin, D. Li, Z. Sun, H. Li, and R. Jin, “Learning accurate entropy model with global reference for image compression,” in International Conference on Learning Representations, 2021.
[42] Y. Qian, M. Lin, X. Sun, Z. Tan, and R. Jin, “Entroformer: A transformer-based entropy model for learned image compression,” in International Conference on Learning Representations, 2022.
[43] M. Li, K. Ma, J. You, D. Zhang, and W. Zuo, “Efficient and effective context-based convolutional entropy modeling for image compression,” IEEE Transactions on Image Processing, vol. 29, pp. 5900–5911, 2020.
[44] D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in IEEE International Conference on Image Processing. IEEE, 2020, pp. 3339–3343.
[45] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in IEEE Conference on Computer Vision and Pattern Recogition, 2021.
[46] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.
[47] J. Duda, “Asymmetric numeral systems,” arXiv preprint arXiv:0902.0271, 2009.
[48] S. Reed, A. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. Freitas, “Parallel multiscale autoregressive density estimation,” in International Conference on Machine Learning. PMLR, 2017, pp. 2912–2921.
[49] M. Zhang, A. Zhang, and S. McDonagh, “On the out-of-distribution generalization of probabilistic image modelling,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
[50] R. v. d. Berg, A. A. Gritsenko, M. Dehghani, C. K. Sønderby, and T. Salimans, “Idf++: Analyzing and improving integer discrete flows for lossless compression,” in International Conference on Learning Representations, 2021.
[51] F. Bellard, “BPG image format,” https://bellard.org/bpg/.
[52] Google, “Webp image format,” https://developers.google.com/
speed/webp/.
[53] K. Chen and T. V. Ramabadran, “Near-lossless compression of medical images through entropy-coded dpcm,” IEEE Transactions on Medical Imaging, vol. 13, no. 3, pp. 538–548, 1994.
[54] X. Zhang and X. Wu, “Ultra high fidelity deep image decompression with $\ell_{\infty}$ -constrained compression,” IEEE Transactions on Image Processing, vol. 30, pp. 963–975, 2021.
[55] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recogition, 2016, pp. 770–778.
[57] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local attention networks for image restoration,” in International Conference on Learning Representations, 2019.
[58] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[59] J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of images using a generalized normalization transformation,” in International Conference on Learning Representations, 2016.
[60] Y. Choi, M. El-Khamy, and J. Lee, “Variable rate deep image compression with a conditional autoencoder,” in International Conference on Computer Vision, 2019, pp. 3146–3154.
[61] M. Zhang, J. Townsend, N. Kang, and D. Barber, “Parallel neural local lossless compression,” arXiv preprint arXiv:2201.05213, 2022.
[62] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: dataset and study,” in IEEE Conference on Computer Vision and Pattern Recogition Workshop, July 2017.
[63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[64] P. Chrabaszcz, I. Loshchilov, and F. Hutter, “A downsampled variant of imagenet as an alternative to the cifar datasets,” arXiv preprint arXiv:1707.08819, 2017.
[65] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recogition. Ieee, 2009, pp. 248–255.
[66] E. Kodak, “Kodak lossless true color image suite (photocd pcd0992),” http://r0k.us/graphics/kodak/, 1993.
[67] J. Borovec, J. Kybic, I. Arganda-Carreras, D. V. Sorokin, G. Bueno, A. V. Khvostikov, S. Bakas, I. Eric, C. Chang, S. Heldmann et al., “Anhir: automatic non-rigid histological image registration challenge,” IEEE Transactions on Medical Imaging, vol. 39, no. 10, pp. 3042–3052, 2020.
[68] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–58, 2001.
[69] G. K. Wallace, “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
[70] J.-R. Ohm and G. J. Sullivan, “Versatile video coding–towards the next generation of video compression,” in Picture Coding Symposium, 2018.
[71] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, “Pixelvae: A latent variable model for natural images,” in International Conference on Learning Representations, 2017.

	$\displaystyle p_{\boldsymbol{\theta}}(r_{i,1},r_{i,2},r_{i,3}\|u_{i},C_{r_{i}})$	$\displaystyle=p_{\boldsymbol{\theta}}(r_{i,1}\|u_{i},C_{r_{i}})\cdot$		(15)
	$\displaystyle p_{\boldsymbol{\theta}}(r_{i,2}\|r_{i,1},u_{i},C_{r_{i}}$	$\displaystyle)\cdot p_{\boldsymbol{\theta}}(r_{i,3}\|r_{i,1},r_{i,2},u_{i},C_{r_{i}})$

Deep Lossy Plus Residual Coding for Lossless and Near-lossless Image Compression