Learned Image Compression with Gaussian-Laplacian-Logistic Mixture Model and Concatenated Residual Modules

Haisheng Fu, Feng Liang, Jianping Lin, Bing Li, Mohammad Akbari, Jie Liang, Guohe Zhang, Dong Liu, Chengjie Tu, Jingning Han Haisheng Fu, Feng Liang, Bing Li and Guohe Zhang are with the School of Microelectronics, Xi’an Jiaotong University, Xi’an, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]) (Corresponding author: Feng Liang)Jianping Lin and Dong Liu are with the CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]; [email protected]) Mohammad Akbari and Jie Liang are with the School of Engineering Science, Simon Fraser University, Canada ([email protected]; [email protected])Chengjie Tu is with the Tencent Technologies (e-mail: [email protected])Jingning Han is with the Google Inc. (e-mail: [email protected])This work was supported by the National Natural Science Foundation of China (No. 61474093), the Natural Science Foundation of Shaanxi Province, China (No. 2021GXLH-Z-081), the Fundamental Research Funds for the Central Universities (No. xzd022020017), the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-04525), China Scholarship Council, and Google Chrome University Research Program.

Abstract

Recently deep learning-based image compression methods have achieved significant achievements and gradually outperformed traditional approaches including the latest standard Versatile Video Coding (VVC) in both PSNR and MS-SSIM metrics. Two key components of learned image compression are the entropy model of the latent representations and the encoding/decoding network architectures. Various models have been proposed, such as autoregressive, softmax, logistic mixture, Gaussian mixture, and Laplacian. Existing schemes only use one of these models. However, due to the vast diversity of images, it is not optimal to use one model for all images, even different regions within one image. In this paper, we propose a more flexible discretized Gaussian-Laplacian-Logistic mixture model (GLLMM) for the latent representations, which can adapt to different contents in different images and different regions of one image more accurately and efficiently, given the same complexity. Besides, in the encoding/decoding network design part, we propose a concatenated residual blocks (CRB), where multiple residual blocks are serially connected with additional shortcut connections. The CRB can improve the learning ability of the network, which can further improve the compression performance. Experimental results using the Kodak, Tecnick-100 and Tecnick-40 datasets show that the proposed scheme outperforms all the leading learning-based methods and existing compression standards including VVC intra coding (4:4:4 and 4:2:0) in terms of the PSNR and MS-SSIM. The source code is available at https://github.com/fengyurenpingsheng.

Index Terms:

Deep learning-based image compression, Entropy coding, Gaussian Mixture Model, Residual Network.

I Introduction

Image compression is an important step in many applications. The classical approaches, e.g., JPEG [1], JPEG 2000 [2], and BPG (intra-frame coding of H.265/HEVC) [3], mainly use techniques such as linear transform, quantization, and entropy coding to remove the redundancy of the input and achieve better rate-distortion (R-D) performance, as illustrated in Fig. 1. Recently, deep learning-based methods have been investigated, where the three main components are re-designed based on the properties of neural networks. This approach has gradually outperformed traditional methods in both PSNR and MS-SSIM metrics [4], and shows great potentials.

Refer to caption — Figure 1: The block diagram of a typical image compression system.

The most important difference of learning-based schemes is that the classic linear transform is replaced by a non-linear neural network, which is learned from training data. Therefore, how to design the network architecture to reduce the correlation of the latent representations, which are the output of the encoding network (also known as the feature maps), is critical to the performance of learned image compression schemes. Another important task is to design a good probability model to capture the remaining correlation of the latent representations, so that they can be encoded efficiently.

I-A Encoder/Decoder Architectures

TABLE I: Different encoder/decoder architectures in image compression.

Methods	Highlights
RNN [5, 6]	The first work on end-to-end LSTM-based RNN for variable-rate image compression.
GDN [7]	GDN is first used in image compression framework, which shows great potentials.
Residual Network [8]	The residual network is first used for CNN-based image compression.
GAN [9]	The first to use GAN for image compression. A multi-scale framework is also proposed.
Importance Map [10]	Importance Map is introduced to achieve content-aware bit allocation.
Non-Local Attention Module [11]	A non-local attention module is introduced to capture long-range correlation.
Simplified Attention Model [12]	The non-local module in [11] is simplified. The Gaussian Mixture Model (GMM) is also used.
Octave Convolution [13]	The multi-resolution octave convolution is employed in image compression framework.
iWave++ [14]	A lifting-based network similar to wavelet is proposed that supports both lossy and lossless compression.
Transformer [15]	The vision transformer is combined with convolutional layers to boost the compression performance.

TABLE II: Different entropy coding models in image compression.

Methods	Highlights
PixelCNN [16]	Masked convolutional networks are used to predict the distribution of each pixel value.
PixelCNN++ [17]	The discretized logistic mixture model is used to reduce the complexity of PixelCNN.
3D-CNN [18]	The 2D-CNN model in [17] is generalized to 3D. The importance map in [10] is also used.
Hyperprior [19]	The hyperprior is first introduced. Zero-mean Gaussian model with a scale parameter is used.
GMM and Autoregressive models[20]	The GMM and autoregressive model are developed for the hyperprior method.
Context Model [21]	A context-adaptive entropy model is proposed with two types of contexts.
Laplace-smoothed histogram [8]	The Laplace-smoothed histogram is used for entropy encoding.
Laplacian [22]	The Laplacian distribution is established for lossy compression.
Logistic Mixture [23]	The discretized logistic mixture distribution is used.
GMM and quality enhancement network [24]	The GMM model and a quality enhancement subnetwork are used to achieve the state of the art.
Multivariate GMM [25]	Multivariate GMM and vector quantization are used. The parameters are estimated in a cascaded approach.

Table I summarizes some representative encoder/decoder architectures for learned image coding.

One of the first learned image compression schemes was proposed in [5], which was based on the long short-term memory (LSTM) recurrent neural network (RNN) and was used to compress thumbnail images. In [6], the scheme in [5] is generalized to full-resolution images.

In [7, 26], an encoder network that includes convolution, downsampling, and the generalized divisive normalization (GDN) is proposed. It was the first learned image compression scheme that achieved better performance than JPEG2000 in terms of both PSNR and MS-SSIM.

Most recent schemes are based on the autoencoder framework, whose latent representations have much lower dimension than the input, and is very suitable for data compression.

A powerful building block of many cutting-edge neural networks is the residual block first proposed in the ResNet [27], which uses shortcut connections to facilitate the design and training of deep networks, and can effectively improve the performances of many computer vision tasks. As a result, the residual block has also been used in many learning-based image compression schemes.

In [8], the residual block concept in [27] was used in the autoencoder architecture, which also achieved comparable performance to JPEG2000.

Generative adversarial network (GAN) is another powerful framework for many applications, and has also been used in several learned image coding schemes [9, 28, 29].

In [10], the importance map is introduced to achieve content-adaptive bit allocation in different regions, but the importance mask needs to be sent to the decoder. In [11], non-local attention module is used to capture long-range correlation, and generate attention mask without sending side information, and it is used in both core network and hyper network. In [12], the attention module in [11] is simplified.

In [30, 13], a multi-resolution network architecture based on the octave convolution in [31] is developed, similar to the wavelet transform. In [14], another wavelet-like scheme is proposed based on the lifting scheme, which supports both lossy and lossless compression.

In [15], the vision transformer framework is introduced and combined with convolutional layers to boost the image compression performance.

I-B Entropy Coding Models

Table II is an overview of typical entropy coding models used for learned image coding.

In earlier schemes [32, 8, 26], the quantized latent representations or latents are assumed to be independent and identically distributed and followed a simple marginal distribution. Once the parameters of the distribution are trained, the probability of all latents is fixed for all images, which is used by the entropy encoding and decoding of the quantized latents. Since a fixed entropy model is used, the performances of these methods are compromised.

To improve the entropy coding performance, recent methods consider the correlation of neighboring pixels or latents (contexts), and achieve image-adaptive entropy coding. To estimate the joint probability with manageable complexity, the chain rule is usually used, which factorizes the joint probability into products of conditional probabilities.

In PixelCNN [16], masked convolutional networks are used to predict the distribution of each pixel value, conditioned on the causal neighbors in an autoregressive manner, which is used for image generation and image decoder in an autoencoder. In [17], the PixelCNN++ is proposed, which uses the discretized logistic mixture distribution to reduce the complexity of [16]. In [23], PixelCNN++ is applied to lossless image compression.

To help the estimation of the conditional probabilities of the latents, the hyperprior network is introduced in [19], where a hyper encoder network is used to extract some hyperpriors from the latents, which are coded as side information and sent to a hyper decoder network. The latter uses the reconstructed hyperpriors to estimate the conditional probability of the latents, thereby making the entropy model image-dependent and spatially adaptive.

In [19], the conditional probability of each latent is assumed to follow a zero-mean Gaussian scale mixture (GSM) model, which achieves better performance than BPG (4:4:4). In [20], the non-zero-mean Gaussian mixture model (GMM) is further proposed. In addition, the autoregressive context model in [16] is introduced to estimate the conditional probability of each latent from both the hyperpriors and its spatial context.

Other models have also been proposed to encode the latents. For example, in [22], the Laplacian distribution was used. In [12] and [24], the GMM model is adopted. A quality enhancement subnetwork is also introduced in [24] to improve the performance. It is the first learned image compression approach that achieves better performance than H.266/VVC (4:4:4) in both PSNR and MS-SSIM. As far as we know, the results in [24] are currently the best in the literature.

I-C Multivariate Models

The entropy coding approach above predicts the conditional probability of a latent given some causal neighbors and the hyperpriors, according to the chain rule, which approximates the joint probability implicitly. Another possible approach is to use the more general multivariate models to represent the joint probability explicitly.

In [25], the multivariate GMM is used in learned image compression to capture the inter-channel correlation of co-located latents across feature maps (this is captured implicitly by the approach above using 1x1 filters in the hyper decoder). The means and covariance matrices are estimated in a cascaded approach. Vector quantization is used to encode the latents, as in [32]. The hyper coding network is not used. However, since no spatial context is used, the R-D performance of [25] is only comparable to VVC on the Kodak test set, and not as good as other leading methods such as [24]. This indicates that if the learned image compression framework can consider the joint distribution via context model, non-local operator, and attention module, it is sufficient to use univariate models to model the remaining redundancy in the latents.

On the other hand, the overall complexity of multivariate model-based methods is usually quite high, as noted in [25], because the number of parameters of multivariate models increases quadratically with the size of the covariance matrix. Therefore, the complexity of the network model and the complexity of calculating multivariate probability increase dramatically. The training of the network is also more difficult, because the reverse gradients of different variables affect each other. Although the method in [25] can be parallelized because no spatial context is used, it still has very high demands for GPU and memory. In fact, other methods using spatial context can also be parallelized [33, 34, 35].

Multivariate models have also been used in other applications. Most of them are based on the Gaussian distribution. In [36], a deep GMM is proposed for classification by using the factor-analytic representation of the GMM in each layer. The Expectation-Maximization (EM) algorithm is used to estimate the parameters. However, it is found in [37] that sometimes it is quite challenging for this method to infer its parameters, even for a small-scale problem with 3 layers and 76000 latents.

In [38], a novel face recognition framework based on the multivariate GMM is proposed to make feature embeddings extract more identity-relevant information. In [39], a learned multiple graph Gaussian embedding model is developed to learn highly informative network features by mapping high-dimensional networks into a low-dimensional latent space. In [40], the multivariate skewed t-distribution is proposed for hyperspectral anomaly detection. In [41], the last fully-connected (FC) layer of a deep network is combined with a mixture of GMM (MoGMM) for image recognition. In [42], the multivariate GMM is considered as a prior to recover degraded images. In [43], a framework for deep unconstrained face verification is proposed to map learned discriminative facial features to a regularized metric space, in which matching and non-matching pairs follow multivariate Gaussian distributions.

I-D Contributions of this Paper

Although the learned methods have made significant progresses, the compression performance can be further improved. First, the previous entropy models only use a single distribution, which is not optimal for every latent. Second, the latents still exist some spatial redundancy. In order to address these issues, we make the following contributions in this paper:

$\bullet\,$ Instead of using a single probability model, we propose a discretized Gaussian-Laplacian-Logistic mixture model (GLLMM), which is more flexible and efficient in estimating the conditional probabilities of the latents, given the same complexity. In fact, the previous models are special cases of the proposed model. Ablation studies demonstrate that the proposed joint model outperforms all previously proposed single models. For example, it achieves 0.1-0.15 dB gain in PSNR compared to the GMM model for the Kodak dataset.

$\bullet\,$ We improve the basic building block of the encoder network by developing a concatenated residual module (CRM), with additional shortcut connections. The CRM improves the information flow, reduces the correlation of the output, and improves the training of the network. Ablation studies using the Kodak dataset show that the proposed CRM can achieve 0.2-0.3 dB gain in terms of PSNR compared to the original residual blocks. We also conduct ablation studies to compared the performances of CRMS with two and three stages of residual blocks.

$\bullet\,$ One of the key contributions of [24] is the post-processing component, which achieves 0.5 dB gain in [24] compared to its baseline. We also investigate the performance of this component in our scheme. However, our experiments show that there is no need to apply the post-processing in our method, because the proposed CRM and entropy model have done a good job. This also reduces the complexity of the scheme.

Experimental results using the Kodak[44], Tecnick-100, and Tecnick-40 datasets in [45] show that the proposed scheme outperforms all the previous learning-based methods including [24, 12] and traditional codecs including VVC in both PSNR and MS-SSIM. For example, in terms of PSNR, for the Kodak dataset, when the bit rate is higher than 0.4 bpp, our method is 0.2-0.3 dB higher than [24, 12], and 0.3-0.4 dB higher than VVC (4:4:4). For the Tecnick dataset, when the bit rate is higher than 0.2 bpp, our method is 0.3-0.4 dB better than VVC (4:4:4). This represents the new state of the art in learned image compression.

The remainder of the paper is organized as follows. In Section II, we propose our image compression framework with discretized Gaussian-Laplacian-Logistic mixture model and the concatenated residual modules. In Sec. III, we compare our method with some state-of-the-art learning-based methods and classical image compression methods. Ablation experiments are carried out to investigate the performance gain of the the proposed scheme. The conclusions are reported in Sec. IV.

II The Proposed Image Compression Scheme with CRM and GLLMM

In this section, we first introduce the overall learned image coding framework adopted in this paper, and then describe the proposed concatenated residual module (CRM) and the discretized Gaussian-Laplacian-Logistic mixture model (GLLMM).

II-A The Overall Framework

The learned image coding framework adopted in this paper is illustrated in Fig. 2, which is mainly based on [12]. It is comprised of two main parts: the core autoencoder and the hyperprior coding.

The size of the input color image $\bm{x}$ is $W\times H\times 3$ , where $W$ and $H$ represents the width and height of the image, respectively. In the training of the network, the pixel values are normalized to $[-1,1]$ by ( $\frac{\bm{x}}{127.5}-1.0$ ).

The core autoencoder includes an encoder ( $g_{a}$ ) and a decoder ( $g_{s}$ ). The input image $\bm{x}$ is first sent to the encoder network $g_{a}$ , which aims to reduce the redundancy and learn a compact latent representation $\bm{y}$ of the input image. The latent is then quantized and entropy coded. The quantized latent is denoted by $\hat{\bm{y}}$ .

The core encoder network includes various convolution layers and four stages of pooling operators to get the latents. The residual blocks with shortcut connections are used extensively. The GDN operator is used when the size is changed.

To improve the R-D performance, the simplified attention module in [12] is applied at two resolutions to capture long-range correlation, and strengthen more important areas, so that they will get more bit allocation.

To estimate the distribution of the latents and improve entropy coding efficiency, the hyperprior coding part is introduced in [19], which consists of a hyper encoder ( $h_{a}$ ) and a hyper decoder ( $h_{s}$ ). The hyper encoder extracts the hyperprior $\bm{z}$ from the latents, which is quantized into $\hat{\bm{z}}$ and entropy coded as a side information to the hyper decoder. The hyper decoder first recovers $\hat{\bm{z}}$ via entropy decoding, and then uses hyper decoder network $h_{s}$ to estimate the parameters of the conditional distribution of $\hat{\bm{y}}$ , which is used by the entropy encoding and decoding of $\hat{\bm{y}}$ .

Since the input dimension of the hyper network is much smaller than the original image, the hyper network is much simpler than the core network. Leaky ReLU is utilized in most convolution layers, except for the last layer in hyper encoder and decoder, which do not have any activation function.

To further improve the entropy coding efficiency, the context model network $c_{m}$ in [20] and [12] is also used in our system, which uses masked convolutions to capture the correlation of neighboring latents. The output layer of the context model is concatenated with the output of the first part of hyper decoder, and then further processed by some additional convolutional layers to estimate the parameters of the conditional distributions of the latents, which are then used to entropy encode and decode $\hat{\bm{y}}$ .

Since autoregressive context model is used, the symbols have to be decoded in serial manner. After all the symbols of $\hat{\bm{y}}$ are decoded, they are sent to the core decoder ( $g_{s}$ ) to generate the reconstructed image. Note that the context model can be sped up using the methods in [33, 34, 35].

In Fig. 2, the encoders and decoders in the core and hyper networks are symmetric, except that they use convolution and deconvolution operators, respectively.

II-B The Proposed Concatenated Residual Module (CRM)

When the size does not change, to further remove the spatial correlation in the latent representation, we develop two deeper residual blocks in this paper, which is illustrated in Fig. 3. The basic building block is the standard residual block developed in the ResNet [27], as described in Fig. 3. As in [12], we first employ the leaky ReLU activation function to replace the ReLU function and remove the batch normalization layer from the residual block. The leaky ReLU activation function can speed up the convergence of the network. The detailed structure is shown in Fig. 3, which is used in [12]. Based on this, we develop two concatenated residual modules, as shown in Fig. 3 and Fig. 3.

In Fig. 3, two residual blocks in Fig. 3 are concatenated, and another short connection is added between the input and the output. In Fig. 3, three residual blocks are concatenated, with an additional shortcut connection as well.

Compared to the standard residual block in Fig. 3, the concatenated modules have larger receptive fields. They can remove more spatial correlation, which can also help the attention module in the network. In Sec. III, we will compare the performances of the two types of concatenated modules.

II-C The Proposed Gaussian-Laplacian-Logistic Mixture Model (GLLMM)

II-C1 Existing Models

The core encoder and decoder in Fig. 2 can be represented by

\begin{split}\bm{y}&=g_{a}(\bm{x;\phi}),\\ \hat{\bm{y}}&=Q(\bm{y}),\\ \hat{\bm{x}}&=g_{s}(\hat{\bm{y}};\bm{\theta}),\end{split}

(1)

where $\bm{\phi}$ and $\bm{\theta}$ are the parameters of the encoder and decoder networks that need to be optimized. $Q$ represents the quantization operator.

In earlier work, a fixed distribution is used to encode all entries of $\hat{\bm{y}}$ for all images, which is not optimal. In [19], the hyper encoder $h_{a}$ and hyper decoder $h_{s}$ are introduced to help learning the distribution of each entry of $\hat{\bm{y}}$ , which makes the entropy coding image-adaptive. The output $\bm{z}$ of the hyper encoder network is quantized into $\hat{\bm{z}}$ and entropy coded as a side information to the decoder. The hyper coding part can be represented by

\begin{split}\bm{z}&=h_{a}(\bm{y};\bm{\phi}_{h}),\\ \hat{\bm{z}}&=Q(\bm{z}),\\ P_{\hat{\bm{y}}|\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{z}})&\leftarrow h_{s}(\hat{\bm{z}};\bm{\theta}_{h}).\\ \end{split}

(2)

where $\bm{\phi}_{h}$ and $\bm{\theta}_{h}$ are the parameters of the hyper encoder and hyper decoder, and $P_{\hat{\bm{y}}|\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{z}})$ is the conditional distribution vector of $\hat{\bm{y}}$ given $\hat{\bm{z}}$ .

In [19], $P_{\hat{\bm{y}}|\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{z}})$ is assumed to follow the independent zero-mean Gaussian distribution with variation vector $\bm{\sigma}^{2}$ . In [20], it is allowed to have non-zero-mean Gaussian distribution with mean vector $\bm{\mu}$ :

P_{\hat{\bm{y}}|\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{z}})\sim\mathcal{N}(\bm{\mu},\bm{\sigma}^{2}).

(3)

In addition, the autoregressive context model $C_{m}$ in[16] is introduced in [20] to consider the correlation from causal neighboring latents in estimating the distribution of the latent.

Based on [20], in [12], a Gaussian mixture model (GMM) is further introduced to estimate the latents. Since context model is also used as in [20], the conditional probability is denoted by $P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}})$ , where $\hat{\bm{y}}_{c}$ represents the neighboring latents (contexts) for $\hat{\bm{y}}$ . Therefore the GMM-based estimation can be written as

P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}})\sim\sum_{k=1}^{K}{\bm{w}^{(k)}\mathcal{N}(\bm{\mu}^{(k)},\bm{\sigma}^{2(k)})}.

(4)

At each position $i$ , the conditional distribution of $\hat{y}_{i}$ , $P_{\hat{\bm{y}}|\hat{y}_{i,c},\hat{\bm{z}}}(\hat{y}_{i}|\hat{y}_{i,c},\hat{\bm{z}})$ , is a weighted average of multiple Gaussian models with different means and variances (the sum of the weights equals to $1$ ). This is more powerful than a single Gaussian model.

Models with other distributions have also been investigated, such as logistic mixture model in [17, 23], and Laplacian distribution in [22].

However, the aforementioned methods only use one type of probability models to learn the probability distribution of the latent representation. It is difficult for one probability model to learn the distributions of all images. Therefore, better performance can be expected if different types of distributions can be combined together.

II-C2 The Proposed GLLMM Model

In this paper, we propose a powerful Gaussian-Laplacian-Logistic model (GLLMM) distribution as shown in Fig. 4 for $P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}})$ :

\begin{split}P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}})&\sim\bm{p}_{0}\sum_{k=1}^{K}{\bm{w}^{(k)}\mathcal{N}(\bm{\mu}_{w}^{(k)},\bm{\sigma}_{w}^{2(k)})}\\ &+\bm{p}_{1}\sum_{m=1}^{M}{\bm{\alpha}^{(m)}Lap(\bm{\mu}_{\alpha}^{(m)},\bm{\sigma}_{\alpha}^{2(m)})}\\ &+\bm{p}_{2}\sum_{j=1}^{J}{\bm{\beta}^{(j)}Logi(\bm{\mu}_{\beta}^{(j)},\bm{\sigma}_{\beta}^{2(j)})},\\ \end{split}

(5)

which is a weighted average of Gaussian mixture, Laplacian mixture ( $Lap$ ), and logistic mixture ( $Logi$ ), with normalization constraints for $\bm{w}^{(k)}$ , $\bm{\alpha}^{(m)}$ , $\bm{\beta}^{(j)}$ , and $\bm{p}_{i}$ respectively. $\bm{\mu}$ and $\bm{\sigma}$ with different subscripts represent the mean and variance parameters for different models, and $\bm{p}_{i}$ ’s are the weights of the three types of distributions.

The total number of parameters in Eq. (5) is $3(K+M+J+1)$ . Each latent needs its own set of parameters to estimate its conditional distribution. All of these parameters are generated by the hyper decoder network. The last layer of the hyper decoder uses a $1\times 1$ filter to estimate the distribution parameters of each latent. The kernel size of each $1\times 1$ filter is thus $3(K+M+J+1)$ . For $N$ feature maps, the total size of the $1\times 1$ filter at each position is $3N(K+M+J+1)$ . In this paper, the values of $K$ , $M$ , and $J$ are all chosen to be $3$ , based on the experimental results in Sec. III. Therefore there are $30$ parameters to estimate for each latent, and the total size of the $1\times 1$ filter for all $N$ feature maps are $30N$ , as shown in Fig. 2.

Compared to the previous models with a single distribution, the proposed GLLMM model includes three types of distributions, and can capture the distribution of the latent more accurately and efficiently, given the same complexity, thereby improving the performance.

Eq. (5) is the continuous distribution, but the entropy coding part needs the distribution of the quantized $\hat{\bm{y}}$ . Since quantization is not differentiable, a standard solution during training is to add a uniform noise $U(-\frac{1}{2},\frac{1}{2})$ to $\bm{y}$ to achieve a differentiable approximation of the quantization step, which enables the back-propagation-based training [19]. During the inference, $\bm{y}$ is quantized to $\hat{\bm{y}}$ as usual, which is then encoded into the bit stream via entropy coding.

Based on the approach above, the distribution of the discrete-valued $\hat{\bm{y}}$ after quantization is given by [12]

\begin{split}P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}&(\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}})=\prod_{i}(P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{y}_{i}|\hat{y}_{i,c},\hat{\bm{z}})),\\ P_{\hat{\bm{y}}|\hat{\bm{y}}_{c},\hat{\bm{z}}}&(\hat{y}_{i}|\hat{y}_{i,c},\hat{\bm{z}})=\left[\left(p_{0,i}\sum_{k=1}^{K}{w_{i}^{(k)}\mathcal{N}\left(\mu_{w,i}^{(k)},\sigma_{w,i}^{2(k)}\right)}\right.\right.\\ &+p_{1,i}\sum_{m=1}^{M}{\alpha_{i}^{(m)}Lap\left(\mu_{\alpha,i}^{(m)},\sigma_{\alpha,i}^{2(m)}\right)}\\ &\left.+p_{2,i}\sum_{j=1}^{J}{\beta_{i}^{(j)}Logi\left(\mu_{\beta,i}^{(j)},\sigma_{\beta,i}^{2(j)}\right)}\right)\\ &\left.\ast U\left(-\frac{1}{2},\frac{1}{2}\right)\right](\hat{y}_{i})=c\left(\hat{y}_{i}+\frac{1}{2}\right)-c\left(\hat{y}_{i}-\frac{1}{2}\right),\end{split}

(6)

where $i$ is the location index of the feature tensor, and $c(.)$ is the cumulative distribution function of the mixture model.

II-D Loss Function

In this paper, we focus on optimizing the learned image compression to achieve the best rate-distortion (R-D) performance, based on the information theory. Let $R$ be the expected length of the bitstream, and $D$ be the reconstruction error between the source image and reconstructed image. The tradeoff between the rate and distortion is adjusted by a Lagrange multiplier denoted by $\lambda$ . The objective cost function is then defined as follows:

$\displaystyle L$	$\displaystyle=\lambda D(\bm{x},\hat{\bm{x}})+H(\hat{\bm{y}})+H(\hat{\bm{z}}),$	(7)
$\displaystyle H(\hat{\bm{y}})$	$\displaystyle=E[-\log_{2}(P_{\hat{\bm{y}}\|\hat{\bm{y}}_{c},\hat{\bm{z}}}(\hat{\bm{y}}\|\hat{\bm{y}}_{c},\hat{\bm{z}}))],$
$\displaystyle H(\hat{\bm{z}})$	$\displaystyle=E[-\log_{2}(P_{\hat{\bm{z}}}(\hat{\bm{z}}))],$

where the distortion $D(\bm{x},\hat{\bm{x}})$ is the reconstruction error between origin image $\bm{x}$ and the decompressed image $\hat{\bm{x}}$ . $H(\hat{\bm{y}})$ and $H(\hat{\bm{z}})$ are the entropies of the latents and hyperpriors, based on the estimated conditional probabilities, as a measure of the bits needed to encode them.

The mean square error (MSE) and MS-SSIM are the most widely used distortion metrics, which are chosen in this paper. Other terms such as the GAN cost can be added to the loss function to achieve other goals. However, when more terms are introduced, the optimized result has to trade off these different constraints.

III Experiment

In this section, we compare our method with different learning-based methods and traditional compression standards using the Kodak PhotoCD dataset [44] and two Tecnick datasets [45]. The Kodak dataset consists of 24 images with a resolution of $768\times 512$ or $512\times 768$ . The first Tecnick dataset named Tecnick-100 includes 100 uncompressed images with a resolution of $1200\times 1200$ . The second Tecnick dataset named Tecnick-40 has 40 uncompressed images with a resolution of $1200\times 1200$ . The learning-based methods are recent state-of-the-art methods, including Balle2017 [26], Li2018 [10], Lee2019 [46], Cheng2020 [12], Lee2020 [24], Li2020 [47], Chen2021 [11], Ho2021 [48], Hu2021 [49], Ma2022 [14], and Zhu2022 [25]. We also compare with traditional methods, including VVC-Intra (4:4:4) (version 12.1) [50], VVC-Intra (4:2:0), BPG (4:4:4) [3], JPEG2000, WebP [51], and JPEG. Both PSNR and MS-SSIM metrics are used. The original MS-SSIM values are represented in dB scale by $-10\log_{10}(1-MS\textnormal{-}SSIM)$ .

III-A Training Set

The CLIC dataset [52] and LIU4K dataset [53] are employed to train our models. The images of the training dataset are rescaled to a resolution of $2000\times 2000$ , which is better for training. Data augmentation algorithms (i.e., rotation and scaling) are used to randomly crop 81,650 patches with a resolution of $384\times 384$ . The patches are stored as lossless PNG images.

III-B Parameter Settings

We optimize the proposed models using mean square error (MSE) and MS-SSIM respectively. When optimized for MSE metric, the parameter $\lambda$ is selected from the set $\{0.0016,0.0032,0.0075,0.015,0.023,0.03,0.045\}$ . Each value trains a network for a particular bit rate. The number of filters $N$ is set to 128 for the three lower bit rates, and is set to 256 for the four higher bit rates. When the MS-SSIM metric is targeted, the parameter $\lambda$ is in the set $\{12,40,80,120\}$ . The number of filters $N$ is set to 128 for the two lower bit rate, and is set to 256 for the two higher bit rates. Each model was trained up to $1.5\times 10^{6}$ iterations to obtain stable performance. The Adam solver with a batch size of 8 is adopted. The learning rate is set to $1\times 10^{-4}$ in the first 750,000 iterations, and we gradually reduce the learning rate by 0.5 after every 100,000 iterations in the last 750,000 iterations.

III-C Performances on Kodak, Tecnick-100, and Tecnick-40 Datasets

The average MS-SSIM and PSNR performances over the 24 Kodak images are illustrated in Fig. 5. Note that in Fig. 5 to Fig. 7, the notations Scheme (PSNR) and Scheme (MS-SSIM) in the legends mean that the model in the scheme is optimized for PSNR and MS-SSIM respectively.

In Fig. 5, when optimized for PSNR, Lee2020 (PSNR) [24] is the best among previous methods, which obtains even better performance than VVC (4:4:4) at high rates. The next closest method to VVC (4:4:4) is Cheng2020 [12]. When the bit rate is less than 0.3 bpp, our method has similar performance to [24] and VVC (4:4:4). When the bit rate is higher than 0.4 bpp, our method achieves the best performance, which is 0.2-0.3 dB over [24] and 0.3-0.4 dB over VVC (4:4:4).

In the MS-SSIM results in Fig. 5, Lee2020 (MS-SSIM) [24] also achieves better performance than previous learning-based methods and all the traditional image codecs including VVC (4:4:4). Our proposed method optimized for MS-SSIM achieves slightly better results than Lee2020 (MS-SSIM).

Fig. 6 compares the performances of different methods on the Tecnick-100 dataset. Our scheme also outperforms all available learning-based methods and all the traditional image codecs including VVC (4:4:4) in term of both PSNR and MS-SSIM. Our method is the only method better than VVC. When the bit rate is above 0.2bpp, our method is about 0.2-0.3 dB higher than VVC (4:4:4).

Fig. 7 compares the performances of some methods on the Tecnick-40 dataset. The results of other methods are not available. Our scheme also outperforms other learned methods including Lee2020 [24] and VVC (4:4:4) in both PSNR and MS-SSIM.

One example are shown in Fig. 8 to compare the visual quality of different methods. Our method produces the most visually pleasing results.

III-D Ablation Studies

III-D1 Contributions of CRM and GLLMM

We first present ablation study to show the gain of the concatenated residual module (CRM) and the GLLMM model. The scheme in [12] is used as the baseline. On top of the baseline, we add different modules in turn. In order to compare as fair as possible, the parameter $\lambda$ is chosen in the set $\{0.0032,0.0075,0.015,0.03,0.045\}$ . The number of filters $N$ is set to 128 for the two lower bit rates, and 256 for the three higher bit rates. Other training setups are the same. The MSE objective function is optimized in the ablation experiments.

The results are shown in Fig. 9. We first replace the GMM in the baseline by GLLMM, denoted as Baseline+GLLMM, which improves the R-D performance by about 0.15 dB at the same bit rate. Compared to Baseline+GLLMM, the proposed full method in this paper has a further improvement of 0.3 dB, which is the contribution of the CRM over the original residual blocks.

III-D2 Number of Concatenated Residual Modules

Next, we compare the standard residual block (RB), two-stage concatenated residual module (RB+two-stage), and three-stage concatenated residual module (RB+three-stage). The results are shown in Fig. 10.

The method with RB [27] achieves the worst performance. RB+two-stage method is 0.3 dB higher than RB. The RB+three-stage achieves the same performance with RB+two-stage at low bit rates and is sightly worse at high bit rates. Moreover, the model size will increase about $13\%$ . Therefore, we adopt the RB+two-stage method in our framework.

III-D3 Comparisons of Different Entropy Coding Models

In Fig. 11, we use the Kodak dataset to compare the performances of different entropy coding models, including Logistic mixture model (LoMM), Gaussian mixture model (GMM), Gaussian-Logistic mixture model (GLoMM), Gaussian-Laplacian mixture model (GLaMM), and the proposed Gaussian-Logistic-Laplacian mixture model (GLLMM). If some distributions are not used, we set the corresponding $K$ , $M$ or $J$ in Eq. (5) to be $0$ . It can be seen that LoMM achieves the worst performance. GMM is slightly better than LoMM, which agrees with the results in [12]. Adding either Laplacian or Logistic distribution to GMM can further improve its performance, with more gains from Logistic than Laplacian. Finally, adding both Laplacian and Logistic distributions as in the proposed GLLMM achieves the best result. The gaps between different curves are quite consistent at all bit rates.

In Fig. LABEL:entropy_model, we use image Kodim21 from the Kodak dataset to visualize the impacts of different entropy models using the same network architecture. The feature map with the highest entropy is displayed. The first column shows the values of the quantized latents. The second and third columns are the predicted parameters $\bm{\mu}$ and $\bm{\sigma}$ . For the proposed GLLMM model, the mean is predicted by

\mu_{i}=p_{0,i}\sum_{k=1}^{K}{w_{i}^{(k)}\mu_{w,i}^{(k)}}+p_{1,i}\sum_{m=1}^{M}{\alpha_{i}^{(m)}\mu_{\alpha,i}^{(m)}}+p_{2,i}\sum_{j=1}^{J}{\beta_{i}^{(j)}\mu_{\beta,i}^{(j)}.}

(8)

The means and variances for other models are obtained similarly. If some distributions are not used, the corresponding $K$ , $M$ or $J$ are set to $0$ .

The fourth column in Fig. LABEL:entropy_model shows the normalized latents $\frac{\hat{y_{i}}-\mu_{i}}{\sigma_{i}}$ , as used in [20, 12, 11], because if the model is accurate, the normalized result should have zero mean and unit variance. Therefore the normalized result helps to visualize the remaining redundancy that is not captured by the entropy models. The last column shows the required bits to encoder the latents at each position, which is calculated using the entropy of the predicted discretized distribution at that position.

Table III reports the mean and variance of the images in the fourth column and the average bits of the images in the fifth column.

It can be observed from Fig. LABEL:entropy_model and Table III that our GLLMM model provides more uniform normalized latents or less remaining redundancy, and needs less bits for encoding.

Fig. 13 further zooms in to one particular location at coordinate [10, 22] in the feature map in Fig. LABEL:entropy_model, and plots the estimated continuous and quantized distributions using GMM and GLLMM respectively. The first column shows the reconstructed images of the two methods. The second column indicates the location of the selected latent. The third column plots the estimated individual distributions, the final mixed probability, and the corresponding parameters. It can be seen that in GLLMM, all the components in it are used effectively to represent this complex probability. The last column shows the histograms of the quantized probabilities.

TABLE III: Numerical results for Columns 4 and 5 in Fig. LABEL:entropy_model.

Method	Col. 4	Col. 4	Col. 5
	Mean	Variance	Mean
GMM	0.390	2.214	4.202
GLaMM	0.077	2.412	6.314
GLLMM	0.049	1.303	3.359
GLLMM+NoCRM	0.060	4.773	8.386

III-D4 GLLMM with Different Orders

We next compare the performances of GLLMM with different orders. We use GLLMMn to represent the GLLMM with $K=M=J=n$ in Eq. (5). The results are shown in Fig. 14. It can be seen that GLLMM2 achieves the worst performance. GLLMM3 is up to about $0.15-0.2$ dB higher than GLLMM2. GLLMM4 almost achieves the same performance as GLLMM3. Therefore, the values of K, M, and J in other experiments of this paper are set to be $3$ .

III-D5 Comparison of GLLMM and Higher-order GMM

We also compare the performance of GLLMM3 and higher-order GMM, denoted as GMMn. Theoretically, GMM with enough orders can achieve any distribution. However, this comes at the price of increased complexity, and the gain diminishes gradually. For each method, we train two models at low and high bit rates respectively, with $N=128$ . The results are shown in Fig. 15. It can be seen that higher-order GMM has better performance. However, the performance saturates gradually. GLLMM3 has better performance than GMM15. Moreover, GLLMM3 has 30 parameters in Eq. (5), whereas GMM15 has 45 parameters in Eq. 4. As a result, compared to GLLMM3, the encoding and decoding time of GMM15 is 52.32% slower, its model size is 17.07% larger, and training time is 8.23 % longer. Therefore the proposed GLLMM is more effective than GMM when complexity is considered.

III-D6 Impact of the Attention Module and Context Model

Fig. 16 shows the effectiveness of the attention module and context model. In this figure, the baseline is the proposed method with CRM and GLLMM, which achieves the best performance. In Baseline+NoAttension, the attention module is removed, which has similar performance to the Baseline. In Baseline+NoAttension+NoAR, the autoregressive (AR) context model is also removed, which is 0.2-0.3 dB lower than the Baseline.

III-D7 Impact of Different Loss Functions

Fig. 17 shows the impact of different loss functions. In this figure, the Ours+MSE is the proposed method with only MSE as the distortion measure $D(\bm{x},\hat{\bm{x}})$ in the loss function, which achieves the best PSNR performance. In Ours+MSE+MS-SSIM, we include both the MSE and MS-SSIM in the distortion part of the loss function, whose PSNR is 0.2-0.3 dB lower than the loss function with MSE-only distortion.

III-D8 Impact of the Post-Processing Module in [24]

We also study the effectiveness of the post-processing module on image compression performance in our method. In [24], the post-processing module is combined to the image compression framework, which achieves 0.5 dB gain compared to its baseline without the post-processing. The post processing can be applied to other learning-based schemes as well. To study its effectiveness in our scheme, we implement the same post-processing as in [24] into our method. The corresponding results are listed in Table IV. We can observe that these two schemes have almost the same performances on Kodak dataset. The main reason is that we have much better residual modules and entropy model. Therefore there is no need to apply the post processing. This also reduces the complexity of the scheme.

TABLE IV: Impact of the Post-Processing Module in [24] for Kodak Dataset

Method	Number of Filters	$\lambda$	Objective Function	BPP	PSNR	Ms $-$ SSIM
Ours	128	0.0032	PSNR	0.1556	29.63 dB	12.59 dB
Ours+post-processing	128	0.0032	PSNR	0.1554	29.65 dB	12.59 dB
ours	256	0.015	PSNR	0.4408	33.97 dB	16.70 dB
Ours+post-processing	256	0.015	PSNR	0.4400	33.98 dB	16.71 dB

III-E Encoding and Decoding Complexity

TABLE V: Comparisons of encoding/decoding time and model sizes for Kodak Dataset.

Method	Encoding Time	Decoding Time	Model Size (Low Rates)	Model Size (High Rates)
VVC [50]	402.27s	0.61s	7.2 MB	7.2MB
Lee2019[46]	15.721s	42.88s	123.8 MB	292.6MB
Hu2020 [49]	281.25s	450.23s	84.6 MB	290.9MB
Cheng2020 [12]	20.89s	22.14s	50.8 MB	175.18MB
Chen2021 [11]	402.26s	2405.14s	200.99 MB	200.99MB
GLLMM	385.26s	387.62s	77.08 MB	241.03MB

Table V compares the complexities of different approaches. Since VVC, Hu2020 [54] and Cheng2020 [12] only run on CPU, we evaluate the encoding and decoding time of different methods at the similar bit rate on an 2.9GHz Intel Xeon Gold 6226R CPU. The average time over all Kodak images is used. The average model sizes at low bit rates and high bit rates are also reported.

It can be seen from Table V that compared to Chen2021 [11], the proposed scheme is faster in both encoding and decoding, but is slower than other methods. Our model size is also smaller than [11]. Compared to VVC, our encoding is faster, but the decoding is much slower.

IV Conclusions

In this paper, we improve the state of the art of learning-based image compression by presenting a more flexible conditional probability model based on the discretized Gaussian-Laplacian-Logistic mixture distribution, which captures the spatial-channel correlation more effectively in latent representations. We also develop an improved concatenated residual block module for the encoder network.

Experiments demonstrate that the proposed method outperforms VVC (4:4:4) in terms of both PSNR and MS-SSIM metrics when measured on Kodak, Tecnick-100, and Tecnick-40 dataset. Also, our scheme achieves better performance compared to all the previous state-of-the-art learning-based methods.

Our method still has some rooms to improve. Since the autoregressive context model is employed in our framework, the symbols have to be decoded in an serial manner. Although it can effectively reduce the spatial correlation of the latent representations, it significantly increases the time complexity. How to reduce the complexity of the context model without too much degradation of the performance is a future research topic. Also, the complexity of our method can be further optimized by different approaches such as model compression and optimization.

Another possible future topic is to develop low-cost multivariate mixture models for learned image compression.

References

[1] G. K. Wallace, “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. 18–34, 1992.
[2] D. Taubman and M. Marcellin, JPEG2000: image compression fundamentals, standards, and practice. Kluwer Academic Publishers, 2002.
[3] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[4] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2, 2003, pp. 1398–1402.
[5] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, 2016.
[6] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5435–5443.
[7] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimization of nonlinear transform codes for perceptual quality,” in 2016 Picture Coding Symposium (PCS), 2016, pp. 1–5.
[8] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in International Conference on Learning Representations., 2017.
[9] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in International Conference on Machine Learning, 2017, pp. 2922–2930.
[10] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3214–3223.
[11] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “End-to-end learnt image compression via non-local attention optimization and improved context modeling,” IEEE Transactions on Image Processing, vol. 30, pp. 3179–3191, 2021.
[12] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7939–7948.
[13] M. Akbari, J. Liang, J. Han, and C. Tu, “Learned bi-resolution image coding using generalized octave convolutions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 08, Feb 2021, pp. 6592–6599.
[14] H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized versatile image compression with wavelet-like transform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1247–1263, 2022.
[15] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” in Data Compression Conference, March 2022.
[16] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with PixelCNN decoders,” in Advances in Neural Information Processing Systems(NIPS), 2016, pp. 4797–4805.
[17] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications,” in Intl. Conf. on Learning Representations (ICLR), 2017.
[18] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[19] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in International Conference on Learning Representations, 2018, pp. 1–23.
[20] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10 794–10 803.
[21] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for end-to-end optimized image compression,” in International Conference on Learning Representations, 2019.
[22] L. Zhou, C. Cai, Y. Gao, S. Su, and J. Wu, “Variational autoencoder for low bit-rate image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018, pp. 2617–2620.
[23] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, “Practical full resolution learned lossless image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 629–10 638.
[24] J. Lee, S. Cho, and M. Kim, “An end-to-end joint learning scheme of image compression and quality enhancement with improved entropy minimization,” arXiv:1912.12817, 2020.
[25] X. Zhu, J. Song, L. Gao, F. Zheng, and H. T. Shen, “Unified multivariate gaussian mixture for efficient neural image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 17 612–17 621.
[26] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in International Conference on Learning Representations, 2017.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
[28] S. Santurkar, D. Budden, and N. Shavit, “Generative compression,” in 2018 Picture Coding Symposium (PCS), June 2018, pp. 258–262.
[29] M. Akbari, J. Liang, and J. Han, “Dsslic: Deep semantic segmentation-based layered image compression,” in The 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 2042–2046.
[30] J. Lin, M. Akbari, H. Fu, Q. Zhang, S. Wang, J. Liang, D. Liu, F. Liang, G. Zhang, and C. Tu, “Variable-rate multi-frequency image compression using modulated generalized octave convolution,” in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020, pp. 1–6.
[31] Y. Chen, H. Fan, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach, S. Yan, and J. Feng, “Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3435–3444.
[32] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool., “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in In Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
[33] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 14 771–14 780.
[34] M. Li, K. Ma, J. You, D. Zhang, and W. Zuo, “Efficient and effective context-based convolutional entropy modeling for image compression,” IEEE Transactions on Image Processing, vol. 29, pp. 5900–5911, 2020.
[35] M. Lu and Z. Ma, “High-efficiency lossy image coding through adaptive neighborhood information aggregation,” arXiv preprint arXiv:2204.11448, 2022.
[36] C. Viroli and G. McLachlan, “Deep gaussian mixture models,” Statistics and Computing, vol. 29, no. 1, pp. 43–51, 2019.
[37] M. Selosse, I. Gormley, J. Jacques, and C. Biernacki, “A bumpy journey: exploring deep gaussian mixture models,” I Can’t Believe It’s Not Better, NeurIPS 2020, Dec 2020, Vancouver, Canada, hal-02985701.
[38] H. Ai, Q. Liao, Y. Chen, and J. Qian, “Gaussian mixture distribution makes data uncertainty learning better,” in 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 2021, pp. 01–08.
[39] M. Xu, D. L. Sanz, P. Garces, F. Maestu, Q. Li, and D. Pantazis, “A graph gaussian embedding method for predicting alzheimer’s disease progression with meg brain networks,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 5, pp. 1579–1588, 2021.
[40] K. Kayabol, E. B. Aytekin, S. Arisoy, and E. E. Kuruoglu, “Skewed t-distribution for hyperspectral anomaly detection based on autoencoder,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
[41] J. Xie, Z. Ma, J.-H. Xue, G. Zhang, J. Sun, Y. Zheng, and J. Guo, “Ds-ui: Dual-supervised mixture of gaussian mixture models for uncertainty inference in image recognition,” IEEE Transactions on Image Processing, vol. 30, pp. 9208–9219, 2021.
[42] M. Niknejad, H. Rabbani, and M. Babaie-Zadeh, “Image restoration using gaussian mixture models with spatially constrained patch clustering,” IEEE Trans. on Image Processing, vol. 24, no. 11, pp. 3624–3636, 2015.
[43] A. Ali, M. Testa, T. Bianchi, and E. Magli, “Biometricnet: Deep unconstrained face verification through learning of metrics regularized onto gaussian distributions,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., 2020, pp. 133–149.
[44] Kodak PhotoCD dataset, December 2016. [Online]. Available: http://r0k.us/graphics/kodak/
[45] N. Asuni and A. Giachetti, “TESTIMAGES: a Large-scale Archive for Testing Visual Devices and Basic Image Processing Algorithms.” The Eurographics Association, 2014.
[46] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for end-to-end optimized image compression,” in International Conference on Learning Representations, 2019.
[47] M. Li, W. Zuo, S. Gu, J. You, and D. Zhang, “Learning content-weighted deep image compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
[48] Y.-H. Ho, C.-C. Chan, W.-H. Peng, H.-M. Hang, and M. Domański, “Anfic: Image compression using augmented normalizing flows,” IEEE Open Journal of Circuits and Systems, vol. 2, pp. 613–626, 2021.
[49] Y. Hu, W. Yang, Z. Ma, and J. Liu, “Learning end-to-end lossy image compression: A benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
[50] H. Fraunhofer, “Vvc official test model vtm,” 2019. [Online]. Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/tree/VTM-12.1
[51] Webp, September 2010. [Online]. Available: https://developers.google.com/speed/webp/
[52] T. George, S. Wenzhe, T. Radu, T. Lucas, B. Johannes, A. Eirikur, J. Nick, and M. Fabian, “Workshop and challenge on learned image compression (clic2020).” CVPR, 2020. [Online]. Available: http://www.compression.cc
[53] J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y. Dai, “A comprehensive benchmark for single image compression artifact reduction,” IEEE Transactions on Image Processing, vol. 29, pp. 7845–7860, 2020.
[54] Y. Hu, W. Yang, and J. Liu, “Coarse-to-fine hyper-prior modeling for learned image compression,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 013–11 020.