Learned Lossless Image Compression with a HyperPrior and Discretized Gaussian Mixture Likelihoods

Abstract

Lossless image compression is an important task in the field of multimedia communication. Traditional image codecs typically support lossless mode, such as WebP, JPEG2000, FLIF. Recently, deep learning based approaches have started to show the potential at this point. HyperPrior is an effective technique proposed for lossy image compression. This paper generalizes the hyperprior from lossy model to lossless compression, and proposes a L2-norm term into the loss function to speed up training procedure. Besides, this paper also investigated different parameterized models for latent codes, and propose to use Gaussian mixture likelihoods to achieve adaptive and flexible context models. Experimental results validate our method can outperform existing deep learning based lossless compression, and outperform the JPEG2000 and WebP for JPG images.

Index Terms— Lossless Image Compression, Deep Learning, HyperPrior, Gaussian Mixture Model

1 Introduction

Image compression is a fundamental task in the field of signal processing in many decades for efficient transmission and storage. Classical image compression standards, such as JPEG [2], JPEG2000 [3], WebP [4], HEVC/H.265-intra [5] and lossless FLIF [6], usually rely on hand-crafted encoder/decoder (codec) block diagrams. They use fixed linear transform, quantization and the entropy coder to reduce spatial redundancy for images. Typically they also support lossless model to compress images lossless. However, along with the fast development of new image formats and high-resolution mobile devices, existing image compression algorithms are not expected to be optimal solutions.

Recently, we have seen a great surge of deep learning based image compression approaches. Most of them focus on lossy image compression. For example, some image compression approaches use generative models to learn the distribution of images using adversarial training [7, 8, 9] to achieved impressive subjective quality at extremely low bit rate. Some works use recurrent neural networks to compress the residual information recursively, such as [10, 11, 12] to realize scalable coding. Some approaches propose a hyperprior-based and context-adaptive context model to compress codes effectively in [13, 14, 15]. Some methods decorrelate each channel of latent codes and apply deep residual learning to improve the performance as [16, 17, 18]. However, deep learning based lossless compression has rarely discussed. One related work is L3C [19] to propose a hierarchical architecture with $3$ scales to compress images lossless.

In this paper, we propose a learned lossless image compression using a hyperprior and discretized Gaussian mixture likelihoods. Our contributions mainly consist of two aspects. First, we generalize the hyperprior from lossy model to lossless compression model, and propose a loss function with L2-norm for lossless compression to speed up training. Second, we investigate four parameterized distributions and propose to use Gaussian mixture likelihoods for the context model. Experimental results have demonstrated our method can outperform recent learned compression approach L3C. Besides, our method outperform JPEG2000 and WebP for JPG images.

2 Learned Lossless Image Compression

2.1 Formulation of Compression with a Hyperprior

According to the transform coding approach [20], image compression can be formulated as

$\displaystyle\bm{y}$	$\displaystyle=g_{a}(\bm{x};\bm{\phi})$	(1)
$\displaystyle\hat{\bm{y}}$	$\displaystyle=Q(\bm{y})$
$\displaystyle\hat{\bm{x}}$	$\displaystyle=g_{s}(\hat{\bm{y}};\bm{\theta})$

where $\bm{x}$ , $\hat{\bm{x}}$ , $\bm{y}$ , and $\hat{\bm{y}}$ are raw images, reconstructed images, a latent presentation before quantization, and compressed codes, respectively. $Q$ represents the quantization , which is approximated by a uniform noise $\mathcal{U}(-\frac{1}{2},\frac{1}{2})$ during the training, which is denoted as $U|Q$ in Fig. 1. $\bm{\phi}$ and $\bm{\theta}$ are optimized parameters of analysis and synthesis transforms. The operation diagram is explained in Fig. 1(a).

In the work [13], Ballé proposed a hyperprior for lossy image compression to achieve promising results. Hyperprior introduces a side information $\bm{z}$ to capture spatial dependencies among the elements of $\bm{y}$ , formulated as

	$\displaystyle\bm{z}=h_{a}(\bm{y};\bm{\phi_{h}});\;\;$	(2)
	$\displaystyle\hat{\bm{z}}=Q(\bm{z})$
$\displaystyle($	$\displaystyle\bm{\mu_{y}},\bm{\sigma_{y}})=h_{s}(\hat{\bm{z}};\bm{\theta_{h}})$

where $h_{a}$ and $h_{s}$ denote the analysis and synthesis functions in the auxiliary autoencoder, where $\bm{\phi_{h}}$ and $\bm{\theta_{h}}$ are optimized parameters of the hyperprior, respectively. $\bm{\mu_{y}}$ and $\bm{\sigma_{y}}$ are generated by the auxiliary autoencoder to model a Gaussian distribution $\mathcal{N}(\bm{\mu_{y}},\bm{\sigma_{y}}^{2})$ . Originally, Ballé assumed a zero-mean scalar Gaussian distribution, but in the enhanced work [14], a mean and scale Gaussian distribution achieves better results. Then we use a mean and scale one in this paper. The operation diagram is explained in Fig. 1(b).

However, both cases are lossy compression. In this paper, we generalize a lossy model to a lossless model as Fig. 1(c). Not only the reconstructed pixel value, we predict a probability model for raw images, and model $\bm{x}$ using another parameterized distributions. This is feasible because entropy coding techniques such as arithmetic coding [21] can losslessly compress the signals if a probability model is given. The distribution difference of lossy and lossless compression is illustrated in Fig. 1(d), where we can think lossy compression is to predict a delta distribution at the value $\hat{\bm{x}}$ for each element, while lossless compression is to predict a more generalized and arbitrary probability models. One intuitive choice is to use Gaussian distribution $\mathcal{N}(\bm{\mu_{x}},\bm{\sigma_{x}}^{2})$ , like $\bm{y}$ , as Fig. 1(d).

2.2 Proposed Loss Function with a L2-Norm

Different from lossy compression which controls the rate-distortion tradeoff, the loss function of lossless compression only consists of required bits for $\bm{x}$ , $\bm{y}$ and $\bm{z}$ , that is,

	$\displaystyle\mathcal{L}=$	$\displaystyle\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{x}}}(\hat{\bm{x}}\|\bm{\mu_{x}},\bm{\sigma_{x}}))]+\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{y}}}(\hat{\bm{y}}\|\bm{\mu_{y}},\bm{\sigma_{y}}))]+$		(3)
		$\displaystyle\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{z}}}(\hat{\bm{z}}\|\bm{\psi}))]$		(3)

where the probability models of $\bm{\hat{x}}$ are estimated by the parameters $\bm{\mu_{x}},\bm{\sigma_{x}}$ , which are conditioned on $\bm{\hat{y}}$ actually. Similarly, the probability model of $\bm{y}$ are estimated by parameters conditioned on $\bm{\hat{z}}$ , i.e.

\displaystyle p_{\hat{\bm{x}}}(\hat{\bm{x}}|\hat{\bm{y}})=\prod_{i}(\mathcal{N}(\mu_{x_{i}},\sigma_{x_{i}}^{2})\ast\mathcal{U}(-\frac{1}{2},\frac{1}{2}))(\hat{x}_{i})

(4)

\displaystyle p_{\hat{\bm{y}}}(\hat{\bm{y}}|\hat{\bm{z}})=\prod_{i}(\mathcal{N}(\mu_{y_{i}},\sigma_{y_{i}}^{2})\ast\mathcal{U}(-\frac{1}{2},\frac{1}{2}))(\hat{y}_{i})

(5)

\displaystyle p_{\hat{\bm{z}}}(\hat{\bm{z}}|\bm{\psi})=\prod_{i}(p_{z_{i}|\bm{\psi}_{i}}(\bm{\psi}_{i})\ast\mathcal{U}(-\frac{1}{2},\frac{1}{2}))(\hat{z}_{i})

(6)

where $\bm{\psi}$ are learnable factorized distributions [13] for $\bm{z}$ because there is no prior for $\bm{z}$ .

Basically, by using Eq.(3), we can train a neural network for lossless compression. However, by experiments, we have found the converge is very slow. By observing the distribution in the distribution of Fig. 1(d), we can notice the actual marginal distribution of $\bm{x}$ typically is not identical to the estimated entropy model $p_{\hat{\bm{x}}}$ , especially for the initial training stage of neural networks. It takes quite a long time to find a proper and accurate distribution, which is centralized at the actual $\bm{x}$ value. Therefore, we novelly introduce a L2-norm term between the ground-truth value and predicted mean value into the lossless compression, that is,

\mathcal{L^{\prime}}=||\mu_{\bm{x}}-\bm{\hat{x}}||^{2}+||\mu_{\bm{y}}-\bm{\hat{y}}||^{2}

(7)

Then the updated loss function for lossless image compression is defined as

	$\displaystyle\mathcal{L}=$	$\displaystyle\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{x}}}(\hat{\bm{x}}\|\hat{\bm{y}}))+\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{y}}}(\hat{\bm{y}}\|\hat{\bm{z}}))+$		(8)
		$\displaystyle\mathop{\mathbb{E}}[-\log_{2}(p_{\hat{\bm{z}}}(\hat{\bm{z}}\|\bm{\psi}))+\lambda\cdot(\|\|\mu_{\bm{x}}-\bm{\hat{x}}\|\|^{2}+\|\|\mu_{\bm{y}}-\bm{\hat{y}}\|\|^{2})$		(8)

where $\lambda$ controls the weights of L2-norm term.

2.3 Proposed Discretized Gaussian Mixture Likelihoods

Whether the parameterized distribution model fits the margin distribution of $\bm{x}$ and $\hat{\bm{y}}$ is a key factor for the performance. We visualize the margin distribution of all sub-pixel values $\bm{x}$ and compressed codes $\bm{\hat{y}}$ for kodim02 from Kodak dataset [28] in Fig. 3, and it does not actually satisfy the assumption that they follow Gaussian distribution. Previous works have investigated several parameterized distribution models, but to our knowledge, which distribution model can fit true margin distribution best is still an open question. For example, standard PixelCNN [22] uses full 256-way softmax likelihoods, which is memory-consuming and impractical for large images. PixelCNN++ [23] proposed a discretized logistic mixture likelihoods to achieve faster training. L3C [19] followed the PixelCNN++ to use a logistic mixture model. Lossy compression work [13] assumed a univariate Gaussian distribution. The work [24] used Laplace distribution. Traditional coding work [25] used a Cauchy distribution to model coefficients.

We investigate all the above models, including Gaussian distribution, Cauchy distribution, Laplace distribution and Logistic distribution. We replace them for $\hat{\bm{y}}$ individually. Examples of their distribution models are visualized in Fig. 4, where we clip the ranges and add the tail masses to edge cases. It can be observed Cauchy distribution has longer tail than other distributions. Corresponding results are listed in Table 1 and performance is evaluated in terms of bits per sub-pixel (bpsp) following the L3C on Kodak dataset. Gaussian distribution can achieve the best performance among them.

Table 1: The effect of different distribution models

Model	Gaussian	Cauchy	Laplace	Logistic
bpsp	3.568	4.189	3.778	3.814

To enhance the performance, we further improve the parameterized probabilities from univariate Gaussian model to multivariate Gaussian mixture model as

		$\displaystyle p_{\hat{\bm{y}}}(\hat{\bm{y}}\|\hat{\bm{z}})=\prod_{i}p_{\hat{\bm{y}}}(\hat{y}_{i}\|\hat{\bm{z}})$		(9)
		$\displaystyle p_{\hat{\bm{y}}}(\hat{y}_{i}\|\hat{\bm{z}})=(\sum_{k=1}^{K}w_{y_{i}}^{(k)}\mathcal{N}(\mu_{y_{i}}^{(k)},\sigma_{y_{i}}^{2(k)})\ast\mathcal{U}(-\frac{1}{2},\frac{1}{2}))(\hat{y}_{i})$		(9)

where the $k$ -th mixture distribution is characterized by Gaussian distribution with three parameters, i.e. weights $w_{i}^{(k)}$ , means $\mu_{i}^{(k)}$ and variances $\sigma_{i}^{2(k)}$ for each $i$ -th element. The mixture models for $\bm{x}$ are conducted in the same way as $\bm{y}$ . Multivariate mixture model offers more flexibility than univariate one to model arbitrary likelihoods, as Fig. 3(a).

Table 2: Compression performance on CLIC Professional, CLIC Mobile and Kodak datasets.

[JPG]	Method	CLICP	CLICM	Kodak
Non-Learned Methods	JPEG2000	2.936 +7.7%	2.897 +9.0%	2.822 +7.0%
	WebP	2.777 +1.9%	2.735 +2.9%	2.708 +2.7%
	FLIF	2.631 -3.5%	2.516 -5.4%	2.479 -6.1%
Learned Methods	L3C	2.739 +0.5%	2.647 -0.5%	2.730 +3.5%
	Proposed	2.726	2.659	2.638

3 Experiments

3.1 Implementation details

The network architectures are illustrated in Fig. 2. This residual learning architecture refers to [18] with competitive performance. We include one mask convolution to $\bm{y}$ as [14]. In our experiments, $K$ is set as $3$ . We set $\lambda$ in Eq. (8) as $0.6$ for warm-up ( $8\times 10^{4}$ steps in our experiments) and set $\lambda$ as $0$ after that, because optimizing this L2-norm basically contributes to smaller $\mathcal{L}$ , instead, optimizing $\mathcal{L}$ not absolutely requires the minimized L2-norm term at the late training stage. For instance, some symbols with very large variance would not put too much constraint on accurate mean estimation. This L2-norm is similar to mean square error (MSE) distortion in lossy image compression. To some degree, lossless compression is a generalized form of lossy one, because lossy compression only samples the variable $\hat{\bm{x}}$ from $p_{\hat{\bm{x}}}$ with the largest probability, while lossless compression predicts the full likelihoods.

For training, we use about $35000$ patches with the size of $128\times 128$ cropped from ImageNet ILSVRC dataset [26]. Batch size is $8$ . The model was optimized using Adam [27] algorithm. In addition, the learning rate was fixed at $1\times 10^{-4}$ and was reduced to $1\times 10^{-5}$ for the last $80k$ steps. We train up to $6.8\times 10^{5}$ steps to achieve stable performance.

3.2 Performance Comparison and Discussion

For evaluation, we use three datasets, CLIC [29] professional validation dataset with $41$ images, CLIC mobile validation dataset with $61$ images, and Kodak dataset [28] with 24 images to cover many different kinds of contents. Following [19], we resize the high-resolution CLIC images to $768$ pixels for the longer side. We compare our method with classical compression algorithms FLIF [6], WebP 1.0.2 [4], and OpenJPEG (JPEG2000 official software) [3], and L3C [19]. For each model, average bits per sub-pixel (bpsp) are measured on average across all the test images.

Currently the format of large-scale training image dataset is JPG, which has already been largely compressed with many artifacts, such as smoothing areas, blocking and ringing artifacts. Even though some artifacts are invisible, they largely affect the distributions. Therefore, to evaluate the performance more close to what we train, we decide to evaluate all these methods in two ways. First, we convert test set to JPG images with quality level as 95 using PIL library for evaluation, which followed L3C [19] way. Second, we evaluate results with lossless PNG images to show the performance gap when the test set is much different from training set.

The performance comparison on JPG images is shown in Table 2. It can be observed that our method achieves better performance than JPEG2000 and WebP for all the dataset. Our method is better than L3C for CLICP and Kodak, but is slightly worse than L3C for CLICM. On the other hand, the performance on lossless images is listed in Table 3. Our method is still better than L3C. Although our performance is worse than JPEG2000 and WebP, it comes from the difference between training datasets and test datasets. As a future work, finding some ways to remove inherent compression artifacts of training images or build a lossless dataset is worthy trying.

Table 3: Compression performance on lossless images.

[PNG]	Method	CLICP	CLICM	Kodak
Non-Learned	PNG	4.298	4.374	4.350
	JPEG2000	3.403	3.266	3.191
	WebP	3.254	3.212	3.206
	FLIF	3.141	3.083	2.903
Learned	L3C	4.098	3.691	3.547
	Proposed	3.647	3.486	3.475

3.3 Relation to Prior Work

The most related works are hyperprior [13] and L3C [19]. Compared to [13], we generalize the hyperprior from a lossy compression model to a lossless one. L3C is a recent learned lossless compression with $3$ scales, while our architecture is somewhat like a simplified hierarchical network with $2$ scales (i.e. $\bm{y}$ and $\bm{z}$ ). Except for the difference on architectures, more importantly, we propose two novel strategies to improve the performance. 1) We add a L2-norm to loss function which contributes to stable and fast training. 2) We used Gaussian mixture model, while L3C used Logistic mixture model. We have discussed $4$ parameterized models in Section 2.3 to show Gaussian achieves slightly better performance.

4 Conclusion

In this paper, we propose a learned lossless image compression using a hyperprior and discretized Gaussian mixture likelihoods. First, we formulate a hyperprior-based lossless image compression model. Based on it, we propose corresponding loss function with L2-norm to speed up training. Second, the determination of parameterized distributions is a significant factor for performance. We investigate 4 types and propose to use Gaussian mixture likelihoods. Experimental results have demonstrated our method can outperform recent learned compression approach L3C. Besides, our method outperform JPEG2000 and WebP for JPG images.

5 Acknowledgement

The authors would like to thank Fabian Mentzer (the first author of L3C) for fruitful discussion and insightful feedback on the evaluation methods and datasets.

References

[1]
[2] G. K Wallace, “The JPEG still picture compression standard”, IEEE Trans. on Consumer Electronics, vol. 38, no. 1, pp. 43-59, Feb. 1991.
[3] Majid Rabbani, Rajan Joshi, “An overview of the JPEG2000 still image compression standard”, Image Communication, vol. 17, no, 1, pp. 3-48, Jan. 2002. JPEG2000 official software OpenJPEG, https://jpeg.org/jpeg2000/software.html
[4] WebP Image Format, https://developers.google.com/speed/webp/
[5] G. J. Sullivan, J. Ohm, W. Han, T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, Dec. 2012.
[6] J. Sneyers and P. Wuille. “FLIF: Free lossless image format based on MANIAC compression”, ICIP 2016. https://flif.info/
[7] Ripple Oren, L. Bourdev, “Real Time Adaptive Image Compression”, Proc. of Machine Learning Research, Vol. 70, pp. 2922-2930, 2017.
[8] S. Santurkar, D. Budden, N. Shavit, “Generative Compression”, Picture Coding Symposium, 2018.
[9] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, “Generative Adversarial Networks for Extreme Learned Image Compression”, arXiv:1804.02958.
[10] G. Toderici, S. M.O’Malley, S. J. Hwang, et al., “Variable rate image compression with recurrent neural networks”, arXiv: 1511.06085, 2015.
[11] G, Toderici, D. Vincent, N. Johnson, et al., “Full Resolution Image Compression with Recurrent Neural Networks”, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, July 21-26, 2017.
[12] Nick Johnson, Damien Vincent, David Minnen, et al., “Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks”, arXiv:1703.10114, pp. 1-9, March 2017.
[13] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, “Variational Image Compression with a Hyperprior”, Intl. Conf. on Learning Representations (ICLR), 2018.
[14] D. Minnen, J. Ballé, G. Toderici, “Joint Autoregressive and Hierarchical Priors for Learned Image Compression”, arXiv.1809.02736.
[15] J. Lee, S. Cho, S-K Beack, “Context-Adaptive Entropy Model for End-to-end optimized Image Compression”, Intl. Conf. on Learning Representations (ICLR) 2019.
[16] Z. Cheng, H. Sun, M. Takeuchi, J. Katto, “Deep Convolutional AutoEncoder-based Lossy Image Compression”, Picture Coding Symposium, pp. 1-5, June 24-27, 2018.
[17] Z. Cheng, H. Sun, M. Takeuchi, J. Katto, “Learning Image and Video Compression through Spatial-Temporal Energy Compaction”, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 16-20, 2019.
[18] Z. Cheng, H. Sun, M. Takeuchi, J. Katto, “Deep Residual Learning for Image Compression”, CVPR Workshop, pp. 1-4, June 16-20, 2019.
[19] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, L. V. Gool, “Practical Full Resolution Learned Lossless Image Compression”, CVPR 2019. https://github.com/fab-jul/L3C-PyTorch (Accessed on June 3 2019)
[20] Goyal Vivek K. “Theoretical Foundations of Transform Coding”, IEEE Signal Processing Magazine, Vol. 18, No. 5, Sep. 2001.
[21] Rissanen Jorma and Glen G. Langdon Jr., “Universal modeling and coding”, IEEE Transactions on Information Theory, Vol. 27, No. 1, Jan. 1981.
[22] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espehold, A. Graves, and K. Kavukcuoglu, “Conditional Image Generation with PixelCNN Decoders”, Advances in Neural Information Processing Systems (NIPS), 2016
[23] T. Salimans, A. Karpathy, X. Chen, D. P. Kingma, “PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications”, Intl. Conf. on Learning Representations (ICLR), 2017.
[24] L. Zhou, C. Cai, Y. Gao, S. Su, J. Wu, “Variational Autoencoder for Low Bit-rate Image Compression”, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) workshop, 2019.
[25] J. Liu, Y. Cho, Z. Guo, C.-C Jay Kuo, “Bit Allocation for Spatial Scalability Coding of H.264/SVC With Dependent Rate-Distortion Analysis”, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 20, No. 7, July 2010.
[26] J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database”, IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1-8, June 20-25, 2009.
[27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, arXiv:1412.6980, pp.1-15, Dec. 2014.
[28] Kodak Lossless True Color Image Suite, Download from http://r0k.us/graphics/kodak/
[29] Workshop and Challenge on Learned Image Compression, CVPR2019, http://www.compression.cc/