CPIPS: Learning to Preserve Perceptual Distances in End-to-End Image Compression

\authorblockN Chen-Hsiu Huang and Ja-Ling Wu
CMLab, CSIE, National Taiwan University, Taiwan
E-mail: {chenhsiu48,wjl}@cmlab.csie.ntu.edu.tw

Abstract

Lossy image coding standards such as JPEG and MPEG have successfully achieved high compression rates for human consumption of multimedia data. However, with the increasing prevalence of IoT devices, drones, and self-driving cars, machines rather than humans are processing a greater portion of captured visual content. Consequently, it is crucial to pursue an efficient compressed representation that caters not only to human vision but also to image processing and machine vision tasks. Drawing inspiration from the efficient coding hypothesis in biological systems and the modeling of the sensory cortex in neural science, we repurpose the compressed latent representation to prioritize semantic relevance while preserving perceptual distance. Our proposed method, Compressed Perceptual Image Patch Similarity (CPIPS), can be derived at a minimal cost from a learned neural codec and computed significantly faster than DNN-based perceptual metrics such as LPIPS and DISTS.

{IEEEkeywords}

End-to-end learned compression, image quality assessment, perceptual distance, coding for machines.

1 Introduction

The concept of efficient coding [1, 2] in early biological sensory processing systems hypothesized that the internal representation of images in the human visual system is optimized to encode the visual information it processes efficiently. In other words, the brain effectively compresses visual information.

The field of neural science has made discoveries regarding modeling neural single-unit and population responses in higher visual cortical areas using goal-driven hierarchical convolutional neural networks (HCNNs) [3]. The sensory cortex’s fundamental framework models the visual system through encoding, the process by which stimuli are transformed into patterns of neural activity, and decoding, the process by which neural activity generates behavior. In their work [3], HCNNs have successfully described the mapping of stimuli to measured neural responses in the brain.

In recent years, the rapid advancement of deep neural network techniques has significantly improved computer vision tasks [4, 5, 6] and image processing tasks [7, 8, 9]. Neural compression [10], an end-to-end learned image compression method [11, 12, 13, 14, 15, 16, 17], has also gained significant attention and has been shown to outperform traditional expert-designed image codecs. Traditionally, most image processing algorithms cannot be directly applied to hand-crafted image codecs like JPEG [18]. As a result, the first step before further image processing or analysis is typically decompressing the image into raw pixels. With the evolution of neural compression, there is a growing trend to apply CNN-based methods directly to the compressed latent space [19, 20, 21, 22], leveraging the advantages of joint compression-accuracy optimization [21] and eliminating the need for decompression. Consequently, international standards such as JPEG AI [23] and MPEG VCM (Video Coding for Machines) [24] have been initiated to bridge data compression and computer vision, catering to both human and machine vision needs.

Refer to caption — Figure 1: The architecture we proposed for conducting perceptual distance preserving image compression. The innermost green convolution output $y^{5}$ represents the compressed latents to be further entropy-coded. The orange layer outputs $y^{l}$ and $e^{l}$ , trained with the image classification task, contain semantic features that preserve perceptual differences in human vision.

Drawing inspiration from sensory cortex modeling [3] and the efficient coding hypothesis employed in information-theoretic perceptual quality metrics [25], we aim to develop an end-to-end learned image compression method jointly trained with the ImageNet classification task as the goal-driven HCNN. Fig. 1 illustrates our proposed architecture, which resembles a UNet network. The compressed latent representations and the intermediate decoder output layers are mapped to a semantic space that preserves the perceptual distance between two different images. We name our method the Compressed Perceptual Image Patch Similarity (CPIPS), which utilizes the entropy-coded bitstream and intermediate decoder output to measure the perceptual distance between images. In the context of Coding for Machines, the compressed image bitstream transmitted by IoT devices can be readily utilized by machines to assess perceptual distortions resulting from image operations.

Our contributions can be summarized as follows:

•

We demonstrate the utilization of a goal-driven HCNN as an auxiliary task to map the latent space of the end-to-end learned image compression method to a space with semantic meaning.
•

We provide guidance and insights on designing the network architecture when a high-level computer vision task is jointly trained with a variational autoencoder network.
•

The proposed perceptual metric, CPIPS, is lightweight compared to other CNN-based perceptual metrics, such as LPIPS [26] and DISTS [27]. Computing CPIPS is significantly faster than LPIPS, with an acceleration of approximately 50 times.

2 Related Works

2.1 Learned Image Compression

The field of learned image compression has witnessed significant advancements with the introduction of convolutional neural networks. Several approaches have been proposed in the literature, starting with Ballé et al. [12] that surpassed traditional codecs like JPEG [18] and JPEG 2000 [28] in terms of PSNR and SSIM metrics. Minnen et al. [13] further improved coding efficiency by employing a joint autoregressive and hierarchical prior model, surpassing the performance of the HEVC [29] codec. More recently, Cheng et al. [15] developed techniques that achieved comparable performance to the latest coding standard VVC [30]. Several comprehensive survey and introduction papers [31, 32, 10] have summarized these advancements in end-to-end learned compression.

Currently, there are two remaining challenges [33] in this field: computational complexity and subjective image quality. The neural compressor employs high-capacity networks to end-to-end model data dependency in exchange for better bitrate-distortion (BD) efficiency. The channel-conditional method proposed by Minnen et al. [14] achieves performance close to VVC but at the cost of high computational complexity (600K FLOPS/pixel). Regarding image quality, Valenzise et al. [34] conducted subjective tests on DNN-based methods and observed that these methods produce artifacts that are difficult to evaluate using traditional metrics like PSNR. They concluded that PSNR is inadequate for evaluating DNN-based methods. Upenik et al. [35] benchmarked a set of DNN-based image codecs using a crowdsourcing-based subjective quality evaluation procedure with Differential Mean Opinion Scores (DMOS). Their results demonstrate that learning-based approaches can achieve promising bitrate-DMOS performance compared to HEVC. However, despite their superior subjective scores, these DNN-based image codecs are optimized with pixel difference-based distortion functions.

2.2 Perceptual Quality Metrics

The evaluation of image codec quality traditionally relies on full-reference image quality assessment (FR-IQA) metrics, which measure the similarity between the reconstructed image and the original image as perceived by human observers. In addition, to mean square error (MSE) or PSNR, various FR-IQA metrics, such as SSIM variants [36, 37], PIM [25], and DISTS [27], have been proposed to predict subjective image quality judgments. Johnson et al. [38] proposed using the feature vector distance from the VGG network [39] as a perceptual loss for image transformation tasks based on the hypothesis that the same image features used for image classification are also helpful for other tasks.

Zhang et al. [26] introduced the BAPPS dataset, which includes a large-scale collection of human judgments on image pairs, and trained the Learned Perceptual Image Patch Similarity (LPIPS) metric. LPIPS was found to be more aligned with human judgments than traditional quality metrics such as L2, PSNR, and SSIM. Ding et al. [40] conducted an interesting study to evaluate whether DNN-based quality metrics can be used as objectives for optimizing image processing algorithms. Developing effective perceptual quality metrics for image tasks remains a challenging problem.

2.3 Coding for Machines

Lossy image coding standards such as JPEG and MPEG have primarily focused on achieving high compression rates for human consumption of multimedia data. However, with the rise of IoT devices, drones, and self-driving cars, there is a growing need for efficient compressed representations that cater not only to human vision but also to image processing and machine vision tasks. Techniques such as image data hiding [19], image denoising [20], and image super-resolution [41] have been developed to operate directly on neural compressed latent spaces.

Le et al. [21] proposed an inference-time content-adaptive fine-tuning scheme that optimizes the latent representation to improve compression efficiency for machine consumption. Duan et al. [22] employed transfer learning to perform semantic inference directly from quantized latent features in the deep compressed domain without pixel reconstruction. Choi et al. [42] introduced scalable image coding frameworks based on well-developed neural compressors, achieving up to 80% bitrate savings for machine vision tasks.

3 Proposed Methods

To enable joint training of the image compression network and an image classification task, one has to design a suitable network architecture that can be shared between a variational encoder network $\mathcal{G}_{e}$ and a DNN feature extraction network $\mathcal{F}$ . We leverage the successful UNet [5] and VGG [39] networks and propose a Left-UNet. Our Left-UNet consists of $L=5$ downsampling convolution layers, each with two convolution blocks. As shown in Fig. 1, the first orange block from the top-left represents the intermediate encoder output feature $y^{1}$ from the second convolution block of the first layer, denoted as conv_1_2. In Fig. 1, the innermost latent vector $y^{5}$ , colored green, is outputted from conv_5_2. This vector is subject to quantization, resulting in an approximation $\hat{y}^{5}=\mathsf{Q}(y^{5})$ , which is then entropy-coded.

3.1 Image Classification

We illustrate the Left-UNet architecture in Fig. 2 and provide details in Table 1. The feature extraction network $\mathcal{F}$ for the image classification task uses the parameterized ReLU as the activation function for all layers, while the encoder network $\mathcal{G}_{e}$ employs Generalized Divisive Normalization (GDN) at the end of each downsampling layer. GDN, proposed by Ballé et al. [43], is inspired by modeling neurons in biological visual systems and has been proven effective in Gaussianizing image densities for a superior rate-distortion trade-off.

The extracted image features are then average pooled and connected to a linear layer with 1,000 neurons to optimize the classification loss $\mathcal{L}_{C}$ using cross-entropy:

\mathcal{L}_{C}=-\sum_{i}t_{i}\log(\mathcal{F}(x)_{i})

(1)

It is known that a high-capacity neural network trained for a high-level vision task implicitly learns to reason about relevant semantics [38]. Our goal is not to solve the classification problem directly. Instead, we aim to design a moderately-sized network that can learn semantic features without significantly increasing the encoder-decoder complexity.

3.2 Image Compression Network

A typical learned neural codec consists of an encoder-decoder pair, a quantization module, and an entropy coder. Given an input image $x\in\mathcal{X}$ , the neural encoder $\mathcal{G}_{e}$ transforms $x$ into a latent representation $y=\mathcal{G}_{e}(x)$ , which is later quantized to a discrete-valued vector $\hat{y}$ . The discrete probability distribution $P_{\hat{y}}$ is estimated using a neural network and then encoded into a bitstream using an entropy coder. The rate of this discrete code, $\mathsf{R}$ , is lower-bounded by the entropy of the discrete probability distribution $H(P_{\hat{y}})$ . On the decoder side, we decode $\hat{y}$ from the bitstream and reconstruct the image $\hat{x}=\mathcal{G}_{d}(\hat{y})$ using the neural decoder. The distortion, $\mathsf{D}$ , is measured by a perceptual metric $d(x,\hat{x})$ . Overall, we optimize the network parameters for a weighted sum of the rate and distortion, $\mathsf{R}+\lambda\mathsf{D}$ , over a set of images.

Table 2 illustrates the decoder network $\mathcal{G}_{d}$ , which is designed to complement the encoder. In the generic neural codec concept, the innermost latent vector $\hat{y}^{5}$ is equivalent to the discrete-valued vector $\hat{y}$ . During image reconstruction, the intermediate output vectors $e^{l}$ from each upsampling layer conv_l_2 play a crucial role because they represent learned multi-scale semantic layers, which are equivalent to the feature layers of a VGG-16 network.

Table 1: Left-UNet architecture for

\mathcal{G}_{e}

and

\mathcal{F}

Layer	Kernel	Stride	In	Out	Output
conv_1_1	3	1	3	32
PReLU
conv_1_2	3	2	32	32
PReLU or GDN					$y^{1}$
conv_2_1	3	1	32	64
PReLU
conv_2_2	3	2	64	64
PReLU or GDN					$y^{2}$
conv_3_1	3	1	64	128
PReLU
conv_3_2	3	2	128	128
PReLU or GDN					$y^{3}$
conv_4_1	3	1	128	256
PReLU
conv_4_2	3	2	256	256
PReLU or GDN					$y^{4}$
conv_5_1	3	1	256	320
PReLU
conv_5_2	3	2	320	320	$y^{5}$

Table 2: Decoder network architecture

\mathcal{G}_{d}

Layer	Kernel	Stride	In	Out	Output
deconv_5_1	3	2	320	320
PReLU
conv_5_2	3	1	320	256
GDN					$e^{4}$
deconv_4_1	3	2	256	256
PReLU
conv_4_2	3	1	256	128
GDN					$e^{3}$
deconv_3_1	3	2	128	128
PReLU
conv_3_2	3	1	128	64
GDN					$e^{2}$
deconv_2_1	3	2	64	64
PReLU
conv_2_2	3	1	64	32
GDN					$e^{1}$
deconv_1_1	3	2	32	32
PReLU
conv_1_2	3	1	32	3	$\hat{x}$

Like [12], we employ kernel density estimation with a neural network to obtain the probability distribution $P_{\hat{y}}$ . The rate loss $\mathsf{R}$ is computed as follows:

\mathsf{R}=-\mathbb{E}[\log_{2}P_{\hat{y}}]

(2)

In our experiments, we utilize the MSE as the distortion function. However, alternative quality metrics such as SSIM variants [36, 37] can be employed to fit perceptual quality better. The distortion loss $\mathsf{D}$ is defined as:

\mathsf{D}=\mathbb{E}[d(x,\hat{x})]=\mathbb{E}||x-\hat{x}||_{2}^{2}

(3)

3.3 Joint Compression-Classification Learning

Although the intermediate convolution output features are seldom used in most machine learning tasks, these features, which are tuned to be predictive of essential structures, exhibit a high correlation with human perceptual similarity [26]. However, storing intermediate latent features in the context of data compression becomes impractical if the final bottleneck layer contains sufficient information for the decoder to reconstruct the image. Another approach to mitigate storage waste is to reduce the number of downsampling layers. However, modeling the sensory cortex in the visual system [3] requires at least five layers of feature extraction to generate neural responses, a finding that our experiments validate as well. Consequently, we utilize the intermediate output $e^{l}$ from the decoder as a proxy for multi-scale semantic features and apply a regularizer to constrain the decoder. Specifically, we employ the $l_{1}$ distance to define our regularization loss:

\mathcal{L}_{R}=\sum_{l=1}^{4}||e^{l}-y^{l}||_{1}

(4)

To initialize the Left-UNet encoder $\mathcal{G}_{e}$ and an auxiliary classifier, we utilize pre-trained semantic features from the image classification task mentioned in Section 3.1. Subsequently, we train an end-to-end image compression network using the overall loss function:

\mathcal{L}=\mathsf{R}+\lambda\mathsf{D}+\alpha\mathcal{L}_{C}+\beta\mathcal{L}_{R}

(5)

The hyper-parameter $\lambda$ represents the rate-distortion trade-off, which can be adjusted according to the desired image quality factor $Q$ . We set $\alpha=0.3$ and $\beta=1.0$ for our experiments.

Through joint compression-classification training, the weights of the Left-UNet encoder are initially initialized with pre-trained semantic features. Subsequently, the gradient descent optimizer updates the encoder-decoder weights to analyze and synthesize the image while improving classification accuracy.

3.4 Compressed Perceptual Image Patch Similarity

To obtain the distance between two images, denoted as $x$ and $x_{0}$ , we follow the same procedure as LPIPS [26] by learning a linear layer $w$ on the BAPPS dataset. This linear layer assigns weights to the compressed latents and intermediate decoder outputs. Fig. 3 illustrates the process of obtaining the distance using entropy-decoded $\hat{y}^{5}$ and feature outputs $e^{l}$ from our decoder network $\mathcal{G}_{d}$ . We extract feature maps $\hat{y}^{5},e^{l}\in\mathbb{R}^{C_{l}\times H_{l}\times W_{l}}$ for all layers $l$ and normalize them in the channel dimension. The activations are then scaled channel-wise using the vector $w^{l}\in\mathbb{R}^{C_{l}}$ , and the $l_{2}$ distance is computed. Finally, we average across the spatial dimensions and all layers to obtain the following:

d(f^{l})=\frac{1}{H_{l}W_{l}}\sum_{h,w}||w_{l}\odot(f_{hw}^{l}-f_{0hw}^{l})||_{2}^{2}

(6)

Eq. (7) calculates the final distance between image $x$ and $x_{0}$ , that is:

d_{0}=\sum_{l=1}^{4}d(e^{l})+d(\hat{y}^{5})

(7)

Furthermore, we train another smaller network, denoted as $\mathcal{D}$ , to predict perceptual judgments $h$ from the distance pair $(d_{0},d_{1})$ on the BAPPS 151k patches 2AFC (two alternative forced choice) dataset.

4 Experimental Results

4.1 Experimental Settings

To implement our CPIPS, we utilize the CompressAI¹¹1https://github.com/InterDigitalInc/CompressAI [44] implementation of the hyperprior neural compressor [12] and the official release²²2https://github.com/richzhang/PerceptualSimilarity of LPIPS. We pre-train the image classification task on the ImageNet dataset, which consists of 1.2 million images. The training is performed using the PyTorch Adam optimizer with a learning rate 0.0001 for 120 epochs. Following that, we jointly train the compression-classification task with the pre-trained weights for 150 epochs, employing the Adam optimizer with a learning rate of 0.0001.

Regarding CPIPS weights $w$ and the judgment network $\mathcal{D}$ , we train them for ten epochs using the BAPPS 2AFC dataset, as mentioned in the original LPIPS paper.

4.2 Left-UNet Image Classification

Table 3 displays our top-1 and top-5 accuracy compared to high-capacity deep networks such as VGG-16. The achieved top-1 accuracy of 60.11% is considered favorable, indicating that the pre-trained weights can serve as a suitable initialization for the Left-UNet encoder $\mathcal{G}_{e}$ .

Table 3: ImageNet classification accuracy

Network	Top-1 Acc.	Top-5 Acc.
AlexNet	56.52%	79.06%
Left-UNet	60.11%	81.95%
ResNet18	69.36%	89.03%
VGG-16	71.51%	93.38%

4.3 Human Judgment Accuracy

We compare our method with LPIPS and traditional L2 and SSIM metrics, in terms of the accuracy of image judgments against human ratings on the BAPPS dataset. Table 4 and Fig. 4 present the results.

Table 4: 2AFC judgment accuracy

Method	Trad.	CNN	S.Res	DeBlur	Color	F.Interp
LPIPS-Alex	74.64	83.37	71.34	60.86	65.47	62.97
Left-UNet	71.23	82.27	70.51	59.74	62.50	61.39
CPIPS	64.77	81.77	67.21	59.20	61.91	58.00
L2	59.94	77.76	64.67	58.19	63.50	55.02
SSIM	62.73	77.59	63.13	54.23	60.88	57.10

Evidently, the metrics incorporating learned semantic features, such as LPIPS, Left-UNet, and CPIPS, exhibit a higher correlation with human judgments compared to L2 and SSIM. While Left-UNet does not achieve the same level of accuracy as LPIPS, it serves as an upper bound for our proposed CPIPS since they share the same feature extraction convolution layers. Our CPIPS achieves similar accuracy to Left-UNet in the CNN, DeBlur, and Color subsets but experiences a more considerable drop in accuracy in the Traditional, Super-Res, and Frame-Interp subsets. We attribute this drop to two factors: 1) the rate-distortion optimization process influencing the semantic properties of the latent vectors, thereby affecting the perceptual representation, and 2) the multi-scale feature maps $e^{l}$ serving as proxies for the feature extraction vectors $y^{l}$ reconstructed in the decoding stages through the regularization loss. Investigating and improving upon these factors are left as future work.

Qualitatively, we select some sample image patches from the BAPPS dataset and present their different judgments in Fig. 5. We can see that the L2 and SSIM cannot reflect human perceptual preferences. At the same time, the CPIPS and LPIPS align with the ground truth better. The second image pair in Fig. 5 demonstrates that the SSIM has a strong bias with structures and tends to be impacted by additive noises.

4.4 Computational Complexity

We assessed the computation time of the metrics on an Intel i7-9700K workstation with an Nvidia GTX 3090 GPU. To compare our CPIPS metric with LPIPS and DISTS³³3https://github.com/dingkeyan93/DISTS, we used the Kodak dataset [45] and calculated the average time cost, as shown in Table 5. Due to utilizing of a less complex neural network that only requires decoding the bitstream and intermediate features, our CPIPS method is approximately 50 times faster.

Table 5: Metric Computation Time on Kodak

Method	Avg. Time (secs.)
CPIPS	0.0205
LPIPS-Alex	1.0681
DISTS	1.0373

5 Conclusions

In this work, we have introduced an end-to-end learned approach for image compression that aims to preserve perceptual distances. By leveraging pre-training on an image classification task and joint compression-classification training, we initialize the parameters of a learned image coding model with semantic features and guide the gradient descent process to emphasize semantic relevance. We have proposed a UNet-inspired network architecture Left-UNet, shared between the image classifier and the image encoder. Our approach calculates the difference in feature vectors between rate-distortion optimized compressed latents and intermediate decode outputs of two images, providing a perceptual distance preserving metric. We refer to this metric as CPIPS, derived from a learned image codec bitstream at no additional cost. Our experimental results demonstrate that CPIPS aligns more with human subjective judgments than traditional distortion metrics such as L2 and SSIM.

Acknowledgment

The authors would like to thank the NSTC of Taiwan and CITI SINICA for supporting this research under the grant numbers 111-2221-E-002-134-MY3 and Sinica 3012-C3447.

References

[1] Fred Attneave “Some informational aspects of visual perception.” In Psychological review 61.3 American Psychological Association, 1954, pp. 183
[2] Horace B Barlow “Possible principles underlying the transformation of sensory messages” In Sensory communication 1.01, 1961, pp. 217–233
[3] Daniel LK Yamins and James J DiCarlo “Using goal-driven deep learning models to understand sensory cortex” In Nature neuroscience 19.3 Nature Publishing Group, 2016, pp. 356–365
[4] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In Communications of the ACM 60.6 AcM New York, NY, USA, 2017, pp. 84–90
[5] Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-net: Convolutional networks for biomedical image segmentation” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 2015, pp. 234–241 Springer
[6] Alexey Bochkovskiy, Chien-Yao Wang and Hong-Yuan Mark Liao “YOLOv4: Optimal Speed and Accuracy of Object Detection”, 2020 arXiv:2004.10934 [cs.CV]
[7] Kai Zhang et al. “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising” In IEEE Transactions on Image Processing 26.7 IEEE, 2017, pp. 3142–3155
[8] Chao Dong, Chen Change Loy, Kaiming He and Xiaoou Tang “Learning a deep convolutional network for image super-resolution” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, 2014, pp. 184–199 Springer
[9] Jiahui Yu et al. “Generative image inpainting with contextual attention” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5505–5514
[10] Yibo Yang, Stephan Mandt and Lucas Theis “An introduction to neural data compression” In arXiv preprint arXiv:2202.06533, 2022
[11] Johannes Ballé, Valero Laparra and Eero P Simoncelli “End-to-end optimized image compression” In arXiv preprint arXiv:1611.01704, 2016
[12] Johannes Ballé et al. “Variational image compression with a scale hyperprior” In arXiv preprint arXiv:1802.01436, 2018
[13] David Minnen, Johannes Ballé and George D Toderici “Joint autoregressive and hierarchical priors for learned image compression” In Advances in Neural Information Processing Systems 31, 2018, pp. 10771–10780
[14] David Minnen and Saurabh Singh “Channel-wise autoregressive entropy models for learned image compression” In 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 3339–3343 IEEE
[15] Zhengxue Cheng, Heming Sun, Masaru Takeuchi and Jiro Katto “Learned image compression with discretized gaussian mixture likelihoods and attention modules” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7939–7948
[16] Tong Chen et al. “End-to-end learnt image compression via non-local attention optimization and improved context modeling” In IEEE Transactions on Image Processing 30 IEEE, 2021, pp. 3179–3191
[17] Zhihao Duan, Ming Lu, Zhan Ma and Fengqing Zhu “Lossy Image Compression with Quantized Hierarchical VAEs” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 198–207
[18] Gregory K Wallace “The JPEG still picture compression standard” In IEEE transactions on consumer electronics 38.1 IEEE, 1992, pp. xviii–xxxiv
[19] Huang Chen-Hsiu and Wu Ja-Ling “Image Data Hiding in Neural Compressed Latent Representations” In unpublished, 2023
[20] Michela Testolina, Evgeniy Upenik and Touradj Ebrahimi “Towards image denoising in the latent space of learning-based compression” In Applications of Digital Image Processing XLIV 11842, 2021, pp. 412–422 SPIE
[21] Nam Le et al. “Learned image coding for machines: A content-adaptive approach” In 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6 IEEE
[22] Zhihao Duan, Zhan Ma and Fengqing Zhu “Unified Architecture Adaptation for Compressed Domain Semantic Inference” In IEEE Transactions on Circuits and Systems for Video Technology IEEE, 2023
[23] Joáo Ascenso and Evgeniy Upenik “White paper on JPEG AI scope and framework v1. 0” In ISO/IEC JTC 1/SC 29/WG1 N90049, 2021
[24] “Call for Evidence for Video Coding for Machines” In ISO/IEC JTC 1/SC 29/WG 2, 2020
[25] Sangnie Bhardwaj, Ian Fischer, Johannes Ballé and Troy Chinen “An unsupervised information-theoretic perceptual quality metric” In Advances in Neural Information Processing Systems 33, 2020, pp. 13–24
[26] Richard Zhang et al. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric” In CVPR, 2018
[27] Keyan Ding, Kede Ma, Shiqi Wang and Eero P Simoncelli “Image quality assessment: Unifying structure and texture similarity” In IEEE transactions on pattern analysis and machine intelligence 44.5 IEEE, 2020, pp. 2567–2581
[28] Michael D Adams “The JPEG-2000 still image compression standard” In ISO/IEC JTC 1/SC 29/WG 1 N 2412, 2001 Citeseer
[29] Jani Lainema, Miska M Hannuksela, Vinod K Malamal Vadakital and Emre B Aksu “HEVC still image coding and high efficiency image file format” In 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 71–75 IEEE
[30] Jens-Rainer Ohm and Gary J Sullivan “Versatile video coding–towards the next generation of video compression” In Picture Coding Symposium 2018, 2018
[31] Siwei Ma et al. “Image and video compression with neural networks: A review” In IEEE Transactions on Circuits and Systems for Video Technology IEEE, 2019
[32] Dipti Mishra, Satish Kumar Singh and Rajat Kumar Singh “Deep architectures for image compression: a critical review” In Signal Processing 191 Elsevier, 2022, pp. 108346
[33] Johannes Ballé “DCC 2023 - Perception: the Next Milestone in Learned Image Compression”, 2023 URL: https://www.youtube.com/watch?v=Y3ySwIhwvTE
[34] Giuseppe Valenzise, Andrei Purica, Vedad Hulusic and Marco Cagnazzo “Quality assessment of deep-learning-based image compression” In 2018 ieee 20th international workshop on multimedia signal processing (mmsp), 2018, pp. 1–6 IEEE
[35] Evgeniy Upenik et al. “Large-scale crowdsourcing subjective quality evaluation of learning-based image coding” In 2021 International Conference on Visual Communications and Image Processing (VCIP), 2021, pp. 1–5 IEEE
[36] Zhou Wang, Alan C Bovik, Hamid R Sheikh and Eero P Simoncelli “Image quality assessment: from error visibility to structural similarity” In IEEE transactions on image processing 13.4 IEEE, 2004, pp. 600–612
[37] Zhou Wang, Eero P Simoncelli and Alan C Bovik “Multiscale structural similarity for image quality assessment” In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003 2, 2003, pp. 1398–1402 Ieee
[38] Justin Johnson, Alexandre Alahi and Li Fei-Fei “Perceptual losses for real-time style transfer and super-resolution” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016, pp. 694–711 Springer
[39] Karen Simonyan and Andrew Zisserman “Very deep convolutional networks for large-scale image recognition” In arXiv preprint arXiv:1409.1556, 2014
[40] Keyan Ding, Kede Ma, Shiqi Wang and Eero P Simoncelli “Comparison of full-reference image quality models for optimization of image processing systems” In International Journal of Computer Vision 129 Springer, 2021, pp. 1258–1281
[41] Evgeniy Upenik, Michela Testolina and Touradj Ebrahimi “Towards super resolution in the compressed domain of learning-based image codecs” In Applications of Digital Image Processing XLIV 11842, 2021, pp. 531–541 SPIE
[42] Hyomin Choi and Ivan V Bajić “Scalable image coding for humans and machines” In IEEE Transactions on Image Processing 31 IEEE, 2022, pp. 2739–2754
[43] Johannes Ballé, Valero Laparra and Eero P Simoncelli “End-to-end optimization of nonlinear transform codes for perceptual quality” In 2016 Picture Coding Symposium (PCS), 2016, pp. 1–5 IEEE
[44] Jean Bégaint, Fabien Racapé, Simon Feltman and Akshay Pushparaja “CompressAI: a PyTorch library and evaluation platform for end-to-end compression research” In arXiv preprint arXiv:2011.03029, 2020
[45] “Kodak PhotoCD dataset”, http://r0k.us/graphics/kodak/