Exploring Compressed Image Representation as a Perceptual Proxy: A Study

Chen-Hsiu Huang and Ja-Ling Wu
Department of Computer Science and Information Engineering, National Taiwan University No. 1, Sec. 4, Roosevelt Rd., Taipei City 106, Taiwan {chenhsiu48,wjl}@cmlab.csie.ntu.edu.tw

Abstract

We propose an end-to-end learned image compression codec wherein the analysis transform is jointly trained with an object classification task. This study affirms that the compressed latent representation can predict human perceptual distance judgments with an accuracy comparable to a custom-tailored DNN-based quality metric. We further investigate various neural encoders and demonstrate the effectiveness of employing the analysis transform as a perceptual loss network for image tasks beyond quality judgments. Our experiments show that the off-the-shelf neural encoder proves proficient in perceptual modeling without needing an additional VGG network. We expect this research to serve as a valuable reference developing of a semantic-aware and coding-efficient neural encoder.

1 Introduction

We consider a natural image $x$ as a point in the signal space that triggers stimuli in the brain’s sensory cortex through the visual system. The effectiveness of a deep neural network (DNN) lies in its ability to learn a complex transformation $\rho_{s}$ that maps stimuli $x_{i}$ to points $s_{i}$ in the semantic space for object classification, as illustrated in Figure 1 (a). Equivalently, research in image quality assessment aims to learn a perceptual mapping $\rho_{p}$ such that the Euclidean distance between points $z_{i}$ in the perceptual space highly correlates with human perception.

In the field of neural science [1], large capacity goal-driven hierarchical convolutional neural networks (HCNNs) like the VGG network [2], which are trained for image classification, have been demonstrated to effectively model neural stimuli and population responses in higher visual cortical areas. Consequently, the VGG network can map images into a feature vector space where distances between vectors closely align with the image’s structures and semantics. Rather than employing per-pixel difference loss, studies have shown that utilizing perceptual loss from VGG leads to superior image quality in various image transformation tasks, such as style transfer [3, 4] and super-resolution [5].

Zhang et al. [6] introduced the Learned Perceptual Image Patch Similarity (LPIPS) metric, which involves learning a set of linear weights $w$ applied to HCNN-extracted vectors to align with human perceptual judgments. LPIPS represents the pioneering attempt to employ a semantic transformation as a proxy for perceptual transformation. Essentially, they repurpose a semantic model as a perceptual model, where $\rho_{p}=\omega(\rho_{s}(x);w)$ , with $\omega$ representing the learned linear mapping. Huang et al. [7] proposed the Compressed Perceptual Image Patch Similarity (CPIPS) metric, which incorporates both an analysis transform $g_{a}$ and a synthesis transform $g_{s}$ within an end-to-end learned image compression codec [8] as a goal-driven HCNN for image reconstruction. In CPIPS, the analysis transform is pre-trained on ImageNet object classification to achieve a certain level of accuracy. Subsequently, it is jointly optimized for rate-distortion loss and classification loss. The CPIPS affirms that with a mapping function $\omega$ applied to the compressed latent representation, we can conduct human perceptual distance judgments with an accuracy comparable to LPIPS. This work extends CPIPS further and can be conceptually illustrated in Figure 1 (b). We study the following two key questions:

1) To what extent can the image encoder network be effectively utilized from both perceptual and semantic perspectives?

2) What outstanding tasks are for a neural codec to serve as a perceptual space transform?

2 Related Works

\SubSection

Learned Image Compression

The end-to-end learned image compression approaches have been proposed in the literature, starting with Ballé et al. [8] that surpassed traditional codecs like JPEG and JPEG 2000 regarding PSNR and SSIM metrics. Minnen et al. [9] further improved coding efficiency by employing a joint autoregressive and hierarchical prior model, surpassing the performance of the HEVC [10] codec. Cheng et al. [11] developed techniques that achieved comparable performance to the latest coding standard VVC [12]. Several comprehensive surveys and introduction papers [13, 14] have summarized these advancements in end-to-end learned compression.

\SubSection

Image Quality Assessment

The evaluation of image coding quality relies on full-reference image quality assessment metrics, which gauge the similarity between the reconstructed and the original images as human observers perceive. In addition to the traditional pixel-based PSNR, metrics such as SSIM [15], GMSD [16], and NLPD [17] have been widely employed. Recently, new DNN-based methods like LPIPS [6], PIM [18], and DISTS [19] have been proposed. These approaches have demonstrated superior predictive performance for subjective image quality. Their efficacy has been confirmed through validation against human judgments on benchmark datasets like BAPPS [6] and CLIC2021 [20], which comprise a comprehensive collection of two-alternative force choice (2AFC) human judgments.

\SubSection

Semantic Features as Perceptual Loss

The feature space of the VGG network has been leveraged by Gatys et al. [3] to generate high-quality images for style transfer. Johnson et al. [4] further introduced the feature reconstruction loss and style reconstruction loss computed up to the VGG network layer relu3_3 and relu4_3, respectively. Features extracted from higher VGG layers encapsulate high-level semantics, including color, texture, and shape, collectively representing an image’s style. Johnson’s work encompasses the joint optimization of content and style loss, expanding beyond style transfer to encompass generic image transformation tasks.

In the domain of image super-resolution, Ledig et al. [5] introduced SRGAN, which incorporates adversarial training and a perceptual loss function comprising an adversarial loss and a content loss. To capture low-level image details, they define the content loss based on the VGG network up to the relu2_2 layer. SRGAN has demonstrated remarkable results in achieving visually appealing 4X super-resolution and has garnered superior mean opinion scores (MOS) compared to prior methods. In addition to these two image tasks, other image reconstruction tasks [21] have also used perceptual loss to enhance the visual quality.

3 Proposed Method

Transform coding plays a crucial role in conventional image compression as it separates the task process of decorrelating a source from its coding process [22]. Popular transforms such as the Discrete Cosine Transform (DCT) and the Wavelet transform are linear and unitary. Consider $U$ as a unitary transform in $Z=UX$ , then the energy in the transform domain equals the energy in the original domain: $\lVert Z\rVert^{2}=\lVert X\rVert^{2}$ . If $z_{1}=Ux_{1}$ and $z_{2}=Ux_{2}$ , we have

\lVert z_{1}-z_{2}\rVert^{2}=\lVert x_{1}-x_{2}\rVert^{2}

(1)

A unitary transform preserves Euclidean distances from the signal space to the latent space. As a result, leveraging the latent representation in traditional image compression for high-level semantic computation is equivalent to processing in the signal domain. However, the development of complex non-linear transforms through deep neural networks in modern learned image compression [8, 9] has made it possible to perform processing and analysis in the compressed domain, giving rise to the advocacy of Video Coding for Machines (VCM) [23].

\SubSection

CPIPS: An Image Encoder Tuned with Semantics

In CPIPS [7], the weights of the analysis transform are initialized with the semantic features pre-trained on ImageNet. Through joint compression classification training, the gradient descent optimizer updates the encoder-decoder weights to analyze and synthesize the image while improving classification accuracy. We enhance CPIPS by removing the regularization loss and leveraging the encoder instead of the decoder. Figure 2 shows the detailed network architecture.

To obtain the distance between two images, denoted as $x$ and $x_{0}$ , we follow the same procedure as LPIPS [6] by learning a linear layer $w$ on the BAPPS dataset. We extract feature maps $f^{l}\in\mathbb{R}^{C_{l}\times H_{l}\times W_{l}}$ for all layers $l$ and normalize them in the channel dimension. The activations are then scaled channel-wise using the vector $w_{l}\in\mathbb{R}^{C_{l}}$ , and the $l_{2}$ distance is computed. Finally, we average across the spatial dimensions and sum up all layers for the final distance:

d(x,x_{0})=\sum_{l}\frac{1}{H_{l}W_{l}}\sum_{h,w}\lVert w_{l}\odot(f_{hw}^{l}-f_{0hw}^{l})\rVert_{2}^{2},

(2)

where $\odot$ denotes the convolution operator. We jointly train the compression classification loss to achieve 51.58% and 76.27% for top-1 and top-5 accuracy, respectively. The enhanced CPIPS performs superior to LPIPS on the BAPPS human perceptual judgment dataset, as shown in Table 3. However, its coding efficiency is somewhat compromised, surpassing only JPEG’s.

\SubSection

Hyperprior encoder and Semantic-tuned Hyperprior

A perceptual distance-preserving image encoder only makes sense when its coding efficiency remains superior to traditional hand-crafted codecs. To address this, we turn to the well-known hyperprior codec [8], which has been reported to outperform HEVC in terms of PSNR metric. Given that image reconstruction is a well-defined goal-driven HCNN, we assess the feasibility of utilizing the hyperprior’s analysis transform as the perceptual transform. Additionally, we apply the same joint compression-classification optimization strategy to fine-tune the hyperprior encoder/decoder as we wonder about the semantic-aware benefits derived from the auxiliary image classification task. Table 1 reports the Bjøntegaard-Delta (BD) metric compared to JPEG on the Kodak dataset. The fine-tuning of the hyperprior encoder significantly reduces its coding efficiency but remains superior to traditional codecs.

Table 1: The coding efficiency impact of classification fine-tuning compared to JPEG.

Comparison	BD Rate (%)	BD PSNR (dB)	Top-1 Acc.	Top-5 Acc.
CPIPS	-1.17	0.54	51.58	76.27
Hyperprior	-55.61	3.93	-
Hyperprior-tune	-31.85	1.72	24.25	46.63

\SubSection

Style Transfer and Super-resolution

In addition to perceptual similarity judgments, we assess the efficacy of the analysis transform as a perceptual loss for image transformation tasks. We employ Gatys’ approach [3] and SRGAN’s technique [5] for style transfer and super-resolution, utilizing CPIPS, hyperprior codec, and the hyperprior-tuned encoder as the perceptual loss networks. The feature layers we extract are detailed in Table 2. We focus on early layers from the analysis transform to capture low-level structures as content loss for style transfer. Conversely, we extract full layers to represent semantic features in super-resolution, as the original paper suggests. Each encoder network includes a final Gaussian-conditioned hyperprior bottleneck layer, reflecting the rate constraint imposed on the latent vectors.

Table 2: The feature extraction layers for style transfer and super-resolution.

Network	Feature Extraction Layer
Network	Style Transfer	SRGAN
Vgg16	conv2_2+ReLU	conv5_3+ReLU
CPIPS	conv2_2+GDN	conv5_2+bottleneck
Hyperprior	conv2+GDN	conv4+bottleneck
Hyperprior-tune	conv2+GDN	conv4+bottleneck

4 Experimental Results

For our proposed CPIPS, we begin by pre-training the analysis transform on the ImageNet dataset for 90 epochs, focusing on an image classification task, resulting in a top-1 accuracy of 59.68%. Subsequently, we jointly train the compression classification task, utilizing the pre-trained weights for an additional 120 epochs. This phase involves employing the Adam optimizer with a learning rate set to 0.0001. In the case of the hyperprior method, we use the CompressAI implementation¹¹1https://github.com/InterDigitalInc/CompressAI along with a pre-trained model configured at quality setting 8. We apply the same joint compression-classification optimization strategy to fine-tune the hyperprior codec, conducting this process for 120 epochs, resulting in a semantic-tuned hyperprior, denoted as Hyperprior-tune.

To evaluate our approach, we compare the perceptual judgment results with various quality metrics, namely GMSD, NLPD, LPIPS²²2https://github.com/richzhang/PerceptualSimilarity, PIM³³3https://github.com/google-research/perceptual-quality, and DISTS⁴⁴4https://github.com/dingkeyan93/DISTS, on two datasets: BAPPS [6] and CLIC2021 [20]. These datasets consist of 151k and 122k patches, respectively, and employ a 2AFC judgment methodology.

For style transfer, we adopt a separate model for each style, fine-tuning the weights for content and style loss based on the activation magnitude of different networks to achieve visually appealing results. In the case of super-resolution, we conduct experiments on three widely recognized benchmark datasets: Set5, Set14, and BSD100. All experiments are performed with a scale factor of 4X between low and high-resolution images.

\SubSection

Perceptual Distance Judgments

The quality metrics are compared on the benchmark datasets, as shown in Table 3. It is evident that there exists a significant gap between modern DNN-based quality metrics and traditional pixel-based metrics. The PIM exhibits the best performance among specially designed perceptual metrics, and the DISTS metric also outperforms the LPIPS. Notably, the DISTS metric also leverages the VGG network as a fundamental building block.

Table 3: The accuracy of human perceptual distance compared with various metrics. We boldly mark the numbers for leading group performance and underline the best.

Method	BAPPS-2AFC							CLIC
Method	Trad.	CNN	S.Res	DeBlur	Color	F.Interp	Avg.	CLIC
LPIPS-Vgg [6]	73.36	82.20	69.51	59.43	61.62	62.34	68.08	61.06
CPIPS [7]	77.08	83.33	71.18	60.25	63.63	61.62	69.52	60.92
CPIPS-hyper	72.76	82.78	70.62	60.45	62.64	62.37	68.60	62.16
CPIPS-hyper-tune	72.98	82.79	71.50	60.29	62.00	62.27	68.64	61.73
DISTS [19]	77.18	82.18	71.03	60.03	62.67	62.48	69.26	61.22
PIM [18]	80.11	87.37	76.01	66.84	70.53	66.68	74.59	65.06
GMSD [16]	58.81	79.41	66.39	58.84	57.74	56.47	62.94	53.55
NLPD [17]	56.81	79.09	64.85	58.26	56.03	55.41	61.74	53.05
PSNR	59.94	77.76	64.67	58.19	63.50	55.02	63.18	54.57
SSIM	62.73	77.59	63.13	54.23	60.88	57.10	62.62	55.05

The off-the-shelf perceptual model, CPIPS, closely trails behind a custom-tailored quality metric like DISTS. Our CPIPS, built on a neural encoder, performs slightly better than DISTS in the BAPPS dataset but is less competitive in the CLIC dataset. Surprisingly, the CPIPS built on the original hyperprior codec, referred to as CPIPS-hyper, delivers a competitive performance in the BAPPS dataset and is superior in the CLIC dataset, closely rivaling a well-designed quality metric like DISTS. We argue that this phenomenon echoes the conclusion from [1], where a well-trained neural network on an image reconstruction task is equivalent to a goal-driven HCNN. Another finding is that fine-tuning the joint compression classification optimization on the hyperprior encoder does not have a meaningful impact on perceptual prediction performance. We attribute this to the network architecture design problem: a pre-trained network good at classification like CPIPS is challenging to achieve satisfying coding efficiency while fine-tuning a highly efficient hyperprior codec cannot improve perceptual prediction. Investigating and improving upon the above findings are left as future work. Despite this, we confirmed that a vanilla hyperprior encoder performs closely, rivaling a well-designed quality metric like DISTS.

\SubSection

Style Transfer

We assess the analysis transform’s ability to extract high-level features for style and transfer them to another content image. Figure 3 showcases the qualitative results obtained using different networks as a perceptual loss. The CPIPS encoder effectively disentangles content and style features through semantic fine-tuning, allowing for style transfer similar to the VGG network. However, the stylish effect, while present, may not be as pronounced as with VGG, according to subjective evaluation. On the other hand, the raw hyperprior encoder struggles to differentiate between style and content features, resulting in less meaningful outputs. By applying semantic fine-tuning similar to CPIPS, the hyperprior’s performance is greatly enhanced, producing stylish results comparable to those of CPIPS. Therefore, the vanilla neural encoder needs to be fine-tuned to accomplish the style transfer task.

\SubSection

Super-resolution

While it is acknowledged that SRGAN, with a perceptual loss, yields higher mean square errors, it often achieves superior mean opinion scores. We begin by quantifying and presenting the objective measurements in Table 4. When utilizing the CPIPS encoder as the perceptual loss, we obtain quantitative results comparable to those achieved with the VGG network. Consequently, the raw and tuned hyperprior encoder networks outperform the VGG network regarding PSNR and SSIM metrics. This result could be attributed to the fact that neural codecs are designed to excel in image reconstruction tasks.

Table 4: Quantitative results for SRGAN 4X super-resolution.

Method	Set5		Set14		BSD100
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Vgg16	28.87	0.8371	25.82	0.7256	25.78	0.7016
CPIPS [7]	28.77	0.8374	25.73	0.7241	25.75	0.7024
CPIPS-hyper	29.30	0.8537	26.06	0.7396	25.98	0.7128
CPIPS-hyper-tune	29.27	0.8534	26.04	0.7389	25.97	0.7118

In terms of visual quality, as illustrated in Figure 4, all the analysis transforms yield sharpened and visually enhanced outputs. Remarkably, using the VGG network introduces finer details than other methods. This observation suggests that the analysis transform derived from a neural codec could be a reliable proxy for perceptual loss in specific image transformation tasks.

\SubSection

Rate-Distortion Trade-offs

In lossy compression, defining distinct quality factors for low and high-bit-rate applications through rate-distortion optimization is a common practice. We evaluated the influence of bit-rate trade-offs on the efficacy of perceptual representation. Our experiments revealed that even with the lowest bit-rate setting, such as quality 1, the accuracy of perceptual judgment, style transfer effectiveness, and super-resolution quality remained remarkably close to that achieved with a high-quality setting.

5 Conclusion

This study demonstrated that the analysis transform within a neural codec is highly proficient in extracting feature representations suitable for perceptual distance judgment and various image transformation tasks. Specifically, we’ve shown its effectiveness as a perceptual loss in super-resolution tasks, generating visually pleasing outputs with or without classification fine-tuning. In the case of style transfer, fine-tuning the analysis transform is crucial for producing stylish content. Suppose one seeks to repurpose the neural encoder as a perceptual transform. In that case, it’s imperative to carefully re-examine the network architecture design, pre-training techniques, and rate-distortion-semantic optimization. Further exploration into auxiliary tasks beyond image classification may also serve as avenues for semantically fine-tuning the neural encoder.

6 References

References

[1] Daniel LK Yamins and James J DiCarlo, “Using goal-driven deep learning models to understand sensory cortex,” Nature neuroscience, vol. 19, no. 3, pp. 356–365, 2016.
[2] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] Leon A Gatys, Alexander S Ecker, and Matthias Bethge, “A neural algorithm of artistic style,” arXiv preprint arXiv:1508.06576, 2015.
[4] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 694–711.
[5] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017, pp. 4681–4690.
[6] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
[7] Chen-Hsiu Huang and Ja-Ling Wu, “Cpips: Learning to preserve perceptual distances in end-to-end image compression,” in 2023 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2023.
[8] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
[9] David Minnen, Johannes Ballé, and George D Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in Neural Information Processing Systems, vol. 31, pp. 10771–10780, 2018.
[10] Jani Lainema, Miska M Hannuksela, Vinod K Malamal Vadakital, and Emre B Aksu, “Hevc still image coding and high efficiency image file format,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 71–75.
[11] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in CVPR, 2020, pp. 7939–7948.
[12] Jens-Rainer Ohm and Gary J Sullivan, “Versatile video coding–towards the next generation of video compression,” in Picture Coding Symposium, 2018, vol. 2018.
[13] Dipti Mishra, Satish Kumar Singh, and Rajat Kumar Singh, “Deep architectures for image compression: a critical review,” Signal Processing, vol. 191, pp. 108346, 2022.
[14] Yibo Yang, Stephan Mandt, and Lucas Theis, “An introduction to neural data compression,” arXiv preprint arXiv:2202.06533, 2022.
[15] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[16] Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE transactions on image processing, vol. 23, no. 2, pp. 684–695, 2013.
[17] Valero Laparra, Johannes Ballé, Alexander Berardino, and Eero P Simoncelli, “Perceptual image quality assessment using a normalized laplacian pyramid,” Electronic Imaging, vol. 2016, no. 16, pp. 1–6, 2016.
[18] Sangnie Bhardwaj, Ian Fischer, Johannes Ballé, and Troy Chinen, “An unsupervised information-theoretic perceptual quality metric,” Advances in Neural Information Processing Systems, vol. 33, pp. 13–24, 2020.
[19] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567–2581, 2020.
[20] “Clic 2021: Workshop and challenge on learned image compression,” https://clic.compression.cc/2021/tasks/index.html.
[21] Longfei Liu, Sheng Li, Yisong Chen, and Guoping Wang, “X-gans: Image reconstruction made easy for extreme cases,” arXiv preprint arXiv:1808.04432, 2018.
[22] Johannes Ballé, Philip A Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin Hwang, and George Toderici, “Nonlinear transform coding,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 339–353, 2020.
[23] “Call for evidence for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 2, 2020.