Conditional and Residual Methods in Scalable Coding for Humans and Machines
Abstract
We present methods for conditional and residual coding in the context of scalable coding for humans and machines. Our focus is on optimizing the rate-distortion performance of the reconstruction task using the information available in the computer vision task. We include an information analysis of both approaches to provide baselines and also propose an entropy model suitable for conditional coding with increased modelling capacity and similar tractability as previous work. We apply these methods to image reconstruction, using, in one instance, representations created for semantic segmentation on the Cityscapes dataset, and in another instance, representations created for object detection on the COCO dataset. In both experiments, we obtain similar performance between the conditional and residual methods, with the resulting rate-distortion curves contained within our baselines.
Index Terms— learnable compression, scalable coding, conditional coding, residual coding, entropy modelling
1 Introduction
With the prominence of artificial intelligence, digital content is not only consumed by humans but also by computer programs. This software often analyzes content in different ways, according to their purpose. Depending on their task, only a subset of the information available might be necessary. Moreover, the information required can be represented in a more suitable way for the computer program that does not necessarily resemble its original natural representation, often required by humans to consume such content.
In a collaborative setting [1], where edge devices capture signals that are processed and transmitted to cloud services to complete a set of tasks, it is efficient to transmit only the information necessary to achieve these tasks. Creating representations for every subset of tasks does not scale well with the number of tasks. In addition, if information for some tasks has already been transmitted and a superset of the original tasks is now required for the same input, transmitting the new corresponding representation would incur an overhead in redundant information. Thus, we would like to compose the information required for tasks in a scalable fashion [2], in which base representations are shared among multiple tasks and only incremental amounts of information are required for more specific tasks.
Creating learnable tasks that make use of different streams of information in which some are fitted for different purposes is a challenge [3]. The lower-dimensional manifold induced by a particular task might not be readily usable by a different task. Translating representations from one manifold to another, such that the maximum amount of information is usable in a secondary task, is limited by the modelling capacity of the transformation and the data available [4, 5]. Conditional and residual coding have prevailed as two different approaches to incorporate side information in learnable compression settings. These approaches can leverage dedicated learnable transformations to explicitly transfer information to the target domain.
We limit our findings to a common setting in which we have an image reconstruction task and a computer vision task whose representation is shared with the former. This configuration is referred as scalable image coding for humans and machines [3]. We present conditional and residual approaches for scalable learnable compression in which we transform the representations to share a common feature space. We derive baselines for these approaches and empirically compare them. Our experiments perform reconstruction of different datasets using representations for semantic image segmentation and object detection. We also present an entropy model with increased modelling potential suitable for conditional coding.
2 Related Work


In learnable compression, an information bottleneck is induced on an intermediate representation between the input and the output [6]. Successful approaches follow a variational framework [7] in which a hyper-prior representation learns the dependencies between the different factors of a latent representation and operates as side information [8, 9, 10, 11]. An auto-regressive entropy model is used to induce the information bottleneck and to entropy code the learnt representations.
Recent work on scalable coding for humans and machines apply the ideas of learnable compression to both the reconstruction and computer vision tasks [12, 3, 13]. In these approaches, the reconstruction task uses a dedicated and a shared representation. These representations are concatenated after being decoded, and are used as input for a reconstruction model. Through rate-distortion optimization, this approach could create independent representations with no much redundancy of information between them, but the results of [3] and [13] show considerable redundancy. This work focuses primarily on efficiently coding the dedicated representation for the reconstruction task. In the conditional approach, the dedicated representation contains all information relevant to reconstruction but the uncertainty resolved by the shared representation is exploited during coding. In the residual approach, the information of the shared representation is removed from the target representation before coding it, and then added back down the pipeline after decoding.
Residual and conditional coding in the context of learnable compression has been explored before for video compression [14, 15, 16, 17]. Our formulation is different in that the prediction is completely explained by the original signal and as such, the information of the residual cannot increase, whereas in video compression this can occur since the previous frame is used to compute the prediction. In [16], it is shown that learnable conditional coding often requires a transformation of the side information, potentially resulting in information loss to a degree where a residual approach could outperform the conditional approach. In this work, we propose to transform the side information in both approaches and show how this can improve the performance of the residual approach with respect to the conditional approach.
Many of the existing entropy models for learnable compression support the conditional coding of the target representation given the hyper-prior representation [8]. Recent entropy models extend these ideas to efficiently capture the spatial and dimensional dependencies of these representations, by grouping factors together [18], and reorganizing and parallelizing the decoding order of the spatial locations [19]. In this work, we utilize our conditional information in a similar fashion, but we increase the modelling capacity by augmenting its receptive field and adding scaled residual connections.
3 Proposed Methods
For an input image , a lossily-compressed base representation is learned as to minimize the distortion with respect to a given computer vision target , a task distortion function , and a learnable decoding function .
In the conditional setting, a lossily-compressed enhancement representation is learned to minimize the distortion , using an image reconstruction distortion function and a learnable decoder . All information used for the reconstruction task is contained in , and the information contained in is utilized to efficiently code . Conditional coding effectively models , where is a learnable transformation of that intuitively has a feature space similar to that of the enhancement representation , so that their similarities can be exploited. Any information that reduces the conditional entropy should be maintained in since there is no rate penalty on its rate.
In the residual approach, an analogous representation is created to minimize . Here, is a learnable transformation of that implicitly reconstructs the image. The prediction is added at the end of the reconstruction process. Fig. 1 shows architecture diagrams for both configurations.111Official code release: https://github.com/adeandrade/research
3.1 Bounds for conditional coding
Our theoretical analysis is performed in the lossless case to motivate the proposed baselines for our lossy approaches. In conditional coding, we model , having as a lower bound:
(1) |
where we used due to the data processing inequality. This bound is tight when and , or equivalently, when . This corresponds to a decrease of information in of . An upper bound is obtained by:
(2) |
This bound is tight when , which corresponds to and being independent.
We provide an upper baseline for the conditional approach by using a standalone enhancement representation generated without relying on any side information, and measuring , where is an entropy estimate. As a lower baseline we use . This is motivated by considering that can be more efficient as a task representation than , and by the bounds in (3.1) and (3.1).
3.2 Bounds for residual coding
It has been shown that conditional coding is an upper bound of residual coding [16, 17]:
(3) | ||||
(4) | ||||
(5) |
Here (3) uses the fact that having observed , the only uncertainty in is due to . The inequality in (4) uses the non-negativity of mutual information. is rewritten in (5) using the definition of mutual information and once again the fact that given , the only uncertainty in is due to .
The term in (4) acts as a penalty term on the residual formulation. To minimize it, the residual and the prediction must be as independent from each other as possible. This can be achieved when collapses values in so that decreases, or when produces a constant value for different values in , reducing . Reducing increases , which in turn could have the adverse effect of increasing , as shown in (5).
In our proposed method we train to minimize and . By extension, we also minimize . As shown in (5), this reduces both and . Hence, this optimization procedure encourages the learnable function to create a representation that recovers the input as accurately as possible, while at the same time being as independent as possible from the resulting residual .
Note that enforcing the similarity of and the original input may not be an optimal procedure, since even though such an optimization will decrease , it may lead to an increase in . This explains why in preliminary experiments, we found that having a function that explicitly reconstructs does not perform as well as our proposed method.
Due to the previous considerations stemming from our proposed method, we find that can be easier to achieve. This motivates us to compare against the same baselines used in the conditional approach.
3.3 Entropy modelling


To conditionally code a representation that exploits as much information as possible from , we model the spatial and dimensional dependencies between and within representations using a CNN [18]. Our proposed entropy model strikes a balance between complexity and accuracy. We group channels with a fixed size [18]. Within each group, the same location across channels are processed in parallel, using as context all locations in the previous groups and all previous locations across all channels of the current group, within the receptive field of the convolutional layer. Similarly to [8], locations in the spatial domain are processed in a top-to-bottom, left-to-right fashion. The Markov property is enforced by a mask applied to the convolution kernels. Fig. 2(a) shows the kernel mask for a single output channel of a layer.
Unlike previous work, the CNN architecture of our entropy model has scalable residual connections and deeper layers with kernels sizes larger than one for its auto-regressive convolutions. This removes some of the modelling limitations imposed by similar entropy models. The CNN architecture has blocks of three layers in which the input channels are scaled up, transformed in a higher-dimensional space, and scaled down back to the original number of channels. The residual connections are introduced in-between these blocks, such that the inputs can be re-scaled differently across the channel dimension. To maintain the Markov property when the number of channels changes, the group sizes are re-scaled accordingly and the channels can only change in multiples of . Fig. 2(b) shows the architecture overview of a single block in the CNN.
The conditional representation is available as another channel group and is transformed by the CNN in the same way as other groups, except that its context is restricted to the receptive field within that group. All elements of the conditioned representation have access to this group.
Similarly to [8] and more recent works, the predictions of the entropy model correspond to the means and scales of a univariate Gaussian distribution assigned to each element in the representation. The symbols are obtained by , and the corresponding probability is . During training, the rounding operation is simulated by adding uniform noise .
3.4 Learnable scalable compression
As an architecture for learnable compression, we use a simplified version of the work in [11]. We drop the side information components from the coder, introduced as a hyper-prior in [8]. We also remove the attention layers introduced in [10]. To reduce the memory footprint and speed up the training procedure, we incrementally reduce the channels in the first layers of the analyzers and the last layers of the synthesizers.
The base and enhancement tasks have this same architecture. The base generates a representation with the same dimensionality and resolution as the input but reconstruction of the input is not enforced. This representation is the input for the computer vision model. The coder and the computer vision model are trained together end-to-end.
In the conditional approach, is a traditional CNN composed of blocks with residual connections between them. Each block has three convolutional layers that perform a transformation in a higher dimensionality and scales the output back to a lower dimensionality. The first half of the blocks maintain the same dimensionality as , while the second half transitions to the same dimensionality as , obtaining . The resolution is maintained across the network as both and have the same resolution. In the residual approach, uses the same architecture as our synthetizers, to transform into . The synthetizer upscales the representation to match the resolution and dimensionality of .
A representation that is optimized for can contain information that might not be beneficial for reconstruction on its own. Moreover, the information in is represented in a way that is suitable for the computer vision task and bringing it all back to the image feature space through can be challenging. To overcome these obstacles, we add a small reconstruction penalty on a transformed to the rate-distortion Lagrange minimization formulation:
(6) |
where is an auxiliary network with the same architecture as the other syntherizers and and are hyper-parameters. For the enhancement representations, we use the traditional rate-distortion loss function [6]:
for the conditional and residual approaches, respectively. During training, either the base network remains frozen or the gradients from the reconstruction network do not flow into the base network.
4 Experiments




We conduct experiments to analyze the rate-distortion performance of the proposed conditional and residual methods for scalable coding and compare them against the proposed baselines. We perform two sets of experiments: one using semantic segmentation as the computer vision task on the Cityscapes dataset [20], and another using object detection as the computer vision task on the COCO 2017 dataset [21].
We first train the base representation on the computer vision task to obtain and for rate-distortion points under different values of . We choose a model corresponding to a point on the rate-distortion curve that achieves reasonable distortion and subsequently use it to generate the representations for the conditional and residual approaches. The upper baseline is created by training the reconstruction task with no side information. We use the same architecture for the analysis and synthesis transforms as the other models. The entropy model for the upper baseline has the same architecture as the one used in the residual approach. The lower baseline is obtained by adding the rate of the base representation used for the conditional and residual approaches.
Across all experiments, we allocated channels to the base representation and channels to the enhancement representation. In the analysis transforms, the four down-scaling operations have output channels 24, 48, 192, and or , while the up-scaling operations in the synthesis transform have output channels 192, 48, 24, and . The entropy model consists of 5 blocks with and .
To train the reconstruction tasks in all experiments, we use the RMSE function as the distortion function . We compute and report the bits-per-pixel (BPP) using the entropy estimates, which in several experiments had at most 0.5% difference with the achieved BPP. Also, to speed up the computation of the rate-distortion curve, we often train a model under low-compression settings and use its weights as initialization for the models trained to obtain the rest of the curve. The parameters are updated using Adam at a learning rate of . We train models with early stopping but first decay the learning rate by a factor of 0.75 if a plateau is reached.
4.1 Image semantic segmentation on Cityscapes
Cityscapes is a set of images of urban scenes for semantic understanding [20]. We use DeepLabV3 [22] as the computer vision model for segementation with MobileNetV3 [23] as a back-end. Here, is the per-pixel multi-class cross-entropy, although we report the mean intersection over union (mIoU) metric. We set and report the results on the validation dataset. For data augmentation, we use random crops of pixels, random horizontal flips and color jittering. The front-end of the model corresponding to the coder is trained with Adam using a learning rate of , while the classifier is trained with stochastic gradient descent using momentum and a learning rate of . A loss is added to the weights of the classifier to prevent over-fitting, with a scale factor of .
Fig. 3(a) shows the rate-distortion curves for the conditional and residual approach. We notice that these lines lie in between the baselines, producing rate-distortion points that respect them. Compared to the rate-distortion curve of the lower baseline, the conditional approach has a BD-Rate [24] of , whereas the residual approach achieves a rate reduction. Thus, the conditional approach performs marginally better than the residual approach in terms of BD-Rate. Looking at the ratio between these BD-Rate scores and the BD-Rate score achieved by the upper baseline, we can compute the percentage of the base representation utilized. As such, the conditional approach uses of the side information rate, whereas the residual approach uses . In the lowest-compression settings under both approaches, the utilization is higher.
Fig. 3(c) shows the rate-distortion performance of the base task. The chosen value places a penalty on both the rate and the task performance but allows the base representation to be exploited by the architecture. We attribute the small imperfections in this rate-distortion curve to the choice of and the limitations of the training algorithm.
4.2 Object detection on COCO
COCO 2017 has 123,287 domain-agnostic images for object detection and segmentation [21]. We use Faster R-CNN [25] for object detection with ResNet-50 [26] as the back-end. For this task, is the sum of the different loss functions employed by this architecture. We report the mean average precision (mAP) metric computed according to [21], and set . As data preprocessing for training, we use random horizontal flips and generate batches with similar aspect ratios grouped in 3 clusters. Images inside a batch are resized to the minimum size and the bounding boxes are adjusted accordingly when training the computer vision task. When training for reconstruction, the images in a batch are center-cropped to their minimum size. All weights are trained with Adam using a learning rate of .
As shown in Fig. 3(b), the performance of both approaches is comparable, with a and a BD-Rate improvement over the lower baseline, for the conditional and residual methods, respectively. We achieve a utilization of the base in terms of BD-Rate of 49.24% and 29.32% for the conditional and residual approaches, respectively. Fig. 3(d) shows the base-distortion performance of the base task. Compared to semantic segmentation on Cityscapes, the task model is better at reaching the uncompressed task performance. Also, for a similar distortion penalty, this task uses more rate. The rate-distortion curves obtained by the reconstruction task on the COCO dataset are almost an order of magnitude larger than the ones from Cityscapes. This can be explained by the simplicity of the content in the images found in Cityscapes, and the higher amount of artifacts found in the COCO dataset due to compression.
5 Conclusion
We present conditional and residual methods for scalable coding for humans and machines. Our experiments show that the proposed architectures for conditional and residual coding perform similarly and that the rate-distortion performance is within the presented baselines or operational bounds. In addition, the proposed conditional entropy model is able to match the performance of the residual method.
References
- [1] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in IEEE ICASSP, 2021, pp. 8493–8497.
- [2] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE TCSVT, vol. 17, no. 9, pp. 1103–1120, 2007.
- [3] H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE TIP, vol. 31, pp. 2739–2754, 2022.
- [4] B. J. Culpepper and B. A. Olshausen, “Learning transport operators for image manifolds,” in NeurIPS, 2009, pp. 423–431.
- [5] M. Connor, G. Canal, and C. Rozell, “Variational autoencoder with learned latent structure,” in AISTATS, 2021, vol. 130, pp. 2359–2367.
- [6] N. Tishby, F. C. N. Pereira, and W. Bialek, “The information bottleneck method,” CoRR, vol. physics/0004057, 2000.
- [7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
- [8] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in ICLR, 2018.
- [9] D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in NeurIPS, 2018, pp. 10794–10803.
- [10] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in IEEE CVPR, 2020, pp. 7936–7945.
- [11] D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, “ELIC: efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in IEEE CVPR, 2022, pp. 5708–5717.
- [12] H. Choi and I. V. Bajić, “Latent-space scalability for multi-task collaborative intelligence,” in IEEE ICIP, 2021, pp. 3562–3566.
- [13] E. Ozyilkan, M. Ulhaq, H. Choi, and F. Racape, “Learned disentangled latent representations for scalable image coding for humans and machines,” arXiv 2301.04183, 2023.
- [14] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in IEEE ICCV, 2019, pp. 6420–6428.
- [15] T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Déforges, “Conditional coding for flexible learned video compression,” in ICLR, 2021.
- [16] F. Brand, J. Seiler, and A. Kaup, “On benefits and challenges of conditional interframe video coding in light of information theory,” in PCS, 2022, pp. 289–293.
- [17] H. Hadizadeh and I. V. Bajić, “LCCM-VC: Learned conditional coding modes for video coding,” arXiv 2210.15883, 2022.
- [18] D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in IEEE ICIP, 2020, pp. 3339–3343.
- [19] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in IEEE CVPR, 2021, pp. 14771–14780.
- [20] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in IEEE CVPR, 2016, pp. 3213–3223.
- [21] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, 2014, pp. 740–755.
- [22] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” CoRR, vol. abs/1706.05587, 2017.
- [23] A. Howard, R. Pang, H. Adam, Q. V. Le, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu, V. Vasudevan, and Y. Zhu, “Searching for MobileNetV3,” in IEEE ICCV, 2019, pp. 1314–1324.
- [24] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” ITU-T SC16/Q6 VCEG-M33, 2001.
- [25] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE TPAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
- [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.