This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conditional and Residual Methods in Scalable Coding for Humans and Machines

Abstract

We present methods for conditional and residual coding in the context of scalable coding for humans and machines. Our focus is on optimizing the rate-distortion performance of the reconstruction task using the information available in the computer vision task. We include an information analysis of both approaches to provide baselines and also propose an entropy model suitable for conditional coding with increased modelling capacity and similar tractability as previous work. We apply these methods to image reconstruction, using, in one instance, representations created for semantic segmentation on the Cityscapes dataset, and in another instance, representations created for object detection on the COCO dataset. In both experiments, we obtain similar performance between the conditional and residual methods, with the resulting rate-distortion curves contained within our baselines.

Index Terms—  learnable compression, scalable coding, conditional coding, residual coding, entropy modelling

1 Introduction

With the prominence of artificial intelligence, digital content is not only consumed by humans but also by computer programs. This software often analyzes content in different ways, according to their purpose. Depending on their task, only a subset of the information available might be necessary. Moreover, the information required can be represented in a more suitable way for the computer program that does not necessarily resemble its original natural representation, often required by humans to consume such content.

In a collaborative setting [1], where edge devices capture signals that are processed and transmitted to cloud services to complete a set of tasks, it is efficient to transmit only the information necessary to achieve these tasks. Creating representations for every subset of tasks does not scale well with the number of tasks. In addition, if information for some tasks has already been transmitted and a superset of the original tasks is now required for the same input, transmitting the new corresponding representation would incur an overhead in redundant information. Thus, we would like to compose the information required for tasks in a scalable fashion [2], in which base representations are shared among multiple tasks and only incremental amounts of information are required for more specific tasks.

Creating learnable tasks that make use of different streams of information in which some are fitted for different purposes is a challenge [3]. The lower-dimensional manifold induced by a particular task might not be readily usable by a different task. Translating representations from one manifold to another, such that the maximum amount of information is usable in a secondary task, is limited by the modelling capacity of the transformation and the data available [4, 5]. Conditional and residual coding have prevailed as two different approaches to incorporate side information in learnable compression settings. These approaches can leverage dedicated learnable transformations to explicitly transfer information to the target domain.

We limit our findings to a common setting in which we have an image reconstruction task and a computer vision task whose representation is shared with the former. This configuration is referred as scalable image coding for humans and machines [3]. We present conditional and residual approaches for scalable learnable compression in which we transform the representations to share a common feature space. We derive baselines for these approaches and empirically compare them. Our experiments perform reconstruction of different datasets using representations for semantic image segmentation and object detection. We also present an entropy model with increased modelling potential suitable for conditional coding.

2 Related Work

Refer to caption
(a) Conditional
Refer to caption
(b) Residual
Fig. 1: Overall architecture of the residual and conditional methods. The dotted line signifies that the enhancement network does not affect the base network. The conditional entropy decoder models H(Yc|Yt)H(Y_{c}|Y_{t}).

In learnable compression, an information bottleneck is induced on an intermediate representation between the input and the output [6]. Successful approaches follow a variational framework [7] in which a hyper-prior representation learns the dependencies between the different factors of a latent representation and operates as side information [8, 9, 10, 11]. An auto-regressive entropy model is used to induce the information bottleneck and to entropy code the learnt representations.

Recent work on scalable coding for humans and machines apply the ideas of learnable compression to both the reconstruction and computer vision tasks [12, 3, 13]. In these approaches, the reconstruction task uses a dedicated and a shared representation. These representations are concatenated after being decoded, and are used as input for a reconstruction model. Through rate-distortion optimization, this approach could create independent representations with no much redundancy of information between them, but the results of [3] and [13] show considerable redundancy. This work focuses primarily on efficiently coding the dedicated representation for the reconstruction task. In the conditional approach, the dedicated representation contains all information relevant to reconstruction but the uncertainty resolved by the shared representation is exploited during coding. In the residual approach, the information of the shared representation is removed from the target representation before coding it, and then added back down the pipeline after decoding.

Residual and conditional coding in the context of learnable compression has been explored before for video compression [14, 15, 16, 17]. Our formulation is different in that the prediction is completely explained by the original signal and as such, the information of the residual cannot increase, whereas in video compression this can occur since the previous frame is used to compute the prediction. In [16], it is shown that learnable conditional coding often requires a transformation of the side information, potentially resulting in information loss to a degree where a residual approach could outperform the conditional approach. In this work, we propose to transform the side information in both approaches and show how this can improve the performance of the residual approach with respect to the conditional approach.

Many of the existing entropy models for learnable compression support the conditional coding of the target representation given the hyper-prior representation [8]. Recent entropy models extend these ideas to efficiently capture the spatial and dimensional dependencies of these representations, by grouping factors together [18], and reorganizing and parallelizing the decoding order of the spatial locations [19]. In this work, we utilize our conditional information in a similar fashion, but we increase the modelling capacity by augmenting its receptive field and adding scaled residual connections.

3 Proposed Methods

For an input image XCx×H×WX\in\mathbb{R}^{C_{x}\times H\times W}, a lossily-compressed base representation Yb=fb(X)Y_{b}=f_{b}(X) is learned as to minimize the distortion Db=𝔼X[db(gb(Yb),T)]D_{b}=\mathbb{E}_{X}[d_{b}(g_{b}(Y_{b}),T)] with respect to a given computer vision target TT, a task distortion function db(,)d_{b}(\cdot,\cdot), and a learnable decoding function gb()g_{b}(\cdot).

In the conditional setting, a lossily-compressed enhancement representation Yc=fe(X)Y_{c}=f_{e}(X) is learned to minimize the distortion Dc=𝔼[de(X^,X)];X^=ge(Yc)D_{c}=\mathbb{E}[d_{e}(\widehat{X},X)];\widehat{X}=g_{e}(Y_{c}), using an image reconstruction distortion function de(,)d_{e}(\cdot,\cdot) and a learnable decoder ge()g_{e}(\cdot). All information used for the reconstruction task is contained in YcY_{c}, and the information contained in YbY_{b} is utilized to efficiently code YcY_{c}. Conditional coding effectively models H(Yc|Yt)H(Y_{c}|Y_{t}), where Yt=hc(Yb)Y_{t}=h_{c}(Y_{b}) is a learnable transformation of YbY_{b} that intuitively has a feature space similar to that of the enhancement representation YcY_{c}, so that their similarities can be exploited. Any information that reduces the conditional entropy should be maintained in YtY_{t} since there is no rate penalty on its rate.

In the residual approach, an analogous representation Yr=fe(Xr);Xr=XXp;Xp=hr(Yb)Y_{r}=f_{e}(X_{r});X_{r}=X-X_{p};X_{p}=h_{r}(Y_{b}) is created to minimize Dr=𝔼[de(ge(Yr)+Xp,X)]D_{r}=\mathbb{E}[d_{e}(g_{e}(Y_{r})+X_{p},X)]. Here, hr()h_{r}(\cdot) is a learnable transformation of YbY_{b} that implicitly reconstructs the image. The prediction XpX_{p} is added at the end of the reconstruction process. Fig. 1 shows architecture diagrams for both configurations.111Official code release: https://github.com/adeandrade/research

3.1 Bounds for conditional coding

Our theoretical analysis is performed in the lossless case to motivate the proposed baselines for our lossy approaches. In conditional coding, we model H(Yb)+H(Yc|Yt)H(Y_{b})+H(Y_{c}|Y_{t}), having H(Yc)H(Y_{c}) as a lower bound:

H(Yc)\displaystyle H(Y_{c}) H(Yc)+H(Yt|Yc)=H(Yc,Yt)\displaystyle\leq H(Y_{c})+H(Y_{t}|Y_{c})=H(Y_{c},Y_{t})
=H(Yt)+H(Yc|Yt)H(Yb)+H(Yc|Yt),\displaystyle=H(Y_{t})+H(Y_{c}|Y_{t})\leq H(Y_{b})+H(Y_{c}|Y_{t}), (1)

where we used H(Yt)H(Yb)H(Y_{t})\leq H(Y_{b}) due to the data processing inequality. This bound is tight when H(Yt|Yc)=0H(Y_{t}|Y_{c})=0 and H(Yb|Yt)=0H(Y_{b}|Y_{t})=0, or equivalently, when H(Yb)=I(Yc;Yt)H(Y_{b})=I(Y_{c};Y_{t}). This corresponds to a decrease of information in H(Yc|Yt)H(Y_{c}|Y_{t}) of H(Yb)H(Y_{b}). An upper bound is obtained by:

H(Yb)+H(Yc|Yt)\displaystyle H(Y_{b})+H(Y_{c}|Y_{t}) =H(Yb)+H(Yc)I(Yc;Yt)\displaystyle=H(Y_{b})+H(Y_{c})-I(Y_{c};Y_{t})
H(Yb)+H(Yc).\displaystyle\leq H(Y_{b})+H(Y_{c}). (2)

This bound is tight when I(Yc;Yt)=0I(Y_{c};Y_{t})=0, which corresponds to YcY_{c} and YtY_{t} being independent.

We provide an upper baseline for the conditional approach by using a standalone enhancement representation YeY_{e} generated without relying on any side information, and measuring H^(Ye)\hat{H}(Y_{e}), where H^()\hat{H}(\cdot) is an entropy estimate. As a lower baseline we use H^(Yb)+H^(Ye)\hat{H}(Y_{b})+\hat{H}(Y_{e}). This is motivated by considering that YeY_{e} can be more efficient as a task representation than YcY_{c}, and by the bounds in (3.1) and (3.1).

3.2 Bounds for residual coding

It has been shown that conditional coding is an upper bound of residual coding [16, 17]:

H(X|Xp)\displaystyle H(X|X_{p}) =H(Xr+Xp|Xp)=H(Xr|Xp)\displaystyle=H(X_{r}+X_{p}|X_{p})=H(X_{r}|X_{p}) (3)
=H(Xr)I(Xp;Xr)H(Xr)\displaystyle=H(X_{r})-I(X_{p};X_{r})\leq H(X_{r}) (4)
=H(X|Xp)+I(Xp;Xr).\displaystyle=H(X|X_{p})+I(X_{p};X_{r}). (5)

Here (3) uses the fact that having observed XpX_{p}, the only uncertainty in Xr+XpX_{r}+X_{p} is due to XrX_{r}. The inequality in (4) uses the non-negativity of mutual information. H(Xr)H(X_{r}) is rewritten in (5) using the definition of mutual information and once again the fact that given XpX_{p}, the only uncertainty in XXpX-X_{p} is due to XX.

The term I(Xp;Xr)I(X_{p};X_{r}) in (4) acts as a penalty term on the residual formulation. To minimize it, the residual XrX_{r} and the prediction XpX_{p} must be as independent from each other as possible. This can be achieved when XpX_{p} collapses values in XX so that H(Xr)H(X_{r}) decreases, or when XpX_{p} produces a constant value for different values in XX, reducing H(Xp)H(X_{p}). Reducing H(Xp)H(X_{p}) increases H(X|Xp)H(X|X_{p}), which in turn could have the adverse effect of increasing H(Xr)H(X_{r}), as shown in (5).

In our proposed method we train to minimize DrD_{r} and H^(Yr)\hat{H}(Y_{r}). By extension, we also minimize H^(Xr)\hat{H}(X_{r}). As shown in (5), this reduces both H(X|Xp)H(X|X_{p}) and I(Xp;Xr)I(X_{p};X_{r}). Hence, this optimization procedure encourages the learnable function hr()h_{r}(\cdot) to create a representation XpX_{p} that recovers the input XX as accurately as possible, while at the same time being as independent as possible from the resulting residual XrX_{r}.

Note that enforcing the similarity of XpX_{p} and the original input XX may not be an optimal procedure, since even though such an optimization will decrease H(X|Xp)H(X|X_{p}), it may lead to an increase in I(Xp;Xr)I(X_{p};X_{r}). This explains why in preliminary experiments, we found that having a function hr()h_{r}(\cdot) that explicitly reconstructs XX does not perform as well as our proposed method.

Due to the previous considerations stemming from our proposed method, we find that H(Yc|Yt)=H(Yr)H(Y_{c}|Y_{t})=H(Y_{r}) can be easier to achieve. This motivates us to compare H^(Yb)+H^(Yr)\hat{H}(Y_{b})+\hat{H}(Y_{r}) against the same baselines used in the conditional approach.

3.3 Entropy modelling

Refer to caption
(a) Convolution mask
Refer to caption
(b) CNN block
Fig. 2: Entropy model overview. (a) The convolution has kernel size 3×33\times 3 and the input is 12×4×412\times 4\times 4. The conditional input has size 3×4×43\times 4\times 4 and there are K=4K=4 groups. With an input padding and a stride of 1, this is the 7th step of the convolution.

To conditionally code a representation YcY_{c} that exploits as much information as possible from YtY_{t}, we model the spatial and dimensional dependencies between and within representations using a CNN [18]. Our proposed entropy model strikes a balance between complexity and accuracy. We group channels with a fixed size KK [18]. Within each group, the same location across channels are processed in parallel, using as context all locations in the previous groups and all previous locations across all channels of the current group, within the receptive field of the convolutional layer. Similarly to [8], locations in the spatial domain are processed in a top-to-bottom, left-to-right fashion. The Markov property is enforced by a mask applied to the convolution kernels. Fig. 2(a) shows the kernel mask for a single output channel of a layer.

Unlike previous work, the CNN architecture of our entropy model has scalable residual connections and deeper layers with kernels sizes larger than one for its auto-regressive convolutions. This removes some of the modelling limitations imposed by similar entropy models. The CNN architecture has blocks of three layers in which the input channels are scaled up, transformed in a higher-dimensional space, and scaled down back to the original number of channels. The residual connections are introduced in-between these blocks, such that the inputs can be re-scaled differently across the channel dimension. To maintain the Markov property when the number of channels changes, the group sizes are re-scaled accordingly and the channels can only change in multiples of MM. Fig. 2(b) shows the architecture overview of a single block in the CNN.

The conditional representation is available as another channel group and is transformed by the CNN in the same way as other groups, except that its context is restricted to the receptive field within that group. All elements of the conditioned representation have access to this group.

Similarly to [8] and more recent works, the predictions of the entropy model correspond to the means WW and scales Σ\Sigma of a univariate Gaussian distribution assigned to each element in the representation. The symbols are obtained by Q=YWQ=\lfloor Y-W\rceil, and the corresponding probability is P𝒩(𝟎,Σ)[|Q||Q|±1/2]P_{\mathcal{N}(\mathbf{0},\Sigma)}[|Q|\leq|Q|\pm\nicefrac{{1}}{{2}}]. During training, the rounding operation is simulated by adding uniform noise 𝒰(1/2,1/2)\mathcal{U}(-\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}}).

3.4 Learnable scalable compression

As an architecture for learnable compression, we use a simplified version of the work in [11]. We drop the side information components from the coder, introduced as a hyper-prior in [8]. We also remove the attention layers introduced in [10]. To reduce the memory footprint and speed up the training procedure, we incrementally reduce the channels in the first layers of the analyzers and the last layers of the synthesizers.

The base and enhancement tasks have this same architecture. The base generates a representation with the same dimensionality and resolution as the input XX but reconstruction of the input is not enforced. This representation is the input for the computer vision model. The coder and the computer vision model are trained together end-to-end.

In the conditional approach, hc()h_{c}(\cdot) is a traditional CNN composed of blocks with residual connections between them. Each block has three convolutional layers that perform a transformation in a higher dimensionality and scales the output back to a lower dimensionality. The first half of the blocks maintain the same dimensionality as YbY_{b}, while the second half transitions to the same dimensionality as YcY_{c}, obtaining YtY_{t}. The resolution is maintained across the network as both YbY_{b} and YcY_{c} have the same resolution. In the residual approach, hr()h_{r}(\cdot) uses the same architecture as our synthetizers, to transform YbY_{b} into XpX_{p}. The synthetizer upscales the representation to match the resolution and dimensionality of XX.

A representation YbY_{b} that is optimized for DbD_{b} can contain information that might not be beneficial for reconstruction on its own. Moreover, the information in YbY_{b} is represented in a way that is suitable for the computer vision task TT and bringing it all back to the image feature space through hr()h_{r}(\cdot) can be challenging. To overcome these obstacles, we add a small reconstruction penalty on a transformed YbY_{b} to the rate-distortion Lagrange minimization formulation:

b=Db+λbH^(Yb)+β𝔼[de(h^r(Yb),X)],\displaystyle\mathcal{L}_{b}=D_{b}+\lambda_{b}\hat{H}(Y_{b})+\beta\,\mathbb{E}[d_{e}(\hat{h}_{r}(Y_{b}),X)], (6)

where h^r()\hat{h}_{r}(\cdot) is an auxiliary network with the same architecture as the other syntherizers and λb\lambda_{b} and β\beta are hyper-parameters. For the enhancement representations, we use the traditional rate-distortion loss function [6]:

c=Dc+λeH^(Yc|Yt),\displaystyle\mathcal{L}_{c}=D_{c}+\lambda_{e}\hat{H}(Y_{c}|Y_{t}), r=Dr+λrH^(Yr),\displaystyle\mathcal{L}_{r}=D_{r}+\lambda_{r}\hat{H}(Y_{r}),

for the conditional and residual approaches, respectively. During training, either the base network remains frozen or the gradients from the reconstruction network do not flow into the base network.

4 Experiments

Refer to caption
(a) Scalable coding rate-distortion curves for Cityscapes
Refer to caption
(b) Scalable coding rate-distortion curves for COCO
Refer to caption
(c) Rate-distortion curve for semantic segmentation on Cityscapes
Refer to caption
(d) Rate-distortion curve for object detection on COCO
Fig. 3: Scalable coding results. The purple lines represent the performance attained with λb=0\lambda_{b}=0 and β=0\beta=0.

We conduct experiments to analyze the rate-distortion performance of the proposed conditional and residual methods for scalable coding and compare them against the proposed baselines. We perform two sets of experiments: one using semantic segmentation as the computer vision task on the Cityscapes dataset [20], and another using object detection as the computer vision task on the COCO 2017 dataset [21].

We first train the base representation on the computer vision task to obtain fb()f_{b}(\cdot) and gb()g_{b}(\cdot) for rate-distortion points under different values of λb\lambda_{b}. We choose a model corresponding to a point on the rate-distortion curve that achieves reasonable distortion and subsequently use it to generate the representations YbY_{b} for the conditional and residual approaches. The upper baseline is created by training the reconstruction task with no side information. We use the same architecture for the analysis and synthesis transforms as the other models. The entropy model for the upper baseline has the same architecture as the one used in the residual approach. The lower baseline is obtained by adding the rate of the base representation used for the conditional and residual approaches.

Across all experiments, we allocated Cb=32C_{b}=32 channels to the base representation and Ce=256C_{e}=256 channels to the enhancement representation. In the analysis transforms, the four down-scaling operations have output channels 24, 48, 192, and CbC_{b} or CeC_{e}, while the up-scaling operations in the synthesis transform have output channels 192, 48, 24, and CxC_{x}. The entropy model consists of 5 blocks with K=16K=16 and M=1M=1.

To train the reconstruction tasks in all experiments, we use the RMSE function as the distortion function de(,)d_{e}(\cdot,\cdot). We compute and report the bits-per-pixel (BPP) using the entropy estimates, which in several experiments had at most 0.5% difference with the achieved BPP. Also, to speed up the computation of the rate-distortion curve, we often train a model under low-compression settings and use its weights as initialization for the models trained to obtain the rest of the curve. The parameters are updated using Adam at a learning rate of 10410^{-4}. We train models with early stopping but first decay the learning rate by a factor of 0.75 if a plateau is reached.

4.1 Image semantic segmentation on Cityscapes

Cityscapes is a set of images of urban scenes for semantic understanding [20]. We use DeepLabV3 [22] as the computer vision model for segementation with MobileNetV3 [23] as a back-end. Here, db(,)d_{b}(\cdot,\cdot) is the per-pixel multi-class cross-entropy, although we report the mean intersection over union (mIoU) metric. We set β=0.1\beta=0.1 and report the results on the validation dataset. For data augmentation, we use random crops of 768×768768\times 768 pixels, random horizontal flips and color jittering. The front-end of the model corresponding to the coder is trained with Adam using a learning rate of 10410^{-4}, while the classifier is trained with stochastic gradient descent using momentum and a learning rate of 10210^{-2}. A 2\ell_{2} loss is added to the weights of the classifier to prevent over-fitting, with a scale factor of 10410^{-4}.

Fig. 3(a) shows the rate-distortion curves for the conditional and residual approach. We notice that these lines lie in between the baselines, producing rate-distortion points that respect them. Compared to the rate-distortion curve of the lower baseline, the conditional approach has a BD-Rate [24] of 16.56%-16.56\%, whereas the residual approach achieves a 14.6%-14.6\% rate reduction. Thus, the conditional approach performs marginally better than the residual approach in terms of BD-Rate. Looking at the ratio between these BD-Rate scores and the BD-Rate score achieved by the upper baseline, we can compute the percentage of the base representation utilized. As such, the conditional approach uses 43.01%43.01\% of the side information rate, whereas the residual approach uses 37.91%37.91\%. In the lowest-compression settings under both approaches, the utilization is higher.

Fig. 3(c) shows the rate-distortion performance of the base task. The chosen β\beta value places a penalty on both the rate and the task performance but allows the base representation to be exploited by the architecture. We attribute the small imperfections in this rate-distortion curve to the choice of β\beta and the limitations of the training algorithm.

4.2 Object detection on COCO

COCO 2017 has 123,287 domain-agnostic images for object detection and segmentation [21]. We use Faster R-CNN [25] for object detection with ResNet-50 [26] as the back-end. For this task, db(,)d_{b}(\cdot,\cdot) is the sum of the different loss functions employed by this architecture. We report the mean average precision (mAP) metric computed according to [21], and set β=0.05\beta=0.05. As data preprocessing for training, we use random horizontal flips and generate batches with similar aspect ratios grouped in 3 clusters. Images inside a batch are resized to the minimum size and the bounding boxes are adjusted accordingly when training the computer vision task. When training for reconstruction, the images in a batch are center-cropped to their minimum size. All weights are trained with Adam using a learning rate of 10410^{-4}.

As shown in Fig. 3(b), the performance of both approaches is comparable, with a 4.14%-4.14\% and a 2.47%-2.47\% BD-Rate improvement over the lower baseline, for the conditional and residual methods, respectively. We achieve a utilization of the base in terms of BD-Rate of 49.24% and 29.32% for the conditional and residual approaches, respectively. Fig. 3(d) shows the base-distortion performance of the base task. Compared to semantic segmentation on Cityscapes, the task model is better at reaching the uncompressed task performance. Also, for a similar distortion penalty, this task uses more rate. The rate-distortion curves obtained by the reconstruction task on the COCO dataset are almost an order of magnitude larger than the ones from Cityscapes. This can be explained by the simplicity of the content in the images found in Cityscapes, and the higher amount of artifacts found in the COCO dataset due to compression.

5 Conclusion

We present conditional and residual methods for scalable coding for humans and machines. Our experiments show that the proposed architectures for conditional and residual coding perform similarly and that the rate-distortion performance is within the presented baselines or operational bounds. In addition, the proposed conditional entropy model is able to match the performance of the residual method.

References

  • [1] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in IEEE ICASSP, 2021, pp. 8493–8497.
  • [2] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE TCSVT, vol. 17, no. 9, pp. 1103–1120, 2007.
  • [3] H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE TIP, vol. 31, pp. 2739–2754, 2022.
  • [4] B. J. Culpepper and B. A. Olshausen, “Learning transport operators for image manifolds,” in NeurIPS, 2009, pp. 423–431.
  • [5] M. Connor, G. Canal, and C. Rozell, “Variational autoencoder with learned latent structure,” in AISTATS, 2021, vol. 130, pp. 2359–2367.
  • [6] N. Tishby, F. C. N. Pereira, and W. Bialek, “The information bottleneck method,” CoRR, vol. physics/0004057, 2000.
  • [7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [8] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in ICLR, 2018.
  • [9] D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in NeurIPS, 2018, pp. 10794–10803.
  • [10] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in IEEE CVPR, 2020, pp. 7936–7945.
  • [11] D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, “ELIC: efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in IEEE CVPR, 2022, pp. 5708–5717.
  • [12] H. Choi and I. V. Bajić, “Latent-space scalability for multi-task collaborative intelligence,” in IEEE ICIP, 2021, pp. 3562–3566.
  • [13] E. Ozyilkan, M. Ulhaq, H. Choi, and F. Racape, “Learned disentangled latent representations for scalable image coding for humans and machines,” arXiv 2301.04183, 2023.
  • [14] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in IEEE ICCV, 2019, pp. 6420–6428.
  • [15] T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Déforges, “Conditional coding for flexible learned video compression,” in ICLR, 2021.
  • [16] F. Brand, J. Seiler, and A. Kaup, “On benefits and challenges of conditional interframe video coding in light of information theory,” in PCS, 2022, pp. 289–293.
  • [17] H. Hadizadeh and I. V. Bajić, “LCCM-VC: Learned conditional coding modes for video coding,” arXiv 2210.15883, 2022.
  • [18] D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in IEEE ICIP, 2020, pp. 3339–3343.
  • [19] D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” in IEEE CVPR, 2021, pp. 14771–14780.
  • [20] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in IEEE CVPR, 2016, pp. 3213–3223.
  • [21] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, 2014, pp. 740–755.
  • [22] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” CoRR, vol. abs/1706.05587, 2017.
  • [23] A. Howard, R. Pang, H. Adam, Q. V. Le, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu, V. Vasudevan, and Y. Zhu, “Searching for MobileNetV3,” in IEEE ICCV, 2019, pp. 1314–1324.
  • [24] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” ITU-T SC16/Q6 VCEG-M33, 2001.
  • [25] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE TPAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.