LCCM-VC: Learned Conditional Coding Modes for Video Compression
Abstract
End-to-end learning-based video compression has made steady progress over the last several years. However, unlike learning-based image coding, which has already surpassed its handcrafted counterparts, the progress on learning-based video coding has been slower. In this paper, we present learned conditional coding modes for video coding (LCCM-VC), a video coding model that competes favorably against HEVC reference software implementation. Our model utilizes conditional – rather than residual – coding, and introduces additional coding modes to improve compression performance. The compression efficiency is especially good in the high-quality/high-bitrate range, which is important for broadcast and video-on-demand streaming applications. The implementation of LCCM-VC is available at https://github.com/hadihdz/lccm_vc
Index Terms— End-to-end learned video coding, conditional coding, augmented normalizing flows, autoencoders
1 Introduction
With the rapid advancement of deep learning technologies, various end-to-end learned image/video codecs have been developed [1, 2, 3] to rival their handcrafted counterparts such as JPEG, High Efficiency Video Coding (HEVC) [4], and Versatile Video Coding (VVC) [5]. For instance, in the seminal work by Ballé et al. [1], the variational autoencoders (VAE) were used to construct an end-to-end learned image compression system based on a context-adaptive entropy model. This model incorporates a hyperprior as side information to effectively capture dependencies in the latent representation, thereby improving entropy modeling. Many follow-up VAE-based models were then developed to further improve compression performance [2, 6, 3]. A popular one by Cheng et al. [3] used discretized Gaussian mixture likelihoods to parameterize the latent distribution for entropy modeling, achieving high rate-distortion (RD) performance. In fact, the results in [3] show that this model achieves superior performance on both PSNR and MS-SSIM quality metrics over JPEG, JPEG2000, and HEVC (Intra), and comparable performance with VVC (Intra).
Although VAEs have been proven to be effective for image compression, their ability to provide a wide range of bitrates and reconstruction qualities has been called into question [7]. To address this issue, an image compression method was proposed in [7] based on normalizing flows. Using augmented normalizing flows (ANF), Ho et al. [8] developed ANF for image compression (ANFIC), which combines both VAEs and normalizing flows to achieve the state-of-the-art performance for image compression, even better than [3].
Building on the success of learned image compression, learned video compression is catching up quickly. Lu et al. [9] presented deep video compression (DVC) as the first end-to-end learned video codec based on temporal predictive coding. Agustsson et al. [10] proposed an end-to-end video coding model based on a learning-based motion compensation framework in which a warped frame produced by a learned flow map is used a predictor for coding the current video frame. Liu et al. [11] used feature-domain warping in a coarse-to-fine manner for video compression. Hu et al. [12] employed deformable convolutions for feature warping.

Most of the existing video codecs rely on residual coding. However, Ladune et al. [14, 15] argued that conditional coding relative to a predictor is more efficient than residual coding using the same predictor. Building on this idea, Ho et al. [16] proposed conditional augmented normalizing flows for video coding (CANF-VC), which achieves very strong performance among learned video codecs. CANF-VC uses conditional coding for both motion and inter-frame coding.
Here we extend these ideas further. We provide a more comprehensive theoretical justification for conditional coding relative to multiple predictors/modes, and construct a codec called Learned Conditional Coding Modes for Video Compression (LCCM-VC), which outperforms representative learning-based codecs on the commonly used test sets. The proposed system is described in Section 2. Experiments are presented in Section 3, followed by conclusions in Section 4.
2 Proposed Method
2.1 Motivation
Let and be two random variables and , then
(1) |
where is the entropy, is the conditional entropy, (a) follows from the fact that given , the only uncertainty in is due to , and (b) follows from the fact that conditioning does not increase entropy [17].
Now consider the following Markov chain where is an arbitrary function. By the data processing inequality [17], we have , where is the mutual information. Expanding the two mutual informations as follows: and , and applying the data processing inequality, we conclude
(2) |
In video compression, coding modes are constructed via predictors, for example inter coding modes use other frames to predict the current frame, while intra coding modes use information from the same frame for prediction. Let be the current frame, be a set of candidate predictors, and be a predictor for from . Function could, for example, use different combinations of in different regions of the frame. Further, let be an optimal predictor for that minimizes conditional entropy. Then, based on (1) and (2),
(3) |
where (a) follows from (2), (b) follows from the fact that is the optimal predictor, and (c) follows from (1). Also note that if , then
(4) |
because an optimal predictor with more candidates () can, at the very least, choose to ignore of them and achieve the same performance as the predictor with candidates.
Our proposed codec is built on the idea shown in (4). By generating a large number of candidate predictors (modes), we want to minimize the bitrate needed for coding conditioned on an optimal combination of these modes. Note that the theory promises even better performance via multi-conditional coding: inequality (a) in (3). However, this requires estimating conditional probabilities in very high-dimensional spaces, and is not pursued in this paper.
2.2 Codec description
Fig. 1 depicts the structure of our proposed video compression system. It consists of three main components: 1) motion coder, 2) mode generator, and 3) inter-frame coder. The exact functionality of each of these components is described below.
Motion coder: Given the current frame and its reconstructed reference frame , we first feed them to a learned optical flow estimation network like PWC-Net [13], to obtain a motion flow map . The obtained flow is then encoded by the encoder (AE) of the HyperPrior-based coder from [1], and the obtained motion bitstream is transmitted to the decoder. At the decoder side, the transmitted flow map is reconstructed by the decoder (AD) of the HyperPrior coder to obtain . Then, is warped by using bilinear sampling [16] to obtain a motion-compensated frame .

The above-described motion coder is only used for the first P frame in each group of pictures (GOP). For the subsequent P frames, we use the CANF-based motion coder shown in Fig. 2. Here, the extrapolation network from CANF-VC [16] is used to extrapolate a flow map from the three previously-decoded frames , , and two decoded-flow maps , . is then coded conditionally relative to . I-frames are coded using ANFIC [8].
Mode generator: The goal of the mode generator is to produce additional coding modes to operationalize (4). For this purpose, the previous reconstructed frame , the motion-compensated frame , and the decoded flow map are concatenated and fed to the mode generator to produce two weight maps and , which are of the same size as the input image. The mode generator is implemented as a simple convolutional network with structure , where and are convolutional layers with 32 kernels of size (stride=1, padding=1), and is a LeakyReLU layer whose negative slope is . Since and are produced from previously (de)coded data, they can be regenerated at the decoder without any additional bits. These two maps are then fed to a sigmoid layer to bound their values between 0 and 1. Then, a frame predictor is generated as:
(5) |
where denotes Hadamard (element-wise) product and is the all-ones matrix. Moreover, is multiplied by both and , and the resultant two frames, and , are fed to the CANF-based conditional inter-frame coder for coding conditioned on . Note that , , and are available at the decoder.
Inter-frame coder: The inter-frame coder codes conditioned on using the inter-frame coder of CANF-VC to obtain at the decoder. The final reconstruction of the current frame, i.e. , is then obtained by:
(6) |
In the limiting case when , the predictor becomes equal to , and when , becomes equal to the motion-compensated frame . For , the predictor is a pixel-wise mixture of and . Hence, provides the system with more flexibility for choosing the predictor for each pixel within the current frame being coded. Also, for pixels where , becomes equal to , so the inter-frame coder does not need to code anything. This resembles the SKIP mode in conventional coders, and depending on the value of , the system can directly copy from , , or a mixture of these two, to obtain . When , only the inter-frame coder is used to obtain . In the limiting case when and , the proposed method would reduce to CANF-VC [16].
Note that a somewhat similar approach was proposed in [14]. However, [14] used only one weight map, which is similar to our map, and this map was coded and transmitted to the decoder. In our proposed system, two maps, and , are used to create a larger number of modes. Moreover, these two maps can be constructed using previously (de)coded information, so they can be regenerated at the decoder without any additional bits to signal the coding modes.

3 Experiments
3.1 Training
We trained the proposed LCCM-VC on the VIMEO-90K Setuplet dataset [18], which consists of 91,701 7-frame sequences with fixed resolution , extracted from 39K video clips. We randomly cropped these clips into patches, and used them for training LCCM-VC using a GOP of frames. We employed the Adam [19] optimizer with the batch size of 4. We adopted a two-stage training scheme. In the first stage, we froze the CANF-based conditional coders with their pre-trained weights, and optimized the remainder of the model for 5 epochs with the initial learning rate of . In the second stage, we trained the entire system end-to-end for 5 more epochs with the initial learning rate of . Four separate models were trained for four different bitrates using the following loss function:
(7) |
where , and is the RD loss of the -th training frame defined in [16] with and . Note that [16] used as the training loss, without weighting. In our experiments, we first trained the model with (highest rate), and all lower-rate models were then initialized from this model.
3.2 Evaluation methodology
We evaluate the performance of LCCM-VC on three datasets commonly used in learning-based video coding: UVG [20] (7 sequences), MCL-JCV [21] (30 sequences), and HEVC Class B [22] (5 sequences). Following the common test protocol used in the recent literature [16], we encoded only the first 96 frames of the test videos, with the GOP size of 32.
We used the following video codecs as benchmarks: x265 (‘very slow’ mode) [23], HEVC Test Model (HM 16.22) with LDP profile [24] , M-LVC [25], DCVC [26], and CANF-VC [16]. As the quality of the I-frames has a significant role on the RD performance of video codecs, in order to have a fair comparison, we used ANFIC [8] as the I-frame coder for all learned codecs in the experiment. Note that ANFIC achieves state-of-the-art performance for static image coding [8].
Similar to the existing practice in the learned video coding literature [16, 25, 26], to evaluate the RD performance of various methods, the bitrates were measured in bits per pixel (BPP) and the reconstruction quality was measured by both RGB-PSNR and RGB-MS-SSIM. Then the RD performance is summarized into BD-Rate [27].
3.3 Results
In Fig. 3, we plot RGB-PSNR vs. BPP (top) and RGB-MS-SSIM vs. BPP curves (bottom) of various codecs on the three datasets. It is notable that both LCCM-VC and CANF-VC achieve better performance than HM 16.22 at higher bitrates/qualities, whereas HM 16.22 has slight advantage at lower bitrates. All three codecs offer comparable performance at medium bitrates. Tables 1 and 2 show BD-rate (%) relative to the x265 anchor using RGB-PSNR and RGB-MS-SSIM, respectively, with negative values showing an average bit reduction (i.e., coding gain) relative to the anchor. The best result in each row is shown in blue, and the second best result is shown in red. As seen in the tables, LCCM-VC has the best RGB-PSNR results among learned codecs, and best RGB-MS-SSIM results overall.
It should be mentioned that choosing x265 as the anchor – in keeping with the established practice in learned video coding – has a consequence of excluding high-quality portions of the RD curves from BD-Rate computation. Yet, high quality is where LCCM-VC (and CANF-VC) have the biggest advantage over HM 16.22, as seen in Fig. 3. In fact, if we choose HM 16.22 as the anchor, then LCCM-VC has BD-Rates of –2.8%, 3.4%, and –1.1% (using RGB-PSNR), so in fact, it beats HM 16.22 on two out of three datasets even in terms of RGB-PSNR, and the gains are in the high-quality region.
Dataset | HM 16.22 | DCVC | M-LVC | CANF-VC | LCCM-VC |
---|---|---|---|---|---|
HEVC-B | –32.1 | –10.5 | –9.7 | –27.7 | –31.7 |
UVG | –41.6 | –16.3 | –12.1 | –35.9 | –36.6 |
MCL-JCV | –38.6 | –21.3 | –5.3 | –32.0 | –35.6 |
Dataset | HM 16.22 | DCVC | CANF-VC | LCCM-VC |
---|---|---|---|---|
HEVC-B | –31.0 | –27.0 | –27.7 | –39.0 |
UVG | –34.3 | –33.2 | –35.0 | –42.7 |
MCL-JCV | –32.0 | -36.1 | –36.9 | –49.7 |
4 Conclusions
In this paper, we proposed learned conditional coding modes for video coding (LCCM-VC), an end-to-end learned video codec that achieves excellent results among learning-based codecs. We also gave a theoretical justification for advantages of conditional coding relative to multiple coding modes. LCCM-VC outperforms other benchmark codecs on three commonly used test datasets in terms of MS-SSIM and is competitive with HM 16.22 even in terms of RGB-PSNR.
References
- [1] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Intl. Conf. on Learning Representations (ICLR), 2018, pp. 1–23.
- [2] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, vol. 31.
- [3] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- [4] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 22, no. 12, 2012.
- [5] B. Bross, Y. K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J. R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,” IEEE Trans. Circuits and Systems for Video Technology, vol. 31, no. 10, 2021.
- [6] J. Lee, S. Cho, and S. K. Beack, “Context-adaptive entropy model for end-to-end optimized image compression,” in Intl. Conf. on Learning Representations (ICLR), 2019.
- [7] L. Helminger, A. Djelouah, M. Gross, and C. Schroers, “Lossy image compression with normalizing flows,” in Intl. Conf. on Learning Representations (ICLR), 2021.
- [8] Y. H. Ho, C. C. Chan, W. H. Peng, H .M. Hang, and M. Domanski, “ANFIC: Image compression using augmented normalizing flows,” IEEE Open Journal of Circuits and Systems, vol. 2, pp. 613–626, 2021.
- [9] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, p. 11006–11015.
- [10] E. Agustsson, D. Minnen, N. Johnston, Ballé J., Hwang S. J., and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2020, pp. 8503–8512.
- [11] H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao, and Y. Wang, “Neural video coding using multiscale motion compensation and spatiotemporal context model,” IEEE Trans. Circuits and Systems for Video Technology, 2020.
- [12] Z. Hu, G. Lu, and D. Xu, “FVC: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, p. 1502–1511.
- [13] D. Sun, X. Yang, M.Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, p. 8934–8943.
- [14] T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Déforges, “Optical flow and mode selection for learning-based video coding,” in IEEE 22nd International Workshop on Multimedia Signal Processing,, 2020.
- [15] T. Ladune, P. Philippe, W. Hamidouche, L. Zhang, and O. Deforges, “Conditional coding for flexible learned video compression,” in Neural Compression From Information Theory to Applications–Workshop@ ICLR 2021, 2021.
- [16] Y. H. Ho, C. P. Chang, P. Y. Chen, A. Gnutti, and W. H. Peng, “CANF-VC: Conditional augmented normalizing flows for video compression,” European Conference on Computer Vision, 2022.
- [17] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, 2nd edition, 2006.
- [18] T. Xue, B. Chen, J. Wu, D. Wei, and W.T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
- [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations, 2015.
- [20] A. Mercat, M. Viitanen, and J. Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, 2020, pp. 297–302.
- [21] H. Wang, W. Gan, S. Hu, J.Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.C.J. Kuo, “MCL-JCV: a JND-based H.264/AVC video quality assessment dataset,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 1509–1513.
- [22] F. Bossen, “Common test conditions and software reference configurations,” in JCTVC-L1100 12(7), 2013.
- [23] x265, “An open-source HEVC encoder,” https://x265.readthedocs.io/en/master/, 2022-03-10.
- [24] HM, “Reference software for HEVC,” https://vcgit.hhi.fraunhofer.de/Zhu/HM/blob/HM-16.22/cfg/encoder_lowdelay_P_main.cfg, 2022-03-10.
- [25] J. Lin, D. Liu, H. Li, and F. Wu, “M-LVC: multiple frames prediction for learned video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, p. 3546–3554.
- [26] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” in Advances in Neural Information Processing Systems, 2021.
- [27] G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” Apr. 2001, VCEG-M33.