Conditional Neural Video Coding with Spatial-Temporal Super-Resolution

Henan Wang Xiaohan Pan Runsen Feng Zongyu Guo and Zhibo Chen
University of Science and Technology of China Hefei, Anhui, China [email protected]

Abstract

This document is an expanded version of a one-page abstract originally presented at the 2024 Data Compression Conference. It describes our proposed method for the video track of the Challenge on Learned Image Compression (CLIC) 2024. Our scheme follows the typical hybrid coding framework with some novel techniques. Firstly, we adopt Spynet network to produce accurate motion vectors for motion estimation. Secondly, we introduce the context mining scheme with conditional frame coding to fully exploit the spatial-temporal information. As for the low target bitrates given by CLIC, we integrate spatial-temporal super-resolution modules to improve rate-distortion performance. Our team name is IMCLVC.

1 Introduction

In the video track of CLIC2024, participants are required to submit a codec to compress a set of 10-second sequences with diverse contents and frame rates. The winners will be chosen based on human perceptual visual quality. To meet the requirements, we propose an efficient neural video codec optimized for MS-SSIM. In this paper, we briefly describe the design and training details.

2 Related Works

In recent years, the field of learning-based video compression has witnessed remarkable advancements. These methods predominantly follow the conventional hybrid coding architecture while integrating neural networks to enhance various aspects such as motion compensation and context analysis. DVC[1] is the first end-to-end video compression framework. The model is designed by replacing every component in a traditional video compression framework with neural networks. A notable feature of DVC is its incorporation of optical flow networks for motion estimation, setting a foundational architecture for subsequent variations. Building on this architecture, several variations have emerged, each addressing specific challenges in video compression. SSF[2] introduces a scale parameter to the optical flow, enhancing the network’s ability to model uncertainty. This adaptation proves particularly effective in scenarios involving disocclusions or rapid motion. FVC[3] conducts key video compression operations, including motion estimation, motion compression, motion compensation, and residual compression, within the feature space. ELF-VC[4] offers an innovative flexible-rate framework, enabling adjustable rate control within a single model. Incorporating reference modes from traditional video compression, LHBDC[5] integrates bi-directional prediction as well as hierarchical coding architecture into learning-based video compression frameworks. C2F[6] presents a coarse-to-fine framework accompanied by hyperprior-guided mode prediction methods. This approach significantly enhances motion compensation and optimizes coding modes. ENVC[7] designs a reference feature pyramid as prediction sources, aiming to refine the precision of prediction. Additionally, it introduces a weighted prediction mechanism focused on the reconstruction of fine details.

The recent shift in video compression strategies is marked by the growing interest in a new paradigm known as conditional coding, first introduced in DCVC[8]. This innovative approach harnesses contextual information derived from motion compensation to enhance the encoding and decoding processes. It represents a shift from explicit residual coding, where the focus is on encoding the differences (residuals) between frames, to a more nuanced strategy known as implicit conditional coding. Further advancements are seen in DCVC-TCM[9], which advances this concept by enabling the learning of multi-scale temporal contexts for more effective information transmission. Subsequent works[10, 11, 12] continue to explore various model architecture designs, further enriching the potential and scope of conditional coding in video compression.

3 Proposed Methods

\SubSection

Spatial-temporal Super-Resolution

To enhance the rate-distortion (RD) performance, particularly at lower bitrates, a novel approach combining downsampling and super-resolution techniques has been implemented, as shown in Fig 1.

For the ultra-low bitrate scenario of 0.05 mbps, the input video undergoes a downsampling process in both spatial and temporal dimensions, specifically by a factor of 4. This drastic downsampling results in a substantially smaller video file, which is then encoded and transmitted efficiently.

At the decoder side, we first decode the downsampled video. Following this, an advanced frame interpolation technique EMA-VFI[13] is applied to reconstruct the omitted frames, compensating for the temporal information lost during downsampling. Subsequently, to address the spatial downsampling, bicubic interpolation is employed.

In the higher bitrate of 0.5 mbps, only temporal downsampling and interpolation are utilized, leaving the spatial resolution intact. By doing so, the texture and detail of the original video are better preserved.

Refer to caption — Figure 1: Proposed spatial-temporal super resolution scheme. Videos are downsampled spatially and temporally before being coded, while upsampled to original resolution at the decoder side.

\SubSection

Codec Architecture

The proposed end-to-end optimized video codec follows the typical hybrid coding framework with some new techniques. In particular, we adopt the conditional coding framework. As shown in Fig 2, the main body can be divided into three parts: motion coding, context mining and frame coding.

1) Motion Coding: The optical flow produced by the pretrained Spynet [14] is taken as the estimated motion vector, which is then encoded and decoded by an hyperprior-based autoencoder.

2) Context Mining: The rich spatial-temporal context information provided by the decoded frame and motion vector should be fully extracted. Basically we adopt the temporal context mining (TCM) module from DCVC-TCM [9]. The TCM module uses multi-scale warping to generate multi-scale temporal contexts and capture spatial-temporal non-uniform motion and texture.

3) Frame Coding: The spatial-temporal context feature generated by the context mining module is introduced as the conditional information for frame coding. Specially, the contextual encoder and contextual decoder take the conditional features in at different scale to fully utilize the conditional information. Our context model is developed based on the unevenly grouped channel-wise context model in [15]. This helps our models strike a balance between complexity and performance.

\SubSection

Training Strategy

The training procedure of our codec can be divided into 2 stages. At the first stage, the loss calculation and model optimization is carried out per frame. The I-frame model is trained firstly and freezed when we are training the P-frame model. For P-frame training, we also learn from the multi-stage training strategy proposed in DCVC-TCM [9], where the motion codec is trained seperately from the frame codec. With the help of this training strategy, we pre-train the model at high rates and then finetune the model to the target rate at the next stage.

The next stage is multi-frame co-optimization. Here we jointly optimize the I-frame model and P-frame model. In our low-delay setting, the latter P-frame uses previously reconstructed P-frame or I-frame as reference. Since the reference frame is reconstructed with some loss at the decoder side, the error will accumulate as the number of picture order count (POC) increases. To mitigate the error propagation issue, we jointly optimize the I-frame model and P-frame model, making the model aware of error propagation. Furthermore, we adjust the training GoP size as training proceeds, enabling the model to be better equipped for handling larger GoP sizes during testing. We first set the training GoP size to 3 and later to 5. The overall Rate-Distortion (RD) optimization target function is shown in Eq. 1

\mathcal{L}=\mathcal{R}_{I}+\lambda\cdot\mathcal{D}_{I}+\sum_{t=1}^{T-1}\left(\mathcal{R}_{t}+\lambda\cdot\mathcal{D}_{t}\right),

(1)

where $\mathcal{R}_{I}$ - $\mathcal{D}_{I}$ and $\mathcal{R}_{t}$ - $\mathcal{D}_{t}$ represent the rate-distortion of the I-frame and the $t$ -th P-frame, $T$ represents the training GoP size. Since perceptual quality is evaluated in CLIC, we choose MS-SSIM as our distortion objective.

4 References

References

[1] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao, “Dvc: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11006–11015.
[2] Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8503–8512.
[3] Zhihao Hu, Guo Lu, and Dong Xu, “Fvc: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1502–1511.
[4] Oren Rippel, Alexander G Anderson, Kedar Tatwawadi, Sanjay Nair, Craig Lytle, and Lubomir Bourdev, “Elf-vc: Efficient learned flexible-rate video coding,” arXiv preprint arXiv:2104.14335, 2021.
[5] M Akın Yılmaz and A Murat Tekalp, “End-to-end rate-distortion optimized learned hierarchical bi-directional video compression,” IEEE Transactions on Image Processing, vol. 31, pp. 974–983, 2021.
[6] Zhihao Hu, Guo Lu, Jinyang Guo, Shan Liu, Wei Jiang, and Dong Xu, “Coarse-to-fine deep video coding with hyperprior-guided mode prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5921–5930.
[7] Zongyu Guo, Runsen Feng, Zhizheng Zhang, Xin Jin, and Zhibo Chen, “Learning cross-scale weighted prediction for efficient neural video compression,” IEEE Transactions on Image Processing, 2023.
[8] Jiahao Li, Bin Li, and Yan Lu, “Deep contextual video compression,” Advances in Neural Information Processing Systems, vol. 34, pp. 18114–18125, 2021.
[9] Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu, “Temporal context mining for learned video compression,” IEEE Transactions on Multimedia, 2022.
[10] Yung-Han Ho, Chih-Peng Chang, Peng-Yu Chen, Alessandro Gnutti, and Wen-Hsiao Peng, “Canf-vc: Conditional augmented normalizing flows for video compression,” in European Conference on Computer Vision. Springer, 2022, pp. 207–223.
[11] Linfeng Qi, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu, “Motion information propagation for neural video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6111–6120.
[12] Jiahao Li, Bin Li, and Yan Lu, “Neural video compression with diverse contexts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22616–22626.
[13] Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang, “Extracting motion and appearance via inter-frame attention for efficient video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5682–5692.
[14] Anurag Ranjan and Michael J Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4161–4170.
[15] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang, “Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5718–5727.