A QP-adaptive Mechanism for CNN-based Filter in Video Coding
Abstract
Convolutional neural network (CNN)-based filters have achieved great success in video coding. However, in most previous works, individual models are needed for each quantization parameter (QP) band. This paper presents a generic method to help an arbitrary CNN-filter handle different quantization noise. We model the quantization noise problem and implement a feasible solution on CNN, which introduces the quantization step (Qstep) into the convolution. When the quantization noise increases, the ability of the CNN-filter to suppress noise improves accordingly. This method can be used directly to replace the (vanilla) convolution layer in any existing CNN-filters. By using only 25% of the parameters, the proposed method achieves better performance than using multiple models with VTM-6.3 anchor. Besides, an additional BD-rate reduction of 0.2% is achieved by our proposed method for chroma components.
Index Terms— Convolutional Neural Network, In-loop filter, Video Coding, H.266/VVC.
1 Introduction
Quantization [1] in the hybrid coding framework like H.265/HEVC [2] and H.266/VVC [3] is a crucial part of the lossy compression. However, it also causes some severe distortion and artifacts like ringing and Gibbs effects. Filters such as deblocking (DB), sample adaptive offset (SAO), and adaptive loop filter (ALF) are proposed to alleviate these artifacts. Also, learning-based filters, especially CNN-based filters[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], have shown great potential and arouse widespread interest. Previous studies show that different structures and designs such as serial network[5, 7, 8, 9, 12, 15], parallel network[4, 6, 11, 13, 14] can make significant improvements to both subjective and objective qualities. Liu et al. [5] proposed to use depth separable convolution (DSC[16]) in CNN to reduce the complexity. Besides, Dai et al. proposed a parallel network structure VRCNN [6], which uses different sizes of convolution kernels in the same layer to extract the features from different receptive fields. Conversely, Wang et al. proposed a DCAD [7] with a serial structure, which stacks 10 convolutional layers, also achieves good performance. Apart from what is mentioned above, Zhou et al. proposed to use the res-net[17] as the backbone of the proposed Tucodec [8], and used leaky-ReLU [18] instead of the ReLU [19] as the activation function.
Although CNN-based in-loop filters have achieved great success. Few previous studies have explored the generalization capabilities of different quantization parameters (QPs) and most works require training a specific model for each QP band. It is impossible to train quantities of models in actual filtering due to limited storage resources. Also, those CNN filters do not make full use of the side information of coding, such as the corresponding QP of the reconstructed image, etc.
In this paper, we propose a novel method to solve this important but easily neglected problem. Specifically, from the frequency domain, we model this problem and obtain a feasible solution of making a simple filtering model adapt to different quantization noises. By further decomposition, the simplified solution is applied to each convolution layer instead of the first one, so it has great robustness and performance. With VTM-6.3 anchor, we conduct extensive experiments on four models of different complexity to demonstrate the versatility of the proposed method. Compared with using multiple models, using a single model with the proposed method achieves about three-quarters reduction on the number of parameters and extra 0.2% BD-rate reductions on chroma components.
2 Literature Review and Analysis
To our best knowledge, only one approach of using QP map [20] as the extra input has been proposed to solve this problem. And similar methods [21, 22] have been proposed based on the QP map. The network [20] can better control the filtering strength by using the internal relationship between the QP map and the distortion. To analysis its working principle, we consider the first convolution layer of this model:
(1) |
The notations and are the input and the output, and are the weights and the biases, and are the convolution and the multiplication, is the concatenation operation. By expanding it:
(2) |
where
(3) |
From (3), it can be found that the adaptiveness of [20] is actually achieved by adding QP to the bias term with a linear function. There are some drawbacks of this method. 1. Using linear models for bias may not explain the internal relationship between QP and filtering strength. 2. This method may be less effective because it is built with the bias. It is the weight rather than bias that dominates CNN, so building the QP-adaptive method for weight may be more effective. 3. It lacks robustness and does not fully tap the potential of the QP, since only the input introduces QP. Considering these shortcomings, a better adaptive filtering strategy is designed in this paper.
3 Proposed Method
In this section, the proposed QP-adaptive mechanism is introduced. To begin with, we present the modeling to the problem. Then the implemented solution for CNN is provided.
3.1 Proposed QP-adaptive Mechanism
Given a simple filtering model(we focus on the weight, so the bias is dropped here.):
(4) |
where is the trained convolution kernel, is the distorted image, and is the filtered image. It is known that the spatial domain convolution is essentially equivalent to frequency domain multiplication.
(5) |
where the notation represents the Fourier transform. We assume that this simple model can effectively remove the specific noise in but can’t handle variable quantization noise. So it is approximated as the original image .
(6) |
As is known, the increase of the coding parameter QP represents the added noise that related to Qstep in the frequency domain. Here represents the noise caused by the change of QP.
(7) |
Therefore, a feasible solution to achieve the adaptiveness for various quantization noise is to find a changeable convolution kernel to minimize the loss function between the filtered image and the original image . Similarly, (7) in frequency domain can be expressed as:
(8) |
Here we choose classical mean square error (MSE) as the loss function.
(9) |
where is the notation of expectation. Considering (6) and Parseval’s theorem for the Fourier transform, the loss can be transformed into:
(10) |
By taking the derivative w.r.t. , we can obtain the solution:
(11) |
where and . Here, the first term is the original filter in (5), and the second term is equivalent to the influence factor that compensats for the increased quantization noise. It can be found that this solution is similar in form to Wiener deconvolution [23]. The difference lies in the motivations and forms. Wiener deconvolution hopes to recover the original signal from the distorted signal by using the priors of the input signal, noise, and degradation function. While this solution doesn’t have the concept of degradation function and it aims at making a specific filter become adaptive to the changing quantization noise.
Models | Global (single model 1) | Separate (single model4) | Proposed (single model 1) \bigstrut | |||||||||
Param. | Y | U | V | Param. | Y | U | V | Param. | Y | U | V \bigstrut | |
Liu et al[5] | 12,266 | -0.97% | -1.55% | -2.61% | 12,2664 | -2.28% | -1.64% | -2.68% | 12,555 | -2.28% | -1.70% | -2.90% \bigstrut |
VRCNN[6] | 54,512 | -0.48% | -1.57% | -2.23% | 54,5124 | -1.88% | -1.58% | -2.34% | 54,673 | -1.85% | -1.62% | -2.46% \bigstrut |
DCAD[7] | 296,641 | -2.21% | -2.00% | -3.07% | 296,6414 | -3.83% | -2.46% | -3.84% | 297,218 | -3.74% | -2.78% | -3.93% \bigstrut |
Tucodec[8] | 447,681 | -3.72% | -2.73% | -3.63% | 447,6814 | -4.49% | -2.72% | -3.88% | 448,514 | -4.54% | -2.95% | -4.21% \bigstrut |


3.2 Applying QP-adaptive Mechanism to CNN
From the perspective of the frequency domain, the extracted features from CNNs are equivalent to specific selections of the input image at different frequencies, which establishes the relationship between CNN and the frequency domain in the solution. For example, the Gaussian kernel is a low-pass filter and the Laplacian kernel is a high-pass one. The filter of the entire frequency band can be decomposed into different sub-filters, and each sub-filter works in a selected frequency sub-band. With (11), the could be written as:
(12) |
The first term in (3.2) is actually equivalent to the convolution kernel in CNN. Due to the decomposition, the second term that represents the influence factor of each kernel can be regarded as working only in a sub frequency band. In this sub-band, we approximate in the influence factor to a constant. The strength of the original signal is also invariable in the task of adapting to different quantization noises, so can be approximated as a constant . For the intensity of the quantization noise , it is proportional to the square of Qstep at all frequencies with the default coding setting. Similarly, the decomposed noise has the same pattern at the selected frequency.
(13) |
We use trainable parameters to represent the proportional relationship here. Therefore, (3.2) can be rewritten as follows.
(14) |
where the indicates a specific parameter in set . We assume that represents the number of feature maps of the CNN, then the number of parameters introduced by this method should be , which is the same order as the number of the kernels. Inspired by DSC [16] that uses depthwise convolution instead of standard convolution, we apply the influence factor to the feature maps instead of the convolution kernels. Fig. 1 shows the schematic diagram of our proposed method. Thus the parameter quantity becomes , which is the same order as the number of the biases (the QP map method [20]).
3.3 Implementation Detail
Same as HEVC[1], the relationship between QP and Qstep in VVC can be written as [24]:
(15) |
so the square of Qstep is:
(16) |
Due to the trainable multiplier in (14), multiplying different by the same constant does not affect the performance of the model. A normalization of using to replace is adopted, which can avoid the gradient vanishing problem caused by large . Besides, the parameter should be greater than 0 because both and are greater than 0. When are , the proposed model will turn into the original CNN filter. There are two common methods to solve this: 1. Using the reparametrization of , where are the unconstraint trainable parameters. 2. Directly truncating . We adopt the second one in this paper.
The order of the filtering process of H.266/VVC with CNN filter is the luma mapping with chroma scaling(LMCS), DB, CNN filter, SAO, and ALF. Both SAO and ALF need to add some bits to indicate the offsets and coefficients. By putting the CNN-based filter before SAO and ALF, a filtered image with higher quality can be sent to SAO and ALF, thereby reducing the number of coded bits. For the DB, which depends on the preset thresholds to perform filtering, putting the CNN filter before it may need to modify its thresholds accordingly. Therefore, instead of using this strategy, we choose to put the CNN filter after it.
4 Experiment
In this section, the experimental setting is first introduced. Then we provide the experiment results on coding efficiency and complexity. Finally, the comparisons with previous work are provided.
4.1 Experimental Setting
By integrating our method into different models, the BD-rate [25] reduction of the proposed method could be tested in various situations, such as complexity, activation, serial or parallel structure, etc. Here we chose four different models, including Liu et al. [5], VRCNN [6], DCAD [7], and Tucodec [8], as the backbones for our experiment. The CNN-filter was integrated into the Versatile Video Coding Test Model (VTM)-6.3 [26] and put between DB and SAO. The DIV2K dataset [27] was used to train and validate all of the mentioned CNN filters. We divided 900 pictures in DIV2K into 800 as the training set and 100 as the validation set. Four QPs including 22, 27, 32, and 37 in common test condition (CTC[28]) were used to encode these pictures. By cutting these pictures into blocks, we obtained 522,877 samples for training and 66,712 samples for validation under each QP. With framework Keras [29] and optimizer Adam [30], about 40,000 iterations were trained for each QP with batch size 128. In the test phase, the 1-st frames from HEVC test sequences were used for evaluating the performance of the aforementioned filters. It is worth mentioning that these test sequences were not overlapped with the datasets used in the training or validation phase.
4.2 Performance Evaluation
From the results shown in Table 1, the ”Global” column of using a single model has the lowest BD-rate reduction. By using multiple models shown in the ”Separate” column, the overall BD-rate reduction of the CNN-filter has been improved, but the number of required parameters has increased fourfold. From the ”Proposed” column, it can be seen that our proposed method enables a single model to have excellent performance for all four backbones while only increasing a small number of parameters. This shows the versatility and flexibility of our proposed method. Its BD-rate reduction is almost the same as that of the separate method on the luma component. And it achieves extra 0.2% BD-rate reduction on the chroma components. This fully demonstrates that our method effectively improves the generalization ability of the model, because we only use the luma component for training. The performance of the chroma components is completely dependent on the generalization ability of the model. As shown in Fig. 2, the PSNR gains of the methods relative to the VTM baseline are also plotted. The dotted line represents the separated method, and the QP in its legend represents the dataset used for training. We can find that the lines peak in the QP of the corresponding training dataset but perform poorly on the other QPs. Especially for lower QP, it may even lead to a negative impact on performance. The blue solid line represents the global method, which can obtain an ordinary BD-rate gain at higher QP, but similarly, it hurts the reconstructed image at lower QP. On the contrary, the proposed method has a significant filtering performance in a wide range of QPs and almost reaches the optimal performance of using multiple separate models. This further demonstrates the effectiveness and versatility of our model.
Class | Liu et al. [5] | DCAD [7] | Tucodec [8] \bigstrut | |||
Global | Proposed | Global | Proposed | Global | Proposed \bigstrut | |
A | 345.3% | 353.4% | 602.1% | 612.5% | 683.2% | 694.6% \bigstrut |
B | 453.4% | 461.0% | 693.1% | 704.9% | 794.6% | 814.5% \bigstrut |
C | 432.2% | 442.7% | 775.4% | 770.1% | 817.3% | 834.0% \bigstrut |
D | 585.3% | 627.4% | 1443.2% | 1461.3% | 1463.7% | 1487.9% \bigstrut |
E | 555.3% | 563.8% | 1047.2% | 1059.8% | 1148.0% | 1177.0% \bigstrut |
Average | 474.3% | 489.6% | 912.2% | 921.7% | 981.4% | 1001.6% \bigstrut |
4.3 Complexity Evaluation
In Table 1, the parameter comparisons of different methods are shown in the ”Param.” column. In addition, Table 2 shows comparisons of relative decoding complexity than VTM anchor, which shares the same test setting with Section 4.2. Although the separate method uses more models than the global method, for the filtering of determined QP, they both use a single model with the same structure, so their complexity should be the same. From Table 2, the decoding complexity of our proposed method only increases by about 2% compared with the global method. Therefore, the impact of our proposed method on complexity is minimal. This lays a good foundation for the practical application of our method.

4.4 Comparison with Previous Work
The performance comparison of the proposed method and the QP map method (Song et al.[20]) are shown in Fig. 3. These two methods achieve similar relative PSNR gain with the backbone of Liu et al. [5], DCAD [7]. For VRCNN[6], Song et al.[20] has a negative impact on lower QPs, whereas our proposed method still achieves a minor gain when . Our method performs more robust, probably because it provides the quantization information for each convolution layer but Song et al.[20] only does it for the input one. This comparison demonstrates the robustness and versatility of our method.
5 Conclusion
In this paper, we present a novel method to improve the adaptability of CNN-filters to different QPs. By adding influence factors related to Qstep to the CNN-filter, CNN can suppress the quantization noise as the noise changes. The proposed method achieves excellent performance on previous CNN-filters and yields similar BD-rate reduction to using multiple models. Besides, the complexity evaluations of different trained models prove that it only brings a slight increase in complexity and has a promising future for practical applications. Finally, the comparison with previous work shows that our proposed method is more robust and stable. We believe that in the future, more efficient methods will emerge based on further design and modeling.
6 Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant 61674041, in part by Alibaba Group through Alibaba Innovative Research (AIR) Program, in part by the STCSM under Grant 16XD1400300, in part by the pioneering project of academy for engineering and technology and Fudan-CIOMP joint fund, in part by the National Natural Science Foundation of China under Grant 61525401, in part by the Program of Shanghai Academic/Technology Research Leader under Grant 16XD1400300, in part by the Innovation Program of Shanghai Municipal Education Commission, in part by JST, PRESTO Grant Number JPMJPR19M5, Japan.
References
- [1] Madhukar Budagavi, Arild Fuldseth, and Gisle Bjøntegaard, “Hevc transform and quantization,” in High Efficiency Video Coding (HEVC), pp. 141–169. Springer, 2014.
- [2] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
- [3] “H266.,” https://en.wikipedia.org/wiki/Versatile_Video_Coding.
- [4] C. Liu, H. Sun, J. Chen, Z. Cheng, M. Takeuchi, J. Katto, X. Zeng, and Y. Fan, “Dual learning-based video coding with inception dense blocks,” in 2019 Picture Coding Symposium (PCS), Nov 2019, pp. 1–5.
- [5] Chao Liu, Heming Sun, Jiro Katto, Xiaoyang Zeng, and Yibo Fan, “A convolutional neural network-based low complexity filter,” arXiv preprint arXiv:2009.02733, 2020.
- [6] Yuanying Dai, Dong Liu, and Feng Wu, “A convolutional neural network approach for post-processing in hevc intra coding,” in International Conference on Multimedia Modeling. Springer, 2017, pp. 28–39.
- [7] Tingting Wang, Mingjin Chen, and Hongyang Chao, “A novel deep learning-based method of improving coding efficiency from the decoder-end for hevc,” in 2017 Data Compression Conference (DCC). IEEE, 2017, pp. 410–419.
- [8] Lei Zhou, Chunlei Cai, Yue Gao, Sanbao Su, and Junmin Wu, “Variational autoencoder for low bit-rate image compression.,” in CVPR Workshops, 2018, pp. 2617–2620.
- [9] Heming Sun, Chao Liu, Jiro Katto, and Yibo Fan, “An image compression framework with learning-based filter,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 152–153.
- [10] Jan P Klopp, Liang-Gee Chen, and Shao-Yi Chien, “Utilising low complexity cnns to lift non-local redundancies in video coding,” IEEE Transactions on Image Processing, 2020.
- [11] Daowen Li and Lu Yu, “An in-loop filter based on low-complexity cnn using residuals in intra video coding,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
- [12] Xiaoyi He, Qiang Hu, Xiaoyun Zhang, Chongyang Zhang, Weiyao Lin, and Xintong Han, “Enhancing hevc compressed videos with a partition-masked convolutional neural network,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 216–220.
- [13] Yongbing Zhang, Tao Shen, Xiangyang Ji, Yun Zhang, Ruiqin Xiong, and Qionghai Dai, “Residual highway convolutional neural networks for in-loop filtering in hevc,” IEEE Transactions on image processing, vol. 27, no. 8, pp. 3827–3841, 2018.
- [14] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, Jiaying Liu, Shiliang Pu, and Siwei Ma, “Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3343–3356, 2019.
- [15] Chen Li, Li Song, Rong Xie, and Wenjun Zhang, “Cnn based post-processing to improve hevc,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 4577–4580.
- [16] Laurent Sifre and Stéphane Mallat, “Rigid-motion scattering for image classification,” Ph. D. thesis, 2014.
- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [18] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013, vol. 30, p. 3.
- [19] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
- [20] Xiaodan Song, Jiabao Yao, Lulu Zhou, Li Wang, Xiaoyang Wu, Di Xie, and Shiliang Pu, “A practical convolutional neural network as loop filter for intra frame,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 1133–1137.
- [21] Shufang Zhang, Zenghui Fan, Nam Ling, and Minqiang Jiang, “Recursive residual convolutional neural network-based in-loop filtering for intra frames,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
- [22] Han Zhu, Xiaozhong Xu, and Shan Liu, “Residual convolutional neural network based in-loop filter with intra and inter frames processed respectively for avs3,” in 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2020, pp. 1–6.
-
[23]
“Wiener deconvolution,”
https://en.wikipedia.org/wiki/Wiener
_deconvolution. - [24] Chen Jianle, Ye Yan, and Kim Seung Hwan, “Algorithm description for versatile video coding and test model 6 (vtm 6),” document JVET-O2001, 15th JVET meeting: Gothenburg, SE, July 2019.
- [25] G. Bjontegarrd, “Calculation of average psnr differences between rdcurves,” VCEG-M33, 2001.
-
[26]
“Video coding test model,”
https://vcgit.hhi.fraunhofer.de/jvet/
VVCSoftware_VTM/. - [27] Eirikur Agustsson and Radu Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
- [28] Frank Bossen, Jill Boyce, X Li, V Seregin, and K Sühring, “Jvet common test conditions and software reference configurations for sdr video,” Joint Video Experts Team (JVET) of ITU-T SG, vol. 16, 2018.
-
[29]
François Chollet et al.,
“Keras,” https://github.com/fchollet/ker
as, 2015. - [30] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.