An Efficient QP Variable Convolutional Neural Network Based In-loop Filter for Intra Coding

Zhijie Huang Xiaopeng Guo Mingyu Shang Jie Gao and Jun Sun^∗
Wangxuan Institute of Computer Technology Peking University Beijing, 100871, China {zhijiehuang,jsun}@pku.edu.cn ^∗Corresponding author

Abstract

In this paper, a novel QP variable convolutional neural network based in-loop filter is proposed for VVC intra coding. To avoid training and deploying multiple networks, we develop an efficient QP attention module (QPAM) which can capture compression noise levels for different QPs and emphasize meaningful features along channel dimension. Then we embed QPAM into the residual block, and based on it, we design a network architecture that is equipped with controllability for different QPs. To make the proposed model focus more on examples that have more compression artifacts or is hard to restore, a focal mean square error (MSE) loss function is employed to fine tune the network. Experimental results show that our approach achieves 4.03% BD-Rate saving on average for all intra configuration, which is even better than QP-separate CNN models while having less model parameters.

1 1. Introduction

In-loop filtering is an essential module in video coding, which can not only improve the quality of current frames directly by reducing the compression artifacts but also provide high-quality reference frames for succeeding pictures. In the latest video coding standard Versatile Video Coding (VVC) [1], four in-loop filtering steps, namely a luma mapping with chroma scaling (LMCS) process [2], followed by a deblocking filter (DBF) [3], an SAO filter [4] and an adaptive loop filter (ALF) [5] are applied to the reconstructed samples. The DBF and SAO are similar to that of the HEVC [6] standard, whereas LMCS and ALF are newly adopted in VVC.

Besides the built-in in-loop filters in video coding, various convolutional neural network (CNN) based in-loop filters have been proposed in recent years. In [7], a very deep recursive residual CNN (RRCNN) was developed to recover the reconstructed intra frames. Zhang et al. [8] introduced a deep residual highway CNN (RHCNN) based in-loop filtering in HEVC. Wang et al. [9] designed a dense residual CNN based in-loop filter (DRNLF) for VVC. Typically, since the compression noise levels are distinct for videos compressed with different quantization parameters (QPs), we need to train many CNN models for different QPs. To address this issue, Zhang et al. [8] merged the QPs into several bands, and trained the optimal models for each band, but they still had to train and deploy several networks. Song et al. [10] combined QPs as an input and fed them into the CNN training stage by simply padding the scalar QPs into a matrix with the same size of input frames or patches. However, these QP-combined models are inferior to QP-separate CNN models in terms of rate-distortion (RD) performance. And their flexibility and scalability are not strong.

In this paper, a novel QP variable convolutional neural network based in-loop filter is proposed for VVC intra frames. Specifically, Considering different compression noise levels for different QPs, a QP attention module (QPAM) is developed which assigns different weights to each channel of the input feature map according to the QP value. Compared with other methods, the proposed QPAM has wide applicability and stronger scalability, which can also be applied to adapt to different frame types. Then QPAM is embed into the residual block. Based on it, we design a network architecture that can not only fully utilize residual feature, but has the controllability for different QPs. To further make the proposed model pay more attention to examples that have more compression artifacts or is hard to restore, a focal mean square error (MSE) loss function is employed to fine tune the network. Experimental results verify the efficiency of the proposed QPAM and network architecture, which also outperforms other methods.

2 2. The Proposed QPALF Method

Refer to caption — Figure 1: Overview of the QP attention module (QPAM). $\otimes$ denotes element-wise product.

QPAM

Inspired by channel attention module in [11], we propose a QPAM to avoid training and deploying multiple networks. Unlike channel attention module which extracts the attention map from the input feature map, our QP attention module is controlled by the QP value. The overview of the proposed QPAM is illustrated in Figure 1. Given a feature map $\mathbf{F}\in\mathbb{R}^{H\times W\times C}$ as input, QPAM sequentially infers a 1D QP attention map $\mathbf{M}\in\mathbb{R}^{1\times 1\times C}$ . The attention process can be summarized as:

\mathbf{F^{\prime}}=\mathbf{M}\otimes\mathbf{F}

(1)

where $\otimes$ denotes element-wise multiplication. During multiplication, the attention values are broadcast along the channel dimension. $\mathbf{F^{\prime}}$ is the refined output. The process of generating the QP attention map is as follows: Given a QP value $q\in\Omega=[a,b]$ , since $q$ is an integer, we first map $q$ to a vector $\mathbf{v}_{\Omega}(q)\in\mathbb{R}^{m\times 1}$ by one-hot encoding. $m$ is the length of $|\Omega|$ . The QP attention map $\mathbf{M}$ is calculated by:

\begin{array}[]{l}\mathbf{M^{\prime}}=\sigma(U\mathbf{v}_{\Omega}(q))\\ \mathbf{M}=\text{reshape}(\mathbf{M^{\prime}})\end{array}

(2)

where $U\in\mathbb{R}^{C\times m}$ is a weight matrix and $\sigma(x)=log(1+e^{x})$ . From the process we can see that the QPAM assigns different weights to each channel of the input feature map according to the QP value, so that the module can capture compression noise levels among different QPs. Meanwhile, the module can also emphasize meaningful features along channel axes through this process. Moreover, compared with other methods, the proposed QPAM has stronger scalability, which can also be easily applied to other discrete variables, e.g. frame types.

QPALF

Architecture. In figure 2, we present the overview of the architecture of our QPALF network. This is also one of the popular architecture used by many other methods [8, 7, 9], which usually consists of three parts: the head part, the backbone part and the reconstruction part. The head part is responsible for initial feature extraction with only one convolutional layer followed by an activate function. Given a compressed input $X$ , we can get a shallow feature $F_{0}$ through this layer:

F_{0}=\mathcal{F}(X)

(3)

The backbone part is the key component of the network, which is also the most distinct part of various networks. Here it makes up of $D$ cascaded residual feature aggregation modules (RFA). The backbone part receives the feature $F_{0}$ as input and sends the extracted global feature $F$ to the reconstruction part, which can be formulated as:

F_{D}=\mathcal{R}_{D}(F_{D-1})=\mathcal{R}_{D}\left(\mathcal{R}_{D-1}\left(\ldots\left(\mathcal{R}_{1}\left(F_{0}\right)\right)\ldots\right)\right)

(4)

F=F_{0}+\mathcal{R}([F_{1},\ldots,F_{d},\ldots,F_{D}])

(5)

where $\mathcal{R}_{d}$ denotes the d-th RFA module function. $F_{d-1}$ is the input feature of the d-th RFA module function and $F_{d}$ is the corresponding output. The output features of the $D$ RFAs are concatenated together. Then we utilize a long skip connection to extract a global feature $F$ . Finally, the global feature $F$ is transformed through the reconstruction part

\hat{Y}=X+\mathcal{H}(F)

(6)

where $\hat{Y}$ is the output and $\mathcal{H}$ is the reconstruction function, which consists of only one convolutional layer. A global residual learning is usually used to ease the training difficulty in the reconstruction part.

Inspired by [12], we propose a RFA module to make a better use of the local residual features. Figure 3 illustrates the detail of the RFA, which contains three residual blocks and one convolution layer. The input is extracted by three residual blocks at three different levels. Then the outputs from three residual blocks are concatenated and a $1\times 1$ convolution is applied at the end of the RFA for channel dimension reduction. As depicted in Figure 3, the detail of the residual block is the same as that in [13] except that we employ the proposed QPAM after the first convolution layer.

Considering the complexity of the proposed model, we use 6 RFA modules, that is, $D=6$ here. All of $3\times 3$ convolutional layers are followed by a parametric rectified linear unit (PReLU) [14] activation, and the filter number of each $3\times 3$ layer convolutional is 64. The $1\times 1$ convolutions are applied for channel dimension reduction.

Dataset. For the network training, we build a dataset using DIV2K [15] which contains 800 high-resolution images. First, we convert these images to YUV 4:2:0 color format and encode them by VTM6.0 [16] with all-intra (AI) configuration at four QPs, 22, 27, 32, 37. The built-in in-loop filters are all enabled when compressed these images. Then the compressed images are divided into two non-overlapping sets of training (700 images), validation (100 images). To further expand the training dataset, we split the reconstructions to small patches of $64\times 64$ with stride 16. And we remove the patches whose PSNR are more than 50.0 or less than 20.0. When training QP variable models, four training datasets are mixed in a random order.

Loss Function. Let $X$ be the input and $\theta$ be the set of network parameters to be optimized. Our goal is to learn an end-to-end function $F$ for generating a higher quality reconstruction $\hat{Y}=F(X;\theta)$ that is close to the ground truth $Y$ . The loss function is the MSE between $\hat{Y}$ and $Y$ :

L_{rec}=\frac{1}{N}\sum_{i=1}^{N}{\lVert\hat{Y}^{(i)}-Y^{(i)}\rVert^{2}}

(7)

where $N$ is the number of training samples in each batch. In order to train a more robust QP-combined network, we analyse the restoration ability of the network for different QPs. First we train a QPALF network using mixed dataset (the training detail will be presented in the follow). Then we plot the valid image cumulative proportion over the PSNR gain rate for different QPs. And the PSNR gain rate is defined as follows:

R=1-\frac{L_{rec}}{L_{init}}

(8)

where $L_{init}$ is the MSE between $X$ and $Y$ . From Figure 4, we can find: 1) The network has lower PSNR gain rate overall on dataset with smaller QP, especially at QP=22. Obviously smaller QP means less compression artifacts, and we do not expect the network to pay much attention on data with less compression artifacts. 2) The PSNR gain rate of 80% of the valid data is less than 10%, that is, the valid data with low PSNR gain rate accounts for a large proportion. So we expect the network to focus more on data that has low PSNR gain rate. To this end, we propose a focal MSE loss function to fine tune the proposed network, which can be calculated by:

L=\alpha_{q}(1-R)^{\gamma}L_{rec}=\alpha_{q}\frac{L^{1+\gamma}_{rec}}{L_{init}}\\

(9)

where $\alpha_{q}$ is a weighting factor over QP value, and $\gamma$ is focusing parameter. Herein $\alpha_{q}=0.1,0.25,0.3,0.35$ for four QPs respectively and $\gamma=1$ . Table 1 shows the coding performances of three networks over test sequences. QPALF-I, QPALF-II, QPALF-III denote QPALF without fine tuning, QPALF fine tuned by MSE and QPALF fine tuned by focal MSE respectively. As we can see, QPALF-III achieves more bit-rate saving than QPALF-II, which demonstrates the effectiveness of focal MSE.

Table 1: The coding performance of three QPALF networks

Class	BD-Rate(%)
Class	QPALF-I	QPALF-II	QPALF-III
A1	-1.54	-1.73	-2.02
A2	-1.98	-2.18	-2.29
B	-3.14	-3.30	-3.32
C	-4.49	-4.61	-4.73
D	-5.48	-5.60	-5.70
E	-5.43	-5.66	-5.84
Average	-3.75	-3.91	-4.03

Training Detail. The widely adopted deep learning framework Pytorch [17] is utilized to train our models. We use Adam [18] optimization to train these models and a batch size of 64. The learning rate discounts 0.5 every 25 epochs. The training takes 100 epochs in total. For QP-separate models, we first train the model for QP=37 and then use it to initialize the parameters of the networks with smaller QP. The initial learning rate is $10^{-4}$ for QP=37 and $10^{-5}$ for other QPs. For QP-combined models, the initial learning rate is $10^{-4}$ and the fine-tune process takes 50 epochs with learning rate $10^{-5}$ . All models are trained on NVIDIA Titan X (Pascal) GPUs.

Implementation. We integrate the QPALF into VVC as an additional tool of in-loop filters between DBF and SAO. To get better performance for video coding, a frame level flag is signaled in the bitstream to indicate whether QPALF is enabled for this frame in the decoder. When the reduction of RD cost is greater than 0, the flag will be enabled and the QPALF will be applied for the frame on luma component.

Table 2: The BD-Rate of different models on Y channel under AI configuration

Class	Sequence	BD-Rate(%)
Class	Sequence	RHCNN	DRNLF	QPMLF	QPALF-S	QPALF
A1	Tango2	-0.62	-0.63	-0.78	-0.62	-1.86
	Campfire	-0.82	-1.32	-0.79	-1.42	-2.01
	FoodMarket4	-0.74	-0.09	-0.29	-0.89	-2.20
A2	CatRobot	-1.10	-2.20	-1.97	-2.28	-3.39
	DaylightRoad2	-0.43	0.07	-0.17	1.02	-0.50
	ParkRunning3	-1.01	-1.96	-1.56	-2.04	-2.99
B	RitualDance	-1.88	-4.29	-4.03	-4.85	-6.32
	MarketPlace	-1.34	-2.33	-2.09	-2.64	-3.58
	BasketballDrive	-0.73	-1.63	-1.05	-1.82	-2.84
	BQTerrace	-0.64	-1.38	-1.02	-1.56	-2.06
	Cactus	-0.84	-2.07	-1.71	-1.69	-1.78
C	BasketballDrill	-2.29	-5.43	-4.50	-5.76	-7.48
	BQMall	-1.93	-4.31	-3.73	-4.58	-5.49
	PartyScene	-1.22	-3.01	-2.57	-3.19	-3.62
	RaceHorsesC	-0.81	-1.75	-1.39	-1.81	-2.31
D	BasketballPass	-2.10	-5.24	-4.42	-5.67	-6.76
	BlowingBubbles	-1.59	-3.64	-3.19	-3.87	-4.45
	BQSquare	-1.93	-5.12	-4.40	-5.28	-6.20
	RaceHorses	-2.03	-4.54	-4.21	-4.67	-5.40
E	FourPeople	-2.05	-4.73	-4.08	-5.09	-6.49
	Johnny	-1.63	-3.90	-3.17	-4.12	-5.72
	KristenAndSara	-1.59	-3.95	-3.31	-4.24	-5.31
Average All		-1.54	-2.88	-2.47	-3.05	-4.03

3 3. Experiment

Experimental Setting

In our experiments, all approaches for in-loop filtering are incorporated into the VVC reference software VTM6.0. The Libtorch [17] library is integrated into VTM 6.0 to perform the in-loop filtering with the different models. Four typical QP values are tested, including 22, 27, 32, 37. We use the AI configuration suggested by VVC common test condition (CTC) [19]. The anchor for all experiments is VTM6.0 with all built-in in-loop filters enabled. The coding efficiency is evaluated on standard video sequences from class A1 to class E recommended by JVET. The BD-Rate [20] are referred to measure the coding performance. We only train and apply the models on Y channel, but our approach can be extended to any arbitrary number of channels.

Evaluation on VVC Test Sequences

RD performance. First, we compare our QPALF with VTM baseline, and two CNN based in-loop filters, RHCNN [8] and DRNLF [9]. For a fair comparison, we also train the models using our dataset and integrate the trained model into VTM6.0 between DBF and SAO. The results are displayed in Table 2. It can be seen obviously, our QPALF further improves the coding efficiency, which obtains 4.03% bit-rate saving overall for the luma component on all the test sequences. To further verify the efficiency of the proposed model, we also compare our QPALF with QPMLF and QPALF-D. QPALF-D is the QPALF model trained separately on four QPs. QPMLF is the QPALF without QPAM and use the QP map method [10]. As we can see, compared with QP-separate model QPALF-S, the performance of the QP-combined model QPMLF degrades while our model QPALF achieves even better coding performance. Moreover, the PSNR gain of three models on multiple QPs is also depicted in Figure 5. We can observe that our QPAM obtains the highest PSNR gain over all QPs, which also demonstrates the generalization ability and robustness of the proposed method. (Since the models are trained on only four QPs, {22, 27, 32, 37}, we first map other QP values to the four QPs.)

Subjective evaluation. Figure 6 illustrates the subjective visual quality comparison among all four approaches. It can be observed that the images enhanced by our approach remain less distortion than those by other approaches, e.g., the clearer edge of the basketball net line. In Figure 7, we display the residual map of three methods over VTM baseline. Compared with RHCNN and DRNLF, our method can restore more image texture details.

Complexity. Table 3 shows the average encode/decode complexity increase and parameters of different models on Intel(R) Xeon(R) CPU E5-2697v4 and Titan X (Pascal). All of the neural networks are conducted with GPU acceleration. The complexity increase is calculated by $\Delta T=(\hat{T}-T)/T$ , where $\hat{T}$ is the encode/decode time with integrating the models, $T$ is the original encode/decode time. Our proposed QPALF has less model parameters overall. Compared with QPMLF, our model achieves much better RD performance with little complexity increase.

Table 3: Average complexity increase and parameters of different models

Method	$\Delta$ ET	$\Delta$ DT	#Params
RHCNN	5.43%	10695.9%	$6.79M\times 4$
DRN	3.29%	7808.0%	$671.30k\times 4$
QPMLF	4.41%	8232.2%	$838.98k\times 1$
QPALF	4.70%	8428.6%	$905.22k\times 1$

4 4. Conclusion and Future Work

In this paper, an efficient QP variable CNN based in-loop filter for VVC is proposed. With the proposed QPAM, the QPALF can be adaptive to different QPs while achieving better RD performance. And a focal MSE is introduced to train a more robust model. Experimental results demonstrate that our QPALF can significantly improve the coding efficiency, which outperforms other CNN based methods. Moreover, the proposed QPAM has wide applicability and stronger scalability which can be easily to implement in networks and extend to other types. In our future work, we will extend our model to inter mode and speed up our QPALF.

Acknowledgment

This work was supported by National Natural Foundation of China under contract No. 61671025

5 References

References

[1] S. Liu, B. Bross, and J. Chen, “Versatile Video Coding (Draft 10),” JVET-S2001, Joint Video Experts Team (JVET), Jul. 2020.
[2] T. Lu, F. Pu, P. Yin, S. McCarthy, W. Husak, T. Chen, E. Francois, C. Chevance, F. Hiron, J. Chen, R. Liao, Y. Ye, and J. Luo, “Luma Mapping with Chroma Scaling in Versatile Video Coding,” in 2020 Data Compression Conference (DCC), 2020, pp. 193–202.
[3] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson, M. Zhou, and G. Van der Auwera, “HEVC Deblocking Filter,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1746–1754, 2012.
[4] C. Fu, C. Chen, Y. Huang, and S. Lei, “Sample adaptive offset for HEVC,” in 2011 IEEE 13th International Workshop on Multimedia Signal Processing, 2011, pp. 1–5.
[5] C. Tsai, C. Chen, T. Yamakage, I. S. Chong, Y. Huang, C. Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz, and S. Lei, “Adaptive Loop Filtering for Video Coding,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 6, pp. 934–945, 2013.
[6] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[7] S. Zhang, Z. Fan, N. Ling, and M. Jiang, “Recursive Residual Convolutional Neural Network-Based In-Loop Filtering for Intra Frames,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1888–1900, 2020.
[8] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual Highway Convolutional Neural Networks for in-loop Filtering in HEVC,” IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3827–3841, 2018.
[9] Y. Wang, Z. Chen, Y. Li, L. Zhao S. Liu, and X. Li, “Test Results of Dense Residual Convolutional Neural Network Based in-Loop Filter,” in document Rep. JVET-M0508, Marrakech, Morocco, Jan. 2019.
[10] X. Song, J. Yao, L. Zhou, L. Wang, X. Wu, D. Xie, and S. Pu, “A Practical Convolutional Neural Network as Loop Filter for Intra Frame,” in 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 1133–1137.
[11] S. Woo, J. Park, J. Lee, and I. Kweon, “CBAM: Convolutional Block Attention Module,” in Computer Vision – ECCV 2018, Cham, 2018, pp. 3–19.
[12] J. Liu, W. Zhang, Y. Tang, J. Tang, and G. Wu, “Residual Feature Aggregation Network for Image Super-Resolution,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2356–2365.
[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
[15] E. Agustsson and R. Timofte, “NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1122–1131.
[16] “Versatile video coding Test Model (VTM), 6.0.1,” 2019.
[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, and Z. Lin, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
[18] M. Kingma, P. Diederik, and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv e-prints, p. arXiv:1412.6980, Dec 2014.
[19] K. Suehring and X. Li, “JVET common test conditions and software reference configurations,” in Document JVET-H1010 8th JVET Meeting, Oct. 2017, vol. 22.
[20] G. Bjøntegaard, “Document VCEG-M33: Calculation of average PSNR differences between RD-Curves,” Proceedings of the ITU-T Video Coding Experts Group (VCEG) Thirteenth Meeting, 01 2001.