DCNGAN: A DEFORMABLE CONVOLUTION-BASED GAN WITH QP ADAPTATION FOR PERCEPTUAL QUALITY ENHANCEMENT OF COMPRESSED VIDEO

Abstract

In this paper, we propose a deformable convolution-based generative adversarial network (DCNGAN) for perceptual quality enhancement of compressed videos. DCNGAN is also adaptive to the quantization parameters (QPs). Compared with optical flows, deformable convolutions are more effective and efficient to align frames. Deformable convolutions can operate on multiple frames, thus leveraging more temporal information, which is beneficial for enhancing the perceptual quality of compressed videos. Instead of aligning frames in a pairwise manner, the deformable convolution can process multiple frames simultaneously, which leads to lower computational complexity. Experimental results demonstrate that the proposed DCNGAN outperforms other state-of-the-art compressed video quality enhancement algorithms.

Index Terms— Compressed video perceptual quality enhancement, Deformable convolution, GAN, QP adaptation

1 Introduction

Recent years have witnessed the tremendous development of video compression algorithms [1][2]. However, compressed videos, especially at low bit rate, still suffer from the degraded quality due to compression artifacts. In this case, it is crucial to enhance the quality of compressed videos.

Previous works focus on enhancing the objective quality of compressed videos [3][4][5]. Yang et al. [3] proposed a compressed video quality enhancement algorithm which aggregated information from neighboring high quality frames, named MFQE. By further optimizing the network in [3], Guan et al. [4] proposed MFQE 2.0 and achieved better performance. Deng et al. [5] incorporated deformable convolutions [6] to efficiently enhance the PSNR of compressed videos. However, sometimes the objective quality is inconsistent with the perceptual quality [7]. To achieve higher quality of experience (QoE) [8], many works have aimed at enhancing the perceptual quality of compressed videos [9][10]. Wang et al. [9] proposed a generative adversarial network (GAN) based on the multi-level wavelet packet transform to enhance compressed videos. Wang et al. [10] removed visual artifacts by an enhancement network with residual blocks.

However, those algorithms require training and storing various models to enhance videos compressed at different QPs, which sets high demand on the memory. Adapting to specific QPs, Huang et al. [11] achieved QP adaptive CNN-filters by feeding the coded QP to the network. Liu et al. [12] embedded the quantization step (Qstep) into the network to adapt QP and improve coding performance.

Refer to caption — Fig. 1: The proposed DCNGAN. ConvN- $K\times K$ is a convolution with N channels, $K\times K$ filters. ↓2 and ↑2 are strided down and up by a factor 2, respectively. DConv is the deformable convolution. FCN is the fully connected layer with N dimensions. B↑2 is Bilinear upsampling by a factor 2. LReLu is Leaky ReLu with $\alpha=0.2$ . BN is Batch Normalization.

In this paper, we propose a QP-adaptive deformable convolution-based GAN to enhance the perceptual quality of compressed videos. Thanks to QP adaptation, our method can enhance videos displayed on TV, where QP values vary greatly from frame to frame. Moreover, our method can be used for historical content enhancement to meet higher quality requirements. Our main contributions are:

1.

A GAN framework based on deformable convolutions to enhance the perceptual quality of compressed videos.
2.

A single adaptive model to enhance videos compressed at various QPs.
3.

We compare the proposed DCNGAN with state-of-the-art compressed video quality enhancement networks, showing its superior performance.

2 Proposed framework

Our framework is illustrated in Fig. 1, and consists of a generator $\hat{\mathbf{x}}_{t}=G\left(\mathbf{y}_{t},\mathbf{q}\right)$ and a discriminator $D\left(\hat{\mathbf{x}}_{t}\right)$ . At a particular frame index $t$ , the generator processes a sequence of three consecutive frames $\mathbf{y}_{t}=\left(\mathbf{x}_{t-1},\mathbf{x}_{t},\mathbf{x}_{t+1}\right)$ and the corresponding QP value $\mathbf{q}$ to output the enhanced frame $\hat{\mathbf{x}}_{t}$ .

2.1 Generator

The generator $G\left(\mathbf{y}_{t}\right)$ further consists of two modules: the frame alignment module $\mathbf{z}_{t}=A\left(\mathbf{y}_{t}\right)$ and the quality enhancement module $\hat{\mathbf{x}}_{t}=E\left(\mathbf{z}_{t}\right)$ . In order to leverage temporal information, the former takes previous, current and next frames and aligns them using a deformable convolution. The output of the deformable convolution is a single representation $\mathbf{z}_{t}$ which already integrates information from the three frames. The key to the correct alignment is the computation of the corresponding offsets, which are predicted by a network based on U-net [13]. The number of channels of offsets is determined by the number of the input frames (i.e. three in our case), the number of spatial dimensions (i.e. two) and the size of the convolutional kernels (i.e. $3\times 3$ in our case). This approach is more efficient than aligning them using optical flows (generally conducted in a pairwise manner).

The quality enhancement module is based on an encoder-decoder structure with nine residual blocks. To avoid storing multiple sets of model parameters for enhancing videos compressed at different QPs, encoded QP information $\mathbf{q}$ is embedded into each residual block to make the network modulated by QP values. Specifically, $\mathbf{q}$ is represented using one-hot encoding [11] and fed into a fully connected layer. Softplus is selected as the activation function since it ensures positive outputs which we found beneficial in training. Finally, channel-wise multiplication is performed between the feature maps which are the output of the first convolution layer before ReLU in the residual block and the encoded QP information.

2.2 Discriminator

We use a patch discriminator [14] which outputs the probability of each patch being real or fake. It is implemented in a fully convolutional fashion, and the final output is the average real/fake classification probability across patches.

2.3 Objective

Our model optimizes the following adversarial loss [15] (hereinafter we omit the subindex $t$ for simplicity)

{L_{gan}}\left({G,D}\right)={\rm{\mathbb{E}}}_{\left({\mathbf{y},\mathbf{q}}\right)}\left[{{{\left({D\left({G\left(\mathbf{y},\mathbf{q}\right)}\right)-1}\right)}^{2}}}\right]+{\rm{\mathbb{E}}}_{\mathbf{x}}\left[{D{{\left(\mathbf{x}\right)}^{2}}}\right],

(1)

where $\mathbf{y}$ are sequences of three consecutive decoded frames and $\mathbf{q}$ is the corresponding QP value, and $\mathbf{x}$ are raw frames, both obtained from a video dataset. The generator is trained to minimize the value of Eq. (1) while the discriminator is trained to maximize it.

We use a perceptual loss based on VGG features [16] that enforces that the features of a given pair $\left(\mathbf{y},\mathbf{q}\right)$ match the features of the corresponding target raw frame $\mathbf{x}$ .

{L_{vgg}}\left(G\right)={\rm{\mathbb{E}}_{\left({{\bf{y}},{\bf{q}},{\bf{x}}}\right)}}\sum\limits_{i{\rm{=1}}}^{{N_{f}}}{\frac{{\rm{1}}}{{{M_{i}}}}\sum\limits_{j{\rm{=1}}}^{{M_{i}}}{{{\left\|{f_{j}^{i}\left({\bf{x}}\right){\rm{-}}f_{j}^{i}\left({G\left({{\bf{y}},{\bf{q}}}\right)}\right)}\right\|}_{\rm{1}}}}},

(2)

where ${f^{i}_{j}}\left(\cdot\right)$ represents the $j$ -th spatial element of the output tensor of the $i$ -th layer from a pre-trained VGG-19 model.

Similarly, we also enforce matching the features of enhanced and of raw images in the discriminator.

{L_{fm}}\left({G,D}\right)={\rm{\mathbb{E}}_{\left({{\bf{y}},{\bf{q}},{\bf{x}}}\right)}}\sum\limits_{i{\rm{=1}}}^{{N_{g}}}{\frac{{\rm{1}}}{{M_{i}^{g}}}\sum\limits_{j{\rm{=1}}}^{M_{i}^{g}}{{{\left\|{g_{j}^{i}\left({\bf{x}}\right){\rm{-}}g_{j}^{i}\left({G\left({{\bf{y}},{\bf{q}}}\right)}\right)}\right\|}_{\rm{1}}}}},

(3)

where ${g^{i}_{j}}\left(\cdot\right)$ represents the $j$ -th spatial element of the output tensor of the $i$ -th layer selected from the discriminator.

Finally, during training we optimize

{\underset{G}{\operatorname{min}}}\ {\underset{D}{\operatorname{max}}}\ \left({L_{gan}}\left({G,D}\right)+{L_{vgg}}\left(G\right)+{L_{fm}}\left({G,D}\right)\right).

(4)

Table 1: Overall performance on LPIPS and DISTS of JCT-VC standard test sequences at four QPs

[b] QP Sequences Compressed MFQE 2.0 [4] STDF [5] MW-GAN [9] VPE-GAN [10] Proposed 32 LPIPS DISTS LPIPS DISTS LPIPS DISTS LPIPS DISTS LPIPS DISTS LPIPS DISTS Class A Traffic 0.170 0.014 0.184 0.014 0.094 0.009 0.138 — 0.179 0.029 0.070 0.006 PeopleOnStreet 0.150 0.018 0.167 0.018 0.133 0.010 0.130 — 0.135 0.015 0.086 0.008 Class B Kimono 0.258 0.043 0.294 0.046 0.160 0.026 0.189 — 0.180 0.034 0.108 0.023 ParkScene 0.276 0.044 0.286 0.045 0.182 0.027 0.244 — 0.196 0.037 0.123 0.023 Cactus 0.260 0.022 0.288 0.022 0.136 0.012 0.151 — 0.126 0.017 0.096 0.010 BQTerrace 0.215 0.032 0.241 0.034 0.152 0.021 0.116 — 0.140 0.040 0.113 0.018 BasketballDrive 0.247 0.028 0.279 0.031 0.166 0.022 0.141 — 0.132 0.025 0.099 0.015 Class C RaceHorses 0.147 0.066 0.174 0.075 0.120 0.061 0.126 — 0.101 0.055 0.089 0.042 BQMall 0.124 0.066 0.145 0.071 0.089 0.050 0.091 — 0.112 0.063 0.072 0.038 PartyScene 0.101 0.057 0.126 0.060 0.067 0.042 0.026 — 0.091 0.045 0.075 0.029 BasketballDrill 0.156 0.073 0.181 0.079 0.126 0.068 0.109 — 0.105 0.060 0.072 0.040 Class D RaceHorses 0.122 0.121 0.143 0.132 0.098 0.113 0.117 — 0.093 0.126 0.072 0.091 BQSquare 0.110 0.150 0.121 0.160 0.084 0.130 0.073 — 0.066 0.112 0.104 0.123 BlowingBubbles 0.102 0.117 0.111 0.128 0.068 0.104 0.063 — 0.072 0.096 0.065 0.084 BasketballPass 0.116 0.135 0.135 0.150 0.099 0.127 0.095 — 0.085 0.116 0.067 0.099 Class E FourPeople 0.120 0.037 0.128 0.038 0.089 0.022 0.080 — 0.103 0.028 0.054 0.016 Johnny 0.148 0.035 0.159 0.035 0.111 0.021 0.083 — 0.178 0.059 0.063 0.014 KristenAndSara 0.134 0.038 0.148 0.039 0.106 0.025 0.108 — 0.136 0.046 0.062 0.019 Average 0.164 0.061 0.184 0.065 0.116 0.049 0.115 — 0.124 0.056 0.083 0.039 22 Average 0.077 0.020 0.087 0.022 0.050 0.014 — — 0.097 0.047 0.042 0.017 27 Average 0.116 0.037 0.130 0.040 0.077 0.029 — — 0.103 0.054 0.059 0.026 37 Average 0.223 0.089 0.232 0.086 0.168 0.080 0.177 — 0.148 0.070 0.120 0.058

3 Experimental Results

3.1 Dataset and Training Strategy

We employ 106 sequences collected by [4] for training and the Joint Collaborative Team on Video Coding (JCT-VC) standard test sequences for testing. All sequences are compressed by H.265/HEVC software HM16.5 under Low Delay P configuration at QPs 22, 27, 32 and 37. Note that the 106 raw sequences and their compressed versions are randomly cropped into $128\times 128$ clips as training samples. The training dataset includes mixed and shuffled samples at the four QPs.

We employ Adam optimizer with ${\beta_{1}}=0.9$ , ${\beta_{2}}=0.999$ and ${\epsilon}={10^{-8}}$ . Batch size is set to be 32. Learning rate remains unchanged at ${10^{-4}}$ . For a fair comparison with previous work, we only enhance the luminance component.

3.2 Quantitative and Qualitative Comparison

We compare the proposed DCNGAN with state-of-the-art video quality enhancement networks in [4] (MFQE 2.0), [5] (STDF), [9] (MW-GAN) ¹¹1For fair comparison, the performance of MW-GAN shown in Table 1 is from their published paper since our retrained model has not achieved the same good performance and their pretrained models at the four QPs have not been published. and [10] (VPE-GAN). LPIPS [17] and DISTS [18] are employed to quantitatively evaluate the perceptual quality of enhanced videos. For better illustration, LPIPS and DISTS of “Compressed” (i.e. the input videos) are also shown in Table 1. Smaller values indicate better perceptual quality. It should be noted that we only train one model for testing sequences compressed at four different QPs, while the other networks need to train four models since their proposed network cannot achieve QP adaptation. As shown in Table 1, the proposed DCNGAN achieves the advanced performance and alleviates memory requirements at the enhancement stage.

Besides, three examples are randomly selected to qualitatively illustrate the performance of the proposed DCNGAN compared with the other video quality enhancement networks. Specifically, as shown in Fig. 2, MFQE 2.0 and STDF, which were designed to improve the PSNR of compressed videos, still tend to produce blurred results and penalize the perceptual quality. As for VPE-GAN, it alleviates blurring to some extent, but sometimes generates artifacts (e.g. the artifacts on the horse and the wall in Fig. 2). These artifacts also result in degraded perceptual quality, while the proposed DCNGAN can generate more realistic high-frequency details to combat with blurring and greatly improve the perceptual quality of compressed videos.

3.3 QP Adaptation Performance Evaluation

To evaluate the ability of QP adaptation of the proposed DCNGAN, we separately train four models at the four QPs, represented by “Trained_QP22”, “Trained_QP27”, “Trained_QP32” and “Trained_QP37”, and compare these four models with the single model trained to adapt to various QPs, represented by “Trained_4QPs”. The performance comparison is shown in Fig. 3, where LPIPS is averaged on all sequences in Table 1.

Overall, the model trained at a certain QP achieves the best performance tested at that QP because it has fully learned the characteristics of videos compressed at the corresponding QP in the training process, while the model can hardly achieve satisfactory performance in enhancing the perceptual quality of videos compressed at other QPs due to different characteristics. By feeding encoded QP information ²²2HM 16.5 compresses videos with a small QP variation. We have tried to feed QP of each frame to the network, but haven’t see benefit of retraining with small delta QP. Hence, we only feed the QP of I frame to the network. into the proposed DCNGAN, the model “Trained_4QPs” can be employed to enhance the perceptual quality of compressed videos at various QPs with advanced performance.

3.4 Deformable Convolutions vs. Optical Flows

An alternative to deformable convolutions to align frame information is optical flow. For comparison, we replace deformable convolutions with optical flows estimated by a pre-trained SpyNet model [19], in order to achieve frame alignments. The other modules of the proposed DCNGAN (i.e., the quality enhancement with QP adaptation module, the discriminator and the loss functions) are kept unchanged. Finally, the average LPIPS performance and runtime of two approaches are compared. As shown in Fig. 4, the network using deformable convolutions is much faster and achieves better performance than that using optical flows, which highlights the accuracy and efficiency of deformable convolutions.

4 Conclusion

In this paper, DCNGAN is proposed to enhance the perceptual quality of videos compressed at various QPs. By incorporating the deformable convolution to align temporal neighboring frames and feeding encoded QP to modulate the network, the proposed DCNGAN achieves the advanced performance and saves model parameters. Experimental results showed the superiority of the proposed DCNGAN compared with state-of-the-art compressed video quality enhancement networks.

References

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, July 2003, doi: 10.1109/TCSVT.2003.815165.
[2] G. J. Sullivan, J. Ohm, W. Han and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, Dec. 2012, doi: 10.1109/TCSVT.2012.2221191.
[3] R. Yang, M. Xu, Z. Wang and T. Li, “Multi-Frame Quality Enhancement for Compressed Video,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6664-6673, doi: 10.1109/CVPR.2018.00697.
[4] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu and Z. Wang, “MFQE 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 949-963, 1 March 2021, doi: 10.1109/TPAMI.2019.2944806.
[5] J. Deng, L. Wang, S. Pu and C. Zhuo, “Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp.10696-10703, 2020.
[6] J. Dai et al., “Deformable Convolutional Networks,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 764-773, doi: 10.1109/ICCV.2017.89.
[7] Y. Blau and T. Michaeli, “The Perception-Distortion Tradeoff,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6228-6237, doi: 10.1109/CVPR.2018.00652.
[8] C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron and A. C. Bovik, “Study of Temporal Effects on Subjective Video Quality of Experience,” in IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5217-5231, Nov. 2017, doi: 10.1109/TIP.2017.2729891.
[9] J. Wang, X. Deng, M. Xu, C. Chen, and Y. Song, “Multi-Level Wavelet-Based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video,” in European Conference on Computer Vision. Springer, 2020, pp. 405–421.
[10] T. Wang, J. He, S. Xiong, P. Karn and X. He, “Visual Perception Enhancement for HEVC Compressed Video Using a Generative Adversarial Network,” 2020 International Conference on UK-China Emerging Technologies (UCET), 2020, pp. 1-4, doi: 10.1109/UCET51115.2020.9205459.
[11] Z. Huang, J. Sun, X. Guo and M. Shang, “One-for-all: An Efficient Variable Convolution Neural Network for In-loop Filter of VVC,” in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3089498.
[12] C. Liu, H. Sun, J. Katto, X. Zeng and Y. Fan, “A QP-Adaptive Mechanism for CNN-Based Filter in Video Coding,” arXiv preprint, arXiv:2010.13059, Oct. 2020.
[13] O. Ronneberger, P. Fischer and T. Brox, “U-net: Convolutional Networks for Biomedical Image Segmentation,” International Conference on Medical image computing and computer-assisted intervention, pp. 234-241, 2015.
[14] P. Isola, J. Zhu, T. Zhou and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967-5976, doi: 10.1109/CVPR.2017.632.
[15] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang and S. P. Smolley, “Least squares generative adversarial networks,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813-2821, doi: 10.1109/ICCV.2017.304.
[16] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz and B. Catanzaro, “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798-8807, doi: 10.1109/CVPR.2018.00917.
[17] R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586-595, doi: 10.1109/CVPR.2018.00068.
[18] K. Ding, K. Ma, S. Wang and E. P. Simoncelli, “Image Quality Assessment: Unifying Structure and Texture Similarity,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.3045810.
[19] A. Ranjan and M. J. Black, “Optical Flow Estimation Using a Spatial Pyramid Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.