Group-based Bi-Directional Recurrent Wavelet Neural Networks
for Video Super-Resolution

Young-Ju Choi
Department of IT Engineering
Sookmyung Women’s University, Seoul, Korea
[email protected] Young-Woon Lee
Department of Computer Engineering
Sunmoon University, Asan, Korea
[email protected] Byung-Gyu Kim
Department of IT Engineering
Sookmyung Women’s University, Seoul, Korea
[email protected]

Abstract

Video super-resolution (VSR) aims to estimate a high-resolution (HR) frame from a low-resolution (LR) frames. The key challenge for VSR lies in the effective exploitation of spatial correlation in an intra-frame and temporal dependency between consecutive frames. However, most of the previous methods treat different types of the spatial features identically and extract spatial and temporal features from the separated modules. It leads to lack of obtaining meaningful information and enhancing the fine details. In VSR, there are three types of temporal modeling frameworks: 2D convolutional neural networks (CNN), 3D CNN, and recurrent neural networks (RNN). Among them, the RNN-based approach is suitable for sequential data. Thus the SR performance can be greatly improved by using the hidden states of adjacent frames. However, at each of time step in a recurrent structure, the RNN-based previous works utilize the neighboring features restrictively. Since the range of accessible motion per time step is narrow, there are still limitations to restore the missing details for dynamic or large motion. In this paper, we propose a group-based bi-directional recurrent wavelet neural networks (GBR-WNN) to exploit the sequential data and spatio-temporal information effectively for VSR. The proposed group-based bi-directional RNN (GBR) temporal modeling framework is built on the well-structured process with the group of pictures (GOP). We propose a temporal wavelet attention (TWA) module, in which attention is adopted for both spatial and temporal features. Experimental results demonstrate that the proposed method achieves superior performance compared with state-of-the-art methods in both of quantitative and qualitative evaluations.

1 Introduction

Super-resolution (SR) is a traditional problem in low-level vision field. The goal of SR is to reconstruct a high-resolution (HR) image from the corresponding low-resolution (LR) image. Therefore, finding missing edge and texture details in LR image plays an important role in the SR. The SR imaging technique is widely used in various computer vision applications such as medical, satellite, surveillance, and low-bitrate media imaging systems. Moreover, with the growth of display industries, the SR has become more crucial in recent years.

The SR problem can be separated into single-image super-resolution (SISR) [4, 14, 18, 21, 27, 34], multi-image super-resolution (MISR) [6, 7, 22], and video super-resolution (VSR) [1, 2, 10, 13, 26, 29, 31]. Given a LR video consisting of ${(2N+1)}$ LR frames ${\{LR_{t-N},...,LR_{t},...,LR_{t+N}\}}$ , the target frame is ${LR_{t}}$ . The SISR resolves each of the video frames independently. However, this technique do not consider the temporal information from the other frames, which is very inefficient. The MISR utilizes the temporal details from the neighboring frames and fuses them for super-resolving ${LR_{t}}$ . In MISR, however, the frames are aligned without any technique related to temporal smoothness. This inaccurate motion alignment cause a result containing some discontinuous regions. In contrast with the SISR and MISR, the VSR exploits temporal dependency among frames while maintaining a sequential characteristic.

With the success of the convolutional neural networks (CNN) in computer vision tasks such as image classification [16] and object detection [8], CNN has also been successfully applied to VSR task. By learning the non-linear LR-to-HR mapping function directly, the performance of reconstruction accuracy and visual quality was remarkably enhanced. Earlier deep learning-based VSR methods [3, 20, 28] have designed as a simple extension of the SISR. It means that the temporal information among video frames is not considered properly.

To exploit more temporal dependency between consecutive frames, some studies [1, 13, 26] have used an optical flow-based explicit motion estimation and compensation process. However, it is difficult to estimate accurate motion in the case of a sequence including occlusions and large motions. To address the aforementioned problem, recent methods [12, 29] used implicit motion compensation. The latest state-of-the-art methods [2, 5, 25, 31] have designed architectures with more elaborated pipelines and multiple stages to extract spatio-temporal feature and reconstructed target frame.

Furthermore, the VSR methods can be divided into three categories based on a temporal modeling framework: 2D CNN [1, 2, 25, 29, 30, 31], 3D CNN [5, 12], and recurrent neural networks (RNN) [10, 11, 26]. Most of the recent VSR approaches have used a frame concatenation and several stacked 2D CNN layers to extract spatial and temporal information. In this case, although an additional temporal feature extraction module is constructed, there is still limitation to represent dynamic motion because input frames are concatenated together. In VSR, input frames are stacked along the temporal axis. To alleviate the problem with 2D CNN, some methods utilized 3D CNN to apply to the stacked frames directly.

The 3D CNN could extract both spatial and temporal information within a temporal sliding window simultaneously. However, 3D CNN has higher computational complexity and needs a much larger memory than 2D CNN. The RNN is effective in dealing with sequential data, so it could be employed for VSR. In RNN structure, the frames is fed into convolution layers in temporal order. With a RNN-based approach, a performance can be greatly improved because it is possible to use the hidden states of adjacent frames.

Exploiting both intra-frame spatial correlations and inter-frame temporal dependencies between consecutive frames plays an important role in the VSR. From temporal point of view, the previous uni-directional RNN-based temporal modeling frameworks [10, 26] have considered only the previous hidden states. In the previous bi-directional RNN-based temporal modeling framework [11], the previous and future hidden states at ${t-1}$ and ${t+1}$ time step have connected to the hidden state at current time step. Because they dealt with the narrow range of motion representation, they may fail to address large and dynamic motions on the sequence.

In an image, the low-frequency component describes the basic background information and the high-frequency component represents the edge and texture details. From spatial point of view, most previous methods have handled the features identically or simply combined the edge map. Therefore the HR output image is lack meaningful information. Furthermore, since most previous model architectures have considered spatial and temporal feature extractor separately, there is a discontinuity between the two extracted features. For better feature extraction, it is preferable that the spatial and temporal information is extracted together on single module. Recently, Wang et al. [31] have proposed the temporal and spatial attention (TSA) module to extract a spatio-temporal feature. The TSA module sequentially has performed the temporal attention and the spatial attention with several convolution layers. However, the TSA module may have limitation because they have treated the low-frequency and high-frequency features equally.

In this paper, we propose a group-based bi-directional recurrent wavelet neural networks (GBR-WNN) to exploit the sequential data and spatio-temporal dependencies effectively for VSR. The cores of GBR-WNN are (1) a temporal modeling framework, called group-based bi-directional RNN (GBR), and (2) a temporal wavelet attention (TWA) module.

The proposed GBR framework can cover a wide range of motion by utilizing the previous and future hidden states of time step ${\{t-N,...,t,...,t+N\}}$ within a defined group of pictures (GOP). The proposed TWA module can extract an elaborate spatio-temporal feature by generating two attention maps for spatial and temporal information, respectively. In the proposed TWA module, we apply discrete wavelet transform (DWT) for spatial attention to utilize the advantage of decomposing a feature to multiple features with frequency properties. The scheme of temporally weighted features between frames can reduce discontinuities along the time axis. Also, the weighted feature through four DWT attention maps with different frequency characteristics can keep the inherent precise and sharp attributes.

This paper is organized as follows: In section 2, we introduce the related works. In section 3, we present the proposed methodology. The experimental results are shown in section 4. Finally, section 5 will make a conclusion for this paper.

Refer to caption — Figure 1: An overview of the proposed GBR-WNN.

2 Related Work

2.1 Single Image Super-Resolution

As an earliest work of deep learning-based SISR, Dong et al. [4] proposed a super-resolution convolutional neural network (SRCNN). The SRCNN is a relatively shallow network. Kim et al. [14] later developed a deeper network with residual learning called very deep super-resolution (VDSR). After that, an efficient sub-pixel convolutional neural network (ESPCN) proposed by Shi et al. [27] to reduce computational complexity with keeping a deeper network. Also, Ledig et al. [18] proposed a super-resolution using a generative adversarial network (SRGAN), which was a model focusing to high frequency details. Based on [18], Lim et al. [21] proposed an enhanced deep super-resolution network (EDSR), which modified the residual module by removing batch normalization. Recently, much deeper CNN, including residual dense network (RDN) [34], DBPN [9], RCAN [33] were then introduced. They outperformed previous networks by a substantial margin.

2.2 Video Super-Resolution

2.2.1 2D CNN

By inspiring of SRCNN in SISR, some researches in deep learning based VSR has begun. Kappelar et al. [13] proposed a two-step framework consisting of flow estimation and frame warping, namely, video super-resolution with convolutional neural networks (VSRNet). Caballero et al. [1] introduced the first end-to-end CNN for VSR, called as video super-resolution using an efficient sub-pixel convolutional neural network (VESPCN), which has been trained flow estimation and spatio-temporal networks. The aforementioned methods used the optical flow to estimate the motions between frames and perform warping. However, a motion estimation by utilizing the optical flow mechanism could be inaccurate in the case of a sequence including occlusion and large motion. To address the issue, most of the latest models [2, 25, 29, 31] handle the problem by using implicit motion compensation and surpass the optical flow-based methods. Some of the latest methods [19, 30] still estimated optical flow and utilized the warped feature in a sophisticated multi-stage CNN architecture.

2.2.2 3D CNN

Typically, a 3D CNN is more appropriate to extract spatio-temporal features than 2D CNN in a sequence. Jo et al. [12] proposed a VSR network estimating dynamic upsampling filters (DUF) with stacked 3D convolutional layers. DUF handle the problem of explicit motion compensation by using implicit motion compensation and surpass the flow-based methods. Fan et al. [5] proposed 3D CNN-based architecture for deep video restoration, namely, wide-activated 3D convolutional network for video restoration (WDVR).

2.2.3 RNN

RNN architecture is suitable for training a sequential data such as video. Sajjadi et al. [26] proposed the frame-recurrent architecture called FRVSR to use previously inferred HR estimates for the SR of next frames. After that, Haris et al. [10] proposed the recurrent back-projection network (RBPN), which collected a temporal and spatial information from frames surrounding the target frame. As a bi-directional RNN approach, Huang et al. [11] proposed a bi-directional recurrent convolutional network by using recurrent and 3D feedforward convolutions.

3 Methodology

3.1 Overview

In this section, we introduce the overall system of our group-based bi-directional recurrent wavelet neural networks (GBR-WNN). The proposed GBR-WNN consists of two core components: a temporal modeling framework, namely, group-based bi-directional RNN (GBR) and a feature extraction module which is called temporal wavelet attention (TWA).

The proposed network is illustrated in Figure 1. Given ${2N+1}$ consecutive LR input frames ${LR_{input}=\{LR_{t-N},...,LR_{t},...,LR_{t+N}\}}$ , we denote the central LR frame ${LR_{t}}$ as the target frame and the other frames as neighboring frames with size of ${H\times W}$ , where ${H}$ is height and ${W}$ is width. The goal of VSR is to estimate a HR target frame ${SR_{t}}$ , which is close to the ground truth frame ${HR_{t}}$ with size of ${\small{s}H\times\small{s}W}$ , where ${s}$ is scaling factor.

In the GBR framework, the availability of temporal information is checked by exploiting the previous and future hidden states within the scope of a group of pictures (GOP). The input LR frames are fed to the TWA module to extract elaborate spatio-temporal features based on 2D discrete wavelet transform (DWT) [24]. The output features after the TWA module pass through a reconstruction and upsampling module. Our reconstruction module is designed with several 2D CNN layers and 2D residual blocks wherein no batch normalization units in [21]. The predicted HR residual frame is obtained by adopting the depth-to-space transformation in [27]. Finally, the HR estimated frame ${SR_{t}}$ is obtained by adding the predicted residual frame to a direct upsampled LR frame ${LR_{t}}$ .

3.2 Group-based Bi-directional RNN

In case of a sequence containing large motion, the far away neighboring frames are likely to have the missing details of the target frame. Therefore, if we utilize a wide range of motion, we are able to have good chance to improve the performance in VSR. In order to manage the hidden states of neighboring frames as well as the LR inputs of those dynamically, we propose the GBR temporal modeling framework.

The structure of the GPR framework is shown in Figure 2. We define the ${(2N+1)}$ LR input frames as the group of pictures (GOP). In a GOP, each of the LR frames ${\{LR_{t-N},...,LR_{t},...,LR_{t+N}\}}$ becomes an current target LR frame ${LR_{curr}}$ in SR order. For effective structure to estimate the central HR target frame ${SR_{t}}$ , the SR order is gradually directed toward the center. As the previous and future hidden states, and SR frames, we apply the features of the frames already predicted at the current point. The time steps of the previous and future neighboring features are illustrated in Figure 2, where ${L0}$ , ${L1}$ , and ${N}$ are previous picture list, future picture list, and none, respectively. At the time step ${t-3}$ for the first target frame, the previous LR frame ${LR_{l0}}$ and the previous and future estimations ${H_{l0}}$ , ${H_{l1}}$ , ${SR_{l0}}$ , and ${SR_{l1}}$ are initialized with zero. At the time step ${t+3}$ for the last target frame, the future LR frame ${LR_{l1}}$ and the future estimations ${H_{l1}}$ and ${SR_{l1}}$ are initialized with zero.

At every time step, the GBR-WNN uses following equations to estimate output SR frame at the current point. The TWA module takes three LR frames ${LR_{l0}}$ , ${LR_{curr}}$ , and ${LR_{l1}}$ as input to produce the temporal wavelet feature ${TWF_{curr}}$ :

{TWF_{curr}=f_{twa}([LR_{l0},LR_{curr},LR_{l1}]),}

(1)

where ${f_{twa}(\cdot)}$ and ${[\cdot,\cdot,\cdot]}$ denote the mapping function for TWA module and concatenation, respectively. We adopt the space-to-depth [27] transformation to match the shape of the previous and future SR frames ${SR_{l0}}$ and ${SR_{l1}}$ to LR frames with scaling factor ${s}$ :

{\left\{\begin{array}[]{l}{SR_{l0}^{s\rightarrow d}=f_{s\rightarrow d}(SR_{l0})},\\ {SR_{l1}^{s\rightarrow d}=f_{s\rightarrow d}(SR_{l1})},\\ \end{array}\right.}

(2)

where ${f_{s\rightarrow d}(\cdot)}$ , ${SR_{l0}^{s\rightarrow d}}$ , and ${SR_{l1}^{s\rightarrow d}}$ denote the space-to-depth mapping function, the reshaped previous and future SR frames, respectively. Then, to further enhance the feature, concatenated feature with temporal wavelet feature, previous and future hidden states and SR frames are fed into reconstruction module. This process can be represented as

{H_{curr}=f_{rec}([TWF_{curr},H_{l0},H_{l1},SR_{l0}^{s\rightarrow d},SR_{l1}^{s\rightarrow d}]),}

(3)

where ${f_{rec}(\cdot)}$ denotes the mapping function of reconstruction module. ${H_{curr}}$ represents the extracted current hidden state. For the upsampling process, the depth-to-space mapping function ${f_{d\rightarrow s}(\cdot)}$ is applied to hidden state ${H_{curr}}$ with scaling factor ${s}$ . Lastly, output SR frame ${SR_{curr}}$ of each time step is obtained by global residual learning using the estimated HR residual frame and the upsampled LR target frame:

{SR_{curr}=f_{d\rightarrow s}(H_{curr})+f_{up}^{\times s}(LR_{curr}),}

(4)

where ${f_{up}^{\times s}(\cdot)}$ denotes the bilinear upsampling function with scaling factor ${s}$ . When the frame order of the current target frame is smaller than the frame order of next target frame, the current hidden state ${H_{curr}}$ becomes the previous hidden state ${H_{l0}}$ and the current SR frame ${SR_{curr}}$ becomes the previous SR frame ${SR_{l0}}$ . In contrast, if the frame order of the current target frame is bigger than the frame order of next target frame, the current hidden state ${H_{curr}}$ becomes the future hidden state ${H_{l1}}$ and the current SR frame ${SR_{curr}}$ becomes the future SR frame ${SR_{l1}}$ .

3.3 Temporal Wavelet Attention

Extracting fine feature of both intra-frame spatial and inter-frame temporal information is very important to improve the quality of output SR frame. To obtain the spatio-temporal feature, we design the TWA module based on the 2D Haar DWT and attention mechanism, as illustrated in Figure 3. The key role of the TWA is generating the temporal attention map and wavelet attention map. By utilizing temporal attention map, we can restore the missing feature from the neighboring frames with different degrees of motion information. Meanwhile, we can strengthen the edge and texture details by applying the spatial attention map.

Each of the three LR frames are fed into the CNN embedding layer to increase the number of features:

{\left\{\begin{array}[]{l}{E_{l0}=f_{emb}(LR_{l0})},\\ {E_{curr}=f_{emb}(LR_{curr})},\\ {E_{l1}=f_{emb}(LR_{l1})},\\ \end{array}\right.}

(5)

where ${f_{emb}(\cdot)}$ denotes the embedding function. The embedded LR features of three LR frames are ${E_{l0}}$ , ${E_{curr}}$ , and ${E_{l1}}$ , respectively. The temporal attention features ${TAtt_{l0}}$ , ${TAtt_{curr}}$ , and ${TAtt_{l1}}$ for three LR frames can be calculated as

{\left\{\begin{array}[]{l}{TAtt_{l0}=sigmoid(E_{l0}\boldsymbol{\cdot}E_{curr})},\\ {TAtt_{curr}=sigmoid(E_{curr}\boldsymbol{\cdot}E_{curr})},\\ {TAtt_{l1}=sigmoid(E_{l1}\boldsymbol{\cdot}E_{curr})},\\ \end{array}\right.}

(6)

where ${\boldsymbol{\cdot}}$ and ${sigmoid(\cdot)}$ denote the dot product and the sigmoid activation function. After that, the temporal attention maps are then multiplied to the original LR frames:

{\left\{\begin{array}[]{l}{\tilde{E}_{l0}=LR_{l0}\odot TAtt_{l0}},\\ {\tilde{E}_{curr}=LR_{curr}\odot TAtt_{curr}},\\ {\tilde{E}_{l1}=LR_{l1}\odot TAtt_{l1}},\\ \end{array}\right.}

(7)

where ${\odot}$ denotes the element-wise multiplication. The weighted temporal embedded features ${\tilde{E}_{l0}}$ , ${\tilde{E}_{curr}}$ , and ${\tilde{E}_{l1}}$ are then concatenated. Then, the current temporal feature ${TF_{curr}}$ can be represented as

{TF_{curr}=f_{fusion}([\tilde{E}_{l0},\tilde{E}_{curr},\tilde{E}_{l1}]),}

(8)

where ${f_{fusion}(\cdot)}$ denotes the fusion process function based on several CNN layers.

For extracting the spatial feature, we use the 2D Haar DWT. An example of the 2D Haar DWT is shown in Figure 4. An image can be decomposed into four sub-band images using four 2D DWT filters. The low-pass filter ${f_{LL}}$ means approximation of image and high-pass filters ${f_{HL},\;f_{LH},\;and\;f_{HH}}$ mean vertical, horizontal, and diagonal edge of image, respectively. The DWT filters used in this paper are defined as

\begin{array}[]{c@{\qquad}c}f_{LL}=\begin{bmatrix}1&1\\ 1&1\end{bmatrix},\;f_{HL}=\begin{bmatrix}1&-1\\ 1&-1\end{bmatrix},\\[10.0pt] f_{LH}=\begin{bmatrix}1&1\\ -1&-1\end{bmatrix},\;f_{HH}=\begin{bmatrix}1&-1\\ -1&1\end{bmatrix}.\end{array}

(9)

The estimated temporal feature ${TF_{curr}}$ can be decomposed into four sub-band components by using 2D DWT. The wavelet attention map ${WAtt_{curr}}$ can be generated by the upsampling function and sigmoid activation function. This process can be expressed as

{WAtt_{curr}=sigmoid(f_{up}^{\times 2}(DWT(TF_{curr}))).}

(10)

Because the wavelet attention map ${WAtt_{curr}}$ is composed of four components, the size of map is ${H\times W\times nF\times 4}$ , where ${nF}$ means number of features. Finally, by multiplying temporal feature ${TF_{curr}}$ and each DWT component of wavelet attention map ${[WAtt_{curr}^{LL},WAtt_{curr}^{HL},WAtt_{curr}^{LH},WAtt_{curr}^{HH}]}$ , the temporal wavelet feature ${TWF_{curr}}$ can be achieved as

{\left\{\begin{array}[]{l}{TWF_{curr}^{LL}=TF_{curr}\odot WAtt_{curr}^{LL}},\\ {TWF_{curr}^{HL}=TF_{curr}\odot WAtt_{curr}^{HL}},\\ {TWF_{curr}^{LH}=TF_{curr}\odot WAtt_{curr}^{LH}},\\ {TWF_{curr}^{HH}=TF_{curr}\odot WAtt_{curr}^{HH}},\\ \end{array}\right.}

(11)

{TWF_{curr}=[TWF_{curr}^{LL},TWF_{curr}^{HL},TWF_{curr}^{LH},TWF_{curr}^{HH}].}

(12)

4 Experimental Results

4.1 Datasets and Implementation Details

In this paper, we used the Vimeo-90K [32] dataset which is a large and diverse data set with high-quality frames and a range of motion types for training. The resolution of each sample in the Vimeo-90K dataset is ${448\times 256}$ . For training, we used 3 channel patches of size ${64\times 64}$ as inputs. We augment the training data with random horizontal flips and ${90^{\circ}}$ rotations. We evaluated our methods on the Vid4 [23] dataset. The Vid4 dataset consists of the four test sequences, walk, foliage, city, and calendar, commonly reported in recent methods.

In all our experiments, the scaling factor ${s}$ of SR was set to 4. For evaluation, we use peak signal-to-noise ratio (PSNR) to test the quality of each frame. The overall PSNR values of each video clip were then calculated by aggregating PSNRs over all frames in a video clip. Finally, the overall PSNR values of whole Vid4 dataset were then calculated by averaging PSNRs of video clips.

For our GBR-WNN, the network takes seven consecutive frames ${(i.e.,\;N=3)}$ as inputs. Also, the number of feature in each residual block was set to 128. To train our network, we used Charbonnier penalty function [17] for loss function:

{L=\sqrt{\left\|\hat{O_{t}}-O_{t}\right\|^{2}+\varepsilon^{2}},}

(13)

where ${\varepsilon}$ set to ${1\times 10^{-3}}$ . We used Adam optimizer [15] and initially set learning rate to ${4\times 10^{-4}}$ . The number of iterations was set to 600K.

The proposed GBR-WNN was implemented in PyTorch on a PC with 8 NVIDIA Tesla V100 16GB GPUs. We trained with setting the size of global mini-batch to 128, which means that the size of mini-batch for each GPU was set to 16.

Table 1: Quantitative comparison on Vid4 for 4

{\times}

video SR on Y channel. Red and blue indicates the best and the second best performance, respectively.

Method	PSNR (dB)					Params.
Method	Calendar	City	Foliage	Walk	Average	Params.
Bicubic	20.45	25.22	23.57	26.27	23.88	-
SOF-VSR [30]	22.66	26.94	25.45	29.18	26.06	1.0M
WAEN [2]	23.81	27.61	26.00	30.37	26.95	9.6M
WDVR [5]	23.47	27.36	25.84	30.11	26.69	1.2M
FRVSR [26]	23.02	27.93	26.26	29.61	26.71	5.1M
RBPN [10]	23.96	27.74	26.21	30.70	27.15	12.7M
\hlineB2 GBR-WNN-S (Ours)	23.93	27.75	26.18	30.81	27.17	5.9M
GBR-WNN-M (Ours)	23.99	27.79	26.20	30.88	27.21	8.8M
GBR-WNN-L (Ours)	24.00	27.80	26.22	30.89	27.23	11.8M

Table 2: Quantitative comparison on Vid4 for 4

{\times}

video SR on RGB channel. Red and blue indicates the best and the second best performance, respectively.

Method	PSNR (dB)					Params.
Method	Calendar	City	Foliage	Walk	Average	Params.
Bicubic	18.96	23.75	22.21	24.94	22.47	-
SOF-VSR [30]	20.96	25.43	24.01	27.80	24.55	1.0M
WAEN [2]	22.04	26.08	24.59	28.99	25.42	9.6M
WDVR [5]	21.75	25.84	24.44	28.74	25.19	1.2M
FRVSR [26]	21.37	26.39	24.84	28.24	25.21	5.1M
RBPN [10]	22.17	26.21	24.78	29.31	25.62	12.7M
\hlineB2 GBR-WNN-S (Ours)	22.15	26.21	24.77	29.42	25.64	5.9M
GBR-WNN-M (Ours)	22.22	26.25	24.79	29.49	25.69	8.8M
GBR-WNN-L (Ours)	22.23	26.26	24.81	29.51	25.70	11.8M

Table 3: Analysis of adopted GBR framework and TWA module (Experiments here adopt a medium size model with 20 RBs) on Vid4 for 4

{\times}

video SR. Red and blue indicates the best and the second best performance, respectively.

Model		Test Model1	Test Model2	Test Model3	GBR-WNN-M
Params.		8.3M	8.3M	8.6M	8.8M
Temporal Modeling Framework		2D CNN	2D CNN	Uni-directional RNN	GBR
TWA Module		✗	✓	✓	✓
\hlineB2 PSNR (dB)	Y	26.62	26.80	26.92	27.21
\hlineB2 PSNR (dB)	RGB	25.11	25.28	25.40	25.69

4.2 Comparison with the-state-of-the-arts

We compare the proposed GBR-WNN with Bicubic and several state-of-the-art methods including all kinds of temporal modeling frameworks, (1) 2D CNN: the optical flows for VSR (SOF-VSR) [30] and the wavelet attention embedding network (WAEN) [2], (2) 3D CNN: the wide-activated 3D convolutional network for video restoration (WDVR) [5], and (3) RNN: the recurrent back-projection network (RBPN) [10] and the frame-recurrent VSR (FRVSR) [26] on Vid4 dataset.

For SOF-VSR [30] and RBPN [10], we used a provided pre-trained model to produce their results. For WAEN [2] and WDVR [5], we trained the model using Vimeo-90K dataset. We produced the result of FRVSR [26] using the super-resolved Vid4 result images provided by the authors because the source code and the pre-trained model were not opened. In this paper, the first two frames and the last two frames are not used for overall performance evaluation because the FRVSR [26] provides excluding these four frames.

The quantitative results in term of PSNR on Y (luminance) and RGB channels are shown in Table 1 and Table 2, respectively. We also show the number of parameters for each model. In our experiments, we evaluated our GBR-WNN with different number of residual blocks (RBs) in the reconstruction module. The large size model consisting of 30 RBs (GBR-WNN-L), medium size model with 20 RBs (GBR-WNN-M) and small size model with 10 RBs (GBR-WNN-S) were tested. The results show the tendency that the model with increased RBs performs better.

Comparing with other methods, our GBR-WNN-L shows the best performance and GBR-WNN-M is in the second place on average in both Y and RGB channels. Even though RBPN [10] needs 0.9M additional parameters, our GBR-WNN-L outperforms RBPN [10] in term of PSNR. Furthermore, when comparing our small size model GBR-WNN-S with RBPN, our method has better average performances.

Although FRVSR [26] has better performances on City and Foliage clips in both Y and RGB channels, the gap between our GBR-WNN-L and FRVSR [26] is small in this clips. On the contrary, the proposed GBR-WNN-L outperforms FRVSR by a large margin on Calendar and Walk clips. Comparing with WAEN using DWT-based feature extractor, our methods achieves better performance. It means that the proposed GBR framework is more effective in estimating the sophisticated temporal feature than general 2D CNN temporal modeling framework.

The qualitative results on Vid4 dataset for Calendar and City clips are presented in Figure 5. For result of each method, the upper row means zoomed a visual area and the lower row means temporal profile. For the temporal profiles, we used 144th line for Calendar clip and 550th line for City clip in the entire frame of the sequences. Our GBR-WNN-L recovers more accurate textures with more smooth temporal transition compared to the existing methods.

For a more detailed analysis of the proposed GBR-WNN, Table 3 shows the results as adopted our GBR framework and TWA module in our method. From this result, we can explain that combination of two proposed core components produces better performance than using a single component. By comparing Test Model2 to Test Model1, we can verify that the TWA module exploits useful information in VSR problem. When comparing GBR-WNN-M with Test Model2 and Test Model3, we can see that the GBR framework is very efficient and beneficial in dealing with the sequential data.

5 Conclusion

In this paper, we have proposed a group-based bi-directional recurrent wavelet neural networks (GBR-WNN) for VSR. The proposed method consists of two core components: group-based bi-directional RNN (GBR) temporal modeling framework and temporal wavelet attention (TWA) module. The temporal continuities between multiple frames were aligned with the management structure based on the group of pictures (GOP) in the proposed GBR framework. With the enhanced temporal modeling framework, the proposed network can recover precise temporal details improving SR performance. In addition, the proposed TWA module was able to make full use of spatio-temporal information among consecutive frames to generate the enriched features. We have compared the performance of proposed method with other recent state-of-the-art VSR approaches. Experimental results demonstrated that the proposed GBR-WNN could obtain better quality in VSR.

References

[1] J. Caballero, C. Ledig, A. Andrew, A. Alejandro, J. Totz, Z. Wang, and W. Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4778–4787, 2017.
[2] Y. J. Choi, Y. W. Lee, and B. G. Kim. Wavelet attention embedding networks for video super-resolution. In Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), pages 7314–7320, 2020.
[3] Q. Dai, S. H. Yoo, A. Kappeler, and A. K. Katsaggelos. Dictionary-based multiple frame video super-resolution. In IEEE International Conference on Image Processing (ICIP), pages 83–87, 2015.
[4] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision (ECCV), pages 184––199. Springer, 2014.
[5] Y. Fan, J. Yu, D. Liu, and T. S. Huang. An empirical investigation of efficient spatio-temporal modeling in video restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition Workshops, 2019.
[6] Esmaeil Faramarzi, Dinesh Rajan, and Marc P Christensen. Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Transactions on Image Processing, 22(6):2101–2114, 2013.
[7] Diogo C Garcia, Camilo Dorea, and Ricardo L de Queiroz. Super resolution for multiview images using depth information. IEEE Transactions on Circuits and Systems for Video Technology, 22(9):1249–1256, 2012.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
[9] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1664–1673, 2018.
[10] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3897–3906, 2019.
[11] Yan Huang, Wei Wang, and Liang Wang. Video super-resolution via bidirectional recurrent convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 40(4):1015–1028, 2017.
[12] Y. H. Jo, S. W. Oh, J. Y. Kang, and S. J. Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3224–3232, 2018.
[13] A. Kappeler, S. H. Yoo, Q. Dai, and A. K. Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
[14] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1646––1654, 2016.
[15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097–1105, 2012.
[17] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 624–632, 2017.
[18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4681–4690, 2017.
[19] Feng Li, Huihui Bai, and Yao Zhao. Learning a deep dual attention network for video super-resolution. IEEE Transactions on Image Processing, 29:4474–4488, 2020.
[20] R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 531–539, 2015.
[21] B. Lim, S. H. Son, H. W. Kim, S. J. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition Workshops, pages 136–144, 2017.
[22] Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
[23] C. Liu and D. Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346––360, 2013.
[24] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989.
[25] K. Purohit, S. Mandal, and A. N. Rajagopalan. Mixed-dense connection networks for image and video super-resolution. Neurocomputing, 398:360––376, 2019.
[26] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6626–6634, 2018.
[27] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1874––1883, 2016.
[28] H. Takeda, P. Milanfar, M. Protter, and M. Elad. Super-resolution without explicit subpixel motion estimation. IEEE Transactions on Image Processing, 18(9):1958––1975, 2009.
[29] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
[30] Longguang Wang, Yulan Guo, Li Liu, Zaiping Lin, Xinpu Deng, and Wei An. Deep video super-resolution using hr optical flow estimation. IEEE Transactions on Image Processing, 29:4323–4336, 2020.
[31] X. Wang, K. C.K. Chan, K. Yu, C. Dong, and C. C. Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition Workshops, 2019.
[32] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman. Video enhancement with task-oriented flow. arXiv preprint arXiv:1711.09078, 2017.
[33] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018.
[34] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 2472–2481, 2018.

Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video Super-Resolution