Online Video Super-Resolution with Convolutional Kernel Bypass Graft

Jun Xiao, Xinyang Jiang, Ningxin Zheng, Huan Yang, Yifan Yang, Yuqing Yang, Dongsheng Li, Kin-Man Lam Jun Xiao and Kin-Man Lam are with the Department of Electronic and Information Engineering, the Hong Kong Polytechnic University. E-mail: [email protected] Jiang, Ningxin Zheng, Huan Yang, Yifan Yang, Yuqing Yang, and Dongsheng Li are with Microsoft Research Asia. E-mail: [email protected] work of this paper were finished when Jun Xiao interned in Microsoft Research Asia.

Abstract

Deep learning-based models have achieved remarkable performance in video super-resolution (VSR) in recent years, but most of these models are less applicable to online video applications. These methods solely consider the distortion quality and ignore crucial requirements for online applications, e.g., low latency and low model complexity. In this paper, we focus on online video transmission, in which VSR algorithms are required to generate high-resolution video sequences frame by frame in real time. To address such challenges, we propose an extremely low-latency VSR algorithm based on a novel kernel knowledge transfer method, named convolutional kernel bypass graft (CKBG). First, we design a lightweight network structure that does not require future frames as inputs and saves extra time costs for caching these frames. Then, our proposed CKBG method enhances this lightweight base model by bypassing the original network with “kernel grafts”, which are extra convolutional kernels containing the prior knowledge of external pretrained image SR models. In the testing phase, we further accelerate the grafted multi-branch network by converting it into a simple single-path structure. Experiment results show that our proposed method can process online video sequences up to 110 FPS, with very low model complexity and competitive SR performance.

Index Terms:

Video Super-resolution, deep lightweight model, video restoration

I Introduction

Video super-resolution (VSR) is a fundamental task in computer vision, which aims to generate high-resolution (HR) video sequences given the corresponding low-resolution (LR) counterparts. In general, VSR is a challenging problem because of its ill-posed nature, which means that an LR frame can be generated from infinitely possible HR frames. With the fast development of video applications in the last decade, VSR has shown substantial industrial values and thus attracted researchers’ attention.

Online video applications (e.g., cloud gaming, live broadcasting, and online video conferences) have become increasingly popular, especially during COVID-19. Although existing deep learning-based models have achieved unprecedented success in VSR tasks, most of them are less applicable in online situations because they solely pursue performance improvement and seldomly concern the latency and model complexity [1, 2]. Therefore, super-resolving online video sequences is still a challenging and necessary problem. In this paper, we will focus on the online VSR setting, where users need to receive super-resolved video sequences frame by frame in real time.

Refer to caption — Figure 1: PSNR, FPS and FLOPs (G) of different methods deployed in Tesla V-100. The dot lines indicate the recommended FPS of different online applications, e.g., for “ Gaming ”, the recommended FPS for cloud games is 60 FPS. “ BasicVSR^∗ ” is the modified version of BasicVSR to meet online VSR requirements. The input resolution is $180\times 320$ .

In contrast to offline VSR, where there is no restriction on model complexity and latency, online VSR poses two key challenges to existing VSR methods. The first challenge is that online applications have extremely strict requirements on low latency and buffering lag, because they involve real-time user interaction. In some online applications (e.g., online video conference), slight latency will significantly affect the user’s experience and the conference quality, which is undesirable. However, most state-of-the-art VSR methods need to cache future frames for super-resolving the current frame, which will unavoidably introduce large latency. For example, TOF [3] and RLSP [4] use optical flows to align the future frames, and propagate the information contained in the future frames to the current frame for feature aggregation. Secondly, in order to reduce the transmission bandwidth, online VSR methods are usually deployed on client devices rather than cloud servers, which require the models to have low model complexity and real-time speed. However, the majority of deep VSR models adopt complicated network modules with high computational complexity in exchange for better performance, such as the progressive fusion blocks [5, 6, 7], the non-local attention blocks [8, 9], etc. The client devices (e.g., personal computers, mobile phones, etc.) are usually resources-constrained, which cannot support those models with such high computational complexity. It should be noted that the distortion quality of VSR methods is highly related to their model complexity. Reducing the model complexity will inevitably deteriorate the distortion quality of the generated images. Therefore, it is essential to achieve a good trade-off among the distortion quality of the generated videos, the processing latency, and model complexity for online-VSR algorithms.

In this paper, we propose a low-latency online VSR solution based on a novel knowledge transfer-based method called convolutional kernel bypass graft (CKBG). To maximize the performance with no cached future frames and limited model capacity/latency, we propose to leverage the prior information from large pretrained image SR models. Borrowing the concept of heart bypass graft, CKBG enhances a VSR base model by bypassing the original network with “kernel grafts”, which are extra convolutional kernels containing the prior knowledge of external pretrained image SR models. Specifically, given amounts of kernels extracted from large pretrained image SR models, CKBG learns a set of kernel bases by clustering in the Wasserstein space, and the “kernel grafts” are obtained by learning a linear representation under the space spanned by the learned bases. The grafted multi-branch structure of our network can be converted into a single-path structure with re-parameterization [10, 11], which can effectively reduce the model complexity and latency.

The main contributions of this paper are as follows:

•

In this paper, we mainly focus on online video super-resolution where no future frames are accessible and propose an extremely low-latency and effective online VSR method.
•

To further improve the performance, we propose the CKBG scheme, which incorporates the prior information learned from large pretrained image SR models into a VSR base model.
•

Experiment results show that our proposed method can process video sequences with up to 110 FPS and achieve promising performance, compared with other state-of-the-art deep VSR methods, as summarized in Figure 1.

II Related Works

II-A Deep Lightweight Image SR Methods

In recent years, increasing efforts have been invested in exploring deep lightweight models for image SR because most promising deep SR methods [12, 13] require high computational complexity, which significantly limits their applications in resource-constrained devices. The common methods [14, 15, 16, 17] for reducing model parameters adopt the recurrent structures, which share the weights and enhance the features with multiple cycles for image reconstruction. However, the recursive strategy increases the processing time and the performance gain is limited. Considering the effectiveness of the group convolution in [18], Ahn et al. [19] proposed a cascaded convolutional network for image super-resolution. They combined the modified residual block with group convolution, significantly reducing the number of model parameters. The methods in [20, 21] extend the group convolution and propose an information-distillation block, which splits the input feature into several groups for further processing and then concatenates the output feature of each group for the feature fusion. These methods have shown remarkable trade-off performance in reducing model complexity and maintaining the distortion quality. With the split-and-concatenate strategy, Zhao et al. [22] proposed the pixel-attention mechanism to enhance useful information at each pixel location. Considering the characteristics of local regions, Xiao et al. [23] proposed an efficient method of generating dynamic convolution kernels, which adaptively extracts local features for image super-resolution. Unlike the above methods which design efficient modules by hands, the methods in [24, 25] employed the techniques of neural network architecture search to automatically find efficient model designs. Inspired by the structure re-parameterization techniques [11, 10], Zhang et al. [26] proposed the edge convolutional block to accelerate the running speed of deep models. Even though these deep lightweight image SR models have shown their effectiveness, in terms of performance and real-time running speed, these methods do not consider the long-range temporal dependency when applied to video sequences.

II-B Deep Video Super-resolution Methods

With the rapid development of video applications, deep learning-based VSR methods have become increasingly popular. Unlike image SR, VSR needs to consider object motions and the temporal correspondence between successive frames. The method in [7] adopted the efficient spatial transformer for motion compensation, and then combine it with ESPCN [27] to synthesize HR video sequences. As the object motions have the property of being spatially variant, DUF [28] used a dual-path residual dense network to predict the residue between the ground-truth frames and the input LR frames in one path, and dynamically upsample the input LR frames in another path. In [3], it was found that incorporating an optical flow estimation network into a task-specific network for joint training, named TOF, is beneficial to the overall performance. However, its estimated motion field is different from the ground-truth optical flow, and the accuracy of the estimated optical flow is very sensitive to local illumination changes. Instead of using optical flow for the alignment of video frames, EDVR [5] adopted deformable convolution [29] to align the features from multi-scale levels. According to [30], deformable convolution cannot effectively capture the long-range dependency and suffer from unstable training. In practice, the running speed of deformable convolution is very slow, which cannot satisfy the real-time requirements. To achieve better efficiency, Fuoli D et al. [4] proposed an efficient recurrent network to investigate the information from adjacent frames only for practical VSR tasks. This method directly feeds the extracted features of the hidden state from the previous step into the current step for feature fusion so that information can temporally propagate along the video sequences. Since VSR is an ill-posed problem, RSDN [31] introduced structural and detailed information to regularize the process for generating HR frames. Recently, bi-directional recurrent methods [32, 33, 32, 34], such as BasicVSR and its variant [32, 34], have shown their effectiveness in VSR, which can fully exploit the information from the forward and the backward directions of input video sequences. However, these methods need to acquire the whole video sequences beforehand and hence, they are impractical in online-VSR scenarios. It should be noted that video sequences are transmitted in the streaming format in online applications. Therefore, it is impractical to capture the information contained in the future frames because undesirable latency is introduced. In this paper, we will focus on the online VSR scenarios in which future frames are inaccessible and the deployed devices are resource-constrained, such as low-configuration devices.

II-C Knowledge Transfer for Image and Video Super-resolution

Knowledge distillation is a well-known technique to transfer knowledge from a large deep model to a smaller one. In knowledge distillation, the large model is called the teacher network, while the smaller model is called the student network. Recently, Gao et al. [35] proposed to distill knowledge from a teacher image super-resolution model by minimizing the distance of statistical properties (i.e., maximum values, mean values, etc.) of feature maps between the teacher network and the student network. He et al. [36] proposed a feature-affinity distillation (DAKD) method for image super-resolution, which transfers knowledge by using the correlation matrices of feature maps. Lee et al. [37] proposed to leverage the privileged information from ground-truth images and distill the knowledge by minimizing the distance between the features of the teacher network and the student network. Xiao et al. [38] proposed an effective knowledge-distillation method for video super-resolution, which enforces the spatial and temporal characteristics of the teacher network and student network to be consistent. All these distillation-based super-resolution models require the network topology of the teacher network and the student network to be consistent. In contrast, our proposed CKBG does not restrict the network structures, which provides more flexibility than distillation-based methods. In addition, the large teacher models used in our CKBG-based model are not involved in the training process, which is significantly different from the distillation-based methods.

III The Proposed Method

The overall structure of our proposed method is illustrated in Figure 2. Our method adopts the recurrent structure, which only utilizes the information in the current and previous frame. Therefore, it does not need the extra cost of caching frames in the super-resolution process. In particular, the proposed network first uses optical flow to align the features extracted from the previous frame with the current frame and then, temporally fuse the aligned features with the features extracted from the current frame. After that, the temporally aggregated feature is forwarded to the cascaded bypass-graft blocks (BGBs) for feature extraction, constructed based on our proposed CKBG method. At the output of our model, we adopt the PixelShuffle operator to upsample the extracted features and combine them with the bilinearly upscaled LR inputs for generating the final HR frames.

III-A Temporal Aggregation

For temporal aggregation, we first utilize SpyNet [39], denoted as $S(\cdot)$ , to estimate the optical flow $f^{(t-1)}$ between the current LR frame $x^{(t)}$ and the previous frame $x^{(t-1)}$ , which is computed as follows:

f^{(t-1)}=S(x^{(t)},x^{(t-1)}).

(1)

Then, we use the estimated optical flow to perform alignment in the feature space. The warped feature of the previous frame is computed as follows:

\hat{h}^{(t-1)}=\text{Warp}(h^{(t-1)},f^{(t-1)}),

(2)

where $h^{(t-1)}$ represents the feature extracted from the previous frame, $\hat{h}^{(t-1)}$ is the corresponding warped feature, and $\text{Warp}(\cdot)$ represents the warp operator. To avoid significantly increasing the model complexity, we simply concatenate the feature extracted from the current frame with the aligned feature along the channel dimension. Then, a convolutional layer is used to aggregate the features temporally. The aggregated feature $F^{(t)}$ is calculated as follows:

F^{(t)}=\text{conv}(\text{Cat}(\text{conv}(x^{(t)}),\hat{h}^{(t-1)})),

(3)

where $\text{Cat}(\cdot)$ is the concatenation operator and $\text{conv}(\cdot)$ represents a $3\times 3$ convolutional operator.

III-B Convolutional Kernel Bypass Graft

III-B1 Wasserstein Distance

Suppose $\mu$ and $\nu$ are probability measure defined on the Polish spaces $\mathcal{X}$ and $\mathcal{Y}$ , respectively. Let $\mathcal{P}(\mathcal{X}\times\mathcal{Y})$ be a set of probability distributions defined over the space $\mathcal{X}\times\mathcal{Y}$ . The joint distribution of $\mu$ and $\nu$ is denoted as $\prod(\mu,\nu)$ , i.e., $\prod(\mu,\nu)=\{\pi\in\mathcal{P}(\mathcal{X}\times\mathcal{Y})|\int_{\mathcal{X}}\mathrm{d}\pi(x,y)=\mathrm{d}\nu(y),\int_{\mathcal{Y}}\mathrm{d}\pi(x,y)=\mathrm{d}\mu(x)\}$ , where $x\in\mathcal{X},y\in\mathcal{Y}$ , and $\pi$ is the joint probability distribution. The optimal transportation problem aims to find an optimal transport mapping with the minimum transportation cost between two locations $x$ and $y$ , which is defined as follows:

\min_{\pi\in\prod(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{Y}}c(x,y)\,\mathrm{d}\pi(x,y),

(4)

where $c(x,y):\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}^{+}\cup\{0\}$ denotes the cost function from $x$ to $y$ . This is the well-known Kantorovich’s formulation. In this case, $\pi$ is also called a transport mapping, and $\mathrm{d}\pi(x,y)$ specifies the transported mass between $x$ and $y$ . Based on this, the Wasserstein distance [40] is defined as follows:

\mathcal{W}_{p}^{p}(x,y)=\left(\min_{\pi\in\prod(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{Y}}|x-y|^{p}\,\mathrm{d}\pi(x,y)\right)^{1/p},

(5)

where $c(x,y)=|x-y|^{p}$ . If $p=2$ , it is called the $2$ -Wasserstein distance, denoted as $\mathcal{W}_{2}^{2}$ . Assuming that there are $N_{1}$ probability distributions $\{\mu_{i}\}_{i=1}^{N_{1}}$ defined in the $2$ -Wasserstein metric space, the Fréchet mean of these distributions is defined as follows:

\nu=\mathop{\arg\min}_{\nu\in\mathcal{P}(\mathcal{Y})}\sum_{i=1}^{N_{1}}\lambda_{i}\mathcal{W}_{2}^{2}(\mu_{i},\nu),

(6)

where $\lambda_{i}$ is the weight associated with the $i$ -th probability distribution, and $\sum_{i=1}^{N_{1}}\lambda_{i}=1$ . The solution $\nu$ is also called the Wasserstein barycenter, which achieves the minimum weighted $2$ -Wasserstein distance for every $\mu_{i}$ .

III-B2 Kernel Prior Learning and Grafting

As shown in Figure 3, the proposed CKBG is based on a two-stage learning framework. At the first stage, CKBG extracts convolutional kernels from a pretrained image SR model, and then learns a set of kernel bases by clustering the extracted kernels in the Wasserstein metric space. After that, the “kernel grafts” are obtained by learning linear representations under the space spanned by the kernel bases. A “kernel graft” is grafted alongside each convolutional layer in the original network, forming a bypass-graft block (BGB). It should be noted that the obtained BGB has a multi-branch structure in the training stage, but the multi-branch structure can be further converted into a single-path structure for acceleration in the testing phase. In this section, we will elaborate on the learning, grafting, and testing of the proposed CKBG.

Kernel Prior Learning. Given a pretrained image SR model, we first extract convolutional kernels from it. In this paper, the adopted image SR model is EDSR [12], which is a classic image SR model and has shown promising results. The extracted kernels contain a lot of prior information for generating HR images. To remove redundancy and obtain more representative kernels, we cluster similar kernels by performing the K-Means algorithm in the Wasserstein space, resulting in a set of cluster centroids. Specifically, assuming that $K$ convolutional kernels are obtained from a pretrained image SR model, denoted as $\{\mathbf{k}_{i}\}_{i=1}^{K}$ , the cluster centroids are obtained by solving the following optimization problem:

\mathop{\arg\min}_{\mathbf{S}}\sum_{i=1}^{M}\sum_{\mathbf{k}\in S_{i}}\mathcal{W}_{2}^{2}(\mathbf{k},\mathbf{c}_{i}),

(7)

where $\mathbf{S}=\{S_{1},\cdots,S_{M}\}$ denotes the partition of the extracted kernels and $M$ is the number of clusters. $\mathbf{c}_{i}$ represents the $i$ -th cluster centroid in the Wasserstein space. From Eqn. (6), we can find that obtaining the kernel centroids is equivalent to computing the Wasserstein barycenters for each kernel cluster. In this case, the weights associated with the Wasserstein distance between the kernels and the centroids are all the same. We use the variational method provided in [41] to solve the above optimization problem. To better inherit the prior information from the extracted kernels, we perform K-Means clustering in the Wasserstein space rather than in the Euclidean space, because the optimal transport mapping can better preserve the geometric properties (e.g., shape) of the extracted kernels. As a result, the learned cluster centroids learned in the Wasserstein space have a similar geometric structure to the extracted kernels. Figure 4(a) illustrates a $1$ -D example of six unimodal distributions. As shown in Figure 4(b), we can find that the centroid distribution computed in the Euclidean space is severely distorted. In contrast, the geometric properties of the cluster centroids obtained in the Wasserstein space are more similar to the original distributions. It is worth noting that the output responses are highly related to the geometric properties of the kernels, so the kernel centroids obtained in the Wasserstein space can avoid generating distortion, leading to similar output responses.

The proposed CKBG enhances the base VSR model by learning a set of new kernels and grafting the kernels in the original network. We called the learned kernels “kernel grafts”. In this paper, the “kernel grafts” are learned based on a kernel space constructed from the cluster centroids, which contain prior knowledge extracted from the pretrained image SR model. Specifically, this kernel space is represented by the kernel bases obtained by performing principal component analysis on the cluster centroids, as follows:

CC^{T}=U\Sigma U^{T},

(8)

where $C=[c_{1},\cdots,c_{M}]$ is the cluster centroid matrix, $C^{T}$ is its transpose matrix, $U=[\mathbf{u}_{1},\cdots,\mathbf{u}_{M}]$ is the eigenvector matrix of $C$ , and $\Sigma=\text{Diag}(\sigma_{1},\cdots,\sigma_{M})$ contains the corresponding eigenvalues sorted in the descent order. The eigenvectors in $U$ are the principal components of the cluster centroids, which can be used as the potential bases of the kernel prior sub-space. Larger singular values indicate that the corresponding eigenvectors are more significant.

Kernel Graft. As shown in Figure 3, for each convolutional layer in the base network, several kernel grafts are obtained, which bypass the original convolutional kernels. For each kernel graft, we first select a set of bases to form the kernel space, and the selected bases are the principal components of the cluster centroids sampled from the following categorical distribution:

\mu_{i}\sim p(\mathbf{\Theta}),

(9)

where $\mathbf{\Theta}=[\theta_{1},\cdots,\theta_{M}]$ , $\theta_{i}=\sigma_{i}/\sum_{j=1}^{M}\sigma_{j},\forall i=1,\cdots,M$ . This implies that the bases are sampled according to their significance, which corresponds to their singular values. The more significant principal components are more likely to be selected as the bases for the kernel space. Meanwhile, the randomness introduced by the sampling process brings diversity of kernel grafts.

Finally, the convolutional kernel graft is obtained by learning a linear combination of the sampled kernel bases. As shown in Figure 5, in our implementation, learning a linear combination of the kernel bases is equivalent to adding a $1\times 1$ convolution (i.e. the patterned boxes) before the “kernel graft” , which is updated simultaneously with the parameters of the base network, while the parameters of “kernel grafts” are kept fixed.

BGB in testing. In the testing phase, the multi-branch structure of BGB can be converted into a single convolution according to the linear property of the convolutional operation [10]. The network structure of the proposed BGB in the training and testing stages is illustrated in Figure 5. Specifically, the consecutive $1\times 1$ and $3\times 3$ convolutional operators are first merged to form a single convolution. Then, all the parallel $3\times 3$ convolutional operators are merged into a single convolutional operator, resulting in a highly efficient single-path structure. It is worth noting that the re-parameterized kernel at the testing stage is equivalent to the original multi-branch structure in the training stage, without sacrificing any performance. More details about kernel re-parameterization can be found in Appendix A.

III-C Loss function

We adopt the Charbonnier loss to measure the distance between the generated frames and the ground-truth (GT) frames, which is defined as follows:

L=\frac{1}{T}\sum_{t=1}^{T}\sqrt{\left\Arrowvert I_{SR}^{(t)}-I_{GT}^{(t)}\right\Arrowvert_{2}^{2}+\epsilon},

(10)

where $I_{SR}^{(t)}$ and $I_{GT}^{(t)}$ denote the generated SR frame and the ground-truth frame at the $t$ -th time step, respectively, $T$ is the number of input LR frames, and $\epsilon$ is a hyper-parameter.

IV Experiments and Analysis

Datasets. The REDS dataset [42] and Vimeo-90K dataset [3], which are two widely used datasets in VSR, are adopted for training. In the testing stage, the REDS4 dataset [32] and Vid4 dataset [43] are used for evaluation. In these two datasets, the upscaling factor is 4, and the bicubic downsampling kernel is used to generate LR video sequences.

Evaluation Metrics. The performance of online VSR algorithms should include three perspectives: model complexity, latency, and the distortion quality of the generated videos. We measure model complexity by considering the number of model parameters, floating point operations per second (FLOPs), and the number of activations. For distortion quality assessment, peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) are adopted for evaluation. Since our proposed method focuses on online scenarios, the latency $t$ is very important and is defined as follows:

t=t_{cache}+t_{run},

(11)

where $t_{cache}$ denotes the time to cache the frames required by a VSR method and $t_{run}$ denotes the runtime required to super-resolve an input frame. If the frame rate is 24 FPS, then the cache time for one frame is 40 ms. It is worth noting that distortion and latency are well-known trade-offs. The trade-off score function [44] is used to objectively measure the performance of the algorithms in online situations and is defined as follows:

\text{score}=\frac{2^{\text{PSNR}}}{C\times t},

(12)

where $C$ is a constant and set to $2^{20.0}$ in the experiment. A model with a higher score can achieve a better trade-off between distortion quality and latency, so the model is more effective and applicable in online scenarios.

Implement Details. In the experiments, we train our proposed model using patches of size $80\times 80$ randomly cropped from the input video sequences. For data augmentation, we randomly flipped and rotated the input video sequences. The number of channels in our model is set to $64$ , and the batch size is $8$ in the training process. We use Adam [45] with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ to update the weights of the model. The initial learning rate is $2\times 10^{-4}$ and the Cosine annealing strategy is utilized to adaptively adjust the learning rate in the training process. The total number of iterations used in training is $6\times 10^{5}$ . We implemented the proposed method with PyTorch in Tesla V-100 GPUs, and it took approximately eight days to complete the training.

TABLE I: Comparison of the average PSNR, SSIM, model complexity, latency, and trade-off score of different VSR methods. Note that “ # Param. ” and “ # Act. ” represent the number of model parameters and the number of activations, respectively. “ Online ” refers to whether a method can be applied online. “ Time ” represents latency measured on the REDS4 dataset in Tesla-V100. The average size of the input frame is

180\times 320

. “ RGB ” and “ Y ” mean that the evaluation metrics are measured in the RGB space and Y channel, respectively. The best results are highlighted in bold. The second-best results are underlined.

Methods		# Param.	FLOPs	# Act.	Online	Time	Score	REDS4 (RGB)	Vid4 (Y)
Methods		# Param.	FLOPs	# Act.	Online	Time	Score	PSNR/SSIM	PSNR/SSIM
	Bicubic	-	-	-		-	-	26.13/0.7388	23.78/0.6347
SISR	EDSR-M [12]	1,571 K	114.28 G	201.83 M	✓	17 ms	19.46	28.37/0.8078	25.31/0.6608
SISR	PAN [22]	272 K	28.29 G	237.88 M	✓	12 ms	28.34	28.41/0.8089	25.35/0.7306
VSR	TOF[3]	1,405 K	2,175.25 G	1,251.93 M	✗	334 ms	0.75	27.98/0.7990	25.89/0.7651
	RLSP [4]	1,154 K	132.94 G	108.74 M	✗	47 ms	7.86	28.53/0.8136	25.69/0.7530
	RRN [46]	3,364 K	193.62 G	164.96 M	✓	20 ms	22.59	28.82/0.8234	25.85/0.7660
	BasicVSR^∗ [32]	1,887 K	71.33 G	185.24 M	✓	32 ms	27.38	29.81/0.8535	26.47/0.7986
	CKBG(ours)	1,750 K	17.85 G	34.09 M	✓	9 ms	91.14	29.73/0.8514	26.34/0.7857

IV-A Experiment on REDS4 and Vid4 Datasets

In this experiment, we compared our proposed method with EDSR-M [12], PAN [22], RLSP [4], RRN [46], TOF [3], and BasicVSR [32]. Among these methods, EDSR-M and PAN are single-image SR models with lower model complexity. Since online video SR algorithms have a high requirement in latency, these two methods do not need to cache frames and have the ability to process video in real time. RLSP and RRN are recurrent-based video SR methods and have achieved promising results. Specifically, RRN only incorporates the information of the previous frame into the current state for generating HR frames, so this method is also suitable for online scenarios. Both methods were originally trained on videos blurred by Gaussian kernels, so we retrained these models with the same training datasets for a fair comparison. BasicVSR is a bi-directional VSR model, so it should not be applicable for online scenarios. We modified BasicVSR by removing the backward part and reducing its model size for online application, which is denoted as BasicVSR^∗. The implementation details of BasicVSR^∗ can be found in Appendix B. We used the open-source codes provided by the authors to implement the compared methods. All the evaluation results, including the number of model parameters, the number of activations, FLOPs, the average PSNR and SSIM, the required runtime, and the trade-off score, are reported in Table I.

As observed, our proposed model has the lowest model computational complexity because our proposed method requires less FLOPs than other compared methods, including two image SR methods, EDSR-M and PAN. Although our proposed method uses an external network to estimate optical flow, the number of model parameters required by our method is slightly higher than that of EDSR-M, which is acceptable. For distortion quality, our proposed method significantly outperforms EDSR-M, PAN, TOF, RLSP, and RRN in terms of PSNR and SSIM. This shows that our proposed method is more efficient and effective for VSR. Surprisingly, we found that the latency of our proposed method measured in Tesla V-100 is less than 10ms on the REDS4 dataset. The average size of video sequences in the RED4 dataset is $180\times 320$ , and the upscaling factor used in our experiments is 4. This means that our proposed method takes less than 10ms to produce a video frame of the size $720\times 1280$ in online scenarios, which is significantly efficient. In addition, our method has much lower latency than the most efficient image SR model (i.e., PAN) and runs much faster than other compared VSR methods. Overall, compared to other efficient image SR and VSR methods, our proposed method requires extremely low latency and achieves competitive performance in terms of distortion quality. Therefore, our proposed method has a higher trade-off score, which shows that our proposed method outperforms other compared methods in online situations.

Figures 6 and 7 illustrate the visual results of different VSR methods. For better visualization, a region of each image marked by a red rectangular box is cropped and enlarged. In Figure 6, we can find that the video frames generated by EDSR, PAN, RLSP, TOF, and RRN suffer from severe distortion and lack of sharp texture information in the marked dense grids, while our method can effectively preserve object details (e.g., edges and textures). Similarly, in Figure 7, we can easily observe that our proposed method has a better ability to generate clear image content (e.g., the word “ REE” in the marked region) than other compared methods. Overall, the illustrated results show that our proposed method has a better ability in producing images with low distortion and high visual quality. Please find more visual results in Appendix C.

IV-B Experiment on Different Hardware Devices

Online VSR methods are usually deployed in clients’ devices, and the configurations of personal devices are various. In this experiment, we evaluated the latency of different VSR methods in two devices, NAVIDIA 2080 Ti and 1080 Ti. These two devices have been widely used in gaming applications, which require video algorithms to process up to 60 FPS¹¹1https://www.gamingscan.com/best-fps-gaming. The REDS4 dataset is used in this experiment, where the size of the generated video frame is $720\times 1280$ . The evaluation results on 2080 Ti and 1080 Ti are shown in Table II. In addition, the relationship between the PSNR and the FPS achieved is illustrated in Figure 1. We can find that the latency of our method is significantly lower than other methods. In particular, for 2080 Ti, the processing speed of our proposed method is up to 110 FPS. It is worth noting that none of three compared methods can meet the latency requirements of online gaming on the devices equipped with 1080 Ti, but our method can amazingly process the video sequences at a rate of up to 80 FPS. It is worthy noticing that personal laptops equipped with 1080 Ti are not high-configuration for online games. This shows that our method is less affected by hardware devices and more applicable to different industrial products.

TABLE II: The latency required by different methods running on various devices. All models are evaluated on the REDS4 dataset, and the average size of the input frame is

180\times 320

. The best results are highlighted in bold.

		Methods
		EDSR-M	PAN	RRN	CKBG(ours)
#Param		1,571 K	272 K	3,364 K	1,750 K
PSNR		28.37	28.41	28.82	29.73
2080Ti	Time	22 ms	16 ms	27 ms	9 ms
	FPS	45.45	62.5	37.03	111.11
	Gaming	✗	✓	✗	✓
1080Ti	Time	40 ms	36 ms	44 ms	12 ms
	FPS	25.00	27.77	22.72	83.33
	Gaming	✗	✗	✗	✓

IV-C Ablation Study on Kernel Graft

In this experiment, we evaluate the performance of kernel bypass graft. Specifically, we train our proposed method in two different settings for comparison. One setting is that the model does not adopt any bypass-graft kernel, denoted as w/o Graft. The second is that the model adopts the bypass-graft kernel obtained in the Euclidean space, denoted as E-KMeans. Our proposed method learns the kernel graft in the Wasserstein space, and is denoted as W-KMeans. The average PSNR and SSIM for the different settings on the REDS4 dataset are shown in Table III. As observed, the model with the graft learned in the Euclidean space achieves comparable results with the model without using grafts, because the grafted kernels learned in the Euclidean space cannot effectively inherit prior information from the extracted kernels. In contrast, our model using the kernel graft learned in the Wasserstein space significantly outperforms other compared settings because the optimal mapping can help the learned kernels better preserve the geometric properties.

TABLE III: The average PSNR and SSIM of our proposed model with different settings on the REDS4 dataset.

\uparrow\bigtriangleup

denotes the performance increment. The best results are highlighted in bold.

	w/o Graft	E-KMeans	W-KMeans
PSNR	29.58	29.59	29.73
$\uparrow\bigtriangleup$	0.00	0.01	0.15
SSIM	0.8480	0.8484	0.8514
$\uparrow\bigtriangleup$	0.0000	0.0004	0.0034

IV-D Ablation Study on Different Kernels

In this experiment, we compared our proposed bypass-grafted kernels with other re-parameterized kernels, such as the ACB kernel [47], RepVGG kernel [11], and ECB kernel [26]. Specifically, we retrained the model by replacing our proposed grafted kernel with other re-parameterized kernels. The average PSNR and SSIM of the different kernel settings on the REDS4 dataset are shown in Table IV. We can find that our method achieves the best performance compared with other kernel settings. Figure 8 illustrates the feature maps generated from the last convolutional block of the models with different kernels. We highlighted regions of dense grids with the red and blue rectangular boxes for better visual comparison. As observed, the generated result of our method looks much sharper, and the marked region of dense grids has less distorted content.

TABLE IV: The average PSNR and SSIM results of different re-parameterization convolutional blocks on the REDS4 dataset. The best results are highlighted in bold.

	ACB [47]	RepVGG [11]	ECB [26]	CKBG(ours)
PSNR	29.53	29.51	29.58	29.73
SSIM	0.8471	0.8464	0.8478	0.8514

IV-E Ablation Study on Knowledge Transfer

We further compare our method with two deep super-resolution models, i.e, FAKD [36] and FDVDNet [38], which are based on knowledge distillation. The compared methods distill prior knowledge from the teacher network based on the feature map generated from the intermediate layers. In addition, the teacher networks provide supervised signals to guide the student networks for training. In contrast, our proposed CKBG method extracts convolutional kernels from a pretrained image super-resolution model and grafts the extracted kernels to a lightweight model before training. The average PSNR and runtime of the different methods on the Vid4 dataset are tabulated in Table V, where the best results are highlighted in bold. The runtime of the methods is measured in a device equipped with a NVIDIA GeForce GTX 1080 Ti. From Table V, our method can achieve the best performance with the lowest runtime compared with FAKD and FDVDNet. This shows that our method is more effective and efficient for video super-resolution.

TABLE V: The average PSNR and runtime of SR models with different knowledge transfer methods on the Vid4 dataset. The best results are highlighted in bold. The runtime is measured in a Navdia 1080 Ti.

	Bicubic [47]	FAKD [36]	FDVDNet [38]	CKBG(ours)
PSNR	23.78	25.42	26.14	26.34
Time	-	28.31 ms	17.50 ms	12.26 ms

V Conclusion

Online video applications raise high requirements for video super-resolution (VSR) algorithms in terms of processing latency, model complexity, and distortion quality, but few of the existing VSR methods can handle such challenging issues simultaneously. In this paper, we propose an extremely low-latency VSR method for online applications. A novel knowledge transfer method, called convolutional kernel bypass graft (CKBG), is proposed. The proposed CKBG aims to improve the performance of a base VSR network by bypassing a set of extra kernels containing rich prior knowledge from external, pretrained SR models (i.e., kernel grafts). Experiment results show that our proposed method can process a video sequence at a rate of up to 110 FPS, and achieve the best trade-off between the distortion quality and the processing latency, compared with other competitive VSR methods.

Appendix A Kernel Re-parameterization of CKBG

In the paper, we propose a novel kernel knowledge-transfer method, called convolutional kernel graft bypass, which enhances a base VSR model by bypassing the learned “kernel graft” from a large pretrained model. The resulting network module is called a bypass graft block (BGB), which has a multi-branch structure for feature extraction. In our method, the proposed BGB is trained in the form of a multi-branch structure, but the multiple branches can be converted into a convolutional operator in the testing stage because of the linear property of the convolutional operation [11]. This kernel conversion process is called the kernel re-parameterization, as illustrated in Figure 5. Our proposed method involves two types of kernel re-parameterization: each sequential convolution and parallel convolution. In the testing stage, the sequence of $1\times 1$ convolution and $3\times 3$ convolution is first merged into a $3\times 3$ convolution. Then, the convolutional kernels of all parallel branches are merged into a single convolution kernel. The resulting structure is a single-path topology. Next, we will elaborate on how to perform kernel re-parameterization for sequential and parallel convolutions.

Type I: Sequential convolution. Suppose a sequential convolution of a $1\times 1$ kernel and a $K\times K$ kernel, denoted as $F_{1}\in\mathbb{R}^{\text{C}_{in}\times\text{C}_{mid}\times 1\times 1}$ and $F_{2}\in\mathbb{R}^{\text{C}_{mid}\times\text{C}_{out}\times K\times K}$ , respectively, where $\text{C}_{in}$ , $\text{C}_{mid}$ , and $\text{C}_{out}$ represent the number of channel of the input feature, intermediate feature, and output feature, respectively. The output of the sequential convolution is computed as follows:

	$\displaystyle F_{output}$	$\displaystyle=(F_{in}\ast F_{1}+\mathbf{b}_{1})\ast F_{2}+\mathbf{b}_{2}$		(13)
		$\displaystyle=F_{in}\ast F_{1}\ast F_{2}+\mathbf{b}_{1}\ast F_{2}+\mathbf{b}_{2},$		(14)

where $\ast$ denotes the convolutional operation, and $\mathbf{b}_{1}\in\mathbb{R}^{1\times C_{mid}}$ and $\mathbf{b}_{2}\in\mathbb{R}^{1\times C_{out}}$ are the corresponding bias terms. According to the linear property of convolution, the convolution kernels $F_{1}$ and $F_{2}$ in the first term can be merged as follows:

F^{\prime}=F_{2}\ast F^{T}_{1},

(15)

where $F^{T}_{1}\in\mathbb{R}^{C_{mid}\times C_{in}\times 1\times 1}$ is the transpose of $F_{1}$ , and $F^{\prime}$ denotes the resulting convolution kernel with the size of $\text{C}_{in}\times\text{C}_{out}\times K\times K$ . For the second term $\mathbf{b}_{1}\ast F_{2}$ , the corresponding re-parameterized result $\hat{\mathbf{b}}_{1}=[\hat{b}_{1,1},\cdots,\hat{b}_{1,C_{mid}}]$ is computed as follows:

\hat{b}_{1,c_{out}}=\sum_{c=1}^{C_{mid}}\sum_{m=1}^{K}\sum_{n=1}^{K}b_{1,c}F_{c_{out},c,m,n}^{(2)},

(16)

where $c_{out}=1,\cdots,C_{out}$ , and $F_{c_{out},c,m,n}^{2}$ represents the kernel at the position $(c_{out},c,m,n)$ of the convolution kernel $F_{2}$ . Then, the final re-parameterized bias term is obtained as follows

\hat{\mathbf{b}}=\hat{\mathbf{b}}_{1}+\mathbf{b}_{2}.

(17)

As a result, the sequential convolution of a $1\times 1$ convolution and a $K\times K$ convolution is merged into a single convolutional kernel after performing kernel re-parameterization. Eqn. 14 can be re-written as follows:

F_{out}=F_{in}\ast F^{\prime}+\hat{\mathbf{b}}.

(18)

Type II: Parallel convolution. Assume that the parameter set of the $i$ -th branch in the $\ell$ -th BGB is denoted as $\Theta^{\ell}_{i}=\{F^{\ell}_{i},\mathbf{b}^{\ell}_{i}\}$ , where $F^{\ell}_{i}\in\mathbb{R}^{C_{in}\times C_{out}\times K\times K}$ and $\mathbf{b}\in\mathbb{R}^{1\times C_{out}}$ represent the convolution kernel and the bias term, respectively. Without loss of generality, suppose a BGB has $B$ branches. Given an input feature map $F_{in}^{\ell}$ , the output of the multi-branch block is computed as follows:

F_{out}^{\ell}=\sum_{i=1}^{B}(F_{in}^{\ell}\ast F_{i}^{\ell}+\mathbf{b}_{i}^{\ell}).

(19)

Then, we can perform kernel re-parameterization based on the linear property of convolution again. As a result, Eqn. (19) can be re-written as follows:

	$\displaystyle F_{out}^{\ell}$	$\displaystyle=F_{in}^{\ell}\ast\left(\sum_{b=1}^{B}F_{b}^{\ell}\right)+\left(\sum_{b=1}^{B}\mathbf{b}_{b}^{\ell})\right)$		(20)
		$\displaystyle=F_{in}^{\ell}\hat{F}^{\ell}+\hat{\mathbf{b}}^{\ell},$		(21)

where $\hat{F}^{\ell}=\sum_{b=1}^{B}F_{b}^{\ell}$ and $\hat{\mathbf{b}}^{\ell}=\sum_{b=1}^{B}\mathbf{b}_{j}^{\ell}$ are the corresponding re-parameterized convolutional kernel and the bias term. As a result, the multi-branch structure is converted into a single-path structure in the testing stage.

Since kernel re-parameterization relies on the linear property of the convolution operation, the re-parameterized kernel in the testing stage is equivalent to the multi-branch structure in the training stage, without sacrificing any performance.

Appendix B Implementation Details of BasicVSR^∗

BasicVSR [32] is a bi-directional model and has shown its effectiveness for VSR. However, BasicVSR cannot be applied to online video applications because it requires access to the whole video sequence, which is impractical. To make BasicVSR meet the online requirements, we modify BasicVSR in this paper, denoted as BasicVSR^∗. Specifically, we first remove the backward part of the model, so it does not need to cache the future frames. To meet the requirement of low model complexity, we reduce the number of residual blocks from 60 to 15. The number of feature channels is reduced from 64 to 32. Other configurations are remained the same as the original model for training.

Appendix C More Visual Results

In Figure 9 and 10, we provide additional visual results generated by different image/video SR methods for visual comparison. As illustrated, our proposed method has a better ability to preserve the shapes and textures of objects in the images than other compared methods. In addition, the images generated by our proposed method contain less distorted content than that produced by other compared methods, leading to the best visual quality.

References

[1] X. Nan, X. Guo, Y. Lu, Y. He, L. Guan, S. Li, and B. Guo, “Delay–rate–distortion optimization for cloud gaming with hybrid streaming,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2687–2701, 2016.
[2] C. P. Lau, A. Alabbasi, and B. Shihada, “An efficient live tv scheduling system for 4g lte broadcast,” IEEE Systems Journal, vol. 11, no. 4, pp. 2737–2748, 2016.
[3] “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol. 127, pp. 1106–1125, 2019.
[4] D. Fuoli, S. Gu, and R. Timofte, “Efficient video super-resolution through recurrent latent space propagation,” in Proceedings of the IEEE International Conference on Computer Vision Workshop, 2019.
[5] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “Edvr: Video restoration with enhanced deformable convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
[6] M. Haris, G. Shakhnarovich, and N. Ukita, “Recurrent back-projection network for video super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[7] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[8] W. Li, X. Tao, T. Guo, L. Qi, J. Lu, and J. Jia, “Mucan: Multi-correspondence aggregation network for video super-resolution,” in European Conference on Computer Vision, 2020.
[9] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma, “Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations,” in Proceedings of the IEEE International Conference on Computer Vision, 2019.
[10] X. Ding, X. Zhang, J. Han, and G. Ding, “Diverse branch block: Building a convolution as an inception-like unit,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[11] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Repvgg: Making vgg-style convnets great again,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[12] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
[13] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301.
[14] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1637–1645.
[15] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3147–3155.
[16] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3867–3876.
[17] Q. Li, Z. Li, L. Lu, G. Jeon, K. Liu, and X. Yang, “Gated multiple feedback network for image super-resolution,” in British Machine Vision Conference (BMVC), 2019.
[18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” in CVPR, 2017.
[19] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight super-resolution with cascading residual network,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 252–268.
[20] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 723–731.
[21] Z. Hui, X. Gao, Y. Yang, and X. Wang, “Lightweight image super-resolution with information multi-distillation network,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2024–2032.
[22] H. Zhao, X. Kong, J. He, Y. Qiao, and C. Dong, “Efficient image super-resolution using pixel attention,” in Proceedings of European Conference on Computer Vision, 2020.
[23] J. Xiao, Q. Ye, R. Zhao, K.-M. Lam, and K. Wan, “Self-feature learning: An efficient deep lightweight network for image super-resolution,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4408–4416.
[24] X. Chu, B. Zhang, and R. Xu, “Multi-objective reinforced evolution in mobile neural architecture search,” in European Conference on Computer Vision. Springer, 2020, pp. 99–113.
[25] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, and Y. Wang, “Efficient residual dense block search for image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 007–12 014.
[26] X. Zhang, H. Zeng, and L. Zhang, “Edge-oriented convolution block for real-time super resolution on mobile devices,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021.
[27] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
[28] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim, “Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3224–3232.
[29] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
[30] J. Lin, Y. Huang, and L. Wang, “Fdan: Flow-guided deformable alignment network for video super-resolution,” arXiv preprint arXiv:2105.05640, 2021.
[31] T. Isobe, X. Jia, S. Gu, S. Li, S. Wang, and Q. Tian, “Video super-resolution with recurrent structure-detail network,” in European Conference on Computer Vision. Springer, 2020, pp. 645–660.
[32] K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Basicvsr: The search for essential components in video super-resolution and beyond,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[33] Y. Huang, W. Wang, and L. Wang, “Video super-resolution via bidirectional recurrent convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 1015–1028, 2017.
[34] K. C. Chan, S. Zhou, X. Xu, and C. C. Loy, “Basicvsr++: Improving video super-resolution with enhanced propagation and alignment,” arXiv preprint arXiv:2104.13371, 2021.
[35] Q. Gao, Y. Zhao, G. Li, and T. Tong, “Image super-resolution using knowledge distillation,” in Asian Conference on Computer Vision. Springer, 2018, pp. 527–541.
[36] Z. He, T. Dai, J. Lu, Y. Jiang, and S.-T. Xia, “Fakd: Feature-affinity based knowledge distillation for efficient image super-resolution,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 518–522.
[37] W. Lee, J. Lee, D. Kim, and B. Ham, “Learning with privileged information for efficient image super-resolution,” in European Conference on Computer Vision. Springer, 2020, pp. 465–482.
[38] Z. Xiao, X. Fu, J. Huang, Z. Cheng, and Z. Xiong, “Space-time distillation for video super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2113–2122.
[39] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, 2017, pp. 4161–4170.
[40] C. Villani, Optimal transport: old and new. Springer, 2009, vol. 338.
[41] L. Mi, W. Zhang, X. Gu, and Y. Wang, “Variational wasserstein clustering,” in Proceedings of the European Conference on Computer Vision, 2018.
[42] S. Son, S. Lee, S. Nah, R. Timofte, and K. M. Lee, “Ntire 2021 challenge on video super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2021, pp. 166–181.
[43] C. Liu and D. Sun, “On bayesian adaptive video super resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 346–360, 2013.
[44] A. Ignatov, A. Romero, H. Kim, and R. Timofte, “Real-time video super-resolution on smartphones with deep learning, mobile ai 2021 challenge: Report,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014.
[46] T. Isobe, F. Zhu, X. Jia, and S. Wang, “Revisiting temporal modeling for video super-resolution,” in Proceedings of the British Machine Vision Conference, 2020.
[47] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Efficient dense modules of asymmetric convolution for real-time semantic segmentation,” in Proceedings of the ACM Multimedia Asia, 2019.