This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

STDAN: Deformable Attention Network for Space-Time Video Super-Resolution

Hai Wang, Xiaoyu Xiang, Yapeng Tian, Wenming Yang, and Qingmin Liao H. Wang, W. Yang and Q. Liao are with Tsinghua Shenzhen International Graduate School / Department of Electronic Engineering, Tsinghua University, Shenzhen 518055, China (e-mail: [email protected]; [email protected]; [email protected]).X. Xiang is with the On-Device AI team at Meta Reality Labs, Menlo Park, CA 94025 USA (e-mail: [email protected]).Y. Tian is with the Department of Computer Science, University of Rochester, Rochester, NY 14627 USA (e-mail: [email protected]).
Abstract

The target of space-time video super-resolution (STVSR) is to increase the spatial-temporal resolution of low-resolution (LR) and low frame rate (LFR) videos. Recent approaches based on deep learning have made significant improvements, but most of them only use two adjacent frames, that is, short-term features, to synthesize the missing frame embedding, which cannot fully explore the information flow of consecutive input LR frames. In addition, existing STVSR models hardly exploit the temporal contexts explicitly to assist high-resolution (HR) frame reconstruction. To address these issues, in this paper, we propose a deformable attention network called STDAN for STVSR. First, we devise a long-short term feature interpolation (LSTFI) module, which is capable of excavating abundant content from more neighboring input frames for the interpolation process through a bidirectional RNN structure. Second, we put forward a spatial-temporal deformable feature aggregation (STDFA) module, in which spatial and temporal contexts in dynamic video frames are adaptively captured and aggregated to enhance SR reconstruction. Experimental results on several datasets demonstrate that our approach outperforms state-of-the-art STVSR methods. The code is available at https://github.com/littlewhitesea/STDAN.

Index Terms:
Deformable attention, space-time video super-resolution, feature interpolation, feature aggregation.

I Introduction

The goal of space-time video super-resolution (STVSR) is to reconstruct photo-realistic high-resolution (HR) and high frame rate (HFR) videos from corresponding low-resolution (LR) and low frame rate (LFR) ones. STVSR methods have attracted much attention in the computer vision community since HR slow-motion videos provide more visually appealing content for viewers. Many traditional algorithms [8, 19, 3, 2, 4] are proposed to solve the STVSR task. However, due to their strict assumptions in their manually designed regularization, these methods mostly suffer from ubiquitous object and camera motions in videos.

In recent years, deep learning approaches have made great progress in diverse low-level visual tasks [53, 62, 31, 30, 9, 20, 26, 46]. Particularly, video super-resolution (VSR) [53, 38] and video frame interpolation (VFI) [31, 25] networks among these approaches can be combined together to tackle STVSR. Specifically, the VFI model interpolates the missing LR video frames. Then, the VSR model can be adopted to reconstruct HR frames. Nevertheless, the two-stage STVSR approaches usually have large model sizes, and the essential association between the temporal interpolation and spatial super-resolution is not explored.

Refer to caption
Figure 1: Example of space-time video super-resolution (STVSR). Compared with three recent SOTA STVSR methods, our network can reconstruct more accurate structures.

To build an efficient model and explore mutual information between temporal interpolation and spatial super-resolution, several one-stage STVSR networks [13, 36, 64, 65] are proposed. These approaches can simultaneously handle the space and time super-resolution of videos in diverse scenes. Most of them only leverage corresponding two adjacent frames for interpolating the missing frame feature. However, other neighboring input LR frames can also contribute to the interpolation process. In addition, existing one-stage STVSR networks are limited in fully exploiting spatial and temporal contexts among various frames for SR reconstruction. To alleviate these problems, in this paper, we propose a one-stage framework named STDAN for STVSR, which is superior to recent methods, illustrated in Fig. 1. The cores of STDAN are (1) a feature interpolation module known as Long-Short Term Feature Interpolation (LSTFI), and (2) a feature aggregation module known as Spatial-Temporal Deformable Feature Aggregation (STDFA).

The LSTFI module, composed of long-short term cells (LSTCs), utilizes a bidirectional RNN [37] structure to synthesize features for missing intermediate frames. Specifically, to interpolate the intermediate feature, we adopt the forward and backward deformable alignment [13] for dynamically sampling two neighboring frame features. Then, the preliminary intermediate feature in the current LSTC is mingled with the hidden state that contains long-term temporal context from previous LSTCs to obtain the final interpolated features.

The STDFA module aims to capture spatial-temporal contexts among different frames to enhance SR reconstruction. To dynamically aggregate the spatial-temporal information, we propose to use deformable attention to adaptively discover and leverage relevant spatial and temporal information. The process of STDFA can be divided into two phases: cross-frame spatial aggregation and adaptive temporal aggregation. Through deformable attention, the cross-frame spatial aggregation phase dynamically fuses useful content from different frames. The adaptive temporal aggregation phase mixes the temporal contexts among these fused frame features further to acquire enhanced features.

The contributions of this work are three-fold: (1) We design a deformable attention network (STDAN) to deal with STVSR. Our STDAN with fewer parameters achieves state-of-the-art performance on multiple datasets; (2) We propose a long-short term feature interpolation module, where abundant information from more neighboring frames are explored for the interpolation process of missing frame features; (3) We put forward a spatial-temporal deformable feature aggregation module, which can dynamically capture spatial and temporal contexts among video frames for enhancing features to reconstruct HR frames.

II Related Work

In this section, we discuss some relevant works on video super-resolution, video frame interpolation, and space-time video super-resolution.

Video Super-Resolution. The goal of video super-resolution (VSR) [55, 38, 44, 48] is to generate temporally coherent high-resolution (HR) videos from corresponding low-resolution (LR) ones. Since input LR video frames are consecutive, many researchers focus on how to aggregate the temporal contexts from the neighboring frames for super-resolving the reference frame. Several VSR approaches [32, 24, 23, 51, 52] adopt optical flow to align the reference frame with neighboring video frames. Nevertheless, the estimated optical flow may be inaccurate due to the occlusion and fast motions, leading to poor reconstruction results. To avoid using optical flow, deformable convolution [39, 40] is applied in [53, 54, 38] to perform the temporal alignment in a feature space. In addition, Li et al. [43] established a multi-correspondence aggregation network to exploit similar patches between and within frames. Dynamic filters [49] and non-local [42, 47] modules are also exploited to aggregate the temporal information.

Video Frame Interpolation. Video frame interpolation (VFI) [29, 28, 27, 31, 62] aims to synthesize the missing intermediate frame with two adjacent video frames, which is extensively used in slow-motion video generation. Specifically, for generating the intermediate frame, U-Net structure modules [25] are employed to compute optical flows and visibility maps between two input frames. To cope with occlusion in VFI, contextual features [21] are further introduced into the interpolation process. Furthermore, Bao et al. [31] proposed a depth-aware module to detect occlusions explicitly for VFI. On the other hand, unlike most VFI methods using optical flow, Niklaus et al. [22, 30] adopted the adaptive convolution to predict kernels directly and then leveraged these kernels to estimate pixels of the intermediate video frame. Recently, attention mechanism [31] and deformable convolution [57, 28] are explored.

Space-Time Video Super-Resolution. Compared to video super-resolution, space-time video super-resolution (STVSR) needs to implement super-resolution in time and space dimensions. Due to strict assumptions and manual regularization, conventional STVSR methods [19, 3, 4] cannot effectively process the spatial-temporal super-resolution of sophisticated LR input videos. In recent years, significant advances have been made from the deep neural network (DNN). Through merging VSR and VFI into a joint framework, Kang et al. [64] put forward a DNN model for STVSR. To exploit mutually informative relationships between time and space dimensions, STARnet [66] with an extra optical flow branch is proposed to generate HR slow-motion videos. In addition, Xiang et al. [13] developed a deformable ConvLSTM [58] module, which can achieve sequence-to-sequence (S2S) learning in STVSR. Base on [13], Xu et al. [36] proposed a temporal modulation block to perform controllable STVSR. Recently, Geng et al. [65] proposed a STVSR network based on Swin Transformer. However, most of them only leverage two adjacent frame features to interpolate the intermediate frame feature, and they hardly explore spatial and temporal contexts explicitly among video frames. To address these problems, we propose a spatial-temporal deformable network (1) to use more content from input LR frames for the interpolation process and (2) employ deformable attention to dynamically capture spatial-temporal contexts for HR frame reconstruction.

Refer to caption
Figure 2: The architecture of our proposed STDAN. Long-short term feature interpolation is capable of exploring more neighboring LR frames to synthesize the intermediate frame in the feature space. Spatial-temporal deformable feature aggregation is utilized to capture spatial-temporal contexts by deformable attention. This figure only shows two input LR video frames from a long video sequence for a presentation.

III Our Method

The architecture of our proposed network is illustrated in Fig. 2, which consists of four parts: feature extraction module, long-short term feature interpolation (LSTFI) module, spatial-temporal deformable feature aggregation (STDFA) module and frame feature reconstruction module. Given a low-resolution (LR) and low frame rate (LFR) video with NN frames: {I2t1lr}t=1N\left\{I_{2t-1}^{lr}\right\}_{t=1}^{N}, our STDAN can generate 2N12N-1 consecutive high-resolution (HR) and high frame rate (HFR) frames: {Ithr}t=12N1\left\{I_{t}^{hr}\right\}_{t=1}^{2N-1}. The structure of each module is described in the following.

III-A Frame Feature Extraction

We first use a 3×33\times 3 convolutional layer in the feature extraction module to get shallow features {F2t1s}t=1N\left\{F_{2t-1}^{s}\right\}_{t=1}^{N} from the NN input LR video frames. Considering that these shallow features lack long-range spatial information due to the locality of the naive convolutional layer, which may cause poor quality in the next feature interpolation module. We hope to extract these shallow features further to establish the correlation between two distant locations.

Recently, Transformer-based models have realized good performance in computer vision [5, 7, 6, 10] owing to the strong capacity of Transformer to model long-range dependency. However, the computation cost of self-attention in Transformer is high, which limits its extensive application in video-related tasks. To overcome the drawback, Liu et al. [11] put forward Swin Transformer block (STB) to achieve linear computational complexity with respect to image size. Based on efficient and effective STB [11], Liang et al. [12] proposed residual Swin Transformer block (RSTB) to construct SwinIR for image restoration. Thanks to the powerful ability to model long-range dependency of RSTB, SwinIR [12] obtains state-of-the-art (SOTA) performance compared with CNN-based methods. In this paper, to acquire features {F2t1}t=1N\left\{F_{2t-1}\right\}_{t=1}^{N} that capture long-range spatial information, we also use mfm_{f} RSTBs [12] to extract shallow features {F2t1s}t=1N\left\{F_{2t-1}^{s}\right\}_{t=1}^{N} further, illustrated in Fig. 2. We can see that the RSTB is a residual block with several STBs and one convolutional layer. In addition, given a tensor XinX_{in} as input, the detailed process of STB to output XoutX_{out} is formulated as:

X=MSA(LayerNorm(Xin))+Xin,Xout=MLP(LayerNorm(X))+X,\begin{split}&X=MSA(LayerNorm(X_{in}))+X_{in},\\ &X_{out}=MLP(LayerNorm(X))+X,\end{split} (1)

where MSA and XX denotes multi-head self-attention module and intermediate results, respectively.

Refer to caption
Figure 3: The framework of our long-short term feature interpolation (LSTFI) module. It consists of long-short term cells (LSTCs) with bidirectional RNN, which can fully exploit the whole input video frame features during the interpolation process. Note that the two neighboring frame features and the hidden state from previous LSTC provide short-term and long-term content for interpolation results, respectively. Here, h0fh^{f}_{0} and h8bh^{b}_{8} denote initialized hidden states for the forward and backward recurrent propagation, respectively. More specifically, the h0fh^{f}_{0} serves as the forward hidden state for predicting the first missing frame feature: F2F_{2}, while the h8bh^{b}_{8} is regarded as the backward hidden state for predicting the last missing frame feature: F6F_{6}.

III-B Long-Short Term Feature Interpolation

To implement the super-resolution in the time dimension, we also utilize a feature interpolation module to synthesize the intermediate frames in the LR feature space, like [13, 36]. Specifically, given the two extracted features: F1F_{1} and F3F_{3}, the feature interpolation module can synthesize the feature F2F_{2} corresponding to the missing frame I2lrI_{2}^{lr}. Generally, to obtain the intermediate feature, we should capture the pixel-wise motion information first. Optical flow is usually adopted to estimate the motion between video frames. However, there are several shortcomings in using optical flow for interpolation. The computational cost is high to calculate optical flow precisely, and estimated optical flow may be inaccurate due to the occlusion or motion blur, which causes poor interpolation results.

Considering the drawback of optical flow, Xiang et al. [13] employed multi-level deformable convolution [38] to perform frame feature interpolation. The learned offset used in deformable convolution can implicitly capture forward and backward motion information and achieve good performance. However, the synthesis of intermediate frame feature [13, 36] only utilizes the two neighboring frame features, which cannot fully explore the information from the other input frames to assist in the process. Unlike feature interpolation in previous STVSR algorithms [13, 36], we propose a long-short term feature interpolation (LSTFI) module to realize the intermediate frame in our STDAN, which is capable of exploiting helpful information from more input frames.

Refer to caption
Figure 4: Overview of the proposed LSTC and fusion process of interpolation results from forward and backward branches. We adopt DFI block [13] to adaptively align the hidden state from the previous LSTC with the current preliminary interpolation result. Note that the final intermediate frame feature is achieved by fusing the interpolation results from forward and backward branches.

As illustrated in Fig. 3, we adopt a bidirectional recurrent neural network (RNN) [37] to construct the LSTFI module, which consists of two branches in forward and backward directions. Take the forward branch as an example. Two neighboring frame features and the hidden state from the previous long-short term cell (LSTC) are fed into each LSTC, and then the LSTC generates the corresponding intermediate frame feature and current hidden state used for subsequent LSTC. Here, the two neighboring frame features and hidden state serve as short-term and long-term information for the intermediate feature, respectively. However, each branch’s hidden state only considers the unidirectional information flow. To fully mine the information flow of these frame features for the interpolation procedure, we fuse interpolation results from LSTCs in the forward and backward branches to acquire the final intermediate frame feature.

The architecture of LSTC and the fusion process are shown in Fig. 4. Given two neighboring frame features F1F_{1} and F3F_{3}, we employ deformable feature interpolation (DFI) block [13] to capture the forward and backward motion between the two features implicitly. For simplification, we take the feature F31F_{3\rightarrow{1}} that has experienced backward motion compensation as an example. As illustrated in Fig. 4(b), the two frame features are concatenated along channel dimension, and then pass through offset generation function HogbH_{og}^{b} to predict an offset with backward motion information:

θ31=Hogb([F3,F1]),\theta_{3\rightarrow{1}}=H_{og}^{b}([F_{3},F_{1}])\ , (2)

where HogbH_{og}^{b} consists of convolutional layers, and [.,.][.,.] denotes the concatenation along channel dimension. With the learned offset, we adopt deformable convolution [40] as motion compensation function to obtain compensated feature:

F31=DConv([F3,θ31]),F_{3\rightarrow{1}}=DConv([F_{3},\theta_{3\rightarrow{1}}])\ , (3)

where DConvDConv denotes the operation of deformable convolution.

To blend the features F13F_{1\rightarrow{3}} and F31F_{3\rightarrow{1}} that have experienced forward and backward motion compensation respectively, a 1×11\times{1} convolutional layer is applied, which can perform pixel-level linear weighting to achieve preliminary interpolation feature F2pF_{2}^{p}. Note that the acquisition of feature F2pF_{2}^{p} only utilizes the short-term information. In order to combine long-term information h0fh_{0}^{f}, the hidden state from the previous LSTC, we first use the other DFI block to align h0fh_{0}^{f} with the current feature F2pF_{2}^{p}, since there may be some misalignment. The process is expressed as:

h02f=DAlign(h0f,F2p),h_{0\rightarrow{2}}^{f}=DAlign(h_{0}^{f},F_{2}^{p})\ , (4)

where DAlign(.)DAlign(.) indicates the operation of DFI block. At the end of LSTC, we apply a fusion function into aligned hidden state h02fh_{0\rightarrow{2}}^{f} and preliminary interpolation result F2pF_{2}^{p} to obtain forward intermediate feature:

F2f=Hfs(F2p,h02f),F_{2}^{f}=H_{fs}(F_{2}^{p},h_{0\rightarrow{2}}^{f})\ , (5)

where HfsH_{fs} refers to the fusion function. Then, the intermediate feature F2fF_{2}^{f} passes through a convolutional layer and an activation layer in sequence to produce hidden state h2fh_{2}^{f} for the subsequent LSTC.

For fully exploring the whole input frame features for interpolation, the bidirectional RNN structure is utilized in our LSTFI module, so we fuse the forward intermediate feature F2fF_{2}^{f} and backward intermediate feature F2bF_{2}^{b} to get the final intermediate frame feature F2F_{2}, shown in Fig. 4(c).

III-C Spatial-Temporal Deformable Feature Aggregation

With the assistance of the LSTFI module, we now have 2N12N-1 frame features, where the generation of N1N-1 intermediate frame features combines their adjacent frame features with hidden states. Although the hidden states can introduce certain temporal information, the whole interpolation procedure hardly explicitly explores the temporal information between various frames. In addition, the NN input frame features are merely processed independently in the feature extraction module. However, these frame features {Ft}t=12N1\left\{F_{t}\right\}_{t=1}^{2N-1} are consecutive, which means there are abundant temporal content without being exploited among these features.

For a feature vector 𝐟t\mathbf{f}_{t} whose location is 𝐩o\mathbf{p}_{o} on feature FtF_{t}, the simplest method to aggregate temporal information is adaptive fusion with the feature vector on the same location from the other 2N22N-2 frame features. However, the aggregation approach has several drawbacks. Generally, the corresponding point on other frame features may not be on the same location due to inter-frame motion. Furthermore, there are multiple helpful feature vectors for 𝐟t\mathbf{f}_{t} from each of the 2N22N-2 frame features. Based on the above analysis, we propose a spatial-temporal deformable feature aggregation (STDFA) module to mix cross-frame spatial information adaptively and capture the long-range temporal information.

Refer to caption
Figure 5: The detailed process of spatial-temporal deformable feature aggregation (STDFA) module. Note that we only show the case when the number of frame features is 4. Under the case, the value of tt can be 2, 3 or 4 for frame feature F1F_{1}.

Specifically, we utilize the STDFA module to learn the residual auxiliary information from the remaining 2N22N-2 frame features for each frame feature FtF_{t}. As presented in Fig. 5, the processing of the STDFA module can be divided into two parts: spatial aggregation and temporal aggregation. To adaptively fuse cross-frame spatial content of frame feature FiF_{i} from the other frame features, we perform deformable attention to each pair: FiF_{i} and FjF_{j} (j[1,2N1]j\in[1,2N-1], jij\neq{i}). In detail, frame feature FiF_{i} passes through a linear layer to get embedded feature QiQ_{i}. Similarly, frame feature FjF_{j} is fed into two linear layers to obtain embedded features KjK_{j} and VjV_{j}, respectively.

To implement deformable attention between FiF_{i} and FjF_{j}, we first predict the offset map:

ΔMji=Hog([Qi,Kj]),\Delta{M_{j\rightarrow{i}}}=H_{og}([Q_{i},K_{j}])\ , (6)

where HogH_{og} indicates offset generation function consisting of several convolutional layers with k×kk\times{k} kernel. The offset map ΔMji\Delta{M_{j\rightarrow{i}}} at position 𝐩o\mathbf{p}_{o} is expressed as:

ΔMji(𝐩o)=[Δ𝐩1,Δ𝐩2,,Δ𝐩ξ,,Δ𝐩k2].\Delta{M_{j\rightarrow{i}}(\mathbf{p}_{o})}=[\Delta{\mathbf{p}_{1}},\Delta{\mathbf{p}_{2}},\cdots,\Delta{\mathbf{p}_{\xi}},\cdots,\Delta{\mathbf{p}_{k^{2}}}]\ . (7)

Then the offsets ΔMji(𝐩o)\Delta{M_{j\rightarrow{i}}(\mathbf{p}_{o})} are combined with k2k^{2} pre-specified sampling locations to perform deformable sampling. Here, we denote the pre-specified sampling location as 𝐩ξ\mathbf{p}_{\xi}, and the value set of 𝐩ξ\mathbf{p}_{\xi} of k×kk\times{k} kernel is defined as:

𝐩ξ{(k2,k2),,(k2,k2)},\mathbf{p}_{\xi}\in{\left\{(-{\left\lfloor\frac{k}{2}\right\rfloor},-{\left\lfloor\frac{k}{2}\right\rfloor}),\cdots,({\left\lfloor\frac{k}{2}\right\rfloor},{\left\lfloor\frac{k}{2}\right\rfloor})\right\}}\ , (8)

where \left\lfloor\cdot\right\rfloor denotes rounding down function.

With the offsets ΔMji(𝐩o)\Delta{M_{j\rightarrow{i}}}(\mathbf{p}_{o}), the embedded feature vector Qi(𝐩o)Q_{i}(\mathbf{p}_{o}) can attend k2k^{2} related points in KjK_{j}. Nevertheless, not all the information of these k2k^{2} points is helpful for Qi(𝐩o)Q_{i}(\mathbf{p}_{o}). In addition, each point on embedded feature QiQ_{i} needs to search k2k^{2} points, which inevitably causes a large storage occupation. To avoid irrelevant points and reduce storage occupation, we only choose the first TT points that are most relevant. To select the TT points, we calculate the inner product between two embedded feature vectors as the relevance score:

RSji(𝐩o,ξ)=Qi(𝐩o)Kj(𝐩o+𝐩ξ+Δ𝐩ξ),RS_{j\rightarrow{i}}(\mathbf{p}_{o},\xi)=Q_{i}(\mathbf{p}_{o})\cdot K_{j}(\mathbf{p}_{o}+\mathbf{p}_{\xi}+\Delta{\mathbf{p}_{\xi}})\ , (9)

The larger the score, the more relevant the two points are. According to this criterion, we can determine the TT points. In the following, to distinguish the selected TT points from original k2k^{2} points, we denote the pre-specified sampling location and learned offset of the TT points as 𝐩¯ξ\mathbf{\overline{p}}_{\xi} and Δ𝐩¯ξ\Delta{\mathbf{\overline{p}}_{\xi}}, respectively.

To adaptively mingle the spatial information from the TT locations for each embedded feature vector Qi(𝐩o)Q_{i}(\mathbf{p}_{o}), we first adopt softmax function to calculate the weight of these points:

wξ=eQi(𝐩o)Kj(𝐩o+𝐩¯ξ+Δ𝐩¯ξ)ξ=1TeQi(𝐩o)Kj(𝐩o+𝐩¯ξ+Δ𝐩¯ξ).w_{\xi}=\frac{e^{Q_{i}(\mathbf{p}_{o})\cdot K_{j}(\mathbf{p}_{o}+\mathbf{\overline{p}}_{\xi}+\Delta{\mathbf{\overline{p}}_{\xi}})}}{\sum_{\xi=1}^{T}{e^{Q_{i}(\mathbf{p}_{o})\cdot K_{j}(\mathbf{p}_{o}+\mathbf{\overline{p}}_{\xi}+\Delta{\mathbf{\overline{p}}_{\xi}})}}}\ . (10)

Then, with the weights and the embedded feature vector Kj(𝐩o+𝐩¯ξ+Δ𝐩¯ξ)K_{j}(\mathbf{p}_{o}+\mathbf{\overline{p}}_{\xi}+\Delta{\mathbf{\overline{p}}_{\xi}}), we can obtain corresponding updated embedded feature vector:

Kji(𝐩o)=ξ=1TwξKj(𝐩o+𝐩¯ξ+Δ𝐩¯ξ).K_{j\rightarrow{i}}(\mathbf{p}_{o})=\sum_{\xi=1}^{T}{w_{\xi}\cdot{K_{j}(\mathbf{p}_{o}+\mathbf{\overline{p}}_{\xi}+\Delta{\mathbf{\overline{p}}_{\xi}})}}\ . (11)

Same as Kji(𝐩o)K_{j\rightarrow{i}}(\mathbf{p}_{o}), the updated vector Vji(𝐩o)V_{j\rightarrow{i}}(\mathbf{p}_{o}) can be also achieved with the weight wξw_{\xi}. Finally, we calculate the updated relevant weight map WjiW_{j\rightarrow{i}} at each position 𝐩o\mathbf{p}_{o} between QiQ_{i} and KjiK_{j\rightarrow{i}} for the following temporal aggregation:

Wji(𝐩o)=Qi(𝐩o)Kji(𝐩o).W_{j\rightarrow{i}}(\mathbf{p}_{o})=Q_{i}(\mathbf{p}_{o})\cdot K_{j\rightarrow{i}}(\mathbf{p}_{o})\ . (12)

To capture the temporal contexts of frame feature vector Fi(𝐩o)F_{i}(\mathbf{p}_{o}) from the remaining 2N22N-2 features, we also utilize softmax function to adaptively aggregating feature vectors Vji(𝐩o)V_{j\rightarrow{i}}(\mathbf{p}_{o}). Specifically, the normalized temporal weight of each vector Vji(𝐩o)V_{j\rightarrow{i}}(\mathbf{p}_{o}) (j[1,2N1]j\in[1,2N-1], jij\neq{i}) is expressed as:

W^ji(𝐩o)=eWji(𝐩o)j=12N1,jieWji(𝐩o).\hat{W}_{j\rightarrow{i}}(\mathbf{p}_{o})=\frac{e^{W_{j\rightarrow{i}}(\mathbf{p}_{o})}}{\sum_{j=1}^{2N-1,j\neq{i}}{e^{W_{j\rightarrow{i}}(\mathbf{p}_{o})}}}\ . (13)

Then, through fusing embedded feature vector Vji(𝐩o)V_{j\rightarrow{i}}(\mathbf{p}_{o}) (j[1,2N1]j\in[1,2N-1], jij\neq{i}) with the corresponding normalized weight, we can attain the embedded feature ViV_{i}^{*} that aggregates the spatial and temporal contexts from other 2N22N-2 embedded features. The weighted fusion process is defined as:

Vi(𝐩o)=j=12N1,jiW^ji(𝐩o)Vji(𝐩o).V_{i}^{*}(\mathbf{p}_{o})=\sum_{j=1}^{2N-1,j\neq{i}}{\hat{W}_{j\rightarrow{i}}(\mathbf{p}_{o})\cdot{V_{j\rightarrow{i}}(\mathbf{p}_{o})}}\ . (14)

In the tail of STDFA module, the embedded feature ViV_{i}^{*} is sent into a linear layer to acquire the residual auxiliary feature FiresF_{i}^{res}. Finally, we add frame feature FiF_{i} and residual auxiliary feature FiresF_{i}^{res} to get the enhanced feature FiF_{i}^{*} that aggregates spatial and temporal contexts from the other 2N22N-2 frame features.

TABLE I: Quantitative comparisons of our STDAN and other SOTA methods for space-time video super-resolution. The best two results are highlighted in red and blue colors. Note that we conduct a padding operation to the input LR frames before feeding them into the networks, so the results of comparative methods on Vid4 are different from the reported results in the original papers.
VFI (V)SR Vid4 SPMC-11 Vimeo-Slow Vimeo-Medium Vimeo-Fast Speed Parameters
Method Method PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM FPS (Million)
SuperSloMo Bicubic 22.84 0.5772 24.91 0.6874 28.37 0.8102 29.94 0.8477 31.88 0.8793 - 19.8
SuperSloMo RCAN 23.78 0.6385 26.50 0.7527 30.69 0.8624 32.50 0.8884 34.52 0.9076 2.49 19.8+16.0
SuperSloMo RBPN 24.00 0.6587 26.14 0.7582 30.48 0.8584 32.79 0.8930 34.73 0.9108 2.06 19.8+12.7
SuperSloMo EDVR 24.22 0.6700 26.46 0.7689 30.99 0.8673 33.85 0.8967 35.05 0.9136 6.85 19.8+20.7
SepConv Bicubic 23.51 0.6273 25.67 0.7261 29.04 0.8290 30.61 0.8633 32.27 0.8890 - 21.7
SepConv RCAN 24.99 0.7259 28.16 0.8226 32.13 0.8967 33.59 0.9125 34.97 0.9195 2.42 21.7+16.0
SepConv RBPN 25.75 0.7829 28.65 0.8614 32.77 0.9090 34.09 0.9229 35.07 0.9238 2.01 21.7+12.7
SepConv EDVR 25.89 0.7876 28.86 0.8665 32.96 0.9112 34.22 0.9240 35.23 0.9252 6.36 21.7+20.7
DAIN Bicubic 23.55 0.6268 25.68 0.7263 29.06 0.8289 30.67 0.8636 32.41 0.8910 - 24.0
DAIN RCAN 25.03 0.7261 28.15 0.8224 32.26 0.8974 33.82 0.9146 35.27 0.9242 2.23 24.0+16.0
DAIN RBPN 25.76 0.7783 28.57 0.8598 32.92 0.9097 34.45 0.9262 35.55 0.9300 1.88 24.0+12.7
DAIN EDVR 25.90 0.7830 28.77 0.8649 33.11 0.9119 34.66 0.9281 35.81 0.9323 5.20 24.0+20.7
STARnet 25.99 0.7819 29.04 0.8509 33.10 0.9164 34.86 0.9356 36.19 0.9368 14.08 111.61
Zooming Slow-Mo 26.14 0.7974 28.80 0.8635 33.36 0.9138 35.41 0.9361 36.81 0.9415 16.50 11.10
RSTT 26.20 0.7991 28.86 0.8634 33.50 0.9147 35.66 0.9381 36.80 0.9403 15.36 7.67
TMNet 26.23 0.8011 28.78 0.8640 33.51 0.9159 35.60 0.9380 37.04 0.9435 14.69 12.26
STDAN (Ours) 26.28 0.8041 28.94 0.8687 33.66 0.9176 35.70 0.9387 37.10 0.9437 13.80 8.29

III-D High-Resolution Frame Reconstruction

To reconstruct HR frames from the enhanced features {Ft}t=12N1\left\{F_{t}^{*}\right\}_{t=1}^{2N-1}, we first employ mbm_{b} RSTBs [12] to map feature FtF_{t}^{*} to deep feature FtdF_{t}^{d}. Then, these deep features further pass through an upsampling module to realize the HR video frames {Ithr}t=12N1\left\{I_{t}^{hr}\right\}_{t=1}^{2N-1}. Specifically, the upsampling module consists of the PixelShuffle layer [16] and several convolutional layers. For optimizing our proposed network, we adopt the Charbonnier function [17] as the reconstruction loss:

Lrec=IthrItGT2+ϵ2,L_{rec}=\sqrt{||I_{t}^{hr}-I_{t}^{GT}||^{2}+\epsilon^{2}}\ , (15)

where ItGTI_{t}^{GT} indicates the ground truth of tt-th reconstructed video frame IthrI_{t}^{hr}, and the value of ϵ\epsilon is empirically set to 1×1031\times 10^{-3}. With the loss function, our STDAN can be end-to-end trained to generate HR slow-motion videos from corresponding LR and LFR counterparts.

IV Experiments

In this section, we first introduce the datasets and evaluation metrics used in our experiments. Then, the implementation details of our STDAN are elaborated. Next, we compare our proposed network with state-of-the-art methods on public datasets. Finally, we carry out ablation studies to investigate the effect of the modules adopted in our STDAN.

Refer to caption
Figure 6: Visual comparisons of different STVSR approaches on Vid4 and Vimeo datasets. We can see that our model can recover more accurate structures.

IV-A Datasets and Evaluation Metrics

Datasets We use the Vimeo-90K dataset [32] to train our network. Specifically, the Vimeo-90K dataset consists of more than 60,000 training video sequences, and each video sequence has seven frames. We adopt the raw seven frames as our HR and HFR supervisions. The corresponding four LR and LFR frames are downscaled by a factor of 4 with bicubic sampling from these odd-numbers ones. The Vimeo-90K also provides corresponding testsets that can be divided into Vimeo-Slow, Vimeo-Medium and Vimeo-Fast according to the degree of motion. The three testsets serve as the evaluation datasets in our experiments. Same as STVSR methods [13, 36], six video sequences in Vimeo-Medium testset and three sequences in Vimeo-Slow testset are removed to avoid infinite values on PSNR. In addition, we report the results on Vid4 [33] and SPMC-11 [23] of different approaches.

Refer to caption
Figure 7: Three different aggregation methods in the feature aggregation module. ‘STFA’ refers to spatial-temporal feature aggregation. Note that we only show 4 frames for a illustration, and STFA in a deformable window denotes our STDFA module.
Refer to caption
Figure 8: Ablation study on the feature aggregation module. ‘FW’ indicates fixed window, while ‘DW’ refers to deformable window.

Evaluation Metrics To compare diverse STVSR networks quantitatively, Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM) [34] are adopted in our experiments as evaluation metrics. In this paper, we calculate the PSNR and SSIM metrics on the Y channel of the YCbCr color space. In addition, we also compare the parameters and inference speed of various models.

IV-B Implementation Details

In our STDAN, the number of RSTBs in the feature extraction module and frame feature reconstruction module is 2 and 6, respectively, where each RSTB contains 6 STBs. In addition, the number of feature and embedded feature channels are set to be 64 and 72 separately. In the LSTFI module, we utilize a Pyramid, Cascading, and Deformable (PCD) structure in [38] to achieve DFI. The hidden states in the forward and backward branches are initialized to zeros. In the STDFA module, the value of kk and TT are set to 3 and 2, respectively. We augment the training frames by randomly flipping horizontally and 9090^{\circ} rotations during the training process. Then, we crop the input LR frames with a size of 32×3232\times 32 at random to the network, and the batch size is set to be 18. Our model is trained by Adam [35] optimizer by setting β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999. We employ cosine annealing to decay the learning rate [41] from 2e42e-4 to 1e71e-7. We implement the STDAN with PyTorch and train our model on 6 NVIDIA GTX-1080Ti GPUs.

Refer to caption
Figure 9: Ablation study on the proposed modules. We can see that STDFA can effectively suppress blurring artifacts and recover correct visual structures and the LSTFI can further help to reconstruct fine details.

IV-C Comparison with State-of-the-art Methods

We compare our STDAN with existing state-of-the-art (SOTA) one-stage STVSR approaches: STARnet [66], Zooming Slow-Mo [13], RSTT [65] and TMNet [36]. In addition, we also compare the performance of our network with SOTA two-stage STVSR algorithms, like Zooming Slow-Mo [13] and TMNet [36]. Specifically, two-stage STVSR methods are composed of SOTA VFI and SR algorithms. These VFI networks are SuperSloMo [25], SepConv [30] and DAIN [31], respectively, while SOTA SR approaches are RCAN [56], RBPN [52] and EDVR [38].

Quantitative results of various STVSR methods are shown in Table I. From the table, we can see that: (1) Our STDAN with fewer parameters obtains SOTA performance on both Vid4 [33] and Vimeo [32]; (2) For the SPMC-11 [23] dataset, our model is only 0.1dB lower than STARnet [66] in terms of PSNR, but our STDAN acquires better results than it on SSIM [34] index, which demonstrates our network can recover more correct structures. In addition, our model only needs one thirteenth parameters of STARnet.

Visual comparison of different models are displayed in Fig. 6. We observe that our STDAN, with the proposed LSTFI and STDFA modules, restores more accurate structures and fewer motion blurs compared with other STVSR approaches, which confirms the higher value on PSNR and SSIM achieved by our model.

IV-D Ablation Study

To investigate the effect of the proposed modules in our STDAN, we conduct comprehensive ablation studies in this section.

TABLE II: Ablation study on the proposed modules. Our long-short term feature interpolation leverages more input LR frames to assist in the interpolation process. The proposed spatial-temporal feature aggregation in the deformable window can adaptively capture spatial-temporal contexts among different frames for HR frame reconstruction. ‘STFA’ indicates spatial-temporal feature aggregation.
Method Ω1{\rm\Omega_{1}} Ω2{\rm\Omega_{2}} Ω3{\rm\Omega_{3}} Ω4{\rm\Omega_{4}} Ω5{\rm\Omega_{5}}
Parameters (M) 5.44 5.54 5.54 5.82 8.29
Feature Interpolation Short-term feature interpolation
Long-short term feature interpolation
Feature Aggregation STFA in a 1x1 fixed window
STFA in a 3x3 fixed window
STFA in a deformable window
Vid4 (slow motion) 25.27 25.69 25.85 25.97 26.28
Vimeo-Fast (fast motion) 35.88 36.22 36.41 36.63 37.10

Feature Aggregation To valid the effect of the proposed spatial-temporal deformable feature aggregation (STDFA) module, we establish a baseline: model Ω1{\rm\Omega_{1}}. It only adopts short-term information to perform interpolation, and then directly reconstructs HR video frames through the frame feature reconstruction module without feature aggregation process. In contrast, we compare three different models: Ω2{\rm\Omega_{2}}, Ω3{\rm\Omega_{3}} and Ω4{\rm\Omega_{4}} with feature aggregation. For the spatial-temporal feature aggregation process in the model Ω2{\rm\Omega_{2}}, illustrated in Fig. 7(a), each feature vector aggregates the information at the same position of other frame features, that is, the feature vector attends the valuable spatial content in a 1×11\times 1 window. We enlarge the window size of the model Ω3{\rm\Omega_{3}} to 3. Considering large motions between frames. A deformable window is applied in the model Ω4{\rm\Omega_{4}}. As shown in Fig. 7(c), model Ω4{\rm\Omega_{4}} adopts the STDFA module to perform feature aggregation.

Quantitative results on Vid4 [33] and Vimeo-Fast [32] datasets are shown in Table II. From the table, we know that: (1) Feature aggregation module can improve the reconstruction results; (2) The larger the spatial range of feature aggregation, the more useful information can be captured to enhance recovery quality of HR frames. Qualitative results of the three models are represented in Fig. 8, which confirms the feature aggregation in the deformable window can acquire more helpful content.

Feature Interpolation To investigate the effect of the proposed long-short term feature interpolation (LSTFI) module, we compare two models: Ω4{\rm\Omega_{4}} and Ω5{\rm\Omega_{5}}. As shown in Fig. 3, the model Ω5{\rm\Omega_{5}} with LSTFI can exploit short-term information of two neighboring frames and long-term information of hidden states from other LSTCs. In comparison, model Ω4{\rm\Omega_{4}} only uses two adjacent frames to interpolate the feature of the intermediate frame. From Table II, combining long-term and short-term information can achieve better feature interpolation results, which leads to high-quality HR frames with more details, as illustrated in Fig. 9.

Efficiency of selecting the first TT points We also investigate the efficiency of determining the first TT points in our STDFA module. Specifically, the model’s inference time of each Vimeo sequence without/with the keypoint selection are 0.542s/0.543s, which demonstrates that the utilization of the keypoint selection in our STDFA module cannot lead to a significant increase in the inference time of the model.

Refer to caption
Figure 10: Visualization of deformable sampling locations. The red star in the frame feature F1F_{1}^{*} denotes a feature vector, and green stars in the other frame features indicate the corresponding sampling locations of the feature vector. Note that we directly show the sampling locations on the video frames rather than frame features, and we only show 4 frames for a better illustration.

V Failure Analysis

Although our method can outperform existing SOTA methods, it is not perfect especially when handling fast-motion videos. As shown in Fig. 10, we found that our deformable attention might sample wrong locations when video motions are fast. The key reason is that the predicted deformable offsets cannot accurately capture relevant visual contexts due to the large motions.

VI Conclusion

In this paper, we propose a deformable attention network called STDAN for STVSR. Our STDAN can utilize more input video frames for the interpolation process. In addition, the network adopts deformable attention to dynamically capture spatial and temporal contexts among frames to enhance SR reconstruction. Thanks to the LSTFI and STDFA modules, our model demonstrates superior performance to recent SOTA STVSR approaches on public datasets.

References

  • [1] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,” IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.
  • [2] U. Mudenagudi, S. Banerjee, and P. K. Kalra, “Space-time super-resolution using graph-cut optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 995–1008, 2010.
  • [3] H. Takeda, P. v. Beek, and P. Milanfar, “Spatiotemporal video upscaling using motion-assisted steering kernel (mask) regression,” in High-Quality Visual Experience.   Springer, 2010, pp. 245–274.
  • [4] O. Shahar, A. Faktor, and M. Irani, Space-time super-resolution from a single video.   IEEE, 2011.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision.   Springer, 2020, pp. 213–229.
  • [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [7] M. Chen, H. Peng, J. Fu, and H. Ling, “Autoformer: Searching transformers for visual recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 270–12 280.
  • [8] T. Li, X. He, Q. Teng, Z. Wang, and C. Ren, “Space–time super-resolution with patch group cuts prior,” Signal Processing: Image Communication, vol. 30, pp. 147–165, 2015.
  • [9] L. Zhang, J. Nie, W. Wei, Y. Li, and Y. Zhang, “Deep blind hyperspectral image super-resolution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2388–2400, 2020.
  • [10] B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 448–10 457.
  • [11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [12] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1833–1844.
  • [13] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu, “Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3370–3379.
  • [14] C. You, L. Han, A. Feng, R. Zhao, H. Tang, and W. Fan, “Megan: Memory enhanced graph attention network for space-time video super-resolution,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1401–1411.
  • [15] K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Understanding deformable alignment in video super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 973–981.
  • [16] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
  • [17] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 624–632.
  • [18] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
  • [19] E. Shechtman, Y. Caspi, and M. Irani, “Space-time super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 531–545, 2005.
  • [20] M. Tassano, J. Delon, and T. Veit, “Fastdvdnet: Towards real-time deep video denoising without flow estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1354–1363.
  • [21] S. Niklaus and F. Liu, “Context-aware synthesis for video frame interpolation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1701–1710.
  • [22] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive convolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 670–679.
  • [23] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-revealing deep video super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4472–4480.
  • [24] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4778–4787.
  • [25] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz, “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9000–9008.
  • [26] X. Zhang, R. Jiang, T. Wang, and J. Wang, “Recursive neural network for video deblurring,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3025–3036, 2020.
  • [27] J. Park, K. Ko, C. Lee, and C.-S. Kim, “Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation,” in European Conference on Computer Vision.   Springer, 2020, pp. 109–125.
  • [28] H. Lee, T. Kim, T.-y. Chung, D. Pak, Y. Ban, and S. Lee, “Adacof: Adaptive collaboration of flows for video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5316–5325.
  • [29] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang, “Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 933–948, 2019.
  • [30] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in 2017 IEEE International Conference on Computer Vision (ICCV).   Venice, Italy: IEEE, 2017, pp. 261–270.
  • [31] W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao, and M.-H. Yang, “Depth-aware video frame interpolation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   Long Beach, CA, USA: IEEE, 2019, pp. 3698–3707.
  • [32] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
  • [33] C. Liu and D. Sun, “On bayesian adaptive video super resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 346–360, 2013.
  • [34] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [36] G. Xu, J. Xu, Z. Li, L. Wang, X. Sun, and M.-M. Cheng, “Temporal modulation network for controllable space-time video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6388–6397.
  • [37] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [38] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “Edvr: Video restoration with enhanced deformable convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.   Long Beach, CA, USA: IEEE, 2019, pp. 1954–1963.
  • [39] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
  • [40] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316.
  • [41] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  • [42] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma, “Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3106–3115.
  • [43] W. Li, X. Tao, T. Guo, L. Qi, J. Lu, and J. Jia, “Mucan: Multi-correspondence aggregation network for video super-resolution,” in European Conference on Computer Vision.   Springer, 2020, pp. 335–351.
  • [44] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video super-resolution with convolutional neural networks,” IEEE transactions on computational imaging, vol. 2, no. 2, pp. 109–122, 2016.
  • [45] B. Bare, B. Yan, C. Ma, and K. Li, “Real-time video super-resolution via motion convolution kernel estimation,” Neurocomputing, vol. 367, pp. 236–245, 2019.
  • [46] Y. Zheng, X. Yu, M. Liu, and S. Zhang, “Single-image deraining via recurrent residual multiscale networks,” IEEE transactions on neural networks and learning systems, 2020.
  • [47] H. Wang, D. Su, C. Liu, L. Jin, X. Sun, and X. Peng, “Deformable non-local network for video super-resolution,” IEEE Access, vol. 7, pp. 177 734–177 744, 2019.
  • [48] K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Basicvsr: The search for essential components in video super-resolution and beyond,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4947–4956.
  • [49] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim, “Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3224–3232.
  • [50] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao, “Fast spatio-temporal residual network for video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 522–10 531.
  • [51] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An, “Learning for video super-resolution through hr optical flow estimation,” in Asian Conference on Computer Vision.   Springer, 2018, pp. 514–529.
  • [52] M. Haris, G. Shakhnarovich, and N. Ukita, “Recurrent back-projection network for video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3897–3906.
  • [53] Y. Tian, Y. Zhang, Y. Fu, and C. Xu, “Tdan: Temporally-deformable alignment network for video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3360–3369.
  • [54] X. Ying, L. Wang, Y. Wang, W. Sheng, W. An, and Y. Guo, “Deformable 3d convolution for video super-resolution,” IEEE Signal Processing Letters, vol. 27, pp. 1500–1504, 2020.
  • [55] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y.-L. Li, S. Wang, and Q. Tian, “Video super-resolution with temporal group attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8008–8017.
  • [56] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision (ECCV).   Munich, Germany: Springer, 2018, pp. 294–310.
  • [57] X. Cheng and Z. Chen, “Multiple video frame interpolation via enhanced deformable separable convolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [58] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015.
  • [59] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee, “Channel attention is all you need for video frame interpolation,” in AAAI.   New York City, NY, USA: AAAI Press, 2020, pp. 10 663–10 671.
  • [60] E. Shechtman, Y. Caspi, and M. Irani, “Increasing space-time resolution in video,” in European Conference on Computer Vision.   Copenhagen, Denmark: Springer, 2002, pp. 753–768.
  • [61] O. Shahar, A. Faktor, and M. Irani, Space-time super-resolution from a single video.   Colorado Springs, CO, USA: IEEE, 2011.
  • [62] B. Zhao and X. Li, “Edge-aware network for flow-based video frame interpolation,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [63] T. Li, X. He, Q. Teng, Z. Wang, and C. Ren, “Space–time super-resolution with patch group cuts prior,” Signal Processing: Image Communication, vol. 30, pp. 147–165, 2015.
  • [64] J. Kang, Y. Jo, S. W. Oh, P. Vajda, and S. J. Kim, “Deep space-time video upsampling networks,” in European Conference on Computer Vision.   Springer, 2020, pp. 701–717.
  • [65] Z. Geng, L. Liang, T. Ding, and I. Zharkov, “Rstt: Real-time spatial temporal transformer for space-time video super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 441–17 451.
  • [66] M. Haris, G. Shakhnarovich, and N. Ukita, “Space-time-aware multi-resolution video enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2859–2868.