Learnable Sampling 3D Convolution for Video Enhancement and Action Recognition

Shuyang Gu
USTC
[email protected]
&Jianmin Bao
Microsoft Research
[email protected]
&Dong Chen
Microsoft Research
[email protected]

Abstract

A key challenge in video enhancement and action recognition is to fuse useful information from neighboring frames. Recent works suggest establishing accurate correspondences between neighboring frames before fusing temporal information. However, the generated results heavily depend on the quality of correspondence estimation. In this paper, we propose a more robust solution: sampling and fusing multi-level features across neighborhood frames to generate the results. Based on this idea, we introduce a new module to improve the capability of 3D convolution, namely, learnable sampling 3D convolution (LS3D-Conv). We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames. The offsets can be learned for specific tasks. The LS3D-Conv can flexibly replace 3D convolution layers in existing 3D networks and get new architectures, which learns the sampling at multiple feature levels. The experiments on video interpolation, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.

1 Introduction

There is increasing interest in video interpolation [43, 60, 20, 35, 33], video super-resolution [28, 42, 22, 41, 27] and video denoising [11, 7, 32, 18]. The aim of these tasks is to recover a high quality video from an input video suffering from degradation (low-bit rate, low resolution or noise). The key to the success of video interpolation and restoration is to collect and fuse useful information from neighboring frames.

Most existing approaches adopt the alignment and fusing strategies. Some optical flow based methods [36, 5, 50, 20, 54] align the reference and its neighboring frames by explicitly estimating the motion/flow vector for each pixel. However, estimating optical flow remains to be a challenging problem due to fast-moving objects, occlusions and motion blur. Besides, the flow estimation network is usually trained on synthetic datasets. Their generalization ability to the real world is limited. The other kinds of approaches to achieve implicit motion compensation are dynamic filtering [21, 29] or deformable convolution [46, 53]. Without supervision, it is too difficult to estimate accurate motion. The inaccurate motion may cause incorrect fusion of pixels from neighboring frames, resulting in ghosting or blurring.

Recently, non-local neural networks [52] is applied for video super-resolution [55]. The non-local operation fuses all possible pixels by computing the correlations with neighboring frames. But it usually fuses patches with similar appearance, instead of the same semantic instance. The computation cost of non-local operation is also relatively heavy for capturing long-range dependencies.

Refer to caption — Figure 1: The medium is the interpolation result of two input frames(left and right), we show the sampling locations on the input frames correspond to each output pixel(red points) from our proposed *LS3D-ConvNet*. We can observe that learnt sampling positions is more concentrated in regions with small motion but more scattered for areas with large motion.

module	dimenstion	learned offset or regular grid	learned kernel or interpolation
2D Convolution	2D	regular grid	N/A
3D Convolution	3D	regular grid	N/A
2D Deformable Conv	2D	learned offset	bilinear
SDC-Net/MEMC-Net	2D	learned offset	learned kernel
LS3D-Conv (ours)	3D	learned offset	learned kernel + bilinear

Table 1: Comparsion of different methods.

In this paper, we propose a novel end-to-end deep neural network which is composed of a set of learnable sampling 3D convolution (LS3D-Conv) modules for this task. Our network learns to sample and fuse multi-level features from the neighboring frames to generate the video, which is the key difference from the existing work. In every level, LS3D-Conv learns how to sample valuable features from adjacent frames, and then automatically fuses the collected feature candidates using a 3D convolution. The aggregated features are used for sampling in the next level. By iterations, the network finally outputs a reconstructed frame.

Inspired by the parametrization in 2D deformable convolution [8], the sampling in neighboring frames can be modeled by 2D offsets around each location in nearby frames. Furthermore, we add an importance scalar for each sampling location to indicate the importance of the feature from that location. At every level, the network will directly learn such a set of 2D offsets and an importance scalar for each location.

Compared with the traditional 3D convolution [19], or flow-guided convolution [57], LS3D-Conv brings several advantages: (a) In general, flow-based approaches may fail to find accurate correspondences across frames especially for large motion, or motion blur. In this case, LS3D-Conv only needs the collected samples to cover valuable (or best matching) samples, which relaxes the requirement of matching accuracy; (b) LS3D-Conv can learn and update the sampling locations for the target tasks; (c) The sampling strategy learned by LS3D-Conv can be adapted to the confidence of correspondence estimation, which is more robust than single-level deformable convolution framework [46, 53]. As the example shown in Figure 1, where the learned sampling is more concentrated in regions with high-confidence motion estimation and seems to be more scattered for areas with large motion, or motion blur.

The experiments show the effectiveness of our proposed LS3D-Conv in video interpolation, video super-resolution, and video denoising. Furthermore, the proposed LS3D-Conv operator can also be flexibly applied to popular action recognition backbone to boost the performance.

2 Related work

We briefly summarize the most related works for video enhancement. In general, most of these works can be divided into three types. The first is to use optical flow [15, 2] to estimate the correspondence between neighboring frames and generate the result. The second is to leverage some new techniques(e.g.non-local [52], 2D deformable convolution [8, 59]) to build the correspondence and reconstruct the result. The third is to directly apply 3D CNNs [47]. Next, we will talk about these works in detail.

Flow-based methods. Recent advances in optical flow estimation methods like FlowNet [10], EpicFlow [40], FlowNet2.0 [17], and PWCNet [44] arise many optical-based methods for video enhancement and action recognition. For example, Deep Voxel Flow [30] proposes an end-to-end deep network to learn how to borrow voxels from nearby frames for video frame synthesis. For high frame rate video interpolation, Super SloMo [20] proposes an end-to-end convolutional neural network for variable-length multi-frame video interpolation. More recent work TOFlow [54] proposes task-oriented flow for accurate motion estimation and the flow is learned from target tasks. Although the flow estimation can be learned from target tasks, it still suffers from the fast-moving object, occlusions, and motion blur.

Build correspondence beyond flow. Instead of relying on flow estimation to build the correspondence. Several works have incorporated a learned motion estimation network in burst processiong [45, 21, 13, 12]. The recently proposed non-local blocks are also applied to build the long-range correspondence for video super-resolution [55] or video denoising [9]. From another point of view, video interpolation can also be formulated as convolution operations. Following this idea, recent works AdaConv [34] and SepConv [35] estimate spatially-adaptive convolutional kernels for each output pixel. Moreover, recent works [46, 53, 48, 51] adopt deformable convolution to align the features between two adjacent frames for video super-resolution.

3D CNN based methods It is a natural idea to extend 2D CNN to 3D CNN for video-related tasks. VESPCN [3] apply a spatio-temporal sub-pixel convolution networks for real-time video super-resolution. 3DSRNet [24] proposes a 3D CNN based framework for video super-resolution. More recent work FSTRN [27] proposes an efficient 3D convolution based operator for video super-resolution, which achieves impressive results on video super-resolution. We highlight the difference compared with these methods in Table 1.

We also notice that several approaches do not belong to these types. For example, DAIN [1] uses extra information in videos like depth for video interpolation. IM-Net [37] proposes an end-to-end framework for high resolution video frame interpolation and formulate interpolated motions estimation as a classification problem. DeepSR-ITM [25] proposes a joint super-resolution and inverse tone-mapping framework which is able to restore fine details in videos.

3 Methods

In this paper, we propose a novel end-to-end framework composed of a set of learnable sampling 3D convolution (LS3D-Conv) modules for video enhancement and action recognition. The proposed framework aims to sample and fuse multi-level features from neighboring frames for the target task. Therefore, we first take the video interpolation task as an example and introduce the proposed architecture. Then we introduce how our proposed LS3D-Conv learn to sample and fuse valuable feature for the next level. Finally, we discuss and clarify the relations and differences with some existing works.

3.1 Learnable Sampling 3D Convolution Network

Although 3D CNNs [19, 39] are widely used in video-related tasks, we notice that few 3D CNNs architectures are proposed for video interpolation. For the integrity of the paper, we first detail the network architecture of the proposed method for video interpolation. As shown in Figure 2(b), our method mainly consists of three components: (1) An encoder like component consists of two 3D convolution layers (strides are $1,2,2$ for T, H, W, respectively). to reduce the spatial input size. (2) Interpolation component consists of $6$ learnable sampling 3D-ResBlocks, i.e., LS3D-ResBlock, which will be introduced later and two 3D deconvolution layers (strides are $2,1,1$ for T, H, W, respectively). Two 3D deconvolution layer are placed after the second and the forth LS3D-Resblocks, respectively. They are used for temporal upsampling. (3) A decoder like component uses two 3D deconvolution layers (strides are $1,2,2$ for T, H, W, respectively) to reconstruct the final results. The proposed LS3D-Convolution network can interpolate $3$ in-between frames for two input frames. Our framework is simple and general. It can also be used for other video tasks, e.g., video super-resolution, video denoising, video recognition, and so on. We only need to change the third component and the temporal stride in 3D deconvolution layers as needed.

The proposed LS3D-ResBlock structure is shown in Figure 2(a). It applies a newly designed residual block, in which we replace the first convolution layer in the 3D-ResBlock with the learnable sampling 3D convolution (LS3D-Conv). These LS3D-Conv layer in the network will learn how to sample valuable features from adjacent frames, and then automatically fuses the collected feature candidates at multi-level. We will introduce the technical details of how to learn to sample and fuse valuable features at each level in the following section.

3.2 Learnable sampling 3D convolutions

The main novelty of learnable sampling 3D convolutions relies on the learnable process of sampling and fusing of the input features of neighboring frames. We will describe these two steps in detail in the following part.

Sampling Let us first consider a conventional 3D convolution operator. Given input feature maps $\{{x}_{t}\}$ , where ${x}_{t}\in\mathbb{R}^{C\times H\times W}$ is the feature of the $t$ -th frame, suppose the 3D convolution kernel $\mathbf{w}\in\mathbb{R}^{C\times 3\times 3\times 3}$ . For a output spatial location $\mathbf{p}$ , the sampling location for the traditional 3D convolution is $\mathbf{p}+\mathbf{p}^{n}$ on the input feature map across frame $t+\tau$ . where $\mathbf{p}^{n}$ enumerates $3\times 3$ spatial locations as a standard 2D convolution: $p^{n}\in\{(-1,-1),(-1,0),\ldots,(1,1)\}$ , and $\tau$ enumerates the temporal dimension $\tau\in\{-1,0,1\}$ .

However, Sampling features at a fixed grid around location $\mathbf{p}$ across neighboring frames usually fail to provide valuable information for reconstruction in case of motions. To address this problem, we add additional 2D offsets $\Delta\mathbf{p}^{n}_{t+\tau}$ to the sampling locations $\mathbf{p}^{n}$ at frame $t+\tau$ . Thus the sampling positions becomes: $\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}^{n}_{t+\tau}$ .

In this case, the sampling locations on input feature maps at frame $t+\tau$ can be irregular locations $\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}_{t+\tau}^{n}$ . And the offsets $\Delta\mathbf{p}_{t+\tau}^{n}$ can be obtained by directly adding a traditional 3D convolution layer on the input feature map. So the sampling locations can be learned by the input feature map.

Considering the offsets $\Delta\mathbf{p}_{t+\tau}^{n}$ is possibly becomes fractional, therefore, the corresponding feature $\mathbf{x}_{t+\tau}(\mathbf{\widetilde{p}})$ , where $\mathbf{\widetilde{p}}=\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}_{t+\tau}^{n}$ , is calculated by bilinear interpolation with a specific sampling kernel $G$ :

\mathbf{x}_{t+\tau}(\mathbf{\widetilde{p}})=\sum_{\mathbf{q}}G(\mathbf{q},\mathbf{\widetilde{p}})\cdot\mathbf{x}_{t+\tau}(\mathbf{q}).

(1)

Here $\mathbf{q}$ enumerates all integral spatial locations in the feature map $\mathbf{x}_{t+\tau}$ , and G is the bilinear interpolation kernel.

Fusing After sampling input features at the sampling positions $\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}^{n}_{t+\tau}$ . The next step for our method is to fuse these features. Here we use the 3D convolution operator for feature fusion. So the output feature $\mathbf{y}_{t}(\mathbf{p})$ can be calculated as:

\mathbf{y}_{t}(\mathbf{p})=\sum\limits_{\tau=-1}^{1}\sum\limits_{n=1}^{9}\mathbf{w}_{\tau}(\mathbf{p}^{n})\mathbf{x}_{t+\tau}(\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}_{t+\tau}^{n}),

(2)

where $\mathbf{w}_{\tau}$ is the weight of kernel for frame $t+\tau$ . Besides, the sampling features may contribute differently to the final results. So we use an independent importance scalar $\mathbf{m}^{n}_{t+\tau}$ in the range of $[0,1]$ for each sampling locations across frames, then Eq. (2) becomes

\mathbf{y}_{t}(\mathbf{p})=\sum\limits_{\tau=-1}^{1}\sum\limits_{n=1}^{9}\mathbf{m}^{n}_{t+\tau}\mathbf{w}_{\tau}(\mathbf{p}^{n})\mathbf{x}_{t+\tau}(\mathbf{p}+\mathbf{p}^{n}+\Delta\mathbf{p}_{t+\tau}^{n}).

(3)

The importance scalar can also be obtained by directly adding a traditional 3D convolution layer on the same input feature map. During the training stage, the offsets and the importance scalar are learned from the target tasks.

3.3 In Context of Related Works

In this section, we will detail the relations and difference of our proposed method with some existing works.

Particle Filter [14] Particle filter is a very popular technique for tracking tasks in videos. It basically goes through four basic steps: (1) initialization; (2) measurement update; (3) resampling and (4) prediction. Our proposed LS3D-Conv network shares similar steps in the inference stage: (1) At the first LS3D-Conv layer, we obtain an initial sampling location; (2) We fuse these sampled features and obtain an updated feature map; (3) Based on the new feature map, the next LS3D-Conv layer update the sampling location which is more suitable for the target; and (4) Repeat step (2)-(3) layer by layer till the end of the network.

Deformable Convolution [8, 58] Deformable ConvNets augment the sampling locations in the 2D convolution with learnable offsets and modulations to handle geometric variations. Technically, our work can be viewed as extending the 2D deformable convolution to 3D, but still retaining the 2D offsets. But we have three differences compared to [8, 58]. (1) The function of the offsets is different. Our LS3D-Conv aims to sampling meaningful features across neighboring frames, instead of handling geometric variations as [8, 58]. (2) Our offsets are not motion vectors. Comparing with [46, 53] that use deformable convolution to align features across frames, the offsets learned from a single LS3D-Conv layer seems meaningless. But when multiple LS3D-Conv layers work together, our method samples a large area to find useful features, as shown in Figure 1. (3) To our knowledge, we are first to successfully use the deformable convolution for video interpolation tasks.

TrajectoryNet [57] This work applies estimated flow to 3D CNN for the motion features along the motion paths which can be aggregated. There are three main differences between this work and our method: (1) The sampling locations in different layers of trajectory convolution are the same, while our method uses different offsets for different layers. In Section 4.1, we show the effectiveness of using multiple LS3D-Conv layers. (2) In trajectory convolution, the estimated trajectories are obtained by the other motion estimation network which introduces a huge additional computation cost. The approach used in our method to obtain offsets is much more efficient. (3) As mentioned above, the offsets learned by our method are not motion vectors.

4 Experiments

In this section, we first try to analyze the behavior of LS3D-Conv network in video interpolation tasks. Then, we show the results on video interpolation, video super-resolution, and video denoising tasks. Finally, we show that our proposed LS3D-Conv can even flexibly replace 3D-Conv in action recognition tasks to boost the performance.

4.1 Understanding LS3D-ConvNets

To analyze and understand the behavior of the LS3D-Conv network, we adopt video interpolation task as an example to verify the learned sampling locations across frames. We train a baseline model with 3D-Conv network and compare it with the proposed LS3D-Conv network. All the experiments are conducted on MLB-Youtube datasets [38]. Please refer Section 4.2 for more implementation details.

Visualization of the sampling locations. To obtain the learned sampling locations for each model, we back-propagate from a chosen position on the interpolation output to get the gradient on the two input images. The gradient value can approximate the sampling frequency [31]. Figure 3 illustrates several examples of the learned sampling locations of 3D-ConvNets and VI-LS3D-ConvNets. We can observe that the sampling positions using the 3D-ConvNets are almost around the chosen positions. On the contrary, the learned sampling locations from LS3D-ConvNets can be adaptively adjusted according to different motions. When the output pixel is on an object with large motion, e.g., the leg in Figure 3, our method samples a large number of position candidates and fuses the valuable information to reconstruct the results. When the output pixel is on an object with small motion, our learnable sampling convolution samples a small number of position candidates around the output pixel.

Which stage to use LS3D-Conv? We compare the results of adding LS3D-Conv into different stages. We apply LS3D-ResBlock in different layers and measure the video interpolation error. As shown in Table 3, LS3D-Conv at deep layer get better results. One possible explanation is that the feature from deeper layers contains stronger semantic information that helps to sample the correct position on the neighboring frames.

Multi-level feature sampling leads to better results. Table 3 shows the results of using LS3D-Conv in multi-level features. We train three models with different numbers of LS3D-Conv layers. Specifically, we use 2 LS3D-Conv layers(res5,6), 4 LS3D-Conv layers(res3,4,5,6), and all the 6 LS3D-Conv layers in the backbone. We can find that with more LS3D-Conv layers, the model can achieve better results.

model	PSNR	SSIM
baseline	29.35	0.942
res1,2	29.70	0.943
res3,4	30.45	0.951
res5,6	30.97	0.954

Table 2: Comparison of results when adding learnable sampling 3D convolution into different stages.

model	PSNR	SSIM
baseline	29.35	0.942
2-LS3D-Conv	30.97	0.954
4-LS3D-Conv	31.70	0.962
6-LS3D-Conv	31.98	0.964

Table 3: Comparison of results when adding 2, 4,and 6 learnable sampling 3D convolution layers into model.

4.2 LS3D-ConvNets for Video Interpolation

Methods	Cityscapes	MLB-baseball	Gait
SepConv [35]	27.32/0.819	28.87/0.926	33.01/0.947
Super SloMo [20]	26.98/0.807	28.38/0.914	32.70/0.932
DAIN [1]	27.48/0.830	30.31/0.954	35.02/0.964
Ours	27.62/0.837	31.98/0.964	35.82/0.978

Table 4: Comparisons with state-of-the-art results on three datasets for video interpolation task, we report the PSNR and SSIM score of each method.

Methods	Dancing	Treadmill	Flag	Fan	Turbine	Average
SRGAN [26]	27.91 / 0.87	22.61 / 0.73	28.71 / 0.83	34.25 / 0.94	27.84 / 0.81	29.20 / 0.84
BRCN [16]	28.08 / 0.88	22.67 / 0.74	28.86 / 0.84	34.15 / 0.94	27.63 / 0.82	29.16 / 0.85
VESPCN [3]	27.89 / 0.86	22.46 / 0.74	29.01 / 0.85	34.40 / 0.94	28.19 / 0.83	29.40 / 0.85
FSTRN [27]	28.66 / 0.89	23.06 / 0.76	29.81 / 0.88	34.79 / 0.95	28.57 / 0.84	29.95 / 0.87
Ours	29.06 / 0.91	23.13 / 0.77	29.98 / 0.89	35.07 / 0.95	28.82 / 0.85	30.17 / 0.88

Table 5: Comparison of the results on 25 YUV format benchmark, the performance is measured by PSNR/SSIM.

Methods	PSNR	SSIM
Bicubic	29.79	0.8483
TOFlow [54]	33.08	0.9054
RCAN [56]	33.61	0.9101
DUF [21]	34.33	0.9227
3D-ConvNets	33.25	0.9133
Ours	34.90	0.9295

Table 6: Quantitative results comparison with state-of-the-art methods of video super-resolution on Vimeo-90K dataset.

Methods	Gaussian-15	Gaussian-25
Fixed Flow [54]	36.25/0.9626	34.74/0.9411
TOFlow [54]	36.63/0.9628	34.89/0.9518
3D-ConvNets	36.35/0.9645	34.67/0.9452
Ours	36.90/0.9732	35.23/0.9605

Table 7: Comparison of our approach with existing methods on video denoising task, the performance is measured by PSNR/SSIM.

Training datasets. We conduct video interpolation experiments on three datasets: Cityscapes datasets [6], MLB-Youtube datasets [38] and Gait Analysis dataset [49]. Cityscapes datasets have $2,974$ video clips of $17$ -fps, each clip has 30 frames. MLB-Youtube dataset consists of $4290$ video clips of $30$ -fps from $20$ baseball games, each clip contains diverse baseball activities such as swing, hit, ball, strike, and so on. And the gait datasets consists $240$ $30$ -fps videos from 20 persons, each person has $12$ video sequences from different directions. The input and output resolutions for these three datasets are $256\times 512$ , $360\times 720$ and $240\times 352$ in our experiments, respectively. For all these datasets, we predict the in-between three frames by given two input frames.

Training details. For the training loss, we use $\mathcal{L}_{1}$ loss and two kinds of adversarial losses, a 2D adversarial loss, and a 3D adversarial loss to generate more realistic and coherent sequences. Please refer to supplementary material for more implementation details.

We illustrate the qualitative comparison results with the state-of-the-art method on MLB-Youtube and Gait Analysis datasets in Figure 4, in this figure, we only visualize the middle frame. We can find that our method has a better performance to handle large motion and occlusion. Table 4 report the quantitative comparison results, our method achieves better performance compared with state-of-the-art methods. More video results are presented in the supplementary material.

4.3 LS3D-ConvNets for Video Super-Resolution

Training datasets. We conduct video super-resolution experiments on two datasets: 25 YUV format benchmark and Vimeo-90K [54]. 25 YUV format benchmark contains 25 YUV sequences for training and 5 sequences for test, which has been previously used in [16, 26, 3, 27]. The Vimeo-90K dataset is a large, high-quality, and diverse dataset for video super-resolution, video denoising, and other video restoration tasks. The Vimeo-90K super-resolution benchmark consists of 91701 7-frame sequences with fixed resolution $448\times 256$ , extracted from 39K selected video clips from Vimeo-90K. For both datasets, we conduct our experiments with the upsampling scale of 4.

Training details. For the 25 YUV format benchmark, we follow the training and evaluation settings used in FSTRN [27], and use a P3D based framework as backbone. We directly replace the $3\times 1\times 1$ 3D-Conv layer with LS3D-Conv. For the Vimeo-90K dataset [54], we use a model based on image super-resolution model but with fewer ResBlocks, also the sub-pixel layer [42] is applied in the network. Please refer to the supplementary material for more details.

Table 5 shows the quantitative comparison results on the 25 YUV format benchmark. We can find that our model achieves higher PSNR and SSIM score, which demonstrates the effectiveness of learnable sampling.

4.4 LS3D-ConvNets for Video Denoising

Training datasets and details. We conduct the video denoising experiment on Vimeo-90K denoising benchmark [54], which is the same as Vimeo-90K super-resolution benchmarks. We train and evaluate our method following TOFlow [54] with two kinds of noise, Gaussian noise with the standard deviation of 15 intensity levels(Gaussian-15) and Gaussian noise with the standard deviation of 25(Gaussian-25).

In Table 7, We quantitatively compare the results of our method with Fixed Flow [54], TOFlow [54], and our baseline 3D-ConvNets. Our method achieve a higher PSNR and SSIM score.

Methods	Top1	Top5	params	GFLOPs
I3D	72.0%	89.9%	28.04M	86.59
I3D+LS3D-Conv	72.9%	90.6%	28.21M	87.13
I3D+non-local	73.6%	91.0%	35.40M	100.10
I3D+non-local+LS3D-Conv	74.2%	91.2%	35.57M	100.64

Table 8: Action recognition results on Kinetics-400, reported on validation set. We use the ResNet-50 I3D model as our baseline.

4.5 LS3D-ConvNets for Action Recognition

We also investigate our proposed LS3D-ConvNets for action recognition task on Kinetics-400 [23] dataset. Kinetics-400 contains $\sim$ $246$ k training videos and $20$ k validation videos for $400$ human action categories. We train all models on the training set and evaluate on the validation set following [4]. We choose two basic models for our baseline, ResNet-50 I3D [4] and ResNet-50 I3D model with non-local blocks [52]. We replace the $3\times 1\times 1$ 3D convolution in all bottleneck structure with our proposed learnable sampling 3D convolution in two baseline models. We train our models using the weights from the baseline model as initialization. The training procedure contains 45 epochs with an initial learning rate $0.001$ and it is divided by 10 for every $15$ epochs, the input size is $32\times 224\times 224$ .

Table 8 shows the results of baseline models and the models with the proposed LS3D-Conv. The computational cost(GFLOPs) and the amount of learnable parameters are also reported. We can notice that our model (+LS3D-Conv) gets $0.9\%$ top- $1$ accuracy improvement over the simple baseline with negligible additional overhead. Although adding non-local blocks (+non-local) gets a better improvement over the baseline but comes with huge computation cost and model size. Our model is also compatible with non-local, the combination (+non-local+LS3D-Conv) can even get $0.6\%$ top- $1$ accuracy improvement over the model with non-local blocks (+non-local), which further shows the effectiveness of our proposed LS3D-ConvNets.

5 Conclusion

In this paper, we try to solve the video enhancement and action recognition in a sampling and fusing multi-level features fashion. To achieve this, we propose a new operator called LS3D-Conv with learnable 2D offsets and importance scalar. The sampling locations and the importance scalar can be learned from target tasks. We can flexibly replace 3D convolution with our proposed LS3D-Conv in various tasks. The experiments show the superiority of LS3D-Conv in video enhancement and action recognition. The proposed new operator can even be applied to popular action recognition frameworks to boost the performance.

References

[1] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3703–3712, 2019.
[2] John L Barron, David J Fleet, and Steven S Beauchemin. Performance of optical flow techniques. International journal of computer vision, 12(1):43–77, 1994.
[3] Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4778–4787, 2017.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[5] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. Coherent online video style transfer. In Proceedings of the IEEE International Conference on Computer Vision, pages 1105–1114, 2017.
[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
[7] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
[9] Axel Davy, Thibaud Ehret, Jean-Michel Morel, Pablo Arias, and Gabriele Facciolo. A non-local cnn for video denoising. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2409–2413. IEEE, 2019.
[10] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
[11] Rakesh Dugad and Narendra Ahuja. Video denoising by combining kalman and wiener estimates. In Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348), volume 4, pages 152–156. IEEE, 1999.
[12] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3436–3445, 2019.
[13] Shuyang Gu, Congliang Chen, Jing Liao, and Lu Yuan. Arbitrary style transfer with deep feature reshuffle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8222–8231, 2018.
[14] Fredrik Gustafsson, Fredrik Gunnarsson, Niclas Bergman, Urban Forssell, Jonas Jansson, Rickard Karlsson, and P-J Nordlund. Particle filters for positioning, navigation, and tracking. IEEE Transactions on signal processing, 50(2):425–437, 2002.
[15] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
[16] Yan Huang, Wei Wang, and Liang Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems, pages 235–243, 2015.
[17] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
[18] Hui Ji, Chaoqiang Liu, Zuowei Shen, and Yuhong Xu. Robust video denoising using low rank matrix completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1791–1798. IEEE, 2010.
[19] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
[20] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9000–9008, 2018.
[21] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
[22] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
[23] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[24] Soo Ye Kim, Jeongyeon Lim, Taeyoung Na, and Munchurl Kim. 3dsrnet: Video super-resolution using 3d convolutional neural networks. arXiv preprint arXiv:1812.09079, 2018.
[25] Soo Ye Kim, Jihyong Oh, and Munchurl Kim. Deep sr-itm: Joint learning of super-resolution and inverse tone-mapping for 4k uhd hdr applications. arXiv preprint arXiv:1904.11176, 2019.
[26] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
[27] Sheng Li, Fengxiang He, Bo Du, Lefei Zhang, Yonghao Xu, and Dacheng Tao. Fast spatio-temporal residual network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10522–10531, 2019.
[28] Ce Liu and Deqing Sun. A bayesian approach to adaptive video super resolution. In CVPR 2011, pages 209–216. IEEE, 2011.
[29] Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 2507–2515, 2017.
[30] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
[31] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pages 4898–4906, 2016.
[32] Mona Mahmoudi and Guillermo Sapiro. Fast image and video denoising via nonlocal means of similar neighborhoods. IEEE signal processing letters, 12(12):839–842, 2005.
[33] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1710, 2018.
[34] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 670–679, 2017.
[35] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 261–270, 2017.
[36] Katsunori Ohnishi, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. Hierarchical video generation from orthogonal information: Optical flow and texture. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[37] Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2398–2407, 2019.
[38] AJ Piergiovanni and Michael S. Ryoo. Fine-grained activity recognition in baseball videos. In CVPR Workshop on Computer Vision in Sports, 2018.
[39] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017.
[40] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1164–1172, 2015.
[41] Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
[42] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
[43] Gary J Sullivan and Richard L Baker. Motion compensation for video compression using control grid interpolation. In [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, pages 2713–2716. IEEE, 1991.
[44] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
[45] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
[46] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally deformable alignment network for video super-resolution. arXiv preprint arXiv:1812.02898, 2018.
[47] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[48] Hua Wang, Dewei Su, Longcun Jin, and Chuangchuang Liu. Deformable non-local network for video super-resolution. arXiv preprint arXiv:1909.10692, 2019.
[49] Liang Wang, Tieniu Tan, Huazhong Ning, and Weiming Hu. Silhouette analysis-based gait recognition for human identification. IEEE transactions on pattern analysis and machine intelligence, 25(12):1505–1518, 2003.
[50] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
[51] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[52] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
[53] Xiangyu Xu, Muchen Li, and Wenxiu Sun. Learning deformable kernels for image and video denoising. arXiv preprint arXiv:1904.06903, 2019.
[54] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
[55] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Proceedings of the IEEE International Conference on Computer Vision, pages 3106–3115, 2019.
[56] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 286–301, 2018.
[57] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory convolution for action recognition. In Advances in Neural Information Processing Systems, pages 2204–2215, 2018.
[58] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.
[59] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2017.
[60] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation. In ACM transactions on graphics (TOG), volume 23, pages 600–608. ACM, 2004.