Blur More To Deblur Better: Multi-Blur2Deblur For Efficient Video Deblurring

Dongwon Park¹ Dong Un Kang¹ Se Young Chun^∗
Department of Electrical Engineering UNIST Republic of Korea
{ dong1, qkrtnskfk23, sychun }@unist.ac.kr

Abstract

One of the key components for video deblurring is how to exploit neighboring frames. Recent state-of-the-art methods either used aligned adjacent frames to the center frame or propagated the information on past frames to the current frame recurrently. Here we propose multi-blur-to-deblur (MB2D), a novel concept to exploit neighboring frames for efficient video deblurring. Firstly, inspired by unsharp masking, we argue that using more blurred images with long exposures as additional inputs significantly improves performance. Secondly, we propose multi-blurring recurrent neural network (MBRNN) that can synthesize more blurred images from neighboring frames, yielding substantially improved performance with existing video deblurring methods. Lastly, we propose multi-scale deblurring with connecting recurrent feature map from MBRNN (MSDR) to achieve state-of-the-art performance on the popular GoPro and Su datasets in fast and memory efficient ways.

1 Introduction

¹¹footnotetext: Equal contribution, ^∗Corresponding author

Video deblurring is a highly ill-posed inverse problem that aims to recover the sharp latent image from blurred video frames. The solution for this is getting more and more important due to massive amount of video data from hand-held devices such as smart phones. There are a number of factors to non-uniformly blur videos such as camera shake, object motion, and depth variation that particularly make this inverse problem quite challenging. Video deblurring is a long-standing computer vision topic. Most non-deep learning works investigated how to estimate blur kernels and/or latent frames using neighboring video frames [2, 3, 31, 27, 25, 7]. Among them, multi-image blind deblurring methods have been developed to incorporate observations from multiple blurred images that share the common underlying latent image with different blurs [2, 31, 27].

Refer to caption — Figure 1: Video deblurring results of Kim [9], STFAN [30], Pan [15], and our MB2D (Ours) on the GoPro dataset [13].

Recent deep learning-based approaches for video deblurring have investigated the ways of utilizing neighboring blurred video frames largely either by using temporally aligned adjacent frames to the reference frame with deep neural networks (DNNs) or by propagating the information about past frames to the reference frame recurrently with recurrent neural network (RNN). One group of works exploits adjacent video frames by warping them to the center frame at image and/or feature levels so that sharp pixel information from other frames can be utilized for deblurring [20, 24, 30, 15]. Due to reduced variations between video frames by alignment, DNNs seem to work more efficiently with aligned frames than with unaltered frames, yielding state-of-the-art (SOTA) video deblurring performance. However, there are a couple of disadvantages for temporal alignment; Alignment in video deblurring itself is also ill-posed and challenging. Thus, potential errors in alignment may cause undesirable artifacts in deblurring. Alignment often requires heavy computation cost and large memory for warping operation. The other group of works utilizes RNNs to sequentially restore the sharp image from video frames using the features from previous steps [9, 14, 30]. Thanks to no alignment operation, these methods have low computation cost, but yielded slightly lower deblurring performance than SOTA methods that were developed around the same time.

Here, we propose multi-blur-to-deblur (MB2D), a novel concept on how to exploit neighboring frames for efficient video deblurring as an alternative to achieve both SOTA performance and fast computation with small memory.

First of all, we propose the fundamental argument for our MB2D: using more blurred images with long exposures as additional inputs to video deblurring significantly improves performance. This argument was inspired by classical unsharp masking [10, 17] that enhances high-frequency components via a more blurred input image. We conjecture that if more blurred images with long exposures for the same reference frame are available, they could encode more information on motions that can potentially improve deblurring performance. Thus, our proposed MB2D consists of two steps: more-blurring (MB) and then deblurring (D).

Secondly, for the MB step of our MB2D, we propose multi-blurring recurrent neural network (MBRNN) that synthesizes more blurred images (not available during testing) from neighboring blurred frames. Figure 2 illustrates (a) video deblurring with alignments [20, 24, 30, 15] and (b) our MB2D with MBRNN. While the former warps neighboring frames to the reference frame, the latter (MBRNN) progressively synthesizes more blurred images by appending small amounts of motion blurs predicted from neighboring frames. Our MBRNN is fast and memory-efficient due to no alignment, while yields substantially improved performance with existing video deblurring algorithms.

Lastly, for the D step of our MB2D, we propose multi-scale deblurring with connecting recurrent feature map (CRFM) from MBRNN (MSDR) to achieve SOTA performance for video deblurring on the GoPro and Su datasets in fast and memory efficient ways. Progressively generated recurrent feature maps further improved the performance of video deblurring with the synthesized more blurred images, and thus our proposed MSDR with CRFM performed favorably against existing SOTA methods with relatively small parameters and fast computation as depicted in Figure 1.

The main contributions of this work are summarized as:
$\bullet$ For the first time, we show that using long exposure images along with the input with regular exposure substantially improves deblurring performance. Then, we propose MB2D, a novel approach with multi-blur and deblur steps.
$\bullet$ We propose MBRNN that progressively synthesizes more blurred images from the input and its neighboring frames. MBRNN improved the performance of existing methods.
$\bullet$ We propose MSDR for video deblurring that exploits CRFM from MBRNN so that the multi-blur and deblur steps are connected at feature levels.
$\bullet$ Our proposed MB2D with MBRNN and MSDR outperformed other SOTA methods on the GoPro [13] and Su [20] datasets quickly and parameter-efficiently.

2 Related Works

Non-DNN multi-images / video deblurring Blind deconvolution of motion blur is challenging, but multiple blur images often make this inverse problem less ill-posed. Multi-image deblurring has been investigated such as multi-channel deconvolution [31], iterative kernel estimation [2], multi-channel blind deconvolution with augmented Lagrangian optimization [19], homography estimation for image registration [3], and multi-image registration [25]. There have been some works on video deblurring without using DNNs such as patch-based restoration of blurry areas by detecting sharp areas that share the same contents from nearby frames [4], simultaneous estimation of the latent image and optical flow based on locally approximating the pixel-wise varying kernels with bi-directional optical flow [8] and local deblurring through weighted Fourier accumulation after warping adjacent frames to the reference by optical flow consistency [5].

DNN video deblurring with alignment There have been a number of methods for video deblurring to use neighboring frames with temporal alignment. Su et al. proposed to align adjacent frames to the center frame, decreasing the spatial variance of video frames [20]. EDVR employed deformable CNN to warp the information on adjacent frames to the center frame at feature level [24]. Zhou et al. proposed STFAN to implicitly estimate the dynamic alignment filters to transform the features in the spatio-temporal filter adaptive network [30]. Pan et al. developed the temporal sharpness prior that explores the sharpness pixel and constraints the deep CNN model for generating aligned intermediate latent frames with optical flow [15].

RNN video deblurring Kim et al. [9] proposed a method to recurrently estimate the latent image by developing the RNN with the concatenation of multi-frame features. Nah et al. [14] developed the RNN to exploit the intra-frame and inter-frame iterations with hidden-state update via reusing RNN cell parameters for video deblurring. Zhong et al. [29] proposed RNN based method to globally aggregate the spatio-temporal correlation from the spatial high-level features of video frames generated by RNN cell.

DNN single image deblurring Nah et al. [13] proposed single image deblurring with multi-scale (MS) approach, Tao et al. [23] proposed to share models of each stage in the MS approach, and Gao et al. [6] developed a partial sharing method considering different level of blurs for each stage in the MS approach. Zhang et al. [26] and Suin et al. [21] independently proposed to adjust the receptive field using multi-patch approaches with coarse-to-fine structures. Zhang et al. [28] proposed a data augmentation method by predicting a real blur image using GAN from a single sharp image to reduce the gap between synthetic blur and real blur. Park et al. [16] introduced a progressive deblurring method that recurrently estimates the intermediate and final latent images by exploiting time-resolved deblurring data augmentation.

Our MB2D consists of the MB step and the D step. MBRNN utilized RNN, not for deblurring [9, 14], but for blurring more. Unlike [28], this blurring step is for encoding the information of neighboring frames for the same reference. MSDR employed the MS approach [13] for deblurring, but with a novel CRFM from MBRNN so that two steps are connected at both image and feature levels.

3 More Blurred Images with Long Exposures

This section investigates the fundamental assumption of our proposed MB2D: Using more blurred images with long exposures as additional inputs to video deblurring significantly improves performance. More blurred images with long exposures are not available during testing, but it is possible to synthesize them during training with video deblur datasets from high-speed cameras [13, 20]. While Park et al. [16] generated blurred images with short exposures, we propose to synthesize blurred images with long exposures.

Dynamic scenes blur dataset Photons are accumulated on a sensor during exposure time, yielding real blurred images. Simulating this, the blurred image $B$ is generated by the integration of successive sharp images $S$ from high-speed cameras [13, 20] as follows:

B^{n}_{t}=g\left(\frac{1}{n}\sum_{i=-\left\lfloor n/2\right\rfloor}^{\left\lfloor n/2\right\rfloor}S\left[nt+i\right]\right)

(1)

where $n$ denotes the number of sharp frames (odd, proportional to exposure time for the blurred image $B^{n}_{t}$ ), $S[i]$ denotes the $i$ th acquired sharp video frame from a high-speed camera, $g$ is a camera response function and $t$ is an index for the generated blurred image (integer). A typical video deblurring problem is to predict $S[nt]$ or $B^{1}_{t}$ (sharp ground truth) from $B^{n}_{t}$ and its neighboring blurred frames $B^{n}_{t-1}$ and $B^{n}_{t+1}$ . Note that as $n$ increases, $B^{n}_{t}$ becomes more blurred.

3.1 Ideal More Blurred Images

The unsharp masking [10, 17] is a classical image sharpening technique to enhance high-frequency components in an image by utilizing the difference between the input image and its blurred (or unsharp) image using Gaussian filtering (we denote this more blurred image). This difference emphasizes (or masks) high-frequency components such as edges that were degraded by Gaussian filtering (or more blurring). In this work, we extend this idea of unsharp masking to the case of video deblurring.

Instead of Gaussian blurring, we propose to generate more blurred images for video deblurring using sharp images with (1). For the blurred image $B^{n}_{t}$ , the more blurred video frames for the same reference frame $S[nt]$ will be $B^{n+2}_{t}$ , $B^{n+4}_{t}$ and $B^{n+6}_{t}$ that have long exposures to encode more motion information over wide acquisition times as compared to $B^{n}_{t}$ . As illustrated in Figure 3, the difference between the input image and a more blurred image encodes the information for sharp region with small motion and blurred region with large motion effectively.

Note that more blurred images can be available during training, but not during testing. Thus, more blurred images are ideal and must be predicted during testing. In addition, for other unsharp masking operations ( $e.g.$ , scaling), we propose to resort to the power of DNNs (see next Section).

3.2 More Blurred Images for Deblurring

We empirically validated our conjecture on more blurred images with deblurring. We performed a single image deblurring with the original input image $B^{11}_{t}$ and/or with ideal more blurred images $B^{13}_{t}$ , $B^{15}_{t}$ and so on from the GoPro dataset [13]. We denote the set of input images $B^{11}_{t},B^{13}_{t},\ldots,B^{19}_{t}$ by $B^{\{11,13,\ldots,19\}}_{t}$ . A modified U-Net [18] was trained with the input blurred image and/or ideal more blurred images for the same ground truth sharp frame. Single image deblurring results with the ideal more blurred image sets are summarized in Table 1. Set 1 corresponds to a usual single image deblurring and Sets 2 to 5 correspond to our image deblurring with ideal more blurred images with long exposures.

As compared to the original performance with the input image $B^{11}_{t}$ , deblurring with additional more blurred images yielded significantly improved performance. As the more blurred images with longer exposures are used for deblurring, the better performances in PSNR were observed until $n=17$ where dramatic performance boost of 3.33 dB was achieved as compared to the original deblurring with the input with $n=11$ . However, after $n=17$ , more information from more blurred images did not help to improve deblurring performance. We chose to use 3 more blurred images.

Table 1: Deblurring results with the input and/or ideal more blurred images on the GoPro validation dataset. Sets

1,\ldots,5

denote

B^{\{11\}}_{t}

B^{\{11,13\}}_{t}

, …,

B^{\{11,13,...,19\}}_{t}

, respectively.

Set	1	2	3	4	5
PSNR(dB)	29.14	31.82	32.39	32.47	32.39

4 Multi-Blurring To Deblurring (MB2D)

Figure 4 illustrates our proposed MB2D framework that contains the MB step with MBRNN and the D step with MSDR. Our proposed MBRNN in Figure 4(a) aims to progressively predict more blurred images $\hat{B}^{\{n+2,n+4,n+6\}}_{t}$ from the input blurred frame and its neighboring video frames $B^{n}_{\{t-1,t,t+1\}}$ where $B^{n}_{\{t-1,t,t+1\}}$ denotes the set of $B^{n}_{t-1}$ , $B^{n}_{t}$ , $B^{n}_{t+1}$ . Our proposed MSDR in Figure 4(b) uses the input blurred image $B^{n}_{t}$ , the predicted more blurred images $\hat{B}^{\{n+2,n+4,n+6\}}_{t}$ as well as the connecting recurrent feature map (CRFM) from MBRNN for multi-scale based video deblurring of estimating the sharp latent image $B^{1}_{t}$ .

4.1 Multi-Blurring Recurrent Neural Network

Ideal more blurred images significantly improved the performance of deblurring as investigated in Table 1, but they are available only during training, not during testing. We conjecture that these more blurred images corresponding to the $n+6$ high-speed video frames can be predicted from the input image $B_{t}^{n}$ and its neighboring images $B_{t-1}^{n}$ , $B_{t+1}^{n}$ that are corresponding to $3n$ high-speed video frames. Thus, the goal of our proposed MBRNN is to progressively predict the multi-blurring images ${B}^{\{n+2,n+4,n+6\}}_{t}$ (available during training) from $B_{\{t-1,t,t+1\}}^{n}$ (available during training and testing). Our MBRNN is responsible for the MB step of our proposed MB2D.

MBRNN model: A modified U-Net [18] was used as the baseline network for our proposed MBRNN for essentially progressive blurring that is similar to the network in [16] for progressive deblurring. Thus, our proposed MBRNN to yield more blurred images is modeled as:

\{\mathrm{\hat{B}}^{n+2k}_{t},\mathrm{F}^{k}\}=\mathrm{Net_{B}}(\mathrm{B}^{n}_{\{t-1,t,t+1\}},\mathrm{\hat{B}}^{n+2k-2}_{t},\mathrm{F}^{k-1})

(2)

where $\mathrm{Net_{B}}$ is the baseline network, $\mathrm{\hat{B}}^{n+2k}_{t}$ is the estimated more blurred image at the $k$ th iteration ( $k=1,2,3$ ), and $\mathrm{F}^{i}$ is connecting recurrent feature map from the encoder of MBRNN and to the decoder of MBRNN. Figure 4(a) illustrates how the model of MBRNN works to progressively append small amount of motion blurs to the center (or reference) frame by exploiting the estimated more blurred image and intermediate feature map from the previous iteration. Figure 5 shows examples of estimated more blurred images from MBRNN such that small motion blurs (b, c, d) are progressively added to the center frame (a), emphasizing fast moving objects. This phenomenon is also observed quantitatively as decreased spectral densities (e) that is the exact opposite to progressive deblurring in [16].

MBRNN loss: We trained our proposed MBRNN using a simple L1 function: Given $\mathrm{N}$ multi-blurring images $\mathrm{B}^{n+2k}_{t}$ as ground truth that were synthesized from high-speed video frames, the loss function for MBRNN is:

\mathcal{L}=\sum_{t=1}^{N}\sum_{k=1}^{3}\|\mathrm{\hat{B}}^{n+2k}_{t}-\mathrm{B}^{n+2k}_{t}\|_{1}

(3)

where $t$ is an index and $\mathrm{\hat{B}}^{n+2i}_{t}$ is the output of MBRNN.

4.2 Multi-Scale Video Deblurring with Connecting Recurrent Feature Map

Our proposed MB2D consists of the MB step and the D step for more blurring and deblurring, respectively. It is possible to use MBRNN (MB step) and existing video deblurring methods (D step) for improved performance (see next Section for results). While this approach connects the MB and D steps at image level, we propose to connect these two steps at feature levels for further performance boost.

MSDR model: Figure 4(b) illustrates the schematic of the D step using the predicted more blurred images and connecting recurrent feature map from MBRNN. Our proposed Multi-Scale video Deblurring network with connecting Recurrent feature map (MSDR) exploits the information from MBRNN, the predicted more blurred images $\hat{B}^{\{n+2,n+4,n+6\}}_{t}$ and the connecting recurrent feature map $F^{(s)}$ , in the multi-scale framework for deblurring with down-sampling. Our proposed MSDR with connecting recurrent feature map is modeled as:

\mathrm{\hat{I}}^{(s)}=\mathrm{Net_{D}}({\mathrm{B}_{t}^{n,(s)}},\mathrm{\hat{B}}^{{\bf m},(s)}_{t},Up(\mathrm{\hat{I}}^{(s+1)}),\mathrm{F}^{(s)})

(4)

where $s$ is a down-scaling index ( $s=1,2,3$ for $\times 1,\times 2,\times 4$ downsamplings, respectively), $\mathrm{\hat{B}}^{{\bf m},(s)}_{t}$ is the set of $\mathrm{\hat{B}}^{n+2,(s)}_{t}$ , $\mathrm{\hat{B}}^{n+4,(s)}_{t}$ , $\mathrm{\hat{B}}^{n+6,(s)}_{t}$ such that $\mathrm{\hat{B}}^{n+2,(s)}_{t}$ is the downsampled image of $\mathrm{\hat{B}}^{n+2}_{t}$ with the scale index $s$ , $\mathrm{\hat{I}}^{(s)}$ is a restored sharp image with the scale index $s$ , $\mathrm{F}^{(s)}$ denotes the downsampled connecting recurrent feature map from MBRNN, and $\mathrm{Net_{D}}$ is the modified baseline network from $\mathrm{Net_{B}}$ for for multi-scale framework (see supplemental for details), and $Up$ is a bilinear up-sampling. Our MSDR iteratively restores the latent sharp image from down-scales to the original scale.

Connecting recurrent feature map: As illustrated in Figure 4, the recurrent feature maps $\mathrm{F}$ are extracted from the decoders of MBRNN at every iteration and then concatenated before feeding into our MSDR. We call the concatenated feature maps “connecting recurrent feature map (CRFM)” that bridges between MBRNN and MSDR at feature levels for the best possible deblurring performance of our proposed MB2D concept. It seems that CRFM carries more information on blurs around the edges of moving objects than the more blurred images have, leading to the further performance improvement of video deblurring (see 0.64 dB improvement in PSNR via our proposed CRFM as shown in Table 2).

5 Experiments

Datasets GoPro dataset [13] consists of 34,874 sharp images (33 videos) with the size of 1280 $\times$ 720 and with 22 videos for training and 11 videos for testing. By the integration of successive sharp images from high-speed cameras, Nah [13] synthesized 2,103 blurred images for training and 1,111 blurred images for testing with sharp ground truth images. We further generated and added more blurred images to the GoPro dataset (called M-GoPro) for training.

Su dataset [20] consists of 6,708 images (71 videos) with the size of 1,920 $\times$ 1080 or 1,280 $\times$ 720 captured by several high-speed camera devices where 61 videos are for training and 10 videos are for testing. Following the method of synthesizing motion blurs with interpolated frames using optical flow [20], we further generated and added more blurred images to the Su dataset (called M-Su) for training.

Implementation details All implementations were done with PyTorch and the Adam optimizer with learning rate $2\times 10^{-4}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , $\epsilon=10^{-8}$ , and batch size 16 was used as an optimization algorithm for training. Data augmentation techniques for training were used such as random cropping with 256 $\times$ 256 size, random vertical / horizontal flips, and 90^∘ rotation. For Tables 1 to 5, the total number of iterations was $10^{5}$ and for Table 6, it was $10^{6}$ . For Table 7, the trained network for Table 6 with additional fine-tuning of $3\times 10^{5}$ was used. All experiments were conducted on a NVIDIA Titan V. Run time was measured with batch size 1 without data-loading time. MBRNN was trained first and then MSDR was trained with the pre-trained MBRNN.

Table 2: Ablation studies with the number of input frames (NIF) (1 or 3 frames), with pre-processing (Preproc) with optical flow (OF) or our MBRNN (MB), and with our CRFM.

NIF	Preproc	CRFM	PSNR	SSIM	Param(M)
(a) 1	-	-	29.29	0.894	2.6
(b) 3	-	-	29.55	0.896	2.6
(c) 3	OF	-	29.63	0.901	12.0
(d) 3	MB	X	30.37	0.911	5.2
(e) 3	MB	O	30.94	0.922	5.4
(f) 3	-	-	30.20	0.909	5.9
(g) 3	-	-	30.48	0.914	10.5

5.1 Ablation Studies for Our MB2D

We performed the ablation studies to demonstrate the effectiveness of the proposed MB2D approach for several components: the number of input frames (NIF), pre-processing (Pre-proc) such as optical flow (OF) or our MBRNN, and connecting recurrent feature map (CRFM). We chose the modified U-Net as a deblurring network without multi-scale (MS) or multi-temporal (MT) frameworks. The results are summarized in Table 2.

Table 2(a),(b) correspond to the cases of single image deblurring (with 1 input frame) and of video deblurring (with 3 input frames), respectively, showing the improved performance of (b) over (a) by using more information from neighboring frames. Table 2(b),(c),(d) correspond to different pre-processing methods for video deblurring: Using optical flow for aligned adjacent frames (c) with significantly increased memory requirement (12.0M) improved the deblurring performance by 0.08dB over (b) without pre-processing. However, our proposed MBRNN itself (d) improved the performance by 0.82dB with modestly increased memory requirement (5.2M). Table 2(e) shows additional performance improvement by 0.57dB over (d) when using CRFM from MBRNN for deblurring without increasing network parameters much.

We also performed additional studies on the number of network parameters for (b). As observed in Table 2(f),(g), increasing the number of network parameters improved deblurring performance, but our proposed MB2D (especially with CRFM) yielded better performances than the deblurring networks with similar numbers of parameters. Thus, our proposed MB2D yielded improved performance not simply because of increased network size, but because of our proposed approach with more blurred images.

Table 3: Performance comparisons for existing video deblurring approaches (one-stage (OS), multi-temporal (MT), multi-scale (MS)) with/without MBRNN (MB) and CRFM.

	MB	CRFM	PSNR	SSIM	Param(M)
(h) OS	X	-	29.29	0.894	2.6
(i) OS	O	X	30.37	0.911	5.2
(j) OS	O	O	30.94	0.922	5.4
(k) MT	X	-	29.95	0.906	2.8
(l) MT	O	X	30.59	0.914	5.4
(m) MT	O	O	30.83	0.920	5.6
(n) MS	X	-	29.65	0.901	2.6
(o) MS	O	X	30.62	0.916	5.2
(p) MS	O	O	31.19	0.926	5.4

Our MB2D concept can be used with various single image / video deblurring approaches such as One-Stage (OS) [12, 11, 1], Multi-Temporal (MT) [16] and Multi-Scale (MS) [13, 23, 6] methods where the MS and MT approaches achieve high performance with small number of parameters due to parameter sharing over scales or iterations. We performed another ablation study on them with our proposed MBRNN pre-processing and the results are summarized in Table 3.

Table 3(h),(k),(n) show the results of the baseline methods for OS, MT and MS approaches with the same DNN. In this case, MT approach yielded the best performance among all three approaches by up to 0.66dB. However, with our proposed MBRNN, MS approach yielded the best performance among all approaches by up to 1.08dB as shown in Table 3(i),(l),(o). Further performance improvements were observed with our proposed CRFM from MBRNN and MS approach with them yielded the best performance among all methods by up to 1.54dB. As argued in [16], MS approach seems to destroy high frequency details for good deblurring by using down-sampled images / features, but it seems that our proposed MBRNN and CRFM compensate for the degraded details to yield excellent performance for video deblurring. Thus, we chose MS for the D step of our MB2D.

Table 4: Ablation study with different input frames to MBRNN.

	1 frame		3 frames
Output	PSNR	SSIM	PSNR	SSIM
$\mathrm{\hat{B}}_{t}^{n+2}$	44.73	0.994	46.63	0.995
$\mathrm{\hat{B}}_{t}^{n+4}$	40.89	0.987	43.86	0.992
$\mathrm{\hat{B}}_{t}^{n+6}$	38.54	0.980	42.63	0.990

5.2 Role of Neighboring Frames for MBRNN

To validate the necessity of adjacent frames to provide motion information for generating more blurred images, we performed an ablation study with 1 frame (without neighboring frames) and 3 frames (with neighboring frames) for MBRNN. Note that a single frame is generated with $n$ high-speed video frames while 3 frames cover $3n$ high-speed video frames that is usually wider than the more blurred images from $n+2$ to $n+6$ high-speed video frames. The performance results are summarized in Table 4, showing significant performance differences between the case with a single frame for MBRNN and the case of using neighboring frames. The largest difference was observed for the more blurred image with $n+6$ high-speed video frames and the smallest difference was still significant by 1.9dB. Note that MBRNN is computationally efficient (0.02sec) as compared to optical flow (0.076sec) [22] that we used for Su [20] instead of a MATLAB implementation for it.

5.3 Multi-Blurring Step for Existing Methods

Table 5: Performance comparisons of existing methods (Tao [23], Zhang [26], Su [20], Pan[15]) and their MB2D versions where ^∗ suffix refers to our MB2D version of the existing methods (

e.g.

, Tao^∗ for Tao [23]). OF is optical flow and MB is our MBRNN.

Approach	Preproc	PSNR	SSIM	Param(M)
Tao [23]	-	29.52	0.899	6.9
Tao^∗	MB	31.12	0.925	9.7
Zhang [26]	-	29.91	0.905	5.4
Zhang^∗	MB	30.85	0.920	8.7
Su [20]	OF	30.11	0.911	25.4
Su^∗	MB	30.63	0.915	18.6
Pan [15]	OF	30.40	0.908	16.19
Pan^∗	MB	30.97	0.923	9.4

Our proposed MB2D is an alternative approach to exploit neighboring frames and its MBRNN can be applied to other existing video deblurring methods with mild modifications such as multi-channel inputs with the predicted more blurred images and with CRFM. Here, we investigate the feasibility of using our proposed MB2D (especially MBRNN) along with other existing SOTA methods: Tao [23], Zhang [26], Su [20] and Pan [15]. We implemented our MB2D versions of them, called Tao^∗, Zhang^∗, Su^∗ and Pan^∗, respectively. Note that for Tao [23], Zhang [26], single image deblurring methods were converted into video deblurring methods with our MB2D by adding MBRNN. For Su [20], Pan [15], we replaced the optical flow network (PWC-Net [22]) with our MBRNN, resulting in decreased network parameter size by 6.8M. Table 5 shows that our proposed MB2D substantially improved the deblurring performance with existing methods by at least 0.52dB up to 1.6dB for all 4 existing methods.

Table 6: Benchmark qualitative results on the GoPro test dataset [13] for PSNR (dB), SSIM, parameter size (Param in Million) and run time (second).

Method	PSNR	SSIM	Param(M)	Time
Kim [9]	26.82	0.825	0.92	0.13
Su [20]	27.31	0.826	16.67	2.38
EDVR [15]	26.83	0.843	-	-
Nah [14]	29.97	0.895	-	-
STFAN [30]	28.59	0.861	5.37	0.15
Zhong [29]	31.07	0.902	1.76	0.54
Pan [15]	31.67	0.928	16.19	1.73
Ours	32.16	0.953	5.42	0.27

Table 7: Benchmark qualitative results on the Su test dataset [20] for PSNR (dB), SSIM, parameter size (Param in Million) and run time (second).

Method	PSNR	SSIM	Param(M)	Time
Kim [9]	29.95	0.869	0.92	0.13
Su [20]	30.01	0.888	16.67	2.38
EDVR [15]	28.51	0.864	-	-
Nah [14]	30.80	0.899	-	-
STFAN [30]	31.15	0.905	5.37	0.15
Pan [15]	32.13	0.927	16.19	1.73
Ours	32.34	0.947	5.42	0.27

5.4 Benchmark Results

We performed our proposed MB2D with MBRNN and MSDR on two popular benchmark video deblurring datasets: GoPro test dataset [13] and Su test dataset [20]. We use the reported quantitative performances in PSNR (dB) and SSIM of the SOTA methods for Pan [15], Zhong [29] and Nah [14]. Parameter size and run time results were measured using publicly available codes.

Tables 6 and 7 show that our proposed MB2D significantly outperformed all existing SOTA methods for video deblurring in terms of PSNR and SSIM on the GoPro and Su test datasets, respectively. Note that compared to the most recent SOTA method, Pan [15], our proposed MB2D with MBRNN and MSDR yielded substantially outperformed it with more than 3 times less network parameters (5.42M vs. 16.19M) and more than 6 times faster computation time (0.27 second vs. 1.73 second).

Figures 6 and 7 show the examples of video deblurring using a few SOTA methods or using our proposed MB2D on the GoPro and Su test datasets, respectively. Our proposed MB2D seems to better deblur for fast moving objects or people than other existing methods such as Kim [9], STFAN [30], and Pan [15], due to our efficient more blurring based neighboring video frames exploitation.

6 Conclusion

We proposed a novel MB2D to exploit neighboring frames for efficient video deblurring. We showed that using more blurred images as additional inputs significantly improves performance, then proposed MBRNN that can synthesize them from neighboring frames, yielding substantially improved performance with existing deblurring methods, and finally proposed MSDR with a novel CRFM to outperform SOTAs in fast and memory-efficient ways.

Acknowledgments

This work was supported partly by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2017R1D1A1B05035810) and a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316).

References

[1] Raied Aljadaany, Dipan K Pal, and Marios Savvides. Douglas-Rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution. In CVPR, pages 10235–10244, 2019.
[2] Jia Chen, Lu Yuan, Chi-Keung Tang, and Long Quan. Robust dual motion deblurring. In CVPR, pages 1–8, 2008.
[3] Sunghyun Cho, Hojin Cho, Yu-Wing Tai, and Seungyong Lee. Registration based non-uniform motion deblurring. In CGF, volume 31, pages 2183–2192, 2012.
[4] Sunghyun Cho, Jue Wang, and Seungyong Lee. Video deblurring for hand-held cameras using patch-based synthesis. TOG, 31(4):1–9, 2012.
[5] Mauricio Delbracio and Guillermo Sapiro. Hand-held video deblurring via efficient fourier aggregation. TIP, 1(4):270–283, 2015.
[6] Hongyun Gao, Xin Tao, Xiaoyong Shen, and Jiaya Jia. Dynamic scene deblurring with parameter selective sharing and nested skip connections. In CVPR, pages 3848–3856, 2019.
[7] Sung Hee Park and Marc Levoy. Gyro-based multi-image deconvolution for removing handshake blur. In CVPR, pages 3366–3373, 2014.
[8] Tae Hyun Kim and Kyoung Mu Lee. Generalized video deblurring for dynamic scenes. In CVPR, pages 5426–5434, 2015.
[9] Tae Hyun Kim, Kyoung Mu Lee, Bernhard Scholkopf, and Michael Hirsch. Online video deblurring via dynamic temporal blending network. In ICCV, pages 4038–4047, 2017.
[10] Anil K Jain. Fundamentals of digital image processing. Prentice-Hall, Inc., 1989.
[11] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. DeblurGAN: blind motion deblurring using conditional adversarial networks. In CVPR, pages 8183–8192, 2018.
[12] Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang Wang. DeblurGAN-v2: deblurring (orders-of-magnitude) faster and better. In ICCV, pages 8878–8887, 2019.
[13] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, pages 3883–3891, 2017.
[14] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intra-frame iterations for video deblurring. In CVPR, pages 8102–8111, 2019.
[15] Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In CVPR, pages 3043–3051, 2020.
[16] Dongwon Park, Dong Un Kang, Jisoo Kim, and Se Young Chun. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. ECCV, 2020.
[17] Andrea Polesel, Giovanni Ramponi, and V John Mathews. Image enhancement via adaptive unsharp masking. TIP, 9(3):505–510, 2000.
[18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
[19] Filip Sroubek and Peyman Milanfar. Robust multichannel blind deconvolution via fast alternating minimization. TIP, 21(4):1687–1700, 2011.
[20] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In CVPR, pages 1279–1288, 2017.
[21] Maitreya Suin, Kuldeep Purohit, and AN Rajagopalan. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In CVPR, pages 3606–3615, 2020.
[22] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018.
[23] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In CVPR, pages 8174–8182, 2018.
[24] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. EDVR: video restoration with enhanced deformable convolutional networks. In CVPRW, 2019.
[25] Haichao Zhang and Lawrence Carin. Multi-shot imaging: joint alignment, deblurring and resolution-enhancement. In CVPR, pages 2925–2932, 2014.
[26] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In CVPR, pages 5978–5986, 2019.
[27] Haichao Zhang, David Wipf, and Yanning Zhang. Multi-image blind deblurring using a coupled adaptive sparse prior. In CVPR, pages 1051–1058, 2013.
[28] Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, and Hongdong Li. Deblurring by realistic blurring. In CVPR, pages 2737–2746, 2020.
[29] Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. ECCV, 2020.
[30] Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, Wangmeng Zuo, and Jimmy Ren. Spatio-temporal filter adaptive network for video deblurring. In CVPR, pages 2482–2491, 2019.
[31] Xiang Zhu, Filip Šroubek, and Peyman Milanfar. Deconvolving PSFs for a better motion deblurring using multiple images. In ECCV, pages 636–647, 2012.