This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Constrained Deformable Convolutional Network for Efficient Single Image Dynamic Scene Blind Deblurring with Spatially-Variant Motion Blur Kernels Estimation

Shu Tang This work was supported in part by the National Natural Science Foundation of China under Grant No. 61601070, Grant 61501074, the Key Project of Science and Technology Research of Chongqing Education Commission under Grant No. KJZD-K201800603, the Major Project of Science and Technology Research of Chongqing Education Commission under Grant No. KJZD-M201900602, the Foundation Research and Advanced Exploration Project of Chongqing under Grant No. cstc2018jcyjAX0432, the Special General Program of Technology Innovation and Application Development of Chongqing under Grant No. cstc2020jscx-msxmX0135.Corresponding author: [email protected] Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Yang Wu Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Hongxing Qin Corresponding author: [email protected] Chongqing University, Chongqing 400044, China Xianzhong Xie Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Shuli Yang Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China Jing Wang Chongqing Key Laboratory of Computer Network and Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Abstract

Most existing deep-learning-based single image dynamic scene blind deblurring (SIDSBD) methods usually design deep networks to directly remove the spatially-variant motion blurs from one inputted motion blurred image, without blur kernels estimation. Recently, a reblurring training strategy has been proved that it can significantly boost the deblurring performance of deep-learning-based video blind deblurring (DLVBD) methods for the motion blurred video. For the DLVBD methods, the success of reblurring training strategy mainly stems from the estimation of optical flows between two or more consecutive frames which are used to estimate/model the spatially-variant motion blur kernels of motion blurred frames and consequently guide the video deblurring. However, this strategy does not hold for the SIDSBD method as we only have one observed motion blurred image without any additional previous and next frames of the observed motion blurred image. In this paper, inspired by the Projective Motion Path Blur (PMPB) model and deformable convolution, we propose a novel constrained deformable convolutional network (CDCN) for efficient single image dynamic scene blind deblurring, which simultaneously achieves accurate spatially-variant motion blur kernels estimation and the high-quality image restoration from only one observed motion blurred image. In our proposed CDCN, we first construct a novel multi-scale multi-level multi-input multi-output (MSML-MIMO) encoder-decoder architecture for more powerful features extraction ability. Second, different from the DLVBD methods that use multiple consecutive frames, a novel constrained deformable convolution reblurring (CDCR) strategy is proposed, in which the deformable convolution is first applied to blurred features of the inputted single motion blurred image for learning the sampling points of motion blur kernel of each pixel, which is similar to the estimation of the motion density function of the camera shake in the PMPB model, and then a novel PMPB-based reblurring loss function is proposed to constrain the learned sampling points convergence, which can make the learned sampling points match with the relative motion trajectory of each pixel better and promote the accuracy of the spatially-variant motion blur kernels estimation. Extensive experiments show that our method not only can estimate spatially-variant motion blur kernels accurately, but also can produce better deblurring results than the state-of-the-art SIDSBD methods in terms of both qualitative evaluation and quantitative metrics.

Index Terms:
Single image dynamic scene blind deblurring, reblurring training strategy, spatially-variant motion blur kernels estimation, deformable convolution, position-based constraint.

I INTRODUCTION

The blind deblurring for single motion blurred image, whose goal is to recover a clear image from its single motion blurry version, is a severe ill-posed inverse problem. Especially in the real dynamic scene, many factors such as the multidimensional relative motion between the scene and the imaging device during exposure time, the noise and depth variation make this inverse problem more challenging. To tackle such an inverse problem, numerous optimization-based methods and deep-learning-based methods have been developed to model the blur process and regularize the solution space, and learn the mapping function between the clear and blurry image pairs, respectively.

For the optimization-based methods, the key to success is to build the right model to model the formation process of image blur: the convolution operation for the uniform motion blur[1, 2, 3, 4, 5], the efficient filter flow and the Projective Motion Path Blur (PMPB) model for the spatially-variant motion blur[6, 7, 8, 9, 10, 11, 12], and so on. Among these models, the PMPB model, which takes a motion blurred image as the result of integrating all intermediate images the camera “sees” along the trajectory of the relative motion, has been proved to be one of the best motion blur models that can model spatially-variant motion blurs well . However, in the optimization-based methods, the PMPB model requires computing all possible combinations of all motion spaces, which inevitably leads to tremendous computational cost and hence can only model small-size 3-dimensional camera shake.

Recently, the deep-learning-based methods have achieved significant improvement in single image dynamic scene blind deblurring (SIDSBD). Most existing deep-learning-based SIDSBD methods focus on learning the regression relation between a motion blurry input image and the corresponding clear image in an end-to-end manner, which skip the estimation of motion blur kernels[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Nevertheless, in recent years, a reblurring training strategy has been widely applied to the deep-learning-based video blind deblurring (DLVBD) methods, which can significantly boost the deblurring performance of DLVBD methods[32, 33, 34, 35]. The success of the reblurring training strategy mainly stems from the estimation or modeling of the spatially-variant motion blur kernels of each motion blurred frame, which is achieved by using the optical flows between two or more consecutive frames in the motion blurred video. These phenomenon set us thinking that whether the estimation of spatially-variant motion blur kernels and the corresponding reblurring training strategy are still good for the SIDSBD? And obviously, the strategy used by the DLVBD method does not hold for the SIDSBD method as we only have one observed motion blurred image without any additional previous and next frames of the observed motion blurred image. Therefore, the innovative approaches, which can dig out informative blurred features and consequently achieve accurate spatially-variant motion blur kernels estimation from a single motion blurred image, need to be explored.

In this paper, we propose a novel constrained deformable convolutional network (CDCN) for accurate spatially-variant motion blur kernels estimation and higher quality image restoration from only one observed motion blurred image. Specifically, inspired by the PMPB model and the deformable convolution ,we propose a novel constrained deformable convolution reblurring (CDCR) strategy, in which the deformable convolution is first used to learn the sampling points of spatially-variant motion blur kernels of the inputted single motion blurred image, which is similar to the estimation of the motion density function of the camera shake in the PMPB model, and then a novel PMPB-based reblurring loss function is proposed to constrain the learned sampling points convergence, which optimizes the estimation of the spatially-variant motion blur kernels. To the best of our knowledge, our CDCN is the first deep-learning-based SIDSBD method that can accurately estimate spatially-variant motion blur kernels from only one single motion blurred image without the optical flow. Our proposed network can be trained in an end-to-end manner. In summary, the main contributions of our proposed CDCN are listed as follows:

1) We propose a CDCR strategy, which achieves accurate spatially-variant motion blur kernels estimation from only one single motion blurred image without the optical flow by constraining the convergence of the sampling points of spatially-variant motion blur kernels with the PMPB-based reblurring loss function.

2) A small convolutional neural network (CNN) with one SoftMax layer, several PReLU layers and several convolutional layers is constructed for predicting the inverse kernel of each estimated motion blur kernel. The predicted inverse kernels are directly applied to blurred features of the inputted single motion blurred image to generate deblurred features, which can enhance the restoration ability of the decoder and achieve better deblurring performance for SIDSBD.

3) We construct a novel multi-scale multi-level multi-input multi-output (MSML-MIMO) encoder-decoder architecture via combining the multi-input multi-output strategy into our early research work (i.e. the multi-scale channel attention network: MSCAN[27]), which can further enhance features extraction ability of the network and consequently facilitate accurate estimation of spatially-variant motion blur kernels and higher quality image restoration.

II RELATED WORK

Numerous optimization-based and deep-learning-based deblurring methods have been proposed in the literatures. Due to the space limitation, here we focus on works related to our method.

II-A The Optimization-based Approach

Agrawal et al.[36] pointed out that the reason why deconvolution is a highly ill-posed problem is that there are many null values in the point spread function (PSF) frequency domain space. By changing the exposure time of each frame in the video, a series of frames were used to fill the null value in the PSF frequency domain space of the blurred frame, so that the motion blur of the blurred frame became a reversible process, so as to solve the problem of deblurring objects moving at a uniform speed. Many researchers thought that an image blurred by camera shake can be viewed as the result of integrating all intermediate images the camera “sees” along the trajectory of the camera shake, therefore the so-called projective motion path blur (PMPB) model was proposed to model the spatially-variant blur. Gupta et al.[6] proposed a PMPB-based motion density function (MDF) to describe the exposure time spent on the three-dimensional motion trajectory of the camera, which was used to estimate the spatially-variant motion blur kernels. Tai et al.[7] regarded the blurred image as the integration of a series of clear scenes, which went through a sequence of planar projective transformations, and proposed a PMPB-based RL algorithm, which can incorporate many regularization priors to improve the deblurred results. Harmeling et al.[8] proposed a space-variant blind deblurring method based on filter flow by studying the type of camera jitter, and designed an experimental device to record the space-variant PSF corresponding to the blur while taking the blurred image. Hirsch et al.[9] combined with the PMPB model and the Efficient Filter Flow (EFF), and proposed an efficient algorithm that can deal with non-uniform blur caused by camera shake. Whyte et al.[10] assumed that the camera rotation was the only significant source of camera shake blur, and proposed a PMPB-based parameterized geometric model to remove the non-uniform camera rotation blur. Xu et al.[37] proposed a hierarchical estimation framework based on region trees to estimate the blur kernel step by step, and redesigned a spatially varying PSF estimation algorithm based on shock filtering invariance for non-uniform deblurring. Hu et al.[11] thought that non-uniform blur was not only caused by camera shake, but also by the change of scene depth. Therefore, a method of simultaneously estimating scene depth and removing non-uniform blur was proposed. Sheng et al.[12] proposed a PMPB-based depth-aware motion blur model with a given depth image. The authors used a PatchMatch-based depth filling method to fix the empty holes in the depth image. The Deblurring and depth filling were performed iteratively to refine the results. Bai et al.[38] observed that a coarse enough image down-sampled from a blurry observation was very close to a low-resolution version of the latent sharp image. Based on this observation, the authors proposed a coarse-to-fine progressive single-image blind deblurring algorithm. Ulyanov et al.[28] showed that an elaborate UNet is sufficient to capture the statistics prior of a single image for the low-level tasks learning. Inspired by[28], Ren et al.[39] proposed a self-supervised blind deblurring method for a single uniform blurred image, which combined deep models with the maximum a posterior (MAP), and constructed two generative networks for the latent image restoration and the blur kernel estimation, respectively.

From above discussions we can see that most optimization-based methods[36, 6, 7, 8, 9, 10, 37, 11, 12, 38, 39] can only handle either small-size camera shake without the movement of objects or the rigid object motion without camera shake. Although the PMPB model can model spatially-variant motion blurs very well, it suffers from tremendous computational cost and hence can only model small-size 3-dimensional camera shake[6, 7, 8, 10, 11, 12]. So, the optimization based methods are not suitable for the real-word complex dynamic scene deblurring problem, which contains camera shake, multiple rigid or non-rigid objects motion, and different scene depths simultaneously.

II-B The Deep-Learning-Based Approach

Lately, impressive progress has been made in SIDSBD by using deep-learning-based single image blind deblurring methods. Xu et al.[40] proposed an end-to-end CNN-based non-blind deblurring network to learn the deconvolution operation for the disk and motion blurs. Ramakrishnan et al.[13] proposed a generative advantageous network (GAN) whose generator consisted of the global jump links and dense connections. Ramakrishnan et al. obtained the restoration image from the inputted blurred image directly, and without the blur kernel estimation. Nah et al.[14] designed a multi-scale end-to-end deblurring network,and proposed an improved residual block for SIDSBD. Again, there was no blur kernel estimation. The DeblurGAN model proposed by kupyn et al.[15] greatly improved the values of the self-similarity measure (SSIM) and subjective visual effects through well-designed advantageous loss and content loss. Tao et al.[16] proposed a scale recurrent network (SRN), which applied the ResBlock to the encoder-decoder module, and restored sharp images with different resolutions gradually. Zhang et al.[17] proposed a spatially-variant recurrent neural network (RNN) for spatially-variant blurs, in which different weights were learned for different pixels. Inspired by the spatial pyramid matching (SPM), Zhang et al.[18] proposed a deep multi-patch hierarchical network (DMPHN) to achieve end-to-end non-uniform deblurring. The DMPHN made inputs at different levels have the same spatial resolution, therefore, the residual manner could be introduced between levels. Gao et al.[19] proposed a nested skip connection structure to replace the conventional residual connection, and a parameter selection sharing strategy between different scales for SIDSBD. Cai et al.[20] thought that appropriate image priors and regularization terms could improve the deblurring performance. Therefore, they inserted an extreme channel prior into the CNN-based blind deblurring network, and proposed an extreme channel prior embedded network (ECPeNet) for the SIDSBD. Yuan et al.[21] introduced the blur kernel estimation into deep-learning-based SIDSBD, and proposed a spatially variant deconvolution network (SVDN). In their proposed SVDN, the deformable convolution was first used to learn the sampling points of the blur kernels, and then the optical flows between the blurred image and it’s nearby frames were used to guide the learning of the sampling points. Zamir et al.[22] combined the characteristics of the encoder-decoder network with the single scale network, and designed a multi-stage network structure, in which not only the attention mechanism was introduced into each stage of the network, but also an information exchange strategy between different stages was proposed. Cho et al.[23] proposed a multi-input multi-output encoder-decoder structure and an asymmetric feature fusion strategy to fuse multi-scale features effectively. Purohit et al.[24] thought that different regions of the blurred image had different degrees of degradation, so they first designed a positioning network to identify the degraded regions, and then the learned degradation features were used to guide the recovery network for adaptive deblurring. Wang et al.[25] combined the UNet and the Transformer, and proposed a general u-shaped transformer for image restoration. In their proposed network, a block-based self-attention and a learnable multi-scale recovery modulator were proposed to capture the local features and estimate the modulation parameter of each window, respectively. In order to reduce the computational overhead of the Transformer on low-level visual tasks, Zamir et al.[26] proposed a multi-dconv head transposed attention (MDTA) module, which calculated attention in the channel dimension rather than the pixel dimension. In addition, a gated-dconv feed-forward network (GDFN) was proposed to capture the local information of images

Except for the SIDSBD, recently, the deep-learning-based video blind deblurring (DLVBD) methods have also been greatly developed and achieved significant improvement in video motion blind deblurring. Su et al.[41] proposed an end-to-end video deblurring network based on the encoder-decoder architecture, and collected real-world motion blurred video datasets using high frame rate cameras. Chen et al.[32] used a optical flow network to estimate the optical flows between consecutive frames, and the estimated optical flows were used to estimate the spatially-variant motion blur kernels. The estimated blur kernels were used to fine-tune the proposed deblurring network by using a reblurring self-supervised loss, which could boost the deblurring performance of the proposed deblurring network. Zhang et al.[33] proved that, in a motion blurred video, the pixel-wise blur kernel can be represented by the pixel-wise optical flow. Therefore, Zhang et al. used the optical flows to model spatially-variant motion blur kernels, which were used to generate the weights of the RNNs for video blind deblurring. Bai et al.[34] estimated the blur kernel form a low-resolution and uniform blurred video directly, and used the estimated blur kernel to achieve self-supervised-based high-resolution video restoration. Wang et al.[35] first estimated the average magnitude of optical flows from several clear consecutive frames, then the estimated average magnitude was used to learn the pixel-wise motion blur level of each motion blurred frame. Then, Wang et al. utilized the learned motion blur levels as guidance for effective deep video deblurring.

From above discussions we can see that, on the one hand, most existing deep-learning-based single image dynamic scene blind deblurring (SIDSBD) methods usually design deep networks to directly remove the spatially-variant motion blurs from one inputted motion blurred image, without blur kernels estimation[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. On the other hand, the estimation of the spatially-variant motion blur kernels and the corresponding reblurring training strategy have been widely applied to the deep-learning-based video blind deblurring (DLVBD) methods and have proved their effectiveness[32, 33, 34, 35]. Therefore we wonder that whether the estimation of spatially-variant motion blur kernels and the corresponding reblurring training strategy are still good for the SIDSBD? So, in this paper, we mainly focus on the research of the estimation of the spatially-variant motion blur kernels and the corresponding reblurring training strategy for SIDSBD.

III THE PROPOSED CDCN

Our proposed CDCN for SIDSBD is illustrated in Fig. 1. As we discussed above, the main contributions of our proposed CDCN are a novel MSML-MIMO encoder-decoder architecture for more powerful features extraction ability and a CDCR strategy for accurate spatially-variant motion blur kernels estimation and the corresponding inverse kernels prediction. Therefore, in this section, we will first discuss the MSML-MIMO encoder-decoder architecture and then the CDCR strategy in detail.

Refer to caption
Figure 1: Our proposed constrained deformable convolutional network (CDCN).

III-A The MSML-MIMO Encoder-Decoder Architecture

Our proposed MSML-MIMO encoder-decoder architecture is illustrated in Fig. 1,where Bijk,Si1k,Ri2k(i{1,2,3},j{1,2},k{1,2,3})B_{ijk},S_{i1k},R_{i2k}(i\in{\{1,2,3\}},j\in{\{1,2\}},k\in{\{1,2,3\}}) represent the blur image, the recoverd image and the residual image at different scales, different levels, and different orders, respectively. In detail, Bijk,Si1k,Ri2k(i{1,2,3},j{1,2},k{1,2,3})B_{ijk},S_{i1k},R_{i2k}(i\in{\{1,2,3\}},j\in{\{1,2\}},k\in{\{1,2,3\}}) denote the observed blurred image input into the kthk-th encoder block (EnBlock) at the jthj-th level of the ithi-th scale, the restoration image inferred by the kthk-th decoder block (DeBlock) at the 1st level of the ithi-th scale, and the residual image generated by the kthk-th DeBlock at the 2nd level of the ithi-th scale, respectively. B1jk,B2jkB_{1jk},B_{2jk} and B3jkB_{3jk} are a sequence of blurry images downsampled from the observed original full resolution blurred image at different scales. EBijkEB_{ijk}, DBijkDB_{ijk} and CDCRijCDCR_{ij} denote the kthk-th EnBlock at the jthj-th level of the ithi-th scale, the kthk-th DeBlock at the jthj-th level of the ithi-th scale, and the CDCR module at the jthj-th level of the ithi-th scale, respectively.

As shown in the Fig. 1, our proposed MSML-MIMO encoder-decoder architecture includes three scales, in which each scale consists of two levels, and each level contains three EnBlocks, three DeBlocks, and a CDCR module. Except the EBij1EB_{ij1}, the first convolution of EBij2EB_{ij2} and EBij3EB_{ij3} is a strided convolution with a stride of 22 for downsampling. And the stride of the transposed convolution in DBij1DB_{ij1} and DBij2DB_{ij2} is 22 for upsampling. For the DBi21DB_{i21} and DBi22DB_{i22}, two additional 3×33\times 3 convolution layers with the output channel of 33 are used to obtain the intermediate residual images with different scales respectively. And, for the DBi11DB_{i11} and DBi12DB_{i12}, two additional 3×33\times 3 convolution layers with the output channel of 33 are used to obtain the intermediate restoration images with different scales respectively.

As illustrated in Fig. 1, the deblurring process of CDCN starts at the second/down level of the third scale. At the second level of each scale, the inputs of EB321EB_{321}, EB221EB_{221} and EB121EB_{121} are EB321in=B321©B321EB_{321}^{in}=B_{321}\copyright B_{321}, EB221in=B221©S313EB_{221}^{in}=B_{221}\copyright S_{313}^{\uparrow} and EB121in=B121©S213EB_{121}^{in}=B_{121}\copyright S_{213}^{\uparrow}, respectively. While, the input of EB32kEB_{32k} includes the output of the EB32(k1)EB_{32(k-1)} and the B32k(k{2,3})B_{32k}(k\in{\{2,3\}}), and the input of EBi2kEB_{i2k} includes the output of the EBi2(k1)EB_{i2(k-1)} and the concatenation of the Bi2kB_{i2k} and the S(i+1)1(4k)(i{1,2},k{2,3})S_{(i+1)1(4-k)}^{\uparrow}(i\in{\{1,2\}},k\in{\{2,3\}}). At the first/up level of each scale, the input of EBi11EB_{i11} is EBi11in=Bi11Ri23EB_{i11}^{in}=B_{i11}\oplus R_{i23}. While, the input of EBi1kEB_{i1k} includes the output of the EBi1(k1)EB_{i1(k-1)} and the addition of the Bi1kB_{i1k} and the Ri2(4k)(i{1,2},k{2,3})R_{i2(4-k)}(i\in{\{1,2\}},k\in{\{2,3\}}). Where EBijkinEB_{ijk}^{in} and SijkS_{ijk}^{\uparrow} denote the input of the EBijkEB_{ijk} and the upsampled version of the SijkS_{ijk} (the green arrow in Fig. 1) respectively.

For the DeBlocks, the input of the DBij1DB_{ij1} is the CDCRijCDCR_{ij}. And the input of the DBijk(k{2,3})DB_{ijk}(k\in{\{2,3\}}) is the concatenation of the DBij(k1)outDB_{ij(k-1)}^{out}, EBij1outEB_{ij1}^{out}, EBij2outEB_{ij2}^{out}, and EBij3outEB_{ij3}^{out}, by using the resize operation (the earthy yellow arrow in Fig. 1) which can be formulated as:

DBij2in=DBij1out©EBij1out©EBij2out©EBij3outDB_{ij2}^{in}=DB_{ij1}^{out}\copyright EB_{ij1}^{out\downarrow}\copyright EB_{ij2}^{out}\copyright EB_{ij3}^{out\uparrow} (1)
DBij3in=DBij2out©EBij1out©EBij2out©EBij3outDB_{ij3}^{in}=DB_{ij2}^{out}\copyright EB_{ij1}^{out}\copyright EB_{ij2}^{out\uparrow}\copyright EB_{ij3}^{out\uparrow} (2)

where,EBijkoutEB_{ijk}^{out} and DEijkoutDE_{ijk}^{out} denote the outputs of the EBijkEB_{ijk} and DEijkDE_{ijk} respectively. The CDCR module is used to estimate the spatially-variant motion blur kernels and predict the corresponding inverse kernels and output the deblurred features. At the second level of each scale, the input of the CDCRi2CDCR_{i2} is the EBi23outEB_{i23}^{out}. And At the first/up level of each scale, the input of the CDCRi1CDCR_{i1} is CDCRi1in=EBi13outCDCRi2outCDCR_{i1}^{in}=EB_{i13}^{out}\oplus CDCR_{i2}^{out}, where CDCRijinCDCR_{ij}^{in} and CDCRijoutCDCR_{ij}^{out} denote the input of the CDCRijCDCR_{ij} module and the output of the CDCRijCDCR_{ij} module respectively.

From above analyses and Fig. 1 we can see that, first, our MSML-MIMO encoder-decoder architecture can conduct the residual between levels. However, beyond our early work in [27], our proposed MSML-MIMO encoder-decoder architecture conducts the residual manner between levels not only once but three times, which are between three DeBlocks of the second level and the corresponding three EnBlocks of the first level within the same scale. Second, similar to the rich residual connects between levels, for each scale, we conduct the intermediate supervision not only once but apply multiple intermediate supervisions to all DeBlocks of the first level of each scale. And except for the scale 11, all intermediate supervision results in previous scale will be used to guide the image restoration at this scale. Finally, different from most conventional coarse-to-fine image blind deblurring networks, which fused different depths of features only between EnBlocks and DeBlocks with the same spatial resolution, our proposed MSML-MIMO encoder-decoder architecture fuses features from different spatial resolutions within each level: the second and third DeBlocks of each level take the outputs of all EnBlocks within the same level as the inputs and merge different resolution features using the concatenation operation and convolutional layers. Therefore, because of the utilization and fusion of more information flows and informative features, our proposed CDCN possesses more powerful features extraction ability, which can facilitate accurate estimation of spatially-variant motion blur kernels and higher quality image restoration for SIDSBD..

III-B THE CDCR STRATEGY

Refer to caption
Figure 2: Our proposed constrained deformable convolution reblurring (CDCR) strategy.

As illustrated in Fig. 2, Our proposed CDCR strategy consists of a CDCR module and a PMPB-based reblurring loss function, which is responsible for learning the sampling points of spatially-variant motion blur kernels and predicting the sampling points of their corresponding inverse kernels from one single motion blurred image. In this subsection, we will discuss the CDCR module and the PMPB-based reblurring loss function in detail.

As shown in Fig. 2, the proposed CDCR module is a small CNN, which consists of one SoftMax layer, two PReLU layers, three regular convolutional layers and one deformable convolution. On the one hand, the deformable convolution has been proved that it can make the spatial sampling locations focus on the interested image content efficiently[42, 43]. On the other hand, the PMPB model, which takes a motion blurred image as the result of integrating all intermediate images the camera “sees” along the trajectory of the relative motion, is one of the best motion blur models for modeling spatially-variant motion blurs. Therefore, inspired by the deformable convolution and the PMPB model, in our proposed CDCR module, we first use a regular convolution to learn the spatial sampling locations and the corresponding weights of the trajectory of the relative motion for each pixel, which is similar to the estimation of the motion density function of the camera shake in the PMPB model. The learning of the spatial sampling locations and the corresponding weights can be formulated as:

offsetbk,weightbk=Sep(Conv1(fin))offset_{bk},weight_{bk}=Sep(Conv_{1}(f_{in})) (3)
weightbk=SoftMax(weightbk)weight_{bk}=SoftMax(weight_{bk}) (4)

where,finf_{in} denotes the input features. For the CDCRi2CDCR_{i2}, finf_{in} is the blurred features extracted by the EnBlocks, i.e. the EBi23outEB_{i23}^{out}. For the CDCRi1CDCR_{i1},fin=EBi13out+CDCRi2outf_{in}=EB_{i13}^{out}+CDCR_{i2}^{out}. Conv1()Conv_{1}() denotes the regular convolution operation with 3N3N output channels, where NN denotes the number of the sampling points of the blur kernel. Sep()Sep() denotes the separation operation, which divides 3N3N channels into 2N2N and NN for offsetbkoffset_{bk} and weightbkweight_{bk}, respectively. And offsetbk=[offsetbk,1,offsetbk,2offsetbk,N],offsetbk,n=(offsetbk,n,x,offsetbk,n,y),n{1,2,N},offset_{bk}=[offset_{bk,1},offset_{bk,2}...offset_{bk,N}],offset_{bk,n}=(offset_{bk,n,x},offset_{bk,n,y}),n\in{\{1,2,...N\}}, where offsetbk,n,xoffset_{bk,n,x} and offsetbk,n,yoffset_{bk,n,y} denote the xx-coordinate and the yy-coordinate of the nthn-th sampling point of the motion blur kernel for each pixel, respectively. Therefore, offsetbkoffset_{bk} is the spatial locations of NN sampling points, which approximates the trajectory of the relative motion of each pixel (i.e. the shape of the motion blur kernel). And weightbk=[weightbk,1,weightbk,2weightbk,N]weight_{bk}=[weight_{bk,1},weight_{bk,2}...weight_{bk,N}], where weightbk,nweight_{bk,n} denotes the fraction of the exposure time (FET) spent at nthn-th sampling point of the motion blur kernel for each pixel. Based on the principle of the PMPB model, the SoftMaxSoftMax operation is applied to the weightbkweight_{bk} for energy conservation. Then, we propose a PMPB-based reblurring loss function to constrain the accuracy of the offsetbkoffset_{bk} and weightbkweight_{bk}:

Lreblur=i=13j=12(1Mm=1M(Breblurij3)m(Bij3)m22)L_{reblur}=\sum_{i=1}^{3}\sum_{j=1}^{2}(\frac{1}{M}\sum_{m=1}^{M}||(B_{reblur_{ij3}})^{m}-(B_{ij3})^{m}||_{2}^{2}) (5)

where, (Bij3)m(B_{ij3})^{m} denotes the mthm-th observed blurred image at the 3th3-th EnBlock of the jthj-th level of the ithi-th scale and (Breblurij3)m=n=1Nwarp((offsetbk,n)ij3m,(SGTi11)m)(weightbk,n)ij3m(B_{reblur_{ij3}})^{m}=\sum_{n=1}^{N}warp((offset_{bk,n})_{ij3}^{m},(S_{GT_{i11}})^{m})\bigotimes(weight_{bk,n})_{ij3}^{m} denotes the mthm-th reconstructed blurred image using the PMPB model at the 3th3-th EnBlock of the jthj-th level of the ithi-th scale. And, (SGTi11)m(S_{GT_{i11}})^{m} is the mthm-th ground truth sharp image at the 1th1-th DeBlock of the 1th1-th level of the ithi-th scale. (offsetbk,n)ij3m(offset_{bk,n})_{ij3}^{m} and (weightbk,n)ij3m(weight_{bk,n})_{ij3}^{m} denote the learned offsetbk,noffset_{bk,n} and weightbk,nweight_{bk,n} for (Bij3)m(B_{ij3})^{m}. ||||22||\quad||_{2}^{2} and \bigotimes denote the L2L2 norm and the element-wise multiplication respectively, and warp()warp() denotes the warp operation by using the bilinear interpolation.

From Equations (5) we can see that, our proposed PMPB-based reblurring loss function can constrain the solution spaces of the learned offsetbkoffset_{bk} and weightbkweight_{bk}, which makes the learned sampling points fit the trajectory of the relative motion of each pixel well and achieves accurate spatially-variant motion blur kernels estimation from a single motion blurred image.

Then, after the learning of spatially-variant motion blur kernels, we apply two PReLU layers and two regular convolutional layers to the offsetbkoffset_{bk} and weightbkweight_{bk} for predicting the inverse kernels, which can be formulated as:

Conv2out=Conv2(Pre(offsetbk)©weightbk)Conv_{2}^{out}=Conv_{2}(Pre(offset_{bk})\copyright weight_{bk}) (6)
offsetinbk,weightinbk=Sep(Conv3(Pre(Conv2out)))offset_{inbk},weight_{inbk}=Sep(Conv_{3}(Pre(Conv_{2}^{out}))) (7)

where, Pre()Pre() denotes the PReLU activation function and ©\copyright denotes the concatenation operation. Conv3()Conv_{3}() denotes the regular convolution operation with 3M3M output channels, where MM denotes the number of the sampling points for the inverse kernel. Again, Sep()Sep() divides 3M3M channels into 2M2M and MM for offsetinbkoffset_{inbk} and weightinbkweight_{inbk}, respectively. Similar to offsetbkoffset_{bk}, offsetinbkoffset_{inbk} is the spatial locations of MM sampling points of the inverse kernel for each pixel, which approximates the shape of the inverse kernel. And similar to weightbkweight_{bk}, weightinbkweight_{inbk} is the weights of MM sampling points of the inverse kernel for each pixel but without the SoftMaxSoftMax operation.

Finally, the predicted inverse kernels are directly applied to finf_{in} for generating the deblurred features by using one deformable convolution.

III-C THE LOSS FUNCTION

The loss function of our CDCN consists of the multi-scale content mean square error loss, the multi-scale frequency reconstruction loss and the PMPB-based reblurring loss:

LCDCN=Lcontent+λLfr+LreblurL_{CDCN}=L_{content}+\lambda L_{fr}+L_{reblur} (8)
Lcontent=i=13k=13(1Mm=1M(Si1k)m(SGTi1k)m1)L_{content}=\sum_{i=1}^{3}\sum_{k=1}^{3}(\frac{1}{M}\sum_{m=1}^{M}||(S_{i1k})^{m}-(S_{GT_{i1k}})^{m}||_{1}) (9)
Lfr=i=13k=13(1Mm=1MF((Si1k)m)F((SGTi1k)m)1)L_{fr}=\sum_{i=1}^{3}\sum_{k=1}^{3}(\frac{1}{M}\sum_{m=1}^{M}||F((S_{i1k})^{m})-F((S_{GT_{i1k}})^{m})||_{1}) (10)

where, (Si1k)m(S_{i1k})^{m} and (SGTi1k)m(S_{GT_{i1k}})^{m} denote the mthm-th recovered image and the mthm-th ground truth sharp image at the kthk-th DeBlock of the 1th1-th level of the ithi-th scale, respectively. λ\lambda is set to 0.10.1 in our experiments. For the parameters of our proposed CDCN, in this paper, we propose a inter-scale parameter sharing scheme: the parameters between the levels, which have the same multi-patch model, are shared. And that, the parameters between the levels, which belong to the same scale are independent.

III-D The Differences to multi-input multi-output U-net (MIMO-UNet)

The multi-input multi-output strategy and the fusion of different resolution features are also introduced in MIMO-UNet for SIDSBD. However, there are two main differences between MIMO-UNet and our CDCN. The first difference is the network architecture. In MIMO-UNet, Cho et al.[23] adopt a traditional single-scale coarse-to-fine strategy, where a input blurred image is encoded and decoded only once, and the output multiple deblurred images cannot be used to guide image restoration of other scales. While, our CDCN consists of three scales, where each scale contains two levels, therefore, a current-scale blurred image will be encoded and decoded twice. And, because of the multi-scale and multi-level architecture, in our CDCN, all intermediate deblurred results can be used to guide the image restoration at the next scale. The second difference is the estimation of the spatially-variant motion blur kernels. In MIMO-UNet, Cho et al.[23] directly remove the spatially-variant motion blurs from one inputted motion blurred image, without blur kernels estimation. Compared with MIMO-UNet, our CDCN can accurately estimate spatially-variant motion blur kernels from only one single motion blurred image without the optical flow.

III-E The Differences to Spatially Variant Deconvolution Network (SVDN)

Another related work to our CDCN is SVDN, where the deformable convolution is utilized to learn the sampling points of the blur kernels. Although our CDCN also adopts the deformable convolution, there are three main differences between SVDN and our CDCN. First, for the learning of the sampling points, SVDN uses the bi-directional optical flows to approximate the blur kernels, and makes the spatial distribution of the deformable sampling points close to the optical flows by using a distance-based loss function. Therefore, SVDN is still essentially a multi-frame-based blur kernel estimation method. On the contrary, our CDCN learns the sampling points of spatially-variant motion blur kernels from only one inputted motion blurred image by using a regular convolution without any optical flow. Therefore, our CDCN is essentially a single-image-based blur kernel estimation method, which is suitable for the SIDSBD well. Second, compared with SVDN, our CDCN proposes a PMPB-based reblurring loss function to constrain the accuracy of the learned sampling points, which makes the learned sampling points fit the trajectory of the relative motion of each pixel well and achieves accurate spatially-variant motion blur kernels estimation from a single motion blurred image. Third, for the deconvolution operation, SVDN uses two deformable convolutions to get the deblurred features. However, our CDCN can not only generate deblurred features by using only one deformable convolution, but also can achieve better deblurring performance for SIDSBD.

IV EXPERIMENTS

To compare our method with the state-of-the-art SIDSBD methods, and demonstrate the effectiveness of our method, extensive experiments are performed on a PC with four NVIDIA Geforce RTX 3090 GPUs and the Intel Core I9-10980XE CPU, and the PyTorch 1.9.0 Library.

IV-A The Datasets and Implementation Details

In this paper, we train our model on the GOPRO[14] dataset and then test it on the GOPRO and HIDE[44] test datasets respectively. For the GoPro dataset, we use 2103 image pairs for training and the remaining 1111 pairs for testing, which is same as [14]. For the HIDE dataset, we use 2025 pairs for testing, which is same as [44].

For training of our CDCN, the Adam[45] with β1=0.9\beta_{1}=0.9,β2=0.999\beta_{2}=0.999 and ε=1e8\varepsilon=1e-8 is used as the optimizer to optimize our network for 20002000 epochs which are sufficient for convergence. The learning rate is initially set to 1e41e-4 and decreased by the factor of 0.50.5 at every 200200 epochs. For every training iteration, we randomly sample eight images. NN and MM are set to 33 and 77 respectively. The number of residual blocks in each EnBlock and each DeBlock is 8. And unless otherwise specified, the entire network uses the 3×33\times 3 convolution kernel for all other convolutional layers by default. For the quantitative metrics, we use the peak signal-to-noise ratio (PSNR) and the self-similarity measure (SSIM) to evaluate the performance of our method quantitatively. For data augmentation, each patch was horizontally flipped with a probability of 0.50.5.

IV-B Ablation Experiments

TABLE I: ABLATION STUDY OF THE PROPOSED PMPB-BASED REBLURRING LOSS, THE CDCR MODULE AND THE MSML-MIMO ARCHITECTURE. THE PSNR AND SSIM ARE OBTAINED BY AVERAGING 1,111 GOPRO TESTING IMAGES
Models PSNR SSIM
CDCN-NoPMPBReBlur 32.08 0.952
CDCN-NoCDCR 31.87 0.948
CDCN-1level 31.98 0.951
CDCN-NoMIMO 32.18 0.954
CDCN 32.59 0.958
Refer to caption
(a) with the PMPB-based reblurring constraint
Refer to caption
(b) with the PMPB-based reblurring constraint
Refer to caption
(c) with the PMPB-based reblurring constraint
Refer to caption
(d) without the PMPB-based reblurring constraint
Refer to caption
(e) without the PMPB-based reblurring constraint
Refer to caption
(f) without the PMPB-based reblurring constraint
Figure 3: Visualization of the sampling points with and without the PMPB-based reblurring constraint.
Refer to caption
(a) CDCN-NoPMPBReBlur
Refer to caption
(b) CDCN-NoCDCR
Refer to caption
(c) CDCN-1level
Refer to caption
(d) CDCN-NoMIMO
Refer to caption
(e) Our proposed CDCN
Figure 4: The subjective visual evaluation results of the models CDCN-NoPMPBReBlur, CDCN-NoCDCR, CDCN-1level, CDCN-NoMIMO and CDCN.

As we discussed above, the main contributions of our proposed CDCN are: the PMPB-based reblurring loss function for constraining the convergence of the sampling points of spatially-variant motion blur kernels, a CDCR module for predicting the inverse kernels and generating deblurred features, and an MSML-MIMO architecture for more powerful features extraction ability. So, in this subsection, we will individually test these three parts to better understand the effectiveness of each part via the experiments.

1) Evaluation of the PMPB-based reblurring loss: As shown in Equations (5), the proposed PMPB-based reblurring loss function can constrain the solution spaces of the offsetbkoffset_{bk} and weightbkweight_{bk}, and make the learned sampling points fit the trajectory of the relative motion of each pixel well. In order to verify the effectiveness of the PMPB-based reblurring loss, we compare our proposed CDCN with a model named CDCN-NoPMPBReBlur, which removes Equations (5) from the proposed CDCN (i.e. without any constraint on the offsetbkoffset_{bk} and weightbkweight_{bk}). In other words, the values of offsetbkoffset_{bk} and weightbkweight_{bk} will be completely self-learned. The CDCN-NoPMPBReBlur model shares the same framework as Fig. 2 and is trained on the same GoPro training dataset as the proposed CDCN model. Table I shows that without the PMPB-based reblurring constraint, the PSNR and SSIM of the CDCN-NoPMPBReBlur model reduce by 0.51dB and 0.006, respectively.

Fig. 3 shows the comparisons of the distribution of sampling points with and without the PMPB-based reblurring constraint. On the one hand, from Fig. 3b, we can see that, the light and shadow on the ground provides clear information of the relative movement trajectory, and our proposed PMPB-based reblurring loss can constrain the learned sampling points fit the trajectory of the relative motion of each pixel very well: the shape of the red points is very similar to that of the light and shadow (Please see the Fig. 3a, 3b and 3c and the zoomed in regions). On the other hand, from Fig. 3c we can see that, our proposed PMPB-based reblurring loss can also make the sampling points change with the level of the blur: different blurred regions have different sampling point distributions, which are consist with the degrees of the blurs (Please see the left region, the middle region and the right region of Fig. 3c and the zoomed in region). However, the CDCN-NoPMPBReBlur model can not learn any information about the spatially-variant motion blur kernel (Please see the Figs. 3d, 3e and 3f, and the zoomed in regions)

2) Evaluation of the CDCR module: As shown in Fig. 2, we use two PReLU layers and two regular convolutional layers to predict the inverse kernels from the learned offsetbkoffset_{bk} and weightbkweight_{bk}, and then the predicted inverse kernels are used to generate the deblurred features via a deformable convolution. To demonstrate the effectiveness of the CDCR module, we compare our proposed CDCN with a model named CDCN-NoCDCR, which removes Conv2()Conv_{2}() , Conv3()Conv_{3}() , two PReLU layers, and the final deformable convolution from the CDCR module, and only keep the Conv1()Conv_{1}() and the SoftMaxSoftMax for calculating PMPB-based reblurring loss. Therefore, in the CDCN-NoCDCR model, DBi21in=EBi23outDB_{i21}^{in}=EB_{i23}^{out} and DBi11in=EBi23outEBi13outDB_{i11}^{in}=EB_{i23}^{out}\oplus EB_{i13}^{out}. The CDCN-NoCDCR model is trained on the same GoPro training dataset as the proposed CDCN model. As can be seen from Table I, without the prediction of the inverse kernels, the PSNR and SSIM of the CDCN-NoCDCR model are reduced by 0.72dB and 0.01, respectively.

3) Evaluation of the MSML-MIMO architecture: To demonstrate the advantage of our MSML-MIMO architecture, we compare our proposed CSCN with 22 baseline models: the CDCN-1level model, where each scale only contains one level, and the CDCN-NoMIMO, where only Bi11B_{i11} and Bi21B_{i21} are considered as the inputs of the ithi-th scale, and only Si13S_{i13} is considered as the output of the ithi-th scale. The models CDCN-1level and CDCN-NoMIMO are trained on the same GoPro training dataset as the proposed CDCN model respectively. From Table I we can see that, on the one hand, compared with our CDCN model, CDCN-1level model can not conduct the residual operation between levels, and the values of PSNR and SSIM are reduced by 0.61dB and 0.007, respectively. On the other hand, without more input information and intermediate restoration results, the PSNR and SSIM of the CDCN-NoMIMO model are reduced by 0.41dB and 0.004, respectively.

Fig. 4 illustrates the subjective visual evaluation results of the models CDCN-NoPMPBReBlur, CDCN-NoCDCR, CDCN-1level, CDCN-NoMIMO and our proposed CDCN. We could found that our proposed CDCN can obtain higher quality restoration image, which has sharper and clearer edges. Table I and Fig. 4 demonstrate that all contributions play important roles in estimating accurate blur kernels and recovering of high quality images.

IV-C The Comparisons With the State-of-the-Art SIDSBD Methods on the Synthetic Benchmark Datasets

TABLE II: ALL THE MODELS. ALL MODELS ARE TRAINED ONLY ON THE GOPRO[14] TRAINING IMAGE PAIRS AND DIRECTLY APPLIED TO THE HIDE[44] TESTING DATASETS.
Method GOPRO HIDE
PSNR SSIM PSNR SSIM
Xu et al.[5] 21.00 0.741 - -
DeblurGAN[15] 28.70 0.858 24.51 0.871
Nah et al.[14] 29.08 0.914 25.73 0.874
Zhang et al.[17] 29.19 0.931 - -
DeblurGAN-v2[46] 29.55 0.934 26.61 0.875
SRN[16] 30.26 0.934 28.36 0.915
Gao et al.[19] 30.90 0.935 29.11 0.913
DBGAN[29] 31.10 0.942 28.94 0.915
MT-RNN[30] 31.15 0.945 29.15 0.918
DMPHN[18] 31.20 0.940 29.09 0.924
MSCAN[27] 31.24 0.945 29.63 0.921
Suin et al.[31] 31.85 0.948 29.98 0.930
SPAIR[24] 32.06 0.953 30.29 0.931
MIMO-UNet+[23] 32.45 0.957 29.99 0.930
CDCN 32.59 0.958 30.55 0.935
Refer to caption
(a) Blurry image
Refer to caption
(b) Xu et al.[5]
Refer to caption
(c) Nah et al.[14]
Refer to caption
(d) DeblurGAN[15]
Refer to caption
(e) SRN[16]
Refer to caption
(f) Gao et al.[19]
Refer to caption
(g) Cai et al.[20]
Refer to caption
(h) MSCAN[27]
Refer to caption
(i) MIMO-UNet+[23]
Refer to caption
(j) Our CDCN
Figure 5: The qualitative evaluation comparisons of all the methods on the GoPro testing dataset.
Refer to caption
(a) Blur image
Refer to caption
(b) Xu et al.[5]
Refer to caption
(c) Nah et al.[14]
Refer to caption
(d) DeblurGAN[15]
Refer to caption
(e) SRN[16]
Refer to caption
(f) Gao et al.[19]
Refer to caption
(g) Cai et al.[20]
Refer to caption
(h) MSCAN[27]
Refer to caption
(i) MIMO-UNet+[23]
Refer to caption
(j) Our CDCN
Figure 6: The qualitative evaluation comparisons of all the methods on the GoPro testing dataset.

For the synthetic benchmark datasets comparisons, we use two synthetic datasets: the GoPro[14] and the HIDE[44], and compare our method with fourteen state-of-the-art SIDSBD methods (Xu et al.[5], DeblurGAN[15], Nah et al.[14], Zhang et al.[17], DeblurGAN-v2[46], SRN[16], Gao et al.[19], DBGAN[29], MT-RNN[30], DMPHN[18], MSCAN[27], Suin et al.[31], SPAIR[24] , MIMO-UNet+[23]). For fair comparison, the models [15], [14], [17], [46], [16], [19], [29], [30], [18], [27],[31], [24] and [23] are trained on the 2103 GoPro training image pairs. Because the method [5] is an optimization-based method, we use the executable program provided by Xu et al. for the comparison. Table II shows the values of the mean PSNR and the mean SSIM of all models on the 1111 GoPro testing image pairs and 2025 HIDE testing image pairs. From Table II we can see that our CDCN significantly outperforms all the methods and could attain the highest mean PSNR value and mean SSIM value on both the GoPro testing images and the HIDE testing images. So, in summary, our CDCN can produce better deblurring results than the state-of-the-art SIDSBD methods in terms of the quantitative metrics.

Because of the space limitation, here, we only use two GoPro testing images (Fig. 5 and Fig. 6) to illustrate the qualitative evaluation comparisons of the models [5] , [15], [14], [16], [19], [20], [27], [23] and our CDCN model. From Fig. 5 and Fig. 6 we can see that, the deblurred images by models [5] , [15], [14], [16], [19], [20], [27], [23] suffer from one or more of different degrees flaws: the blur, the distortion and the deformation, respectively. By contrast, our CDCN can obtain the highest quality restoration images: can not only remove various artifacts effectively, but also can recover sharper edges and more details. Please see the Figs. 5b-5i, and the Figs. 6b-6i, and the corresponding zoomed in regions. Table II, Fig. 5 and Fig. 6 demonstrate the superiority of our method on the synthetic benchmark datasets. More experimental results for synthetic benchmark datasets can be available at https://github.com/wuyang1002431655/CDCN.

IV-D The Comparisons with the state-of-the-art methods on the real blurred images

Refer to caption
(a) Blur image
Refer to caption
(b) SRN[16]
Refer to caption
(c) Gao et al.[19]
Refer to caption
(d) MIMO-UNet+[23]
Refer to caption
(e) Our CDCN
Figure 7: The qualitative evaluation comparisons of all the methods on the real blurred images.
Refer to caption
(a) Blur image
Refer to caption
(b) SRN[16]
Refer to caption
(c) Gao et al.[19]
Refer to caption
(d) MIMO-UNet+[23]
Refer to caption
(e) Our CDCN
Figure 8: The qualitative evaluation comparisons of all the methods on the real blurred images.

With the exception of the synthetic blurred images, we also use the real blurred images to further demonstrate the effectiveness of our method. For real blurred images, we compare our method with the methods [16], [19] and [23], and all methods are still trained on the 2103 GoPro training image pairs. Fig. 7 and Fig. 8 illustrate the qualitative evaluation comparisons of the models [16], [19], [23] and our CDCN model on two real blurred images. From Fig. 7 and Fig. 8 we can see that, methods [16], [19] and [23] can not recover the real blurred image well, and there are too much flaws: the ringing artifacts, the blur, the distortion and the deformation, in the deblurred images. By contrast, our CDCN model can obtain the higher quality restoration images: less blur, less deformation, hardly distortion and ringing artifacts, and sharper edges and more details Please see the Figs. 7b-7e and the Figs. 8b-8e, and the corresponding zoomed in regions. Figs. 7 and 8 demonstrate that, for the real blurred images, our method still can achieve higher quality restoration results. More experimental results for real blurred images can be available at https://github.com/wuyang1002431655/CDCN.

V CONCLUSION

In this paper, we propose a novel constrained deformable convolutional network (CDCN) for accurate spatially-variant motion blur kernels estimation and high quality image restoration. Inspired by the PMPB model and the deformable convolution, a novel CDCR strategy is proposed to achieve accurate spatially-variant motion blur kernels estimation from only one single motion blurred image without the optical flow by proposing a PMPB-based reblurring loss function, which can make the learned sampling points fit the trajectory of the relative motion of each pixel well. Then, a novel MSML-MIMO encoder-decoder architecture is constructed, which possesses more powerful features extraction ability by utilizing and fusing of more information flows and informative features. Extensive experiments on both the synthetic benchmark datasets and the real blurred images show that our method can produce better deblurring results than the state-of-the-art SIDSBD methods in terms of both qualitative evaluation and quantitative metrics. Researching more general and more powerful constraint terms and incorporating them into the PMPB-based reblurring loss function for more accurate blur kernels estimation, and extending our CDCN to other types of blurred images (e.g. the defocus blurred images) are our future works.

VI ACKNOWLEDGMENT

The authors will thank the editor and all reviewers, and in addition, the authors thank Xu et al, Nah et al, Tao et al, Gao et al, Kupyn et al, Cho et al and so on for the source code or model provided.

References

  • [1] J. Pan, D. Sun, H. Pfister, and M.-H. Yang, “Blind image deblurring using dark channel prior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1628–1636.
  • [2] J. Pan, Z. Hu, Z. Su, and M.-H. Yang, “l_0l\_0-regularized intensity and gradient prior for deblurring text images and beyond,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 2, pp. 342–355, 2016.
  • [3] S. Tang, X. Xie, M. Xia, L. Luo, P. Liu, and Z. Li, “Spatial-scale-regularized blur kernel estimation for blind image deblurring,” Signal Processing: Image Communication, vol. 68, pp. 138–154, 2018.
  • [4] C. Wang, L. Sun, P. Cui, J. Zhang, and S. Yang, “Analyzing image deblurring through three paradigms,” IEEE Transactions on Image Processing, vol. 21, no. 1, pp. 115–129, 2011.
  • [5] L. Xu, S. Zheng, and J. Jia, “Unnatural l0 sparse representation for natural image deblurring,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1107–1114.
  • [6] A. Gupta, N. Joshi, C. Lawrence Zitnick, M. Cohen, and B. Curless, “Single image deblurring using motion density functions,” in European conference on computer vision.   Springer, 2010, pp. 171–184.
  • [7] Y.-W. Tai, P. Tan, and M. S. Brown, “Richardson-lucy deblurring for scenes under a projective motion path,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1603–1618, 2010.
  • [8] S. Harmeling, H. Michael, and B. Schölkopf, “Space-variant single-image blind deconvolution for removing camera shake,” Advances in Neural Information Processing Systems, vol. 23, pp. 829–837, 2010.
  • [9] M. Hirsch, C. J. Schuler, S. Harmeling, and B. Schölkopf, “Fast removal of non-uniform camera shake,” in 2011 International Conference on Computer Vision.   IEEE, 2011, pp. 463–470.
  • [10] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce, “Non-uniform deblurring for shaken images,” International journal of computer vision, vol. 98, no. 2, pp. 168–186, 2012.
  • [11] Z. Hu, L. Xu, and M.-H. Yang, “Joint depth estimation and camera shake removal from single blurry image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2893–2900.
  • [12] B. Sheng, P. Li, X. Fang, P. Tan, and E. Wu, “Depth-aware motion deblurring using loopy belief propagation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 955–969, 2020.
  • [13] S. Ramakrishnan, S. Pachori, A. Gangopadhyay, and S. Raman, “Deep generative filter for motion deblurring,” in Proceedings of the IEEE international conference on computer vision workshops, 2017, pp. 2993–3000.
  • [14] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3883–3891.
  • [15] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8183–8192.
  • [16] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep image deblurring,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8174–8182.
  • [17] J. Zhang, J. Pan, J. Ren, Y. Song, L. Bao, R. W. Lau, and M.-H. Yang, “Dynamic scene deblurring using spatially variant recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2521–2529.
  • [18] H. Zhang, Y. Dai, H. Li, and P. Koniusz, “Deep stacked hierarchical multi-patch network for image deblurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5978–5986.
  • [19] H. Gao, X. Tao, X. Shen, and J. Jia, “Dynamic scene deblurring with parameter selective sharing and nested skip connections,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3848–3856.
  • [20] J. Cai, W. Zuo, and L. Zhang, “Dark and bright channel prior embedded network for dynamic scene deblurring,” IEEE Transactions on Image Processing, vol. 29, pp. 6885–6897, 2020.
  • [21] Y. Yuan, W. Su, and D. Ma, “Efficient dynamic scene deblurring using spatially variant deconvolution network with optical flow guided training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3555–3564.
  • [22] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 821–14 831.
  • [23] S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4641–4650.
  • [24] K. Purohit, M. Suin, A. Rajagopalan, and V. N. Boddeti, “Spatially-adaptive image restoration using distortion-guided networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2309–2319.
  • [25] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 683–17 693.
  • [26] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5728–5739.
  • [27] S. Wan, S. Tang, X. Xie, J. Gu, R. Huang, B. Ma, and L. Luo, “Deep convolutional-neural-network-based channel attention for single image dynamic scene blind deblurring,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 2994–3009, 2021.
  • [28] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9446–9454.
  • [29] K. Zhang, W. Luo, Y. Zhong, L. Ma, B. Stenger, W. Liu, and H. Li, “Deblurring by realistic blurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2737–2746.
  • [30] D. Park, D. U. Kang, J. Kim, and S. Y. Chun, “Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training,” in European Conference on Computer Vision.   Springer, 2020, pp. 327–343.
  • [31] M. Suin, K. Purohit, and A. Rajagopalan, “Spatially-attentive patch-hierarchical network for adaptive motion deblurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3606–3615.
  • [32] H. Chen, J. Gu, O. Gallo, M.-Y. Liu, A. Veeraraghavan, and J. Kautz, “Reblur2deblur: Deblurring videos via self-supervised learning,” in 2018 IEEE International Conference on Computational Photography (ICCP).   IEEE, 2018, pp. 1–9.
  • [33] J. Zhang, J. Pan, D. Wang, S. Zhou, X. Wei, F. Zhao, J. Liu, and J. Ren, “Deep dynamic scene deblurring from optical flow,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [34] H. Bai and J. Pan, “Self-supervised deep blind video super-resolution,” arXiv preprint arXiv:2201.07422, 2022.
  • [35] Y. Wang, Y. Lu, Y. Gao, L. Wang, Z. Zhong, Y. Zheng, and A. Yamashita, “Efficient video deblurring guided by motion magnitude,” in Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  • [36] A. Agrawal, Y. Xu, and R. Raskar, “Invertible motion blur in video,” in ACM SIGGRAPH 2009 papers, 2009, pp. 1–8.
  • [37] L. Xu and J. Jia, “Depth-aware motion deblurring,” in 2012 IEEE International Conference on Computational Photography (ICCP).   IEEE, 2012, pp. 1–8.
  • [38] Y. Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao, “Single-image blind deblurring using multi-scale latent structure prior,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2033–2045, 2020.
  • [39] D. Ren, K. Zhang, Q. Wang, Q. Hu, and W. Zuo, “Neural blind deconvolution using deep priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3341–3350.
  • [40] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” Advances in neural information processing systems, vol. 27, 2014.
  • [41] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang, “Deep video deblurring for hand-held cameras,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1279–1288.
  • [42] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
  • [43] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9308–9316.
  • [44] Z. Shen, W. Wang, J. Shen, H. Ling, T. Xu, and L. Shao, “Human-aware motion deblurring,” in IEEE International Conference on Computer Vision, 2019.
  • [45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [46] O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, “Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8878–8887.
[Uncaptioned image] Shu Tang received the M.E. degree from Chongqing University of Posts and Telecommunications, Chongqing, China, in 2007, and the Ph.D. degree in Chongqing university, China, in 2013. He is currently an associate professor of the College of Computer Science and Technology at Chongqing University of Posts and Telecommunications, China. His research interests include signal processing, image processing, and computer vision. Email: tang- [email protected].
[Uncaptioned image] Yang Wu received his bachelor of engineering from Nanyang Institute of Technology, China, in 2019. He is studying for a master’s degree at Chongqing University of Posts and Telecommunications, China. His research interests include computer vision and deep learning. Email: [email protected].
[Uncaptioned image] Hongxing Qin is now a Professor at Chongqing University, Chongqing. He received his PhD degree in pattern recognition from Shanghai Jiaotong University, in 2008. He worked as a postdoctoral researcher at Rutgers, the State University of New Jersey, from 2008 to 2009. His research interests include computer graphics, digital geometry processing, medical image processing, and visualization.
[Uncaptioned image] Xianzhong Xie born in 1966, received his Ph.D. degree in communication and information systems from Xidian University, China, in 2000. He is currently the professor and Director of Chongqing Key Lab of Computer Network and Communication Technology at Chongqing University of Posts and Telecommunications, China. His research interests include MIMO precoding, cognitive radio networks, and cooperative communications. He is the principal author of five books on cooperative communications, 3G, MIMO, cognitive radio, and TDD technology. He has published more than 100 papers in journals and 30 papers in international conferences. Email: [email protected].
[Uncaptioned image] Shuli Yang is studying for her doctorate at Chongqing University of Posts and Telecommunications, China. Hers research interests include image super-resolution reconstruction and deep learning. Email: [email protected].
[Uncaptioned image] Jing Wang received the bachelor’s degree in engineering from the Chongqing University of Posts and Telecommunications, Chongqing, China, in 2019, where she is currently studying with the School of Computer Science and Technology. Her research interest is image deblurring.