(cvpr) Package cvpr Warning: Package ‘hyperref’ is not loaded, but highly recommended for camera-ready version

Continuous Facial Motion Deblurring

Tae Bok Lee¹, Sujy Han¹, and Yong Seok Heo ¹^,²
¹Department of Artificial Intelligence, Ajou University, South Korea
²Department of Electrical and Computer Engineering, Ajou University, South Korea
{dolphin0104, tn0502wl, ysheo}@ajou.ac.kr

Abstract

We introduce a novel framework for continuous facial motion deblurring that restores the continuous sharp moment latent in a single motion-blurred face image via a moment control factor. Although a motion-blurred image is the accumulated signal of continuous sharp moments during the exposure time, most existing single image deblurring approaches aim to restore a fixed number of frames using multiple networks and training stages. To address this problem, we propose a continuous facial motion deblurring network based on GAN (CFMD-GAN), which is a novel framework for restoring the continuous moment latent in a single motion-blurred face image with a single network and a single training stage. To stabilize the network training, we train the generator to restore continuous moments in the order determined by our facial motion-based reordering process (FMR) utilizing domain-specific knowledge of the face. Moreover, we propose an auxiliary regressor that helps our generator produce more accurate images by estimating continuous sharp moments. Furthermore, we introduce a control-adaptive (ContAda) block that performs spatially deformable convolution and channel-wise attention as a function of the control factor. Extensive experiments on the 300VW datasets demonstrate that the proposed framework generates a various number of continuous output frames by varying the moment control factor. Compared with the recent single-to-single image deblurring networks trained with the same 300VW training set, the proposed method show the superior performance in restoring the central sharp frame in terms of perceptual metrics, including LPIPS, FID and Arcface identity distance. The proposed method outperforms the existing single-to-video deblurring method for both qualitative and quantitative comparisons. In our experiments on the 300VW test set, the proposed framework reached 33.14 dB and 0.93 for recovery of 7 sharp frames in PSNR and SSIM, respectively.

1 Introduction

Refer to caption — Figure 1: Comparison of single-to-video deblurring network architectures. The proposed method can restore continuous sharp motion of the face with a single network. (a) Jin *et al*. [34], (b) Purohit *et al*. [59], (c) Argaw *et al*. [1], (d) Zhang *et al*. [86], and (e) proposed CFMD.

(a) Input blur image

\animategraphics

[loop, autoplay, width=0.15]5images/comparisons_video/009_blur007_000044/gt/frame-06

(b) GT (7 Fr)

\animategraphics

[loop, autoplay, width=0.15]5images/comparisons_video/009_blur007_000044/jin/frame-06

\animategraphics

[loop, autoplay, width=0.15]5images/comparisons_video/009_blur007_000044/gt_FMR/frame-06

(d) GT FMR (7 Fr)

\animategraphics

[loop, autoplay, width=0.15]5images/comparisons_video/009_blur007_000044/cfmd_mimo_07/frame-06

(e) Ours (7 Fr)

\animategraphics

[loop, autoplay, width=0.15]5images/comparisons_video/009_blur007_000044/cfmd_mimo_51/frame-050

(f) Ours (51 Fr)

Figure 2: Exemplar deblurring results. “GT” denotes the ground-truth sharp frames in 300VW dataset [65]. “

\#

Fr” in parentheses denotes the number of frames. The results in (e) and (f) denote the outputs of the same network. By adjusting the control factor value, our single network can restore any number of sharp movements from a given blurry face image. This figure contains videos that are best viewed using Adobe Reader.

Facial motion deblurring for a single image is a specific but critical branches of image deblurring, aimed at restoring a sharp image latent in a motion-blurred face image. Besides being visually unpleasant, blurry face images also degrade the performance of many facial-related computer vision tasks such as face detection [73, 87, 62], face recognition [75, 14], facial emotion recognition [80, 91], and face medical image segmentation [63]. Therefore, face deblurring studies in computer vision and image processing have received much attention.

Recently, deep neural networks (DNNs) have become widespread in image restoration fields [41, 88, 18, 15]. Among them, it has been achieved remarkable success in single image face deblurring [67, 12, 11, 69, 81, 46, 36]. Most of these methods recover only a single sharp image from a motion-blurred facial image. However, motion-blurred images are the integration of continuous sharp moments during the exposure time [30, 18]. Thus, recovering such aggregated sharp moments from the blurred image can be considered the ideal goal of single image deblurring.

Several methods [34, 59, 86, 1] have been proposed to restore sharp sequences from a blurry image. However, most of these methods have several drawbacks. First, the temporal ordering problem is extremely challenging, because it is difficult to uniquely define the temporal order of the motion of an object in a blurry image [34, 59, 1]. For this reason, most existing methods fail to extract the accurate temporal order. This temporal ambiguity of the underlying motion in blurry images remains unsolved issue [1]. Second, as shown in Fig. 1, most existing models aim to only restore fixed frames, owing to architectural design or training strategies. Jin et al. [34] proposed a cascaded architecture consisting of four deblurring networks. As depicted in Fig. 1a, each network is assigned to restore neighboring frames using the outputs from the previous networks. Thus, this method requires a large number of networks according to the number of output frames to be extracted. Purohit et al. [59] proposed the using a recurrent neural network (RNN) so that they can handle various numbers of frames without architectural changes (Fig. 1b). They first extracted the middle frame using a pre-trained deblurring network and extracted nine frames using an RNN. However, their model is fixed to restore the entire sequence with nine frames, which is the predefined number of iterations of the RNN in the training phase. Argaw et al. [1] proposed a single encoder-multiple decoder architecture trained in a single training step. However, as shown in Fig. 1c, this architecture requires as many decoders as output frames. Recently, Zhang et al. [86] have shown promising results by restoring 42 frames from a blurry image. They trained three generative adversarial networks (GANs) by repeating the reblurring and deblurring processes (Fig. 1d). However, they restore a fixed number of frames and require multiple training steps.

To address the problems described above, as shown in Fig. 1e, we propose a facial motion-based reordering (FMR) process and a continuous facial motion deblurring network based on GAN (CFMD-GAN), a novel framework for restoring continuous moment latent in a single motion-blurred face image with a single training stage.

To alleviate the difficulty of resolving temporal ambiguity, we estimate the reordered frames instead of estimating the frames in the original temporal order. To this end, we apply a facial motion-based reordering (FMR) process, which reorders frames in the dataset based on the position of the left eye in the face (e.g. from top-left to right-bottom position) [72]. This reordering process helps the network stabilize training.

On the other hand, we introduce CFMD-GAN that restores sharp moments by varying the continuous moment control factor to estimate frames under continuous scenario. This approach is primarily inspired by conditional GANs (cGANs) [49, 54, 51, 4, 85], which are effective for training generators to synthesize diverse and realistic data conditioned on interpretable information, such as class labels. In our case, a single image deblurring network serves as the generator, and the conditional information for sharp image generation is the moment control factor. However, we have found that there are two main challenges in effectively incorporating cGANs into a single image deblurring framework. First, most existing cGANs are primarily developed for image synthesis conditioned on discrete labels (e.g. class labels) [16]. In contrast, we aim to restore the output images conditioned on the continuous control factor. Unlike most cGANs [54, 20, 37, 38] that use an auxiliary classifier for discrete class labels, we propose an auxiliary regressor to estimate the continuous control factor. It allows the proposed deblurring network to learn the image deblurring as a function of the continuous control factor. Second, an effective network module is required to apply the control factor into the deblurring network. Most existing single image deblurring approaches directly learn image-to-image mapping functions without the use of control factor. Recently, DNNs-based controllable image restoration models [27, 40, 5] have been extensively studied. Generally, these methods use a channel-wise attention module as a function of the control factor to resolve the Gaussian blurs and noise in static scenes. However, spatially-variant blurs with dynamic scenes must be considered. To this end, we present a control-adaptive (ContAda) block to effectively incorporate a control factor into recent deblurring architectures. The proposed block learns the modulation weights using a spatially deformable convolution and channel-wise attention as functions of the control factor.

Extensive experiments show that the proposed CFMD-GAN restores continuous sharp moments latent in a blurry face image using a single network and a single training process. Fig. 2 exemplifies our results, and compares our method with previous method [34].

The main contributions of this study are summarized as follows.

•

We introduce the FMR process to stabilize the network training. It allows the network to utilize rich and accurate information of the ground-truth frames corresponding to the control factor during training.
•

We propose a CFMD-GAN for continuous facial motion deblurring that restores continuous sharp frames latent in a single motion-blurred face image via a moment control factor.
•

We present a ContAda block to learn the feature modulation weights of the deblurring network using spatially deformable convolution and channel-wise attention as functions of the control factor.

2 Related Works

In this section, we briefly review recent single image deblurring methods and conditional GANs, which are closely related to the present work.

2.1 Single Image Deblurring

Traditionally, the motion-blur process is formulated as the accumulation of continuous sharp moments that occur during exposure [32, 18]. By mimicking this, large-scale deblurring datasets [18, 53, 70, 68] have been proposed by synthesizing a blurry image by averaging consecutive sharp frames. By leveraging such datasets, DNNs-based methods have become widespread for single image deblurring. In the following, we introduce existing DNNs-based single image deblurring methods into three categories.

2.1.1 Singe-to-Single, General Deblurring

Single-to-single image deblurring aims to restore a single sharp image when a blurry image in a general scene is given. Earlier studies [6, 71, 19] estimated the blur kernel using DNNs and obtained the resulting image using deconvolution methods. Chakrabarti et al. [6] proposed a network that predicts the complex Fourier coefficients of a deconvolution filter and applies the predicted deconvolution filter to the input patch. Sun et al. [71] proposed a deep learning approach that estimates motion blur kernels from local patches using a Markov random field model. Gong et al. [19] developed a DNN to predict the motion flow from blurred images, which was used to recover deblurred images. Without estimating the deconvolution kernel, Nah et al. [18] utilized a coarse-to-fine network to directly restore a sharp image using their synthesized large-scale dynamic scene blur dataset. Following the success of [18], variants of coarse-to-fine networks have been proposed, such as multi-recurrent networks [74, 56], multi-patch networks [84] and efficient multi-scale networks [10]. Concretely, Tao et al. [74] designed a scale-recurrent network that shares network parameters across scales. Zhang et al. [84] cascaded a multi-patch network to restore sharp images based on different patches. In addition, Cho et al. [10] reduced computational costs by utilizing a U-Net [61]-based architecture that exploits multi-scale features extracted from an input image and outputs.

2.1.2 Singe-to-Single, Face Deblurring

Face deblurring is a domain-specific task of single image deblurring that aims to obtain a sharp face from a blurry face image. Most existing methods have been studied in a manner that utilizes strong prior knowledge of the face, such as reference faces [55, 23], face landmark [12, 11], face sketches [47], multi-task embedding [69], 3D face models [60], facial parsing maps [67, 81, 46] and deep feature priors [36]. Specifically, Shen et al. [67] proposed to estimate the facial parsing map from the blurry face and then utilize it for restoring the sharp image. To avoid side effects caused by incorrect parsing maps, Yasarla et al. [81] utilized an uncertainty-based multi-stream architecture. Lee et al. [46] proposed restoring the face progressively from large components, such as skin, to small components, such as the eyes and nose. More recently, Jung et al. [36] utilized the rich information of feature maps extracted from a pre-trained deep neural network on the face.

However, all single-to-single deblurring methods, including the general and facial image domains, focus on restoring only one of the many moments accumulated in the blurred image. Unlike these methods, the proposed method restores various numbers of moments from a blurred image.

2.1.3 Singe-to-Video, General Deblurring

Instead of restoring a single output image, single-to-video deblurring is to predict multiple sharp frames from a single blurred image. In the pioneering work of Jin et al. [34], a sequentially cascaded architecture consisting of multiple networks trained with the corresponding number of training steps was utilized. In their method, each network is assigned to predict pre-specified frames among all sharp frames. Thus, this method requires changing the number of networks based on the desired number of output frames and training them from scratch. Purohit et al. [59] proposed a recurrent neural networks (RNNs)-based method trained with two stages. In the first stage, they trained a video autoencoder to learn the motion and frame generation from sharp frames. It addresses the problem of the number of network scales with respect to the number of output frames. However, they still have to be trained anew each time the number of output frames changes. The method proposed by Zhang et al. [86] was one of the first attempts to restore continuous frames. Their method extracts a total of 42 sharp frames from a blurry image by cascading three GANs trained in three stages. However, this approach is limited to restoring a fixed number of frames. Instead of training the entire model in multiple stages, Argaw et al. [1] proposed a single framework that can be trained in an end-to-end manner. They proposed a feature transformer network consisting of a single encoder and multiple decoders, where each decoder was specified to output a specific frame. Thus, this method still requires changing the number of decoders when the number of output frames changes.

In short, existing studies are inherently limited in restoring only a fixed number of frames, owing to their rigorous architectural design or training strategies. In contrast, the proposed method differs in that 1) it restores continuous sharp frames beyond a fixed number, 2) a single deblurring network with a single training step is utilized, and 3) the proposed method can be trained in an end-to-end manner.

2.2 Conditional Generative Adversarial Networks

Generative Adversarial Networks (GANs) [22] are among the most widely used frameworks in image generation and have been extensively studied over the past few years. Conditional GANs (cGANs) [49] are variants of GANs that synthesize realistic and diverse images using conditional information, such as class labels. Depending on how the framework incorporates the data and class labels, most cGANs can be categorized into classifier-based cGANs [54, 20, 37, 38] and projection-based cGANs [51, 50, 4, 25]. Classifier-based cGANs utilize conditional information (class labels) by training an additional classifier as well as a standard GAN discriminator. Meanwhile, projection-based cGANs propose a projection discriminator that takes an inner product between the embedded class labels and the feature vector extracted from the data.

The proposed method draws inspiration from all existing cGANs. To the best of our knowledge, this is the first attempt to apply continuous conditional information to deblurring task.

3 Preliminaries

Generative Adversarial Networks (GANs) [22] are well-established method for mimicking the probability distribution of the real data by playing a min-max game between the generator $G$ and discriminator $D$ . Whereas $G$ learns to fool $D$ by generating realistic samples, $D$ learns to classify whether the given samples are true data (real) or generated data (fake). Their objective, $V(G,D)$ is formulated as follows.

	$\displaystyle\mathop{\min}\limits_{G}\mathop{\max}\limits_{D}V(G,D)$	$\displaystyle=\mathbb{E}_{x\sim p(x)}[\log(D(x))]$		(1)
		$\displaystyle+\mathbb{E}_{z\sim p(z)}[\log(1-D(G(z)))],$		(1)

where $p(x)$ denotes the real data distribution, and $p(z)$ denotes a pre-defined distribution, e.g., Gaussian distribution. A key property of GANs is that a well-trained $G$ successfully captures the data manifold even if there are missing data in the training set [21, 17, 43].

Conditional GANs (cGANs) [49, 54, 51] are an extended GAN framework developed for conditional image synthesis. Given a pair of images $x$ and class labels $c$ sampled from the joint distribution of the real dataset $(x,c)\sim p(x,c)$ , the goal of $G$ is to learn the class-conditional image synthesis by utilizing $c$ as an additional input with $z$ . Let $p_{G}(x|c)$ denote the generative distribution specified by $G(x,c)$ and $p_{G}(x,c):=p_{G}(x|c)p(c)$ . The objective of generic cGANs [49], $V_{\text{cGAN}}(G,D)$ , minimizes the Jensen-Shannon Divergence (JSD) between $p(x,c)$ and $p_{G}(x,c)$ as

	$\displaystyle\mathop{\min}\limits_{G}\mathop{\max}\limits_{D}$	$\displaystyle V_{\text{cGAN}}(G,D)=\mathbb{E}_{(x,c)\sim p(x,c)}[\log(D(x,c))]$		(2)
		$\displaystyle+\mathbb{E}_{z\sim p(z),c\sim p(c)}[\log(1-D(G(z,c),c))].$		(2)

As one of the most representative classifier-based cGANs, AC-GAN [54] introduces an auxiliary classifier $Q$ to provide feedback on the class-conditional image synthesis of $G$ . In AC-GAN, $D$ and $Q$ share all weights of the feature extractor, except for the final output layer. Let $p_{Q}(c|x)$ denote the conditional distribution induced by classifier $Q$ . Then, their loss, $V_{\text{AC-GAN}}(G,Q,D)$ can be expressed as follows

		$\displaystyle\mathop{\min}\limits_{G,Q}\mathop{\max}\limits_{D}V_{\text{AC-GAN}}(G,Q,D)=\mathbb{E}_{(x,c)\sim p(x,c)}[\log(D(x))]$		(3)
		$\displaystyle+\mathbb{E}_{z\sim p(z),c\sim p(c)}[\log(1-D(G(z,c)))]$
		$\displaystyle-\lambda_{c}\underbrace{\mathbb{E}_{(x,c)\sim p(x,c)}[\log(p_{Q}(c\|x)]}_{\text{(a)}}$
		$\displaystyle-\lambda_{c}\underbrace{\mathbb{E}_{(x,c)\sim p_{G}(x,c)}[\log(p_{Q}(c\|x))]}_{\text{(b)}},$

where $\lambda_{c}$ is the balancing weight between the GAN and the auxiliary classification losses. In Eq. 3, the first two lines are loss functions similar to the original GANs (Eq. 1), where $D$ serves as a binary classifier that distinguishes between real and fake samples. Terms (a) and (b) represent the auxiliary classification losses that enable $Q$ to determine the class labels of the input samples. Through this auxiliary classifier, AC-GAN can generate class-conditional image synthesis.

4 Proposed Method

In this section, we first introduce the facial motion-based reordering (FMR) process, which is proposed to mitigate the temporal ambiguity problem by utilizing human face information (Sec. 4.1). Next, detailed explanation of the key components of the proposed CFMD-GAN is provided, which recovers the continuous moment latent in a blurry face image via a moment control factor (Sec. 4.2). Lastly, we introduce the training objectives of the proposed model (Sec. 4.3).

4.1 Facial Motion-based Reordering

One of the main challenges in restoring multiple images from a single blurred image is to resolve the temporal (sequence) ambiguity of sharp moments. A motion-blurred image is the averaged result of a continuous sharp sequence during the exposure time [32, 18]. As averaging destroys the information of the temporal order [34, 86, 1], reconstructing the original sequence of sharp moments is non trivial. For example, suppose a blurry facial image and its corresponding original sharp sequence are given, as shown in Fig. 4. The problem is that the same blurry image can be obtained even if the face moves in a reverse or shuffled order during the exposure time. Owing to this ill-posed nature of the temporal ambiguity, finding the underlying sequence of the blurry image is one of the unsolved issues [1]. In this regard, previous studies [34, 59, 1] have found that temporal ambiguity causes unstable training of the network because it is difficult to uniquely define the temporal sequence of object movements.

To alleviate this, we leverage the information of the human face to apply effective yet strong constraints. In a recent study on face landmark detection, Sun et al. [72] proposed defining the intensity of facial motion as the movement of the left eye during the time unit. Inspired by this, we devised a facial motion-based reordering (FMR) that enables the network to restore sharp face images in a generalized order based on the position of the left eye.

Specifically, as depicted in Fig. 4, FMR is a motion-based reordering process of the ground-truth (GT) sequence in a training dataset consisting of a single facial motion per single video clip. Let $\mathbf{S_{t}}$ be a time-ordered set of GT frames sampled from a high-frame-rate facial video, which is denoted by

\mathbf{S_{t}}=\{s[i]\in\mathbb{R}^{H\times W\times 3}\;|\;i\in[1,N]\},

(4)

where $i$ denotes the frame index within the total number of frames $N$ . Then, a blurry image $b\in\mathbb{R}^{H\times W\times 3}$ can be approximated by averaging these GT frames as follows:

b\simeq g(\frac{1}{N}\sum\nolimits_{i=1}^{N}{s[i]}),

(5)

where $g(\cdot)$ denotes the camera response function [18]. We rearrange $s[i]$ according to the position of the left eye $(x,y)$ ¹¹1 We utilize the public landmark detector provided by OpenCV [3] to obtain the position of the left eye in the face image. in each $s[i]$ so that the frame that includes the eye in the top-left position comes first, and the frame that includes the eye in the bottom-right position follows the last. Concretely, the proposed FMR process rearranges the sharp sequence according to the following criteria: (c1) The order is primarily determined by the ascending order of $x$ values. It generalizes the erratic movement of the face as a left-to-right movement. (c2) If there are frames with the same $x$ values, those frames are sorted in ascending of $y$ values. It also regularizes the direction of facial motion to a top-to-bottom movement. (c3) When frames have the same $(x,y)$ , they are sorted in ascending order of temporal sequence.

Following the above procedure, we further transform the frame index $i$ into a continuous motion index value $u\in[0,1]$ by applying $u=\frac{i-1}{N}$ . Then, we can denote this reordered set $\mathbf{S_{r}}$ as follows:

\mathbf{S_{r}}=\{s(u)\in\mathbb{R}^{H\times W\times 3}\;|\;u\in[0,1]\}.

(6)

Note that the real number $u$ becomes a moment control factor in the proposed framework.

In this study, the network learns to restore the facial motion-based order in $\mathbf{S_{r}}$ . It should be noted that this reordered sequence does not match the temporal sequence. Instead, the proposed framework restores all possible sharp moments latent in a blurry facial image. The FMR process allows the frames in the sequence $\mathbf{S_{r}}$ to have regularity of face motion, which helps the network stabilize the training. The effects of the FMR are analyzed in Sec. 5.

4.2 Continuous Facial Motion Deblurring GAN

Inspired by the success of AC-GAN [54], the proposed continuous facial motion deblurring framework CFMD-GAN consists of a generator $G$ and a discriminator $D$ with an auxiliary regressor $Q$ . An overview of the CFMD-GAN is depicted in Fig. 3. Given a blurry face image and a control factor, $G$ performs the role of a deblurring network to perform conditional image restoration. Unlike most single image deblurring methods that only recover a single deblurred image from a single blurry image, the proposed $G$ is a function that restores a deblurred image conditioned on a control factor. That is, $G$ predicts continuous sharp moments latent in a blurry image by changing the value of the control factor. To achieve this, $D$ learns to predict 1) decisions of images belonging to real or fake [64] and 2) regression for control factor at the additional output layer $Q$ .

4.2.1 Overall Pipeline of Generator

Given a blurry face image $b\in\mathbb{R}^{H\times W\times 3}$ and moment control factor $u\in[0,1]$ as the condition, $G$ generates a restored face image $\hat{s}(u)\in\mathbb{R}^{H\times W\times 3}$ , which is defined as

\hat{s}(u)=G(b,u).

(7)

Specifically, the proposed $G$ comprises two parts, a mapping network $G_{\text{M}}$ and a deblurring network $G_{\text{R}}$ . First, $G_{\text{M}}$ translates the moment control factor $u\in[0,1]$ into the feature control factor $u_{f}\in\mathbb{R}^{H\times W\times 64}$ . Second, $G_{\text{R}}$ incorporates $u_{f}$ with features extracted from $b$ and then outputs the final deblurred face image $\hat{s}(u)$ . In the proposed deblurring network, we deign a ContAda block so that $G$ can focus on important spatial locations and channels of features extracted from $b$ according to $c_{f}$ .

Mapping Network. In recent GANs studies [39, 66, 26, 92], the additional mapping network has proven to provide more disentangled semantics for the generator than directly using input codes. Inspired by this, we set the mapping network $G_{\text{M}}$ that outputs the feature map control factor $u_{f}\in\mathbb{R}^{H\times W\times 64}$ from the given moment control factor $u\in[0,1]$ as

u_{f}=G_{\text{M}}(u).

(8)

As shown in Fig. 5, $G_{\text{M}}$ first expands $u$ into a 2-dimensional matrix $u_{\text{2D}}\in\mathbb{R}^{H\times W}$ where each position is filled with $u$ . Then, $G_{\text{M}}$ outputs $u_{f}$ from $u_{\text{2D}}$ through several convolutional layers. Similar to [39], we design $G_{\text{M}}$ consisting of eight layers, each of which includes $1\times 1$ convolutions and a leaky ReLU [48].

Deblurring Network. As mentioned earlier, the deblurring network $G_{\text{R}}$ generates a restored image $\hat{s}(u)\in\mathbb{R}^{H\times W\times 3}$ from the blurry face image $b\in\mathbb{R}^{H\times W\times 3}$ and the feature map control factor $u_{f}\in\mathbb{R}^{H\times W\times 64}$ , as

\hat{s}(u)=G_{\text{R}}(b,u_{f}).

(9)

In this work, we employ the high-level structure of MIMO-UNet [10], which has exhibited impressive performance in a single image deblurring field. Specifically, as shown in Fig. 5, MIMO-Unet is based on the encoder-decoder architecture and comprises three encoder blocks ( $\mathrm{EB_{1}},\mathrm{EB_{2}}$ and $\mathrm{EB_{3}}$ ) and three decoder blocks ( $\mathrm{DB_{1}},\mathrm{DB_{2}}$ and $\mathrm{DB_{3}}$ ). Each of these encoder and decoder blocks contain eight modified residual blocks [74]. Unlike the original MIMO-UNet, the network developed in this study can focus on important spatial positions and channels of the feature map depending on the control factor by replacing the residual blocks with the proposed ContAda blocks. Note that SCM, FAM and AFF are modules used in the original MIMO-UNet that represent the shallow convolutional module, feature attention module and asymmetric feature fusion module, respectively. The details of each module, including the high-level architecture, can be found in [10]. In the following section, we discuss the proposed control-adaptive (ContAda) block.

4.2.2 Control-Adaptive Block

There is a major challenge in applying existing building blocks (e.g. variants of residual blocks [28] ) that are widely used in single image deblurring networks in the proposed continuous facial motion deblurring. First, standard convolution-based layers have an inherent drawback in modelling geometric transformations. This drawback stems from the fact that a convolutional unit samples the input feature map at fixed spatial locations [13, 93, 94]. To alleviate this, deformable convolution [13, 93] has exhibited promising results in object detection by learning the offsets of the convolution grid to adjust the receptive field dynamically. Inspired by this, several motion deblurring studies [76, 58, 82] applied a deformable convolution module to handle the complex and various latent movements in a given blurred image [76, 58]. However, these methods are still inadequate for our task because of the inability to focus on the adaptive positions of the feature maps depending on the control factor.

To this end, as shown in Fig. 6, we propose a Control-Adpative (ContAda) block that comprises a control-adaptive deformable convolution (CADC) module and a control-adaptive channel-attention (CACA) module. Let $F_{\mathrm{Im}}\in\mathbb{R}^{H_{n}\times W_{n}\times{C}_{n}}$ denote an input feature map of the ContAda block extracted from the input blurred image $b\in\mathbb{R}^{H\times W\times{3}}$ . Here, $H_{n},W_{n}$ and ${C}_{n}$ represent the height, width, and number of channels in the $n^{th}$ encoder/decoder block, respectively. The ContAda block starts with a $3\times 3$ convolutional layer and LeakyReLU to extract the initial feature map $F_{o}\in\mathbb{R}^{H_{n}\times W_{n}\times{C}_{n}}$ . Meanwhile, the feature control factor $u_{f}\in\mathbb{R}^{H\times W\times{C}}$ , which is the output of the mapping network $G_{\text{M}}$ , is reshaped to $u^{(n)}_{f}\in\mathbb{R}^{H_{n}\times W_{n}\times{C}_{n}}$ using bilinear interpolation and $1\times 1$ convolutional layer. Then, $u^{(n)}_{f}$ is concatenated with $F_{o}$ along the channel dimension and then reshaped into $F_{u}\in\mathbb{R}^{H_{n}\times W_{n}\times{C}_{n}}$ by applying $1\times 1$ convolution layer. $F_{u}$ is utilized as an input feature for the CADC and CACA modules. In the following section, we introduce CADC and CACA distinctly.

Control-Adaptive Deformable Convolution (CADC) module is based on deformable convolution [13, 93] that enhances the ability of network in modeling spatial variations. Unlike [13, 93], where deformable offsets and attention weights are solely determined by internal information regarding the features of the input image, the proposed CADC learns the offsets and attention weights from the combined features of the control factor and image features. Let $K$ denote the sampling locations of a convolutional kernel. We denote the weight and pre-specified offset for the $k^{th}$ location as $w_{k}$ and $p_{k}$ , respectively. For example, $3\times 3$ convolutional kernel of dilation 1 has $9$ sampling locations ( $K=9$ ) and $p_{k}\in\{(-1,-1),(-1,0),\ldots,(1,1)\}$ . Let $F_{u}(p)$ and $F_{dc}(p)$ denote the features at location $p$ of the input feature map $F_{u}$ and output feature map $F_{dc}$ , respectively. Accordingly, the proposed CADC can be formulated as

\displaystyle F_{dc}(p)=\sum\limits_{k=1}^{K}{w_{k}\cdot F_{u}(p+p_{k}+\Delta p_{k})}\cdot\Delta m_{k},

(10)

where $\Delta p_{k}$ and $\Delta m_{k}$ denote the learned offset and attention weight scalar for the $k^{th}$ location, respectively. As shown in Fig. 6, $\Delta p_{k}$ and $\Delta m_{k}$ are determined by separate convolutional layers. The output of the sampling offsets branch has $2K$ channels, corresponding to $\{\Delta p_{k}\}^{K}_{k=1}$ . The output of the attention weights branch is of $K$ channels, as $\{\Delta m_{k}\}^{K}_{k=1}$ , and each $\Delta m_{k}$ is in the range of $[0,1]$ by the sigmoid function. Following [93], the initial values of $\Delta p_{k}$ and $\Delta m_{k}$ are set to 0 and 0.5, respectively.

Control-Adaptive Channel Attention (CACA) module is mainly motivated by [8, 31, 90], which benefits from applying the channel-wise attention mechanism for convolutional layers. In short, both CADC and CACA can be considered as attention functions of two variables: features extracted from blurry images and those extracted from the control factor. They are complementary in that CADC performs spatial attention to select important geometric properties of features, whereas CACA focuses on significant semantic and contextual attributes [8, 90]. Given $F_{u}$ , as can be seen in Fig. 6, global average pooling is applied to transform channel-wise information into channel descriptors, following [90]. Subsequently, we obtain the channel-wise attention weights from two $1\times 1$ convolutional layers and a sigmoid function. The learned attention weights are multiplied by $F_{dc}$ , the output of the CADC, in an element-wise manner.

4.2.3 Discriminator

As shown in Fig. 3, the proposed discriminator $D$ is based on the U-net structure discriminator [64] with an auxiliary regressor. In our framework, $G$ receives as inputs a blurred face image $b$ and a control factor $u$ , and outputs an image $\hat{s}(u)=G(b,u)$ . Following [33], the discriminator $D$ takes as inputs as a blurred face image and the corresponding sharp face image. Here, a face image is either a real sharp image ${s}(u)$ drawn from the training dataset or a restored image $\hat{s}(u)$ from $G$ . Then, $D$ provides three types of outputs from the encoder output layer $D_{enc}$ , decoder output layer $D_{dec}$ , and auxiliary regression layer $Q$ .

Following [64], $D_{enc}$ determines whether the global input context is real or fake. Similarly, the final outputs of $D_{dec}$ are used to classify whether the local context of the input is sampled from the real or fake. On the other hand, the proposed $Q$ provides a regression value for the estimated control factor. Instead of predicting a single scalar value of $c$ , our $Q$ outputs $\hat{u}_{2D}\in\mathbb{R}^{H\times W}$ and is trained to estimate the ground-truth control factor ${u}_{2D}\in\mathbb{R}^{H\times W}$ .

4.3 Model Objectives

Following [22], $D$ and $G$ are optimized alternately using loss functions, which are described as follows.

4.3.1 Discriminator Loss

To estimate the global and per-pixel probability distributions, the encoder loss $\mathcal{L}_{D_{enc}}$ and decoder loss $\mathcal{L}_{D_{dec}}$ are formulated as follows:

$\displaystyle\mathcal{L}_{D_{enc}}=$	$\displaystyle-\log D_{enc}(b,s(u))+\log D_{enc}(b,G(b,u)),$	(11)
$\displaystyle\mathcal{L}_{D_{dec}}=$	$\displaystyle\frac{1}{WH}\sum\limits_{i,j}^{W,H}\Big{(}-\log[D_{dec}(b,s(u))]_{(i,j)}$
	$\displaystyle+\log[D_{dec}(b,G(b,u))]_{(i,j)}\Big{)}.$

Here, $[D_{dec}(\cdot)]_{(i,j)}$ represents the decision of the discriminator decoder at pixel coordinate $(i,j)$ .

To ensure that the restored image is an accurate moment of the blurry image, the auxiliary regression loss $\mathcal{L}_{Q}$ is defined by

	$\displaystyle\mathcal{L}_{Q}=\frac{1}{WH}\sum\limits_{i,j}^{W,H}\Big{(}$	$\displaystyle\left\\|{u_{\text{2D}}-Q(b,s(u))}\right\\|_{2}^{2}$		(12)
	$\displaystyle+$	$\displaystyle\left\\|{u_{\text{2D}}-Q(b,G(b,u))}\right\\|_{2}^{2}\Big{)}$		(12)

The total loss of $D$ is formulated as the sum of the above objectives:

\mathcal{L}_{D}=\mathcal{L}_{D_{enc}}+\mathcal{L}_{D_{dec}}+\lambda_{Q}\mathcal{L}_{Q},

(13)

where $\lambda_{Q}$ denotes a weight parameter, which is empirically set to 0.05.

4.3.2 Generator Loss

Auxiliary Regression Loss. To accurately restore the output image conditioned by the control factor, an auxiliary regression loss $\mathcal{L}_{ar}$ is optimized as follows:

\displaystyle\mathcal{L}_{ar}=\frac{1}{WH}\sum\limits_{i,j}^{W,H}\left\|{u_{\text{2D}}-Q(b,G(b,u))}\right\|_{2}^{2}.

(14)

Adversarial Loss. We use the Unet-discriminator to ensure that the generated image is indistinguishable from the real data for both global and local contexts. The adversarial loss $\mathcal{L}_{adv}$ is formulated as follows:

	$\displaystyle\mathcal{L}_{adv}=-\Big{(}$	$\displaystyle\log D_{enc}(b,G(b,u))$		(15)
		$\displaystyle+\frac{1}{WH}\sum\limits_{i,j}^{W,H}\log[D_{dec}(b,G(b,u))]_{(i,j)}\Big{)}.$		(15)

Pixel-wise Loss. To restore accurate pixel intensities, following [83], we employ the Charbonnier loss [7] to minimize the pixel-wise distance between a ground-truth moment and a restored image as follows:

\mathcal{L}_{pix}=\sum\limits_{n=1}^{3}{\frac{1}{{W_{n}H_{n}}}}\sum\limits_{i,j}^{W_{n},H_{n}}\sqrt{\left\|{s(u)_{n}-G(b,u)_{n}\ }\right\|^{2}+\varepsilon^{2}},

(16)

where $n$ denotes the number of multi-scale levels. $H_{n}$ and $W_{n}$ represent the height and width at the corresponding $n^{th}$ level of output image, respectively. Following [83], $\varepsilon$ is set to ${10}^{-3}$ .

Perceptual loss. Furthermore, we use perceptual loss to obtain perceptually satisfactory images. Similar to [35], LPIPS [89] is employed for perceptual loss.

\mathcal{L}_{per}=\sum\limits_{l}^{M}{\omega^{l}\left\|{\phi^{l}(s(u))-\phi^{l}(G(b,u))}\right\|_{2}^{2}}

(17)

Here, $\phi(\cdot)$ is a feature extractor, $\omega$ denotes a learned vector to measure the LPIPS score, and the total score is averaged over $M$ layers.

In overall, the total loss of $G$ combines the aforementioned loss functions,

\mathcal{L}_{G}=\lambda_{ar}\mathcal{L}_{ar}+\lambda_{adv}\mathcal{L}_{adv}+\lambda_{pix}\mathcal{L}_{pix}+\lambda_{per}\mathcal{L}_{per},

(18)

where $\lambda_{ar},\lambda_{adv},\lambda_{pix}$ and $\lambda_{per}$ denote the balancing weights empirically set to 0.05, 0.1, 1 and 0.01, respectively.

5 Experiments

5.1 Experimental Setup

5.1.1 Dataset

We use the 300VW dataset [65] which consists of a large number of high-quality facial videos recorded in the wild. Each video has a duration of about one minute at 25-30 fps. Following the face deblurring study by Ren et al. [60], the training and test datasets are extracted from 83 videos and 9 videos, respectively. Each blurry image is synthesized by averaging various numbers (5-13) of consecutive sharp frames, as in recent motion deblurring studies [18, 60]. Thus, the testset consists of total 13,058 blurred images and 116,188 sharp frames. The details of the number of test images are listed in Table 1.

Table 1: Configuration of facial motion deblurring testset synthesized using 300VW dataset [65].

$\#$ of averaged frames $\#$ of blurred images $\#$ of sharp images 5 2753 13765 7 2677 18739 9 2605 23445 11 2530 27830 13 2493 32409 Total 13058 116188

5.1.2 Implementation Details

The proposed framework is implemented with Pytorch [57] and trained with NVIDIA TITAN-RTX GPUs. We train our networks using the Adam optimizer [42] with ${\beta}_{1}$ = 0.9, and ${\beta}_{2}$ = 0.999. The initial learning rate is set as $1\times 10^{-4}$ and it decayed exponentially by a factor of 0.99 for every epoch. For data augmentation, we randomly scale the image from 1.0 to 1.5 and then randomly crop the image with a spatial size of $256\times 256\times 3$ . During training, we set the batch size as 8 and train our model for 200 epochs.

5.1.3 Evaluation Metrics

For a quantitative evaluation, we measure the PSNR and SSIM [79], which are traditionally used for image quality assessment. We also report two metrics of learning-based perceptual quality, FID [29] and LPIPS [89]. Moreover, we employ the ArcFace [14] model to measure the distance of facial identity between the ground truth (GT) and the resulting image, as [77].

Table 2: Quantitative comparison of single-to-single general deblurring methods. The best and the second best results are highlighted in bold and underline, respectively.

Methods PSNR ( $\uparrow$ ) SSIM ( $\uparrow$ ) LPIPS ( $\downarrow$ ) FID ( $\downarrow$ ) ArcFace ( $\downarrow$ ) Nah et al. [18] 31.4144 0.9232 0.0935 13.9722 1.1250 SRN [74] 32.1485 0.9249 0.0930 11.6292 1.1488 DMPHN [84] 33.1797 0.9284 0.0847 13.0407 1.0338 DMPHN* 33.8182 0.9345 0.0916 14.3071 1.0126 MIMO [10] 34.0372 0.9350 0.0795 7.8606 1.0205 MIMO* 34.8496 0.9401 0.0794 7.2459 0.9918 CFMD-GAN 34.2684 0.9362 0.0697 5.1448 0.9338

5.2 Comparisons with the state-of-the-arts

To the best of our knowledge, the proposed method is the first attempt for single-to-video face deblurring. Hence, we conduct extensive and faithful comparisons with state-of-the-art methods in single image deblurring. Specifically, the proposed CFMD-GAN is compared with single-to-single (s2s) general deblurring ( i.e. Nah et al. [18], SRN [74], DMPHN [84], MIMO [10]), s2s face deblurring (i.e. Shen et al. [67], UMSN [81], MSPL [46] ), and single-to-video (s2v) general deblurring ( i.e. Jin et al. [34]). To facilitate fair comparisons, we retrain the existing methods using the same training dataset used in this study. The retrained models are marked with asterisks (*). All experiments are performed using the official codes provided by the authors.

5.2.1 Single-to-Single General Deblurring

In this comparison, we evaluate the performance of the center frame prediction, as most s2s general methods are proposed to restore the center frame. For the proposed method, the control factor is set to $c=0.5$ to obtain the center frame results. Table 2 reports the comparisons of s2s general deblurring methods. Despite the significant improvements in the performance of retrained DMPHN* and MIMO* compared to the original DMPHN and MIMO, our CFMD-GAN shows the best results in LPIPS, FID and ArcFace distance, and the second best in PSNR and SSIM.

As investigated in recent GAN-based restoration studies [45, 2, 78, 9, 35, 77, 24], PSNR and SSIM may be lower because the GAN-based model tends to generate fake yet realistic details and textures [24]. This effect of GANs can be clearly observed in the visual comparisons in Fig. 7. Compared with other methods, the proposed CFMD-GAN restores more realistic textures and finer details of facial components, such as the eyes, nose, and eyelids. Based on these results, we can confirm that the proposed model can predict a more accurate center frame than the other methods.

Table 3: Quantitative comparison of single-to-single face deblurring methods. The best and the second best results are highlighted in bold and underline, respectively.

Methods PSNR ( $\uparrow$ ) SSIM ( $\uparrow$ ) LPIPS ( $\downarrow$ ) FID ( $\downarrow$ ) ArcFace ( $\downarrow$ ) Shen et al. [67] 23.1795 0.6873 0.2310 78.3630 2.2112 UMSN [81] 27.0050 0.8276 0.1460 39.4150 1.4908 UMSN* 30.5884 0.9140 0.0833 16.8517 1.2716 MSPL-GAN [46] 28.2286 0.8936 0.1092 27.9441 1.2947 MSPL-GAN* 34.2711 0.9359 0.0638 10.3597 1.0983 CFMD-GAN₁₂₈ 33.8475 0.9379 0.04910 6.5449 1.0246

5.2.2 Single-to-Single Face Deblurring

Most existing s2s face deblurring methods [67, 81, 46] are developed to remove spatially-uniform blurs. However, our training and test datasets contain spatially-variant blurs. Besides, their models only handle input images of $128\times 128\times 3$ . For these reasons, we downsample our dataset to $128\times 128\times 3$ and use it to retrain UMSN [81], MSPL [46] and our model (termed as CFMD-GAN₁₂₈). The retrained models, UMSN* and MSPL*, are trained to predict the center frame, similar to the s2s general deblurring approaches. Note that we do not retrain Shen et al. [67] because they do not release the training code.

Table 3 and Fig. 8 provide the quantitative and qualitative comparisons of the s2s face deblurring methods, respectively. In this experiment, the proposed method achieved significantly better performance on SSIM, LPIPS, FID and ArcFace than the existing face deblurring methods. For PSNR, our method achieved the second best. Shen et al. [67] fails to restore plausible results because they are not trained to remove spatially-variant blurs, as shown in Fig. 8. Although the retrained models (UMSN* and MSPL*) show improved performance, they are still inferior to the CFMD-GAN.

Table 4: Quantitative comparison of single-to-video deblurring methods. “# of GT” indicates the number of GT frames per a single blurry image, “# of pairs” is the total number of test GT frames, and “ALL” represents the entire results of 300VW testset. Note that all the results of CFMD-GAN are measured with the same model. The best results are highlighted in bold.

Methods $\#$ of GT $\#$ of pairs PSNR ( $\uparrow$ ) SSIM ( $\uparrow$ ) LPIPS ( $\downarrow$ ) FID ( $\downarrow$ ) ArcFace ( $\downarrow$ ) Jin et al. [34] 7 18739 29.2407 0.8754 0.1471 25.6946 1.1574 CFMD-GAN 7 18739 33.1360 0.9336 0.0691 3.4238 0.9078 CFMD-GAN 5 13765 34.6556 0.9498 0.0538 2.9080 0.7548 CFMD-GAN 9 23445 32.0300 0.9192 0.0823 4.0663 1.0384 CFMD-GAN 11 27830 31.1256 0.9060 0.0939 4.7120 1.1466 CFMD-GAN 13 32409 30.3970 0.8949 0.1041 5.3640 1.2367 CFMD-GAN ALL 116188 31.8474 0.9153 0.0857 4.0948 1.0650

5.2.3 Single-to-Video General Deblurring

For s2v general deblurring methods, we compare our method with Jin et al. [34] which officially released their test model. Since this method is strictly fixed to extract seven sequential frames from a single blurry image, we compare the results only for blurry images averaged by seven sharp frames. None of the s2v deblurring methods [34, 59, 86, 1] have released their training codes. [34] is the only work that provides the test code.

Table 4 reports quantitative comparisons with Jin et al. [34] and detailed results of our model according to the number of GT frames. The model of Jin et al. [34] is limited to predicting only a fixed number of frames when the model is trained once. However, it is worth to note that the proposed single model can predict various numbers of output frames without additional network changes or training processes. Visual comparisons in Fig. 9 show this difference.

5.3 Analysis on CFMD-GAN

5.3.1 Evaluation on Other Test Datasets

Since our model is trained and evaluated with synthetically blurred images using the 300VW dataset [65], we verify how our model performs on other motion-blur benchmark datasets such as REDS [52] and Lai et al. [44]. The REDS dataset is generated using 120 fps videos, synthesizing blurry frames by merging consecutive frames. The Lai dataset contains real-blur images where the GT images do not exist. We manually crop the facial regions of images in the REDS validation set and the Lai dataset.

Fig. 10 shows that our method restores satisfactory images for recent benchmark deblurring datasets. In $1^{st}$ row of Fig. 10, we can see that our method produces not only a sharp face, but also the background that was occluded by the face in the previous frame. For the real-blurred images in $2^{nd}$ row of Fig. 10, our model restores plausible results containing consecutive frames. Our framework can provide all sharp moments that user wants from a single motion-blurred face image.

Table 5: Ablations on the proposed ContAda block. The best results are highlighted in bold.

CADC CACA FMR PSNR ( $\uparrow$ ) SSIM ( $\uparrow$ ) LPIPS ( $\downarrow$ ) FID ( $\downarrow$ ) ArcFace ( $\downarrow$ ) ✓ ✓ 33.5546 0.9263 0.0814 7.0030 1.0457 ✓ ✓ 34.0271 0.9279 0.0810 6.8322 1.0389 ✓ ✓ 33.4478 0.9201 0.0880 9.6110 1.1547 ✓ ✓ ✓ 34.2684 0.9362 0.0697 5.1448 0.9338

5.3.2 Ablation Study

In Table 5, we evaluate the impact of the proposed ContAda block consisting of ContAda deformable convolution (CADC) and ContAda channel attention (CACA). With the CADC module, the proposed method can focus on the spatially important sampling points of the feature maps by the feature map control factor. Notably, using only CACA module improves the average PSNR by about 0.5dB compared to using only CADC module. This demonstrates that the channel attention plays a more important role in the proposed model. More importantly, using both CADC and CACA achieves the best results. This indicates that both spatial and channel-wise modulations are required for the continuous facial motion deblurring. Furthermore, we conduct an ablation study to investigate the contribution of FMR to the network training. The $3^{rd}$ row in Table 5 indicates that without FMR, the performance of the model drops drastically when it learns the original temporal order.

6 Conclusion

In this study, we introduce CFMD-GAN, a novel framework for continuous facial motion deblurring with a single network and a single training process. Subsequently, we apply facial motion-based reordering (FMR) to mitigate the difficulty of temporal ordering by utilizing domain-specific facial information. This ensures a stable learning process for the framework. We devise an auxiliary regressor to learn continuous motion deblurring by integrating the concept of conditional GANs into a single image deblurring framework. In addition, we propose a control-adaptive (ContAda) block that focuses on deformable locations and important channels according to the control factor. In our extensive experiments, we demonstrate that the proposed method outperforms state-of-the-art methods in facial image deblurring. The proposed framework can provide continuous sharp moments that users want to obtain from a single motion-blurred facial image. Since the proposed method restores facial motion in the order of FMR, there may be a limitation in predicting the accurate temporal order of the facial motion. However, we believe that the proposed method will be the basis for future studies on continuous facial motion deblurring. In addition, incorporating various facial priors can be a fundamental issue for future research to improve the quality of this study.

References

[1] Dawit Mureja Argaw, Junsik Kim, Francois Rameau, Chaoning Zhang, and In So Kweon. Restoration of Video Frames from a Single Blurred Image with Motion Understanding. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Worksh. (CVPRW), June 2021.
[2] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, and Lihi Zelnik-Manor. The 2018 pirm challenge on perceptual image super-resolution. In Proc. Eur. Conf. Comput. Vis. Worksh. (ECCVW), pages 0–0, Sept. 2018.
[3] G. Bradski. The OpenCV Library. Dr. Dobb’s J. Softw. Tools, 2000.
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Int. Conf. Learn. Represent., May 2019.
[5] Haoming Cai, Jingwen He, Yu Qiao, and Chao Dong. Toward Interactive Modulation for Photo-Realistic Image Restoration. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 294–303, June 2021.
[6] Ayan Chakrabarti. A neural approach to blind motion deblurring. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 221–235. Springer, Oct. 2016.
[7] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In IEEE Int. Conf. Image Process., volume 2, pages 168–172. IEEE, Nov. 1994.
[8] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 5659–5667, June 2017.
[9] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2492–2501, 2018.
[10] Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko. Rethinking Coarse-to-Fine Approach in Single Image Deblurring. In arXiv preprint arXiv:2108.05054, pages 4641–4650, Oct. 2021.
[11] Grigorios G Chrysos, Paolo Favaro, and Stefanos Zafeiriou. Motion deblurring of faces. Int. J. Comput. Vis., 127(6-7):801–823, Mar. 2019.
[12] Grigorios G Chrysos and Stefanos Zafeiriou. Deep face deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Worksh. (CVPRW), pages 69–78, July 2017.
[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 764–773, Oct. 2017.
[14] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4690–4699, June 2019.
[15] Sadia Din, Anand Paul, and Awais Ahmad. Lightweight deep dense Demosaicking and Denoising using convolutional neural networks. Multimed. Tools. Appl., 79(45):34385–34405, Aug. 2020.
[16] Xin Ding, Yongwei Wang, Zuheng Xu, William J Welch, and Z. Jane Wang. CcGAN: continuous conditional generative adversarial networks for image generation. In Int. Conf. Learn. Represent., May 2021.
[17] Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, and Thomas Brox. Learning to generate chairs, tables and cars with convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):692–705, Apr. 2016.
[18] Nah, Seungjun and Kim, Tae Hyun and Lee, Kyoung Mu. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), July 2017.
[19] Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, Ian Reid, Chunhua Shen, Anton Van Den Hengel, and Qinfeng Shi. From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 2319–2328, June 2017.
[20] Mingming Gong, Yanwu Xu, Chunyuan Li, Kun Zhang, and Kayhan Batmanghelich. Twin auxiliary classifiers gan. 32:1328, Dec. 2019.
[21] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, May 2016.
[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Adv. Neural Inf. Process. Syst. (NIPS), pages 2672–2680, June 2014.
[23] Klemen Grm, Walter J Scheirer, and Vitomir Štruc. Face hallucination using cascaded super-resolution and identity priors. IEEE Trans. Image Process., 29(1):2150–2165, Oct. 2019.
[24] Jinjin Gu, Haoming Cai, Chao Dong, Jimmy S Ren, Yu Qiao, Shuhang Gu, and Radu Timofte. NTIRE 2021 challenge on perceptual image quality assessment. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 677–690, June 2021.
[25] Ligong Han, Martin Renqiang Min, Anastasis Stathopoulos, Yu Tian, Ruijiang Gao, Asim Kadav, and Dimitris N Metaxas. Dual projection generative adversarial networks for conditional image generation. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 14438–14447, Oct. 2021.
[26] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546, Dec. 2020.
[27] Jingwen He, Chao Dong, and Yu Qiao. Interactive multi-dimension modulation with dynamic controllable residual learning for image restoration. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 53–68. Springer, Nov. 2020.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 770–778, June 2016.
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. 30, Dec. 2017.
[30] Michael Hirsch, Christian J Schuler, Stefan Harmeling, and Bernhard Schölkopf. Fast removal of non-uniform camera shake. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 463–470. IEEE, Jan. 2011.
[31] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 7132–7141, June 2018.
[32] Tae Hyun Kim, Byeongjoo Ahn, and Kyoung Mu Lee. Dynamic scene deblurring. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 3160–3167, Dec. 2013.
[33] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1125–1134, June 2017.
[34] Meiguang Jin, Givi Meishvili, and Paolo Favaro. Learning to extract a video sequence from a single motion-blurred image. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 6334–6342, June 2018.
[35] Younghyun Jo, Sejong Yang, and Seon Joo Kim. Investigating loss functions for extreme super-resolution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 424–425, June 2020.
[36] Soo Hyun Jung, Tae Bok Lee, and Yong Seok Heo. Deep Feature Prior Guided Face Deblurring. In Proc. IEEE Winter Conf. Appl. Comput. Vis., pages 3531–3540, Feb. 2022.
[37] Minguk Kang and Jaesik Park. Contragan: Contrastive learning for conditional image generation. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2020.
[38] Minguk Kang, Woohyeon Shim, Minsu Cho, and Jaesik Park. Rebooting ACGAN: Auxiliary Classifier GANs with Stable Training. arXiv preprint arXiv:2111.01118, May 2021.
[39] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4401–4410, June 2019.
[40] Heewon Kim, Sungyong Baik, Myungsub Choi, Janghoon Choi, and Kyoung Mu Lee. Searching for Controllable Image Restoration Networks. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 14234–14243, Oct. 2021.
[41] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1646–1654, June 2016.
[42] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd Int. Conf. Learn. Repr. (ICLR), May. 2015.
[43] Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised learning with gans: Manifold invariance with improved inference. Adv. Neural Inform. Process. Syst., 30, Dec. 2017.
[44] W. Lai, J. Huang, Z. Hu, N. Ahuja, and M. Yang. A Comparative Study for Single Image Blind Deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1701–1709, June 2016.
[45] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4681–4690, June 2017.
[46] Tae Bok Lee, Soo Hyun Jung, and Yong Seok Heo. Progressive Semantic Face Deblurring. IEEE Access, 8:223548–223561, Oct. 2020.
[47] Songnan Lin, Jiawei Zhang, Jinshan Pan, Yicun Liu, Yongtian Wang, Jing Chen, and Jimmy Ren. Learning to Deblur Face Images via Sketch Synthesis. AAAI, 34:11523–11530, Apr. 2020.
[48] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
[49] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, Nov. 2014.
[50] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, Feb. 2018.
[51] Takeru Miyato and Masanori Koyama. cGANs with Projection Discriminator. In Int. Conf. Learn. Represent., Jan. 2018.
[52] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In CVPR Workshops, June 2019.
[53] Mehdi Noroozi, Paramanand Chandramouli, and Paolo Favaro. Motion deblurring in the wild. In German Conf. on Pattern Recognit. (GCPR), pages 65–77. Springer, Aug. 2017.
[54] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proc. Int. Conf. Mach. Learn., pages 2642–2651. PMLR, Aug. 2017.
[55] Jinshan Pan, Zhe Hu, Zhixun Su, and Ming-Hsuan Yang. Deblurring face images with exemplars. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 47–62. Springer, Sept. 2014.
[56] Dongwon Park, Dong Un Kang, Jisoo Kim, and Se Young Chun. Multi-Temporal Recurrent Neural Networks For Progressive Non-Uniform Single Image Deblurring With Incremental Temporal Training. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 327–343. Springer, Oct. 2020.
[57] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. (NIPS), 32:8026–8037, June 2019.
[58] Kuldeep Purohit and AN Rajagopalan. Region-adaptive dense network for efficient motion deblurring. In AAAI, volume 34, pages 11882–11889, Feb. 2020.
[59] Kuldeep Purohit, Anshul Shah, and AN Rajagopalan. Bringing alive blurred moments. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 6830–6839, June 2019.
[60] Wenqi Ren, Jiaolong Yang, Senyou Deng, David Wipf, Xiaochun Cao, and Xin Tong. Face Video Deblurring using 3D Facial Priors. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 9388–9397, Oct. 2019.
[61] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Int. Conf. Med. Image Comput. Comput. Assist. Interv. (MICCAI), pages 234–241. Springer, May 2015.
[62] Faisal Saeed, Muhammad Jamal Ahmed, Malik Junaid Gul, Kim Jeong Hong, Anand Paul, and Muthu Subash Kavitha. A robust approach for industrial small-object detection using an improved faster regional convolutional neural network. Sci. Rep., 11(1):1–13, Dec. 2021.
[63] Karshiev Sanjar, Olimov Bekhzod, Jaeil Kim, Jaesoo Kim, Anand Paul, and Jeonghong Kim. Improved U-net: fully convolutional network model for skin-lesion segmentation. Appl. Sci., 10(10):3658, May 2020.
[64] Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 8207–8216, June 2020.
[65] Jie Shen, Stefanos Zafeiriou, Grigoris G Chrysos, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 50–58, Dec. 2015.
[66] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 9243–9252, June 2020.
[67] Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang. Deep semantic face deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 8260–8269, June 2018.
[68] Ziyi Shen, Wenguan Wang, Xiankai Lu, Jianbing Shen, Haibin Ling, Tingfa Xu, and Ling Shao. Human-aware motion deblurring. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 5572–5581, Oct. 2019.
[69] Ziyi Shen, Tingfa Xu, Jizhou Zhang, Jie Guo, and Shenwang Jiang. A multi-task approach to face deblurring. Eurasip J. Wirel. Commun. Netw., 2019(1):1–11, Jan. 2019.
[70] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1279–1288, Nov. 2017.
[71] Jian Sun, Wenfei Cao, Zongben Xu, and Jean Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 769–777, June 2015.
[72] Keqiang Sun, Wayne Wu, Tinghao Liu, Shuo Yang, Quan Wang, Qiang Zhou, Zuochang Ye, and Chen Qian. Fab: A robust facial landmark detection framework for motion-blurred videos. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pages 5462–5471, Oct. 2019.
[73] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 3476–3483, June 2013.
[74] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 8174–8182, June 2018.
[75] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 5265–5274, June 2018.
[76] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 0–0, June 2019.
[77] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. Towards real-world blind face restoration with generative facial prior. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 9168–9178, June 2021.
[78] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proc. Eur. Conf. Comput. Vis. Worksh. (ECCVW), pages 0–0, 2018.
[79] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[80] Huiyuan Yang, Umur Ciftci, and Lijun Yin. Facial expression recognition by de-expression residue learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 2168–2177, June 2018.
[81] Rajeev Yasarla, Federico Perazzi, and Vishal M Patel. Deblurring face images using uncertainty guided multi-stream semantic networks. IEEE Trans. Image Process., Apr. 2020.
[82] Yuan Yuan, Wei Su, and Dandan Ma. Efficient dynamic scene deblurring using spatially variant deconvolution network with optical flow guided training. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 3555–3564, June 2020.
[83] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-Stage Progressive Image Restoration. In arXiv preprint arXiv:2102.02808, pages 14821–14831, June 2021.
[84] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep Stacked Hierarchical Multi-Patch Network for Image Deblurring. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 5978–5986, June 2019.
[85] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In Proc. Int. Conf. Mach. Learn., pages 7354–7363. PMLR, June 2019.
[86] Kaihao Zhang, Wenhan Luo, Björn Stenger, Wenqi Ren, Lin Ma, and Hongdong Li. Every Moment Matters: Detail-Aware Networks to Bring a Blurry Image Alive. In Proc. 28th ACM Int. Conf. Multimedia, pages 384–392, Oct. 2020.
[87] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett., 23(10):1499–1503, Aug. 2016.
[88] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process., 26(7):3142–3155, Feb. 2017.
[89] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 586–595, June 2018.
[90] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 286–301, Sept. 2018.
[91] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial expression recognition to interpersonal relation prediction. Int. J. Comput. Vis., 126(5):550–569, May 2018.
[92] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 592–608. Springer, Aug. 2020.
[93] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 9308–9316, 2019.
[94] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, Mar. 2020.