This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DeMFI: Deep Joint Deblurring and Multi-Frame Interpolation
with Flow-Guided Attentive Correlation and Recursive Boosting

Jihyong Oh     Munchurl Kim
Korea Advanced Institute of Science and Technology
{jhoh94, mkimee}@kaist.ac.kr
Corresponding author.
Abstract

In this paper, we propose a novel joint deblurring and multi-frame interpolation (DeMFI) framework, called DeMFI-Net, which accurately converts blurry videos of lower-frame-rate to sharp videos at higher-frame-rate based on flow-guided attentive-correlation-based feature bolstering (FAC-FB) module and recursive boosting (RB), in terms of multi-frame interpolation (MFI). The DeMFI-Net jointly performs deblurring and MFI where its baseline version performs feature-flow-based warping with FAC-FB module to obtain a sharp-interpolated frame as well to deblur two center-input frames. Moreover, its extended version further improves the joint task performance based on pixel-flow-based warping with GRU-based RB. Our FAC-FB module effectively gathers the distributed blurry pixel information over blurry input frames in feature-domain to improve the overall joint performances, which is computationally efficient since its attentive correlation is only focused pointwise. As a result, our DeMFI-Net achieves state-of-the-art (SOTA) performances for diverse datasets with significant margins compared to the recent SOTA methods, for both deblurring and MFI. All source codes including pretrained DeMFI-Net are publicly available at https://github.com/JihyongOh/DeMFI.

1 Introduction

Video frame interpolation (VFI) converts a low frame rate (LFR) video to a high frame rate (HFR) one between given consecutive input frames, thereby providing a visually better motion-smoothed video which is favorably perceived by human visual systems (HVS) [24, 25]. Therefore, it is widely used for diverse applications, such as adaptive streaming [52], slow motion generation [18, 2, 30, 28, 37, 44] and space-time super resolution [22, 51, 15, 50, 53, 21, 54, 55, 9].

Refer to caption
Figure 1: PSNR profiles for multi-frame interpolation results (×8\times 8) for the blurry input frames on diverse three datasets; Adobe240, YouTube240 and GoPro (HD). Our DeMFI-Net consistently shows best performances along all time instances.

On the other hand, motion blur is necessarily induced by either camera shake [1, 58] or object motion [33, 59] due to the accumulations of the light during the exposure period [14, 16, 49] when capturing videos. Therefore, eliminating the motion blur, called deblurring, is essential to synthesize sharp intermediate frames while increasing temporal resolution. The discrete degradation model for blurriness is generally formulated as follows [20, 29, 45, 19, 40, 41, 13]:

𝐁:={B2i}i=0,1,={12τ+1j=iKτiK+τSj}i=0,1,,\displaystyle\mathbf{B}:=\{B_{2i}\}_{i=0,1,...}=\{\frac{1}{2\tau+1}\sum_{j=iK-\tau}^{iK+\tau}S_{j}\}_{i=0,1,...}, (1)

where SjS_{j}, 𝐁\mathbf{B}, KK and 2τ+12\tau+1 denote latent sharp frame at time jj in HFR, observed blurry frames at LFR, a factor that reduces frame rate of HFR to LFR and an exposure time period, respectively. However, a few studies have addressed the joint problem of video frame interpolation with blurred degradation namely as a joint deblurring and frame interpolation problem. To handle this problem effectively, five works [19, 40, 41, 61, 13] delicately have shown that joint approach is much better than the cascade of two separate tasks such as deblurring and VFI, which may lead to sub-optimal solutions. However, the methods [19, 40, 41, 13] simply perform a center-frame interpolation (CFI) between two blurry input frames. This implies that they can only produce intermediate frames of time at a power of 2 in a recursive manner. As a result, the prediction errors are accumulatively propagated to the later interpolated frames. Also, their methods can not produce interpolated frames at arbitrary target time instances, not at time of power of 2.

Refer to caption
Figure 2: Overview of our DeMFI-Net framework.

To overcome these limitations for improving the quality in terms of multi-frame interpolation (MFI) with a temporal up-scaling factor ×M\times M, we propose a novel framework for joint deblurring and multi-frame interpolation, called DeMFI-Net, to accurately generate sharp-interpolated frames at arbitrary time t based on flow-guided attentive-correlation-based feature bolstering (FAC-FB) module and recursive boosting (RB). However, using a pretrained optical flow estimator is not optimal for blurry input frames and is computationally heavy. So, our DeMFI-Net is designed to learn self-induced feature-flows (fFf_{F}) and pixel-flows (fPf_{P}) in warping the given blurry input frames for synthesizing a sharp-interpolated frame at arbitrary time t, without any help of pretrained optical flow networks.

Direct estimation of flows to jointly deblur and interpolate the intermediate frame at arbitrary t from the blurry input frames is a very challenging task. To effectively handle it, our DeMFI-Net is designed by dividing and conquering the joint task into a two-stage problem as shown in Fig. 2:

  • (i) the first stage (baseline version, denoted as DeMFI-Netbs) jointly performs deblurring and MFI based on feature-flow-based warping and blending (FWB) by learning fFf_{F} to obtain a sharp-interpolated frame of t (0,1)\in(0,1) as well to deblur two center-input frames (B0B_{0}, B1B_{1}) of t=0,1t=0,1 from four blurry input frames (B1B_{-1}, B0B_{0}, B1B_{1}, B2B_{2}); and

  • (ii) the second stage (recursive boosting, denoted as DeMFI-Netrb) further boosts the joint performance based on pixel-flow-based warping and blending (PWB) by iteratively updating fPf_{P} with the help of GRU-based RB. It fully exploits the obtained output of DeMFI-Netbs by adopting residual learning. It is trained with recursive boosting loss that enables the recursive iterations to be properly regulated during inference time by considering runtime or computational constraints, even after the training is finished.

It should be noted that (1) the FWB of DeMFI-Netbs is a warping and blending operation in feature-domain where the resulting learned features tend to be more sharply constructed from the blurry inputs; and (2) the following PWB of DeMFI-Netrb can be precisely performed in pixel-domain for the output of DeMFI-Netbs via the residual learning to boost the performance of the joint task.

The blurry input frames implicitly contain abundant useful latent information due to an accumulation of light [14, 16, 49], as also shown in Eq. 1. Motivated from this, we propose a novel flow-guided attentive-correlation-based feature bolstering (FAC-FB) module that can effectively bolster the source feature F0F_{0} (or F1F_{1}) by extracting the useful information in the feature-domain from its counterpart feature F1F_{1} (or F0F_{0}) in guidance of self-induced flow f01f_{01} (or f10f_{10}). By doing so, the distributed pixel information over four blurry input frames can be effectively gathered into the corresponding features of the two center-input frames which can then be utilized to restore sharp-interpolated frames and two deblurred center-input frames.

In the performance evaluation, DeMFI-Netbs outperforms previous SOTA methods for three benchmark datasets including both diverse real-world scenes and larger-sized blurry videos. The final DeMFI-Netrb further pushes its capability for MFI with large margins which has shown a strong generalization of our DeMFI-Net framework as shown in Fig. 1. Extensive experiments with diverse ablation studies have demonstrated the effectiveness of our framework. All source codes including pretrained DeMFI-Net are publicly available at https://github.com/JihyongOh/DeMFI.

2 Related Works

2.1 Center-Frame Interpolation (CFI)

The VFI methods on CFI only interpolate a center-frame between two consecutive sharp input frames. Since the interpolation is fixedly targeted at the center time position, they tend not to rely on optical flow networks. SepConv [32] generates dynamically separable filters to handle motions efficiently. CAIN [6] employs a channel attention module to extract motion information effectively without explicit estimation of motion. FeFlow [12] adopts deformable convolution [8] in the center frame generator to replace optical flows. AdaCoF [26] handles a complex motion by introducing a warping module in a generalized form.

However, all the above methods simply try to do CFI for two times (×2) increase in frame rates, not for arbitrary time t. This approach tends to limit the performance when being applied for MFI because they must be recursively applied after each center frame is interpolated, which causes error propagation into later-interpolated frames.

2.2 Multi-Frame Interpolation (MFI)

To effectively synthesize an intermediate frame at arbitrary time t, many VFI methods on MFI for sharp input frames adopt a flow estimation-based warping operation. Super-SloMo [18] jointly combines occlusion maps and approximated intermediate flows to synthesize the intermediate frame. Quadratic video frame interpolation [56, 27] adopts the acceleration-aware approximation for the flows in quadratic form to better handle nonlinear motion. DAIN [2] proposes flow projection layer to delicately approximate the flows according to depth information. SoftSplat [31] directly performs forward warping for the feature maps of input frames with learning-based softmax weights for the occluded region. ABME [36] proposes an asymmetric bilateral motion estimation based on bilateral cost volume [35]. XVFI [44] introduces a recursive multi-scale shared structure to effectively capture large motion. However, all the above methods handle MFI problems for sharp input frames, which may not work well for blurry input frames.

2.3 Joint Deblurring and Frame Interpolation

The previous studies on the joint deblurring and frame interpolation tasks [19, 40, 41, 61, 13] have consistently shown that the joint approaches are much better than the simple cascades of two separately pretrained networks of deblurring and VFI. TNTT [19] first extracts several clear keyframes which are then subsequently used to generate intermediate sharp frames by adopting a jointly optimized cascaded scheme. It takes an approximate recurrent approach by unfolding and distributing the extraction of the frames over multiple processing stages. BIN [40] adopts a ConvLSTM-based [43] recurrent pyramid framework to effectively propagate the temporal information over time. Its extended version with a larger model size, called PRF [41], simultaneously yields the deblurred input frames and temporally center-frame at once. ALANET [13] employs the combination of both self- and cross-attention modules to adaptively fuse features in latent spaces, thus allowing for robustness and improvement in the joint task performances.

However, all the above four joint methods simply perform the CFI for blurry input frames so their performances are limited to MFI for the joint task. On the other hand, UTI-VFI [61] can interpolate the sharp frames at arbitrary time tt in two-stage manner. It first extracts deblurred key-state frames at both start time and end time of the camera exposures, and then warps them to arbitrary time tt. However, its performance necessarily depends on the quality of flows obtained by a pretrained optical flow network which also increases the complexity of the overall network (+8.75M parameters).

Distinguished from all the above methods, our proposed framework elaborately learns self-induced fFf_{F} and fPf_{P} to effectively warp the given blurry input frames for synthesizing a sharp-interpolated frame at arbitrary time, without any pretrained optical flow network. As a result, our method not only outperforms the previous SOTA methods in structural-related metrics but also shows higher temporal consistency of visual quality performance for diverse datasets.

Refer to caption
Figure 3: Overall DeMFI-Net including both baseline version and recursive boosting.

3 Proposed Method : DeMFI-Net

3.1 Design Considerations

Our network, DeMFI-Net, aims to jointly interpolate a sharp intermediate frame at arbitrary time tt and deblur the blurry input frames. Most of the previous SOTA methods [19, 41, 40, 13] only consider CFI (×2\times 2) and need to perform it recursively at the power of 2 for MFI (×M\times M) between two consecutive input frames. Here, it should be noted that the later-interpolated frames must be sequentially created based on their previously-interpolated frames. Therefore, the errors are inherently propagated into later-interpolated frames so that they often have lower visual qualities.

Our DeMFI-Net is designed to interpolate intermediate frames at multiple time instances without dependency among them so that the error propagation problem can be avoided. That is, the multiple intermediate frames can be parallelly generated. To synthesize an intermediate frame at time t (0,1)\in(0,1) instantaneously, we adopt a warping operation which is widely used in VFI research [18, 56, 2, 27, 44] to interpolate the frames based on a backward warping [17] with estimated flows from time t to 0 and 1, respectively. However, direct usage of a pretrained optical flow network is not optimal for blurry frames and even computationally heavy. So our DeMFI-Net is devised to learn self-induced flows for robust warping in both feature- and pixel-domain. Furthermore, to effectively handle the joint task of deblurring and interpolation, we take a divide-and-conquer approach to the design of our DeMFI-Net in a two-stage manner: baseline version (DeMFI-Netbs) and recursive boosting version (DeMFI-Netrb) as shown in Fig. 2. DeMFI-Netbs first performs feature-flow-based warping and blending (FWB) to produce the deblurred input frames and a sharp-interpolated frame at the given t. Then the output of DeMFI-Netbs is boosted for further improvement in DeMFI-Netrb, by performing pixel-flow-based warping and blending (PWB). The DeMFI-Netbs and DeMFI-Netrb are described with more details in the following subsections.

3.2 DeMFI-Netbs

Fig. 3 (a) shows the architecture of DeMFI-Netbs that first takes four consecutive blurry input frames (B1B_{-1}, B0B_{0}, B1B_{1}, B2B_{2}). Then, feature flow residual dense backbone (FF-RDB) module is followed which is similar to a backbone network of [41, 40], described in Appendices. Its modified 133 (=64×2+2×2+1)(=64\times 2+2\times 2+1) output channels are composed of 64×264\times 2 for two feature maps (F0F_{0}^{\prime}, F1F_{1}^{\prime}) followed by tanh functions, 2×22\times 2 two bidirectional feature-domain flows (f01f_{01}, f10f_{10}) and 1 for an occlusion map logit (ot0o_{t0}).

tt-Alignment. To fully exploit the bidirectional flows (f01f_{01}, f10f_{10}) extracted from four blurry inputs, the intermediate flows f0tf_{0t} (or f1tf_{1t}) from time 0 (or 1) to time t are linearly approximated as f0t=tf01f_{0t}=t\cdot f_{01} (or f1t=(1t)f10f_{1t}=(1-t)\cdot f_{10}). Then we apply the complementary flow reversal (CFR) [44] for f0tf_{0t} and f1tf_{1t} to finally approximate ft0f_{t0} and ft1f_{t1}. Finally, we obtain t-aligned feature FtF_{t} by applying the backward warping operation (WbW_{b}) [17] for features F0F_{0}^{\prime}, F1F_{1}^{\prime} followed by a blending operation with the occlusion map. This is called feature-flow-based warping and blending (FWB), which is depicted by the green box in Fig. 3 (a). The t-aligned feature FtF_{t} is computed as follows:

Ft=FWB(F0,F1,ft0,ft1,ot0)=(1t)o¯t0Wb(F0,ft0)+to¯t1Wb(F1,ft1)(1t)o¯t0+to¯t1,F_{t}=\mathrm{FWB}(F_{0}^{\prime},F_{1}^{\prime},f_{t0},f_{t1},o_{t0})\\ =\frac{(1-t)\cdot\bar{o}_{t0}\cdot W_{b}(F_{0}^{\prime},f_{t0})+t\cdot\bar{o}_{t1}\cdot W_{b}(F_{1}^{\prime},f_{t1})}{(1-t)\cdot\bar{o}_{t0}+t\cdot\bar{o}_{t1}}, (2)

where o¯t0=σ(ot0)\bar{o}_{t0}=\sigma(o_{t0}) and o¯t1=1o¯t0\bar{o}_{t1}=1-\bar{o}_{t0}, and σ\sigma is a sigmoid activation function.

FAC-FB Module. Since the pixel information is spread over the blurry input frames due to the accumulation of light [14, 16, 49] as in Eq. 1, we propose a novel FAC-FB module that can effectively bolster the source feature F0F_{0}^{\prime} (or F1F_{1}^{\prime}) by extracting the useful information in the feature-domain from its counterpart feature F1F_{1}^{\prime} (or F0F_{0}^{\prime}) in guidance of self-induced flow f01f_{01} (or f10f_{10}). The FAC-FB module in Fig. 3 (b) first encodes the two feature maps (F0F_{0}, F1F_{1}) by passing the outputs (F0F_{0}^{\prime}, F1F_{1}^{\prime}) of the FF-RDB module through its five residual blocks (ResB’s). The cascade (𝐑𝐞𝐬𝐁×5\mathbf{ResB}^{\times 5}) of the five ResB’s is shared for F0F_{0}^{\prime} and F1F_{1}^{\prime}.

After obtaining the F0F_{0} and F1F_{1}, the flow-guided attentive correlation (FAC) in Fig. 3 (b) computes attentive correlation of F0F_{0} with respect to the positions of its counterpart feature F1F_{1} pointed by the self-induced flow f01f_{01}. The FAC on F0F_{0} with respect to F1F_{1} guided by f01f_{01} is calculated as:

FAC01(F0,F1,f01)(x)=[cwConv1(F0(x))Conv1(F1(x+f01(x)))]Conv1(F1(x+f01(x))),\mathrm{FAC}_{01}(F_{0},F_{1},f_{01})(\mathbf{\textbf{x}})=[\ \textstyle\sum_{cw}\mathrm{Conv_{1}}(F_{0}(\mathbf{\textbf{x}}))\odot\\ \mathrm{Conv_{1}}(F_{1}(\mathbf{\textbf{x}}+f_{01}(\mathbf{\textbf{x}})))]\ \cdot\mathrm{Conv_{1}}(F_{1}(\mathbf{\textbf{x}}+f_{01}(\mathbf{\textbf{x}}))), (3)

where F1(x+f01(x))F_{1}(\mathbf{\textbf{x}}+f_{01}(\mathbf{\textbf{x}})) is computed by bilinear sampling on a feature location x. \odot, cw\sum_{cw} and Convi\mathrm{Conv_{i}} denote element-wise multiplication, channel-wise summation and i×ii\times i-sized convolution filter, respectively. The square bracket in Eq. 3 becomes a single channel scaling map which is then stretched along channel axis to be element-wise multiplied to Conv1(F1(x+f01(x)))\mathrm{Conv_{1}}(F_{1}(\mathbf{\textbf{x}}+f_{01}(\mathbf{\textbf{x}}))). We block backpropagation to the flows in FAC for stable learning. Finally, the FAC-FB module produces bolstered features F0bF_{0}^{b} for F0F_{0} as:

F0b=w01F0+(1w01)Conv1(FAC01)E0\displaystyle F_{0}^{b}=w_{01}\cdot F_{0}+(1-w_{01})\cdot\underbrace{\mathrm{Conv_{1}}(\mathrm{FAC}_{01})}_{\text{$\equiv E_{0}$}} (4)

where w01w_{01} is a single channel of spatially-variant learnable weights that are dynamically generated by an embedded FAC01\mathrm{FAC}_{01} via Conv1\mathrm{Conv_{1}} (denoted as E0)E_{0}) and F0F_{0} according to w01=(σConv3ReLUConv3)([E0,F0])w_{01}=(\sigma\circ\mathrm{Conv_{3}}\circ\text{ReLU}\circ\mathrm{Conv_{3}})([E_{0},F_{0}]). [][\cdot] means a concatenation along a channel axis. Similarly, FAC10 and F1bF_{1}^{b} can be computed for F1F_{1} with respect to F0F_{0} by f10f_{10}. The FAC-FB module allows the DeMFI-Net to effectively gather the distributed blurry pixel information over the blurry input frames in feature-domain to improve the joint performance. The FAC is computationally efficient because its attentive correlation is only computed in the focused locations pointed by the flows. Also, all filter weights in the FAC-FB module are shared for both F0F_{0}^{\prime} and F1F_{1}^{\prime}.

Refine Module. After the FAC-FB Module in Fig. 3 (a), F0bF_{0}^{b}, F1bF_{1}^{b}, ft0f_{t0}, ft1f_{t1} and ot0o_{t0} are refined via the U-Net-based [39] Refine Module (RM) as [F0r,F1r,ft0r,ft1r,ot0r]=RM(𝐀𝐠𝐠1)+[F0b,F1b,ft0,ft1,ot0][F_{0}^{r},F_{1}^{r},f_{t0}^{r},f_{t1}^{r},o_{t0}^{r}]=\mathrm{RM}(\mathbf{Agg}^{1})+[F_{0}^{b},F_{1}^{b},f_{t0},f_{t1},o_{t0}] where 𝐀𝐠𝐠1\mathbf{Agg}^{1} is the aggregation of [F0b,Ft,F1b,ft0,ft1,ot0,f01,f10][F_{0}^{b},F_{t},F_{1}^{b},f_{t0},f_{t1},o_{t0},f_{01},f_{10}] in the concatenated form. Then, we get the refined feature FtrF_{t}^{r} at time tt by Ftr=FWB(F0r,F1r,ft0r,ft1r,ot0r)F_{t}^{r}=\mathrm{FWB}(F_{0}^{r},F_{1}^{r},f_{t0}^{r},f_{t1}^{r},o_{t0}^{r}) as similar to Eq. 2. Here, we define a composite symbol at time t by the combination of two feature-flows and occlusion map logit as 𝐟𝐅[ft0r,ft1r,ot0r]\mathbf{f_{F}}\equiv[f_{t0}^{r},f_{t1}^{r},o_{t0}^{r}] to be used in recursive boosting.

Decoder \@slowromancapi@ (D1D_{1}). D1D_{1} is composed of 𝐑𝐞𝐬𝐁×𝟓\mathbf{ResB^{\times 5}} and it is intentionally designed to have a function: to decode a feature FjF_{j} at a time index jj to a sharp frame SjS_{j}. D1D_{1} is shared for all the three features (F0r,Ftr,F1rF_{0}^{r},F_{t}^{r},F_{1}^{r}). It should be noted that D1D_{1} decodes F0r,FtrF_{0}^{r},F_{t}^{r} and F1rF_{1}^{r} into sharp frames S0r,StrS_{0}^{r},S_{t}^{r} and S1rS_{1}^{r}, respectively, which would be applied by L1 reconstruction loss (LD1rL_{D_{1}}^{r}) (Eq. 9). It is reminded that the architecture from the input layer to D1D_{1} constitutes our baseline version, called DeMFI-Netbs. Although DeMFI-Netbs outperforms the previous SOTA methods, its extension with recursive boosting, called DeMFI-Netrb, can further improve the performance.

3.3 DeMFI-Netrb

Since we have already obtained sharp frames S0r,Str,S1rS_{0}^{r},S_{t}^{r},S_{1}^{r} as the output of DeMFI-Netbs, they can further be sharpened based on the learned pixel-flows by recursive boosting via residual learning. It is known that feature-flows (𝐟𝐅\mathbf{f_{F}}) and pixel-flows (𝐟𝐏\mathbf{f_{P}}) would have similar characteristics [26, 12]. Therefore, the 𝐟𝐅\mathbf{f_{F}} obtained from the DeMFI-Netbs are used as initial 𝐟𝐏\mathbf{f_{P}} for recursive boosting. For this, we design a GRU [5]-based recursive boosting for progressively updating 𝐟𝐏\mathbf{f_{P}} to perform PWB for two sharp frames at t=0,1t=0,1 (S0r,S1rS_{0}^{r},S_{1}^{r}) accordingly to boost the quality of a sharp intermediate frame at t via residual learning which has been widely adopted for effective deblurring [57, 10, 38, 34, 4]. Fig. 3 (c) shows ii-thth recursive boosting (RB) of DeMFI-Netrb, which is composed of Booster Module and Decoder \@slowromancapii@ (D2D_{2}).

Booster Module. Booster Module iteratively updates 𝐟𝐏\mathbf{f_{P}} to perform PWB for S0r,S1rS_{0}^{r},S_{1}^{r} obtained from DeMFI-Netbs. The Booster Module is composed of Mixer and GRU-based Booster (GB), and it first takes a recurrent hidden state (Fi1recF_{i-1}^{rec}) and 𝐟𝐏i1\mathbf{f_{P}}^{i-1} at ii-thth recursive boosting as well as an aggregation of several components in the form of 𝐀𝐠𝐠2=[S0r,Str,S1r,B1,B0,B1,B2,f01,f10,𝐟𝐅]\mathbf{Agg}^{2}=[S_{0}^{r},S_{t}^{r},S_{1}^{r},B_{-1},B_{0},B_{1},B_{2},f_{01},f_{10},\mathbf{f_{F}}] as an input to yield two outputs of FirecF_{i}^{rec} and 𝚫i1\mathbf{\Delta}_{i-1} that is added on 𝐟𝐏i1\mathbf{f_{P}}^{i-1}. Note that 𝐟𝐏𝟎=𝐟𝐅\mathbf{f_{P}^{0}}=\mathbf{f_{F}} and 𝐀𝐠𝐠2\mathbf{Agg}^{2} is not related to ii-thth recursive boosting. The updating process is given as follows:

Mi1=Mixer([𝐀𝐠𝐠2,𝐟𝐏i1])\displaystyle M_{i-1}=\mathrm{Mixer}([\mathbf{Agg}^{2},\mathbf{f_{P}}^{i-1}]) (5)
[Firec,𝚫i1]=GB([Fi1rec,Mi1])\displaystyle[F_{i}^{rec},\mathbf{\Delta}_{i-1}]=\mathrm{GB}([F_{i-1}^{rec},M_{i-1}]) (6)
𝐟𝐏i=𝐟𝐏i1+𝚫i1,\displaystyle\mathbf{f_{P}}^{i}=\mathbf{f_{P}}^{i-1}+\mathbf{\Delta}_{i-1},\vspace{-2mm} (7)

where the initial feature F0recF_{0}^{rec} is obtained as a 64-channel feature via channel reduction for Conv1([F0r,Ftr,F1r])\mathrm{Conv_{1}}([F_{0}^{r},F_{t}^{r},F_{1}^{r}]) of 192 channels. More details are provided for the Mixer and the updating process of GB in Appendices.

Decoder \@slowromancapii@ (D2D_{2}). D2D_{2} in Fig. 3 (c) is composed of 𝐑𝐞𝐬𝐁×𝟓\mathbf{ResB^{\times 5}}. It fully exploits abundant information of 𝐀𝐠𝐠i3=[S0r,Str,i,S1r,B1,B0,B1,B2,f01,f10,𝐟𝐅,𝐟𝐏i,Firec]\mathbf{Agg}_{i}^{3}=[S_{0}^{r},S_{t}^{r,i},S_{1}^{r},B_{-1},B_{0},B_{1},B_{2},f_{01},f_{10},\mathbf{f_{F}},\mathbf{f_{P}}^{i},F_{i}^{rec}] to finally generate the refined outputs [S0i,Sti,S1i]=D2(𝐀𝐠𝐠i3)+[S0r,Str,i,S1r][S_{0}^{i},S_{t}^{i},S_{1}^{i}]=D_{2}(\mathbf{Agg}_{i}^{3})+[S_{0}^{r},S_{t}^{r,i},S_{1}^{r}] via residual learning, where Str,i=PWB(S0r,S1r,𝐟𝐏i)S_{t}^{r,i}=\mathrm{PWB}(S_{0}^{r},S_{1}^{r},\mathbf{f_{P}}^{i}) is operated by only using the updated 𝐟𝐏i\mathbf{f_{P}}^{i} after the ii-thth recursive boosting to enforce the flows to be better boosted.

Loss Functions. The total loss function total\mathcal{L}_{total} is given as:

total=D1r+i=1NtrnD2irecursive boosting loss\displaystyle\mathcal{L}_{total}=\mathcal{L}_{D_{1}}^{r}+\underbrace{\textstyle\sum_{i=1}^{N_{trn}}\mathcal{L}_{D_{2}}^{i}}_{\text{recursive boosting loss}} (8)
D1r=(j(0,t,1)SjrGTj1)/3\displaystyle\mathcal{L}_{D_{1}}^{r}=(\textstyle\sum_{j\in(0,t,1)}\lVert{S}_{j}^{r}-GT_{j}\rVert_{1})/3 (9)
D2i=(j(0,t,1)SjiGTj1)/3,\displaystyle\mathcal{L}_{D_{2}}^{i}=(\textstyle\sum_{j\in(0,t,1)}\lVert{S}_{j}^{i}-GT_{j}\rVert_{1})/3, (10)

where GTjGT_{j} and NtrnN_{trn} denote the ground-truth sharp frame at time jj and total numbers of recursive boosting for training, respectively. We denote DeMFI-Netrb(NtrnN_{trn}, NtstN_{tst}) as DeMFI-Netrb that is trained with NtrnN_{trn} and is tested by NtstN_{tst} recursive boosting. The second term in the right-hand side of Eq. 8 is called as a recursive boosting loss. It should be noted that DeMFI-Netrb is jointly trained with the architecture of DeMFI-Netbs in an end-to-end manner using Eq. 8 without any complex learning schedule. Note that DeMFI-Netbs is trained with only Eq. 9 from the scratch.

On the other hand, the design consideration for Booster Module was partially inspired from the work [48] which is here carefully modified for more complex process of DeMFI; (i) Due to the absence of ground-truth for the pixel-flows from tt to 0 and 1, self-induced pixel-flows are instead learned by adopting D2D_{2} (Decoder \@slowromancapii@) and the recursive boosting loss; (ii) 𝐟𝐏\mathbf{f_{P}} is not necessarily to be learned precisely, instead to improve the final joint performance of sharpening the S0r,Str,S1rS_{0}^{r},S_{t}^{r},S_{1}^{r} via PWB and D2D_{2} as shown in Fig. 3 (c). So, we do not block any backpropagation to 𝐟𝐏\mathbf{f_{P}} per every recursive boosting unlike in [48], to fully focus on boosting the performance.

4 Experiment Results

4.1 Implementation Details

Training Dataset. To train our network, we use Adobe240 dataset [45] which contains 120 videos of 1,280×\times720 @ 240fps. We follow a blurry formation setting of [40, 41, 13] by averaging 11 consecutive frames at a stride of 8 frames over time to synthesize blurry frames captured by a long exposure, which finally generates blurry frames of 30fps with K=8K=8 and τ=5\tau=5 in Eq. 1. The resulting blurry frames are downsized to 640×\times352 as done in [40, 41, 13].

Training Strategy. Each training sample is composed of four consecutive blurry input frames (B1B_{-1}, B0B_{0}, B1B_{1}, B2B_{2}) and three sharp-target frames (GT0,GTt,GT1GT_{0},GT_{t},GT_{1}) where tt is randomly determined in multiple of 1/81/8 with 0<t<10<t<1. The filter weights of the DeMFI-Net are initialized by the Xavier method [11] and the mini-batch size is set to 2. DeMFI-Net is trained with a total of 420K iterations (7,500 epochs) by using the Adam optimizer [23] with the initial learning rate set to 10410^{-4} and reduced by a factor of 2 at the 3,750-, 6,250- and 7,250-thth epochs. The total numbers of recursive boosting are empirically set to Ntrn=5N_{trn}=5 for training and Ntst=3N_{tst}=3 for testing. We construct each training sample on the fly by randomly cropping a 256×256256\times 256-sized patch from blurry and clean frames, and it is randomly flipped in both spatial and temporal directions for data augmentation. Training takes about five days for DeMFI-Netbs and two weeks for DeMFI-Netrb by using a single GPU with PyTorch in an NVIDIA DGX™ platform.

4.2 Comparison to Previous SOTA Methods

We mainly compare our DeMFI-Net with five previous joint SOTA methods; TNTT [19], UTI-VFI [61], BIN [40], PRF [41] (a larger-sized version of BIN) and ALANET [13], which all have adopted joint learning for deblurring and VFI. They all have reported better performance than the cascades of separately trained VFI [18, 3, 2] and deblurring [47, 51] networks. It should be noted that the four methods of TNTT, BIN, PRF and ALANET simply perform CFI (×2\times 2), not at arbitrary t but at the center time t=0.5t=0.5. So, they have to perform MFI (×8\times 8) recursively based on previously interpolated frames, which causes to propagate interpolation errors into later-interpolated frames. For experiments, we compare them in two aspects of CFI and MFI. For MFI performance, temporal consistency is measured such that the pixel-wise difference of motions are calculated in terms of tOF [7, 44] (the lower, the better) for all 7 interpolated frames and deblurred two center frames for each blurry test sequence (scene). We also retrain the UTI-VFI with the same blurry formation setting for the Adobe240 for fair comparison, to be denoted as UTI-VFI*.

Method Rt #P Deblurring CFI (×\times2) Average
PSNR SSIM PSNR SSIM PSNR SSIM
B0,B1B_{0},B_{1} - - 28.68 0.8584 - - - -
SloMo [18] - 39.6 - - 27.52 0.8593 - -
MEMC [3] - 70.3 - - 30.83 0.9128 - -
DAIN [2] - 24.0 - - 31.03 0.9172 - -
SRN [47]+[18] 0.27 47.7 29.42 0.8753 27.22 0.8454 28.32 0.8604
SRN [47]+[3] 0.22 78.4 28.25 0.8625 28.84 0.8689
SRN [47]+[2] 0.79 32.1 27.83 0.8562 28.63 0.8658
EDVR [51]+[18] 0.42 63.2 32.76 0.9335 27.79 0.8671 30.28 0.9003
EDVR [51]+[3] 0.27 93.9 30.22 0.9058 31.49 0.9197
EDVR [51]+[2] 1.13 47.6 30.28 0.9070 31.52 0.9203
UTI-VFI [61] 0.80 43.3 28.73 0.8656 29.00 0.8690 28.87 0.8673
UTI-VFI* 0.80 43.3 31.02 0.9168 32.67 0.9347 31.84 0.9258
TNTT [19] 0.25 10.8 29.40 0.8734 29.24 0.8754 29.32 0.8744
BIN [40] 0.28 4.68 32.67 0.9236 32.51 0.9280 32.59 0.9258
PRF [41] 0.76 11.4 33.33 0.9319 33.31 0.9372 33.32 0.9346
ALANET [13] - - 33.71 0.9329 32.98 0.9362 33.34 0.9355
DeMFI-Netbs 0.38 5.96 33.83 0.9377 33.93 0.9441 33.88 0.9409
DeMFI-Netrb(1,1) 0.51 7.41 34.06 0.9401 34.35 0.9471 34.21 0.9436
DeMFI-Netrb(5,3) 0.61 7.41 34.19 0.9410 34.49 0.9486 34.34 0.9448
RED: Best performance, BLUE: Second best performance.
Rt: The runtime on 640×\times352-sized frames (s), UTI-VFI*: retrained version.
#P: The number of parameters (M), ALANET: no source code for testing.
Table 1: Quantitative comparisons on Adobe240fps [45] for deblurring and center-frame interpolation (×2\times 2).

Test Dataset. We use three datasets for evaluation: (i) Adobe240 dataset [45], (ii) YouTube240 dataset and (iii) GoPro (HD) dataset (CC BY 4.0 license) [29] that contains large dynamic object motions and camera shakes. For the YouTube240, we directly selected 60 YouTube videos of 1,280×\times720 at 240fps by considering to include extreme scenes captured by diverse devices. Then they were resized to 640×\times352 as done in [40, 41, 13]. The Adobe240 contains 8 videos of 1,280×\times720 resolution at 240 fps and was also resized to 640×\times352, which is totally composed of 1,303 blurry input frames. On the other hand, the GoPro has 11 videos with total 1,500 blurry input frames but we used the original size of 1,280×\times720 for an extended evaluation in larger-sized resolution. All test datasets are also temporally downsampled to 30 fps with the blurring as [40, 41, 13].

Joint Method Adobe240 [45] YouTube240 GoPro (HD) [29]
deblurring MFI (×\times8) Average deblurring MFI (×\times8) Average deblurring MFI (×\times8) Average
PSNR/SSIM PSNR/SSIM PSNR/SSIM/tOF PSNR/SSIM PSNR/SSIM PSNR/SSIM/tOF PSNR/SSIM PSNR/SSIM PSNR/SSIM/tOF
UTI-VFI [61] 28.73/0.8657 28.66/0.8648 28.67/0.8649/0.578 28.61/0.8891 28.64/0.8900 28.64/0.8899/0.585 25.66/0.8085 25.63/0.8148 25.64/0.8140/0.716
UTI-VFI* 31.02/0.9168 32.30/0.9292 32.13/0.9278/0.445 30.40/0.9055 31.76/0.9183 31.59/0.9167/0.517 28.51/0.8656 29.73/0.8873 29.58/0.8846/0.558
TNTT [19] 29.40/0.8734 29.45/0.8765 29.45/0.8761/0.559 29.59/0.8891 29.77/0.8901 29.75/0.8899/0.549 26.48/0.8085 26.68/0.8148 26.65/0.8140/0.754
PRF [41] 33.33/0.9319 28.99/0.8774 29.53/0.8842/0.882 32.37/0.9199 29.11/0.8919 29.52/0.8954/0.771 30.27/0.8866 25.68/0.8053 26.25/0.8154/1.453
DeMFI-Netbs 33.83/0.9377 33.79/0.9410 33.79/0.9406/0.473 32.90/0.9251 32.79/0.9262 32.80/0.9260/0.469 30.54/0.8935 30.78/0.9019 30.75/0.9008/0.538
DeMFI-Netrb(1,1) 34.06/0.9401 34.15/0.9440 34.14/0.9435/0.460 33.17/0.9266 33.22/0.9291 33.21/0.9288/0.459 30.63/0.8961 31.10/0.9073 31.04/0.9059/0.512
DeMFI-Netrb(5,3) 34.19/0.9410 34.29/0.9454 34.28/0.9449/0.457 33.31/0.9282 33.33/0.9300 33.33/0.9298/0.461 30.82/0.8991 31.25/0.9102 31.20/0.9088/0.500
Table 2: Quantitative comparisons of joint methods on Adobe240 [45], YouTube240 and GoPro (HD) [29] datasets for deblurring and multi-frame interpolation (×8\times 8).

Quantitative Comparison. Table 1 shows the quantitative performance comparisons for the previous SOTA methods including the cascades of deblurring and VFI methods with the Adobe240, in terms of deblurring and CFI (×\times2). Most results of the previous methods in Table 1 are brought from [40, 41, 13], except those of UTI-VFI (pretrained, newly tested), UTI-VFI* (retrained, newly tested) and DeMFI-Nets (ours). Please note that all runtimes (Rt) in Table 1 were measured for 640×\times352-sized frames in the setting of [40, 41] with one NVIDIA RTX™ GPU. As shown in Table 1, our proposed DeMFI-Netbs and DeMFI-Netrb clearly outperform all the previous methods with large margins in both deblurring and CFI performances, and the number of model parameters (#P) for our methods are the second- and third-smallest with smaller Rt compared to PRF. In particular, DeMFI-Netrb(5,3) outperforms ALANET by 1dB and 0.0093 in terms of PSNR and SSIM, respectively for average performances of deblurring and CFI, and especially by average 1.51dB and 0.0124 for center-interpolated frames attributed to our warping-based framework with self-induced flows. Furthermore, even our DeMFI-Netbs is superior to all previous methods which are dedicatedly trained for CFI.

Refer to caption
Figure 4: Visual comparisons for MFI results on YouTube240 for our and joint SOTA methods. Best viewed in zoom.

Table 2 shows quantitative comparisons of the joint methods for the three test datasets in terms of deblurring and MFI (×\times8). As shown in Table 2, all the three versions of DeMFI-Net significantly outperform the previous joint methods, which shows a good generalization of our DeMFI-Net framework. Fig. 1 shows PSNR profiles for MFI results (×\times8). As shown, the CFI methods such as TNTT and PRF tend to synthesize worse intermediate frames than the methods of interpolation at arbitrary time like UTI-VFI and our DeMFI-Net. This is because the error propagation is accumulated recursively due to the inaccurate interpolations by the CFI methods, which also has been inspected in VFI for sharp input frames [44]. Although UTI-VFI can interpolate the frames at arbitrary tt by adopting the PWB combined with QVI [56], its performances inevitably depend on fPf_{P} quality obtained by PWC-Net [46], where adoption of a pretrained net brings a disadvantage in terms of both Rt and #P (+8.75M). It is worthwhile to note that our method also shows the best performances in terms of temporal consistency with tOF by help of self-induced flows in interpolating sharp frames at arbitrary time t.

Refer to caption
Figure 5: Visual comparisons for MFI results on GoPro (HD) for our and joint SOTA methods. Best viewed in zoom.

Qualitative Comparison. Figs. 4 and 5 show the visual comparisons of deblurring and VFI performances on YouTube240 and GoPro datasets, respectively. As shown, the blurriness is easily visible between B0B_{0} and B1B_{1}, which is challenging for VFI. Our DeMFI-Nets show better generalized performances for the extreme scenes (Fig. 4) and larger-sized videos (Fig. 5), also in terms of temporal consistency. Due to page limits, more visual comparisons with larger sizes are provided in Appendices for all three test datasets. Also the results of deblurring and MFI (×\times8) of all the SOTA methods are publicly available at https://github.com/JihyongOh/DeMFI. Please note that it is laborious but worth to get results for the SOTA methods in terms of MFI (×\times8).

4.3 Ablation Studies

To analyze the effectiveness of each component in our framework, we perform ablation experiments. Table 3 shows the results of ablation experiments for FAC and RB in Fig. 3 (b)) with Ntrn=1N_{trn}=1 and Ntst=1N_{tst}=1 for a simplicity.

Method Rt #P Adobe240 [45] YouTube240
(s) (M) PSNR SSIM PSNR SSIM
(a) w/o RB, w/o FAC (F0b=F0F_{0}^{b}=F_{0}) 0.32 5.87 33.30 0.9361 32.54 0.9230
(b) w/o RB, f=0f=0 0.38 5.96 33.64 0.9393 32.74 0.9237
(c) w/o RB (DeMFI-Netbs) 0.38 5.96 33.79 0.9406 32.80 0.9260
(d) w/o FAC (F0b=F0F_{0}^{b}=F_{0}) 0.45 7.32 33.73 0.9391 32.93 0.9260
(e) f=0f=0 0.51 7.41 34.08 0.9428 33.15 0.9279
(f) DeMFI-Netrb(1,1) 0.51 7.41 34.14 0.9435 33.21 0.9288
Table 3: Ablation experiments on RB and FAC (F0b=F0F_{0}^{b}=F_{0}) in terms of total average of deblurring and MFI (×8\times 8).
Refer to caption
Figure 6: Effect of FAC. The green boxes show blurrier patches that are more attentive in the counterpart feature based on flow-guidance to effectively bolster the source feature.

FAC. By comparing the method (f) to (d) and (c) to (a) in Table 3, it is noticed that FAC can effectively improve the overall joint performances in the both cases without and with RB by taking little more runtime (+0.06s) and small number of additional parameters (+0.09M). Fig. 6 qualitatively shows the effect of FAC for DeMFI-Netrb(1,1) (f). Brighter positions with green boxes in the rightmost column indicate important regions E1E_{1} after passing Eq. 3 and Conv1. The green boxes show blurrier patches that are more attentive in the counterpart feature based on f10f_{10} to reinforce the source feature F1F_{1} complementally. On the other hand, the less focused regions such as backgrounds with less blurs are relatively have smaller EE after FAC. In summary, FAC bolsters the source feature by complementing the important regions with blurs in the counterpart feature pointed by flow-guidance. We also show the effectiveness of FAC without flow guidance when trained with f=0f=0. As shown in Table 3, we obtained the performance higher than without FAC but lower than with FAC by flow-guidance, as expected. Therefore, we conclude that FAC works very effectively under the self-induced flow guidance to bolster the center features to improve the performance of the joint task.

Recursive Boosting. By comparing the method (d) to (a), (e) to (b) and (f) to (c) in Table 3, it can be known that the RB consistently yields improved final joint results. Fig. 7 shows that 𝐟𝐅\mathbf{f_{F}} and 𝐟𝐏\mathbf{f_{P}} have a similar tendency in flow characteristics. Furthermore, the 𝐟𝐏\mathbf{f_{P}} updated from 𝐟𝐅\mathbf{f_{F}} seems sharper to perform PWB in pixel domain, which may help our divide-and-conquer approach effectively handles the joint task based on warping operation. It is noted that our weakest variant (a) (w/o both RB and FAC) even outperformed the second-best joint method (UTI-VFI*) as shown in Table 2, 3 on the both Adobe240 and YouTube240.

Refer to caption
Figure 7: Self-induced flows for both features 𝐟𝐅\mathbf{f_{F}} and images 𝐟𝐏\mathbf{f_{P}} (t=7/8t=7/8) of DeMFI-Netrb (1,1) show a similar tendency. They do not have to be accurate, but help improve final joint performances.
NtrnN_{trn} NtstN_{tst} PSNR(dB)/SSIM
1 (Rt=0.51R_{t}=0.51) 3 (Rt=0.61R_{t}=0.61) 5 (Rt=0.68R_{t}=0.68)
1 34.14/0.9435 28.47/0.8695 25.99/0.8136
33.21/0.9288 29.01/0.8845 26.56/0.8406
3 34.21/0.9439 34.21/0.9440 34.16/0.9437
33.27/0.9290 33.27/0.9291 33.23/0.9289
5 34.27/0.9446 34.28/0.9449 34.27/0.9448
33.32/0.9296 33.33/0.9298 33.33/0.9297
1st row: Adobe240 [45], 2nd row: YouTube240 in each block.
RED: Best performance of each row, #P=7.41M.
Table 4: Ablation study on NtrnN_{trn} and NtstN_{tst} of DeMFI-Netrb.

# of Recursive Boosting NN. To inspect the relationship between NtrnN_{trn} and NtstN_{tst} for RB, we train the three variants of DeMFI-Netrb for Ntrn=1,3,5N_{trn}=1,3,5 as shown in Table 4. Since the weight parameters in RB are shared for each recursive boosting, all the variants have same #P=7.41M and each column in Table 4 has same runtime RtR_{t}. The performances are generally boosted by increasing NtrnN_{trn}, where each recursion is attributed to the recursive boosting loss that enforces the recursively updated flows 𝐟𝐏i\mathbf{f_{P}}^{i} to better focus on synthesis Str,iS_{t}^{r,i} via the PWB. It should be noted that the overall performances are better when NtstNtrnN_{tst}\leq N_{trn}, while they are dropped otherwise. So, we can properly regulate NtstN_{tst} by considering RtR_{t} or computational constraints, even though the training with NtrnN_{trn} is once over. That is, under the same runtime constraint of each RtR_{t} as in the column when testing, we can also select the model trained with larger NtrnN_{trn} to generate better results. On the other hand, we found out that further increasing NtrnN_{trn} does not bring additional benefits due to saturated performance of DeMFI-Netrb.

5 Conclusion

We propose a novel joint deblurring and multi-frame interpolation framework, called DeMFI-Net, based on our novel flow-guided attentive-correlation-based feature bolstering (FAC-FB) module and recursive boosting (RB), by learning the self-induced feature- and pixel-domain flows without any help of pretrained optical flow networks. FAC-FB module forcefully enriches the source feature by extracting attentive correlation from the counterpart feature at the position where self-induced flow points at, to finally improve results for the joint task. Our DeMFI-Net achieves state-of-the-art performances for diverse datasets with significant margins compared to the previous SOTA methods for both deblurring and multi-frame interpolation (MFI).

Limitations. Extreme conditions such as tiny objects, low-light condition and large motion would make the joint task very challenging. We also provide visual results of the failure cases in Appendices in detail.

Acknowledgement. This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-00419, Intelligent High Realistic Visual Processing for Smart Broadcasting Media).

References

  • [1] Yuval Bahat, Netalee Efrat, and Michal Irani. Non-uniform blind deblurring by reblurring. In ICCV, pages 3286–3294, 2017.
  • [2] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In CVPR, pages 3703–3712, 2019.
  • [3] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [4] Zhixiang Chi, Yang Wang, Yuanhao Yu, and Jin Tang. Test-time fast adaptation for dynamic scene deblurring via meta-auxiliary learning. In CVPR, pages 9137–9146, 2021.
  • [5] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
  • [6] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, pages 10663–10671, 2020.
  • [7] Mengyu Chu, Xie You, Mayer Jonas, Leal-Taixé Laura, and Thuerey Nils. Learning temporal coherence via self-supervision for gan-based video generation. ACM ToG, 39(4):75–1, 2020.
  • [8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In CVPR, pages 764–773, 2017.
  • [9] Saikat Dutta, Nisarg A Shah, and Anurag Mittal. Efficient space-time video super resolution using low-resolution flow and mask upsampling. In CVPR, pages 314–323, 2021.
  • [10] Hongyun Gao, Xin Tao, Xiaoyong Shen, and Jiaya Jia. Dynamic scene deblurring with parameter selective sharing and nested skip connections. In CVPR, pages 3848–3856, 2019.
  • [11] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249–256, 2010.
  • [12] Shurui Gui, Chaoyue Wang, Qihua Chen, and Dacheng Tao. Featureflow: Robust video interpolation via structure-to-texture generation. In CVPR, pages 14004–14013, 2020.
  • [13] Akash Gupta, Abhishek Aich, and Amit K Roy-Chowdhury. Alanet: Adaptive latent attention network for joint video deblurring and interpolation. In ACMMM, pages 256–264, 2020.
  • [14] Ankit Gupta, Neel Joshi, C Lawrence Zitnick, Michael Cohen, and Brian Curless. Single image deblurring using motion density functions. In ECCV, pages 171–184. Springer, 2010.
  • [15] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhancement. In CVPR, pages 2859–2868, 2020.
  • [16] Stefan Harmeling, Hirsch Michael, and Bernhard Schölkopf. Space-variant single-image blind deconvolution for removing camera shake. NeurIPS, 23:829–837, 2010.
  • [17] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In NeurIPS, pages 2017–2025, 2015.
  • [18] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, pages 9000–9008, 2018.
  • [19] Meiguang Jin, Zhe Hu, and Paolo Favaro. Learning to extract flawless slow motion from blurry videos. In CVPR, pages 8112–8121, 2019.
  • [20] Meiguang Jin, Givi Meishvili, and Paolo Favaro. Learning to extract a video sequence from a single motion-blurred image. In CVPR, June 2018.
  • [21] Jaeyeon Kang, Younghyun Jo, Seoung Wug Oh, Peter Vajda, and Seon Joo Kim. Deep space-time video upsampling networks. In ECCV, pages 701–717. Springer, 2020.
  • [22] Soo Ye Kim, Jihyong Oh, and Munchurl Kim. Fisr: Deep joint frame interpolation and super-resolution with a multi-scale temporal loss. In AAAI, pages 11278–11286, 2020.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [24] Yoshihiko Kuroki, Tomohiro Nishi, Seiji Kobayashi, Hideki Oyaizu, and Shinichi Yoshimura. A psychophysical study of improvements in motion-image quality by using high frame rates. Journal of the Society for Information Display, 15(1):61–68, 2007.
  • [25] Yoshihiko Kuroki, Haruo Takahashi, Masahiro Kusakabe, and Ken-ichi Yamakoshi. Effects of motion image stimuli with normal and high frame rates on eeg power spectra: comparison with continuous motion image stimuli. Journal of the Society for Information Display, 22(4):191–198, 2014.
  • [26] Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In CVPR, pages 5316–5325, 2020.
  • [27] Yihao Liu, Liangbin Xie, Li Siyao, Wenxiu Sun, Yu Qiao, and Chao Dong. Enhanced quadratic video interpolation. In ECCV, pages 41–56. Springer, 2020.
  • [28] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In CVPR, pages 4463–4471, 2017.
  • [29] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, pages 3883–3891, 2017.
  • [30] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. In CVPR, pages 1701–1710, 2018.
  • [31] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, pages 5437–5446, 2020.
  • [32] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, pages 261–270, 2017.
  • [33] Jinshan Pan, Deqing Sun, Hanspeter Pfister, and Ming-Hsuan Yang. Blind image deblurring using dark channel prior. In CVPR, pages 1628–1636, 2016.
  • [34] Dongwon Park, Dong Un Kang, Jisoo Kim, and Se Young Chun. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In ECCV, pages 327–343. Springer, 2020.
  • [35] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In ECCV, 2020.
  • [36] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021.
  • [37] Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. Im-net for high resolution video frame interpolation. In CVPR, pages 2398–2407, 2019.
  • [38] Kuldeep Purohit and AN Rajagopalan. Region-adaptive dense network for efficient motion deblurring. In AAAI, volume 34, pages 11882–11889, 2020.
  • [39] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [40] Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, and Zhiyong Gao. Blurry video frame interpolation. In CVPR, pages 5114–5123, 2020.
  • [41] Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, and Zhiyong Gao. Video frame interpolation and enhancement via pyramid recurrent framework. IEEE Transactions on Image Processing, 30:277–292, 2020.
  • [42] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pages 1874–1883, 2016.
  • [43] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
  • [44] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: extreme video frame interpolation. In ICCV, 2021.
  • [45] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In CVPR, pages 1279–1288, 2017.
  • [46] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018.
  • [47] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In CVPR, pages 8174–8182, 2018.
  • [48] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419. Springer, 2020.
  • [49] Jacob Telleen, Anne Sullivan, Jerry Yee, Oliver Wang, Prabath Gunawardane, Ian Collins, and James Davis. Synthetic shutter speed imaging. In Computer Graphics Forum, volume 26, pages 591–598. Wiley Online Library, 2007.
  • [50] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In CVPR, pages 3360–3369, 2020.
  • [51] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In CVPRW, pages 0–0, 2019.
  • [52] Jiyan Wu, Chau Yuen, Ngai-Man Cheung, Junliang Chen, and Chang Wen Chen. Modeling and optimization of high frame rate video transmission over wireless networks. IEEE Transactions on Wireless Communications, 15(4):2713–2726, 2015.
  • [53] Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In CVPR, pages 3370–3379, 2020.
  • [54] Zeyu Xiao, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Space-time video super-resolution using temporal profiles. In ACM MM, pages 664–672, 2020.
  • [55] Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for controllable space-time video super-resolution. In CVPR, pages 6388–6397, 2021.
  • [56] Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, and Ming-Hsuan Yang. Quadratic video interpolation. In NeurIPS, pages 1647–1656, 2019.
  • [57] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In CVPR, pages 5978–5986, 2019.
  • [58] Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Wei Liu, and Hongdong Li. Adversarial spatio-temporal learning for video deblurring. IEEE Transactions on Image Processing, 28(1):291–301, 2018.
  • [59] Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, and Hongdong Li. Deblurring by realistic blurring. In CVPR, pages 2737–2746, 2020.
  • [60] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472–2481, 2018.
  • [61] Youjian Zhang, Chaoyue Wang, and Dacheng Tao. Video frame interpolation without temporal priors. NeurIPS, 33, 2020.

Appendix A Details of Architecture for DeMFI-Net

A.1 DeMFI-Netbs

A.1.1 Feature Flow Residual Dense Backbone (FF-RDB) Module

The feature flow residual dense backbone (FF-RDB) module first takes four consecutive blurry input frames (B1B_{-1}, B0B_{0}, B1B_{1}, B2B_{2}). It is similar to a backbone network of [41, 40] and the number of output channels is modified to 133 (=64×2+2×2+1)(=64\times 2+2\times 2+1). As shown in Fig. 8 (a), it consists of one DownShuffle layer and one UpShuffle layer [42], six convolutional layers, and twelve residual dense blocks [60] that are each composed of four Conv3\mathrm{Conv_{3}}’s, one Conv1\mathrm{Conv_{1}}, and four ReLU functions as in Fig. 8 (b). All the hierarchical features obtained by the residual dense blocks are concatenated for successive network modules. The 133 output channels are composed of 64×264\times 2 for two feature maps (F0F_{0}^{\prime}, F1F_{1}^{\prime}) followed by tanh activation functions, 2×22\times 2 two bidirectional feature-domain flows (f01f_{01}, f10f_{10}) and 1 for an occlusion map logit (ot0o_{t0}).

Refer to caption
Figure 8: Architecture of Feature Flow Residual Dense Backbone (FF-RDB) Module based on Residual Dense Block [60]. It is modified from [40, 41] and DownShuffle layer distributes the motion information into channel axis [40, 41].
Refer to caption
Figure 9: Architecture of U-Net-based Refine Module (RM). NNupsample denotes nearest neighborhood upsampling.

A.1.2 U-Net-based Refine Module (RM)

The U-Net-based [39] Refine Module (RM) takes 𝐀𝐠𝐠1\mathbf{Agg}^{1} as an input to refine F0bF_{0}^{b}, F1bF_{1}^{b}, ft0f_{t0}, ft1f_{t1} and ot0o_{t0} in a residual learning manner as [F0r,F1r,ft0r,ft1r,ot0r]=RM(𝐀𝐠𝐠1)+[F0b,F1b,ft0,ft1,ot0][F_{0}^{r},F_{1}^{r},f_{t0}^{r},f_{t1}^{r},o_{t0}^{r}]=\mathrm{RM}(\mathbf{Agg}^{1})+[F_{0}^{b},F_{1}^{b},f_{t0},f_{t1},o_{t0}] where 𝐀𝐠𝐠1\mathbf{Agg}^{1} is the aggregation of [F0b,Ft,F1b,ft0,ft1,ot0,f01,f10][F_{0}^{b},F_{t},F_{1}^{b},f_{t0},f_{t1},o_{t0},f_{01},f_{10}] in the concatenated form.

A.2 DeMFI-Netrb

A.2.1 Booster Module

Booster Module iteratively updates 𝐟𝐏\mathbf{f_{P}} to perform PWB for S0r,S1rS_{0}^{r},S_{1}^{r} obtained from DeMFI-Netbs. The Booster Module is composed of Mixer and GRU-based Booster (GB), and it first takes a recurrent hidden state (Fi1recF_{i-1}^{rec}) and 𝐟𝐏i1\mathbf{f_{P}}^{i-1} at ii-thth recursive boosting as well as an aggregation of several components in the form of 𝐀𝐠𝐠2=[S0r,Str,S1r,B1,B0,B1,B2,f01,f10,𝐟𝐅]\mathbf{Agg}^{2}=[S_{0}^{r},S_{t}^{r},S_{1}^{r},B_{-1},B_{0},B_{1},B_{2},f_{01},f_{10},\mathbf{f_{F}}] as an input to yield two outputs of FirecF_{i}^{rec} and 𝚫i1\mathbf{\Delta}_{i-1} that is added on 𝐟𝐏i1\mathbf{f_{P}}^{i-1}. Note that 𝐟𝐏𝟎=𝐟𝐅\mathbf{f_{P}^{0}}=\mathbf{f_{F}} and 𝐀𝐠𝐠2\mathbf{Agg}^{2} is not related to ii-thth recursive boosting. The updating process is given as follows:

Mi1=Mixer([𝐀𝐠𝐠2,𝐟𝐏i1])\displaystyle M_{i-1}=\mathrm{Mixer}([\mathbf{Agg}^{2},\mathbf{f_{P}}^{i-1}]) (11)
[Firec,𝚫i1]=GB([Fi1rec,Mi1])\displaystyle[F_{i}^{rec},\mathbf{\Delta}_{i-1}]=\mathrm{GB}([F_{i-1}^{rec},M_{i-1}]) (12)
𝐟𝐏i=𝐟𝐏i1+𝚫i1,\displaystyle\mathbf{f_{P}}^{i}=\mathbf{f_{P}}^{i-1}+\mathbf{\Delta}_{i-1},\vspace{-2mm} (13)

where the initial feature F0recF_{0}^{rec} is obtained as a 64-channel feature via channel reduction for Conv1([F0r,Ftr,F1r])\mathrm{Conv_{1}}([F_{0}^{r},F_{t}^{r},F_{1}^{r}]) of 192 channels. More details about both Mixer and the updating process of GB are described in the following subsections.

A.2.2 Mixer

The first component in Booster Module is called Mixer. As shown in Fig. 10, Mixer first passes 𝐀𝐠𝐠2\mathbf{Agg}^{2} and 𝐟𝐏i1\mathbf{f_{P}}^{i-1} through each independent set of convolution layers as Conv7ReLUConv3ReLU\mathrm{Conv_{7}}-\mathrm{ReLU}-\mathrm{Conv_{3}}-\mathrm{ReLU}, respectively, then yields Mi1M_{i-1} via Conv3ReLUConv3ReLU\mathrm{Conv_{3}}-\mathrm{ReLU}-\mathrm{Conv_{3}}-\mathrm{ReLU} by taking concatenated outputs of the sets. Mi1M_{i-1} is consecutively used in GRU-based Booster (GB) as described in the following subsection.

A.2.3 GRU-based Booster (GB)

GRU-based Booster (GB) takes both Mi1M_{i-1} and Fi1recF_{i-1}^{rec} as an input to finally produce an updated FirecF_{i}^{rec} which is consecutively used to make 𝚫i1\mathbf{\Delta}_{i-1} that is added on 𝐟𝐏i1\mathbf{f_{P}}^{i-1}. GB adopts gated activation unit based on the GRU cell [5] by replacing fully connected layers with two separable convolutions of 1×51\times 5 (Conv1×5\mathrm{Conv_{1\times 5}}) and 5×15\times 1 (Conv5×1\mathrm{Conv_{5\times 1}}) as in [48] to efficiently increase a receptive field. The detailed process in GB is operated as follows:

zi1×5=σ(Conv1×5([Fi1rec,Mi1]))\displaystyle z_{i}^{1\times 5}=\sigma(\mathrm{Conv_{1\times 5}}([F_{i-1}^{rec},M_{i-1}])) (14)
ri1×5=σ(Conv1×5([Fi1rec,Mi1]))\displaystyle r_{i}^{1\times 5}=\sigma(\mathrm{Conv_{1\times 5}}([F_{i-1}^{rec},M_{i-1}])) (15)
F^irec,1×5=tanh(Conv1×5[ri1×5Fi1rec,Mi1]))\displaystyle\hat{F}_{i}^{rec,1\times 5}=\mathrm{tanh}(\mathrm{Conv_{1\times 5}}[r_{i}^{1\times 5}\odot F_{i-1}^{rec},M_{i-1}])) (16)
Firec,1×5=(1zi1×5)Fi1rec+zi1×5F^irec,1×5\displaystyle F_{i}^{rec,1\times 5}=(1-z_{i}^{1\times 5})\odot F_{i-1}^{rec}+z_{i}^{1\times 5}\odot\hat{F}_{i}^{rec,1\times 5} (17)
zi5×1=σ(Conv5×1([Firec,1×5,Mi1]))\displaystyle z_{i}^{5\times 1}=\sigma(\mathrm{Conv_{5\times 1}}([F_{i}^{rec,1\times 5},M_{i-1}])) (18)
ri5×1=σ(Conv5×1([Firec,1×5,Mi1]))\displaystyle r_{i}^{5\times 1}=\sigma(\mathrm{Conv_{5\times 1}}([F_{i}^{rec,1\times 5},M_{i-1}])) (19)
F^irec,5×1=tanh(Conv5×1([ri5×1Firec,1×5,Mi1]))\displaystyle\hat{F}_{i}^{rec,5\times 1}=\mathrm{tanh}(\mathrm{Conv_{5\times 1}}([r_{i}^{5\times 1}\odot F_{i}^{rec,1\times 5},M_{i-1}])) (20)
Firec=(1zi5×1)Firec,1×5+zi5×1F^irec,5×1\displaystyle F_{i}^{rec}=(1-z_{i}^{5\times 1})\odot F_{i}^{rec,1\times 5}+z_{i}^{5\times 1}\odot\hat{F}_{i}^{rec,5\times 1} (21)
𝚫i1=(Conv3RLConv3)(Firec).\displaystyle\mathbf{\Delta}_{i-1}=(\mathrm{Conv_{3}}\circ\text{RL}\circ\mathrm{Conv_{3}})(F_{i}^{rec}). (22)

Please note that Eq. 21, 22 produce the final outputs (FirecF_{i}^{rec}, 𝚫i1\mathbf{\Delta}_{i-1}) of the Booster Module as shown in Fig. 3 (c) in the main paper, indicated by blue arrows.

Refer to caption
Figure 10: Architecture of Mixer in Booster Module. It is designed to blend two information of 𝐀𝐠𝐠2\mathbf{Agg}^{2} and 𝐟𝐏i1\mathbf{f_{P}}^{i-1}.

Appendix B Additional Qualitative Comparison Results

Figs. 11, 12, 13, 14, 15 show the abundant visual comparisons of deblurring and MFI (×\times8) performances for all the three test datasets. To better show them, we generally show the cropped patches for each scene. Since the number of blurry input frames for each method is different, two blurry center-input frames (B0B_{0}, B1B_{1}) are averagely shown in the figures. As shown, the severe blurriness can easily be shown between two center-input frames (B0B_{0}, B1B_{1}), which is very challenging for VFI.

Our DeMFI-Nets, especially DeMFI-Netrb, better synthesize textures or patterns (1st/2nd scenes of Fig. 11, Fig. 12, 1st scene of Fig. 15), precisely generate thin poles (3rd scene of Fig. 11) or fast moving objects (2nd/3rd scenes of Fig. 14) and effectively capture letters (Fig. 12, Fig. 13, 1st scene of Fig. 14, 2nd/3rd/4th scenes of Fig. 15), which tend to be failed by all the previous methods.

Especially, CFI methods such as TNTT and PRF are more hard to interpolate sharp frames at the time index 2/8 or 6/8 than 4/8 (center time instance) within each scene because they can only produce intermediate frames of time at a power of 2 in a recursive manner. As a result, the prediction errors are accumulatively propagated to the later interpolated frames. On the other hand, our DeMFI-Net framework adopts self-induced flow-based warping methodology trained in an end-to-end manner, which finally leads to generate temporally consistent sharp intermediate frames from blurry input frames. Also the results of deblurring and MFI (×\times8) of all the SOTA methods are publicly available at https://github.com/JihyongOh/DeMFI for easier comparison. Please note that it is laborious but worth to get results for the SOTA methods in terms of MFI (×\times8).

Appendix C Limitations: Failure Cases

Fig. 16 shows the failure cases such as tiny objects (1st scene), low-light condition (2nd scene) and large motion (3rd scene), which would make the joint task very challenging. First, in the case of splashed tiny objects with blurriness, it is very hard to capture sophisticated motions from the afterimages of the objects so all the methods fail to delicately synthesize the frames as GT’s. Second, in the case of low-light condition, it is hard to distinguish the boundaries of the objects (green arrows) and to detect tiny objects such as fast falling coffee beans (dotted green line), which deteriorates the overall performances of all the methods. Lastly, large and complex motion with blurriness due to camera shaking also makes all the methods hard to precisely synthesize final frames as well. We hope these kinds of failure cases will motivate researchers for further challenging studies.

Appendix D Visual Comparison with Video

We provide a visual comparison video for TNTT [19], UTI-VFI* (retrained ver.) [61], PRF [41] (a larger-sized version of [40]) and DeMFI-Netrb (5,3) (ours), which all have adopted joint learning for deblurring and VFI. The video named https://www.youtube.com/will-be-updated shows several multi-frame interpolated (×8\times 8) results played as 30fps for a slow motion, synthesized from blurry input frames of 30fps. All the results of the methods are adequately resized to be simultaneously played at a single screen. Please take into account that YouTube240 test dataset contains extreme motion with blurriness.

TNTT generally synthesize blurry visual results and PRF tends to show temporal inconsistency for MFI (×8\times 8). These two joint methods simply do CFI, not for arbitrary time t. Therefore, their methods must be recursively applied after each center frame is interpolated for MFI, which causes error propagation into later-interpolated frames. Although UTI-VFI* shows better visual results than above two CFI joint methods, it tends to produce some artifacts especially on large motion with blurriness and tiny objects such as splash of water. This tendency is attributed to the error accumulation from the dependency on fPf_{P} quality inevitably obtained by pretrained PWC-Net [46], where adoption of a pretrained net also brings a disadvantage in terms of both Rt and #P (+8.75M). On the other hand, our DeMFI-Net framework is based on the self-induced feature- and pixel-domain flows without any help of pretrained optical flow networks, to finally better interpolate the sharp frames.

Refer to caption
Figure 11: Visual comparisons for MFI results on Adobe240. Best viewed in zoom.
Refer to caption
Figure 12: Visual comparisons for MFI results on Adobe240. Best viewed in zoom.
Refer to caption
Figure 13: Visual comparisons for MFI results on GoPro (HD). Best viewed in zoom.
Refer to caption
Figure 14: Visual comparisons for MFI results on YouTube240. Best viewed in zoom.
Refer to caption
Figure 15: Visual comparisons for MFI results on GoPro (HD). Best viewed in zoom.
Refer to caption
Figure 16: Failure cases; tiny objects, low-light condition and large motion. Best viewed in zoom.