Unifying Motion Deblurring and Frame Interpolation with Events

Xiang Zhang, Lei Yu²²2Corresponding author
Wuhan University, Wuhan, China.
{xiangz, ly.wd}@whu.edu.cn

Abstract

Slow shutter speed and long exposure time of frame-based cameras often cause visual blur and loss of inter-frame information, degenerating the overall quality of captured videos. To this end, we present a unified framework of event-based motion deblurring and frame interpolation for blurry video enhancement, where the extremely low latency of events is leveraged to alleviate motion blur and facilitate intermediate frame prediction. Specifically, the mapping relation between blurry frames and sharp latent images is first predicted by a learnable double integral network, and a fusion network is then proposed to refine the coarse results via utilizing the information from consecutive blurry inputs and the concurrent events. By exploring the mutual constraints among blurry frames, latent images, and event streams, we further propose a self-supervised learning framework to enable network training with real-world blurry videos and events. Extensive experiments demonstrate that our method compares favorably against the state-of-the-art approaches and achieves remarkable performance on both synthetic and real-world datasets. Codes are available at https://github.com/XiangZ-0/EVDI.

1 Introduction

^†^†footnotetext: The research was partially supported by the National Natural Science Foundation of China under Grants 61871297, the Natural Science Foundation of Hubei Province, China under Grant 2021CFB467, the Fundamental Research Funds for the Central University under Grant 2042020kf0019, and the National Natural Science Foundation of China Enterprise Innovation Development Key Project under Grant U19B2004.

Highly dynamic scenes, e.g., fast-moving targets or non-linear motions, pose challenges for high-quality video generation as the captured frame is often blurred and target information is missing between consecutive frames [29]. Existing frame-based methods attempt to tackle these problems by developing motion deblurring [11], frame interpolation [1] or blurry video enhancement techniques [10, 25]. However, it is difficult for frame-based deblurring methods to predict sharp latent frames from severely blurred videos because of motion ambiguities and the erasure of intensity textures [11]. Besides, current frame-based interpolation approaches generally assume the motion between neighboring frames to be linear [1], which is not always valid in real-world scenarios especially when encountering non-linear motions, thus often leading to incorrect predictions.

Refer to caption — Figure 1: Illustrative examples of video deblurring and interpolation via the state-of-the-art deblurring approach LEVS [11], interpolation approach Time Lens [30] and our EVDI method.

Recent works have revealed the advantages of event cameras [5] in motion deblurring and frame interpolation. On one hand, the output of event camera inherently embeds precise motions and sharp edges [2] since it reports asynchronous event data with extremely low latency (in the order of $\mu s$ ) [13, 5], which is effective in alleviating motion blur [22, 21, 14, 31, 34]. On the other hand, event camera is able to record almost continuous brightness changes to compensate the missing information between consecutive frames, making it feasible to recover accurate intermediate frames even under non-linear motions [14, 30]. However, existing works generally treat motion deblurring and frame interpolation as separate tasks, while the problems of motion blur and missing information between frames have strong co-occurrence in real scenes and thus need to be considered simultaneously. In real-world scenarios, the aforementioned methods face two main challenges as follows.

•

Limitations of Separate Tasks: The performance of interpolation methods [30] is often highly dependent on the quality of reference frames, and it is difficult to interpolate clear results when the reference frames are degraded by motion blur. For deblurring task, most methods [31, 34] focus on recovering sharp images inside the exposure time of blurry inputs, neglecting these latent images between blurry frames (see Fig. 1).
•

Data Inconsistency: Most previous works employ well-labeled synthetic datasets for supervised [31, 30] or semi-supervised learning [34], which often causes performance drop in real scenes due to the inconsistency between synthetic and real-world data [34].

In this paper, we present a unified framework of Event-based Video Deblurring and Interpolation (EVDI) for blurry video enhancement. The proposed method consists of two main modules: a learnable double integral (LDI) network and a fusion network. The LDI network is designed to automatically predict the mapping relation between blurry frames and sharp latent images from the corresponding events, where the timestamp of the latent image can be chosen arbitrarily inside the exposure time of blurry frames (deblurring task) or between consecutive blurry frames (interpolation task). The fusion network receives the coarse reconstruction of latent images and generates a fine result by utilizing all the information from consecutive blurry frames and event streams. For training, we take advantage of the mutual constraints among blurry frames, sharp latent images and event streams, and propose a fully self-supervised learning framework to help the network fit the distribution of real-world data without the need of ground truth images.

The main contributions of this paper are three-fold:

•

We present a unified framework of event-based video deblurring and interpolation that generates arbitrarily high frame-rate sharp videos from blurry inputs.
•

We propose a fully self-supervised framework to enable network training in real-world scenarios without any labeled data.
•

Experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art results while maintaining an efficient network design.

2 Related Work

2.1 Frame Interpolation

Existing frame-based interpolation methods can be roughly categorized into two classes: warping-based and kernel-based approaches. Warping-based approaches [9, 18, 35, 1] generally combine optical flow [8, 28] with image warping to predict intermediate frames, and several techniques have been proposed to enhance the interpolation performance, e.g., forward warping [18], spatial transformer networks [35], and depth information [1]. However, these methods often assume linear motion and brightness constancy between two reference frames, thus failing to handle arbitrary motions. Rather than warping reference frames with optical flow, kernel-based methods [19, 20] model the frame interpolation as local convolution on the reference frames, where the kernel is directly estimated from the input frames. Despite kernel-based methods are more robust to complex motions and brightness changes, their scalability is often limited by the fixed sizes of convolution kernels.

The common challenge of frame-based interpolation is the missing information between reference frames, which can be alleviated by leveraging the extremely low latency of events. Recent approach [30] takes the merits of frames and events and achieves excellent interpolation results even under non-linear motions, but its performance is also closely related to the quality of reference frames and thus cannot be directly applied for blurry video enhancement.

2.2 Motion Deblurring

One of the most popular frame-based deblurring methods is to employ neural networks to learn the blur feature and predict sharp images from blurry inputs [11, 7, 37, 17]. Several techniques have been developed to exploit the temporal information inside blurry frames, including dynamic temporal blending mechanism [7], spatio-temporal filter adaptive networks [37], and intra-frame iterations [17]. Recent works have also revealed the potential of events in motion deblurring. Event streams inherently embed motion information and sharp edges, which can be exploited to tackle the temporal ambiguity and texture erasure caused by motion blur. Pioneer event-based methods achieve motion deblurring by relating blurry frames, sharp latent images and the corresponding events according to the physical event generation model [22, 21], but their performance is often degraded due to the imperfection of physical circuits, e.g., intrinsic camera noises. To alleviate this, learning-based approaches [31, 34] have been proposed to fit the distribution of event data, achieving better deblurring performance.

However, most deblurring methods focus solely on restoring sharp latent images inside the exposure time of blurry frames, while the information between blurry frames is also important in practical applications, motivating the combination of deblurring and interpolation.

2.3 Joint Deblurring and Interpolation

Previous frame-based methods have approached the joint deblurring and interpolation task [10, 25]. The work of [10] performs frame interpolation based on the keyframes pre-processed by a deblurring module, and the work of [25] treats deblurring and interpolation as a unified task and achieves better enhancement performance. For event-based methods, LEDVDI is the closest related work [14], but LEDVDI is categorized into the cascaded scheme as it achieves deblurring and interpolation with different stages. Besides, all the aforementioned methods require supervised training on the synthetic datasets, limiting their performance on real-world scenarios due to data inconsistency.

Exploiting the information from both frames and events, our method achieves blurry video enhancement without distinguishing the deblurring and interpolation tasks. Moreover, a self-supervised learning framework is proposed to enable network training with real events and blurry videos, guaranteeing the performance in real-world scenarios.

3 Problem Statement

Videoing highly dynamic scenes often suffers from blurry artifacts and the Blurry Video Enhancement (BVE) plays an important role for visual perception. Existing frame-based methods often struggle to achieve BVE due to motion ambiguity and loss of inter-frame information, while this can be effectively mitigated with the aid of events. Given two consecutive blurry frames $B_{i},B_{i+1}$ captured within the exposure time $\mathcal{T}_{i},\mathcal{T}_{i+1}$ and the corresponding event streams $\mathcal{E}_{i+1}^{i}$ triggered inside $\mathcal{T}_{i+1}^{i}$ , where $\mathcal{T}_{i+1}^{i}\triangleq\mathcal{T}_{i}\cup\mathcal{T}_{i\rightarrow i+1}\cup\mathcal{T}_{i+1}$ with $\mathcal{T}_{i\rightarrow i+1}$ indicating the time interval between $B_{i}$ and $B_{i+1}$ , the task of EVDI is to achieve BVE directly from blurry inputs, i.e.,

L(t)=\operatorname{EVDI}(t;B_{i},B_{i+1},\mathcal{E}_{i+1}^{i}),\quad\forall t\in\mathcal{T}_{i+1}^{i},

(1)

where $L(t)$ indicates the latent image of arbitrary time $t\in\mathcal{T}_{i+1}^{i}$ . According to Eq. 1, EVDI degrades to Motion Deblurring (MD) when $t\in\mathcal{T}_{i}$ or $\mathcal{T}_{i+1}$ , or Frame Interpolation (FI) when $t\in\mathcal{T}_{i\to i+1}$ . Thus, EVDI is more general than MD and FI, and provides a unified formulation to the task of BVE.

EVDI vs. Frame Interpolation. Conventional FI task aims at recovering the intermediate latent images $\{L(t)\}_{t\in\mathcal{T}_{i\rightarrow i+1}}$ from sharp reference frames $I_{i},\ I_{i+1}$ . Providing the concurrent event streams $\mathcal{E}_{i\rightarrow i+1}$ emitted within $\mathcal{T}_{i\rightarrow i+1}$ , we have

L(t)=\operatorname{Interp}(t;I_{i},I_{i+1},\mathcal{E}_{i\rightarrow i+1}),\quad t\in\mathcal{T}_{i\rightarrow i+1},

(2)

where $\operatorname{Interp}(\cdot)$ represents an FI operator. Most FI methods [9, 1, 30] are designed to restore inter-frame latent images from high-quality (sharp and clear) reference frames $I_{i},\ I_{i+1}$ , while EVDI directly accepts blurry inputs, which is more challenging than conventional FI.

EVDI vs. Motion Deblurring. The MD aims at reconstructing the sharp latent images $\{L(t)\}_{t\in\mathcal{T}_{i}}$ from the corresponding blurry frame $B_{i}$ . Providing the concurrent event streams $\mathcal{E}_{i}$ triggered within $\mathcal{T}_{i}$ , we have

L(t)=\operatorname{Deblur}(t;B_{i},\mathcal{E}_{i}),\quad t\in\mathcal{T}_{i},

(3)

where $\operatorname{Deblur}(\cdot)$ indicates an MD operator. Existing deblurring methods [4, 27] mainly focus on recovering the latent frames inside the exposure time $\mathcal{T}_{i}$ , while EVDI is able to predict the latent images of time instance both inside the exposure time $\mathcal{T}_{i}$ (or $\mathcal{T}_{i+1}$ ) and between blurry frames $\mathcal{T}_{i\rightarrow i+1}$ , as shown in Fig. 1.

Ideally, EVDI can approach the BVE task by unifying MD and FI in Eq. 1. However, to efficiently realize EVDI in real-world scenarios, challenges still exist.

•

MD and FI should be simultaneously addressed in a unified framework to fulfill the EVDI. Previous attempts for BVE [10, 14] employ a cascaded scheme that performs frame interpolation after deblurring, but this approach often propagates deblurring error to the interpolation stage, leading to sub-optimal results.
•

Existing related methods are generally developed within a supervised learning framework [14, 10, 25], of which the supervision is usually provided by synthetic blurry images and events. Thus the performance might degrade in real scenes due to the different distribution between synthetic and real-world data.

4 Method

In this work, we propose to approximate the optimal EVDI model with trainable neural networks, and develop a self-supervised learning framework by exploiting the mutual constraints among blurry frames, sharp latent frames and event streams.

4.1 Unified Deblurring and Interpolation

We first review the physical generation model of events, which are triggered whenever the log-scale brightness change exceeds the event threshold $c>0$ , i.e.,

\operatorname{log}(L(t,\mathbf{x}))-\operatorname{log}(L(\tau,\mathbf{x}))=p\cdot c,

(4)

where $L(t,\mathbf{x})$ and $L(\tau,\mathbf{x})$ denote the instantaneous intensity at time $t$ and $\tau$ at the pixel position $\mathbf{x}$ , and polarity $p\in\{+1,-1\}$ indicates the direction of brightness changes. With the aid of events, we can formulate the following relation (the pixel position is omitted for readability):

L(t)=L(f)\operatorname{exp}(c\int_{f}^{t}e(s)ds),

(5)

where $L(t)$ and $L(f)$ are latent images at instant time $t$ and $f$ , and $e(t)\triangleq p\cdot\delta(t-\tau)$ denotes the continuous representation of events with $\delta(\cdot)$ indicating the Dirac function. On the other hand, the blurry images can be formulated as the average of latent images within the exposure time [3], i.e.,

B=\frac{1}{T}\int_{t\in\mathcal{T}}L(t)dt

(6)

with $T$ denoting the duration of exposure period $\mathcal{T}$ . Combining Eq. (5) and Eq. (6), one can obtain

	$\displaystyle L(f)$	$\displaystyle=\frac{B}{E(f,\mathcal{T})},\quad\text{with}$		(7)
	$\displaystyle E(f,\mathcal{T})$	$\displaystyle=\frac{1}{T}\int_{t\in\mathcal{T}}\operatorname{exp}(c\int_{f}^{t}e(s)ds)dt$		(8)

representing the relation between blurry frames $B$ and latent images $L(f)$ from the perspective of events, which is also known as event-based double integral (EDI) [22].

4.1.1 Feasibility Analysis

Previous works of [22, 21] focus on exploiting Eq. (7) to restore the sharp latent image inside the exposure period $\mathcal{T}$ , while this formulation can be also extended to recover the latent frames at arbitrary time outside $\mathcal{T}$ (please see the supplementary material for proof). However, direct applying Eq. (7) for unified deblurring and interpolation often meets the following obstacles: First, the computation of $E(f,\mathcal{T})$ requires the knowledge of event threshold $c$ , which is critical to the recovery performance [22] but hard to accurately estimate due to its temporal instability. Second, real-world events are noisy due to the non-ideality of physical sensors [5], e.g., limited read-out bandwidth, and thus often lead to degraded results, especially when encountering long-term integral of events where $E(f,\mathcal{T})$ is severely contaminated by noises. Hence, we propose to employ learning-based approaches to fit the statistics of real-world events.

4.1.2 Network Architecture

Our network receives a latent image timestamp $f\in\mathcal{T}_{i+1}^{i}$ , two consecutive blurry frames $B_{i},B_{i+1}$ and the corresponding event streams $\mathcal{E}_{i+1}^{i}$ as input, and outputs a sharp latent image $L(f)$ . There are two main modules in our network: a learnable double integral (LDI) network and a fusion network, where the LDI network learns to approximate the double integral behavior of Eq. (8) and the fusion network is designed to refine the results generated by the blurry images and the outputs of LDI network, as shown in Fig. 3.

LDI Network. Suppose the LDI network is trained to approximate a specific case $E(0,\mathcal{T}_{[0,T]})\approx\operatorname{LDI}(\mathcal{E}_{[0,T]})$ i.e.,

\operatorname{LDI}(\mathcal{E}_{[0,T]})\approx\frac{1}{T}\int_{0}^{T}\operatorname{exp}(c\int_{0}^{t}e(s)ds)dt,

(9)

where $\mathcal{T}_{[0,T]}$ indicates the time interval from $0$ to $T>0$ and $\mathcal{E}_{[0,T]}$ is the corresponding event streams. Now we consider a more general case of $E(f,\mathcal{T})$ , which can be written as

	$\displaystyle E(f,\mathcal{T})=$	$\displaystyle\frac{1}{T}\int_{t_{s}}^{f}\operatorname{exp}(c\int_{f}^{t}e(s)ds)dt$		(10)
		$\displaystyle+\frac{1}{T}\int_{f}^{t_{s}+T}\operatorname{exp}(c\int_{f}^{t}e(s)ds)dt,$		(10)

where $t_{s}$ indicates the starting time of $\mathcal{T}$ . Applying $t^{\prime}=t-f$ and $s^{\prime}=s-f$ to Eq. (10), we have

$\displaystyle E(f,\mathcal{T})=$	$\displaystyle-\frac{1}{T}\int_{0}^{t_{s}-f}\operatorname{exp}(c\int_{0}^{t^{\prime}}e(s^{\prime}+f)ds^{\prime})dt^{\prime}$	(11)
	$\displaystyle+\frac{1}{T}\int_{0}^{t_{s}+T-f}\operatorname{exp}(c\int_{0}^{t^{\prime}}e(s^{\prime}+f)ds^{\prime})dt^{\prime}$
$\displaystyle=$	$\displaystyle w_{1}G(\mathcal{E}_{[f,t_{s}]})+w_{2}G(\mathcal{E}_{[f,t_{s}+T]}),$

where $w_{1}=(f-t_{s})/T,\ w_{2}=(t_{s}+T-f)/T$ are weights and $G(\cdot)$ is a general formula defined as

G(\mathcal{E}_{[f,t_{r}]})=\frac{1}{t_{r}-f}\int_{0}^{t_{r}-f}\operatorname{exp}(c\int_{0}^{t}e(s+f)ds)dt

(12)

and $t_{r}$ denotes the reference time. Based on the above definition, we can calculate $E(f,\mathcal{T})$ by approximating Eq. (12) with the LDI network, i.e., Eq. (9). For the case of $t_{r}-f\geq 0$ , $G(\cdot)$ can be directly approximated by

G(\mathcal{E}_{[f,t_{r}]})\approx\operatorname{LDI}(\mathcal{S}(\mathcal{E}_{[f,t_{r}]}))

(13)

with $\mathcal{S}(\mathcal{E}_{[f,t_{r}]})\triangleq\{e(t+f),t\in[0,t_{r}-f]\}$ representing the event operator of time shift. For the case of $t_{r}-f<0$ ,

	$\displaystyle G(\mathcal{E}_{[f,t_{r}]})$	$\displaystyle=\frac{1}{f-t_{r}}\int_{0}^{f-t_{r}}\operatorname{exp}(c\int_{0}^{t}-e(-s+f)ds)dt$		(14)
		$\displaystyle\approx\operatorname{LDI}(\mathcal{R}(\mathcal{E}_{[f,t_{r}]})),$		(14)

where $\mathcal{R}(\mathcal{E}_{[f,t_{r}]})\triangleq\{-e(-t+f),t\in[0,f-t_{r}]\}$ indicates the event operator composed of time shift, flip and polarity reversal, as shown in Fig. 2. For simplicity, we define a unified pre-processing operator $\mathcal{P}(\cdot)$ as follows.

\mathcal{P}(\mathcal{E}_{[f,t_{r}]})=\left\{\begin{array}[]{ll}\mathcal{S}(\mathcal{E}_{[f,t_{r}]})&\text{ if }t_{r}\geq f,\\ \mathcal{R}(\mathcal{E}_{[f,t_{r}]})&\text{ if }t_{r}<f.\end{array}\right.

(15)

Thus, Eq. (11) can be reformulated as

E(f,\mathcal{T})\approx w_{1}\operatorname{LDI}(\mathcal{P}(\mathcal{E}_{[f,t_{s}]}))+w_{2}\operatorname{LDI}(\mathcal{P}(\mathcal{E}_{[f,t_{s}+T]})),

(16)

meaning that arbitrary $E(f,\mathcal{T})$ can be approximated by a weighted combination of LDI outputs, where the LDI network only needs to be trained once to fit the case of Eq. (9).

For the input of LDI network, we introduce a spatio-temporal event representation. With a pre-defined number, e.g., $N$ , we fairly divide $N$ temporal bins from $t=0$ to $t=T_{i+1}^{i}$ where $T_{i+1}^{i}$ denotes the total duration of $\mathcal{T}_{i+1}^{i}$ . We then accumulate the events pre-processed by $\mathcal{P}(\cdot)$ inside each temporal bin, and form a $2N\times H\times W$ tensor as the LDI input with $2,H,W$ indicating event polarity, image height and width, respectively. Therefore, our event representation enables flexible choice of the target timestamp $f$ while maintaining a fixed input format, which allows network to restore the latent images $L(f)$ at arbitrary $f\in\mathcal{T}_{i+1}^{i}$ without any network modification or re-training process.

Fusion Network. After obtaining $E(f,\mathcal{T})$ from LDI network, the latent image $L(f)$ can be coarsely restored by Eq. (7). We denote the latent images reconstructed from $B_{i},B_{i+1}$ as $L_{i}(f),L_{i+1}(f)$ , respectively, and manually generate an extra result $L^{i}_{i+1}(f)$ by

L^{i}_{i+1}(f)=\omega(f)L_{i}(f)+(1-\omega(f))L_{i+1}(f),

(17)

where the weighting function $\omega(f)$ with $f\in[0,T_{i+1}^{i}]$ is defined as

\omega(f)=\left\{\begin{array}[]{ll}1&\text{ if }f\in\mathcal{T}_{i},\\ 1-\frac{f}{T_{i+1}^{i}}&\text{ if }f\in\mathcal{T}_{i\rightarrow i+1},\\ 0&\text{ if }f\in\mathcal{T}_{i+1},\end{array}\right.

(18)

since a weighted reconstruction is helpful for frame interpolation in our observation. Finally, our fusion network receives $L_{i}(f),L_{i+1}(f),L^{i}_{i+1}(f),E(f,\mathcal{T}_{i}),E(f,\mathcal{T}_{i+1})$ and produces the final latent image $\bar{L}(f)$ , as illustrated in Fig. 3.

Table 1: Quantitative comparisons of the proposed method to the state-of-the-arts on the deblurring task. Note that LEDVDI only produces 6 frames for each blurry frame while the others output 7 frames. The number of network parameters (#Param.) is also provided.

Method	GoPro			HQF			#Param.
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	#Param.
LEVS [11]	20.84	0.5473	0.1111	20.08	0.5629	0.0998	18.21M
EDI [22]	21.29	0.6402	0.1104	19.65	0.5909	0.1173	-
LEDVDI [14]	25.38	0.8567	0.0280	22.58	0.7472	0.0578	4.996M
eSL-Net [31]	17.80	0.5655	0.1141	21.36	0.6659	0.0644	0.188M
RED [34]	25.14	0.8587	0.0425	24.48	0.7572	0.0475	9.762M
EVDI (Ours)	30.40	0.9058	0.0144	24.77	0.7664	0.0423	0.393M

4.2 Self-supervised Learning Framework

The proposed self-supervised learning framework consists of three different losses which are formulated based on the mutual constraints among blurry frames, sharp latent images and event streams.

Blurry-event Loss. The double integral of events $E(f,\mathcal{T})$ corresponds to the mapping relation between blurry frames and sharp latent images. For multiple blurry inputs, we propose to formulate the consistency between the latent images reconstructed from different blurry frames, e.g., $L_{i}(f)=L_{i+1}(f)$ . Considering the quantization error which might be accumulated in $E(f,\mathcal{T})$ , we rewrite the consistency as

\frac{B_{i}}{E(f,\mathcal{T}_{i})}\approx\frac{B_{i+1}}{E(f,\mathcal{T}_{i+1})},

(19)

where $E(f,\mathcal{T}_{i}),E(f,\mathcal{T}_{i+1})$ are generated by the LDI network. We convert Eq. (19) to the logarithmic domain and rewrite it as the blurry-event loss $\mathcal{L}_{B\text{-}E}$ ,

\mathcal{L}_{B\text{-}E}=\|(\tilde{B}_{i+1}-\tilde{B}_{i})-(\tilde{E}(f,\mathcal{T}_{i+1})-\tilde{E}(f,\mathcal{T}_{i}))\|_{1},

(20)

where the top tilde denotes logarithm, e.g., $\tilde{B}_{i}=\operatorname{log}(B_{i})$ . With the blurry-event constraint, the LDI network can learn to perform event double integral through utilizing the brightness difference between blurry frames.

Blurry-sharp Loss. Providing the reconstructed latent images $\bar{L}(t)$ with $t\in\mathcal{T}_{i}$ , the blurring process Eq. (6) can be reformulated as the discrete version, i.e.,

\bar{B}_{i}=\frac{1}{T}\int_{t\in\mathcal{T}_{i}}\bar{L}(t)dt\approx\frac{1}{M}\sum_{m=0}^{M-1}\bar{L}_{i}[m],

(21)

where $\bar{L}_{i}[m]$ indicates the $m$ -th latent image inside the exposure time of $B_{i}$ and $M$ is the total number of reconstruction. Previous attempts reduce the discretization error by assuming linear [3] or piece-wise linear motion [34] between latent frames and interpolating more intermediate frames, while this assumption might be violated in real-world scenarios especially in the case of complex non-linear motions. In contrast, we restore $\bar{L}_{i}[m]$ all by our network to exploit the real motion embedded in the event streams, and formulate the blurry-sharp loss $\mathcal{L}_{B\text{-}S}$ between the reblurred images $\bar{B}_{i},\bar{B}_{i+1}$ and the original blurry inputs $B_{i},B_{i+1}$ as

\mathcal{L}_{B\text{-}S}=\|\bar{B}_{i}-B_{i}\|_{1}+\|\bar{B}_{i+1}-B_{i+1}\|_{1},

(22)

which guarantees the brightness consistency by learning from the blurry inputs.

Sharp-event Loss. Apart from the above constraints, the relation between sharp latent images and events can be also leveraged to supervise the reconstruction of consecutive latent frames. Based on Eq. (5), we have

\mathcal{N}(\Delta\tilde{L})=\mathcal{N}(J),

(23)

where $\Delta\tilde{L}\triangleq\tilde{L}(t)-\tilde{L}(f)$ , $J\triangleq\int_{f}^{t}e(s)ds$ and $\mathcal{N}(\cdot)$ is the min/max normalization operator adopted in [23]. Therefore, we can avoid the estimation of threshold $c$ and formulate the sharp-event loss $\mathcal{L}_{S\text{-}E}$ as

\mathcal{L}_{S\text{-}E}=\|\mathcal{M}(\mathcal{N}(\Delta\tilde{L}))-\mathcal{M}(\mathcal{N}(J))\|_{1},

(24)

where $\mathcal{M}(\cdot)$ denotes a pixel-wise masking operator for $\mathcal{M}(\cdot)=0$ only when there are no events. Finally, the total self-supervised framework can be summarized as follows.

\mathcal{L}=\alpha\mathcal{L}_{B\text{-}E}+\beta\mathcal{L}_{B\text{-}S}+\gamma\mathcal{L}_{S\text{-}E},

(25)

with $\alpha,\beta,\gamma$ denoting the balancing parameters.

Table 2: Quantitative results on the interpolation task. We compute PSNR and SSIM on the reconstruction results of the skipped frames, and use the official models provided by the authors for comparison. The column #Param. indicates the number of network parameters.

Method	1 frame skip				3 frame skips				#Param.
	GoPro		HQF		GoPro		HQF
	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
Jin [10]	20.47	0.5244	20.48	0.5958	19.50	0.4730	18.78	0.5160	10.81M
BIN [25]	19.54	0.4645	18.25	0.4576	-	-	-	-	11.44M
DAIN [1]	20.89	0.5297	20.97	0.5980	20.48	0.5102	20.46	0.5848	24.03M
EDI [22]	18.72	0.5059	16.62	0.4266	18.49	0.4862	16.58	0.4219	-
LEDVDI [14]	24.42	0.8198	19.24	0.6034	23.57	0.7992	18.57	0.5651	4.996M
Time Lens [30]	21.56	0.5809	21.21	0.6090	21.47	0.5870	20.96	0.6060	79.20M
EVDI (Ours)	29.17	0.8797	23.09	0.6929	28.77	0.8731	22.24	0.6670	0.393M

5 Experiments and Analysis

5.1 Experimental Settings

Datasets. We evaluate the proposed method with three different datasets, including synthetic and real-world ones.

GoPro: We build a purely synthetic dataset based on the REDS dataset [16]. We first downsample and crop the images to size $160\times 320$ and then increase the frame rate by interpolating 7 images between consecutive frames using RIFE [6]. Finally, we generate both blurry frames and events based on the high frame-rate sequences, where the blurry frames are obtained by averaging a specific number of sharp images, and events are simulated by ESIM [24].

HQF: The HQF dataset [26] contains real-world events and high-quality frames captured simultaneously by a DAVIS240C camera where the images are minimally blurred. We up-convert the frame rate and synthesize blurry frames using the same manner as the GoPro dataset, and form a semi-synthetic dataset of blurry videos.

RBE: The RBE dataset [34] employs a DAVIS346 camera to collect real-world blurry videos and the corresponding event streams, which can be used for training with the proposed self-supervised learning framework and verifying the performance of our method in real-world scenarios.

Implementation details. We implement the LDI network with 5 convolution layers and the fusion network with 6 convolution layers, 2 residual blocks and 1 CBAM [33] block, forming a lightweight network architecture (detailed in the supplementary material). Our network is implemented using Pytorch and trained on NVIDIA GeForce RTX 2080 Ti GPUs with batch size 4 by default. The Adam optimizer [12] is employed accompanied with the SGDR [15] schedule where the parameter $T_{max}$ is set to 100 (reset the learning rate every 100 epochs). We set the number of temporal bins $N=16$ for LDI inputs and randomly crop the images to $128\times 128$ patches for training. The training process is divided into two stages: We first train our model in the deblurring setting with the weighting factors $[\alpha,\beta,\gamma]=[512,1,1\times 10^{-1}]$ and the initial learning rate $1\times 10^{-3}$ for 100 epochs, and then continue training under the setting of unified deblurring and interpolation with the weighting factors $[\alpha,\beta,\gamma]=[128,1,1\times 10^{-1}]$ and the initial learning rate $1\times 10^{-4}$ for another 100 epochs. We train a model for each dataset and evaluate it on the corresponding dataset, which is convenient as we do not need ground-truth images for supervision.

5.2 Results of Deblurring

For the setting of deblurring, we synthesize 1 blurry image using 49 frames on the GoPro and HQF datasets and evaluate the performance by restoring 7 original frames (before up-converting the frame rate) per blurry image. We compare to the state-of-the-art frame-based deblurring approach LEVS [11] and event-based methods including EDI [22], LEDVDI [14], eSL-Net [31], RED [34], and evaluate the results by metrics PSNR, SSIM [32] and LPIPS [36].

\begin{overpic}[width=82.38885pt]{Figs-Rebuttal-GT.png} \put(4.0,4.0){\footnotesize{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}\bf GT}} \end{overpic}

\begin{overpic}[width=82.38885pt]{Figs-Rebuttal-198.png} \put(4.0,4.0){\footnotesize{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}\bf w/ $\mathcal{L}_{B\text{-}S}$}} \end{overpic}

\begin{overpic}[width=82.38885pt]{Figs-Rebuttal-199.png} \put(4.0,4.0){\footnotesize{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}\bf w/ $\mathcal{L}_{B\text{-}E}$}} \end{overpic}

\begin{overpic}[width=82.38885pt]{Figs-Rebuttal-200_in_log.png} \put(4.0,4.0){\footnotesize{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}\bf w/ $\mathcal{L}_{S\text{-}E}$}} \end{overpic}

\begin{overpic}[width=82.38885pt]{Figs-Rebuttal-202.png} \put(4.0,4.0){\footnotesize{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}\bf w/ All}} \end{overpic}

Figure 6: Visual results of EVDI trained with different losses.

As demonstrated in Tab. 1, the proposed method achieves remarkable deblurring results compared to the state-of-the-arts. The performance of LEVS is limited under highly dynamic scenes, e.g., Fig. 4, due to the motion ambiguity. For event-based approaches, the model-driven method EDI provides comparable performance to LEVS by exploiting the precise motion embedded in events. LEDVDI further enhances this advantage on the GoPro dataset through introducing learning-based techniques, but this overwhelming performance is not maintained in the HQF dataset due to the inconsistency between datasets. RED achieves the most competing results with semi-supervised learning, but it still pays performance losses to balance different data distributions. Our EVDI method tackles this problem by fitting the particular data distribution with the self-supervised framework, and thus achieves the best performance on each dataset. Meanwhile, our model only contains 0.393M network parameters, which is an order of magnitude smaller than other methods except eSL-Net. Note that eSL-Net requires 122.8G FLOPs to infer a $160\times 320$ image due to its recursive structure, while our model only needs 13.45G FLOPs, maintaining the overall efficiency.

5.3 Results of Interpolation

For the task of interpolation, we collect consecutive 97 frames (which are 13 original frames before up-converting) as a set of input, and synthesize 1 blurry images using 41 frames at both ends, leaving 1 latent original frame in the middle for evaluation (noted as 1 frame skip). Similarly, we design another case of 3 frame skips by synthesizing 1 blurry image with 33 frames and leaving 3 original middle frames. The frame-based interpolation methods Jin’s work [10], BIN [25], DAIN [1] and event-based approaches EDI [22], LEDVDI [14], Time Lens [30] are compared.

The results in Fig. 5 and Tab. 2 demonstrate the difficulty of blurry video interpolation for frame-based approaches. The optical flow used in DAIN often projects motion blur to the interpolation results as shown in Fig. 5. Jin’s work employs a cascaded scheme for deblurring and interpolation, which tends to propagate the deblurring error to the interpolation stage. Despite BIN achieves joint deblurring and interpolation, it is difficult to synthesize the accurate intermediate frames due to the missing information between frames. For event-based methods, LEDVDI and Time Lens are able to correctly estimate the intermediate frames using the precise motion inside events. However, the performance of Time Lens is highly dependent on the quality of reference frames, and LEDVDI often faces performance drop when inferring on other datasets due to data inconsistency. The proposed EVDI method utilizes both frames and events to guarantee the interpolation quality, and tackles the inconsistency problem by learning on the target scenarios with the self-supervised framework, thus producing better results.

Table 3: Ablation study of the proposed self-supervised framework and the fusion network. We train these models under the setting of 1 frame skip on the GoPro dataset but evaluate them by computing metrics on all frames of the test set, including the reconstruction results within and between blurry frames, for a comprehensive analysis of unified deblurring and interpolation.

$\mathcal{L}_{B\text{-}S}$	$\mathcal{L}_{B\text{-}E}$	$\mathcal{L}_{S\text{-}E}$	Fusion	PSNR / SSIM / LPIPS
$\checkmark$			$\checkmark$	22.66 / 0.6769 / 0.0954
	$\checkmark$		$\checkmark$	9.152 / -0.0631 / 0.3213
		$\checkmark$	$\checkmark$	6.847 / 0.0192 / 0.7840
$\checkmark$	$\checkmark$		$\checkmark$	29.97 / 0.8998 / 0.0182
$\checkmark$		$\checkmark$	$\checkmark$	28.07 / 0.8734 / 0.0274
$\checkmark$	$\checkmark$	$\checkmark$		29.36 / 0.8924 / 0.0221
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	30.15 / 0.9026 / 0.0162

5.4 Ablation Study

We study the importance of each loss in our self-supervised framework and investigate the contribution of the fusion network. The following conclusions are drawn:

Necessity of loss combination. As depicted in Fig. 6, blurry-sharp loss $\mathcal{L}_{B\text{-}S}$ contributes to brightness consistency but cannot produce sharp results. Blurry-event loss $\mathcal{L}_{B\text{-}E}$ and sharp-event loss $\mathcal{L}_{S\text{-}E}$ are able to deal with motion ambiguity by gaining supervision from blurry frames and events, respectively, but do not constrain brightness. With the combination of loss functions, the brightness inconsistency and motion ambiguity can be simultaneously addressed by taking the complementary advantage of $\mathcal{L}_{B\text{-}S}$ and $\mathcal{L}_{B\text{-}E},\mathcal{L}_{S\text{-}E}$ .

Importance of information fusion. Although $\mathcal{L}_{B\text{-}E}$ and $\mathcal{L}_{S\text{-}E}$ are both capable of tackling motion ambiguity, their supervision comes from different information sources: $\mathcal{L}_{B\text{-}E}$ exploits blurry frames $B_{i},B_{i+1}$ to supervise the estimation of $E(f,\mathcal{T}_{i}),E(f,\mathcal{T}_{i+1})$ , while $\mathcal{L}_{S\text{-}E}$ utilizes events to constrain the generation of $\bar{L}(f)$ . Hence, combing $\mathcal{L}_{B\text{-}E}$ and $\mathcal{L}_{S\text{-}E}$ will achieve the best performance, as demonstrated in Tab. 3. Moreover, the fusion network also improves the results by fusing the information from different blurry frames and events.

6 Conclusion

This paper introduces a unified framework of event-based video deblurring and interpolation that generates high frame-rate sharp videos from low-frame-rate blurry inputs. Through analyzing the mutual constraints among blurry frames, sharp latent images and events, a self-supervised learning framework is also proposed to enable network training in real-world scenarios without any labeled data. Evaluation on both synthetic and real-world datasets demonstrates that our method competes favorably against state-of-the-arts while maintaining an efficient network design, showing potential for practical applications.

References

[1] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video frame interpolation. In CVPR, pages 3703–3712, 2019.
[2] Ryad Benosman, Charles Clercq, Xavier Lagorce, Sio-Hoi Ieng, and Chiara Bartolozzi. Event-based visual flow. IEEE Transactions on Neural Networks and Learning Systems, 25(2):407–417, 2013.
[3] Huaijin Chen, Jinwei Gu, Orazio Gallo, Ming-Yu Liu, Ashok Veeraraghavan, and Jan Kautz. Reblur2deblur: Deblurring videos via self-supervised learning. In 2018 IEEE International Conference on Computational Photography (ICCP), pages 1–9. IEEE, 2018.
[4] Senyou Deng, Wenqi Ren, Yanyang Yan, Tao Wang, Fenglong Song, and Xiaochun Cao. Multi-scale separable network for ultra-high-definition video deblurring. In ICCV, pages 14030–14039, October 2021.
[5] Guillermo Gallego, Tobi Delbruck, Garrick Michael Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew Davison, Jorg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey. IEEE TPAMI, 2020.
[6] Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294, 2020.
[7] Tae Hyun Kim, Kyoung Mu Lee, Bernhard Scholkopf, and Michael Hirsch. Online video deblurring via dynamic temporal blending network. In ICCV, pages 4038–4047, 2017.
[8] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, pages 2462–2470, 2017.
[9] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, pages 9000–9008, 2018.
[10] Meiguang Jin, Zhe Hu, and Paolo Favaro. Learning to extract flawless slow motion from blurry videos. In CVPR, pages 8112–8121, 2019.
[11] Meiguang Jin, Givi Meishvili, and Paolo Favaro. Learning to extract a video sequence from a single motion-blurred image. In CVPR, pages 6334–6342, 2018.
[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[13] Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128 $\times$ 128 120 dB 15 $\mu$ s Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-state Circuits, 43(2):566–576, 2008.
[14] Songnan Lin, Jiawei Zhang, Jinshan Pan, Zhe Jiang, Dongqing Zou, Yongtian Wang, Jing Chen, and Jimmy Ren. Learning event-driven video deblurring and interpolation. In ECCV, pages 695–710. Springer, 2020.
[15] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. In ICLR, 2017.
[16] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In CVPRW, pages 1974–1984, 2019.
[17] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intra-frame iterations for video deblurring. In CVPR, pages 8102–8111, 2019.
[18] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, pages 5437–5446, 2020.
[19] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In CVPR, pages 670–679, 2017.
[20] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, pages 261–270, 2017.
[21] Liyuan Pan, Richard Hartley, Cedric Scheerlinck, Miaomiao Liu, Xin Yu, and Yuchao Dai. High frame rate video reconstruction based on an event camera. IEEE TPAMI, 2020.
[22] Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. In CVPR, pages 6820–6829, 2019.
[23] Federico Paredes-Vallés and Guido CHE de Croon. Back to event basics: Self-supervised learning of image reconstruction for event cameras via photometric constancy. In CVPR, pages 3446–3455, 2021.
[24] Henri Rebecq, Daniel Gehrig, and Davide Scaramuzza. Esim: an open event camera simulator. In Conference on Robot Learning, pages 969–982. PMLR, 2018.
[25] Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, and Zhiyong Gao. Blurry video frame interpolation. In CVPR, pages 5114–5123, 2020.
[26] Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuzza, Tom Drummond, Nick Barnes, Lindsay Kleeman, and Robert Mahony. Reducing the sim-to-real gap for event cameras. In ECCV, pages 534–549. Springer, 2020.
[27] Maitreya Suin and A. N. Rajagopalan. Gated spatio-temporal attention-guided video deblurring. In CVPR, pages 7802–7811, June 2021.
[28] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018.
[29] Jacob Telleen, Anne Sullivan, Jerry Yee, Oliver Wang, Prabath Gunawardane, Ian Collins, and James Davis. Synthetic shutter speed imaging. In Comput. Graph. Forum, volume 26, pages 591–598. Wiley Online Library, 2007.
[30] Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, and Davide Scaramuzza. Time lens: Event-based video frame interpolation. In CVPR, pages 16155–16164, 2021.
[31] Bishan Wang, Jingwei He, Lei Yu, Gui-Song Xia, and Wen Yang. Event enhanced high-quality image recovery. In ECCV, pages 155–171. Springer, 2020.
[32] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In IEEE Asilomar Conf. Sign. Syst. Comput., volume 2, pages 1398–1402, 2003.
[33] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
[34] Fang Xu, Lei Yu, Bishan Wang, Wen Yang, Gui-Song Xia, Xu Jia, Zhendong Qiao, and Jianzhuang Liu. Motion deblurring with real events. In ICCV, pages 2583–2592, 2021.
[35] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. IJCV, 127(8):1106–1125, 2019.
[36] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
[37] Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, Wangmeng Zuo, and Jimmy Ren. Spatio-temporal filter adaptive network for video deblurring. In ICCV, pages 2482–2491, 2019.