Super-Resolving Blurry Images with Events

Chi Zhang, Mingyuan Lin, Xiang Zhang, Chenxu Jiang, Lei Yu This work was partially supported by the National Natural Science Foundation of China under Grants 62271354 and 61871297.C. Zhang, M. Lin, C. Jiang, and L. Yu are with the School of Electronic Information, Wuhan University, Wuhan 430072. X. Zhang is with the Department of Computer Science, ETH Zurich, Switzerland.Corresponding author: Lei Yu ([email protected]).

Abstract

Super-resolution from motion-blurred images poses a significant challenge due to the combined effects of motion blur and low spatial resolution. To address this challenge, this paper introduces an Event-based Blurry Super Resolution Network (EBSR-Net), which leverages the high temporal resolution of events to mitigate motion blur and improve high-resolution image prediction. Specifically, we propose a multi-scale center-surround event representation to fully capture motion and texture information inherent in events. Additionally, we design a symmetric cross-modal attention module to fully exploit the complementarity between blurry images and events. Furthermore, we introduce an intermodal residual group composed of several residual dense Swin Transformer blocks, each incorporating multiple Swin Transformer layers and a residual connection, to extract global context and facilitate inter-block feature aggregation. Extensive experiments show that our method compares favorably against state-of-the-art approaches and achieves remarkable performance.

Index Terms:

Motion Deblurring, Super-Resolution, Event Camera

I Introduction

Motion blurs often lead to significant performance degradation in Super Resolution (SR) tasks, characterized by motion ambiguities and texture erasure, which poses a substantial challenge for downstream tasks, e.g., autonomous driving [1], visual detection and tracking [2, 3], and visual SLAM [4, 5].

While promising progress has been reported in image SR over the past decade [6, 7, 8, 9], few studies have addressed scenarios involving blurry textures and diverse motion patterns. Consequently, they often lose effectiveness when handling motion-blurred images in real-world dynamic scenarios. Despite decades of separate investigation into the problems of image SR [6, 7, 8] and motion deblurring [10, 11], each yielding promising results, simply integrating a motion deblurring module into an image SR architecture may either exacerbate artifacts or compromise detailed information [12, 13] due to cascading errors. Compared to traditional cascading methods, recent advancements in single-image SR from motion-blurred images have revealed that simultaneous resolution of motion ambiguities can significantly enhance effectiveness [14, 15, 16], despite the inherent ill-posed nature of this task [17]. While kernel-based methods have shown promise in addressing motion-blurred image SR under the assumption of uniform motion [13, 18, 19], real-world scenarios often present non-uniform motions, such as those involving non-rigid or moving objects, challenging this assumption. To tackle this issue, various strategies have emerged, including motion flow estimation from video sequences [9, 20] and the use of end-to-end deep neural networks [21, 22, 23, 24, 25]. However, these approaches are often specialized for specific domains, like faces [21, 22] or text [23], or heavily reliant on the performance of the deblurring submodule [24, 25], limiting their applicability to general image SR tasks involving natural scenes with complex motions.

Recently, several studies have highlighted the advantages of event cameras in Motion-blurred Image Super-Resolution (MSR) in scenes with complex motions [26, 27, 28]. These studies report asynchronous event data with extremely low latency (in the order of $\mu$ s), which proves effective in recovering accurate sharp details even under non-linear motions and preserving high-resolution information with its high temporal resolution. However, existing methods often experience performance degradation in more complex scenarios due to limitations imposed by sparse coding [26, 28], as well as accumulated errors in the multi-stage training procedure [27].

In this paper, to address the aforementioned issue, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering HR sharp image from an motion-blurred LR image across diverse scenarios. We revisit and formulate the MSR task in Sec. II-A, exploring how events can be leveraged to mitigate the ill-posed problem. In Sec. II-B, we first introduce a novel Multi-scale Center-surround Event Representation (MCER) module to fully exploit intra-frame motion information for extracting multi-scale textures embedded in events. Then, a Symmetric Cross-Modal Attention (SCMA) module is presented to effectively attend to cross-modal features for subsequent tasks through symmetric querying between frames and events. Furthermore, we design an Intermodal Residual Group (IRG) module consisting of several residual dense Swin Transformer layers and a residual dense connection to facilitate inter-block feature aggregation and extract global context. Overall, the contributions of this paper are three-fold:

1.

We propose a novel event-based approach, i.e., EBSR-Net, for the single blurry image SR, harnessing cross-modal information between blurry frames and events to reconstruct HR sharp images across diverse scenarios within a one-stage architecture.
2.

We propose an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain.
3.

We employ SCMA and IGR modules to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation.

II Methods

II-A Problem Formulation

Due to the imperfection of image sensors, the captured image $B$ may suffer from non-negligible quality degeneration including motion blur and low spatial resolution, which can be related to the high-quality (sharp and high spatial resolution) latent image $L$ as follows:

\begin{split}B&=\frac{1}{T}\int_{t\in\mathcal{T}}I(t)dt,\\ I&=D^{\downarrow}(L),\end{split}

(1)

where $\mathcal{T}\triangleq[0,T]$ denotes the exposure interval of $B$ , $T$ is the duration of $\mathcal{T}$ , and $D^{\downarrow}(\cdot)$ represents the down-sampled operator to obtain Low Resolution (LR) sharp image $I$ from $L$ . Thus the Motion-blurred image Super Resolution (MSR) can be formulated as:

L=\operatorname{MSR}(B).

(2)

It is obvious that the task of recovering a High-Resolution (HR) sharp image $L$ from a single LR blurry image $B$ poses a severe ill-posed problem. While significant progress has been reported in MSR techniques [13, 18, 23, 23], these approaches often are specialized for specific domains (e.g., face or text) or rely heavily on the strong assumption of uniform motion, thereby limiting their applicability in natural scenarios with complex motions.

Recently, many methods [26, 28, 27] utilize events to tackle MSR task due to their low latency. These events are triggered whenever the log-scale brightness change exceeds the event threshold $c>0$ , i.e.,

\operatorname{log}(I(t,\mathbf{x}))-\operatorname{log}(I(\tau,\mathbf{x}))=p\cdot c,

(3)

where $I(t,\mathbf{x})$ and $I(\tau,\mathbf{x})$ denote the instantaneous intensity at time $t$ and $\tau$ at the pixel position $\mathbf{x}$ , and polarity $p\in\{+1,-1\}$ indicates the direction of brightness changes. Hence, the Event-based MSR (E-MSR) task can be represented as:

L=\operatorname{E-MSR}(B,\mathcal{E}_{\mathcal{T}}),

(4)

where $\mathcal{E}_{\mathcal{T}}\triangleq{(\mathbf{x}_{i},p_{i},t_{i})}_{t_{i}\in\mathcal{T}}$ denotes the emitted event stream during the exposure interval of $B$ . However, existing approaches [28, 27] often suffer performance degradation on more complex scenarios owing to the limitations imposed by sparse coding [28], and accumulated errors in the multi-stage training procedure [27]. To address these problems, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering $L$ from $B$ and concurrent event stream $\mathcal{E}_{\mathcal{T}}$ across various challenging scenarios.

Refer to caption — Figure 1: (a) illustrates the details of the proposed Multi-scale Center-surround Event Representation (MCER). Here, $\Delta t$ controls the exposure interval, determining the period utilized for quantizing the event representation.

II-B Overall Architecture

The overall architecture of our proposed Event-based Blurry Super Resolution Network (EBSR-Net) is illustrated in Fig. 2 (a), which is a deep network mainly consisting of a Multi-scale Center-surround Event Representation (MCER) module, a Symmetric Cross-Modal Attention (SCMA) module, and an Intermodal Residual Group (IRG) module.

Multi-scale Center-surround Event Representation. The blur degree of a motion-blurred image often varies significantly due to the diverse speeds of the camera or object motion during exposure. To ensure robustness across various scenes with complex motion, we introduce the novel Multi-scale Center-surround Event Representation (MCER) module. This module comprehensively captures intra-frame motion at multiple temporal scales, thereby strengthening the resilience of the event representation. Specifically, we encode event streams $\mathcal{E}_{\mathcal{T}}$ into window-dependent representation frames $E_{\Delta t}(f)$ , denoted as

	$\displaystyle E_{\Delta t}(f)$	$\displaystyle=\operatorname{MCER}(\mathcal{E}_{\mathcal{N}_{\Delta{t}}(f)}),\quad\text{with}$		(5)
	$\displaystyle\mathcal{N}_{\Delta{t}}(f)$	$\displaystyle\triangleq\{t^{\prime}\mid\left\|f-t^{\prime}\right\|\leq{\frac{\Delta{t}}{2}},\forall t^{\prime}\in{\mathcal{T}}\},$		(6)

where $f$ represents the middle point of the exposure time $T$ , $\Delta t$ determines the length of the interval. According to Eqs. 5 and 6, the event representation results with different $\Delta t$ encode motion information across multiple temporal scales. Additionally, we employ Event Count Map and Timesurface approaches [29] to quantize intra-frame motion information.

Symmetric Cross-Modal Attention. Compared to existing simple fusion methods between frames and events [27, 29], our proposed Symmetric Cross-Modal Attention (SCMA) module fully exploit the complementary characteristics of multimodal information. Through symmetric querying of multimodal data, it extract adaptively enhanced features for subsequent tasks. As illustrated in Fig. 2 (b), SCMA module takes the feature of blurry image $F_{b}$ and events $F_{e}$ as inputs, yielding symmetric fusion features $F_{s}$ . These features are obtained by encoders consisting of conventional convolutional layers and receiving blurry image $B$ and event map $E_{\Delta t}$ .

Unlike conventional self-attention blocks that typically compute queries, keys, and values exclusively from either the frame or event branch of the network, our SCMA leverages multimodal information between frames and events. Queries are calculated from both images and events, with keys and values obtained from the opposite modality. SCMA consists of two parallel self-attention structures and a combination operation, formulated as:

\begin{split}\operatorname{Att(Q_{b},K_{e},V_{e})}&=V_{e}\operatorname{Softmax}(\frac{Q_{b}^{{}^{\prime}}K_{e}}{\sqrt{d_{k}}}),\\ \operatorname{Att(Q_{e},K_{b},V_{b})}&=V_{b}\operatorname{Softmax}(\frac{Q_{e}^{{}^{\prime}}K_{b}}{\sqrt{d_{k}}}),\end{split}

(7)

where $(\cdot)^{{}^{\prime}}$ represents the transpose operator. Note that $Q$ , $K$ , and $V$ , are produced through operations involving normalization and 1 $\times$ 1 convolutional layers. Specifically, $Q_{b}$ and $Q_{e}$ are derived from $F_{b}$ and $F_{e}$ respectively, similar to $K$ and $V$ . Additionally, the adaptively symmetric fusion-based output $F_{s}$ can be calculated by

F_{s}=\operatorname{Conv_{1\times 1}}([\tilde{F}_{b\rightarrow e},\tilde{F}_{e\rightarrow b}]),

(8)

where $\tilde{F}_{b\rightarrow e}$ and $\tilde{F}_{e\rightarrow b}$ represent intermediate features from the $Q_{b}$ and $Q_{e}$ branches, respectively. These features are generated by reshaping the results that combine attention and original features followed by a Multi-Layer Perceptron (MLP) layer.

Intermodal Residual Group. The SCMA module aims to extract multimodal information but lacks effective deep features for subsequent tasks. To address this, we introduce the Intermodal Residual Group (IRG) module, which fully exploits deep intermodal information through a meticulously designed group of Residual Dense Swin-Transformer Blocks (RDSTB) [6, 30]. The IRG module, depicted in Fig. 2 (a) and (c), comprises four RDSRBs and two 3 $\times$ 3 convolutional layers. Each RDSRB module contains four residual dense blocks of Swin Transformer Layer (STL). Explicitly, the first RDSTB structure in the IRG module is formulated as:

F_{n}(d)=\operatorname{STL}([F_{n}(d-1),F_{n}(d-2),...,F_{s})]),

(9)

where $F_{i}$ represents the intermediate feature maps of RDSTBs, $d$ denotes the number of STL layers, and $\operatorname{STL}$ [30] is based on the standard multi-head self-attention and the original transformer layer [31]. Additionally, the output $F_{r}$ of the first RDSRB module can be obtained by

F_{r}=\operatorname{Conv_{3\times 3}}(F_{n}(4))+F_{s}.

(10)

The final result $F_{i}$ of the IRG module is computed using three additional RDSTBs and two 3 $\times$ 3 convolutional layers. Consequently, the recovery HR sharp image $\bar{L}$ can be estimated by

\bar{L}=\operatorname{Dec}(F_{r})+\operatorname{BL}(B),

(11)

where $\operatorname{Dec}$ denotes a decoder operator, and $\operatorname{BL}$ represents the bilinear upsampling operation.

II-C Loss Function

We supervise the overall architecture by using $L_{1}$ loss and perceptual similarity loss $\mathcal{L}_{per}$ [32] for better visual quality, which can be formulated as

\mathcal{L}_{\text{total}}=\alpha{\|\bar{L}-L\|}_{1}+\beta\mathcal{L}_{\text{per}}(\bar{L},L),

(12)

where $\alpha$ and $\beta$ are the balancing parameters.

III Experiment

III-A Datasets and Implementation

The proposed EBSR-Net model, implemented using PyTorch on an NVIDIA GeForce RTX 3090, is trained separately on the training sets of both the GoPro [33] and REDS [34] datasets, and we use simulated events and blurry image [29]. Subsequently, evaluations are conducted separately on the testing sets of these two datasets. Furthermore, we use the ADAM optimizer [35] with an initial learning rate of $10^{-4}$ , and the exponential term decays by $0.98$ for every $5$ epoch. The weighting factors $\alpha$ and $\beta$ in Eq. 12 are set to 1 and 0.1, respectively. We use Structural SIMilarity (SSIM) [36] and Peak Signal to Noise Ratio (PSNR) as performance metrics.

III-B Quantitative and Qualitative Evaluation

We compare our EBSR-Net to the state-of-the-art Motion Deblurring (MD) methods including MPR [37], RED [29], and EF [38], and Image Super-Resolution (ISR) approaches, including SwIR [6], CAT [7], and DAT [8], on both GoPro [33] and REDS [34] datasets. These comparison methods are categorized into two stages: MD methods are evaluated first, followed by the ISR methods, denoted as MPR+DAT, and so on. Additionally, a one-stage architecture, e.g., eSL-Net [26], is directly utilized for comparison by utilizing its official code with default parameters. According to the quantitative results shown in Tab. I, our EFSR-Net outperforms state-of-the-art methods by a large margin, achieving an average improvement of 4.12/5.70 dB and 0.0788/0.1607 in PSNR and SSIM on the GoPro and REDS datasets respectively. Meanwhile, our one-stage model only contains 7.3M network parameters, which is much smaller than other methods except eSL-Net. Note that eSL-Net requires 122.8G FLOPs to infer a 160 $\times$ 320 image due to its recursive structure, while our model only needs 41.2G FLOPs, maintaining the overall efficiency.

TABLE I: Quantitative comparisons on the GoPro [33] and REDS [34] datasets. Bold and Underlined numbers represent the best and second performance respectively, and OS denotes the One Stage (OS) methods.

Methods	GoPro [33]	REDS [34]	#Param.	OS	Events
Methods	PSNR $\uparrow$ /SSIM $\uparrow$	PSNR $\uparrow$ /SSIM $\uparrow$	#Param.	OS	Events
MPR+DAT	26.86/0.8237	19.48/0.5336	34.9M	✗	✗
MPR+CAT	27.03/0.8266	19.62/0.5381	36.7M	✗	✗
MPR+SwIR	26.87/0.8229	19.57/0.5354	32.0M	✗	✗
RED+DAT	26.91/0.8164	21.67/0.5998	24.5M	✗	✓
RED+CAT	26.74/0.8166	21.62/0.5989	26.3M	✗	✓
RED+SwIR	26.69/0.8134	21.58/0.5992	21.6M	✗	✓
EF+DAT	26.88/0.8262	22.75/0.7043	23.3M	✗	✓
EF+CAT	27.02/0.8285	21.50/0.6931	25.1M	✗	✓
EF+SwIR	26.81/0.8253	21.49/0.6806	20.4M	✗	✓
eSL-Net	26.01/0.7818	19.82/0.5386	0.1M	✓	✓
EBSR-Net	30.90/0.8969	26.61/0.7629	7.3M	✓	✓

The qualitative comparisons are shown in Fig. 3. The results estimated by the two-stage cascade architecture, e.g., RED [29]+DAT [8], suffer artifacts and distortions owing to the accumulated errors, leading to significant degradation of the overall quality. Furthermore, one-stage methods like eSL-Net [26] exhibit performance degradation in complex scenarios, attributed to limitations imposed by sparse coding. In contrast, our EBSR-Net achieves accurate reconstructions closely resembling the HR ground-truth sharp images.

TABLE II: The Ablation Experimental Results of our EBSR-Net.

Models	MCER	SCMA	IRG	GoPro [33]		REDS [34]
Models	MCER	SCMA	IRG	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
#0	✓			27.07	0.8165	24.04	0.6756
#1		✓		27.66	0.8336	23.87	0.6725
#2			✓	27.42	0.8247	22.45	0.6232
#3	✓	✓		29.67	0.8810	25.02	0.7181
#4	✓		✓	27.79	0.8433	24.07	0.6810
#5		✓	✓	28.07	0.8486	24.34	0.6992
#6	✓	✓	✓	30.90	0.8969	26.61	0.7629

III-C Ablation Study

In order to verify the effectiveness of the key components in our EBSR-Net, ablation experiments of the MCER, SCMA, and IRG modules are conducted, as shown in Tab. II. All models are trained on the same experimental environment and equipment, and we replace the modules with corresponding convolutional layers for a fair comparison. Specifically, removal of the MCER (Case 1, 2, 5), SCMA (Case 0, 2, 4), and IRD (Case 0, 1, 3) modules respectively lead to a degradation of 3.12/3.28/2.91 dB in PSRN and 0.0795/0.0858/0.0722 in SSIM. Furthermore, comparing (b) with (e) and (f) with (h), (c) with (f) and (g) with (h), and (d) with (g) and (e) with (h) demonstrate that EBSR-Net with the SCMA, IRG, and MCER modules respectively give sharper results than network without them. The improvement in both quantitative and qualitative results validates the effectiveness of the proposed three modules.

IV Conclusion

In this letter, we present EBSR-Net, an event-based one-stage architecture for recovering HR sharp images from the LR blurry images. We introduce an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain. The SCMA and IGR modules are presented to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation. Extensive experimental results show that our method compares favorably against state-of-the-art methods and achieves remarkable performance.

References

[1] Q. Wang, T. Han, Z. Qin, J. Gao, and X. Li, “Multitask attention network for lane detection and fitting,” IEEE TNNLS, vol. 33, no. 3, pp. 1066–1078, 2022.
[2] Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, and X. You, “Few-shot object detection: Research advances and challenges,” Information Fusion, vol. 107, p. 102307, 2024.
[3] Z. Wu, J. Wen, Y. Xu, J. Yang, X. Li, and D. Zhang, “Enhanced spatial feature learning for weakly supervised object detection,” IEEE TNNLS, vol. 35, no. 1, pp. 961–972, 2024.
[4] Y. Wu, L. Wang, L. Zhang, Y. Bai, Y. Cai, S. Wang, and Y. Li, “Improving autonomous detection in dynamic environments with robust monocular thermal slam system,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 203, pp. 265–284, 2023.
[5] Y. Ge, L. Zhang, Y. Wu, and D. Hu, “Pipo-slam: Lightweight visual-inertial slam with preintegration merging theory and pose-only descriptions of multiple view geometry,” IEEE Transactions on Robotics, vol. 40, pp. 2046–2059, 2024.
[6] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in ICCV, 2021, pp. 1833–1844.
[7] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yuan et al., “Cross aggregation transformer for image restoration,” NeurIPS, vol. 35, pp. 25 478–25 490, 2022.
[8] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yang, and F. Yu, “Dual aggregation transformer for image super-resolution,” in ICCV, 2023, pp. 12 312–12 321.
[9] H. Park and K. Mu Lee, “Joint estimation of camera pose, depth, deblurring, and super-resolution from a blurred image sequence,” in ICCV, 2017, pp. 4613–4621.
[10] G. Han, M. Wang, H. Zhu, and C. Lin, “Mpdnet: An underwater image deblurring framework with stepwise feature refinement module,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106822, 2023.
[11] H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn, “Multi-task learning framework for motion estimation and dynamic scene deblurring,” IEEE TIP, vol. 30, pp. 8170–8183, 2021.
[12] A. Singh, F. Porikli, and N. Ahuja, “Super-resolving noisy images,” in CVPR, 2014, pp. 2846–2853.
[13] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018, pp. 3262–3271.
[14] N. Fang and Z. Zhan, “High-resolution optical flow and frame-recurrent network for video super-resolution and deblurring,” Neurocomputing, vol. 489, pp. 128–138, 2022.
[15] J. Liang, K. Zhang, S. Gu, L. Van Gool, and R. Timofte, “Flow-based kernel prior with application to blind super-resolution,” in CVPR, 2021, pp. 10 601–10 610.
[16] W. Niu, K. Zhang, W. Luo, and Y. Zhong, “Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding,” IEEE TIP, vol. 30, pp. 7101–7111, 2021.
[17] S. Nah, S. Son, S. Lee, R. Timofte, and K. M. Lee, “Ntire 2021 challenge on image deblurring,” in CVPR, 2021, pp. 149–165.
[18] J. Pan, H. Bai, J. Dong, J. Zhang, and J. Tang, “Deep blind video super-resolution,” in ICCV, 2021, pp. 4811–4820.
[19] J.-S. Yun, M. H. Kim, H.-I. Kim, and S. B. Yoo, “Kernel adaptive memory network for blind video super-resolution,” Expert Systems with Applications, vol. 238, p. 122252, 2024.
[20] H. Bai and J. Pan, “Self-supervised deep blind video super-resolution,” IEEE TPAMI, 2024.
[21] X. Li, W. Zuo, and C. C. Loy, “Learning generative structure prior for blind text image super-resolution,” in CVPR, 2023, pp. 10 103–10 113.
[22] J. Chen, B. Li, and X. Xue, “Scene text telescope: Text-focused scene image super-resolution,” in CVPR, 2021, pp. 12 026–12 035.
[23] X. Li, C. Chen, X. Lin, W. Zuo, and L. Zhang, “From face to natural image: Learning real degradation for blind image super-resolution,” in ECCV. Springer, 2022, pp. 376–392.
[24] D. Zhang, Z. Liang, and J. Shao, “Joint image deblurring and super-resolution with attention dual supervised network,” Neurocomputing, vol. 412, pp. 187–196, 2020.
[25] T. Barman and B. Deka, “A deep learning-based joint image super-resolution and deblurring framework,” IEEE Transactions on Artificial Intelligence, 2023.
[26] B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high-quality image recovery,” in ECCV. Springer, 2020, pp. 155–171.
[27] J. Han, Y. Yang, C. Zhou, C. Xu, and B. Shi, “Evintsr-net: Event guided multiple latent frames reconstruction and super-resolution,” in ICCV, 2021, pp. 4882–4891.
[28] L. Yu, B. Wang, X. Zhang, H. Zhang, W. Yang, J. Liu, and G.-S. Xia, “Learning to super-resolve blurry images with events,” IEEE TPAMI, 2023.
[29] F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592.
[30] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
[32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
[33] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, 2017, pp. 3883–3891.
[34] S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. Mu Lee, “Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study,” in CVPRW, 2019, pp. 1974–1984.
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
[36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
[37] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
[38] L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L. V. Gool, “Event-based fusion for motion deblurring with cross-modal attention,” in ECCV. Springer, 2022, pp. 412–428.