Super-Resolving Blurry Images with Events
Abstract
Super-resolution from motion-blurred images poses a significant challenge due to the combined effects of motion blur and low spatial resolution. To address this challenge, this paper introduces an Event-based Blurry Super Resolution Network (EBSR-Net), which leverages the high temporal resolution of events to mitigate motion blur and improve high-resolution image prediction. Specifically, we propose a multi-scale center-surround event representation to fully capture motion and texture information inherent in events. Additionally, we design a symmetric cross-modal attention module to fully exploit the complementarity between blurry images and events. Furthermore, we introduce an intermodal residual group composed of several residual dense Swin Transformer blocks, each incorporating multiple Swin Transformer layers and a residual connection, to extract global context and facilitate inter-block feature aggregation. Extensive experiments show that our method compares favorably against state-of-the-art approaches and achieves remarkable performance.
Index Terms:
Motion Deblurring, Super-Resolution, Event CameraI Introduction
Motion blurs often lead to significant performance degradation in Super Resolution (SR) tasks, characterized by motion ambiguities and texture erasure, which poses a substantial challenge for downstream tasks, e.g., autonomous driving [1], visual detection and tracking [2, 3], and visual SLAM [4, 5].
While promising progress has been reported in image SR over the past decade [6, 7, 8, 9], few studies have addressed scenarios involving blurry textures and diverse motion patterns. Consequently, they often lose effectiveness when handling motion-blurred images in real-world dynamic scenarios. Despite decades of separate investigation into the problems of image SR [6, 7, 8] and motion deblurring [10, 11], each yielding promising results, simply integrating a motion deblurring module into an image SR architecture may either exacerbate artifacts or compromise detailed information [12, 13] due to cascading errors. Compared to traditional cascading methods, recent advancements in single-image SR from motion-blurred images have revealed that simultaneous resolution of motion ambiguities can significantly enhance effectiveness [14, 15, 16], despite the inherent ill-posed nature of this task [17]. While kernel-based methods have shown promise in addressing motion-blurred image SR under the assumption of uniform motion [13, 18, 19], real-world scenarios often present non-uniform motions, such as those involving non-rigid or moving objects, challenging this assumption. To tackle this issue, various strategies have emerged, including motion flow estimation from video sequences [9, 20] and the use of end-to-end deep neural networks [21, 22, 23, 24, 25]. However, these approaches are often specialized for specific domains, like faces [21, 22] or text [23], or heavily reliant on the performance of the deblurring submodule [24, 25], limiting their applicability to general image SR tasks involving natural scenes with complex motions.
Recently, several studies have highlighted the advantages of event cameras in Motion-blurred Image Super-Resolution (MSR) in scenes with complex motions [26, 27, 28]. These studies report asynchronous event data with extremely low latency (in the order of s), which proves effective in recovering accurate sharp details even under non-linear motions and preserving high-resolution information with its high temporal resolution. However, existing methods often experience performance degradation in more complex scenarios due to limitations imposed by sparse coding [26, 28], as well as accumulated errors in the multi-stage training procedure [27].
In this paper, to address the aforementioned issue, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering HR sharp image from an motion-blurred LR image across diverse scenarios. We revisit and formulate the MSR task in Sec. II-A, exploring how events can be leveraged to mitigate the ill-posed problem. In Sec. II-B, we first introduce a novel Multi-scale Center-surround Event Representation (MCER) module to fully exploit intra-frame motion information for extracting multi-scale textures embedded in events. Then, a Symmetric Cross-Modal Attention (SCMA) module is presented to effectively attend to cross-modal features for subsequent tasks through symmetric querying between frames and events. Furthermore, we design an Intermodal Residual Group (IRG) module consisting of several residual dense Swin Transformer layers and a residual dense connection to facilitate inter-block feature aggregation and extract global context. Overall, the contributions of this paper are three-fold:
-
1.
We propose a novel event-based approach, i.e., EBSR-Net, for the single blurry image SR, harnessing cross-modal information between blurry frames and events to reconstruct HR sharp images across diverse scenarios within a one-stage architecture.
-
2.
We propose an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain.
-
3.
We employ SCMA and IGR modules to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation.
II Methods
II-A Problem Formulation
Due to the imperfection of image sensors, the captured image may suffer from non-negligible quality degeneration including motion blur and low spatial resolution, which can be related to the high-quality (sharp and high spatial resolution) latent image as follows:
(1) |
where denotes the exposure interval of , is the duration of , and represents the down-sampled operator to obtain Low Resolution (LR) sharp image from . Thus the Motion-blurred image Super Resolution (MSR) can be formulated as:
(2) |
It is obvious that the task of recovering a High-Resolution (HR) sharp image from a single LR blurry image poses a severe ill-posed problem. While significant progress has been reported in MSR techniques [13, 18, 23, 23], these approaches often are specialized for specific domains (e.g., face or text) or rely heavily on the strong assumption of uniform motion, thereby limiting their applicability in natural scenarios with complex motions.
Recently, many methods [26, 28, 27] utilize events to tackle MSR task due to their low latency. These events are triggered whenever the log-scale brightness change exceeds the event threshold , i.e.,
(3) |
where and denote the instantaneous intensity at time and at the pixel position , and polarity indicates the direction of brightness changes. Hence, the Event-based MSR (E-MSR) task can be represented as:
(4) |
where denotes the emitted event stream during the exposure interval of . However, existing approaches [28, 27] often suffer performance degradation on more complex scenarios owing to the limitations imposed by sparse coding [28], and accumulated errors in the multi-stage training procedure [27]. To address these problems, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering from and concurrent event stream across various challenging scenarios.

II-B Overall Architecture
The overall architecture of our proposed Event-based Blurry Super Resolution Network (EBSR-Net) is illustrated in Fig. 2 (a), which is a deep network mainly consisting of a Multi-scale Center-surround Event Representation (MCER) module, a Symmetric Cross-Modal Attention (SCMA) module, and an Intermodal Residual Group (IRG) module.
Multi-scale Center-surround Event Representation. The blur degree of a motion-blurred image often varies significantly due to the diverse speeds of the camera or object motion during exposure. To ensure robustness across various scenes with complex motion, we introduce the novel Multi-scale Center-surround Event Representation (MCER) module. This module comprehensively captures intra-frame motion at multiple temporal scales, thereby strengthening the resilience of the event representation. Specifically, we encode event streams into window-dependent representation frames , denoted as
(5) | ||||
(6) |
where represents the middle point of the exposure time , determines the length of the interval. According to Eqs. 5 and 6, the event representation results with different encode motion information across multiple temporal scales. Additionally, we employ Event Count Map and Timesurface approaches [29] to quantize intra-frame motion information.

Symmetric Cross-Modal Attention. Compared to existing simple fusion methods between frames and events [27, 29], our proposed Symmetric Cross-Modal Attention (SCMA) module fully exploit the complementary characteristics of multimodal information. Through symmetric querying of multimodal data, it extract adaptively enhanced features for subsequent tasks. As illustrated in Fig. 2 (b), SCMA module takes the feature of blurry image and events as inputs, yielding symmetric fusion features . These features are obtained by encoders consisting of conventional convolutional layers and receiving blurry image and event map .
Unlike conventional self-attention blocks that typically compute queries, keys, and values exclusively from either the frame or event branch of the network, our SCMA leverages multimodal information between frames and events. Queries are calculated from both images and events, with keys and values obtained from the opposite modality. SCMA consists of two parallel self-attention structures and a combination operation, formulated as:
(7) |
where represents the transpose operator. Note that , , and , are produced through operations involving normalization and 11 convolutional layers. Specifically, and are derived from and respectively, similar to and . Additionally, the adaptively symmetric fusion-based output can be calculated by
(8) |
where and represent intermediate features from the and branches, respectively. These features are generated by reshaping the results that combine attention and original features followed by a Multi-Layer Perceptron (MLP) layer.
Intermodal Residual Group. The SCMA module aims to extract multimodal information but lacks effective deep features for subsequent tasks. To address this, we introduce the Intermodal Residual Group (IRG) module, which fully exploits deep intermodal information through a meticulously designed group of Residual Dense Swin-Transformer Blocks (RDSTB) [6, 30]. The IRG module, depicted in Fig. 2 (a) and (c), comprises four RDSRBs and two 33 convolutional layers. Each RDSRB module contains four residual dense blocks of Swin Transformer Layer (STL). Explicitly, the first RDSTB structure in the IRG module is formulated as:
(9) |
where represents the intermediate feature maps of RDSTBs, denotes the number of STL layers, and [30] is based on the standard multi-head self-attention and the original transformer layer [31]. Additionally, the output of the first RDSRB module can be obtained by
(10) |
The final result of the IRG module is computed using three additional RDSTBs and two 33 convolutional layers. Consequently, the recovery HR sharp image can be estimated by
(11) |
where denotes a decoder operator, and represents the bilinear upsampling operation.
II-C Loss Function
We supervise the overall architecture by using loss and perceptual similarity loss [32] for better visual quality, which can be formulated as
(12) |
where and are the balancing parameters.
III Experiment

III-A Datasets and Implementation
The proposed EBSR-Net model, implemented using PyTorch on an NVIDIA GeForce RTX 3090, is trained separately on the training sets of both the GoPro [33] and REDS [34] datasets, and we use simulated events and blurry image [29]. Subsequently, evaluations are conducted separately on the testing sets of these two datasets. Furthermore, we use the ADAM optimizer [35] with an initial learning rate of , and the exponential term decays by for every epoch. The weighting factors and in Eq. 12 are set to 1 and 0.1, respectively. We use Structural SIMilarity (SSIM) [36] and Peak Signal to Noise Ratio (PSNR) as performance metrics.
III-B Quantitative and Qualitative Evaluation
We compare our EBSR-Net to the state-of-the-art Motion Deblurring (MD) methods including MPR [37], RED [29], and EF [38], and Image Super-Resolution (ISR) approaches, including SwIR [6], CAT [7], and DAT [8], on both GoPro [33] and REDS [34] datasets. These comparison methods are categorized into two stages: MD methods are evaluated first, followed by the ISR methods, denoted as MPR+DAT, and so on. Additionally, a one-stage architecture, e.g., eSL-Net [26], is directly utilized for comparison by utilizing its official code with default parameters. According to the quantitative results shown in Tab. I, our EFSR-Net outperforms state-of-the-art methods by a large margin, achieving an average improvement of 4.12/5.70 dB and 0.0788/0.1607 in PSNR and SSIM on the GoPro and REDS datasets respectively. Meanwhile, our one-stage model only contains 7.3M network parameters, which is much smaller than other methods except eSL-Net. Note that eSL-Net requires 122.8G FLOPs to infer a 160320 image due to its recursive structure, while our model only needs 41.2G FLOPs, maintaining the overall efficiency.
Methods | GoPro [33] | REDS [34] | #Param. | OS | Events |
PSNR/SSIM | PSNR/SSIM | ||||
MPR+DAT | 26.86/0.8237 | 19.48/0.5336 | 34.9M | ✗ | ✗ |
MPR+CAT | 27.03/0.8266 | 19.62/0.5381 | 36.7M | ✗ | ✗ |
MPR+SwIR | 26.87/0.8229 | 19.57/0.5354 | 32.0M | ✗ | ✗ |
RED+DAT | 26.91/0.8164 | 21.67/0.5998 | 24.5M | ✗ | ✓ |
RED+CAT | 26.74/0.8166 | 21.62/0.5989 | 26.3M | ✗ | ✓ |
RED+SwIR | 26.69/0.8134 | 21.58/0.5992 | 21.6M | ✗ | ✓ |
EF+DAT | 26.88/0.8262 | 22.75/0.7043 | 23.3M | ✗ | ✓ |
EF+CAT | 27.02/0.8285 | 21.50/0.6931 | 25.1M | ✗ | ✓ |
EF+SwIR | 26.81/0.8253 | 21.49/0.6806 | 20.4M | ✗ | ✓ |
eSL-Net | 26.01/0.7818 | 19.82/0.5386 | 0.1M | ✓ | ✓ |
EBSR-Net | 30.90/0.8969 | 26.61/0.7629 | 7.3M | ✓ | ✓ |
The qualitative comparisons are shown in Fig. 3. The results estimated by the two-stage cascade architecture, e.g., RED [29]+DAT [8], suffer artifacts and distortions owing to the accumulated errors, leading to significant degradation of the overall quality. Furthermore, one-stage methods like eSL-Net [26] exhibit performance degradation in complex scenarios, attributed to limitations imposed by sparse coding. In contrast, our EBSR-Net achieves accurate reconstructions closely resembling the HR ground-truth sharp images.

III-C Ablation Study
In order to verify the effectiveness of the key components in our EBSR-Net, ablation experiments of the MCER, SCMA, and IRG modules are conducted, as shown in Tab. II. All models are trained on the same experimental environment and equipment, and we replace the modules with corresponding convolutional layers for a fair comparison. Specifically, removal of the MCER (Case 1, 2, 5), SCMA (Case 0, 2, 4), and IRD (Case 0, 1, 3) modules respectively lead to a degradation of 3.12/3.28/2.91 dB in PSRN and 0.0795/0.0858/0.0722 in SSIM. Furthermore, comparing (b) with (e) and (f) with (h), (c) with (f) and (g) with (h), and (d) with (g) and (e) with (h) demonstrate that EBSR-Net with the SCMA, IRG, and MCER modules respectively give sharper results than network without them. The improvement in both quantitative and qualitative results validates the effectiveness of the proposed three modules.
IV Conclusion
In this letter, we present EBSR-Net, an event-based one-stage architecture for recovering HR sharp images from the LR blurry images. We introduce an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain. The SCMA and IGR modules are presented to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation. Extensive experimental results show that our method compares favorably against state-of-the-art methods and achieves remarkable performance.
References
- [1] Q. Wang, T. Han, Z. Qin, J. Gao, and X. Li, “Multitask attention network for lane detection and fitting,” IEEE TNNLS, vol. 33, no. 3, pp. 1066–1078, 2022.
- [2] Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, and X. You, “Few-shot object detection: Research advances and challenges,” Information Fusion, vol. 107, p. 102307, 2024.
- [3] Z. Wu, J. Wen, Y. Xu, J. Yang, X. Li, and D. Zhang, “Enhanced spatial feature learning for weakly supervised object detection,” IEEE TNNLS, vol. 35, no. 1, pp. 961–972, 2024.
- [4] Y. Wu, L. Wang, L. Zhang, Y. Bai, Y. Cai, S. Wang, and Y. Li, “Improving autonomous detection in dynamic environments with robust monocular thermal slam system,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 203, pp. 265–284, 2023.
- [5] Y. Ge, L. Zhang, Y. Wu, and D. Hu, “Pipo-slam: Lightweight visual-inertial slam with preintegration merging theory and pose-only descriptions of multiple view geometry,” IEEE Transactions on Robotics, vol. 40, pp. 2046–2059, 2024.
- [6] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in ICCV, 2021, pp. 1833–1844.
- [7] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yuan et al., “Cross aggregation transformer for image restoration,” NeurIPS, vol. 35, pp. 25 478–25 490, 2022.
- [8] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yang, and F. Yu, “Dual aggregation transformer for image super-resolution,” in ICCV, 2023, pp. 12 312–12 321.
- [9] H. Park and K. Mu Lee, “Joint estimation of camera pose, depth, deblurring, and super-resolution from a blurred image sequence,” in ICCV, 2017, pp. 4613–4621.
- [10] G. Han, M. Wang, H. Zhu, and C. Lin, “Mpdnet: An underwater image deblurring framework with stepwise feature refinement module,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106822, 2023.
- [11] H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn, “Multi-task learning framework for motion estimation and dynamic scene deblurring,” IEEE TIP, vol. 30, pp. 8170–8183, 2021.
- [12] A. Singh, F. Porikli, and N. Ahuja, “Super-resolving noisy images,” in CVPR, 2014, pp. 2846–2853.
- [13] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018, pp. 3262–3271.
- [14] N. Fang and Z. Zhan, “High-resolution optical flow and frame-recurrent network for video super-resolution and deblurring,” Neurocomputing, vol. 489, pp. 128–138, 2022.
- [15] J. Liang, K. Zhang, S. Gu, L. Van Gool, and R. Timofte, “Flow-based kernel prior with application to blind super-resolution,” in CVPR, 2021, pp. 10 601–10 610.
- [16] W. Niu, K. Zhang, W. Luo, and Y. Zhong, “Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding,” IEEE TIP, vol. 30, pp. 7101–7111, 2021.
- [17] S. Nah, S. Son, S. Lee, R. Timofte, and K. M. Lee, “Ntire 2021 challenge on image deblurring,” in CVPR, 2021, pp. 149–165.
- [18] J. Pan, H. Bai, J. Dong, J. Zhang, and J. Tang, “Deep blind video super-resolution,” in ICCV, 2021, pp. 4811–4820.
- [19] J.-S. Yun, M. H. Kim, H.-I. Kim, and S. B. Yoo, “Kernel adaptive memory network for blind video super-resolution,” Expert Systems with Applications, vol. 238, p. 122252, 2024.
- [20] H. Bai and J. Pan, “Self-supervised deep blind video super-resolution,” IEEE TPAMI, 2024.
- [21] X. Li, W. Zuo, and C. C. Loy, “Learning generative structure prior for blind text image super-resolution,” in CVPR, 2023, pp. 10 103–10 113.
- [22] J. Chen, B. Li, and X. Xue, “Scene text telescope: Text-focused scene image super-resolution,” in CVPR, 2021, pp. 12 026–12 035.
- [23] X. Li, C. Chen, X. Lin, W. Zuo, and L. Zhang, “From face to natural image: Learning real degradation for blind image super-resolution,” in ECCV. Springer, 2022, pp. 376–392.
- [24] D. Zhang, Z. Liang, and J. Shao, “Joint image deblurring and super-resolution with attention dual supervised network,” Neurocomputing, vol. 412, pp. 187–196, 2020.
- [25] T. Barman and B. Deka, “A deep learning-based joint image super-resolution and deblurring framework,” IEEE Transactions on Artificial Intelligence, 2023.
- [26] B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high-quality image recovery,” in ECCV. Springer, 2020, pp. 155–171.
- [27] J. Han, Y. Yang, C. Zhou, C. Xu, and B. Shi, “Evintsr-net: Event guided multiple latent frames reconstruction and super-resolution,” in ICCV, 2021, pp. 4882–4891.
- [28] L. Yu, B. Wang, X. Zhang, H. Zhang, W. Yang, J. Liu, and G.-S. Xia, “Learning to super-resolve blurry images with events,” IEEE TPAMI, 2023.
- [29] F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592.
- [30] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
- [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
- [32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
- [33] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, 2017, pp. 3883–3891.
- [34] S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. Mu Lee, “Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study,” in CVPRW, 2019, pp. 1974–1984.
- [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- [36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
- [37] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
- [38] L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L. V. Gool, “Event-based fusion for motion deblurring with cross-modal attention,” in ECCV. Springer, 2022, pp. 412–428.