This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Super-Resolving Blurry Images with Events

Chi Zhang, Mingyuan Lin, Xiang Zhang, Chenxu Jiang, Lei Yu This work was partially supported by the National Natural Science Foundation of China under Grants 62271354 and 61871297.C. Zhang, M. Lin, C. Jiang, and L. Yu are with the School of Electronic Information, Wuhan University, Wuhan 430072. X. Zhang is with the Department of Computer Science, ETH Zurich, Switzerland.Corresponding author: Lei Yu ([email protected]).
Abstract

Super-resolution from motion-blurred images poses a significant challenge due to the combined effects of motion blur and low spatial resolution. To address this challenge, this paper introduces an Event-based Blurry Super Resolution Network (EBSR-Net), which leverages the high temporal resolution of events to mitigate motion blur and improve high-resolution image prediction. Specifically, we propose a multi-scale center-surround event representation to fully capture motion and texture information inherent in events. Additionally, we design a symmetric cross-modal attention module to fully exploit the complementarity between blurry images and events. Furthermore, we introduce an intermodal residual group composed of several residual dense Swin Transformer blocks, each incorporating multiple Swin Transformer layers and a residual connection, to extract global context and facilitate inter-block feature aggregation. Extensive experiments show that our method compares favorably against state-of-the-art approaches and achieves remarkable performance.

Index Terms:
Motion Deblurring, Super-Resolution, Event Camera

I Introduction

Motion blurs often lead to significant performance degradation in Super Resolution (SR) tasks, characterized by motion ambiguities and texture erasure, which poses a substantial challenge for downstream tasks, e.g., autonomous driving [1], visual detection and tracking [2, 3], and visual SLAM [4, 5].

While promising progress has been reported in image SR over the past decade [6, 7, 8, 9], few studies have addressed scenarios involving blurry textures and diverse motion patterns. Consequently, they often lose effectiveness when handling motion-blurred images in real-world dynamic scenarios. Despite decades of separate investigation into the problems of image SR [6, 7, 8] and motion deblurring [10, 11], each yielding promising results, simply integrating a motion deblurring module into an image SR architecture may either exacerbate artifacts or compromise detailed information [12, 13] due to cascading errors. Compared to traditional cascading methods, recent advancements in single-image SR from motion-blurred images have revealed that simultaneous resolution of motion ambiguities can significantly enhance effectiveness [14, 15, 16], despite the inherent ill-posed nature of this task [17]. While kernel-based methods have shown promise in addressing motion-blurred image SR under the assumption of uniform motion [13, 18, 19], real-world scenarios often present non-uniform motions, such as those involving non-rigid or moving objects, challenging this assumption. To tackle this issue, various strategies have emerged, including motion flow estimation from video sequences [9, 20] and the use of end-to-end deep neural networks [21, 22, 23, 24, 25]. However, these approaches are often specialized for specific domains, like faces [21, 22] or text [23], or heavily reliant on the performance of the deblurring submodule [24, 25], limiting their applicability to general image SR tasks involving natural scenes with complex motions.

Recently, several studies have highlighted the advantages of event cameras in Motion-blurred Image Super-Resolution (MSR) in scenes with complex motions [26, 27, 28]. These studies report asynchronous event data with extremely low latency (in the order of μ\mus), which proves effective in recovering accurate sharp details even under non-linear motions and preserving high-resolution information with its high temporal resolution. However, existing methods often experience performance degradation in more complex scenarios due to limitations imposed by sparse coding [26, 28], as well as accumulated errors in the multi-stage training procedure [27].

In this paper, to address the aforementioned issue, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering HR sharp image from an motion-blurred LR image across diverse scenarios. We revisit and formulate the MSR task in Sec. II-A, exploring how events can be leveraged to mitigate the ill-posed problem. In Sec. II-B, we first introduce a novel Multi-scale Center-surround Event Representation (MCER) module to fully exploit intra-frame motion information for extracting multi-scale textures embedded in events. Then, a Symmetric Cross-Modal Attention (SCMA) module is presented to effectively attend to cross-modal features for subsequent tasks through symmetric querying between frames and events. Furthermore, we design an Intermodal Residual Group (IRG) module consisting of several residual dense Swin Transformer layers and a residual dense connection to facilitate inter-block feature aggregation and extract global context. Overall, the contributions of this paper are three-fold:

  1. 1.

    We propose a novel event-based approach, i.e., EBSR-Net, for the single blurry image SR, harnessing cross-modal information between blurry frames and events to reconstruct HR sharp images across diverse scenarios within a one-stage architecture.

  2. 2.

    We propose an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain.

  3. 3.

    We employ SCMA and IGR modules to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation.

II Methods

II-A Problem Formulation

Due to the imperfection of image sensors, the captured image BB may suffer from non-negligible quality degeneration including motion blur and low spatial resolution, which can be related to the high-quality (sharp and high spatial resolution) latent image LL as follows:

B=1Tt𝒯I(t)𝑑t,I=D(L),\begin{split}B&=\frac{1}{T}\int_{t\in\mathcal{T}}I(t)dt,\\ I&=D^{\downarrow}(L),\end{split} (1)

where 𝒯[0,T]\mathcal{T}\triangleq[0,T] denotes the exposure interval of BB, TT is the duration of 𝒯\mathcal{T}, and D()D^{\downarrow}(\cdot) represents the down-sampled operator to obtain Low Resolution (LR) sharp image II from LL. Thus the Motion-blurred image Super Resolution (MSR) can be formulated as:

L=MSR(B).L=\operatorname{MSR}(B). (2)

It is obvious that the task of recovering a High-Resolution (HR) sharp image LL from a single LR blurry image BB poses a severe ill-posed problem. While significant progress has been reported in MSR techniques [13, 18, 23, 23], these approaches often are specialized for specific domains (e.g., face or text) or rely heavily on the strong assumption of uniform motion, thereby limiting their applicability in natural scenarios with complex motions.

Recently, many methods [26, 28, 27] utilize events to tackle MSR task due to their low latency. These events are triggered whenever the log-scale brightness change exceeds the event threshold c>0c>0, i.e.,

log(I(t,𝐱))log(I(τ,𝐱))=pc,\operatorname{log}(I(t,\mathbf{x}))-\operatorname{log}(I(\tau,\mathbf{x}))=p\cdot c, (3)

where I(t,𝐱)I(t,\mathbf{x}) and I(τ,𝐱)I(\tau,\mathbf{x}) denote the instantaneous intensity at time tt and τ\tau at the pixel position 𝐱\mathbf{x}, and polarity p{+1,1}p\in\{+1,-1\} indicates the direction of brightness changes. Hence, the Event-based MSR (E-MSR) task can be represented as:

L=EMSR(B,𝒯),L=\operatorname{E-MSR}(B,\mathcal{E}_{\mathcal{T}}), (4)

where 𝒯(𝐱i,pi,ti)ti𝒯\mathcal{E}_{\mathcal{T}}\triangleq{(\mathbf{x}_{i},p_{i},t_{i})}_{t_{i}\in\mathcal{T}} denotes the emitted event stream during the exposure interval of BB. However, existing approaches [28, 27] often suffer performance degradation on more complex scenarios owing to the limitations imposed by sparse coding [28], and accumulated errors in the multi-stage training procedure [27]. To address these problems, we introduce a novel Event-based Blurry Super Resolution Network (EBSR-Net), a one-stage architecture aimed at directly recovering LL from BB and concurrent event stream 𝒯\mathcal{E}_{\mathcal{T}} across various challenging scenarios.

Refer to caption
Figure 1: (a) illustrates the details of the proposed Multi-scale Center-surround Event Representation (MCER). Here, Δt\Delta t controls the exposure interval, determining the period utilized for quantizing the event representation.

II-B Overall Architecture

The overall architecture of our proposed Event-based Blurry Super Resolution Network (EBSR-Net) is illustrated in Fig. 2 (a), which is a deep network mainly consisting of a Multi-scale Center-surround Event Representation (MCER) module, a Symmetric Cross-Modal Attention (SCMA) module, and an Intermodal Residual Group (IRG) module.

Multi-scale Center-surround Event Representation. The blur degree of a motion-blurred image often varies significantly due to the diverse speeds of the camera or object motion during exposure. To ensure robustness across various scenes with complex motion, we introduce the novel Multi-scale Center-surround Event Representation (MCER) module. This module comprehensively captures intra-frame motion at multiple temporal scales, thereby strengthening the resilience of the event representation. Specifically, we encode event streams 𝒯\mathcal{E}_{\mathcal{T}} into window-dependent representation frames EΔt(f)E_{\Delta t}(f), denoted as

EΔt(f)\displaystyle E_{\Delta t}(f) =MCER(𝒩Δt(f)),with\displaystyle=\operatorname{MCER}(\mathcal{E}_{\mathcal{N}_{\Delta{t}}(f)}),\quad\text{with} (5)
𝒩Δt(f)\displaystyle\mathcal{N}_{\Delta{t}}(f) {t|ft|Δt2,t𝒯},\displaystyle\triangleq\{t^{\prime}\mid\left|f-t^{\prime}\right|\leq{\frac{\Delta{t}}{2}},\forall t^{\prime}\in{\mathcal{T}}\}, (6)

where ff represents the middle point of the exposure time TT, Δt\Delta t determines the length of the interval. According to Eqs. 5 and 6, the event representation results with different Δt\Delta t encode motion information across multiple temporal scales. Additionally, we employ Event Count Map and Timesurface approaches [29] to quantize intra-frame motion information.

Refer to caption
Figure 2: (a) illustrates the overview of our proposed EBSR-Net. (b) and (c) provide details of the Symmetric Cross-Modal Attention (SCMA) module and the Residual Dense Swin Transformer Block (RDSTB), respectively. (d) presents the details of the Swin Transformer Layer (STL).

Symmetric Cross-Modal Attention. Compared to existing simple fusion methods between frames and events [27, 29], our proposed Symmetric Cross-Modal Attention (SCMA) module fully exploit the complementary characteristics of multimodal information. Through symmetric querying of multimodal data, it extract adaptively enhanced features for subsequent tasks. As illustrated in Fig. 2 (b), SCMA module takes the feature of blurry image FbF_{b} and events FeF_{e} as inputs, yielding symmetric fusion features FsF_{s}. These features are obtained by encoders consisting of conventional convolutional layers and receiving blurry image BB and event map EΔtE_{\Delta t}.

Unlike conventional self-attention blocks that typically compute queries, keys, and values exclusively from either the frame or event branch of the network, our SCMA leverages multimodal information between frames and events. Queries are calculated from both images and events, with keys and values obtained from the opposite modality. SCMA consists of two parallel self-attention structures and a combination operation, formulated as:

Att(Qb,Ke,Ve)=VeSoftmax(QbKedk),Att(Qe,Kb,Vb)=VbSoftmax(QeKbdk),\begin{split}\operatorname{Att(Q_{b},K_{e},V_{e})}&=V_{e}\operatorname{Softmax}(\frac{Q_{b}^{{}^{\prime}}K_{e}}{\sqrt{d_{k}}}),\\ \operatorname{Att(Q_{e},K_{b},V_{b})}&=V_{b}\operatorname{Softmax}(\frac{Q_{e}^{{}^{\prime}}K_{b}}{\sqrt{d_{k}}}),\end{split} (7)

where ()(\cdot)^{{}^{\prime}} represents the transpose operator. Note that QQ, KK, and VV, are produced through operations involving normalization and 1×\times1 convolutional layers. Specifically, QbQ_{b} and QeQ_{e} are derived from FbF_{b} and FeF_{e} respectively, similar to KK and VV. Additionally, the adaptively symmetric fusion-based output FsF_{s} can be calculated by

Fs=Conv1×1([F~be,F~eb]),F_{s}=\operatorname{Conv_{1\times 1}}([\tilde{F}_{b\rightarrow e},\tilde{F}_{e\rightarrow b}]), (8)

where F~be\tilde{F}_{b\rightarrow e} and F~eb\tilde{F}_{e\rightarrow b} represent intermediate features from the QbQ_{b} and QeQ_{e} branches, respectively. These features are generated by reshaping the results that combine attention and original features followed by a Multi-Layer Perceptron (MLP) layer.

Intermodal Residual Group. The SCMA module aims to extract multimodal information but lacks effective deep features for subsequent tasks. To address this, we introduce the Intermodal Residual Group (IRG) module, which fully exploits deep intermodal information through a meticulously designed group of Residual Dense Swin-Transformer Blocks (RDSTB) [6, 30]. The IRG module, depicted in Fig. 2 (a) and (c), comprises four RDSRBs and two 3×\times3 convolutional layers. Each RDSRB module contains four residual dense blocks of Swin Transformer Layer (STL). Explicitly, the first RDSTB structure in the IRG module is formulated as:

Fn(d)=STL([Fn(d1),Fn(d2),,Fs)]),F_{n}(d)=\operatorname{STL}([F_{n}(d-1),F_{n}(d-2),...,F_{s})]), (9)

where FiF_{i} represents the intermediate feature maps of RDSTBs, dd denotes the number of STL layers, and STL\operatorname{STL} [30] is based on the standard multi-head self-attention and the original transformer layer [31]. Additionally, the output FrF_{r} of the first RDSRB module can be obtained by

Fr=Conv3×3(Fn(4))+Fs.F_{r}=\operatorname{Conv_{3\times 3}}(F_{n}(4))+F_{s}. (10)

The final result FiF_{i} of the IRG module is computed using three additional RDSTBs and two 3×\times3 convolutional layers. Consequently, the recovery HR sharp image L¯\bar{L} can be estimated by

L¯=Dec(Fr)+BL(B),\bar{L}=\operatorname{Dec}(F_{r})+\operatorname{BL}(B), (11)

where Dec\operatorname{Dec} denotes a decoder operator, and BL\operatorname{BL} represents the bilinear upsampling operation.

II-C Loss Function

We supervise the overall architecture by using L1L_{1} loss and perceptual similarity loss per\mathcal{L}_{per} [32] for better visual quality, which can be formulated as

total=αL¯L1+βper(L¯,L),\mathcal{L}_{\text{total}}=\alpha{\|\bar{L}-L\|}_{1}+\beta\mathcal{L}_{\text{per}}(\bar{L},L), (12)

where α\alpha and β\beta are the balancing parameters.

III Experiment

Refer to caption
Figure 3: Qualitative comparisons of our EBSR-Net with the state-of-the-art methods on the GoPro and the REDS datasets. Six samples are arranged from left to right and top to bottom, with the first three samples from the GoPro dataset and the last three from the REDS dataset.

III-A Datasets and Implementation

The proposed EBSR-Net model, implemented using PyTorch on an NVIDIA GeForce RTX 3090, is trained separately on the training sets of both the GoPro [33] and REDS [34] datasets, and we use simulated events and blurry image [29]. Subsequently, evaluations are conducted separately on the testing sets of these two datasets. Furthermore, we use the ADAM optimizer [35] with an initial learning rate of 10410^{-4}, and the exponential term decays by 0.980.98 for every 55 epoch. The weighting factors α\alpha and β\beta in Eq. 12 are set to 1 and 0.1, respectively. We use Structural SIMilarity (SSIM) [36] and Peak Signal to Noise Ratio (PSNR) as performance metrics.

III-B Quantitative and Qualitative Evaluation

We compare our EBSR-Net to the state-of-the-art Motion Deblurring (MD) methods including MPR [37], RED [29], and EF [38], and Image Super-Resolution (ISR) approaches, including SwIR [6], CAT [7], and DAT [8], on both GoPro [33] and REDS [34] datasets. These comparison methods are categorized into two stages: MD methods are evaluated first, followed by the ISR methods, denoted as MPR+DAT, and so on. Additionally, a one-stage architecture, e.g., eSL-Net [26], is directly utilized for comparison by utilizing its official code with default parameters. According to the quantitative results shown in Tab. I, our EFSR-Net outperforms state-of-the-art methods by a large margin, achieving an average improvement of 4.12/5.70 dB and 0.0788/0.1607 in PSNR and SSIM on the GoPro and REDS datasets respectively. Meanwhile, our one-stage model only contains 7.3M network parameters, which is much smaller than other methods except eSL-Net. Note that eSL-Net requires 122.8G FLOPs to infer a 160×\times320 image due to its recursive structure, while our model only needs 41.2G FLOPs, maintaining the overall efficiency.

TABLE I: Quantitative comparisons on the GoPro [33] and REDS [34] datasets. Bold and Underlined numbers represent the best and second performance respectively, and OS denotes the One Stage (OS) methods.
Methods GoPro [33] REDS [34] #Param. OS Events
PSNR\uparrow/SSIM\uparrow PSNR\uparrow/SSIM\uparrow
MPR+DAT 26.86/0.8237 19.48/0.5336 34.9M
MPR+CAT 27.03/0.8266 19.62/0.5381 36.7M
MPR+SwIR 26.87/0.8229 19.57/0.5354 32.0M
RED+DAT 26.91/0.8164 21.67/0.5998 24.5M
RED+CAT 26.74/0.8166 21.62/0.5989 26.3M
RED+SwIR 26.69/0.8134 21.58/0.5992 21.6M
EF+DAT 26.88/0.8262 22.75/0.7043 23.3M
EF+CAT 27.02/0.8285 21.50/0.6931 25.1M
EF+SwIR 26.81/0.8253 21.49/0.6806 20.4M
eSL-Net 26.01/0.7818 19.82/0.5386 0.1M
EBSR-Net 30.90/0.8969 26.61/0.7629 7.3M

The qualitative comparisons are shown in Fig. 3. The results estimated by the two-stage cascade architecture, e.g., RED [29]+DAT [8], suffer artifacts and distortions owing to the accumulated errors, leading to significant degradation of the overall quality. Furthermore, one-stage methods like eSL-Net [26] exhibit performance degradation in complex scenarios, attributed to limitations imposed by sparse coding. In contrast, our EBSR-Net achieves accurate reconstructions closely resembling the HR ground-truth sharp images.

Refer to caption
Figure 4: Qualitative ablations of each module of EBSR-Net.
TABLE II: The Ablation Experimental Results of our EBSR-Net.
Models MCER SCMA IRG GoPro [33] REDS [34]
PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow
#0 27.07 0.8165 24.04 0.6756
#1 27.66 0.8336 23.87 0.6725
#2 27.42 0.8247 22.45 0.6232
#3 29.67 0.8810 25.02 0.7181
#4 27.79 0.8433 24.07 0.6810
#5 28.07 0.8486 24.34 0.6992
#6 30.90 0.8969 26.61 0.7629

III-C Ablation Study

In order to verify the effectiveness of the key components in our EBSR-Net, ablation experiments of the MCER, SCMA, and IRG modules are conducted, as shown in Tab. II. All models are trained on the same experimental environment and equipment, and we replace the modules with corresponding convolutional layers for a fair comparison. Specifically, removal of the MCER (Case 1, 2, 5), SCMA (Case 0, 2, 4), and IRD (Case 0, 1, 3) modules respectively lead to a degradation of 3.12/3.28/2.91 dB in PSRN and 0.0795/0.0858/0.0722 in SSIM. Furthermore, comparing (b) with (e) and (f) with (h), (c) with (f) and (g) with (h), and (d) with (g) and (e) with (h) demonstrate that EBSR-Net with the SCMA, IRG, and MCER modules respectively give sharper results than network without them. The improvement in both quantitative and qualitative results validates the effectiveness of the proposed three modules.

IV Conclusion

In this letter, we present EBSR-Net, an event-based one-stage architecture for recovering HR sharp images from the LR blurry images. We introduce an innovative event representation method, i.e., MCER, which comprehensively captures intra-frame motion information through a multi-scale center-surround structure in the temporal domain. The SCMA and IGR modules are presented to achieve effective image restoration by symmetrically querying multimodal feature and facilitating inter-block feature aggregation. Extensive experimental results show that our method compares favorably against state-of-the-art methods and achieves remarkable performance.

References

  • [1] Q. Wang, T. Han, Z. Qin, J. Gao, and X. Li, “Multitask attention network for lane detection and fitting,” IEEE TNNLS, vol. 33, no. 3, pp. 1066–1078, 2022.
  • [2] Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, and X. You, “Few-shot object detection: Research advances and challenges,” Information Fusion, vol. 107, p. 102307, 2024.
  • [3] Z. Wu, J. Wen, Y. Xu, J. Yang, X. Li, and D. Zhang, “Enhanced spatial feature learning for weakly supervised object detection,” IEEE TNNLS, vol. 35, no. 1, pp. 961–972, 2024.
  • [4] Y. Wu, L. Wang, L. Zhang, Y. Bai, Y. Cai, S. Wang, and Y. Li, “Improving autonomous detection in dynamic environments with robust monocular thermal slam system,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 203, pp. 265–284, 2023.
  • [5] Y. Ge, L. Zhang, Y. Wu, and D. Hu, “Pipo-slam: Lightweight visual-inertial slam with preintegration merging theory and pose-only descriptions of multiple view geometry,” IEEE Transactions on Robotics, vol. 40, pp. 2046–2059, 2024.
  • [6] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in ICCV, 2021, pp. 1833–1844.
  • [7] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yuan et al., “Cross aggregation transformer for image restoration,” NeurIPS, vol. 35, pp. 25 478–25 490, 2022.
  • [8] Z. Chen, Y. Zhang, J. Gu, L. Kong, X. Yang, and F. Yu, “Dual aggregation transformer for image super-resolution,” in ICCV, 2023, pp. 12 312–12 321.
  • [9] H. Park and K. Mu Lee, “Joint estimation of camera pose, depth, deblurring, and super-resolution from a blurred image sequence,” in ICCV, 2017, pp. 4613–4621.
  • [10] G. Han, M. Wang, H. Zhu, and C. Lin, “Mpdnet: An underwater image deblurring framework with stepwise feature refinement module,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106822, 2023.
  • [11] H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn, “Multi-task learning framework for motion estimation and dynamic scene deblurring,” IEEE TIP, vol. 30, pp. 8170–8183, 2021.
  • [12] A. Singh, F. Porikli, and N. Ahuja, “Super-resolving noisy images,” in CVPR, 2014, pp. 2846–2853.
  • [13] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018, pp. 3262–3271.
  • [14] N. Fang and Z. Zhan, “High-resolution optical flow and frame-recurrent network for video super-resolution and deblurring,” Neurocomputing, vol. 489, pp. 128–138, 2022.
  • [15] J. Liang, K. Zhang, S. Gu, L. Van Gool, and R. Timofte, “Flow-based kernel prior with application to blind super-resolution,” in CVPR, 2021, pp. 10 601–10 610.
  • [16] W. Niu, K. Zhang, W. Luo, and Y. Zhong, “Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding,” IEEE TIP, vol. 30, pp. 7101–7111, 2021.
  • [17] S. Nah, S. Son, S. Lee, R. Timofte, and K. M. Lee, “Ntire 2021 challenge on image deblurring,” in CVPR, 2021, pp. 149–165.
  • [18] J. Pan, H. Bai, J. Dong, J. Zhang, and J. Tang, “Deep blind video super-resolution,” in ICCV, 2021, pp. 4811–4820.
  • [19] J.-S. Yun, M. H. Kim, H.-I. Kim, and S. B. Yoo, “Kernel adaptive memory network for blind video super-resolution,” Expert Systems with Applications, vol. 238, p. 122252, 2024.
  • [20] H. Bai and J. Pan, “Self-supervised deep blind video super-resolution,” IEEE TPAMI, 2024.
  • [21] X. Li, W. Zuo, and C. C. Loy, “Learning generative structure prior for blind text image super-resolution,” in CVPR, 2023, pp. 10 103–10 113.
  • [22] J. Chen, B. Li, and X. Xue, “Scene text telescope: Text-focused scene image super-resolution,” in CVPR, 2021, pp. 12 026–12 035.
  • [23] X. Li, C. Chen, X. Lin, W. Zuo, and L. Zhang, “From face to natural image: Learning real degradation for blind image super-resolution,” in ECCV.   Springer, 2022, pp. 376–392.
  • [24] D. Zhang, Z. Liang, and J. Shao, “Joint image deblurring and super-resolution with attention dual supervised network,” Neurocomputing, vol. 412, pp. 187–196, 2020.
  • [25] T. Barman and B. Deka, “A deep learning-based joint image super-resolution and deblurring framework,” IEEE Transactions on Artificial Intelligence, 2023.
  • [26] B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high-quality image recovery,” in ECCV.   Springer, 2020, pp. 155–171.
  • [27] J. Han, Y. Yang, C. Zhou, C. Xu, and B. Shi, “Evintsr-net: Event guided multiple latent frames reconstruction and super-resolution,” in ICCV, 2021, pp. 4882–4891.
  • [28] L. Yu, B. Wang, X. Zhang, H. Zhang, W. Yang, J. Liu, and G.-S. Xia, “Learning to super-resolve blurry images with events,” IEEE TPAMI, 2023.
  • [29] F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592.
  • [30] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
  • [32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
  • [33] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, 2017, pp. 3883–3891.
  • [34] S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. Mu Lee, “Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study,” in CVPRW, 2019, pp. 1974–1984.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
  • [37] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” in CVPR, 2021.
  • [38] L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L. V. Gool, “Event-based fusion for motion deblurring with cross-modal attention,” in ECCV.   Springer, 2022, pp. 412–428.