(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration
Abstract
Transformer-based approaches have achieved superior performance in image restoration, since they can model long-term dependencies well. However, the limitation in capturing local information restricts their capacity to remove degradations. While existing approaches attempt to mitigate this issue by incorporating convolutional operations, the core component in Transformer, i.e., self-attention, which serves as a low-pass filter, could unintentionally dilute or even eliminate the acquired local patterns. In this paper, we propose HIT, a simple yet effective High-frequency Injected Transformer for image restoration. Specifically, we design a window-wise injection module (WIM), which incorporates abundant high-frequency details into the feature map, to provide reliable references for restoring high-quality images. We also develop a bidirectional interaction module (BIM) to aggregate features at different scales using a mutually reinforced paradigm, resulting in spatially and contextually improved representations. In addition, we introduce a spatial enhancement unit (SEU) to preserve essential spatial relationships that may be lost due to the computations carried out across channel dimensions in the BIM. Extensive experiments on 9 tasks (real noise, real rain streak, raindrop, motion blur, moiré, shadow, snow, haze, and low-light condition) demonstrate that HIT with linear computational complexity performs favorably against the state-of-the-art methods. The source code and pre-trained models will be available at https://github.com/joshyZhou/HIT.
Keywords:
Image Restoration High-frequency Information Transformer1 Introduction
Image restoration aims to recover clear images by removing undesired degradation from input [49]. Significant progress has been made due to the use of kinds of convolutional neural network (CNN) architectures [11, 55, 93]. However, the CNN-based methods are limited in modeling global contexts, which tends to negatively impact high-quality image restoration.
A newly proposed architecture, i.e., Transformer [74], has attracted much attention from the vision and learning communities owing to its remarkable ability to capture long-range relations among distant pixels. Recent approaches [80, 9, 88] successfully apply Transformer to image restoration, by reducing the quadratically grown complexity of vanilla Vision Transformers (ViT) [17], i.e., , where is the number of pixels (tokens). However, the limited capability of Transformer in aggregating local information [59, 18] remains a grand challenge. Note high-quality image restoration requires modeling both global and local information, and clear images typically contain global structures and rich local details. It is obviously insufficient to employ global dependencies alone for removing degradations, because the lack of high-frequency local information, which serves as guidelines to provide a reliable reference (e.g., edges and textures), hinders these models from recovering fine details.
![]() ![]() ![]() ![]() ![]() ![]() |
To mitigate this issue, a common solution is to introduce convolutional operations and learn convolution-like features [80, 88, 73]. By incorporating depth-wise convolution into the Feed-Forward Network (FFN) of each Transformer block, local interactions within the feature map are enabled. The convolutional operation performs a weighted sum of neighborhoods, yielding feature representations focused on local correlations. When self-attention is applied to the convolved feature map, attention weights redistribute importance among pixels of the entire feature map, potentially reducing the emphasis on local patterns. In other words, the fine local patterns initially captured by the convolution layer may inadvertently be diluted or even eliminated, especially in deep layers of the network. As a result, it is hard to learn the desired fine high-frequency information within Transformer-based architecture due to its low-pass filter nature [58]. In order to illustrate this, we utilize an attribution method, i.e., Integrated Gradients [72], to discriminate which pixels contribute to the final prediction. As shown in Figure 1 (c), local details and structures were not sufficiently deblurred on the highly textured regions obtained by Uformer [80], such as the numbers and the bicycle (zoom-in yellow and orange boxes). When analyzing the corresponding attribution area in Figure 1 (e), we notice a lack of activations of neighboring pixels around the characters and the bike. This demonstrates that the diluted local cues can negatively impact the restoration of high-frequency details.
In this paper, we propose a Transformer-based approach aiming at modeling local correlations for better image restoration. The key idea of our HIT is High-frequency Injection in Transformer with the proposed window-wise injection module (WIM). Different from existing methods that require a single model to capture both high-frequency and low-frequency information, we partition the learning targets into two categories: CNN handles low-level fine details, while Transformer tackles long-range dependencies. More specifically, we deploy a CNN-based extractor to generate high-frequency features thanks to its basic high-pass filter-like convolution operator [58] and residual learning [54], and subsequently inject extracted features into Transformer in a window-wise fashion. In this way, our approach obtains plentiful high-frequency information while allowing the Transformer architecture to concentrate on modeling long-range relationships. Meantime, the global image contexts in the hierarchical features [92] extracted by CNN, which plays a complementing role, provide a comprehensive picture of the image and alleviate the potential loss of global structure due to the window split strategy. As shown in Figure 1 (f), compared to existing methods, e.g., [80], HIT effectively enhances high-frequency information without sacrificing the large receptive filed benefit of the Transformer. Moreover, towards preventing most useful high-frequency information from being diluted by the subsequent repeated self-attention mechanism, which serves as a low-pass filter [58], we tailor two schemes. The first is to cut off the attention mechanism within the encoder part, which remains the FFN alone to deal with the information flow. The other one is to develop a bidirectional interaction module (BIM) to guide the feature integration for eliminating content information loss in the decoder part. The proposed BIM facilitates a two-way exchange of information between features at different scales, enabling each feature to benefit from the other’s complementary characteristics. This bidirectional process involves calculating cross-attention from high-resolution features with fine details to semantically rich low-resolution representations and then in reverse, resulting in a spatially and contextually improved representation. Furthermore, we introduce a spatial enhancement unit (SEU) to preserve the spatial information. The calculation in BIM, which is carried out across channel dimensions, could inadvertently lead to a loss of spatial context. In response, our SEU performs a convolution operation on the value projection in self-attention, and complements the aggregated feature of BIM with crucial spatial relations.
With the proposed modules, our HIT explores rich high-frequency information while restricting linear complexity. We perform comprehensive experiments on 9 popular image restoration tasks, including image denoising, draining, deraindrop, deblurring, demoiréing, deshadowing, desnowing, dehazing, and low-light image enhancement. Extensive experimental results show the effectiveness of our model. The main contributions of this work are threefold. (1) We propose an effective model, i.e., HIT, which leverages a CNN-based extractor to capture fine details, while ensuring that the Transformer focuses on modeling global context. This distinctive design enhances high-frequency information while maintaining the large receptive field benefit of the Transformer, thus facilitating high-quality image restoration. (2) We develop a window-wise injection module (WIM) to integrate high-frequency information into separate windows of the feature map. Towards keeping the most useful local cues can be met in deep layers, a bidirectional interaction module (BIM) is used to achieve spatially and semantically improved representations, in which a spatial enhancement unit (SEU) is developed to ensure the crucial spatial details can be preserved in BIM. (3) We evaluate the proposed HIT on various tasks, showing that it achieves favorable performance.
2 Related Work
Image Restoration. Over the past decades, CNN-based methods [56, 19] have offered a preferable solution to the image restoration task, compared to the traditional approaches [16, 20]. By learning an optimal mapping function from low-quality images to high-quality ones, CNN-based architectures achieve impressive performance on various restoration tasks, including image denosing [22, 79, 9], deblurring [32, 55, 98], deraining [29, 13, 21], demoiréing [85, 97, 99], etc. Since the introduction of CNN [70, 15], a surge of approaches have considered deeper and wider architecture designs [66, 43] to explore global cues and further improve performance. Meanwhile, some works [69, 88] introduce spatial and channel attention mechanisms to get better performance by forcing the model to focus more on relevant information. More architecture designs can be found in NTIRE challenge reports [53, 50] and recent surveys [95, 35]. Besides, some works explore All-In-One image restoration [61, 44], which is out of the scope of this work.
Recently, since Transformer [74] has achieved great success in various natural language processing tasks, many works attempt to apply it in computer vision tasks [100, 13, 26]. Specifically, for image restoration, IPT [9] first utilizes the vanilla Transformer as the backbone and obtains competitive results. Nevertheless, there remain concerns since it highly relies on large training data to fit a large number of parameters. Stripformer [73] designs the novel intra- and inter-strip attention to form a token-efficient transformer. Restormer [88] models global relations across channel dimensions to reduce complexity. However, Stripformer still needs high complexity (i.e., ) while channel-wise attention in Restormer may lose necessary spatial information. On the other hand, some works [80, 41, 39] leverage a window-based strategy [47] to achieve linear complexity. Even though these works have made clear improvements, however, the insufficient local detail issue still limits the performance of the Transformer.
Frequency Components based Image Restoration. Apart from mining relations in the spatial domain, some works attempt to design networks solving various degradation removal tasks from a frequency perspective. To be specific, transformation tools, such as Fourier transform or wavelet, are employed by some works [25, 102, 12] to decompose features into different frequency bands. Unfortunately, few of them aim to use high-frequency information to improve transformers. How to effectively explore high-frequency information to ensure the Transformer-based methods can model both high-frequency local details and low-frequency non-local structures for better image restoration is not trivial, as Transformer-based mechanisms do not model high-frequency information well [58]. Different from these methods, our HIT employs WIM using a --and- strategy to emphasize the indispensable role of local details in the feature maps, which are then fused along the channel dimension. Besides, we cut off the self-attention mechanism in the encoder of the model to prevent the fine details from being diluted, and develop a BIM to deliver a spatially and contextually improved representation for the decoder part.

3 Proposed Method
Figure 2 shows an overview of our HIT model. It comprises a U-shaped architecture with two proposed modules: the window-wise injection module (WIM), which is described in detail in Section 3.2, and the bidirectional interaction module (BIM), which is presented in Section 3.3.
3.1 Overall Pipeline
Given a degraded image , HIT first adopts a convolution layer to extract low-level feature , where denote the height, width and number of channels separately. The low-level feature is then taken into WIM, where the rich local feature is encoded into the network (see Section 3.2 for details). Next, the feature processed by WIM is fed into -level encoder-decoder parts and outputs the refined feature . Each level of encoder and decoder shares the same window-based multi-head self-attention block (W-MSA) [47]. Following the attention block, a Feed-Forward Network (FFN), as pioneered works [40], is employed. To be specific, in the encoder, the input feature is progressively processed by each Transformer block and generates the intermediate feature at -th depth, which is formulated as:
(1) | ||||
where LN is layer normalization. Afterward, a convolution layer is used to resize the feature map.
In the decoder, each level contains a Transformer block similar to the encoder part, except the convolution layer and the proposed BIM (see Section 3.3 for details). Specifically, the convolution layer in the decoder part performs feature up-sampling. Here, the input feature to the Transformer block at the stage in the decoder can be represented as:
(2) |
where denotes the concatenation, Conv is the 11 convolution, is the up-sampled feature, and is the output feature of BIM.
After the -level encoder-decoder architecture, we obtain the refined feature . A convolution layer is then applied to generate a residual image . Finally, the restored image is obtained by adding the degraded image: . Similar to [80], we adopt the commonly used Charbonnier loss [6] to train the whole network: , where denotes the ground-truth image and is set to .
3.2 Window-wise Injection Module
Unlike existing methods that either adopt convolution [59, 10, 42] or attention-based [14] fusion paradigm, as shown in Figure 2 (a), we develop window-wise injection module (WIM), using a --and- strategy, to encode high-frequency information into feature map along the channel dimension.
Split: First, a pre-trained CNN feature extractor (e.g., ResNet, where the final fully connected layer from the original design is ignored, remaining the final feature vector as input representation) is applied to the degraded image to generate a feature representation with abundant local correlations, denoted as . Next, the input feature and are divided into non-overlapping windows with the same size of , which results in the separated version and , and the -th window features are denoted as and separately. Align: These window features are concatenated along the channel dimension with the same index , which generates intermediate tensor . Fuse: The processed features are then sent into the adaptive average pooling layer (AAP) for consolidation. Overall, the WIM process can be denoted as:
(3) | |||
where WP represents the window partition strategy [47]. is the reshaped output feature of WIM.
3.3 Bidirectional Interaction Module
To prevent the injected high-frequency information from being diluted by the low-pass filter-like self-attention mechanism, we tailor two specific schemes. First, we cancel the self-attention in the encoder to ensure these local details can flow forward. Second, we design a bidirectional interaction module (BIM) for useful local relations that can be met in the decoder, as illustrated in Figure 2 (b).
Given the input feature maps and , we first resize and obtain . Next, we generate linear projections , , from , and , , from . We estimate the cross-scale self-attention by:
(4) | ||||
where , , , , , , SoftMax represents the softmax activation and is the learnable scaling factor. Here we use the transposed scaled-dot-product attention to reduce the computational cost according to [88]. We keep the calculation paradigm from [74], i.e., the queries from one feature while the keys and the values come from another one, to encourage the rich interaction among every pixel in the feature maps. We fuse the above estimated scaled-dot-product attentions so that the aggregated features can better explore cross-scale information.
(5) | |||

Spatial Enhancement Unit. By calculating cross-covariance across channels, BIM results in the linear complexity of SA operation. Such channel-wise attention mechanisms [88], however, may lose crucial spatial information. To address this issue, we design a unit to retain the indispensable spatial relations, namely spatial enhancement unit (SEU), as shown in Figure 3.
Take as an example, the SEU is performed on the transposed results of the value by:
(6) |
where DWConv is the 33 depth-wise convolution, is the activation function, and is the transposed version of , which is estimated from . Then, we update Attention with by:
(7) |
Complexity Analysis. Given the feature , the computational complexity of BIM is:
(8) |
4 Experiments
In this section, we begin with the experimental setup. Then we demonstrate the effectiveness of our method on five image restoration tasks (i.e., denosing, deraining, low-light image enhancement, dehazing and deblurring). After that, we conduct ablation studies to verify the design contributions of each component. Due to the limited space, more numerical and visual results (e.g., deshadowing on ISTD [76], desnowing on Snow100K [46], demoiréing on TIP18 [71], and deraindrop on AGAN-Data [63]) and detailed experimental settings are reported in supplementary materials.
4.1 Experimental Setup
Metrics. We use the PSNR and SSIM metrics to evaluate the quality of each restored image when its ground truth image is available. Specifically, the metrics are applied in the RGB color space for most cases while calculated on the Y channel in YCbCr color space for deraining, following existing works [80, 75]. For the reported results, the best and second best scores are highlighted and underlined. For the evaluated methods, we report the results in their paper if provided (e.g., DIL [38] for image denoising), otherwise we retrain the models with their publicly available code or evaluate with their pre-trained models (e.g., NAFNet [11], trained on GoPro [52], evaluated on RealBlur [67] for image deblurring). For others, we use the reported results in [83, 80, 21, 27] (e.g., the result of Restormer [88] on SPAD [78] for image deraining is from [83]).
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
SIDD [1] | DND [60] | Average | ||||
Method | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
SDAP [57] | 37.53 | 0.936 | 38.56 | 0.940 | 38.05 | 0.938 |
RIDNet [3] | 38.71 | 0.914 | 39.26 | 0.953 | 38.99 | 0.934 |
IPT [9] | 39.10 | 0.954 | 39.62 | 0.952 | 39.36 | 0.953 |
VDN [86] | 39.28 | 0.909 | 39.38 | 0.952 | 39.33 | 0.931 |
MalleNet [31] | 39.56 | 0.941 | 39.21 | 0.949 | 39.39 | 0.945 |
MSANet [23] | 39.56 | 0.912 | 39.65 | 0.955 | 39.61 | 0.934 |
VIRNet [87] | 39.64 | 0.958 | 39.83 | 0.954 | 39.74 | 0.956 |
MPRNet [90] | 39.71 | 0.958 | 39.80 | 0.954 | 39.76 | 0.956 |
MIRNet-v2 [91] | 39.84 | 0.959 | 39.86 | 0.955 | 39.85 | 0.957 |
DIL [38] | 39.92 | 0.939 | 39.03 | 0.955 | 39.48 | 0.947 |
HIT-T (Ours) | 39.62 | 0.958 | 39.93 | 0.956 | 39.78 | 0.957 |
HIT-B (Ours) | 39.94 | 0.960 | 40.00 | 0.956 | 39.97 | 0.958 |
Architecture Variants. By setting different feature channels C and the number of the Transformer blocks in the 4-level encoder-decoder architecture, we build two variants of HIT: 1) HIT-T sets the feature channels to 16 and the number of Transformer blocks to [2, 2, 2, 2]; 2) HIT-B sets the feature channels to 32 and the number of Transformer blocks to [1, 2, 8, 8]. In all experiments, the split window size is 8, and Transformer blocks share the same attention heads as [80].
Implementation Details. We train the model using AdamW optimizer [48] with the recommended parameter settings from [80]. The initial learning rate is set as and gradually decreased to using the cosine decay strategy. For data augmentation, we randomly adopt horizontal and vertical flips to the training samples. We adopt the progressive learning strategy to our model, similar to [88, 73]. The code is provided in the supplementary materials to ensure the reproducibility of our results.
SPAD [78] | ||
Method | PSNR | SSIM |
RESCAN [37] | 38.11 | 0.9797 |
SPANet [78] | 40.24 | 0.9811 |
RCDNet [75] | 43.36 | 0.9831 |
SPAIR [62] | 44.10 | 0.9872 |
Fu et al. [21] | 45.03 | 0.9907 |
Uformer [80] | 46.13 | 0.9913 |
Restormer [88] | 46.25 | 0.9911 |
SCD-Former [26] | 46.89 | 0.9941 |
IDT [83] | 47.34 | 0.9929 |
DRSformer [13] | 48.53 | 0.9924 |
HIT-T (Ours) | 47.16 | 0.9926 |
HIT-B (Ours) | 49.16 | 0.9940 |
SMID [7] | ||
Method | PSNR | SSIM |
KinD [96] | 22.18 | 0.634 |
EnlightenGAN [30] | 22.62 | 0.674 |
RetineNet [81] | 22.83 | 0.684 |
DeepUPE [77] | 23.91 | 0.690 |
SID [8] | 24.78 | 0.718 |
RUAS [45] | 25.88 | 0.744 |
Restormer [88] | 26.97 | 0.758 |
Uformer [80] | 27.20 | 0.792 |
SNR-Net [84] | 28.49 | 0.805 |
Retinexformer [5] | 29.15 | 0.815 |
HIT-T (Ours) | 29.16 | 0.813 |
HIT-B (Ours) | 29.37 | 0.821 |
4.2 Image Denosing
We compare HIT with ten state-of-the-art (SOTA) denoising methods: SDAP [57], RIDNet [3], IPT [9], VDN [86], MalleNet [31], MSANet [23], VIRNet [87], MPRNet [90], MIRNet-v2 [91] and DIL [38]. Table 1 shows the quantitative result on the SIDD [1] and DND [60] benchmarks. It is noted that HIT-B trained on SIDD dataset not only obtains better performance (39.94 dB) on the same dataset than the SOTA (e.g., DIL [38]), but also makes a clear gain (0.97 dB) on DND dataset, which demonstrates its better generalization capability. Figure 4 shows that HIT-B effectively removes noise while keeping image details well.
4.3 Image Deraining
We compare HIT with ten SOTA deraining methods: RESCAN [37], SPANet [78], RCDNet [75], SPAIR [62], Fu et al. [21], Uformer [80], Restormer [88], SCD-Former [26], IDT [83] and DRSformer [13]. Table 3 shows the quantitative results of HIT-B on SPAD [78] benchmark. HIT-B achieves a performance boost of 4.13 dB over the recent approach [21] and 0.63 dB over the previous best method DRSformer [13]. Figure 5 shows that HIT-B restores a visually better image.
4.4 Low-Light Image Enhancement
We compare HIT with ten SOTA methods for low-light image enhancement: KinD [96], EnlightenGAN [30], RetineNet [81], DeepUPE [77], SID [8], RUAS [45], Restormer [88], Uformer [80], SNR-Net [84] and Retinexformer [5]. We report quantitative results on the SMID [7] dataset in Table 3. Our HIT-B obtains the best performance among all the compared methods in terms of PSNR and SSIM metrics. HIT-B makes a substantial performance gain of 0.22 dB when compared to the previous best Retinex-based method Retinexformer [5].
Method | DCP | SGID | AOD-Net | FFA-Net | Uformer | Restormer | AECR-Net | Fourmer | DeHamer | MB-TaylorFormer | HIT-T | HIT-B |
---|---|---|---|---|---|---|---|---|---|---|---|---|
[28] | [4] | [34] | [64] | [80] | [88] | [82] | [101] | [24] | [65] | (Ours) | (Ours) | |
PSNR | 10.06 | 13.09 | 13.14 | 14.39 | 15.22 | 15.78 | 15.80 | 15.95 | 16.62 | 16.66 | 15.93 | 17.06 |
SSIM | 0.39 | 0.52 | 0.41 | 0.45 | 0.43 | 0.55 | 0.47 | 0.49 | 0.56 | 0.56 | 0.50 | 0.56 |
RealBlur-R [67] | RealBlur-J [67] | Average | ||||
Method | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
IR-SDE [51] | 32.56 | 0.909 | 23.19 | 0.691 | 27.89 | 0.800 |
NAFNet [11] | 33.63 | 0.944 | 26.33 | 0.856 | 29.98 | 0.900 |
FFTformer [33] | 33.66 | 0.948 | 25.71 | 0.851 | 29.69 | 0.900 |
CODE [98] | 33.81 | 0.939 | 26.25 | 0.801 | 30.03 | 0.870 |
GRL-B [39] | 33.97 | 0.944 | 26.40 | 0.816 | 30.19 | 0.880 |
HIT-T (Ours) | 35.23 | 0.946 | 28.36 | 0.855 | 31.81 | 0.901 |
HIT-B (Ours) | 36.19 | 0.956 | 28.69 | 0.870 | 32.44 | 0.913 |
4.5 Image Dehazing
We compare HIT with ten SOTA dehazing methods, including DCP [28], SGID [4], AOD-Net [34], FFA-Net [64], Uformer [80], AECR-Net [82], Restormer[88], Fourmer[101], DeHamer [24] and MB-TaylorFormer [65]. As shown in Table 4, HIT-B achieves the best scores in PSNR. Specifically, HIT-B obtains a performance gain of 0.4 dB over the previous best method MB-TaylorFormer [65], and 1.11 dB over the recent method Fourmer [101].
4.6 Image Deblurring
We provide the comparisons with five state-of-the-art techniques, including IR-SDE [51], NAFNet [11], FFTformer [33], CODE [98] and GRL-B [39] on the RealBlur [67] benchmark in Table 5. Specifically, HIT-B surpasses the recent method FFTformer [33] and GRL-B [39] by 2.75 dB and 2.25 dB, respectively. It should be noted that our model is trained only on the GoPro dataset, while HIT-B achieves competitive performance on the RealBlur benchmark, indicating its better generalization capability.
4.7 Model Efficiency
We present a detailed analysis of HIT against five state-of-the-art architectures that are designed for image deraining, including MPRNet [90], SwinIR [41], Uformer [80], Restormer [88], IDT [83], and DRSformer [13]. As shown in Table 6, HIT-B performs the best PSNR metric among all the considered methods. It is clear that our model achieves a good trade-off between restoration ability and computational cost. On the one hand, Uformer [80] and Restormer [88] take the common solution, manipulate convolutional operations, to enrich local features and make a clear performance improvement over SwinIR [41]. Since the core self-attention mechanism in Transformer hinders these models from achieving the satisfactory ability to capture local patterns, such a strategy is insufficient for obtaining adequate local information as ours. HIT leverages much richer high-frequence information, aided by the proposed modules, and performs better than all the compared methods.
5 Ablation Studies
To better understand the effect of each component, we provide ablation studies and train all possible baselines on the SPAD [78] for fair comparisons. GMACs are calculated by an input with the size of 256 256. The conclusions based on image deraining hold on other tasks. Due to the limited space, more detailed analysis and discussions are included in the supplemental material.
Window-wise Injection Module (WIM). To demonstrate the effectiveness of the proposed WIM, we conduct in-depth ablation experiments by comparing it with other components proposed for the same fusion purposes in other computer vision tasks. Specifically, we consider (b) the Feature Coupling Unit (FCU) from [59], (c) TransUNet from [10], (d) TCM from [42] and (e) MobileFormer from [14] as replacements for WIM in our experiments.
As shown in Table 7, using the FCU (Table 7b) leads to a performance drop (a substantial 2.66 dB). And when WIM is changed to TransUNet (Table 7c), we observe a more decline (3.63 dB). This can be attributed to that the core component (i.e., Batch Normalization) in these designs is less suitable for image restoration [52]. In addition, if MobileFormer (Table 7d) is adopted, which also serves as an unsuitable component, a drop of 1.64 dB in performance occurs. This decrease may be triggered by a significant reduction in computational capacity, which is required for deployment on mobile devices. Besides, though the model equipped with TCM (Table 7e) achieves the best scores among the considered varieties, there remains a clear performance (0.58 dB) gap with our WIM. Compared to these methods, it is noteworthy that WIM is the only one that splits input features into windows, aligning them explicitly with the downstream transformer branch. This design choice allows for high-frequency details preserved within each window that are exploited by the subsequent attention mechanism. In conclusion, even though these related works introduce various modules to combine the CNN and Transformer block, few of them consider the indispensable role of high-frequency information for image restoration. Moreover, we present visual comparisons in Figure 7 to demonstrate the effectiveness of the WIM. It improves the model’s ability to handle degradation patterns, resulting in a cleaner restored image and more details in the residual image.
![]() ![]() ![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() ![]() ![]() |
Ablation | Variant | Param | GMACs | PSNR |
---|---|---|---|---|
None | (a) HIT-T (Ours) | 17.29M | 15.76G | 42.98 |
WIM | (b) FCU [59] | 17.29M | 15.76G | 40.32 |
(c) TransUNet [10] | 17.30M | 16.40G | 39.35 | |
(d) MobileFormer [14] | 17.30M | 20.70G | 41.34 | |
(e) TCM [42] | 17.29M | 15.80G | 42.40 | |
BIM | (f) Concat [68] | 16.81M | 13.61G | 41.79 |
(g) AFF [12] | 17.12M | 15.18G | 41.84 | |
(h) MDTA [88] | 17.29M | 15.90G | 42.69 | |
SEU | (i) w/o SEU | 17.22M | 15.51G | 41.04 |
(j) DPE [36] | 17.22M | 15.54G | 40.48 |
Bidirectional Interaction Module (BIM). To validate the effectiveness of BIM, we compare it with three baselines: (f) a simple concatenation operation; (g) Asymmetric feature fusion (AFF) in [15]; (h) canceling cross query paradigm in BIM, which degrades to MDTA [88] style. Specifically, BIM (Table 7a), AFF (Table 7g), and MDTA version of BIM (Table 7h) all consider taking advantage of features at different scales to improve the presentations, and they enjoy clear benefit on performance over the simple concatenation (Table 7f). Since the features from different scales could emphasize distinct semantic information, directly merging them may lead to a semantic conflict issue, which can be aggravated as the number of features increases. Compared to handling all available features (>2) in AFF (Table 7g), coping with features from only two scales potentially introduces less semantic confusion (Table 7a and Table 7h). Furthermore, BIM introduces a bidirectional query paradigm to achieve the alignment of the features semantically and enhance the feature representations. In brief, all these dedicated designs made in BIM result in a simple yet effective solution to aggregate features at different scales. In addition, our visualization of the feature map processed by BIM (specifically, the final convolution layer in BIM), as shown in Figure 7, clearly illustrates the efficacy of the proposed BIM. In our case, the final convolution layer in BIM is regarded as the target layer, and the real size of the visualized feature map is . Features at different scales are explored to facilitate the discriminate capability of the deeper attention layer.
Spatial Enhancement Unit (SEU). To evaluate the effectiveness of SEU, we conduct ablation studies by (i) canceling SEU and (j) replacing it with DPE [36]. We observe a clear performance degradation when SEU is not used in BIM (Table 7i), and replacing SEU with DPE [36] also leads to a significant reduction (Table 7j). These results demonstrate that SEU plays a necessary role in retaining key spatial information by executing in parallel to the attention calculation, whereas DPE precedes this step. DPE adopts a convolution layer like SEU and results in spatially enhanced features, however, these semantically different features with enhanced spatial information may challenge the model to learn satisfactory representations. Meantime, the spatial information could be lost during subsequent channel-wise attention computation within BIM, which leads to a significant performance drop (Table 7j).
Extension to new Baseline. To further demonstrate the effectiveness of our approach in injecting high-frequency information within the Transformer for image restoration, we conduct extension experiments. Specifically, we implement WIM and BIM on the recent work IDT [83] for image deraining. As shown in Table 8, we observe a clear improvement in both PSNR and SSIM metrics (e.g., 0.57 dB on PSNR metric). With the assistance of our modules, the IDT model surpasses the previous SOTA method Uformer-B [80].
6 Conclusion
In this paper, we present a new Transformer-based model (HIT) with injected high-frequency information. We develop the window-wise injection module and bidirectional interaction module to help the Transformer benefit from the crucial local cues when analyzing the degraded patterns. We show that HIT can handle a variety of image restoration tasks including denoising, deraining, deblurring, demoiréing, deraindrop, dehazing, desnowing, deshadowing, and low-light enhancement, and performs competitively in terms of computational cost and accuracy.
References
- [1] Abdelhamed, A., Lin, S., Brown, M.S.: A high-quality denoising dataset for smartphone cameras. In: CVPR (2018)
- [2] Ancuti, C.O., Ancuti, C., Sbert, M., Timofte, R.: Dense-haze: A benchmark for image dehazing with dense-haze and haze-free images. In: ICIP (2019)
- [3] Anwar, S., Barnes, N.: Real image denoising with feature attention. In: ICCV (2019)
- [4] Bai, H., Pan, J., Xiang, X., Tang, J.: Self-guided image dehazing using progressive feature fusion. TIP (2022)
- [5] Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In: ICCV (2023)
- [6] Charbonnier, P., Blanc-Feraud, L., Aubert, G., Barlaud, M.: Two deterministic half-quadratic regularization algorithms for computed imaging. In: ICIP (1994)
- [7] Chen, C., Chen, Q., Do, M.N., Koltun, V.: Seeing motion in the dark. In: ICCV (2019)
- [8] Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: CVPR (2018)
- [9] Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: CVPR (2021)
- [10] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
- [11] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. In: ECCV (2022)
- [12] Chen, W.T., Fang, H.Y., Hsieh, C.L., Tsai, C.C., Chen, I.H., Ding, J.J., Kuo, S.Y.: All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In: ICCV (2021)
- [13] Chen, X., Li, H., Li, M., Pan, J.: Learning a sparse transformer network for effective image deraining. In: CVPR (2023)
- [14] Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. In: CVPR (2022)
- [15] Cho, S.J., Ji, S.W., Hong, J.P., Jung, S.W., Ko, S.J.: Rethinking coarse-to-fine approach in single image deblurring. In: ICCV (2021)
- [16] Cho, S., Lee, S.: Fast motion deblurring. ACM TOG (2009)
- [17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
- [18] d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: Improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
- [19] Fan, Y., Yu, J., Mei, Y., Zhang, Y., Fu, Y., Liu, D., Huang, T.S.: Neural sparse representation for image restoration. In: NeurIPS (2020)
- [20] Fergus, R., Singh, B., Hertzmann, A., Roweis, S.T., Freeman, W.T.: Removing camera shake from a single photograph. ACM TOG (2006)
- [21] Fu, X., Xiao, J., Zhu, Y., Liu, A., Wu, F., Zha, Z.J.: Continual image deraining with hypergraph convolutional networks. TPAMI (2023)
- [22] Gao, S., Zhuang, X.: Rank-one network: An effective framework for image restoration. TPAMI (2022)
- [23] Gou, Y., Hu, P., Lv, J., Zhou, J.T., Peng, X.: Multi-scale adaptive network for single image denoising. In: NeurIPS (2022)
- [24] Guo, C.L., Yan, Q., Anwar, S., Cong, R., Ren, W., Li, C.: Image dehazing transformer with transmission-aware 3d position embedding. In: CVPR (2022)
- [25] Guo, S., Yong, H., Zhang, X., Ma, J., Zhang, L.: Spatial-frequency attention for image denoising. arXiv preprint arXiv:2302.13598 (2023)
- [26] Guo, Y., Xiao, X., Chang, Y., Deng, S., Yan, L.: From sky to the ground: A large-scale benchmark and simple baseline towards real rain removal. In: ICCV (2023)
- [27] He, B., Wang, C., Shi, B., Duan, L.: Mop moiré patterns using mopnet. In: ICCV (2019)
- [28] He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. TPAMI (2010)
- [29] Huang, H., Luo, M., He, R.: Memory uncertainty learning for real-world single image deraining. TPAMI (2023)
- [30] Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., Yang, J., Zhou, P., Wang, Z.: Enlightengan: Deep light enhancement without paired supervision. TIP (2021)
- [31] Jiang, Y., Wronski, B., Mildenhall, B., Barron, J.T., Wang, Z., Xue, T.: Fast and high quality image denoising via malleable convolution. In: ECCV (2022)
- [32] Ke, R., Schönlieb, C.B.: Unsupervised image restoration using partially linear denoisers. TPAMI (2022)
- [33] Kong, L., Dong, J., Ge, J., Li, M., Pan, J.: Efficient frequency domain-based transformers for high-quality image deblurring. In: CVPR (2023)
- [34] Li, B., Peng, X., Wang, Z., Xu, J., Feng, D.: Aod-net: All-in-one dehazing network. In: ICCV (2017)
- [35] Li, C., Guo, C., Han, L., Jiang, J., Cheng, M.M., Gu, J., Loy, C.C.: Low-light image and video enhancement using deep learning: A survey. TPAMI (2022)
- [36] Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI (2023)
- [37] Li, X., Wu, J., Lin, Z., Liu, H., Zha, H.: Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: ECCV (2018)
- [38] Li, X., Li, B., Jin, X., Lan, C., Chen, Z.: Learning distortion invariant representation for image restoration from a causality perspective. In: CVPR (2023)
- [39] Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R., Timofte, R., Gool, V.L.: Efficient and explicit modelling of image hierarchies for image restoration. In: CVPR (2023)
- [40] Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
- [41] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: ICCV Workshops (2021)
- [42] Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer-cnn architectures. In: CVPR (2023)
- [43] Liu, J., Yan, M., Zeng, T.: Surface-aware blind image deblurring. TPAMI (2021)
- [44] Liu, L., Xie, L., Zhang, X., Yuan, S., Chen, X., Zhou, W., Li, H., Tian, Q.: Tape: Task-agnostic prior embedding for image restoration. In: ECCV (2022)
- [45] Liu, R., Ma, L., Zhang, J., Fan, X., Luo, Z.: Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In: CVPR (2021)
- [46] Liu, Y.F., Jaw, D.W., Huang, S.C., Hwang, J.N.: Desnownet: Context-aware deep network for snow removal. TIP (2018)
- [47] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
- [48] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
- [49] Lu, Y., Lin, Y., Wu, H., Luo, Y., Zheng, X., Wang, L.: All one needs to know about priors for deep image restoration and enhancement: A survey. arXiv preprint arXiv:2206.02070 (2022)
- [50] Lugmayr, A., Danelljan, M., Timofte, R., Kim, K.w., Kim, Y., Lee, J.y., Li, Z., Pan, J., Shim, D., Song, K.U., Tang, J., Wang, C., Zhao, Z.: Ntire 2022 challenge on learning the super-resolution space. In: CVPR Workshops (2022)
- [51] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Image restoration with mean-reverting stochastic differential equations. In: ICML (2023)
- [52] Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: CVPR (2017)
- [53] Nah, S., Son, S., Lee, S., Timofte, R., Lee, K.M.: Ntire 2021 challenge on image deblurring. In: CVPR Workshops (2021)
- [54] Pan, J., Liu, S., Sun, D., Zhang, J., Liu, Y., Ren, J., Li, Z., Tang, J., Lu, H., Tai, Y.W., et al.: Learning dual convolutional neural networks for low-level vision. In: CVPR (2018)
- [55] Pan, J., Sun, D., Pfister, H., Yang, M.H.: Deblurring images via dark channel prior. TPAMI (2017)
- [56] Pan, X., Zhan, X., Dai, B., Lin, D., Loy, C.C., Luo, P.: Exploiting deep generative prior for versatile image restoration and manipulation. TPAMI (2022)
- [57] Pan, Y., Liu, X., Liao, X., Cao, Y., Ren, C.: Random sub-samples generation for self-supervised real image denoising. In: ICCV (2023)
- [58] Park, N., Kim, S.: How do vision transformers work? In: ICLR (2022)
- [59] Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: Local features coupling global representations for visual recognition. In: ICCV (2021)
- [60] Plotz, T., Roth, S.: Benchmarking denoising algorithms with real photographs. In: CVPR (2017)
- [61] Potlapalli, V., Zamir, S.W., Khan, S., Khan, F.S.: Promptir: Prompting for all-in-one blind image restoration. arXiv preprint arXiv:2306.13090 (2023)
- [62] Purohit, K., Suin, M., Rajagopalan, A., Boddeti, V.N.: Spatially-adaptive image restoration using distortion-guided networks. In: ICCV (2021)
- [63] Qian, R., Tan, R.T., Yang, W., Su, J., Liu, J.: Attentive generative adversarial network for raindrop removal from a single image. In: CVPR (2018)
- [64] Qin, X., Wang, Z., Bai, Y., Xie, X., Jia, H.: Ffa-net: Feature fusion attention network for single image dehazing. In: AAAI (2020)
- [65] Qiu, Y., Zhang, K., Wang, C., Luo, W., Li, H., Jin, Z.: Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing. In: ICCV (2023)
- [66] Ren, W., Zhang, J., Pan, J., Liu, S., Ren, J.S., Du, J., Cao, X., Yang, M.H.: Deblurring dynamic scenes via spatially varying recurrent neural networks. TPAMI (2022)
- [67] Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: ECCV (2020)
- [68] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
- [69] Song, X., Zhou, D., Li, W., Dai, Y., Shen, Z., Zhang, L., Li, H.: Tusr-net: Triple unfolding single image dehazing with self-regularization and dual feature to pixel attention. TIP (2023)
- [70] Sun, J., Cao, W., Xu, Z., Ponce, J.: Learning a convolutional neural network for non-uniform motion blur removal. In: CVPR (2015)
- [71] Sun, Y., Yu, Y., Wang, W.: Moiré photo restoration using multiresolution convolutional neural networks. TIP (2018)
- [72] Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: ICML (2017)
- [73] Tsai, F.J., Peng, Y.T., Lin, Y.Y., Tsai, C.C., Lin, C.W.: Stripformer: Strip transformer for fast image deblurring. In: ECCV (2022)
- [74] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
- [75] Wang, H., Xie, Q., Zhao, Q., Meng, D.: A model-driven deep neural network for single image rain removal. In: CVPR (2020)
- [76] Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: CVPR (2018)
- [77] Wang, R., Zhang, Q., Fu, C.W., Shen, X., Zheng, W.S., Jia, J.: Underexposed photo enhancement using deep illumination estimation. In: CVPR (2019)
- [78] Wang, T., Yang, X., Xu, K., Chen, S., Zhang, Q., Lau, R.W.: Spatial attentive single-image deraining with a high quality real rain dataset. In: CVPR (2019)
- [79] Wang, W., Wen, F., Yan, Z., Liu, P.: Optimal transport for unsupervised denoising learning. TPAMI (2023)
- [80] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: CVPR (2022)
- [81] Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. In: BMVC (2018)
- [82] Wu, H., Qu, Y., Lin, S., Zhou, J., Qiao, R., Zhang, Z., Xie, Y., Ma, L.: Contrastive learning for compact single image dehazing. In: CVPR (2021)
- [83] Xiao, J., Fu, X., Liu, A., Wu, F., Zha, Z.J.: Image de-raining transformer. TPAMI (2022)
- [84] Xu, X., Wang, R., Fu, C.W., Jia, J.: Snr-aware low-light image enhancement. In: CVPR (2022)
- [85] Yue, H., Cheng, Y., Mao, Y., Cao, C., Yang, J.: Recaptured screen image demoiréing in raw domain. TMM (2022)
- [86] Yue, Z., Yong, H., Zhao, Q., Meng, D., Zhang, L.: Variational denoising network: Toward blind noise modeling and removal. In: NeurIPS (2019)
- [87] Yue, Z., Yong, H., Zhao, Q., Zhang, L., Meng, D., Wong, K.Y.K.: Deep variational network toward blind image restoration. TPAMI (2024)
- [88] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: CVPR (2022)
- [89] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Cycleisp: Real image restoration via improved data synthesis. In: CVPR (2020)
- [90] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage progressive image restoration. In: CVPR (2021)
- [91] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Learning enriched features for fast image restoration and enhancement. TPAMI (2023)
- [92] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV (2014)
- [93] Zhang, K., Li, Y., Zuo, W., Zhang, L., Van Gool, L., Timofte, R.: Plug-and-play image restoration with deep denoiser prior. TPAMI (2021)
- [94] Zhang, K., Luo, W., Zhong, Y., Ma, L., Stenger, B., Liu, W., Li, H.: Deblurring by realistic blurring. In: CVPR (2020)
- [95] Zhang, K., Ren, W., Luo, W., Lai, W.S., Stenger, B., Yang, M.H., Li, H.: Deep image deblurring: A survey. IJCV (2022)
- [96] Zhang, Y., Zhang, J., Guo, X.: Kindling the darkness: A practical low-light image enhancer. In: ACMMM (2019)
- [97] Zhang, Y., Lin, M., Li, X., Liu, H., Wang, G., Chao, F., Shuai, R., Wen, Y., Chen, X., Ji, R.: Real-time image demoireing on mobile devices. In: ICLR (2023)
- [98] Zhao, H., Gou, Y., Li, B., Peng, D., Lv, J., Peng, X.: Comprehensive and delicate: An efficient transformer for image restoration. In: CVPR (2023)
- [99] Zheng, B., Yuan, S., Yan, C., Tian, X., Zhang, J., Sun, Y., Liu, L., Leonardis, A., Slabaugh, G.: Learning frequency domain priors for image demoireing. TPAMI (2022)
- [100] Zheng, C., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Cross aggregation transformer for image restoration. In: NeurIPS (2022)
- [101] Zhou, M., Huang, J., Guo, C.L., Li, C.: Fourmer: an efficient global modeling paradigm for image restoration. In: ICML (2023)
- [102] Zou, W., Jiang, M., Zhang, Y., Chen, L., Lu, Z., Wu, Y.: Sdwnet: A straight dilated network with wavelet transformation for image deblurring. In: ICCV (2021)