latexFont shape
SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution
Abstract
Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance. Still, the computation overhead is also considerable when the window size gradually increases. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention (PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. In addition, we also attempt to scale up the model by further enlarging the window size and channel numbers to explore the potential of Transformer-based models. Experiments show that our scaled model, named SRFormerV2, can further improve the results and achieves state-of-the-art. We hope our simple and effective approach could be useful for future research in super-resolution model design. The homepage is https://z-yupeng.github.io/SRFormer/.
Index Terms:
Super-resolution, vision transformer, permuted self-attention, window size1 Introduction
Single image super-resolution (SR) aims to restore a high-quality image from its degraded low-resolution version. Exploring efficient and effective super-resolution algorithms has been a hot research topic in computer vision, which has a variety of applications [2, 3, 4]. Since the pioneer works [5, 6, 7, 8, 9, 10], CNN-based methods have been mainstream for image super-resolution for a long time. These methods mostly take advantage of residual learning [11, 8, 10, 6, 12, 13], dense connections [14, 15, 16], or channel attention [17, 18] to construct network architectures, making great contributions to the development of super-resolution models.
Despite the success made by CNN-based models in super-resolution, recent works [19, 20, 21, 22] have shown that Transformer-based models perform better. They observe that self-attention, which can build pairwise relationships, is a more efficient way to produce high-quality super-resolution images than convolutions. One typical work among them should be SwinIR [20], which introduces Swin Transformer [23] to image super-resolution, greatly improving the CNN-based models on various benchmarks. Later, a variety of works, such as SwinFIR [21], ELAN [22], and HAT [24], further develop SwinIR and use Transformers to design different network architectures for SR.

The methods above reveal that properly enlarging the windows for the shifted window self-attention in SwinIR can result in clear performance gain (see Fig. 1). However, the computational burden is also an important issue as the window size becomes larger. In addition, Transformer-based methods utilize self-attention and require networks of larger channel numbers than previous CNN-based methods [17, 15, 6]. To explore efficient and effective super-resolution algorithms, a straightforward question should be: How would the performance go if we reduce the channel number and increase the window size?
Motivated by the above question, in our conference version of this work [1], we present permuted self-attention (PSA), an efficient way to build pairwise relationships within large windows (e.g., ). The intention is to enable more pixels to participate in the attention map computation while introducing no extra computational burden. To this end, we propose to shrink the channel dimensions of the key and value matrices and adopt a permutation operation to convey part of the spatial information into the channel dimension. In this way, despite the channel reduction, there is no loss of spatial information, and each attention head is also allowed to keep a proper number of channels to produce expressive attention maps [25]. In addition, we also improve the original feed-forward network (FFN) by adding a depth-wise convolution between the two linear layers, which we found helps in high-frequency component recovery.
Given the proposed PSA, we construct a novel network for SR in the conference version, termed SRFormer. Benefiting from the proposed PSA, our SRFormer receives good performance on five widely-used datasets. Notably, for SR, our SRFormer trained on only the DIV2K dataset [10] achieves a 33.86 PSNR score on the challenging Urban100 dataset [26]. This result is much higher than the popular SwinIR (33.40) and ELAN (33.44). A similar phenomenon can be observed when evaluating the and SR tasks. In addition, we perform experiments using a light version of our SRFormer. Our method performs better on all benchmarks than previous lightweight SR models.
Despite the good performance of our SRFormer, the scaling ability of SRFormer has not been explored well. In this paper, we intend to further investigate the scaling ability of SRFormer and conduct a series of scaling operations. To our knowledge, due to PSA’s efficient computation of window attention, we are the first to implement self-attention computation within windows. Experimental results demonstrate that conducting attention calculations within larger windows can further improve performance, showing our method’s excellent scaling capabilities. By conducting a series of scaling operations, our new model, named SRFormerV2, is able to surpass the state-of-the-art HAT method [24] with less computational cost and parameters. To sum up, the contributions of this paper can be summarized as follows:
-
•
We propose a novel permuted self-attention mechanism for image super-resolution, which can enjoy large-window self-attention by transferring spatial information into the channel dimension. By leveraging it, we are able to implement the self-attention mechanism within windows at an acceptable time complexity for SR.
-
•
We build a new transformer-based super-resolution network, dubbed SRFormer, based on the proposed PSA and an improved FFN from the frequency perspective (ConvFFN).
-
•
Based on SRFormer, we further explore the scaling capability of the model at the macro structural level and consider the use of both large and small window PSA. The outcome of the above investigations is the new SRFormerV2, which is able to surpass previous state-of-the-art models using fewer parameters and computational cost.

2 Related Work
In this section, we briefly review the literature on image super-resolution. We first introduce the CNN-based methods and then describe the recently popular Transformer-based models.
2.1 CNN-Based Image Super-Resolution
Since SRCNN[5] first introduced CNN into image super-resolution (SR), many CNN-based SR models have emerged. DRCN [27] and DRRN [11] introduce recursive convolutional networks to increase the depth of the network without increasing the parameters. Some early CNN-based methods [28, 5, 27, 11] attempt to interpolate the low-resolution (LR) as input, which results in a computationally expensive feature extraction. To accelerate the SR inference process, FSRCNN [29] extracts features at the LR scale and conducts an upsampling operation at the end of the network. This pipeline with pixel shuffle upsampling [9] has been widely used in later works [22, 17, 20]. LapSRN [30] and DBPN [31] perform upsampling during extracting features to learn the correlation between LR and HR. There are also some works [8, 14, 32, 33] that use GAN [34] to generate realistic textures in reconstruction. MemNet [28], RDN [15], and HAN [35] efficiently aggregate the intermediate features to enhance the quality of the reconstructed images. Non-local attention [36] has also been extensively explored in SR to better model the long-range dependencies. Methods of this type include CS-NL [37], NLSA [38], SAN [39], IGNN [40], etc.
2.2 Vision Transformers
Transformers have recently shown great potential in a variety of vision tasks, including image classification [41, 42, 43, 44, 45], object detection [46, 47, 48], semantic segmentation [49, 50, 51], image restoration [52, 20, 19], etc. Among these, the most typical work should be Vision Transformer (ViT) [41], which proves Transformers can perform better than convolutional neural networks on feature encoding. The application of Transformers in low-level vision mainly includes two categories: generation [53, 54, 55, 56] and restoration. Further, the restoration tasks can also be divided into two categories: video restoration [57, 58, 59, 60, 61] and image restoration [19, 52, 62, 63].
Image super-resolution is an important task in image restoration. It needs to preserve the structural information of the input, which is a great challenge for Transformer-based model design. IPT [19] is a large pre-trained model based on the Transformer encoder and decoder structure and has been applied to super-resolution, denoising, and deraining. Based on the Swin Transformer encoder [23], SwinIR [20] performs self-attention on an local window in feature extraction and achieves extremely powerful performance. ELAN [22] simplifies the architecture of SwinIR and uses self-attention computed in different window sizes to collect the correlations between long-range pixels.
Our SRFormer is also based on Transformer. Unlike the methods above that directly leverage self-attention to build models, our SRFormer mainly aims at the self-attention itself. We intend to study how to compute self-attention in a large window to improve the performance of SR models without increasing the parameters and computational cost.
3 Method
3.1 Overall Architecture
The overall architecture of our SRFormer and SRFormerV2 is shown in Fig. 2, consisting of three parts: a pixel embedding layer , a feature encoder , and a high-resolution image reconstruction layer . Following previous works [20, 22], the pixel embedding layer is a single convolution that transforms the low-resolution RGB image to feature embeddings . will then be sent into the feature encoder with a hierarchical structure. It consists of permuted self-attention groups, each of which is with permuted self-attention blocks followed by a convolution. A convolution is added at the end of the feature encoder, yielding . The summation results of and are fed into for high-resolution image reconstruction, which contains a convolution and a sub-pixel convolutional layer [9] to reconstruct high-resolution images. We compute the L1 loss between the high-resolution reconstructed image and ground-truth HR image to optimize our models.

3.2 Permuted Self-Attention Block
The core of our SRFormer and SRFormerV2 is the permuted self-attention block (PAB), which consists of a permuted self-attention (PSA) layer and a convolutional feed-forward network (ConvFFN).
Permuted self-attention. As shown in Fig. 3(b), given an input feature map and a tokens reduction factor , we first split into non-overlapping square windows , where is the side length of each window. Then, we use three linear layers to get , , and :
(1) |
Here, keeps the same channel dimension to while and compress the channel dimension to , yielding and . After that, to enable more tokens to get involved in the self-attention calculation and avoid the increase of the computational cost, we propose to permute the spatial tokens in and to the channel dimension, attaining permuted tokens and .
We use and the shrunken and to perform the self-attention operation. In this way, the window size for and will be reduced to but their channel dimension is still unchanged to guarantee the expressiveness of the attention map generated by each attention head [25]. The formulation of the proposed PSA can be written as follows:
(2) |
where is an aligned relative position embedding that can be attained by interpolating the original one defined in [23] since the window size of does not match that of . is a scalar as defined in [41]. Note that the above equation can easily be converted to the multi-head version by splitting the channels into multiple groups.

Our PSA transfers the spatial information to the channel dimension. It ensures the following two key design principles: i) We do not downsample the tokens first as done in [49, 44] but allow each token to participate in the self-attention computation independently. This enables more representative attention maps. We will discuss more variants of our PSA in Sec. 3.3 and show more results in our experiment section. ii) In contrast to the original self-attention illustrated in Fig. 3(a), PSA can be conducted in a large window (e.g., ) using even fewer computations than SwinIR with window while attaining better performance.
ConvFFN. Previous works have demonstrated that self-attention can be viewed as a low-pass filter [64, 65]. To better restore high-frequency information, a convolution is often added at the end of each group of Transformers as done in SwinIR [20]. Different from SwinIR, in our PAB, we propose to add a local depthwise convolution branch between the two linear layers of the FFN block to assist in encoding more details. We name the new block ConvFFN. We empirically found that such an operation increases nearly no computations but can compensate for the loss of high-frequency information caused by self-attention shown in Fig. 4. We simply calculate the power spectrum of the feature maps produced by our SRFormer with FFN and ConvFFN. By comparing the two figures, we can see that ConvFFN can clearly increase high-frequency information, and hence yields better results as listed in Tab. I.
3.3 Large-Window Self-Attention Variants
To provide guidance for the design of large-window self-attention and demonstrate the advantage of our PSA, here, we introduce another two large-window self-attention variants. The quantitative comparisons and analysis can be found in our experiment section.
Token Reduction. The first way to introduce large-window self-attention and avoid the increase in computational cost is to reduce the number of tokens as done in [49]. Let and be a reduction factor and the window size. Given an input , we can adopt a depthwise convolutional function with kernel size and stride to reduce the token numbers of and in each window to , yielding , , and . and are used to compute the attention scores . Computing the matrix multiplication between and yields the output with the same number of tokens to .
Token Sampling. The second way to achieve large-window self-attention is to randomly sample tokens from each window according to a given sampling ratio for the key and value . Given the input , shares the same shape with but the shapes of and are reduced to . In this way, as long as is fixed, the computational cost increases linearly as the window size gets larger. A drawback of token sampling is that randomly selecting a portion of tokens loses structural information of the scene content, which is essentially needed for image super-resolution. We will show more numerical results in our experiment section.

4 SRFormerV2: Scaling the SRFormer
In this section, we further explore the potential of our SRFormer by scaling it up as shown in Fig. 5 and present a new version, SRFormerV2. We intend to figure out two things: i) Whether PSA with a larger window can further boost the performance; ii) How to balance the channel and spatial information, as well as the local and global information to maximize the effect of PSA.
Scaling the window size. The original SRFormer employs permuted self-attention to perform SR reconstruction efficiently. By compressing the channel information and transferring spatial information, PSA can enjoy large-window self-attention with fewer computations, boosting the performance of previous Transformer-based models. Leveraging the PSA, we can enjoy large window self-attention with acceptable computations. Thus, it is natural to investigate whether larger windows could further lift the performance.
To figure it out, we conduct more experiments to further enlarge the window size, as shown in Tab. IV. We find that for the low-resolution test set, such as Set5 or B100, the performance no longer improves when we scale the window size up to . However, we find that in high-resolution test datasets, such as Urban100 or Manga109, even when we reach the maximum level allowed by our computational resources, a window size of , notable improvements in performance can still be achieved. This finding provides strong evidence for the significance of larger attention windows in image super-resolution, which is inspiring for super-resolution backbone design. It is worth noting that, by leveraging PSA, our computations are still reasonable even with the window size, more efficient and effective than the recent state-of-the-art model HAT. In the end, we set the window size of to strike a balance between computational cost and performance.
Another advantage we discover when increasing the window size is the superiority of the structural similarity index measure (SSIM). As shown in Tab. V, our SRFormer almost always achieves the highest SSIM values, even in cases where our Peak Signal-to-Noise Ratio (PSNR) values are similar to those of the comparison models. SSIM [66] is proposed to measure the differences between the structure properties of the areas. Instead of simply computing the absolute error between the images like PSNR, SSIM is proposed to take into account known properties of the human visual system.
Scaling in the channel aspect. Another scaling exploration is to enlarge the channel number. The original PSA compresses the channel number, so there is some sacrifice in channel information. By scaling the channel number, we hope to compensate for the loss of channel information. As shown in the channel aspect bracket of Fig. 5, two explorations are carried out: i) We enlarge the channel number to preserve more channel information. Increasing the number of channels allows the use of more information during inference, which results in significant performance gains. However, it also leads to an increase in computational load. By integrating the operations of scaling the channel size and the scaling window size mentioned earlier, we can achieve significant improvements compared to the original results of SRFormer. ii) Reduce the degree of channel compression. Our permuted self-attention compresses the channel dimension and transfers the spatial information to the channel dimension. By increasing the number of channels after compression, we can reduce the loss of channel information. This brings some improvements, but it is not adopted in SRFormerV2 after taking account of the performance improvement and computing overhead.
Aggregating Global and Local information. Based on PSA, we are able to facilitate a broad range of information interchange with a very small computational cost. However, as shown in [67], the local information also plays a significant role in enhancing the performance of SR. During the expiration, we hope to introduce the local spatial information into our large window self-attention SR network. We rethink the macro architecture of the network and attempt to involve more local information. Following previous SR models [20], the size of attention windows in SRFormer is all the same. However, due to the advantage of our PSA, our attention window size is particularly large, as it is conducive to aggregating global information. In order to further improve performance, we contemplate whether we could also introduce some blocks with small attention windows to help capture local information. Therefore, prior to every two permute self-attention blocks (abbreviated as PAB shown in Fig. 2 ), we insert a block with a small window size, to aggregate local spatial information before gathering global information. We found that such an approach is particularly useful for enhancing performance.
By combining the aforementioned enhancement methods as shown in Fig. 5, we have upgraded SRFormer to SRFormerV2, achieving a new state-of-the-art. With a similar computational cost, it largely improves the previous state-of-the-art HAT. We name the scaling version as SRFormerV2. We also design a version with even less computational cost named SRFormerV2-S, which still achieves comparable performance with HAT.
Method | Window size | Params | MACs | SET5 [68] | SET14 [69] | B100 [70] | Urban100 [26] | Manga109 [71] | |||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||||
SwinIR [20] | 11.75M | 2868G | 38.24 | 0.9615 | 33.94 | 0.9212 | 32.39 | 0.9023 | 33.09 | 0.9373 | 39.34 | 0.9784 | |
11.82M | 3107G | 38.30 | 0.9617 | 34.04 | 0.9220 | 32.42 | 0.9026 | 33.28 | 0.9381 | 39.44 | 0.9788 | ||
11.91M | 3441G | 38.32 | 0.9618 | 34.00 | 0.9212 | 32.44 | 0.9030 | 33.40 | 0.9394 | 39.53 | 0.9791 | ||
SRFormer w/o ConvFFN | 9.97M | 2381G | 38.23 | 0.9615 | 34.00 | 0.9216 | 32.37 | 0.9023 | 32.99 | 0.9367 | 39.30 | 0.9786 | |
9.99M | 2465G | 38.25 | 0.9616 | 33.98 | 0.9209 | 32.38 | 0.9022 | 33.09 | 0.9371 | 39.42 | 0.9789 | ||
10.06M | 2703G | 38.30 | 0.9618 | 34.08 | 0.9225 | 32.43 | 0.9030 | 33.38 | 0.9397 | 39.44 | 0.9786 | ||
SRFormer | 10.31M | 2419G | 38.22 | 0.9614 | 34.08 | 0.9220 | 32.38 | 0.9025 | 33.08 | 0.9372 | 39.13 | 0.9780 | |
10.33M | 2502G | 38.31 | 0.9617 | 34.10 | 0.9217 | 32.43 | 0.9026 | 33.26 | 0.9385 | 39.36 | 0.9785 | ||
10.40M | 2741G | 38.33 | 0.9618 | 34.13 | 0.9228 | 32.44 | 0.9030 | 33.51 | 0.9405 | 39.49 | 0.9788 |
ConvFFN | Urban100 [26] | Manga109 [71] | ||
PSNR | SSIM | PSNR | SSIM | |
w/o Depth-wise Conv | 33.38 | 0.9397 | 39.44 | 0.9786 |
Depth-wise Conv | 33.42 | 0.9398 | 39.34 | 0.9787 |
Depth-wise Conv | 33.51 | 0.9405 | 39.49 | 0.9788 |
Method | Params | MACs | PSNR | SSIM | ||
SwinIR [20] | 11.75M | 2868G | 8 | - | 33.09 | 0.9373 |
Token Reduction | 11,78M | 2471G | 16 | 2 | 33.09 | 0.9372 |
Token Reduction | 11.85M | 2709G | 24 | 2 | 33.24 | 0.9387 |
Token Sampling | 11.91M | 2465G | 16 | 2 | 32.38 | 0.9312 |
Token Sampling | 12.18M | 2703G | 24 | 2 | 32.34 | 0.9305 |
PSA | 9.99M | 2465G | 16 | 2 | 33.09 | 0.9371 |
PSA | 9.67M | 2390G | 24 | 3 | 33.09 | 0.9376 |
PSA | 10.06M | 2703G | 24 | 2 | 33.38 | 0.9397 |
Window Size | MACs | Set5 | B100 | Urban100 | Manga109 |
2741G | 38.53 | 32.59 | 34.09 | 40.07 | |
3377G | 38.58 | 32.61 | 34.39 | 40.21 | |
3562G | 38.58 | 32.61 | 34.52 | 40.27 |
5 Experiments
In this section, we conduct experiments on both the classical, lightweight, and real-world image SR tasks, compare our SRFormer and SRFormerV2 with existing state-of-the-art methods, and do ablation analysis of the proposed method.
5.1 Experimental Setup
Datasets and Evaluation. The choice of training datasets remains the same as the comparison models. In classical image SR, we use DIV2K [10] and DF2K (DIV2K [10] + Flickr2K [72]) to train two versions of SRFormer. In the training of SRFormerV2, we also use DF2K as the training dataset. For lightweight image SR, we use DIV2K [10] to train our SRFormer-light. For real-world SR, we use DF2K and OST [73]. For testing, we mainly evaluate our method on five benchmark datasets, including Set5 [68], Set14 [69], BSD100 [70], Urban100 [26], and Manga109 [71]. Self-ensemble strategy is introduced in SRFormerV2 to further improve performance, marked as SRFormerV2+. The experimental results are evaluated in terms of PSNR and SSIM values, which are calculated on the Y channel from the YCbCr space.
Implementation Details. In the classical image SR task, we set the PAB group number, PAB number, channel number, and attention head number to 6, 6, 180, and 6, respectively. When training SRFormer on DIV2K [10], the patch size, window size , and reduction factor are set to , 24, and 2, respectively. When training SRFormer on the DF2K [10, 72], they are , 22, and 2, respectively. For the training of SRFormerV2, we also follow the same training settings. For the lightweight image SR task, we set the PAB group number, PAB number, channel number, windows size , reduction factor , and attention head number to 4, 6, 60, 16, 2, and 6, respectively. The training patch size we use is . We randomly rotate images by , , or and randomly flip images horizontally for data augmentation. We adopt the Adam [74] optimizer with and to train the model for 500k iterations. The initial learning rate is set as and is reduced by half at the -th iterations.
5.2 Ablation Study
Impact of window size in PSA. Permuted self-attention provides an efficient and effective way to enlarge window size. To investigate the impact of different window sizes on model performance, we conduct three groups of experiments and report the results in Table I. The first group is the vanilla SwinIR [20] with , , and window sizes. In the second group, we do not use the ConvFFN but only the PSA in our SRFormer and set the window size to , , and , respectively, to observe the performance difference. In the third group, we use our full SRFormer with , , and as window size to explore the performance change. The results show that a larger window size yields better performance improvement for all three groups of experiments. In addition, the parameters and MACs of our SRFormer with window are even fewer than the original SwinIR with window. To balance the performance and MACs, we set window size as in SRFormer and in SRFormer-light.
Impact of kernel size of ConvFFN. We introduce ConvFFN in Sec. 3.2, which aims to encode more local information without increasing too many computations. In order to explore which kernel size can bring the best performance improvement, we attempt to use depth-wise convolution and depth-wise convolution and report the results in Table II. Given that the depth-wise convolution has little effect on the number of parameters and MACs, we do not list them in the table. Obviously, depth-wise convolution leads to the best results. Thus, we use depth-wise convolution in our ConvFFN.
Large-window self-attention variants. In Sec. 3.3, we introduce another two large-window self-attention variants. We summarize the results in Table III. Though token reduction can slightly improve SwinIR when using a large window, the number of parameters does not decrease and the performance gain is lower than ours. We argue that it is because directly applying downsampling operations to the key and value results in spatial information loss. For token sampling, the performance is even worse than the original SwinIR. We believe the reason is that dropping out some tokens severely breaks the image content structure.
Method | Training Dataset | SET5 [68] | SET14 [69] | B100 [70] | Urban100 [26] | Manga109 [71] | ||||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |||
SR | EDSR [10] | DIV2K | 38.11 | 0.9602 | 33.92 | 0.9195 | 32.32 | 0.9013 | 32.93 | 0.9351 | 39.10 | 0.9773 |
RCAN [17] | DIV2K | 38.27 | 0.9614 | 34.12 | 0.9216 | 32.41 | 0.9027 | 33.34 | 0.9384 | 39.44 | 0.9786 | |
SAN [39] | DIV2K | 38.31 | 0.9620 | 34.07 | 0.9213 | 32.42 | 0.9028 | 33.10 | 0.9370 | 39.32 | 0.9792 | |
IGNN [40] | DIV2K | 38.24 | 0.9613 | 34.07 | 0.9217 | 32.41 | 0.9025 | 33.23 | 0.9383 | 39.35 | 0.9786 | |
HAN [35] | DIV2K | 38.27 | 0.9614 | 34.16 | 0.9217 | 32.41 | 0.9027 | 33.35 | 0.9385 | 39.46 | 0.9785 | |
NLSA [38] | DIV2K | 38.34 | 0.9618 | 34.08 | 0.9231 | 32.43 | 0.9027 | 33.42 | 0.9394 | 39.59 | 0.9789 | |
SwinIR [20] | DIV2K | 38.35 | 0.9620 | 34.14 | 0.9227 | 32.44 | 0.9030 | 33.40 | 0.9393 | 39.60 | 0.9792 | |
ELAN [22] | DIV2K | 38.36 | 0.9620 | 34.20 | 0.9228 | 32.45 | 0.9030 | 33.44 | 0.9391 | 39.62 | 0.9793 | |
SRFormer (ours) | DIV2K | 38.45 | 0.9622 | 34.21 | 0.9236 | 32.51 | 0.9038 | 33.86 | 0.9426 | 39.69 | 0.9786 | |
IPT [19] | ImageNet | 38.37 | - | 34.43 | - | 32.48 | - | 33.76 | - | - | - | |
SwinIR [20] | DF2K | 38.42 | 0.9623 | 34.46 | 0.9250 | 32.53 | 0.9041 | 33.81 | 0.9427 | 39.92 | 0.9797 | |
EDT [75] | DF2K | 38.45 | 0.9624 | 34.57 | 0.9258 | 32.52 | 0.9041 | 33.80 | 0.9425 | 39.93 | 0.9800 | |
HAT [75] | DF2K | 38.63 | 0.9630 | 34.86 | 0.9274 | 32.62 | 0.9053 | 34.45 | 0.9466 | 40.26 | 0.9809 | |
SRFormer (ours) | DF2K | 38.51 | 0.9627 | 34.44 | 0.9253 | 32.57 | 0.9046 | 34.09 | 0.9449 | 40.07 | 0.9802 | |
SRFormerV2 (ours) | DF2K | 38.63 | 0.9632 | 34.90 | 0.9276 | 32.64 | 0.9056 | 34.59 | 0.9477 | 40.40 | 0.9811 | |
SRFormerV2+ (ours) | DF2K | 38.68 | 0.9633 | 34.93 | 0.9278 | 32.67 | 0.9059 | 34.77 | 0.9485 | 40.51 | 0.9813 | |
SR | EDSR [10] | DIV2K | 34.65 | 0.9280 | 30.52 | 0.8462 | 29.25 | 0.8093 | 28.80 | 0.8653 | 34.17 | 0.9476 |
RCAN [17] | DIV2K | 34.74 | 0.9299 | 30.65 | 0.8482 | 29.32 | 0.8111 | 29.09 | 0.8702 | 34.44 | 0.9499 | |
SAN [39] | DIV2K | 34.75 | 0.9300 | 30.59 | 0.8476 | 29.33 | 0.8112 | 28.93 | 0.8671 | 34.30 | 0.9494 | |
IGNN [40] | DIV2K | 34.72 | 0.9298 | 30.66 | 0.8484 | 29.31 | 0.8105 | 29.03 | 0.8696 | 34.39 | 0.9496 | |
HAN [35] | DIV2K | 34.75 | 0.9299 | 30.67 | 0.8483 | 29.32 | 0.8110 | 29.10 | 0.8705 | 34.48 | 0.9500 | |
NLSA [38] | DIV2K | 34.85 | 0.9306 | 30.70 | 0.8485 | 29.34 | 0.8117 | 29.25 | 0.8726 | 34.57 | 0.9508 | |
SwinIR [20] | DIV2K | 34.89 | 0.9312 | 30.77 | 0.8503 | 29.37 | 0.8124 | 29.29 | 0.8744 | 34.74 | 0.9518 | |
ELAN [22] | DIV2K | 34.90 | 0.9313 | 30.80 | 0.8504 | 29.38 | 0.8124 | 29.32 | 0.8745 | 34.73 | 0.9517 | |
SRFormer (ours) | DIV2K | 34.94 | 0.9318 | 30.81 | 0.8518 | 29.41 | 0.8142 | 29.52 | 0.8786 | 34.78 | 0.9524 | |
IPT [19] | ImageNet | 34.81 | - | 30.85 | - | 29.38 | - | 29.49 | - | - | - | |
SwinIR [20] | DF2K | 34.97 | 0.9318 | 30.93 | 0.8534 | 29.46 | 0.8145 | 29.75 | 0.8826 | 35.12 | 0.9537 | |
EDT [75] | DF2K | 34.97 | 0.9316 | 30.89 | 0.8527 | 29.44 | 0.8142 | 29.72 | 0.8814 | 35.13 | 0.9534 | |
HAT [75] | DF2K | 35.07 | 0.9329 | 31.08 | 0.8555 | 29.54 | 0.8167 | 30.23 | 0.8896 | 35.53 | 0.9552 | |
SRFormer (ours) | DF2K | 35.02 | 0.9323 | 30.94 | 0.8540 | 29.48 | 0.8156 | 30.04 | 0.8865 | 35.26 | 0.9543 | |
SRFormerV2 (ours) | DF2K | 35.05 | 0.9331 | 31.10 | 0.8562 | 29.55 | 0.8176 | 30.46 | 0.8924 | 35.59 | 0.9557 | |
SRFormerV2+ (ours) | DF2K | 35.14 | 0.9335 | 31.17 | 0.8569 | 29.59 | 0.8181 | 30.62 | 0.8941 | 35.73 | 0.9563 | |
SR | EDSR [10] | DIV2K | 32.46 | 0.8968 | 28.80 | 0.7876 | 27.71 | 0.7420 | 26.64 | 0.8033 | 31.02 | 0.9148 |
RCAN [17] | DIV2K | 32.63 | 0.9002 | 28.87 | 0.7889 | 27.77 | 0.7436 | 26.82 | 0.8087 | 31.22 | 0.9173 | |
SAN [39] | DIV2K | 32.64 | 0.9003 | 28.92 | 0.7888 | 27.78 | 0.7436 | 26.79 | 0.8068 | 31.18 | 0.9169 | |
IGNN [40] | DIV2K | 32.57 | 0.8998 | 28.85 | 0.7891 | 27.77 | 0.7434 | 26.84 | 0.8090 | 31.28 | 0.9182 | |
HAN [35] | DIV2K | 32.64 | 0.9002 | 28.90 | 0.7890 | 27.80 | 0.7442 | 26.85 | 0.8094 | 31.42 | 0.9177 | |
NLSA [38] | DIV2K | 32.59 | 0.9000 | 28.87 | 0.7891 | 27.78 | 0.7444 | 26.96 | 0.8109 | 31.27 | 0.9184 | |
SwinIR [20] | DIV2K | 32.72 | 0.9021 | 28.94 | 0.7914 | 27.83 | 0.7459 | 27.07 | 0.8164 | 31.67 | 0.9226 | |
ELAN [22] | DIV2K | 32.75 | 0.9022 | 28.96 | 0.7914 | 27.83 | 0.7459 | 27.13 | 0.8167 | 31.68 | 0.9226 | |
SRFormer (ours) | DIV2K | 32.81 | 0.9029 | 29.01 | 0.7919 | 27.85 | 0.7472 | 27.20 | 0.8189 | 31.75 | 0.9237 | |
IPT [19] | ImageNet | 32.64 | - | 29.01 | - | 27.82 | - | 27.26 | - | - | - | |
SwinIR [20] | DF2K | 32.92 | 0.9044 | 29.09 | 0.7950 | 27.92 | 0.7489 | 27.45 | 0.8254 | 32.03 | 0.9260 | |
EDT [75] | DF2K | 32.82 | 0.9031 | 29.09 | 0.7939 | 27.91 | 0.7483 | 27.46 | 0.8246 | 32.03 | 0.9254 | |
HAT [75] | DF2K | 33.04 | 0.9056 | 29.23 | 0.7973 | 28.00 | 0.7517 | 27.97 | 0.8368 | 32.48 | 0.9292 | |
SRFormer (ours) | DF2K | 32.93 | 0.9041 | 29.08 | 0.7953 | 27.94 | 0.7502 | 27.68 | 0.8311 | 32.21 | 0.9271 | |
SRFormerV2 (ours) | DF2K | 33.06 | 0.9066 | 29.21 | 0.7978 | 27.98 | 0.7522 | 28.04 | 0.8391 | 32.52 | 0.9300 | |
SRFormerV2+ (ours) | DF2K | 33.16 | 0.9069 | 29.28 | 0.7991 | 28.03 | 0.7530 | 28.17 | 0.8414 | 32.75 | 0.9313 |
Further scaling window size in SRFormerV2. Here, we further explore the performance of SRFormer with larger attention windows in Sec. 4. Building upon a baseline of , we implement attention mechanisms for window sizes of and , respectively. As shown in Tab. IV, for the largest window size we could achieve (), we still observe performance improvement. For low-resolution images, increasing the window size up to 36 results in similar performance. However, when further increasing the window size to , we observe no performance gain due to the small input size.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Method | Training Dataset | Params | MACs | SET5 [68] | SET14 [69] | B100 [70] | Urban100 [26] | Manga109 [71] | ||||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |||||
SR | EDSR-baseline [10] | DIV2K | 1370K | 316G | 37.99 | 0.9604 | 33.57 | 0.9175 | 32.16 | 0.8994 | 31.98 | 0.9272 | 38.54 | 0.9769 |
IMDN [76] | DIV2K | 694K | 158.8G | 38.00 | 0.9605 | 33.63 | 0.9177 | 32.19 | 0.8996 | 32.17 | 0.9283 | 38.88 | 0.9774 | |
LAPAR-A [77] | DF2K | 548K | 171G | 38.01 | 0.9605 | 33.62 | 0.9183 | 32.19 | 0.8999 | 32.10 | 0.9283 | 38.67 | 0.9772 | |
LatticeNet [78] | DIV2K | 756K | 169.5G | 38.15 | 0.9610 | 33.78 | 0.9193 | 32.25 | 0.9005 | 32.43 | 0.9302 | - | - | |
ESRT [79] | DIV2K | 751K | - | 38.03 | 0.9600 | 33.75 | 0.9184 | 32.25 | 0.9001 | 32.58 | 0.9318 | 39.12 | 0.9774 | |
SwinIR-light [20] | DIV2K | 910K | 244G | 38.14 | 0.9611 | 33.86 | 0.9206 | 32.31 | 0.9012 | 32.76 | 0.9340 | 39.12 | 0.9783 | |
ELAN [22] | DIV2K | 621K | 203G | 38.17 | 0.9611 | 33.94 | 0.9207 | 32.30 | 0.9012 | 32.76 | 0.9340 | 39.11 | 0.9782 | |
SRFormer-light | DIV2K | 853K | 236G | 38.23 | 0.9613 | 33.94 | 0.9209 | 32.36 | 0.9019 | 32.91 | 0.9353 | 39.28 | 0.9785 | |
SR | EDSR-baseline [10] | DIV2K | 1555K | 160G | 34.37 | 0.9270 | 30.28 | 0.8417 | 29.09 | 0.8052 | 28.15 | 0.8527 | 33.45 | 0.9439 |
IMDN [76] | DIV2K | 703K | 71.5G | 34.36 | 0.9270 | 30.32 | 0.8417 | 29.09 | 0.8046 | 28.17 | 0.8519 | 33.61 | 0.9445 | |
LAPAR-A [77] | DF2K | 594K | 114G | 34.36 | 0.9267 | 30.34 | 0.8421 | 29.11 | 0.8054 | 28.15 | 0.8523 | 33.51 | 0.9441 | |
LatticeNet [78] | DIV2K | 765K | 76.3G | 34.53 | 0.9281 | 30.39 | 0.8424 | 29.15 | 0.8059 | 28.33 | 0.8538 | - | - | |
ESRT [79] | DIV2K | 751K | - | 34.42 | 0.9268 | 30.43 | 0.8433 | 29.15 | 0.8063 | 28.46 | 0.8574 | 33.95 | 0.9455 | |
SwinIR-light [20] | DIV2K | 918K | 111G | 34.62 | 0.9289 | 30.54 | 0.8463 | 29.20 | 0.8082 | 28.66 | 0.8624 | 33.98 | 0.9478 | |
ELAN [22] | DIV2K | 629K | 90.1G | 34.61 | 0.9288 | 30.55 | 0.8463 | 29.21 | 0.8081 | 28.69 | 0.8624 | 34.00 | 0.9478 | |
SRFormer-light | DIV2K | 861K | 105G | 34.67 | 0.9296 | 30.57 | 0.8469 | 29.26 | 0.8099 | 28.81 | 0.8655 | 34.19 | 0.9489 | |
SR | EDSR-baseline [10] | DIV2K | 1518K | 114G | 32.09 | 0.8938 | 28.58 | 0.7813 | 27.57 | 0.7357 | 26.04 | 0.7849 | 30.35 | 0.9067 |
IMDN [76] | DIV2K | 715K | 40.9G | 32.21 | 0.8948 | 28.58 | 0.7811 | 27.56 | 0.7353 | 26.04 | 0.7838 | 30.45 | 0.9075 | |
LAPAR-A [77] | DF2K | 659K | 94G | 32.15 | 0.8944 | 28.61 | 0.7818 | 27.61 | 0.7366 | 26.14 | 0.7871 | 30.42 | 0.9074 | |
LatticeNet [78] | DIV2K | 777K | 43.6G | 32.30 | 0.8962 | 28.68 | 0.7830 | 27.62 | 0.7367 | 26.25 | 0.7873 | - | - | |
ESRT [79] | DIV2K | 751K | - | 32.19 | 0.8947 | 28.69 | 0.7833 | 27.69 | 0.7379 | 26.39 | 0.7962 | 30.75 | 0.9100 | |
SwinIR-light [20] | DIV2K | 930K | 63.6G | 32.44 | 0.8976 | 28.77 | 0.7858 | 27.69 | 0.7406 | 26.47 | 0.7980 | 30.92 | 0.9151 | |
ELAN [22] | DIV2K | 640K | 54.1G | 32.43 | 0.8975 | 28.78 | 0.7858 | 27.69 | 0.7406 | 26.54 | 0.7982 | 30.92 | 0.9150 | |
SRFormer-light | DIV2K | 873K | 62.8G | 32.51 | 0.8988 | 28.82 | 0.7872 | 27.73 | 0.7422 | 26.67 | 0.8032 | 31.17 | 0.9165 |
5.3 Classical Image Super-Resolution
To evaluate the performance of SRFormer and SRFormerV2 on classical super-resolution task, we compare them with a series of state-of-the-art CNN-based and Transformer-based SR methods: RCAN [17], RDN [15], SAN [39], IGNN [40], HAN [35], NLSA [38], IPT [19], SwinIR [20], EDT [75], ELAN [22], and HAT [24].
Quantitative comparison. The quantitative comparison of the methods for classical image SR is shown in Table V. For a fair comparison, the number of parameters and MACs of SRFormer are lower than SwinIR [20] (See Tab. VII for details). It can be clearly seen that SRFormer achieves the best performance on almost all five benchmark datasets for all scale factors. Since calculating self-attention within large windows can allow more information to be aggregated over a large area, our SRFormer performs much better on the high-resolution test set, such as Urban100 and Manga109. Especially, for the SR training with DIV2K, our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than SwinIR but uses fewer parameters and computations. The above strongly supports that our SRFormer is effective and efficient. Through a series of enhancements to SRFormer described in Sec. 4, the upgraded SRFormerV2 achieves better results, surpassing the previous state-of-the-art model HAT.
Urban100 (): img_024
HR
Bicubic
CARN [80]
IDN [81]
IMDN [76]
EDSR-baseline [10]
LAPAR-A [77]
LatticeNet [78]
SwinIR-light [20]
SRFormer-light
Urban100 (): img_067
HR
Bicubic
CARN [80]
IDN [81]
IMDN [76]
EDSR-baseline [10]
LAPAR-A [77]
LatticeNet [78]
SwinIR-light [20]
SRFormer-light
B100 (): img_78004

Qualitative comparison. We show qualitative comparisons of our SRFormer and our advanced SRFormerV2 with recent state-of-the-art methods in Fig. 6. From each example of Fig. 6, one can clearly observe that SRFormer can restore crisper and detailed textures and edges. In contrast, the previous models restore blurred or low-quality textures. The qualitative comparison shows that our SRFormer can restore better high-resolution images from the low-resolution ones. Through a series of enhancements described in Sec. 4, SRFormerV2 further enhances image clarity and achieves much better results than all of the previous works.
5.4 Lightweight Image Super-Resolution
To demonstrate our model’s scalability and further proof ofSRFormer’s efficiency and effectiveness, we train SRFormer-light and compare it with a list of state-of-the-art lightweight SR methods: EDSR-baseline [10], CARN [80], IMDN [76], LAPAR-A [77], LatticeNet [78], ESRT [79], SwinIR-light [20], and ELAN [22].
Quantitative comparison. The quantitative comparisons of lightweight image SR models are shown in Tab. VI. Following previous works [78, 80], we report the MACs by upscaling a low-resolution image to resolution on all scales. We can see that our SRFormer-light achieves the best performance on all five benchmark datasets for all scale factors. Our model outperforms SwinIR-light [20] by up to dB PSNR scores on the Urban100 dataset and dB PSNR scores on the Manga109 dataset with even fewer parameters and MACs. The results indicate that despite the simplicity, our permuted self-attention is a more effective way to encode spatial information.
Qualitative comparison. We compare our SRFormer with state-of-the-art lightweight image SR models for qualitative comparisons in Fig. 7. Notably, for all examples in Fig. 7, SRFormer-light is the only model that can restore the main structures with less blurring and artifacts. This strongly demonstrates that the light version of SRFormer also performs better for restoring edges and textures compared to other methods.
5.5 Real-World Image Super-Resolution
To test the performance of SRFormer under a broader range of image degradation scenarios, we follow SwinIR [20] to retrain our SRFormer by using the same degradation model as BSRGAN [82] and show results in Fig. 8. SRFormer still produces more realistic and visually pleasing textures without artifacts when faced with real-world images, which demonstrates the robustness of our method.
5.6 Model Size Comparison
While achieving the state-of-the-art performance, our SRFormer is also efficient. In Tab. VII, we compare the parameters and MACs of our SRFormer with recent state-of-the-art models. Following previous works [78, 80], we report the MACs by upscaling a low-resolution image to resolution. One can clearly find that the performance of our SRFormer exceeds SwinIR [20] with fewer parameters and MACs.
In our SRFormerV2, we explore the scaling capabilities of SRFormer. Compared with HAT, our SRFormerV2 has similar parameters and computational load but greatly exceeds HAT in performance. We also train a smaller scaling model, named SRFormerV2-S, which follows the advanced operations in Fig. 5 but has 180 channels. It requires much fewer computational resources than HAT but also performs better than it as shown in Tab. VII. To further verify the effectiveness of PSA, we also replace the window attention in HAT with PSA using a attention window size, which can also enhance the performance of HAT on both the Urban100 and Manga109 datasets.
Method | Params | MACs | Urban100 | Manga109 |
SwinIR [20] | 11.80M | 2.9T | 33.81 | 39.92 |
SRFormer | 10.52M | 2.7T | 34.09 | 40.07 |
HAT [24] | 20.6M | 5.9T | 34.45 | 40.26 |
HAT + our PSA | 18.9M | 5.6T | 34.57 | 40.32 |
SRFormerV2-S | 15.0M | 4.8T | 34.56 | 40.35 |
SRFormerV2 | 21.9M | 6.1T | 34.59 | 40.40 |

5.7 LAM Comparison
To observe the range of utilized pixels for super-resolution (SR) reconstruction of our PSA, we compare SRFormer with SwinIR using LAM [84] on multiple challenging examples. Given a specified region in the SR image, LAM analyzes the contributions of each pixel in the input image to reconstruct this region. The results are shown in Fig. 9. From the first two examples, we can see that SRFormer performs reconstruction based on almost all the pixels of input because similar textures are repeated in the whole image. For the last example, a more amazing phenomenon can be observed that SRFormer can span long distances and use the pixels on the right side for the reconstruction of the left region since the image is left-right symmetrical. This proves that the SRFormer can extract the available features from almost the entire image for inference SR images in this case. Experimental results indicate that SRFormer infers SR images with a significantly wider range of pixels than SwinIR [20].
6 Conclusion
In this paper, we propose PSA, an efficient self-attention mechanism that can efficiently build pairwise correlations within large windows. Based on our PSA, we design a simple yet effective Transformer-based model for single image super-resolution, called SRFormer. Due to the extremely large attention window and high-frequency information enhancement, SRFormer performs excellently on classical, lightweight, and real-world SR tasks. In order to further investigate the potential of PSA, we further conduct scaling research and propose SRFormerV2, achieving new state-of-the-art performance. We hope our permuted self-attention can be a paradigm of large window self-attention and serve as a useful tool for future research in super-resolution model design.
References
- [1] Y. Zhou, Z. Li, C.-L. Guo, S. Bai, M.-M. Cheng, and Q. Hou, “Srformer: Permuted self-attention for single image super-resolution,” in Int. Conf. Comput. Vis., 2023.
- [2] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim, “Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [3] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “Edvr: Video restoration with enhanced deformable convolutional networks,” in IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2019.
- [4] S. Anwar, S. Khan, and N. Barnes, “A deep journey into super-resolution: A survey,” ACM Computing Surveys, 2020.
- [5] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Eur. Conf. Comput. Vis., 2014.
- [6] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016.
- [7] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [8] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017.
- [9] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016.
- [10] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2017.
- [11] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017.
- [12] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern Anal. Mach. Intell., 2021.
- [13] J. Li, F. Fang, K. Mei, and G. Zhang, “Multi-scale residual network for image super-resolution,” in Eur. Conf. Comput. Vis., 2018.
- [14] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Eur. Conf. Comput. Vis. Worksh., 2018.
- [15] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [16] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Int. Conf. Comput. Vis., 2017.
- [17] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Eur. Conf. Comput. Vis., 2018.
- [18] Y. Yang and Y. Qi, “Image super-resolution via channel attention and spatial graph convolutional network,” Pattern Recognition, 2021.
- [19] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- [20] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Int. Conf. Comput. Vis. Worksh., 2021.
- [21] D. Zhang, F. Huang, S. Liu, X. Wang, and Z. Jin, “Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution,” arXiv:2208.11247, 2022.
- [22] X. Zhang, H. Zeng, S. Guo, and L. Zhang, “Efficient long-range attention network for image super-resolution,” arXiv:2203.06697, 2022.
- [23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Int. Conf. Comput. Vis., 2021.
- [24] X. Chen, X. Wang, J. Zhou, and C. Dong, “Activating more pixels in image super-resolution transformer,” arXiv:2205.04437, 2022.
- [25] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, “Going deeper with image transformers,” in Int. Conf. Comput. Vis., 2021.
- [26] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015.
- [27] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016.
- [28] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in Int. Conf. Comput. Vis., 2017.
- [29] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Eur. Conf. Comput. Vis., 2016.
- [30] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017.
- [31] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [32] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “Ranksrgan: Generative adversarial networks with ranker for image super-resolution,” in Int. Conf. Comput. Vis., 2019.
- [33] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Int. Conf. Comput. Vis., 2021.
- [34] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” arXiv:1406.2661, 2014.
- [35] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang, X. Cao, and H. Shen, “Single image super-resolution via a holistic attention network,” in Eur. Conf. Comput. Vis., 2020.
- [36] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [37] Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi, “Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020.
- [38] Y. Mei, Y. Fan, and Y. Zhou, “Image super-resolution with non-local sparse attention,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- [39] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019.
- [40] S. Zhou, J. Zhang, W. Zuo, and C. C. Loy, “Cross-scale internal graph neural network for image super-resolution,” in Adv. Neural Inform. Process. Syst., 2020.
- [41] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Int. Conf. Learn. Represent., 2020.
- [42] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” arXiv:2012.12877, 2020.
- [43] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Int. Conf. Comput. Vis., 2021.
- [44] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Int. Conf. Comput. Vis., 2021.
- [45] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “Volo: Vision outlooker for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2022.
- [46] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Eur. Conf. Comput. Vis., 2020.
- [47] Z. Sun, S. Cao, Y. Yang, and K. M. Kitani, “Rethinking transformer-based set prediction for object detection,” in Int. Conf. Comput. Vis., 2021.
- [48] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, “You only look at one sequence: Rethinking transformer in vision through object detection,” in Adv. Neural Inform. Process. Syst., 2021.
- [49] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” in Adv. Neural Inform. Process. Syst., 2021.
- [50] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- [51] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Int. Conf. Comput. Vis., 2021.
- [52] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [53] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers can make one strong gan,” arXiv:2102.07074, 2021.
- [54] K. Lee, H. Chang, L. Jiang, H. Zhang, Z. Tu, and C. Liu, “Vitgan: Training gans with vision transformers,” arXiv:2107.04589, 2021.
- [55] B. Zhang, S. Gu, B. Zhang, J. Bao, D. Chen, F. Wen, Y. Wang, and B. Guo, “Styleswin: Transformer-based gan for high-resolution image generation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [56] Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr2: Image style transfer with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [57] L. Lu, R. Wu, H. Lin, J. Lu, and J. Jia, “Video frame interpolation with transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [58] R. Liu, H. Deng, Y. Huang, X. Shi, L. Lu, W. Sun, X. Wang, J. Dai, and H. Li, “Fuseformer: Fusing fine-grained information in transformers for video inpainting,” in Int. Conf. Comput. Vis., 2021.
- [59] J. Ren, Q. Zheng, Y. Zhao, X. Xu, and C. Li, “Dlformer: Discrete latent transformer for video inpainting,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [60] C. Liu, H. Yang, J. Fu, and X. Qian, “Learning trajectory-aware transformer for video super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [61] Z. Geng, L. Liang, T. Ding, and I. Zharkov, “Rstt: Real-time spatial temporal transformer for space-time video super-resolution,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [62] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [63] C.-L. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing transformer with transmission-aware 3d position embedding,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- [64] N. Park and S. Kim, “How do vision transformers work?” in Int. Conf. Learn. Represent., 2022.
- [65] P. Wang, W. Zheng, T. Chen, and Z. Wang, “Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice,” arXiv:2203.05962, 2022.
- [66] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, 2004.
- [67] F. Kong, M. Li, S. Liu, D. Liu, J. He, Y. Bai, F. Chen, and L. Fu, “Residual local feature network for efficient super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 766–776.
- [68] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 2012.
- [69] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces, 2010.
- [70] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Int. Conf. Comput. Vis., 2001.
- [71] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, 2017.
- [72] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang, “Ntire 2017 challenge on single image super-resolution: Methods and results,” in IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2017.
- [73] X. Wang, K. Yu, C. Dong, and C. C. Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
- [75] W. Li, X. Lu, J. Lu, X. Zhang, and J. Jia, “On efficient transformer and image pre-training for low-level vision,” arXiv:2112.10175, 2022.
- [76] Z. Hui, X. Gao, Y. Yang, and X. Wang, “Lightweight image super-resolution with information multi-distillation network,” in ACM Int. Conf. Multimedia, 2019.
- [77] W. Li, K. Zhou, L. Qi, N. Jiang, J. Lu, and J. Jia, “Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond,” in Adv. Neural Inform. Process. Syst., 2020.
- [78] X. Luo, Y. Xie, Y. Zhang, Y. Qu, C. Li, and Y. Fu, “Latticenet: Towards lightweight image super-resolution with lattice block,” in Eur. Conf. Comput. Vis., 2020.
- [79] Z. Lu, H. Liu, J. Li, and L. Zhang, “Efficient transformer for single image super-resolution,” arXiv:2108.11084, 2021.
- [80] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight super-resolution with cascading residual network,” in Eur. Conf. Comput. Vis., 2018.
- [81] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- [82] K. Zhang, J. Liang, L. V. Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in Int. Conf. Comput. Vis., 2021.
- [83] X. Ji, Y. Cao, Y. Tai, C. Wang, J. Li, and F. Huang, “Real-world super-resolution via kernel estimation and noise injection,” in IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2020.
- [84] J. Gu and C. Dong, “Interpreting super-resolution networks with local attribution maps,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021.
![]() |
Yupeng Zhou is currently working toward the PhD degree in the College of Computer Science, Nankai University, under the supervision of Prof. Qibin Hou. He received his bachelor’s degree from Shandong University in 2022. His research interests include computer vision and deep learning, with a particular focus on image/video restoration and generation. |
![]() |
Zhen Li is currently working toward the PhD degree in the College of Computer Science, Nankai University, under the co-supervision of Prof. Ming-Ming Cheng and Prof. Xiu-Li Shao. He received his MS degree from Sichuan University in 2019. His research interests include computer vision and deep learning, particularly focusing on image/video restoration and enhancement, generation and editing, etc. |
![]() |
Chun-Le Guo received the Ph.D.degree from Tianjin University, China, under the supervision of Prof. ji-Chang Guo. He was a Visiting Ph.D. Student with the School of Electronic Engineering and Computer Science, Queen Mary University of London (QMUL), U.K. He was a Research Associate with the Department of Computer Science, City University of Hong Kong (CityU of HK). He was a Postdoctoral Researcher with Prof. Ming-Ming Cheng at Nankai University. He is currently an Associate Professor with Nankai University. His research interests include image processing, computer vision, and deep learning. |
![]() |
Li Liu received her Ph.D. degree in information and communication engineering from the National University of Defence Technology, China, in 2012. She joined the faculty at the National University of Defense Technology in 2012. During her PhD study, she spent more than two years as a Visiting Student at the University of Waterloo, Canada, from 2008 to 2010. From 2015 to 2016, she spent ten months visiting the Multimedia Laboratory at the Chinese University of Hong Kong. From 2016 to 2018, she was a senior researcher of the CMVS at the University of Oulu, Finland. Dr. Liu was co-chair of nine International Workshops at several major venues, including CVPR, ICCV, and ECCV; she served as the leading guest editor of the special issues for IEEE TPAMI and IJCV. She also served as Area Chair for ICME 2020, 2021, 2022, and ACCV 2020, 2022. She currently serves as Associate Editor for IEEE Transactions on Circuits and Systems for Pattern Video Technology, IEEE Transactions on Geoscience and Remote Sensing, and Pattern Recognition. Her current research interests include computer vision, pattern recognition, and machine learning. Her papers currently have over 14000 citations in Google Scholar. She is a senior member of the IEEE. |
![]() |
Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012. Then he did 2 years research fellow, with Prof. Philip Torr in Oxford. He is now a professor at Nankai University, leading the Media Computing Lab. His research interests include computer graphics, computer vision, and image processing. He received research awards including National Science Fund for Distinguished Young Scholars and ACM China Rising Star Award. He is on the editorial boards of IEEE TPAMI and IEEE TIP. |
![]() |
Qibin Hou received his Ph.D. degree from the School of Computer Science, Nankai University. Then, he worked at the National University of Singapore as a research fellow. Now, he is an associate professor at School of Computer Science, Nankai University. He has published more than 30 papers on top conferences/journals, including T-PAMI, CVPR, ICCV, NeurIPS, etc. His research interests include deep learning and computer vision. |