This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring Sparsity in Image Super-Resolution for Efficient Inference

Longguang Wang1    Xiaoyu Dong2,3    Yingqian Wang1    Xinyi Ying1    Zaiping Lin1    Wei An1    Yulan Guo1∗
1National University of Defense Technology      2The University of Tokyo       3RIKEN AIP
{wanglongguang15,yulan.guo}@nudt.edu.cn
Abstract

Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since missing details in low-resolution (LR) images mainly exist in regions of edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve redundant computation in flat regions, which increases their computational cost and limits their applications on mobile devices. In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. Specifically, we develop a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation. Within our SMSR, spatial masks learn to identify “important” regions while channel masks learn to mark redundant channels in those “unimportant” regions. Consequently, redundant computation can be accurately localized and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41%/33%/27%41\%/33\%/27\% FLOPs being reduced for ×2/3/4\times 2/3/4 SR. Code is available at: https://github.com/LongguangWang/SMSR.

1 Introduction

The goal of single image super-resolution (SR) is to recover a high-resolution (HR) image from a single low-resolution (LR) observation. Due to the powerful feature representation and model fitting capabilities of deep neural networks, CNN-based SR methods have achieved significant performance improvements over traditional ones. Recently, many efforts have been made towards real-world applications, including few-shot SR [38, 39], blind SR [12, 49, 42], and scale-arbitrary SR [15, 43]. With the popularity of intelligent edge devices (such as smartphones and VR glasses), performing SR on these devices is highly demanded. Due to the limited resources of edge devices111For example, the computational performance of Kirin 990 and RTX 2080Ti are 0.9 and 13.4 tFLOPS, respectively., efficient SR is crucial to the applications on these devices.

Refer to caption
Figure 1: Trade-off between PSNR performance, number of parameters and FLOPs. Results are achieved on Set5 for ×2\times 2 SR.

Since the pioneering work of SRCNN [8], deeper networks have been extensively studied for image SR. In VDSR [19], SR network is first deepened to 20 layers. Then, a very deep and wide architecture with over 60 layers is introduced in EDSR [29]. Later, Zhang et al. further increased the network depth to over 100 and 400 in RDN [51] and RCAN [50], respectively. Although a deep network usually improves SR performance, it also leads to high computational cost and limits the applications on mobile devices. To address this problem, several efforts have been made to reduce model size through information distillation [17] and efficient feature reuse [2]. Nevertheless, these networks still involve redundant computation. Compared to an HR image, missing details in its LR image mainly exist in regions of edges and textures. Consequently, less computational resources are required in those flat regions. However, these CNN-based SR methods process all locations equally, resulting in redundant computation within flat regions.

In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. We first study the intrinsic sparsity of the image SR task and then investigate the feature sparsity in existing SR networks. To fully exploit the sparsity for efficient inference, we propose a sparse mask SR (SMSR) network to dynamically skip redundant computation at a fine-grained level. Our SMSR learns spatial masks to identify “important” regions (e.g., edge and texture regions) and uses channel masks to mark redundant channels in those “unimportant” regions. These two kinds of masks work jointly to accurately localize redundant computation. During network training, we soften these binary masks using the Gumbel softmax trick to make them differentiable. During inference, we use sparse convolution to skip redundant computation. It is demonstrated that our SMSR can effectively localize and prune redundant computation to achieve better efficiency while producing promising results (Fig. 1).

Our main contributions can be summarized as: 1) We develop an SMSR network to dynamically skip redundant computation for efficient image SR. In contrast to existing works that focus on lightweight network designs, we explore a different route by pruning redundant computation to improve inference efficiency. 2) We propose to localize redundant computation by learning spatial and channel masks. These two kinds of masks work jointly for fine-grained localization of redundant computation. 3) Experimental results show that our SMSR achieves state-of-the-art performance with better inference efficiency. For example, our SMSR outperforms previous methods on Set14 for ×2\times 2 SR with a significant speedup on mobile devices (Table 2).

Refer to caption
Figure 2: Absolute difference between IBicubicSRI^{SR}_{Bicubic}, IRCANSRI^{SR}_{RCAN} and IHRI^{HR} in the luminance channel.
Refer to caption
Figure 3: Visualization of feature maps after the ReLU layer in the first backbone block of RCAN. Note that, sparsity is defined as the ratio of zeros in the corresponding channels.

2 Related Work

In this section, we first review several major works for CNN-based single image SR. Then, we discuss CNN acceleration techniques related to our work, including adaptive inference and network pruning.

Single Image SR. CNN-based methods have dominated the research of single image SR due to their strong representation and fitting capabilities. Dong et al. [8] first introduced a three-layer network to learn an LR-to-HR mapping for single image SR. Then, a deep network with 20 layers was proposed in VDSR [19]. Recently, deeper networks are extensively studied for image SR. Lim et al. [29] proposed a very deep and wide network (namely, EDSR) by cascading modified residual blocks. Zhang et al. [51] further combined residual learning and dense connection to build RDN with over 100 layers. Although these networks achieve state-of-the-art performance, the high computational cost and memory footprint limit their applications on mobile devices.

To address this problem, several lightweight networks were developed [22, 17, 2]. Specifically, distillation blocks were proposed for feature learning in IDN [17], while a cascading mechanism was introduced to encourage efficient feature reuse in CARN [2]. Different from these manually designed networks, Chu et al. [6] developed a compact architecture using neural architecture search (NAS). Recently, Lee et al. [24] introduced a distillation framework to leverage knowledge learned by powerful teacher SR networks to boost the performance of lightweight student SR networks. Although these lightweight SR networks successfully reduce the model size, redundant computation is still involved and hinders them to achieve better computational efficiency. In contrast to many existing works that focus on compact architecture designs, few efforts have been made to exploit the redundancy in SR networks for efficient inference.

Adaptive Inference. Adaptive inference techniques [44, 37, 36, 11, 26] have attracted increasing interests since they can adapt the network structure according to the input. One active branch of adaptive inference techniques is to dynamically select an inference path at the levels of layers. Specifically, Wu et al. [45] proposed a BlockDrop approach for ResNets to dynamically drop several residual blocks for efficiency. Mullapudi et al. [36] proposed an HydraNet with multiple branches and used a gating approach to dynamically choose a set of them at test time. Another popular branch is early stopping techniques that skip the computation at a location whenever it is deemed to be unnecessary [46]. On top of ResNets, Figurnov et al. [9] proposed a spatially adaptive computation time (SACT) mechanism to stop computation for a spatial position when the features become “good enough”. Liu et al. [31] introduced adaptive inference for SR by producing a map of local network depth to adapt the number of convolutional layers implemented at different locations. However, these adaptive inference methods only focus on spatial redundancy without considering redundancy in channel dimension.

Network Pruning. Network pruning [13, 32, 33] is widely used to remove a set of redundant parameters for network acceleration. As a popular branch of network pruning methods, structured pruning approaches are usually used to prune the network at the level of channels and even layers [25, 32, 33, 14]. Specifically, Li et al. [25] used L1L_{1} norm to measure the importance of different filters and then pruned less important ones. Liu et al. [32] imposed a sparsity constraint on scaling factors of the batch normalization layers and identified channels with lower scaling factors as less informative ones. Different from these static structured pruning methods, Lin et al. [30] conducted runtime neural network pruning according to the input image. Recently, Gao et al. [10] introduced a feature boosting and suppression method to dynamically prune unimportant channels at inference time. Nevertheless, these network pruning methods treat all spatial locations equally without taking their different importance into consideration.

3 Sparsity in Image Super-Resolution

In this section, we first illustrate the intrinsic sparsity of the single image SR task and then investigate the feature sparsity in state-of-the-art SR networks.

Given an HR image IHRI^{HR} and its LR version ILRI^{LR} (e.g., ×4\times 4 downsampled), we super-resolve ILRI^{LR} using Bicubic and RCAN to obtain IBicubicSRI^{SR}_{Bicubic} and IRCANSRI^{SR}_{RCAN}, respectively. Figure 3 shows the absolute difference between IBicubicSRI^{SR}_{Bicubic}, IRCANSRI^{SR}_{RCAN} and IHRI^{HR} in the luminance channel. It can be observed from Fig. 3(b) that IBicubicSRI^{SR}_{Bicubic} is “good enough” for flat regions, with noticeable missing details in only a small proportion of regions (17%\sim\!17\% pixels with |IHRIBicubicSR|>0.1|I^{HR}\!-\!I^{SR}_{Bicubic}|>0.1). That is, the SR task is intrinsically sparse in spatial domain. Compared to Bicubic, RCAN performs better in edge regions while achieving comparable performance in flat regions (Fig. 3(c)). Although RCAN focuses on recovering high-frequency details in edge regions (Fig. 3(d)), those flat regions are equally processed at the same time. Consequently, redundant computation is involved.

Figure 3 illustrates the feature maps after the ReLU layer in a backbone block of RCAN. It can be observed that the spatial sparsity varies significantly for different channels. Moreover, a considerable number of channels are quite sparse (sparsity \geq 0.8), with only edge and texture regions being activated. That is, computation in those flat regions is redundant since these regions are not activated after the ReLU layer. In summary, RCAN activates only a few channels for “unimportant” regions (e.g., flat regions) and more channels for “important” regions (e.g., edge regions). More results achieved with different SR networks and backbone blocks are provided in the supplemental material.

Motivated by these observations, we learn sparse masks to localize and skip redundant computation for efficient inference. Specifically, our spatial masks dynamically identify “important” regions while the channel masks mark redundant channels in those “unimportant” regions. Compared to network pruning methods [10, 30, 14], we take region redundancy into consideration and only prune channels for “unimportant” regions. Different from adaptive inference networks [37, 27], we further investigate the redundancy in channel dimension to localize redundant computation at a finer-grained level.

Refer to caption
Figure 4: An overview of our SMSR network.

4 Our SMSR Network

Our SMSR network uses sparse mask modules (SMM) to prune redundant computation for efficient image SR. Within each SMM, spatial and channel masks are first generated to localize redundant computation, as shown in Fig. 4. Then, the redundant computation is dynamically skipped using LL densely-connected sparse mask convolutions. Since only necessary computation is performed, our SMSR can achieve better efficiency while maintaining comparable performance.

4.1 Sparse Mask Generation

Refer to caption
Figure 5: An illustration of sparse mask generation and sparse mask convolution.

1) Training Phase

Spatial Mask. The goal of spatial mask is to identify “important” regions in feature maps (i.e., 0 for “unimportant” regions and 1 for “important” ones). To make the binary spatial mask learnable, we use Gumbel softmax distribution to approximate the one-hot distribution [18]. Specifically, input feature FC×H×WF\!\in\!\mathbb{R}^{C\times{H}\times{W}} is first fed to an hourglass block to produce Fspa2×H×WF^{spa}\!\in\!\mathbb{R}^{2\times{H}\times{W}}, as shown in Fig. 5(a). Then, the Gumbel softmax trick is used to obtain a softened spatial mask MkspaH×WM_{k}^{spa}\!\in\!\mathbb{R}^{{H}\times{W}}:

Mkspa[x,y]=exp((Fspa[1,x,y]+Gkspa[1,x,y])/τ)i=12exp((Fspa[i,x,y]+Gkspa[i,x,y])/τ),M_{k}^{spa}[x,y]=\frac{{\rm exp}\Big{(}\big{(}F^{spa}[1,x,y]\!+\!G_{k}^{spa}[1,x,y]\big{)}/\tau\Big{)}}{\sum_{i=1}^{2}{\rm exp}\Big{(}\big{(}F^{spa}[i,x,y]\!+\!G_{k}^{spa}[i,x,y]\big{)}/\tau\Big{)}}, (1)

where x,yx,y are vertical and horizontal indices, Gkspa2×H×WG_{k}^{spa}\!\in\!\mathbb{R}^{2\!\times\!{H}\times\!{W}} is a Gumbel noise tensor with all elements following Gumbel(0,1){\rm Gumbel}(0,1) distribution and τ\tau is a temperature parameter. When τ\tau\!\rightarrow\!\infty, samples from Gumbel softmax distribution become uniform. That is, all elements in MkspaM_{k}^{spa} are 0.5. When τ0\tau\!\rightarrow\!0, samples from Gumbel softmax distribution become one-hot. That is, MkspaM_{k}^{spa} becomes binary. In practice, we start at a high temperature and anneal to a small one to obtain binary spatial masks.

Channel Mask. In addition to spatial masks, channel masks are used to mark redundant channels in those “unimportant” regions (i.e., 0 for redundant channels and 1 for preserved ones). Here, we also use Gumbel softmax trick to produce binary channel masks. For the lthl^{\rm th} convolutional layer in the kthk^{\rm th} SMM, we feed auxiliary parameter Sk,l2×CS_{k,l}\in\mathbb{R}^{2\times{C}} to a Gumbel softmax layer to generate softened channel masks Mk,lchCM_{k,l}^{ch}\in\mathbb{R}^{{C}}:

Mk,lch[c]=exp((Sk,l[1,c]+Gk,lch[1,c])/τ)i=12exp((Sk,l[i,c]+Gk,lch[i,c])/τ),\begin{aligned} M_{k,l}^{ch}[c]=\frac{{\rm exp}\Big{(}\big{(}S_{k,l}[1,c]+G_{k,l}^{ch}[1,c]\big{)}/\tau\Big{)}}{\sum_{i=1}^{2}{\rm exp}\Big{(}\big{(}S_{k,l}[i,c]+G_{k,l}^{ch}[i,c]\big{)}/\tau\Big{)}}\end{aligned}, (2)

where cc is the channel index and Gk,lch2×CG_{k,l}^{ch}\in\mathbb{R}^{2\times{C}} is a Gumbel noise tensor. In our experiments, Sk,lS_{k,l} is initialized using random values drawn from a Gaussian distribution N(0,1){\rm N}(0,1).

Table 1: Comparative results achieved on Set14 by our SMSR with different settings for ×2\times 2 SR.
Model Spatial Mask Channel Mask Conv #Params. Sparsity FLOPs PSNR SSIM
1 Vanilla 926K 0 1.00×1.00\times 33.65 0.9180
2 Vanilla 587K 0.46 0.60×0.60\times 33.53 0.9169
3 Sparse 985K 0.42 0.65×0.65\times 33.60 0.9176
4 (Ours) Sparse 985K 0.46 0.61×0.61\times 33.64 0.9179

Sparsity Regularization. Based on spatial and channel masks, we define a sparsity term ηk,l\eta_{k,l}:

ηk,l=1C×H×Wc,x,y((1Mk,lch[c])×Mkspa[x,y]+Mk,lch[c]×I[x,y]),\eta_{k,l}=\frac{1}{{C}\!\times\!H\!\times\!{W}}\sum_{c,x,y}\left(\begin{aligned} (1\!-\!M_{k,l}^{ch}[c])\!\times\!{M_{k}^{spa}[x,y]}\\ +{M_{k,l}^{ch}[c]\!\times\!{I[x,y]}}\end{aligned}\right), (3)

where IH×WI\!\in\!{\mathbb{R}^{{H}\times{W}}} is a tensor with all ones. Note that, ηk,l\eta_{k,l} represents the ratio of activated locations in the output feature maps. To encourage the output features to be more sparse with fewer locations being activated, we further introduce a sparsity regularization loss:

Lreg=1K×Lk,lηk,l,L_{reg}=\frac{1}{K\!\times\!{L}}\sum_{k,l}\eta_{k,l}, (4)

where KK is the number of SMMs and LL is the number of sparse mask convolutional layers within each SMM.

Training Strategy. During the training phase, the temperature parameter τ\tau in Gumbel softmax layers is annealed using the schedule τ=max(0.4,1tTtemp)\tau\!=\!{\rm max}(0.4,~{}~{}1\!-\!\frac{t}{T_{temp}}), where tt is the number of epochs and TtempT_{temp} is empirically set to 500 in our experiments. As τ\tau gradually decreases, Gumbel softmax distribution is forced to approach an one-hot distribution to produce binary spatial and channel masks.

2) Inference Phase During training, Gumbel softmax distributions are forced to approach one-hot distributions as τ\tau decreases. Therefore, we replace the Gumbel softmax layers with argmax layers after training to obtain binary spatial and channel masks, as shown in Fig. 5(c).

4.2 Sparse Mask Convolution

1) Training Phase

To enable backpropagation of gradients at all locations, we do not explicitly perform sparse convolution during training. Instead, we multiply the results of a vanilla “dense” convolution with predicted spatial and channel masks, as shown in Fig. 5(b). Specifically, input feature FF is first multiplied with Mk,l1chM^{ch}_{k,l-1} and (1Mk,l1ch)(1\!-\!M^{ch}_{k,l-1}) to obtain FDF^{D} and FSF^{S}, respectively. That is, channels with “dense” and “sparse” feature maps in FF are separated. Next, FDF^{D} and FSF^{S} are passed to two convolutions with shared weights. The resulting features are then multiplied with different combinations of (1Mk,lch)(1\!-\!M^{ch}_{k,l}), Mk,lchM^{ch}_{k,l} and MkspaM^{spa}_{k} to activate different parts of the features. Finally, all these features are summed up to generate the output feature FoutF^{out}. Thanks to Gumbel softmax trick used in mask generation, gradients at all locations can be preserved to optimize the kernel weights of convolutional layers.

2) Inference Phase During the inference phase, sparse convolution is performed based on the predicted spatial and channel masks, as shown in Fig. 5(d). Take the lthl^{\rm th} layer in the kthk^{\rm th} SMM as an example, its kernel is first splitted into four sub-kernels according to Mk,l1chM_{k,l-1}^{ch} and Mk,lchM_{k,l}^{ch} to obtain four convolutions. Meanwhile, input feature FF is splitted into FDF^{D} and FSF^{S} based on Mk,l1chM_{k,l-1}^{ch}. Then, FDF^{D} is fed to convolutions ➀ and ➁ to produce FD2DF^{D2D} and FD2SF^{D2S}, while FSF^{S} is fed to convolutions ➂ and ➃ to produce FS2DF^{S2D} and FS2SF^{S2S}. Note that, FD2DF^{D2D} is produced by a vanilla “dense” convolution while FD2SF^{D2S}, FS2DF^{S2D} and FS2SF^{S2S} are generated by sparse convolutions with only “important” regions (marked by MkspaM_{k}^{spa}) being computed. Finally, features obtained from these four branches are summed and concatenated to produce the output feature FoutF^{out}. Using sparse mask convolution, computation for redundant channels within those “unimportant” regions can be skipped for efficient inference.

4.3 Discussion

Different from many recent works that use lightweight network designs [17, 2, 6] or knowledge distillation [24] for efficient SR, we speedup SR networks by pruning redundant computation. Previous adaptive inference and network pruning methods focus on redundant computation in spatial and channel dimensions independently. Directly applying these approaches cannot fully exploit the redundancy in SR networks and suffers notable performance drop, as demonstrated in Sec. 5.2. In contrast, our SMSR provides a unified framework to consider redundancy in both spatial and channel dimensions. It is demonstrated that our spatial and channel masks are well compatible to each other and facilitate our SMSR to obtain fine-grained localization of redundant computation.

Table 2: Comparative results achieved on Set14 by our SMSR with different sparsities for ×2\times 2 SR.
Model Conv λ0\lambda_{0} Sparsity #Params. FLOPs Memory Time PSNR SSIM
GPU CPU Kirin 990 Kirin 810
baseline Vanilla 0 0 926K 1.00×1.00\times 1.00×1.00\times 1.00×1.00\times 1.00×1.00\times 1.00×1.00\times 1.00×1.00\times 33.65 0.9180
5 Sparse 0.1 0.46 985K 0.61×0.61\times 0.89×0.89\times 1.22×1.22\times 0.79×0.79\times 0.64×0.64\times 0.57×0.57\times 33.64 0.9179
6 Sparse 0.2 0.64 985K 0.46×0.46\times 0.87×0.87\times 1.11×1.11\times 0.73×0.73\times 0.55×0.55\times 0.50×0.50\times 33.61 0.9174
7 Sparse 0.3 0.73 985K 0.38×0.38\times 0.85×0.85\times 1.04×1.04\times 0.68×0.68\times 0.54×0.54\times 0.45×0.45\times 33.52 0.9169
IDN [17] - - - 553K 0.57×0.57\times 0.91×0.91\times 1.04×1.04\times 0.73×0.73\times 0.71×0.71\times 0.60×0.60\times 33.30 0.9148
CARN [2] - - - 1592K 0.99×0.99\times 1.01×1.01\times 1.00×1.00\times 0.89×0.89\times 0.96×0.96\times 1.15×1.15\times 33.52 0.9166
FALSR-A [6] - - - 1021K 1.04×1.04\times 2.02×2.02\times 1.11×1.11\times 1.05×1.05\times 1.02×1.02\times 0.92×0.92\times 33.55 0.9168
Refer to caption
Figure 6: Visualization of sparse masks. Blue and green regions in MchM^{ch} represent channels with “dense” and “sparse” feature maps, respectively. In MspaM^{spa}, “important” locations are shown in yellow.
Refer to caption
Figure 7: Comparison of sparsities achieved in different SMMs on butterfly for different scale factors.
Refer to caption
Figure 8: Comparison between learning-based masks (red circles) and gradient-based masks (yellow and green circles) on Set14.

5 Experiments

5.1 Implementation Details

We used 800 training images and 100 validation images from the DIV2K dataset [1] as training and validation sets. For evaluation, we used five benchmark datasets including Set5 [4], Set14 [48], B100 [34], Urban100 [16], and Manga109 [35]. Peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) were used as evaluation metrics to measure SR performance. Following the evaluation protocol in [50, 51], we cropped borders and calculated the metrics in the luminance channel.

During training, 16 LR patches of size 96×9696\times 96 and their corresponding HR patches were randomly cropped. Data augmentation was then performed through random rotation and flipping. We set C=64,L=4,K=5C\!=\!64,L\!=\!4,K\!=\!5 for our SMSR. We used the Adam method [21] with β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999 for optimization. The initial learning rate was set to 2×1042\times 10^{-4} and reduced to half after every 200 epochs. The training was stopped after 1000 epochs. The overall loss for training is defined as L=LSR+λLregL=L_{SR}+\lambda{L_{reg}}, where LSRL_{SR} is the L1L_{1} loss between SR results and HR images, LregL_{reg} is defined in Eq. 4. To maintain training stability, we used a warmup strategy λ=λ0×min(tTwarm,1)\lambda=\lambda_{0}\times{\rm min}(\frac{t}{T_{warm}},1), where tt is the number of epochs, TwarmT_{warm} is empirically set to 50 and λ0\lambda_{0} is set to 0.1.

5.2 Model Analysis

We first conduct experiments to demonstrate the effectiveness of sparse masks. Then, we investigate the effect of sparsity and visualize sparse masks for discussion. Finally, we compare our learning-based masks with heuristic ones.

Effectiveness of Sparse Masks. To demonstrate the effectiveness of our sparse masks, we first introduced variant 1 by removing both spatial and channel masks. Then, we developed variants 2 and 3 by adding channel masks and spatial masks, respectively. Comparative results are shown in Table 1. Without spatial and channel masks, all locations and all channels are processed equally. Therefore, variant 1 has a high computational cost. Using channel masks, redundant channels are pruned at all spatial locations. Therefore, variant 2 can be considered as a pruned version of variant 1. Although variant 2 has fewer parameters and FLOPs, it suffers a notable performance drop (33.53 vs. 33.65) since beneficial information in “important” regions of these pruned channels are discarded. With only spatial masks, variant 3 suffers from a conflict between efficiency and performance since redundant computation in channel dimension cannot be well handled. Consequently, its FLOPs is reduced with a performance drop (33.60 vs. 33.65). Using both spatial and channel masks, our SMSR can effectively localize and skip redundant computation at a finer-grained level to reduce FLOPs by 39%39\% while maintaining comparable performance (33.64 vs. 33.65).

Effect of Sparsity. To investigate the effect of sparsity, we retrained our SMSR with large λ0\lambda_{0} to encourage high sparsity. Nvidia RTX 2080Ti, Intel I9-9900K and Kirin 990/810 were used as platforms of GPU, CPU and mobile processor for evaluation. For fair comparison of memory consumption and inference time, all convolutional layers in the backbone of different networks were implemented using im2col [5] based convolutions since different implementation methods (e.g., Winograd [23] and FFT [41]) have different computational costs. Comparative results are presented in Table 2.

As λ0\lambda_{0} increases, our SMSR produces higher sparsities with more FLOPs and memory consumption being reduced. Further, our network also achieves significant speedup on CPU and mobile processors. Due to the irregular and fragmented memory patterns, sparse convolution cannot make full use of the characteristics of general GPUs (e.g., memory coalescing) and relies on specialized designs to improve memory locality and cache hit rate for acceleration [47]. Therefore, the advantage of our SMSR cannot be fully exploited on GPUs without specific optimization. Compared to other state-of-the-art methods, our SMSR (variant 5) obtains better performance with lower memory consumption and shorter inference time on mobile processors. This clearly demonstrates the great potential of our SMSR for applications on mobile devices.

Table 4: Comparative results achieved for ×2/3/4\times 2/3/4 SR. PSNR/SSIM results of previous works are directly copied from corresponding papers. FLOPs is computed based on HR images with a resolution of 720p (1280×7201280\times 720). For SMSR, average sparsities on all datasets (0.49/0.39/0.330.49/0.39/0.33 for ×2/3/4\times 2/3/4 SR) are used to calculate FLOPs, with full FLOPs being shown in brackets. Best and second best results are highlighted and underlined.
Model Scale #Params FLOPs Set5 Set14 B100 Urban100 Manga109
Bicubic ×2\times 2 - - 33.66/0.9299 30.24/0.8688 29.56/0.8431 26.88/0.8403 30.80/0.9339
SRCNN [8] ×2\times 2 57K 52.7G 36.66/0.9542 32.45/0.9067 31.36/0.8879 29.50/0.8946 35.60/0.9663
VDSR [19] ×2\times 2 665K 612.6G 37.53/0.9590 33.05/0.9130 31.90/0.8960 30.77/0.9140 37.22/0.9750
DRCN [20] ×2\times 2 1774K 9788.7G 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133 37.55/0.9732
LapSRN [22] ×2\times 2 813K 29.9G 37.52/0.9591 33.08/0.9130 31.08/0.8950 30.41/0.9101 37.27/0.9740
MemNet [40] ×2\times 2 677K 623.9G 37.78/0.9597 33.28/0.9142 32.08/0.8978 31.31/0.9195 37.72/0.9740
SRFBN-S [28] ×2\times 2 282K 574.4G 37.78/0.9597 33.35/0.9156 32.00/0.8970 31.41/0.9207 38.06/0.9757
IDN [17] ×2\times 2 553K 127.7G 37.83/0.9600 33.30/0.9148 32.08/0.8985 31.27/0.9196 38.01/0.9749
CARN [2] ×2\times 2 1592K 222.8G 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256 38.36/0.9765
FALSR-A [6] ×2\times 2 1021K 234.7G 37.82/0.9595 33.55/0.9168 32.12/0.8987 31.93/0.9256 -/-
SMSR ×2\times 2 985K
(224.1G)131.6G
38.00/0.9601 33.64/0.9179 32.17/0.8990 32.19/0.9284 38.76/0.9771
Bicubic ×3\times 3 - - 30.39/0.8682 27.55/0.7742 27.21/0.7385 24.46/0.7349 26.95/0.8556
SRCNN [8] ×3\times 3 57K 52.7G 32.75/0.9090 29.30/0.8215 28.41/0.7863 26.24/0.7989 30.48/0.9117
VDSR [19] ×3\times 3 665K 612.6G 33.67/0.9210 29.78/0.8320 28.83/0.7990 27.14/0.8290 32.01/0.9340
DRCN [20] ×3\times 3 1774K 9788.7G 33.82/0.9226 29.76/0.8311 28.80/0.7963 27.14/0.8279 32.24/0.9343
MemNet [40] ×3\times 3 677K 623.9G 34.09/0.9248 30.01/0.8350 28.96/0.8001 27.56/0.8376 32.51/0.9369
SRFBN-S [28] ×3\times 3 375K 686.4G 34.20/0.9255 30.10/0.8372 28.96/0.8010 27.66/0.8415 33.02/0.9404
IDN [17] ×3\times 3 553K 57.0G 34.11/0.9253 29.99/0.8354 28.95/0.8013 27.42/0.8359 32.71/0.9381
CARN [2] ×3\times 3 1592K 118.8G 34.29/0.9255 30.29/0.8407 29.06/0.8034 28.06/0.8493 33.50/0.9440
SMSR ×3\times 3 993K
(100.5G)67.8G
34.40/0.9270 30.33/0.8412 29.10/0.8050 28.25/0.8536 33.68/0.9445
Bicubic ×4\times 4 - - 28.42/0.8104 26.00/0.7027 25.96/0.6675 23.14/0.6577 24.89/0.7866
SRCNN [8] ×4\times 4 57K 52.7G 30.48/0.8628 27.50/0.7513 26.90/0.7101 24.52/0.7221 27.58/0.8555
VDSR [19] ×4\times 4 665K 612.6G 31.35/0.8830 28.02/0.7680 27.29/0.7260 25.18/0.7540 28.83/0.8870
DRCN [20] ×4\times 4 1774K 9788.7G 31.53/0.8854 28.02/0.7670 27.23/0.7233 25.18/0.7524 28.93/0.8854
LapSRN [22] ×4\times 4 813K 149.4G 31.54/0.8850 28.19/0.7720 27.32/0.7270 25.21/0.7560 29.09/0.8900
MemNet [40] ×4\times 4 677K 623.9G 31.74/0.8893 28.26/0.7723 27.40/0.7281 25.50/0.7630 29.42/0.8942
SRFBN-S [28] ×4\times 4 483K 852.9G 31.98/0.8923 28.45/0.7779 27.44/0.7313 25.71/0.7719 29.91/0.9008
IDN [17] ×4\times 4 553K 32.3G 31.82/0.8903 28.25/0.7730 27.41/0.7297 25.41/0.7632 29.41/0.8942
CARN [2] ×4\times 4 1592K 90.9G 32.13/0.8937 28.60/0.7806 27.58/0.7349 26.07/0.7837 30.47/0.9084
SMSR ×4\times 4 1006K
(57.2G)41.6G
32.12/0.8932 28.55/0.7808 27.55/0.7351 26.11/0.7868 30.54/0.9085
Table 3: Comparison between learning-based masks and gradient-based masks. Results are achieved on Set14 for ×2\times 2 SR.
MspaM^{spa} #Params. α\alpha Sparsity Set14
PSNR SSIM
Gradient-based 926K 30 0.51 33.48 0.9163
926K 30 0.62 33.42 0.9155
926K 30 0.72 33.33 0.9151
926K 50 0.50 33.45 0.9162
926K 50 0.61 33.39 0.9153
926K 50 0.71 33.30 0.9150
Learning-based 985K - 0.46 33.64 0.9179
985K - 0.64 33.61 0.9174
985K - 0.73 33.52 0.9169

Visualization of Sparse Masks. We visualize the sparse masks generated in the first SMM for ×2\times 2 SR in Fig. 8. More results are provided in the supplemental material. It can be seen that locations around edges and textures in MspaM^{spa} are considered as “important” ones, which is consistent with our observations in Sec. 3. Moreover, we can see that there are more sparse channels (i.e., green regions in MchM^{ch}) in deep layers than shallow layers. This means that a subset of channels in shallow layers are informative enough for “unimportant” regions and our network progressively focuses more on “important” regions as the depth increases. Overall, our spatial and channel masks work jointly for fine-grained localization of redundant computation.

We further investigate the sparsities achieved by our SMMs for different scale factors. Specifically, we feed an LR image (×\times2 downsampled) to ×2/3/4\times 2/3/4 SMSR networks and compare the sparsities in their SMMs. As shown in Fig. 8, the sparsities decrease for larger scale factors in most SMMs. Since more details need to be reconstructed for larger scale factors, more locations are marked as “important” ones (with sparsities being decreased).

Learning-based Masks vs. Heuristic Masks. As regions of edges are usually identified as important ones in our spatial masks (Fig. 8), another straightforward choice is to use heuristic masks. KernelGAN [3] follows this idea to identify regions with large gradients as important ones when applying ZSSR [38] and uses a masked loss to focus on these regions. To demonstrate the effectiveness of learning-based masks in our SMSR, we introduced a variant with gradient-induced masks. Specifically, we consider locations with gradients larger than a threshold α\alpha as important ones and keep the spatial mask fixed within the network. The performance of this variant is compared to our SMSR in Table 3. Compared to learning-based masks, the variant with gradient-based masks suffers a notable performance drop with comparable sparsity (e.g., 33.52 vs. 33.33/33.30). Further, we can see from Fig. 8 that learning-based masks facilitate our SMSR to achieve better trade-off between SR performance and computational efficiency. Using fixed heuristic masks, it is difficult to obtain fine-grained localization of redundant computation. In contrast, learning-based masks enable our SMSR to accurately localize redundant computation to produce better results.

Refer to caption
Figure 9: Visual comparison on the Urban100 dataset for ×4\times 4 SR.
Refer to caption
Figure 10: Visual comparison on a real-world image.

5.3 Comparison with State-of-the-art Methods

We compare our SMSR with nine state-of-the-art methods, including SRCNN [8], VDSR [19], DRCN [20], LapSRN [22], MemNet [40], SRFBN-S [28], IDN [17], CARN [2], and FALSR-A [6]. As this paper focuses on lightweight SR networks (<2<2M), several recent works with large models (e.g., EDSR [29] (\sim40M), RCAN [50] (\sim15M) and SAN [7] (\sim15M)) are not included for comparison. Quantitative results are presented in Table 4 and visualization results are shown in Figs. 9 and 10.

Quantitative Results. As shown in Table 4, our SMSR outperforms the state-of-the-art methods on most datasets. For example, our SMSR achieves much better performance than CARN for ×2\times 2 SR, with the number of parameters and FLOPs being reduced by 38% and 41%, respectively. With a comparable model size, our SMSR performs favorably against FALSR-A and achieves better inference efficiency in terms of FLOPs (131.6G vs. 234.7G). With comparable computational complexity in terms of FLOPs (131.6G vs. 127.7G), our SMSR achieves much higher PSNR values than IDN. Using sparse masks to skip redundant computation, our SMSR reduces 41%/33%/27%41\%/33\%/27\% FLOPs for ×2/3/4\times 2/3/4 SR while maintaining the state-of-the-art performance. We further show the trade-off between performance, number of parameters and FLOPs in Fig. 1. We can see that our SMSR achieves the best PSNR performance with low computational cost.

Qualitative Results. Figure 9 compares the qualitative results achieved on Urban100. Compared to other methods, our SMSR produces better visual results with fewer artifacts, such as the lattices in img_004img\_004 and the stripes on the building in img_033img\_033. We further tested our SMSR on a real-world image to demonstrate its effectiveness. As shown in Fig. 10, our SMSR achieves better perceptual quality while other methods suffer notable artifacts.

6 Conclusion

In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. Specifically, we develop a sparse mask SR network to prune redundant computation. Our spatial and channel masks work jointly to localize redundant computation at a fine-grained level such that our network can effectively reduce computational cost while maintaining comparable performance. Extensive experiments demonstrate that our network achieves the state-of-the-art performance with significant FLOPs reduction and a speedup on mobile devices.

Acknowledge

The authors would like to thank anonymous reviewers for their insightful suggestions. Xiaoyu Dong is supported by RIKEN Junior Research Associate Program. Part of this work was done when she was a master student at HEU.

References

  • [1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, pages 1122–1131, 2017.
  • [2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In ECCV, pages 252–268, 2018.
  • [3] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind super-resolution kernel estimation using an internal-gan. In NeurIPS, pages 284–293, 2019.
  • [4] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie-Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, pages 1–10, 2012.
  • [5] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neural networks for document processing. In IWFHR, 2006.
  • [6] Xiangxiang Chu, Bo Zhang, Hailong Ma, Ruijun Xu, Jixiang Li, and Qingyuan Li. Fast, accurate and lightweight super-resolution with neural architecture search. In ICPR, 2020.
  • [7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In CVPR, 2019.
  • [8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199, 2014.
  • [9] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry P. Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In CVPR, pages 1790–1799, 2017.
  • [10] Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert D. Mullins, and Cheng-Zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In ICLR, 2019.
  • [11] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pages 9224–9232, 2018.
  • [12] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. Blind super-resolution with iterative kernel correction. In CVPR, 2019.
  • [13] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, pages 1135–1143, 2015.
  • [14] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR, pages 4340–4349, 2019.
  • [15] Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Jian Sun, and Tieniu Tan. Meta-SR: A magnification-arbitrary network for super-resolution. In CVPR, 2019.
  • [16] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, pages 5197–5206, 2015.
  • [17] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation network. In CVPR, 2018.
  • [18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  • [19] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, pages 1646–1654, 2016.
  • [20] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, pages 1637–1645, 2016.
  • [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [22] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pages 5835–5843, 2017.
  • [23] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In CVPR, pages 4013–4021, 2016.
  • [24] Wonkyung Lee, Junghyup Lee, Dohyung Kim, and Bumsub Ham. Learning with privileged information for efficient image super-resolution. In ECCV, 2020.
  • [25] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
  • [26] Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao Huang. Improved techniques for training adaptive deep networks. In ICCV, pages 1891–1900, 2019.
  • [27] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, pages 6459–6468, 2017.
  • [28] Zhen Li, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwanggil Jeon, and Wei Wu. Feedback network for image super-resolution. In CVPR, pages 3867–3876, 2018.
  • [29] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In CVPR, 2017.
  • [30] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In NeurIPS, pages 2181–2191, 2017.
  • [31] Ming Liu, Zhilu Zhang, Liya Hou, Wangmeng Zuo, and Lei Zhang. Deep adaptive inference networks for single image super-resolution. In ECCVW, 2020.
  • [32] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, pages 2755–2763, 2017.
  • [33] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, pages 5068–5076, 2017.
  • [34] David Martin, Charless Fowlkes, Doron Tal, Jitendra Malik, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [35] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools Appl., 76(20):21811–21838, 2017.
  • [36] Ravi Teja Mullapudi, William R. Mark, Noam Shazeer, and Kayvon Fatahalian. Hydranets: Specialized dynamic architectures for efficient inference. In CVPR, pages 8080–8089, 2018.
  • [37] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Urtasun. Sbnet: Sparse blocks network for fast inference. In CVPR, pages 8711–8720, 2018.
  • [38] Assaf Shocher, Nadav Cohen, and Michal Irani. ’Zero-shot” super-resolution using deep internal learning. In CVPR, 2018.
  • [39] Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. Meta-transfer learning for zero-shot super-resolution. In CVPR, 2020.
  • [40] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Memnet: A persistent memory network for image restoration. In ICCV, pages 4549–4557, 2017.
  • [41] Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. In ICLR, 2015.
  • [42] Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An, and Yulan Guo. Unsupervised degradation representation learning for blind super-resolution. In CVPR, 2021.
  • [43] Longguang Wang, Yingqian Wang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning for scale-arbitrary super-resolution from scale-specific networks. arXiv, 2020.
  • [44] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, volume 11217, pages 420–436, 2018.
  • [45] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and Rogério Schmidt Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, pages 8817–8826, 2018.
  • [46] Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and Stephen Lin. Spatially adaptive inference with stochastic feature sampling and interpolation. In ECCV, 2020.
  • [47] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In ISCA, 2017.
  • [48] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, volume 6920, pages 711–730, 2010.
  • [49] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfolding network for image super-resolution. In CVPR, 2020.
  • [50] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In ECCV, pages 1646–1654, 2018.
  • [51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472–2481, 2018.