[1]\fnmDepeng \surDang

1]\orgdivSchool of Artificial Intelligence, \orgnameBeijing Normal University, \orgaddress\cityBeijing, \postcode100000, \countryChina

A Mountain-Shaped Single-Stage Network for Accurate Image Restoration

\fnmHu \surGao [email protected] \fnmJing \surYang [email protected] \fnmYing \surZhang [email protected] \fnmNing \surWang [email protected] \fnmJingfan \surYang [email protected] [email protected] [

Abstract

Image restoration is the task of aiming to obtain a high-quality image from a corrupt input image, such as deblurring and deraining. In image restoration, it is typically necessary to maintain a complex balance between spatial details and contextual information. Although a multi-stage network can optimally balance these competing goals and achieve significant performance, this also increases the system’s complexity. In this paper, we propose a mountain-shaped single-stage design base on a simple U-Net architecture, which removes or replaces unnecessary nonlinear activation functions to achieve the above balance with low system complexity. Specifically, we propose a feature fusion middleware (FFM) mechanism as an information exchange component between the encoder-decoder architectural levels. It seamlessly integrates upper-layer information into the adjacent lower layer, sequentially down to the lowest layer. Finally, all information is fused into the original image resolution manipulation level. This preserves spatial details and integrates contextual information, ensuring high-quality image restoration. In addition, we propose a multi-head attention middle block (MHAMB) as a bridge between the encoder and decoder to capture more global information and surpass the limitations of the receptive field of CNNs. Extensive experiments demonstrate that our approach, named as M3SNet, outperforms previous state-of-the-art models while using less than half the computational costs, for several image restoration tasks, such as image deraining and deblurring.

keywords:

Image restoration, Single-stage, Feature fusion middleware, Multi-head attention middle block

1 Introduction

Refer to caption — Figure 1: Visualized results of M3SNet on various image restoration tasks. Left: degraded image. Right: the predicted result of M3SNet. From top to bottom: image deblurring, and image deraining task respectively.

Image degradation is a common issue that occurs during image acquisition due to a variety of factors such as camera limitations, environmental conditions, and human factors. For instance, smartphone cameras with narrow apertures, small sensors, and limited dynamic range can produce blurred and noisy images due to device shaking caused by body movements. Similarly, images captured in adverse weather conditions can be affected by haze and rain. Most classical image restoration tasks can be formulated as:

\mathbf{L}=\mathbf{D}(\mathbf{H})+\gamma

(1)

where $\mathbf{L}$ denotes an observed low-quality image, $\mathbf{H}$ refers to its corresponding high-quality image, and $\mathbf{D}(\cdot),\gamma$ indicate the degradation function and the noise during the imaging and transmission processes, respectively. This formulation can signify different image restoration tasks when $\mathbf{D}(\cdot)$ varies. For example, if $\mathbf{D(H)}=\mathbf{H}t-At+A$ , it corresponds to the deraining or dehazing task, where $A$ is the global atmospheric light and $t$ is the transmission matrix defined as:

t=e^{-\alpha d}

(2)

where $\alpha$ is the scattering coefficient of the atmosphere, and $d$ is the distance between the object and the camera.

Image restoration aims to recover the high-quality clean image $\mathbf{H}$ from its degraded image $\mathbf{L}$ . It is a highly ill-posed problem as there are many candidates for any original input. In order to restrict the infinite feasible candidates space to natural images, traditional methods [1, 2, 3, 4, 5, 6, 7] explicitly design appropriately priors for the given kind of restoration problem, such as domain-relevant priors and task-relevant priors. Then, the potential high-quality image can be obtained by solving a maximum a posteriori (MAP) problem:

\mathbf{\hat{H}}=\underset{\mathbf{H}}{\operatorname{arg\,max}}\log P(\mathbf{L}|\mathbf{H})+\log P(\mathbf{H})

(3)

where $P(\mathbf{L}|\mathbf{H})$ represents the probability of observing the degraded image $\mathbf{L}$ given the clean image $\mathbf{H}$ , and $P(\mathbf{H})$ represents the prior distribution of the clean image $\mathbf{H}$ . This can also be expressed as a constrained maximum likelihood estimation:

\mathbf{\hat{H}}=\underset{\mathbf{H}}{\operatorname{arg\,min}}\ \|\mathbf{L}-\mathbf{D(H)}\|^{2}+\lambda\Psi(\mathbf{H})

(4)

where fidelity term $|\mathbf{L}-\mathbf{D(H)}|^{2}$ serves as an approximation for the likelihood $P(\mathbf{L}|\mathbf{H})$ , while the regularization term $\lambda\Psi(\mathbf{H})$ represents either the priors of the latent image $\mathbf{H}$ or the constraints on the solution. The aim is to express the fidelity of the reconstructed image to the original input while simultaneously considering the prior knowledge or constraints imposed on the solution

While designing effective priors for image restoration can be challenging and may not be generalizable. With large-scale data, deep models such as Convolutional Neural Networks(CNNs) [8, 9, 10, 11, 12, 13, 14] and Transformer [15, 16, 17, 18] have emerged as the preferred choice due to their ability to implicitly learn more general priors by capturing natural image statistics and achieving state-of-the-art (SOTA) performance in image restoration. The performance gain of these deep learning models over conventional restoration approaches is primarily attributed to their model design, which includes numerous network modules and functional units for image restoration, such as recursive residual learning [19], transformer [16, 15, 18], encoder-decoders [20, 12, 13], multi-scale models [21, 22], and generative models [23, 24, 25].

Nevertheless, most of these models for low-level vision problems are based on a single-stage design, which ignores the interactions that exist between spatial details and contextualized information. To address this limitation, [8, 9, 10, 11] proposes a multi-stage architecture in which contextualized features are first learned through an encoder-decoder architecture and subsequently integrated with a high-resolution branch to preserve local information. Despite its good performance, this method requires refining the results from the previous stage in the later stage, leading to a high level of system complexity.

Based on the information presented, a natural question that comes to mind is whether it is feasible to use a single-stage architecture to reduce system complexity and achieve the same balance between spatial details and contextualized information as a multi-stage architecture while maintaining the SOTA performance. To achieve this objective, we propose a mountain-shaped single-stage image restoration architecture, called M3SNet, with several key components. 1). We utilize NAFNet [12] as the baseline architecture and concentrate on modifying the network model to attain multi-stage functionality. By emitting the information transfer between the multi-stage and eliminating the nonlinear activation function from the network structure, we are able to reduce the system’s complexity. 2). A feature fusion middleware (FFM) mechanism has been added to facilitate multi-scale information fusion between encoder and decoder blocks from different layers, resulting in the acquisition of more contextual information. Additionally, this approach enables manipulation of the original image resolution, thereby aiding in the preservation of spatial details. 3). A multi-head attention middle block (MHAMB) is the bridge between the encoder and decoder that surpass the limitations of the receptive field of CNNs and capture more global information.

The main contributions of this work are:

1.

A novel single-stage approach capable of generating outputs that are contextually enriched and spatially accurate similar to a multi-stage architecture. Our architecture reduces system complexity due to its single-stage design eliminates the need for information to be passed between stages.
2.

A feature fusion middleware mechanism that enables the exchange of information across multiple scales while preserving the fine details from the input image to the output image.
3.

A multi-head attention middle block captures more global information.
4.

We demonstrate the versatility of M3SNet by setting new state-of-the-art on 6 synthetic and real-world datasets for various restoration tasks (image deraining and deblurring) while maintaining low complexity (see Fig.2). Further, we provide detailed analysis, qualitative results, and generalization tests.

2 Related Work

Image degradation is a common occurrence caused by camera equipment and a variety of environmental factors. Depending on the specific degradation phenomenon, different image restoration tasks are proposed, e.g., deblurring and deraining. Early image restoration work was mainly based on manually crafting some prior knowledge, such as total variation and self-similarity [1, 2, 3, 4, 5, 6, 7]. With the rise of deep learning, data-driven methods like CNN [26, 27, 28, 29, 8, 30] and Transformer [31, 32, 16, 17, 18] have become the dominant approach for image restoration due to their impressive performance. These methods can be categorized as either single-stage or multi-stage based on their architectural design.

2.1 Single-Stage Architecture.

In recent years, the majority of image restoration research has focused on single-stage architecture. Among these architectures, the encoder-decoder based U-Net [12, 13, 18, 33, 25, 8, 34, 35] and dual network structure [36, 37, 38, 39, 40] are mainly included.

Encoder-Decoder Approaches. In recent years, encoder-decoder have gained great attention from researchers in the field of image restoration thanks to its ability to capture multi-scale information. To construct an effective and efficient Transformer-based architecture for image restoration, [18] introduce a novel locally-enhanced window and multi-scale restoration modulator to create a hierarchical encoder-decoder network. [30] utilize selective kernel feature fusion to realize the information exchange of different scales and information aggregation based on attention. [22] develops a simple yet effective boosted decoder to progressively restore the haze-free image by incorporating the strengthen-operate-subtract boosting strategy in the decoder. By eliminating or substituting the nonlinear activation function, [12] establishes a simple baseline that yields measurable outcomes while requiring fewer computing resources.

Dual Network Approaches. The Dual Networks architecture is designed with two parallel branches that separately estimate the structure and detail components of the target signals from the input. These components are then combined to reconstruct the final results according to the specific task formulation module. This architecture was first proposed by [41] and has since inspired a lot of subsequent work, including in the areas of image dehazing [36, 37, 38], image deraining [42], image denoising [39], and image super-resolution/deblurring [40]. The Dual Networks approach has proven to be effective in addressing various low-level vision problems, by enabling a better separation of the structure and detail information, leading to improved performance in terms of both accuracy and computational efficiency. Furthermore, the flexibility of this architecture makes it adaptable to different types of data, making it a popular choice in the field of image restoration.

Despite the significant achievements made by these networks, it remains a challenge to effectively balance these competing goals of preserving spatial details and contextualized information while recovering images.

2.2 Multi-Stage Architecture.

The multi-stage networks are shown to be more effective than their single-stage counterparts in high level vision problems [43, 44, 45, 46]. In recent years, there have been some attempts [47, 48, 11, 10, 8, 49, 50] to apply multi-stage networks to image restoration. They aims to break down the image restoration process into several manageable stages, enabling the use of lightweight subnetworks to progressively restore clear images. This approach facilitates the capture of both spatial details and contextualized information by individual subnetworks at each stage. To prevent the production of suboptimal results that may arise from using the same subnetwork at each stage, a supervisory attention mechanism was proposed along with the adoption of distinct subnetwork structures [8]. Additionally, [49] present a novel self-supervised event-guided deep hierarchical Multi-patch Network to handle blurry images and videos through fine-to-coarse hierarchical localized representations. Nevertheless, this approach elevates the complexity of the system as refining the previous stage’s results is required in subsequent stages.

3 Method

Our primary goal is to create a single-stage network architecture that can efficiently handle the challenging task of image restoration by balancing the need for spatial details and context information, all while using fewer computational resources. The M3SNet is built upon a U-Net architecture, as shown in Figure 3. As is apparent from the figure, in contrast to the traditional U-Net network, we have inverted the architecture and introduced two key components: (a) the feature fusion middleware (FFM) and (b) the multi-head attention middle block (MHAMB). The model’s architecture takes on a mountain-like shape, and we liken the image restoration process to climbing a mountain.

Overall Pipeline. Given a degraded image $\mathbf{I}\in\mathbb{R}^{H\times W\times 3}$ , M3SNet first applies a $3\times 3$ convolutional layer to extract shallow feature maps $\mathbf{F_{0}}\in\mathbb{R}^{H\times W\times C}$ ( $H,W,C$ are the feature map height, width, and channel number, respectively). Next these shallow features $\mathbf{F_{0}}$ pass through $4-level$ encoder-decoder and one multi-head attention middle block, yielding deep features $\mathbf{F_{DF}}\in\mathbb{R}^{H\times W\times C}$ . Each layer contains multiple feature fusion middleware between the encoder and decoder to capture multi-scale information and retain spatial details. Finally, we apply convolution to deep features $\mathbf{F_{DF}}$ and generate a residual image $\mathbf{R}\in R^{H\times W\times 3}$ to which degraded image is added to obtain the restored image: $\mathbf{\hat{I}}=\mathbf{R}+\mathbf{I}$ . We optimize the proposed network using PSNR loss :

PSNR=10\cdot log_{10}\cdot\frac{(2^{n}-1)^{2}}{||\mathbf{\hat{I}}-\mathbf{\dot{I}}||^{2}}

(5)

where $\mathbf{\dot{I}}$ denotes the ground-truth image.

3.1 Feature Fusion Middleware (FFM)

By incorporating an encoder and decoder network in the initial stage, followed by a network operating at the original image input resolution in the final stage, the multi-stage network can produce high-quality images with accurate spatial details and reliable contextual information. However, the latter stage of this process requires revising the results of the previous stage, which adds a slight level of complexity to the system. While a single-stage network has relatively less complexity, it may struggle to balance spatial details and context information effectively. Therefore, we are exploring a middleware mechanism for feature fusion that enables a simple single-stage U-Net architecture to achieve the same functionality as a multi-stage architecture, without sacrificing smoothness of expression or compromising on the original meaning.

As shown in Fig. 4(a), the feature fusion middleware(FFM) is a nonlinear activation-free block (NAFBlock) with upsample and feature fusion. We have introduced the FFM between the encoder-decoder architectural levels to integrate upper-layer information into adjacent lower layers. This integration takes place sequentially, from the highest layer down to the lowest, until all information is fused into the original image resolution manipulation level. This approach enhances the network’s capacity to capture and fuse multi-scale features, ranging from simple patterns at low levels, such as corner or edge/color connections, to more complex higher-level features, such as significant variations and specific objects. As a result, this structure preserves spatial details while integrating contextual information, ultimately ensuring high-quality image restoration. Formally, let $\mathbf{FE_{i}}\in\mathbb{R}^{\frac{H}{i^{2}}\times\frac{W}{i^{2}}\times{i^{2}}C}$ be the output in the $i$ -th level encoder $(i=1,2,3,4)$ . At each level, the feature fusion information $FFM_{s,i}$ is given as:

FFM_{1,i}=H_{Naf_{1,i}}(FE_{i}\oplus UP(FE_{i+1}))

(6)

FFM_{s,i}=H_{Naf_{s,i}}(UP(FFM_{s-1,i+1})\oplus FFM_{s-1,i})

(7)

where $\oplus$ denote the element-wise addition, $UP(\cdot)$ represents the up-sampling operation and $H_{Naf_{s,i}}(\cdot)$ is the $s$ -th FFM in the $i$ -th level.

This design offers two benefits. Firstly, the feature fusion middleware integrates multi-scale information, allowing the network model to capture abundant context information. Secondly, the feature fusion middleware in the $1_{th}$ layer operates on the original image resolution, without employing any subsampling operation, thereby enabling the network model to acquire detailed spatial information of high-resolution features.

NAFBlock. NAFBlock is a variant of the U-Net network that simplifies the system by replacing or removing the nonlinear activation function. Fig. 4(b) illustrates the process of obtaining an output $Y$ from an input $X$ using Layer Normalization, Convolution, Simple Gate, and Simplified Channel Attention. Express as follows:

X_{1}=X+C_{1}(SCA(SG(C_{3}(C_{1}(LN(X))))))

(8)

Y=X_{1}+C_{1}(SG(C_{1}(LN(X_{1}))))

(9)

SG=X_{f1}\times X_{f2}

(10)

where $C_{1}$ is the $1\times 1$ convolution, $C_{3}$ is the $3\times 3$ depth-wise convolution, GAP is the adaptive average pool, $X_{f1},X_{f2}\in\ R^{H\times W\times\frac{C}{2}}$ are obtained by dividing $X_{f3}$ into channel dimensions, and $SCA(\cdot)$ is shown in Fig. 4(c).

Finally, the depth features $\mathbf{F_{DF}}$ are obtained through this single-stage architecture, as demonstrated below:

FD_{i}=H_{Naf_{\textrm{-1},i}}(FD_{i+1}+FFM_{\textrm{-1},i})

(11)

F_{DF}=FD_{1}

(12)

where $FD_{i}$ is the output in the $i$ -th level decoder, and - $1$ indicates that this is the last feature fusion middleware at this level.

3.2 Multi-Head Attention Middle Block (MHAMB)

The transformer model [18, 16, 17] has gained popularity in image restoration tasks due to its capability to capture global information, as evidenced by its increasing usage in recent years. The computation on a global scale results in a quadratic complexity in relation to the number of tokens as shown in Eq. 15, rendering it inadequate for the representation of high-resolution images.

\mathcal{O}_{MSA}=4hwC^{2}+2(hw)^{2}C

(13)

To alleviate this issue, [17, 16, 32] etc., proposed various methods to reduce complexity. In this paper, we propose a multi-head attention middle block (MHAMB) as the bridge of the encoder-decoder, shown in Fig. 4(d).MHAMB utilizes global self-attention to process and integrate the feature map information that is generated by the last layer of the encoder. This approach is particularly efficient in handling large images because convolution downsamples space, while attention can effectively process smaller resolutions for better performance. From the last layer of the encoder output $FE_{4}$ , our MHAMB first apply a $1\times 1$ convolution and a $3\times 3$ depth-wise convolution to generate the $query,key$ and $value$ matrices $\mathbf{Q}\in\mathbb{R}^{H\times W\times C},\mathbf{K}\in\mathbb{R}^{H\times W\times C}$ and $\mathbf{V}\in\mathbb{R}^{H\times W\times C}$ as follows:

		$\displaystyle Q=C_{3}(C_{1}(FE_{4}))$		(14)
		$\displaystyle K=C_{3}(C_{1}(FE_{4}))$
		$\displaystyle V=C_{3}(C_{1}(FE_{4}))$

Then, we reshape $\mathbf{Q}$ , $\mathbf{K}$ , $\mathbf{V}$ to $\mathbf{\hat{Q}}\in\mathbb{R}^{(HW)\times\frac{C}{h}\times h}$ , $\mathbf{\hat{K}}\in\mathbb{R}^{(HW)\times\frac{C}{h}\times h}$ , $\mathbf{\hat{V}}\in\mathbb{R}^{(HW)\times\frac{C}{h}\times h}$ , where $h$ is the number of head. Next, the attention matrix is thus computed by the self-attention mechanism as :

Attention(\hat{Q},\hat{K},\hat{V})=SoftMax(\frac{\hat{Q}\hat{K}}{\beta})\hat{V}

(15)

where $\beta$ is a learning scaling parameter used to adjust the magnitude of the dot product of $\hat{Q}$ and $\hat{K}$ prior to the application of the softmax function. Finally, we reshape the attention matrix back to its original dimensions of $\mathbb{R}^{H\times W\times C}$ and apply a $1\times 1$ convolution. The resulting output is then added to $FE_{4}$ and passed to the decoder. This allows us to capture more information and enhance the overall performance of the model as shown in the following experiment.

4 Experiments

We evaluate the proposed M3SNet on benchmark datasets for two image restoration tasks, including (a) image deraining, and (b) image deblurring. Fig. 5 displays some of the images predicted by our method. Our model recovered clearer images that were close to the ground truth.

4.1 Datasets and Evaluation Protocol

We use PSNR and SSIM as quality assessment metrics. To report the reduction in error for each method relative to the best-performing method, we convert PSNR to RMSE ( $RMSE\propto\sqrt{10^{-PSNR/10}}$ ) and SSIM to DSSIM (DSSIM = (1 - SSIM)/2).

Image Deraining. As shown in Table. 1. Our derain model is trained on a collection of 13,712 clean-rain image pairs obtained from multiple datasets [51, 52, 53, 54]. We assess the model’s performance on various test sets, including Test100 [53], Test1200 [54], Rain100H [52], and Rain100L [52].

Image Deblurring.As shown in Table. 2. To perform image deblurring, we utilize the GoPro [55] dataset, which consists of 2,103 image pairs for training and 1,111 pairs for evaluation. Additionally, we assess the generalizability of our approach by applying the GoPro-trained model directly to the test images of the HIDE dataset. The HIDE dataset is designed specifically for human-aware motion deblurring, and its test set comprises 2,025 images.

Table 1: Dataset description for image deraining.

Datasets	Train Samples	Test Samples	Testset Rename
Rain14000 [51]	11200	0	-
Rain1800 [52]	1800	0	-
Rain12 [56]	12	0	-
Rain800 [53]	700	98	Test100
Rain1200 [54]	0	1200	Test1200
Rain100H [52]	0	100	Rain100H
Rain100L [52]	0	100	Rain100L

Table 2: Dataset description for image deblurring.

Datasets	Train Samples	Test Samples	Testset Rename
GoPro [55]	2130	1111	GoPro
HIDE [57]	0	2025	HIDE

4.2 Implementation Details

We train the proposed M3SNet without any pre-training and separate models for different image restoration tasks. We utilize the following block configurations in our network for each level: $[1,1,1,28]$ blocks for the encoder, $[1,1,1,1]$ blocks for the decoder, $[2,2,1,0]$ blocks for the FFM. And one MHAMB for the bridge of the encoder and decoder. We train models with Adam [58] optimizer( $\beta_{1}=0.9,\beta_{2}=0.999$ ) and PSNR loss for $5\times 10^{5}$ iterations with the initial learning rate $1\times 10^{-3}$ gradually reduced to $1\times 10^{-7}$ with the cosine annealing [59]. We extract patches of size $256\times 256$ from training images, and the batch size is set to $32$ . We adopt TLC [60] to solve the issue of performance degradation caused by training on patched images and testing on the full image. For data augmentation, we perform horizontal and vertical flips.

4.3 Image Deraining Results

In our image deraining task, we compute the PSNR/SSIM scores using the Y channel in the YCbCr color space, which is consistent with previous works such as[8, 61, 62]. Our method has been demonstrated to outperform existing approaches significantly and consistently, as presented in Table 3. Specifically, our method achieves a remarkable improvement of 0.93 dB and a $10.2\%$ error reduction on average across all datasets when compared to the best CNN-based method, SPAIR [62]. Moreover, we can achieve up to 2.76 dB improvement over HINet [9] on individual datasets, such as Rain100L.

In addition to quantitative evaluations, Fig. 6 presents qualitative results that demonstrate the effectiveness of our M3SNet in removing rain streaks of various orientations and magnitudes while preserving the structural content of the images.

Table 3: Image deraining results. The best and second best scores are highlighted and underlined. Our M3SNet is better than the state-of-the-art by 0.93 dB.

	Test100 [63]		Test1200 [61]		Rain100H [64]		Rain100L [64]		Average
Methods	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$		SSIM $\uparrow$
DerainNet [65]	22.77	0.810	23.38	0.835	14.92	0.592	27.03	0.884	22.48	(73.0%)	0.796	(63.7%)
SEMI [66]	22.35	0.788	26.05	0.822	16.56	0.486	25.03	0.842	22.88	(71.7%)	0.744	(71.1%)
DIDMDN [54]	22.56	0.818	29.65	0.901	17.35	0.524	25.23	0.741	24.58	(65.6%)	0.770	(67.8%)
UMRL [67]	24.41	0.829	30.55	0.910	26.01	0.832	29.18	0.923	28.02	(48.8%)	0.880	(38.3%)
RESCAN [11]	25.00	0.835	30.51	0.882	26.36	0.786	29.80	0.881	28.59	(45.4%)	0.857	(48.3%)
PreNet [10]	24.81	0.851	31.36	0.911	26.77	0.858	32.44	0.950	29.42	(39.9%)	0.897	(28.2%)
MSPFN [61]	27.50	0.876	32.39	0.916	28.66	0.860	32.40	0.933	30.75	(29.9%)	0.903	(23.7%)
MPRNet [8]	30.27	0.897	32.91	0.916	30.41	0.890	36.40	0.965	32.73	(12.0%)	0.921	(6.3%)
SPAIR [62]	30.35	0.909	33.04	0.922	30.95	0.892	36.93	0.969	32.91	(10.2%)	0.926	(0.0%)
HINet [9]	30.29	0.906	33.05	0.919	30.65	0.894	37.28	0.970	32.81	(11.2%)	0.922	(5.1%)
M3SNet-32	31.29	0.903	33.46	0.924	30.64	0.892	39.62	0.984	33.75	(1.0%)	0.926	(0.0%)
M3SNet-64	31.25	0.901	33.52	0.925	30.54	0.889	40.04	0.985	33.84	(0.0%)	0.925	(1.3%)

4.4 Image Deblurring Result

The performance evaluation of image deblurring approaches on the GoPro [55] and HIDE [57] datasets is presented in Table 4. Our M3SNet outperformed other methods, with a performance gain of 0.09dB when averaging across all datasets [55, 57]. Compared to our baseline network NAFNet [12], we improve 0.04 dB and 0.05 dB at 32 widths and 64 widths, respectively. It is worth noting that even though our network is trained solely on the GoPro Dataset, it still achieves state-of-the-art results (31.49 dB in PSNR) on the HIDE dataset. This demonstrates its impressive generalization capability.

Figure 7 displays some of the deblurred images produced by our method. Our model recovered clearer images that were closer to the ground truth than those by others.

Table 4: Image deblurring results. The proposed M3SNet is trained only on the GoPro dataset but achieves a 0.09 dB improvement over the state of the art on the average of the effects on both datasets.

	GoPro [55]		HIDE [57]		Average
Methods	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$		SSIM $\uparrow$
DeblurGAN [23]	$28.70$	$0.858$	$24.51$	$0.871$	26.61	(49.9%)	0.865	(70.4%)
Nah $etal.$ [55]	$29.08$	$0.914$	$25.73$	$0.874$	27.41	(45.1%)	0.894	(62.3%)
DeblurGAN-v2 [25]	29.55	0.934	26.61	0.875	28.08	(40.7%)	0.905	(57.9%)
SRN [47]	30.26	0.934	28.36	0.915	29.31	(31.7%)	0.925	(46.7%)
Gao $etal.$ [68]	30.90	0.935	29.11	0.913	30.01	(26.0%)	0.924	(47.4%)
DBGAN [24]	31.10	0.942	28.94	0.915	30.02	(25.9%)	0.929	(43.7%)
MT-RNN [69]	31.15	0.945	29.15	0.918	30.15	(24.8%)	0.932	(41.2%)
DMPHN [50]	31.20	0.940	29.09	0.924	30.15	(24.8%)	0.932	(41.2%)
Suin $etal.$ [70]	31.85	0.948	29.98	0.930	30.92	(17.8%)	0.939	(34.4%)
SPAIR [62]	32.06	0.953	30.29	0.931	31.18	(15.3%)	0.942	(21.0%)
MIMO-UNet++ [33]	32.45	0.957	29.99	0.930	31.22	(14.9%)	0.944	(28.6%)
MPRNet [8]	32.66	0.959	30.96	0.939	31.81	(8.9%)	0.949	(21.6%)
MPRNet-local [8]	33.31	0.964	31.19	0.945	32.25	(4.2%)	0.955	(11.1%)
Restormer [16]	32.92	0.961	31.22	0.942	32.07	(6.1%)	0.952	(16.7%)
Restormer-local [16]	33.57	0.966	31.49	0.945	32.53	(1.0%)	0.956	(9.1%)
Uformer [18]	32.97	0.967	30.83	0.952	31.90	(8.0%)	0.960	(0.0%)
HINet [9]	32.71	-	-	-	-	-
HINet-local [9]	33.08	0.962	-	-	-	-
NAFNet-32 [12]	32.87	0.961	-	-	-	-
NAFNet-64 [12]	33.69	0.967	-	-	-	-
M3SNet-32 (ours)	32.91	0.965	30.92	0.948	31.92	(7.8%)	0.957	(6.9%)
M3SNet-64 (ours)	33.74	0.967	31.49	0.951	32.62	(0.0%)	0.959	(2.4%)

4.5 Ablation Studies

The ablation studies are conducted on image deblurring (GoPro [55]) to analyze the impact of each of our model components. It is worth noting that we use NAFNet-32 [12] as the baseline network to demonstrate the effectiveness by adding our proposed component. Table. 5 shows the effectiveness of our M3SNet. Next, we describe the impact of each component.

FFM We demonstrate the effectiveness of the proposed feature fusion middleware by adding it to NAFNet-32 [12]. As the Table. 5 shows that the FFM added in each level of encoder-decoder leads to improved performance (32.90 dB) as compared to the baseline. This indicates that our proposed FFM can help the model capture more multi-scale information while also preserving spatial detail information.

MHAMB As the Table. 5 shows a tiny improvement in PSNR from 30.87 dB to 30.88 dB after we add a multi-head attention middle block (MHAMB) to capture more global information in the topmost level.

Adding both of these components improves the performance by a large margin from 32.87 dB to 32.91 dB in PSNR.

Table 5: Ablation study on individual components of the proposed M3SNet.

	NAFNet-32	+ FFM	+ MHAMB	+ FFM & MHAMB
PSNR	32.87	32.90	32.88	32.91
SSIM	0.9606	0.9632	0.9606	0.9644

4.6 Resource Efficient

Table 6: The evaluation of model computational complexity. This is conducted with an input size of

256\times 256

, on an NVIDIA 1060 GPU.

Method	PSNR	Params(M)	MACs(G)
MIMO-UNet++ [33]	$32.68$	$16.1$	1235
MPRNet [8]	$32.66$	20.1	778.2
MPRNet-local [8]	$33.31$	20.1	778.2
HINet [9]	$32.77$	88.7	170.7
Restormer [16]	$32.92$	26.13	140
Restormer-local [16]	$33.57$	26.13	140
Uformer [18]	$32.97$	50.88	89.5
NAFNet-32 [12]	$32.87$	17.1	32
NAFNet-64 [12]	$33.69$	68	65
M3SNet-32 (ours)	$32.91$	16.7	37
M3SNet-64 (ours)	$33.74$	66.3	146

Deep learning models have become increasingly complex in order to achieve higher accuracy. However, larger models require more resources and may not be practical in certain contexts. Therefore, there is a need to design lightweight image restoration models that can achieve high accuracy. In our work, we design a mountain-shaped single-stage network. This architecture optimizes the balance between spatial details and contextual information while minimizing the computational resources required to restore images.

Our M3SNet has been shown to outperform other models, as demonstrated in Table 6. Despite having 0.6M higher parameters than MIMO-unet++ [33], our proposed M3SNet-32 still achieves better performance, while using significantly fewer computational resources, with MACs approximately 40 times smaller than that of MIMO-unet++. Although the MACs of our model are higher than Uformer [18] and NAFNet [12], our model parameters are smaller, yet still perform better. Considering all factors, including model parameters, MACs, and performance, our model is the optimal choice.

5 Conclusion

In this paper, we present a single-stage network with a mountain-shaped structure that effectively captures multi-scale feature information and minimizes the computational resources required for image restoration. Our design is guided by the principle of balancing the competing goals of contextual information and spatial details while recovering images. To this end, we propose a feature fusion middleware mechanism that enables seamless information exchange between the encoder-decoder architecture’s different levels. This approach smoothly combines upper-layer information with adjacent lower-layer information and eventually integrates all information to the original image resolution manipulation level. To overcome the limitations of CNNs’ receptive fields and capture more global information, we utilize a multi-head attention middle block as the bridge of our encoder-decoder architecture. Furthermore, to maintain computational efficiency and lightweight model size, we replace or remove nonlinear activation functions and instead use multiplication. Our extensive experiments on multiple benchmark datasets demonstrate that our M3SNet model significantly outperforms existing methods while utilizing low computational resources.

References

\bibcommenthead
Rudin et al. [1992] Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenomena (1992)
Song and Mumford [1997] Song, C.Z., Mumford, D.: Prior learning and gibbs reaction-diffusion. TPAMI 19(11), 1236–1250 (1997)
Perona and Malik [2002] Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. TPAMI 12(7), 629–639 (2002)
Roth and Black [2005] Roth, S., Black, M.J.: Fields of experts: A framework for learning image priors. In: CVPR (2005)
Kim and Kwon [2010] Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. TPAMI 32(6), 1127 (2010)
Dong et al. [2011] Dong, W., Zhang, L., Shi, G., Wu, X.: Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. TIP 20(7), 1838–1857 (2011)
He et al. [2011] He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. TPAMI (2011)
Zamir et al. [2021] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.-H., Shao, L.: Multi-stage progressive image restoration. In: CVPR (2021)
Chen et al. [2021] Chen, L., Lu, X., Zhang, J., Chu, X., Chen, C.: Hinet: Half instance normalization network for image restoration. In: CVPR (2021)
Ren et al. [2019] Ren, D., Zuo, W., Hu, Q., Zhu, P.F., Meng, D.: Progressive image deraining networks: A better and simpler baseline. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3932–3941 (2019)
Li et al. [2018] Li, X., Wu, J., Lin, Z., Liu, H., Zha, H.: Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: European Conference on Computer Vision (2018)
Chen et al. [2022] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. arXiv preprint arXiv:2204.04676 (2022)
Chu et al. [2022] Chu, X., Chen, L., Yu, W.: Nafssr: Stereo image super-resolution using nafnet. In: CVPR (2022)
Pan et al. [2022] Pan, J., Sun, D., Zhang, J., Tang, J., Yang, J., Tai, Y.W., Yang, M.H.: Dual convolutional neural networks for low-level vision. IJCV (2022)
Zhang et al. [2023] Zhang, J., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Accurate image restoration with attention retractable transformer. In: ICLR (2023)
Zamir et al. [2022] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.-H.: Restormer: Efficient transformer for high-resolution image restoration. In: CVPR (2022)
Tsai et al. [2022] Tsai, F.-J., Peng, Y.-T., Lin, Y.-Y., Tsai, C.-C., Lin, C.-W.: Stripformer: Strip transformer for fast image deblurring. In: ECCV (2022)
Wang et al. [2022] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer for image restoration. In: CVPR (2022)
Wang et al. [2018] Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Loy, C.C., Qiao, Y., Tang, X.: Esrgan: Enhanced super-resolution generative adversarial networks. In: ECCV Workshops (2018)
Ronneberger et al. [2015] Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)
Kim et al. [2022] Kim, K., Lee, S., Cho, S.: Mssnet: Multi-scale-stage network for single image deblurring. ArXiv abs/2202.09652 (2022)
Dong et al. [2020] Dong, H., Pan, J., Xiang, L., Hu, Z., Zhang, X., Wang, F., Yang, M.-H.: Multi-scale boosted dehazing network with dense feature fusion. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2154–2164 (2020). https://doi.org/10.1109/CVPR42600.2020.00223
Kupyn et al. [2017] Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J.: Deblurgan: Blind motion deblurring using conditional adversarial networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8183–8192 (2017)
Zhang et al. [2020] Zhang, K., Luo, W., Zhong, Y., Ma, L., Stenger, B., Liu, W., Li, H.: Deblurring by realistic blurring. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2734–2743 (2020)
Kupyn et al. [2019] Kupyn, O., Martyniuk, T., Wu, J., Wang, Z.: Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 8877–8886 (2019)
Anwar and Barnes [2020] Anwar, S., Barnes, N.: Densely residual laplacian super-resolution. TPAMI (2020)
Zhang et al. [2018] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: ECCV (2018)
Zhang et al. [2020] Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image restoration. TPAMI (2020)
Dudhane et al. [2022] Dudhane, A., Zamir, S.W., Khan, S., Khan, F.S., Yang, M.-H.: Burst image restoration and enhancement. In: CVPR (2022)
Zamir et al. [2022] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.-H., Shao, L.: Learning enriched features for fast image restoration and enhancement. TPAMI (2022)
Conde et al. [2022] Conde, M.V., Choi, U.-J., Burchi, M., Timofte, R.: Swin2SR: Swinv2 transformer for compressed image super-resolution and restoration. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2022)
Liang et al. [2021] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. arXiv preprint arXiv:2108.10257 (2021)
Cho et al. [2021] Cho, S.J., Ji, S.W., Hong, J.P., Jung, S.W., Ko, S.J.: Rethinking coarse-to-fine approach in single image deblurring. In: ICCV (2021)
Yue et al. [2020] Yue, Z., Zhao, Q., Zhang, L., Meng, D.: Dual adversarial network: Toward real-world noise removal and noise generation. In: ECCV, (2020)
Zhang et al. [2020] Zhang, K., Li, Y., Zuo, W., Zhang, L., Gool, L.V., Timofte, R.: Plug-and-play image restoration with deep denoiser prior. TPAMI 44, 6360–6376 (2020)
Zhu et al. [2018] Zhu, H., Xi, P., Chandrasekhar, V., Li, L., Lim, J.H.: Dehazegan: When image dehazing meets differential programming. In: Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18 (2018)
Guo et al. [2019] Guo, T., Li, X., Cherukuri, V., Monga, V.: Dense scene information estimation network for dehazing. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
Yang et al. [2019] Yang, A., Wang, H., Ji, Z., Pang, Y., Shao, L.: Dual-path in dual-path network for single image dehazing. In: Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19 (2019)
Tian et al. [2021] Tian, C., Xu, Y., Zuo, W., Du, B., Lin, C.-W., Zhang, D.: Designing and training of a dual cnn for image denoising. Knowledge-Based Systems 226, 106949 (2021)
Singh et al. [2020] Singh, V., Ramnath, K., Mittal, A.: Refining high-frequencies for sharper super-resolution and deblurring. Computer Vision and Image Understanding (2020)
Pan et al. [2018] Pan, J., Liu, S., Sun, D., Zhang, J., Liu, Y., Ren, J., Li, Z., Tang, J., Lu, H., Tai, Y.W.a.: Learning dual convolutional neural networks for low-level vision. In: CVPR (2018)
Siyuan et al. [2018] Siyuan, L.I., Ren, W., Zhang, J., Yu, J., Guo, X.: Fast Single Image Rain Removal via a Deep Decomposition-Composition Network. Computer Vision and Image Understanding (2018)
Li et al. [2019] Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., Sun, J.: Rethinking on multi-stage networks for human pose estimation. ArXiv abs/1901.00148 (2019)
Cheng et al. [2019] Cheng, B., Chen, L.-C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T., Hwu, W.-m.W., Shi, H.: Spgnet: Semantic prediction guidance for scene parsing. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 5217–5227 (2019)
Ghosh et al. [2018] Ghosh, P., Yao, Y., Davis, L.S., Divakaran, A.: Stacked spatio-temporal graph convolutional networks for action segmentation. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 565–574 (2018)
Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1 (2020) https://doi.org/%****␣sn-article.bbl␣Line␣725␣****10.1109/TPAMI.2020.3021756
Tao et al. [2018] Tao, X., Gao, H., Wang, Y., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep image deblurring. CVPR (2018)
Fu et al. [2018] Fu, X., Liang, B., Huang, Y., Ding, X., Paisley, J.: Lightweight pyramid networks for image deraining. IEEE Transactions on Neural Networks and Learning Systems (2018)
Zhang et al. [2022] Zhang, H., Zhang, L., Dai, Y., Li, H., Koniusz, P.: Event-guided multi-patch network with self-supervision for non-uniform motion deblurring. International Journal of Computer Vision, 1–18 (2022)
Zhang et al. [2019] Zhang, H., Dai, Y., Li, H., Koniusz, P.: Deep stacked hierarchical multi-patch network for image deblurring. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Fu et al. [2017] Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., Paisley, J.: Removing rain from single images via a deep detail network. In: CVPR (2017)
Yang et al. [2017] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. CVPR (2017)
Zhang et al. [2017] Zhang, H., Sindagi, V.A., Patel, V.M.: Image de-raining using a conditional generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology 30, 3943–3956 (2017)
Zhang and Patel [2018] Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 695–704 (2018)
Nah et al. [2016] Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for dynamic scene deblurring. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 257–265 (2016)
Li et al. [2016] Li, Y., Tan, R.T., Guo, X., Lu, J., Brown, M.S.: Rain streak removal using layer priors. In: CVPR (2016)
Shen et al. [2019] Shen, Z., Wang, W., Lu, X., Shen, J., Ling, H., Xu, T., Shao, L.: Human-aware motion deblurring. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 5571–5580 (2019)
Kingma and Ba [2014] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. Computer Science (2014)
Loshchilov and Hutter [2016] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2016)
Chu et al. [2021] Chu, X., Chen, L., Chen, C., Lu, X.: Improving image restoration by revisiting global information aggregation. In: ECCV (2021)
Jiang et al. [2020] Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. CVPR (2020)
Purohit et al. [2021] Purohit, K., Suin, M., Rajagopalan, A.N., Boddeti, V.N.: Spatially-adaptive image restoration using distortion-guided networks. CoRR abs/2108.08617 (2021)
Zhang et al. [2017] Zhang, H., Sindagi, V.A., Patel, V.M.: Image de-raining using a conditional generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology 30, 3943–3956 (2017)
Yang et al. [2016] Yang, W., Tan, R.T., Feng, J., Liu, J., Guo, Z., Yan, S.: Deep joint rain detection and removal from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1685–1694 (2016)
Fu et al. [2016] Fu, X., Huang, J., Ding, X., Liao, Y., Paisley, J.W.: Clearing the skies: A deep network architecture for single-image rain removal. TIP 26, 2944–2956 (2016)
Wei et al. [2018] Wei, W., Meng, D., Zhao, Q., Xu, Z.: Semi-supervised cnn for single image rain removal. ArXiv abs/1807.11078 (2018)
Yasarla and Patel [2019] Yasarla, R., Patel, V.M.: Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8397–8406 (2019)
Gao et al. [2019] Gao, H., Tao, X., Shen, X., Jia, J.: Dynamic scene deblurring with parameter selective sharing and nested skip connections. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3843–3851 (2019)
Park et al. [2019] Park, D., Kang, D.U., Kim, J., Chun, S.Y.: Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In: ECCV (2019)
Suin et al. [2020] Suin, M., Purohit, K., Rajagopalan, A.N.: Spatially-attentive patch-hierarchical network for adaptive motion deblurring. CVPR, 3603–3612 (2020)