Structural Residual Learning
for Single Image Rain Removal

Hong Wang, Yichen Wu, Qi Xie, Qian Zhao, Yong Liang, and Deyu Meng, , H. Wang, Y. Wu, Q. Xie, Q. Zhao, and D. Meng (corresponding author) are with the Institute for Information and System Sciences and Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University, Shaan’xi, 710049, P.R.China. E-mail: {hongwang01, wyc620602, xq.liwu}@stu.xjtu.edu.cn; {timmy.zhaoqian, dymeng}@mail.xjtu.edu.cn.Y. Liang is with the Faculty of Information Technology, Macau University of Science and Technology, Macau, P.R.China. E-mail: [email protected].

Abstract

To alleviate the adverse effect of rain streaks in image processing tasks, CNN-based single image rain removal methods have been recently proposed. However, the performance of these deep learning methods largely relies on the covering range of rain shapes contained in the pre-collected training rainy-clean image pairs. This makes them easily trapped into the overfitting-to-the-training-samples issue and cannot finely generalize to practical rainy images with complex and diverse rain streaks. Against this generalization issue, this study proposes a new network architecture by enforcing the output residual of the network possess intrinsic rain structures. Such a structural residual setting guarantees the rain layer extracted by the network finely comply with the prior knowledge of general rain streaks, and thus regulates sound rain shapes capable of being well extracted from rainy images in both training and predicting stages. Such a general regularization function naturally leads to both its better training accuracy and testing generalization capability even for those non-seen rain configurations. Such superiority is comprehensively substantiated by experiments implemented on synthetic and real datasets both visually and quantitatively as compared with current state-of-the-art methods.

Index Terms:

Generalization performance, single image deraining, deep learning, multi-scale learning, convolutional sparse coding, interpretability.

I Introduction

Images captured in rainy days often suffer from noticeable degradation of scene visibility and adversely affect the performance of subsequent image processing tasks, such as video surveillance [1, 2] and intelligent vehicles [3]. Removing rain streaks from a rainy image thus has become a necessary step for a wide range of practical applications, and has attracted much research attention recently [4, 5].

Refer to caption — Figure 1: Illustration of the proposed structural residual network (SRNet), which consists of three parallel subnetworks with similar structures but different dilated factors (DF) to capture rational structures expressing rain streaks.

Many single image deraining methods have been raised in the recent decades. Some early attempts exploited different filters to decompose a rainy image into low frequency part (LFP) and high frequency part (HFP), such as guided filter [6, 7] and $L_{0}$ smoothing filter [8], and then restored the rain-free image by combining the LFP and texture details. Recently, many researchers further explored the physical properties of rain and background layers and formulated them into different prior terms for single image rain removal. For example, the techniques of Gaussian mixture model (GMM) [9] and discriminative sparse coding (DSC) [10] have been used to model rain layers. Besides, joint convolutional analysis and synthesis sparse representation (JCAS) [11] is a typical instance that describes rain and background layers with concise mathematical models. These methods have been shown to be effective in certain scenarios. Such prior-model-based techniques, however, are still lack of the ability to flexibly adapt to rainy images with complicated rain shapes and background scenes. Besides, these methods are generally time-consuming due to their inevitable iterative optimization computations, which is always unfriendly to real users.

Very recently, driven by the success of deep learning (DL) techniques in low level vision tasks, the related techniques have also been employed in the deraining task. They are routinely required to pre-collect abundant training samples with rainy-clean image pairs to learn non-linear mapping from rainy image to its rain-free one in an end-to-end manner. The most representative methods along this research line include DerainNet [12], deep detail network (DDN) [13], deep joint rain detection and removal network (JORDER_E) [14, 15], recurrent squeeze-and-excitation context aggregation network (RESCAN) [16], progressive image deraining network (PReNet) [17], and spatial attentive network (SPANet) [18].

Albeit achieving success in certain contexts, the current DL methods for single image deraining still exist evident limitations. Specifically, the effectiveness of these methods largely relies on the quality and quantity of pre-collected training samples which are composed of large amount of rainy-clean images simulating the network inputs and outputs. The training images should include possibly wide range of rain shapes so as to cover those potentially occurring in testing stages. On the one hand, the non-linear mapping represented by the deep network architecture brings strong fitting capability on approximating abundant rain structures contained in training images beyond conventional model-based approaches. On the other hand, however, their complex network expression also tends to conduct redundancy in expressing rain streaks in training samples and excessive flexibility on adapting to newly input rainy images. A DL model achieving good training accuracy might not finely extract rains (like those containing evident background scenes, as shown in our experiments) from some practical rainy images with complex rain streaks non-seen in training stages. How to alleviate such generalization issue has become the core issue nowadays in current single image deraining research [18].

Against this issue, we build a specific network architecture, called structural residual network (SRNet), by enforcing its output residual finely accordant with the prior expressions for general rains. Such a structural residual setting forms a strong regularizer for rain layer extraction, guaranteeing the rationality of output rain streaks even for non-seen rain shapes differentiated from those contained in training samples, and is thus expected to achieve generalization ability in testing stages.

Specifically, inspired by the previous research on rain structure modeling, we summarize the following prior knowledge for representing general rain streaks, and employ them for building our network architecture. Firstly, as explored in [11] and [19], rain streaks always repeatedly appear at different locations over a rainy image with similar local patterns like shape, thickness, and direction. Such knowledge can be delivered by convolutional representation under a set of local filters across the entire images. The structures of these filters are expected to capture local repetitive patterns inside rains. Secondly, an evident prior of rains is that they are always sparsely scattered over an image [10, 20]. This knowledge can be expressed by the sparsity of feature maps (can be seen as coefficients) convoluted with those local filters. In our network, each feature map of the last layer is yielded by an embedded encoder-decoder subnetwork, in which the joint MaxPooling and MaxUnpooling operation naturally leads to the sparsity of all last-layer feature maps. Thirdly, another common prior used for describing rain streaks are their multi-scale characteristics [21], referring to the fact that rain streaks in different distances appear with different sizes in a rainy image. We thus construct the output of our network as an integration of multiple encoder-decoder subnetworks through utilizing dilated convolutions [22] with different dilated factors. This facilitates a rational expression for extracted rain layers with different scales. The architecture is demonstrated in Fig. 1 for easy observation.

Note that such a structural residual setting keeps the rain layer extracted by the network comply with the prior knowledge underlying general rains, and thus is functioned like a conventional regularizer to rectify output rain shapes being well extracted from rainy images by the network in both training and predicting stages. This naturally leads to both its potential good training accuracy and testing generalization capability even for those non-seen rain configurations. Such superiority of the proposed network is substantiated by comprehensive experiments implemented on synthetic and real datasets, especially on its fine generalization capability. Furthermore, ablation studies have been provided to verify the necessity of all modules involved in our network.

The paper is organized as follows. Section II briefly reviews related works. Section III presents the proposed structural residual single image deraining network as well as the network training details. Comprehensive experiments are shown in Section IV and the paper is finally concluded.

II Related Work

In this section, we briefly review current development along video deraining methods and single image deraining methods.

II-A Video Deraining Methods

Garg and Nayar [23, 24] initially discussed the visual effects of raindrops on imaging systems and proposed a video deraining method, by capturing dynamics of raindrops through a space-time correlation model and describing the photometry of rain via a motion blur model. Later, researchers made more investigations on exploring physical properties of rain streaks for video deraining, like temporal-chromatic [25, 26] and spatio-temporal frequency characteristics [27, 28].

In the past years, more intrinsic prior structures of rain streaks and background scenes of rainy videos have been analyzed and formulated for deraining algorithm design. For example, Chen et al. [29] investigated the non-local similarity and repeatability of rain streaks and provided a low rank model. To remove rain and snow from a video, Kim et al. [30] adopted a low rank matrix completion strategy. To handle heavy rain streaks and dynamic scenes, Ren et al. [31] decomposed rain streaks into two classes: sparse ones and dense ones, and detected them with different models. Recently, Wei et al. [32] used stochastic manner to encode rain streaks as a patch based mixture of Gaussian. Li et al.[21] further explored the prior structure of rain streaks and described that they have the repetitive local patterns and multi-scale characteristics. Based on the observation, the authors proposed a multi-scale convolutional sparse coding model that achieves the state-of-the-art deraining performance when the background layer in a video is also finely extracted.

Very recently, deep learning techniques have achieved great success in various low-level vision tasks, such as image denoising [33, 34], image super-resolution [35, 36, 37], and image deblurring [38, 39]. Recent years have also witnessed the development of deep learning in the deraining task [40]. To remove rains from videos, Liu et al. [41] provided a hybrid rain model and constructed a multi-task learning network architecture that successively accomplished rain degradation classification, rain removal, and background restoration. Further, the authors developed a dynamic routing residue recurrent network [42] for handling dynamically detected rainy videos. Compared with these video deraining methods, the single image deraining task is much more challenging in practice due to the lack of temporal information.

II-B Single Image Deraining Methods

Against this single image deraining task, Xu et al. [6] considered the chromatic property of rain streaks and achieved a coarse rain-free image via a guided filter. Kim et al. [43] analyzed the geometric property of rains to detect rain streak regions, and then reconstructed the derained result by executing nonlocal means filtering on the detected region.

Later on, researchers resorted to utilizing domain knowledge to encode rains for helping the deraining task [44, 45]. For example, based on morphological component analysis, Fu et al. [46] considered single image deraining as a signal decomposition problem. Then the authors executed a bilateral filter, dictionary learning, and sparse coding to acquire LFP and HFP for obtaining the derained result. Afterwards, Luo et al. [10] adopted a screen blend model and proposed to utilize high discriminative codes over a learned dictionary to sparsely approximate rain and background layers. To represent multiple orientations and scales of rain streaks, Li et al. [9] designed GMM based patch prior. By employing the sparsity and gradient statistics of rain and background layers, Zhu et al. [20] formulated three regularization terms to progressively extract rain streaks. Afterwards, Gu et al. [11] utilized analysis sparse representation to represent image large-scale structures and synthesis sparse representation to describe image fine-scale textures. Meanwhile, Zhang et al. [19] proposed to learn a set of sparsity based and low rank based convolutional filters for illustrating background and rain layers, respectively. Albeit achieving good performance on certain scenarios, these prior-model-based methods need inevitable time-consuming iterative inferences and always could not finely fit practical diverse rain shapes in real rainy images due to relatively simple while subjective prior assumptions.

More recently, some CNN based DL methods have been raised for the task [16, 47, 18]. Fu et al. [12] first designed the DerainNet to predict clean image. To make training process easier, the authors [13] further proposed DDN to remove rain content in HFP. Later, Zhang et al. [47] proposed a conditional generative adversarial deraining network. Considering the complicatedness of rain streaks, the authors [48] further incorporated a residual-aware classifier process and designed a rain density-aware multistream dense network. Furthermore, Yang et al. [14] reformulated the commonly-used rain model and developed a multi-task architecture to jointly learn binary rain streak map, appearance of rain streaks, and clean background. To achieve better visual quality, the authors further introduced a detail preservation step [15]. Very recently, by repeatedly unfolding a shallow ResNet with a recurrent layer, Ren et al. [17] presented a simple baseline network, PReNet. To make the network better reflect the prior knowledge of rain structures, some beneficial attempts have also been made. For example, Mu et al. [49] proposed to implicitly describe prior structures by data-dependent networks and then formulate them into bi-layer optimization iterations. To alleviate the hard-to-collect-training-sample and overfitting-to-training-sample issues, Wei et al. [50] formulate rain layer prior as GMM and train the backbone, DDN, in a semi-supervised manner. Albeit good performance, these DL methods embed these beneficial prior rain knowledge inside the network architecture [49] or on the loss formulation for training the network [50]. The high flexibility of the nonlinear network mapping might output unexpected mappings deviated from practical rain shapes especially in the testing stages with non-seen rain configurations in training samples. Their generalization performance could thus possibly be negatively influenced. This is the main motivation of this study, to make network extract rain layer, possibly complying with the intrinsic rain structures the previous researches have been explored, from a rainy image.

III Structural Residual Network for Single Image Rain Removal

III-A Basic Rainy Image Model

An input rainy image is denoted as $\mathbf{O}\in\mathbb{R}^{H\times W}$ , where $H$ and $W$ denote its height and width, respectively. The commonly-used rainy image model is:

\mathbf{O}=\mathbf{B}+\mathbf{R},

(1)

where $\mathbf{B}$ and $\mathbf{R}$ represent the background layer and the rain layer of the image, respectively. The goal of DL based single image derainers is to design rational network architectures to learn non-linear mapping functions from an input rainy image $\mathbf{O}$ to its background layer $\mathbf{B}$ or the residual rain layer $\mathbf{R}$ .

Our aim is to construct a network architecture with its output residual capable of sufficiently representing prior rain structures explored by previous investigations, so as to rectify a rational and interpretable network prediction even for those non-seen and complicated rainy image inputs. Specifically, three known priors have been considered in this network design task. The first is the local-repetitive-pattern prior as proposed in [11] and [19]. That is, rain streaks always repeatedly appear at different locations over a rainy image with similar local patterns like shape, thickness, and direction. The second is the coefficient-sparsity prior, which describes the fact that the rain streaks are always sparsely scattered over an image [10, 20]. The third is the multi-scale prior [21], which refers to the phenomenon that rain streaks in different distances appear with different sizes in a rainy image since they are pictured from different distances by cameras. Based on this, the single image rainy model (1) can be more elaborately formulated as:

\mathbf{O}=\mathbf{B}+\mathbf{R}_{s}+\mathbf{R}_{m}+\mathbf{R}_{l},

(2)

where $\mathbf{R}_{s}$ , $\mathbf{R}_{m}$ , and $\mathbf{R}_{l}$ denote three different scales of rain layer separation, respectively.

We then introduce how to construct a deep network with carefully designed structural residual output finely conveying all the aforementioned prior knowledge of rain streaks.

III-B Structural Residual Network Architecture

In this section, we construct a single image deraining network to learn structural residual output. As shown in Fig. 2, the proposed network, called structural residual network (SRNet), mainly consists of three parallel sub-networks with similar structures but different dilated factors (DF), to extract rain layer at different scales. The rain-removed result $\widetilde{\mathbf{B}}$ is estimated by subtracting the entire residual rain layer $\widetilde{\mathbf{R}}$ from the input $\mathbf{O}$ . The design details are described as follows.

III-B1 Convolutional Sparse Coding Structure for Rain Streaks

As displayed in Fig. 2(a), the SRNet first adopts one convolution layer and two Resblocks to extract shallow features $\mathcal{O}_{2}\in\mathcal{R}^{H\times W\times N}$ . Consistent to the model (2), three encoder-decoder subnetworks are constructed by exploiting the dilated convolutions [22] with different DFs to extract $\widetilde{\mathbf{R}}_{s}$ , $\widetilde{\mathbf{R}}_{m}$ , and $\widetilde{\mathbf{R}}_{l}$ at relatively small, medium, and large scales, respectively. Here we take the extraction of small rain layer $\widetilde{\mathbf{R}}_{s}$ as an example and illustrate the encoder-decoder network structure.

As shown in Fig. 2(b), the proposed encoder-decoder subnetwork is symmetrically composed of encoder and decoder parts. In the encoder part, two consecutive Resblocks are to obtain deeper features of the shallow feature input $\mathcal{O}_{2}$ . Each Resblock is followed by one MaxPooling layer with downsampling ratio set as $2$ . Symmetrically, two extra Resblocks are stacked in the decoder part and each one is followed by one MaxUnpooling layer for upsampling the feature activations ahead. The MaxUnpooling layer exploits the pooling indices computed on the corresponding MaxPooling layer to execute non-linear unsampling, as shown in Fig. 3. With the symmetrical combination of MaxPooling/MaxUnpooling layers, and the feature fusion operation with successive $1\times 1$ convolution and $3\times 3$ convolution, we get the sparse rain feature map $\mathcal{M}_{s}\in\mathcal{R}^{H\times W\times N}$ , as displayed in Fig. 2(b). Such sparsity finely represents the aforementioned coefficient-sparsity prior, i.e., the sparse locations of rain streaks across the image. Subsequently, by imposing the convolution layer $\mathcal{C}_{s}$ on $\mathcal{M}_{s}$ , small rain layer $\widetilde{\mathbf{R}}_{s}$ can then be extracted. Such convolutional coding manner naturally delivers the local-repetitive-pattern prior of rain shapes over the entire image.

From [36] and [53], incorporating skip or dense connection strategy into a deeper model can improve the understanding and representation of image contents. Hence, to fully utilize the shallow feature $\mathcal{O}_{1}$ and propagate the long-distance spatial context information, we introduce the global residual learning [37] into the encoder-decoder network shown as skip connection 3 in Fig. 2(b). Besides, we incorporate diverse local residual learning, including the skip connection inside each Resblock and the skip connections between the encoder part and the decoder part shown as 1 and 2 in Fig. 2(b). In this way, we aim to deepen into stronger feature expressions for better extracting rain structure. Meanwhile, the training process would be made easier [51]. The functions of such MaxUnpooling layer and skip connection settings to the final performance of our network will be validated in our following ablation experiments.

III-B2 Multi-Scale Structure for Rain Streaks

To represent the multi-scale configuration of rain streaks, we simply resort to dilated convolution, which weights pixels with a step size of DF, and enlarges its receptive field without losing resolution [22]. Specifically, we repeatedly construct the proposed encoder-decoder subnetwork with the same kernel size $3\times 3$ but with three different DFs (1, 2, and 3), as displayed in Fig. 2(a). The three branches thus have their own receptive fields to capture information at different scales. Fig. 4 demonstrates the effect of this multi-scale specification by comparing the rain-removed images restored by our network with and w/o dilated convolutions. One can easily observe that with dilated convolution, the SRNet can acquire larger receptive field that inclines to help remove long rain streaks as well as preserve large-scale image details.

Here, we present a visualization for the structural residual learned by the poposed SRNet to express the rain layer as shown in Fig. 5. The rain layer and rain feature maps ( $\mathbf{M}_{*1}$ , $\mathbf{M}_{*2}$ , and $\mathbf{M}_{*3}$ ) on two testing images by our network are demonstrated. It is easy to see obvious differences among $\widetilde{\mathbf{R}}_{s}$ , $\widetilde{\mathbf{R}}_{m}$ , and $\widetilde{\mathbf{R}}_{l}$ with multi-scale characteristics. $\mathbf{M}_{*1}$ , $\mathbf{M}_{*2}$ , and $\mathbf{M}_{*3}$ approximately illustrate the directions, positions, and shapes of rain streaks, implying the convolutional coding mechanism of the proposed method. In such manner, the proposed SRNet is expected to capture the rational rain streaks complying with these pre-known prior structures.

III-C Objective Function

As the sensitivity of the human visual system depends on local luminance, contrast, and structure, this can be quantitatively measured by the structure similarity (SSIM) index [54]. The larger the SSIM value is, the better the quality of the image is. To train the SRNet, similar to [17], we simply adopt the negative SSIM as the objective function, written as:

\mathcal{L}=-\text{SSIM}\left({\widetilde{\mathbf{B}}},\mathbf{B}\right).

(3)

III-D Training Details

We use PyTorch[52] to implement the proposed SRNet, based on a PC equipped with an Intel (R) Core(TM) i7-8700K at 3.70GHZ and one Nvidia GeForce GTX 1080Ti GPU. Since we mainly care about the effectiveness of the network architecture on rain removal, instead of relying on some extra tricks such as designing complicated loss functions and tuning hyper-parameters, we directly adopt the parameter settings of the latest baseline network–PReNet [17]. Specifically, the patch size is 100 $\times$ 100 and the Adam optimizer[55] with the batch size of 18 is used. The initial learning rate is $1\times$ $10^{-3}$ and divided by 5 after reaching 30, 50, and 80 epochs. The total epoch is 100. It is worth mentioning that for all datasets in subsequent experiments, these parameter settings are the same. This would show the favorable robustness and generality of our method. Besides, to balance the trade-off between running time and deraining performance on synthetic and real datasets, we simply select depth $T=2$ and width $N=64$ as the default setting in our experiments, where the depth $T$ denotes the times of symmetrical downsampling/upsampling operations, namely, the number of the module Resblock+MaxPooing (Resblock+MaxUnpooing) in Fig. 2(b).

IV Experimental Results

In this section, the effectiveness of the SRNet is verified via comprehensive experiments implemented on synthetic and real datasets by comparing with current state-of-the-art methods, including model based ones: DSC [10]¹¹1https://sites.google.com/view/taixiangjiang/%E9%A6%96%E9%A1%B5/state-of-the-art-methods, GMM [9]²²2http://yu-li.github.io/, JCAS [11]³³3https://sites.google.com/site/shuhanggu/home; and DL-based ones: Clear [12]⁴⁴4https://xueyangfu.github.io/projects/tip2017.html, DDN [13]⁵⁵5https://xueyangfu.github.io/projects/cvpr2017.html, RESCAN [16]⁶⁶6https://github.com/XiaLiPKU/RESCAN, PReNet [17]⁷⁷7https://github.com/csdwren/PReNet, SPANet [18]⁸⁸8https://stevewongv.github.io/derain-project.html, JORDER_E [15]⁹⁹9https://github.com/flyywh, SIRR [50]¹⁰¹⁰10https://github.com/wwzjer/Semi-supervised-IRR.

IV-A Experiments on Synthetic Data

Synthetic Datasets. We adopt four benchmark datasets: Rain100L [15], Rain100H [15], Rain1400 [13], and Rain12 [9]. Rain100L has only one type of rain streaks and consists of 200 image pairs for training and 100 ones for evaluation. With five types of rain streak directions, Rain100H is more challenging and contains 1800 image pairs for training and 100 ones for testing. Rain1400 contains 14000 rainy images synthesized from 1000 clean images with 14 kinds of rain streaks, where 12600 rainy images are used as training samples and 1400 ones as testing samples. Like [17], the trained model for Rain100L is utilized to evaluate Rain12 that only includes 12 image pairs. For SIRR, we adopt the real 147 rainy images [50] as unsupervised training samples.

Performance Metrics. Since the groundtruths of synthetic datasets are available, we provide quantitative comparisons based on two commonly used metrics: peak-signal-to-noise ratio (PSNR) [56] and SSIM. As the human visual system is sensitive to the Y channel of a color image in YCbCr space, we compute PSNR/SSIM based on the luminance channel, and the Matlab computation codes are released by [15].

Evaluation on Rain Removal. From the derained results shown in Fig. 6, it is easy to see that traditional model-based DSC, GMM, and JCAS methods leave distinct rain streaks, and DL-based Clear, DDN, and SIRR also leave evident rain marks. Besides, apart from RESCAN, PReNet, and the proposed SRNet, other competing methods blur image textures obviously. From the comparison of extracted rain layers, it is observed that the result by the SRNet covers less texture. Fig. 7 further compares the deraining results on another test image with diverse rain patterns. As displayed, among all comparison methods, SRNet more evidently removes rain streaks and restores background image.

Evaluation on Detail Preservation. We select two more difficult samples from Rain100H and Rain1400, respectively, to evaluate the deraining ability of different methods. As depicted in Fig. 8 and Fig. 9, complicated rain patterns adversely degrade the rain removal performance of most methods. However, the proposed SRNet still has advantages over rain removal and detail preservation, and achieves best PSNR and SSIM scores.

Tables I reports the quantitative comparison results of all competing methods on synthetic datasets with diverse, complicated, and different rain types. It is clear that due to the strong fitting ability of CNN, the deraining performance of DL based methods generally outperforms those by conventional prior-model-based methods. Comparatively, the proposed SRNet achieves the best average PSNR and SSIM values on Rain1400 with 14 kinds of rain types. This reflects the better robustness and generality of SRNet.

TABLE I: PSNR and SSIM comparisons on synthetic datasets. Bold and bold italic indicate the best

1^{\text{st}}

and

2^{\text{nd}}

performance, respectively.

Datasets	Rain100L		Rain100H		Rain1400		Rain12
Metrics	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Input	26.90	0.8384	13.56	0.3709	25.24	0.8097	30.14	0.8555
DSC[10]	27.34	0.8494	13.77	0.3199	27.88	0.8394	30.07	0.8664
GMM[9]	29.05	0.8717	15.23	0.4498	27.78	0.8585	32.14	0.9145
JCAS[11]	28.54	0.8524	14.62	0.4510	26.20	0.8471	33.10	0.9305
Clear[12]	30.24	0.9344	15.33	0.7421	26.21	0.8951	31.24	0.9353
DDN[13]	32.38	0.9258	22.85	0.7250	28.45	0.8888	34.04	0.9330
RESCAN[16]	38.52	0.9812	29.62	0.8720	32.03	0.9314	36.43	0.9519
PReNet[17]	37.45	0.9790	30.11	0.9053	32.55	0.9459	36.66	0.9610
SPANet[18]	35.33	0.9694	25.11	0.8332	29.85	0.9148	35.85	0.9572
SIRR[50]	32.37	0.9258	22.47	0.7164	28.44	0.8893	34.02	0.9347
JORDER_E[15]	38.56	0.9827	30.50	0.8967	32.00	0.9347	36.69	0.9618
SRNet	37.37	0.9780	30.05	0.9060	32.88	0.9487	36.71	0.9626

TABLE II: PSNR and SSIM generalization comparisons on SPA-Data [18].

Methods	Input	DSC	GMM	JCAS	Clear	DDN
PSNR	34.15	34.95	34.30	34.95	32.66	34.70
SSIM	0.9269	0.9416	0.9428	0.9453	0.9420	0.9343
Methods	RESCAN	PReNet	SPANet	SIRR	JORDER_E	SRNet
PSNR	34.70	35.08	35.13	34.85	34.34	35.31
SSIM	0.9376	0.9424	0.9443	0.9357	0.9382	0.9448

IV-B Experiments on Real Data

Real Datasets and Performance Metrics. For real application, evaluating the generalization ability of a single image derainer is the key point. In this section, we evaluate the deraining performance of all competing methods on real rainy images, based on two real datasets from [18] and [50], respectively. The former is denoted as SPA-Data and contains 1000 image pairs for evaluation, and the other is called Internet-Data and consists of 147 rainy images without groundtruths. For the experiments on SPA-Data, we compute PSNR and SSIM based on the Y channel. While for Internet-Data, we can only provide visual comparison. Specifically, for each one among the DL-based comparison methods, we can obtain three pre-trained models based on the three synthetic datasets: Rain100L, Rain100H, and Rain1400, respectively. For each method, we select the model with the best PSNR/SSIM on SPA-Data to predict the generalization results for real rainy images from SPA-Data. While for Internet-Data, like previous works, the rain-removed result is the one with the best visual quality among the predicted results by the three models.

Evaluation on SPA-Data. Fig. 10 shows the derained results of all comparison methods on a real rainy image from SPA-Data. As seen, when dealing with the images with complex rain shapes, traditional model based DSC, GMM, and JCAS do not work well, which can be explained by that the adopted specific prior assumptions cannot always accurately represent the complicated rain distribution. DL based competing methods also leave obvious rain streaks and even blur image details, such as PReNet. Nevertheless, as compared with other methods, the proposed SRNet achieve better generalization performance.

To objectively evaluate the generalization ability of these methods, we provide the quantitative rain removal performance comparison on SPA-Data listed in Table II. As reported, the most competitive baseline methods on synthetic datasets, such as RESCAN and JORDER_E, cannot perform as well as the traditional model-based JCAS on the real dataset, possibly due to the overfitting-the-training-samples issue. It is worth mentioning that in this case, JCAS obtains a good SSIM as the method utilizes the complementary of ASR and SSR to finely describe texture structures. Although SPANet has relatively inferior deraining effect on synthetic datasets, its generalization performance is competitive. Observing the comparison results presented in Table I and Table II, it is easy to see that among all comparison methods, the proposed SRNet is a competing single image derainer as it achieves satisfactory rain removal performance on both synthetic and real datasets. This confirms the effectiveness of structural residual design of our network architecture and the fine generalization ability of SRNet.

Evaluation on Internet-Data. Furthermore, we select another two hard samples from Internet-Data with complicated rain distribution. Fig. 11 and Fig. 12 demonstrate the corresponding generalization comparison results. From these figures, similar to the experiment analysis on SPA-Data, traditional model-based methods always leave severe rains, and other DL based methods still have obvious rain marks and blur image details. Comparatively, the proposed SRNet has evident superior performance in rain removal and detail preservation.

IV-C Rain Layer Extraction Evaluation on SRNet

Here we select several rainy images with different rain shapes from aforementioned datasets to further evaluate the effectiveness of the proposed SRNet on extracting rain layer. As shown in Fig. 13, when dealing with complicated rainy images with diverse rain streaks (light/heavy density, short/long shape, different directions), the proposed method can finely extract rain layer and preserve background details well.

IV-D Ablation Studies

The Effect of Training Datasets. Based on some typical real rainy images, Fig. 14 shows the derained results predicted by SRNet trained on the three synthesized datasets, including Rain100L, Rain100H, and Rain1400. For each input rainy image, we use a box to indicate the best recovery result visually. It can be observed that the model trained on Rain100L performs well in the case of light/thin rain streaks; the trained model on Rain100H is applicable to heavy/accumulated rain types, and for Rain1400, it is suitable for long and thin ones.

Network Modules Analysis. Here we quantitatively analyze the importance of different modules of the proposed SRNet, as listed in Table III. We train all the variants on Rain100L and compare their generalization ability on SPA-Data. $B_{a}$ , $B_{b}$ , $B_{c}$ , and $B_{d}$ denote four variants of only using the small scale branch (DF=1) in Fig. 2(a). $B_{a}$ and $B_{b}$ are basic Resnet-like networks, only consisting of Resblocks. $B_{c}$ and $B_{d}$ include symmetrical downsampling and unsampling layers following the corresponding Resblocks. $B_{f}$ is the default SRNet. $B_{e}$ is a variant that SRNet adopts weight sharing among three parallel branches, which shows that the SRNet can notably reduce network parameters by weight sharing without unsubstantial degradation in generalization performance. It is worth mentioning that even $B_{d}$ only adopts small scale, it outperforms most other competing methods on real deraining performance as shown in Table II, which corroborates the rationality of the proposed encoder-decoder network. The versus between every two variants and the corresponding functioning module is concluded as Table IV.

TABLE III: Network module analysis of the proposed SRNet.

Methods					$B_{a}$	$B_{b}$	$B_{c}$	$B_{d}$	$B_{e}$	$B_{f}$
Multi-Scale									$\checkmark$	$\checkmark$
Resblock w/o pooling					$\checkmark$	$\checkmark$
Encoder-Decoder Net							$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
Bilinear unsampling							$\checkmark$
MaxUnpooling								$\checkmark$	$\checkmark$	$\checkmark$
Global residual learning						$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
Weight sharing									$\checkmark$
w/o Weight sharing										$\checkmark$
SPA-Data	PSNR	35.06	35.15	35.20	35.23	35.22	35.31
SPA-Data	SSIM	0.9414	0.9437	0.9438	0.9441	0.9444	0.9448

TABLE IV: Versus among the variants of SRNet given in Table III.

Versus	Winner	Effective Module
$B_{a}$ vs $B_{b}$	$B_{b}$	Global residual learning
$B_{b}$ vs $B_{c}$	$B_{c}$	Encoder-Decoder Net
$B_{c}$ vs $B_{d}$	$B_{d}$	MaxUpooling
$B_{d}$ vs $B_{f}$	$B_{f}$	Multi-Scale

IV-E Extension to Real Video Deraining

Finally, we evaluate the deraining ability of the SRNet on real rainy videos without groundtruth, by comparing with current state-of-the-art video deraining methods, including the traditional model-based Garg et al. [24], Kim et al. [30], Jiang et al. [57], Ren et al. [31], Wei et al. [32], Li et al. [21], and the DL-based Liu et al. [41].

Fig. 15 and Fig. 16 present the visual comparison results based on two real rainy frames: one is captured by surveillance systems on a street with complex moving objects, and the other is obtained at night. From Fig. 15, it is easily seen that the model-based video deraining methods, including Garg et al.’s, Kim et al.’s, Jiang et al.’s, Wei et al.’s, and Li et al.’s, cause different degrees of artifacts at the location of the moving car, and the DL based Liu et al.’s method leaves obvious rain streaks. Even not utilizing temporal information among multi-frames, the proposed single image deraining method, SRNet, still performs well on removing rain streaks and preserving background textures. From Fig. 16, SRNet has the ability to detect obvious rain streaks. Although some model-based video deraining methods can remove more rain streaks than SRNet, they need many frames and thus have unfavorable real-time performance. It is noteworthy that as compared with the DL-based Liu et al.’s video deraining method, SRNet can also preserve relatively more image details, like trunks.

V Conclusions and Future Work

In this paper, we have taken into account the intrinsic prior structures of rain streaks, and designed a specific structural residual network for rain layer removal from a single image. By visualizing the extracted rain layer and sparse rain feature maps for synthetic and real test images, we have validated the working mechanism underlying the proposed structural residual network. Comprehensive experiments have demonstrated the superiority of the proposed method over current state-of-the-art single image derainers in terms of both robustness and generalization capability.

Although we have fairly evaluated the generalization performance of all competing methods based on SPA-Data, the dataset relies on human judgments and it cannot cover all the complicated patterns of rains in real-world. Besides, collecting a large number of really real rainy/clean image pairs is extremely time-consuming and cumbersome. To alleviate these problems, in the future, we will make further efforts for employing such network designing idea into unsupervised or at least semi-supervised scenarios in our future investigations.

References

[1] M. S. Shehata, J. Cai, W. M. Badawy, T. W. Burr, M. S. Pervez, R. J. Johannesson, and A. Radmanesh, “Video-based automatic incident detection for smart roads: The outdoor environmental challenges regarding false alarms,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 2, pp. 349–360, 2008.
[2] C. H. Bahnsen and T. B. Moeslund, “Rain removal in traffic surveillance: Does it matter?” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 8, pp. 2802–2819, 2018.
[3] C. E. Smith, C. Richards, S. Brandt, and N. Papanikolopoulos, “Visual tracking for intelligent vehicle-highway systems,” IEEE Transactions on Vehicular Technology, vol. 45, no. 4, pp. 744–759.
[4] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo, and X. Cao, “Single image deraining: A comprehensive benchmark analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3838–3847.
[5] R. Yasarla and V. M. Patel, “Confidence measure guided single image de-raining,” IEEE Transactions on Image Processing, vol. 29, pp. 4544–4555, 2020.
[6] X. Jing, Z. Wei, L. Peng, and X. Tang, “Removing rain and snow in a single image using guided filter,” in IEEE International Conference on Computer Science and Automation Engineering, vol. 2, 2012, pp. 304–307.
[7] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 6, pp. 1397–1409, 2010.
[8] X. Ding, L. Chen, X. Zheng, H. Yue, and D. Zeng, “Single image rain and snow removal via guided l0 smoothing filter,” Multimedia Tools and Applications, vol. 75, no. 5, pp. 2697–2712, 2016.
[9] Y. Li, “Rain streak removal using layer priors,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2736–2744.
[10] L. Yu, X. Yong, and J. Hui, “Removing rain from a single image via discriminative sparse coding,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3397–3405.
[11] S. Gu, D. Meng, W. Zuo, and Z. Lei, “Joint convolutional analysis and synthesis sparse representation for single image layer separation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1708–1716.
[12] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the skies: A deep network architecture for single-image rain removal,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2944–2956, 2017.
[13] X. Fu, J. Huang, D. Zeng, H. Yue, X. Ding, and J. Paisley, “Removing rain from single images via a deep detail network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3855–3863.
[14] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint rain detection and removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1357–1366.
[15] W. Yang, R. T. Tan, J. Feng, J. Liu, S. Yan, and Z. Guo, “Joint rain detection and removal from a single image with contextualized deep networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2019.
[16] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-excitation context aggregation net for single image deraining,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 254–269.
[17] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng, “Progressive image deraining networks: a better and simpler baseline,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3937–3946.
[18] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. Lau, “Spatial attentive single-image deraining with a high quality real rain dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 270–12 279.
[19] Z. He and V. M. Patel, “Convolutional sparse and low-rank coding-based rain streak removal,” in IEEE Winter Conference on Applications of Computer Vision, 2017, pp. 1259–1267.
[20] L. Zhu, C. W. Fu, D. Lischinski, and P. A. Heng, “Joint bi-layer optimization for single-image rain streak removal,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2526–2534.
[21] M. Li, Q. Xie, Q. Zhao, W. Wei, S. Gu, J. Tao, and D. Meng, “Video rain streak removal by multiscale convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6644–6653.
[22] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[23] K. Garg and S. K. Nayar, “Detection and removal of rain from videos,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. I–I.
[24] ——, “Vision and rain,” International Journal of Computer Vision, vol. 75, no. 1, pp. 3–27, 2007.
[25] X. Zhang, H. Li, Y. Qi, W. K. Leow, and T. K. Ng, “Rain removal in video by combining temporal and chromatic properties,” in IEEE International Conference on Multimedia and Expo, 2006, pp. 461–464.
[26] W.-J. Park and K.-H. Lee, “Rain removal using kalman filter in video,” in International Conference on Smart Manufacturing Application, 2008, pp. 494–497.
[27] P. C. Barnum, S. Narasimhan, and T. Kanade, “Analysis of rain and snow in frequency space,” International journal of computer vision, vol. 86, no. 2-3, p. 256, 2010.
[28] A. Tripathi and S. Mukhopadhyay, “Video post processing: low-latency spatiotemporal approach for detection and removal of rain,” IET image processing, vol. 6, no. 2, pp. 181–196, 2012.
[29] Y. L. Chen and C. T. Hsu, “A generalized low-rank appearance model for spatio-temporally correlated rain streaks,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1968–1975.
[30] K. Jin-Hwan, S. Jae-Young, and K. Chang-Su, “Video deraining and desnowing using temporal correlation and low-rank matrix completion,” IEEE Transactions on Image Processing, vol. 24, no. 9, pp. 2658–2670, 2015.
[31] W. Ren, J. Tian, H. Zhi, A. Chan, and Y. Tang, “Video desnowing and deraining based on matrix decomposition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4210–4219.
[32] W. Wei, L. Yi, Q. Xie, Q. Zhao, D. Meng, and Z. Xu, “Should we encode rain streaks in video as deterministic or stochastic?” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2516–2525.
[33] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[34] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
[35] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
[36] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2472–2481.
[37] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” pp. 2790–2798, 2017.
[38] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep image deblurring,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8174–8182.
[39] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8183–8192.
[40] C. Jie, C. H. Tan, J. Hou, L. P. Chau, and L. He, “Robust video content alignment and compensation for rain removal in a cnn framework,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6286–6295.
[41] J. Liu, W. Yang, S. Yang, and Z. Guo, “Erase or fill? deep joint recurrent rain removal and reconstruction in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3233–3242.
[42] ——, “D3r-net: Dynamic routing residue recurrent network for video rain removal,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 699–712, 2018.
[43] J. H. Kim, C. Lee, J. Y. Sim, and C. S. Kim, “Single-image deraining using an adaptive nonlocal means filter,” in IEEE International Conference on Image Processing, 2014, pp. 914–917.
[44] L. W. Kang, C. W. Lin, and Y. H. Fu, “Automatic single-image-based rain streaks removal via image decomposition,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1742–1755, 2012.
[45] Y. Wang, S. Liu, C. Chen, and B. Zeng, “A hierarchical approach for rain or snow removing in a single color image,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3936–3950, 2017.
[46] Y.-H. Fu, L.-W. Kang, C.-W. Lin, and C.-T. Hsu, “Single-frame-based rain removal via image decomposition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 1453–1456.
[47] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a conditional generative adversarial network,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
[48] H. Zhang and V. M. Patel, “Density-aware single image de-raining using a multi-stream dense network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 695–704.
[49] P. Mu, J. Chen, R. Liu, X. Fan, and Z. Luo, “Learning bilevel layer priors for single image rain streaks removal,” IEEE Signal Processing Letters, vol. 26, no. 2, pp. 307–311, 2019.
[50] W. Wei, D. Meng, Q. Zhao, Z. Xu, and Y. Wu, “Semi-supervised transfer learning for image rain removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3877–3886.
[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[52] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[53] G. Huang, Z. Liu, L. V. Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” pp. 2261–2269, 2017.
[54] W. Zhou, B. Alan Conrad, S. Hamid Rahim, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans Image Process, vol. 13, no. 4, pp. 600–612, 2004.
[55] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
[56] Q. Huynh-Thu and M. Ghanbari, “Scope of validity of psnr in image/video quality assessment,” Electronics Letters, vol. 44, no. 13, pp. 800–801, 2008.
[57] T.-X. Jiang, T.-Z. Huang, X.-L. Zhao, L.-J. Deng, and Y. Wang, “A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2017, pp. 4057–4066.

Structural Residual Learning for Single Image Rain Removal