\addauthor

Jaehoon [email protected] \addauthorMinjung [email protected] \addauthorJini [email protected] \addauthorSunok [email protected] \addinstitution Hyundai Motor Company,
Seoul, Korea \addinstitution Artificial Intelligence,
Korea Aerospace University,
Goyang, Korea A Prototype Unit for Image De-raining using Time-Lapse Data

A Prototype Unit for Image De-raining
using Time-Lapse Data

Abstract

We address the challenge of single-image de-raining, a task that involves recovering rain-free background information from a single rain image. While recent advancements have utilized real-world time-lapse data for training, enabling the estimation of consistent backgrounds and realistic rain streaks, these methods often suffer from computational and memory consumption, limiting their applicability in real-world scenarios. In this paper, we introduce a novel solution: the Rain Streak Prototype Unit (RsPU). The RsPU efficiently encodes rain streak-relevant features as real-time prototypes derived from time-lapse data, eliminating the need for excessive memory resources. Our de-raining network combines encoder-decoder networks with the RsPU, allowing us to learn and encapsulate diverse rain streak-relevant features as concise prototypes, employing an attention-based approach. To ensure the effectiveness of our approach, we propose a feature prototype loss encompassing cohesion and divergence components. This loss function captures both the compactness and diversity aspects of the prototypical rain streak features within the RsPU. Our method evaluates various de-raining benchmarks, accompanied by comprehensive ablation studies. We show that it can achieve competitive results in various rain images compared to state-of-the-art methods.

1 Introduction

Image de-raining is an important task in computer vision as the rain streaks hinder visibility and deteriorate the robustness of most outdoor vision systems. It has been widely applied in many tasks such as object detection [Li et al.(2019)Li, Araujo, Ren, Wang, Tokuda, Junior, Cesar-Junior, Zhang, Guo, and Cao, Fu et al.(2019)Fu, Liang, Huang, Ding, and Paisley], semantic segmentation [Cho et al.(2020)Cho, Kim, Min, and Sohn, Jiang et al.(2020)Jiang, Wang, Yi, Chen, Huang, Luo, Ma, and Jiang], autonomous driving [Huang et al.(2021)Huang, Yu, and He, Guo et al.(2021)Guo, Sun, Juefei-Xu, Ma, Xie, Feng, Liu, and Zhao], or surveillance system [Li et al.(2021a)Li, Cao, Zhao, Zhang, and Meng, Li et al.(2021b)Li, Ren, Wang, Araujo, Tokuda, Junior, Cesar-Jr, Wang, and Cao], as an essential pre-processing step.

Many existing approaches to single-image-based de-raining employ deep Convolutional Neural Networks (CNNs) [Fu et al.(2017a)Fu, Huang, Ding, Liao, and Paisley, Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley, Yang et al.(2017)Yang, Tan, Feng, Liu, Guo, and Yan, Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu, Li et al.(2018b)Li, Wu, Lin, Liu, and Zha, Yang et al.(2019a)Yang, Liu, Yang, and Guo, Wang et al.(2020)Wang, Xie, Zhao, and Meng]. Approaches utilizing transformers [Chen et al.(2023)Chen, Li, Li, and Pan, Valanarasu et al.(2021)Valanarasu, Yasarla, and Patel, Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang, Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li] also have been employed for single-image-based de-raining and have shown a better improvement. These methods are trained in a supervised manner using synthetic datasets containing both rain and rain-free images. However, the dissimilarities between synthetic and real rain images are apparent. As shown in Fig. 1, real rain streaks exhibit complex and diverse patterns in terms of size, scale, direction, and density. Consequently, methods trained on synthetic data struggle to handle real rain images effectively [Yang et al.(2020)Yang, Tan, Wang, Fang, and Liu, Huang et al.(2021)Huang, Yu, and He, Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau]. Semi-supervised methods [Wei et al.(2019)Wei, Meng, Zhao, Xu, and Wu, Yasarla et al.(2020)Yasarla, Sindagi, and Patel, Huang et al.(2021)Huang, Yu, and He] use synthetic and unpaired real rain data for training, yet they remain mainly reliant on synthetic data, leading to performance degradation when applied to real rain images.

To address the requirement for rain-clean image pairs, SPANet [Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau] and TimeLapsNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] introduced real rain datasets characterized by consistent backgrounds and varying rain streaks. TimeLapsNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] focused on estimating unchanging backgrounds while accommodating temporal changing rain streaks. Recently, MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn] leveraged an external memory network to capture rain streak features across real rain datasets. Especially, they showed remarkable results on real rainy images. As a real-world application, single-image-based de-raining often needs to be executed on devices with limited computing power and memory in practice. However, heavy parameters from [Cho et al.(2020)Cho, Kim, Min, and Sohn] and external memory networks from [Cho et al.(2022)Cho, Kim, and Sohn] hinder practical applicability.

Building on leveraging real rain datasets, this paper proposes a novel learning framework centered around a memory-efficient representation of rain streak features, named the Rain-streak Prototype Unit (RsPU). By encoding prototypical rain streak features using an attention mechanism, the RsPU effectively captures the essence of rain streaks. The process involves applying an attention operator on the encoder’s encoded features, assigning rain streak weights to pixels, and generating a rain streak attention map, shaping the prototypical rain streak features. These prototypes are formed by aggregating local encoding vectors guided by rain streak weights. Multiple attention operators are employed to generate diverse prototype candidates. We propose a feature prototype loss that combines cohesion and divergence components to enhance the distinctiveness of the prototypical rain streak features. Cohesion loss minimizes intra-class variability to facilitate the clustering similar rain streaks, while divergence loss enforces diversity among prototypes.

We summarize our contributions as follows: i) the introduction of the Rain-streak Prototype Unit (RsPU) for encoding diverse rain-streak features as prototypes without additional memory consumption, ii) the proposal of a feature prototype loss to enhance the discriminative capabilities of prototypical rain-streak features, and iii) the demonstration of state-of-the-art performance across various de-raining benchmarks, along with the adaptability of our method in real-world scenarios.

2 Related Work

2.1 Single Image De-raining

Since obtaining the paired real dataset is challenging, most methods have used paired synthetic datasets. Fu et al., [Fu et al.(2017a)Fu, Huang, Ding, Liao, and Paisley] firstly proposed a deep learning-based method with multi-layer CNN to extract and remove the rain streaks and presented the deep detail network (DDN) [Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley] learning a mapping function from a rainy image to the clean image. JORDER [Yang et al.(2017)Yang, Tan, Feng, Liu, Guo, and Yan] attempted to jointly handle the image de-raining with detecting the rain region and extended their work with the contextual dilated networks (JORDER-E [Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu]). Many attempts were suggested in terms of the constitute of network architecture, including residual blocks [He et al.(2016)He, Zhang, Ren, and Sun], squeeze-and-excitation [Li et al.(2018b)Li, Wu, Lin, Liu, and Zha], and recurrent networks [Ren et al.(2019)Ren, Zuo, Hu, Zhu, and Meng]. Several methods were proposed to improve computational efficiency by designing lightweight networks in a cascade manner [Fan et al.(2018)Fan, Wu, Fu, Hunag, and Ding] and a Laplacian pyramid framework [Fu et al.(2019)Fu, Liang, Huang, Ding, and Paisley]. In addition, some useful priors, such as multi-scale [Yasarla and Patel(2019), Jiang et al.(2020)Jiang, Wang, Yi, Chen, Huang, Luo, Ma, and Jiang], bi-level layer prior [Mu et al.(2018)Mu, Chen, Liu, Fan, and Luo], wavelet transform [Yang et al.(2019a)Yang, Liu, Yang, and Guo] and dictionary learning mechanism [Wang et al.(2020)Wang, Xie, Zhao, and Meng] were also embedded into the deep learning-based methods for representing rain streaks. Moreover, semi-supervised learning approaches [Wei et al.(2019)Wei, Meng, Zhao, Xu, and Wu, Yasarla et al.(2020)Yasarla, Sindagi, and Patel, Huang et al.(2021)Huang, Yu, and He] were proposed to leverage synthetic and unpaired real rain data more effectively. Recent Transformation-based methods [Valanarasu et al.(2021)Valanarasu, Yasarla, and Patel, Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang, Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li] also emerged in the domain of image de-raining. Nonetheless, the reliance on synthetic data for training still prevails in the aforementioned methods.

Taking a different approach, TimeLapsNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] introduced a de-raining network relying solely on a real rain dataset, where both the camera and scene remain static except for time-varying rain streaks. MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn] extended this concept [Cho et al.(2020)Cho, Kim, Min, and Sohn] by incorporating an external memory network, albeit at the cost of additional memory usage. Regrettably, both methods encounter limitations in terms of practical real-world applications due to their demanding computational requirements. In contrast, our proposed method introduces a novel attention operator, making it more amenable to real-world scenarios.

2.2 Attention mechanism

Attention mechanisms, designed to prioritize relevant regions of an image while filtering out irrelevant regions, have attracted significant attention in many research field. In the context of a vision system, an attention mechanism functions akin to a dynamic selection process, updating feature weights adaptively based on the significance of the input. Leveraging attention mechanisms, prototype networks are devised to generate embeddings that cluster multiple points around a central prototype representation for each class. This approach involves learning class-specific prototypes through feature space averaging. This concept has found wide application in computer vision tasks like few-shot classification [Snell et al.(2017)Snell, Swersky, and Zemel] and semantic segmentation [Liu et al.(2020)Liu, Zhang, Zhang, and He].

In recent developments, MOSS [Huang et al.(2021)Huang, Yu, and He] and MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn] introduced the utilization of memory networks to store prototypical rain streak-related features, albeit at the expense of substantial memory consumption. In contrast, our contribution involves the adoption of a prototype network with integrated attention mechanisms, eliminating the need for additional memory usage.

3 Proposed Method

We show an overview of our framework in Fig. 2. For training, we use a set of time-lapse data $\mathbf{X}=\{X_{n}\}_{n=1,...,N}$ and ${X}_{n}=\{{X}_{t,n}\}_{t=1,...,T}$ , where $N$ is the number of all time-lapse data, $T$ is the whole time of the time-lapse data. $n$ and $t$ mean the index of scenes and the time, respectively. The objective is to infer a de-rained image $Y_{t,n}$ for each $X_{t,n}$ through the proposed method. Note that, for simplicity, the framework repeats the same process for each $t$ and $n$ in the RsPU, where the subscripts of $t$ and $n$ are omitted.

3.1 Rain-streak Prototype Unit (RsPU)

The Rain-streak Prototype Unit (RsPU) operates by learning and capturing rain-streak relevant features in the form of multiple prototypes. This concept bears resemblance to memory-guided methods [Cho et al.(2022)Cho, Kim, and Sohn, Huang et al.(2021)Huang, Yu, and He]. While these methods calculate the similarity between queries and memory items using external networks to store rain streak-related features, they incur memory consumption. In contrast, our RsPU employs self-attention mechanisms to generate prototypical rain streak features internally, eliminating the need for extra memory usage.

We denote by $\mathbf{x}$ a corresponding feature of rain image ${X}$ from the de-raining network. The de-raining network inputs the rainy image ${X}$ and gives the extracted features $\mathbf{x}$ of size $H\times W\times C$ , where $H$ , $W$ , and $C$ are height, width, and the number of channels, respectively. We denote by $\mathbf{x}^{k}\in\mathbb{R}^{C}(k=1,...,K)$ , where $K=H\times W$ , individual extracted features of size $1\times 1\times C$ in the extracted features $\mathbf{x}$ . An attention mapping functions¹¹1Attention mapping functions are implemented as fully connected layers to generate multiple attention maps and form a candidate of prototypes. $\{\mathcal{A}_{m}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{1}\}^{M}_{m=1}$ are employed to assign rain streak weights to encoding vectors, $w^{k,m}\in\mathcal{W}^{m}=\mathcal{A}_{m}(\mathbf{x})$ . On each pixel location, the rain streak weight measures the probability of finding rain streaks range of the encoding vector. We denote by $\mathcal{W}^{m}\in\mathbb{R}^{H\times W\times 1}$ , where the $m-$ th rain streaks map generated from the $m-$ th attention function. Then one unique prototype $\mathbf{p}^{m}$ is derived as an aggregation of $K$ encoding vectors with normalized rain streak weights as:

~{}\mathbf{p}^{m}={\sum\limits_{k=1}^{K}\frac{w^{k,m}}{\sum\nolimits_{k^{\prime}=1}^{K}{w^{k^{\prime},m}}}}\mathbf{x}^{k}.

(1)

Similarly, $M$ prototypes are derived from multiple attention functions to form a candidate of prototype, $\mathcal{P}=\{\mathbf{p}^{m}\}^{M}_{m=1}$ .

Input encoding vectors $\mathbf{x}^{k}(k\in K)$ from the encoding map of the de-raining network are used as queries to retrieve rain streak relevant features in the candidate of prototype for estimating a rain streak encoding $\mathbf{\hat{X}}\in\mathbb{R}^{H\times W\times C}$ . For every obtained rain streak encoding vector obtained by

\mathbf{\hat{x}}^{k}={\sum\limits_{m=1}^{M}\alpha^{k,m}{\mathbf{p}^{m}}},

(2)

where $\alpha^{k,m}$ denotes the relevant score between the $k-$ th encoding vector $\mathbf{{x}}^{k}$ and the $m-$ th prototype item $\mathbf{p}^{m}$ .

Using a channel-wise summation, we aggregate the obtained rain streak map and the original encoding $\mathbf{{x}}$ as the final output. The output encoding of RsPU is inputted to the decoder for estimating the rain streak $\hat{R}$ . Note that we aim to enrich encoded features to enhance estimation for various real rain streak information while suppressing consistent background information in time-lapse data.

3.2 Loss Functions

For learning our framework, as in previous works [Cho et al.(2020)Cho, Kim, Min, and Sohn, Cho et al.(2022)Cho, Kim, and Sohn], we utilize several loss functions including background consistency, cross consistency, and self-consistency loss. Moreover, we propose a novel feature prototype loss, enabling prototype learning for enhanced prototypical rain streak features.

Feature Prototype Loss

The feature prototype loss is designed in a form that reduces intra-rain streak variations while enlarging the inter-rain streak differences simultaneously. Contrary to the conventional prototype loss function’s optimization on positive and negative pairs, the proposed loss calculates gradients and does backpropagation based on the overall distance of prototypes within RsPU. The feature prototype loss consists of two terms, including cohesion loss $\mathcal{L}_{coh}$ and divergence loss $\mathcal{L}_{div}$ , formulated as:

~{}{{\mathcal{L}_{fea}}}={{\mathcal{L}}_{coh}}+\lambda_{a}{{\mathcal{L}}_{div}},

(3)

where $\lambda_{a}$ is the weighting factor. The cohesion loss encourages the rain streak relevant features to be gathered with compact prototypes. It penalizes the mean $L_{2}$ distance between the extracted features $\mathbf{x}$ from the de-raining network and their most-relevant prototypes as:

{{\mathcal{L}}_{coh}}={\frac{1}{K}}\sum\limits_{k=1}^{K}{{{\left\|\mathbf{{x}}^{k}-{\mathbf{p}^{*}}\right\|}_{2}}},

(4)

\mathrm{s.t.,*}=\underset{m\in[1,M]}{\mathrm{argmax}}\alpha^{k,m},

(5)

where $\alpha^{k,m}$ ²²2 $\mathrm{argmax}$ is not involved in the back-propagation and is only used to obtain indices of the most relevant vector. is the relevant score.

To promote the diversity among prototype items by pushing the learned prototypes away from each other, we propose the divergence term $\mathcal{L}_{div}$ defined with a margin of $\delta$ as:

{{\mathcal{L}}_{div}}={\frac{1}{M(M-1)}}\sum\limits_{m=1}^{M}\sum\limits_{m^{\prime}=1}^{M}{{[-{\left\|{\mathbf{p}^{m}}-{\mathbf{p}^{m^{\prime}}}\right\|}_{2}}+\delta]_{+}}.

(6)

With this feature prototype loss, the network captures the differentiated rain streak features in the RsPU. This loss is well-tailored to the RsPU in that the prototypes are encouraged to encode compact and diverse rain streak features.

Background Consistency Loss

This loss encourages the generation of consistent background images between the input images [Cho et al.(2020)Cho, Kim, Min, and Sohn, Cho et al.(2022)Cho, Kim, and Sohn]. We use the following equation to compute the loss.

{{\mathcal{L}}_{b}}={\sum\limits_{n\in N}{\sum\limits_{\{w,v\}\in T}}}{\sum\limits_{i}{{{{\left\|{\hat{Y}_{w,n}(i)}-{\hat{Y}_{v,n}(i)}\right\|}_{1}}}}},

(7)

where $\hat{Y}_{w,n}$ is a background image estimated from $X_{w,n}$ and ${\left\|\cdot\right\|}_{1}$ denotes the $L_{1}$ distance. ${w,v}$ represents the different times in $T$ . ${\hat{Y}_{w,n}(i)}$ and ${\hat{Y}_{v,n}(i)}$ are the values at pixel $i$ from the image ${\hat{Y}_{w,n}}$ and ${\hat{Y}_{v,n}}$ , respectively.

Cross Consistency Loss

This loss aims to estimate the approximation of the overall structure of the layout information [Lettry et al.(2018b)Lettry, Vanhoey, and Van Gool, Lettry et al.(2018a)Lettry, Vanhoey, and Van Gool] obtained by

{{\mathcal{L}}_{c}}={\sum\limits_{n\in N}{\sum\limits_{\{w,v\}\in T}}}{\sum\limits_{i}{{{{\left\|{X_{w,n}(i)}-{\hat{Y}_{v,n}(i)}\right\|}_{1}}}}}.

(8)

This loss helps the network achieve good initial results in initial training.

Self Consistency Loss

This loss makes the summation of the estimated $\hat{Y}$ and $\hat{R}$ to be the input image ${X}$ [Cho et al.(2020)Cho, Kim, Min, and Sohn, Cho et al.(2022)Cho, Kim, and Sohn] obtained by

{{\cal L}_{s}}={\sum\limits_{n\in N}{\sum\limits_{w\in T}}}{\sum\limits_{i}{{\left\|{{X_{w,n}(i)}-({\hat{Y}_{w,n}(i)}+{\hat{R}_{w,n}(i)})}\right\|}{{}_{1}}}}.

(9)

This loss acts like a regularize.

Total Loss

During the entire training process of the proposed network, the total loss is formulated as follows:

~{}{{\mathcal{L}}_{tot}}={{\mathcal{L}}_{b}}+\lambda_{c}{{\mathcal{L}}_{c}}+\lambda_{s}{{\mathcal{L}}_{s}}+\lambda_{f}{{\mathcal{L}}_{fea}},

(10)

where $\lambda_{c}$ , $\lambda_{s}$ , and $\lambda_{f}$ are the hyper-parameters determined by empirical experiments.

3.3 Network Architectures

Our de-raining network framework closely adheres to the widely utilized encoder–decoder architecture, a prevailing choice in the realm of single image de-raining [Cho et al.(2020)Cho, Kim, Min, and Sohn, Yasarla et al.(2020)Yasarla, Sindagi, and Patel, Zhang and Patel(2018), Cho et al.(2022)Cho, Kim, and Sohn]. In this paradigm, all convolutional layers employ a $3\times 3$ kernel size. The encoder employs max-pooling layers with $2\times 2$ kernel size and stride, effectively reducing feature dimensions by a factor of 2. The input to the encoder is a rain image, denoted as ${X}$ .

The decoder inputs the retrieved prototypical rain streak features from the RsPU and encoded features $\mathbf{x}$ to produce initial rain streak information. In the decoder, each layer comprises $3\times 3$ deconvolution and convolution layers followed by ReLU, which is connected to the encoder using skip connections. The deconvolution layer, implemented with transposed convolutional layers, has an upscaling factor of 2. As a result, the de-rained image is obtained by subtracting the estimated rain streaks from the input rain image. The details of network architectures are described in the supplementary material.

4 Experiment

A demonstration is given for the proposed method against several state-of-the-art image de-raining methods including JORDER-E [Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu], DDN [Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley], PReNet [Ren et al.(2019)Ren, Zuo, Hu, Zhu, and Meng], SPANet [Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau], RCDNet [Wang et al.(2020)Wang, Xie, Zhao, and Meng], MPRNet [Zamir et al.(2021)Zamir, Arora, Khan, Hayat, Khan, Yang, and Shao], SIRR [Wei et al.(2019)Wei, Meng, Zhao, Xu, and Wu], NLEDN [Li et al.(2018a)Li, He, Zhang, Chang, Dong, and Lin], and MOSS [Huang et al.(2021)Huang, Yu, and He]. Additionally, the performances are compared to the proposed method with state-of-the-art Vision Transformer-based methods such as Restormer [Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang], MAXIM [Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li], and DRSformer [Chen et al.(2023)Chen, Li, Li, and Pan] and time-lapse data-based methods including TimLapseNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] and MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn]. We used publicly sourced codes and pre-trained models provided by the authors to produce the de-raining results.

Datasets

We used a time-lapse benchmark provided by TimLapseNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] to train the proposed method. They provide rain image pairs comprised of 2 images sampled from 30 images from the 186 total scenes. For a fair comparison, the identical experimental setups were used in the previous works [Cho et al.(2020)Cho, Kim, Min, and Sohn, Cho et al.(2022)Cho, Kim, and Sohn]. Since our framework aims to the consistent background information between time-lapse data, we use only time-lapse data during training without ground truth.

For testing, we conducted experiments on real and synthetic datasets, respectively. For the real dataset, we used the RealDataset to provide real rain images with realistic ground-truth background images generated by a semi-automatic algorithm [Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau], thus enabling the quantitative evaluations. We also obtained real-world rain images from the Internet and previous studies [Zhang and Patel(2018), Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau, Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu]. For the synthetic dataset, following [Cho et al.(2022)Cho, Kim, and Sohn, Cho et al.(2020)Cho, Kim, Min, and Sohn], we used Rain14000 [Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley], Rain12000 [Zhang and Patel(2018)], and Rain100 [Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu].

Metrics

Following a de-raining works [Yang et al.(2020)Yang, Tan, Wang, Fang, and Liu], we adopted two metrics for quantitative evaluation of performance by using the paired synthetic data: peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM).

Implementation Details

The proposed networks were trained using the PyTorch library with an Nvidia RTX TITAN GPU. As described in the TimLapseNet [Cho et al.(2020)Cho, Kim, Min, and Sohn] and MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn], we did not utilize any additional data augmentation, such as horizontal and vertical flips, due to large-scale real rain data having been provided. The input images were resized to the resolution of 256 $\times$ 256 and normalized to the range of [-1, 1]. During the training, the learning rate was fixed as $10^{-4}$ and set the batch size as 16. We set the height $H$ and width $W$ of the extracted features $\mathbf{x}$ feature map, the number of feature channels $C$ , and candidates of rain streak maps $M$ to 256, 256, 128, and, 20, respectively. For the feature prototype loss, $\lambda_{a}$ and $\delta$ are set to $0.1$ and $1$ , respectively. For the total loss hyper-parameters, we empirically set as $\lambda_{c}$ = 0.1, $\lambda_{s}$ = 0.001, and $\lambda_{f}$ = 0.1.

4.1 Comparison with State-of-the-Art

Real Rain Analysis

For evaluating the generalization ability of real rain images, we conducted a qualitative evaluation compared to all competing state-of-the-art methods. In the first row of Fig. 3, while existing methods suffer from removing long and thin real rain streaks, our method effectively handles various types of real rain streaks. The second row shows that our method outperforms the competing method in terms of discriminating the rain streaks and background information well even in the rain accompanied by fog. This is because existing methods heavily rely on synthetic data, which could not cover various types of real rain streaks. Additionally, Fig. 4 shows de-rained results on RealDataset [Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau]. In contrast, our method confidently removes rain streaks compared to other methods thanks to the advantage of the RsPU and real-world time-lapse data. Additional qualitative results are shown in the supplementary material.

Synthetic Rain Analysis

Table 1 shows the quantitative results of recent studies, from CNN-based methods to transformer-based methods, on various benchmarks. Although our method does not use any ground-truth data for training, we achieve comparable performance gain and even outperform compared to state-of-the-art methods. This means leveraging a real rain dataset effectively extracts useful rain streak features, even in the synthetic rain dataset. We believe that our framework, which intelligently combines the real dataset and RsPU, effectively estimates the prototypical rain streak features.

Benchmark	GT	T-L	Synthetic Dataset						Real-world Dataset
			Rain14000		Rain12000		Rain100		RealDataset
			PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DDN [Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley]	Yes	No	28.45	0.889	30.97	0.912	34.68	0.967	36.16	0.946
DID [Zhang and Patel(2018)]	Yes	No	26.17	0.887	31.30	0.921	35.40	0.962	28.96	0.941
MPRNet [Zamir et al.(2021)Zamir, Arora, Khan, Hayat, Khan, Yang, and Shao]	Yes	No	33.64	0.938	32.91	0.916	36.40	0.965	40.12	0.984
PReNet [Ren et al.(2019)Ren, Zuo, Hu, Zhu, and Meng]	Yes	No	32.55	0.946	33.17	0.942	37.80	0.981	40.16	0.982
SIRR [Wei et al.(2019)Wei, Meng, Zhao, Xu, and Wu]	Yes	No	28.44	0.889	30.57	0.910	34.75	0.969	35.31	0.941
JORDER-E [Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu]	Yes	No	32.00	0.935	33.98	0.950	38.59	0.983	40.78	0.981
NLEDN [Li et al.(2018a)Li, He, Zhang, Chang, Dong, and Lin]	Yes	No	29.79	0.897	33.16	0.919	36.57	0.975	40.12	0.984
RCDNet [Wang et al.(2020)Wang, Xie, Zhao, and Meng]	Yes	No	30.66	0.921	31.99	0.921	40.17	0.988	41.47	0.983
MOSS [Huang et al.(2021)Huang, Yu, and He]	Yes	No	31.22	0.932	32.87	0.932	37.67	0.974	40.01	0.971
SPANet [Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau]	Yes	Yes	29.85	0.912	33.04	0.949	35.79	0.965	40.24	0.981
Restormer [Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang]	Yes	No	34.18	0.944	33.19	0.926	38.99	0.978	41.12	0.985
MAXIM [Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li]	Yes	No	33.80	0.943	32.37	0.922	38.06	0.977	39.15	0.978
TimeLapsNet [Cho et al.(2020)Cho, Kim, Min, and Sohn]	No	Yes	33.73	0.941	33.25	0.935	37.89	0.980	38.54	0.989
MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn]	No	Yes	34.02	0.945	34.55	0.949	38.45	0.981	41.56	0.989
Ours	No	Yes	34.53	0.945	35.25	0.951	39.75	0.981	42.16	0.990

Table 1: Quantitative comparison of single image de-raining using synthetic and RealDataset. GT and T-L denote using paired ground truth and time-lapse data, respectively. The best result is shown in bold, and the second–best is underlined.

Analysis of Run Time and Parameters

	NLEDN	JORDER-E	RCDNet	MAXIM	Restormer	DRSformer	TimLapseNet	MemoryNet	Ours
GPU	1.17	1.74	0.85	4.12	6.32	6.58	3.87	0.82	0.52
Params.	1.01 M	4.17 M	3.17 M	14.1 M	26.12 M	33.7 M	10.2 M	0.81 M	0.61 M

Table 2: Comparison of run time (s) and the number of parameters.

We compare the running time and the number of parameters in Table 2. To evaluate the running time, we used 100 images with a size of 1000 $\times$ 1000. It is clear that our method has a comparable GPU runtime compared with other CNN-based methods [Li et al.(2018a)Li, He, Zhang, Chang, Dong, and Lin, Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu] and is significantly faster than several Transformer-based methods [Chen et al.(2023)Chen, Li, Li, and Pan, Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang, Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li]. This shows that our network can extract more effective representations, which could be executed on devices with limited computing power and memory in practice. In addition, Our method shows faster than MemoryNet [Cho et al.(2022)Cho, Kim, and Sohn] requiring the external memory network, thanks to the proposed RsPU that does not require additional memory.

4.2 Ablation Study

Study of the feature prototype loss

To verify the effectiveness of the feature prototype loss ${\mathcal{L}_{fea}}$ , we conducted the experiment according to the loss functions. As described in previous studies [Cho et al.(2020)Cho, Kim, Min, and Sohn, Cho et al.(2022)Cho, Kim, and Sohn], we show that the performance of experimental results is improved with the addition of loss. For quantitative evaluation, Table 3 (Left) shows the performance of our model trained with the proposed loss functions on RealDataset. The model trained with a ${\mathcal{L}_{fea}}$ achieves better results than the model trained without the ${\mathcal{L}_{fea}}$ . The experimental results demonstrate the advantage of the proposed loss.

We show the de-rained image and the result of the estimated rain streaks learned with and without ${\mathcal{L}_{fea}}$ in Fig. 5. Our model trained without ${\mathcal{L}_{fea}}$ is not effective at discriminating rain streaks and background, as shown in Fig. 5 (b). We achieved the improved performance with ${\mathcal{L}_{coh}}$ but it still shows the rain streaks in the face of the spider man in Fig. 5 (c). Our model trained with ${\mathcal{L}_{fea}}$ effectively discriminates rain streak and background in Fig. 5 (d). This demonstrates that ${\mathcal{L}_{fea}}$ helps to discriminate rain streaks effectively and yields improved de-raining performance.

$\mathcal{L}_{b}$	$\mathcal{L}_{c}$	$\mathcal{L}_{s}$	$\mathcal{L}_{fea}$	PSNR	SSIM
✓	✓			39.92	0.981
✓	✓		✓	40.97	0.982
	✓	✓		40.81	0.982
	✓	✓	✓	41.53	0.985
✓	✓	✓		41.82	0.987
✓	✓	✓	✓	42.16	0.990

	$\mathcal{A}$	$\mathcal{B}$	$\mathcal{C}$
De-rainingNet	✓	✓	✓
Siamese		✓	✓
RsPU			✓
PSNR	39.82	40.53	42.16
SSIM	0.980	0.983	0.990

Table 3: (Left) Ablation study of loss functions. (Right) Ablation study of the network architecture.

Study of Network Architectures

To validate the effectiveness of our network, we evaluated the different combinations on RealDataset, as shown in Table 3 (Right). Compared to the primary de-raining network, with Siamese, our model with RsPU yields significantly improved de-raining performance. This verifies that the self-attention mechanism is more effective at extracting informative rain streak features and estimating de-raining images than the static convolutional layer.

5 Other weather condition

We conducted experiments on heavy rain images often accompanied by haze effects using RainCityscapes. DAF-Net proposed a rain imaging model with rain streaks and haze. Although DAF-Net removes haze, they still remain rain streaks. We applied our network to estimate de-rained images and compared them with SOTA. Fig. 6 shows that existing methods tend to leave rain streaks. SIRR generates a corrupted background. MemoryNet fails to preserve background information (i.e., the logo of the car is erased). While existing methods fail to remove the rain streaks and generate over-smoothing results, our method outperforms them even in haze.

6 Conclusion

In summary, our paper introduces the Rain-streak Prototype Unit (RsPU) for efficient single-image de-raining. The RsPU utilizes an attention-based approach to encode diverse rain streak features as compact prototypes, overcoming memory limitations while capturing real rain complexities. Additionally, our proposed feature prototype loss enhances the discriminative power of these prototypes through a combination of cohesion and divergence components. Extensive evaluations demonstrate that our method achieves superior performance compared to existing techniques, highlighting its potential to advance rain removal capabilities for real-world applications.

Acknowledgement

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (NRF-2021R1C1C2005202). (Corresponding author: Sunok Kim)

References

[Chen et al.(2023)Chen, Li, Li, and Pan] Xiang Chen, Hao Li, Mingqiang Li, and Jinshan Pan. Learning a sparse transformer network for effective image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5896–5905, 2023.
[Cho et al.(2020)Cho, Kim, Min, and Sohn] Jaehoon Cho, Seungryong Kim, Dongbo Min, and Kwanghoon Sohn. Single image deraining using time-lapse data. IEEE Transactions on Image Processing, 29:7274–7289, 2020.
[Cho et al.(2022)Cho, Kim, and Sohn] Jaehoon Cho, Seungryong Kim, and Kwanghoon Sohn. Memory-guided image de-raining using time-lapse data. IEEE Transactions on Image Processing, 31:4090–4103, 2022.
[Fan et al.(2018)Fan, Wu, Fu, Hunag, and Ding] Zhiwen Fan, Huafeng Wu, Xueyang Fu, Yue Hunag, and Xinghao Ding. Residual-guide feature fusion network for single image deraining. arXiv preprint arXiv:1804.07493, 2018.
[Fu et al.(2017a)Fu, Huang, Ding, Liao, and Paisley] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017a.
[Fu et al.(2017b)Fu, Huang, Zeng, Huang, Ding, and Paisley] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3855–3863, 2017b.
[Fu et al.(2019)Fu, Liang, Huang, Ding, and Paisley] Xueyang Fu, Borong Liang, Yue Huang, Xinghao Ding, and John Paisley. Lightweight pyramid networks for image deraining. IEEE transactions on neural networks and learning systems, 31(6):1794–1807, 2019.
[Guo et al.(2021)Guo, Sun, Juefei-Xu, Ma, Xie, Feng, Liu, and Zhao] Qing Guo, Jingyang Sun, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Wei Feng, Yang Liu, and Jianjun Zhao. Efficientderain: Learning pixel-wise dilation filtering for high-efficiency single-image deraining. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1487–1495, 2021.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[Huang et al.(2021)Huang, Yu, and He] Huaibo Huang, Aijing Yu, and Ran He. Memory oriented transfer learning for semi-supervised image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7732–7741, 2021.
[Jiang et al.(2020)Jiang, Wang, Yi, Chen, Huang, Luo, Ma, and Jiang] Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Baojin Huang, Yimin Luo, Jiayi Ma, and Junjun Jiang. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8346–8355, 2020.
[Lettry et al.(2018a)Lettry, Vanhoey, and Van Gool] Louis Lettry, Kenneth Vanhoey, and Luc Van Gool. Deep unsupervised intrinsic image decomposition by siamese training. arXiv preprint arXiv:1803.00805, 2018a.
[Lettry et al.(2018b)Lettry, Vanhoey, and Van Gool] Louis Lettry, Kenneth Vanhoey, and Luc Van Gool. Unsupervised deep single-image intrinsic decomposition using illumination-varying image sequences. In Computer Graphics Forum, volume 37, pages 409–419. Wiley Online Library, 2018b.
[Li et al.(2018a)Li, He, Zhang, Chang, Dong, and Lin] Guanbin Li, Xiang He, Wei Zhang, Huiyou Chang, Le Dong, and Liang Lin. Non-locally enhanced encoder-decoder network for single image de-raining. In Proceedings of the 26th ACM international conference on Multimedia, pages 1056–1064, 2018a.
[Li et al.(2021a)Li, Cao, Zhao, Zhang, and Meng] Minghan Li, Xiangyong Cao, Qian Zhao, Lei Zhang, and Deyu Meng. Online rain/snow removal from surveillance videos. IEEE Transactions on Image Processing, 30:2029–2044, 2021a.
[Li et al.(2019)Li, Araujo, Ren, Wang, Tokuda, Junior, Cesar-Junior, Zhang, Guo, and Cao] Siyuan Li, Iago Breno Araujo, Wenqi Ren, Zhangyang Wang, Eric K Tokuda, Roberto Hirata Junior, Roberto Cesar-Junior, Jiawan Zhang, Xiaojie Guo, and Xiaochun Cao. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3838–3847, 2019.
[Li et al.(2021b)Li, Ren, Wang, Araujo, Tokuda, Junior, Cesar-Jr, Wang, and Cao] Siyuan Li, Wenqi Ren, Feng Wang, Iago Breno Araujo, Eric K Tokuda, Roberto Hirata Junior, Roberto M Cesar-Jr, Zhangyang Wang, and Xiaochun Cao. A comprehensive benchmark analysis of single image deraining: Current challenges and future perspectives. International Journal of Computer Vision, 129(4):1301–1322, 2021b.
[Li et al.(2018b)Li, Wu, Lin, Liu, and Zha] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 254–269, 2018b.
[Liu et al.(2020)Liu, Zhang, Zhang, and He] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming He. Part-aware prototype network for few-shot semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 142–158. Springer, 2020.
[Mu et al.(2018)Mu, Chen, Liu, Fan, and Luo] Pan Mu, Jian Chen, Risheng Liu, Xin Fan, and Zhongxuan Luo. Learning bilevel layer priors for single image rain streaks removal. IEEE Signal Processing Letters, 26(2):307–311, 2018.
[Ren et al.(2019)Ren, Zuo, Hu, Zhu, and Meng] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and Deyu Meng. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3937–3946, 2019.
[Snell et al.(2017)Snell, Swersky, and Zemel] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
[Tu et al.(2022)Tu, Talebi, Zhang, Yang, Milanfar, Bovik, and Li] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5769–5780, 2022.
[Valanarasu et al.(2021)Valanarasu, Yasarla, and Patel] Jeya Maria Jose Valanarasu, Rajeev Yasarla, and Vishal M Patel. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. arXiv preprint arXiv:2111.14813, 2021.
[Wang et al.(2020)Wang, Xie, Zhao, and Meng] Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng. A model-driven deep neural network for single image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3103–3112, 2020.
[Wang et al.(2019)Wang, Yang, Xu, Chen, Zhang, and Lau] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, and Rynson WH Lau. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12270–12279, 2019.
[Wei et al.(2019)Wei, Meng, Zhao, Xu, and Wu] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying Wu. Semi-supervised transfer learning for image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3877–3886, 2019.
[Yang et al.(2017)Yang, Tan, Feng, Liu, Guo, and Yan] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1357–1366, 2017.
[Yang et al.(2019a)Yang, Liu, Yang, and Guo] Wenhan Yang, Jiaying Liu, Shuai Yang, and Zongming Guo. Scale-free single image deraining via visibility-enhanced recurrent wavelet learning. IEEE Transactions on Image Processing, 28(6):2948–2961, 2019a.
[Yang et al.(2019b)Yang, Tan, Feng, Guo, Yan, and Liu] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo, Shuicheng Yan, and Jiaying Liu. Joint rain detection and removal from a single image with contextualized deep networks. IEEE transactions on pattern analysis and machine intelligence, 42(6):1377–1393, 2019b.
[Yang et al.(2020)Yang, Tan, Wang, Fang, and Liu] Wenhan Yang, Robby T Tan, Shiqi Wang, Yuming Fang, and Jiaying Liu. Single image deraining: From model-based to data-driven and beyond. IEEE Transactions on pattern analysis and machine intelligence, 2020.
[Yasarla and Patel(2019)] Rajeev Yasarla and Vishal M Patel. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8405–8414, 2019.
[Yasarla et al.(2020)Yasarla, Sindagi, and Patel] Rajeev Yasarla, Vishwanath A Sindagi, and Vishal M Patel. Syn2real transfer learning for image deraining using gaussian processes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2726–2736, 2020.
[Zamir et al.(2021)Zamir, Arora, Khan, Hayat, Khan, Yang, and Shao] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14821–14831, 2021.
[Zamir et al.(2022)Zamir, Arora, Khan, Hayat, Khan, and Yang] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
[Zhang and Patel(2018)] He Zhang and Vishal M Patel. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 695–704, 2018.