Two-Stage Single Image Reflection Removal
with Reflection-Aware Guidance

Yu Li¹ Ming Liu¹ Yaling Yi¹ Qince Li¹ Dongwei Ren² Wangmeng Zuo^1(✉)
¹School of Computer Science and Technology Harbin Institute of Technology China
²College of Intelligence and Computing Tianjin University China
{liyuhit, csmliu, csylyi}@outlook.com, {qinceli, wmzuo}@hit.edu.cn, [email protected].

Abstract

Removing undesired reflection from an image captured through a glass surface is a very challenging problem with many practical application scenarios. For improving reflection removal, cascaded deep models have been usually adopted to estimate the transmission in a progressive manner. However, most existing methods are still limited in exploiting the result in prior stage for guiding transmission estimation. In this paper, we present a novel two-stage network with reflection-aware guidance (RAGNet) for single image reflection removal (SIRR). To be specific, the reflection layer is firstly estimated due to that it generally is much simpler and is relatively easier to estimate. Reflection-aware guidance (RAG) module is then elaborated for better exploiting the estimated reflection in predicting transmission layer. By incorporating feature maps from the estimated reflection and observation, RAG can be used (i) to mitigate the effect of reflection from the observation, and (ii) to generate mask in partial convolution for mitigating the effect of deviating from linear combination hypothesis. A dedicated mask loss is further presented for reconciling the contributions of encoder and decoder features. Experiments on five commonly used datasets demonstrate the quantitative and qualitative superiority of our RAGNet in comparison to the state-of-the-art SIRR methods. The source code and pre-trained model are available at https://github.com/liyucs/RAGNet.

Refer to caption — Figure 1: An example of real-world observation from *Real 20* dataset and the reflection predicted by our model. It can be seen that the linear combination is violated in regions with heavy reflection. The results of competing methods are given in the second row.

1 Introduction

Reflections are commonly observed in images taken through glass, where the light reflected by the glass (\ie, reflection) and that from the objects behind the glass (\ie, transmission) are both captured by the camera. Undesired reflections inevitably degrade image visual quality and hinder many subsequent vision applications such as autonomous driving and entertainment, making reflection removal both very challenging and practically valuable in low level vision [20, 4, 35]. To mitigate the ill-posedness of reflection removal, early studies usually resort to multiple input images captured under different illumination conditions, focal lengths, polarizer angles, and camera viewpoints. On the contrary, single image reflection removal (SIRR) [41, 39, 14] is more common in task setting and practically more significant, and has received considerable interest in the recent few years.

Following the formulation in [35, 41], the input image, \ie, an observation $I$ , can be viewed as a linear combination of a transmission layer $T$ and a reflection layer $R$ , \ie, $I=T+R$ . The goal of SIRR is to recover the transmission layer $T$ for the observation $I$ . Unfortunately, the intrinsic ill-posedness of SIRR makes it very challenging to learn a direct mapping from the input image $I$ to the transmission $T$ . To ease the difficulty of training SIRR, cascaded deep models are usually adopted to mitigate the uncertainty of transmission estimation [4, 39, 14]. For example, [4] first estimates the edge map and then the transmission, while [39] and [14] progressively refine the estimation of reflection and transmission layers. However, there remain several interesting issues to further investigate (i) what should be estimated in the prior stage and (ii) how to exploit the result in prior stage for guiding succeeding transmission estimation, thereby leaving some leeway in improving SIRR.

In this paper, we take a step forward to address the above issues, and present a novel two-stage network with reflection-aware guidance (RAGNet). To be specific, our RAGNet first estimates the reflection layer $\hat{R}$ in the first stage. Then, it takes both the estimated reflection $\hat{R}$ and observation $I$ as the input to predict the transmission layer $\hat{T}$ . Fig. 1 shows an example of an input image as well as the estimated reflection by our RAGNet. The reasons to estimate the reflection layer in the first stage can be explained from three aspects. First, the reflection layer generally is much simpler than the transmission layer and is relatively easier to estimate. Second, the estimated reflection is beneficial to transmission estimation. For example, the difference between the observation and estimated reflection, \ie, $I-\hat{R}$ , can serve as a reasonable initialization of the transmission. Third, the linear combination $I=T+R$ may not hold true for real-world observations, especially those images with heavy and bright reflection (see Fig. 1). Also, the illumination of estimated reflection is helpful in finding the regions violating the linear combination hypothesis.

Table 1: Comparison of deep cascaded models for SIRR.

I

R

T

and

E

denote observation, reflection, transmission and edge maps, respectively.

Methods	Multi-stage Scheme	Cross-stage Fusion
CEILNet [4]	$I\to E\to T$	Concat
BDN [39]	$I\to T_{0}\to R\to T$	Concat
IBCLN [14]	$(\mathbf{0},I)\to(R_{1},T_{1})\to\ldots$	Concat+LSTM
Ours	$I\to R\to T$	RAG

Furthermore, we elaborate a reflection-aware guidance (RAG) module for better exploiting the estimated reflection in the second stage. As summarized in Table 1, the output of the prior stage is concatenated with the original observation as the input of the subsequent stage in [39, 4]. Li et al. [14] further applied a recurrent layer (\ie, LSTM [10]) to exploit the deep features across stages. In comparison, the second stage of RAGNet includes an encoder for the estimated reflection and an encoder-decoder for the original observation. For each block of the decoder, an RAG module is elaborated for better leveraging the estimated reflection from two aspects. On the one hand, the encoder feature map of the reflection is utilized to suppress that of the observation, and their difference is then concatenated with the decoder feature map of the observation. On the other hand, the encoder feature maps of the reflection and observation are incorporated with the decoder feature map to generate a mask indicating the extent to which the transmission is corrupted by heavy reflection. The mask is then collaborated with partial convolution to mitigate the effect of deviating from linear combination hypothesis. And a dedicated mask loss is further presented for reconciling the contributions of encoder and decoder features.

Extensive experiments are conducted on five commonly used real-world datasets to evaluate our RAGNet. With the RAG module, the estimated reflection is effective in guiding the transmission estimation and alleviating the effect of violating linear combination hypothesis. In comparison with the state-of-the-art models [4, 41, 39, 35, 14], our RAGNet performs favorably in terms of both quantitative metrics and visual quality. In general, the main contribution of this work is summarized as follows:

•

A SIRR network RAGNet is presented, which involves two stages, \ie, first estimating reflection and then predicting transmission. Its rationality is also explicated to differentiate from existing cascaded SIRR models.
•

For better leveraging the estimated reflection in the second stage, the reflection-aware guidance (RAG) module is further introduced to guide the transmission estimation and alleviate the effect of violating linear combination hypothesis.
•

Experiments on five popular datasets show that our RAGNet performs favorably against the state-of-the-art methods quantitatively and qualitatively.

2 Related Work

In this section, we briefly review two categories of relevant methods on reflection removal tasks, \ie, multiple image reflection removal and single image reflection removal.

2.1 Multiple Image Reflection Removal

Due to the ill-posed nature, multiple images have been exploited to solve the reflection removal problem by providing extra information. Agrawal et al. [1] used two separate images as input, which are captured with and without flash light, respectively, and Fu et al. [5] further took into consideration the high frequency illumination. Schechner et al. [25] treated the input image as a superimposition of two layers, and novel blur kernels were proposed to eliminate mutual penetration of the separated layers, using images taken with different focal lengths as a clue. Polarization camera is also used to take advantage of the polarization properties of light [24, 36, 21]. In particular, based on the linear hypothesis $I=T+R$ in raw RGB space, Lei et al. [21] adopted a simple two-stage model but took multiple polarized images as input. In these methods, the behavioral difference between the reflection layer and the transmission layer under different settings serves as an effective clue for reflection removal.

Unlike the aforementioned methods, whose input images are captured in a fixed position, taking images from different viewpoints is another popular way to exploit multiple image information in reflection removal tasks [38, 31, 15, 8, 24, 6, 29, 40, 30, 9, 27]. In particular, Punnappurath et al. [20] leveraged the newly developed dual pixel (DP) cameras, and images captured from two sub-aperture views are readily accessible. Although approaches in the multi-input paradigm can obtain impressive results, the necessity to prepare qualified input images with specially designed hardware or extra human efforts makes them inconvenient, and the prerequisites are unable to be satisfied in many practical scenarios, which boosts the recent prevalence of single image reflection removal.

2.2 Single Image Reflection Removal

As the name implies, SIRR methods merely require a single image as input. Traditional methods employ various priors observed in real-world images to reduce the ill-posedness, such as gradient sparsity prior [13, 32, 2, 23], smoothness prior [16, 34] and ghosting cues [26]. Unfortunately, the representation and generalization ability of such methods is limited, and they are prone to produce poor results in the absence of discriminative hints, which is often the case when using handcrafted priors.

As a remedy, learning based methods have been proposed to solve the SIRR problem by leveraging the capacity of deep convolutional neural networks (CNNs), which mainly focus on data synthesis, loss function and network design. Due to the tremendous effort required to collect sufficient well-aligned real image pairs for model training, Fan et al. [4] generated reflection layers via smoothing and illumination intensity decay, which were then used to synthesize the observation by linear combination. Zhang et al. [41] further adapted the intensity decay to suit reflections with higher illuminance level, and applied random vignette to simulate different camera views. Besides, several loss functions have been particularly designed to leverage the characteristic of reflections. CEILNet [4] constrains the gradient maps of the predicted transmission layer to avoid blurry output. Zhang et al. [41] proposed an exclusion loss by minimizing the correlation between the gradient maps of estimated transmission and reflection layers, based on the observation that the edges in these two layers are unlikely to overlap. Li et al. [14] took advantage of the synthesis model to reconstruct the observation, which is constrained to approximate the input. As to network structure, most recent works follow two main schemes. On the one hand, [41] and [35] extracted input features via a pre-trained VGG-19 [28] model and stack them together for subsequent usage, which exploits multi-scale features and is tolerant of mild misalignment between real-world training pairs. On the other hand, the encoder-decoder framework is exploited in [4, 35, 39, 14], where skip connections are usually integrated for better feature utilization.

Considering the difficulty of directly estimating the transmission layer, cascaded models have been widely adopted in the hope of finding a better reflection removal pathway. Fan et al. [4] first predicted potential edge maps, and then the transmission layer is generated given the edge maps. Yang et al. [39] alternated between reflection and transmission estimation, while Li et al. [14] further incorporated a long short-term memory (LSTM) module [10] to generate the reflection and the transmission layer in a progressive manner. Such multi-stage methods are closely related to our proposed method, however, as further discussed in Sec. 3.1, they are still limited in exploiting the result from prior stage.

We note that the MIRR method [21] also uses a two-stage network but takes raw polarized images with different polarization angles as input. Moreover, it adopts the linear hypothesis $I=T+R$ and its concatenation-based fusion is limited in exploiting the estimated reflection for the subsequent stage. Empirically, the retraining of adjusted MIRR by feeding a single unpolarized image is still limited in removing heavy and bright reflections (See Fig. 1).

3 Proposed Method

\begin{overpic}[width=433.62pt]{figures/net.pdf} \put(16.0,33.8){\normalsize$\hat{R}$} \put(17.0,29.8){\normalsize$I$} \put(16.5,10.1){\normalsize$\hat{T}$} \put(24.0,32.6){\normalsize$\mathbf{F}_{R}^{1}$} \put(43.0,32.6){\normalsize$\mathbf{F}_{R}^{2}$} \put(62.2,32.6){\normalsize$\mathbf{F}_{R}^{3}$} \put(81.4,32.6){\normalsize$\mathbf{F}_{R}^{4}$} \put(30.0,28.6){\normalsize$\mathbf{F}_{I}^{1}$} \put(49.0,28.6){\normalsize$\mathbf{F}_{I}^{2}$} \put(68.2,28.6){\normalsize$\mathbf{F}_{I}^{3}$} \put(87.4,28.6){\normalsize$\mathbf{F}_{I}^{4}$} \put(95.5,24.2){\normalsize$\mathbf{F}_{I}^{5}$} \put(23.3,19.1){\normalsize$\mathbf{F}_{\mathit{diff}}^{1}$} \put(29.2,19.1){\normalsize$\mathbf{F}_{\mathit{dec}}^{1}$} \put(42.3,19.1){\normalsize$\mathbf{F}_{\mathit{diff}}^{2}$} \put(48.2,19.1){\normalsize$\mathbf{F}_{\mathit{dec}}^{2}$} \put(61.9,19.1){\normalsize$\mathbf{F}_{\mathit{diff}}^{3}$} \put(67.2,19.1){\normalsize$\mathbf{F}_{\mathit{dec}}^{3}$} \put(80.3,19.1){\normalsize$\mathbf{F}_{\mathit{diff}}^{4}$} \put(86.5,19.1){\normalsize$\mathbf{F}_{\mathit{dec}}^{4}$} \par\put(35.5,12.8){\footnotesize$\mathbf{F}_{R}^{i}$} \put(51.8,12.8){\footnotesize$\mathbf{F}_{I}^{i}$} \put(35.0,1.7){\footnotesize$\mathbf{F}_{\mathit{diff}}^{i}$} \put(51.2,1.7){\footnotesize$\mathbf{F}_{\mathit{dec}}^{i}$} \put(89.3,8.4){{\scriptsize Encoder $\bm{E_{I}}$}} \put(89.3,3.0){{\scriptsize Encoder $\bm{E_{R}}$}} \put(79.5,3.0){{\scriptsize Decoder $\bm{D_{T}}$}} \put(79.3,8.4){{\scriptsize Skip Connect}} \put(72.8,8.4){{\scriptsize Concat}} \put(63.3,8.4){{\scriptsize Partial Conv}} \put(63.5,3.0){{\scriptsize Subtraction}} \put(72.4,3.0){{\scriptsize Sigmoid}} \end{overpic}

Figure 2: Structure of the second stage subnetwork (

G_{T}

) in RAGNet, which takes both the observation

I

and the predicted reflection

\hat{R}

as input, and generates the transmission layer

\hat{T}

G_{T}

follows the U-Net [22] setting, while the concatenated feature through the skip-connection is

\mathbf{F}_{\mathit{diff}}=\mathbf{F}_{I}-\mathbf{F}_{R}

rather than

\mathbf{F}_{I}

. More detailed structure settings please refer to the supplementary material.

In this section, we first revisit the limitations of existing multi-stage SIRR methods. Then, the overall architecture and the design to exploit reflection-aware guidance are illustrated in detail. Finally, the dedicated mask loss and the learning objective are presented.

3.1 Limitations of existing multi-stage methods

Given an observation $I$ , CEILNet [4] predicts a potential edge map $E$ for the transmission layer, where a reflection smoothness prior is assumed to distinguish whether the image gradients belong to the transmission layer or the reflection layer. Unfortunately, such smoothness prior is often violated in real scenarios, and the resulting inaccurate edge maps will lead to a degraded performance by providing wrong information to the second stage.

BDN [39] and IBCLN [14] resort to predicting reflection layer in the prior stage. However, a simple concatenation is insufficient to make full use of the predicted reflection layer and is limited in suppressing the heavy and bright reflections, thus BDN [39] often fails to process the reflections in difficult regions (see Fig. 1). IBCLN [14] takes the input image $I$ and black image $\mathbf{0}_{h\times w}$ as initial estimation of transmission and reflection, and iteratively refines them with a LSTM [10] module to exploit cross-stage information, which significantly promotes the SIRR performance. Nevertheless, such principle makes the method less efficient and greatly relies on the generation quality in the prior stages, and the errors will also be accumulated across stages, leading to undesired artifacts in some scenarios (\eg, zooming in Fig. 1 for clear observation).

3.2 Overall Architecture Design

Considering the aforementioned limitations, we build a two-stage model, where a reflection-aware guidance (RAG) module is introduced in the second stage to better exploit the reflection layer predicted in the first stage. Specifically, we use a plain U-Net [22] model as the first stage subnetwork, namely $G_{R}$ , which takes the observation $I$ as input and predicts the reflection layer, \ie, $\hat{R}=G_{R}(I)$ . The structure of the second stage subnetwork $G_{T}$ is shown in Fig. 2, $\hat{R}$ is fed into $G_{T}$ together with the observation $I$ to generate the transmission layer, \ie, $\hat{T}=G_{T}(I,\hat{R})$ . In particular, $G_{T}$ is composed of two encoders ( $E_{I}$ and $E_{R}$ ) and one decoder ( $D_{T}$ ), and an RAG module is elaborated in each block of $D_{T}$ . Kindly refer to the supplementary material for detailed structure configurations.

3.3 Reflection-Aware Guidance

As shown in Fig. 1, reflections are spatially nonuniform, whose location and intensity can be roughly predicted in the first stage. The estimated reflection $\hat{R}$ can then be incorporated with the observation $I$ for succeeding transmission estimation. However, the estimated reflection $\hat{R}$ may be inaccurate. Furthermore, the linear combination hypothesis may not hold true in regions with high reflection intensities, where few informative features of the transmission are retained in $I-\hat{R}$ (see Fig. 1). Therefore, estimating the transmission in such regions is more challenging, and is more likely to be an inpainting task. Taking these factors into account, we elaborate an RAG module in each decoder block for (i) generating feature complementary to the decoder and (ii) predicting a mask to indicate the heavy reflection regions for guiding image inpainting.

Given the observation feature $\mathbf{F}_{I}$ , the reflection feature $\mathbf{F}_{R}$ and the decoder feature $\mathbf{F}_{\mathit{dec}}$ , the RAG module is designed as shown in Fig. 2. Using $\mathbf{F}_{I}$ and $\mathbf{F}_{R}$ , the difference feature $\mathbf{F}_{\mathit{diff}}$ is generated via a subtraction operation to better suppress the reflections, \ie, $\mathbf{F}_{\mathit{diff}}=\mathbf{F}_{I}-\mathbf{F}_{R}$ , which serves as a complement and enhancement to the decoder feature $\mathbf{F}_{\mathit{dec}}$ . Furthermore, taking $\mathbf{F}_{I}$ , $\mathbf{F}_{R}$ and $\mathbf{F}_{\mathit{dec}}$ as input, the RAG module generates a mask $\mathbf{M}$ via two $1\times 1$ convolution layers followed by a sigmoid operation. Note that $\mathbf{M}=[\mathbf{M}_{\mathit{diff}},\mathbf{M}_{\mathit{dec}}]$ , where $[\cdot,\cdot]$ denotes the concatenation operation, and $\mathbf{M}_{\mathit{diff}}$ and $\mathbf{M}_{\mathit{dec}}$ are masks regarding to $\mathbf{F}_{\mathit{diff}}$ and $\mathbf{F}_{\mathit{dec}}$ , respectively. With such mask, the model is aware of the regions deviating from the linear combination hypothesis, and recovers the transmission layer in such areas mainly relying on the surrounding decoder feature using image inpainting. The mask $\mathbf{M}$ is visualized in Fig. 3.

Considering the similarity between inpainting and reflection removal in heavy reflection regions, we deploy a partial convolution [18] alongside each RAG module to better exploit the mask, which was first introduced in image inpainting methods [17, 37]. The mask generated by RAG module ranges from 0 to 1, thus the partial convolution in this paper can be formulated as,

$\mathbf{F}^{\prime}=\left\{\begin{array}[]{cl}\displaystyle(\mathbf{W}\ast(\mathbf{F}\circ\mathbf{M}))\circ\frac{1}{\mathbf{\bar{M}}_{3\times 3}}+\mathbf{b},&\mathbf{\bar{M}}_{3\times 3}>0\\ 0,&\mathrm{otherwise}\end{array}\right.$ ,

(1)

where $\mathbf{W}$ and $\mathbf{b}$ represent the weight and bias of the partial convolution, $\mathbf{F}=[\mathbf{F}_{\mathit{diff}},\mathbf{F}_{\mathit{dec}}]$ , $\ast$ and $\circ$ are convolution and entry-wise product respectively. $\mathbf{\bar{M}}_{3\times 3}$ contains the average values of $\mathbf{M}$ in $3\!\times\!3$ neighborhood regions, which can be efficiently calculated by a $3\!\times\!3$ average pooling operation.

Table 2: PSNR/SSIM results by different methods for reflection removal on four real-world datasets with ground-truth. The best and the second-best results in each dataset are highlighted with red and blue, respectively.

Datasets	CEILNet [4]	Zhang et al. [41]	BDN [39]	ERRNet [35]	Lei et al. [21]	IBCLN [14]	Ours
Datasets	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
SIR² Solid (200)	23.37 / 0.875	22.68 / 0.879	22.73 / 0.853	24.85 / 0.894	23.81 / 0.882	24.88 / 0.893	26.15 / 0.903
SIR² Postcard (199)	20.09 / 0.786	16.81 / 0.797	20.71 / 0.857	21.99 / 0.874	21.48 / 0.873	23.39 / 0.875	23.67 / 0.879
SIR² Wild (55)	21.07 / 0.805	21.52 / 0.832	22.34 / 0.821	24.16 / 0.847	23.84 / 0.866	24.71 / 0.886	25.52 / 0.880
Real 20 (20)	18.87 / 0.692	22.55 / 0.788	18.81 / 0.737	23.19 / 0.817	22.35 / 0.793	22.04 / 0.772	22.95 / 0.793
Average (474)	21.54 / 0.822	20.08 / 0.835	21.67 / 0.846	23.50 / 0.877	22.77 / 0.873	24.11 / 0.880	24.90 / 0.886

3.4 Mask Loss

Furthermore, we design a novel mask loss to better exploit the mask $\mathbf{M}$ during training RAGNet. As shown in Fig. 1, for areas with heavy reflections, $\mathbf{F}_{\mathit{diff}}$ can hardly provide informative features. Therefore, we dispose the values of $\mathbf{M}_{\mathit{diff}}$ to approximate $0$ in such areas, \ie,

\mathcal{L}_{\mathrm{mask}}^{\mathit{diff}}=\sum\limits_{i=1}^{4}\|\mathbf{M}_{\mathit{diff}}^{i}[R>\varphi]\|_{1},

(2)

where $i$ means the $i$ -th layer, $\|\cdot\|_{{}_{1}}$ denotes the $\ell_{1}$ norm, $A[condition]$ represents the part of $A$ that meets the $condition$ in the square brackets, $\varphi$ is the threshold that delimits heavy reflection areas.

However, with $\mathcal{L}_{\mathrm{mask}}^{\mathit{diff}}$ only, $\mathbf{M}$ may be optimized towards a trivial solution $\mathbf{0}$ , which diminishes the effect of $\mathbf{F}_{\mathit{diff}}$ . Therefore, $\mathbf{M}$ is supposed to be $1$ for areas with few reflections, to avoid the trivial solution as a regularization term and force the partial convolution to exploit both $\mathbf{F}_{\mathit{diff}}$ and $\mathbf{F}_{\mathit{dec}}$ , since both of them are reliable in such areas, \ie,

\mathcal{L}_{\mathrm{mask}}^{\mathit{reg}}=\sum\limits_{i=1}^{4}\|\mathbf{M}^{i}[R<\xi]-1\|_{1},

(3)

where $\xi$ means threshold for regions with few reflections.

Note that we do not constrain the mask values in remaining regions, which will be automatically optimized for better transmission estimation. Therefore, the mask loss is formulated as,

\mathcal{L}_{\mathrm{mask}}=\mathcal{L}_{\mathrm{mask}}^{\mathit{diff}}+\mathcal{L}_{\mathrm{mask}}^{\mathit{reg}},

(4)

and we empirically set $\xi=0.01$ and $\varphi=0.3$ . The effectiveness of mask loss will be further corroborated in ablation studies.

3.5 Learning Objective

To train our RAGNet, several commonly used loss functions are collaborated, including reconstruction loss, perceptual loss [11], exclusion loss [41] and adversarial loss [7].

Reconstruction loss. With synthetic image pairs, we are able to minimize the pixel-wise difference between network outputs $\hat{T}$ , $\hat{R}$ and the corresponding ground-truths $T$ , $R$ ,

\mathcal{L}_{\mathrm{rec}}=\sum_{Y\in\{T,R\}}\|\hat{Y}-Y\|_{1}.

(5)

Perceptual loss. Given a pre-trained VGG-19 [28] model $\phi$ , we minimize the $\ell_{1}$ difference between $\phi(\hat{T})$ , $\phi(\hat{R})$ and $\phi(T)$ , $\phi(R)$ in the selected feature layers,

\displaystyle\mathcal{L}_{\mathrm{percep}}=\sum_{Y\in\{T,R\}}\sum\limits_{l}\kappa_{l}\|\phi_{l}(\hat{Y})-\phi_{l}(Y)\|_{1},

(6)

where $l$ indicates the index of $conv1\_2$ , $conv2\_2$ , $conv3\_2$ , $conv4\_2$ and $conv5\_2$ layers. The weights $\{\kappa_{l}\}$ are used to balance different layers.

Exclusion loss. Following [41], the exclusion loss is formulated as,

\mathcal{L}_{\mathrm{excl}}=\frac{1}{N\!+\!1}\sum\limits_{n=0}^{N}\sqrt{\|\Psi(T^{\downarrow n},R^{\downarrow n})\|_{{}_{F}}}.

(7)

where $\Psi(T,R)={\tanh}(\lambda_{{}_{T}}|\nabla T|)\circ{\tanh}(\lambda_{{}_{R}}|\nabla R|)$ , and $\lambda_{T}$ and $\lambda_{R}$ denote normalization factors. $\nabla T$ and $\nabla R$ are gradients of $T$ and $R$ . $\|\cdot\|_{F}$ is the Frobenius norm. $T^{\downarrow n}$ and $R^{\downarrow n}$ represent the $n\times$ down-sampling versions of $T$ and $R$ , where $T^{\downarrow 0}$ and $R^{\downarrow 0}$ are the original inputs. In practice, we set $N$ = 2, $\lambda_{T}=\frac{1}{2}$ , and $\lambda_{R}=\frac{\|\nabla T\|_{1}}{\|\nabla{R}\|_{1}}$ .

Adversarial Loss. Adversarial loss is adopted to further enhance the visual quality of the output images. We treat the whole RAGNet as the generator $G$ and additionally build a $4$ -layer discriminator $D$ , whose parameters are updated via,

\mathcal{L}_{D}=\mathbb{E}_{I,T}{\log}D(I,T)+\mathbb{E}_{I}(1-{\log}D(I,G(I))),

(8)

while the parameters of $G$ are optimized by,

\mathcal{L}_{\mathrm{adv}}=-\mathbb{E}_{I}{\log}D(I,G(I)).

(9)

Taking the above loss functions into account, the learning objective to train our RAGNet can be formulated as,

\mathcal{L}=\lambda_{1}\mathcal{L}_{\mathrm{rec}}+\lambda_{2}\mathcal{L}_{\mathrm{percep}}+\lambda_{3}\mathcal{L}_{\mathrm{excl}}+\lambda_{4}\mathcal{L}_{\mathrm{adv}}+\lambda_{5}\mathcal{L}_{\mathrm{mask}},

(10)

where $\lambda_{1}=\lambda_{2}=\lambda_{5}=1,\lambda_{3}=0.2$ and $\lambda_{4}=0.01$ .

4 Experiments

4.1 Implementation Details

Training Data. Following previous works, the proposed RAGNet is trained with both synthetic and real-world images, and we use the same data synthesis protocol as ERRNet [35]. Specifically, $7,643$ image pairs are chosen from the PASCAL VOC dataset [3], and each pair is used to generate the transmission $T$ and the reflection $R$ , respectively. In each iteration, we select a pair of two images, whose shorter sides are randomly scaled into $[224,448]$ , and two $224\times 224$ patches are then cropped to synthesize the input $I$ . For real-world data, Zhang et al. [41] collected 110 input-transmission pairs, and we follow the official split where 90 image pairs are used for training. Since the ground-truth reflection $R$ is unavailable, the corresponding reconstruction and and perceptual losses on $\hat{R}$ are omitted when training with real-world images.

Evaluation Datasets. Five commonly used real-world datasets are exploited for evaluation, including 20 image pairs (Real 20) from [41], 45 images (Real 45) from [4] and three subsets from SIR² [33], including i) SIR² Wild with 55 wild scene images, ii) SIR² Solid with 200 controlled scene images of a set of daily-life objects, and iii) SIR² Postcard with 199 controlled scene images on postcards, which are obtained by using one postcard as transmission and another as reflection. Note that we only give the quantitative results of Real 20 and three SIR² subsets, since the ground-truth transmission of Real 45 is unavailable.

Implementation Details. The model is optimized by the Adam [12] optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and a fixed learning rate of $1\times 10^{-4}$ . For stable training, we first train $G_{R}$ for 50 epochs, and then the whole model is jointly trained for another 100 epochs. All the experiments are conducted in the PyTorch [19] environment running on a PC with an Nvidia RTX 2080Ti GPU. The source code and pre-trained model are available at https://github.com/liyucs/RAGNet.

Table 3: Quantitative comparison of different settings on structure and loss functions.

Model	Real 20	SIR² Wild
Model	PSNR / SSIM	PSNR / SSIM
RAGNet_I→T	20.73 / 0.753	24.37 / 0.871
w/o $\mathbf{F}_{\mathit{diff}}$	20.99 / 0.758	24.57 / 0.865
w/o ${\rm\mathcal{L}_{mask}}$	21.32 / 0.764	25.09 / 0.878
w/o $\mathcal{L}_{\mathrm{mask}}^{\mathit{diff}}$	22.08 / 0.769	25.09 / 0.877
w/o $\mathcal{L}_{\mathrm{mask}}^{\mathit{reg}}$	22.11 / 0.773	25.12 / 0.876
RAGNet	22.95 / 0.793	25.52 / 0.880

4.2 Comparison with State-of-the-arts

We compare the proposed RAGNet with five state-of-the-art SIRR methods, \ie, CEILNet [4], Zhang et al. [41], BDN [39], ERRNet [35] , IBCLN [14], as well as the adjusted MIRR method by only taking the observation $I$ as the network input, \ie, Lei et al. [21]. We finetune these models on our training dataset and report the better result between the finetuned version and the released one. For Lei et al. [21], we adjust the network input and retrain the model with our training set.

The PSNR and SSIM metrics of all competing methods are reported in Table 2. Note that the results of Real 45 are omitted due to lack of ground-truth. Therefore, a user study is conducted on Real 45 to evaluate the reflection removal quality and the results also show the superiority of our RAGNet. The details are given in the supplementary material. It can be seen that the proposed RAGNet achieves the best performance by PSNR on three of the four datasets, and surpasses all other methods on average quantitative performance by a large margin ( $\sim$ 0.8dB in PSNR). The quantitative gain against other deep cascaded models [4, 39, 14, 21] also indicates the benefit of appropriate reflection-aware guidance.

Fig. 4 shows the qualitative results and the corresponding ground-truth on three SIR² subsets and Real 20 dataset, and the results on Real 45 dataset are given in Fig. 5. One can see that CEILNet may fail when the reflections have clear edges, where the smoothness assumption is violated (\eg, the second row of Fig. 4). Zhang et al. [41] is prone to over-processing, which generates dim results with color aberration in most cases, while BDN [39] tends to generate brighter output images than input ones. Moreover, the reflections seem insufficiently removed in the results of ERRNet [35] and IBCLN [14]. In general, our method outperforms the competing methods, and removes the reflection areas more accurately and thoroughly. Please refer to the supplementary material for more qualitative results.

5 Ablation Study

In this section, we perform ablation studies on two outdoor real-world datasets, \ie, Real 20 and SIR² Wild.

5.1 Two-stage Architecture

Our RAGNet is designed as a two-stage network, which takes reflection as an intermediate result to ease the reflection removal procedure. However, it may be concerned whether a well-designed one-stage network is sufficient for SIRR. To show the necessity and superiority of the two-stage architecture, we additionally design a single-stage variant, namely RAGNet_I→T, which directly estimates the transmission layer from the input.

According to Fig. 2, the generation of $T$ in RAGNet requires features from two sources, \ie, $\mathbf{M}_{\mathit{diff}}\circ\mathbf{F}_{\mathit{diff}}$ and $\mathbf{M}_{dec}\circ\mathbf{F}_{\mathit{dec}}$ . We expect that $\mathbf{M}_{\mathit{diff}}\to 0$ when reflection is heavy as in Eqn. (2), so that only $\mathbf{F}_{\mathit{dec}}$ provides information for recovering heavy reflection areas. When designing RAGNet_I→T, we retain the functionality of the skip connection and $\mathbf{F}_{\mathit{dec}}$ : one for providing auxiliary information via subtraction operations, the other focuses on gradually recovering $\hat{T}$ . The structure of the one-stage model is provided in the supplementary material. The results in Table 3 and Fig. 6 show that the one-stage variant RAGNet_I→T suffers a huge performance degradation, indicating the essentiality of the two-stage architecture for our RAGNet.

Table 4: Quantitative comparison of different settings on mask utilization schemes.

Model	Real 20	SIR² Wild
Model	PSNR / SSIM	PSNR / SSIM
RAGNet_F	20.93 / 0.750	24.47 / 0.875
RAGNet_F∘M	21.60 / 0.772	24.86 / 0.875
RAGNet ${}_{\mathbf{F}\circ\mathbf{M}^{2c}}$	21.69 / 0.774	24.69 / 0.871
RAGNet	22.95 / 0.793	25.52 / 0.880

5.2 RAG Module

Difference Features. To evaluate the effectiveness of difference features, we replace $\mathbf{F}_{\mathit{diff}}$ with $\mathbf{F}_{I}$ and keep the other parts of the model unchanged, \ie, the subtraction operation is discarded. As shown in Fig. 6, without $\mathbf{F}_{\mathit{diff}}$ , the model performs poorly in suppressing the reflection, leading to an obvious performance drop in Table 3.

Masks. In order to verify the settings about masks, we conduct ablation studies on three variants, (i) RAGNet_F: the masks are totally discarded, where the partial convolution is accordingly replaced by vanilla convolution. (ii) RAGNet_F∘M: the masks are multiplied with the feature $\mathbf{F}$ , in other words, the renormalization operation in partial convolutions is removed. (iii) RAGNet ${}_{\mathbf{F}\circ\mathbf{M}^{2c}}$ : a unified one-channel mask is predicted for $\mathbf{F}_{\mathit{diff}}$ and $\mathbf{F}_{\mathit{dec}}$ respectively, \ie, $\mathbf{M}_{\mathit{diff}}^{1c}$ and $\mathbf{M}_{\mathit{dec}}^{1c}$ , where $\mathbf{M}^{2c}=[\mathbf{M}_{\mathit{diff}}^{1c},\mathbf{M}_{\mathit{dec}}^{1c}]$ .

Table 4 shows the quantitative results of the ablation studies on masks. It can be seen that our RAGNet surpasses its variants RAGNet_F and RAGNet_F∘M in both PSNR and SSIM, indicating the effectiveness of the deployment of mask and partial convolution. Furthermore, the ablation study on RAGNet ${}_{\mathbf{F}\circ\mathbf{M}^{2c}}$ shows that predicting a per-channel mask better suits the SIRR task in our framework.

Fig. 7 shows the qualitative results of the ablation studies. It can be seen that, all of RAGNet_F, RAGNet_F∘M and RAGNet ${}_{\mathbf{F}\circ\mathbf{M}^{2c}}$ fail to remove the heavy reflections, especially in complex environments (\eg, in the second row of Fig. 7, the light on the ground is very similar to the reflections). RAGNet outperforms the others by recovering the heavy reflection areas with contextual information, making the output image visually pleasant. Moreover, by integrating partial convolution, the reflection removal performance is also enhanced for images with mild reflections.

5.3 Mask Loss

From the aspect of the proposed mask loss, we conduct three experiments, \ie, discarding $\mathcal{L}_{\mathrm{mask}_{\mathit{diff}}}$ , $\mathcal{L}_{\mathrm{mask}_{\mathit{util}}}$ and both of them, respectively. Table 3 and Fig. 6 show that with the mask loss, the performance of our RAGNet is improved by a large margin. Furthermore, one can see that both items of the specifically designed mask loss are essential for RAGNet, and greatly boost the quantitative performance and visual quality. The thresholds $\xi$ and $\varphi$ in Eqns. (2) and (3) are discussed in the supplementary material.

6 Conclusion

In this paper, we investigated the two-stage framework for the single image reflection removal (SIRR) task, and presented an RAGNet to exploit the reflection-aware guidance via the RAG module. The RAG module suppresses the reflection by using the difference between observation and reflection features, while a mask is generated to indicate the extent to which the transmission is corrupted by heavy reflection and to collaborate with partial convolution for mitigating the effect of deviating from the linear combination hypothesis. A mask loss is accordingly designed for reconciling the contributions of encoder and decoder features. Extensive experiments on five widely used real-world datasets indicate that the proposed RAGNet outperforms the state-of-the-art methods by a large margin.

References

[1] Amit Agrawal, Ramesh Raskar, Shree K. Nayar, and Yuanzhen Li. Removing photography artifacts using gradient projection and flash-exposure sampling. In ACM SIGGRAPH, pages 828–835, 2005.
[2] Nikolaos Arvanitopoulos, Radhakrishna Achanta, and Sabine Susstrunk. Single image reflection suppression. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4498–4506, 2017.
[3] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
[4] Qingnan Fan, Jiaolong Yang, Gang Hua, Baoquan Chen, and David Wipf. A generic deep architecture for single image reflection removal and image smoothing. In IEEE International Conference on Computer Vision, pages 3238–3247, 2017.
[5] Ying Fu, Antony Lam, Imari Sato, Takahiro Okabe, and Yoichi Sato. Separating reflective and fluorescent components using high frequency illumination in the spectral domain. In IEEE International Conference on Computer Vision, pages 457–464, 2013.
[6] Kun Gai, Zhenwei Shi, and Changshui Zhang. Blind separation of superimposed moving images using image statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):19–32, 2012.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
[8] Xiaojie Guo, Xiaochun Cao, and Yi Ma. Robust separation of reflection from multiple images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2187–2194, 2014.
[9] Byeong-Ju Han and Jae-Young Sim. Reflection removal using low-rank matrix completion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5438–5446, 2017.
[10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[11] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711, 2016.
[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[13] Anat Levin and Yair Weiss. User assisted separation of reflections from a single image using a sparsity prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9):1647–1654, 2007.
[14] Chao Li, Yixiao Yang, Kun He, Stephen Lin, and John E. Hopcroft. Single image reflection removal through cascaded refinement. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3565–3574, 2020.
[15] Yu Li and Michael S. Brown. Exploiting reflection change for automatic reflection removal. In IEEE International Conference on Computer Vision, pages 2432–2439, 2013.
[16] Yu Li and Michael S. Brown. Single image layer separation using relative smoothness. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–2759, 2014.
[17] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision, pages 85–100, 2018.
[18] Guilin Liu, Kevin J Shih, Ting-Chun Wang, Fitsum A Reda, Karan Sapra, Zhiding Yu, Andrew Tao, and Bryan Catanzaro. Partial convolution based padding. arXiv preprint arXiv:1811.11718, 2018.
[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8026–8037, 2019.
[20] Abhijith Punnappurath and Michael S. Brown. Reflection removal using a dual-pixel sensor. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1556–1565, 2019.
[21] Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, Wenxiu Sun, and Qifeng Chen. Polarized reflection removal with perfect alignment in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1750–1758, 2020.
[22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 234–241, 2015.
[23] Tushar Sandhan and Jin Young Choi. Anti-glare: Tightly constrained optimization for eyeglass reflection removal. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1241–1250, 2017.
[24] Bernard Sarel and Michal Irani. Separating transparent layers through layer information exchange. In European Conference on Computer Vision, pages 328–341, 2004.
[25] Yoav Y. Schechner, Nahum Kiryati, and Ronen Basri. Separation of transparent layers using focus. pages 25–39, 2000.
[26] YiChang Shih, Dilip Krishnan, Fredo Durand, and William T. Freeman. Reflection removal using ghosting cues. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3193–3201, 2015.
[27] Christian Simon and In Kyu Park. Reflection removal for in-vehicle black box videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4231–4239, 2015.
[28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[29] Sudipta N Sinha, Johannes Kopf, Michael Goesele, Daniel Scharstein, and Richard Szeliski. Image-based rendering for scenes with reflections. ACM Transactions on Graphics, 31(4):1–10, 2012.
[30] Chao Sun, Shuaicheng Liu, Taotao Yang, Bing Zeng, Zhengning Wang, and Guanghui Liu. Automatic reflection removal using gradient intensity and motion cues. In ACM International Conference on Multimedia, pages 466–470, 2016.
[31] Richard Szeliski, Shai Avidan, and Padmanabhan Anandan. Layer extraction from multiple images containing reflections and transparency. In IEEE Conference on Computer Vision and Pattern Recognition, pages 246–253, 2000.
[32] Renjie Wan, Boxin Shi, Ling-Yu Duan, Ah-Hwee Tan, Wen Gao, and Alex C Kot. Region-aware reflection removal with unified content and gradient priors. IEEE Transactions on Image Processing, 27(6):2927–2941, 2018.
[33] Renjie Wan, Boxin Shi, Ling-Yu Duan, Ah-Hwee Tan, and Alex C. Kot. Benchmarking single-image reflection removal algorithms. In IEEE International Conference on Computer Vision, pages 3922–3930, 2017.
[34] Renjie Wan, Boxin Shi, Ah-Hwee Tan, and Alex C. Kot. Depth of field guided reflection removal. In IEEE International Conference on Image Processing, pages 21–25, 2016.
[35] Kaixuan Wei, Jiaolong Yang, Ying Fu, David Wipf, and Hua Huang. Single image reflection removal exploiting misaligned training data and network enhancements. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8178–8187, 2019.
[36] Patrick Wieschollek, Orazio Gallo, Jinwei Gu, and Jan Kautz. Separating reflection and transmission images in the wild. In European Conference on Computer Vision, pages 89–104, 2018.
[37] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shilei Wen, and Errui Ding. Image inpainting with learnable bidirectional attention maps. In IEEE International Conference on Computer Vision, pages 8858–8867, 2019.
[38] Tianfan Xue, Michael Rubinstein, Ce Liu, and William T Freeman. A computational approach for obstruction-free photography. ACM Transactions on Graphics, 34(4):1–11, 2015.
[39] Jie Yang, Dong Gong, Lingqiao Liu, and Qinfeng Shi. Seeing deeply and bidirectionally: A deep learning approach for single image reflection removal. In European Conference on Computer Vision, pages 654–669, 2018.
[40] Jiaolong Yang, Hongdong Li, Yuchao Dai, and Robby T. Tan. Robust optical flow estimation of double-layer images under transparency or reflection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1410–1419, 2016.
[41] Xuaner Zhang, Ren Ng, and Qifeng Chen. Single image reflection separation with perceptual losses. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4786–4794, 2018.

Supplemental Materials

A Network Structure

Our RAGNet is comprised of two components, \ie, a $G_{R}$ to predict the reflection layer $\hat{R}$ and a $G_{T}$ to generate the transmission layer $\hat{T}$ on the basis of $\hat{R}$ .

The subnetworks $G_{R}$ and $G_{T}$ both follow the structual configuration of U-Net. Especially, all the encoders of $G_{R}$ and $G_{T}$ share the same network structure, while the decoders differ slightly in the use of partial convolution. The detailed structures of $G_{R}$ and $G_{T}$ are shown in Table A.

{strip}

Encoder			Decoder
Layer	Output size	Filter	Layer	Output size	Filter
Conv, ReLU	$224\times 224$	$3\rightarrow 64$	Transposed Conv	$28\times 28$	$512\rightarrow 512$
Conv, ReLU	$224\times 224$	$64\rightarrow 64$	Concat, $Conv^{\ast}$ , ReLU	$28\times 28$	$(512+512)\rightarrow 1024$
MaxPool	$112\times 112$	-	[Conv, ReLU] $\times$ 3	$28\times 28$	$1024\rightarrow 1024$
			Conv, ReLU	$28\times 28$	$1024\rightarrow 256$
Conv, ReLU	$112\times 112$	$64\rightarrow 128$	Transposed Conv	$56\times 56$	$256\rightarrow 256$
Conv, ReLU	$112\times 112$	$128\rightarrow 128$	Concat, $Conv^{\ast}$ , ReLU	$56\times 56$	$(256+256)\rightarrow 512$
MaxPool	$56\times 56$	-	[Conv, ReLU] $\times$ 3	$56\times 56$	$512\rightarrow 512$
			Conv, ReLU	$56\times 56$	$512\rightarrow 128$
Conv, ReLU	$56\times 56$	$128\rightarrow 256$	Transposed Conv	$112\times 112$	$128\rightarrow 128$
[Conv, ReLU] $\times$ 3	$56\times 56$	$256\rightarrow 256$	Concat, $Conv^{\ast}$ , ReLU	$112\times 112$	$(128+128)\rightarrow 256$
MaxPool	$28\times 28$	-	Conv, ReLU	$112\times 112$	$256\rightarrow 256$
			Conv, ReLU	$112\times 112$	$256\rightarrow 64$
Conv, ReLU	$28\times 28$	$256\rightarrow 512$	Transposed Conv	$224\times 224$	$64\rightarrow 64$
[Conv, ReLU] $\times$ 3	$28\times 28$	$512\rightarrow 512$	Concat, $Conv^{\ast}$ , ReLU	$224\times 224$	$(64+64)\rightarrow 128$
MaxPool	$14\times 14$	-	Conv, ReLU	$224\times 224$	$128\rightarrow 128$
[Conv, ReLU] $\times$ 4	$14\times 14$	$512\rightarrow 512$	Conv, ReLU	$224\times 224$	$128\rightarrow 64$
			Conv, Sigmoid	$224\times 224$	$64\rightarrow 3$

Table A: Structure configuration of

G_{R}

and

G_{T}

. The encoders of

G_{R}

and

G_{T}

have the same structure, which is shown in the left column. The structure of decoders is shown in the right column.

Conv^{\ast}

denotes vanilla convolution in

G_{R}

and partial convolution in

G_{T}

. For vanilla convolutions and partial convolutions, kernel size=3, stride=1, while for maxpooling layers and transposed convolutions, kernel size=2, stride=2.

B One-stage Structure

The structure of single-stage network RAGNet_I→T used in ablation study is shown in Fig. A. RAGNet_I→T predicts transmission layer $\hat{T}$ directly from observation $I$ .

C Running Time and Number of Parameters

RAGNet_I→T has 63.5M parameters, which is slightly less than Zhang et al.(77.6M) and BDN (75.2M) but achieves comparable performance according to Tables 2 and 3 in the main text. While RAGNet (including $G_{R}$ and $G_{T}$ ) has 130.9M parameters. We also note that RAGNet has a running time of 0.15 $s$ per image (540 $\times$ 400) on an Nvidia RTX 2080Ti GPU, which is comparable to Zhang et al.and is faster than the other competing methods (see Table. D).

D Ablation Study on $\varphi$ and $\xi$

$\varphi$ and $\xi$ are thresholds that define the intensity levels of reflection, \ie, areas with reflection intensity higher than $\varphi$ are treated as heavy reflection areas, while areas with reflection intensity lower than $\xi$ are treated as weak reflection areas. We conduct an ablation study on $\varphi$ and $\xi$ on Real 20 and SIR² test datasets with two groups of values, \ie, $\varphi=\{0.20,0.25,0.30,0.35,0.40\}$ and $\xi=\{0.005,0.010,0.015,0.020\}$ . Considering the PSNR and SSIM performance exhibited in Fig. B, the final thresholds are set as $\varphi=0.3$ and $\xi=0.01$ .

{strip}\begin{overpic}[width=433.62pt]{figures/onestage.pdf} \put(15.5,30.0){\normalsize$I$} \put(15.5,10.1){\normalsize$\hat{T}$} \put(29.5,29.0){\normalsize$\mathbf{F}_{I}^{1}$} \put(48.8,29.0){\normalsize$\mathbf{F}_{I}^{2}$} \put(68.0,29.0){\normalsize$\mathbf{F}_{I}^{3}$} \put(87.2,29.0){\normalsize$\mathbf{F}_{I}^{4}$} \put(95.3,24.5){\normalsize$\mathbf{F}_{I}^{5}$} \put(22.6,19.3){\normalsize$\mathbf{F}_{\mathit{diff}}^{1}$} \put(28.5,19.3){\normalsize$\mathbf{F}_{\mathit{dec}}^{1}$} \put(41.9,19.3){\normalsize$\mathbf{F}_{\mathit{diff}}^{2}$} \put(47.8,19.3){\normalsize$\mathbf{F}_{\mathit{dec}}^{2}$} \put(61.0,19.3){\normalsize$\mathbf{F}_{\mathit{diff}}^{3}$} \put(66.8,19.3){\normalsize$\mathbf{F}_{\mathit{dec}}^{3}$} \put(80.1,19.3){\normalsize$\mathbf{F}_{\mathit{diff}}^{4}$} \put(86.2,19.3){\normalsize$\mathbf{F}_{\mathit{dec}}^{4}$} \par\put(44.2,13.4){\footnotesize$\mathbf{F}_{I}^{i}$} \put(43.6,2.1){\footnotesize$\mathbf{F}_{\mathit{diff}}^{i}$} \put(49.5,2.1){\footnotesize$\mathbf{F}_{\mathit{dec}}^{i}$} \put(88.0,8.6){{\scriptsize Subtraction}} \put(81.3,8.6){{\scriptsize Concat}} \put(71.8,8.6){{\scriptsize Sigmoid}} \put(75.0,3.0){{\scriptsize Decoder $\bm{D_{T}}$}} \put(62.1,3.0){{\scriptsize Skip Connect}} \put(62.0,8.6){{\scriptsize Partial Conv}} \put(87.8,3.0){{\scriptsize Encoder $\bm{E_{I}}$}} \end{overpic}

Figure A: Structure of the single-stage network RAGNet_I→T, which directly estimates the transmission layer from the observation

I

. RAGNet_I→T follows the U-Net setting, while the concatenated feature through the skip-connection is

\mathbf{F}_{\mathit{diff}}=\mathbf{F}_{I}-\mathbf{F}_{\mathit{dec}}

rather than

\mathbf{F}_{I}

{strip}

Figure B: PSNR and SSIM performance under different settings of

\varphi

(left) and

\xi

(right).

{strip}



Input	IBCLN [14]	Input	IBCLN [14]

Figure C: Undesired artifacts in some scenarios by IBCLN [14].

{strip}







Input	CEILNet [4]	Zhang et al. [41]	BDN [39]	ERRNet [35]	IBCLN [14]	Ours

Figure D: Visual comparison on Real 45 dataset, where the ground-truth is unavailable. Please zoom in for better observation.

{strip}

Table B: The comparison of running time and number of parameters.

Method	CEILNet [4]	Zhang [41]	BDN [39]	ERRNet [35]	IBCLN [14]	RAGNet_I→T	RAGNet
Time ( $s$ )	0.26	0.16	0.22	0.27	0.22	0.08	0.15
Parameter (M)	2.2	77.6	75.2	86.7	21.2	63.5	130.9

Table C: The performance of RAGNet trained with an additional dataset. We expand our training and test datasets with the Nature dataset [14], which contains 200 image pairs for training and 20 image pairs for testing.

Datasets	SIR² Solid (200)	SIR² Postcard (199)	SIR² Wild (55)	Real 20 (20)	Nature (20)	Average (494)
Datasets	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
RAGNet	26.16 / 0.907	24.42 / 0.883	25.99 / 0.890	22.48 / 0.786	22.95 / 0.792	25.16 / 0.886

Table D: Comparison with SoTA methods on regions with weak and strong reflection on the 4 test sets. The PSNR for weak and strong reflection regions are respectively listed in the first and second row.

Reflection Intensity	CEILNet [4]	Zhang [41]	BDN [39]	ERRNet [35]	IBCLN [14]	Ours
Weak	21.51	19.38	21.86	23.79	24.28	25.18
Strong	21.01	20.29	21.14	21.32	21.43	23.40

{strip}













Input	CEILNet [4]	Zhang et al. [41]	BDN [39]	ERRNet [35]	IBCLN [14]	Ours	Ground-truth

Figure E: Visual comparison on four real-world datasets. The first column and the last column are the inputs and ground-truth transmission layers. From top to down, the images are from SIR² Solid, SIR² Postcard, SIR² Wild and Real 20 datasets, respectively, and we show three images for each dataset. Please zoom in for better observation.

E Artifacts in IBCLN [14]

IBCLN [14] iteratively refines the transmission layer and reflection layer, which promotes the SIRR performance but leads to undesired artifacts in some scenarios. Fig. C shows the phenomena.

F Reflection Removal Results

Fig. D and Fig. E show more reflection removal results of all competing methods (\ie, CEILNet [4], Zhang et al. [41], BDN [39], ERRNet [35] and IBCLN [14] and our RAGNet). Fig. D shows results from Real 45, where the ground-truth is unavailable. Fig. E are results from SIR² Solid, SIR² Postcard, SIR² Wild and Real 20 test datasets, and we show three images from each dataset.

G Results on Images with Fully Covering Reflection

Fig. F further shows the results on four images with fully covering reflection. It can be seen that RAGNet performs equally well in this scenario.

H User Study on Real45

A user study on Real45 dataset is conducted to evaluate the quality of reflection removal. Our method is compared with three other methods, including ERRNet [35] and IBCLN [14] which perform the second and third best in terms of PSNR / SSIM metrics, and Zhang et al.[41] which performs well on Real20 dataset. We randomly select 20 image pairs from Real45 dataset. 30 participants are invited for the user study and each of them is given 20 questions. Each question contains the original input image and four choices, which are the results of the four candidate methods. The choices are arranged in a random order for a fair comparison. The users are instructed to choose the best result in terms of reflection removal ability and image quality. The results are reported in Table E. It can be seen that our RAGNet has a much higher probability to be chosen as the the best results among the four methods.

Table E: User study on Real 45 dataset. The percentages represent the voting results by the participants.

Method	Zhang et al.	ERRNet	IBCLN	Ours
Real 45	30.3 $\%$	10.5 $\%$	9.0 $\%$	50.17 $\%$

I Performance with Additional Training Data

We retrain our model by including additional 200 training image pairs from [14] and report the results in Table C. Except Real20, the inclusion of additional images in training is beneficial to network performance on all the other datasets.

J Detection of Reflection

To verify whether the reflection region can be successfully detected by our network, we compare the similarity between the estimated reflection $\hat{R}$ and the residual $I-T_{gt}$ . As a result, the average PSNR is 24.01 dB on the 4 test sets of SIR² and Real 20. It should be noted that the given ground truth $R_{gt}$ in SIR² generally covers the whole image but is partially invisible in input $I$ (see Fig. G). Therefore, we use $I-T_{gt}$ instead of $R_{gt}$ for evaluating the detection of reflection.

K Ability in Removing Strong and Weak Reflection

While our RAGNet is suggested for handling strong reflection regions, it also works well on the regions with weak reflection. To verify this point, we calculate the PSNR respectively in the weak and strong reflection regions on the 4 test sets (SIR² and Real 20 datasets). As shown in Fig. 3, the elements of $\mathbf{M}_{\mathit{diff}}$ tend to be 0 in strong reflection regions and 1 in weak reflection regions. Therefore, we can separately evaluate the network performance by calculating PSNR in regions with weak reflection intensity based on $\mathbf{M}_{w}=\mathbf{M}_{\mathit{diff}}>\tau(=0.40)$ . In specific, given the $H\times W$ estimated transmission $\hat{T}$ and the ground truth transmission $T_{gt}$ , the PSNR for weak reflection regions is calculated as:

{\rm PSNR}=10\,{\rm log}_{10}\frac{I_{max}}{{\rm MSE}_{mask}},

(11)

where

{\rm MSE}_{mask}\!=\!\frac{\sum_{i=0}^{H-1}\!\sum_{j=0}^{W-1}\!{\rm\mathbf{M}}_{w}(i,j)[\hat{T}(i,j)-T_{gt}(i,j)]^{2}}{\sum_{i=0}^{H-1}\!\sum_{j=0}^{W-1}\!{\rm\mathbf{M}}_{w}(i,j)},

(12)

where $I_{max}$ is the maximum possible pixel value. Accordingly, the PSNR for strong reflection regions can be calculated by replacing ${\rm\mathbf{M}}_{\mathit{w}}(i,j)$ with ${\rm\mathbf{M}}_{\mathit{s}}(i,j)=1-{\rm\mathbf{M}}_{\mathit{w}}(i,j)$ . The results are shown in Table D. It can be seen that: 1) For weak / no reflection regions, our method can achieve 0.9 dB PSNR gain. 2) For heavy reflection regions, the PSNR by our method exceeds the others by 1.9 dB.

L Robustness Against $\hat{R}$

Empirically, $\hat{R}$ with higher quality yields better network performance. Nonetheless, RAGNet also exhibits considerable robustness against $\hat{R}$ within a certain range. To show this, we have individually trained a $G_{R}$ for different epochs and feed its prediction to $G_{T}$ . As shown in Fig. H, only slight fluctuation of network performance is observed given $\hat{R}$ generated in different epochs, indicating the robustness of RAGNet against $\hat{R}$ .


Observation $I$	Predicted Reflection $\hat{R}$	Visualization of $I-\hat{R}$	Ours

CEILNet [4]	BDN [39]	Lei et al. [21]	IBCLN [14]


Input	Predicted $\hat{R}$

$\mathbf{M}_{\mathit{diff}}$	$\mathbf{M}_{\mathit{dec}}$

Two-Stage Single Image Reflection Removal with Reflection-Aware Guidance

Abstract

1 Introduction

2 Related Work

2.1 Multiple Image Reflection Removal

2.2 Single Image Reflection Removal

3 Proposed Method

3.1 Limitations of existing multi-stage methods

3.2 Overall Architecture Design

3.3 Reflection-Aware Guidance

3.4 Mask Loss

3.5 Learning Objective

4 Experiments

4.1 Implementation Details

4.2 Comparison with State-of-the-arts

5 Ablation Study

5.1 Two-stage Architecture

5.2 RAG Module

5.3 Mask Loss

6 Conclusion

References

A Network Structure

B One-stage Structure

C Running Time and Number of Parameters

D Ablation Study on φ\varphi and ξ\xi

E Artifacts in IBCLN [14]

F Reflection Removal Results

G Results on Images with Fully Covering Reflection

H User Study on Real45

I Performance with Additional Training Data

J Detection of Reflection

K Ability in Removing Strong and Weak Reflection

L Robustness Against R^\hat{R}

Two-Stage Single Image Reflection Removal
with Reflection-Aware Guidance

D Ablation Study on $\varphi$ and $\xi$

L Robustness Against $\hat{R}$