Unsupervised Shadow Removal Using Target-Consistency Generative Adversarial Network

Chao Tan
School of Computer Science and Engineering
Chongqing University of Technology
[email protected] Xin Feng
School of Computer Science and Engineering
Chongqing University of Technology
[email protected]

Abstract

In this paper, we develop a simple yet effective target-consistency generative adversarial network (TC-GAN) for the shadow removal task in an unsupervised manner. Compared with the bidirectional mapping in cycle-consistency GAN based methods for unsupervised shadow removal, TC-GAN targets to learn a unidirectional mapping to translate shadow images into shadow-free ones. With the proposed target-consistency constraint that is designed to connect two dual GAN-based sub-networks, the correlations between shadow images and the output shadow-free image, and the realness of recovered shadow-free image are strictly confined. Extensive comparison experiments results show that TC-GAN outperforms the state-of-the-art unsupervised shadow removal methods by 14.9% in terms of FID and 31.5% in terms of KID. It is rather remarkable that TC-GAN achieves comparable performance with supervised shadow removal methods.

1 Introduction

To recognize and remove shadows from the natural images is very important and critical for many computer vision tasks such as the object detection, recognition and tracking, etc [39, 32, 33, 16, 31]. In principle, shadow removal can be formulated as an image restoration problem, where a shadow-free image $I_{Y}$ is either recovered from the element-wise addition of input shadow image $I_{X}$ and a negative residual that is generated from the shadowed image (denoted by $G_{res}(I_{X})$ in equation 1) [30, 4, 45], or the linear transformation of $I_{X}$ and a shadow matte (or shadow factors) [25, 28, 1].

I_{Y}=I_{X}\oplus G_{res}(I_{X})

(1)

In practice, because of environmental uncertainties, such as different depth of shadows projected by different material of objects and illumination conditions, similar visual features between shadow regions and some objects in the image, etc., the task of shadow removal is quite challenging and has always been a fundamental research topic.

Refer to caption — Figure 1: Visualization of the shadow removal result. (a) is input shadow images, (b) is results of CycleGAN [21], (c) is results of Mask-ShadowGAN [48] and (d) is results of proposed TC-GAN.

Traditional shadow removal methods are usually developed to localize and remove shadows by designing hand-crafted color and illumination features based on some prior assumption, e.g.the consistent illumination in shadow region [34, 42, 51, 41]. Recently, deep learning based shadow removal approaches have achieved remarkable performance [4, 47, 28, 19]. In these methods, the shadow matte or mask and the removal operation are learned from the paired shadow removal datasets in the supervised manner. In particular, the generative adversarial learning [18], which learns a non-linear mapping network through restricting the output shadow-free images to be indistinguishable from the non-shadow label images, have shown to be effective for the shadow removal task [4, 37, 8]. However, the problem of this kind of supervised learning is that high quality paired shadow and non-shadow images are difficult and costly to obtain in practice. Comparatively, the unpaired natural scene images that contain shadows and non-shadow images are easy to be acquired. This allows us to collect a large number of shadow images and mismatched non-shadow images with variety scenes for unsupervised shadow removal learning. However, without the corresponding shadow-free labels for supervision, the unsupervised shadow removal is much more difficult so that few existing works has achieved satisfactory results so far, as the results of CycleGAN [21] and Mask-ShadowGAN [48] shown in Figure 3.

Inspired by the recent unsupervised shadow removal work [48] that casts the unsupervised shadow removal task into image-to-image translation problem based on cycle-consistent adversarial network, in this work, we present a target-consistent generative adversarial network (TC-GAN) that aims at translating the input shadow image into shadow-free image with unpaired data in the unsupervised manner. Unlike the bidirectional mapping (denoted by $G_{X2Y}$ and $G_{Y2X}$ separately in Figure 1 (a)) in the cycle-consistency constraint-based models for unsupervised learning in [48], which requires the generated shadow mask to reconstruct the image from the non-shadow domain back to the shadow domain, TC-GAN learns a one-sided mapping network $G_{X2Y}$ to directly translate the shadow image into shadow-free image, as shown in Figure 1 (b).

In particular, TC-GAN draws inspiration from two facts. One fact is that multiple shadow-discrepant images may correspond to the same shadow-free image. In other words, if the input of $G_{X2Y}$ has two (could have more) shadow-discrepant images (or features) that correspond to one shadow-free image, the output of $G_{X2Y}$ that contains two shadow-free images should be consistent. Here, we propose a target consistency constraint to supervise the content consistency between the two outputs of $G_{X2Y}$ . Now the problem is how we have two shadow-discrepant images (or features) for a given input shadow image. In this work, we explicitly design a set of shadow transform encoders (STE) in the generative process of TC-GAN to transform the input shadow image into two feature representations (we denote as $G_{X2F}$ in Figure 1), which learns to encode the different intrinsic compositions between the non-shadow features and shadow residuals. The other fact is that different non-shadow natural images shall share some common illumination and brightness characteristics, for example, generally without dramatic brightness variations. This motivates us to adapt the discriminator in TC-GAN to learn the hidden common features from the large-scale non-shadow natural image dataset, which is available in the unsupervised shadow removal dataset.

To realize the overall process, we design TC-GAN as a dual generative adversarial network structure. Through recognizing different real non-shadow images from the generated shadow-free images of the corresponding discriminator, each generative adversarial sub-network has different adversarial optimization objective. Along with target consistency constraint during training, the whole network guarantees that: (i) different feature embedding can be generated from the input shadow image, (ii) the decoded shadow-free images from two generators not only have consistent content but also agree with real shadow-free image. Finally, we have a simple model selection module to select the best output shadow-free result from the two generators.

The proposed unidirectional mapping generative adversarial network solves the unsupervised shadow removal problem from a new perspective. As extensive comparison experiments on the challenge unsupervised shadow removal dataset are shown, TC-GAN achieves the best performance in terms of both quantitative and qualitative comparison over state-of-the-art unsupervised shadow removal approaches. Furthermore, on the widely use supervised shadow removal dataset, TC-GAN even obtains comparable results with recent excellent supervised methods.

2 Related Work

2.1 Shadow Removal

Natural scene shadow removal has been widely studied. Early methods usually focused on physically modeling the shadow image and estimating the model parameters in terms of gradient domain manipulation [9, 11, 2], shadow illumination and color transfer [51, 12, 10, 7], and accurate shadow matte [34, 42, 44]. Recently, driven by shadow removal datasets with paired shadow and shadow-free images [28, 19], deep learning-based methods are widely used to analyze the potential mapping relation between shadow images and corresponding non-shadow images [4, 37, 29, 20]. Qu et al.[28] proposes an end-to-end multi-context embedding network to integrate high-level semantic context for shadow removal. Zhang et al.[30] uses a unified end-to-end framework to explore the residual and illumination for shadow removal. For learning shadow detection and removal jointly, Wang et al.[19] proposes a stacked conditional generative adversarial network, and Hu et al.[47] presents a deep neural network for shadow detection and removal by analyzing the spatial image context in a direction-aware manner.

Since now, a few of methods have explored the unsupervised shadow removal on the unpaired dataset and the only unsupervised dataset has been proposed in [48]. In paticular, Hieu and Dimitris [26] use shadow and non-shadow patches cropped from the shadow images and train a physical shadow formulation-based model via an adversarial framework. Hu et al.[48] present a Mask-ShadowGAN that learns shadow removal from the unpaired training data. The method takes use of the cycle consistency of CycleGAN [21], and with the guidance of the generated binary mask in the generative process, it facilitates the bidirectional mapping to transform the shadow image into shadow-free image.

2.2 Unsupervised Adversarial Learning

Unsupervised shadow removal can be regarded as a domain mapping problem. Recent domain mapping works have used adversarial learning [18] to solve unpaired domain mapping [23, 53, 50]. Generative adversarial networks [18] are powerful adversarial learning models, which learn a generator and a discriminator jointly such that the generator produces realistic images that confuse the discriminator [35, 49, 46]. The cycle consistency constraint based methods [21, 24, 54] learn a bijection mapping by enforcing cycle consistency between the reconstructed data and input one to produce convincing mappings from the source domain to the target domain without paired supervision. Since then, many subsequent methods have been proposed to improve the cycle-consistency constraint [3, 36, 22, 14, 6]. Furthermore, for one-sided unsupervised domain mapping, Benaim and Wolf [43] propose to maintain the distances between samples within domains, and Fu et al.[17] develop a geometry-consistent generative adversarial network to perform image-to-image translation.

3 TC-GAN for unsupervised Shadow Removal

Without paired shadow and non-shadow images for supervision, the non-linear mapping $G_{X2Y}$ cannot directly learn from the unpaired dataset. [48] takes use of CycleGAN structure to render a backward mapping $G_{Y2X}$ with the help of generated mask from $G_{X2Y}$ , as shown in Figure 1 (a), and $I_{X}$ is used as the reference to supervise the reconstructed image. Unlike the CycleGAN based unsupervised shadow removal that dependents on the predicted shadow mask, TC-GAN is an unidirectional mapping network to directly learn $G_{X2Y}$ based on generative adversarial learning framework. Mainly motivated by the fact that if two different shadow images were captured by shading differently on one non-shadow image, the recovered shadow-free image $I_{Y}$ from the two different shadow images should be the same, we build TC-GAN as a dual generator-discriminator GAN structure to conform this assumption into network learning. The overall architecture of TC-GAN is shown in Figure 2.

3.1 Generative Network in TC-GAN

For the input shadow image $I_{X}$ , we design two generators $G_{1}$ and $G_{2}$ to learn two separate $G_{X2Y}$ mappings, deriving two shadow-free images $I_{y1}$ and $I_{y2}$ with respect to $I_{X}$ , respectively. Here, on the one hand, in order to confine $I_{y1}$ and $I_{y2}$ to be content consistent, we propose a target consistency constraint to act on the ends of the two generators. On the other hand, the two generators are acquired to produce two different representations of $I_{X}$ to mimic the assumption that two different shadow-discrepant images correspond to the same non-shadow image.

The two generators are mainly composed of two encoder-decoders with dual network structure, which we term as shadow residual generators, with $G_{res1}$ and $G_{res2}$ indicating the related sub-network in Figure 2. The two sub-networks are expected to generate two different shadow residual images from $I_{X}$ . In particular, the encoder in each of the generator is a shadow transform encoder (denoted by $G_{STE1}$ and $G_{STE_{2}}$ for each encoder in Figure 2). During network training, $G_{STEi},(i\in\{1,2\})$ learns to transform $I_{X}$ from image domain to a kind of feature compositions, i.e. $F_{i}=G_{STEi}(I_{X}),i\in\{1,2\}$ , which encodes the inherent relations between the non-shadow and shadow residual features with respect to $I_{X}$ . The subsequent shadow residual decoders $G_{SRD1}$ and $G_{SRD2}$ are designed to decode the feature representations from feature domain to shadow residual images, i.e. $G_{resi}(I_{X})=G_{SRDi}(F_{i}),\in\{1,2\})$ , layer by layer. At last, by applying the element-wise addition between $I_{X}$ and $G_{res1}(I_{X})$ and $G_{res2}(I_{X})$ , respectively, the two separate shadow-free images $I_{y1}$ and $I_{y2}$ can be recovered.

In order to ensure $G_{1}$ and $G_{3}$ producing two separate shadow-free images from $I_{X}$ , we intentionally initialize the dual sub-networks with different parameters, and the differences are maintained during the whole training process through different adversarial optimization objectives. Ideally, $I_{y1}$ and $I_{y2}$ will be exactly the same images under the target consistent constraint, however, as they were trained by different adversarial generators, we build a model selection module (MSM) to make a selection from the two shadow-free images. The MSM module is a pre-trained shadow/non-shadow binary classifier, which is embedded at the end of the generative network in TC-GAN and only applied for the network inference phrase. Through comparison of the output probability of the two classes, it selects the one with higher confidence from the two generated images ( $I_{y1}$ and $I_{y2}$ ) to be a real non-shadow image as the final output.

To summarize, the overall process of the generative network of TC-GAN can be formulated as:

	$\displaystyle I_{Y}=\text{MSM}(I_{X}+G_{res1}(I_{X})),$		(2)
	$\displaystyle I_{X}+G_{res2}(I_{X})))$		(2)

In model implementation, we build the shadow transform encoders with three convolution layers (stride-2) and nine residual blocks [54] , and the shadow residual decoders are designed to have three stride-2 deconvolution layers. Each layer in both encoders and decoders is followed by an instance normalization and ReLU activation function. The structure of MSM is built by four convolution layer with stride of 2, and each layer uses the same nomalization and activation function with encoders.

3.2 Discriminative Network in TC-GAN

Similar with the generative network, there are two discriminators $D_{1}$ and $D_{2}$ corresponding to the two generators and with the same network structure in the Discriminative network. The discriminator basically contains four stride-2 convolutional layers, each of which is followed by an instance normalization and Leaky ReLU activation function (slope of 0.2).

In the GAN based image-to-image translation approaches for unpaired data, image data in the same domain is believed to share some common characteristics. Delighted by the assumption and the fact that images from the non-shadow domain have agreement on some aspects like consistent illumination and brightness, etc., we facilitate each discriminator in the dual discriminative network for adversarial training by feeding a randomly selected sample from non-shadow image dataset as the real non-shadow image (as we denote by $I_{rn1}$ and $I_{rn2}$ for $D_{1}$ and $D_{2}$ in Figure 2) for discrimnative supervision. In addition, to leverage the two generators for individual learning, we intentionally choose two different real non-shadow images from the dataset, i.e., $I_{rn1}\neq I_{rn2}$ , for the two discriminators.

3.3 Target Consistency Constraint

As before mentioned, the target-consistency constraint is proposed to confine the correlation between $I_{y1}$ and $I_{y2}$ . Specifically, given an input sample $x$ being drawn from the shadow image data distribution $P_{X}$ , the output targets of the two generators $G_{1}$ and $G_{2}$ are restricted to be as close as possible by applying the target-consistence constraint:

	$\displaystyle\mathcal{L}_{tc}(G_{1},G_{2})=\mathcal{E}_{x\sim P_{x}}[\\|(x+G_{res1}(x))$		(3)
	$\displaystyle-(x+G_{res2}(x))\\|_{1}]$		(3)

The key role of target consistency constraint is to connect the dual sub-GAN networks ( $G_{1}$ - $D_{1}$ and $G_{2}$ - $D_{2}$ ) for the overall learning of TC-GAN, but more than that, it drives an adversarial learning between the consistency of $I_{y1}$ and $I_{y2}$ and distinctiveness of $G_{1}$ - $D_{1}$ and $G_{2}$ - $D_{2}$ , in addition to each adversarial constraint of $G_{1}$ - $D_{1}$ and $G_{2}$ - $D_{2}$ . Furthermore, the target-consistence loss restricts the output shadow-free targets directly, whereas random disturbance would be induced by using the binary mask guided cycle-consistence constraint in [48], leading to ambiguous textures in the recovered shadow-free image as shown in Figure 3 (c).

3.4 Fully Objective

For training each sub-GAN network, we use the generic adversarial constraint [18] to enforce the generator and discriminator to jointly optimize the adversarial loss. Specifically, let $p(X)$ and $p(Y)$ represent the data distribution of shadow domain $D_{X}$ and non-shadow domain $D_{Y}$ , respectively, the adversarial loss $\mathcal{L}_{gan}(G_{i},D_{i})$ for each sub-GAN $i\in{1,2}$ can be expressed as:

	$\displaystyle\mathcal{L}_{gan}(G_{i},D_{i})=\mathcal{E}_{y_{i}\sim P_{Y}}[logD_{i}(y_{i})]$		(4)
	$\displaystyle+\mathcal{E}_{x\sim P_{X}}[log(1-D_{i}(x+G_{resi}(x)))]$		(4)

where $y_{1}$ and $y_{2}$ are the two different real non-shadow images $I_{rn1}$ and $I_{rn2}$ for supervising true shadow-free image restoration.

In addition, by following most of the GAN-based image-to-image translation works on unpaired data [52], we further use the identity loss to leverage the generators learning from real non-shadow images. For the purpose of (i) preserving the consistency between the generated shadow-free image and input image on the non-shadow areas in terms of color or other appearance features [21]. (ii) making faster convergence for training on both generators, the identity loss of $G_{1}$ and $G_{2}$ for non-shadow data $y_{1}$ and $y_{2}\in P(Y)$ are described as:

	$\displaystyle\mathcal{L}_{idt}(G_{i})=\mathcal{E}_{y_{i}\sim P_{Y}}[\\|y_{i}+G_{resi}(y_{i})-y_{i}\\|_{1}]$		(5)
	$\displaystyle=\mathcal{E}_{y_{i}\sim P_{Y}}[\\|G_{resi}(y_{i})\\|_{1}]$		(5)

In summary, the final loss function for TC-GAN ( $\mathcal{L}_{tcgan}$ ) is a weighted sum of the adversarial loss of the both sub-GANs, target-consistency loss, and identity loss:

$\displaystyle\mathcal{L}_{tcgan}(G_{1},G_{2},D_{1},D_{2})$	$\displaystyle=$	(6)
$\displaystyle\lambda_{1}\cdot(\mathcal{L}_{gan}(G_{1},D_{1})+\mathcal{L}_{gan}(G_{2},D_{2}))$
$\displaystyle+\lambda_{2}\cdot\mathcal{L}_{tc}(G_{1},G_{2})$
$\displaystyle+\lambda_{3}\cdot(\mathcal{L}_{idt}(G_{1})+\mathcal{L}_{idt}(G_{2})),$

where $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are trade-off hyperparameters to weight the contribution of $\mathcal{L}_{gan}$ , $\mathcal{L}_{tc}$ and $\mathcal{L}_{idt}$ , respectively. In this work, We empirically set $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ as 1, 40, and 5 separately in all the experiments. Furthermore, the negative log likelihood objective is replaced by a least-square loss for adversarial loss, which is shown to be more stable and effective than the original log likelihood according to [49]. At last, we optimize the whole network in a minimax optimizer.

3.5 Implementation Details

In training process, we use the Adam solver [38] with a basic learning rate of 0.0002, and assign the first and second momentum values as 0.5 and 0.999 for network optimization. The parameters for all the generators and discriminators are random initialized by a zero-mean Gaussian distribution with a standard deviation of 0.02. The network is trained in a total of 200 epochs with each mini-batch size of one. In addition, we fix the learning rate for the first 100 epochs, and gradually reduce it to be zero by applying a linear decay rate for the rest of epochs. All the images are first re-scale to 286 ×286 and randomly cropped to be size of 256 ×256 as inputs. Lastly, we built our model on PyTorch library and trained on a computer with a single NVIDIA GTX1080Ti GPU.

4 Experiments

In this work, we mainly evaluate our proposed TC-GAN on the unsupervised shadow removal dataset. However, in order to further validate the method with annotated data and make more comparison, we perform training and experiment on the supervised shadow removal dataset as well.

4.1 Shadow Removal Datasets

The supervised dataset: The ISTD [19] is a widely used dataset for supervised shadow removal. It contains 1330 training triplets and 540 testing triplets of shadow image, non-shadow image and binary shadow mask. The shadow images are acquired by artificially masking some shadowless regions of the non-shadow natural images. However, The artificial shadowed images, would only cover limit shadow patterns, and visible differences between the actual shadowless images and their non-shadow labels would be induced, as have been demonstrated in [48].

The unsupervised dataset: Up to now, the USR [48] dataset is the first and the unique unsupervised shadow removal dataset. It contains 2445 shadow images (1956 for training and 489 for testing) and 1770 non-shadow images. Note that the 1770 non-shadow images are not correlated with any image in the 2445 shadow images. Moreover, the USR dataset is captured on a thousands of different outdoor scenes and shaded by various types of objects, which covers the most number of different scenes and shadow patterns over other existing datasets.

Methods	Stage-I Rating	Stage-II Rating	FID [15]	KID [5]	RMSE
GAN-alone [40]	2.85 $\pm$ 1.65	2.31 $\pm$ 1.96	108.4338	2.9889 $\pm$ 0.4872	20.5989
DistanceGAN [43]	3.04 $\pm$ 1.73	2.67 $\pm$ 2.18	103.8114	2.7214 $\pm$ 0.4888	14.5482
CycleGAN [21]	4.36 $\pm$ 2.02	3.94 $\pm$ 1.82	85.9442	2.4699 $\pm$ 0.4858	12.7868
U-GAT-IT [[22]	4.12 $\pm$ 2.21	3.99 $\pm$ 1.58	95.6695	2.7124 $\pm$ 0.4447	10.8764
AttentionGAN [14]	4.21 $\pm$ 2.51	3.28 $\pm$ 2.78	95.1523	2.8461 $\pm$ 0.4816	4.3253
Mask-ShadowGAN [48]	4.51 $\pm$ 1.49	3.46 $\pm$ 1.57	85.0228	2.4096 $\pm$ 0.4568	12.8541
TC-GAN(ours)	5.91 $\pm$ 1.35	6.66 $\pm$ 1.62	72.3522	1.6503 $\pm$ 0.4548	10.5271

Table 1: Qualitative and quantitative results on the USR dataset. For user study results (Stage-I and Stage-II ratings), higher is better. For FID, KID (KID

\times

100

\pm

std.

\times

100) and RMSE metrics, lower is better.

4.2 Experiments on USR Dataset

Since the difficulty and feasible data has not been available for a long time, the topic of unsupervised shadow removal has rarely been investigated. Mask-ShadowGAN [48] is the state-of-the-art unpaired shadow removal methods and has shown to have good performance on the USR dataset. In addition to it, we compare our TC-GAN with some latest unsupervised image-to-image translation models, which are trained on the USR dataset by using the official provided source code and hyper-parameters. Specifically, the compared models are: GAN-alone (Pix2Pix [40] model without paired supervision), DistanceGAN [43], CycleGAN [21], U-GAT-IT [22] and AttentionGAN [14]. Noted that in order to make fair comparison, we generate shadow residual for all the generators as described in Section 3.1, and finally add it to the input shadow image to obtain the shadow-free output.

4.2.1 Qualitative Evaluation

By following the qualitative evaluation on the USR dataset in Mask-ShadowGAN [48], we conduct a user study through publishing a rating task on a source crowding platform to evaluate the shadow removal performance. However, in order to make further comparison on the visual results of the non-shadow area, we propose to conduct a two-stage subject experiment, which studies 10 participants in different ages and genders. In stage-I, each participant is given by randomly selected 10 shadow images from the USR testing set. We then apply all the methods to be compared to generate a total of 70 shadow-free images. Afterwards, we ask the participant to rate all the 70 shadow-free images in a scale from 1 (bad quality) to 10 (good quality). In Stage-II, we select another 10 shadow images randomly and perform the same settings as stage-I. Differently hereby, we provide participants with the original shadow images that correspond to the 70 shadow-free images for reference. In general, stage-I aims to evaluate the quality of the shadow-free images generated by different methods, while stage-II is to evaluate the visual effect of different shadow removal results through comparing from the input shadow images, i.e. check the consistency of non-shadow area between the recovered shadow-free image and original shadow image in terms of illumination, color and etc.

In total, we obtain 100 ratings (10 participants × 10 images) per method for both Stage-I and Stage-II, respectively. We calculate the mean and standard deviation of these ratings and summarized the results in the second and third columns of Table 1, where the higher value of rating results indicating better performance. As can be seen that our TC-GAN achieves the highest rating performance in both Stage-I and Stage-II. Besides, although the cycle-consistency base models (CycleGAN, U-GAT-IT, AttentionGAN and Mask-ShadowGAN) achieve an improvement performance compared with vanilla GANs, they are not as good as our TC-GAN that is base on target consistency constraint. In addition, since the overall framework of TC-GAN is driven to maintain the correlation between the input shadow image and generated shadow-free image through learning, it also shows a greater advantage in the Stage-II ratings.

4.2.2 Quantitative Evaluation

In order to make quantitative evaluation, which is not considered in Mask-ShadowGAN [48], we further calculate the FID [15] and KID [5] scores for all compared models. The FID and KID scores are two commonly used evaluation metrics in image-to-image translation field on the unpaired data, which aim at measuring the domain similarity between the real image and the transformed image domain. The lower of them indicate better generated results. Besides, in order to evaluate the correlation between the input shadow image and generated shadow-free target, we also measure the RMSE of the shadowless areas between them. Specifically, we use the BDRAR method [27] and the official pre-trained model to obtain the shadow mask for the shadow image. A smaller RMSE score represents a better match between shadow image and the output shadow-free image, which means the less variation is presented to the non-shadow regions.

The quantitative evaluation results of all the compared methods are shown in the last 3 columns of Table 1. Moreover, we present the visual comparison of shadow removal results of all methods on USR dataset in Figure 4. As can be seen that compared with all other methods, our TC-GAN obtains the smallest FID and KID scores. It should be noted that since AttentionGAN directly copies the non-shadow pixels of the input shadow image to the output shadow-free image, it obtains a relative lower RMSE score. However, the worse results of AttentionGAN in terms of sujective rating, FID, KID and visual images indicate that AttentionGAN fails in removing shadow samples effectively. In addition, it can be observed from Figure 4 that due to the absence of sufficient constraint, the generated image of GAN-alone and DistanceGAN contains many random perturbations and artifacts. Columns 4 to 6 in Figure 4 are the results of cycle-consistency based models. We observe that the output of these models retains some information of the shadow regions, leading to the failure of shadow removal. We believe the reason is that cycle-consistency constraint assumes that the generated shadow-free image could be reconstructed to the input shadow image without additional information. As a result, the generated shadow-free image has to maintain some shadow information. Mask-ShadowGAN facilitates reverse mapping by using the generated shadow mask. However, because the binary mask cannot provide intensity and color information of the shadow area, the shadow-free images generated by Mask-ShadowGAN have large appearance deviations from the real non-shadow ones.

Comparatively, as shown in both Table 1 and Figure 4, the proposed TC-GAN not only achieves the best qualitative and quantitative scores, but is able to completely remove shadow regions while keeping the high correlation of the non-shadow area between the recovered shadow-free image and the input image.

4.3 Experiment on ISTD Dataset

In order to further evaluate our methods, we compare our TC-GAN with some supervised methods on the supervised ISTD dataset as well. Here, many baseline and advanced shadow removal methods are take into comparison on the ISTD dataset, including Gong et al.[13], Guo et al.[42], Yang et al.[41], ST-CGAN et al.[19] and DSC [47]. Specially, we obtain their results and source code directly from the official website and in order to make a fair comparison, we re-scale the size of all the testing images to 256 × 256 via cubic spline interpolation. In addition, since TC-GAN is independent of shadow mask information, which is provided by the ISTD dataset, while many supervised have used it as the guidance to recover the shadow-free image, we do not compared with more state-of-the-art methods that use full ISTD for training. We evaluate the performance of all the supervised methods by calculating RMSE between the generated shadow-free image and the annotated ground truth by considering several validation areas: (i) in shadow area (S), (ii) non-shadow area (N) , (iii) the entire image (A) respectively. In addition, by considering that the non-shadow labels in the ISTD dataset contain noises, we further measure the RMSE of the generated shadow-free image and the input shadow image in the non-shadow area (N-I).

Method	S	N	A	N-I
Gong et al.[13]	14.98	7.29	8.53	-
Guo et al.[42]	18.95	7.46	9.3	-
Yang et al.[41]	18.92	14.83	15.63	-
ST-CGAN [19]	10.33	6.93	7.43	7.45
DSC [47]	9.22	6.39	6.67	6.61
TC-GAN	11.49	5.91	6.85	6.29

Table 2: RMSE results for different methods on ISTD dataset. Abbreviations: S (shadow area), N (non-shadow area), A (entire image) and N-I (compared with non-shadow area of input image).

For our TC-GAN on the ISTD dataset, we disrupt the paired order of shadow images and non-shadow images during training, and use random cropping to avoid pixel matching between any two shadow images and non-shadow image in the dataset. We finally summarize all the evaluation results Table 2. As can be seen that our TC-GAN achieves comparable performance over the supervised methods, especially for the non-shadow area (N-I) results, and importantly, TC-GAN is trained without paired supervision.

Figure 5 illustrates the visual comparison of shadow removal results on ISTD dataset. It can be seen that due to the noise of training labels, the supervised methods usually cannot remove shadow perfectly (the first and second rows) or keep the non-shadow areas unchanged (the third and fourth rows). Comparatively, TC-GAN effectively removes the shadows and retain the content and illumination in non-shadow areas simultaneously, even if the paired data are not used for training.

5 Conclusion

In this work, we propose a novel target consistency generative adversarial network, named TC-GAN, for unsupervised shadow removal. Our TC-GAN uses target consistency constraint to learn a unidirectional mapping from shadow to non-free domain. The overall network is designed to be a dual GAN structure. The correlation between the generated shadow-free target and input shadow image and the authentication of the generated shadow-free image is strictly confined through the whole network training. The proposed TC-GAN is compared with many state-of-the-art unsupervised shadow removal and supervised methods in terms of both quantitative and qualitative evaluation, results demonstrate the superior performance and the good recovered quality of our method. In the future, we plan to apply TC-GAN to more shadow scene tasks, such as shadow detection and shadow removal in video, etc.

References

[1] Benish Amin, M Mohsin Riaz, and Abdul Ghafoor. Automatic shadow detection and removal using image matting. Signal Processing, 170:107415, 2020.
[2] Mohan Ankit, Tumblin Jack, and Choudhury Prasun. Editing soft shadows in a digital photograph. IEEE Computer Graphics and Applications, 27(2):23–31, 2007.
[3] Anoosheh Asha, Agustsson Eirikur, Timofte Radu, and Van Gool Luc. Combogan: Unrestrained scalability for image domain translation. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 783–790, 2018.
[4] Ding Bin, Long Chengjiang, Zhang Ling, and Xiao Chunxia. Argan: Attentive recurrent generative adversarial network for shadow detection and removal. In Int. Conf. Comput. Vis., pages 10213–10222, 2019.
[5] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
[6] Yang Chao, Kim Taehwan, Wang Ruizhe, Peng Hao, and Kuo C-C Jay. Show, attend, and translate: Unsupervised image translation with self-regularization and attention. IEEE Trans. Image Process., 28(10):4845–4856, 2019.
[7] Xiao Chunxia, She Ruiyun, Xiao Donglin, and Ma Kwan-Liu. Fast shadow removal using adaptive multi-scale illumination transfer. In Computer Graphics Forum, volume 32, pages 207–218, 2013.
[8] Xiaodong Cun, Chi-Man Pun, and Cheng Shi. Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan. arXiv preprint arXiv:1911.08718, 2019.
[9] Finlayson Graham D, Hordley Steven D, Lu Cheng, and Drew Mark S. On the removal of shadows from images. IEEE Trans. Pattern Anal. Mach. Intell., 28(1):59–68, 2005.
[10] Reinhard Erik, Adhikhmin Michael, Gooch Bruce, and Shirley Peter. Color transfer between images. IEEE Computer graphics and applications, 21(5):34–41, 2001.
[11] Liu Feng and Gleicher Michael. Texture-consistent shadow removal. In Eur. Conf. Comput. Vis., pages 437–450, 2008.
[12] Khan Salman H, Bennamoun Mohammed, Sohel Ferdous, and Togneri Roberto. Automatic shadow detection and removal from a single image. IEEE Trans. Pattern Anal. Mach. Intell., 38(3):431–446, 2015.
[13] Gong Han and Cosker Darren. Interactive shadow removal and ground truth for variable scene categories. In Brit. Mach. Vis. Conf., pages 1–11, 2014.
[14] Tang Hao, Liu Hong, Xu Dan, Torr Philip HS, and Sebe Nicu. Attentiongan: Unpaired image-to-image translation using attention-guided generative adversarial networks. arXiv preprint arXiv:1911.11897, 2019.
[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Adv. Neural Inform. Process. Syst., volume 30, pages 6626–6637. Curran Associates, Inc., 2017.
[16] Gang Hua, Chengjiang Long, Ming Yang, and Yan Gao. Collaborative active visual recognition from crowds: A distributed ensemble approach. IEEE Trans. Pattern Anal. Mach. Intell., 40(3):582–594, 2017.
[17] Fu Huan, Gong Mingming, Wang Chaohui, Batmanghelich Kayhan, Zhang Kun, and Tao Dacheng. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2427–2436, 2019.
[18] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. Generative adversarial nets. In Adv. Neural Inform. Process. Syst., pages 2672–2680, 2014.
[19] Wang Jifeng, Li Xiang, and Yang Jian. Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1788–1797, 2018.
[20] Wei Jinjiang, Long Chengjiang, Zou Hua, and Xiao Chunxia. Shadow inpainting and removal using generative adversarial networks with slice convolutions. In Computer Graphics Forum, volume 38, pages 381–392, 2019.
[21] Zhu Jun-Yan, Park Taesung, Isola Phillip, and Efros Alexei A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Int. Conf. Comput. Vis., pages 2223–2232, 2017.
[22] Kim Junho, Kim Minjae, Kang Hyeonwoo, and Lee Kwanghee. U-gat-it: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
[23] Johnson Justin, Alahi Alexandre, and Fei-Fei Li. Perceptual losses for real-time style transfer and super-resolution. In Eur. Conf. Comput. Vis., pages 694–711, 2016.
[24] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
[25] Hieu Le and Dimitris Samaras. Shadow removal via shadow image decomposition. In Proceedings of the IEEE International Conference on Computer Vision, pages 8578–8587, 2019.
[26] Hieu Le and Dimitris Samaras. From shadow segmentation to shadow removal. arXiv preprint arXiv:2008.00267, 2020.
[27] Zhu Lei, Deng Zijun, Hu Xiaowei, Fu Chi-Wing, Xu Xuemiao, Qin Jing, and Heng Pheng-Ann. Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In Eur. Conf. Comput. Vis., pages 121–136, 2018.
[28] Qu Liangqiong, Tian Jiandong, He Shengfeng, Tang Yandong, and Lau Rynson WH. Deshadownet: A multi-context embedding deep network for shadow removal. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4067–4075, 2017.
[29] Yun-Hsuan Lin, Wen-Chin Chen, and Yung-Yu Chuang. Bedsr-net: A deep shadow removal network from a single document image. In IEEE Conf. Comput. Vis. Pattern Recog., June 2020.
[30] Zhang Ling, Long Chengjiang, Zhang Xiaolong, and Xiao Chunxia. Ris-gan: Explore residual and illumination with generative adversarial networks for shadow removal. arXiv preprint arXiv:1911.09178, 2019.
[31] Chengjiang Long and Gang Hua. Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In Int. Conf. Comput. Vis., pages 2839–2847, 2015.
[32] Chengjiang Long, Xiaoyu Wang, Gang Hua, Ming Yang, and Yuanqing Lin. Accurate object detection with location relaxation and regionlets re-localization. In ACCV, pages 260–275. Springer, 2014.
[33] Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, and Yizhou Wang. End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell., 42(6):1317–1332, 2019.
[34] Gryka Maciej, Terry Michael, and Brostow Gabriel J. Learning to remove soft shadows. ACM Trans. Graph., 34(5):1–15, 2015.
[35] Arjovsky Martin, Chintala Soumith, and Bottou Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[36] Li Minjun, Huang Haozhi, Ma Lin, Liu Wei, Zhang Tong, and Jiang Yugang. Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks. In Eur. Conf. Comput. Vis., pages 184–199, 2018.
[37] Sidorov Oleksii. Conditional gans for multi-illuminant color constancy: Revolution or yet another approach? In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 0–0, 2019.
[38] Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[39] Hung Ngoc Phan, Long Hoang Pham, Nhat Minh Chung, and Synh Viet-Uyen Ha. Improved shadow removal algorithm for vehicle classification in traffic surveillance system. In IEEE Int. Conf. Comput. Commun. Technol., pages 1–6, 2020.
[40] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A. Image-to-image translation with conditional adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1125–1134, 2017.
[41] Yang Qingxiong, Tan Kar-Han, and Ahuja Narendra. Shadow removal using bilateral filtering. IEEE Trans. Image Process., 21(10):4361–4368, 2012.
[42] Guo Ruiqi, Dai Qieyun, and Hoiem Derek. Paired regions for shadow detection and removal. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2956–2967, 2012.
[43] Benaim Sagie and Wolf Lior. One-sided unsupervised domain mapping. In Adv. Neural Inform. Process. Syst., pages 752–762, 2017.
[44] Wu Tai-Pang, Tang Chi-Keung, Brown Michael S, and Shum Heung-Yeung. Natural shadow matting. ACM Trans. Graph., 26(2):8–es, 2007.
[45] Jin Tang, Qing Luo, Fan Guo, Zhihu Wu, Xiaoming Xiao, and Yan Gao. Sdrnet: An end-to-end shadow detection and removal network. Signal Processing: Image Communication, page 115832, 2020.
[46] Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec, and Chen Xi. Improved techniques for training gans. In Adv. Neural Inform. Process. Syst., pages 2234–2242, 2016.
[47] Hu Xiaowei, Zhu Lei, Fu Chi-Wing, Qin Jing, and Heng Pheng-Ann. Direction-aware spatial context features for shadow detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7454–7462, 2018.
[48] Hu Xiaowei, Jiang Yitong, Fu Chi-Wing, and Heng Pheng-Ann. Mask-shadowgan: Learning to remove shadows from unpaired data. In Int. Conf. Comput. Vis., pages 2472–2481, 2019.
[49] Mao Xudong, Li Qing, Xie Haoran, Lau Raymond YK, Wang Zhen, and Paul Smolley Stephen. Least squares generative adversarial networks. In Int. Conf. Comput. Vis., pages 2794–2802, 2017.
[50] Li Ya, Tian Xinmei, Gong Mingming, Liu Yajing, Liu Tongliang, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Eur. Conf. Comput. Vis., pages 624–639, 2018.
[51] Vicente Tomas F Yago, Hoai Minh, and Samaras Dimitris. Leave-one-out kernel optimization for shadow detection and removal. IEEE Trans. Pattern Anal. Mach. Intell., 40(3):682–695, 2017.
[52] Taigman Yaniv, Polyak Adam, and Wolf Lior. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
[53] Zhang Yexun, Zhang Ya, and Cai Wenbin. Separating style and content for generalized style transfer. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8447–8455, 2018.
[54] Yi Zili, Zhang Hao, Tan Ping, and Gong Minglun. Dualgan: Unsupervised dual learning for image-to-image translation. In Int. Conf. Comput. Vis., pages 2849–2857, 2017.

1 Discussion about Shadow Transform Encoders (STE)

In the proposed target-consistency generative adversarial network (TC-GAN), the two shadow transform encoders (STE) in the two corresponding generators attempt to encode the same shadow image into different feature combinations so as to satisfy the target consistency assumption for the shadow removal problem. In practice, we do not explicitly constraint the feature combinations of the two shadow transform encoders to be different, as this is relatively complex and not conductive to convergence stably during training, but provide different adversarial optimization objectives for the two generators at each iteration during training by feeding with different real non-shadow images for the two discriminators respectively. In order to verify that the STE has learned different feature combinations from the input shadow image separately, we visualize some feature maps produced by the two STE with respect to the same shadow domain input in Figure 1. Note that the full feature maps of each STE contains 256 channels, we present the activation maps from the first 10 channels for the best look. As we can see from Figure 1, the feature maps obtained by the two STE are significantly different in activated information for each channel, and as is expected that the input shadow image has been transformed into different feature combinations of the shadow or non-shadow semantic objects. During the overall training process, although with different generated features and decoded shadow residuals by the shadow residual decoders, the two consistent shadow-free images that are corresponding to the same shadow input can be generated by the jointly optimization of adversarial generative constraint, the target consistency constraint and the identity loss.

Selection Strategy	$G_{1}$	$G_{2}$	MSM
FID	75.1297	74.9462	72.3522

Table 1: FID scores for different model selection strategy on USR dataset, and for this metric, lower is better.

G_{1}

and

G_{2}

are two mapping networks and MSM is our model selection module.

2 Discussion about Module Selection Module (MSM)

Our TC-GAN contains two completely symmetric mapping networks from the shadow domain to non-shadow domain. In theory, the two mapping networks should perform almost identically on the shadow removal task, but for each specific sample, the output of two different networks may differ slightly. Therefore, in the testing phrase, we need to choose one of the output shadow-free images from the two mapping networks as the final shadow-free target of TC-GAN, which is the role of the model selection module (MSM). In fact, different purposes usually correspond to different model selection strategies. For example, for a hardware and time sensitive application, the output of one of the mapping network can be fixed directly as the final output, and the other mapping network can be discarded, which can reduce the number of parameters and inferring time by half. In this work, we propose a simple and reasonable model selection strategy. We add a shadow/non-shadow binary classifier that is pre-trained with the shadow removal dataset to the model selection module, and the output scalar of the classifier represents the probability that the input shadow-free image is a real non-shadow image. The network structure of the model selection classifier is the same as that of the discriminator, which makes it have enough classification ability. The classifier is trained in advance, and finally reached the correct rate of 90% on the testing dataset. Therefore, during the testing phrase, for each sample, we choose the output of two mapping networks with a higher probability of real non-shadow on the model selection classifier as the final output of TC-GAN. In the experiments, we found that the performance of using model selection module improve slightly compared to the performance of directly fixing one of the mapping networks. Specifically, taking the FID evaluation metric in the USR unpaired dataset as an example, as shown in Table 1 the optimal FID scores of the two mapping networks are 75.1297 and 74.9462 respectively, while the score of the use of model selection module is 72.3522, whose performance is improved by 3.46 %.

3 More Examples of Visual Comparisons

In Figure 2, we provide more examples of visual comparisons on the USR dataset. It can be seen from the figure that our TC-GAN can achieve better performance than other unsupervised image-to-image translation and unpaired shadow removal approaches, both in terms of the effect of shadow removal and pixel retention in non-shadow regions. However, due to the lack of supervision between the shadow image and the corresponding non-shadow label, TC-GAN may fail to remove shadows from the shadow image with some complex scenes. Three typical scenarios are shown in Figure 3. TC-GAN may restore the shadow to the wrong color when the shadow area covers a variety of backgrounds and large color difference. As shown in the first row of Figure 3, TC-GAN restore the shadow region to green mistakenly because it mistaken the background of the shadow region for a lawn. When the shadows in the image are very similar to the object in the non-shadow area, TC-GAN may change the non-shadow region, as shown in the second row of Figure 3. In addition, when there are objects in the non-shadow region of the shadow image that are darker than the shadows, TC-GAN is likely to remove them simultaneously. As shown in the last row of Figure 3, TC-GAN treats the manhole cover as an image shadow mistakenly and tries to remove it. In general, we find that when dealing with these complex scenes, other methods are either unable to remove the shadow of the image or suffer from the same problem as our TC-GAN, which is also the direction of the unsupervised shadow removal methods to be optimized in the next step.