This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Unsupervised Approach towards Varying Human Skin Tone Using Generative Adversarial Networks

Debapriya Roy, Diganta Mukherjee and Bhabatosh Chanda Indian Statistical Institute
Kolkata, India
Email: [email protected], [email protected], [email protected]
Abstract

With the increasing popularity of augmented and virtual reality, retailers are now focusing more towards customer satisfaction to increase the amount of sales. Although augmented reality is not a new concept but it has gained much needed attention over the past few years. Our present work is targeted towards this direction which may be used to enhance user experience in various virtual and augmented reality based applications. We propose a model to change skin tone of a person. Given any input image of a person or a group of persons with some value indicating the desired change of skin color towards fairness or darkness, this method can change the skin tone of the persons in the image. This is an unsupervised method and also unconstrained in terms of pose, illumination, number of persons in the image etc. The goal of this work is to reduce the time and effort which is generally required for changing the skin tone using existing applications (e.g., Photoshop) by professionals or novice. To establish the efficacy of this method we have compared our result with that of some popular photo editor and also with the result of some existing benchmark method related to human attribute manipulation. Rigorous experiments on different datasets show the effectiveness of this method in terms of synthesizing perceptually convincing outputs.

I Introduction

Augmented reality (AR) is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated augmentations to it, in order to enhance our experiences [1, 2]. Though the concept of AR is not new but with the advent of machine learning and deep learning in computer vision, AR got its much needed push into the mainstream. various retailers like Burberry, 1-800-Flowers.com and ASOS, Nike, Benjamin Moore etc. infuse augmented reality (AR) into their apps to help online customers make more informed purchase decisions [3]. For example Nike Fit is an app launched by the big sports retail brand Nike that scans one’s foot dimensions using his smartphone’s camera and suggests his size of shoe. During 2018 CES (Consumer Technology Association) a dressing room app was launched by GAP, a clothing retail company, where shoppers can select the cloth of their choice along with one of five body types to visualize what an outfit will look like on them. All these shows retailers are now focusing more towards new ways to enhance user’s shopping experience. Our work is a new approach towards this direction.

Refer to caption
Figure 1: Illustrating the objective of the present work. The results with different skin tones is shown in the right (2nd2^{nd} to 4th4^{th}). The axis line above indicates the values of the skin tone control variable. The values along the negative direction of the axis indicates darkness while that along the positive direction indicates fairness. The amount of skin color change is proportional to the absolute value of the variable.

During buying cloths online we see the image of a model wearing the cloth and decide to buy or not based on how she looks on it. However excluding the cloth size factor, it is seen often that a confusion remains in the buyer’s mind about his looks after wearing it. The reason may be, generally the models are slim, fair skinned, tall etc., which means the seller makes an ideal situation in which his product looks best. Another case may be you are buying a cloth for a friend of yours, however he is much fairer or darker than the model. There can be various such cases where the buyer wishes to see the cloth on a model with his choice of features. Not only clothes, the same concept applies to various accessories also. This work addresses the skin tone aspect of this problem, where we can vary the skin tone of a person in an image. Such an application may be helpful for fashion designers also.

The existing skin tone changing applications are generally targeted to faces. A well known image editing tool is photoshop which is quite popular in this line of work. However these applications require a lot of human intervention and also getting a good result relies heavily on user’s expertise and effort. Over the last few years, especially after the advent of generative adversarial network (GAN) face attribute manipulation related various works [38, 11, 16, 35, 32, 19] etc., has been proposed. The objective of most of these works are to translate facial images among multiple groups, where each group represents one attribute, e.g., beard, sun-glass, white hair, smile, gender etc. Not only these works are mostly constrained attributes related to facial images but also they use some annotations to divide data into different groups, which requires human intervention. In this work, we attempt an unsupervised approach to change skin tone of person in image. As per our knowledge no previous works has been done to vary the skin color. Although various works has been done on skin detection [28, 23, 9, 34, 39, 8, 29, 22, 31, 21, 26] but the objective of such works are to detect the skin pixels in a given image. However our objective is different in the sense that we not only detect skin pixels but also vary the color of those pixels to synthesize a new image of the input person with varied skin tone (Fig. 1). In fact, the contribution of this paper lies in the idea of changing the skin color. While varying the color of skin pixels might sound like an easy task but the main challenge remains in keeping the perceptual realism in the output, as otherwise the purpose is lost.

We take an adversarial learning [15] based approach to solve this problem. We employ a generative adversarial network [15] (GAN) to learn a generative model with our proposed skin color distance based loss function. The idea of this loss is to perturb the color of skin pixels based on a control variable, while the amount of perturbation depends on the absolute value of this variable. However, the main challenge here is to vary the skin tone within a permissible range of human skin tones which is addressed by this loss function. Otherwise the realism in the result is lost which is not desirable. To detect the skin / non-skin pixels we employ a skin detector network which is trained in a supervised way. We call our method unsupervised as it does not require any annotation in terms of skin tone.

In summary this paper makes the following contributions,

  • We propose a method to synthesize images of a person over a varying scale of skin tone, where the tone can vary from dark to fair.

  • Inspired from the concept of perceptual loss function we propose a skin color distance based loss function which is used to train a GAN for this current purpose.

  • This is an unsupervised method, hence does not require any skin tone related annotations.

As per our knowledge this is the first method to vary skin tone of a person smoothly over a continuous interval. In the rest of the paper a brief literature survey in presented in Sec. II, this is followed by our methodology section (Sec. III) where we discuss the idea and the workflow of the proposed method in details. To show the effectiveness of this method we presented a detailed qualitative and a quantitative study in Sec. IV. Finally we conclude in Sec. V.

II Related Works

The objective of this work is to smoothly vary human skin tone over a continuous range. Although various works related to attribute manipulation of human has been proposed earlier but the idea of varying skin tone smoothly is relatively new in literature. Below we discuss some of the existing methods related to human attribute manipulation to gain an overview of the existing literature in the current problem context. We also discuss some existing skin / non-skin segmentation methods.

Earlier methods have explored the ways to manipulate attributes of human faces in images e.g., beard to no-beard looks, or changing the hair color or patterns, smiling to non-smiling face etc. A method for skin beautification has been proposed in [11] where the authors have used bilateral filter to smooth the detected flawed skin areas and integrate the rectified regions with the original image using Poisson image cloning. After the advent of GAN, attribute manipulation related works have got more attention. Among the recently proposed methods Shen et al. [32] attempted this problem using a GAN based approach. Here the authors employ two generators as image transformation networks and one discriminator. The generators are responsible for the attribute manipulation and its dual operation. The idea of this work is to learn the residual image which refers to the difference of after and before manipulation images. However this method needs labelled data corresponding to each attribute type. A weakly supervised attribute manipulation framework is proposed in [35]. It employs a GAN which is trained on a perceptual content loss and two adversarial losses to ensure global consistency of the image along with the effect of desired attribute. However this method is weakly-supervised as it learns the attribute from a set of images with that attribute in common. Another method [38] implements a multi-task learning framework based on GAN. It translates images among multiple groups, where each group characterises one attribute. AttGAN [16] proposes a GAN based facial attribute editing framework. It employs an encoder that encodes the input image into its latent representation, which is decoded along with the desired binary attribute vector to synthesize the final image with the desired attributes. A classifier is employed to constrain the attributes of the synthesized image to be as desired. SC-FEGAN [19] proposed a method that generates images as per user provided inputs e.g., in the form of free-form mask, sketch or color. This is also based on GAN. The generator is an encoder decoder based framework with gated convolutional layers. The speciality of this network is its ability to reconstruct large portions of image erased. The above said methods are all multi-attribute image manipulation works. All of the above methods are either supervised or weakly-supervised requiring some kind of annotation related to the specific attribute. However we attempted the problems of learning human skin tone manipulation in an unsupervised setting where the attribute represented in terms of a value can be varied over a continuous range.

Coming to the problem of human skin, non-skin pixel classification we see earlier in literature various methods has been proposed. Nguyen-Trang et al. [28] proposed an approach based on Bayesian classifier which detects skin pixels using first posterior probability threshold and ”skin candidate” pixels using second posterior probability threshold. Connected component algorithm is used to find all the connected components containing the “skin candidate” pixels. All candidate pixels in a component is classified as skin pixel if it contains at least one skin pixel. Methods based on criterion specific to different color models (e.g., RGB, HSV, YCbCr) has been proposed in [23, 31]. However colours model based approaches without considering context might face the problem of incorrect classification in some cases e.g., clothing areas with colors close to skin color may be classified as skin. Zuo et al. proposed a deep neural network based approach for skin detection. Where the authors proposed a network integrating recurrent neural networks (RNNs) layers into the fully convolutional neural networks (FCNs). The reason of incorporating RNN layers is to capture semantic contextual dependencies thus overcoming the limitation of convolutional neural networks (CNNs) which captures local features only. In this paper we employed a neural network based skin detection method as a preprocessing step which contributes to our final goal of learning to vary skin color. We do not go into detailed experimental study on our skin detection method as this diverts the main objective of this paper.

III Methodology

We propose a method to synthesize a new image corresponding to a given input image with the skin tone of the persons in that image changed according to some desired control variable. Towards this objective we first segment the given image pixels in two classes, skin and non-skin. This is achieved by our skin segmentation network (Sec. III-A) which is a Convolutional Neural Network (CNN). The next part of this work which is the final image synthesizer (Sec. III-B) employs a Conditional Generative Adversarial Network [25] (cGAN). It takes the image as input, along with the value of a conditional variable and synthesizes a new image with the skin tone of the persons in the image changed in accordance with the value of the variable.

Refer to caption
Figure 2: Demonstration of change of skin color on image with persons of difference in skin tone. The values within the brackets indicates the value of the skin color control variable zz. Observe that our method retains the relative difference of skin color.
Refer to caption
Figure 3: Demonstration of results of different stages of the proposed method; (2nd2^{nd} column) Results of skin, non-skin segmentation, (3rd3^{rd} column) demonstration of predicted mean skin tone. For better understanding of the viewer about the prediction accuracy of skin tone the predicted skin color is filled in the skin areas of the source image. (4th4^{th}, 5th5^{th} columns) Results on different values of zz.

III-A Skin Segmentation

The objective of skin segmentation network is two fold. Given an image as input, it first classifies each pixel in the image in two classes - skin, non-skin, followed by estimating a rgb value indicating the skin tone. Here the method predicts one skin tone irrespective of the number of persons present in the image. During training all the images used contains single person in it therefore no question of confusion arises. However it might be confusing to relate this to images with multiple persons present which might occur during testing. To maintain better flow, we will explain it in detail later.

The skin detection network consists of two sub-networks. The first half of this network is dedicated to skin segmentation objective. This part of the network is trained using the following loss function,

Lseg=Lc(x^seg,xseg)+Lp(x^seg,xseg)+Ls(x^seg,xseg),L_{seg}=L_{c}(\hat{x}_{seg},x_{seg})+L_{p}(\hat{x}_{seg},x_{seg})+L_{s}(\hat{x}_{seg},x_{seg}), (1)

where LcL_{c} (. , .) represents count loss and Lp{L}_{p} (. , .), LsL_{s} (. , .) represents the perceptual [20] and SSIM [36] loss respectively between the ground truth (say, xsegx_{seg}) and the predicted segmentation result (say, x^seg\hat{x}_{seg}).

Count loss measures the absolute difference between the counts of skin pixels of the predicted and the ground truth image. Since in general neural networks can not predict binarized output which is required in case of segmentation problems, therefore we try to control this constraint using this loss. It is observed that this loss improves the result in terms of predicting close to binarized output.

SSIM [36] is a popular image metric to measure the structural similarity between two images. However as the value of loss is minimized in neural networks therefore instead of SSIM we employ DSSIM as the loss function which is related to SSIM the following way, DSSIM (. , .) = 1 - SSIM (. , .). Throughout the paper we refer DSSIM loss as SSIM loss.

VGG perceptual loss [20] is a LL2 loss between the features of the generated and the groundtruth images, obtained from different layers of pre-trained classification network (VGG-19) [12]. Instead of exactly matching the pixel values of the generated and groundtruth images this loss matches their feature representations. This encourages the network to produce images which are perceptually similar to their corresponding target images.

To train the other half of this network which is responsible for the skin color estimation we used MS-SSIM [37] loss, which is a variant of SSIM [36]. As this method is unsupervised therefore we do not use any ground truth skin color annotations for training this network. Instead we fill the detected skin areas with the predicted skin color and minimize the structural dissimilarity between the input and the manipulated image. The reason for designing such objective is to learn the skin color by maximizing perceptual similarity between the input and the predicted image in an unsupervised way. Some results can be seen in Fig. 3. The importance of this half of the network lies in learning to extract skin color from the person image itself, which is useful for the next part of the proposed work.

III-B Synthesizing Images with Varying Human Skin Tone

This section elaborates the main contribution of this work. The problem we are trying to attempt in this section is to synthesize images of varying human skin tone given a source image as input. We formulate this problem as a conditional image generation problem, where the source image, along with its skin segmentation (obtained from the skin segmentation network discussed in the previous section) and a control variable zz is given as input to a cGAN. zz controls the amount of change of skin tone. The value of zz = 0 indices no change of skin color while, values less than zero and above zero indicates the amount of change towards darkness and fairness of skin respectively. Therefore zz here plays the role of a skin tone regulator.

We employ a CNN as our generator network in the cGAN and the discriminator network is a patchGAN discriminator [18] which is also a CNN. The objective of this generator is to learn a distribution over the input data set. It basically builds a mapping with some conditional information from a prior noise distribution to the data space; while the discriminator predicts the probability of an input given a conditional information, to be coming from the data distribution rather than the generator distribution. In general discriminator predicts a single probability value corresponding to a given input image. However inspired from [18] we employ a patchGAN discriminator for this objective which in contrast to classical CNN based discriminator classifies each patch of the image, where the patch size is much smaller than the input image size; which implies pixels separated by more than a patch diameter gets modelled independently. As discussed in [18] such a discriminator function acts as texture/style loss and plays significant role in keeping better texture in the generated image.

The interesting part of this cGAN lies in the design of the loss function. Lets denote the generator network as fg(.)f_{g}(.), the input image as xx, x^seg\hat{x}_{seg} as the skin segmentation output corresponding to xx (we denote complement of x^seg\hat{x}_{seg} as x^seg{\hat{x}^{\prime}}_{seg} representing the predicted non-skin regions). We formulate the objective function in the following way,

LcGAN=l1+l2+λ(m×z+l3ϵ)+LADV.L_{cGAN}=l^{1}+l^{2}+\lambda(m\times z+l^{3}-\epsilon)+L_{ADV}. (2)

Where considering x^z=0=fg(x,z=0,x^seg)\hat{x}_{z=0}=f_{g}(x,z=0,\hat{x}_{seg}) and x^z0=fg(x,z0,x^seg)\hat{x}_{z\neq 0}=f_{g}(x,z\neq 0,\hat{x}_{seg}), we define,

l1=Lp(x^z=0,x)l2=Lp(x^z=0×x^seg,x×x^seg)l3=log(0.5Lcolor)Lcolor=Lpcolor(x^z0,x).\begin{split}l^{1}&=L_{p}(\hat{x}_{z=0},x)\\ l^{2}&=L_{p}(\hat{x}_{z=0}\times{\hat{x}^{\prime}}_{seg},x\times{\hat{x}^{\prime}}_{seg})\\ l^{3}&=\log(0.5-L_{color})\\ L_{color}&=L_{p}^{color}(\hat{x}_{z\neq 0},x).\end{split} (3)

Here λ\lambda, mm and ϵ\epsilon are parameters. LADVL_{ADV} denotes the adversarial loss. The function Lp(.,.)L_{p}(.,.) indicates VGG-perceptual loss and Lpcolor(.,.)L_{p}^{color}(.,.) indicated a loss similar in concept to perceptual loss but the underlying network is the skin color estimation network which is discussed in the previous section. The whole idea behind this loss is to ensure the skin color of the image synthesized by the cGAN with zz = 0 is close to that of the original image (this is ensured by the term l1l^{1} in the loss function); while that with changing value of zz should differ from the original image (ensured by the term l3l^{3}), preserving the details in regions other than the skin intact (ensured by the term l2l^{2}). The term λ×m×z\lambda\times m\times z ensures the change in skin color goes in accordance with the value and sign of zz. During training the cGAN, we sample random values of zz from a specified distribution (the noise prior).

Observe from Fig 2 that this method works in case of images containing people with difference in skin tone. This eradicates the confusion regarding single skin tone prediction by the skin tone estimator sub-network in Sec. III-A. We want to clarify that the objective of that part of network was to be able to learn features related to skin color which came into use in training the cGAN. In images containing multiple person, the estimated skin color value basically indicates an average skin tone of all the persons present in the image.

Refer to caption
Figure 4: Results of our method. The source person images are shown in the 1st column, the next column shows the skin pixel segmentation results. The following columns show the results for different values of zz. Walking along the axis from negative to positive direction increases the fairness of skin.

IV experiments

IV-A Dataset

We evaluate our model based on the following datasets, Category and Attribute Prediction Benchmark, In Shop Cloth Retrieval dataset of DeepFashion [24] and MPV [13]. DeepFashion is a large scale fashion dataset where images are taken under variety of illumination condition, clothing category, pose and gender. Deepfashion contains multiple subset of images each with different types of annotations. In this work we used two subsets. However our method is applicable to all other subsets of images also. The In Shop Cloth Retrieval dataset contains in total 52,712 images of multiple views of each person (front, side, back and full) while the Category and Attribute Prediction Benchmark dataset contains 289,222 number of clothing images, where the images are mostly of models wearing the clothing. The MPV dataset contains in total 35,687 images of multiple views of each person. Our model is trained on the Category and Attribute Prediction Benchmark dataset of DeepFashion. For quantitative analysis we tested our model on 2000 randomly selected images from each of the mentioned datasets. For each of the test sample image, the result is obtained with the values of zz drawn from NN(0 , 0.05). The qualitative analysis is done on In Shop Cloth Retrieval dataset of DeepFashion, with the cGAN trained on Category and Attribute Prediction Benchmark dataset. For the skin detection part of work we trained the model on LIP [14] and MPV datasets, where LIP is a popular dataset for human parsing and MPV also contains annotations related to different body parts.

IV-B Implementation Details

IV-B1 Skin Detection Network

The skin detection sub-network is a hourglass network [27], followed by six convolution and 5 pooling layers for the skin color estimation sub-network. The pooling and convolutions are placed alternatively, with number of filters for the convolutions are 18,8,2,3,3,3, with stride 2 and activation relu for all layers except the the last convolution layer. Here the skin detection and skin color estimation sub-networks are two different network with two different purpose, therefore during training they are trained separately. Hourglass network is a CNN (Convolutional Neural Network) that captures features at various scales and is effective for analyzing spatial relationships among different parts of the input. Multiple of these hourglass networks can be stacked together with intermediate supervision for making it deeper. However, for our purpose the stack size of 1 is found to be sufficient.

IV-B2 cGAN for Varying Skin Tone

We have used an hourglass network for the generator and the architecture of patchGAN discriminator is same as it is mentioned in its original paper. The value of λ=0.003\lambda=0.003, m=0.004m=0.004 and ϵ=0.0002\epsilon=0.0002. These values are chosen experimentally.

All the networks are trained with optimizer Adam with learning rate 0.0006, β1,β2\beta_{1},\beta_{2} = 0.5, 0.999.

IV-C Quantitative Analysis

For quantitative analysis we have reported the scores on the following metrics, Inception Score [30] (IS) and Frechet Inception Distance [17] (FID), SSIM [36]. Note that both IS and FID are evaluation measure for GANs, which measures perceptual quality of synthesized outputs, however FID is a more preferred metric [17]. We report SSIM score for the training set of images, since groundtruth is not available for test set of images. Here ground truth implies the images of same person in different skin color. We report the value of Kolmogorov-Smirnov test [10] statistic which is a goodness of fit test.

Refer to caption
Figure 5: Demonstration of result on in-the-wild image (image has been taken from [4]). The values within the brackets indicates the value of the skin color control variable zz. Observe that our segmentation module is not constrained to background clutter or presence of multiple persons in image. Also it is worth noticing that the results of the cGAN is perceptually convincing in terms of skin tone variation.

Inception Score (IS) measures the classifiability and diverseness in the generated images where the generated images are classified using the inception v3 model [33] to predict the class probability. IS is defined as,

IS(.)=exp(𝔼apg[DKL(p(b|a)||p(a))]),{}IS(.)=exp(\underset{\textbf{a}\sim p_{g}}{\mathbb{E}}[D_{KL}(p(b|\textbf{a})||p(\textbf{a}))]), (4)

where a is a generated image sampled from the learned distribution pgp_{g}, 𝔼\mathbb{E} is the expectation over the set of generated images, p(b|a)p(b|\textbf{a}) is the conditional class (bb) distribution estimated for image a using Inception model [33] pre-trained with ImageNet [12], p(b)p(b) = 𝔼apgp(b|a)\underset{\textbf{a}\sim p_{g}}{\mathbb{E}}p(b|\textbf{a}) is the marginal distribution. DKLD_{KL} is the KL-divergence between the conditional class distribution and the marginal class distribution. IS has a lowest value of 1 and highest value of the number of classes in inception model. Here as we are using inception v3 model which has 1000 number of classes therefore IS has a maximum value of 1000.

Although IS is observed to correlate well with human perception [30] but it does not consider real data at all therefore it can not estimate how well the generator approximates the real data distribution. To address the issues of IS another metric for evaluating the performance of GAN is proposed. Below we discuss that.

Frechet Inception Distance (FID) is a measure of similarity between two sets of images. It extracts the features embedded in both real and the generated images from a layer of inception v3 model pre-trained with ImageNet [12]. Considering the embedding as continuous multivariate Gaussian, the mean and covariance are estimated for both the generated (μg\mu_{g} , σg\sigma_{g}) and the real data (μr\mu_{r} , σr\sigma_{r}). Then the FID is calculated as: μrμg22+Tr(σr+σg2(σrσg)1/2)\left\lVert\mu_{r}-\mu_{g}\right\rVert_{2}^{2}+Tr(\sigma_{r}+\sigma_{g}-2(\sigma_{r}\sigma_{g})^{1/2}). A lower value of FID is better.

Kolmogorov-Smirnov test (KS test) is a Goodness-of-fit test that measures the compatibility of random samples against some theoretical probability distribution function. In other words, it is a non-parametric test statistic which defines the largest absolute difference between two cumulative distribution functions as a measure of disagreement. Let us consider, FobsF_{obs} as the empirical distribution function of the data, and FexpF_{exp} as the cumulative distribution function (CDF) associated with the null hypothesis, then the KS test statistic is defined by,

Dn=argmax|Fexp(.)Fobs(.)|.D_{n}=\operatorname*{arg\,max}|F_{exp}(.)-F_{obs}(.)|. (5)

Since KS test statistic is a measure of distance therefore, a lower value of it indices better distribution similarity (Maximum value of this statistic is 1 and minimum is 0). The classical KS test is based on one dimensional data, however since we are dealing in images therefore, we used the 2D variant of the KS test.

TABLE I: Values of Incepion Score (IS) and Frechet Inception Distance (FID) and SSIM on results of different data sets.
Dataset IS\uparrow FID \downarrow SSIM\uparrow
In Shop 3.21 ±\pm 0.17 38.33 0.93
Category-and-Attribute 3.58 ±\pm 0.19 36.19 0.95
MPV 3.03 ±\pm 0.23 42.56 0.92
TABLE II: Values of Kolmogov-Smirnov test (KS test) statistic along with the corresponding P-values on results of different data sets.
Dataset KS statistic \downarrow P-Value\uparrow
DeepFashion (Category-and-Attribute) 0.0249 0.5545
MPV 0.0450 0.0837

We present the values of IS, FID and SSIM in Table. I and the values of KS test statistic in Table II. The values of SSIM are based on the results of the cGAN generator with zz = 0. The scores of IS and FID, suggests that our method synthesizes quite good quality images which can also be verified visually from the results presented in the qualitative analysis section. Note that values of IS is low in the results. This is because, this problem is related to human images only therefore there is no question of class diverseness in the generated images, where class diverseness relate to the number of classes of InceptionV3 model. The values of KS statistic from Table. II (at 5% level of significance) shows the generator’s distribution is quite similar to the original data distribution, as the difference between the distributions of original and predicted skin tone is statistically insignificant. Which suggests that the predicted skin tones lie within the range of feasible human skin color range.

IV-D Qualitative Analysis

We present our qualitative results in this section. Fig. 3 shows the results of skin segmentation and the estimated skin color. It is observed that the mean predicted color of skin is visually quite similar to the original tone of skin. The final results on varying skin tone is shown in Fig. 4. It is worth noticing that the color of skin varies under different values of the control variable, while the skin becomes fairer as the value zz increases, it becomes darker with decreasing values of zz.

Refer to caption
Figure 6: Comparative study between the results of our method and that of SC-FEGAN. The results of SC-FEGAN has been taken from the paper [19] itself. It can be observed that our results look much more realistic compared to that of SC-FEGAN.

It is observed experimentally that the value of zz within ±0.13\pm{0.13} gives perceptually coherent images on average. We also present result on in-the-wild image (Fig. 5). This shows this method is unconstrained to background clutter, variation of skin colors in different persons in the image.

We present a visual comparison with the results of SC-FEGAN [19] in Fig. 6. SC-FEGAN is a method for attribute manipulation which can not be directly applied for problems like skin tone change. However this method can reconstruct faces from its free-form input e.g., sketch of the face with color details etc. In Fig. 6 we have presented a visual comparison of our results with the results of SC-FEGAN generated from free-form inputs. Note that in case of our method the idea is to relatively make the skin fairer or darker, hence we do not explicitly provide any skin color as input. Here, for the purpose of better visual comparison we generated two results showing relatively darker and fairer (skin tones) reconstructions of the input faces. It is observed from the figures that our results look visually much convincing in terms of realism compared to that of SC-FEGAN.

Refer to caption
Figure 7: Comparative study between the results of our method and that generated by a professional in Photoshop photo editor [5]. It can be noticed that our method makes consistent gradual change of skin color from its original color producing comparable results.

In Fig. 7 we have presented a visual comparison between our results with the result generated by a professional in Photoshop photo editor. Here the editor has made the person’s skin tone fair. For the purpose of comparison we also synthesized our results towards fairness. Notice the difference between the skin colour in our results and that in the image generated by the editor. This happens because our method changes the tone gradually while keeping coherence with the skin tone in the previous image. Another thing that might be interesting is that as shown in [6] it took almost 9 minutes to generate the result in photoshop. However compared to this our results can be generated in much less time subject to the deep learning framework and GPU used.

Coming to the limitations of this method, we would like to mention that this method is sensitive to irregularities in skin segmentations. As observed from Fig. 8 improper segmentation may result in uneven skin tone (leftmost and rightmost persons).

Refer to caption
Figure 8: Demonstration of limitations of the proposed method (image is taken from [7]). Improper segmentation has caused uneven skin tone on the leftmost and the rightmost persons.[Please zoom-in for details.]

V Conclusion

This paper presents a method to synthesize new images of a person with varied skin tone, where the amount of variation can be controlled by a control variable. To achieve this objective we first trained a skin segmentation network which segments the skin, non-skin pixels. This is followed by a generative adversarial network which takes as input the source image, along with the skin segmentation result and the value of the corresponding control variable and synthesizes a new image with changed skin color according to the value of the control variable. Experiments on different datasets as well as with the results of popular photo editor and benchmark attribute manipulation related work establishes the usefulness of this method. It is verifiable from both the qualitative and quantitative analysis that this method generates perceptually convincing results.

References

  • [1] https://en.wikipedia.org/wiki/Augmented_reality.
  • [2] https://www.technologyreview.com/2019/10/23/238473/augmented-reality-in-retail-virtual-try-before-you-buy/.
  • [3] https://retailwire.com/discussion/.
  • [4] https://p4t6u7k5.stackpathcdn.com/wp-content/uploads/shutterstock_605616227.jpg.
  • [5] https://designpanoply.com/blog/how-to-change-a-persons-skin-color-from-dark-to-light-in-photoshop.
  • [6] https://www.youtube.com/watch?v=pO7gq_2BvZw.
  • [7] https://i1.wp.com/cbtpsychology.com/wp-content/uploads/2018/10/Therapy.jpg.
  • [8] Hani K Al-Mohair, Junita Mohamad Saleh, and Shahrel Azmin Suandi. Hybrid human skin detection using neural network and k-means clustering technique. Applied Soft Computing, 33:337–347, 2015.
  • [9] Jason Brand and John S Mason. A comparative assessment of three approaches to pixel-level human skin-detection. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 1, pages 1056–1059. IEEE, 2000.
  • [10] Laha Chakravarti. Roy, handbook of methods of applied statistics, 1 (1967).
  • [11] Chih-Wei Chen, Da-Yuan Huang, and Chiou-Shann Fuh. Automatic skin color beautification. In International Conference on Arts and Technology, pages 157–164. Springer, 2009.
  • [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [13] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE International Conference on Computer Vision, pages 9026–9035, 2019.
  • [14] Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 932–940, 2017.
  • [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [16] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 28(11):5464–5478, 2019.
  • [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [19] Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In Proceedings of the IEEE International Conference on Computer Vision, pages 1745–1753, 2019.
  • [20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
  • [21] Praveen Kakumanu, Sokratis Makrogiannis, and Nikolaos Bourbakis. A survey of skin-color modeling and detection methods. Pattern recognition, 40(3):1106–1122, 2007.
  • [22] Yusuke Kanzawa, Yoshikatsu Kimura, and Takashi Naito. Human skin detection by visible and near-infrared imaging. In IAPR Conference on Machine Vision Applications, volume 12, pages 14–22. Citeseer, 2011.
  • [23] Seema Kolkur, D Kalbande, P Shimpi, C Bapat, and Janvi Jatakia. Human skin detection using rgb, hsv and ycbcr color models. arXiv preprint arXiv:1708.02694, 2017.
  • [24] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016.
  • [25] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [26] Sinan Naji, Hamid A Jalab, and Sameem A Kareem. A survey on skin detection in colored images. Artificial Intelligence Review, 52(2):1041–1087, 2019.
  • [27] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
  • [28] Thao Nguyen-Trang. A new efficient approach to detect skin in color image using bayesian classifier and connected component algorithm. Mathematical Problems in Engineering, 2018, 2018.
  • [29] Abel S Nunez and Michael J Mendenhall. Detection of human skin in near infrared hyperspectral imagery. In IGARSS 2008-2008 IEEE International Geoscience and Remote Sensing Symposium, volume 2, pages II–621. IEEE, 2008.
  • [30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [31] Khamar Basha Shaik, P Ganesan, V Kalist, BS Sathish, and J Merlin Mary Jenitha. Comparative study of skin color detection and segmentation in hsv and ycbcr color space. Procedia Computer Science, 57(12):41–48, 2015.
  • [32] Wei Shen and Rujie Liu. Learning residual images for face attribute manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4030–4038, 2017.
  • [33] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [34] Wei Ren Tan, Chee Seng Chan, Pratheepan Yogarajah, and Joan Condell. A fusion approach for efficient human skin detection. IEEE Transactions on Industrial Informatics, 8(1):138–147, 2011.
  • [35] Yilin Wang, Suhang Wang, Guojun Qi, Jiliang Tang, and Baoxin Li. Weakly supervised facial attribute manipulation via deep adversarial network. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 112–121. IEEE, 2018.
  • [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [37] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  • [38] Jichao Zhang, Yezhi Shu, Songhua Xu, Gongze Cao, Fan Zhong, Meng Liu, and Xueying Qin. Sparsely grouped multi-task generative adversarial networks for facial attribute manipulation. In Proceedings of the 26th ACM international conference on Multimedia, pages 392–401, 2018.
  • [39] Haiqiang Zuo, Heng Fan, Erik Blasch, and Haibin Ling. Combining convolutional and recurrent neural networks for human skin detection. IEEE Signal Processing Letters, 24(3):289–293, 2017.