This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Korea University, Seoul, Korea 22institutetext: Army Research Laboratory, Adelphi, MD, USA

CAFE-GAN: Arbitrary Face Attribute Editing
with Complementary Attention Feature

Jeong-gi Kwak 11    David K. Han 22    Hanseok Ko 11
Abstract

The goal of face attribute editing is altering a facial image according to given target attributes such as hair color, mustache, gender, etc. It belongs to the image-to-image domain transfer problem with a set of attributes considered as a distinctive domain. There have been some works in multi-domain transfer problem focusing on facial attribute editing employing Generative Adversarial Network (GAN). These methods have reported some successes but they also result in unintended changes in facial regions - meaning the generator alters regions unrelated to the specified attributes. To address this unintended altering problem, we propose a novel GAN model which is designed to edit only the parts of a face pertinent to the target attributes by the concept of Complementary Attention Feature (CAFE). CAFE identifies the facial regions to be transformed by considering both target attributes as well as “complementary attributes”, which we define as those attributes absent in the input facial image. In addition, we introduce a complementary feature matching to help in training the generator for utilizing the spatial information of attributes. Effectiveness of the proposed method is demonstrated by analysis and comparison study with state-of-the-art methods.

Keywords:
Face Attribute Editing, GAN, Complementary Attention Feature, Complementary Feature Matching

1 Introduction

Since the advent of GAN by Goodfellow et al. [8], its application literally exploded into a variety of areas, and many variants of GAN emerged. Conditional GAN (CGAN) [25], one of the GAN variants, adds an input as a condition on how the synthetic output should be generated. An area in CGAN receiving a particular attention in the media is “Deep Fake” in which an input image is transformed into a new image of different nature while key elements of the original image are retained and transposed [11, 25]. GAN based style transfer is often the method of choice for achieving the domain transfer of the input to another domain in the output, such as generating a hypothetical painting of a well known artist from a photograph. CycleGAN [40] has become a popular method in image domain transfer, because it uses cycle-consistency loss from a single image input in training, and thus its training is unsupervised.

Nevertheless, single domain translation models [31, 37] including CycleGAN are incapable of learning domain transfer to multiple domains. Thus, in these approaches, multiple models are required to match the number of target domains. One of the multi-domain transfer problems is manipulation of one’s facial characteristics. The goal of facial attribute editing is to convert specific attributes of a face, such as wearing eyeglasses, to a face without eyeglasses. Other attributes may include local properties e.g., beard as well as global properties e.g., face aging. Obviously this requires multiples of domain transfer models if a single domain transfer concept is to be used. The number of single domain transfer models required in such a case is a function of the attribute combination since these facial attributes are mostly independent. Even for relatively small number of attributes, single domain transfer approaches would require a significantly high number of separate models. A model such as CycleGAN, therefore, would become impractical.

Refer to caption
Figure 1: Face editing results of AttGAN [10], StarGAN [6], STGAN [21] and our model given a target attribute Blond hair. While AttGAN, StarGAN and STGAN deliver blond hair, they also create unwanted changes (e.g. halo, different hair style, etc.) in resultant images.

To address the multi-domain transfer problem, StarGAN [6] and AttGAN [10], have been proposed by introducing a target vector of multiple attributes as an additional input. These target attribute vector based GAN models have shown some impressive images, but, they often result in unintended changes - meaning the generator alters regions unrelated to specified attributes as shown in Fig. 1. It stems from these models driven to achieve higher objective scores in classifying attributes at the expense of affecting areas unrelated to the target attribute. Some methods [31, 37] have used a strategy that adds only the values of attribute-specific region to an input image, but these methods have achieved limited success. Hence, changing only pertinent regions remains as an important but challenging problem. Their limitation stems from considering only structural improvement in the generator. However, more effective approach may be possible by exploring decision making process of the discriminator.

Visual explanation [7, 30, 34, 39], which is known effective in interpreting Convolutional Neural Network (CNN) by highlighting response areas critical in recognition, is considered here to address the problem. Our model is mainly motivated by Attention Branch Network (ABN) [7] which extends response-based visual explanation to an attention mechanism. In ABN, the attention branch takes mid-level features then extracts attention feature maps. Attention feature maps are then downsampled through a global average pooling layer (GAP) [20] and subsequently utilized as class probability. However, the problem with the response based visual explanation methods is that they can only extract attention feature maps of the attributes already present in the image. Thus, these methods are effective only in manipulations of existing attributes, such as removing beard or changing hair color.

To address this issue, we propose a method of identifying the regions of attributes that are not present in an input image, via a novel concept of Complementary Attention FEature (CAFE). With the idea of creating a complementary attribute vector, CAFE identifies the regions to be transformed according to the input attributes even when the input image lacks the specified attributes. With CAFE, our discriminator can generate and exploit the spatial attention feature maps of all attributes. We will demonstrate CAFE’s effectiveness both in local as well as in global attributes in generating plausible and realistic images.

Our contributions are as follows:

  • We present a novel approach for facial attribute editing designed to only edit the parts of a face pertinent to the target attributes based on the proposed concept of Complementary Attention FEature (CAFE).

  • We introduce a complementary feature matching loss designed to aid in training the generator for synthesizing images with given attributes rendered accurately and in appropriate facial regions.

  • We demonstrate effectiveness of CAFE in both local as well as global attribute transformation with both qualitative and quantitative results.

2 Related Work

2.0.1 Generative Adversarial Networks.

Since Goodfellow et al. [8] proposed Generative Adversarial Network (GAN), GAN-based generative models have attracted significant attention because of their realistic output. However, the original formulation of GAN suffers from training instability and mode collapse. Numerous studies have been made to address these problems in various ways such as formulating alternative objective function [2, 9, 24] or developing modified architecture [12, 13, 29]. Several conditional methods [11, 25] have extended GAN to image-to-image translation. CycleGAN [40] proposed the use of cycle consistency loss to overcome the lack of paired training data. Advancing from single domain transfer methods, StarGAN [6] can handle image translation among multiple domains. These developments enabled GANs to deliver some remarkable results in various tasks such as style transfer [3], super-resolution [16, 18], and many other real-world applications [1, 26, 28].

2.0.2 Face Attribute Editing.

The goal of face attribute editing is to transform the input facial image according to a given set of target attributes. Several methods have been proposed with Deep Feature Interpolation (DFI) scheme [4, 5, 35]. By shifting deep features of an input image with a certain direction of target attributes, a decoder takes interpolated features and outputs an edited image. Although they produce some impressive results, limitation of these methods is that they require a pre-trained encoder such as VGG network [32] besides they have a weakness in multi-attribute editing. Recently, GAN based frameworks have become the dominant form of face attribute manipulation. A slew of studies for single attribute editing [14, 19, 22, 31, 37, 40] have been conducted. However these methods cannot handle manipulation of multiple attributes with a unified model. Several efforts [17, 27] have been extended to an arbitrary attribute editing but they achieved limited quality of image. Several methods  [6, 10, 38] have shown remarkable results in altering multiple attributes by taking the target attribute vector as an input of their generator or adopting additional network. STGAN [21] and RelGAN [36] further improved face editing ability by using difference between a target and a source vector to constrain in addressing selected attributes. However these methods still suffer from change of irrelevant regions. SaGAN [37] exploits spatial attention to solve the problem, but it is only effective for editing local attributes like adding mustache.

2.0.3 Interpreting CNN.

Several researches [7, 30, 33, 34, 39] have visualized the decision making of CNN by highlighting important region. Gradient-based visual explanation methods [30, 33, 39] have been widely used because they are applicable to pre-trained models. Nonetheless, these methods are inappropriate for providing spatial information to our discriminator because they require back propagation to obtain attention maps and are not trainable jointly with a discriminator. In addition to gradient-based methods, several response based methods [7, 39] have been proposed for visual explanation. They obtain attention map using only response of feed forward propagation. ABN [7] combines visual explanation and attention mechanism by introducing an attention branch. Therefore, we adopt ABN to guide attention features in our model. However, there is a problem when applying ABN in our discriminator because it can visualize only attributes present in the input image. We combine ABN and arbitrary face attribute editing by introducing the concept of complementary attention feature to address the difficulty of localizing facial regions when the input image doesn’t contain the target attribute.

Refer to caption
Figure 2: Overview of our model. On the left side is the generator GG which edits source image according to given target attribute vector. GG consists of GencG_{enc} and GdecG_{dec}. Then the discriminator DD takes both source and edited image as input. DD consists of feature extractor, Datt,DadvD_{att},D_{adv} and DclsD_{cls}.

3 CAFE-GAN

This section presents our proposed CAFE-GAN, a framework to address arbitrary image editing. Fig. 2 shows an overview of our model which consists of a generator GG and a discriminator DD. We first describe our discriminator that recognizes attribute-relevant regions by introducing the concept of complementary attention feature (CAFE), and then describe how the generator learns to reflect the spatial information by describing complementary feature matching. Finally, we present our training implementation in detail.

3.1 Discriminator

The discriminator DD takes both real images and fake images modified by GG as input. DD consists of three main parts, i.e., DattD_{att}, DadvD_{adv}, and DclsD_{cls} as illustrated in right side of Fig. 2. Unlike other arbitrary attribute editing methods [6, 10, 21], a spatial attention mechanism is applied to mid-level features ff in our discriminator. DattD_{att} plays a major role in applying attention mechanism with a novel concept of adopting complementary feature maps. DattD_{att} consists of an attention branch (AB) and a complementary attention branch (CAB) and they generate kk attention maps, which are the number of attributes, respectively. M={M1,,Mk}M=\{M_{1},...,M_{k}\}, a collection of attention maps from AB, contains important regions of attributes that exist in an input image while Mc={M1c,,Mkc}M^{c}=\{M^{c}_{1},...,M^{c}_{k}\} from CAB contains casual region of attributes that do not exist. These attention maps are applied to mid-level features by the attention mechanism as

fi\displaystyle f^{\prime}_{i} =fMi,\displaystyle=f\cdot M_{i}, (1)
fi′′\displaystyle f^{\prime\prime}_{i} =fMic,\displaystyle=f\cdot M^{c}_{i}, (2)

where MiM_{i} and MicM^{c}_{i} are the ii-th attention map from AB and CAB respectively and ()(\cdot) denotes element-wise multiplication.

The following paragraph describes DattD_{att} in detail as illustrated in Fig. 3. As explained above, we adopt Attention Branch (AB) to identify attribute-relevant regions following ABN [7]. AB takes mid-level features of an input image and generate h×w×k~{}h~{}\times~{}w\times~{}k~{} attention features (AF), denoted by AA, with a 1×1×k~{}1~{}\times~{}1\times~{}k~{} convolution layer. Here, kk denotes the number of channels in AA which is the same as the number of attributes. hh and ww denote the height and width of the feature map respectively. AB outputs kk attention maps M1,MkM_{1},...M_{k} with 1×1×k~{}1~{}\times~{}1\times~{}k~{} convolution layers and a sigmoid layer. It also outputs activation of each attribute class by global average pooling (GAP) [20]. The h×w×k~{}h~{}\times~{}w\times~{}k~{} attention feature map AA is converted to 1×1×k~{}1~{}\times~{}1\times~{}k~{} feature map by GAP to produce probability score of each class with a sigmoid layer. The probability score is compared to label vsv_{s} with a cross-entropy loss to train DD to minimize classification errors when a real image (source image) is given as an input. Therefore, attention loss of AB is defined as

DAB=𝔼xi=1k[vs(i)logDAB(i)(x)+(1vs(i))log(1DAB(i)(x))],\mathcal{L}_{D_{AB}}=-\mathbb{E}_{x}\sum_{i=1}^{k}\left[v_{s}^{(i)}\log{D}_{AB}^{(i)}(x)\right.+\left.(1-v_{s}^{(i)})\log(1-D_{AB}^{(i)}(x))\right], (3)

where xx is real image, vs(i)v_{s}^{(i)} denotes the ii-th value of source attribute vector and DAB(i)(x)D_{AB}^{(i)}(x) denotes the ii-th probability score that is output of AB. Therefore, the values of each channel in AA are directly related to activation of the corresponding attribute. AB can extract AA which represents spatial information about attributes contained in the input image. However AA does not include the information about attributes that are not present in the image because AiA_{i}, the ii-th channel of feature map AA, does not have response if ii-th attribute is not in the input image.

Refer to caption
Figure 3: Details on structure of attention branch (AB) and complementary attention branch (CAB) in DattD_{att}.

This aspect does not influence the classification models like ABN because it only needs to activate a channel that corresponds to the correct attribute shown in an input image. However, for handling any arbitrary attribute, the generative model needs to be able to expect related spatial region even when an input image does not possess the attribute. Hence, there is a limit to apply existing visual explanation methods into the discriminator of attribute editing model directly.

To address this problem, we developed the concept of complementary attention and implemented the idea in our algorithm by integrating Complementary Attention Branch (CAB).The concept of CAB is intuitive that it extracts complementary attention feature (CAFE), denoted by AcA^{c}, which represents causative region of attribute that are not present in image. For example, CAB detects the lower part of face about attribute BeardBeard if the beard is not in an input image. To achieve this inverse class activation, we exploit complementary attribute vector vs¯\bar{v_{s}} to compare with probability score from AcA^{c} and vs¯\bar{v_{s}} is formulated by

vs¯=1vs,\bar{v_{s}}=1-v_{s}, (4)

hence the attention loss of CAB is formulated as

DCAB=𝔼xi=1k[(1vs(i))logDCAB(i)(x)+vs(i)log(1DCAB(i)(x))],\mathcal{L}_{D_{CAB}}=-\mathbb{E}_{x}\sum_{i=1}^{k}\left[(1-v_{s}^{(i)})\log{D}_{CAB}^{(i)}(x)\right.\\ +\left.v_{s}^{(i)}\log(1-D_{CAB}^{(i)}(x))\right], (5)

where, DCAB(i)(x)D_{CAB}^{(i)}(x) denotes the ii-th probability score that is output of CAB. CAB is designed to generate a set of attention maps McM^{c} for attention mechanism from AcA^{c}. Therefore AcA^{c} should contain spatial information to help DclsD_{cls} classify attributes. In other words, AcA^{c} represents causative regions of non-existing attribute. With AB and CAB, our model extracts attention feature map about all attributes because AA and AcA^{c} are complementary. In other words, for any ii-th attribute, if AiA_{i} does not have response values, AicA_{i}^{c} has them and vice versa.

Two groups of attention maps MM and McM^{c}, outputs of AB and CAB respectively, have different activation corresponding to attributes. In other words, MM is about existing attributes of the input image while McM^{c} is about absent attributes of that. After attention mechanism, the transformed features are forwarded to two multi-attribute classifiers in DclsD_{cls} and the classifier 1 and classifer 2 classify correct label of image with ff^{\prime} and f′′f^{\prime\prime} respectively. Each classifier outputs the probability of each attribute with cross-entropy loss. For discriminator, it learns to classify real image xx with two different attention mechanism, i.e.,

Dcls=𝔼xn=1,2i=1k[vs(i)logDclsn(i)(x)+(1vs(i))log(1Dclsn(i)(x))],\mathcal{L}_{D_{cls}}=-\mathbb{E}_{x}\sum_{n=1,2}\sum_{i=1}^{k}\left[v_{s}^{(i)}\log{D}_{cls_{n}}^{(i)}(x)\right.+\left.(1-v_{s}^{(i)})\log(1-D_{cls_{n}}^{(i)}(x))\right], (6)

where Dcls1D_{cls_{1}} and Dcls2D_{cls_{2}} stand for two classifiers using collections of attention maps M={M1,,Mk}M=\{M_{1},...,M_{k}\} and Mc={M1c,,Mkc}M^{c}=\{M_{1}^{c},...,M_{k}^{c}\}, respectively. Therefore, the reason CAFE can represent the spatial information of non-existent attributes is that CAB has to generate the attention maps that can help to improve performance of the classifiers while reacting to non-existent attributes by GAP.

In DD, there is another branch DadvD_{adv} distinguishes real image xx and fake image yy in order to guarantee visually realistic output with adversarial learning. In particular, we employ adversarial loss in WGAN-GP [9], hence the adversarial loss of DD is given as

Dadv=𝔼x(Dadv(x))𝔼y(Dadv(y))λgp𝔼x^[(x^Dadv(x^)21)2],\mathcal{L}_{D_{adv}}=\mathbb{E}_{x}(D_{adv}(x))-\mathbb{E}_{y}(D_{adv}(y))-\lambda_{gp}\mathbb{E}_{\hat{x}}\left[(\lVert{\nabla_{\hat{x}}D_{adv}(\hat{x})}\rVert_{2}-1)^{2}\right], (7)

where x^\hat{x} is weighted sum of real and fake sample with randomly selected weight α[0,1]\alpha\in[0,1].

Refer to caption
Figure 4: Details on the proposed complementary feature matching with an example. GG learns to match AF and CAFE of edited image corresponding to attribute to be changed, i.e., To old and To bald, to CAFE and AF of source image, respectively. On the contrary, G learns to match AF and CAFE of edited image to AF and CAFE of source image for the attribute not to be changed.

3.2 Generator

The generator GG takes both the source image xx and the target attribution label vtv_{t} as input, and then conducts a transformation of xx to yy, denoted by y=G(x,vt)y=G(x,v_{t}). The goal of GG is to generate an image yy with attributes according to vtv_{t} while maintaining the identity of xx. GG consists of two components: an encoder GencG_{enc} and a decoder GdecG_{dec}. From the given source image xx, GencG_{enc} encodes the image into a latent representation zz. Then, the decoder generates a fake image yy with the latent feature zz and the target attribute vector vtv_{t}. Following  [21], we compute a difference attribute vector vdv_{d} between the source and the target attribute vectors and use it as an input to our decoder.

vd=vtvs,v_{d}=v_{t}-v_{s}, (8)

Thus, the process can be expressed as

z\displaystyle z =Genc(x),\displaystyle=G_{enc}(x), (9)
y\displaystyle y =Gdec(z,vd).\displaystyle=G_{dec}(z,v_{d}). (10)

In addition, we adopt the skip connection methodology used in STGAN [21] between GencG_{enc} and GdecG_{dec} to minimize loss of fine-scale information due to down sampling. After that, DD takes both the source image xx and the edited image yy as input. GG aims to generate an image that can be classified by Dcls1D_{cls_{1}} and Dcls2D_{cls_{2}} as target attribute, hence the classification loss of GG is defined as

Gcls=𝔼yn=1,2i=1k[vt(i)logDclsn(i)(y)+(1vt(i))log(1Dclsn(i)(y))],\mathcal{L}_{G_{cls}}=-\mathbb{E}_{y}\sum_{n=1,2}\sum_{i=1}^{k}\left[v_{t}^{(i)}\log{D}_{cls_{n}}^{(i)}(y)\right.+\left.(1-v_{t}^{(i)})\log(1-D_{cls_{n}}^{(i)}(y))\right], (11)

where vt(i)v_{t}^{(i)} denotes the ii-th value of target attribute vector.

Although DattD_{att} in our discriminator can obtain the spatial information about all attributes, it is necessary to ensure that GG has ability to change the relevant regions of given target attributes. Attention feature maps of the source image and the edited image should be different on those corresponding to attributes that are changed while the rest of attention feature maps should be the same. Therefore we propose a novel complementary matching method as illustrated in Fig. 4. For the attributes that are not to be changed, the attention feature maps of edited image should be same with that of source image. In other words, GG learns to match AF of edited image to AF of source image when the attributes remain the same (black arrows in Fig. 4), and GG also learns from matching CAFE of the source image. When the given target attributes are different from the source image, GG learns to match AF of edited image to CAFE of source image (red arrows in Fig. 4), same with CAFE of edited image to AF of source image. Let {A(x),Ac(x)}\{A^{(x)},A^{c(x)}\} and {A(y),Ac(y)}\{A^{(y)},A^{c(y)}\} denote set of AF and CAFE from two different samples, real image x and fake image y, respectively. Complementary matching is conducted for changed attributes and thus the complementary matching loss is defined as

CM=𝔼(x,y)i=1k1Ni[Ai(x)Pi1+Aic(x)Qi1],where{Pi,Qi}={{Ai(y),Aic(y)}if|vd(i)|=0,{Aic(y),Ai(y)}if|vd(i)|=1,\displaystyle\begin{split}\mathcal{L}_{CM}=\mathbb{E}_{(x,y)}\sum_{i=1}^{k}\frac{1}{N_{i}}[\lVert{A_{i}^{(x)}-P_{i}}\rVert_{1}+\lVert{A_{i}^{c(x)}-Q_{i}}\rVert_{1}],\\ \text{where}\ \{P_{i},Q_{i}\}=\begin{cases}\{A_{i}^{(y)},A_{i}^{c(y)}\}&\text{if}\ |v_{d}^{(i)}|=0,\\ \{A_{i}^{c(y)},A_{i}^{(y)}\}&\text{if}\ |v_{d}^{(i)}|=1,\end{cases}\end{split} (12)

where kk is the number of attributes and N(i)N(i) denotes the number of elements in feature map. Ai(x)/(y)A_{i}^{(x)/(y)} and Aic(x)/(y)A_{i}^{c(x)/(y)} denote ii-th channel of A(x)/(y)A^{(x)/(y)} and Ac(x)/(y)A^{c(x)/(y)} respectively and vd(i)v_{d}^{(i)} denotes ii-th value of difference attribute vector vdv_{d}.

For adversarial training of GAN, we also adopt the adversarial loss to generator used in WGAN-GP [9], i.e.,

Gadv=𝔼x,vd[Dadv(G(x,vd))],\mathcal{L}_{G_{adv}}=\mathbb{E}_{x,v_{d}}[D_{adv}(G(x,v_{d}))], (13)

Although the generator can edit face attribute with Gcls\mathcal{L}_{{G}_{cls}} and generate realistic image with Gadv\mathcal{L}_{{G}_{adv}}, it should preserve identity of image. Therefore, GG should reconstruct the source image when difference attribute vector is zero. We adopt the pixel-level reconstruction loss, i.e.,

rec=𝔼x[xG(x,𝟎)1],\mathcal{L}_{rec}=\mathbb{E}_{x}[\lVert{x-G(x,\mathbf{0})}\rVert_{1}], (14)

where we use 1\ell_{1} loss for sharpness of reconstructed image and 𝟎\mathbf{0} denotes zero vector.

3.3 Model Objective

Finally, the full objective to train discriminator DD is formulated as

D=Dadv+λattDatt+λDclsDcls,\mathcal{L}_{D}=-\mathcal{L}_{D_{adv}}+\lambda_{att}\mathcal{L}_{D_{att}}+\lambda_{D_{cls}}\mathcal{L}_{D_{cls}}, (15)

and that for the generator GG is formulated as

G=Gadv+λCMCM+λGclsGcls+λrecrec,\mathcal{L}_{G}=\mathcal{L}_{G_{adv}}+\lambda_{CM}\mathcal{L}_{CM}+\lambda_{G_{cls}}\mathcal{L}_{G_{cls}}+\lambda_{rec}\mathcal{L}_{rec}, (16)

where λatt,λDcls,λCM,λGcls\lambda_{att},\lambda_{D_{cls}},\lambda_{CM},\lambda_{G_{cls}} and λrec\lambda_{rec} are hyper-parameters which control the relative importance of the terms.

4 Experiments

In this section, we first explain our experimental setup and then present qualitative and quantitative comparisons of our model with the state-of-the-art models. Finally, we demonstrate effectiveness of CAFE with visualization results and ablation study. The experiments not included in this paper can be found in supplementary material.

4.1 Experimental Setup

We use CelebFaces Attributes (CelebA) dataset [23] which consists of 202,599 facial images of celebrities. Each image is annotated with 40 binary attribute labels and cropped to 178×218178\times 218. We crop each image to 170×170170\times 170 and resize to 128×128128\times 128. For comparison, we choose the same attributes used in the state-of-the-art models [6, 10, 21]. In our experiment, coefficients of the objective functions in Eq. (15) and (16) are set to λatt=λDcls=λCM=1,λGcls=10,\lambda_{att}=\lambda_{D_{cls}}=\lambda_{CM}=1,\lambda_{G_{cls}}=10, and λrec=100\lambda_{rec}=100. We adopt ADAM [15] solver with β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999, and the learning rate is set initially to 0.0002 and set to decay to 0.0001 after 100 epochs.

Refer to caption
Figure 5: Qualitative comparison with arbitrary facial attribute editing models. The row from top to down are result of AttGAN [10], StarGAN [6], STGAN [21] and our model. Please zoom in to see more details.

4.2 Qualitative Result

First, we conduct qualitative analysis by comparing our approach with three state-of-the-art methods in arbitrary face editing, i.e., AttGAN [10], StarGAN [6] and STGAN [21]. The results are shown in Fig. 5. Each column represents different attribute manipulation and each row represents qualitative results from the methods compared. The source image is placed in the leftmost of each row and we analyze results about single attribute as well as multiple attributes. First, AttGAN [10] and StarGAN [6] perform reasonably on attributes such as Add bangs. However they tend to edit irrelevant regions for attributes such as Blond hair or Pale skin and they also result in blurry images for attributes such as To bald. STGAN [21] improves manipulating ability by modifying structure of the generator, but this model also presents unnatural images for some attributes like To Bald and Add Bangs. In addition, unwanted changes are inevitable for some attributes such as Blond hair. As shown in Fig. 5, our model can successfully convert local attributes like Eyeglasses as well as global attributes like To female and To old. Last three columns represent results of multi-attribute editing. Hair color in the last column denotes BlackBrown hair\textit{Black}~{}\leftrightarrow~{}\textit{Brown hair}. It can be seen that our model delivers more natural and well-edited images than the other approaches.

Refer to caption
Figure 6: Qualitative comparison with SaGAN [37] which adopt spatial attention in the generator. Please zoom in to see more details.

We also compare our model with SaGAN [37] which adopt spatial attention concept. As shown in Fig. 6, SaGAN could not edit well in global attributes like To male and To old. Even when the target attribute is on a localized region like Eyeglasses, it performed poorly. However, our model calculates the spatial information in feature-level, not in pixel-level. As a result, our model shows well-edited and natural results in both local and global attributes. In addition, note that SaGAN is a single-attribute editing model, wherein it requires one model to be trained per attribute.

Table 1: Comparisons on the attribute classification accuracy. Numbers indicate the classification accuracy on each attribute.
Bald Bangs Blond h. Musta. Gender Pale s. Aged Open m. Avg.
AttGAN [10] 23.67 91.08 41.51 21.78 82.85 86.28 65.69 96.91 63.72
STGAN [21] 59.76 95.48 79.98 42.10 92.70 97.81 85.86 98.65 81.54
CAFE-GAN 79.03 98.59 88.14 40.13 95.22 98.20 88.61 97.15 85.64

4.3 Quantitative Result

We present quantitative results by comparison with two of the three models compared earlier [10, 21]. In the absence of the ground-truth, the success of arbitrary attribute editing can be determined by a well trained multi-label classifier, i.e., well edited image would be classified to target domain by a well-trained attribute classifier. For fair comparison, we adopted the classifier that was used to evaluate attribute generation score for STGAN [21] and AttGAN [10].To evaluate quantitative result, we use official code which contains network architecture and weights of parameter in well-trained model. We exclude StarGAN [6] in this section because the official code of StarGAN provides only few attributes. There are 2,000 edited images evaluated for each attribute and their source images come from the test set in CelebA dataset. We measure the classification score of each attribute of edited images and they are listed in Table 1. In Table 1, “Blond h.” and “Open m.” denote Blond hair and Open mouth respectively. While our model shows competitive scores on the average for many attributes, it also delivered overwhelming results compared to the other models for specific attributes such as Bald. For attributes such as Mustache and Open mouth, STGAN results in slightly better performance over our model.

4.4 Analysis of CAFE

This section presents analysis of the proposed method. First we show the result of visualization of our attention features and then we conduct ablation study to demonstrate effectiveness of CAFE.

4.4.1 Visualization of CAFE.

Fig. 7 shows visualization results of AF and CAFE to examine whether they activate related region correctly. Because the man in left in Fig. 7 has no bangs, AF rarely activates but CAFE activates and highlights the related region correctly. The result on the right shows AF correctly activating the region relevant to bangs while CAFE doesn’t. Since CAFE only finds complement features absent in the specified attributes, not activating the region specified in the attribute is the correct response. The remainder of the figure demonstrates that CAFE lights up the regions complementary to the given attributes accurately. For global attributes like Young and Male, both AF and CAFE respond correctly. Although AF and CAFE does not detect attribute-relevant regions at pixel level since they are considered at feature-level, they highlight the corresponding regions accurately per given attribute. As such, our model performs better on editing both global and local attributes compared with other methods.

Refer to caption
Figure 7: Visualization results of the attention features (AF) and the complementary attention features (CAFE).

4.4.2 Ablation Study.

To demonstrate effectiveness of the proposed method, we evaluate performance of our model by excluding key components one by one. We compare our model with two different versions, i.e., (i) CAFE-GAN w.o. CM : excluding complementary matching loss (CM\mathcal{L}_{CM}) in training process and (ii) CAFE-GAN w.o. CAB : removing complementary attention branch (CAB) in our discriminator. As shown in Fig. 8, the generator has a difficulty to determine where to change without complementary matching loss though the discriminator can extract AF and CAFE. Some results from the model without CM show unwanted and over-emphasized changes. Excluding CAB leads to artifacts and unwanted changes in generated images because the discriminator has limited spatial information about the attributes that are contained in the input image We also measure classification accuracy of each attribute by the pre-trained classifier which was used in Section 4.3 and the results are listed in Table 2. In the absence of CM or CAB, the classification accuracy decreases in all attributes except Open mouth, and the model without CAB shows the lowest accuracy.

5 Conclusion

We introduced the CAFE-GAN based on the complementary attention features and attention mechanism for facial attribute editing. The CAFE controls facial editing to occur only on parts of the facial image pertinent to specified target attributes by exploiting the discriminator’s ability to locate spatial regions germane to specified attributes. Performance of CAFE-GAN was compared with the state-of-the-art methods via qualitative and quantitative study. The proposed approach in most of the target attributes demonstrated improved performances over the state-of-the-art methods, and in some attributes achieved significantly enhanced results.

Refer to caption
Figure 8: Qualitative results of CAFE-GAN variants.
Table 2: Attribute classification accuracy of ablation study.
Bald Bangs Blond h. Musta. Gender Pale s. Aged Open m. Avg.
Ours 79.03 98.59 88.14 40.13 95.22 98.20 88.61 97.15 85.64
w.o. CM 61.68 97.46 87.01 39.78 85.93 92.38 86.39 97.56 81.02
w.o. CAB 32.43 91.45 70.41 36.13 81.93 92.22 65.95 94.72 70.66

Acknowledgment Authors (Jeong-gi Kwak and Hanseok Ko) of Korea University are supported by a National Research Foundation (NRF) grant funded by the MSIP of Korea (number 2019R1A2C2009480). David Han’s contribution is supported by the US Army Research Laboratory.

References

  • [1] Ak, K.E., Lim, J.H., Tham, J.Y., Kassim, A.A.: Attribute manipulation generative adversarial networks for fashion images. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
  • [2] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv:1701.07875 (2017)
  • [3] Chang, H., Lu, J., Yu, F., Finkelstein, A.: PairedCycleGAN: Asymmetric style transfer for applying and removing makeup. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [4] Chen, Y.C., Lin, H., Shu, M., Li, R., Tao, X., Shen, X., Ye, Y., Jia, J.: Facelet-bank for fast portrait manipulation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [5] Chen, Y.C., Shen, X., Lin, Z., Lu, X., Pao, I., Jia, J., et al.: Semantic component decomposition for face attribute manipulation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [6] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [7] Fukui, H., Hirakawa, T., Yamashita, T., Fujiyoshi, H.: Attention branch network: Learning of attention mechanism for visual explanation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
  • [9] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
  • [10] He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: AttGAN: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing (TIP) (2017)
  • [11] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [12] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation (2017)
  • [13] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [14] Kim, T., Kim, B., Cha, M., Kim, J.: Unsupervised visual attribute transfer with reconfigurable generative adversarial networks (2017)
  • [15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
  • [16] Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networks for fast and accurate super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [17] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.: Fader networks: Manipulating images by sliding attributes. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
  • [18] Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [19] Li, M., Zuo, W., Zhang, D.: Deep identity-aware transfer of facial attributes (2016)
  • [20] Lin, M., Chen, Q., Yan, S.: Network in network. In: The International Conference on Learning Representations (ICLR) (2014)
  • [21] Liu, M., Ding, Y., Xia, M., Liu, X., Ding, E., Zuo, W., Wen, S.: STGAN: A unified selective transfer network for arbitrary image attribute editing. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [22] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
  • [23] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
  • [24] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
  • [25] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
  • [26] Park, J., Han, D.K., Ko, H.: Fusion of heterogeneous adversarial networks for single image dehazing. IEEE Transactions on Image Processing (TIP) (2020)
  • [27] Perarnau, G., Van De Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional GANs for image editing. arXiv:1611.06355 (2016)
  • [28] Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the European conference on computer vision (ECCV). pp. 818–833 (2018)
  • [29] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 (2015)
  • [30] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
  • [31] Shen, W., Liu, R.: Learning residual images for face attribute manipulation (2017)
  • [32] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  • [33] Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: SmoothGrad: removing noise by adding noise (2017)
  • [34] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net (2014)
  • [35] Upchurch, P., Gardner, J., Pleiss, G., Pless, R., Snavely, N., Bala, K., Weinberger, K.: Deep feature interpolation for image content changes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [36] Wu, P.W., Lin, Y.J., Chang, C.H., Chang, E.Y., Liao, S.W.: Relgan: Multi-domain image-to-image translation via relative attributes. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
  • [37] Zhang, G., Kan, M., Shan, S., Chen, X.: Generative adversarial network with spatial attention for face attribute editing. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
  • [38] Zhao, B., Chang, B., Jie, Z., Sigal, L.: Modular generative adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
  • [39] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [40] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: The IEEE International Conference on Computer Vision (ICCV) (2017)