This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation

Xuewen Yang 1234-5678-9012 Stony Brook University [email protected] Dongliang Xie Beijing University of Posts and Telecommunications [email protected]  and  Xin Wang Stony Brook University [email protected]
(2018)
Abstract.

State-of-the-art techniques in Generative Adversarial Networks (GANs) have shown remarkable success in image-to-image translation from peer domain X to domain Y using paired image data. However, obtaining abundant paired data is a non-trivial and expensive process in the majority of applications. When there is a need to translate images across nn domains, if the training is performed between every two domains, the complexity of the training will increase quadratically. Moreover, training with data from two domains only at a time cannot benefit from data of other domains, which prevents the extraction of more useful features and hinders the progress of this research area. In this work, we propose a general framework for unsupervised image-to-image translation across multiple domains, which can translate images from domain XX to any a domain without requiring direct training between the two domains involved in image translation. A byproduct of the framework is the reduction of computing time and computing resources, since it needs less time than training the domains in pairs as is done in state-of-the-art works. Our proposed framework consists of a pair of encoders along with a pair of GANs which learns high-level features across different domains to generate diverse and realistic samples from. Our framework shows competing results on many image-to-image tasks compared with state-of-the-art techniques.

Generative Adversarial Nets, latent code, generative models, image-to-image translation
GAN, Image-to-image translation, Unsupervised learning, Neural networks
copyright: rightsretaineddoi: 10.475/123_4isbn: 123-4567-24-567/08/06conference: ACM Seoul conference; October 2018; Seoul, Koreajournalyear: 2018article: 4price: 15.00ccs: Computing methodologies Image representationsccs: Computing methodologies Neural networksccs: Computing methodologies Unsupervised learning

1. Introduction

In this work, we define multi-domain as multiple datasets or several subsets of one dataset that are applied to complete the same task, but these datasets (or subsets) have different statistical biases. As some examples, images taken at Alps in the summer and in the winter are considered as two different domains, while faces with hair and faces with eyeglasses form another two different domains. Under this domain definition, for faces with black hair and faces with yellow hair, the black hair and yellow hair are two different attributes of the same domain. In multi-domain learning, each sample 𝒙\bm{x} is drawn from a domain dd specific distribution 𝒙pd(𝒙)\bm{x}\sim p_{d}(\bm{x}) and has a label y{0,1}y\in\{0,1\}, with y=1y=1 signifying 𝒙\bm{x} from domain dd, y=0y=0 signifying 𝒙\bm{x} not from domain dd.

Image-to-image translation is the task of learning to map images from one domain to another, e.g., mapping grayscale images to color images (Cao et al., 2017), mapping images of low resolution to images of high resolution (Ledig et al., 2017), changing the seasons of scenery images (Zhu et al., 2017), and reconstructing photos from edge maps (Isola et al., 2017). The most significant improvement in this research field came with the application of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Liu and Tuzel, 2016).

The image-to-image translation can be performed in supervised (Isola et al., 2017) or unsupervised way (Zhu et al., 2017), with the unsupervised one becoming more popular since it does not need to collect ground-truth pairs of samples. Despite the quick progress of research on image-to-image translation, state-of-the-art results for unsupervised translation are still not satisfactory. In addition, existing research generally focuses on image-to-image translation between two domains, which is limited by two drawbacks. First, the translation task is specific to two domains, and the model has to be retrained when there is a need to perform image translation between another pair of similar domains. Second, it can not benefit from the features of multiple domains to improve the training quality. We take the most representative work in this research field CycleGAN (Zhu et al., 2017) as an example to illustrate the first limitation. The translation between two image domains XX and YY is achieved with two generators, GXYG_{X\rightarrow Y} and GYXG_{Y\rightarrow X}. However, this model is inefficient in completing the task of multi-domain image translation. To derive mappings across all nn domains, it has to train n(n1)n(n-1) generators, as shown in Fig. 1a.

Refer to caption
(a)
Refer to caption
(b)
Figure 1. Image-to-image translation of 4 domains. (a) CycleGAN needs 4×34\times 3 generators. (b) Our model only needs 2 encoder-generator pairs. In every iteration, we randomly pick two domains, and sample two batches of training data from the domains to train the model. The two encoders first encode domain information into a latent code 𝒛\bm{z} using two encoders EXE_{X} and EYE_{Y} and then generate two samples of the two domains using the generators GXG_{X} and GYG_{Y}.

To enable more efficient multi-domain image translation with unsupervised learning where image pairing across domains is not predefined, we propose Crossing-Domain GAN (CD-GAN), which is a multi-domain encoding generative adversarial network that consists of a pair of encoders and a pair of generative adversarial networks (GANs). We would like the encoders to efficiently encode the information of all domains to form a high-level feature space with an encoding process, then images of different domains will be translated by decoding the high-level features with a decoding process. CD-GAN achieves this goal with the integrated use of three techniques. First, the two encoders are constrained by a weight sharing scheme, where the two encoders (or the two generators) share the same weights at both the highest-level layers and the lowest-level layers. This ensures that the two encoders can encode common high-level semantics as well as low-level details to obtain the feature space, based on which generators can decode the high-level semantics and low-level details correctly to generate images of different domains. Second, we use a selected or existing label to guide the generator to generate images of a corresponding domain from the high-level features learnt. Third, we propose an efficient training algorithm that jointly train the model across domains by randomly selecting two domains to train at each iteration.

Different from  (Liu and Tuzel, 2016) where only weights at high-level layers of generators are shared, in CD-GAN, we propose the concurrent sharing of the lowest-level and the highest-level layers at both the encoders and the generators to improve the quality of image translation between any two domains. The sharing of highest layers between two encoders helps to enable more flexible cross-domain image translation, while the sharing of the lowest layers across domains helps improve the training quality by taking advantage of the transferring learning across domains.

The contributions of our work are as follows:

  • We propose CD-GAN that learns mappings across multiple domains using only two encoder-generator pairs.

  • We propose the concurrent use of weight-sharing at highest-level and lowest-level layers of both encoders and generators to ensure that CD-GAN generates images with sufficient useful high-level semantics and low-level details across all domains.

  • We leverage domain labels to make a conditional GAN training that greatly improves the performance of the model.

  • We introduce a cross-domain training algorithm that efficiently and sufficiently trains the model by randomly taking samples from two of domains at a time. CD-GAN can fully exploit data from all domains to improve the training quality for each individual domain.

Our experiment results demonstrate that when trained on more than two domains, our method achieves the same quality of image translation between any two domains as compared to directly training for translation between the pair. However, our model is established with much less training time and can generate better quality images for a given amount of time. We also show how CD-GAN can be successfully applied to a variety of unsupervised multi-domain image-to-image translation problems.

The remainder of this paper is organized as follows. Section 2 reviews the relevant research for image-to-image translation problems. Section 3 describes our model and training method in details. Section 4 presents our evaluation metrics, experimental methodology, and the evaluation results of the model’s accuracy and efficiency on different datasets. Finally, we discuss some limitations of our work and conclude our work in Section 5.

2. Related Work

2.1. Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) were introduced to model a data distribution using independent latent variables. Let 𝒙p(𝒙)\bm{x}\sim p(\bm{x}) be a random variable representing the observed data and 𝒛p(𝒛)\bm{z}\sim p(\bm{z}) be a latent variable. The observed variable is assumed to be generated by the latent variable, i.e., 𝒙p𝜽(𝒙|𝒛)\bm{x}\sim p_{\bm{\theta}}(\bm{x}|\bm{z}), where p𝜽(𝒙|𝒛)p_{\bm{\theta}}(\bm{x}|\bm{z}) can be explicitly represented by a generator in GANs. GANs are built on top of neural networks, and can be trained with gradient descent based algorithms.

The GAN model is composed of a discriminator DϕD_{\bm{\phi}}, along with the generator G𝜽G_{\bm{\theta}}. The training involves a min-max game between the two networks. The discriminator DϕD_{\bm{\phi}} is trained to differentiate ‘fake’ samples generated from the generator G𝜽G_{\bm{\theta}} from the ‘real’ samples from the true data distribution p(𝒙)p(\bm{x}). The generator is trained to synthesize samples that can fool the discriminator by mistaking the generated samples for genuine ones. They both can be implemented using neural networks.

At the training phase, the discriminator parameters ϕ\bm{\phi} are firstly updated, followed by the update of the generator parameters 𝜽\bm{\theta}. The objective function is given by:

(1) min𝜽maxϕV(D,G)=\displaystyle\min_{\bm{\theta}}\max_{\bm{\phi}}V(D,G)= 𝔼xp(x)[logDϕ(x)]\displaystyle\mathbb{E}_{x\sim p(x)}[\log D_{\bm{\phi}}(x)]
+𝔼𝒛p(𝒛)[log(1Dϕ(G𝜽(𝒛)))]\displaystyle+\mathbb{E}_{\bm{z}\sim p(\bm{z})}[\log(1-D_{\bm{\phi}}(G_{\bm{\theta}}(\bm{z})))]

The samples can be generated by sampling 𝒛p(𝒛)\bm{z}\sim p(\bm{z}), then 𝒙^=G𝜽(𝒛)\hat{\bm{x}}=G_{\bm{\theta}}(\bm{z}), where p(𝒛)p(\bm{z}) is a prior distribution, for example, a multivariate Gaussian.

2.2. Image-To-Image Translation

Image-to-image translation problem is a kind of image generation task that given an input image 𝒙\bm{x} of domain X, the model maps it into a corresponding output image 𝒚\bm{y} of another domain Y. It learns a mapping between two domains given sufficient training data (Isola et al., 2017). Early works on image-to-image translation mainly focused on tasks where the training data of domain XX are similar to the data of domain YY (Gupta et al., 2012; Liu et al., 2008), and the results were often unrealistic and not diverse.

In recent years, deep generative models have shown increasing capability of synthesizing diverse, realistic images that capture both fine-grained details and global coherence of natural images (Gregor et al., 2015; Radford et al., 2015; Kingma and Welling, 2014). With Generative Adversarial Networks (GANs) (Isola et al., 2017; Zhu et al., 2017; Kim et al., 2017), recent studies have already taken significant steps in image-to-image translation. In (Isola et al., 2017), the authors use a conditional GAN on different image-to-image translation tasks, such as synthesizing photos from label maps and reconstructing objects from edge maps. However, this method requires input-output image pairs for training, which is in general not available in image-to-image translation problems. For situations where such training pairs are not given, in (Zhu et al., 2017), the authors proposed CycleGAN to tackle unsupervised image-to-image translation. With a pair of Generators GG and FF, the model not only learns a mapping G:XYG:X\rightarrow Y using an adversarial loss, but constrains this mapping with an inverse mapping F:YXF:Y\rightarrow X. It also introduces a cycle consistency loss to enforce F(G(X))XF(G(X))\approx X, and vice versa. In settings where paired training data are not available, the authors showed promising qualitative results. The authors in (Kim et al., 2017) and (Yi et al., 2017) use similar idea to solve the unsupervised image-to-image translation tasks.

These approaches only tackle the problems of translating images between two domains, and have two major drawbacks. First, when applied to nn domains, these approaches need n(n1)n(n-1) generators to complete the task, which is computationally inefficient. To train all models, it would either require a significant amount of time to complete if the training is performed on one GPU, or it will require a lot of hardware and computing resources if training is run over multiple GPUs. Second, as each model is trained with only two datasets, the training cannot benefit from the data of other domains.

Our work is inspired by multimodal learning (Ngiam et al., 2011), which shows that data features can be better extracted using one modality if multiple modalities are present at feature learning time. The intuition of our method is that if we can encode the information of different domains together and generate a high-level feature space, it would be possible to decode the high-level features to build images of different domains. In this work, rather than generating images from random noise, we incorporate an encoding process into a GAN model. The image-to-image translation can be achieved by first encoding real images into high-level features, and then generating images of different domains using the high-level features through a decoding process. The encoding process and the decoding process are constrained by a weight-sharing technique that both the highest layer and the lowest layer are shared across the two encoders as well as the two generators. Sharing the high-level layers makes sure that the generated images are semantically correct, while sharing the low-level layers ensures that important low-level features be captured and transferred between domains. Our model is trained end-to-end using data from all nn domains.

3. Cross-Domain Generative Adversarial Network

To conduct unsupervised multi-domain image-to-image translation, a direct approach is to train a CycleGAN for every two domains. While this approach is straightforward, it is inefficient as the number of training models increases quadratically with the number of domains. If we have nn domains, we have to train n(n1)n(n-1) generators, as shown in Fig. 1a. In addition, since each model only utilizes data from two domains XX and YY to train, the training cannot benefit from the useful features of other domains.

To tackle these two problems, a possible way is to encode useful information of all domains into common high level features, and then to decode the high-level features into images of different domains. Inspired by work (Ngiam et al., 2011) from mutimodal learning, where training data are from multiple modalities, we propose to build a multi-domain image translation model that can encode information of multiple domains into a set ZZ of high-level features, and then use features in ZZ to reconstruct data of different domains or to do image-to-image translation. The overview of the model applied to 4 domains is shown in Fig. 1b, where only one model is used.

In this section, we first present our proposed CD-GAN model, then describe how image translation can be performed across domains, and finally introduce our cross-domain training method.

3.1. CD-GAN with Double Layer Sharing

We first describe how to apply our model to multi-domain image-to-image translation in general then illustrate it using two domains as an example. As shown in Fig. 2a, our proposed CD-GAN model consists of a pair of encoders followed by a pair of GANs. Taking domain XX and YY as an example, the two encoders EXE_{X} and EYE_{Y} encode domain information from XX and YY into a set of high-level features contained in a set ZZ. Then from a high-level feature zz in space ZZ, we can generate images that fall into domain XX or YY. The generated images are then evaluated by the corresponding discriminators DXD_{X} and DYD_{Y} to see whether they look real and cannot be identified as generated ones. For example, following the red arrows, the input image 𝒙\bm{x} is first encoded into a high-level feature 𝒛x\bm{z}_{x}, then 𝒛x\bm{z}_{x} is decoded to generate the image 𝒚^\hat{\bm{y}}. The image 𝒚^\hat{\bm{y}} is the translated image in domain YY. Similar processes exist for image 𝒚\bm{y}.

Our model is also constrained by a reconstruction process shown in Fig. 2b. For example, following the red arrows, the input image 𝒙\bm{x} is first encoded into a high-level feature 𝒛x\bm{z}_{x}, then 𝒛x\bm{z}_{x} is decoded to generate the image 𝒙\bm{x}\prime, which is a reconstruction of the input image. Similar processes exist for image 𝒚\bm{y}.

Learning with deep neural networks involves hierarchical feature representation. In order to support flexible cross-domain image translation and also to improve the training quality, we propose the use of double-layer sharing where the highest-level and the lowest-level layers of the two encoders share the same weights and so does the two generators. By enforcing the layers that decode high-level features in GANs to share weights, the images generated by different generators can have some common high-level semantics. The layers that decode low-level details then map the high-level features to images in individual domains.

Sharing weights of low-level layers has the benefit of transferring low-level features of one domain to the other, thus making the image-to-image translation more close to real images in the respective domains. Besides, sharing layers reduces the complexity of the model, making it more resistant to the over-fitting problem.

3.2. Conditional Image Generation

In state-of-the-art techniques, like CycleGAN, each domain is described by a specific generator, thus there is no need to inform the generator which domain the input image is generated to. However, in our model, multiple domains share two generators. For an input image, we have to include an auxiliary variable to guide the generation of image for a specific domain. The only information we have is the domain labels. To make use of this information, the inputs of the model are not images 𝒙\bm{x}, 𝒚\bm{y}, but image pairs (𝒙,𝒍y)(\bm{x},\bm{l}_{y}) and (𝒚,𝒍x)(\bm{y},\bm{l}_{x}) where the labels 𝒍y\bm{l}_{y} and 𝒍x\bm{l}_{x} inform the generators which domains to generate an image for. These image pairs are not the same as the image pairs of supervised image-to-image generation tasks, which are (𝒙,𝒚)(\bm{x},\bm{y}). Thus no matter which domain images are the input, the model can always generate images of a domain of interest.

Refer to caption
(a)
Refer to caption
(b)
Figure 2. The proposed CD-GAN model. (a) The translation mappings: the input image 𝒙\bm{x} is first encoded as a latent code 𝒛x\bm{z}_{x} through EX(𝒙)E_{X}(\bm{x}), which is then decoded into a translated image 𝒚^\hat{\bm{y}} through GY(𝒛x,𝒍y)G_{Y}(\bm{z}_{x},\bm{l}_{y}). The process is identified with red arrows. There is a similar process for the image 𝒚\bm{y}. DXD_{X} and DYD_{Y} are adversarial discriminators for the respective domains to evaluate whether the translated images are realistic. (b) The reconstruction mappings: the input image 𝒙\bm{x} is first encoded as a latent code 𝒛x\bm{z}_{x} through EX(𝒙)E_{X}(\bm{x}), which is then decoded into a reconstructed image 𝒙\bm{x}\prime through the generator GX(𝒛x,𝒍x)G_{X}(\bm{z}_{x},\bm{l}_{x}). The process is signified in red arrows. A similar process exists for image 𝒚\bm{y}. Note: the dashed lines indicate that the two layers share the same parameters.

We denote the data distributions as 𝒙p(𝒙)\bm{x}\sim p(\bm{x}) and 𝒚p(𝒚)\bm{y}\sim p(\bm{y}). As illustrated in Fig. 2, our model includes four mappings, two translation mappings XZYX\rightarrow Z\rightarrow Y, YZXY\rightarrow Z\rightarrow X and two reconstruction mappings XZXX\rightarrow Z\rightarrow X, YZYY\rightarrow Z\rightarrow Y. The translation mappings constrain the model by a GAN loss, while the reconstruction mappings constrain the model by a reconstruction loss. To further constrain the auxiliary variable, we introduce a classification loss by applying a classifier to classify the real or generated images into different domains. The intuition is that if images are generated with the guidance of the auxiliary variable, then it can be correctly classified into the domain specified by the auxiliary variable. Next, we introduce these model losses in more details as follows.

GAN Losses Following the translation mapping XZYX\rightarrow Z\rightarrow Y, we can translate image 𝒙\bm{x} from domain XX to 𝒚^\hat{\bm{y}} of domain YY using 𝒛x=EX(𝒙)\bm{z}_{x}=E_{X}(\bm{x}), 𝒚^=GY(𝒛x,𝒍y)\hat{\bm{y}}=G_{Y}(\bm{z}_{x},\bm{l}_{y}). With the purpose of improving the quality of the generated samples, we apply adversarial loss. We express the objective as:

(2) GANY\displaystyle\mathcal{L}_{GAN_{Y}} =𝔼𝒚p(𝒚)log(DY(𝒚))\displaystyle=\mathbb{E}_{\bm{y}\sim p(\bm{y})}\log(D_{Y}(\bm{y}))
+𝔼𝒙p(𝒙)log(1DY(GY(EX(𝒙),𝒍y)))\displaystyle+\mathbb{E}_{\bm{x}\sim p(\bm{x})}\log(1-D_{Y}(G_{Y}(E_{X}(\bm{x}),\bm{l}_{y})))

where GYG_{Y} tries to generate images 𝒚^=GY(𝒛x,𝒍y)\hat{\bm{y}}=G_{Y}(\bm{z}_{x},\bm{l}_{y}) that look similar to images from domain YY, while DYD_{Y} aims to distinguish between translated samples 𝒚^\hat{\bm{y}} and real samples 𝒚\bm{y}. The similar adversarial loss for YZXY\rightarrow Z\rightarrow X is

(3) GANX\displaystyle\mathcal{L}_{GAN_{X}} =𝔼𝒙p(𝒙)log(DX(𝒙))\displaystyle=\mathbb{E}_{\bm{x}\sim p(\bm{x})}\log(D_{X}(\bm{x}))
+𝔼𝒚p(𝒚)log(1DX(GX(EY(𝒚),𝒍x)))\displaystyle+\mathbb{E}_{\bm{y}\sim p(\bm{y})}\log(1-D_{X}(G_{X}(E_{Y}(\bm{y}),\bm{l}_{x})))

The total GAN loss is:

(4) GAN=GANX+GANY\mathcal{L}_{GAN}=\mathcal{L}_{GAN_{X}}+\mathcal{L}_{GAN_{Y}}

Reconstruction Loss The reconstruction mappings XZXX\rightarrow Z\rightarrow X, YZYY\rightarrow Z\rightarrow Y encourage the model to encode enough information to the high-level feature space ZZ from each domain. The input can then be reconstructed by the generators. The reconstruction process of domain XX is 𝒛x=EX(𝒙)\bm{z}_{x}=E_{X}(\bm{x}), 𝒙=GX(𝒛x,𝒍x)\bm{x}\prime=G_{X}(\bm{z}_{x},\bm{l}_{x}). Similar reconstruction process exists for domain YY. With l2l_{2} distance as the loss function, the reconstruction loss is:

(5) rec\displaystyle\mathcal{L}_{rec} =𝔼𝒙p(𝒙)(𝒙GX(EX(𝒙),𝒍x)2)\displaystyle=\mathbb{E}_{\bm{x}\sim p(\bm{x})}(||\bm{x}-G_{X}(E_{X}(\bm{x}),\bm{l}_{x})||_{2})
+𝔼𝒚p(𝒚)(𝒚GY(EY(𝒚),𝒍y)2)\displaystyle+\mathbb{E}_{\bm{y}\sim p(\bm{y})}(||\bm{y}-G_{Y}(E_{Y}(\bm{y}),\bm{l}_{y})||_{2})

Latent Consistency Loss With only the above losses, the encoding part is not well constrained. We constrain the encoding part using a latent consistency loss. Although 𝒙\bm{x} is translated to 𝒚^\hat{\bm{y}}, which is in domain YY, 𝒚^\hat{\bm{y}} is still semantically similar to 𝒙\bm{x}. Thus, in the latent space ZZ, the high-level feature of 𝒙\bm{x} should be close to that of 𝒚^\hat{\bm{y}}. Similarly, the high-level feature of 𝒚\bm{y} in domain YY should be close to the high-level feature of 𝒙^\hat{\bm{x}} in domain XX. The latent consistency loss is the following:

(6) lcl\displaystyle\mathcal{L}_{lcl} =𝔼𝒙p(𝒙)(EX(𝒙)EY(GY(EX(𝒙),𝒍y)))\displaystyle=\mathbb{E}_{\bm{x}\sim p(\bm{x})}(||E_{X}(\bm{x})-E_{Y}(G_{Y}(E_{X}(\bm{x}),\bm{l}_{y}))||)
+𝔼𝒚p(𝒚)(EY(𝒚)EX(GX(EY(𝒚),𝒍x)))\displaystyle+\mathbb{E}_{\bm{y}\sim p(\bm{y})}(||E_{Y}(\bm{y})-E_{X}(G_{X}(E_{Y}(\bm{y}),\bm{l}_{x}))||)

Classification Loss We consider nn domains as nn categories in the classification problems. We use a network CC, which is an auxiliary classifier, on top of the general discriminator DD to measure whether a sample (real or generated) belongs to a specific fine-grained category. The output of the classifier CC represents the posterior probability P(c|𝒙)P(c|\bm{x}). Specifically, there are four classification losses, i.e., for real data 𝒙\bm{x}, 𝒚\bm{y}, and generated data 𝒙^\hat{\bm{x}}, 𝒚^\hat{\bm{y}}. For image-label pairs (𝒙\bm{x}, 𝒍x\bm{l}_{x}) and (𝒚\bm{y}, 𝒍y\bm{l}_{y}) with 𝒍xp(𝒍x)\bm{l}_{x}\sim p(\bm{l}_{x}) and 𝒍yp(𝒍y)\bm{l}_{y}\sim p(\bm{l}_{y}) our goal is to translate 𝒙\bm{x} to 𝒚^\hat{\bm{y}} with label 𝒍y\bm{l}_{y}, and to translate 𝒚\bm{y} to 𝒙^\hat{\bm{x}} with label 𝒍x\bm{l}_{x}. The four classification losses are:

(7) c\displaystyle\mathcal{L}_{c} =𝔼𝒙p(𝒙),𝒍xp(𝒍x)[logP(𝒍x|𝒙)]\displaystyle=-\mathbb{E}_{\bm{x}\sim p(\bm{x}),\bm{l}_{x}\sim p(\bm{l}_{x})}[\log P(\bm{l}_{x}|\bm{x})]
=𝔼𝒚p(𝒚),𝒍yp(𝒍y)[logP(𝒍y|𝒚)]\displaystyle=-\mathbb{E}_{\bm{y}\sim p(\bm{y}),\bm{l}_{y}\sim p(\bm{l}_{y})}[\log P(\bm{l}_{y}|\bm{y})]
=𝔼𝒙p(𝒙),𝒍yp(𝒍y)[logP(𝒍y|GY(EX(𝒙),𝒍y))]\displaystyle=-\mathbb{E}_{\bm{x}\sim p(\bm{x}),\bm{l}_{y}\sim p(\bm{l}_{y})}[\log P(\bm{l}_{y}|G_{Y}(E_{X}(\bm{x}),\bm{l}_{y}))]
=𝔼𝒚p(𝒚),𝒍xp(𝒍x)[logP(𝒍x|GX(EY(𝒚),𝒍x))]\displaystyle=-\mathbb{E}_{\bm{y}\sim p(\bm{y}),\bm{l}_{x}\sim p(\bm{l}_{x})}[\log P(\bm{l}_{x}|G_{X}(E_{Y}(\bm{y}),\bm{l}_{x}))]

This loss can be used to optimize discriminators DXD_{X}, DYD_{Y}, generators GXG_{X}, GYG_{Y}, and encoders EXE_{X}, EYE_{Y}.

Cycle Consistency Loss Although the minimization of GAN losses ensures that GY(EX(𝒙),𝒍y)G_{Y}(E_{X}(\bm{x}),\bm{l}_{y}) produce a sample 𝒚^\hat{\bm{y}} similar to samples drawn from YY, the model still can be unstable and prone to failure. To tackle this problem, we further constrain our model with a cycle-consistency loss (Zhu et al., 2017). To achieve this goal, we want mapping from domain XX to domain YY and then back to domain XX to reproduce the original sample, i.e., GX(EY(GY(EX(𝒙),𝒍y)),𝒍x)𝒙G_{X}(E_{Y}(G_{Y}(E_{X}(\bm{x}),\bm{l}_{y})),\bm{l}_{x})\approx\bm{x} and GY(EX(GX(EY(𝒚),𝒍x)),𝒍y)𝒚G_{Y}(E_{X}(G_{X}(E_{Y}(\bm{y}),\bm{l}_{x})),\bm{l}_{y})\approx\bm{y}. Thus, the cycle-consistency loss is:

(8) cyc\displaystyle\mathcal{L}_{cyc} =𝔼𝒙p(𝒙)[GX(EY(GY(EX(𝒙),𝒍y)),𝒍x)𝒙]\displaystyle=\mathbb{E}_{\bm{x}\sim p(\bm{x})}[||G_{X}(E_{Y}(G_{Y}(E_{X}(\bm{x}),\bm{l}_{y})),\bm{l}_{x})-\bm{x}||]
+𝔼𝒚p(𝒚)[GY(EX(GX(EY(𝒚),𝒍x)),𝒍y)𝒚]\displaystyle+\mathbb{E}_{\bm{y}\sim p(\bm{y})}[||G_{Y}(E_{X}(G_{X}(E_{Y}(\bm{y}),\bm{l}_{x})),\bm{l}_{y})-\bm{y}||]

Final Objective of CD-GAN To sum up, the goal of our approach is to minimize the following objective:

(9) (E,G,D)\displaystyle\mathcal{L}(E,G,D) =GAN+α0rec+α1lcl+α2c+α3cyc\displaystyle=\mathcal{L}_{GAN}+\alpha_{0}\mathcal{L}_{rec}+\alpha_{1}\mathcal{L}_{lcl}+\alpha_{2}\mathcal{L}_{c}+\alpha_{3}\mathcal{L}_{cyc}

where EE, GG, and DD signify encoders EXE_{X}, EYE_{Y}, generators GXG_{X}, GYG_{Y}, and discriminators DXD_{X}, DYD_{Y}, and α0\alpha_{0}, α1\alpha_{1}, α2\alpha_{2}, α3\alpha_{3} control the relative importance of the losses. Same as solving a regular GAN problem, training the model involves the solving of a min-max problem, where EXE_{X},EYE_{Y}, GXG_{X}, and GYG_{Y} aim to minimize the objective, while DXD_{X} and DYD_{Y} aim to maximize it.

(10) E,G=argminE,GmaxD(E,G,D)E^{\ast},G^{\ast}=arg\min_{E,G}\max_{D}\mathcal{L}(E,G,D)

3.3. Cross-Domain Training

Our proposed model has two encoder-generator pairs, but we have data from nn domains. To train the model using samples of all domains equally, we introduce a cross-domain training algorithm. As shown in Fig. 1b, there are 4 domains. At each iteration, we randomly select two domains RR and SS, and feed training data of these two domains into the model. At the next iteration, we might take another two domains PP and QQ, and perform the same training. We train the model using all data samples of 44 domains at every epoch for several iterations. The training algorithm is shown in Algorithm  1. Cross-domain training ensures the model to learn a generic feature representation of all domains by training the model equally on independent domains.

Algorithm 1 Joint domain training on CD-GAN using mini-batch stochastic gradient descent
  Require: Training samples from nn domains
  Initialize 𝜽EX\bm{\theta}_{E}^{X}, 𝜽EY\bm{\theta}_{E}^{Y},𝜽GX\bm{\theta}_{G}^{X}, 𝜽GY\bm{\theta}_{G}^{Y},𝜽DX\bm{\theta}_{D}^{X}, and 𝜽DY\bm{\theta}_{D}^{Y} with the shared network connection weights set to the same values.
  while Training loss has not converged do
     Randomly draw two domains XX and YY from nn domains
     Randomly draw NN samples from the two domains, {𝒙1,𝒙2,𝒙N;𝒚1,𝒚2,𝒚N\bm{x}_{1},\bm{x}_{2},\ldots\bm{x}_{N};\bm{y}_{1},\bm{y}_{2},\ldots\bm{y}_{N}}
     Get the domain labels of the samples from the two domains, {𝒍Xi,𝒍Yi}i=1N\{\bm{l}_{X}^{i},\bm{l}_{Y}^{i}\}_{i=1}^{N}
     (1) Update 𝑫X,𝑫Y\bm{D}_{X},\bm{D}_{Y} with fixed 𝑮X,𝑮Y,𝑬X,𝑬Y\bm{G}_{X},\bm{G}_{Y},\bm{E}_{X},\bm{E}_{Y}
     Generate fake samples using the real ones
𝒙^i=GX(EY(𝒚i),𝒍xi),𝒚^i=GY(EX(𝒙i),𝒍yi),i=1N\displaystyle\hat{\bm{x}}_{i}=G_{X}(E_{Y}(\bm{y}_{i}),\bm{l}_{x}^{i}),\hat{\bm{y}}_{i}=G_{Y}(E_{X}(\bm{x}_{i}),\bm{l}_{y}^{i}),i=1\ldots N
     Update 𝜽D=(𝜽DX,𝜽DY)\bm{\theta}_{D}=(\bm{\theta}_{D}^{X},\bm{\theta}_{D}^{Y}) according to the following gradients
θD[1Ni=1N[logDX(𝒙i)log(1DX(𝒙^i))logDY(𝒚i)\displaystyle\nabla_{\theta_{D}}\bigg{[}\frac{1}{N}\sum_{i=1}^{N}\Big{[}-\log D_{X}(\bm{x}_{i})-\log(1-D_{X}(\hat{\bm{x}}_{i}))-\log D_{Y}(\bm{y}_{i})
log(1DY(𝒚^i))+α2[logP(𝒍x|𝒙i)+logP(𝒍y|𝒚i)]]]\displaystyle-\log(1-D_{Y}(\hat{\bm{y}}_{i}))+\alpha_{2}\big{[}\log P(\bm{l}_{x}|\bm{x}_{i})+\log P(\bm{l}_{y}|\bm{y}_{i})\big{]}\Big{]}\bigg{]}
     (2) Update 𝑬X,𝑬Y,𝑮X,𝑮Y\bm{E}_{X},\bm{E}_{Y},\bm{G}_{X},\bm{G}_{Y} with fixed 𝑫X,𝑫Y\bm{D}_{X},\bm{D}_{Y}
     Update 𝜽E,G=(𝜽EX,𝜽EY,𝜽GX,𝜽GY)\bm{\theta}_{E,G}=(\bm{\theta}_{E}^{X},\bm{\theta}_{E}^{Y},\bm{\theta}_{G}^{X},\bm{\theta}_{G}^{Y}) according to the following gradients
θE,G[1Ni=1N[log(1DX(𝒙^i))+log(1DY(𝒚^i))\displaystyle\nabla_{\theta_{E,G}}\bigg{[}\frac{1}{N}\sum_{i=1}^{N}\Big{[}\log(1-D_{X}(\hat{\bm{x}}_{i}))+\log(1-D_{Y}(\hat{\bm{y}}_{i}))
+𝒙iGX(EX(𝒙i),𝒍xi)2+𝒚iGY(EY(𝒚i),𝒍yi)2\displaystyle+||\bm{x}_{i}-G_{X}(E_{X}(\bm{x}_{i}),\bm{l}_{x}^{i})||_{2}+||\bm{y}_{i}-G_{Y}(E_{Y}(\bm{y}_{i}),\bm{l}_{y}^{i})||_{2}
+EX(𝒙i)EY(𝒚^i)+EY(𝒚i)EX(𝒙^i)\displaystyle+||E_{X}(\bm{x}_{i})-E_{Y}(\hat{\bm{y}}_{i})||+||E_{Y}(\bm{y}_{i})-E_{X}(\hat{\bm{x}}_{i})||
+logP(𝒍x|𝒙^i)+logP(𝒍y|𝒚^i)\displaystyle+\log P(\bm{l}_{x}|\hat{\bm{x}}_{i})+\log P(\bm{l}_{y}|\hat{\bm{y}}_{i})
+α[||𝒙iGX(EY(𝒚^i),𝒍xi)||+||𝒚iGY(EX(𝒙^i),𝒍yi)||]]]\displaystyle+\alpha\big{[}||\bm{x}_{i}-G_{X}(E_{Y}(\hat{\bm{y}}_{i}),\bm{l}_{x}^{i})||+||\bm{y}_{i}-G_{Y}(E_{X}(\hat{\bm{x}}_{i}),\bm{l}_{y}^{i})||\big{]}\Big{]}\bigg{]}
  end while

4. experiment

In this section, we conduct experiments over three datasets to compare our proposed model with reference models in terms of image translation quality and efficiency.

4.1. Datasets

To evaluate the scalability and effectiveness of our model, we test it on a variety of multi-domain image-to-image translation tasks using the following datasets:

Alps Seasons dataset (Anoosheh et al., 2017) is collected from images on Flickr. The images are categorized into four seasons based on the provided timestamp of when it was taken. It consists of four categories: Spring, Summer, Fall, and Winter. The training data consists of 6053 images of four seasons, while the test data consists of 400 images.

Painters dataset (Zhu et al., 2017) includes painting images of four artists Monet, Van Gogh, Cezanne, and Ukiyo-e. We use 2851 images as the training set, and 200 images as the test set.

CelebA dataset (Liu et al., 2015) contains ten thousand identities, each of which has twenty images, i.e., two hundred thousand images in total. Each image in CelebA is annotated with 40 face attributes. We resize the initial 178×218178\times 218 size images to 256×256256\times 256. We randomly select 4000 images as test set and use all remaining images for training data.

We run all the experiments on a Ubuntu system using an Intel i7-6850K, along with a single NVIDIA GTX 1080Ti GPU.

4.2. Reference Models

We compare the performance of our proposed CD-GAN with that of two reference models:

CycleGAN (Zhu et al., 2017) This method trains two generators G:XYG:X\rightarrow Y and F:YXF:Y\rightarrow X in parallel. It not only applies a standard GAN loss respectively for XX and YY, but applies forward and backward cycle consistency losses which ensure that an image 𝒙\bm{x} from domain XX be translated to an image of domain YY, which can then be translated back to the domain XX, and vice versa.

DualGAN (Yi et al., 2017) This method uses a dual-GAN mechanism, which consists of a primal GAN and a dual GAN. The primal GAN learns to translate images from domain XX to domain YY, while the dual-GAN learns to invert the task. Images from either domain can be translated and then reconstructed. Thus a reconstruction loss can be used to train the model.

UNIT (Liu et al., 2017) This method consists of two VAE-GANs with a fully shared latent space. To complete the task of image-to-image translation between nn domains, it needs to be trained n×(n1)2\frac{n\times{(n-1)}}{2} times.

DB (Hui et al., [n. d.]) This method addresses the multi-domain image-to-image translation problem by introducing nn domain-specific encoders/decoders to learn an universal shared-latent space.

4.3. Evaluation Metrics

There is a challenge to evaluate the quality of synthesized images (Salimans et al., 2016). Recent works have tried using pre-trained semantic classifiers to measure the realism and discriminability of the generated images. The idea is that if the generated images look to be more close to real ones, classifiers trained on the real images will be able to classify the synthesized images correctly as well. Following (Zhang et al., 2016; Isola et al., 2017; Wang and Gupta, 2016), to evaluate the performance of the models in classifying generated images quantitatively, we apply the metric classification accuracy. For each experiment, we generate enough number of images of different domains, then we use a pre-trained classifier which is trained on the training dataset to classify them to different domains and calculate the classification accuracy.

4.4. Network Architecture and Implementation

The design of the architecture is always a difficult task (Radford et al., 2015). To get a proper model architecture, we adopt the architecture of the discriminator from (Isola et al., 2017) which has been proven to be proficient in most image-to-image generation tasks. It has 6 convolutional layers. We keep the discriminator architecture fixed and vary the architectures of the encoders and generators. Following the design of the architectures of the generators in (Isola et al., 2017), we use two types of layers, the regular convolutional layers and the basic residual blocks (He et al., 2016). Since the encoding process is the inverse of the decoding process, we use the same layers for them but put the layers in the inverse orders. The only difference is the first layer of the encoder and the last layer of the generator. We apply 6464 channels (corresponding to different filters) for the first layer of the encoders, but 33 channels for the last layer of the generators since the output images have only 33 RGB channels. We gradually change the number of convolutional layers and the number of residual blocks until we get a satisfying architecture. We don’t apply weight sharing initially. The performance of different architectures is evaluated on the Painters dataset and shown in Fig. 3. We can see that when the model has 3 regular convolutional layers and 4 basic residual blocks, the model has the best performance. We keep this architecture fixed for other datasets.

Refer to caption
Figure 3. The accuracy on varying number of residual blocks and number of convolutional layers.

We then vary the number of weight-sharing layers in the encoders and the generators. We change the number of weight-sharing layers from 1 to 4. Sharing 1 layer means sharing the highest layer and the lowest layers in the encoder pair. Sharing 2 layers means sharing the highest and lowest two layers. The same sharing method applies for the generator pair (not including the output layer). The results are shown in table 1. We found that sharing 1 layer is enough to have a good performance.

Table 1. Classification accuracy on number of shared layers in encoders and generators.
# of shared layers acc. % (Painters) acc. % (Alps Seasons)
0 49.75 29.95
1 52.54 33.78
2 52.81 33.54
3 51.13 33.06

In summary, for the testbed evaluation, we use two encoders each consisting of 3 convolutional layers and 4 basic residual blocks. The generators are composed with 4 basic residual blocks and 3 fractional-strided convolutional layers. The discriminators consist of a stack of 6 convolutional layers. We use LeakyReLU for nonlinearity. The two encoders share the same parameters on their layers 1 and 7, while the two generators share the same parameters on layers 1 and 6, which is the lowest-level layer before the output layer. The details of the networks are given in table 2. We evaluate various network architectures in the evaluation parts. We fix the network architecture as in Table 2.

Table 2. Network architecture for the multi-modal unsupervised image-to-image translation experiments. cxkyszcxkysz denote a Convolution-InstanceNorm-ReLU layer with xx filters, kernel size yy, and stride zz. RmRm denotes a residual block that contains two 3×33\times 3 convolutional layers with the same number of filters on both layers. unun denotes a 3×33\times 3 fractional-strided-Convolution-InstanceNorm-ReLU layer with nn filters, and stride 12\frac{1}{2}. ndn_{d} denotes number of domains. YY and NN denote whether the layer is shared or not.
Layer Encoders Generators Discriminators
1 c64k7s1(Y)c64k7s1(Y) R256(Y)R256(Y) c64k3s2(N)c64k3s2(N)
2 c128k3s2(N)c128k3s2(N) R256(N)R256(N) c128k3s2(N)c128k3s2(N)
3 c256k3s2(N)c256k3s2(N) R256(N)R256(N) c256k3s2(N)c256k3s2(N)
4 R256(N)R256(N) R256(N)R256(N) c512k3s2(N)c512k3s2(N)
5 R256(N)R256(N) u256(N)u256(N) c1024k3s2(N)c1024k3s2(N)
6 R256(N)R256(N) u128(Y)u128(Y) c(1+nd)k2s1(N)c(1+n_{d})k2s1(N)
7 R256(Y)R256(Y) u3(N)u3(N)

We use ADAM (Kingma and Ba, 2014) for training, where the training rate is set to 0.0001 and momentums are set to 0.5 and 0.999. Each mini-batch consists of one image from domain XX and one image from domain YY. Our model has several hyper-parameters. The default values are α0=10\alpha_{0}=10, α1=0.1\alpha_{1}=0.1, α2=0.1{\alpha}_{2}=0.1, and α3=10{\alpha}_{3}=10. The hyper-parameters of the baselines are set to the suggested values by the authors.

4.5. Quantitative Results

We evaluate our model on different datasets and compare it with baseline models.

4.5.1. Comparison on Painters Dataset

To compare the proposed model with baseline models Painters dataset, we first train the state-of-the-art VGG-11 model (Simonyan and Zisserman, 2014) on training data and get a classifier of accuracy 94.5%. We then score synthesized images by the classification accuracy against the domain labels these photos were synthesized from. We generate around 4000 images for every 5 hours and the classification accuracies are shown in Fig. 4.

Refer to caption
Figure 4. The classification accuracy on Painters dataset. The 7 models are the proposed model with the lowest and the highest layer sharing, the lowest layer sharing only, the highest layer sharing only, CycleGAN, DualGAN, UNIT, and DB.

We can see that our model achieves the highest classification accuracy of 52.5% when using both the highest layer and lowest layer sharing, with the training time less than the other reference models in reaching the peak.

4.5.2. Comparison on Alps Seasons Dataset

We train VGG-11 model on training data of Alps Seasons dataset and get a classifier of accuracy 85.5% trained on the training data. We then classify the generated images by our model and the classification accuracies are shown in Fig. 5.

Refer to caption
Figure 5. The classification accuracy on Alps Seasons dataset. The 7 models are the prosed model with lowest and highest level layers sharing, lowest level layers sharing, highest level layers sharing, CycleGAN, DualGAN, UNIT, and DB.

Similar to Fig. 4, our model achieves the highest classification accuracy of 33.8% with the training time less than the baseline models in reaching the peak.

4.6. Analysis of the loss function

We compare the ablations of our full loss. As GAN loss and cycle consistency loss are critical for the training of unsupervised image-to-image translation, we keep these two losses as the baseline model and do the ablation experiments to see the importance of other losses.

Table 3. Ablation study: classification accuracy of Painters and Alps Seasons datasets for different losses. The following abbreviations are used: R:reconstruction loss, LCL: latent consistency loss, C: classification loss.
Loss acc.% (Painters) acc. % (Alps Seasons)
Baseline 35.23 20.81
Baseline + R 36.86 21.59
Baseline + LCL 44.42 25.05
Baseline + C 43.63 24.01
Baseline + R + LCL 45.79 27.19
Baseline + R + C 44.82 26.63
Baseline + LCL + C 50.74 32.51
Baseline + R + LCL + C 52.54 33.78

As shown in Tabel 3, the reconstruction loss RR is least important with accuracy improvement of about 4.6% on Painters dataset and 3.7% on Alps Seasons dataset. The latent consistency loss LCLLCL brings the model an accuracy improvement of 26.1% on Painters dataset and 20.4% on Alps Seasons dataset. The accuracy is improved by 23.8% on Painters dataset and 15.4% on Alps Seasons dataset by the classification loss CC.

4.7. Qualitative Results

We demonstrate our model on three unsupervised multi-domain image-to-image translation tasks.

Painting style transfer (Fig. 6) We train our model on Painters dataset and use it to generate images of size 256×256256\times 256. The model can transfer the painting style of a specific painter to the other painters, e.g., transferring the images of Cezanne to images of other three painters Monet, Ukiyoe and Vangogh. In Fig. 7, we also compare our model with other reference models when given the same test image.

Refer to caption
Figure 6. Painters translation results. The original images are displayed with a dashed square around. The other images are generated according to different painters.
Refer to caption
Figure 7. Painters translation results. The original images are displayed with a dashed square around. The other images are generated according to different painters.

Season transfer (Fig. 8) The model is trained on the Alps Seasons dataset. We use the trained model to generate images of different seasons. For example, we generate an image of summer from an image of spring and vice versa. In Fig. 9, we also compare our model with other reference models when given the same test image.

Refer to caption
Figure 8. Alps Seasons translation results. The original images are displayed with a dashed square around. The other images are generated according to different seasons.
Refer to caption
Figure 9. Alps Seasons translation results. The original images are displayed with a dashed square around. The other images are generated according to different seasons.

Attribute-base face translation (Fig. 10) We train the model on CelebA dataset for attribute-based face translation tasks. We choose 4 attributes, black hair, blond hair, brown hair, and gender. We then use our model to generate images with these attributes. For example, we transfer an image with a man wearing black hair to a man with blond hair, or transfer a man to a woman.

Refer to caption
Figure 10. Attribute-base face translation results. The original images are displayed with a dashed square around. The other images are generated according to different face attributes.

5. Conclusion

In this paper, we propose a Cross-Domain Generative Adversarial Networks (CD-GAN), a novel and scalable model to conduct unsupervised multi-domain image-to-image translation. We show its capability of translating images from one domain to many other domain using several datasets. It still has some limitations. First, training could be unstable due to the training problem of GAN model. Second, the diversity of the generated images are constrained by the cycle consistency loss. We plan to address these two problems in the future work.

References

  • (1)
  • Anoosheh et al. (2017) A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. 2017. ComboGAN: Unrestrained Scalability for Image Domain Translation. ArXiv e-prints (Dec. 2017). arXiv:cs.CV/1712.06909
  • Cao et al. (2017) Yun Cao, Zhiming Zhou, Weinan Zhang, and Yong Yu. 2017. Unsupervised Diverse Colorization via Generative Adversarial Networks. In ECML/PKDD (1) (Lecture Notes in Computer Science), Vol. 10534. Springer, 151–166.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
  • Gregor et al. (2015) K. Gregor, I. Danihelka, A. Graves, D. Jimenez Rezende, and D. Wierstra. 2015. DRAW: A Recurrent Neural Network For Image Generation. ArXiv e-prints (Feb. 2015). arXiv:cs.CV/1502.04623
  • Gupta et al. (2012) Raj Kumar Gupta, Alex Yong-Sang Chia, Deepu Rajan, Ee Sin Ng, and Huang Zhiyong. 2012. Image Colorization Using Similar Images. In Proceedings of the 20th ACM International Conference on Multimedia (MM ’12). ACM, New York, NY, USA, 369–378. https://doi.org/10.1145/2393347.2393402
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.
  • Hui et al. ([n. d.]) L. Hui, X. Li, J. Chen, H. He, C. gong, and J. Yang. [n. d.]. Unsupervised Multi-Domain Image Translation with Domain-Specific Encoders/Decoders. ArXiv e-prints ([n. d.]). arXiv:1712.02050
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5967–5976.
  • Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 1857–1865. http://proceedings.mlr.press/v70/kim17a.html
  • Kingma and Ba (2014) D. P. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv e-prints (Dec. 2014). arXiv:cs.LG/1412.6980
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014).
  • Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 105–114. https://doi.org/10.1109/CVPR.2017.19
  • Liu et al. (2017) Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised Image-to-Image Translation Networks. In Advances in Neural Information Processing Systems 30.
  • Liu and Tuzel (2016) Ming-Yu Liu and Oncel Tuzel. 2016. Coupled Generative Adversarial Networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 469–477. http://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf
  • Liu et al. (2008) Xiaopei Liu, Liang Wan, Yingge Qu, Tien-Tsin Wong, Stephen Lin, Chi-Sing Leung, and Pheng-Ann Heng. 2008. Intrinsic Colorization. ACM Trans. Graph. 27, 5, Article 152 (Dec. 2008), 9 pages. https://doi.org/10.1145/1409060.1409105
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC, USA, 3730–3738. https://doi.org/10.1109/ICCV.2015.425
  • Ngiam et al. (2011) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11). Omnipress, USA, 689–696. http://dl.acm.org/citation.cfm?id=3104482.3104569
  • Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ArXiv e-prints (Nov. 2015). arXiv:cs.LG/1511.06434
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2234–2242. http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf
  • Simonyan and Zisserman (2014) K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv e-prints (Sept. 2014). arXiv:cs.CV/1409.1556
  • Wang and Gupta (2016) X. Wang and A. Gupta. 2016. Generative Image Modeling using Style and Structure Adversarial Networks. ArXiv e-prints (March 2016). arXiv:cs.CV/1603.05631
  • Yi et al. (2017) Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In 2017 IEEE International Conference on Computer Vision (ICCV). 2868–2876. https://doi.org/10.1109/ICCV.2017.310
  • Zhang et al. (2016) R. Zhang, P. Isola, and A. A. Efros. 2016. Colorful Image Colorization. ArXiv e-prints (March 2016). arXiv:cs.CV/1603.08511
  • Zhu et al. (2017) J. Y. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In 2017 IEEE International Conference on Computer Vision (ICCV). 2242–2251. https://doi.org/10.1109/ICCV.2017.244