Progressive Energy-Based Cooperative Learning for Multi-Domain Image-to-Image Translation

Weinan Song University of California, Los Angeles Yaxuan Zhu University of California, Los Angeles Lei He Eastern Institute for Advanced Study, Ningbo, China University of California, Los Angeles Ying Nian Wu University of California, Los Angeles Jianwen Xie Akool Research

Abstract

This paper studies a novel energy-based cooperative learning framework for multi-domain image-to-image translation. The framework consists of four components: descriptor, translator, style encoder, and style generator. The descriptor is a multi-head energy-based model that represents a multi-domain image distribution. Given an image from a source domain, the translator generates an output image in the target domain given a style code, which can be inferred by the style encoder from a reference image or generated by the style generator from a random noise. To train our framework, we propose a likelihood-based multi-domain cooperative learning algorithm to jointly train the multi-domain descriptor and the diversified image generator (including translator, style encoder, and style generator) via multi-domain MCMC teaching, in which the descriptor guides the diversified image generator to shift its probability density toward the data distribution, while the diversified image generator uses its randomly translated images to initialize the descriptor’s Langevin dynamics process for efficient sampling. We also bring in two regularization strategies for both the descriptor and the translator to significantly improve the cooperative learning. To further enhance the efficiency and scalability, we propose a progressive cooperative learning strategy to train our framework. Strong empirical results are shown to verify the effectiveness of our energy-based image translation framework.

1 Introduction

The task of image-to-image translation primarily involves the learning of mappings between different visual domains. This learning framework carries immense application value in the field of generative artificial intelligence, facilitating the development of various creative products for artificial intelligence-generated contents (AIGC). In this context, a “domain” refers to a collection of images belonging to a visually distinctive category such as the gender of a person and animal species. Within each domain, every image exhibits a unique appearance, encompassing image-specific elements such as hairstyle and makeup, commonly referred to as its “style”. An ideal image-to-image translation framework should possess the ability to handle multiple domains, efficiently process high-resolution images, and provide diverse synthesis (i.e., one-to-many mapping) when translating to each target domain. By leveraging the representation power of an energy-based model and the sampling efficiency of a latent variable model, the Generative Cooperative Network [32], also known as CoopNets, and its variants [34, 36], have achieved impressive results in numerous computer vision tasks, such as image generation [32, 34, 36], visual salient object detection[37], supervised image-to-image translation [35], and unsupervised image-to-image translation [33]. However, while the cross-domain translation framework, CycleCoopNets [33], has demonstrated success in unpaired image-to-image translation, it is only capable of learning the relation between two different domains at a time. Such an approach has a limited scalability to deal with multiple domains, as a separate model must be trained for each pair of domains. Besides, cooperative learning still faces challenges when it comes to translating high-resolution images. This is because the translation process involves sampling from the energy-based model via Langevin dynamics, which can be difficult to apply to high-resolution image spaces. To tackle the aforementioned challenges in the current cooperative learning (or more generally, energy-based learning) for multi-domain unsupervised image-to-image translation, this paper proposes a novel cooperative learning framework, PMD-CoopNets, to ensure scalability, flexibility, stability and efficiency for applying energy-based framework to image-to-image translation.

To be specific, the PMD-CoopNets consists of four components: descriptor, translator, style generator and style encoder. (1) The descriptor is a multi-head energy-based model that represents a multi-domain image distribution, where each head of the energy function corresponds to one image domain. (2) The style generator is a multi-head latent variable model responsible for generating domain-specific style codes. It achieves this by transforming a Gaussian latent code into style codes. Each head in the style generator corresponds to one specific domain. (3) The style encoder extracts domain-specific style codes from an input image using a multi-head encoder. Each head of the encoder corresponds to a specific domain. (4) The translator is a style-controlled mapping, which takes an image and a style code as input, and then transforms the image into a translated image that reflects the desired style indicated by the style code. The style code can be obtained either from the style generator or the style extractor. The style generator, style encoder, and style-controlled translator can constitute a diversified image generator.

As to the learning, the multi-domain descriptor and the diversified image generator engage in a cooperative game, where the multi-domain descriptor guides the diversified image generator in aligning its mapping towards the target domains using MCMC teaching, while the image generator assists in expediting the descriptor’s MCMC teaching process by providing a good initialization. Specifically, to enforce a meaningful latent space of style codes, we train the style-controlled translator and style encoder by reconstructing style codes that are randomly generated from the style generator. To enforce translated image to preserve the domain-invariant property of the input reference image, we train the translator with a cycle consistency loss. To enforce the one-to-many translation output, we regularize the translator via a diversity sensitive loss, such that, given an identical reference image, different style codes can lead to sufficiently diversified translated outputs.

Additionally, we propose to improve the cooperative learning algorithm by incorporating some loss terms to regularize the behaviors of the components in our framework. Firstly, we put an $l_{2}$ regularization on the output of the energy function of the descriptor to limit the magnitude of the energy values. To accelerate and stabilize the teaching process provided by the descriptor’s MCMC, we propose to use the energy function to regularize the output of the translator. These regularization techniques significantly improve the performance of the cooperative learning.

To enhance efficiency, stability, and scalability, we present a progressive cooperative learning algorithm for our model. Our approach involves gradual expansion of all four components, initially operating on simpler low-resolution images. As the cooperative training proceeds, new layers are added to each component, enabling the model to handle more challenging high-resolution images. This progressive growth strategy significantly accelerates and stabilizes both training and sampling processes at higher resolutions. Moreover, it offers the flexibility and convenience to scale up the resolution of any pre-trained PMD-CoopNets.

We demonstrate the effectiveness of our proposed multi-domain translation model on the CelebA-HQ [17] and AFHQ [4] dataset. The translated examples exhibit high fidelity and are comparable to GAN-based multi-domain translation models in the task of image-to-image translation. Furthermore, our progressive learning strategy improves the efficiency and stability of the original training process, particularly when it comes to translating high-resolution images. Our contributions are listed below:

•

We propose a novel energy-based cooperative learning framework for multi-domain image-to-image translation. We build a single multi-head energy-based model to represent probability distributions of multiple domains, and train it with a translator, a style encoder, and a style generator using a cooperative manner.
•

We present a novel progressive learning algorithm to optimize the training efficiency of our framework. Our approach adopts a progressive growth strategy, advancing all components from low resolution to high resolution. It yields a significant reduction in the total number of MCMC steps required for training and sampling from the high-resolution model.
•

We propose regularization strategies to stabilize the cooperative learning, which include an energy-based regularization loss for the translator and a $l_{2}$ regularization loss for limiting the magnitude of the energy values of the descriptor. Significant performance gain are obtained from these regularization.
•

We demonstrate strong empirical results on CelebA-HQ and AFHQ datasets to verify the proposed energy-based framework. Our method obtains state-of-the-art performance among existing energy-based image translation models.

2 Related Work

Energy-based Learning

Training energy-based models (EBMs) [43, 21, 15] involves maximizing the likelihood of the observed data by adjusting the model’s energy function parameters, which typically requires Markov chain Monte Carlo (MCMC) sampling to evaluate the intractable gradient [31, 27, 5]. Contrastive divergence (CD) [14, 6] is an efficient approximation algorithm for training energy-based models by initializing the MCMC chains with observed data. [27] uses a noise-initialized non-convergent short-run MCMC to train an EBM, and obtains a valid flow-like generator trained with moment matching estimation. [9] defines a sequence of conditional EBMs and forms a denoising diffusion process. To avoid MCMC, [8] brings in normalizing flow and trains an EBM by flow contrastive estimation. Learning an amortized sampler [19, 32, 12, 20, 30, 10] for EBMs is also an alternative strategy. Our method has a single multi-head EBM to represent multi-domain data distribution, and the image-to-image translator serves as a multi-domain amortized sampler for the EBMs.

Cooperative Learning

Cooperative learning for energy-based models with MCMC teaching is first proposed in [32], where the authors utilize an energy-based model as the descriptor and a latent variable model as the generator to speed up the learning of each other by maximum likelihood algorithms. During each training iteration, the descriptor generates samples by finite-step MCMC sampling with initialization by the generation from the generator for maximum likelihood estimation. Simultaneously, the sampling results from descriptor are used to directly supervise the generator, which is called MCMC teaching. Further research in [42] shows that this cooperative learning method could also provide a good start point for adversarial models with small computation overhead. Additionally, the model could also be extended for image-to-image translation [33] with two pairs of descriptor and generator or used in saliency prediction [38] by introducing a conditional latent variable model.

Progressive Learning

The proposed idea of progressive cooperative learning is closely connected to the research conducted by [41], which involves the incremental growth of a single EBM. The multi-grid EBM framework [7], trains a series of EBMs simultaneously at various resolutions. The sampling process is conducted sequentially, starting from low-resolution and gradually progressing to higher resolutions, leveraging the lower resolution as a foundation for subsequent higher-resolution sampling. In contrast, our method, which combines the growth of an EBM with three mapping networks, introduces a more challenging and complex progressive learning strategy. It is important to note that while there are several progressive learning frameworks based on Generative Adversarial Networks (GANs), our approach falls within the domain of energy-based learning. We need to carefully consider MCMC sampling when progressively expanding the energy function, as it plays a crucial role in both bottom-up energy mapping and top-down image generation.

3 Proposed Framework

Refer to caption — Figure 1: Diagram of energy-based cooperative learning for multi-domain image-to-image translation. The framework consists of a style generator, a style encoder, a translator and a descriptor. The first three components (i.e., style generator, style encoder, and translator) form a diversified image generator. Given a input source image, the translator can transform it into a target domain, which is specified by a style code. The style code can be obtained by sampling from the domain-specific style generator or extracted from a reference image by the style encoder. The descriptor is a multi-domain image distribution, which plays the role of guiding the translation such that the translated images can match the observed images in the target domain in terms of statistical property. All components are trained simultaneously in a cooperative learning scheme. The descriptor learns from the multi-domain training images by maximizing the data likelihood, while utilizing MCMC teaching to guide the training of the diversified image generator, which consists of a translator, a style encoder, and a style generator.

Suppose we have unpaired images from multiple domains A, B, C, $\cdots$ with some shared high-level features, such as expressions in face images, our target is to learn a conditional generative model that maps an image into a target domain, which could be same as the source domain, with specific features. To achieve this, we propose a generative model that consists of four components, i.e., descriptor, style encoder, style generator, and translator. The latter three can form a diversified translator, which is trained with the descriptor in a cooperative learning manner. Let $x$ be an observed image and $y$ be its domain label. We also use $y^{\prime}$ to denote the label of target domain.

3.1 Multi-Domain Descriptor

The multi-domain descriptor is a multi-head energy-based model that specifies the probability distribution of each domain by

p_{y}(x;\theta)\propto\exp[D_{y}(x;\theta)],

(1)

where $\theta$ are parameters of the multi-head energy function $D$ . For notation simplicity, we use $D_{y}(\cdot)$ to denote the negative energy for domain $y$ . The descriptor are learned by multi-domain maximum likelihood estimation, which is equivalent to minimizing the Kullback-Leibler (KL) divergence between the data distribution $p_{\text{data}}(x,y)$ and the model $p_{y}(x;\theta)$ . The gradient of the objective for learning the descriptor is given by

\displaystyle\begin{split}&\nabla_{\theta}\mathcal{L}_{\text{ebm}}(\theta)\\ &=-\mathbb{E}_{p_{\text{data}}(x,y)}\{\nabla_{\theta}D_{y}(x;\theta)-\mathbb{E}_{p_{y}(x^{\prime};\theta)}\left[\nabla_{\theta}D_{y}\left(x^{\prime};\theta\right)\right]\},\end{split}

(2)

where $\mathbb{E}_{p_{y}(x^{\prime};\theta)}$ denotes the expectation with respect to the EBM and we use $x^{\prime}$ in order to distinguish the random variable $x$ in $\mathbb{E}_{p_{\text{data}}(x,y)}$ in the same equation. Suppose we observe a batch of training examples $\{(x_{i},y_{i})\}_{i}^{n}$ , which is assumed to be from $p_{\text{data}}(x,y)$ . The gradient in Eq.(1) can be approximated by

\nabla_{\theta}\mathcal{L}_{\text{ebm}}(\theta)\approx\nabla_{\theta}\left[\frac{1}{n}\sum_{i=1}^{n}D_{y_{i}}(x_{i};\theta)-\frac{1}{n}\sum_{i=1}^{n}D_{y_{i}}(\tilde{x}_{i};\theta)\right],

(3)

where for each observed domain $y_{i}$ , we use Langevin dynamics to obtain the corresponding synthesized example $\tilde{x}_{i}$ as a sample from $p_{y_{i}}(x;\theta)$ . With a specified step size $\delta$ , Langevin dynamics is performed by iterating the follow step:

\tilde{x}_{\tau+1}=\tilde{x}_{\tau}+\delta\nabla_{x}D_{y}\left(\tilde{x}_{\tau};\theta\right)+\sqrt{2\delta}U_{\tau},U_{\tau}\sim\mathcal{N}(0,I),

(4)

where $\tau$ indexes time step and $\tilde{x}_{\tau=0}$ is initialized by the output of a style-controlled image-to-image translator, which is presented in Section 3.2. A good initialization improves the efficiency of Langevin dynamics. To stabilize the EBM training, we also add an $l_{2}$ regularization on the energy outputs of both training examples and synthesized examples, which is

\mathcal{L}_{\text{energy}}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\|D_{y_{i}}(x_{i};\theta)\|^{2}+\frac{1}{n}\sum_{i=1}^{n}\|D_{y_{i}}(\tilde{x}_{i};\theta)\|^{2}.

(5)

3.2 Diversified Image Generator

Multi-Domain Style Generator

Given a latent variables $z$ and a domain label $y$ , the multi-domain style generator can produce a domain-specific style code $c$ by

c_{y}=G_{y}(z;\beta)+\epsilon,\hskip 5.69054pt\epsilon\sim\mathcal{N}(0,I),z\sim\mathcal{N}(0,I),

(6)

where $\epsilon$ is an observation residual and $z$ follows a Gaussian prior distribution. $G$ is a multilayer perceptron (MLP) with multiple output branches to produce style codes for multiple domains. The distribution of style code $c$ conditioned on a domain $y$ is given by $p_{y}(c;\beta)=\int p_{y}(c|z;\beta)p(z)dz$ , which is more informative than the prior distribution $p(z)$ to capture the underlying style space. The domain-specific style code $c_{y}$ is directly used in the translator, which is presented in Section 3.2, for specifying the style and the target domain of the translated image.

Style Encoder

The style encoder $E$ is a multi-head bottom-up network that takes as input an image $x$ and its corresponding domain label $y$ and then outputs a domain-specific style code $c=E_{y}(x;\phi)$ , where $\phi$ are parameters and $E_{y}(\cdot)$ denotes the output of $E$ that corresponds to domain $y$ .

Style-Controlled Image-to-Image Translator

To achieve a one-to-many translation between domains, we build a style-controlled image-to-image translator. It is a conditioned encoder-decoder $T$ that takes as input a source reference image $x$ and a domain-specific style code $c_{y}$ and outputs a translated image in target domain $y$ , which is given by

x_{y}=T(x,c_{y};\alpha)+\epsilon,\hskip 5.69054pt\epsilon\sim\mathcal{N}(0,I),c_{y}\sim p_{y}(c;\beta),

(7)

where $\alpha$ is the parameters of the neural network $T$ . The randomness in the translated image, when given a reference image and the target domain, arises from the stochastic nature of the style codes, which follows a distribution defined by the style generator $p_{y}(c;\beta)$ . The translator $T$ and the style generator $G$ forms a diversified translator. They are trained by the MCMC teaching loss [32], which is

\mathcal{L}_{\text{teach}}(\alpha,\beta)=\mathbb{E}_{z,y,x}[\|\tilde{x}_{z,y,x}-T(x,G_{y}(z;\beta);\alpha)\|^{2}],

(8)

where $\tilde{x}_{z,y,x}$ denotes the Langevin synthesis from the descriptor, which is initialized by the output of $T(x,G_{y}(z;\beta);\alpha)$ . That is, we set $\tilde{x}_{z,x,y,\tau=0}\leftarrow T(x,G_{y}(z;\beta)$ for Langevin dynamics in Eq.(4) to revolve $\tilde{x}_{z,y,x}$ . Let $M_{\theta}q_{\alpha,\beta}(x)$ be the marginal distribution obtained by running Markov transition $M_{\theta}$ from $q(x;\alpha,\beta)$ . At learning step $t+1$ , the gradient of the MCMC teaching loss in Eq.(8) is the gradient of $\text{KL}(M_{\theta^{(t)}}q_{\alpha^{(t)},\beta^{(t)}}||q_{\alpha,\beta})$ , where $q_{\alpha,\beta}$ seeks to be the stationary distribution of $M_{\theta}$ , i.e., minimizing $\text{KL}(p_{\theta}||q_{\alpha,\beta})$ . The effects of the MCMC teaching loss include: (i) $q$ can chase $p$ toward $p_{\text{data}}$ for MLE; (ii) $q$ can serve as a good MCMC initializer for $p$ for efficient MCMC sampling. To ensure diverse translator outputs, we regularize $T$ by minimizing the negative diversity sensitive

\displaystyle\begin{split}\mathcal{L}_{\text{diverse}}(\alpha)=-\mathbb{E}_{z_{1},z_{2},y,x}[\|&T(x,G_{y}(z_{1};\beta);\alpha)\\ &-T(x,G_{y}(z_{2};\beta);\alpha)\|_{1}].\end{split}

(9)

Since the translator is learned from unpaired data domains, to ensure the translated image $T(x,c;\alpha)$ to preserve the domain-invariant features of the source image $x$ , we adopt the cycle consistency loss:

\mathcal{L}_{\text{cycle}}(\alpha)=\mathbb{E}_{z,y,x,y^{\prime}}[\|x-x_{cycle}\|_{1}],

(10)

where $x_{cycle}=T(T(x,G_{y^{\prime}}(z;\beta);\alpha),E_{y}(x;\phi);\alpha)$ . To ensure any style code that is applied to the translated image can be retrieved back from the translated image by the style encoder, we also have a style code reconstruction loss

\displaystyle\begin{split}&\mathcal{L}_{\text{style}}(\alpha,\phi)\\ &=\mathbb{E}_{z,y^{\prime},x}[\|G_{y^{\prime}}(z;\beta)-E_{y^{\prime}}(T(x,G_{y^{\prime}}(z;\beta);\alpha);\phi)\|_{1}].\end{split}

(11)

To further stabilize the cooperative training and accelerate the MCMC teaching effect, we propose to add the following energy-based regularization on the translator,

\mathcal{L}_{\text{mode}}(\alpha,\beta)=\mathbb{E}_{z,y^{\prime},x}[D_{y^{\prime}}(T(x,G_{y^{\prime}}(z;\beta);\alpha);\theta)],

(12)

which can shift the translator mapping toward the low energy modes of the energy function.

3.3 Cooperative Learning of Descriptor and Translator

Our full objective function of the descriptor is $\mathcal{L}_{\text{descriptor}}=\mathcal{L}_{\text{ebm}}+\lambda_{\text{energy}}\mathcal{L}_{\text{energy}}$ and the full objective function of the translator is $\mathcal{L}_{\text{translator}}=\mathcal{L}_{\text{teach}}+\lambda_{\text{diverse}}\mathcal{L}_{\text{diverse}}+\lambda_{\text{cycle}}\mathcal{L}_{\text{cycle}}+\lambda_{\text{style}}\mathcal{L}_{\text{style}}+\lambda_{\text{mode}}\mathcal{L}_{\text{mode}}$ , where $\lambda_{\text{energy}}$ , $\lambda_{\text{diverse}}$ , $\lambda_{\text{cycle}}$ , $\lambda_{\text{style}}$ , and $\lambda_{\text{mode}}$ are hyperparameters. At each learning iteration, the cooperative learning algorithm alternates the following steps: (1) Generate an initial translated image via $\hat{x}=T(x,G_{y}(z))$ ; (2) Revise $\hat{x}$ by Langevin dynamics in Eq.4 to obtain $\tilde{x}$ ; (3) Update the parameters $\theta$ of descriptor by minimizing $\mathcal{L}_{\text{descriptor}}$ ; (4) Update the parameters $\alpha,\phi,\beta$ of translator by minimizing $\mathcal{L}_{\text{translator}}$ .

Input: Multi-resolution data $\{(x^{(s)}_{i},y^{(s)}_{i})$ , $i=1,...,N;s=1,...,S\}$ Output: Model $E^{(S)},T^{(S)},D^{(S)},G$

E^{(0)}\leftarrow\varnothing,D^{(0)}\leftarrow\varnothing,T^{(0)}\leftarrow\varnothing

for

s=1,\cdots,S

m\leftarrow 0

s=1

then

\omega\leftarrow 1

else

\omega\leftarrow 0

E^{(s,\omega)}\leftarrow\text{expand}(E^{(s-1)})

D^{(s,\omega)}\leftarrow\text{expand}(D^{(s-1)})

T^{(s,\omega)}\leftarrow\text{expand}(T^{(s-1)})

while (

m\leq N

) do

Sample

(x,y)

and

y^{\prime}

Sample

z\sim\mathcal{N}(0,I)

c\leftarrow E_{y}^{(s,\omega)}(x)

c\leftarrow G_{y^{\prime}}(z)

\hat{x}\leftarrow T^{(s,\omega)}(x,c)

Revise

\hat{x}

to obtain

\tilde{x}

by a

K

-step Langevin dynamics in Eq. (4).

Update descriptor

D^{(s,\omega)}

with

\mathcal{L}_{\text{descriptor}}

Update translator

\{E^{(s,\omega)},T^{(s,\omega)},G\}

with

\mathcal{L}_{\text{translator}}

m\leftarrow m+n^{(s)}

s\neq 1

then

\omega\leftarrow\min(1,m/N)

else

1

end while

end for

Algorithm 1 Progressive Cooperative Learning

3.4 Progressive Cooperative Learning

The update of both descriptor and translator relies on the cooperative generation of MCMC synthesized examples, denoted as $\tilde{x}$ . To significantly improve training efficiency, we propose a progressive learning strategy for our cooperative learning framework. The algorithm gradually enhances the model resolution from low to high, while maintaining cooperative learning across all components at each resolution. The underlying motivation behind this strategy is that learning and sampling from a low resolution data domain is much more efficient. By leveraging a pre-trained low resolution model as a foundation, we can efficiently learn the next scale of the model, rather than starting from scratch. When expanding the current model to the next scale, each component’s network structure undergoes modifications. New layers are added to handle higher resolution image inputs or outputs, while incompatible old layers are removed. The newly added layers are trained together with the remaining parameters. To ensure a smooth transition and prevent gradient exposure due to the addition of expanding blocks in each component, we propose to retain partial effects of the parameters that need to be removed while incorporating the effects of the newly added parameters. Throughout each resolution of learning, the impact of the removed parameters gradually diminishes until it becomes zero. Figure 2 illustrates the expanding strategy of each component at every level of resolution. Here, $\omega$ represents a transition factor that starts from 0 and increase to 1, controlling the percentage of effects from the parameters to be removed (depicted as dark grey boxes with dashed boundaries) and the parameters to be added (depicted as light grey boxes). For a complete description of the proposed progressive cooperative learning algorithm, please refer to Algorithm 1.

4 Experiment

4.1 Experiment Settings

Dataset and Evaluation Metrics

To demonstrate the performance of our proposed multi-domain image-to-image translation framework, we test it on the CelebA-HQ [17] and AFHQ [4] datasets and compare it with several baselines. We use M and F to refer the domains of male and female in CelebA-HQ, and C, D and W to refer to the domains of cat, dog, and wild animals in AFHQ. We only use the images and the corresponding domain labels from the datasets in our experiments. We evaluate the quality of translated images using the Fréchet Inception Distance (FID) [13] and the Kernel Inception Distance (KID) [2], which are widely used to measure the distance between the population of translated images and the population of original images in the target domain. A small FID or KID is desired to indicate that the translated distribution is very close to the target distribution.

Network Architecture

We show the detail of network structures in Fig. 4. We use dotted lines to represent the layers that will be dropped off during scaling up in Descriptor, Translator, and Style Encoder. Since the Style Generator samples from Guassian distrbution, the architecture keeps the same during progression. For each DownBlock/UpBlock module at level $l$ , we set the input and out channel to be $max(512,2^{5+l})$ and $max(512,2^{6+l})$ . For the FC Layer in Style Generator, we set the input channel of the top FC Layer as 16 and the rest as 512. The size of the style code is set to be 64 for both experiments.

Progressive Training

We start training our model with a resolution of 64 $\times$ 64, and then scale it up to 128 $\times$ 128 and 256 $\times$ 256. Example results after each progression for human face generation in Figure 5 and animal face generation in Figure 6. Consistently, we could see the generation quality improves after each step of progression. We step for 16 iterations for MCMC sampling in the beginning, decreasing by 4 steps after each progression. The hyper-parameters of $\lambda_{\text{energy}}$ , $\lambda_{\text{diverse}}$ , $\lambda_{\text{cycle}}$ , $\lambda_{\text{style}}$ , and $\lambda_{\text{mode}}$ are set to be 1.

Table 1: Evaluation on CelebA-HQ dataset for two-domain human face generation and AFHQ dataset dataset for three-domain animal face generation.

Method	Reference				Diverse
	CelebA-HQ		AFHQ		CelebA-HQ		AFHQ
	FID	KID	FID	KID	FID	KID	FID	KID
MUNIT[16]	107.1	-	223.9	-	31.4	-	41.5	-
DRIT[23]	53.3	-	114.8	-	52.1	-	95.6	-
MSGAN[25]	39.6	-	69.8	-	33.1	-	61.4	-
StarGAN2[4]	23.8	12.1	19.8	6.1	13.7	4.1	16.2	9.1
Liu[24]	26.7	16.8	51.7	28.6	17.8	11.0	26.0	7.0
TUNIT[1]	173.7	193.7	223.0	187.7	128.0	122.0	116.1	99.7
SwapAE[29]	25.4	17.8	61.2	28.8	-	-	-	-
CLUIT[22]	28.9	18.1	22.6	10.5	-	-	-	-
StyleMapGAN[18]	28.8	25.1	64.3	51.3	24.3	15.2	32.8	18.7
CycleCoop[33]	-	-	-	-	131.0	124.7	-	-
EM-LAST[11]	-	-	-	-	48.8	22.9	41.5	17.0
Ours	21.0	7.7	21.0	7.9	32.9	21.9	31.8	16.9

Table 2: Ablation Study on CelebA-HQ and AFHQ datasets in 64

\times

64 resolution.

Removed Item	CelebA		AFHQ		Avg
	Reference	Diverse	Reference	Diverse	Avg
baseline	15.1	14.3	12.4	19.6	15.4
Remove $\mathcal{L}_{diverse}$	16.3	17.1	36.4	35.2	26.3
Remove $\mathcal{L}_{cycle}$	111.0	127.3	NA	NA	119.2
Remove $\mathcal{L}_{energy}$	134.5	40.7	208.5	97.6	120.3
Remove $\mathcal{L}_{mode}$	NA	NA	277.2	217.6	247.4

Table 3: Evaluation on specific domain translations by FID score.

Method	C $\rightarrow$ D	W $\rightarrow$ D	M $\rightarrow$ F
ILVR[3]	74.4	75.3	46.1
SDEdit[26]	74.2	68.5	49.4
CUT[28]	76.2	92.9	31.9
C2F-EBM[40]	55.1	-	-
EM-LAST[11]	69.4	72.5	47.8
EGSDE[39]	51.0	50.4	30.6
Ours (Diverse)	55.1	54.5	26.8
Ours(Reference)	42.2	41.8	16.1

Table 4: Complexity analysis of different models. The units for training and inference (including diverse and reference generation) are s/batch and ms/batch.

Method	Ours	StarGAN2	CycleCoop
Param(M)	87.7	87.7	108.4
Training	2.7	1.0	2.7
Reference	87.08	86.86	NA
Diverse	20.17	21.22	9.55

4.2 Diverse Image Generation

In this experiment, we use style codes that are randomly sampled from the style generator to generate diverse translated images. Examples of generation results for human face on the CelebA-HQ dataset and animal face on the AFHQ dataset can be seen in Figure 3. For each source image shown in the first row, we generate multiple outputs using random Gaussian noise. The qualitative results verify the diversity of the translated results from a source input image. We observe that, given a source image, our model can not only generate diverse translated images but also produce high-quality images that obtain the same attribute (e.g., expression) from the source image.

4.3 Translation with Reference Image

We perform image-to-image translation by providing a reference image. We first adopt the style encoder to extract the style code from the provided reference image, and the apply the style code to the translator. Figure 8 show some qualitative results, where we take images in the first row as source images and images in the first column as reference images. The translation results are shown in the middle. Comparing results displayed in each row, we can observe that the human face in the source image can be clearly changed into the same gender and appearance of the face in the reference image, while keeping the facial expression consistent with that in the source domain.

4.4 Quantitative Comparison

We also compare the results of our translation results quantitatively by using style codes from Style Encoder by randomly selecting reference image in different domains or from Style Generator through sampling from Gaussian distribution with other baseline methods based on adversarial learning, score matching, or EBMs quantitatively. For each source image in the validation dataset, we obtain ten translated images for each target domain to compute the FID. Results for both human and animal face translation are shown in Table1. We also compare our results with some pair-wise translation models on specific domain transfer and summarize the results in Table 3. We could see that our model could significantly out perform existing cooperative learning methods with additional ability of guidance by reference images and reach comparable performance with GAN-based methods.

4.5 Comparison with Adversarial Learning

We have a comparison of generation results from EBM (ours) and GAN (StarGAN2) in human and animal face translation with reference images and show the results in Figure 7. Consistently with quantitative results of Table 1, our model could generate comparable translation results in high resolution against GAN-based models.

4.6 Complexity Analysis

We have a comparison in computational cost below with the evaluation on a single Nvidia A100 GPU and show the results in Table 4. We set the training batch to be 8 and inference batch to be 32 with resolution at 256 $\times$ 256 for all methods. CycleCoop is an EBM baseline only for two domains. StarGANv2 is the GAN-based baseline.

4.7 Ablation Study

We conduct an ablation study to evaluate the importance of each individual component proposed in our paper. In Table 2, we report the model performance in terms of FID by removing different key loss term (including $\mathcal{L}_{\text{diverse}},\mathcal{L}_{\text{cycle}},\mathcal{L}_{\text{energy}},\mathcal{L}_{\text{mode}}$ ) from our objective function in our framework. We train our model in a 64 $\times$ 64 resolution setting on datasets CelebA-HQ and AFHQ without using the progressive learning strategy. We show results of image translation using style codes obtained from both style encoder and style generator and report average performance. NA means that the model fails in learning and can not generate meaningful results. We can see that the newly added regularization strategies for the descriptor and the translator, i.e., $\mathcal{L}_{energy}$ and $\mathcal{L}_{mode}$ , are essential for stabilizing the cooperative training. Especially, the energy-based regularization loss $\mathcal{L}_{mode}$ plays an important role to ensure that the translator can quickly catch up with the descriptor toward the data distribution during the cooperative training. The $\mathcal{L}_{energy}$ is useful to obtain performance gain by limiting the magnitude of the energy values. Also, we can find that the performance drops significantly when removing the cycle-consistency loss $\mathcal{L}_{cycle}$ , which proves to be a key objective for unpaired cross-domain image translation task.

5 Conclusion

We present PMD-CoopNets, a novel approach that combines energy-based learning, MCMC sampling, cooperative learning, and progressive learning for unpaired multi-domain image-to-image translation. Our method includes a multi-head energy-based model as a descriptor, capturing the multi-domain image distribution, and a diversified image-to-image translator for cross-domain one-to-many mapping. To train both the descriptor and translator, we introduce a multi-domain MCMC teaching algorithm. Additionally, we propose progressive learning to enhance the scalability and efficiency of our model. Experimental results demonstrate that our approach achieves comparable performance to adversarial learning frameworks and sets a new benchmark in energy-based image-to-image translation methods.

References

Baek et al. [2021] Kyungjune Baek, Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Hyunjung Shim. Rethinking the truly unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14154–14163, 2021.
Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
Choi et al. [2020] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
Du and Mordatch [2019] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, (NeurIPS), 2019.
Du et al. [2021] Yilun Du, Shuang Li, Joshua B. Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of energy-based models. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
Gao et al. [2018] Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative convnets via multi-grid modeling and sampling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9155–9164, 2018.
Gao et al. [2020] Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2020.
Gao et al. [2021] Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energy-based models by diffusion recovery likelihood. In The ninth International Conference on Learning Representations, ICLR, 2021.
Grathwohl et al. [2021] Will Sussman Grathwohl, Jacob Jin Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, and David Duvenaud. No MCMC for me: Amortized sampling for fast and stable training of energy-based models. In The ninth International Conference on Learning Representations, ICLR, 2021.
Han et al. [2022] Giwoong Han, Jinhong Min, and Sung Won Han. Em-last: Effective multidimensional latent space transport for an unpaired image-to-image translation with an energy-based model. IEEE Access, 10:72839–72849, 2022.
Han et al. [2019] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8670–8679, 2019.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Hinton [2002] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
Hinton [2012] Geoffrey E. Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade - Second Edition, pages 599–619. Springer, 2012.
Huang et al. [2018] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018.
Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
Kim et al. [2021] Hyunsu Kim, Yunjey Choi, Junho Kim, Sungjoo Yoo, and Youngjung Uh. Exploiting spatial dimensions of latent in gan for real-time image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 852–861, 2021.
Kim and Bengio [2016] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
Kumar et al. [2019] Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
LeCun et al. [2006] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
Lee et al. [2021] Hanbit Lee, Jinseok Seol, and Sang-goo Lee. Contrastive learning for unsupervised image-to-image translation. arXiv preprint arXiv:2105.03117, 2021.
Lee et al. [2018] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pages 35–51, 2018.
Liu et al. [2021] Yahui Liu, Enver Sangineto, Yajing Chen, Linchao Bao, Haoxian Zhang, Nicu Sebe, Bruno Lepri, Wei Wang, and Marco De Nadai. Smoothing the disentangled latent style space for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10785–10794, 2021.
Mao et al. [2019] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1429–1437, 2019.
Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Nijkamp et al. [2019] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 32, 2019.
Park et al. [2020a] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020a.
Park et al. [2020b] Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei Efros, and Richard Zhang. Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems, 33:7198–7211, 2020b.
Xiao et al. [2021] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. Vaebm: A symbiosis between variational autoencoders and energy-based models. In The ninth International Conference on Learning Representations, ICLR, 2021.
Xie et al. [2016] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. In International Conference on Machine Learning (ICML), 2016.
Xie et al. [2018] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42:27–45, 2018.
Xie et al. [2021a] Jianwen Xie, Zilong Zheng, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu. Learning cycle-consistent cooperative networks via alternating mcmc teaching for unsupervised cross-domain translation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10430–10440, 2021a.
Xie et al. [2021b] Jianwen Xie, Zilong Zheng, and Ping Li. Learning energy-based model with variational auto-encoder as amortized sampler. In Thirty-Fifth AAAI Conference on Artificial Intelligence, (AAAI)), pages 10441–10451, 2021b.
Xie et al. [2022a] Jianwen Xie, Zilong Zheng, Xiaolin Fang, Song-Chun Zhu, and Ying Nian Wu. Cooperative training of fast thinking initializer and slow thinking solver for conditional learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(8):3957–3973, 2022a.
Xie et al. [2022b] Jianwen Xie, Yaxuan Zhu, Jun Li, and Ping Li. A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations (ICLR), 2022b.
Zhang et al. [2022a] Jing Zhang, Jianwen Xie, Zilong Zheng, and Nick Barnes. Energy-based generative cooperative saliency prediction. In Thirty-Sixth AAAI Conference on Artificial Intelligence, (AAAI)), pages 3280–3290, 2022a.
Zhang et al. [2022b] Jing Zhang, Jianwen Xie, Zilong Zheng, and Nick Barnes. Energy-based generative cooperative saliency prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3280–3290, 2022b.
Zhao et al. [2022] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609–3623, 2022.
Zhao et al. [2021a] Yang Zhao, Jianwen Xie, and Ping Li. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021a.
Zhao et al. [2021b] Yang Zhao, Jianwen Xie, and Ping Li. Learning energy-based generative models via coarse-to-fine expanding and sampling. In The ninth International Conference on Learning Representations, ICLR, 2021b.
Zhao et al. [2023] Yang Zhao, Jianwen Xie, and Ping Li. Coopinit: Initializing generative adversarial networks via cooperative learning. arXiv preprint arXiv:2303.11649, 2023.
Zhu et al. [1998] Song Chun Zhu, Yingnian Wu, and David Mumford. Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27:107–126, 1998.