Wasserstein Geodesic Generator for Conditional Distributions

\nameYoung-geun Kim \email[email protected]
\addrDepartment of Biostatistics and Department of Psychiatry, Columbia University, USA \AND\nameKyungbok Lee \email[email protected]
\addrDepartment of Statistics, Seoul, Republic of Korea \AND\nameYoungwon Choi \email[email protected]
\addrDepartment of Statistics, Seoul, Republic of Korea \AND\nameJoong-Ho Won \email[email protected]
\addrDepartment of Statistics, Seoul, Republic of Korea \AND\nameMyunghee Cho Paik^∗ \email[email protected]
\addrDepartment of Statistics, Seoul, Republic of Korea
\addrShepherd23 Inc., Republic of Korea

Abstract

Generating samples given a specific label requires estimating conditional distributions. We derive a tractable upper bound of the Wasserstein distance between conditional distributions to lay the theoretical groundwork to learn conditional distributions. Based on this result, we propose a novel conditional generation algorithm where conditional distributions are fully characterized by a metric space defined by a statistical distance. We employ optimal transport theory to propose the Wasserstein geodesic generator, a new conditional generator that learns the Wasserstein geodesic. The proposed method learns both conditional distributions for observed domains and optimal transport maps between them. The conditional distributions given unobserved intermediate domains are on the Wasserstein geodesic between conditional distributions given two observed domain labels. The proposed method generates the Wasserstein geodesic under some conditions. Experiments on face images with light conditions as domain labels demonstrate the efficacy of the proposed method.

Keywords: Generative model, Optimal transport, Conditional generation, Wasserstein geodesic, Wasserstein barycenter

1 Introduction

Conditional generation is the task of constructing synthetic samples following target distributions given by specific domain labels such as age, emotion, and gender. Important applications include class-conditional image generation (Odena et al., 2017; Bao et al., 2017), age progression (Antipov et al., 2017; Wang et al., 2018), text-to-image synthesis (Reed et al., 2016; Zhang et al., 2017a), and data augmentation (Frid-Adar et al., 2018; Shao et al., 2019).

Most conditional generation methods are extended from outstanding image generative models such as variational autoencoders (VAEs) (Kingma and Welling, 2014), generative adversarial networks (GANs) (Goodfellow et al., 2014), and adversarial autoencoders (AAEs) (Makhzani et al., 2015). They model the distribution of images by transforming latent variables with deep neural networks. State-of-the-art conditional generative methods include conditional VAE (cVAE) (Sohn et al., 2015), conditional GAN (cGAN) (Mirza and Osindero, 2014), and conditional AAE (cAAE) (Makhzani et al., 2015). The main extension from image generative models is to concatenate domain labels into the latent variable so that the generator is a function of the latent variable as well as the domain label. A detailed review of current methods is provided in Section 2.

Various conditional generative models have demonstrated realistic results for observed domains and have usually been applied to generate samples for unobserved intermediate domains. For example, in the age progression literature, models trained with images of people in their 20s and 50s can be applied to generate synthetic samples in unobserved intermediate domains such as 30s and 40s. Existing methods pass intermediate domain values through deep neural networks and presume that the generated data arise from the conditional distribution given the intermediate domain value. We anticipate conditional distributions given domain values change smoothly over the domain values. However, existing methods do not guarantee that conditional distributions smoothly change over the in-between regions where data are unobserved. Also, a theoretical framework describing paths on the space of conditional distributions populated by domain label values has not been provided.

The Wasserstein geodesic is the shortest path between two distributions in terms of the Wasserstein distance. We propose a novel conditional generator to learn Wasserstein geodesic, named the Wasserstein geodesic generator. The proposed method is composed of two elements of the Wasserstein geodesic, the conditional distributions given observed domains and the optimal transport map between them, so that the conditional distributions to reside in the Wasserstein space, a metric space defined by the Wasserstein distance. The two elements are the vertices and edges of the Wasserstein geodesic in the space of the conditional distributions, respectively. For vertices, we propose a novel notion of conditional sub-coupling for conditional generation, and adopt it to derive a tractable upper bound of the expected Wasserstein distance between the target and model conditional distributions. For edges, our proposed method learns the optimal transport map with respect to (w.r.t.) the metric on feature space specified by encoder networks. We prove that the conditional distributions given unobserved intermediate domain labels constitute the constant-speed Wasserstein geodesic between the observed domains. Our work is the first to propose conditional distributions given both observed and unobserved domains that are fully characterized by a metric space w.r.t. a statistical distance.

Our contributions are summarized as follows:

•

We propose a novel conditional generator that learns the Wasserstein geodesic, named the Wasserstein geodesic generator. Our work is the first that can generate samples whose conditional distributions are fully characterized by the Wasserstein space.
•

We lay a theoretical groundwork for learning conditional distributions with the Wasserstein distance by deriving a tractable upper bound of the Wasserstein distance between conditional distributions.
•

We employ optimal transport maps between conditional distributions given two observed domains to construct the Wasserstein geodesic between the observed points in the space of conditional distributions.
•

We derive that the proposed distribution approximates the Wasserstein barycenter in multiple observed distribution scenarios. It becomes the Wasserstein barycenter when distributions of representations are identical across observed domains.
•

Experiments on face images with light conditions as domain labels demonstrate the efficacy of the proposed method.

The remainder of the paper is organized as follows. In Section 2, we review related works including conditional generative models. Section 3 presents theoretical results to derive a tractable upper bound of the expected Wasserstein distance between conditional distributions and Section 4 introduces the proposed method. Section 5 presents experimental results on a face image dataset with light conditions as domain labels. All proofs of theoretical results are provided in Appendix A.

2 Related Works

This section reviews related works on conditional generation, Data-to-Data Translation, and Wasserstein geometry. Our approach utilizes conditional generative models to learn observed conditional distributions, which serve as the vertices in the distribution space. We leverage Data-to-Data Translation techniques to learn intermediate paths between these observed conditional distributions, effectively establishing edges between the vertices in the distribution space. Additionally, we employ the properties of the Wasserstein space to comprehensively characterize this entire process.

The term conditional generation encompasses various meanings, sometimes leading to confusion. In the context of our work, conditional generation refers to a specific process of transforming latent variables and domain labels to generate samples following conditional distributions given domain label values. We make a clear distinction between Data-to-Data Translation and conditional generative models. This distinction is made to clarify the differences between conditioning on representations and domain labels and conditioning on other observed data. It further emphasizes their orthogonal roles: conditional generative models focus on learning vertices (the distributions associated with specific domain labels), while Data-to-Data Translation concentrates on learning edges (the transitions between these distributions) in the Wasserstein space. In some works in the disentangled representation learning literature (Chen et al., 2016; Higgins et al., 2017; Makhzani et al., 2015), domain labels are not available and pseudo-labels are introduced to imitate domain labels.

2.1 Conditional Generative Model

Most conditional generative models are extended from image generation methods. We first review three eminent image generative models: VAEs, GANs, and AAEs. To synthesize realistic data, all three methods aim to learn a generator that transforms latent factors, which follow a user-specified prior distribution. VAEs consist of encoder and decoder networks. They model the joint likelihood of latent factors and their transformations, feed-forwarded by decoder networks, and seek the maximum likelihood estimator for the marginal distribution of observations. Due to the intractability of the likelihood with nonlinear decoder networks, VAEs employ variational inference (Bishop, 2006) to maximize the evidence lower bound, using encoder networks to approximate the distribution of latent variables given observations. In contrast, GANs are composed of discriminator and generator networks. The generator networks in GANs serve the same function as decoders in VAEs—transforming latent factors to generate data. However, GANs introduce discriminator networks to form an adversarial loss for generator training. AAEs, similarly, employ discriminator and generator networks but also encoders. Unlike GANs, the discriminator of AAEs aims to classify encoded results and latent variables drawn from the prior distribution. Training AAEs can be interpreted as minimizing the $1$ -Wasserstein distance between distributions of real data and generation results, a special case of Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018).

Statistical distances employed in forming training objectives in generative models are pivotal components for the quality of generation results. Both VAEs and GANs utilize the $f$ -divergence (Csiszár, 1964) to learn distributions of images. As an alternative statistical distance, Arjovsky et al. (2017) showed that the Wasserstein distance has advantages over the $f$ -divergence when the supports of data distributions are on a low-dimensional manifold as in image data. The Wasserstein distance yields differentiable losses, while $f$ -divergences do not define losses well or yield non-differentiable losses, possibly due to these advantages, Wasserstein distance-based approaches including AAEs, WAEs, and Wasserstein GANs (Arjovsky et al., 2017) often have outperformed $f$ -divergence-based approaches.

The main extension from image generative models to conditional models is to concatenate domain labels into the latent variable. In cVAE, cGAN, and cAAE, domain labels are incorporated into encoder networks of VAE, discriminator and generators of GAN, and decoder of AAE, respectively. To enhance the visual quality and diversity of generation results, Kameoka et al. (2018), Odena et al. (2017), and Zhao et al. (2018) introduce auxiliary classifier to match the observed and predicted domain labels for conditional generation results. Conditional generative models can be used to generate data for unobserved domains. This is accomplished by inserting unobserved domain label values into trained models, a technique demonstrated in zero-shot learning approaches (Xian et al., 2018; Chao et al., 2016), aimed at boosting classification performance for unseen classes. However, this approach faces limitations. Firstly, it assumes that latent variables for unobserved domains follow similar patterns to those for observed domains. Secondly, the generated data distribution for unobserved domains has not been justified with a metric space w.r.t. distributions. Intuitively, the unobserved intermediate distribution serves as the centroid of observed distributions, but this property has not been discussed. In contrast, our approach constructs geodesics in the Wasserstein space for unobserved domains, and the proposed distribution is the Wasserstein barycenter when distributions of representations are identical across observed domains. This result is achieved without assuming the homogeneity of representations across both observed and unobserved domains, distinguishing our approach from existing works.

We employ conditional generative models for learning data distributions from observed domain labels, observed vertices in the Wasserstein space. In Section 3, we present theoretical results to provide a tractable objective for minimizing the formulated Wasserstein distances between conditional distributions.

2.2 Data-to-Data Translation

Conditional generative models find a transformation from latent variables and domain labels to data, effectively learning conditional distributions. In contrast, another line of work, which we refer to as Data-to-Data Translation (Kim et al., 2017; Choi et al., 2018; Zhu et al., 2017), operates under a different paradigm. In Data-to-Data Translation, the source and target domains are predefined, and the goal is to find a transport map from the source data to the target data. Typical examples include unpaired translations, such as converting daytime scenes to nighttime scenes, where corresponding pairs of daytime and nighttime scenes for the same location are not required during training. Prominent methods within this field include multi-modal translations across different data domains (Xu et al., 2018; Isola et al., 2017), such as Text-to-Image Translation, as demonstrated by DALL-E (Ramesh et al., 2021), and multi-domain Image-to-Image Translation, as exemplified by StarGAN (Choi et al., 2018).

CycleGAN (Zhu et al., 2017) is a pioneering work in unpaired Image-to-Image Translation. It minimizes the adversarial loss with target data and translated source data, with the cycle consistency loss encouraging the inverse relation between translation maps between two domains. In this case, the transformation is encouraged to match distributions of the target and converted source data, but it does not learn conditional distributions given domain labels since the real data from the source domains are mandatory to construct target data. Liu et al. (2017) introduce a latent variable model to learn conditional distributions. However, the properties of the conditional distributions given unobserved intermediate domains have not been discussed in the Data-to-Data Translation literature.

Our proposed method leverages Data-to-Data Translation techniques to find optimal transport maps between observed conditional distributions, thereby defining edges between observed vertices in the Wasserstein space. In Section 4, we propose to generate intermediate data from edges, Wasserstein geodesics, and extend it to approximate the centroid, which is the Wasserstein barycenter of observed distributions.

2.3 Wasserstein Space

The Wasserstein space, a metric space of distributions endowed with the Wasserstein distance, has found widespread applications in various fields of generative models. We review related works in image processing, domain adaptation, and data augmentation. Other important applications of Wasserstein space include density matching (Cisneros-Velarde and Bullo, 2020), distribution alignment (Zhou et al., 2022), online learning (Korotin et al., 2021), and Bayesian inference (Srivastava et al., 2018).

In image processing, most applications focus on transporting point clouds (Cuturi, 2013) or specific features—such as texture (Rabin et al., 2012), colors (Rabin et al., 2014), and shapes (Solomon et al., 2015)—from a source image a target image. While these approaches have shown remarkable results, they typically require training models or solving an optimization problem for each pair of source and target images. Taking a different route, Mroueh (2020) introduces universal style transfer that employs autoencoders. The method exploits the Wasserstein geodesic of Gaussian measures, using features extracted by encoder networks. However, in Mroueh’s framework, distributions of generated samples for unobserved intermediate domains cannot be characterized by Wasserstein spaces. On a related note, Korotin et al. (2019) put forth a generative model to solve the dual form of the $2$ -Wasserstein distance between two distributions of images, which they apply to image style transfer tasks.

In domain adaptation, Xie et al. (2019) proposes latent variable models using single representations to generate multiple images from each domain while minimizing the transportation costs between them. However, this method requires modality-specific generators and can not generate intermediate distributions. The Wasserstein Barycenter Transport (WBT) (Montesuma and Mboula, 2021) is a closely related work to ours. The WBT targets the Wasserstein barycenter of multiple observed source distributions to generate unobserved intermediate domains, but it requires pairs of observations from all the source domains and solves optimization problems for every generation.

In data augmentation, several recent works utilize the Wasserstein space. Bespalov et al. (2022) propose to augment landmark coordinates of facial images with the Wasserstein barycenter, but their method requires computing Wasserstein distances between all pairs of images to oversample landmark data. Zhu et al. (2023) augment data from the Wasserstein barycenter of distributions of images to learn robust classifiers, but this method assumes the Gaussianity of conditional distributions. The work by Fan and Alvarez-Melis (2023) is closely related to this work. Their approach synthesizes data for unobserved domains by applying linear combinations of optimal transport maps between datasets, essentially generating data from the generalized Wasserstein geodesic of observed data distributions. Despite the merits of generalized geodesics, such as convexity and impressive performance in transfer learning tasks, the method employs the optimal transport dataset distance (Alvarez-Melis and Fusi, 2020), dependent on classification labels from each domain. Additionally, the generalized geodesic differs from the Wasserstein barycenter, and the method uses an alternate transportation cost, the $(2,\nu)$ -transport metric (Craig, 2016) in optimization.

Existing methods typically rely on strong assumptions, such as the data following Gaussian distributions and the need to solve optimization problems each time data is generated. Other assumptions include the existence of the Wasserstein distance, optimal transport map, Wasserstein geodesics, and Wasserstein barycenters with Euclidean distances on the data space, implying that the distribution of high-dimensional data is continuous. In contrast, our work fills this gap by generating and justifying intermediate, unobserved distributions without the aforementioned assumptions.

3 Theoretical Results on Wasserstein Distance between Conditional Distributions

3.1 Basic Notations

We provide basic notations as follows. Random variables, their realizations, and their supports are denoted by capital, small, and calligraphy capital letters, respectively. The real data, generated samples, and domain labels are denoted by $X$ , $\tilde{X}$ , and $C$ , respectively. We denote the set of distributions defined on a given support $\mathcal{X}$ by $\mathcal{P}(\mathcal{X})$ and the conditional distribution of $X$ given $Y$ by $\mathbb{P}_{X|Y}$ .

For any metric $d$ on $\mathcal{X}$ and probability measures $\mathbb{P}_{X}$ and $\mathbb{P}_{Y}$ in $\mathcal{P}(\mathcal{X})$ , the $p$ -Wasserstein distance between $\mathbb{P}_{X}$ and $\mathbb{P}_{Y}$ w.r.t. $d$ is denoted by $W_{p}(\mathbb{P}_{X},\mathbb{P}_{Y};d):=\big{(}\underset{\pi\in\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})}{\inf}\int d^{p}(x,y)d\pi(x,y)\big{)}^{1/p}$ where $p\in[1,\infty)$ and $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ is the set of all couplings of $\mathbb{P}_{X}$ and $\mathbb{P}_{Y}$ . For brevity, we omit $d$ , the metric on $\mathcal{X}$ , in the Wasserstein distance if there is no confusion. We assume compactness and convexity of $\mathcal{X}$ to ensure that the Wasserstein space $(\mathcal{P}(\mathcal{X}),W_{p})$ is a geodesic space where every two points can be connected by the constant-speed geodesic (Santambrogio, 2015).

3.2 Distances between Conditional Distributions

Generating samples given a specific label requires to learn conditional distributions given domain labels. In this section, we formulate distances between target and model conditional distributions and derive a tractable upper bound of the Wasserstein distance between conditional distributions.

Denoting the latent variable independent of domain labels by $Z\sim\mathbb{P}_{Z}$ , the distribution of the observed domain labels by $\mathbb{P}_{C}$ , and the conditional generator by $\text{Gen}:\mathcal{Z}\times\mathcal{C}\to\mathcal{X}$ , the model conditional distribution can be expressed as $\mathbb{P}_{\text{Gen}(Z,C)|C}(\cdot|c)$ where $c\in\mathcal{C}$ . To learn $\mathbb{P}_{X|C}(\cdot|c)$ , we formulate a class of distances between conditional distributions as

\int\mathcal{D}(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{\text{Gen}(Z,C)|C}(\cdot|c))d\mathbb{P}_{C}(c),

(1)

where $\mathcal{D}$ is a measure between distributions. Various statistical distances can be considered for $\mathcal{D}$ , and when we choose the Kullback-Leibler divergence, Equation (1) can be minimized by maximizing the expectation of the variational lower bound of the conditional log-likelihood over the distribution of domain labels. For the case of the Jensen-Shannon divergence, adversarial learning with discriminator and generator incorporating domain labels minimizes Equation (1).

To bring advantages over $f$ -divergences, we focus on Equation (1) equipped with Wasserstein distances to learn the Wasserstein geodesic. Since there is no previous work formulating or deriving properties of the Wasserstein distance between conditional distributions given domain labels, in the next section, we derive an upper bound of Equation (1) that has a tractable representation.

3.3 A Tractable Upper Bound of Wasserstein Distance between Conditional Distributions

In this section, we lay a theoretical groundwork by deriving a tractable upper bound of the expected Wasserstein distance between conditional distributions.

We first propose a new set of couplings for conditional generation, conditional sub-coupling.

Definition 1

For any $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Y,C)}$ , we define the conditional sub-coupling as the set of all probability measures expressed as $\int\pi^{*}(\cdot|c)d\mathbb{P}_{C}(c)$ for some $\{\pi^{*}(\cdot|c)\}_{c\in\mathcal{C}}$ where $\pi^{*}(\cdot|c)\in\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ . The conditional sub-coupling is denoted by $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ .

The conditional sub-coupling is the set of all probability measures induced by couplings of conditional distributions. It is nonempty and equals to $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ if $(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))=(\mathbb{P}_{X},\mathbb{P}_{Y})$ for all $c\in\mathcal{C}$ . The following example provides cases where the conditional sub-coupling is a proper subset of $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ . Let $N(\mu_{1},\mu_{2},\sigma_{1},\sigma_{2},\rho)$ denote bivariate Gaussian distribution with mean $(\mu_{1},\mu_{2})^{T}$ and covariance $\begin{pmatrix}\sigma_{1}^{2}&\rho\sigma_{1}\sigma_{2}\\ \rho\sigma_{1}\sigma_{2}&\sigma_{2}^{2}\end{pmatrix}$ .

Example 1

Let $\mathbb{P}_{(X,C)}$ be $N(\mu_{X},\mu_{C},\sigma_{X},\sigma_{C},\rho_{XC})$ and $\mathbb{P}_{(Y,C)}$ be $N(\mu_{Y},\mu_{C},\sigma_{Y},\sigma_{C},\rho_{YC})$ . Then, $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})\setminus\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ includes $N(\mu_{X},\mu_{Y},\sigma_{X},\sigma_{Y},\rho^{*})$ if and only if $\lvert\rho^{*}-\rho_{XC}\rho_{YC}\rvert>\sqrt{(1-\rho_{XC}^{2})(1-\rho_{YC}^{2})}$ .

Further discussions and proofs about the conditional sub-coupling are provided in Appendix A. With the conditional sub-coupling, we derive an upper bound of the expected $p$ -Wasserstein distance in the following theorem.

Theorem 2

Let $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Y,C)}$ be distributions in $\mathcal{P}(\mathcal{X}\times\mathcal{C})$ . For any metric $d$ on $\mathcal{X}$ and $p\in[1,\infty)$ defining $W_{p}$ ,

$\bigg{(}\int W_{p}^{p}(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))d\mathbb{P}_{C}(c)\bigg{)}^{1/p}\leq\bigg{(}\underset{\pi^{*}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})}{\inf}\int d^{p}(x,y)d\pi^{*}(x,y)\bigg{)}^{1/p}.$

(2)

That is, the minimum transport cost over conditional sub-coupling is an upper bound of expected Wasserstein distance between conditional distributions.

We show a tractable representation of the upper bound in the following theorem. We denote the set of all $\mathbb{Q}_{Z|X,C}$ satisfying $\mathbb{P}_{(Z,C)}(z,c)=\big{(}\int\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{X|C}(x|c)\big{)}\mathbb{P}_{C}(c)$ by $\mathcal{Q}$ ; the RHS can be considered as an aggregate posterior (Makhzani et al., 2015).

Theorem 3

Let $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Z,C)}$ be distributions in $\mathcal{P}(\mathcal{X}\times\mathcal{C})$ and $\mathcal{P}(\mathcal{Z}\times\mathcal{C})$ , respectively. For any metric space $(\mathcal{X},d)$ , $p\in[1,\infty)$ , and generator $\text{Gen}:\mathcal{Z}\times\mathcal{C}\rightarrow\mathcal{X}$ ,

$\underset{\pi^{*}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(\text{Gen}(Z,C),C)}|\mathbb{P}_{C})}{\inf}\int d^{p}(x,\tilde{x})d\pi^{*}(x,\tilde{x})=\underset{\mathbb{Q}_{Z|X,C}\in\mathcal{Q}}{\inf}\int d^{p}(x,\text{Gen}(z,c))d\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{(X,C)}(x,c).$

That is, the upper bound of the Wasserstein distance between conditional distributions, the RHS of Equation (2), can be expressed as the infimum of the reconstruction error over encoders $\mathbb{Q}_{Z|X,C}\in\mathcal{Q}$ . Note that the integrand in the LHS of Equation (2) depends on the conditioning data $c$ and requires to evaluate the Wasserstein distance for every realization $c$ , which is infeasible. In contrast, the derived representation can be computed by solving a stochastic optimization problem. When the terms related to domain label $C$ are removed, Theorem 3 reduces to the representation of the Wasserstein distance between marginal distributions provided by Tolstikhin et al. (2018).

4 Proposed method

4.1 Motivation

For a motivating example, suppose data come from one of two observed domains whose label values are $c_{0}$ and $c_{1}$ . Existing methods in conditional generation literature have considered $\text{Gen}(Z,(1-t)c_{0}+tc_{1})$ as intermediate samples (Zhang et al., 2017b). However, without a strong assumption such as a linear structure, the interplay between $\mathbb{P}_{X|C}(\cdot|c_{0})$ , $\mathbb{P}_{X|C}(\cdot|c_{1})$ , and $\mathbb{P}_{\text{Gen}(Z,C)|C}(\cdot|(1-t)c_{0}+tc_{1})$ is difficult to formalize.

A desirable property of generated samples for unobserved intermediate domains would be their conditional distributions change smoothly from one observed domain to another. The next section proposes a new conditional generator that constructs samples from distributions on the constant-speed geodesic in the Wasserstein space.

Definition 4

(Constant-speed geodesic) (Santambrogio, 2015) For any $\mathbb{P}$ and $\mathbb{Q}$ on a Wasserstein space $(\mathcal{P}(\mathcal{X}),W_{p})$ , a parameterized curve $w:[0,1]\to\mathcal{P}(\mathcal{X})$ is called the constant-speed geodesic from $\mathbb{P}$ to $\mathbb{Q}$ in $W_{p}$ if $w(0)=\mathbb{P}$ , $w(1)=\mathbb{Q}$ , and $W_{p}(w(t),w(s))=|t-s|W_{p}(w(0),w(1))$ for any $t,s\in[0,1]$ .

That is, a constant-speed geodesic in a Wasserstein space is a parameterized curve whose speed equals to the Wasserstein distance. Our method yields the conditional distribution given an unobserved intermediate domain label as an interpolation point between the conditional distributions given observed domain labels in the Wasserstein space. Unlike existing methods, the generated distributions are fully characterized by the Wasserstein space $(\mathcal{P}(\mathcal{X}),W_{p})$ .

4.2 Conditional Generator for Learning Wasserstein Geodesics

This section proposes the Wasserstein geodesic generator, a novel conditional generator for learning Wasserstein geodesics. Our method learns the conditional distributions given observed domains and the optimal transport maps between them to construct the Wasserstein geodesic. The proposed method consists of three networks: encoder $\text{Enc}:\mathcal{X}\times\mathcal{C}\to\mathcal{Z}$ , generator $\text{Gen}:\mathcal{Z}\times\mathcal{C}\to\mathcal{X}$ , and transport map $T:\mathcal{X}\times\mathcal{C}^{2}\to\mathcal{X}$ .

We first define the optimal transport map, and then provide the proposed method.

Definition 5

(Optimal transport map) (Santambrogio, 2015) A map $T:\mathcal{X}\to\mathcal{X}$ is an optimal transport map from $\mathbb{P}_{X}$ to $\mathbb{P}_{Y}$ w.r.t. $\text{Cost}:\mathcal{X}^{2}\to\mathbb{R}$ if $T$ is a solution of the Monge-Kantorovich transportation problem,

		$\displaystyle\underset{T}{\text{minimize}}$		$\displaystyle\int\text{Cost}(x,T(x))d\mathbb{P}_{X}(x)$
		subject to		$\displaystyle\mathbb{P}_{Y}=\mathbb{P}_{T(X)}.$

The optimal transport map refers maps yielding the minimum transportation cost. The optimal transport map uniquely exists if the cost function is the $p$ -th power of $l_{2}$ -distance denoted by $\|\cdot\|$ where $p>1$ and measures are absolutely continuous on compact domains. The minimum transportation cost by the optimal transport map is known as $W^{p}_{p}(\mathbb{P}_{X},\mathbb{P}_{Y})$ .

We now present the Wasserstein geodesic generator. The proposed method postulates the encoder, generator, and transport map satisfying the following conditions $(A1)$ and $(A3)$ to generate intermediate samples. Here, $d_{\text{Enc}}((x_{1},c_{1}),(x_{2},c_{2})):=\lVert(\text{Enc}(x_{1},c_{1}),c_{1})-(\text{Enc}(x_{2},c_{2}),c_{2})\rVert$ is a metric defined on $\mathcal{Z}\times\mathcal{C}$ .

(A1)

(One-to-one mapping between $\mathcal{X}\times\mathcal{C}$ and $\mathcal{Z}\times\mathcal{C}$ ) For any $(x,c)\in\mathcal{X}\times\mathcal{C}$ and $(z,c)\in\mathcal{Z}\times\mathcal{C}$ , $\text{Gen}(\text{Enc}(x,c),c)=x$ and $\text{Enc}(\text{Gen}(z,c),c)=z$ .
(A2)

(Absolutely continuous representations) For any $c\in\mathcal{C}$ , $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c)$ is absolutely continuous, is defined on a compact set, and has finite second moments.
(A3)

(Optimal transportation) For any observed domain labels $c_{m}$ and $c_{m^{\prime}}$ , $(T(\cdot,\cdot,c_{m^{\prime}}),c_{m^{\prime}}):\mathcal{X}\times\{c_{m}\}\to\mathcal{X}\times\{c_{m^{\prime}}\}$ is the optimal transport map from $\mathbb{P}_{(X,c_{m})|C}(\cdot|c_{m})$ to $\mathbb{P}_{(X,c_{m^{\prime}})|C}(\cdot|c_{m^{\prime}})$ w.r.t. $d^{p}_{\text{Enc}}$ .

Note that condition $(A1)$ is about inverse relations between the encoder and generator for fixed domain labels, $(A2)$ is to guarantee the existence and uniqueness of optimal transport maps between observed conditional distributions w.r.t. $d^{p}_{\text{Enc}}$ , and $(A3)$ is to build paths between observed conditional distributions with optimal transport maps.

Lemma 6

Suppose the encoder and generator satisfy the conditions $(A1)$ and $(A2)$ . Then, for any $c$ and $c^{\prime}\in\mathcal{C}$ ,

W_{p}(\mathbb{P}_{(X,C)|C}(\cdot|c),\mathbb{P}_{(X,C)|C}(\cdot|c^{\prime});d_{\text{Enc}})=W_{p}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c^{\prime});\lVert\cdot\rVert).

The encoder and generator can define a distance-preserving mapping (called isometric mapping) between two Wasserstein spaces, and we can connect their geometric structures, including geodesics and barycenters. With Lemma 6 and optimal transport maps, we can first generate geodesics on $(\mathcal{P}(\mathcal{Z}\times\mathcal{C}),W_{p}(\cdot,\cdot;\lVert\cdot\rVert))$ , and then project them to geodesics on $(\mathcal{P}(\mathcal{X}\times\mathcal{C}),W_{p}(\cdot,\cdot;d_{\text{Enc}}))$ to generate intermediate unobserved conditional distributions.

Theorem 7

Suppose the encoder, generator, and transport map satisfy conditions $(A1)$ through $(A3)$ . For any two observed domain labels $c_{0}$ and $c_{1}$ , their convex combination $c_{t}:=(1-t)c_{0}+tc_{1}$ , and $X_{0}\sim\mathbb{P}_{X|C}(\cdot|c_{0})$ , the latent interpolation result of $X_{0}$ and its transported result $T(X_{0},c_{0},c_{1})$ can be expressed as $\text{Gen}((1-t)\text{Enc}(X_{0},c_{0})+t\text{Enc}(T(X_{0},c_{0},c_{1}),c_{1}),c_{t})$ . Then, the curve of distributions of latent interpolation results is the constant-speed geodesic from $\mathbb{P}_{(X,C)|C}(\cdot|c_{0})$ to $\mathbb{P}_{(X,C)|C}(\cdot|c_{1})$ in $W_{p}(\cdot,\cdot;d_{\text{Enc}})$ .

The conditional distributions of the samples generated by the proposed method constitute the Wasserstein geodesic, yielding the minimum transportation cost quantified by $d_{\text{Enc}}$ between the conditional distributions of observed domains.

4.3 Generation from the Wasserstein Barycenter with Wasserstein Geodesic Generator

This section extends our Wasserstein geodesic generator to accommodate scenarios involving multiple observed conditional distributions. We explain how the proposed distribution approximates the centroid of observed conditional distributions with an interpretable upper bound of the approximation error. Furthermore, we derive that the proposed distribution is the Wasserstein barycenter under some conditions.

We first define the Wasserstein barycenter, the centroid of distributions within the Wasserstein space.

Definition 8

(Wasserstein barycenter) (Agueh and Carlier, 2011) For any $M$ distributions $(\mathbb{P}_{m})_{m=1}^{M}$ defined on a Wasserstein space $(\mathcal{P}(\mathcal{X}),W_{2})$ and non-negative real numbers $(\alpha_{m})_{m=1}^{M}$ , the Wasserstein barycenter of $(\mathbb{P}_{m})_{m=1}^{M}$ with weights $(\alpha_{m})_{m=1}^{M}$ is the unique solution of $\underset{\mathbb{P}}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{m}).$

Specifically, when $M=2$ , the constant-speed geodesic $w(t)$ in Definition 4 serves as the Wasserstein barycenter of two distributions with weights $(1-t,t)$ . In this context of defining a centroid between two distributions, the Wasserstein geodesic is referred to as McCann’s interpolant (McCann, 1995). The Wasserstein barycenters have been acknowledged as an effective solution for aggregating high-dimensional distributions (Korotin et al., 2022), across various applications including data augmentation (Huguet et al., 2022; Bespalov et al., 2021; Zhu et al., 2023) and domain adaptation (Montesuma and Mboula, 2021).

The Wasserstein barycenter has several advantages in generating unobserved conditional distributions. First, it provides smooth and stable transitions between observed distributions, which is essential for synthesizing data for new, unobserved domains. This allows the generated data to inherit characteristics from observed conditional distributions without abrupt changes. Second, it minimizes the average optimal transportation costs to observed distributions, thus minimally altering observed data to synthesize unobserved data. Last, the Wasserstein barycenter can be employed to infer the characteristics of unobserved domains, e.g., estimating the conditional average treatment effect in clinical trials (Huguet et al., 2022). Despite these advantages, the computational complexity of estimating Wasserstein barycenter remains a significant bottleneck (Cuturi, 2013; Lin et al., 2020).

We extend the Wasserstein geodesic generator to generate unobserved intermediate distributions with multiple observed distributions. In Theorem 10, we establish an interpretable bound on the error incurred when approximating the Wasserstein barycenter using out proposed distribution. We further demonstrate that, under condition $(A4)$ relating to the condition on the homogeneity of representations across observed domains, the proposed distribution is the Wasserstein barycenter. We begin by introducing a lemma that elucidates how the homogeneity affects on the average squared Wasserstein distances.

Lemma 9

Suppose the encoder and generator satisfy the conditions $(A1)$ and $(A2)$ . Then. for any $M$ observed domain labels $(c_{m})_{m=1}^{M}$ and their convex combination $\bar{c}:=\sum_{m=1}^{M}\alpha_{m}c_{m}$ , $\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{X}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})$ is equal to

\underset{\mathbb{P}\in\mathcal{P}(\mathcal{Z})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m});\lVert\cdot\rVert)+\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}.

(3)

holds.

In Equation (3), the first term represents the Wasserstein variance (Martinet et al., 2022) of $(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}$ , while the second term denotes the variance of $(c_{m})_{m=1}^{M}$ w.r.t. weights $(\alpha_{m})_{m=1}^{M}$ . When multiple convex combinations are possible, we select the optimal combination that minimizes the variance, $\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ . In the special case where $M=2$ and $c$ is univariate, this approach is equivalent to using the two nearest observed distributions to generate intermediate distributions. The Wasserstein variance quantifies the homogeneity of representations across observed conditional distributions and is zero if and only if the distributions of representations from all observed domains are identical, which is to say that the following condition holds.

(A4)

$(\text{Homogeneous representations across observed domains})$ For any two observed domain labels $c_{m}$ and $c_{m^{\prime}}$ , $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m})=\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m^{\prime}})$ .

For any $M$ observed domain labels $(c_{m})_{m=1}^{M}$ , their convex combination $\bar{c}:=\sum_{m=1}^{M}\alpha_{m}c_{m}$ , and $X_{1}\sim\mathbb{P}_{X|C}(\cdot|c_{1})$ , the latent interpolation result of $X_{1}$ and its transported results $(T(X_{1},c_{1},c_{m}))_{m=2}^{M}$ can be expressed as $\tilde{X}(\alpha_{1},\dots,\alpha_{M}):=\text{Gen}(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m})),\bar{c})$ . In the subsequent theorem, we derive an upper bound for the difference between the average squared Wasserstein distances with our proposed distribution and with the Wasserstein barycenter. Additionally, when we further suppose the homogeneity of representations, condition $(A4)$ , the proposed method generates the Wasserstein barycenter of observed conditional distributions.

Theorem 10

Suppose the encoder, generator, and transport map satisfy conditions $(A1)$ through $(A3)$ . Then,

$\begin{split}&\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{(\tilde{X}(\alpha_{1},\dots,\alpha_{M}),\bar{c})},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})-\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{X}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})\\ &\leq\frac{1}{2}\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\int d^{2}_{\text{Enc}}((T(x_{1},c_{1},c_{m^{\prime}}),c_{m^{\prime}}),(T(T(x_{1},c_{1},c_{m}),c_{m},c_{m^{\prime}}),c_{m^{\prime}}))d\mathbb{P}_{X|C}(x_{1}|c_{1})\end{split}$

(4)

holds. When we further suppose the condition $(A4)$ , the upper bound is zero and $\mathbb{P}_{(\tilde{X}(\alpha_{1},\dots,\alpha_{M}),\bar{c})}$ is the Wasserstein barycenter of $(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}$ w.r.t. weights $(\alpha_{m})_{m=1}^{M}$ .

To the best of authors’ knowledge, this is the first result that justifies latent interpolation from an optimal transport point of view, without resorting to Gaussian or univariate assumptions. The upper bound represents half of the average squared distances between $T(X_{1},c_{1},c_{m^{\prime}})$ and $T(T(X_{1},c_{1},c_{m}),c_{m},c_{m^{\prime}})$ . They are identical in the univariate scenario, but not necessarily in other cases, which makes the non-negative upper bound. Although we have detailed results for the $2$ -Wasserstein distance where both existence and uniqueness of barycenters are extensively examined, similar results hold for general $p$ -Wasserstein distances. If $(A1)$ through $(A4)$ hold and $\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{X}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{p}(\mathbb{P},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})$ has solutions (called Fréchet mean w.r.t. $W_{p}$ ), then the proposed distribution is a solution. Note that condition $(A4)$ is weaker than the following condition.

(A5)

$(\text{Homogeneous representations})$ $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c)$ is constant w.r.t. $c\in\mathcal{C}$ , equivalently,
$\text{Enc}(X,C)\perp\!\!\!\perp C$ .

Most existing conditional generative models, including cVAE, cGAN, and cAAE, assume $(A5)$ . When the unobserved domains possess patterns distinctly different from the observed ones, this condition may not satisfied. For example, in face frontalization (Huang et al., 2017), if we have facial images captured from the right and left angles and aim to synthesize frontal facial images, the frontal images might exhibit unique features, such as symmetrical facial structures, clear visibility of both eyes, and a full view of facial landmarks like the nose bridge and forehead. In the following theorem, under $(A5)$ , we derive that the generation result follows the true conditional distribution.

Theorem 11

Suppose the encoder, generator, and transport map satisfy conditions $(A1)$ through $(A5)$ . Then, $\mathbb{P}_{\tilde{X}(\alpha_{1},\dots,\alpha_{M})}=\mathbb{P}_{X|C=\bar{c}}$ .

Algorithm 1 Conditional generation with the Wasserstein geodesic generator

1:Input: The pair of encoder, generator, and transport map

(\text{Enc},\text{Gen},\text{T})

satisfying

(A1)

through

(A4)

, observed domain labels

(c_{m})_{m=1}^{M}

, and their convex combination

\bar{c}:=\sum_{m=1}^{M}\alpha_{m}c_{m}

2:Output:

\tilde{x}(\alpha_{1},\dots,\alpha_{M})

following the Wasserstein barycenter of

(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}

w.r.t. weights

(\alpha_{m})_{m=1}^{M}

3:# Conditional generation on an observed distribution (generation from a vertex)

4:Sample

z

from

\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1})

\tilde{x}_{1}=\text{Gen}(z,c_{1})

6:# Translation to other observed domains (moving to other observed vertices through edges)

\tilde{x}_{1\to m}=T(\tilde{x}_{1},c_{1},c_{m})

8:# Latent interpolation (finding the centroid)

\tilde{z}(\alpha_{1},\dots,\alpha_{M})=\sum_{m=1}^{M}\alpha_{m}\text{Enc}(\tilde{x}_{1\to m},c_{m})

10:

\tilde{x}(\alpha_{1},\dots,\alpha_{M})=\text{Gen}(\tilde{z}(\alpha_{1},\dots,\alpha_{M}),\bar{c})

In summary, when $(A1)$ through $(A3)$ hold, we can construct distributions that change smoothly in-between observed conditional distributions by generating geodesics. When we further suppose $(A4)$ , we can generate data from the Wasserstein barycenter without observations as described in Algorithm 1. Furthermore, the proposed distribution is the true conditional distribution when $(A5)$ holds.

4.4 Implementation

The training of the proposed method is to learn networks satisfying conditions $(A1)$ through $(A4)$ and consists of two steps. The first step is to learn the encoder and generator pair and the second step is to learn the transport map with the learned encoder. These two steps learn the vertices and edges of the geodesic, respectively.

For the first step, motivated by Theorem 3 and condition $(A1)$ , we minimize the reconstruction error with two penalty terms

\begin{split}&\int\|x-\text{Gen}(\text{Enc}(x,c),c)\|^{p}d\mathbb{P}_{X|C}(x|c)d\mathbb{P}_{C}(c)\\ &+\lambda_{\text{MatchLatent}}\mathcal{D}_{\text{JS}}\bigg{(}\mathbb{P}_{Z}(z)\mathbb{P}_{C}(c),\big{(}\int\delta_{z}(\text{Enc}(x,c))d\mathbb{P}_{X|C}(x|c)\big{)}\mathbb{P}_{C}(c)\bigg{)}\\ &+\lambda_{\text{ReconLatent}}\int\|(z,c)-(\text{Enc}(\text{Gen}(z,c),c),c)\|d\mathbb{P}_{(Z,C)}(z,c),\end{split}

(5)

where the first term is the reconstruction error, the second term is to enforce the constraint $\mathcal{Q}$ on the encoder network in the derived representation of the upper bound in Theorem 3, and the last term is to enforce condition $(A1)$ . We substitute the deterministic encoder Enc with $\mathbb{Q}_{Z|X,C}$ and $\mathcal{D}$ with $\mathcal{D}_{\text{JS}}$ , and set $\mathbb{P}_{(Z,C)}=\mathbb{P}_{Z}\mathbb{P}_{C}$ to learn the Wasserstein geodesic on which the information independent of domain labels is minimally changed. In implementation, we apply GAN for the second term and interpolation results of encoded values for the last term.

For the second step, with learned encoder from the first step, we solve Monge–Kantorovich transportation problems w.r.t. $d^{p}_{\text{Enc}}$ ,

		$\displaystyle\underset{T(\cdot,c_{0},c_{1})}{\text{minimize}}$		$\displaystyle\int d_{\text{Enc}}^{p}((x_{0},c_{0}),(T(x_{0},c_{0},c_{1}),c_{1}))d\mathbb{P}_{X\|C}(x_{0}\|c_{0})$
		subject to		$\displaystyle\mathbb{P}_{(X,c_{1})\|C}(\cdot\|c_{1})=\mathbb{P}_{(T(X,c_{0},c_{1}),c_{1})\|C}(\cdot\|c_{0})$

for all observed domain label values $c_{0}$ and $c_{1}$ . The objective is

\begin{split}&\int d_{\text{Enc}}^{p}((x_{0},c_{0}),(T(x_{0},c_{0},c_{1}),c_{1}))d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1})\\ &+\lambda_{\text{MatchData}}W_{p}(\mathbb{P}_{(X,C)}\mathbb{P}_{C},\mathbb{P}_{(T(X_{0},C_{0},C_{1}),C_{1},C_{0})};\|\cdot\|)\\ &+\lambda_{\text{Cycle}}\int\|x_{0}-T(T(x_{0},c_{0},c_{1}),c_{1},c_{0})\|d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1}).\end{split}

(6)

Here, $\mathbb{P}_{(T(X_{0},C_{0},C_{1}),C_{1},C_{0})}$ is the distribution of $(T(X_{0},C_{0},C_{1}),C_{1},C_{0})$ where $C_{0}$ and $C_{1}$ are independent samples from $\mathbb{P}_{C}$ and $X_{0}$ is sampled from $\mathbb{P}_{X|C}(\cdot|C_{0})$ . The first term is the transportation cost measured by $d_{\text{Enc}}^{p}$ and the second term is to enforce constraint on conditional distributions of transported data. Note that the second term is zero if and only if $\mathbb{P}_{(X,c_{1})|C}(\cdot|c_{1})=\mathbb{P}_{(T(X,c_{0},c_{1}),c_{1})|C}(\cdot|c_{0})$ for all $(c_{0},c_{1})$ sampled from $\mathbb{P}_{C}\mathbb{P}_{C}$ . The last term is the cycle consistency loss that encourages the inverse relation of the transport map from one domain to another and its vice versa. The cycle consistency loss has been proposed in the Data-to-Data Translation literature (Zhu et al., 2017; Choi et al., 2018) from a heuristic point of view to avoid mode collapse, but in our method, it enforces the transport map to satisfy the inverse relation of optimal transport maps between the two domains. We derive that the minimizer of objective in Equation (6) is unique and is the optimal transport map between observed domains in the following theorem.

Theorem 12

Let $T_{c_{0}\to c_{1}}^{{\dagger}}$ be the optimal transport map from the conditional distribution given domain label $c_{0}$ to that given $c_{1}$ w.r.t. $d^{p}_{\text{Enc}}$ . Then, $T^{{\dagger}}(x_{0},c_{0},c_{1}):=T_{c_{0}\to c_{1}}^{{\dagger}}(x_{0})$ with probability $1$ w.r.t. $\mathbb{P}_{X|C}(x_{0}|c_{0})$ for all $c_{0},c_{1}\in\mathcal{C}$ is the unique minimizer of objective in Equation (6).

In implementation, we apply WGAN with gradient penalty (Gulrajani et al., 2017) for the second term, employ an auxiliary regressor to enhance the visual quality and diversity of generated samples and add a reconstruction error, $\int d(x,T(x,c,c))d\mathbb{P}_{(X,C)}(x,c)$ , for a regularization purpose.

With the learned encoder, generator, and transport map, we can generate samples whose conditional distributions are on the Wasserstein geodesic on which the information independent of domain labels is minimally changed.

4.5 Relationship with Other Methods

If either of the two steps of our training algorithm is dropped and an appropriate modification is made, our algorithm reduces to either cAAE or CycleGAN.

Suppose the second step is dropped. The first step alone learns the encoder and generator pair through minimizing objective in Equation (5). If we remove $\mathbb{P}_{C}$ in the second term and drop the third term, objective in Equation (5) reduces to that of cAAE. Thus it is the $\mathbb{P}_{C}$ in the second term that enables the proposed algorithm to learn the features independent of domain labels.

Now suppose the first step is removed. Since the second step optimizes objective in Equation (6) that includes $d_{\text{Enc}}$ , it inherently depends on the first step. However, if we redefine the distance not to depend on the encoder, the second step can be operated independently. By not inheriting the $d_{\text{Enc}}$ , one no longer learns a transport map while minimally changing contents independent of conditioning variables. If we replace $d_{\text{Enc}}$ in the first term with the $l_{2}$ norm and $W_{p}$ in the second term with the Jensen-Shannon divergence, and if there are only two observed domain labels $c_{0}$ and $c_{1}$ , objective in Equation (6) reduces to that of CycleGAN. Inheriting the results from Theorem 12, we can establish the following corollary.

Corollary 13

Let $T_{c_{i}\to c_{j}}^{{\dagger}}$ be the optimal transport map from the conditional distribution given domain label $c_{i}$ to that given $c_{j}$ w.r.t. $\|\cdot\|$ , where $(i,j)=(0,1)$ or $(1,0)$ . Then, $T^{{\dagger}}(x_{i},c_{i},c_{j}):=T_{c_{i}\to c_{j}}^{{\dagger}}(x_{i})$ with probability $1$ w.r.t. $\mathbb{P}_{X|C}(x_{i}|c_{i})$ for $(i,j)=(0,1),(1,0)$ is the unique minimizer of the objective of CycleGAN.

Lu et al. (2019) specifically point out that theoretically, there is no claim on the detailed properties of the mapping established by CycleGAN. On the other hand, Korotin et al. (2019) used the cycle consistency term to promote inverse relations between optimal transport maps, albeit with an objective based on the dual form of $2$ -Wasserstein distance, which distinguishes it from CycleGAN’s objective. Because CycleGAN can be viewed a special case of our proposed method, now it can be interpreted in terms of optimal transport theory, which has not been established.

Refer to caption — Figure 1: Conditional generation results by the proposed method (left) and cAAE (right). The proposed method produces face images with clearer eyes, noses, and mouths than baselines. For each method, the leftmost and rightmost columns show generation results for observed domains and intermediate columns show results for unobserved intermediate domains.

5 Experiments

5.1 Experimental Setting

We conduct experiments on the Extended Yale Face Database B (Extended Yale-B) (Georghiades et al., 2001) dataset. The Extended Yale-B dataset consists of face images from $29$ subjects, $10$ poses, and $64$ light conditions. We consider light conditions as domain labels. For light conditions, the azimuth and elevation of the light source are provided. The total number of images is about $15,000$ .

We split the Extended Yale-B dataset by subjects to construct training, validation, and test sets. The number of subjects is $21$ , $4$ , and $4$ for training, validation, and test, respectively. For a data pre-processing, we apply a face detection algorithm proposed by Viola and Jones (2001) to crop the face part.²²2We use the official code in OpenCV (Bradski, 2000). The range of images is scaled to $[0,1]$ and the horizontal flip with probability $0.5$ is applied during training.

For baselines, we consider cAAE, CycleGAN, and StarGAN (Choi et al., 2018). As described in Section 4.5, our encoder and generator pair can reduce to cAAE and our transport map can reduce to CycleGAN. StarGAN is a state-of-the-art Data-to-Data Translation method and we add it as a baseline for the transport map. Algorithms of CycleGAN and StarGAN in their published works are inapplicable to continuous domain labels, so we add source and target light conditions as a input of CycleGAN and change the auxiliary classifier of StarGAN to regressor for experiments.

In all methods, architectures of the encoder and generator networks are adopted from DCGAN (Radford et al., 2015), and the architecture of the transport map is adopted from StarGAN. Architectures are modified to concatenate light conditions to latent variables, and we control the size of the networks for a fair comparison. For both the proposed method and baselines, we train the encoder and generator pair for $100,000$ iterations with batch size of $32$ , and train the transport map for $50,000$ iterations with batch size of $16$ . We use the Adam (Kingma and Ba, 2014) optimizer and set the initial learning rate to $0.0002$ and to linearly decrease to $0$ for the encoder and generator pair and to $0.0001$ for the transport map. Implementation details including architectures are provided in Appendix B. ³³3The implementation code is provided at the following link:
http://github.com/kyg0910/Wasserstein-Geodesic-Generator-for-Conditional-Distributions.

For evaluation, we use the Fréchet inception distance (FID) (Heusel et al., 2017), a dissimilarity measure between distributions utilizing features of Inception-v3 (Szegedy et al., 2016). We evaluate FID between distributions of the real and generated face images.

5.2 Results

We compare the proposed method and baselines with three tasks: (i) Conditional generation for unobserved intermediate domains, (ii) Data-to-Data Translation, and (iii) Latent interpolation with the real data and their translation results. Figures 1, 2, and 3 present results for tasks (i), (ii), and (iii), respectively. The (ii) and (iii) evaluate transport maps when an image is given. We reiterate CycleGAN and StarGAN cannot generate samples while the proposed method can.

Figure 1 presents the conditional generation results for unobserved intermediate domains. We compare the proposed method and cAAE. The proposed method produces face images with clearer eyes, noses, and mouths than baselines. For each method, the leftmost and rightmost columns show generation results for observed domains whose values of (azimuth, elevation) are $(0,0)$ and $(70,45)$ , respectively, and intermediate columns show results for unobserved intermediate domains. For the proposed method, as described in Section 4.2, we first generate samples in the leftmost column, then transport the samples to the domain of the rightmost column, and finally apply latent interpolation where $t$ increases from $0$ to $1$ at equal intervals. For cAAE, we fix latent variable $Z$ for each row and interpolate domain label $C$ with equal spacing.

Figure 2 compares the transportation results. We compare the proposed method, CycleGAN, and StarGAN. The leftmost panel visualizes the transportation results. The proposed method gradually casts a shadow reflecting the three-dimensional structure of the nose and mouth in the face, which makes the outcomes visually sharper and more plausible than baselines. The bottom row shows face images from various azimuth for a fixed subject, pose, and elevation. The elevation is $0$ and azimuth values are shown at the bottom. For the first three rows, the middle column shows the real data in the fourth row and other columns show transportation results by various methods for observed domains corresponding to each column. Each method translates the real data in the middle column to other observed domains. The middle panel shows the box plots of FID scores from various transport maps. The lower values are better and the proposed method outperforms baselines. The means of the FID scores of CycleGAN, StarGAN, and the proposed method with standard deviation are 174.3 (28.9), 109.7 (9.7), and 74.2 (10.9), respectively. To calculate the FID scores, we transport real face images from (azimuth, elevation) of $(0,0)$ to other observed domains and evaluate FID between transportation results and real images for every domains. The absolute value of azimuth and elevation is less than or equal to $50$ and $35$ , respectively, to remove outliers from extreme light conditions. The rightmost panel presents the surfaces of the FID scores from various transport maps as functions of azimuth and elevation. In all combinations of light conditions, the proposed method outperforms baselines. We estimate scores for unobserved domains by interpolating scores from adjacent observed domains to draw the plot.

Figure 3 presents the results of latent interpolation on unobserved intermediate domains with a real image and its translation result. We compare the proposed method, CycleGAN, and StarGAN. We fix the encoder and generator pair for all methods, and only change the transport map. The bottom row shows the ground-truth images of fixed subject and pose from two observed domains with (azimuth, elevation) of $(0,0)$ , for the leftmost, and $(70,0)$ , for the rightmost. In the top three rows, the leftmost column shows the ground-truth, the rightmost column shows the transportation results of the ground-truth by various transport maps, and the intermediate columns show latent interpolation results for unobserved intermediate domains. As in Figure 2, the outputs of the proposed method are visually sharper and more plausible than baselines.

6 Conclusion

We founded a theoretical framework for the space of conditional distributions given domain labels and proposed the Wasserstein geodesic generator, a novel conditional generator that learns the Wasserstein geodesic. We derived a tractable upper bound of the Wasserstein distance between conditional distributions to learn those given observed domains and applied optimal transport maps between them to generate a constant-speed Wasserstein geodesic of the conditional distributions given unobserved intermediate domains. Our work is the first to generate samples whose conditional distributions are fully characterized by a metric space w.r.t. a statistical distance. Experiments on face images with light conditions as domain labels demonstrate the efficacy of the proposed method with visually more plausible results and better FID scores than baselines.

Acknowledgments and Disclosure of Funding

The work of Young-geun Kim was supported by the National Research Foundation (NRF) of Korea under Grant NRF-2020R1A2C1A01011950 and by the National Institute of Mental Health of U.S. under Grant R01MH124106. The work of Kyungbok Lee and Myunghee Cho Paik was supported by the NRF under Grant NRF-2020R1A2C1A01011950.

Appendix A Theoretical Results with Proofs

A.1 Further Discussions and Proofs about the Conditional Sub-coupling

In this section, we discuss properties of the conditional sub-coupling, $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , and provide proofs. The following proposition describes properties of the conditional sub-coupling and corresponding properties of the upper bound of the expected Wasserstein distance between conditional distributions in Theorem 2.

Proposition 14

For any $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Y,C)}$ , $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ is nonempty, included in $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ , and equal to $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ if $(\mathbb{P}_{X|C},\mathbb{P}_{Y|C})=(\mathbb{P}_{X},\mathbb{P}_{Y})$ . These imply that the RHS in Theorem 2,

\bigg{(}\underset{\pi^{*}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})}{\inf}\int d^{p}(x,y)d\pi^{*}(x,y)\bigg{)}^{1/p},

(7)

is finite, greater than or equal to $W_{p}(\mathbb{P}_{X},\mathbb{P}_{Y})$ , and equal to $W_{p}(\mathbb{P}_{X},\mathbb{P}_{Y})$ if $(\mathbb{P}_{X|C},\mathbb{P}_{Y|C})=(\mathbb{P}_{X},\mathbb{P}_{Y})$ .

Proof First, the conditional sub-coupling $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ contains $\int\mathbb{P}_{X|C}(\cdot|c)\times\mathbb{P}_{Y|C}(\cdot|c)d\mathbb{P}_{C}(c)$ , so it is nonempty. Next, for any $\pi^{*}_{(X,Y)}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , there is $\{\pi^{*}_{(X,Y)|C}(\cdot|c)\}_{c\in\mathcal{C}}$ such that $\pi^{*}_{(X,Y)|C}(\cdot|c)\in\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ and $\pi^{*}_{(X,Y)}(\cdot)=\int\pi^{*}_{(X,Y)|C}(\cdot|c)d\mathbb{P}_{C}(c)$ . This implies $\int\pi^{*}_{(X,Y)}(x,y)dx=\int\pi^{*}_{(X,Y)|C}(x,y|c)dxd\mathbb{P}_{C}(c)=\int\mathbb{P}_{Y|C}(y|c)d\mathbb{P}_{C}(c)=\mathbb{P}_{Y}(y)$ and $\int\pi^{*}_{(X,Y)}(x,y)dy=\mathbb{P}_{X}(x)$ . Thus, $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})\subseteq\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ and it implies that Equation (7) is greater than or equal to $W_{p}(\mathbb{P}_{X},\mathbb{P}_{Y})$ . Finally, if $(\mathbb{P}_{X|C},\mathbb{P}_{Y|C})=(\mathbb{P}_{X},\mathbb{P}_{Y})$ , we have $\pi(\cdot)=\int\pi(\cdot)d\mathbb{P}_{C}(c)$ and $\pi\in\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ for all $c\in\mathcal{C}$ and $\pi\in\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ . Thus, $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})\subseteq\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ and it concludes the proof.
The following proposition provides a detailed discussion about the relation between $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ and $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , including non-Gaussian cases, by deriving a necessary condition for a coupling from $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ to be in the conditional sub-coupling.

Proposition 15

For given two distributions $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Y,C)}$ and a distribution $\pi\in\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})$ , we denote the covariance matrix of $\pi$ by $\begin{pmatrix}\Sigma_{XX}&\Sigma^{\pi}\\ (\Sigma^{\pi})^{T}&\Sigma_{YY}\end{pmatrix}$ . Then,

\begin{pmatrix}\Sigma_{XX}&\Sigma^{\pi}&\Sigma_{XC}\\ (\Sigma^{\pi})^{T}&\Sigma_{YY}&\Sigma_{YC}\\ \Sigma_{XC}^{T}&\Sigma_{YC}^{T}&\Sigma_{CC}\end{pmatrix}\text{is positive semi-definite}

(8)

if $\pi\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ . Furthermore, when $\mathbb{P}_{(X,C)}$ , $\mathbb{P}_{(Y,C)}$ , and $\pi$ are multivariate Gaussian distributions, Equation (8) holds if and only if $\pi\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ .

Proof By the definition of $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , for any $\pi^{*}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , there is $\{\pi^{*}(\cdot|c)\}_{c\in\mathcal{C}}$ such that $\pi^{*}(\cdot)=\int\pi^{*}(\cdot|c)d\mathbb{P}_{C}(c)$ and $\pi^{*}(\cdot|c)\in\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ for all $c\in\mathcal{C}$ . This implies the existence of random variables $(X^{*},Y^{*},C^{*})$ such that $(X^{*},Y^{*})|C^{*}=c\sim\pi^{*}(\cdot|c)$ for all $c\in\mathcal{C}$ and $C^{*}\sim\mathbb{P}_{C}$ . Now, we have $(X^{*},Y^{*})\sim\pi^{*}$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , and $(Y^{*},C^{*})\sim\mathbb{P}_{(Y,C)}$ , so the covariance matrix of $(X^{*},Y^{*},C^{*})$ is $\begin{pmatrix}\Sigma_{XX}&\Sigma^{\pi}&\Sigma_{XC}\\ (\Sigma^{\pi})^{T}&\Sigma_{YY}&\Sigma_{YC}\\ \Sigma_{XC}^{T}&\Sigma_{YC}^{T}&\Sigma_{CC}\end{pmatrix}$ and it should be positive semi-definite. For the final statement in the proposition, when $\mathbb{P}_{(X,C)}$ , $\mathbb{P}_{(Y,C)}$ , and $\pi$ are multivariate Gaussian distributions and Equation (8) holds, we can define $(X^{**},Y^{**},C^{**})$ following a multivarate Gaussian distribution denoted by $N\Bigg{(}\begin{pmatrix}\mu_{x}\\ \mu_{y}\\ \mu_{c}\end{pmatrix},\begin{pmatrix}\Sigma_{XX}&\Sigma^{\pi}&\Sigma_{XC}\\ (\Sigma^{\pi})^{T}&\Sigma_{YY}&\Sigma_{YC}\\ \Sigma_{XC}^{T}&\Sigma_{YC}^{T}&\Sigma_{CC}\end{pmatrix}\Bigg{)}$ where $\mu_{x}$ , $\mu_{y}$ , and $\mu_{c}$ are means of $\mathbb{P}_{X}$ , $\mathbb{P}_{Y}$ , and $\mathbb{P}_{C}$ , respectively. Since the distribution of $(X^{**},Y^{**})|C^{**}=c$ is in $\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ for all $c\in\mathcal{C}$ and $(X^{**},Y^{**})\sim\pi$ , the proof is concluded.
That is, all the probability measures $\pi$ that can not be utilized to define the distribution of $(X^{*},Y^{*},C^{*})$ whose marginals on $(X^{*},Y^{*})$ , $(X^{*},C^{*})$ , and $(Y^{*},C^{*})$ are $\pi$ , $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Y,C)}$ , respectively, are excluded in the conditional sub-coupling. Now, we provide Example 1 again and provide the corresponding proof.

Example 1. Let $\mathbb{P}_{(X,C)}$ be $N(\mu_{X},\mu_{C},\sigma_{X},\sigma_{C},\rho_{XC})$ and $\mathbb{P}_{(Y,C)}$ be $N(\mu_{Y},\mu_{C},\sigma_{Y},\sigma_{C},\rho_{YC})$ . Then, $\Pi(\mathbb{P}_{X},\mathbb{P}_{Y})\setminus\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ includes $N(\mu_{X},\mu_{Y},\sigma_{X},\sigma_{Y},\rho^{*})$ if and only if $\lvert\rho^{*}-\rho_{XC}\rho_{YC}\rvert>\sqrt{(1-\rho_{XC}^{2})(1-\rho_{YC}^{2})}$ .

Proof By Proposition 15, it is sufficient to solve

\begin{vmatrix}\sigma_{X}^{2}&\rho^{*}\sigma_{X}\sigma_{Y}&\rho_{XC}\sigma_{X}\sigma_{C}\\ \rho^{*}\sigma_{X}\sigma_{Y}&\sigma_{Y}^{2}&\rho_{YC}\sigma_{Y}\sigma_{C}\\ \rho_{XC}\sigma_{X}\sigma_{C}&\rho_{YC}\sigma_{Y}\sigma_{C}&\sigma_{C}^{2}\end{vmatrix}=-\sigma_{X}^{2}\sigma_{Y}^{2}\sigma_{C}^{2}\big{(}(\rho^{*})^{2}-2\rho_{XC}\rho_{YC}\rho^{*}+(\rho_{XC}^{2}+\rho_{YC}^{2}-1)\big{)}<0.

It is a quadratic inequality with respect to $\rho^{*}$ , and the solution is $\rho^{*}<\rho_{XC}\rho_{YC}-\sqrt{(1-\rho_{XC}^{2})(1-\rho_{YC}^{2})}$ or $\rho^{*}>\rho_{XC}\rho_{YC}+\sqrt{(1-\rho_{XC}^{2})(1-\rho_{YC}^{2})}$ .

A.2 Proofs of Theoretical Results

A.2.1 Proof of Theorem 2.

For any $\pi^{*}\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ , there is $\{\pi^{*}(\cdot|c)\}_{c\in\mathcal{C}}$ such that $\pi^{*}(\cdot|c)\in\Pi(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))$ and $\pi^{*}(\cdot)=\int\pi^{*}(\cdot|c)d\mathbb{P}_{C}(c)$ . By the definition of Wasserstein distance, $W_{p}^{p}(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))\leq\int d^{p}(x,y)d\pi^{*}(x,y|c)$ . This implies that $\int W_{p}^{p}(\mathbb{P}_{X|C}(\cdot|c),\mathbb{P}_{Y|C}(\cdot|c))d\mathbb{P}_{C}(c)\leq\int d^{p}(x,y)d\big{(}\int\pi^{*}(\cdot|c)d\mathbb{P}_{C}(c)\big{)}(x,y)=\int d^{p}(x,y)d\pi^{*}(x,y)$ . Now, taking infimum over all $\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(Y,C)}|\mathbb{P}_{C})$ concludes the proof.

A.2.2 Proof of Theorem 3.

We first provide a lemma to prove Theorem 3.

Lemma 16

For any two distributions $\mathbb{P}_{(X,C)}$ and $\mathbb{P}_{(Z,C)}$ and $G:\mathcal{Z}\times\mathcal{C}\rightarrow\mathcal{X}$ , let $\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)$ be the set of all probability measures $\pi$ such that there exists $(X^{*},\tilde{X}^{*},Z^{*},C^{*})$ satisfying $(X^{*},\tilde{X}^{*})\sim\pi$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , $\tilde{X}^{*}=G(Z^{*},C^{*})$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ . Then, $\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)=\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(G(Z,C),C)}|\mathbb{P}_{C})$ .

Proof First, we prove $\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)\subseteq\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(G(Z,C),C)}|\mathbb{P}_{C})$ . By definition, for any $\pi\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)$ , there exists $(X^{*},\tilde{X}^{*},Z^{*},C^{*})$ satisfying $(X^{*},\tilde{X}^{*})\sim\pi$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , $\tilde{X}^{*}=G(Z^{*},C^{*})$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ . Since $X^{*}|C^{*}\sim\mathbb{P}_{X|C}(\cdot|C=C^{*})$ and $\tilde{X}^{*}|C^{*}\sim\mathbb{P}_{G(Z,C)|C}(\cdot|C=C^{*})$ , the distribution of $(X^{*},\tilde{X}^{*})|C^{*}$ is an element of $\Pi(\mathbb{P}_{X|C}(\cdot|C=C^{*}),\mathbb{P}_{G(Z,C)|C}(\cdot|C=C^{*}))$ . Since $C^{*}\sim\mathbb{P}_{C}$ , it is shown that $\pi\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(G(Z,C),C)}|\mathbb{P}_{C})$ .

Next, we prove $\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)\supseteq\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(G(Z,C),C)}|\mathbb{P}_{C})$ . For any $\pi\in\Pi(\mathbb{P}_{(X,C)},\mathbb{P}_{(G(Z,C),C)}|\mathbb{P}_{C})$ , there exists $(X^{*},\tilde{X}^{*},C^{*})$ such that $(X^{*},\tilde{X}^{*})\sim\pi$ , $X^{*}|C^{*}\sim\mathbb{P}_{X|C}(\cdot|C=C^{*})$ , $\tilde{X}^{*}|C^{*}\sim\mathbb{P}_{G(Z,C)|C}(\cdot|C=C^{*})$ , and $C^{*}\sim\mathbb{P}_{C}$ . Since $\tilde{X}^{*}|C^{*}\sim\mathbb{P}_{G(Z,C)|C}(\cdot|C=C^{*})$ , there exists $Z^{*}$ such that $Z^{*}|C^{*}\sim\mathbb{P}_{Z|C}(\cdot|C=C^{*})$ and $\tilde{X}^{*}=G(Z^{*},C^{*})$ . As $(X^{*},\tilde{X}^{*},Z^{*},C^{*})$ satisfies $(X^{*},\tilde{X}^{*})\sim\pi$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , $\tilde{X}^{*}=G(Z^{*},C^{*})$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ , it is shown that $\pi\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)$ , which concludes the proof.
Now, we prove Theorem 3. Showing that $\underset{\pi\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)}{\inf}\int d^{p}(x,\tilde{x})d\pi(x,\tilde{x})$ is equal to $\underset{\mathbb{Q}_{Z|X,C}\in\mathcal{Q}}{\inf}\int d^{p}(x,\text{Gen}(z,c))d\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{(X,C)}(x,c)$ is sufficient by Lemma 6 where $\mathcal{Q}$ denotes the set of all $\mathbb{Q}_{Z|X,C}$ satisfying $\mathbb{P}_{(Z,C)}(z,c)=\big{(}\int\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{X|C}(x|c)\big{)}\mathbb{P}_{C}(c)$ .

First, we show that LHS $\geq$ RHS. For any $\pi\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)$ , there exists $(X^{*},\tilde{X}^{*},Z^{*},C^{*})$ satisfying $(X^{*},\tilde{X}^{*})\sim\pi$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , $\tilde{X}^{*}=G(Z^{*},C^{*})$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ . This implies that

\begin{split}\int d^{p}(x,\tilde{x})d\pi(x,\tilde{x})&=\mathbb{E}_{(X^{*},\tilde{X}^{*},Z^{*},C^{*})}d^{p}(x^{*},\tilde{x}^{*})\\ &=\mathbb{E}_{(Z^{*},C^{*})}\mathbb{E}_{(X^{*},\tilde{X}^{*})|Z^{*},C^{*}}\big{(}d^{p}(X^{*},\tilde{X}^{*})|Z^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(Z^{*},C^{*})}\mathbb{E}_{(X^{*},G(Z^{*},C^{*}))|Z^{*},C^{*}}\big{(}d^{p}(X^{*},G(Z^{*},C^{*}))|Z^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(Z^{*},C^{*})}\mathbb{E}_{X^{*}|Z^{*},C^{*}}\big{(}d^{p}(X^{*},G(Z^{*},C^{*}))|Z^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(X^{*},C^{*})}\mathbb{E}_{Z^{*}|X^{*},C^{*}}\big{(}d^{p}(X^{*},G(Z^{*},C^{*}))|X^{*},C^{*}\big{)}\\ &=\int d^{p}(x^{*},\text{Gen}(z^{*},c^{*}))d\mathbb{Q}_{Z^{*}|X^{*},C^{*}}(z^{*}|x^{*},c^{*})d\mathbb{P}_{(X^{*},C^{*})}(x^{*},c^{*}).\end{split}

We denote the distribution of $Z^{*}|X^{*},C^{*}$ by $\mathbb{Q}_{Z^{*}|X^{*},C^{*}}$ . Since $Z^{*}|C^{*}\sim\mathbb{P}_{Z|C}(\cdot|C=C^{*})$ and $X^{*}|C^{*}\sim\mathbb{P}_{X|C}(\cdot|C=C^{*})$ , $\mathbb{P}_{Z|C}(z^{*}|c^{*})=\int\mathbb{Q}_{Z^{*}|X^{*},C^{*}}(z^{*}|x^{*},c^{*})d\mathbb{P}_{X|C}(x^{*}|c^{*})$ for all $x^{*}\in\mathcal{X}$ , $z^{*}\in\mathcal{Z}$ , and $c^{*}\in\mathcal{C}$ . Thus, $\mathbb{Q}_{Z^{*}|X^{*},C^{*}}\in\mathcal{Q}$ . This and $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ imply $\int d^{p}(x,\tilde{x})d\pi(x,\tilde{x})\geq\underset{\mathbb{Q}_{Z|X,C}\in\mathcal{Q}}{\inf}\int d^{p}(x,\text{Gen}(z,c))d\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{(X,C)}(x,c)$ . It concludes LHS $\geq$ RHS.

Next, we show LHS $\leq$ RHS. For any $\mathbb{Q}_{Z|X,C}\in\mathcal{Q}$ , there exists $(X^{*},Z^{*},C^{*})$ satisfying $Z^{*}|X^{*},C^{*}\sim\mathbb{Q}_{Z|X,C}(\cdot|X=X^{*},C=C^{*})$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ . We denote $\tilde{X}^{*}:=G(Z^{*},C^{*})$ and the distribution of $(X^{*},\tilde{X}^{*})$ by $\pi^{*}$ . Then,

$\begin{split}\int d^{p}(x^{*},\text{Gen}(z^{*},c^{*}))d\mathbb{Q}_{Z|X,C}(z^{*}|x^{*},c^{*})d\mathbb{P}_{(X,C)}(x^{*},c^{*})&=\mathbb{E}_{(X^{*},C^{*})}\mathbb{E}_{Z^{*}|X^{*},C^{*}}\big{(}d^{p}(X^{*},G(Z^{*},C^{*}))|X^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(Z^{*},C^{*})}\mathbb{E}_{X^{*}|Z^{*},C^{*}}\big{(}d^{p}(X^{*},G(Z^{*},C^{*}))|Z^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(Z^{*},C^{*})}\mathbb{E}_{(X^{*},\tilde{X}^{*})|Z^{*},C^{*}}\big{(}d^{p}(X^{*},\tilde{X}^{*})|Z^{*},C^{*}\big{)}\\ &=\mathbb{E}_{(X^{*},\tilde{X}^{*},Z^{*},C^{*})}d^{p}(X^{*},\tilde{X}^{*})\\ &=\int d^{p}(x^{*},\tilde{x}^{*})d\pi^{*}(x,\tilde{x}^{*}).\end{split}$

Here, $\pi^{*}\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)$ because $(X^{*},\tilde{X}^{*},Z^{*},C^{*})$ satisfies $(X^{*},\tilde{X}^{*})\sim\pi^{*}$ , $(X^{*},C^{*})\sim\mathbb{P}_{(X,C)}$ , $\tilde{X}^{*}=G(Z^{*},C^{*})$ , and $(Z^{*},C^{*})\sim\mathbb{P}_{(Z,C)}$ . Thus, we have $\underset{\pi\in\tilde{\mathcal{Q}}(\mathbb{P}_{(X,C)},\mathbb{P}_{(Z,C)},G)}{\inf}\int d^{p}(x,\tilde{x})d\pi(x,\tilde{x})\leq\int d^{p}(x,\text{Gen}(z,c))d\mathbb{Q}_{Z|X,C}(z|x,c)d\mathbb{P}_{(X,C)}(x,c)$ , which concludes the proof.

A.2.3 Proof of Lemma 6.

For any $\mathbb{Q}_{(X,C,X^{\prime}C^{\prime})}$ s.t. $\mathbb{Q}_{(X,C)}=\mathbb{P}_{(X,C)|C}(\cdot|c)$ and $\mathbb{Q}_{(X^{\prime},C^{\prime})}=\mathbb{P}_{(X,C)|C}(\cdot|c^{\prime})$ , by the definition of the Wasserstein distance,

$\begin{split}\int d^{p}_{\text{Enc}}((x,c),(x^{\prime},c^{\prime}))d\mathbb{Q}_{(X,C,X^{\prime},C^{\prime})}(x,c,x^{\prime},c^{\prime})&=\int\lVert(\text{Enc}(x,c),c)-(\text{Enc}(x^{\prime},c^{\prime}),c^{\prime})\rVert^{p}d\mathbb{Q}_{(X,C,X^{\prime},C^{\prime})}(x,c,x^{\prime},c^{\prime})\\ &\geq W^{p}_{p}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c^{\prime});\lVert\cdot\rVert).\end{split}$

This implies that $W_{p}(\mathbb{P}_{(X,C)|C}(\cdot|c),\mathbb{P}_{(X,C)|C}(\cdot|c^{\prime});d_{\text{Enc}})$ is greater than or equal to $W_{p}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c^{\prime});\lVert\cdot\rVert)$ . Similarly, for any $\mathbb{Q}_{(Z,C,Z^{\prime},C^{\prime})}$ s.t. $\mathbb{Q}_{(Z,C)}=\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c)$ and $\mathbb{Q}_{(Z^{\prime},C^{\prime})}=\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c^{\prime})$ ,

$\begin{split}\int\lVert(z,c)-(z^{\prime},c^{\prime})\rVert^{p}d\mathbb{Q}_{(Z,C,Z^{\prime},C^{\prime})}(z,c,z^{\prime},c^{\prime})&=\int d^{p}_{\text{Enc}}((\text{Gen}(z,c),c),(\text{Gen}(z^{\prime},c^{\prime}),c^{\prime}))d\mathbb{Q}_{(Z,C,Z^{\prime},C^{\prime})}(z,c,z^{\prime},c^{\prime})\\ &\geq W^{p}_{p}(\mathbb{P}_{(X,C)|C}(\cdot|c),\mathbb{P}_{(X,C)|C}(\cdot|c^{\prime});d_{\text{Enc}})\end{split}$

holds by $(A1)$ and the definition of the Wasserstein distance, which implies $W_{p}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c^{\prime});\lVert\cdot\rVert)\geq W_{p}(\mathbb{P}_{(X,C)|C}(\cdot|c),\mathbb{P}_{(X,C)|C}(\cdot|c^{\prime});d_{\text{Enc}})$ .

A.2.4 Proof of Theorem 7.

By Lemma 6, it is sufficient to show that the distribution of $((1-t)\text{Enc}(X_{0},c_{0})+t\text{Enc}(T(X_{0},c_{0},c_{1}),c_{1}),(1-t)c_{0}+tc_{1})$ , denoted by $w(t)$ , is the constant-speed Wasserstein geodesic from $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{0})$ to $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1})$ in $W_{p}(\cdot,\cdot;\lVert\cdot\rVert)$ . First, by $(A2)$ , $w(0):=\mathbb{P}_{(\text{Enc}(X_{0},c_{0}),c_{0})}=\mathbb{P}_{(\text{Enc}(X,c_{0}),c_{0})}(\cdot|c_{0})=\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{0})$ . By $(A3)$ , $\mathbb{P}_{(T(X,C,c_{1}),c_{1})|C}(\cdot|c_{0})=\mathbb{P}_{(X,C)|C}(\cdot|c_{1})$ , which implies $\mathbb{P}_{(\text{Enc}(T(X,C,c_{1}),c_{1}),c_{1})|C}(\cdot|c_{0})=\mathbb{P}_{(\text{Enc}(X,c_{1}),C)|C}(\cdot|c_{1})=\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{1})$ . This and $(A2)$ imply that $w(1):=\mathbb{P}_{(\text{Enc}(T(X_{0},c_{0},c_{1}),c_{1}),c_{1})}=\mathbb{P}_{(\text{Enc}(T(X,C,c_{1}),c_{1}),c_{1})|C}(\cdot|c_{0})=\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{1})$ . Next, for any $t\leq t^{\prime}$ , by the definition of the Wasserstein distance, $(A3)$ , and Lemma 6, we have

$\begin{aligned} &W_{p}(w(t),w(t^{\prime});\lVert\cdot\rVert)&\leq&\Big{(}\int\lVert((1-t)\text{Enc}(x_{0},c_{0})+t\text{Enc}(T(x_{0},c_{0},c_{1}),c_{1}),\tilde{c}_{t})-\\ &&&((1-t^{\prime})\text{Enc}(x_{0},c_{0})+t^{\prime}\text{Enc}(T(x_{0},c_{0},c_{1}),c_{1}),\tilde{c}_{t^{\prime}})\rVert^{p}d\mathbb{P}_{X|C}(x_{0}|c_{0})\Big{)}^{1/p}\\ &&=&|t-t^{\prime}|\Big{(}\int\|(\text{Enc}(x_{0},c_{0}),c_{0})-(\text{Enc}(T(x_{0},c_{0},c_{1}),c_{1}),c_{1})\|^{p}d\mathbb{P}_{X|C}(x_{0}|c_{0})\Big{)}^{1/p}\\ &&=&|t-t^{\prime}|\Big{(}\int d^{p}_{\text{Enc}}((x_{0},c_{0}),(T(x_{0},c_{0},c_{1}),c_{1}))d\mathbb{P}_{X|C}(x_{0}|c_{0})\Big{)}^{1/p}\\ &&=&|t-t^{\prime}|W_{p}(\mathbb{P}_{(X,C)|C}(\cdot|c_{0}),\mathbb{P}_{(X,C)|C}(\cdot|c_{1});d_{\text{Enc}})\\ &&=&|t-t^{\prime}|W_{p}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{0}),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{1});\lVert\cdot\rVert).&\end{aligned}$

Here, the last term is $|t-t^{\prime}|W_{p}(w(0),w(1);\lVert\cdot\rVert)$ and it implies

\begin{split}W_{p}(w(0),w(1);\lVert\cdot\rVert)&\leq W_{p}(w(0),w(t);\lVert\cdot\rVert+W_{p}(w(t),w(t^{\prime});\lVert\cdot\rVert+W_{p}(w(t^{\prime}),w(1);\lVert\cdot\rVert)\\ &\leq(t+(t^{\prime}-t)+(1-t^{\prime}))W_{p}(w(0),w(1);\lVert\cdot\rVert)\\ &=W_{p}(w(0),w(1);\lVert\cdot\rVert).\end{split}

Thus, equalities hold and it concludes the proof.

A.2.5 Proof of Lemma 9.

By Lemma 6, showing that $\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{Z}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m});\lVert\cdot\rVert)$ is equal to $\underset{\mathbb{P}\in\mathcal{P}(\mathcal{Z})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m});\lVert\cdot\rVert)+\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ is sufficient. By Proposition 4.2 in Agueh and Carlier (2011), for any measures $(\mathbb{P}_{m})_{m=1}^{M}$ vanishing on small sets,⁴⁴4A measure defined on $D$ -dimensional spaces is said to vanish on small sets if it vanishes on $(D-1)$ -rectifiable sets (Gangbo and Święch, 1998; Agueh and Carlier, 2011).

\underset{\mathbb{P}}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{m};\lVert\cdot\rVert)=\underset{\pi\in\Pi(\mathbb{P}_{1},\dots,\mathbb{P}_{M})}{\inf}\int\sum_{m=1}^{M}\alpha_{m}\lVert x_{m}-\sum_{m=1}^{M}\alpha_{m}x_{m}\rVert^{2}d\pi(x_{1},\dots,x_{M})

(9)

and $\mathbb{P}_{X^{*}}=\mathbb{P}_{\sum_{m=1}^{M}\alpha_{m}X^{*}_{m}}$ hold where $\Pi(\mathbb{P}_{1},\dots,\mathbb{P}_{M})$ is the set of joint distributions whose marginals are $(\mathbb{P}_{m})_{m=1}^{M}$ , $\mathbb{P}_{X^{*}}$ is the unique solution of the LHS, and $\mathbb{P}_{(X^{*}_{1},\dots,X^{*}_{M})}$ is the unique solution of the RHS. By Equation (9),

$\begin{split}&\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{Z}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m});\lVert\cdot\rVert)\\ &=\underset{\mathbb{Q}_{(Z_{1},\dots,Z_{M})}\in\Pi(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1}),\dots,\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{M}))}{\inf}\int\sum_{m=1}^{M}\alpha_{m}\lVert(z_{m},c_{m})-\sum_{m=1}^{M}\alpha_{m}(z_{m},c_{m})\rVert^{2}d\mathbb{Q}_{(Z_{1},\dots,Z_{M})}(z_{1},\dots,z_{M})\\ &=\underset{\mathbb{Q}_{(Z_{1},\dots,Z_{M})}\in\Pi(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1}),\dots,\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{M}))}{\inf}\int\sum_{m=1}^{M}\alpha_{m}\lVert z_{m}-\sum_{m=1}^{M}\alpha_{m}z_{m}\rVert^{2}d\mathbb{Q}_{(Z_{1},\dots,Z_{M})}(z_{1},\dots,z_{M})\\ &\quad+\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}\\ &=\underset{\mathbb{P}\in\mathcal{P}(\mathcal{Z})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m});\lVert\cdot\rVert)+\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}.\end{split}$

A.2.6 Proof of Theorem 10.

We first derive a lemma for Theorem 10.

Lemma 17

Suppose the encoder, generator, and transport map satisfy conditions $(A1)$ through $(A3)$ . Then,

$\begin{split}&(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}W^{2}_{2}(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}),\mathbb{P}_{(X,C)|C}(\cdot|c_{m^{\prime}});d_{\text{Enc}})\\ &\leq\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{X}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})\\ &\leq\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{(\tilde{X}(\alpha_{1},\dots,\alpha_{M}),\bar{c})},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})\\ &\leq(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\int d^{2}_{\text{Enc}}((T(x_{1},c_{1},c_{m}),c_{m}),(T(x_{1},c_{1},c_{m^{\prime}}),c_{m^{\prime}}))d\mathbb{P}_{X|C}(x_{1}|c_{1})\end{split}$

(10)

holds. When we further suppose the condition $(A4)$ , $\mathbb{P}_{(\tilde{X}(\alpha_{1},\dots,\alpha_{M}),\bar{c})}$ is the Wasserstein barycenter of $(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}$ w.r.t. weights $(\alpha_{m})_{m=1}^{M}$ and all the terms in Equation (4) are equal to $\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ .

Proof First, we derive Equation (10). The second inequality is trivial by the definition of the infimum. For the first inequality, by Lemma 6, the definition of the Wasserstein distance, and Equation (9),

\begin{split}&(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}W^{2}_{2}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m}),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m^{\prime}});\lVert\cdot\rVert)\\ &\leq(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\int\lVert(z_{m},c_{m})-(z_{m^{\prime}},c_{m^{\prime}})\rVert^{2}d\mathbb{Q}^{*}_{(Z_{m},Z_{m^{\prime}})}(z_{m},c_{m},z_{m^{\prime}},c_{m^{\prime}})\end{split}

where $\mathbb{Q}^{*}_{(Z_{1},\dots,Z_{M})}$ is the unique solution of $\underset{\pi\in\Pi(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1}),\dots,\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{M}))}{\inf}\int\sum_{m=1}^{M}\alpha_{m}\lVert z_{m}-\sum_{m=1}^{M}\alpha_{m}z_{m}\rVert^{2}\pi(z_{1},\dots,z_{M})$ . Since

(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\lVert a_{m}-a_{m^{\prime}}\rVert^{2}=\sum_{m=1}^{M}\alpha_{m}\lVert a_{m}-\sum_{m=1}^{M}\alpha_{m}a_{m}\rVert^{2}

(11)

holds for any sequence $(a_{m})_{m=1}^{M}$ , we have that $(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\int\lVert(z_{m},c_{m})-(z_{m^{\prime}},c_{m^{\prime}})\rVert^{2}d\mathbb{Q}^{*}_{(Z_{m},Z_{m^{\prime}})}(z_{m},c_{m},z_{m^{\prime}},c_{m^{\prime}})$ is equal to $\sum_{m=1}^{M}\alpha_{m}\int\lVert z_{m}-\sum_{m=1}^{M}\alpha_{m}z_{m}\rVert^{2}d\mathbb{Q}^{*}_{(Z_{1},\dots,Z_{M})}(z_{1},\dots,z_{M})+\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ . The first term in the RHS equals to $\underset{\pi\in\Pi(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1}),\dots,\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{M}))}{\inf}\int\sum_{m=1}^{M}\alpha_{m}\lVert z_{m}-\sum_{m=1}^{M}\alpha_{m}z_{m}\rVert^{2}\pi(z_{1},\dots,z_{M})$ . This and Lemmas 6 and 9 imply

\begin{split}&(1/2)\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}W^{2}_{2}(\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m}),\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m^{\prime}});\lVert\cdot\rVert)\\ &\leq\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{Z}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m});\lVert\cdot\rVert)\\ &\leq\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}),\bar{c}),\bar{c})},\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m});\lVert\cdot\rVert),\end{split}

which concludes the first inequality by Lemma 6. For the last inequality, by the definition of the Wasserstein distance, $W^{2}_{2}(\mathbb{P}_{(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}),\bar{c}),\bar{c})},\mathbb{P}_{(\text{Enc}(X,C),C)|C}(\cdot|c_{m});\lVert\cdot\rVert)\leq\int\lVert(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}),\bar{c})-(T(x_{1},c_{1},c_{m^{\prime}}),c_{m^{\prime}})\rVert^{2}d\mathbb{P}_{X|C}(x_{1}|c_{1})$ . This, Equation (11), and Lemma 6 conclude the proof for the last inequality.

Next, we derive that the distribution of the latent interpolation result is the Wasserstein barycenter when $(A4)$ holds. Since the Wasserstein variance of $(\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}$ is zero, by Lemmas 6 and 9, it is sufficient to show that

\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{\text{Gen}(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}),\bar{c})},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})=\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}.

(12)

By $(A4)$ , the optimal transportation cost from $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{1})$ to $\mathbb{P}_{\text{Enc}(X,C)|C}(\cdot|c_{m})$ is zero, which implies that $\text{Enc}(X_{1},c_{1})=\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m})$ almost surely for all $m$ . Thus, $\mathbb{P}_{\text{Gen}(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}))}=\mathbb{P}_{\text{Gen}(\text{Enc}(X_{1},c_{1}),c_{m})}=\mathbb{P}_{\text{Gen}(\text{Enc}(X,C),c_{m})|C}(\cdot|c_{1})=\mathbb{P}_{\text{Gen}(\text{Enc}(X,C),c_{m})|C}(\cdot|c_{m})=\mathbb{P}_{X|C}(\cdot|c_{m})$ for all $m$ . This implies that the LHS in Equation (12) equals to $\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{(X,\bar{c})|C}(\cdot|c_{m}),\mathbb{P}_{(X,c_{m})|C}(\cdot|c_{m});d_{\text{Enc}})=\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ , which concludes the proof. For general $p$ -Wasserstein distances, $\sum_{m=1}^{M}\alpha_{m}W^{2}_{p}(\mathbb{P}_{\text{Gen}(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}),\bar{c})},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})=\sum_{m=1}^{M}\alpha_{m}W^{2}_{p}(\mathbb{P}_{(X,\bar{c})|C}(\cdot|c_{m}),\mathbb{P}_{(X,c_{m})|C}(\cdot|c_{m});d_{\text{Enc}})=\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ holds.

Last, we derive that all the terms in Equation (10) are equal to $\sum_{m=1}^{M}\alpha_{m}\lVert\bar{c}-c_{m}\rVert^{2}$ when $(A4)$ holds. Since $\mathbb{P}_{\text{Gen}(\sum_{m=1}^{M}\alpha_{m}\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m}))}=\mathbb{P}_{X|C}(\cdot|c_{m})$ for all $m$ , derived in the above paragraph, implies that $\mathbb{P}_{(X,C)|C}(\cdot|c_{m^{\prime}})=\mathbb{P}_{(X,c_{m^{\prime}})|C}(\cdot|c_{m})$ , the lower bound becomes $(1/2)\sum_{m=1}^{M}\sum_{m}^{M}\alpha_{m}\alpha_{m^{\prime}}\lVert c_{m}-c_{m^{\prime}}\rVert^{2}$ . Similarly, the $\text{Enc}(X_{1},c_{1})=\text{Enc}(T(X_{1},c_{1},c_{m}),c_{m})$ for all $m$ implies that the upper bound becomes $(1/2)\sum_{m=1}^{M}\sum_{m}^{M}\alpha_{m}\alpha_{m^{\prime}}\lVert c_{m}-c_{m^{\prime}}\rVert^{2}$ . Now, Equation (11) concludes the proof.
Now, we show Theorem 10. By Lemma 17, we have

$\begin{split}&\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P}_{(\tilde{X}(\alpha_{1},\dots,\alpha_{M}),\bar{c})},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})-\underset{\mathbb{P}\in\cup_{c\in\mathcal{C}}\mathcal{P}(\mathcal{X}\times\{c\})}{\inf}\sum_{m=1}^{M}\alpha_{m}W^{2}_{2}(\mathbb{P},\mathbb{P}_{(X,C)|C}(\cdot|c_{m});d_{\text{Enc}})\\ &\leq\frac{1}{2}\sum_{m=1}^{M}\sum_{m^{\prime}=1}^{M}\alpha_{m}\alpha_{m^{\prime}}\big{(}\int d^{2}_{\text{Enc}}((T(x_{1},c_{1},c_{m}),c_{m}),(T(x_{1},c_{1},c_{m^{\prime}}),c_{m^{\prime}}))d\mathbb{P}_{X|C}(x_{1}|c_{1})\\ &\qquad\qquad\qquad\qquad\qquad-W^{2}_{2}(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}),\mathbb{P}_{(X,C)|C}(\cdot|c_{m^{\prime}});d_{\text{Enc}})\big{)}.\end{split}$

By (A3), $\int d^{2}_{\text{Enc}}((T(x_{1},c_{1},c_{m}),c_{m}),(T(T(x_{1},c_{1},c_{m}),c_{m},c_{m^{\prime}}))d\mathbb{P}_{X|C}(x_{1}|c_{1})$ is equal to $W^{2}_{2}(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}),\mathbb{P}_{(X,C)|C}(\cdot|c_{m^{\prime}});d_{\text{Enc}})$ , which concludes the proof of Equation (4). The upper bound is zero and the distribution of latent interpolation result is the Wasserstein barycenter by Lemma 17.

A.2.7 Proof of Theorem 11.

By $(A5)$ , the data generation structure can be expressed as $X=\text{Gen}(Z,C)$ where $Z:=\text{Enc}(X,C)$ independent of $C$ . The Wasserstein barycenter of $(\mathbb{P}_{(X,C)|C}(\cdot|c_{m}))_{m=1}^{M}=(\mathbb{P}_{(\text{Gen}(Z,C),C)|C}(\cdot|c_{m}))_{m=1}^{M}$ w.r.t. $W_{2}(\cdot,\cdot;d_{\text{Enc}})$ is the same as that of $(\mathbb{P}_{(Z,c_{m})})_{m=1}^{M}$ w.r.t. $W_{2}(\cdot,\cdot;\lVert\cdot\rVert)$ , which implies that $\mathbb{P}_{X|C}(\cdot|\bar{c})$ is the Wasserstein barycenter. Now, Theorem 10 concludes the proof.

A.2.8 Proof of Theorem 12.

We denote the first, second, and third term in objective (4) as follows.

•

$L_{\text{TransportCost}}(T):=\int d_{\text{Enc}}^{p}((x_{0},c_{0}),(T(x_{0},c_{0},c_{1}),c_{1}))d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1})$
•

$L_{\text{MatchData}}(T):=W_{p}(\mathbb{P}_{(X,C)}\mathbb{P}_{C},\mathbb{P}_{(T(X_{0},C_{0},C_{1}),C_{1},C_{0})};\|\cdot\|)$
•

$L_{\text{Cycle}}(T):=\int\|x_{0}-T(T(x_{0},c_{0},c_{1}),c_{1},c_{0})\|d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1})$

By definition of the optimal transport map, $L_{\text{MatchData}}(T^{{\dagger}})=L_{\text{Cycle}}(T^{{\dagger}})=0$ and $L_{\text{TransportCost}}(T^{{\dagger}})\leq L_{\text{TransportCost}}(T)$ for all $T:\mathcal{X}\times\mathcal{C}^{2}\to\mathcal{X}$ . Thus, $T^{{\dagger}}$ is a minimizer. Let $T$ be a minimizer of objective (4). Then, $L_{\text{MatchData}}(T)=L_{\text{Cycle}}(T)=0$ and $L_{\text{TransportCost}}(T)=L_{\text{TransportCost}}(T^{{\dagger}})$ . By definition of the optimal transport map, it implies $T(x_{0},c_{0},c_{1}):=T_{c_{0}\to c_{1}}^{{\dagger}}(x_{0})$ with probability $1$ w.r.t. $\mathbb{P}_{X|C}(x_{0}|c_{0})$ for all $c_{0},c_{1}\in\mathcal{C}$ . Thus, $T=T^{{\dagger}}$ and it concludes that $T^{{\dagger}}$ is the unique minimizer.

Appendix B Details on Experiments

B.1 Implementation Details

Table 1: Architectures of encoder and generator networks.

Encoder	Generator
Conv. with kernel 11x11, filter size 128, stride 1, padding 5	ConvTran. with kernel 4x4, filter size 1024, stride 1, padding 0
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 6x6, filter size 128, stride 2, padding 2	ConvTran. with kernel 4x4, filter size 1024, stride 2, padding 1
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 4x4, filter size 256, stride 2, padding 1	ConvTran. with kernel 4x4, filter size 512, stride 2, padding 1
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 4x4, filter size 512, stride 2, padding 1	ConvTran. with kernel 4x4, filter size 256, stride 2, padding 1
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 4x4, filter size 1024, stride 2, padding 1	ConvTran. with kernel 4x4, filter size 128, stride 2, padding 1
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 4x4, filter size 1024, stride 2, padding 1	ConvTran. with kernel 6x6, filter size 128, stride 2, padding 2
BatchNormalization	BatchNormalization
LeakyReLU with slope 0.2	LeakyReLU with slope 0.2
Conv. with kernel 4x4, filter size 128, stride 1, padding 0	ConvTran. with kernel 11x11, filter size 1, stride 1, padding 5
	Sigmoid

Table 2: The architecture of the transport map network.

Transport map	Residual block
Conv. with kernel 11x11, filter size 64, stride 1, padding 5	Conv. with kernel 3x3, filter size 256, stride 1, padding 1
BatchNormalization	BatchNormalization
ReLU	ReLU
Conv. with kernel 4x4, filter size 128, stride 2, padding 1	Conv. with kernel 3x3, filter size 256, stride 1, padding 1
BatchNormalization	BatchNormalization
ReLU
Conv. with kernel 4x4, filter size 256, stride 2, padding 1
BatchNormalization
ReLU
Residual block 1
Residual block 2
Residual block 3
Residual block 4
Residual block 5
Residual block 6
ConvTrans. with kernel 4x4, filter size 128, stride 2, padding 1
BatchNormalization
ReLU
ConvTrans. with kernel 4x4, filter size 64, stride 2, padding 1
BatchNormalization
ReLU
ConvTrans. with kernel 11x11, filter size 1, stride 1, padding 5
Sigmoid

Table 3: Architectures of discriminator for generator, discriminator for transport map, and auxiliary regressor.

Discriminator for generator	Discriminator for transport map	Auxiliary regressor
Linear with filter size 512	Conv. with kernel 4x4, filter size 64, stride 2, padding 1	Conv. with kernel 4x4, filter size 64, stride 2, padding 1
ReLU	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
Linear with filter size 512	Conv. with kernel 4x4, filter size 128, stride 2, padding 1	Conv. with kernel 4x4, filter size 128, stride 2, padding 1
ReLU	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
Linear with filter size 512	Conv. with kernel 4x4, filter size 256, stride 2, padding 1	Conv. with kernel 4x4, filter size 256, stride 2, padding 1
ReLU	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
Linear with filter size 512	Conv. with kernel 4x4, filter size 512, stride 2, padding 1	Conv. with kernel 4x4, filter size 512, stride 2, padding 1
ReLU	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
Linear with filter size 1	Conv. with kernel 4x4, filter size 1024, stride 2, padding 1	Conv. with kernel 4x4, filter size 1024, stride 2, padding 1
Sigmoid	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
	Conv. with kernel 4x4, filter size 2048, stride 2, padding 1	Conv. with kernel 4x4, filter size 2048, stride 2, padding 1
	LeakyReLU with slope 0.01	LeakyReLU with slope 0.01
	Conv. with kernel 3x3, filter size 1, stride 1, padding 1	Conv. with kernel 3x3, filter size 2, stride 1, padding 1
	Average pooling with kernel 2x2, stride 2	Average pooling with kernel 2x2, stride 2

Our work consists of three main networks, encoder, generator, and transport map, and three auxiliary networks, discriminator for generator, discriminator for transport map, and auxiliary regressor. Architectures of the encoder and generator networks are adopted from DCGAN (Radford et al., 2015), and the architecture of the transport map is adopted from StarGAN (Choi et al., 2018). Architectures are modified to concatenate light conditions to latent variables. Table 1 shows architectures of encoder and generator networks. Conv and ConvTran denote convolutional layer and convolutional transpose layer, respectively. Table 2 shows the architecture of the transport map network. We apply skip connection to input features in intermediate layers of convolutional layers into convolutional transpose layers. Table 3 shows architectures of discriminator for generator, discriminator for transport map, and auxiliary regressor.

We control the size of the networks of baselines for a fair comparison. For conditional AAE (cAAE, Makhzani et al., 2015), architectures are the same as ours except encoder and discriminator input only latent variable. For CycleGAN (Zhu et al., 2017), architectures are the same as ours. For StarGAN, architectures are the same as ours except translator map inputs only source data and target domain labels. The dimension of latent variable is $128$ . For both the proposed method and baselines, we train the encoder and generator pair for $100,000$ iterations with batch size of $32$ , and train the transport map for $50,000$ iterations with batch size of $16$ . We use the Adam (Kingma and Ba, 2014) optimizer and set the initial learning rate to $0.0002$ and to linearly decrease to $0$ for the encoder and generator pair and to $0.0001$ for the transport map. In the first step of training encoder and generator pair, we update encoder and generator for every $5$ iterations while update discriminator for generator for every iteration. In the second step of training transport map, we update transport map and auxiliary regressor for every $5$ iterations while update discriminator for transport map for every iteration. For a data pre-processing, we apply a face detection algorithm proposed by Viola and Jones (2001) to crop the face part. The resolution of image is scaled to $128$ , the range of images is scaled to $[0,1]$ , and the horizontal flip with probability $0.5$ is applied during training.

We denote losses as follows and provide values of coefficients for losses.

•

$L_{Recon}(\text{Enc},\text{Gen}):=\int\|x-\text{Gen}(\text{Enc}(x,c),c)\|^{p}d\mathbb{P}_{X|C}(x|c)d\mathbb{P}_{C}(c)$
•

$L_{MatchLatent}(\text{Enc},\text{Gen}):=\mathcal{D}_{\text{JS}}\bigg{(}\mathbb{P}_{Z}(z)\mathbb{P}_{C}(c),\big{(}\int\delta_{z}(\text{Enc}(x,c))d\mathbb{P}_{X|C}(x|c)\big{)}\mathbb{P}_{C}(c)\bigg{)}$
•

$L_{ReconLatent}(\text{Enc},\text{Gen}):=\int\|(z,c)-(\text{Enc}(\text{Gen}(z,c),c),c)\|d\mathbb{P}_{(Z,C)}(z,c)$
•

$L_{\text{TransportCost}}(T):=\int d_{\text{Enc}}^{p}((x_{0},c_{0}),(T(x_{0},c_{0},c_{1}),c_{1}))d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1})$
•

$L_{\text{MatchData}}(T):=W_{p}(\mathbb{P}_{(X,C)}\mathbb{P}_{C},\mathbb{P}_{(T(X_{0},C_{0},C_{1}),C_{1},C_{0})};\|\cdot\|)$
•

$L_{\text{Cycle}}(T):=\int\|x_{0}-T(T(x_{0},c_{0},c_{1}),c_{1},c_{0})\|d\mathbb{P}_{X|C}(x_{0}|c_{0})d\mathbb{P}_{C}(c_{0})d\mathbb{P}_{C}(c_{1})$

The coefficients of $L_{Recon}(\text{Enc},\text{Gen})$ , $L_{MatchLatent}(\text{Enc},\text{Gen})$ , $L_{ReconLatent}(\text{Enc},\text{Gen})$ , $L_{\text{MatchData}}(T)$ , and $L_{\text{Cycle}}(T)$ are $100.0$ , $1.0$ , $0.1$ , $1.0$ , and $5.0$ , respectively. For $L_{\text{TransportCost}}(T)$ , we consider $\{0.1,1.0,10.0,100.0\}$ and choose the model yielding the best validation FID scores. The coefficient of gradient penalty loss, regression loss, and reconsturction error in the second step, $\int d(x,T(x,c,c))d\mathbb{P}_{(X,C)}(x,c)$ , is $100.0$ , $1.0$ , and $10.0$ , respectively. For CycleGAN, the coefficient of identity mapping loss is $1.0$ . We extended the definition of $d_{\text{Enc}}$ by introducing a hyperparameter $\epsilon$ where the extended formula is $d_{\text{Enc}}((x_{1},c_{1}),(x_{2},c_{2})):=\big{(}\lVert\text{Enc}(x_{1},c_{1})-\text{Enc}(x_{2},c_{2})\rVert^{2}+\epsilon\lVert c_{1}-c_{2}\rVert^{2}\big{)}^{1/2}$ to balance distances on $\mathcal{Z}$ and on $\mathcal{C}$ .

B.2 Further Results

Figure 4 presents further conditional generation results for unobserved intermediate domains. As in Figure 1, the proposed method produces face images with clearer eyes, noses, and mouths than baselines.

Figure 5 presents further results of latent interpolation on unobserved intermediate domains with a real image and its translation result. The bottom row shows the ground-truth images of fixed subject and pose from two observed domains with (azimuth, elevation) of $(0,0)$ , for the leftmost, and $(70,45)$ , for the rightmost. As in Figure 3 of the manuscript, the outputs of the proposed method are visually sharper and more plausible than the baselines.

B.3 Computing Infrastructure

We use about one hundred CPU cores and ten GPUs (five GeForce GTX 1080, two TITAN X, and three TITAN V) for experiments. A full training of the proposed method requires about $15$ GPU hours for the encoder and generator pair and $14$ GPU hours for the transport map.

References

Agueh and Carlier (2011) M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
Alvarez-Melis and Fusi (2020) D. Alvarez-Melis and N. Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020.
Antipov et al. (2017) G. Antipov, M. Baccouche, and J.-L. Dugelay. Face aging with conditional generative adversarial networks. In 2017 IEEE international conference on image processing (ICIP), pages 2089–2093. IEEE, 2017.
Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
Bao et al. (2017) J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, pages 2745–2754, 2017.
Bespalov et al. (2021) I. Bespalov, N. Buzun, O. Kachan, and D. V. Dylov. Data augmentation with manifold barycenters. arXiv preprint arXiv:2104.00925, 2021.
Bespalov et al. (2022) I. Bespalov, N. Buzun, O. Kachan, and D. V. Dylov. Lambo: Landmarks augmentation with manifold-barycentric oversampling. IEEE Access, 10:117757–117769, 2022.
Bishop (2006) C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
Bradski (2000) G. Bradski. The opencv library. Dr Dobb’s J. Software Tools, 25:120–125, 2000.
Chao et al. (2016) W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 52–68. Springer, 2016.
Chen et al. (2016) X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
Choi et al. (2018) Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
Cisneros-Velarde and Bullo (2020) P. Cisneros-Velarde and F. Bullo. Distributed wasserstein barycenters via displacement interpolation. arXiv preprint arXiv:2012.08610, 2020.
Craig (2016) K. Craig. The exponential formula for the wasserstein metric. ESAIM: Control, Optimisation and Calculus of Variations, 22(1):169–187, 2016.
Csiszár (1964) I. Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8:85–108, 1964.
Cuturi (2013) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26:2292–2300, 2013.
Fan and Alvarez-Melis (2023) J. Fan and D. Alvarez-Melis. Generating synthetic datasets by interpolating along generalized geodesics. arXiv preprint arXiv:2306.06866, 2023.
Frid-Adar et al. (2018) M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan. Synthetic data augmentation using gan for improved liver lesion classification. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 289–293. IEEE, 2018.
Gangbo and Święch (1998) W. Gangbo and A. Święch. Optimal maps for the multidimensional monge-kantorovich problem. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 51(1):23–45, 1998.
Georghiades et al. (2001) A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE transactions on pattern analysis and machine intelligence, 23(6):643–660, 2001.
Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), 2014.
Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017.
Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6626–6637, 2017.
Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Huang et al. (2017) R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE international conference on computer vision, pages 2439–2448, 2017.
Huguet et al. (2022) G. Huguet, A. Tong, M. R. Zapatero, G. Wolf, and S. Krishnaswamy. Geodesic sinkhorn: optimal transport for high-dimensional datasets. arXiv preprint arXiv:2211.00805, 2022.
Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
Kameoka et al. (2018) H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. Acvae-vc: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder. arXiv preprint arXiv:1808.05092, 2018.
Kim et al. (2017) T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma and Welling (2014) D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
Korotin et al. (2019) A. Korotin, V. Egiazarian, A. Asadulaev, A. Safin, and E. Burnaev. Wasserstein-2 generative networks. arXiv preprint arXiv:1909.13082, 2019.
Korotin et al. (2021) A. Korotin, V. V’yugin, and E. Burnaev. Mixability of integral losses: A key to efficient online aggregation of functional and probabilistic forecasts. Pattern Recognition, 120:108175, 2021.
Korotin et al. (2022) A. Korotin, V. Egiazarian, L. Li, and E. Burnaev. Wasserstein iterative networks for barycenter estimation. Advances in Neural Information Processing Systems, 35:15672–15686, 2022.
Lin et al. (2020) T. Lin, N. Ho, X. Chen, M. Cuturi, and M. Jordan. Fixed-support wasserstein barycenters: Computational hardness and fast algorithm. Advances in neural information processing systems, 33:5368–5380, 2020.
Liu et al. (2017) M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pages 700–708, 2017.
Lu et al. (2019) G. Lu, Z. Zhou, Y. Song, K. Ren, and Y. Yu. Guiding the one-to-one mapping in cyclegan via optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4432–4439, 2019.
Makhzani et al. (2015) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
Martinet et al. (2022) G. G. Martinet, A. Strzalkowski, and B. Engelhardt. Variance minimization in the wasserstein space for invariant causal prediction. In International Conference on Artificial Intelligence and Statistics, pages 8803–8851. PMLR, 2022.
McCann (1995) R. J. McCann. Existence and uniqueness of monotone measure-preserving maps. 1995.
Mirza and Osindero (2014) M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
Montesuma and Mboula (2021) E. F. Montesuma and F. M. N. Mboula. Wasserstein barycenter for multi-source domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16785–16793, 2021.
Mroueh (2020) Y. Mroueh. Wasserstein style transfer. In International conference on artificial intelligence and statistics, pages 842–852. PMLR, 2020.
Odena et al. (2017) A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642–2651, 2017.
Rabin et al. (2012) J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision: Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29–June 2, 2011, Revised Selected Papers 3, pages 435–446. Springer, 2012.
Rabin et al. (2014) J. Rabin, S. Ferradans, and N. Papadakis. Adaptive color transfer with relaxed optimal transport. In 2014 IEEE international conference on image processing (ICIP), pages 4852–4856. IEEE, 2014.
Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Ramesh et al. (2021) A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
Reed et al. (2016) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069. PMLR, 2016.
Santambrogio (2015) F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
Shao et al. (2019) S. Shao, P. Wang, and R. Yan. Generative adversarial networks for data augmentation in machine fault diagnosis. Computers in Industry, 106:85–93, 2019.
Sohn et al. (2015) K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.
Solomon et al. (2015) J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG), 34(4):1–11, 2015.
Srivastava et al. (2018) S. Srivastava, C. Li, and D. B. Dunson. Scalable bayes via barycenter in wasserstein space. The Journal of Machine Learning Research, 19(1):312–346, 2018.
Szegedy et al. (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Tolstikhin et al. (2018) I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schölkopf. Wasserstein auto-encoders. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
Viola and Jones (2001) P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pages I–I. IEEE, 2001.
Wang et al. (2018) Z. Wang, X. Tang, W. Luo, and S. Gao. Face aging with identity-preserved conditional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7939–7947, 2018.
Xian et al. (2018) Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018.
Xie et al. (2019) Y. Xie, M. Chen, H. Jiang, T. Zhao, and H. Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pages 6882–6892. PMLR, 2019.
Xu et al. (2018) T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
Zhang et al. (2017a) H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017a.
Zhang et al. (2017b) Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5810–5818, 2017b.
Zhao et al. (2018) C. Zhao, C. Chen, Z. He, and Z. Wu. Application of auxiliary classifier wasserstein generative adversarial networks in wireless signal classification of illegal unmanned aerial vehicles. Applied Sciences, 8(12):2664, 2018.
Zhou et al. (2022) Z. Zhou, Z. Gong, P. Ravikumar, and D. I. Inouye. Iterative alignment flows. In International Conference on Artificial Intelligence and Statistics, pages 6409–6444. PMLR, 2022.
Zhu et al. (2023) J. Zhu, J. Qiu, A. Guha, Z. Yang, X. Nguyen, B. Li, and D. Zhao. Interpolation for robust learning: Data augmentation on geodesics. arXiv preprint arXiv:2302.02092, 2023.
Zhu et al. (2017) J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.