This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Solution of physics-based inverse problems using conditional generative adversarial networks with full gradient penalty

Deep Ray Javier Murgoitio-Esandi Agnimitra Dasgupta Assad A. Oberai
Abstract

The solution of probabilistic inverse problems for which the corresponding forward problem is constrained by physical principles is challenging. This is especially true if the dimension of the inferred vector is large and the prior information about it is in the form of a collection of samples. In this work, a novel deep learning based approach is developed and applied to solving these types of problems. The approach utilizes samples of the inferred vector drawn from the prior distribution and a physics-based forward model to generate training data for a conditional Wasserstein generative adversarial network (cWGAN). The cWGAN learns the probability distribution for the inferred vector conditioned on the measurement and produces samples from this distribution. The cWGAN developed in this work differs from earlier versions in that its critic is required to be 1-Lipschitz with respect to both the inferred and the measurement vectors and not just the former. This leads to a loss term with the full (and not partial) gradient penalty. It is shown that this rather simple change leads to a stronger notion of convergence for the conditional density learned by the cWGAN and a more robust and accurate sampling strategy. Through numerical examples it is shown that this change also translates to better accuracy when solving inverse problems. The numerical examples considered include illustrative problems where the true distribution and/or statistics are known, and a more complex inverse problem motivated by applications in biomechanics.

keywords:
PDE-based inverse problems , Bayesian inference , conditional generative adversarial networks , generative models , deep learning , uncertainty quantification
\affiliation

[label1]organization=Department of Mathematics, University of Maryland, city=College Park, postcode=20742, state=Maryland, country=USA \affiliation[label2]organization=Aerospace and Mechanical Engineering Department, University of Southern California, city=Los Angeles, postcode=90089, state=California, country=USA

1 Introduction

Inverse problems arise in many fields of engineering and science. Typically, they involve recovering the input fields or parameters in a system given noisy measurements of its response. In contrast to direct problems, where one wishes to determine the response given the input parameters, inverse problems are notoriously hard to solve. This is because they are often ill-posed in the sense of Hadamard [1] and involve measurements that are corrupted by noise, while their solution often requires injecting some prior information or beliefs regarding the inferred parameters or fields.

The challenges described above can be addressed by working with a probabilistic version of the inverse problem where the input and response vectors/fields are treated as random vectors /processes. In this manuscript, we refer to the input vector as the vector of the parameters to be inferred, or the inferred vector for short. In the probabilistic framework, the solution of an inverse problem transforms to characterizing the conditional probability distribution of the inferred vector conditioned on a measurement of the response; this distribution is known as the posterior distribution. Using Bayes’ rule, the posterior distribution can be expressed (up to a multiplicative constant) as the product of a marginal distribution for the inferred vector with the conditional distribution for the measurement given the inferred parameters [2]. The former is replaced by a prior distribution for the parameters to be inferred, and the latter leads to the likelihood term, which is determined by the model for the noise in the measurement and the forward map that links the parameters to the response. Once the probabilistic inverse problem has been framed, a method like Markov Chain Monte Carlo (MCMC) [3, 4] may be applied to sample from the posterior distribution; these samples help approximate the posterior distribution that may otherwise be analytically intractable [5]. The steady state of the Markov chain leads to samples that are drawn from the desired conditional distribution for the input parameters.

While the MCMC method and its variants can be used to solve some probabilistic inverse problems, their efficacy diminishes under several conditions. These include problems where the dimension of the inferred vector becomes large (greater than \approx 50, for example) — the so-called curse of dimensionality [6, 7, 8]. It also includes problems where prior information about the parameters to be inferred cannot be expressed in terms of a simple probability distribution and is determined through samples. Finally, some variants of the MCMC method with improved convergence properties, like the Hamiltonian Monte Carlo (HMC) method [9, 10], rely on computing the derivative of the forward map with respect to the parameters. In cases where the forward map is implemented through a complex legacy code, computing this derivative is not feasible and these methods cannot be used.

Recently, there has been a growing interest in utilizing novel machine and deep learning algorithms to address the challenges described above. Most of these methods first consider samples of the parameters to be inferred derived from a prior distribution and utilize the forward map and a model for noise to generate corresponding samples of the measurement vector. Thereafter, they use this dataset to train algorithms for sampling the conditional distribution for the parameters given the measured response. One such class of methods that utilize deep neural networks to map the response vector to the vector of parameters, is referred to as Bayesian Neural Networks (BNNs) [11, 12, 13, 14]. In these networks, in addition to the input and output vectors, the network weights are also treated as random variables. Thereafter, Bayes’ rule is applied to derive an expression for the posterior distribution of the network parameters (weights and biases) given the training data. Then, variational inference is used to derive an approximation for the posterior density of the network parameters. Once this density is known, multiple realizations of the network parameters are sampled and the resulting networks are applied to the same measured response to yield an ensemble of inferred vectors. The performance of BNNs is limited by the approximations that are typically incurred to make variational inference tractable, which include assuming mixture models for the weights and selecting a small subset of the network weights as stochastic [15, 16].

Another class of deep learning algorithms utilize Generative Adversarial Networks (GANs) [17] to learn and then sample from the posterior distribution. GANs are generative models that learn a reduced-order representation of a probability distribution from its samples and then efficiently generate samples from it. GANs comprise of two networks: a generator and a critic. The generator network maps low-dimensional vectors drawn from a simple probability distribution to the output domain, while the critic maps samples from the output domain to a scalar. For the Wasserstein GAN [18, 19], under suitable assumptions, it can be shown that the probability distribution of the samples produced by the trained generator converges to the true distribution (from which the training samples are drawn) in the Wasserstein-1 distance. This is achieved by imposing a 1-Lipschitz constraint on the critic. In [20], a WGAN was used to learn the prior distribution of the inferred vector, and then used in conjunction with the HMC method to learn the posterior conditional distribution. While this approach reduced the dimension of the underlying inverse problem by mapping it to the low-dimensional latent space of the WGAN, it still required the use of an HMC method to sample from the posterior distribution.

In contrast to the approach described above, conditional GANs [21] can be used to approximate the posterior distribution directly. Adler and Öktem [22] proposed a conditional WGAN (cWGAN) that aims to directly learn the posterior distribution, and used it to improve the results of an inverse problem in medical imaging. Ray et al. [23] improved the architecture of cWGAN and applied it to solve complex inverse problems motivated by physics-driven applications. Kovachki et al. [24] proposed the Monotone GAN (MGAN) to solve inverse problems. This method differs from the cWGAN in several ways. First, it uses a latent space whose dimension is the same as that of the inferred vector, whereas the cWGAN typically uses a latent space of reduced dimension. Second, it relies on block triangular transformations to map the latent space to the space of the inferred vector, whereas a cWGAN uses a U-NET [25] architecture. Finally, it requires the generator that transforms the latent space to the space of inferred parameters to be monotonic, whereas the cWGAN imposes a 1-Lipschitz condition on the critic and not the generator.

The cWGAN proposed by Adler and Öktem [22] uses a critic network that is 1-Lipschitz with respect to the inferred vector only. The 1-Lipschitz constraint is implemented weakly through a gradient penalty term, and the gradient is computed with respect to the inferred vector. Herein, we refer to this as Partial-GP. As a result of the partial 1-Lipschitz constraint being imposed on the critic, the Kantorovich-Rubinstein duality (see [26] for example) can be invoked to show that the critic for the cWGAN with Partial-GP belongs to the space of functions that can only be optimized to approximate the Wassertstein-1 distance between distributions of the inferred vector. Based on this, Adler and Öktem [22] develop a training objective function for the cWGAN that aims to reduce the averaged Wasserstein-1 distance between the true posterior distribution and the one induced by the cWGAN; the average is taken over all possible values of the measurement vector. This does not necessarily mean that the distribution of the inferred vector conditioned on any one particular value of the measurement vector will converge to the true posterior distribution even if the critic and generator networks are sufficiently expressive, and are perfectly trained.

In this work we improve upon the cWGAN with Partial-GP [22], with a novel cWGAN formulation which differs from the original formulation in one critical aspect. In our approach, we require the critic network to be a 1-Lipschitz function of all its inputs — the inferred and measurement vector. Therefore, the crucial algorithmic difference that arises is that the gradient must be computed with respect to both arguments of the critic as opposed to just the inferred vector for Partial-GP. Remarkably, this small change in the implementation has a significant effect on the properties of the final network. Due to the critic being 1-Lipschitz with respect to its entire input argument, we can show, via the Kantorovich-Rubinstein duality, that the critic in the cWGAN belongs to the space of functions that can minimize the Wasserstein-1 between distributions of the joint random vector containing both the inferred variable and the measurements. As a result, we show that the generator of a cWGAN that is perfectly trained with the same training objective as [22] will converge to the true conditional distribution for the inferred vector for any measurement vector.

Additionally, the convergence guarantees of the proposed cWGAN with Full-GP allow us to develop a new evaluation strategy where the cWGAN is evaluated at measurement values obtained by perturbing the actual measurement vector with samples from a zero-mean Gaussian distribution that has a minute variance. We show that this new evaluation strategy may lead to improved performance of cWGANs. Further, we study the behavior of the proposed cWGAN with Full-GP on a suite of illustrative examples and its ability to approximate the joint distribution and the posterior distribution. We also apply the proposed cWGANs to two challenging inverse problems arising in inverse heat conduction and elastography [27] that help establish its efficacy in solving physics-based inverse problems.

The format of the remainder of this manuscript is as follows. In the following section we introduce the mathematical background required to develop and analyze the new cWGAN with Full-GP. In Section 3, we describe how the Full-GP cWGAN is trained and then used to solve a probabilistic inverse problem. In Section Section 4, we perform a detailed analysis of the convergence of the conditional density generated by Full-GP cWGAN with increasing number of trainable generator parameters. We also compare it with the corresponding analysis for the cWGAN with Partial-GP. In Section Section 5, we apply the Full-GP and Partial-GP cWGAN networks to inverse problems with increasing complexity, and evaluate their ability to recover the true posterior distribution for cases where it is known. We observe that the Full-GP cWGAN performs better. We end with conclusions in Section 6.

2 Background

2.1 Problem setup

Consider the following measurement model

𝒚=(𝒙;𝜼){\bm{y}}=\mathcal{F}({\bm{x}};\bm{\eta}) (1)

where \mathcal{F} is the forward model/process, 𝒙NX{\bm{x}}\in\mathbb{R}^{{N_{X}}} is some input field to the forward model, while 𝒚NY{\bm{y}}\in\mathbb{R}^{N_{Y}} is the corresponding output/measured field of the model corrupted by some measurement noise 𝜼\bm{\eta}. Then our goal is to solve the inverse problem of reconstructing/inferring 𝒙{\bm{x}} given some noisy measurement 𝒚{\bm{y}}. In the probabilistic setting, necessary for carrying out Bayesian inference, the inferred field 𝒙{\bm{x}} and the measured field 𝒚{\bm{y}} are modeled as random variables 𝑿{\bm{X}} and 𝒀{\bm{Y}}, respectively. Let μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} be the joint probability measure corresponding to the pair of random variable (𝑿,𝒀)({\bm{X}},{\bm{Y}}), with marginal measures μ𝑿\mu_{\bm{X}}, μ𝒀\mu_{\bm{Y}}. Then, given a realization 𝒀=𝒚^{\bm{Y}}=\hat{{\bm{y}}}, we wish to approximate the conditional measure μ𝑿|𝒀(.|𝒚^)\mu_{{\bm{X}}|{\bm{Y}}}(.|\hat{{\bm{y}}}) and efficiently draw samples from it in order to evaluate useful conditional statistics.

2.2 Conditional GANs

Conditional generative adversarial networks (cGANs) [21] can be used to learn conditional probability measures. Typically, cGANs comprise two deep neural networks, a generator 𝒈{\bm{g}} and a critic dd. The generator is a mapping given by

𝒈:NZ×NYNX,𝒈:(𝒛,𝒚)𝒙{\bm{g}}:\mathbb{R}^{{N_{Z}}}\times\mathbb{R}^{N_{Y}}\rightarrow\mathbb{R}^{N_{X}},\quad{\bm{g}}:({\bm{z}},{\bm{y}})\mapsto{\bm{x}} (2)

where 𝒛NZ{\bm{z}}\in\mathbb{R}^{N_{Z}} is a realization of the latent random variable 𝒁{\bm{Z}} with the measure μ𝒁\mu_{\bm{Z}}. The latent measure is chosen to be simple, such as a multidimensional Gaussian, to make it easy to sample from. The generator can be interpreted as a mapping that, given a 𝒚μ𝒚{\bm{y}}\sim\mu_{\bm{y}}, can generate ‘fake’ samples of the inferred field from the push-forward measure μ𝑿|𝒀𝒈=𝒈(.,𝒚)#μ𝒁\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}={\bm{g}}(.,{\bm{y}})_{\#}\mu_{\bm{Z}}. The critic, on the other hand, is given by the mapping

d:NX×NY,d:(𝒙,𝒚)rd:\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{N_{Y}}\rightarrow\mathbb{R},\quad d:({\bm{x}},{\bm{y}})\mapsto r (3)

whose role is to distinguish between paired samples (𝒙,𝒚)({\bm{x}},{\bm{y}}) drawn from the true joint measure μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} and the fake/generated joint measure μ𝑿𝒀𝒈=μ𝑿|𝒀𝒈μ𝒀\mu^{{\bm{g}}}_{{\bm{X}}{\bm{Y}}}=\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}\mu_{\bm{Y}}.

In a conditional Wasserstein GAN (cWGAN) [22], a variant of cGAN, both the generator and critic networks are trained in an adversarial manner using the objective ‘loss’ function

(𝒈,d)\displaystyle\mathcal{L}({\bm{g}},d) =𝔼μ𝑿𝒀[d(𝑿,𝒀)]𝔼μ𝒀[𝔼μ𝑿|𝒀𝒈[d(𝑿,𝒀)]]\displaystyle=\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu_{\bm{Y}}}\left[\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right]\right] (4)
=𝔼μ𝑿𝒀[d(𝑿,𝒀)]𝔼μ𝑿𝒀𝒈[d(𝑿,𝒀)],\displaystyle=\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right],

and solving the minmax problem

g,d=argmin𝒈argmax𝑑(𝒈,d).\displaystyle g^{*},d^{*}=\underset{{\bm{g}}}{\text{argmin}}\ \underset{d}{\text{argmax}}\ \mathcal{L}({\bm{g}},d). (5)

3 Conditional WGAN with Full-GP

In this work, we propose a cWGAN model with Full-GP which, unlike the original cWGAN proposed by Adler and Öktem [22], imposes a 1-Lipschitz constraint on the critic dd with respect to its entire argument (𝒙,𝒚)({\bm{x}},{\bm{y}}). Let us denote by Lip the 1-Lipschitz functions on NX×NY\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}. If we assume that the maximization in (5) is over dLipd\in\text{Lip}, then we can show by the Kantorovich–Rubinstein duality argument [26] that

d(𝒈)\displaystyle d^{*}({\bm{g}}) =argmaxdLip(𝒈,d)=W1(μ𝑿𝒀,μ𝑿𝒀𝒈),\displaystyle=\underset{d\in\text{Lip}}{\text{argmax}}\ \mathcal{L}({\bm{g}},d)=W_{1}(\mu_{{\bm{X}}{\bm{Y}}},\mu^{\bm{g}}_{{\bm{X}}{\bm{Y}}}), (6)
𝒈\displaystyle{\bm{g}}^{*} =argmin𝒈(𝒈,d(𝒈))=argmin𝒈W1(μ𝑿𝒀,μ𝑿𝒀𝒈),\displaystyle=\underset{{\bm{g}}}{\text{argmin}}\ \mathcal{L}({\bm{g}},d^{*}({\bm{g}}))=\underset{{\bm{g}}}{\text{argmin}}\ W_{1}(\mu_{{\bm{X}}{\bm{Y}}},\mu^{\bm{g}}_{{\bm{X}}{\bm{Y}}}),

where W1W_{1} is the Wasserstein-1 distance defined on the space of joint probability measures on NX×NY\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}. In practice, the 1-Lipschitz constraint on the critic can be imposed by augmenting the loss function by a gradient penalty term [19] when training the critic, i.e., by solving

d(𝒈)=argmaxdLip[(𝒈,d)λ𝒢𝒫]d^{*}({\bm{g}})=\underset{d\in\text{Lip}}{\text{argmax}}\ \left[\mathcal{L}({\bm{g}},d)-\lambda\mathcal{G}\mathcal{P}\right] (7)

where λ>0\lambda>0 is a hyper-parameter which appropriately weights the penalty term. We use the following penalty term:

𝒢𝒫=𝔼δ𝒰(0,1)[(d(𝒉(𝒙,𝒚,𝒛,δ),𝒚)21)2],\mathcal{G}\mathcal{P}=\mathbb{E}_{\delta\sim\mathcal{U}(0,1)}\left[(\|\nabla d(\bm{h}({\bm{x}},{\bm{y}},{\bm{z}},\delta),{\bm{y}})\|_{2}-1)^{2}\right], (8)

where 𝒰(0,1)\mathcal{U}(0,1) denotes the uniform distribution on [0,1][0,1], \nabla denotes the gradient with respect to both its arguments, and

𝒉(𝒙,𝒚,𝒛,δ)=δ𝒙+(1δ)𝒈(𝒛,𝒚).\bm{h}({\bm{x}},{\bm{y}},{\bm{z}},\delta)=\delta{\bm{x}}+(1-\delta){\bm{g}}({\bm{z}},{\bm{y}}). (9)

The gradient penalty term in Eq. 8 imposes a 1-Lipschitz constraint on the critic with respect to both its arguments 𝒙{\bm{x}} and 𝒚{\bm{y}}. Using this modified penalty term is crucial to ensure that solving minmax problem is equivalent to minimizing the W1W_{1} distance between the true joint measure μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} and the generated joint μ𝑿𝒀𝒈\mu^{\bm{g}}_{{\bm{X}}{\bm{Y}}}. This in turn leads to an alternative analysis of (weak) convergence to the true conditional measure μ𝑿|𝒀\mu_{{\bm{X}}|{\bm{Y}}} which is described in Section 4.

Remark 3.1.

In the original cWGAN model [22], the penalty term was chosen to constrain the critic to be 1-Lipschitz only with respect to the first argument 𝐱{\bm{x}}. More precisely, Adler and Öktem [22] use the following penalty term:

𝒢𝒫=𝔼δ𝒰(0,1)[(1d(𝒉(𝒙,𝒚,𝒛,δ),𝒚)21)2],\mathcal{G}\mathcal{P}=\mathbb{E}_{\delta\sim\mathcal{U}(0,1)}\left[(\|\partial_{1}d(\bm{h}({\bm{x}},{\bm{y}},{\bm{z}},\delta),{\bm{y}})\|_{2}-1)^{2}\right], (10)

where 1d(.,.)\partial_{1}d(.,.) denotes the derivative with respect to the first argument of the critic. Note that, the derivative operator d(.,.)\nabla d(.,.) in Eq. 8 has been replace with 1d(.,.)\partial_{1}d(.,.) in Eq. 10.

3.1 Training

To train the proposed cWGAN with Full-GP, we assume access to a dataset 𝒮={(𝒙(i),𝒚(i)):1iNs}\mathcal{S}=\{({\bm{x}}^{(i)},{\bm{y}}^{(i)}):1\leq i\leq N_{s}\} of paired samples from the true joint measure μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}}. In practice, this can be constructed by first generating samples {𝒙(i):1iNs}\{{\bm{x}}^{(i)}:1\leq i\leq N_{s}\} sampled from a measure μ𝑿\mu_{\bm{X}} (based on prior knowledge about the inferred field), and then generating the corresponding 𝒚i{\bm{y}}_{i}’s using the forward (noisy) model from Eq. 1. Alternatively, such paired data might be available from experiments.

Next, by replacing the expectations in the objective function Eq. 4 by empirical averages, and choosing suitable architectures for the generator and critic (along with other hyper-parameters), the minmax problem Eq. 6 is solved to find the optimal critic dd^{*} and generator 𝒈{\bm{g}}^{*}. This involves using a gradient-based optimizer, such as Adam [28], to optimize the trainable parameters of the two networks. A computationally tractable strategy to solve the minmax problem is to use an iterative approach to train the networks. Typically, Nmax410N_{\mathrm{max}}\approx 4-10 maximization steps are taken to update the trainable parameters of the critic, followed by a single minimization update step for the generator. The training is terminated after a certain number of epochs (loops over the training dataset) are completed.

3.2 Evaluation

Consider the expectation of a continuous and bounded functional qCb(NX)q\in C_{b}(\mathbb{R}^{N_{X}}) with respect to the true posterior conditional measure μ𝑿|𝒀(.|𝒚)\mu_{{\bm{X}}|{\bm{Y}}}(.|{\bm{y}})

k(𝒚;q):=𝔼μX|Y[q(𝑿)|𝒚]=NXq(𝒙)dμ𝑿|𝒀(𝒙|𝒚)k({\bm{y}};q):=\mathbb{E}_{\mu_{X|Y}}\left[q({\bm{X}})|{\bm{y}}\right]=\int_{\mathbb{R}^{{N_{X}}}}q({\bm{x}})\text{d}\mu_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}}) (11)

for a given 𝒚μ𝒀{\bm{y}}\sim\mu_{\bm{Y}}. For instance, choosing q(𝒙)=xiq({\bm{x}})=x_{i} and q(𝒙)=(xix¯i)2q({\bm{x}})=(x_{i}-\bar{x}_{i})^{2} in Eq. 11 for 1iNX1\leq i\leq{N_{X}} leads to the evaluation of the component-wise mean x¯i\bar{x}_{i} and variance σx,i2\sigma_{x,i}^{2} of each component of 𝑿{\bm{X}}. The conditional expectation in Eq. 11 can be approximated using the trained generator 𝒈{\bm{g}}^{*} of the cWGAN using the following algorithm: Given 𝒚^μ𝒀\hat{{\bm{y}}}\sim\mu_{\bm{Y}}, qCb(NX)q\in C_{b}(\mathbb{R}^{N_{X}}), sample size KK and σ0\sigma\geq 0

  1. 1.

    Draw KK random samples {𝒚~(i):1iK}\{\widetilde{{\bm{y}}}^{(i)}:1\leq i\leq K\} from 𝒩(𝒚^,σ2𝑰)\mathcal{N}(\hat{{\bm{y}}},\sigma^{2}\bm{I}), i.e., the Gaussian measure centered at 𝒚^\hat{{\bm{y}}} with covariance σ2𝑰\sigma^{2}\bm{I}.

  2. 2.

    Generate KK samples {𝒛(i):1iK}\{{\bm{z}}^{(i)}:1\leq i\leq K\} from μ𝒁\mu_{\bm{Z}}.

  3. 3.

    Approximate the conditional expectation (11) using the empirical average

    k(y^;q)1Ki=1Kq(𝒈(𝒛(i),𝒚~(i))).k(\hat{y};q)\approx\frac{1}{K}\sum_{i=1}^{K}q\big{(}{\bm{g}}^{*}({\bm{z}}^{(i)},\widetilde{{\bm{y}}}^{(i)})\big{)}. (12)

The motivation behind using Eq. 12 to approximate the conditional expectations (and the choice of σ\sigma) is given by Theorem 4.2 in Section 4 (see the discussion following its proof).

4 Convergence analysis of the cWGAN with Full-GP

In this section, we analyze the convergence of the cWGAN with Full-GP and show how the optimal generator can be used to approximate the posterior statistics from Eq. 11 in accordance to Eq. 12. This proof, which forms the main result of this section, is contained in Theorem 4.1 for σ=0\sigma=0 and in Theorem 4.2 for σ>0\sigma>0. The proofs of these theorems rely on three distinct assumptions. The first (Assumption 4.1) is a statement of the convergence of a standard (not conditional) WGAN with increasing number of learnable parameters. This is also implicitly assumed in the original WGAN papers [18, 19]. The second assumption (Assumption 4.2) assumes that the true and learned conditional expectations are bounded. Finally, the third (Assumption 4.3) invokes the absolutely continuity of the marginal measure of measurement vector and ensuring the existence of a bounded marginal density.

Let 𝒢n\mathcal{G}_{n} be the family of generator networks such that any network 𝒈𝒢n{\bm{g}}\in\mathcal{G}_{n} has nn trainable parameters. We begin with our first assumption about the convergence of a standard WGAN with increasing capacity of the networks.

Assumption 4.1 (WGAN convergence).

We assume that for every n+n\in\mathbb{Z}_{+}, we can find a minimizing generator 𝐠n𝒢n{\bm{g}}^{*}_{n}\in\mathcal{G}_{n} and a critic dn=d(𝐠n)Lipd^{*}_{n}=d^{*}({\bm{g}}_{n})\in\text{Lip} such that dnd^{*}_{n} solves the maximization problem in Eq. 6. In other words, by defining

wn:=(𝒈n,dn(𝒈n))=W1(μ𝑿𝒀,μ𝑿𝒀𝒈n),w_{n}:=\mathcal{L}({\bm{g}}^{*}_{n},d^{*}_{n}({\bm{g}}^{*}_{n}))=W_{1}(\mu_{{\bm{X}}{\bm{Y}}},\mu^{{\bm{g}}^{*}_{n}}_{{\bm{X}}{\bm{Y}}}),

for any other 𝐠n𝒢n{\bm{g}}_{n}\in\mathcal{G}_{n}, we have

W1(μ𝑿𝒀,μ𝑿𝒀𝒈n)=(𝒈n,d(𝒈n))wn.W_{1}(\mu_{{\bm{X}}{\bm{Y}}},\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}})=\mathcal{L}({\bm{g}}_{n},d^{*}({\bm{g}}_{n}))\geq w_{n}.

Further, the sequence of minimizing generators {𝐠n}\{{\bm{g}}^{*}_{n}\} leads to the following convergence

limnwn=0.\lim_{n\rightarrow\infty}w_{n}=0.

Under Assumption 4.1, the sequence of measures μ𝑿𝒀𝒈n\mu^{{\bm{g}}^{*}_{n}}_{{\bm{X}}{\bm{Y}}} will converge to μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} in the W1W_{1} metric, and hence weakly [26], i.e.,

limn𝔼μ𝑿𝒀𝒈n[(𝑿,𝒀)]=𝔼μ𝑿𝒀[(𝑿,𝒀)]Cb(NX×NY).\lim_{n\rightarrow\infty}\mathbb{E}_{\mu^{{\bm{g}}^{*}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell({\bm{X}},{\bm{Y}})\right]=\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell({\bm{X}},{\bm{Y}})\right]\quad\forall\ \ell\in C_{b}(\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}). (13)

Next, we show how an appropriate choice of Cb(NX×NY)\ell\in C_{b}(\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}) leads to approximating the conditional expectation (11). Given a generator 𝒈{\bm{g}}, we define the generated conditional expectations for any qCb(NX)q\in C_{b}(\mathbb{R}^{{N_{X}}}) as

k𝒈(𝒚;q)=𝔼μX|Y𝒈[q(𝑿)|𝒚]=NXq(𝒙)dμ𝑿|𝒀𝒈(𝒙|𝒚).k^{\bm{g}}({\bm{y}};q)=\mathbb{E}_{\mu^{\bm{g}}_{X|Y}}\left[q({\bm{X}})|{\bm{y}}\right]=\int_{\mathbb{R}^{{N_{X}}}}q({\bm{x}})\text{d}\mu^{\bm{g}}_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}}). (14)

We require the conditional expectations to be finite and computable.

Assumption 4.2 (Bounded conditional expectation).

For any qCb(NX)q\in C_{b}(\mathbb{R}^{{N_{X}}}), we assume that:

  1. 1.

    k(𝒚;q)L(NY)<\|k({\bm{y}};q)\|_{L^{\infty}(\mathbb{R}^{N_{Y}})}<\infty; and that

  2. 2.

    there exists a positive integer NbN_{b} (which may depend on qq) such that

    k𝒈n(𝒚;q)L(NY)Cq<\|k^{{\bm{g}}^{*}_{n}}({\bm{y}};q)\|_{L^{\infty}(\mathbb{R}^{N_{Y}})}\leq C_{q}<\infty

    for all optimized generators gng^{*}_{n} with nNbn\geq N_{b}. In other words, we assume a uniform bound (over qq) for optimal generators beyond a certain network size.

We also make the following assumption about the true marginal μ𝒀\mu_{\bm{Y}} to simplify the exposition.

Assumption 4.3 (Absolute continuity).

We assume that μ𝐘\mu_{\bm{Y}} is absolutely continuous with respect to the Lebesgue measure on NY\mathbb{R}^{{N_{Y}}}, which ensures the existence of density p𝐘p_{\bm{Y}} satisfying

dμ𝒀(𝒚)=p𝒀(𝒚)d𝒚.\text{d}\mu_{\bm{Y}}({\bm{y}})=p_{\bm{Y}}({\bm{y}})\text{d}{\bm{y}}.

Furthermore, we assume that p𝐘L(NY)<\|p_{\bm{Y}}\|_{L^{\infty}(\mathbb{R}^{N_{Y}})}<\infty.

Now, let 𝒚^μ𝒀\hat{{\bm{y}}}\sim\mu_{\bm{Y}} be given for which p𝒀(𝒚^)0p_{\bm{Y}}(\hat{{\bm{y}}})\neq 0. Let μ𝒀σ:=𝒩(y^,σ2𝑰)\mu_{{\bm{Y}}_{\sigma}}:=\mathcal{N}(\hat{y},\sigma^{2}\bm{I}) be the Gaussian measure on NY\mathbb{R}^{N_{Y}} with density

pσ(𝒚)=1(2πσ2)n/2exp(𝒚𝒚^L2(NY)2/2σ2).p_{\sigma}({\bm{y}})=\frac{1}{(2\pi\sigma^{2})^{n/2}}\exp\left(-\|{\bm{y}}-\hat{{\bm{y}}}\|^{2}_{L^{2}(\mathbb{R}^{N_{Y}})}/2\sigma^{2}\right). (15)

The following standard result shows how the Dirac measure can be approximated using a Gaussian.

Lemma 4.1.

Let fL(NY)f\in L^{\infty}(\mathbb{R}^{{N_{Y}}}) and μ𝐘σ:=𝒩(y^,σ2𝐈)\mu_{{\bm{Y}}_{\sigma}}:=\mathcal{N}(\hat{y},\sigma^{2}\bm{I}). Then

limσ0𝔼μ𝒀σ[f(𝒀)]=f(𝒚^).\lim_{\sigma\rightarrow 0}\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[f({\bm{Y}})\right]=f(\hat{{\bm{y}}}). (16)
Proof.

Note that

𝔼μ𝒀σ[f(𝒀)]=NYf(𝒚)1(2πσ2)n/2exp(𝒚𝒚^L2(NY)2/2σ2)d𝒚.\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[f({\bm{Y}})\right]=\int_{\mathbb{R}^{{N_{Y}}}}f({\bm{y}})\frac{1}{(2\pi\sigma^{2})^{n/2}}\exp\left(-\|{\bm{y}}-\hat{{\bm{y}}}\|^{2}_{L^{2}(\mathbb{R}^{N_{Y}})}/2\sigma^{2}\right)\text{d}{\bm{y}}.

Applying a change of variables 𝒔=(𝒚𝒚^)/σ\bm{s}=({\bm{y}}-\hat{{\bm{y}}})/\sigma, we get

𝔼μ𝒀σ[f(𝒀)]=NYf(σ𝒔+𝒚^)1(2π)n/2exp(𝒔L2(NY)2/2)d𝒔.\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[f({\bm{Y}})\right]=\int_{\mathbb{R}^{{N_{Y}}}}f(\sigma\bm{s}+\hat{{\bm{y}}})\frac{1}{(2\pi)^{n/2}}\exp\left(-\|\bm{s}\|^{2}_{L^{2}(\mathbb{R}^{N_{Y}})}/2\right)\text{d}\bm{s}.

Since the integrand converges a.e. to f(𝒚^)1(2π)n/2exp(𝒔L2(NY)2/2)f(\hat{{\bm{y}}})\frac{1}{(2\pi)^{n/2}}\exp\left(-\|\bm{s}\|^{2}_{L^{2}(\mathbb{R}^{N_{Y}})}/2\right) as σ0\sigma\rightarrow 0 and

NY1(2π)n/2exp(𝒔L2(NY)2/2)d𝒔=1\int_{\mathbb{R}^{{N_{Y}}}}\frac{1}{(2\pi)^{n/2}}\exp\left(-\|\bm{s}\|^{2}_{L^{2}(\mathbb{R}^{N_{Y}})}/2\right)\text{d}\bm{s}=1

we get (16) by an application of the dominated convergence theorem. ∎

For the rest of the theoretic discussion, we omit the superscript * and it is understood that 𝒈n{\bm{g}}_{n} denotes the optimal generator in accordance to Assumption 4.1. We are now ready to state and prove the main results about approximating Eq. 11.

Theorem 4.1.

Let 𝐲^μ𝐘\hat{{\bm{y}}}\sim\mu_{\bm{Y}} be given for which p𝐘(𝐲^)0p_{\bm{Y}}(\hat{{\bm{y}}})\neq 0. Let the Assumptions 4.1, 4.2 and 4.3 hold. Then given ϵ>0\epsilon>0 and qCb(NX)q\in C_{b}(\mathbb{R}^{N_{X}}), there exists a positive integer N~:=N~(𝐲^,q,ϵ)Nb\widetilde{N}:=\widetilde{N}(\hat{{\bm{y}}},q,\epsilon)\geq N_{b} such that

|k(𝒚^;q)k𝒈n(𝒚^;q)|<ϵnN~\left|k(\hat{{\bm{y}}};q)-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right|<\epsilon\quad\forall\ n\geq\widetilde{N} (17)

Note that NbN_{b} is as defined in Assumption 4.2.

Proof.

Consider the Gaussian measure μ𝒀σ\mu_{{\bm{Y}}_{\sigma}} whose density pσp_{\sigma} is given by Eq. 15. Let σ(𝒙,𝒚)=q(x)pσ(𝒚)\ell_{\sigma}({\bm{x}},{\bm{y}})=q(x)p_{\sigma}({\bm{y}}), which clearly belongs to Cb(NX×NY)C_{b}(\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}). By Assumption 4.3, we have

𝔼μ𝑿𝒀[σ(𝑿,𝒀)]\displaystyle\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right] =\displaystyle= NX×NYq(𝒙)pσ(𝒚)dμ𝑿𝒀(𝒙,𝒚)\displaystyle\int_{\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}}q({\bm{x}})p_{\sigma}({\bm{y}})\text{d}\mu_{{\bm{X}}{\bm{Y}}}({\bm{x}},{\bm{y}}) (18)
=\displaystyle= NYNXq(𝒙)pσ(𝒚)dμ𝑿|𝒀(𝒙|𝒚)dμ𝒀(𝒚)\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}\int_{\mathbb{R}^{{N_{X}}}}q({\bm{x}})p_{\sigma}({\bm{y}})\text{d}\mu_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}})\text{d}\mu_{{\bm{Y}}}({\bm{y}})
=\displaystyle= NYk(𝒚;q)pσ(𝒚)dμ𝒀(𝒚)\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}k({\bm{y}};q)p_{\sigma}({\bm{y}})\text{d}\mu_{{\bm{Y}}}({\bm{y}})
=\displaystyle= NYk(𝒚;q)pσ(𝒚)p𝒀(𝒚)d𝒚\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}k({\bm{y}};q)p_{\sigma}({\bm{y}})p_{\bm{Y}}({\bm{y}})\text{d}{\bm{y}}
=\displaystyle= 𝔼μ𝒀σ[k(𝒀;q)p𝒀(𝒀)].\displaystyle\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[k({\bm{Y}};q)p_{\bm{Y}}({\bm{Y}})\right].

Then using Assumptions 4.2, 4.3, and Lemma 4.1 in Eq. 18 leads to

limσ0𝔼μ𝑿𝒀[σ(𝑿,𝒀)]=k(𝒚^;q)p𝒀(𝒚^).\lim_{\sigma\rightarrow 0}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]=k(\hat{{\bm{y}}};q)p_{\bm{Y}}(\hat{{\bm{y}}}). (19)

Thus, there exists σ~1=σ~1(ϵ,q,𝒚^)\widetilde{\sigma}_{1}=\widetilde{\sigma}_{1}(\epsilon,q,\hat{{\bm{y}}}) such that

|1p𝒀(𝒚^)𝔼μ𝑿𝒀[σ(𝑿,𝒀)]k(𝒚^;q)|<ϵ3σσ~1.\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]-k(\hat{{\bm{y}}};q)\right|<\frac{\epsilon}{3}\quad\forall\ \sigma\leq\widetilde{\sigma}_{1}. (20)

Similarly, we can show that

𝔼μ𝑿𝒀𝒈n[σ(𝑿,𝒀)]\displaystyle\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right] =\displaystyle= 𝔼μ𝒀σ[k𝒈n(𝒀;q)p𝒀(𝒚)],\displaystyle\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[k^{{\bm{g}}_{n}}({\bm{Y}};q)p_{\bm{Y}}({\bm{y}})\right],

and for any optimal generator 𝒈n{\bm{g}}_{n} with nNbn\geq N_{b}

limσ0𝔼μ𝑿𝒀𝒈n[σ(𝑿,𝒀)]=k𝒈n(𝒚^;q)p𝒀(𝒚^).\lim_{\sigma\rightarrow 0}\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]=k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)p_{\bm{Y}}(\hat{{\bm{y}}}). (21)

In other words there exists σ~2=σ~2(ϵ,q,𝒚^)\widetilde{\sigma}_{2}=\widetilde{\sigma}_{2}(\epsilon,q,\hat{{\bm{y}}}) such that for nNbn\geq N_{b}

|1p𝒀(𝒚^)𝔼μ𝑿𝒀𝒈n[σ(𝑿,𝒀)]k𝒈n(𝒚^;q)|<ϵ3σσ~2.\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right|<\frac{\epsilon}{3}\quad\forall\ \sigma\leq\widetilde{\sigma}_{2}. (22)

Note that σ~2\widetilde{\sigma}_{2} will be independent on 𝒈n{{\bm{g}}_{n}} because of the uniform bound assumption in Assumption 4.2 for nNbn\geq N_{b}.

We note that by choosing σ=σ~:=min(σ~1,σ~2)\sigma=\widetilde{\sigma}:=\min(\widetilde{\sigma}_{1},\widetilde{\sigma}_{2}) both inequalities 20 and 22 hold simultaneously. For this σ~\widetilde{\sigma} and the corresponding σ~\ell_{\widetilde{\sigma}}, we have the weak convergence estimate Eq. 13 due to Assumption 4.1. Thus, there exists a positive integer N~1:=N~1(σ~)=N~1(𝒚^,q,ϵ)\widetilde{N}_{1}:=\widetilde{N}_{1}(\widetilde{\sigma})=\widetilde{N}_{1}(\hat{{\bm{y}}},q,\epsilon) such that

|𝔼μ𝑿𝒀𝒈n[σ~(𝑿,𝒀)]𝔼μ𝑿𝒀[σ~(𝑿,𝒀)]|<ϵp𝒀(y^)3nN~1.|\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]|<\frac{\epsilon p_{\bm{Y}}(\hat{y})}{3}\quad\forall\ n\geq\widetilde{N}_{1}. (23)

Further, by setting N~:=max(Nb,N~1)\widetilde{N}:=\max(N_{b},\widetilde{N}_{1}), we have using inequalities 20, 22 and 23 that for all nN~n\geq\widetilde{N}

|k(𝒚^;q)k𝒈n(𝒚^;q)|\displaystyle\left|k(\hat{{\bm{y}}};q)-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right| \displaystyle\leq |k(𝒚^;q)1p𝒀(𝒚^)𝔼μ𝑿𝒀[σ~(𝑿,𝒀)]|\displaystyle\left|k(\hat{{\bm{y}}};q)-\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]\right|
+1p𝒀(𝒚^)|𝔼μ𝑿𝒀[σ~(𝑿,𝒀)]𝔼μ𝑿𝒀𝒈n[σ~(𝑿,𝒀)]|\displaystyle+\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\left|\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]\right|
+|1p𝒀(𝒚^)𝔼μ𝑿𝒀𝒈n[σ~(𝑿,𝒀)]k𝒈n(𝒚^;q)|\displaystyle+\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\widetilde{\sigma}}({\bm{X}},{\bm{Y}})\right]-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right|
<\displaystyle< ϵ3+1p𝒀(𝒚^)ϵp𝒀(𝒚^)3+ϵ3=ϵ.\displaystyle\frac{\epsilon}{3}+\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\frac{\epsilon p_{\bm{Y}}(\hat{{\bm{y}}})}{3}+\frac{\epsilon}{3}=\epsilon.

We construct a sequence of joint measures with elements

μ^𝑿𝒀σ𝒈n(𝒙,𝒚)=μ𝑿|𝒀𝒈n(𝒙|𝒚)μ𝒀σ(𝒚),\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\sigma}}({\bm{x}},{\bm{y}})=\mu^{{\bm{g}}_{n}}_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}})\mu_{{\bm{Y}}_{\sigma}}({\bm{y}}), (24)

which is different from μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} or μ𝑿𝒀𝒈n\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}. We now prove a robustness result which shows that we can obtain error estimates similar to Theorem 4.1 even if we inject a controlled amount of noise into the given measurement 𝒚{\bm{y}} in accordance to μ𝒀σ\mu_{{\bm{Y}}_{\sigma}}.

Theorem 4.2.

Let 𝐲^μ𝐘\hat{{\bm{y}}}\sim\mu_{\bm{Y}} be given for which p𝐘(𝐲^)0p_{\bm{Y}}(\hat{{\bm{y}}})\neq 0. Let the Assumptions 4.1, 4.2 and 4.3 hold. Then given ϵ>0\epsilon>0 and qCb(NX)q\in C_{b}(\mathbb{R}^{N_{X}}), there exists a positive real number σ¯:=σ¯(y^,q,ϵ)\bar{\sigma}:=\bar{\sigma}(\hat{y},q,\epsilon) and a positive integer with N¯:=N¯(σ¯)Nb\bar{N}:=\bar{N}(\bar{\sigma})\geq N_{b} such that

|k(𝒚^;q)𝔼μ^𝑿𝒀σ¯𝒈n[q(𝑿)]|<ϵnN¯,\left|k(\hat{{\bm{y}}};q)-\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\bar{\sigma}}}}\left[q({\bm{X}})\right]\right|<\epsilon\quad\forall\ n\geq\bar{N}, (25)

where μ^𝐗𝐘σ¯𝐠n\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\bar{\sigma}}} is defined according to Eq. 24.

Proof.

Following the proof of Theorem 4.1, we can find a σ¯1=σ¯1(ϵ,q,y^)\bar{\sigma}_{1}=\bar{\sigma}_{1}(\epsilon,q,\hat{y}) and a σ¯2=σ¯2(ϵ,q,y^)\bar{\sigma}_{2}=\bar{\sigma}_{2}(\epsilon,q,\hat{y}) such that

|1p𝒀(𝒚^)𝔼μ𝑿𝒀[σ(𝑿,𝒀)]k(𝒚^;q)|<ϵ4σσ¯1.\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]-k(\hat{{\bm{y}}};q)\right|<\frac{\epsilon}{4}\quad\forall\ \sigma\leq\bar{\sigma}_{1}. (26)

and

|1p𝒀(𝒚^)𝔼μ𝑿𝒀𝒈n[σ(𝑿,𝒀)]k𝒈n(𝒚^;q)|<ϵ4σσ¯2andnNb.\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\sigma}({\bm{X}},{\bm{Y}})\right]-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right|<\frac{\epsilon}{4}\quad\forall\ \sigma\leq\bar{\sigma}_{2}\ \ \text{and}\ \ n\geq N_{b}. (27)

We also have

𝔼μ^𝑿𝒀σ𝒈n[q(𝑿)]\displaystyle\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\sigma}}}\left[q({\bm{X}})\right] =\displaystyle= NX×NYq(𝒙)dμ^𝑿𝒀σ𝒈(𝒙,𝒚)\displaystyle\int_{\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}}q({\bm{x}})\text{d}\hat{\mu}^{{\bm{g}}}_{{\bm{X}}{\bm{Y}}_{\sigma}}({\bm{x}},{\bm{y}})
=\displaystyle= NYNXq(𝒙)dμ𝑿|𝒀𝒈n(𝒙|𝒚)dμ𝒀σ(𝒚)\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}\int_{\mathbb{R}^{{N_{X}}}}q({\bm{x}})\text{d}\mu^{{\bm{g}}_{n}}_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}})\text{d}\mu_{{\bm{Y}}_{\sigma}}({\bm{y}})
=\displaystyle= NYk𝒈n(𝒚;q)dμ𝒀σ\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}k^{{\bm{g}}_{n}}({\bm{y}};q)\text{d}\mu_{{\bm{Y}}_{\sigma}}
=\displaystyle= 𝔼μ𝒀σ[k𝒈n(𝒀;q)].\displaystyle\mathbb{E}_{\mu_{{\bm{Y}}_{\sigma}}}\left[k^{{\bm{g}}_{n}}({\bm{Y}};q)\right].

By Assumption 4.2 and Lemma 4.1, we have for nNbn\geq N_{b}

limσ0𝔼μ^𝑿𝒀𝒈n[q(𝑿)]=k𝒈n(𝒚^;q).\lim_{\sigma\rightarrow 0}\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[q({\bm{X}})\right]=k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q). (28)

Thus there exists σ¯3:=σ¯3(ϵ,q,𝒚^)\bar{\sigma}_{3}:=\bar{\sigma}_{3}(\epsilon,q,\hat{{\bm{y}}}) such that for nNbn\geq N_{b}

|𝔼μ^𝑿𝒀𝒈n[q(𝑿)]k𝒈n(𝒚^;q)|<ϵ4σσ¯3.|\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[q({\bm{X}})\right]-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)|<\frac{\epsilon}{4}\quad\forall\ \sigma\leq\bar{\sigma}_{3}. (29)

We set σ=σ¯:=min(σ¯1,σ¯2,σ¯3)\sigma=\bar{\sigma}:=\min(\bar{\sigma}_{1},\bar{\sigma}_{2},\bar{\sigma}_{3}) and consider the corresponding σ¯\ell_{\bar{\sigma}}. Thus, by Eq. 13 there exists a positive integer N¯1=N¯1(σ¯)\bar{N}_{1}=\bar{N}_{1}(\bar{\sigma}) such that

|𝔼μ𝑿𝒀𝒈n[σ¯(𝑿,𝒀)]𝔼μ𝑿𝒀[σ¯(𝑿,𝒀)]|<ϵp𝒀(y^)4nN¯1.|\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]|<\frac{\epsilon p_{\bm{Y}}(\hat{y})}{4}\quad\forall\ n\geq\bar{N}_{1}. (30)

By setting N¯=max(N¯1,Nb)\bar{N}=\max(\bar{N}_{1},N_{b}) we have using inequalities 26, 27, 29 and 30 that for all nN¯n\geq\bar{N}

|k(𝒚^;q)𝔼μ^𝑿𝒀σ¯𝒈n[q(𝑿)]|\displaystyle\left|k(\hat{{\bm{y}}};q)-\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\bar{\sigma}}}}\left[q({\bm{X}})\right]\right| \displaystyle\leq |k(𝒚^;q)1p𝒀(𝒚^)𝔼μ𝑿𝒀[σ¯(𝑿,𝒀)]|\displaystyle\left|k(\hat{{\bm{y}}};q)-\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]\right|
+1p𝒀(𝒚^)|𝔼μ𝑿𝒀[σ¯(𝑿,𝒀)]𝔼μ𝑿𝒀𝒈n[σ¯(𝑿,𝒀)]|\displaystyle+\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\left|\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]\right|
+|1p𝒀(𝒚^)𝔼μ𝑿𝒀𝒈n[σ¯(𝑿,𝒀)]k𝒈n(𝒚^;q)|\displaystyle+\left|\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\mathbb{E}_{\mu^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}}}\left[\ell_{\bar{\sigma}}({\bm{X}},{\bm{Y}})\right]-k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)\right|
+|k𝒈n(𝒚^;q)𝔼μ^𝑿𝒀σ¯𝒈n[q(𝑿)]|\displaystyle+\left|k^{{\bm{g}}_{n}}(\hat{{\bm{y}}};q)-\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\bar{\sigma}}}}\left[q({\bm{X}})\right]\right|
<\displaystyle< ϵ4+1p𝒀(𝒚^)ϵp𝒀(𝒚^)4+ϵ4+ϵ4=ϵ.\displaystyle\frac{\epsilon}{4}+\frac{1}{p_{\bm{Y}}(\hat{{\bm{y}}})}\frac{\epsilon p_{\bm{Y}}(\hat{{\bm{y}}})}{4}+\frac{\epsilon}{4}+\frac{\epsilon}{4}=\epsilon.

Note that

𝔼μ^𝑿𝒀σ𝒈n[q(𝑿)]\displaystyle\mathbb{E}_{\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\sigma}}}\left[q({\bm{X}})\right] =\displaystyle= NX×NYq(𝒙)dμ^𝑿𝒀σ𝒈n(𝒙,𝒚)\displaystyle\int_{\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}}}q({\bm{x}})\text{d}\hat{\mu}^{{\bm{g}}_{n}}_{{\bm{X}}{\bm{Y}}_{\sigma}}({\bm{x}},{\bm{y}}) (31)
=\displaystyle= NYNXq(𝒙)dμ𝑿|𝒀𝒈n(𝒙|𝒚)dμ𝒀σ(𝒚)\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}\int_{\mathbb{R}^{{N_{X}}}}q({\bm{x}})\text{d}\mu^{{\bm{g}}_{n}}_{{\bm{X}}|{\bm{Y}}}({\bm{x}}|{\bm{y}})\text{d}\mu_{{\bm{Y}}_{\sigma}}({\bm{y}})
=\displaystyle= NYNZq(𝒈n(𝒛,𝒚))dμ𝒁(𝒛)μ𝒀σ(𝒚).\displaystyle\int_{\mathbb{R}^{{N_{Y}}}}\int_{\mathbb{R}^{{N_{Z}}}}q({\bm{g}}_{n}({\bm{z}},{\bm{y}}))\text{d}\mu_{\bm{Z}}({\bm{z}})\mu_{{\bm{Y}}_{\sigma}}({\bm{y}}).

Taking the Monte Carlo approximation of Eq. 31 leads to the approximation formula Eq. 12 in our algorithm, with the choice of σ\sigma motivated by the proof of Theorem 4.2. In particular, to get a reasonable approximation of the conditional expectation, we need to pick a σ=σ¯\sigma=\bar{\sigma} small enough to ensure that inequalities 26, 27 and 29 are satisfied and a generator large (expressive) enough to ensure the inequality 30 is satisfied.

4.1 Comparison with the Partial-GP approach of [22]

We briefly discuss the theoretical inferences of the Partial-GP approach considered for cWGANs in [22], and how it differs from the Full-GP approach. Let us define by LipX\text{Lip}_{X} the space of 1-Lipschitz function in NX\mathbb{R}^{{N_{X}}}, and by LipX\text{Lip}^{\prime}_{X} the space of functions measurable in NX×NY\mathbb{R}^{{N_{X}}}\times\mathbb{R}^{{N_{Y}}} that are 1-Lipschitz in NX\mathbb{R}^{{N_{X}}} for all 𝒚{\bm{y}}. Clearly, LipLipX\text{Lip}\subset\text{Lip}^{\prime}_{X} and dLipXd\in\text{Lip}^{\prime}_{X} implies d(.,y)LipXd(.,y)\in\text{Lip}_{X} for any yΩYy\in\Omega_{Y}. The authors of [22] solve the maximization problem (7) with (10) but over the larger space LipX\text{Lip}^{\prime}_{X}

d(𝒈)\displaystyle d^{*}({\bm{g}}) =\displaystyle= argmaxdLipX[(𝒈,d)λ𝒢𝒫]\displaystyle\underset{d\in\text{Lip}^{\prime}_{X}}{\text{argmax}}\ \left[\mathcal{L}({\bm{g}},d)-\lambda\mathcal{G}\mathcal{P}\right] (32)
=\displaystyle= argmaxdLipX{𝔼μ𝑿𝒀[d(𝑿,𝒀)]𝔼μ𝑿𝒀𝒈[d(𝑿,𝒀)]}\displaystyle\underset{d\in\text{Lip}^{\prime}_{X}}{\text{argmax}}\ \big{\{}\mathbb{E}_{\mu_{{\bm{X}}{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})\right]\big{\}}
=\displaystyle= argmaxdLipX𝔼μ𝒀[𝔼μ𝑿|𝒀[d(𝑿,𝒀)|𝒀]𝔼μ𝑿|𝒀𝒈[d(𝑿,𝒀)|𝒀]].\displaystyle\underset{d\in\text{Lip}^{\prime}_{X}}{\text{argmax}}\ \mathbb{E}_{\mu_{\bm{Y}}}\left[\mathbb{E}_{\mu_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{Y}}\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{Y}}\right]\right].

The authors then proceed to prove that the argmaxdLipX\underset{d\in\text{Lip}_{X}}{\text{argmax}}\ and 𝔼μ𝒀\mathbb{E}_{\mu_{\bm{Y}}} operators in Eq. 32 may be interchanged under the following technical assumption

Assumption 4.4.

Given a generator 𝐠{\bm{g}} and ϵ>0\epsilon>0, there exists a measurable mapping D:(𝐱,𝐲)d^𝐲(x)D:({\bm{x}},{\bm{y}})\mapsto\hat{d}_{\bm{y}}(x) such that for all 𝐲ΩY{\bm{y}}\in\Omega_{Y} it holds that d^𝐲LipX\hat{d}_{\bm{y}}\in\text{Lip}_{X} and

𝔼μ𝑿|𝒀[d^𝒚(𝑿)|𝒚]𝔼μ𝑿|𝒀𝒈[d^𝒚(𝑿)|𝒚]>maxdLipX{𝔼μ𝑿|𝒀[d(𝑿,𝒀)|𝒚]𝔼μ𝑿|𝒀𝒈[d(𝑿,𝒀)|𝒚]}ϵ.\mathbb{E}_{\mu_{{\bm{X}}|{\bm{Y}}}}\left[\hat{d}_{\bm{y}}({\bm{X}})|{\bm{y}}\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}}\left[\hat{d}_{\bm{y}}({\bm{X}})|{\bm{y}}\right]>\max\limits_{d\in\text{Lip}^{\prime}_{X}}\left\{\mathbb{E}_{\mu_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{y}}\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{y}}\right]\right\}-\epsilon.

We note that the Assumption 4.4 may not be easy to verify, nor is it easy to interpret the mapping DD.

Thereafter, by an application of the Kantorovich–Rubinstein duality argument, they obtain

d(𝒈)\displaystyle d^{*}({\bm{g}}) =\displaystyle= 𝔼μ𝒀[argmaxdLipX{𝔼μ𝑿|𝒀[d(𝑿,𝒀)|𝒀]𝔼μ𝑿|𝒀𝒈[d(𝑿,𝒀)|𝒀]}]\displaystyle\mathbb{E}_{\mu_{\bm{Y}}}\left[\underset{d\in\text{Lip}^{\prime}_{X}}{\text{argmax}}\ \left\{\mathbb{E}_{\mu_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{Y}}\right]-\mathbb{E}_{\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}}}\left[d({\bm{X}},{\bm{Y}})|{\bm{Y}}\right]\right\}\right] (33)
=\displaystyle= 𝔼μ𝒀[W1(μ𝑿|𝒀,μ𝑿|𝒀𝒈)],\displaystyle\mathbb{E}_{\mu_{\bm{Y}}}\left[W_{1}(\mu_{{\bm{X}}|{\bm{Y}}},\mu^{{\bm{g}}}_{{\bm{X}}|{\bm{Y}}})\right],

which is the ‘expected’ (mean) W1W_{1} metric with respect to μ𝒀\mu_{\bm{Y}} between the true conditional measure and the learned conditional measure. Thus, finding the optimal generator 𝒈{\bm{g}}^{*} corresponds to minimizing this ‘expected’ (mean) W1W_{1} metric between the conditionals with respect to the true measure μ𝒀\mu_{{\bm{Y}}}. However, finding a sequence of generators 𝒈n{\bm{g}}_{n}^{*} such that

limn𝔼μ𝒀[W1(μ𝑿|𝒀,μ𝑿|𝒀𝒈n)]=0\lim_{n\rightarrow\infty}\mathbb{E}_{\mu_{\bm{Y}}}\left[W_{1}(\mu_{{\bm{X}}|{\bm{Y}}},\mu^{{\bm{g}}_{n}^{*}}_{{\bm{X}}|{\bm{Y}}})\right]=0 (34)

does not guarantee weak convergence of the measures μ𝑿|𝒀𝒈n\mu^{{\bm{g}}_{n}^{*}}_{{\bm{X}}|{\bm{Y}}} to μ𝑿|𝒀\mu_{{\bm{X}}|{\bm{Y}}}. In fact, by the properties of convergence in L1(μ𝒀)L^{1}(\mu_{\bm{Y}}), eq. 34 would only guarantee the existence of a sub-sequence of measures μ𝑿|𝒀𝒈nk\mu^{{\bm{g}}_{n_{k}}}_{{\bm{X}}|{\bm{Y}}} such that limkW1(μ𝑿|𝒀,μ𝑿|𝒀𝒈nk)=0\lim_{k\rightarrow\infty}W_{1}(\mu_{{\bm{X}}|{\bm{Y}}},\mu^{{\bm{g}}_{n_{k}}^{*}}_{{\bm{X}}|{\bm{Y}}})=0, which in turn would lead to the weak-convergence result

limk𝔼μ𝑿|𝒀𝒈nk[q(𝑿)|𝒚^]=𝔼μ𝑿|𝒀[q(𝑿)|𝒚^]qCb(NX).\lim_{k\rightarrow\infty}\mathbb{E}_{\mu^{{\bm{g}}^{*}_{n_{k}}}_{{\bm{X}}|{\bm{Y}}}}\left[q({\bm{X}})|\hat{{\bm{y}}}\right]=\mathbb{E}_{\mu_{{\bm{X}}|{\bm{Y}}}}\left[q({\bm{X}})|\hat{{\bm{y}}}\right]\quad\forall\ q\in C_{b}(\mathbb{R}^{{N_{X}}}).

In contrast to Partial-GP, Full-GP is able to provide an alternate route to prove weak convergence. The primary difference stems from formulating the maximization Eq. 7 over LipLipX\text{Lip}\subset\text{Lip}^{\prime}_{X}. First, this allows us to consider the weak convergence of the learnt joint measures μ𝑿𝒀𝒈n\mu^{{\bm{g}}_{n}^{*}}_{{\bm{X}}{\bm{Y}}} to the true joint measure μ𝑿𝒀\mu_{{\bm{X}}{\bm{Y}}} without requiring a a technical assumption like Assumption 4.4 or a sub-sequential argument. Then, by using the fact that a Gaussian measure closely resembles a Dirac measure in the asymptotic limit of diminishing variance (Lemma 4.1), we are able to derive error estimates for approximating the conditional expectation in Eq. 11 by sampling using the trained generator (Theorems 4.1 and 4.2).

5 Numerical experiments

In this section, several numerical experiments are presented to analyze the performance of the proposed cWGANs with Full-GP. The experiments include: one-dimensional problems where the target measure is known; an inverse heat conduction problem where the reference statistics of the target measure are available [23]; and a complex inverse problem in elastography where the target measure is unknown. For the first two problems, where the target statistics are available and we have a reliable reference to estimate error, we use both the Partial-GP [22] (see Eq. 10), and Full-GP (see Eq. 8) approaches to perform a comparison of the two methods.

5.1 Illustrative examples

We first analyze the performance of the proposed methodology using one-dimensional random variables XX and YY (the paired random variable, (X,Y)(X,Y), being two-dimensional) where the target probability measures are available. These are defined as follows:

Tanh + Γ:\displaystyle\text{Tanh + $\Gamma$}: x=tanh(y)+γ where γΓ(1,0.3) and yU(2,2)\displaystyle\quad x=\operatorname{tanh}(y)+\gamma\text{ where }\gamma\sim\Gamma(1,0.3)\text{ and }\ y\sim U(-2,2) (35)
Bimodal:\displaystyle\text{Bimodal}: x=(y+w)1/3 where y𝒩(0,1) and w𝒩(0,1)\displaystyle\quad x=(y+w)^{1/3}\text{ where }y\sim\mathcal{N}(0,1)\text{ and }w\sim\mathcal{N}(0,1) (36)
Swissroll:x=0.1tsin(t)+0.1w,y=0.1tcos(t)+0.1v,t=3π/2(1+2h),where hU(0,1),w𝒩(0,1) and v𝒩(0,1)\displaystyle\begin{split}\text{Swissroll}:&\quad x=0.1t\sin(t)+0.1w,\ y=0.1t\cos(t)+0.1v,\ t=3\pi/2(1+2h),\\ &\quad\text{where }h\sim U(0,1),\ w\sim\mathcal{N}(0,1)\text{ and }\ v\sim\mathcal{N}(0,1)\end{split} (37)

Here 𝒩,U\mathcal{N},U and Γ\Gamma refer to the Normal, Uniform, and Gamma distributions, respectively. Column 1 of Fig. 1 shows the true joint distribution μXY\mu_{XY} corresponding to Eqs. 35, 36 and 37. We consider the problem of learning the distribution of xx conditioned on yy. For all three problems, we construct a training dataset that is composed of 2,000 samples of xx and yy. For this set of problems, we compare the performance of cWGANs with Partial-GP and Full-GP. For the cWGAN with Full-GP we consider two cases: σ=0\sigma=0 (i.e., k(y^;q)k(\hat{y};q) is estimated as i=1Kq(𝒈(zi,y^))/K)\sum_{i=1}^{K}q({\bm{g}}^{*}(z_{i},\hat{y}))/K) for a given y=y^y=\hat{y}), and an optimally chosen σ\sigma (i.e., k(y^;q)k(\hat{y};q) estimated using Eq. 12). The optimal value of σ\sigma is chosen such that the batch Wasserstein-1 distance, W~1\tilde{W}_{1}, between μX|Y𝒈(|y^)\mu^{{\bm{g}}}_{X|Y}(\cdot|\hat{y}) and μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y}) attains a minimum; we estimate the W~1\tilde{W}_{1} distance using the open source POT package [29] and 104 realizations. Moreover, σ\sigma is scaled by σy\sigma_{y} such that σ/σy=σ\sigma/\sigma_{y}=\sigma^{*}, where σy\sigma_{y} is the standard deviation of the random variable YY. In practice, it is easier to work with σ\sigma^{*} rather than σ\sigma since the former is dimensionless and a scaled quantity and can be compared across different examples.

In the cWGANs, for both Full-GP and Partial-GP, we use multilayer perceptrons (MLPs) for the critic and generator networks and perform a hyper-parameter search (including for the gradient penalty weight λ\lambda) to determine the optimal configuration for each method. We provide further description of the problem set-up and additional details necessary to reproduce the results in Section A1.

Refer to caption
Figure 1: The leftmost column shows histograms generated from 106 samples from the true joint measure μXY\mu_{XY}. The other columns show histograms of 105 samples generated from the generated conditional measures μX|Yg(|y^)\mu^{g}_{X|Y}(\cdot|\hat{y}) and, in red, the outline of histograms of samples from the true conditional measure μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y})

Fig. 1 shows the results when a cWGAN with Full-GP and Partial-GP attempts to learn the conditional measure μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y}) for the problems defined in Eqs. 35, 36 and 37 corresponding to two separate values of y^\hat{y}. Columns 2-7 show histograms of the samples drawn from μX|Yg(|y^)\mu^{g}_{X|Y}(\cdot|\hat{y}) and, in red, the outline of the histogram of samples drawn from the true conditional distribution μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y}). Columns 2 and 3 correspond to the cWGAN with Partial-GP, columns 4 and 5 correspond to Full-GP with σ=0\sigma=0, and columns 6 and 7 correspond to Full-GP with optimally chosen σ=σσy\sigma=\sigma^{*}\sigma_{y}. Although the difference between the histograms for the distribution of xx conditioned on yy obtained using the various approaches is minor, we observe that in all three cases the Full GP approach yields histograms of the conditional distribution that are closer to their corresponding reference. The histograms obtained using cWGANs with Full-GP where σ/σy=σ\sigma/\sigma_{y}=\sigma^{*} appear to be the best match to the reference. This improvement in performance may be attributed to the smoothing that occurs as a consequence of adding noise to the input yy. Also note that, due to the limited quantity of training data (2,000 training samples), it is not possible to approximate the target measure too well.

We quantify the performance of each method using the following two metrics: the batch Wasserstein-1 distance W~1\tilde{W}_{1}, which is the same metric that is employed to choose σ\sigma^{*}, and the relative L2L_{2} error between samples from the target and generated random variables. We define relative L2L_{2} error as follows:

Relative L2 error=X×Y(p^XYp^XYg)2dxdy/X×Y(p^XY)2dxdy,\text{Relative $L_{2}$ error}=\int_{X\times Y}(\hat{p}_{XY}-\hat{p}^{g}_{XY})^{2}\text{d}x\text{d}y\;\;\bigg{/}\int_{X\times Y}(\hat{p}_{XY})^{2}\text{d}x\text{d}y,

where p^XY\hat{p}_{XY} and p^XYg\hat{p}^{g}_{XY} denote the true and generated joint distributions, respectively. We estimate the relative L2L_{2} errors using 106 realizations.

Table 1 shows the minimum error achieved from four independent runs for each method and joint measure. For all three problems, cWGANs with Full-GP outperform cWGANs with Partial-GP on both metrics. Another observation that can be made from Table 1 is that the joint distribution induced by cWGANs with Full-GP when σ0\sigma\neq 0 have smaller W~1\tilde{W}_{1} distances as compared to the case where σ=0\sigma=0, which is expected since this is the same metric that we use to choose σ\sigma, but the influence of σ\sigma on the relative L2L_{2} error is negligible.

Table 1: Error metrics between the true joint measure and the approximation obtained using different cWGANs for the illustrative examples
Example Partial-GP Full-GP (σ/σy=0\sigma/\sigma_{y}=0) Full-GP (σ/σy=σ\sigma/\sigma_{y}=\sigma^{*})
W~1\tilde{W}_{1} dist. rel. L2 error W~1\tilde{W}_{1} dist. rel. L2 error W~1\tilde{W}_{1} dist. rel. L2 error
Tanh+Γ\Gamma 0.0429 0.192 0.0427 0.131 0.0390 0.134
Bimodal 0.0509 0.201 0.0509 0.161 0.0471 0.163
Swissroll 0.0477 0.360 0.0483 0.284 0.0399 0.281

Fig. 2 shows the variation in the W~1\tilde{W}_{1} distance between μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y}) and μX|Yg(|y^)\mu^{g}_{X|Y}(\cdot|\hat{y}) (for the same instances of y^\hat{y} as shown in Figure 1) as the non-dimensional quantity σ/σy\sigma/\sigma_{y} is varied. The blue curve corresponds to Full-GP with σ=0\sigma=0 and the red curve corresponds to Partial-GP. Fig. 2 shows that the change in the W~1\tilde{W}_{1} distance is negligible for σ/σy<0.1\sigma/\sigma_{y}<0.1, which may explain the limited influence of σ\sigma. Therefore, in a practical problem where the true condition density is not known, σ\sigma need not be optimized as long as a small value is chosen. Further, we observe that the W~1\tilde{W}_{1} distance between μX|Yg(|y^)\mu_{X|Y}^{g}(\cdot|\hat{y}) and μX|Y(|y^)\mu_{X|Y}(\cdot|\hat{y}) using the Partial-GP method is greater or similar to Full-GP for all problems at both instances of y^\hat{y} considered here.

Refer to caption
Figure 2: Batch Wasserstein-1 distance, W~1\tilde{W}_{1}, vs. σ\sigma for cWGANs models with Full-GP. For reference, the W~1\tilde{W}_{1} distance for cWGANs with Partial-GP is shown using a dashed red line

5.2 Inverse heat conduction

We consider the following two-dimensional heat conduction problem

u(𝒔,t)tκΔu(𝒔,t))\displaystyle\frac{\partial u({\bm{s}},t)}{\partial t}-\kappa\Delta u({\bm{s}},t)) =0,\displaystyle=0,\qquad (𝒔,t)[0,2π]2×(0,T)\displaystyle\forall\ ({\bm{s}},t)\in[0,2\pi]^{2}\times(0,T) (38)
u(𝒔,0)\displaystyle u({\bm{s}},0) =u0(𝒔),\displaystyle=u_{0}({\bm{s}}),\qquad 𝒔[0,2π]2\displaystyle\forall\ {\bm{s}}\in[0,2\pi]^{2} (39)
u(𝒔,t)\displaystyle u({\bm{s}},t) =0,\displaystyle=0,\qquad (𝒔,t)[0,2π]2×(0,T)\displaystyle\forall\ ({\bm{s}},t)\in\partial[0,2\pi]^{2}\times(0,T) (40)

where uu denotes the temperature field, κ\kappa denotes the conductivity field (which is set to 0.64), and u0u_{0} denotes the initial temperature field. Given a noisy temperature field at final time T=1T=1, we wish to infer the initial condition u0u_{0}. We remark that this inverse problem is very ill-posed problem due to the significant loss of information via diffusion [30].

For the Bayesian framework, we assume the following prior on the initial data

u0(𝒔)={2+2(s1ξ1)(ξ3ξ1)if s1[ξ1,ξ3],s2[ξ4,ξ2],0otherwise,u_{0}({\bm{s}})=\begin{cases}2+2\frac{(s_{1}-\xi_{1})}{(\xi_{3}-\xi_{1})}&\quad\text{if }s_{1}\in[\xi_{1},\xi_{3}],\ s_{2}\in[\xi_{4},\xi_{2}],\\ 0&\quad\text{otherwise},\end{cases} (41)

with ξ1,ξ4\xi_{1},\xi_{4} sampled from the uniform distribution on [0.2,0.4][0.2,0.4], and ξ2,ξ3\xi_{2},\xi_{3} sampled uniformly from [0.6,0.8][0.6,0.8]. The prior distribution defined by Eq. 41 corresponds to a linearly varying temperature field in a rectangular region of random shape/size with a zero background outside the rectangle. For the present inverse problem, we represent the discrete initial temperature field (to be inferred) on a 28×2828\times 28 Cartesian grid by the random variable 𝑿{\bm{X}}. The final temperature field is recovered by solving Eqs. (38), (39) and (40) using a central-space-backward-time finite difference scheme on the same gird. We then add noise sampled from 𝒩(𝟎,𝑰)\mathcal{N}(\bm{0},\bm{I}) to generate a realization of the noisy measurement variable 𝒀{\bm{Y}}. Figure 3 depicts a few of the paired samples (𝒙,𝒚)({\bm{x}},{\bm{y}}) that form the dataset.

Refer to caption
Figure 3: Samples from the rectangular prior dataset for the inverse heat equation
Refer to caption
Figure 4: Test (𝒙,𝒚)({\bm{x}},{\bm{y}}) sample for inverse heat equation
Refer to caption
Figure 5: Comparing mean and SD obtained with the Partial-GP and Full-GP models to the reference fields for the inverse heat equation
Table 2: L2L^{2} error in posterior statistics obtained using different types of cWGANs for the inverse heat equation
Posterior Statistics Partial-GP Full-GP
σ/σY=0\sigma/\sigma_{Y}=0 σ/σY=0.31\sigma/\sigma_{Y}=0.31
Mean 0.441 0.420 0.406
SD 0.460 0.454 0.435

Based on the experiments performed in [23], we consider cWGAN models for the Partial-GP and Full-GP approaches with NZ=3{N_{Z}}=3. The details about the GAN architectures are available in Appendix A2. Both cWGANs are trained on a dataset with 10,000 training sample pairs. After a thorough sweep over the the gradient-penalty parameter λ\lambda, we found λ=10\lambda=10 for the Partial-GP approach and λ=0.1\lambda=0.1 for the Full-GP approach to be optimal. Both cWGANs are tested on an unseen pair shown in Figure 4. Following [23], the reference posterior pixel-wise mean and standard deviation (SD) of 𝒙{\bm{x}} given 𝒚{\bm{y}} for this test sample (depicted in the first column of Figure 5) is calculated using Monte Carlo sampling of the random variables ξi\xi_{i}. As can be seen in Figure 5 the pixel-wise statistics are captured better with the Full-GP approach compared to the Partial-GP approach. This is corroborated by the L2L^{2} error of these statistics listed in Table 2, with the smallest errors observed for σ/σY=0.31\sigma/\sigma_{Y}=0.31, where σ\sigma is the sampling parameter for μ𝒀σ\mu_{{\bm{Y}}_{\sigma}} and σY=1.02\sigma_{Y}=1.02 is the mean pixel-wise standard deviation over the 𝒚{\bm{y}} samples in the training set. Note that the L2L^{2} error here is evaluated as

MeanpredMeanref2=1NXNYiNXjNY([Meanpred]i,j[Meanref]i,j)2,\|\text{Mean}_{\text{pred}}-\text{Mean}_{\text{ref}}\|_{2}=\frac{1}{{N_{X}}{N_{Y}}}\sum_{i}^{{N_{X}}}\sum_{j}^{{N_{Y}}}\left([\text{Mean}_{\text{pred}}]_{i,j}-[\text{Mean}_{\text{ref}}]_{i,j}\right)^{2},

and similarly for the standard deviation. We also perform a sweep over σ\sigma to see how this sampling parameter influences the errors. As shown in Figure 6, the errors are comparable for σ/σY<0.4\sigma/\sigma_{Y}<0.4 while being smaller than those observed with the Partial-GP approach. As expected, the errors increase significantly for larger values of the sampling parameter.

Refer to caption
Figure 6: L2L^{2} errors in posterior mean and SD with Full-GP approach as σ\sigma is varied. The dashed horizontal lines indicate the errors with the Partial-GP model

5.3 Inverse Helmholtz equation

Lastly, we use cWGANs with Full-GP to solve the inverse Helmholtz equation, which is extensively used to model the propagation of waves in elastography applications [31, 32]. In such applications, wave fields are measured and used to infer the mechanical properties of the propagation medium. This relation is given by the Helmholtz equation:

ω2(𝐮^R+i𝐮^I)(Gρ(1+iαω)(𝐮^R+i𝐮^I))=0,(𝐮^R,𝐮^I,G)[0,1]2×[0,1]2×[0,1]2,-\omega^{2}(\hat{\mathbf{u}}_{R}+i\hat{\mathbf{u}}_{I})-\nabla\!\cdot\!\Bigl{(}\frac{G}{\rho}(1+i\alpha\omega)\nabla(\hat{\mathbf{u}}_{R}+i\hat{\mathbf{u}}_{I})\Bigl{)}=0,\forall\ (\hat{\mathbf{u}}_{R},\hat{\mathbf{u}}_{I},G)\in[0,1]^{2}\times[0,1]^{2}\times[0,1]^{2}, (42)

where ω\omega denotes the wave propagation frequency, 𝐮^R\hat{\mathbf{u}}_{R} and 𝐮^I\hat{\mathbf{u}}_{I} denote the real and imaginary components of the wave amplitude field at frequency ω\omega, GG denotes the shear modulus field, α\alpha is the wave dissipation coefficient and ρ\rho denotes the density. The physical domain of interest is 1.75 mm ×\times 1.75 mm. However, we model a larger domain that includes the domain of interest to allow for wave dissipation and avoid reflections: the left edge is padded by 2.6 mm, 1.75 mm is added on the top and bottom edges, but the right edge is not padded. We impose the following boundary condition on the right boundary: 𝒖^R=0.02\hat{\bm{u}}_{R}=0.02 mm and 𝒖^I=0\hat{\bm{u}}_{I}=0. The boundary conditions on all other boundaries of the expanded domain are 𝒖^R=𝒖^I=0\hat{\bm{u}}_{R}=\hat{\bm{u}}_{I}=0. Further, we assume that α=5×105\alpha=5\times 10^{-5} and ρ=1000\rho=1000 kg/m3 are constant over the entire physical domain of this problem. Now, the inverse problem we wish to solve here consists of inferring the shear modulus GG from the wave amplitude fields 𝐮^R\hat{\mathbf{u}}_{R} and 𝐮^I\hat{\mathbf{u}}_{I}.

Refer to caption
Figure 7: A realization of the optic nerve head (ONH) samples from the prior measure showing the (1) sclera, (2) lamina cribrosa, (3) pia matter, (4) optic nerve, and (5) retina

In this particular example, we simulate the case when the tissue displacement within a human optic nerve head (ONH) is measured using ultrasound waves. Fig. 7 shows the typical geometry of an ONH considered in this example and delineates its various parts. The prior distribution μ𝑿\mu_{\bm{X}} controls the geometry of the optic nerve head (ONH) and its shear modulus field GG. Table 3 lists the 16 random variables that make up the prior distribution; these are adapted from the literature [33, 34, 35, 36].

Table 3: Random variables comprising the prior measure for the optic nerve head (ONH)
Parameter Definition
Width of lamina cribrosa (mm) 𝒰(1.1,2.7)\mathcal{U}(1.1,2.7)
Thickness of lamina cribrosa (mm) 𝒰(0.16,0.44)\mathcal{U}(0.16,0.44)
Radius of lamina cribrosa (mm) 𝒰(1.0,5.0)\mathcal{U}(1.0,5.0)
Thickness of the sclera (mm) 𝒰(0.45,1.15)\mathcal{U}(0.45,1.15)
Radius of the sclera and retina (mm) 𝒰(1.0,5.0)\mathcal{U}(1.0,5.0)
Thickness of retina (mm) 𝒰(0.20,0.40)\mathcal{U}(0.20,0.40)
Width of optic nerve (mm) 𝒰(0.20,0.40)\mathcal{U}(0.20,0.40)
Radius of optic nerve (mm) 𝒰(1.65,3.65)\mathcal{U}(1.65,3.65)
Thickness of pia matter (mm) 𝒰(0.06,0.10)\mathcal{U}(0.06,0.10)
Optic nerve shear modulus (kPa) 𝒩(9.8,3.342)\mathcal{N}(9.8,3.34^{2})^{*}
Sclera shear modulus (kPa) 𝒩(125,52)\mathcal{N}(125,5^{2})^{*}
Pia matter shear modulus (kPa) 𝒩(125,502)\mathcal{N}(125,50^{2})^{*}
Retina shear modulus (kPa) 𝒩(9.8,3.342)\mathcal{N}(9.8,3.34^{2})^{*}
Lamina cribrosa shear modulus (kPa) 𝒩(73.1,46.92)\mathcal{N}(73.1,46.9^{2})^{*}
Background shear modulus (kPa) 0.10.1
Rotation of the geometry (rad) 𝒰(π/12,π/12)\mathcal{U}(-\pi/12,\pi/12)
*In this example, the normal distributions 𝒩(0,ς2)\mathcal{N}(0,\varsigma^{2}) are truncated to have support between (0,2ς](0,2\varsigma]

Similar to the previous example, we discretize the shear modulus field GG over a 64×6464\times 64 Cartesian grid and denote it by 𝑿{\bm{X}}. The real and imaginary wave amplitude fields, 𝐮^R\hat{\mathbf{u}}_{R} and 𝐮^I\hat{\mathbf{u}}_{I}, respectively, are also represented on 64×6464\times 64 Cartesian grids. We solve the Helmholtz equation in Eq. 42 using FEniCS [37] and add isotropic Gaussian noise to the measurements; the added noise has standard deviation equal to 4% of the maximum of 𝐮^R\hat{\mathbf{u}}_{R} and 𝐮^I\hat{\mathbf{u}}_{I} across all training samples. We denote the noisy 𝐮^R\hat{\mathbf{u}}_{R} and 𝐮^I\hat{\mathbf{u}}_{I} as 𝒚1{\bm{y}}_{1} and 𝒚2{\bm{y}}_{2}, respectively; the random variable 𝒀{\bm{Y}} consists of 𝒚1{\bm{y}}_{1} and 𝒚2{\bm{y}}_{2}. Finally, we construct a training dataset that consists of 12,000 samples of 𝒙{\bm{x}} and 𝒚{\bm{y}} pairs; three such pairs are shown in Fig. 8. Reference posterior statistics, which allow for the fair comparison of cWGANs, are not available for this problem, so we perform a qualitative study of the performance of the cWGAN with Full-GP. We use the same hyper-parameters for the cWGANs in this example as the inverse heat conduction example; see A for more details.

Refer to caption
Figure 8: Realizations from the prior distribution of 𝑿{\bm{X}} for the ONH shear modulus field (column 1), and corresponding wave amplitudes (columns 2 and 3) and realizations of 𝒀{\bm{Y}} (columns 4 and 5)

Figure 9 shows the predictions of the cWGAN for the three test samples shown in Figure 8. For each test sample, columns 1 and 2 show the noisy 𝒖^R\hat{\bm{u}}_{R} and 𝒖^I\hat{\bm{u}}_{I} measurements, respectively; column 3 shows the ground truth shear modulus 𝒙{\bm{x}}; columns 4 and 5 show the corresponding posterior mean; and, columns 6 and 7 show the posterior standard deviation. Results are reported for two values of the sampling parameter for μ𝒀σ\mu_{{\bm{Y}}_{\sigma}}: σ=0\sigma=0 (columns 4 and 6 of Fig. 9) and σ/σY=0.31\sigma/\sigma_{Y}=0.31 (columns 5 and 7 of Fig. 9); the latter value of σ/σy\sigma/\sigma_{y} is reused from the inverse heat conduction problem and σy=0.105\sigma_{y}=0.105 is the mean pixel-wise standard deviation over the 𝒚{\bm{y}} samples in the training set. The posterior statistics are estimated using 3,0003,000 samples. The posterior mean estimated using cWGANs with Full-GP is similar to the ground truth for all three test samples and for both values of σ\sigma considered here. Moreover, we observe large standard deviations in the posterior distribution in areas where we expect the predictions to be uncertain, such as the edges of the ONH. However, the predicted posterior standard deviation when σ/σY=0.31\sigma/\sigma_{Y}=0.31 is comparatively larger than the predicted posterior standard deviation when σ=0\sigma=0. This may be attributed to the additional sampling variance injected into 𝒚{\bm{y}} when σ/σY=0.31\sigma/\sigma_{Y}=0.31.

Refer to caption
Figure 9: Predictions from the cWGAN with Full GP. Columns 1 & 2 are the measurements 𝒚{\bm{y}} and column 3 is the true shear modulus 𝒙{\bm{x}}; columns 4 and 6 show the component-wise mean and standard deviation, respectively, of 3,0003,000 samples obtained from Full-GP cWGAN with σ/σy=0\sigma/\sigma_{y}=0; columns 5 and 7 are the corresponding results obtained with σ/σy=0.31\sigma/\sigma_{y}=0.31

6 Conclusions and Outlook

In this manuscript a novel version of the cWGAN is developed for solving probabilistic inverse problems where the forward operator is constrained by a physics-based model. The approach is particularly useful for problems where the prior information for the inferred vector is known through samples, and the forward operator is known only as a black-box. The cWGAN developed in this work differs from earlier versions in that its critic is required to be 1-Lipschitz with respect to both the inferred and the measurement vectors and not just the former. This leads to a loss term with a gradient penalty for the critic which is computed using all of its arguments. This simple change has a significant impact on the cWGAN. It allows us to prove that the conditional distribution learned by the cWGAN weakly converges to the true conditional distribution. It also leads to a new sampling strategy, wherein the output of the generator is computed for measurements sampled from a Gaussian distribution tightly centered around the true measurement. Through numerical examples it is demonstrated that the new cWGAN is more accurate than its predecessor, and the that new sampling strategy helps in further improving its performance. The application of this approach to a challenging inverse problem that is by applications motivated in biomechanics is also described.

The ideas described in this manuscript may be extended in several interesting directions. These include, (i) their application to more challenging and complex inference problems; (ii) their application to probabilistic operator networks [38, 39, 40] that map functions of one class to another (initial state to final state, or vice-versa); (iii) the use of more complex network architectures like transformers and self-attention networks [41] in the generator and critic networks of the cWGAN.

7 Acknowledgments

The authors acknowledge support from ARO, USA grant W911NF2010050 and from the Airbus Institute for Engineering Research at USC. The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California, USA for providing computing resources that have contributed to the research results reported within this publication.

References

  • Hadamard [1902] J. Hadamard, Sur les problèmes aux dérivées partielles et leur signification physique, Princeton University Bulletin (1902) 49–52.
  • Dashti and Stuart [2017] M. Dashti, A. M. Stuart, The Bayesian approach to inverse problems, in: Handbook of Uncertainty Quantification, Springer, 2017, pp. 311–428.
  • Chib [2001] S. Chib, Markov chain Monte Carlo methods: computation and inference, Handbook of econometrics 5 (2001) 3569–3649.
  • Brooks et al. [2011] S. Brooks, A. Gelman, G. Jones, X.-L. Meng, Handbook of Markov chain Monte Carlo, CRC press, 2011.
  • Liu [2001] J. S. Liu, Monte Carlo strategies in scientific computing, volume 75, Springer, 2001.
  • Gelman et al. [1997] A. Gelman, W. R. Gilks, G. O. Roberts, Weak convergence and optimal scaling of random walk Metropolis algorithms, The Annals of Applied Probability 7 (1997) 110–120.
  • Cui et al. [2016] T. Cui, K. J. Law, Y. M. Marzouk, Dimension-independent likelihood-informed mcmc, Journal of Computational Physics 304 (2016) 109–137.
  • Huijser et al. [2015] D. Huijser, J. Goodman, B. J. Brewer, Properties of the affine invariant ensemble sampler in high dimensions, arXiv preprint arXiv:1509.02230 (2015).
  • Neal et al. [2011] R. M. Neal, et al., MCMC using Hamiltonian dynamics, Handbook of Markov chain Monte Carlo 2 (2011) 2.
  • Betancourt [2017] M. Betancourt, A conceptual introduction to Hamiltonian Monte Carlo, arXiv preprint arXiv:1701.02434 (2017).
  • Murphy [2023] K. P. Murphy, Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023. URL: http://probml.github.io/book2.
  • Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural network, in: International Conference on Machine Learning, PMLR, 2015, pp. 1613–1622.
  • Jospin et al. [2022] L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, M. Bennamoun, Hands-on Bayesian neural networks—A tutorial for deep learning users, IEEE Computational Intelligence Magazine 17 (2022) 29–48.
  • Lampinen and Vehtari [2001] J. Lampinen, A. Vehtari, Bayesian approach for neural networks—review and case studies, Neural Networks 14 (2001) 257–274.
  • Fortuin [2022] V. Fortuin, Priors in Bayesian deep learning: A review, International Statistical Review 90 (2022) 563–591.
  • Tran et al. [2022] B.-H. Tran, S. Rossi, D. Milios, M. Filippone, All you need is a good functional prior for Bayesian deep learning, Journal of Machine Learning Research 23 (2022) 1–56.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc., 2014. URL: https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • Martin Arjovsky [2017] L. B. Martin Arjovsky, Soumith Chintala, Wasserstein Generative Adversarial Networks, in: Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 214–223.
  • Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
  • Patel et al. [2022] D. V. Patel, D. Ray, A. A. Oberai, Solution of physics-based bayesian inverse problems with deep generative priors, 2022.
  • Mirza and Osindero [2014] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014. URL: https://arxiv.org/abs/1411.1784. arXiv:1411.1784.
  • Adler and Öktem [2018] J. Adler, O. Öktem, Deep Bayesian inversion, arXiv preprint arXiv:1811.05910 (2018). URL: https://arxiv.org/abs/1811.05910.
  • Ray et al. [2022] D. Ray, H. Ramaswamy, D. V. Patel, A. A. Oberai, The efficacy and generalizability of conditional GANs for posterior inference in physics-based inverse problems, Numerical Algebra, Control and Optimization (2022). URL: https://doi.org/10.3934/naco.2022038. doi:10.3934/naco.2022038.
  • Kovachki et al. [2020] N. Kovachki, R. Baptista, B. Hosseini, Y. Marzouk, Conditional sampling with monotone GANs, arXiv preprint arXiv:2006.06755 (2020).
  • Ronneberger et al. [2015] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
  • Villani [2008] C. Villani, Optimal Transport: Old and New, Grundlehren der mathematischen Wissenschaften, Springer Berlin Heidelberg, 2008.
  • Barbone and Oberai [2010] P. E. Barbone, A. A. Oberai, A review of the mathematical and computational foundations of biomechanical imaging, Computational Modeling in Biomechanics (2010) 375–408.
  • Kingma and Ba [2017] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2017. URL: https://arxiv.org/abs/1412.6980. arXiv:1412.6980.
  • Flamary et al. [2021] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, T. Vayer, POT: Python optimal transport, Journal of Machine Learning Research 22 (2021) 1–8. URL: http://jmlr.org/papers/v22/20-451.html.
  • Heinz Werner Engl [2000] A. N. Heinz Werner Engl, Martin Hanke, Regularization of Inverse Problems, Springer Dordrecht, 2000.
  • McLaughlin and Renzi [2006] J. McLaughlin, D. Renzi, Shear wave speed recovery in transient elastography and supersonic imaging using propagating fronts, Inverse Problems 22 (2006) 681.
  • Zhang et al. [2012] Y. Zhang, A. A. Oberai, P. E. Barbone, I. Harari, Solution of the time-harmonic viscoelastic inverse problem with interior data in two dimensions, International Journal for Numerical Methods in Engineering 92 (2012) 1100–1116.
  • Sigal [2009] I. A. Sigal, Interactions between geometry and mechanical properties on the optic nerve head, Investigative ophthalmology & visual science 50 (2009) 2785–2795.
  • Şahan et al. [2019] M. H. Şahan, A. Doğan, M. İnal, M. Alpua, N. Asal, Evaluation of the optic nerve by strain and shear wave elastography in patients with migraine, Journal of Ultrasound in Medicine 38 (2019) 1153–1161.
  • Qian et al. [2021] X. Qian, R. Li, G. Lu, L. Jiang, H. Kang, K. K. Shung, M. S. Humayun, Q. Zhou, Ultrasonic elastography to assess biomechanical properties of the optic nerve head and peripapillary sclera of the eye, Ultrasonics 110 (2021) 106263.
  • Zhang et al. [2020] L. Zhang, M. R. Beotra, M. Baskaran, T. A. Tun, X. Wang, S. A. Perera, N. G. Strouthidis, T. Aung, C. Boote, M. J. Girard, In vivo measurements of prelamina and lamina cribrosa biomechanical properties in humans, Investigative Ophthalmology & Visual Science 61 (2020) 27–27.
  • Alnæs et al. [2015] M. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, G. N. Wells, The FEniCS project version 1.5, Archive of Numerical Software 3 (2015).
  • Lu et al. [2021] L. Lu, P. Jin, G. Pang, Z. Zhang, G. E. Karniadakis, Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nature machine intelligence 3 (2021) 218–229.
  • Li et al. [2020] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, A. Anandkumar, Fourier neural operator for parametric partial differential equations, arXiv preprint arXiv:2010.08895 (2020).
  • Patel et al. [2022] D. V. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, A. A. Oberai, Variationally mimetic operator networks, arXiv preprint arXiv:2209.12871 (2022).
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

Appendix A GAN architectures and experimental setup

The generator and critic architectures for the experiments considered in Section 5 are described in this section. Other key hyper-parameters are listed in Table 1. All networks were trained using Adam optimizer with parameters β1=0.5\beta_{1}=0.5, β2=0.9\beta_{2}=0.9.

Table 1: Hyper-parameters associated with the cWGANs for various numerical examples
Hyperparameter Experiment
Illustrative examples Inverse heat conduction Inverse Helmholtz equation
Tanh+Γ\Gamma Bimodal Swissroll
Training samples 2,000 2,000 2,000 10,000 12,000
NX{N_{X}} 1 1 1 28×2828\times 28 64×6464\times 64
NY{N_{Y}} 1 1 1 28×2828\times 28 64×6464\times 64
NZ{N_{Z}} 1 1 1 3 50
Batch size 2,000 2,000 2,000 50 50
Activation func tanh tanh tanh LReLU(0.1) ELU
NmaxN_{\text{max}} 20 20 20 4 4
Max epochs 600,000 600,000 600,000 500 4,000
Learning rate 10410^{-4} 10410^{-4} 10410^{-4} 10310^{-3} 10310^{-3}

A1 Illustrative examples

The generator and critic architectures chosen to solve this problem are multilayer perceptrons (MLPs) with 4 hidden layers with 1282566432128-256-64-32 units, respectively.

A2 Inverse heat conduction

The generator and critic architectures are identical to those considered for the inverse heat conduction problem in [23]. In particular, a residual block based U-Net architecture is used for the generator, with the latent information injected at every level of the U-Net through conditional instance normalization. The critic comprises residual block based convolution layers followed by a couple of fully connected layers. See [23] for precise details of the architectures, and the benefits of using conditional instance normalization, especially when the generator input 𝒚{\bm{y}} is an image while the latent variable is a vector.

A3 Inverse Helmholtz problem

For this problem, we used the same generator and critic architectures as for the inverse heat conduction problem. The only differences are regarding the different size of the inputs and using the ELU activation function instead of LReLU(0.1). We used the same hyper-parameters as in the inverse heat equation problem (see Table 1).