This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Mathematical Framework for Learning Probability Distributions

Hongkang Yang [email protected]. We thank Weinan E for helpful discussions. Program in Applied and Computational Mathematics, Princeton University
Abstract

The modeling of probability distributions, specifically generative modeling and density estimation, has become an immensely popular subject in recent years by virtue of its outstanding performance on sophisticated data such as images and texts. Nevertheless, a theoretical understanding of its success is still incomplete. One mystery is the paradox between memorization and generalization: In theory, the model is trained to be exactly the same as the empirical distribution of the finite samples, whereas in practice, the trained model can generate new samples or estimate the likelihood of unseen samples. Likewise, the overwhelming diversity of distribution learning models calls for a unified perspective on this subject. This paper provides a mathematical framework such that all the well-known models can be derived based on simple principles. To demonstrate its efficacy, we present a survey of our results on the approximation error, training error and generalization error of these models, which can all be established based on this framework. In particular, the aforementioned paradox is resolved by proving that these models enjoy implicit regularization during training, so that the generalization error at early-stopping avoids the curse of dimensionality. Furthermore, we provide some new results on landscape analysis and the mode collapse phenomenon.

1 Introduction

The popularity of machine learning models in recent years is largely attributable to their remarkable versatility in solving highly diverse tasks with good generalization power. Underlying this diversity is the ability of the models to learn various mathematical objects such as functions, probability distributions, dynamical systems, actions and policies, and often a sophisticated architecture or training scheme is a composition of these modules. Besides fitting functions, learning probability distributions is arguably the most widely-adopted task and constitutes a great portion of the field of unsupervised learning. Its applications range from the classical density estimation [119, 115] which is important for scientific computing [13, 118, 70], to generative modeling with superb performance in image synthesis and text composition [15, 94, 16, 96], and also to pretraining tasks such as masked reconstruction that are crucial for large-scale models [28, 16, 25].

Despite the impressive performance of machine learning models in learning probability measures, this subject is less understood than the learning of functions or supervised learning. Specifically, there are several mysteries:

1. Unified framework. There are numerous types of models for representing and estimating distributions, making it difficult to gain a unified perspective for model design and comparison. One traditional categorization includes five model classes: the generative adversarial networks (GAN) [39, 7], variational autoencoders (VAE) [64], normalizing flows (NF) [112, 95], autoregressive models [92, 85], and diffusion models [104, 107]. Within each class, there are further variations that complicate the picture, such as the choice of integral probability metrics for GANs and the choice of architectures for normalizing flows that enable likelihood computations. Ideally, instead of a phenomenological categorization, one would prefer a simple theoretical framework that can derive all these models in a straightforward manner based on a few principles.

2. Memorization and curse of dimensionality. Perhaps the greatest difference between learning functions and learning probability distributions is that, conceptually, the solution of the latter problem must be trivial. On one hand, since the target distribution PP_{*} can be arbitrarily complicated, any useful model must satisfy the property of universal convergence, namely the modeled distribution can be trained to converge to any given distribution (e.g. Section 5 will show that this property holds for several models). On the other hand, the target PP_{*} is unknown in practice and only a finite sample set {𝐱i}i=1n\{\mathbf{x}_{i}\}_{i=1}^{n} is given (with the empirical distribution denoted by P(n)P_{*}^{(n)}). As a result, the modeled distribution PtP_{t} can only be trained with P(n)P_{*}^{(n)} and inevitably exhibits memorization, i.e.

limtPt=P(n)\lim_{t\to\infty}P_{t}=P_{*}^{(n)}

Hence, training results in a trivial solution and does not provide us with anything beyond the samples we already have. This is different from regression where the global minimizer (interpolating solution) can still generalize well [36].

One related problem is the curse of dimensionality, which becomes more severe when estimating distributions instead of functions. In general, the distance between the hidden target and the empirical distribution scales badly with dimension dd: For any absolutely continuous PP_{*} and any δ>0\delta>0 [120]

W2(P,P(n))n1dδW_{2}(P_{*},P_{*}^{(n)})~{}\gtrsim~{}n^{-\frac{1}{d-\delta}}

where W2W_{2} is the Wasserstein metric. This slow convergence sets a limit on the performance of all possible models: for instance, the following worst-case lower bound [103]

infAsupP𝔼{Xi}[W22(P,A({Xi}i=1n))]1/2n1d\inf_{A}\sup_{P_{*}}~{}\mathbb{E}_{\{X_{i}\}}\big{[}W_{2}^{2}\big{(}P_{*},A(\{X_{i}\}_{i=1}^{n})\big{)}\big{]}^{1/2}~{}\gtrsim~{}n^{-\frac{1}{d}}

where PP_{*} is any distribution supported on [0,1]d[0,1]^{d} and AA is any estimator, i.e. a mapping from every nn sample set {Xi}i=1nP\{X_{i}\}_{i=1}^{n}\sim P_{*} to an estimated distribution A({Xi})A(\{X_{i}\}). Hence, to achieve a generalization error of ϵ\epsilon, an astronomical sample size Ω(ϵd)\Omega(\epsilon^{d}) could be necessary in high dimensions.

These theoretical difficulties form a seeming paradox with the empirical success of distribution learning models, for instance, models that can generate novel and highly-realistic images [15, 62, 94] and texts [92, 16].

3. Training and mode collapse. The training of distribution learning models is known to be more delicate than training supervised learning models, and exhibits several novel forms of failures. For instance, for the GAN model, one common issue is mode collapse [100, 66, 77, 90], when a positive amount of mass in PtP_{t} becomes concentrated at a single point, e.g. an image generator could consistently output the same image. Another issue is mode dropping [129], when PtP_{t} fails to cover some of the modes of PP_{*}. In addition, training may suffer from oscillation and divergence [91, 19]. These problems are the main obstacle to global convergence, but the underlying mechanism remains largely obscure.

The goal of this paper is to provide some insights into these mysteries from a mathematical point of view. Specifically,

  1. 1.

    We establish a unified theoretical framework from which all the major distribution learning models can be derived. The diversity of these models is largely determined by two simple factors, the distribution representation and loss type. This formulation greatly facilitates our analysis of the approximation error, training error and generalization error of these models.

  2. 2.

    We survey our previous results on generalization error, and resolve the paradox between memorization and generalization. As illustrated in Figure 1, despite that the model eventually converges to the global minimizer, which is the memorization solution, the training trajectory comes very close to the hidden target distribution. With early-stopping or regularized loss, the generalization error scales as

    W2(P,Pt) or KL(PPt)nαW_{2}(P_{*},P_{t})\text{ or }\text{KL}(P_{*}\|P_{t})\lesssim n^{-\alpha}

    for some constant α>0\alpha>0 instead of dimension-dependent terms such as α/d\alpha/d. Thereby, the model escapes from the curse of dimensionality.

  3. 3.

    We discuss our previous results on the rates of global convergence for some of the models. For the other models, we establish new results on landscape and critical points, and identify two mechanisms that can lead to mode collapse.

Refer to caption
Figure 1: Generalization error during training.

This paper is structured as follows. Section 2 presents a sketch of the popular distribution learning models. Section 3 introduces our theoretical framework and the derivations of the models. Section 4 establishes the universal approximation theorems. Section 5 analyzes the memorization phenomenon. Section 6 discusses the generalization error of several representative distribution learning models. Section 7 analyzes the training process, loss landscape and mode collapse. All proofs are contained in Section 9. Section 8 concludes this paper with discussion on the remaining mysteries.

Here is a list of related works.

Mathematical framework: A framework for supervised learning has been proposed by [35, 32] with focus on the function representations, namely function spaces that can be discretized into neural networks. This framework is helpful for the analysis of supervised learning models, in particular, the estimation of generalization errors [33, 31, 36] that avoid the curse of dimensionality, and the determination of global convergence [24, 97]. Similarly, our framework for distribution learning emphasizes the function representation, as well as the new factor of distribution representation, and bound the generalization error through analogous arguments. Meanwhile, there are frameworks that characterizes distribution learning from other perspectives, for instance, statistical inference [54, 108], graphical models and rewards [130, 12], energy functions [134] and biological neurons [88].

Generalization ability: The section on generalization reviews our previous results on the generalization error estimates for potential-based model [126], GAN [127] and normalizing flow with stochastic interpolants [125]. The mechanism is that function representations defined by integral transforms or expectations [34, 35] enjoy small Rademacher complexity and thus escape from the curse of dimensionality. Earlier works [33, 31, 36, 37] used this mechanism to bound the generalization error of supervised learning models. Our analysis combines this mechanism with the training process to show that early-stopping solutions generalize well, and is related to concepts from supervised learning literature such as the frequency principle [124, 123] and slow deterioration [76].

Training and convergence: The additional factor of distribution representation further complicates the loss landscape, and makes training more difficult to analyze, especially for the class of “free generators” that will be discussed later. The model that attracted the most attention was GAN, and convergence has only been established in simplified settings [79, 121, 38, 10, 127] or for local saddle points [81, 47, 73]. In practice, GAN training is known to suffer from failures such as mode collapse and divergence [9, 20, 79]. Despite that these issues can be fixed using regularizations and tricks [44, 66, 77, 90, 50], the mechanism underlying this training instability is not well understood.

2 Model Overview

This section offers a quick sketch of the prominent models for learning probability distributions, while their derivations will be presented in Section 3. These models are commonly grouped into five categories: the generative adversarial networks (GAN), autoregressive models, variational autoencoders (VAE), normalizing flows (NF), and diffusion models. We assume access to samples drawn from the target distribution PP_{*}, and the task is to train a model to be able to generate more samples from the distribution (generative modeling) or compute its density function (density estimation).

1. GAN. The generative adversarial networks [39] model a distribution by transport

P=law(X),X=G(Z),ZP=\text{law}(X),\quad X=G(Z),\quad Z\sim\mathbb{P} (1)

The map G:ddG:\mathbb{R}^{d}\to\mathbb{R}^{d} is known as the generator and \mathbb{P} is a base distribution that is easy to sample (e.g. unit Gaussian). To solve for a generator such that P=PP=P_{*}, the earliest GAN model considers the following optimization problem [39]

minGmaxDlog(eD(𝐱)1+eD(𝐱))𝑑P(𝐱)+log(11+eD(G(𝐱)))𝑑(𝐱)\min_{G}\max_{D}\int\log\Big{(}\frac{e^{D(\mathbf{x})}}{1+e^{D(\mathbf{x})}}\Big{)}dP_{*}(\mathbf{x})+\int\log\Big{(}\frac{1}{1+e^{D(G(\mathbf{x}))}}\Big{)}d\mathbb{P}(\mathbf{x})

where D:dD:\mathbb{R}^{d}\to\mathbb{R} is known as the discriminator and this type of min-max losses are known as the adversarial loss. A well-known variant is the WGAN [7] defined by

minGmaxθ1Dθ(𝐱)𝑑P(𝐱)Dθ(G(𝐱))𝑑(𝐱)\min_{G}\max_{\|\theta\|_{\infty}\leq 1}\int D_{\theta}(\mathbf{x})dP_{*}(\mathbf{x})-\int D_{\theta}\big{(}G(\mathbf{x})\big{)}d\mathbb{P}(\mathbf{x})

where the discriminator DθD_{\theta} is a neural network with parameter θ\theta, which is bounded in ll^{\infty} norm. For the other variants, a survey on the GAN models is given by [43].

2. VAE. The variational autoencoder proposed by [64] uses a randomized generator and its approximate inverse, known as the decoder and encoder, and we denote them by the conditional distributions P(|𝐳)P(\cdot|\mathbf{z}) and Q(|𝐱)Q(\cdot|\mathbf{x}). Similar to (1), the distribution is modeled by P=P(|𝐳)d(𝐳)P=\int P(\cdot|\mathbf{z})d\mathbb{P}(\mathbf{z}) and can be sampled by XP(|Z),ZX\sim P(\cdot|Z),Z\sim\mathbb{P}. VAE considers the following optimization problem

minP(|𝐳)minQ(|𝐱)logP(𝐱|𝐳)dQ(𝐳|𝐱)+KL(Q(|𝐱))dP(𝐱)\displaystyle\min_{P(\cdot|\mathbf{z})}\min_{Q(\cdot|\mathbf{x})}\iint-\log P(\mathbf{x}|\mathbf{z})dQ(\mathbf{z}|\mathbf{x})+\text{KL}\big{(}Q(\cdot|\mathbf{x})\big{\|}\mathbb{P}\big{)}dP_{*}(\mathbf{x})

where KL is the Kullback–Leibler divergence. To simplify computation, \mathbb{P} is usually set to be the unit Gaussian 𝒩\mathcal{N}, and the decoder P(|𝐳)P(\cdot|\mathbf{z}) and encoder Q(|𝐱)Q(\cdot|\mathbf{x}) are parametrized as diagonal Gaussians [64]. For instance, consider

P(|𝐳)=𝒩(G(𝐳),s2I),Q(|𝐱)=𝒩(F(𝐱),v2I)P(\cdot|\mathbf{z})=\mathcal{N}\big{(}G(\mathbf{z}),s^{2}I\big{)},\quad Q(\cdot|\mathbf{x})=\mathcal{N}\big{(}F(\mathbf{x}),v^{2}I\big{)}

where G,FG,F are parametrized functions and s,v>0s,v>0 are scalars. Then, we have

logP(𝐱|𝐳)dQ(𝐳|𝐱)\displaystyle\int-\log P(\mathbf{x}|\mathbf{z})dQ(\mathbf{z}|\mathbf{x}) =logP(𝐱|F(𝐱)+vω)d𝒩(ω)\displaystyle=\int-\log P\big{(}\mathbf{x}\big{|}F(\mathbf{x})+v\omega\big{)}d\mathcal{N}(\omega)
=𝐱G(F(𝐱)+vω)22s2𝑑𝒩(ω)+d2log(2πs2)\displaystyle=\int\frac{\big{\|}\mathbf{x}-G\big{(}F(\mathbf{x})+v\omega\big{)}\big{\|}^{2}}{2s^{2}}d\mathcal{N}(\omega)+\frac{d}{2}\log(2\pi s^{2})
KL(Q(|𝐱))\displaystyle\text{KL}\big{(}Q(\cdot|\mathbf{x})\big{\|}\mathbb{P}\big{)} =F(𝐱)22+d2(v2logv21)\displaystyle=\frac{\|F(\mathbf{x})\|^{2}}{2}+\frac{d}{2}\big{(}v^{2}-\log v^{2}-1\big{)}

Up to constant, the VAE loss becomes

minG,sminF,v\displaystyle\min_{G,s}\min_{F,v} 𝐱G(F(𝐱)+vω)22s2𝑑𝒩(ω)+F(𝐱)22dP(𝐱)+d(logslogv)+d2v2\displaystyle\iint\frac{\big{\|}\mathbf{x}-G\big{(}F(\mathbf{x})+v\omega\big{)}\big{\|}^{2}}{2s^{2}}d\mathcal{N}(\omega)+\frac{\|F(\mathbf{x})\|^{2}}{2}dP_{*}(\mathbf{x})+d(\log s-\log v)+\frac{d}{2}v^{2}

3. Autoregressive. Consider sequential data 𝐗=[𝐱1,𝐱l]\mathbf{X}=[\mathbf{x}_{1},\dots\mathbf{x}_{l}] such as text and audio. The autoregressive models represent a distribution PP through factorization P(𝐗)=i=1lP(𝐱i|𝐱<i)P(\mathbf{X})=\prod_{i=1}^{l}P(\mathbf{x}_{i}|\mathbf{x}_{<i}), and can be sampled by sampling iteratively from XiP(Xi|X<i)X_{i}\sim P(X_{i}|X_{<i}). These models minimize the loss

i=1llogP(𝐱i|𝐱<i)dP(𝐗)-\int\sum_{i=1}^{l}\log P(\mathbf{x}_{i}|\mathbf{x}_{<i})dP_{*}(\mathbf{X})

There are several approaches to the parametrization of the conditional distributions P(𝐱i|𝐱<i)P(\mathbf{x}_{i}|\mathbf{x}_{<i}), depending on how to process the variable-length input 𝐱<i\mathbf{x}_{<i}. The common options are the Transformer networks [116, 92], recurrent networks [87], autoregressive networks [68, 87] and causal convolution [86]. For instance, consider Gaussian distributions parametrized by a recurrent network

P(|𝐱<i)=𝒩(m(𝐡i),s2(𝐡i)I),𝐡i=f(𝐡i1,𝐱i1)P(\cdot|\mathbf{x}_{<i})=\mathcal{N}\big{(}m(\mathbf{h}_{i}),s^{2}(\mathbf{h}_{i})I\big{)},\quad\mathbf{h}_{i}=f(\mathbf{h}_{i-1},\mathbf{x}_{i-1})

where m,s,fm,s,f are parametrized functions, and 𝐡\mathbf{h} is the hidden feature. Since

logP(𝐱i|𝐱<i)=𝐱im(𝐡i)22s(𝐡i)2+d2log(2πs(𝐡i)2)-\log P(\mathbf{x}_{i}|\mathbf{x}_{<i})=\frac{\|\mathbf{x}_{i}-m(\mathbf{h}_{i})\|^{2}}{2s(\mathbf{h}_{i})^{2}}+\frac{d}{2}\log\big{(}2\pi s(\mathbf{h}_{i})^{2}\big{)}

the loss is equal, up to constant, to

minm,s,f,𝐡1i=1l𝐱im(𝐡i)22s(𝐡i)2+dlogs(𝐡i)dP(𝐗)\min_{m,s,f,\mathbf{h}_{1}}\int\sum_{i=1}^{l}\frac{\|\mathbf{x}_{i}-m(\mathbf{h}_{i})\|^{2}}{2s(\mathbf{h}_{i})^{2}}+d\log s(\mathbf{h}_{i})~{}dP_{*}(\mathbf{X})

4. NF. The normalizing flows proposed by [112, 111] use a generator GG (1) similar to GAN and VAE. The optimization problem is given by

minGlogdetG1(𝐱)+log(G1(𝐱))dP(𝐱)\min_{G}-\int\log\det\nabla G^{-1}(\mathbf{x})+\log\mathbb{P}(G^{-1}(\mathbf{x}))dP_{*}(\mathbf{x})

where =𝒩\mathbb{P}=\mathcal{N} is set to be the unit Gaussian and detG1\det\nabla G^{-1} is the Jacobian determinant of G1G^{-1}. To enable the calculation of these terms, the earliest approach [112, 111, 95] considers only the inverse F=G1F=G^{-1} and models it by a concatenation of simple maps F=F1FTF=F_{1}\circ\dots\circ F_{T} such that each detFτ\det\nabla F_{\tau} is easy to compute. The modeled distribution (1) cannot be sampled, but can serve as a density estimator, with the density given by

P(𝐱)=𝒩(𝐱0)τ=1TdetFτ(𝐱τ1),𝐱τ1:=FτFT(𝐱)P(\mathbf{x})=\mathcal{N}(\mathbf{x}_{0})\prod_{\tau=1}^{T}\det\nabla F_{\tau}(\mathbf{x}_{\tau-1}),\quad\mathbf{x}_{\tau-1}:=F_{\tau}\circ\dots\circ F_{T}(\mathbf{x})

Up to constant, the loss becomes

minF1,FT𝐱022+τ=1TlogdetFτ(𝐱τ1)dP(𝐱)\min_{F_{1},\dots F_{T}}\int\frac{\|\mathbf{x}_{0}\|^{2}}{2}+\sum_{\tau=1}^{T}-\log\det\nabla F_{\tau}(\mathbf{x}_{\tau-1})~{}dP_{*}(\mathbf{x})

A later approach [29, 63, 89, 55] models the generator by G=GTG1G=G_{T}\circ\dots\circ G_{1} such that each GτG_{\tau} is designed to be easily invertible with Gτ\nabla G_{\tau} being a triangular matrix. Then, the loss becomes

minG1,GT𝐱022+τ=1TTr[logGτ(𝐱τ1)]dP(𝐱),𝐱τ1:=Gτ1GT1(𝐱)\min_{G_{1},\dots G_{T}}\int\frac{\|\mathbf{x}_{0}\|^{2}}{2}+\sum_{\tau=1}^{T}\text{Tr}\big{[}\log\nabla G_{\tau}(\mathbf{x}_{\tau-1})\big{]}~{}dP_{*}(\mathbf{x}),\quad\mathbf{x}_{\tau-1}:=G_{\tau}^{-1}\circ\dots\circ G_{T}^{-1}(\mathbf{x})

Another approach [30, 21, 40] defines GG as a continuous-time flow, i.e. solution to an ordinary differential equation (ODE)

G=G01,Gsτ(𝐱s)=𝐱τ,ddτ𝐱τ=V(𝐱τ,τ)G=G_{0}^{1},\quad G_{s}^{\tau}(\mathbf{x}_{s})=\mathbf{x}_{\tau},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=V(\mathbf{x}_{\tau},\tau) (2)

for some time-dependent velocity field VV. Then, the loss becomes

minV𝐱022+01Tr[𝐱V(𝐱τ,τ)]𝑑τ𝑑P(𝐱),𝐱τ:=(Gτ1)1(𝐱)\min_{V}\int\frac{\|\mathbf{x}_{0}\|^{2}}{2}+\int_{0}^{1}\text{Tr}\big{[}\nabla_{\mathbf{x}}V(\mathbf{x}_{\tau},\tau)\big{]}d\tau dP_{*}(\mathbf{x}),\quad\mathbf{x}_{\tau}:=(G_{\tau}^{1})^{-1}(\mathbf{x})

A survey on normalizing flows is given by [65].

5. Monge-Ampère flow. A model that is closely related to the normalizing flows is the Monge-Ampère flow [131]. It is parametrized by a time-dependent potential function ϕτ,τ[0,1]\phi_{\tau},\tau\in[0,1], and defines a generator by the ODE

G=G01,Gsτ(𝐱s)=𝐱τ,ddτ𝐱τ=ϕτ(𝐱τ)G=G_{0}^{1},\quad G_{s}^{\tau}(\mathbf{x}_{s})=\mathbf{x}_{\tau},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=\nabla\phi_{\tau}(\mathbf{x}_{\tau})

such that the flow is driven by a gradient field. The model minimizes the following loss

minϕ𝐱022+01Δϕτ(𝐱τ)𝑑τ𝑑P(𝐱),𝐱τ:=(Gτ1)1(𝐱)\min_{\phi}\int\frac{\|\mathbf{x}_{0}\|^{2}}{2}+\int_{0}^{1}\Delta\phi_{\tau}(\mathbf{x}_{\tau})d\tau dP_{*}(\mathbf{x}),\quad\mathbf{x}_{\tau}:=(G_{\tau}^{1})^{-1}(\mathbf{x})

where Δϕ=i=1di2ϕ\Delta\phi=\sum_{i=1}^{d}\partial_{i}^{2}\phi is the Laplacian.

6. Diffusion. The diffusion models [104, 106, 49] define the generator by a reverse-time SDE (stochastic differential equation)

G(XT)=X0,XT,dXτ=βτ2(Xτ+2𝐬(Xτ,τ))dτ+βτdW¯τG(X_{T})=X_{0},\quad X_{T}\sim\mathbb{P},\quad dX_{\tau}=-\frac{\beta_{\tau}}{2}\big{(}X_{\tau}+2\mathbf{s}(X_{\tau},\tau)\big{)}d\tau+\sqrt{\beta_{\tau}}d\overline{W}_{\tau}

where 𝐬:d×[0,T]d\mathbf{s}:\mathbb{R}^{d}\times[0,T]\to\mathbb{R}^{d} is a time-dependent velocity field known as the score function [56], βτ>0\beta_{\tau}>0 is some noise scale, and W¯τ\overline{W}_{\tau} is a reverse-time Wiener process. The modeled distribution is sampled by solving this SDE backwards in time from TT to 0. The score function 𝐬\mathbf{s} is learned from the optimization problem

min𝐬0Tλτ2𝐬(e120τβs𝑑s𝐱0+1e0τβs𝑑sω,τ)+ω1e0τβs𝑑s2𝑑𝒩(ω)𝑑P(𝐱0)𝑑τ\min_{\mathbf{s}}\int_{0}^{T}\frac{\lambda_{\tau}}{2}\iint\Big{\|}\mathbf{s}\big{(}e^{-\frac{1}{2}\int_{0}^{\tau}\beta_{s}ds}\mathbf{x}_{0}+\sqrt{1-e^{-\int_{0}^{\tau}\beta_{s}ds}}\omega,\tau\big{)}+\frac{\omega}{\sqrt{1-e^{-\int_{0}^{\tau}\beta_{s}ds}}}\Big{\|}^{2}d\mathcal{N}(\omega)dP_{*}(\mathbf{x}_{0})d\tau

where 𝒩\mathcal{N} is the unit Gaussian and λτ>0\lambda_{\tau}>0 is any weight. Besides the reverse-time SDE, another way to sample from the model is to solve the following reverse-time ODE, which yields the same distribution [107]

G(XT)=X0,XT,ddτXτ=βτ2(Xτ+𝐬(Xτ,τ))dτG(X_{T})=X_{0},\quad X_{T}\sim\mathbb{P},\quad\frac{d}{d\tau}X_{\tau}=-\frac{\beta_{\tau}}{2}\big{(}X_{\tau}+\mathbf{s}(X_{\tau},\tau)\big{)}d\tau

A survey on diffusion models is given by [128].

7. NF interpolant. Finally, we introduce a model called normalizing flow with stochastic interpolants [2, 75]. It is analogous to the diffusion models, and yet is conceptually simpler. This model learns a velocity field V:d×[0,1]dV:\mathbb{R}^{d}\times[0,1]\to\mathbb{R}^{d} from the optimization problem

minV\displaystyle\min_{V} 01V((1τ)𝐱0+τ𝐱1,τ)(𝐱1𝐱0)2𝑑(𝐱0)𝑑P(𝐱1)𝑑τ\displaystyle\int_{0}^{1}\iint\big{\|}V\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}-(\mathbf{x}_{1}-\mathbf{x}_{0})\big{\|}^{2}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau

Then, the generator is defined through the ODE (2) and the modeled distribution is sampled by (1).

3 Framework

Previously, a mathematical framework for supervised learning was proposed by [35], which was effective for estimating the approximation and generalization errors of supervised learning models [34, 33, 31]. In particular, it helped to understand how neural network-based models manage to avoid the curse of dimensionality. The framework characterizes the models by four factors: the function representation (abstract function spaces built from integral transformations), the loss, the training scheme, and the discretization (e.g. how the continuous representations are discretized into neural networks with finite neurons).

This section presents a similar framework that unifies models for learning probability distributions. We focus on two factors: the distribution representation, which is a new factor and determines how distributions are parametrized by abstract functions, and the loss type, which specifies which metric or topology is imposed upon the distributions. A sketch of the categorization is given in Table 1. We show that the diverse families of distribution learning models can be simultaneously derived from this framework. The theoretical results in the latter sections, in particular the generalization error estimates, are also built upon this mathematical foundation.

Density Expectation Regression
Potential bias potential model feasible unknown
Free generator NF, VAE, autoregressive GAN unknown
Fixed generator upper bound feasible diffusion, NF interpolant, OT
Table 1: Categorization of distribution learning models based on distribution representation (row) and loss type (column) with the representative models. See Section 3.4 for a detailed description. Our theoretical results will focus on the marked categories.

3.1 Background

The basic task is to estimate a probability distribution given i.i.d. samples {𝐱i}i=1n\{\mathbf{x}_{i}\}_{i=1}^{n}. We denote this unknown target distribution by PP_{*} and the empirical distribution by

P(n)=1ni=1nδ𝐱iP_{*}^{(n)}=\frac{1}{n}\sum_{i=1}^{n}\delta_{\mathbf{x}_{i}}

The underlying space is assumed to be Euclidean d\mathbb{R}^{d}. To estimate a distribution may have several meanings depending on its usage: e.g. to obtain a random variable XPX\sim P_{*}, estimate the density function P(𝐱)P_{*}(\mathbf{x}), or compute expectations f𝑑P\int fdP_{*}. The first task is known as generative modeling and the second as density estimation; these two problems are the focus of this paper, while the third task can be solved by them.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: The vertical and horizontal perspectives on probability distributions. Left: the distribution P0=𝒩(2,1)P_{0}=\mathcal{N}(-2,1) is transformed to P1=𝒩(2,1)P_{1}=\mathcal{N}(2,1) by reweighing and the distance d(P0,P1)d(P_{0},P_{1}) is measured by the difference between densities. Right: P0P_{0} is transformed to P1P_{1} by transport, and the distance is measure by the displacement of mass.

There are two general approaches to modeling a distribution, which can be figuratively termed as “vertical” and “horizontal”, and an illustration is given by Figure 2. Given a base distribution \mathbb{P}, the vertical approach reweighs the density of \mathbb{P} to approximate the density of the target PP_{*}, while the horizontal approach transports the mass of \mathbb{P} towards the location of PP_{*}. When a modeled distribution PP is obtained and a distance d(P,P)d(P,P_{*}) is needed to compute either the training loss or test error, the vertical approach measures the difference between the densities of PP and PP_{*} over each location, and is exemplified by the KL-divergence

KL(PP):=logP(𝐱)P(𝐱)dP(𝐱)\text{KL}(P_{*}\|P):=\int\log\frac{P_{*}(\mathbf{x})}{P(\mathbf{x})}dP_{*}(\mathbf{x}) (3)

while the horizontal approach measures the distance between the “particles” of PP and PP_{*}, and is exemplified by the 2-Wasserstein metric [61, 117]

W2(P,P):=infπ(𝐱0𝐱12𝑑π(𝐱0,𝐱1))1/2W_{2}(P,P_{*}):=\inf_{\pi}\Big{(}\int\|\mathbf{x}_{0}-\mathbf{x}_{1}\|^{2}d\pi(\mathbf{x}_{0},\mathbf{x}_{1})\Big{)}^{1/2} (4)

where π\pi is any coupling between P,PP,P_{*} (i.e. a joint distribution in d×d\mathbb{R}^{d}\times\mathbb{R}^{d} whose marginal distributions are P,PP,P_{*}). We will see that the vertical and horizontal approaches largely determine the distribution representation and loss type.

Finally, consider the operator lawlaw

P=law(X)P=\text{law}(X)

which maps a random variable XX to its distribution PP. Similarly, given a random path {Xτ,τ[0,T]}\{X_{\tau},\tau\in[0,T]\}, we obtain a path Pτ=law(Xτ)P_{\tau}=\text{law}(X_{\tau}) in the distribution space. In general, there can be infinitely many random variables that are mapped to the same distribution, e.g. let PP and \mathbb{P} be uniform over [0,1][0,1], for any kk\in\mathbb{N}, we can define the random variable Xk=kZmod1X_{k}=kZ\text{mod}1 with ZZ\sim\mathbb{P}, which all satisfy P=law(Xk)P=\text{law}(X_{k}). One drawback of this non-uniqueness is that, for many generative models, there can be plenty of global minima, which make the loss landscape non-convex and may lead to training failures, as we will show that this is inherent in mode collapse. One benefit is that, if the task is to learn some time-dependent distribution PτP_{\tau}, one can select from the infinitely many possible random paths XτX_{\tau} the one that is the easiest to compute, and therefore define a convenient loss.

For the notations, for any measureable subset Ω\Omega of d\mathbb{R}^{d}, denote by 𝒫(Ω)\mathcal{P}(\Omega) the space of probability measures over Ω\Omega, 𝒫2(Ω)\mathcal{P}_{2}(\Omega) the subspace of measures with finite second moments, and 𝒫ac(Ω)\mathcal{P}_{ac}(\Omega) the subspace of absolutely continuous measures (i.e. have density functions). Denote by sprtP\text{sprt}P the support of a distribution. Given any two measures m0,m1m_{0},m_{1}, we denote by m0×m1m_{0}\times m_{1} the product measure over the product space. We denote by tt the training time and by τ\tau the time that parametrizes flows.

3.2 Distribution representation

Since machine learning models at the basic level are good at learning functions, the common approach to learning distributions is to parametrize distributions by functions. There are three common approaches:

1. Potential function. Given any base distribution \mathbb{P}, define the modeled distribution PP by

P=1ZeV,Z=eV𝑑P=\frac{1}{Z}e^{-V}\mathbb{P},\quad Z=\int e^{-V}d\mathbb{P} (5)

where VV is a potential function and ZZ is for normalization. This parametrization is sometimes known as the Boltzmann distribution or exponential family.

2. Free generator. Given any measureable function G:ddG:\mathbb{R}^{d}\to\mathbb{R}^{d}, define the modeled distribution PP by

P=law(X),X=G(Z),ZP=\text{law}(X),\quad X=G(Z),\quad Z\sim\mathbb{P}

PP is known as the transported measure or pushforward measure and denoted by P=G#P=G\#\mathbb{P}, while GG is called the generator or transport map. Equivalently, PP is defined as the measure that satisfies

P(A)=(G1(A))P(A)=\mathbb{P}(G^{-1}(A)) (6)

for all measurable sets AA.

The name “free generator” is used to emphasize that the task only specifies the target distribution PP_{*} to estimate, and we are free to choose any generator from the possibly infinite set of solutions {G|P=G#}\{G~{}|~{}P_{*}=G\#\mathbb{P}\}.

There are several common extensions to the generator. First, GG can be modeled as a random function, such that G(𝐳)P(|𝐳)G(\mathbf{z})\sim P(\cdot|\mathbf{z}) for some conditional distribution P(|𝐳)P(\cdot|\mathbf{z}). Second, GG can be induced by a flow. Let V:d×[0,T]dV:\mathbb{R}^{d}\times[0,T]\to\mathbb{R}^{d} be a Lipschitz velocity field, and define GG as the unique solution to the ODE

G=GT,Gτ(𝐱)=𝐱τ,𝐱0=𝐱,ddτ𝐱τ=V(𝐱τ,τ)G=G_{T},\quad G_{\tau}(\mathbf{x})=\mathbf{x}_{\tau},\quad\mathbf{x}_{0}=\mathbf{x},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=V(\mathbf{x}_{\tau},\tau) (7)

where GτG_{\tau} is the flow map. Furthermore, if we define the interpolant distributions Pτ=Gτ#P_{\tau}=G_{\tau}\#\mathbb{P}, then they form a (weak) solution to the continuity equation

τPτ+(VτPτ)=0\partial_{\tau}P_{\tau}+\nabla\cdot(V_{\tau}P_{\tau})=0 (8)

Specifically, for any smooth test function ϕ\phi

ϕ(𝐱)d(τPτ)(𝐱)\displaystyle\int\phi(\mathbf{x})d(\partial_{\tau}P_{\tau})(\mathbf{x}) =ddτϕ(𝐱)𝑑Pτ(𝐱)=ddτϕ(Gτ(𝐱))𝑑(𝐱)\displaystyle=\frac{d}{d\tau}\int\phi(\mathbf{x})dP_{\tau}(\mathbf{x})=\frac{d}{d\tau}\int\phi\big{(}G_{\tau}(\mathbf{x})\big{)}d\mathbb{P}(\mathbf{x})
=V(Gτ(𝐱),τ)ϕ(Gτ(𝐱))𝑑(𝐱)\displaystyle=\int V\big{(}G_{\tau}(\mathbf{x}),\tau\big{)}\cdot\nabla\phi\big{(}G_{\tau}(\mathbf{x})\big{)}d\mathbb{P}(\mathbf{x})
=V(𝐱,τ)ϕ(𝐱)𝑑Pτ(𝐱)=ϕ(𝐱)d((VτPτ))(𝐱)\displaystyle=\int V(\mathbf{x},\tau)\cdot\nabla\phi(\mathbf{x})dP_{\tau}(\mathbf{x})=-\int\phi(\mathbf{x})d\big{(}\nabla\cdot(V_{\tau}P_{\tau})\big{)}(\mathbf{x})

Third, one can restrict to a subset of the possibly infinite set of solutions {G|P=G#}\{G~{}|~{}P_{*}=G\#\mathbb{P}\}, specifically generators that are gradients of some potential functions {G=ψ}\{G=\nabla\psi\}. By Brennier’s theorem [14, 117], such potential function exists in very general conditions. Similarly, one can restrict the velocity fields in (7) to time-dependent gradient fields

Vτ=ϕτV_{\tau}=\nabla\phi_{\tau} (9)

By Theorem 5.51 of [117], in general there exists a potential function ϕτ\phi_{\tau} such that the flow GG induced by ϕτ\nabla\phi_{\tau} satisfies P=G#P_{*}=G\#\mathbb{P}. Specifically, the interpolant distribution Pτ=Gτ#P_{\tau}=G_{\tau}\#\mathbb{P} is the Wasserstein geodesic that goes from \mathbb{P} to PP_{*}. Finally, it is interesting to note that there is also a heuristic argument from [2] that justifies the restriction to gradient fields: Given any velocity field VτV_{\tau} with the interpolant distribution PτP_{\tau} generated by (7), consider the equation

(Pτϕτ)=(PτVτ)\nabla\cdot(P_{\tau}\nabla\phi_{\tau})=\nabla\cdot(P_{\tau}V_{\tau})

By the theory of elliptic PDEs, the solution ϕτ\phi_{\tau} exists. It follows from (8) that we can always replace a velocity field by a gradient field that induces the same interpolant distribution PτP_{\tau}.

3. Fixed generator. Contrary to the free generator, another approach is to choose a specific coupling between the base and target distributions πΠ(,P)\pi\in\Pi(\mathbb{P},P_{*}), where

Π(,P)={π𝒫(d×d)|π𝑑𝐱0=,π𝑑𝐱1=P}\Pi(\mathbb{P},P_{*})=\Big{\{}\pi\in\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d})~{}\Big{|}~{}\int\pi d\mathbf{x}_{0}=\mathbb{P},~{}\int\pi d\mathbf{x}_{1}=P_{*}\Big{\}}

and the generator GG is represented as the conditional distribution π(|𝐱0)\pi(\cdot|\mathbf{x}_{0}).

One can further extend π\pi into a random path {Xτ,τ[0,T]}\{X_{\tau},\tau\in[0,T]\} such that π=law(X0,XT)\pi=\text{law}(X_{0},X_{T}). Then, analogous to the construction (7), GG can be represented as the ODE or SDE that drives the trajectories XτX_{\tau}. Thanks to the non-uniqueness of lawlaw, one can further consider the interpolant distributions Pτ=law(Xτ)P_{\tau}=\text{law}(X_{\tau}) and solve for the velocity field VV in the continuity equation (8). Then, GG can be represented as the solution to the ODE (7) with velocity VV.

Currently, models of this category belong to either of the two extremes:

Fully deterministic: For some measureable function GG,

π(𝐱0,𝐱1)=δG(𝐱0)(𝐱1)(𝐱0)\pi(\mathbf{x}_{0},\mathbf{x}_{1})=\delta_{G(\mathbf{x}_{0})}(\mathbf{x}_{1})\mathbb{P}(\mathbf{x}_{0}) (10)

The generator GG is usually set to be the optimal transport map from \mathbb{P} to PP_{*}. The idea is simple in one dimension, such that we sort the “particles” of \mathbb{P} and PP_{*} and match according to this ordering. This monotonicity in \mathbb{R} can be generalized to the cyclic monotonicity in higher dimensions [117]. Couplings πΠ(,P)\pi\in\Pi(\mathbb{P},P_{*}) that are cyclic monotonic are exactly the optimal transport plans with respect to the squared Euclidean metric [117], namely minimizers of (4). Then, Brennier’s theorem [14, 117] implies that, under general conditions, the problem (4) has unique solution, which has the form (10), and furthermore the generator is a gradient field G=ψG=\nabla\psi of a convex function ψ\psi.

Fully random: The coupling is simply the product measure

π=×P\pi=\mathbb{P}\times P_{*} (11)

At first sight, this choice is trivial and intractable, but the trick is to choose an appropriate random path XτX_{\tau} such that either the dynamics of XτX_{\tau} or the continuity equation (8) is easy to solve.

One of the simplest constructions, proposed by [75, 2], is to use the linear interpolation

Xτ=(1τ)X0+τX1,(X0,X1)π,τ[0,1]X_{\tau}=(1-\tau)X_{0}+\tau X_{1},\quad(X_{0},X_{1})\sim\pi,~{}\tau\in[0,1] (12)

Then, to solve for the target velocity field in (8), define a joint distribution MM_{*} over d×[0,1]\mathbb{R}^{d}\times[0,1]

ϕ(𝐱,τ)𝑑M(𝐱,τ)=01ϕ(𝐱,τ)𝑑Pτ𝑑τ=01ϕ((1τ)𝐱0+τ𝐱1,τ)𝑑(𝐱0)𝑑P(𝐱1)𝑑τ\displaystyle\begin{split}\int\phi(\mathbf{x},\tau)dM_{*}(\mathbf{x},\tau)&=\int_{0}^{1}\int\phi(\mathbf{x},\tau)dP_{\tau}d\tau\\ &=\int_{0}^{1}\iint\phi\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau\end{split} (13)

for any test function ϕ\phi. Similarly, define the current density JJ_{*}, a vector-valued measure, by

𝐟(𝐱,τ)𝑑J(𝐱,τ)=01(𝐱1𝐱0)𝐟((1τ)𝐱0+τ𝐱1,τ)𝑑(𝐱0)𝑑P(𝐱1)𝑑τ\int\mathbf{f}(\mathbf{x},\tau)\cdot dJ_{*}(\mathbf{x},\tau)=\int_{0}^{1}\iint(\mathbf{x}_{1}-\mathbf{x}_{0})\cdot\mathbf{f}\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau (14)

for any test function 𝐟\mathbf{f}. Then, we can define a velocity field VV_{*} by the Radon-Nikodym derivative

V=dJdMV_{*}=\frac{dJ_{*}}{dM_{*}} (15)

Each V(𝐱,τ)V_{*}(\mathbf{x},\tau) is the weighted average of the velocities of the random lines (12) that pass through the point (𝐱,τ)(\mathbf{x},\tau) in spacetime. As shown in [2, 125], under general assumptions, VV_{*} is the solution to the continuity equation (8) and satisfies

G#=PG_{*}\#\mathbb{P}=P_{*}

where GG_{*} is the generator defined by the flow (7) of VV_{*}.

A more popular construction by [104, 106, 49] uses the diffusion process

X0P,dXτ=βτ2Xτdτ+βτdWτX_{0}\sim P_{*},\quad dX_{\tau}=-\frac{\beta_{\tau}}{2}X_{\tau}d\tau+\sqrt{\beta_{\tau}}dW_{\tau} (16)

where βτ>0\beta_{\tau}>0 is a non-decreasing function that represents the noise scale. Consider the coupling πτ=law(Xτ,X0)\pi_{\tau}=\text{law}(X_{\tau},X_{0}). The conditional distribution πτ(|𝐱0)\pi_{\tau}(\cdot|\mathbf{x}_{0}) is an isotropic Gaussian [107]

πτ(|𝐱0)=𝒩(e120τβs𝑑s𝐱0,(1e0τβs𝑑s)I)\pi_{\tau}(\cdot|\mathbf{x}_{0})=\mathcal{N}\Big{(}e^{-\frac{1}{2}\int_{0}^{\tau}\beta_{s}ds}\mathbf{x}_{0},~{}(1-e^{-\int_{0}^{\tau}\beta_{s}ds})I\Big{)} (17)

and the interpolant distributions Pτ=law(Xτ)P_{\tau}=\text{law}(X_{\tau}) are given by

Pτ=πτ(|𝐱0)dP(𝐱0)P_{\tau}=\int\pi_{\tau}(\cdot|\mathbf{x}_{0})dP_{*}(\mathbf{x}_{0}) (18)

Then,

KL(ππτ)\displaystyle\text{KL}(\pi\|\pi_{\tau}) =lndπdπτdπ=ln𝒩(𝐱)πτ(𝐱|𝐱0)d𝒩(𝐱)𝑑P(𝐱0)\displaystyle=\int\ln\frac{d\pi}{d\pi_{\tau}}d\pi=\iint\ln\frac{\mathcal{N}(\mathbf{x})}{\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})}d\mathcal{N}(\mathbf{x})dP_{*}(\mathbf{x}_{0})
=KL(𝒩𝒩(e120τβs𝑑s𝐱0,(1e0τβs𝑑s)I))𝑑P(𝐱0)\displaystyle=\int\text{KL}\big{(}\mathcal{N}~{}\big{\|}~{}\mathcal{N}(e^{-\frac{1}{2}\int_{0}^{\tau}\beta_{s}ds}\mathbf{x}_{0},(1-e^{-\int_{0}^{\tau}\beta_{s}ds})I)\big{)}dP_{*}(\mathbf{x}_{0})
=e0τβs𝑑s1e0τβs𝑑s(𝐱02𝑑P(𝐱0)+d)+dln(1e0τβs𝑑s)\displaystyle=\frac{e^{-\int_{0}^{\tau}\beta_{s}ds}}{1-e^{-\int_{0}^{\tau}\beta_{s}ds}}\Big{(}\int\|\mathbf{x}_{0}\|^{2}dP_{*}(\mathbf{x}_{0})+d\Big{)}+d\ln(1-e^{-\int_{0}^{\tau}\beta_{s}ds})

where the last line follows from the formula for the KL divergence between multivariable Gaussians. Let \mathbb{P} be the unit Gaussian 𝒩\mathcal{N}. It follows that if PP_{*} has finite second moments, then the coupling πτ\pi_{\tau} converges to the product measure π=×P\pi=\mathbb{P}\times P_{*} exponentially fast.

By choosing TT sufficiently large, we have PTP_{T}\approx\mathbb{P}. Then, a generative model can be defined by sampling from XTX_{T}\sim\mathbb{P} and then going through a reverse-time process X0=G(XT)X_{0}=G(X_{T}) to approximate the target PP_{*}. One approach is to implement the following reverse-time SDE [6]

XT,dXτ=βτ2(Xτ+2𝐱logPτ(Xτ))dτ+βτdW¯τX_{T}\sim\mathbb{P},\quad dX_{\tau}=-\frac{\beta_{\tau}}{2}(X_{\tau}+2\nabla_{\mathbf{x}}\log P_{\tau}(X_{\tau}))d\tau+\sqrt{\beta_{\tau}}d\overline{W}_{\tau}

which is solved from time TT to 0, and W¯\overline{W} is the reverse-time Wiener process. This backward SDE is equivalent to the forward SDE (16) in the sense that, if PT=P_{T}=\mathbb{P}, then they induce the same distribution of paths {Xτ,τ[0,T]}\{X_{\tau},\tau\in[0,T]\} [6], and in particular P=law(X0)P_{*}=\text{law}(X_{0}). (An analysis that accounts for the approximation error between PTP_{T} and \mathbb{P} is given by [105].) The gradient field 𝐱logPτ\nabla_{\mathbf{x}}\log P_{\tau} is known as the score function [56], which is modeled by a velocity field 𝐬:d×[0,T]d\mathbf{s}:\mathbb{R}^{d}\times[0,T]\to\mathbb{R}^{d}, and then the generator GG can be defined as the following random function

G(𝐱T)=X0,XT=𝐱T,dXτ=βτ2(Xτ+2𝐬(Xτ,τ))dτ+βτdW¯τG(\mathbf{x}_{T})=X_{0},\quad X_{T}=\mathbf{x}_{T},\quad dX_{\tau}=-\frac{\beta_{\tau}}{2}(X_{\tau}+2\mathbf{s}(X_{\tau},\tau))d\tau+\sqrt{\beta_{\tau}}d\overline{W}_{\tau} (19)

Another approach is to implement the following reverse-time ODE [107]

XT,ddτXτ=V(Xτ,τ),V(𝐱,τ)=βτ2(𝐱τ+𝐱logPτ(𝐱τ))X_{T}\sim\mathbb{P},\quad\frac{d}{d\tau}X_{\tau}=V(X_{\tau},\tau),\quad V(\mathbf{x},\tau)=-\frac{\beta_{\tau}}{2}(\mathbf{x}_{\tau}+\nabla_{\mathbf{x}}\log P_{\tau}(\mathbf{x}_{\tau}))

This VV is the solution to the continuity equation (8), and we similarly have P=law(X0)P_{*}=\text{law}(X_{0}) if PT=P_{T}=\mathbb{P}. Then, the generator GG can be defined as a deterministic function

G(𝐱T)=𝐱0,ddτ𝐱τ=βτ2(𝐱τ+𝐬(𝐱τ,τ))G(\mathbf{x}_{T})=\mathbf{x}_{0},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=-\frac{\beta_{\tau}}{2}(\mathbf{x}_{\tau}+\mathbf{s}(\mathbf{x}_{\tau},\tau)) (20)

4. Mixture. Finally, we remark that it is possible to use a combination of these representations. For instance, [82] uses a normalizing flow model reweighed by a Boltzmann distribution, which is helpful for sampling from distributions with multiple modes while maintaining accurate density estimation. Another possibility is that one can first train a model with fixed generator representation as a stable initialization, and then finetune the trained generator as a free generator (e.g. using GAN loss) to improve sample quality.

3.3 Loss type

There are numerous ways to define a metric or divergence on the space of probability measures, which greatly contribute to the diversity of distribution learning models. One requirement, however, is that since the target distribution PP_{*} is replaced by its samples P(n)P_{*}^{(n)} during training, the term PP_{*} almost always appears in the loss as an expectation.

The commonly used losses belong to three categories:

1. Density-based loss. The modeled distribution PP participates in the loss as a density function. The default choice is the KL divergence (3), which is equivalent up to constant to the negative log-likelihood (NLL)

L(P)=logP(𝐱)𝑑P(𝐱)L(P)=-\int\log P(\mathbf{x})dP_{*}(\mathbf{x}) (21)

In fact, we can show that NLL is in a sense the only possible density-based loss.

Proposition 3.1.

Let LL be any loss function on 𝒫ac(d)\mathcal{P}_{ac}(\mathbb{R}^{d}) that has the form

L(P)=f(P(𝐱))𝑑P(𝐱)L(P)=\int f\big{(}P(\mathbf{x})\big{)}dP_{*}(\mathbf{x})

where ff is some C1C^{1} function on (0,+)(0,+\infty). If for any P𝒫ac(d)P_{*}\in\mathcal{P}_{ac}(\mathbb{R}^{d}), the loss LL is minimized by PP_{*}, then

f(p)=clogp+cf(p)=c\log p+c^{\prime}

for c0c\leq 0 and cc^{\prime}\in\mathbb{R}. The converse is obvious.

Besides the KL divergence, there are several other well-known divergences in statistics such as the Jensen–Shannon divergence and χ2\chi^{2} divergence. Despite that they are infeasible by Proposition 3.1, certain weakened versions of these divergences can still be used as will be discussed later.

2. Expectation-based loss. The modeled distribution participate in the loss through expectations. Since both PP_{*} and PP are seen as linear operators over test functions, it is natural to define the loss as a dual norm

L(P)=supD1D(𝐱)d(PP)(𝐱)L(P)=\sup_{\|D\|\leq 1}\int D(\mathbf{x})d(P-P_{*})(\mathbf{x}) (22)

where \|\cdot\| is some user-specified functional norm. The test function DD is often called the discriminator, and such loss is called an adversarial loss. If \|\cdot\| is a Hilbert space norm, then the loss can also be defined by

L(P)=supDD(𝐱)d(PP)(𝐱)D2L(P)=\sup_{D}\int D(\mathbf{x})d(P-P_{*})(\mathbf{x})-\|D\|^{2} (23)

There are several classical examples of adversarial losses: If \|\cdot\| is the C0C_{0} norm, then (22) becomes the total variation norm PPTV\|P-P_{*}\|_{\text{TV}}. If \|\cdot\| is the Lipschitz semi-norm, then (22) becomes the 1-Wasserstein metric by Kantorovich-Rubinstein theorem [60, 117]. If \|\cdot\| is the RKHS norm with some kernel kk, then (22) becomes the maximum mean discrepancy (MMD) [41], and (23) is the squared MMD:

L(P)=12k(𝐱,𝐱)d(PP)(𝐱)d(PP)(𝐱)L(P)=\frac{1}{2}\iint k(\mathbf{x},\mathbf{x}^{\prime})~{}d(P_{*}-P)(\mathbf{x})d(P_{*}-P)(\mathbf{x}^{\prime})

which gives rise to the moment matching network [72].

In practice, the discriminator DD is usually parametrized by a neural network, denoted by DθD_{\theta} with parameter θ\theta. One common choice of the norm \|\cdot\| is simply the ll^{\infty} norm on θ\theta

L(P)=supθ1Dθd(PP)L(P)=\sup_{\|\theta\|_{\infty}\leq 1}\int D_{\theta}~{}d(P-P_{*}) (24)

This formulation gives rise to the WGAN model [7], and the ll^{\infty} bound can be conveniently implemented by weight clipping. The loss (24) and its variants are generally known as the neural network distances [8, 133, 45, 37].

The strength of the metric (e.g. fineness of its topology) is proportional to the size of the normed space of \|\cdot\|, or inversely proportional to the strength of \|\cdot\|. Once some global regularity such as the Lipschitz norm applies to the space, then the dual norm or LL becomes continuous with respect to the underlying geometry (e.g. the W1W_{1} metric), and is no longer permutation invariant like NLL (21) or total variation. If we further restrict DD to certain sparse subspaces of the Lipschitz functions, in particular neural networks, then LL becomes insensitive to the “high frequency” parts of 𝒫(d)\mathcal{P}(\mathbb{R}^{d}). As we will demonstrate in Section 6, this property is the source of good generalization.

Note that there are some variants of the GAN loss that resemble (23) but whose norms D\|D\| are so weak that the dual norms are no longer well-defined. For instance, the loss with L2L^{2} penalty from [122]

L(P)=supθDθd(PP)DθL2(P+P)2L(P)=\sup_{\theta}\int D_{\theta}~{}d(P-P_{*})-\|D_{\theta}\|_{L^{2}(P+P_{*})}^{2}

or the loss with Lipschitz penalty 1DθL2(P)2\|1-\|\nabla D_{\theta}\|\|_{L^{2}(P)}^{2} from [44]. By [127, Proposition 5], in general we have L(P)=L(P)=\infty. Nevertheless, if we consider one-time-scale training such that PP and DD are trained with similar learning rates, then this blow-up can be avoided [127].

Beyond the dual norms (22), one can also consider divergences. Despite that Proposition 3.1 has ruled out the use of divergences other than the KL divergence, one can consider the weakened versions of the dual of these divergences. For instance, given any parametrized discriminator DθD_{\theta}, the Jensen–Shannon divergence can be bounded below by [39]

JS(P,P)\displaystyle\text{JS}(P,P_{*}) =12KL(PP+P2)+12KL(PP+P2)\displaystyle=\frac{1}{2}\text{KL}\Big{(}P\Big{\|}\frac{P+P_{*}}{2}\Big{)}+\frac{1}{2}\text{KL}\Big{(}P_{*}\Big{\|}\frac{P+P_{*}}{2}\Big{)}
=supq:d[0,1]logq(𝐱)𝑑P(𝐱)+log(1q(𝐱))𝑑P(𝐱)+2ln2\displaystyle=\sup_{q:\mathbb{R}^{d}\to[0,1]}\int\log q(\mathbf{x})dP(\mathbf{x})+\int\log(1-q(\mathbf{x}))dP_{*}(\mathbf{x})+2\ln 2
supθlog(eDθ(𝐱)1+eDθ(𝐱))𝑑P(𝐱)+log(11+eDθ(𝐱))𝑑P(𝐱)+2ln2\displaystyle\geq\sup_{\theta}\int\log\Big{(}\frac{e^{D_{\theta}(\mathbf{x})}}{1+e^{D_{\theta}(\mathbf{x})}}\Big{)}dP(\mathbf{x})+\int\log\Big{(}\frac{1}{1+e^{D_{\theta}(\mathbf{x})}}\Big{)}dP_{*}(\mathbf{x})+2\ln 2 (25)

This lower bound gives rise to the earliest version of GAN [39]. GANs based on other divergences have been studied in [83, 78].

3. Regression loss. The regression loss is used exclusively by the fixed generator representation discussed in Section 3.2. If a target generator GG_{*} has been specified, then we simply use the L2L^{2} loss over the base distribution \mathbb{P}

L(G)=12GGL2()2L(G)=\frac{1}{2}\|G-G_{*}\|_{L^{2}(\mathbb{P})}^{2} (26)

If a target velocity field VV_{*} has been specified, then the loss is integrated over the interpolant distributions PτP_{\tau}

L(V)=120TV(,τ)V(,τ)L2(Pτ)2𝑑τL(V)=\frac{1}{2}\int_{0}^{T}\big{\|}V(\cdot,\tau)-V_{*}(\cdot,\tau)\big{\|}_{L^{2}(P_{\tau})}^{2}d\tau (27)

or equivalently, we use the L2(M)L^{2}(M_{*}) loss with the joint distribution MM_{*} defined by (13).

3.4 Combination

Having discussed the distribution representations and loss types, we can now combine them to derive the distribution learning models in Table 1. Our focus will be on the highlighted four classes in the table.

Density + Potential. Since Proposition 3.1 indicates that the negative log-likelihood (NLL) (21) is the only feasible density-based loss, we simply insert the potential-based representation (5) into NLL, and obtain a loss in the potential function VV,

L(V)=V𝑑P+lneV𝑑L(V)=\int VdP_{*}+\ln\int e^{-V}d\mathbb{P} (28)

This formulation gives rise to the bias-potential model [115, 13], also known as variationally enhanced sampling.

Density + Free generator. In order to insert the transport representation P=G#P=G\#\mathbb{P} into NLL (21), we need to be able to compute the density P(𝐱)P(\mathbf{x}). For simple cases such as when \mathbb{P} is Gaussian and GG is affine, the density P(𝐱)P(\mathbf{x}) has closed form expression. Yet, in realistic scenarios when PP needs to satisfy the universal approximation property and thus has complicated forms, one has to rely on indirect calculations. There are three common approaches:

  1. 1.

    Change of variables (for normalizing flows): If GG is a C1C^{1} diffeomorphism, the density of PP is given by the change of variables formula

    P(𝐱)=detG1(𝐱)(G1(𝐱))P(\mathbf{x})=\det\nabla G^{-1}(\mathbf{x})~{}\mathbb{P}(G^{-1}(\mathbf{x}))

    Usually \mathbb{P} is set to the unit Gaussian 𝒩\mathcal{N}. Then, the NLL loss (21) becomes

    L(G)\displaystyle L(G) =logdetG1(𝐱)+log(G1(𝐱))dP(𝐱)\displaystyle=-\int\log\det\nabla G^{-1}(\mathbf{x})+\log\mathbb{P}(G^{-1}(\mathbf{x}))dP_{*}(\mathbf{x})
    =logdetG(G1(𝐱))+12G1(𝐱)2dP(𝐱)+constant\displaystyle=\int\log\det\nabla G\big{(}G^{-1}(\mathbf{x})\big{)}+\frac{1}{2}\|G^{-1}(\mathbf{x})\|^{2}dP_{*}(\mathbf{x})+\text{constant}

    If GG is modeled by a flow {Gτ,τ[0,1]}\{G_{\tau},\tau\in[0,1]\} (7) with velocity field VV, then its Jacobian satisfies

    ddτdetGτ(𝐱0)\displaystyle\frac{d}{d\tau}\det\nabla G_{\tau}(\mathbf{x}_{0}) =ddτdet(𝐱0+0τV(Gs(𝐱0),s)𝑑s)\displaystyle=\frac{d}{d\tau}\det\nabla\Big{(}\mathbf{x}_{0}+\int_{0}^{\tau}V\big{(}G_{s}(\mathbf{x}_{0}),s\big{)}ds\Big{)}
    =ddτdet(I+0τV(Gs(𝐱0),s)Gs(𝐱0)𝑑s)\displaystyle=\frac{d}{d\tau}\det\Big{(}I+\int_{0}^{\tau}\nabla V\big{(}G_{s}(\mathbf{x}_{0}),s\big{)}\nabla G_{s}(\mathbf{x}_{0})ds\Big{)}
    =Tr[(V(Gs(𝐱0),s)Gs(𝐱0))(Gs(𝐱0))1]detGτ(𝐱0)\displaystyle=\text{Tr}\Big{[}\big{(}\nabla V(G_{s}(\mathbf{x}_{0}),s)\nabla G_{s}(\mathbf{x}_{0})\big{)}\big{(}\nabla G_{s}(\mathbf{x}_{0})\big{)}^{-1}\Big{]}\det\nabla G_{\tau}(\mathbf{x}_{0})
    =Tr[V(Gs(𝐱0),s)]detGτ(𝐱0)\displaystyle=\text{Tr}\big{[}\nabla V(G_{s}(\mathbf{x}_{0}),s)\big{]}\det\nabla G_{\tau}(\mathbf{x}_{0})

    It follows that

    logdetG(𝐱0)=01Tr[V(Gτ(𝐱0),τ)]𝑑τ\log\det\nabla G(\mathbf{x}_{0})=\int_{0}^{1}\text{Tr}\big{[}\nabla V(G_{\tau}(\mathbf{x}_{0}),\tau)\big{]}d\tau

    and this is known as Abel’s formula [113]. Hence, we obtain the loss of the normalizing flow model [112, 21]

    L(V)=01Tr[V(𝐱τ,τ)]𝑑τ+12𝐱02dP(𝐱1)𝐱τ:=Gτ(G1(𝐱1))\displaystyle\begin{split}L(V)&=\iint_{0}^{1}\text{Tr}\big{[}\nabla V(\mathbf{x}_{\tau},\tau)\big{]}d\tau+\frac{1}{2}\|\mathbf{x}_{0}\|^{2}dP_{*}(\mathbf{x}_{1})\\ \mathbf{x}_{\tau}&:=G_{\tau}(G^{-1}(\mathbf{x}_{1}))\end{split} (29)

    Moreover, if the velocity field is defined by a gradient field V(,τ)=ϕτV(\cdot,{\tau})=\nabla\phi_{\tau} as discussed in (9), then the loss has the simpler form

    L(ϕ)=01Δϕτ(𝐱τ)𝑑τ+12𝐱02dP(𝐱1)L(\phi)=\iint_{0}^{1}\Delta\phi_{\tau}(\mathbf{x}_{\tau})d\tau+\frac{1}{2}\|\mathbf{x}_{0}\|^{2}dP_{*}(\mathbf{x}_{1})

    which leads to the Monge-Ampère flow model [131].

    One potential shortcoming of NF is that diffeomorphisms might not be suitable for the generator when the target distribution PP_{*} is singular, e.g. concentrated on low-dimensional manifolds, which is expected for real data such as images. To approximate PP_{*}, the generator GG needs to shrink the mass of \mathbb{P} onto negligible sets, and thus G1G^{-1} blows up. As G1G^{-1} is involved in the loss, it can cause the training process to be unstable.

  2. 2.

    Variational lower bound (for VAE): Unlike NF, the variational autoencoders do not require the generator to be invertible, and instead use its posterior distribution. The generator can be generalized to allow for random output, and we define the conditional distribution

    P(|𝐳)=law(X),X=G(𝐳)P(\cdot|\mathbf{z})=\text{law}(X),\quad X=G(\mathbf{z})

    The generalized inverse can be defined as the conditional distribution Q(|𝐱)Q_{*}(\cdot|\mathbf{x}) that satisfies

    P(𝐱|𝐳)(𝐳)=P(𝐱)Q(𝐳|𝐱),P(𝐱)=P(𝐱|𝐳)𝑑(𝐳)P(\mathbf{x}|\mathbf{z})\mathbb{P}(\mathbf{z})=P(\mathbf{x})Q_{*}(\mathbf{z}|\mathbf{x}),\quad P(\mathbf{x})=\int P(\mathbf{x}|\mathbf{z})d\mathbb{P}(\mathbf{z})

    in the distribution sense. If the generator is deterministic, i.e. P(|𝐳)=δG(𝐳)P(\cdot|\mathbf{z})=\delta_{G(\mathbf{z})}, and invertible, then Q(|𝐱)Q_{*}(\cdot|\mathbf{x}) is simply δG1(𝐱)\delta_{G^{-1}(\mathbf{x})}. It follows that the KL divergence (3) can be written as

    KL(PP)\displaystyle\text{KL}(P_{*}\|P) =logP(𝐱)P(𝐱)+KL(Q(|𝐱)Q(|𝐱))dP(𝐱)\displaystyle=\int\log\frac{P_{*}(\mathbf{x})}{P(\mathbf{x})}+\text{KL}\big{(}Q_{*}(\cdot|\mathbf{x})\big{\|}Q_{*}(\cdot|\mathbf{x})\big{)}dP_{*}(\mathbf{x})
    =minQ(|𝐱)logP(𝐱)P(𝐱)+KL(Q(|𝐱)Q(|𝐱))dP(𝐱)\displaystyle=\min_{Q(\cdot|\mathbf{x})}\iint\log\frac{P_{*}(\mathbf{x})}{P(\mathbf{x})}+\text{KL}\big{(}Q(\cdot|\mathbf{x})\big{\|}Q_{*}(\cdot|\mathbf{x})\big{)}dP_{*}(\mathbf{x})
    =minQ(|𝐱)logP(𝐱)Q(𝐳|𝐱)P(𝐱)Q(𝐳|𝐱)dQ(𝐳|𝐱)𝑑P(𝐱)\displaystyle=\min_{Q(\cdot|\mathbf{x})}\iint\log\frac{P_{*}(\mathbf{x})Q(\mathbf{z}|\mathbf{x})}{P(\mathbf{x})Q_{*}(\mathbf{z}|\mathbf{x})}dQ(\mathbf{z}|\mathbf{x})dP_{*}(\mathbf{x})
    =minQ(|𝐱)logP(𝐱)Q(𝐳|𝐱)P(𝐱|𝐳)(𝐳)dQ(𝐳|𝐱)𝑑P(𝐱)\displaystyle=\min_{Q(\cdot|\mathbf{x})}\iint\log\frac{P_{*}(\mathbf{x})Q(\mathbf{z}|\mathbf{x})}{P(\mathbf{x}|\mathbf{z})\mathbb{P}(\mathbf{z})}dQ(\mathbf{z}|\mathbf{x})dP_{*}(\mathbf{x})
    =minQ(|𝐱)KL(PQ(|𝐱)P(|𝐳))\displaystyle=\min_{Q(\cdot|\mathbf{x})}\text{KL}\big{(}P_{*}Q(\cdot|\mathbf{x})~{}\big{\|}~{}P(\cdot|\mathbf{z})\mathbb{P}\big{)}

    This is an example of the variational lower bound [64], and the NLL loss (21) now becomes

    minP(|𝐳)minQ(|𝐱)logP(𝐱|𝐳)dQ(𝐳|𝐱)+KL(Q(𝐳|𝐱)(𝐳))dP(𝐱)\min_{P(\cdot|\mathbf{z})}\min_{Q(\cdot|\mathbf{x})}\iint-\log P(\mathbf{x}|\mathbf{z})dQ(\mathbf{z}|\mathbf{x})+\text{KL}\big{(}Q(\mathbf{z}|\mathbf{x})\big{\|}\mathbb{P}(\mathbf{z})\big{)}dP_{*}(\mathbf{x})

    which is the loss of VAE. To make the problem more solvable, the decoder P(|𝐳)P(\cdot|\mathbf{z}) and encoder Q(|𝐱)Q(\cdot|\mathbf{x}) are usually parametrized by diagonal Gaussian distributions [64]:

    P(|𝐳)=𝒩(G(𝐳),diag(e𝐬(𝐳))),Q(|𝐱)=𝒩(F(𝐱),diag(e𝐯(𝐱)))P(\cdot|\mathbf{z})=\mathcal{N}\big{(}G(\mathbf{z}),\text{diag}(e^{\mathbf{s}(\mathbf{z})})\big{)},\quad Q(\cdot|\mathbf{x})=\mathcal{N}\big{(}F(\mathbf{x}),\text{diag}(e^{\mathbf{v}(\mathbf{x})})\big{)}

    where G,F,𝐬,𝐯G,F,\mathbf{s},\mathbf{v} are parametrized functions dd\mathbb{R}^{d}\to\mathbb{R}^{d} such as neural networks, and exp\exp is taken entry-wise. Using the formula for KL divergence between Gaussians

    KL(𝒩(m0,Σ0)𝒩(m1,Σ1))=12[logdetΣ1detΣ0d+Tr[Σ11Σ0]+(m1m0)TΣ11(m1m0)]\text{KL}\big{(}\mathcal{N}(m_{0},\Sigma_{0})\big{\|}\mathcal{N}(m_{1},\Sigma_{1})\big{)}=\frac{1}{2}\Big{[}\log\frac{\det\Sigma_{1}}{\det\Sigma_{0}}-d+\text{Tr}[\Sigma_{1}^{-1}\Sigma_{0}]+(m_{1}-m_{0})^{T}\Sigma_{1}^{-1}(m_{1}-m_{0})\Big{]}

    we can show that, up to constant, the VAE loss equals

    minG,sminF,v\displaystyle\min_{G,s}\min_{F,v} 12𝐱G(F(𝐱)+e𝐯(𝐱)ω)2e𝐬(F(𝐱)+e𝐯(𝐱)ω)+i=1d𝐬i(F(𝐱)+e𝐯(𝐱)ω)d𝒩(ω)\displaystyle~{}\frac{1}{2}\iint\frac{\big{\|}\mathbf{x}-G\big{(}F(\mathbf{x})+e^{\mathbf{v}(\mathbf{x})}\odot\omega\big{)}\big{\|}^{2}}{e^{\mathbf{s}(F(\mathbf{x})+e^{\mathbf{v}(\mathbf{x})}\odot\omega)}}+\sum_{i=1}^{d}\mathbf{s}_{i}\big{(}F(\mathbf{x})+e^{\mathbf{v}(\mathbf{x})}\odot\omega\big{)}d\mathcal{N}(\omega)
    +F(𝐱)2+i=1de𝐯i(𝐱)𝐯i(𝐱)dP(𝐱)\displaystyle\quad+\|F(\mathbf{x})\|^{2}+\sum_{i=1}^{d}e^{\mathbf{v}_{i}(\mathbf{x})}-\mathbf{v}_{i}(\mathbf{x})~{}dP_{*}(\mathbf{x})

    where \odot is entry-wise product. This loss resembles the classical autoencoder [1, 102]

    minF,G𝐱G(F(𝐱))22𝑑P(𝐱)\min_{F,G}\int\frac{\|\mathbf{x}-G(F(\mathbf{x}))\|^{2}}{2}dP_{*}(\mathbf{x})

    and thus P(|𝐳),Q(|𝐱)P(\cdot|\mathbf{z}),Q(\cdot|\mathbf{x}) are addressed by the decoder and encoder.

  3. 3.

    Factorization (for autoregressive model): To model a distribution PP over sequential data 𝐗=[𝐱1,𝐱l]\mathbf{X}=[\mathbf{x}_{1},\dots\mathbf{x}_{l}], one can choose a generator GG that is capable of processing variable-length inputs [𝐱1,𝐱i][\mathbf{x}_{1},\dots\mathbf{x}_{i}], such as the Transformer network [116] or recurrent networks [99], and define the distribution by

    P(𝐗)\displaystyle P(\mathbf{X}) =i=1lP(𝐱i|𝐱1,𝐱i1)\displaystyle=\prod_{i=1}^{l}P(\mathbf{x}_{i}|\mathbf{x}_{1},\dots\mathbf{x}_{i-1})
    P(|𝐱1,𝐱i1)\displaystyle P(\cdot|\mathbf{x}_{1},\dots\mathbf{x}_{i-1}) =law(Xi),XiG(Z|𝐱1,𝐱i1),Z\displaystyle=\text{law}(X_{i}),\quad X_{i}\sim G(Z|\mathbf{x}_{1},\dots\mathbf{x}_{i-1}),\quad Z\sim\mathbb{P}

    Then, NLL (21) is reduced to

    logP(𝐗)𝑑P(𝐗)=i=1llogP(𝐱i|𝐱1,𝐱i1)dP(𝐱)-\int\log P(\mathbf{X})dP_{*}(\mathbf{X})=-\int\sum_{i=1}^{l}\log P(\mathbf{x}_{i}|\mathbf{x}_{1},\dots\mathbf{x}_{i-1})dP_{*}(\mathbf{x})

    Usually, each P(|𝐱1,𝐱i1)P(\cdot|\mathbf{x}_{1},\dots\mathbf{x}_{i-1}) has a simple parametrization such as Gaussian or Softmax so that logP(𝐱i|𝐱1,𝐱i1)\log P(\mathbf{x}_{i}|\mathbf{x}_{1},\dots\mathbf{x}_{i-1}) is tractable [86, 92].

Expectation + Free generator. By the definition (6) of the transport representation P=G#P=G\#\mathbb{P},

f(𝐱)𝑑P(𝐱)=f(G(𝐳))𝑑(𝐳)\int f(\mathbf{x})dP(\mathbf{x})=\int f(G(\mathbf{z}))d\mathbb{P}(\mathbf{z})

for all measureable functions ff. Then, the classical GAN loss (25) becomes

minGmaxDlog(eD(G(𝐳))1+eD(G(𝐳)))𝑑(𝐳)+log(11+eD(𝐱))𝑑P(𝐱)\min_{G}\max_{D}\int\log\Big{(}\frac{e^{D(G(\mathbf{z}))}}{1+e^{D(G(\mathbf{z}))}}\Big{)}d\mathbb{P}(\mathbf{z})+\int\log\Big{(}\frac{1}{1+e^{D(\mathbf{x})}}\Big{)}dP_{*}(\mathbf{x}) (30)

Similarly, the WGAN loss (24) becomes

minGmaxθ1Dθ(G(𝐳))𝑑(𝐳)Dθ(𝐱)𝑑P(𝐱)\min_{G}\max_{\|\theta\|_{\infty}\leq 1}\int D_{\theta}(G(\mathbf{z}))d\mathbb{P}(\mathbf{z})-\int D_{\theta}(\mathbf{x})dP_{*}(\mathbf{x}) (31)

Regression + Fixed generator. For the case with fully deterministic coupling (10), a target generator GG_{*} is provided by numerical optimal transport, and then fitted by a parametrized function GG with the regression loss (26). This formulation leads to the generative model [132]. (Moreover, a few models with some technical variations [4, 5, 98] are related to this category, but for simplicity we do not describe them here.)

For the case with fully random coupling (11), we fit either the score function logPτ\nabla\log P_{\tau} from (18) or the velocity field VV_{*} from (15) using the regression loss (27). Note that the targets (18, 15) are both defined by expectations and thus the loss (27) cannot be computed directly. Thanks to the linearity of expectation, we can expand the loss to make the computation tractable.

Model the score function logPτ\nabla\log P_{\tau} by a velocity field 𝐬:d×[0,T]d\mathbf{s}:\mathbb{R}^{d}\times[0,T]\to\mathbb{R}^{d} and let λτ>0\lambda_{\tau}>0 be a user-specified weight. The regression loss can be written as

L(𝐬)\displaystyle L(\mathbf{s}) :=120Tλτ𝐬(,τ)logPτL2(Pτ)2𝑑τ\displaystyle:=\frac{1}{2}\int_{0}^{T}\lambda_{\tau}\|\mathbf{s}(\cdot,\tau)-\nabla\log P_{\tau}\|_{L^{2}(P_{\tau})}^{2}d\tau
=0Tλτ12𝐬(𝐱,τ)2𝐬(𝐱,τ)logPτ(𝐱)dPτ(𝐱)dτ+C\displaystyle=\int_{0}^{T}\lambda_{\tau}\int\frac{1}{2}\|\mathbf{s}(\mathbf{x},\tau)\|^{2}-\mathbf{s}(\mathbf{x},\tau)\cdot\nabla\log P_{\tau}(\mathbf{x})~{}dP_{\tau}(\mathbf{x})d\tau+C
=0Tλτ12𝐬(𝐱,τ)2𝑑Pτ(𝐱)λτ𝐬(𝐱,τ)Pτ(𝐱)𝑑𝐱𝑑τ+C\displaystyle=\int_{0}^{T}\lambda_{\tau}\int\frac{1}{2}\|\mathbf{s}(\mathbf{x},\tau)\|^{2}dP_{\tau}(\mathbf{x})-\lambda_{\tau}\int\mathbf{s}(\mathbf{x},\tau)\cdot\nabla P_{\tau}(\mathbf{x})d\mathbf{x}~{}d\tau+C
=0Tλτ12𝐬(𝐱,τ)2𝑑πτ(𝐱|𝐱0)𝑑P(𝐱0)\displaystyle=\int_{0}^{T}\lambda_{\tau}\iint\frac{1}{2}\|\mathbf{s}(\mathbf{x},\tau)\|^{2}d\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})dP_{*}(\mathbf{x}_{0})
λτ𝐬(𝐱,τ)πτ(𝐱|𝐱0)𝑑P(𝐱0)𝑑𝐱𝑑τ+C\displaystyle\quad-\lambda_{\tau}\int\mathbf{s}(\mathbf{x},\tau)\cdot\nabla\int\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})dP_{*}(\mathbf{x}_{0})d\mathbf{x}~{}d\tau+C
=0Tλτ12𝐬(𝐱,τ)2𝐬(𝐱,τ)logπτ(𝐱|𝐱0)dπτ(𝐱|𝐱0)dP(𝐱0)dτ+C\displaystyle=\int_{0}^{T}\lambda_{\tau}\iint\frac{1}{2}\|\mathbf{s}(\mathbf{x},\tau)\|^{2}-\mathbf{s}(\mathbf{x},\tau)\cdot\nabla\log\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})~{}d\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})dP_{*}(\mathbf{x}_{0})d\tau+C
=0Tλτ2𝐬(𝐱,τ)logπτ(𝐱|𝐱0)2dπτ(𝐱|𝐱0)dP(𝐱0)dτ+C\displaystyle=\int_{0}^{T}\frac{\lambda_{\tau}}{2}\iint\|\mathbf{s}(\mathbf{x},\tau)-\nabla\log\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})\|^{2}d\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0})dP_{*}(\mathbf{x}_{0})d\tau+C

Since the conditional distribution πτ(𝐱|𝐱0)\pi_{\tau}(\mathbf{x}|\mathbf{x}_{0}) is the isotropic Gaussian (17), this loss is straightforward to evaluate. Thus, we obtain the loss of the score-based diffusion models [104, 106, 49, 107]

L(𝐬)=0Tλτ2𝐬(e120τβs𝑑s𝐱0+1e0τβs𝑑sω,τ)+ω1e0τβs𝑑s2𝑑𝒩(ω)𝑑P(𝐱0)𝑑τL(\mathbf{s})=\int_{0}^{T}\frac{\lambda_{\tau}}{2}\iint\Big{\|}\mathbf{s}\big{(}e^{-\frac{1}{2}\int_{0}^{\tau}\beta_{s}ds}\mathbf{x}_{0}+\sqrt{1-e^{-\int_{0}^{\tau}\beta_{s}ds}}\omega,\tau\big{)}+\frac{\omega}{\sqrt{1-e^{-\int_{0}^{\tau}\beta_{s}ds}}}\Big{\|}^{2}d\mathcal{N}(\omega)dP_{*}(\mathbf{x}_{0})d\tau (32)

Similarly, for the velocity field VV_{*} (15), using the definitions (13, 14) of the joint distribution MM_{*} and current density JJ_{*}, we can write the regression loss as

L(V)\displaystyle L(V) :=1201V(,τ)V(,τ)L2(Pτ)2𝑑τ\displaystyle:=\frac{1}{2}\int_{0}^{1}\|V(\cdot,\tau)-V_{*}(\cdot,\tau)\|_{L^{2}(P_{\tau})}^{2}d\tau
=12V(𝐱,τ)2V(𝐱,τ)V(𝐱,τ)dM(𝐱,τ)+C\displaystyle=\int\frac{1}{2}\|V(\mathbf{x},\tau)\|^{2}-V(\mathbf{x},\tau)\cdot V_{*}(\mathbf{x},\tau)~{}dM_{*}(\mathbf{x},\tau)+C
=12V(𝐱,τ)2𝑑M(𝐱,τ)V𝑑J+C\displaystyle=\int\frac{1}{2}\|V(\mathbf{x},\tau)\|^{2}dM_{*}(\mathbf{x},\tau)-\int V\cdot dJ_{*}+C
=0112V((1τ)𝐱0+τ𝐱1,τ)2𝑑(𝐱0)𝑑P(𝐱1)𝑑τ\displaystyle=\int_{0}^{1}\iint\frac{1}{2}\big{\|}V\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}\big{\|}^{2}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau
01(𝐱1𝐱0)V((1τ)𝐱0+τ𝐱1,τ)𝑑(𝐱0)𝑑P(𝐱1)𝑑τ+C\displaystyle\quad-\int_{0}^{1}\iint(\mathbf{x}_{1}-\mathbf{x}_{0})\cdot V\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau+C
=1201V((1τ)𝐱0+τ𝐱1,τ)(𝐱1𝐱0)2𝑑(𝐱0)𝑑P(𝐱1)𝑑τ+C\displaystyle=\frac{1}{2}\int_{0}^{1}\iint\big{\|}V\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}-(\mathbf{x}_{1}-\mathbf{x}_{0})\big{\|}^{2}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau+C (33)

Thus, we obtain the loss of normalizing flow with stochastic interpolants [2, 75, 125].

Other classes. Finally, we briefly remark on the rest of the classes in Table 1. For the combination “Density + Fixed generator”, it has been shown by [105] that the regression loss upper bounds the KL divergence. Specifically, if in the loss LL (32) we set the weight by λτ=βτ\lambda_{\tau}=\beta_{\tau} where βτ\beta_{\tau} is the noise scale in the SDE (16), then given any score function 𝐬\mathbf{s},

KL(PG𝐬#)L(𝐬)+KL(PT)\text{KL}(P_{*}\|G_{\mathbf{s}}\#\mathbb{P})\leq L(\mathbf{s})+\text{KL}(P_{T}\|\mathbb{P}) (34)

where G𝐬G_{\mathbf{s}} is the generator defined by the reverse-time SDE (19) with the score 𝐬\mathbf{s}. The result also holds for G𝐬G_{\mathbf{s}} defined by the reverse-time ODE (20) under a self-consistency assumption: let (G𝐬)1τ(G_{\mathbf{s}})_{1}^{\tau} be the reverse-time flow map of (20), then

𝐬(𝐱,τ)=log((G𝐬)1τ#)(𝐱)\mathbf{s}(\mathbf{x},\tau)=\nabla\log((G_{\mathbf{s}})_{1}^{\tau}\#\mathbb{P})(\mathbf{x})

The combinations “Expectation + Potential” and “Expectation + Fixed generator” are feasible, but we are not aware of representative models. The combinations “Regression + Potential” and “Regression + Free generator” do not seem probable, since there is no clear target to perform regression.

Remark 1 (Empirical loss).

As discussed in Section 3.3, the loss is almost always an expectation in the target distribution PP_{*}. Indeed, one can check that all the loss functions introduced in this section can be written in the abstract form

L(f)=F(f,𝐱)𝑑P(𝐱)L(f)=\int F(f,\mathbf{x})dP_{*}(\mathbf{x})

where ff is the parameter function and FF depends on the model. Thus, if only a finite sample set of PP_{*} is available, as is usually the case in practice, one can define the empirical loss

L(n)(f)=F(f,𝐱)𝑑P(n)(𝐱)=1ni=1nF(f,𝐱i)L^{(n)}(f)=\int F(f,\mathbf{x})dP^{(n)}_{*}(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}F(f,\mathbf{x}_{i}) (35)

where P(n)P_{*}^{(n)} is the empirical distribution.

3.5 Function representation

Having parametrized the distributions and losses by abstract functions, the next step is to parametrize these functions by machine learning models such as neural networks. There is much freedom in this choice, such that any parametrization used in supervised learning should be applicable to most of the functions we have discussed. These include the generators and discriminators of GANs, the means and variances of the decoder and encoder of VAE, the potential function of the bias potential model, the score function of score-based diffusion models, and the velocity field of normalizing flows with stochastic interpolants. Some interesting applications are given by [27, 92, 58, 96].

One exception is the generator GG of the normalizing flows (29), which needs to be invertible with tractable Jacobian. As mentioned in Section 2, one approach is to parametrize GG as a sequence of invertible blocks whose Jacobians have closed-form formula (Example designs can be found in [29, 63, 89, 55]). Another approach is to represent GG as a flow, approximate this flow with numerical schemes, and solve for the traces Tr[V]\text{Tr}[\nabla V] in (29) (Examples of numerical schemes are given by [21, 131, 40]).

For the theoretical analysis in the rest of this paper, we need to fix a function representation. Since our focus is on the phenomena that are unique to learning distributions (e.g. memorization), we keep the function representation as simple as possible, while satisfying the minimum requirement of the universal approximation property among distributions (and thus the capacity for memorization). Specifically, we use the random feature functions [93, 35, 126].

Definition 1 (Random feature functions).

Let (d,k)\mathcal{H}(\mathbb{R}^{d},\mathbb{R}^{k}) be the space of functions that can be expressed as

f𝐚(𝐱)=𝔼ρ(𝐰,b)[𝐚(𝐰,b)σ(𝐰𝐱+b)]f_{\mathbf{a}}(\mathbf{x})=\mathbb{E}_{\rho(\mathbf{w},b)}\big{[}\mathbf{a}(\mathbf{w},b)~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\big{]} (36)

where ρ𝒫(d+1)\rho\in\mathcal{P}(\mathbb{R}^{d+1}) is a fixed parameter distribution and 𝐚L2(ρ,k)\mathbf{a}\in L^{2}(\rho,\mathbb{R}^{k}) is a parameter function. For simplicity, we use the notation \mathcal{H} when the input and output dimensions d,kd,k are clear.

Definition 2 (RKHS norm).

For any subset Ωd\Omega\subseteq\mathbb{R}^{d}, consider the quotient space

(Ω)=/{f𝐚𝟎 on Ω}\mathcal{H}(\Omega)=\mathcal{H}/\{f_{\mathbf{a}}\equiv\mathbf{0}\text{ on }\Omega\}

with the norm

f(Ω)=inf{𝐚L2(ρ)|f=f𝐚 on Ω}\|f\|_{\mathcal{H}(\Omega)}=\inf\{\|\mathbf{a}\|_{L^{2}(\rho)}~{}|~{}f=f_{\mathbf{a}}\text{ on }\Omega\}

We use the notation \|\cdot\|_{\mathcal{H}} if Ω\Omega is clear from context. By [26, 93], (Ω)\mathcal{H}(\Omega) is a Hilbert space and (Ω)\|\cdot\|_{\mathcal{H}(\Omega)} is equal to the RKHS norm (reproducing kernel Hilbert space) induced by the kernel

k(𝐱,𝐱)=𝔼ρ[σ(𝐰𝐱+b)σ(𝐰𝐱+b)]k(\mathbf{x},\mathbf{x}^{\prime})=\mathbb{E}_{\rho}\big{[}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\sigma(\mathbf{w}\cdot\mathbf{x}^{\prime}+b)\big{]}

Furthermore, given any distribution 𝒫(Ω)\mathbb{P}\in\mathcal{P}(\Omega), we can define the following integral operator K:L2()L2()K:L^{2}(\mathbb{P})\to L^{2}(\mathbb{P}),

K(f)(𝐱)=k(𝐱,𝐱)f(𝐱)𝑑(𝐱)K(f)(\mathbf{x})=\int k(\mathbf{x},\mathbf{x}^{\prime})f(\mathbf{x}^{\prime})d\mathbb{P}(\mathbf{x}^{\prime}) (37)
Definition 3 (Time-dependent random feature function).

Given any V(d+1,d)V\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}), one can define a flow by

GV(𝐱0)\displaystyle G_{V}(\mathbf{x}_{0}) =𝐱1,ddτ𝐱τ=V(𝐱τ,τ)\displaystyle=\mathbf{x}_{1},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=V(\mathbf{x}_{\tau},\tau)

Define the flow-induced norm

V=expV\|V\|_{\mathcal{F}}=\exp\|V\|_{\mathcal{H}}

Our results adopt either of the following settings:

Assumption 3.1.

Assume that the activation σ\sigma is ReLU σ(x)=max(x,0)\sigma(x)=\max(x,0). Assume that the parameter distribution ρ\rho is supported on the l1l^{1} sphere {𝐰1+|b|=1}\{\|\mathbf{w}\|_{1}+|b|=1\} and has a positive and continuous density over this sphere.

Assumption 3.2.

Assume that the activation σ\sigma is sigmoid σ(x)=ex1+ex\sigma(x)=\frac{e^{x}}{1+e^{x}}. Assume that ρ\rho has a positive and continuous density function over d+1\mathbb{R}^{d+1} and also bounded variance

(𝐰2+b2)𝑑ρ(𝐰,b)1\int(\|\mathbf{w}\|^{2}+b^{2})d\rho(\mathbf{w},b)\leq 1

Given either assumption, the universal approximation theorems [51, 109] imply that the space (K,k)\mathcal{H}(K,\mathbb{R}^{k}) is dense among the continuous functions C(K,k)C(K,\mathbb{R}^{k}) with respect to the C0C_{0} norm for any compact subset KdK\subseteq\mathbb{R}^{d}. Also, by Lemma 9.1, (d,k)\mathcal{H}(\mathbb{R}^{d},\mathbb{R}^{k}) is dense in L2(P,k)L^{2}(P,\mathbb{R}^{k}) for any distribution P𝒫(d)P\in\mathcal{P}(\mathbb{R}^{d}).

The random feature functions (36) can be seen as a simplified form of neural networks, e.g. if we replace the parameter distribution ρ\rho by a finite sample set {(𝐰i,bi)}i=1m\{(\mathbf{w}_{i},b_{i})\}_{i=1}^{m}, then (36) becomes a 2-layer network with mm neurons and frozen first layer weights. Similarly, for the flow GVG_{V} in Definition 3, if the ODE is replaced by a forward Euler scheme, then GVG_{V} becomes a deep residual network whose layers share similar weights. Beyond the random feature functions, one can extend the analysis to the Barron functions [34, 11] and flow-induced functions [31], which are the continuous representations of 2-layer networks and residual networks.

3.6 Training rule

The training of distribution learning models is very similar to training supervised learning models, such that one chooses from the many algorithms for gradient descent and optimizes the function parameters. One exception is the GANs, whose losses are min-max problems of the form (30, 31) and are usually solved by performing gradient descent on the generator and gradient ascent on the discriminator [39].

For the theoretical analysis in this paper, we use the continuous time gradient descent. Specifically, given any loss L(f)L(f) over L2(,k)L^{2}(\mathbb{P},\mathbb{R}^{k}) for some 𝒫(d)\mathbb{P}\in\mathcal{P}(\mathbb{R}^{d}), we parametrize ff by the random feature function f𝐚f_{\mathbf{a}} from Definition 1 and denote the loss by L(𝐚)=L(f𝐚)L(\mathbf{a})=L(f_{\mathbf{a}}). Given any initialization 𝐚0L2(ρ,k)\mathbf{a}_{0}\in L^{2}(\rho,\mathbb{R}^{k}), we define the trajectory {𝐚t,t0}\{\mathbf{a}_{t},t\geq 0\} by the dynamics

ddt𝐚t=𝐚L|𝐚t=fL(𝐱)σ(𝐰𝐱+b)𝑑(𝐱)\frac{d}{dt}\mathbf{a}_{t}=-\nabla_{\mathbf{a}}L\big{|}_{\mathbf{a}_{t}}=-\int\nabla_{f}L(\mathbf{x})~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)d\mathbb{P}(\mathbf{x}) (38)

It follows that the function ft=f𝐚tf_{t}=f_{\mathbf{a}_{t}} evolves by

ddtft=𝔼ρ(𝐰,b)[ddt𝐚t(𝐰,b)σ(𝐰𝐱+b)]=K(fL)\frac{d}{dt}f_{t}=\mathbb{E}_{\rho(\mathbf{w},b)}\Big{[}\frac{d}{dt}\mathbf{a}_{t}(\mathbf{w},b)~{}\sigma(\mathbf{w}\cdot\mathbf{x}+b)\Big{]}=-K(\nabla_{f}L) (39)

where KK is the integral operator defined in (37). Similarly, given the empirical loss L(n)L^{(n)} (35), we define the empirical training trajectory

ddtft(n)=K(fL(n))\frac{d}{dt}f_{t}^{(n)}=-K(\nabla_{f}L^{(n)}) (40)

By default, we use the initialization:

𝐚0=𝐚0(n)𝟎\mathbf{a}_{0}=\mathbf{a}^{(n)}_{0}\equiv\mathbf{0} (41)

or equivalently f0=f0(n)𝟎f_{0}=f^{(n)}_{0}\equiv\mathbf{0}.

3.7 Test error

For our theoretical analysis, given a modeled distribution PP and target distribution PP_{*}, we measure the test error by either the Wasserstein metric W2(P,P)W_{2}(P_{*},P) or KL-divergence KL(PP)\text{KL}(P_{*}\|P). As discussed in Section 1, W2W_{2} exhibits the curse of dimensionality, while KL is stronger than W2W_{2}. Thus, they are capable of detecting memorization and can distinguish the solutions that generalize well.

In addition, one advantage of the W2W_{2} metric is that it can be related to the regression loss.

Proposition 3.2 (Proposition 21 of [127]).

Given any base distribution 𝒫2,ac(d)\mathbb{P}\in\mathcal{P}_{2,ac}(\mathbb{R}^{d}) and any target distribution P𝒫2(d)P_{*}\in\mathcal{P}_{2}(\mathbb{R}^{d}), for any GL2(,d)G\in L^{2}(\mathbb{P},\mathbb{R}^{d})

W2(P,G#)=inf{GGL2()|GL2(,d),P=G#}W_{2}(P_{*},G\#\mathbb{P})=\inf\big{\{}\|G-G_{*}\|_{L^{2}(\mathbb{P})}~{}|~{}G_{*}\in L^{2}(\mathbb{P},\mathbb{R}^{d}),~{}P_{*}=G_{*}\#\mathbb{P}\big{\}}

So effectively, the W2W_{2} test error is the L2L^{2} error with the closest target generator.

One remark is that these test losses are only applicable to theoretical analysis. In practice, we only have a finite sample set from PP_{*}, and the curse of dimensionality becomes an obstacle to meaningful evaluation.

4 Universal Approximation Theorems

It is not surprising that distribution learning models equipped with neural networks satisfy the universal approximation property in the space of probability distributions. The significance is that the models in general have the capacity for memorization. This section confirms that the universal approximation property holds for all three distribution representations introduced in Section 3.2. Since our results are proved with the random feature functions (Definitions 1 and 3), they hold for more expressive function parametrizations such as 2-layer and deep neural networks.

For the free generator representation, the following result is straightforward.

Proposition 4.1.

Given either Assumption 3.1 or 3.2, for any base distribution 𝒫2,ac(d)\mathbb{P}\in\mathcal{P}_{2,ac}(\mathbb{R}^{d}), the set of distributions generated by the random feature functions (d,d)\mathcal{H}(\mathbb{R}^{d},\mathbb{R}^{d}) are dense with respect to the W2W_{2} metric. Specifically, for any P𝒫2(d)P_{*}\in\mathcal{P}_{2}(\mathbb{R}^{d}),

infGW2(P,G#)=0\inf_{G\in\mathcal{H}}W_{2}(P_{*},G\#\mathbb{P})=0

In particular, G#G\#\mathbb{P} can approximate the empirical distribution P(n)P_{*}^{(n)}.

4.1 Potential representation

Consider the potential-based representation (5). Let KdK\subseteq\mathbb{R}^{d} be any compact set with positive Lebesgue measure, let 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) be the space of distributions with continuous density functions, and let the base distribution \mathbb{P} be uniform over KK.

Proposition 4.2.

Given either Assumption 3.1 or 3.2, the set of probability distributions

𝒫={1ZeV|V(d,)}\mathcal{P}_{\mathcal{H}}=\Big{\{}\frac{1}{Z}e^{-V}\mathbb{P}~{}\Big{|}~{}V\in\mathcal{H}(\mathbb{R}^{d},\mathbb{R})\Big{\}}

is dense in

  • 𝒫(K)\mathcal{P}(K) under the Wasserstein metric WpW_{p} (1p<1\leq p<\infty),

  • 𝒫ac(K)\mathcal{P}_{ac}(K) under the total variation norm TV\|\cdot\|_{\text{TV}},

  • 𝒫ac(K)C(K)\mathcal{P}_{ac}(K)\cap C(K) under KL divergence.

4.2 Flow-based free generator

For the normalizing flows, we have seen in Section 3.4 the two common approaches for modeling the generator GG, i.e. continuous-time flow (2) or concatenation of simple diffeomorphisms. Both approaches have an apparent issue, that they do not satisfy the universal approximation property among functions. Since GG is always a diffeomorphism, it cannot approximate for instance functions that are overlapping or orientation-reversing (such as x|x|x\mapsto|x| and xxx\mapsto-x). Hence, the approach of Proposition 4.1 is not applicable.

Nevertheless, to transport probability distributions, it is sufficient to restrict to specific kinds of generators, for instance gradient fields ϕ\nabla\phi according to Brennier’s theorem [14, 117]. Using flows induced by random feature functions, we have the following result.

Proposition 4.3.

Given Assumption 3.2, for any base distribution 𝒫2,ac(d)\mathbb{P}\in\mathcal{P}_{2,ac}(\mathbb{R}^{d}), the following set of distributions is dense in 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) with respect to the W2W_{2} metric

𝒫G={GV#|V(d+1,d)}\mathcal{P}_{G}=\big{\{}G_{V}\#\mathbb{P}~{}\big{|}~{}V\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d})\big{\}}

where GVG_{V} is given by Definition 3.

4.3 Fixed generator

For the fixed generator representation, we analyze the normalizing flow with stochastic interpolants (15) instead of the score-based diffusion models (16), since the former has a simpler formulation. As the target velocity field VV_{*} has been specified, we show a stronger result than Propositions 4.1 and 4.3, such that we can simultaneously bound the W2W_{2} test error and the L2L^{2} training loss.

Proposition 4.4 (Proposition 3.2 of [125]).

Given Assumption 3.2, assume that the base distribution \mathbb{P} is compactly-supported and has C2C^{2} density. For any target distribution PP_{*} that is compactly-supported, let VV_{*} be the velocity field (15). Then, for any ϵ>0\epsilon>0, there exists a velocity field Vϵ(d+1,d)V_{\epsilon}\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}) with induced generator Gϵ=GVϵG_{\epsilon}=G_{V_{\epsilon}} given by Definition 3, such that

W2(P,Gϵ#)\displaystyle W_{2}(P_{*},G_{\epsilon}\#\mathbb{P}) <ϵ\displaystyle<\epsilon
VVϵL2(M)=2L(Vϵ)L(V)\displaystyle\|V_{*}-V_{\epsilon}\|_{L^{2}(M_{*})}=\sqrt{2}\sqrt{L(V_{\epsilon})-L(V_{*})} <ϵ\displaystyle<\epsilon

where MM_{*} is the joint distribution (13) and LL is the loss (33).

5 Memorization

The previous section has shown that the distribution learning models, from all known distribution representations, satisfy the universal approximation property. In particular, they are capable of approximating the empirical distribution P(n)P_{*}^{(n)} and thus have the potential for memorization. This section confirms that memorization is inevitable for some of the models. Specifically, we survey our results on the universal convergence property, that is, the ability of a model to converge to any given distribution during training. We believe that this property holds for other models as well, and it should be satisfied by any desirable model for learning probability distributions.

5.1 Bias-potential model

Recall that the bias-potential model is parametrized by potential functions VV (5) and minimizes the loss LL (28). For any compact set KdK\subseteq\mathbb{R}^{d} with positive Lebesgue measure, let the base distribution \mathbb{P} be uniform over KK. Parametrize VV by random feature functions (d,)\mathcal{H}(\mathbb{R}^{d},\mathbb{R}), and define the training trajectory VtV_{t} by continuous time gradient descent (39) on LL with any initialization V0V_{0}\in\mathcal{H}. Denote the modeled distribution by Pt=1ZeVtP_{t}=\frac{1}{Z}e^{-V_{t}}\mathbb{P}. Similarly, let Vt(n)V_{t}^{(n)} be the training trajectory on the empirical loss (35) and denote its modeled distribution by Pt(n)P_{t}^{(n)}.

Proposition 5.1 (Lemma 3.8 of [126]).

Given Assumption 3.1, for any target distribution P𝒫(K)P_{*}\in\mathcal{P}(K), if PtP_{t} has only one weak limit, then PtP_{t} converges weakly to PP_{*}

limtW2(P,Pt)=0\lim_{t\to\infty}W_{2}(P_{*},P_{t})=0
Corollary 5.2 (Memorization (Proposition 3.7 of [126])).

Given Assumption 3.1, the training trajectory Pt(n)P_{t}^{(n)} can only converge to the empirical distribution P(n)P_{*}^{(n)}. Moreover, both the test error and the norm of the potential function diverge

limtKL(PPt(n))=limtVt(n)=\lim_{t\to\infty}\text{KL}(P_{*}\|P_{t}^{(n)})=\lim_{t\to\infty}\|V_{t}^{(n)}\|_{\mathcal{H}}=\infty

In the setting of Proposition 5.1, a limit point always exists, but we have to exclude the possibility of more than one limit point, e.g. the trajectory PtP_{t} may converge to a limit circle. We believe that with a more refined analysis one can prove that such exotic scenario cannot happen.

5.2 GAN discriminator

As will be demonstrated in Section 7, the training and convergence of models with the free generator representation is in general difficult to analyze. Thus, we consider the simplified GAN model from [127] such that the representation G#G\#\mathbb{P} replaced by a density function PP.

Consider the GAN loss (23), and parametrize the discriminator by D(d,)D\in\mathcal{H}(\mathbb{R}^{d},\mathbb{R}). Equivalently, we set the penalty term D\|D\| to be the RKHS norm \|\cdot\|_{\mathcal{H}} and the loss (23) becomes

L(P)\displaystyle L(P) =supDD(𝐱)d(PP)(𝐱)D2\displaystyle=\sup_{D}\int D(\mathbf{x})d(P-P_{*})(\mathbf{x})-\|D\|^{2}_{\mathcal{H}}
=maxaa(𝐰,b)σ(𝐰𝐱+b)𝑑ρ(𝐰,b)d(PP)(𝐱)aL2(ρ)2\displaystyle=\max_{a}\iint a(\mathbf{w},b)\sigma(\mathbf{w}\cdot\mathbf{x}+b)d\rho(\mathbf{w},b)d(P-P_{*})(\mathbf{x})-\|a\|^{2}_{L^{2}(\rho)}
=12k(𝐱,𝐱)d(PP)(𝐱)d(PP)(𝐱)\displaystyle=\frac{1}{2}\iint k(\mathbf{x},\mathbf{x}^{\prime})d(P-P_{*})(\mathbf{x})d(P-P_{*})(\mathbf{x}^{\prime}) (42)

where kk is the kernel function from Definition 2. This loss is an instance of the maximum mean discrepancy [41]. Model the density PP as a function in L2([0,1]d)L^{2}([0,1]^{d}), and define the training trajectory PtP_{t} by continuous time gradient descent

ddtPt=PL(Pt)=k(PtP)\frac{d}{dt}P_{t}=-\nabla_{P}L(P_{t})=-k*(P_{t}-P_{*}) (43)

where (kP)(𝐱)=k(𝐱,𝐱)𝑑P(𝐱)(k*P)(\mathbf{x})=\int k(\mathbf{x},\mathbf{x}^{\prime})dP(\mathbf{x}^{\prime}). Let ΠΔ\Pi_{\Delta} be the nearest point projection from L2([0,1]d)L^{2}([0,1]^{d}) to the convex subset 𝒫([0,1]d)L2([0,1]d)\mathcal{P}([0,1]^{d})\cap L^{2}([0,1]^{d}). We measure the test error by W2(P,ΠΔ(P))W_{2}(P_{*},\Pi_{\Delta}(P)).

Proposition 5.3 (Lemma 13 of [127]).

Given Assumption 3.1, for any target distribution P𝒫([0,1]d)P_{*}\in\mathcal{P}([0,1]^{d}) and any initialization P0L2([0,1]d)P_{0}\in L^{2}([0,1]^{d}), the distribution ΠΔ(Pt)\Pi_{\Delta}(P_{t}) converges weakly to PP_{*}

limtW2(P,ΠΔ(Pt))=0\lim_{t\to\infty}W_{2}\big{(}P_{*},\Pi_{\Delta}(P_{t})\big{)}=0

6 Generalization Error

Despite that Sections 4 and 5 have demonstrated that distribution learning models have the capacity for memorization, this section shows that solutions with good generalization are still achievable. For the four classes highlighted in Table 1, we show that their models escape from the curse of dimensionality with either early stopping or regularization. Specifically, their generalization errors scale as O(nα)O(n^{-\alpha}) where α\alpha are absolute constants, instead of dimension-dependent terms such as α/d\alpha/d.

These results depend on either of the two forms of regularizations

  • Implicit regularization: As depicted in Figure 3 (Left), the training trajectory PtP_{t} comes very close to the hidden target distribution PP_{*} before eventually turning towards the empirical distribution P(n)P_{*}^{(n)}

  • Explicit regularization: Analogous to the above picture, we consider some regularized loss L(n)+R(λ)L^{(n)}+R(\lambda) with strength λ0\lambda\geq 0. With an appropriate regularization strength, the minimizer PλP_{\lambda} becomes very close to the hidden target PP_{*}.

The mechanism underlying both scenarios is that the function representations of the models are insensitive to the sampling error PP(n)P_{*}-P_{*}^{(n)}. Thus, we resolve the seeming paradox between good generalization and the inevitable memorization.

Without a good function representation, this behavior cannot be guaranteed. For instance, as argued in [127], if a distribution PtP_{t} is trained by Wasserstein gradient flow (i.e. without any function parametrization) on the empirical loss

L(n)(P)=W2(P(n),P)L^{(n)}(P)=W_{2}(P_{*}^{(n)},P)

and if the initialization P0PP_{0}\neq P_{*} is in 𝒫2,ac(d)\mathcal{P}_{2,ac}(\mathbb{R}^{d}), then the training trajectory PtP_{t} follows the W2W_{2} geodesic that connects P0P_{0} and P(n)P_{*}^{(n)}. Since the Wasserstein manifold has positive curvature [3], the geodesic in general can never come close to the hidden target PP_{*}, as depicted in Figure 3 (Right).

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Left: Implicit regularization enables Pt(n)P^{(n)}_{t} to stay close to PtP_{t} and thus approximate PP_{*} better than P(n)P_{*}^{(n)}. Right: Wasserstein gradient flow on W2W_{2} loss.

In the following five subsections, we survey our results on three models that have rigorous proofs, and then analyze two models with heuristic calculations.

6.1 Bias-potential model

We start with the bias-potential model (28) since it enjoys the most convexity and thus the arguments are the most transparent.

Consider the domain Ω=[0,1]d\Omega=[0,1]^{d} with base distribution 𝒫(Ω)\mathbb{P}\in\mathcal{P}(\Omega). Let Vt,Vt(n)V_{t},V^{(n)}_{t}\in\mathcal{H} be potential functions trained on the population loss LL (28) and the empirical loss L(n)L^{(n)} (35) respectively, using continuous time gradient descent (39). Denote their induced distributions (5) by Pt,Pt(n)P_{t},P^{(n)}_{t}.

Theorem 6.1 (Theorem 3.3 of [126]).

Given Assumption 3.1, assume that the target distribution PP_{*} has the form (5) with a potential function VV_{*}\in\mathcal{H}. For any δ>0\delta>0, with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)},

KL(PPt(n))V22t+82log2d+22log(2/δ)nt\text{KL}\big{(}P_{*}\|P_{t}^{(n)}\big{)}\leq\frac{\|V_{*}\|^{2}_{\mathcal{H}}}{2t}+\frac{8\sqrt{2\log 2d}+2\sqrt{2\log(2/\delta)}}{\sqrt{n}}t (44)
Corollary 6.2.

Given the condition of Theorem 6.1, if we choose an early-stopping time TT such that

T=Θ(V(nlogd)1/4)T=\Theta\Big{(}\|V_{*}\|_{\mathcal{H}}\big{(}\frac{n}{\log d}\big{)}^{1/4}\Big{)}

then the test error satisfies

KL(PPT(n))V(logdn)1/4\text{KL}\big{(}P_{*}\|P_{T}^{(n)}\big{)}\lesssim\|V_{*}\|_{\mathcal{H}}\Big{(}\frac{\log d}{n}\Big{)}^{1/4}

Hence, the generalization error escapes from the curse of dimensionality.

The two terms in the upper bound (44) are the training error and generalization gap. The former is a consequence of convexity, while the latter follows from the observation that the landscapes of LL and L(n)L^{(n)} differ very little

δLL(n)δaL2(ρ)\displaystyle\Big{\|}\frac{\delta L-L^{(n)}}{\delta a}\Big{\|}_{L^{2}(\rho)} supV1|V(𝐱)d(PP(n))(𝐱)|\displaystyle\leq\sup_{\|V\|_{\mathcal{H}}\leq 1}\Big{|}\int V(\mathbf{x})d(P_{*}-P_{*}^{(n)})(\mathbf{x})\Big{|}
Radn({V1})+log1/δn\displaystyle\lesssim Rad_{n}\big{(}\{\|V\|_{\mathcal{H}}\leq 1\}\big{)}+\frac{\sqrt{\log 1/\delta}}{\sqrt{n}}

where the RadnRad_{n} term is the Rademacher complexity and scales as O(1/n)O(1/\sqrt{n}). Then, Vt,Vt(n)V_{t},V^{(n)}_{t} remain close during training

VtVt(n)0tδLL(n)δaL2(ρ)tn\|V_{t}-V_{t}^{(n)}\|_{\mathcal{H}}\lesssim\int_{0}^{t}\Big{\|}\frac{\delta L-L^{(n)}}{\delta a}\Big{\|}_{L^{2}(\rho)}\lesssim\frac{t}{\sqrt{n}}

which confirms the depiction in Figure 3 (Left).

In the meantime, the bias potential model also generalizes well in the explicit regularization setting. Here we consider the Ivanov and Tikhonov regularizations [57, 114, 84].

Proposition 6.3 (Proposition 3.9 of [126]).

Given Assumption 3.1, assume that the target PP_{*} is generated by a potential VV_{*}\in\mathcal{H}. Let VR(n)V_{R}^{(n)} be the minimizer of the regularized loss

minVRL(n)(V)\min_{\|V\|_{\mathcal{H}}\leq R}L^{(n)}(V)

where RR is any constant such that RVR\geq\|V_{*}\|_{\mathcal{H}}. For any δ>0\delta>0, with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)}, the distribution PR(n)P^{(n)}_{R} generated by the potential VR(n)V^{(n)}_{R} satisfies

KL(PPR(n))82log2d+22log(2/δ)nR\text{KL}(P_{*}\|P^{(n)}_{R})\leq\frac{8\sqrt{2\log 2d}+2\sqrt{2\log(2/\delta)}}{\sqrt{n}}R
Proposition 6.4.

Given the condition of Proposition 6.3, for any δ>0\delta>0, let Vλ(n)V_{\lambda}^{(n)} be the minimizer to the regularized loss

minVL(n)(V)+λnV,λ42log2d+2log(2/δ)\min_{V\in\mathcal{H}}L^{(n)}(V)+\frac{\lambda}{\sqrt{n}}\|V\|_{\mathcal{H}},\quad\lambda\geq 4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}

With probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)}, the distribution Pλ(n)P^{(n)}_{\lambda} generated by PR(n)P^{(n)}_{R} satisfies

KL(PPλ(n))2λVn\text{KL}(P_{*}\|P^{(n)}_{\lambda})\leq\frac{2\lambda\|V_{*}\|_{\mathcal{H}}}{\sqrt{n}}

6.2 Normalizing flow with stochastic interpolants

Consider the normalizing flow with stochastic interpolants (33), and model the velocity field VV and generator GVG_{V} by Definition 3. Denote by Vt,Vt(n)V_{t},V_{t}^{(n)} the training trajectories on the population and empirical losses (33, 35) using gradient flow (39, 40). Denote the generated distributions by Pt=GVt#P_{t}=G_{V_{t}}\#\mathbb{P} and Pt(n)=GVt(n)#P^{(n)}_{t}=G_{V^{(n)}_{t}}\#\mathbb{P}.

First, we bound the generalization gap.

Theorem 6.5 (Theorem 3.4 of [125]).

Given Assumption 3.2, for any compactly-supported base and target distributions \mathbb{P} and PP_{*}, if the velocity field VV_{*} (15) satisfies V(d+1,d)V_{*}\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}), then with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)},

W2(Pt,Pt(n))V(1+2ln2+2ln(2/δ))(43Rt3/2+2Rt)+R2+2tnW_{2}(P_{t},P_{t}^{(n)})\leq\|V_{*}\|_{\mathcal{F}}\frac{(1+\sqrt{2\ln 2}+\sqrt{2\ln(2/\delta)})(\frac{4}{3}Rt^{3/2}+2Rt)+\sqrt{R^{2}+2}~{}t}{\sqrt{n}}

where RR is the radius

R=sup{𝐱|𝐱sprtsprtP}R=\sup\big{\{}\|\mathbf{x}\|~{}\big{|}~{}\mathbf{x}\in\text{sprt}\mathbb{P}\cup\text{sprt}P_{*}\big{\}}

Next, to estimate the generalization error, we need a sharper norm to bound the training error.

Definition 4 (Proposition 2.4 of [125]).

Given any distribution M𝒫(d+1)M\in\mathcal{P}(\mathbb{R}^{d+1}), let KK be the integral operator (37) over L2(M)L^{2}(M). Given Assumption 3.2, Proposition 2.4 of [125] implies that KK is a symmetric positive compact operator, and thus have an eigandecomposition with positive eigenvalues {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty} and eigenfunctions {ϕi}i=1\{\phi_{i}\}_{i=1}^{\infty}, which form an orthonormal basis of L2(M)L^{2}(M). Define the subspace 2(M)L2(M)\mathcal{H}^{2}(M)\subseteq L^{2}(M) with the following norm

V2(M)=i=1𝐯i2λi2\|V\|_{\mathcal{H}^{2}(M)}=\sum_{i=1}^{\infty}\frac{\|\mathbf{v}_{i}\|^{2}}{\lambda_{i}^{2}}

where the coefficients 𝐯i\mathbf{v}_{i} are from the decomposition V=i𝐯iϕiV=\sum_{i}\mathbf{v}_{i}\phi_{i}. Furthermore, we have V(sprtM)V2(M)\|V\|_{\mathcal{H}(\text{sprt}M)}\leq\|V\|_{\mathcal{H}^{2}(M)} and thus VexpV2(M)\|V\|_{\mathcal{F}}\leq\exp\|V\|_{\mathcal{H}^{2}(M)}.

Theorem 6.6 (Theorem 3.7 of [125]).

Given Assumption 3.2, assume that the base distribution \mathbb{P} is compactly-supported and has a C2C^{2} density. Let PP_{*} be any compactly-supported target distribution such that the velocity VV_{*} (15) satisfies V2(M)V_{*}\ \in\mathcal{H}^{2}(M_{*}), where MM_{*} is the joint distribution (13). Then,

W2(P,Pt(n))\displaystyle W_{2}(P_{*},P^{(n)}_{t}) VV2(M)2t\displaystyle\leq\frac{\|V_{*}\|_{\mathcal{F}}\|V_{*}\|_{\mathcal{H}^{2}(M_{*})}}{2\sqrt{t}}
+V(1+2ln2+2ln(2/δ))(43Rt3/2+2Rt)+R2+2tn\displaystyle\quad\quad+\|V_{*}\|_{\mathcal{F}}\frac{(1+\sqrt{2\ln 2}+\sqrt{2\ln(2/\delta)})(\frac{4}{3}Rt^{3/2}+2Rt)+\sqrt{R^{2}+2}~{}t}{\sqrt{n}}

where the radius R=sup{𝐱|𝐱sprtsprtP}R=\sup\big{\{}\|\mathbf{x}\|~{}\big{|}~{}\mathbf{x}\in\text{sprt}\mathbb{P}\cup\text{sprt}P_{*}\big{\}}.

In short, the generalization error scales as

W2(P,Pt(n))1t+1nt3/2W_{2}(P_{*},P^{(n)}_{t})\lesssim\frac{1}{\sqrt{t}}+\frac{1}{\sqrt{n}}~{}t^{3/2}

and thus with early-stopping T=Θ(n1/4)T=\Theta(n^{1/4}), the model escapes from the curse of dimensionality

W2(P,PT(n))n1/8W_{2}(P_{*},P^{(n)}_{T})\lesssim n^{1/8}

The condition V2(M)V_{*}\in\mathcal{H}^{2}(M_{*}) may seem strict, so we present the following corollary that holds for general target distributions.

Corollary 6.7 (Corollary 3.8 of [125]).

Given Assumption 3.2, assume that the base distribution \mathbb{P} is compactly-supported and has C2C^{2} density. Let PP_{*} be any compactly-supported target distribution. For any ϵ>0\epsilon>0, there exists a distribution Mϵ𝒫(d+1)M_{\epsilon}\in\mathcal{P}(\mathbb{R}^{d+1}) and a velocity field Vϵ2(Mϵ)V_{\epsilon}\in\mathcal{H}^{2}(M_{\epsilon}) such that

W2(P,Pt(n))\displaystyle W_{2}(P_{*},P^{(n)}_{t}) <VV2(Mϵ)2t+ϵ(Vϵt3/2+1)\displaystyle<\frac{\|V_{*}\|_{\mathcal{F}}\|V_{*}\|_{\mathcal{H}^{2}(M_{\epsilon})}}{2\sqrt{t}}+\epsilon~{}\big{(}\|V_{\epsilon}\|_{\mathcal{F}}~{}t^{3/2}+1\big{)}
+V(1+2ln2+2ln(2/δ))(43Rt3/2+2Rt)+R2+2tn\displaystyle\quad\quad+\|V_{*}\|_{\mathcal{F}}\frac{(1+\sqrt{2\ln 2}+\sqrt{2\ln(2/\delta)})(\frac{4}{3}Rt^{3/2}+2Rt)+\sqrt{R^{2}+2}~{}t}{\sqrt{n}}

6.3 GAN

Similar to Section 5.2, we consider the simplified GAN model such that the modeled distribution is represented by a density function PP. We show that the discriminator DD alone is sufficient to enable good generalization.

With the discriminator modeled by D(d,)D\in\mathcal{H}(\mathbb{R}^{d},\mathbb{R}), the GAN loss (23) becomes the maximum mean discrepancy LL in (42). Consider the training trajectory PtL2([0,1]d)P_{t}\in L^{2}([0,1]^{d}) defined by (43). Similarly, define the empirical loss L(n)L^{(n)} and empirical training trajectory Pt(n)P^{(n)}_{t} by

L(n)(P)\displaystyle L^{(n)}(P) =12k(𝐱,𝐱)d(PP(n))2(𝐱,𝐱)\displaystyle=\frac{1}{2}\iint k(\mathbf{x},\mathbf{x}^{\prime})d(P-P^{(n)}_{*})^{2}(\mathbf{x},\mathbf{x}^{\prime})
Pt(n)\displaystyle P^{(n)}_{t} =PL(n)(Pt(n))=k(Pt(n)P(n))\displaystyle=-\nabla_{P}L^{(n)}(P^{(n)}_{t})=-k*(P^{(n)}_{t}-P^{(n)}_{*})

Fix some initialization P0=P0(n)L2([0,1]d)P_{0}=P^{(n)}_{0}\in L^{2}([0,1]^{d}). We measure the test error by W2(P,ΠΔ(P))W_{2}(P_{*},\Pi_{\Delta}(P)), where ΠΔ\Pi_{\Delta} is the nearest point projection onto 𝒫([0,1]d)L2([0,1]d)\mathcal{P}([0,1]^{d})\cap L^{2}([0,1]^{d}).

Theorem 6.8 (Theorem 2 of [127]).

Given Assumption 3.1, for any target density function PP_{*} such that PP0P_{*}-P_{0}\in\mathcal{H}, with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)},

W2(P,ΠΔ(Pt(n)))dPP0t+d42log2d+2log(2/δ)ntW_{2}\big{(}P_{*},\Pi_{\Delta}(P_{t}^{(n)})\big{)}\leq\sqrt{d}\frac{\|P_{*}-P_{0}\|_{\mathcal{H}}}{\sqrt{t}}+\sqrt{d}~{}\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}t

It follows that with an early stopping time T=Θ(n1/3)T=\Theta(n^{1/3}), the generalization error scales as O(n1/6)O(n^{-1/6}) and escapes from the curse of dimensionality.

Part of this result can be extended to the usual case with a generator. As shown in [127], if we consider the GAN loss L(G)=L(G#)L(G)=L(G\#\mathbb{P}), then the generator is insensitive to the difference between the landscapes of the population loss LL and empirical loss L(n)L^{(n)},

δLδGδL(n)δGL2()=k(PP(n))L2()1n\Big{\|}\frac{\delta L}{\delta G}-\frac{\delta L^{(n)}}{\delta G}\Big{\|}_{L^{2}(\mathbb{P})}=\|\nabla k*(P_{*}-P_{*}^{(n)})\|_{L^{2}(\mathbb{P})}\lesssim\frac{1}{\sqrt{n}}

The mechanism is that the sampling error PP(n)P_{*}-P^{(n)}_{*} is damped by DD.

Furthermore, we can model the generator as a random feature function, G(d,d)G\in\mathcal{H}(\mathbb{R}^{d},\mathbb{R}^{d}), and consider the population trajectory GtG_{t} and empirical trajectory Gt(n)G^{(n)}_{t}, which are trained respectively on L,L(n)L,L^{(n)} with continuous time gradient descent (39, 40) and with the same initialization. Assume that the activation σ\sigma is C1C^{1}, then the difference GtGt(n)G_{t}-G^{(n)}_{t} grows slowly at t=0t=0

ddt(GtGt(n))(𝐳)|t=0=k(k(PP(n)))\frac{d}{dt}(G_{t}-G^{(n)}_{t})(\mathbf{z})\big{|}_{t=0}=k*\nabla\big{(}k*(P_{*}-P^{(n)}_{*})\big{)}

Note that the sampling error PP(n)P_{*}-P^{(n)}_{*} is damped twice by both GG and DD.

Finally, we remark that the generalization error of GANs have been studied in several related works using other kinds of simplified models, for instance when the generator GG is a linear map [38], one-layer network (without hidden layer, of the form G(𝐱)=σ(A𝐱)G(\mathbf{x})=\sigma(A\mathbf{x}) for Ad×dA\in\mathbb{R}^{d\times d}) [121, 69], or polynomial with bounded degree [71].

6.4 Score-based diffusion model

As a further demonstration of the techniques from the previous sections, this section presents an informal estimation of the generalization error of the score-based diffusion model (32). By heuristic calculations, we derive a bound in the implicit regularization setting that resembles Theorem 6.6.

Model the score function by 𝐬(d+1,d)\mathbf{s}\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}). Let 𝐬t,𝐬t(n)\mathbf{s}_{t},\mathbf{s}^{(n)}_{t} be the training trajectories on the population loss LL (32) and empirical loss L(n)L^{(n)} (35) using gradient flow (39, 40) with zero initialization 𝐬0=𝐬0(n)𝟎\mathbf{s}_{0}=\mathbf{s}^{(n)}_{0}\equiv\mathbf{0}. Model the generators Gt,Gt(n)G_{t},G_{t}^{(n)} by the reverse-time SDE (19) with scores 𝐬t,𝐬t(n)\mathbf{s}_{t},\mathbf{s}^{(n)}_{t}. Denote the generated distributions by Pt=Gt#P_{t}=G_{t}\#\mathbb{P} and Pt(n)=Gt(n)#P_{t}^{(n)}=G^{(n)}_{t}\#\mathbb{P}.

For any target distribution PP_{*}, denote the target score function by 𝐬=logPτ\mathbf{s}_{*}=\nabla\log P_{\tau} with PτP_{\tau} given by (18). By inequality (34),

KL(PPt(n))L(𝐬t(n))L(𝐬)+KL(PT)\text{KL}(P_{*}\|P_{t}^{(n)})\leq L(\mathbf{s}^{(n)}_{t})-L(\mathbf{s}_{*})+\text{KL}(P_{T}\|\mathbb{P})

Assume that 𝐬\mathbf{s}_{*}\in\mathcal{H}. Then, by convexity (see for instance [126, Proposition 3.1])

L(𝐬t)L(𝐬)𝐬2tL(\mathbf{s}_{t})-L(\mathbf{s}_{*})\lesssim\frac{\|\mathbf{s}_{*}\|_{\mathcal{H}}^{2}}{t}

Meanwhile, by the growth rate bound 𝐬t,𝐬t(n)t\|\mathbf{s}_{t}\|_{\mathcal{H}},\|\mathbf{s}^{(n)}_{t}\|_{\mathcal{H}}\lesssim\sqrt{t} from [125, Proposition 5.3],

L(𝐬t(n))L(𝐬t)(Vt(n)C0+VtC0)𝐬t𝐬t(n)t𝐬t𝐬t(n)L(\mathbf{s}^{(n)}_{t})-L(\mathbf{s}_{t})\lesssim(\|V^{(n)}_{t}\|_{C_{0}}+\|V_{t}\|_{C_{0}})\|\mathbf{s}_{t}-\mathbf{s}^{(n)}_{t}\|_{\mathcal{H}}\lesssim\sqrt{t}~{}\|\mathbf{s}_{t}-\mathbf{s}^{(n)}_{t}\|_{\mathcal{H}}

Then, using a calculation analogous to the proof of [125, Theorem 3.4]

𝐬t𝐬t(n)Radn({𝐬t})tt3/2n\|\mathbf{s}_{t}-\mathbf{s}^{(n)}_{t}\|_{\mathcal{H}}\lesssim Rad_{n}\big{(}\{\|\mathbf{s}\|_{\mathcal{H}}\leq\sqrt{t}\}\big{)}t\lesssim\frac{t^{3/2}}{\sqrt{n}}

Combining these inequalities, we obtain

KL(PPt(n))𝐬2t+t2n+KL(PT)\text{KL}(P_{*}\|P_{t}^{(n)})\lesssim\frac{\|\mathbf{s}_{*}\|_{\mathcal{H}}^{2}}{t}+\frac{t^{2}}{\sqrt{n}}+\text{KL}(P_{T}\|\mathbb{P})

Hence, if we ignore the approximation error KL(PT)\text{KL}(P_{T}\|\mathbb{P}) due to finite TT, the generalization error with early stopping scales as O(n1/6)O(n^{-1/6}) and escapes from the curse of dimensionality.

6.5 Normalizing flow

This section presents an informal estimation of the generalization error of the normalizing flow model (29). We conjecture an upper bound in the explicit regularization setting that resembles Proposition 6.3.

Let the velocity field be modeled by V(d+1,d)V\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}), let GVG_{V} be the flow map from Definition 3, and define the reverse-time flow map for τ[0,1]\tau\in[0,1],

FV(𝐱1,τ)=𝐱τ,ddτ𝐱τ=V(𝐱τ,τ)F_{V}(\mathbf{x}_{1},\tau)=\mathbf{x}_{\tau},\quad\frac{d}{d\tau}\mathbf{x}_{\tau}=V(\mathbf{x}_{\tau},\tau)

Let L,L(n)L,L^{(n)} be the population and empirical losses (29, 35). Let the base distribution be the unit Gaussian =𝒩\mathbb{P}=\mathcal{N}. For any target distribution P𝒫2(d)P_{*}\in\mathcal{P}_{2}(\mathbb{R}^{d}) such that P=GV#P_{*}=G_{V_{*}}\#\mathbb{P} for some VV_{*}\in\mathcal{H}, and for any RVR\geq\|V_{*}\|_{\mathcal{F}}, consider the problem with explicit regularization:

minVRL(n)(V)\min_{\|V\|_{\mathcal{F}}\leq R}L^{(n)}(V)

Let VR(n)V^{(n)}_{R} be a minimizer. It follows that

L(VR(n))\displaystyle L(V^{(n)}_{R}) L(n)(VR(n))+supVRL(V)L(n)(V)\displaystyle\leq L^{(n)}(V^{(n)}_{R})+\sup_{\|V\|_{\mathcal{F}}\leq R}L(V)-L^{(n)}(V)
L(n)(V)+supVRL(V)L(n)(V)\displaystyle\leq L^{(n)}(V_{*})+\sup_{\|V\|_{\mathcal{F}}\leq R}L(V)-L^{(n)}(V)
L(V)+2supVRL(V)L(n)(V)\displaystyle\leq L(V_{*})+2\sup_{\|V\|_{\mathcal{F}}\leq R}L(V)-L^{(n)}(V)

Then,

supVRL(V)L(n)(V)\displaystyle\quad\sup_{\|V\|_{\mathcal{F}}\leq R}L(V)-L^{(n)}(V)
supVR01Tr[V(FV(𝐱τ,τ),τ)]𝑑τd(PP(n))(𝐱)\displaystyle\leq\sup_{\|V\|_{\mathcal{F}}\leq R}\iint_{0}^{1}\text{Tr}\big{[}\nabla V\big{(}F_{V}(\mathbf{x}_{\tau},\tau),\tau\big{)}\big{]}d\tau d(P_{*}-P^{(n)}_{*})(\mathbf{x})
+supVR0112FV(𝐱,1)2d(PP(n))(𝐱)\displaystyle\quad+\sup_{\|V\|_{\mathcal{F}}\leq R}\iint_{0}^{1}\frac{1}{2}\|F_{V}(\mathbf{x},1)\|^{2}d(P_{*}-P^{(n)}_{*})(\mathbf{x})

Denote the two terms by the random variables A,BA,B. Using the techniques of [31, Theorem 2.11] and [46, Theorem 3.3] for bounding the Rademacher complexity of flow-induced functions, one can try to bound the following expectations

𝔼[A]\displaystyle\mathbb{E}[A] Rn𝔼[max1inXi|Xii.i.d.P]R2n\displaystyle\lesssim\frac{R}{\sqrt{n}}\mathbb{E}\big{[}\max_{1\leq i\leq n}\|X_{i}\|~{}\big{|}~{}X_{i}\sim^{i.i.d.}P_{*}\big{]}\lesssim\frac{R^{2}}{\sqrt{n}}
𝔼[B]\displaystyle\mathbb{E}[B] R2n𝔼[max1inXi2|Xii.i.d.P]R4n\displaystyle\lesssim\frac{R^{2}}{\sqrt{n}}\mathbb{E}\big{[}\max_{1\leq i\leq n}\|X_{i}\|^{2}~{}\big{|}~{}X_{i}\sim^{i.i.d.}P_{*}\big{]}\lesssim\frac{R^{4}}{\sqrt{n}}

Meanwhile, for the random fluctuations A𝔼[A],B𝔼[B]A-\mathbb{E}[A],B-\mathbb{E}[B], one can apply the extension of McDiarmid’s inequality to sub-Gaussian random variables [67], and try to show that, with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)},

A𝔼[A]\displaystyle A-\mathbb{E}[A] R2log1/δn.B𝔼[B]R4log1/δn\displaystyle\lesssim\frac{R^{2}\sqrt{\log 1/\delta}}{\sqrt{n}}.\quad B-\mathbb{E}[B]\lesssim\frac{R^{4}\sqrt{\log 1/\delta}}{\sqrt{n}}

Combining these inequalities, one can conjecture that the solution PR(n)=GVR(n)#P^{(n)}_{R}=G_{V^{(n)}_{R}}\#\mathbb{P} satisfies

KL(PPR(n))=L(VR(n))L(V)1+log1/δnR4\text{KL}(P_{*}\|P^{(n)}_{R})=L(V^{(n)}_{R})-L(V_{*})\lesssim\frac{1+\sqrt{\log 1/\delta}}{\sqrt{n}}R^{4}

7 Training

This section studies the training behavior of distribution learning models, and illustrates the differences between the three distribution representations discussed in Section 3.2. On one hand, we survey our results on the global convergence rates of models with the potential representation and fixed generator representation. On the other hand, we present new results on the landscape of models with the free generator representation, and analyze the mode collapse phenomenon of GANs.

7.1 Potential and fixed generator

Models with these two representations are easier to analyze since their losses are usually convex over abstract functions. Specifically, this section considers the bias-potential model (28) and normalizing flow with stochastic interpolants (33):

L(V)\displaystyle L(V) =V𝑑P+lneV𝑑\displaystyle=\int VdP_{*}+\ln\int e^{-V}d\mathbb{P}
L(V)\displaystyle L(V) =1201V((1τ)𝐱0+τ𝐱1,τ)(𝐱1𝐱0)2𝑑(𝐱0)𝑑P(𝐱1)𝑑τ\displaystyle=\frac{1}{2}\int_{0}^{1}\iint\big{\|}V\big{(}(1-\tau)\mathbf{x}_{0}+\tau\mathbf{x}_{1},\tau\big{)}-(\mathbf{x}_{1}-\mathbf{x}_{0})\big{\|}^{2}d\mathbb{P}(\mathbf{x}_{0})dP_{*}(\mathbf{x}_{1})d\tau

If we choose a convex function representation for VV, then the optimization problem becomes convex.

To estimate the rate of convergence, one approach is to bound the test error by a stronger norm. The following toy example shows how to bound the L2L^{2} error by the RKHS norm \|\cdot\|_{\mathcal{H}}.

Example 1 (Kernel regression).

Fix any base distribution 𝒫(d)\mathbb{P}\in\mathcal{P}(\mathbb{R}^{d}) and assume that the activation σ\sigma is bounded. For any target function ff_{*}\in\mathcal{H}, consider the regression loss L(f)=12ffL2()2L(f)=\frac{1}{2}\|f-f_{*}\|_{L^{2}(\mathbb{P})}^{2}. Parametrize ff by faf_{a}\in\mathcal{H} and train the parameter function aa by continuous time gradient descent (38) with initialization a0a\equiv 0. Then,

L(ft)f22tL(f_{t})\leq\frac{\|f_{*}\|_{\mathcal{H}}^{2}}{2t}
Proof one.

Denote the loss by L(a)=L(fa)L(a)=L(f_{a}). Since σ\sigma is bounded, LL has continuous Fréchet derivative in aL2(ρ)a\in L^{2}(\rho), so the gradient descent (38) is well-defined. Choose aL2(ρ)a_{*}\in L^{2}(\rho) such that f=faf_{*}=f_{a_{*}} and f=aL2(ρ)\|f_{*}\|_{\mathcal{H}}=\|a_{*}\|_{L^{2}(\rho)}. Define the Lyapunov function

E(t)=t(L(at)L(a))+12aatL2(ρ0)2E(t)=t~{}\big{(}L(a_{t})-L(a_{*})\big{)}+\frac{1}{2}||a_{*}-a_{t}||^{2}_{L^{2}(\rho_{0})}

Then,

ddtE(t)\displaystyle\frac{d}{dt}E(t) =(L(at)L(a))+tddtL(at)+ata,ddtatL2(ρ0)\displaystyle=\big{(}L(a_{t})-L(a_{*})\big{)}+t\cdot\frac{d}{dt}L(a_{t})+\big{\langle}a_{t}-a_{*},~{}\frac{d}{dt}a_{t}\big{\rangle}_{L^{2}(\rho_{0})}
(L(at)L(a))ata,L(at)L2(ρ0)\displaystyle\leq\big{(}L(a_{t})-L(a_{*})\big{)}-\big{\langle}a_{t}-a_{*},~{}\nabla L(a_{t})\big{\rangle}_{L^{2}(\rho_{0})}

By convexity, for any a0,a1a_{0},a_{1},

L(a0)+a1a0,L(a0)L(a1)L(a_{0})+\langle a_{1}-a_{0},~{}\nabla L(a_{0})\rangle\leq L(a_{1})

Hence, ddtE0\frac{d}{dt}E\leq 0. We conclude that E(t)E(0)E(t)\leq E(0). ∎

Proof two.

Since ata_{t} evolves by (38), ftf_{t} evolves by (39):

ddtft=K(ftf)\frac{d}{dt}f_{t}=-K(f_{t}-f_{*}) (45)

where KK is the integral operator (37) over L2()L^{2}(\mathbb{P}). Since KK is symmetric, positive semidefinite and compact, there exists an eigendecomposition with non-negative eigenvalues {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty} and eigenfunctions {ϕi}i=1\{\phi_{i}\}_{i=1}^{\infty} that form an orthonormal basis of L2()L^{2}(\mathbb{P}). Consider the decomposition

ft=i=1ctiϕi,f=i=1ciϕif_{t}=\sum_{i=1}^{\infty}c^{i}_{t}\phi_{i},\quad f_{*}=\sum_{i=1}^{\infty}c_{*}^{i}\phi_{i}

It is known that the RKHS norm satisfies [93, 26]

f2=i=1(ci)2λi\|f_{*}\|_{\mathcal{H}}^{2}=\sum_{i=1}^{\infty}\frac{(c_{*}^{i})^{2}}{\lambda_{i}}

Since c0i=0c^{i}_{0}=0 by assumption, (45) implies that cti=(1eλit)cic^{i}_{t}=(1-e^{-\lambda_{i}t})c^{i}_{*}. Hence,

L(ft)\displaystyle L(f_{t}) =12i=1(ctici)2=12i=1(ci)2e2λit\displaystyle=\frac{1}{2}\sum_{i=1}^{\infty}(c^{i}_{t}-c^{i}_{*})^{2}=\frac{1}{2}\sum_{i=1}^{\infty}(c_{*}^{i})^{2}e^{-2\lambda_{i}t}
12i=1(ci)2λisupλ0λe2λt=f24et\displaystyle\leq\frac{1}{2}\sum_{i=1}^{\infty}\frac{(c_{*}^{i})^{2}}{\lambda_{i}}\sup_{\lambda\geq 0}\lambda e^{-2\lambda t}=\frac{\|f_{*}\|_{\mathcal{H}}^{2}}{4et}

Using similar arguments, we have the following bounds on the test error of models trained on the population loss.

Proposition 7.1 (Bias-potential, Proposition 3.1 of [126]).

Given the setting of Theorem 6.1, the distribution PtP_{t} generated by the potential VtV_{t} satisfies

KL(PPt)V22t\text{KL}(P_{*}\|P_{t})\leq\frac{\|V_{*}\|^{2}_{\mathcal{H}}}{2t}
Proposition 7.2 (NF, Proposition 3.5 of [125]).

Given the setting of Theorem 6.6, the distribution Pt=GVt#P_{t}=G_{V_{t}}\#\mathbb{P} generated by the trajectory VtV_{t} satisfies

W2(P,Pt)VV2(M)2tW_{2}(P_{*},P_{t})\leq\frac{\|V_{*}\|_{\mathcal{F}}\|V_{*}\|_{\mathcal{H}^{2}(M_{*})}}{2\sqrt{t}}
Corollary 7.3 (NF, Corollary 3.6 of [125]).

Given the setting of Corollary 6.7, let PP_{*} be any compactly-supported target distribution and VV_{*} be the target velocity field (15). For any ϵ>0\epsilon>0, there exists a distribution Mϵ𝒫(d+1)M_{\epsilon}\in\mathcal{P}(\mathbb{R}^{d+1}) and velocity field Vϵ2(Mϵ)V_{\epsilon}\in\mathcal{H}^{2}(M_{\epsilon}) such that

W2(P,Pt)<VV2(Mϵ)2t+ϵ(Vϵt3/2+1)W_{2}(P_{*},P_{t})<\frac{\|V_{*}\|_{\mathcal{F}}\|V_{*}\|_{\mathcal{H}^{2}(M_{\epsilon})}}{2\sqrt{t}}+\epsilon~{}\big{(}\|V_{\epsilon}\|_{\mathcal{F}}~{}t^{3/2}+1\big{)}

It seems probable that the O(ϵt3/2)O(\epsilon t^{3/2}) term can be strengthened to O(ϵ)O(\epsilon), which would imply universal convergence, i.e. convergence to any target distribution.

7.2 Free generator: Landscape

Models with the free generator representation are more difficult to analyze, since the loss L(G)L(G) is not convex in GG. For instance, it is straightforward to check that if \mathbb{P} and PP_{*} are uniform over [0,1][0,1], then the solution set {G|G#=P}\{G_{*}~{}|~{}G_{*}\#\mathbb{P}=P_{*}\}, or equivalently the set of minimizers of LL, is an infinite and non-convex subset of L2()L^{2}(\mathbb{P}), and thus LL is non-convex.

Despite the non-convexity, there is still hope for establishing global convergence through a careful analysis of the critical points. For instance, one can conjecture that there are no spurious local minima and that the saddle points can be easily avoided. This section offers two results towards this intuition: a characterization of critical points for general loss functions of the form L(G#)L(G\#\mathbb{P}), and a toy example such that global convergence can be determined from initialization.

In general, no matter how we parametrize the generator, the modeled distribution Pt=Gt#P_{t}=G_{t}\#\mathbb{P} satisfies the continuity equation during training: Assume that G(𝐱,θ)G(\mathbf{x},\theta) is C1C^{1} in 𝐱d\mathbf{x}\in\mathbb{R}^{d} and θΘ\theta\in\Theta, where Θ\Theta is some Hilbert space. Then, given any path θt\theta_{t} that is C1C^{1} in tt, for any smooth test function ϕ\phi,

ddtϕ𝑑Pt\displaystyle\frac{d}{dt}\int\phi dP_{t} =ddtϕ(G(𝐱,θt))𝑑(𝐱)=ϕ(G(𝐱,θt))θG(𝐱,θt)θ˙t𝑑(𝐱)\displaystyle=\frac{d}{dt}\int\phi\big{(}G(\mathbf{x},\theta_{t})\big{)}d\mathbb{P}(\mathbf{x})=\int\nabla\phi\big{(}G(\mathbf{x},\theta_{t})\big{)}\cdot\nabla_{\theta}G(\mathbf{x},\theta_{t})\dot{\theta}_{t}d\mathbb{P}(\mathbf{x})
=ϕ(𝐱)Vt(𝐱)𝑑Pt(𝐱)=ϕd(VtPt)\displaystyle=\int\nabla\phi(\mathbf{x})\cdot V_{t}(\mathbf{x})dP_{t}(\mathbf{x})=-\int\phi d\nabla\cdot(V_{t}P_{t})

where the velocity field VtV_{t} is defined by

𝐟(𝐱)Vt(𝐱)𝑑Pt(𝐱)=𝐟(G(𝐱,θt))θG(𝐱,θt)θ˙t𝑑(𝐱)\int\mathbf{f}(\mathbf{x})\cdot V_{t}(\mathbf{x})~{}dP_{t}(\mathbf{x})=\int\mathbf{f}\big{(}G(\mathbf{x},\theta_{t})\big{)}\cdot\nabla_{\theta}G(\mathbf{x},\theta_{t})\dot{\theta}_{t}~{}d\mathbb{P}(\mathbf{x})

for any test function 𝐟L2(Pt,d)\mathbf{f}\in L^{2}(P_{t},\mathbb{R}^{d}). Thus, PtP_{t} is a weak solution to the continuity equation

tPt+(VtPt)=0\partial_{t}P_{t}+\nabla\cdot(V_{t}P_{t})=0

In particular, the equation implies that no matter how GG is parametrized, the “particles” of PtP_{t} during training can only move continuously without jumps or teleportation.

For abstraction, it is helpful to consider the Wasserstein gradient flow [101, 3]: For any loss function LL over 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) and any initialization P0𝒫2(d)P_{0}\in\mathcal{P}_{2}(\mathbb{R}^{d}), define the training trajectory PtP_{t} by

tPt+(PtδLδP(Pt))=0\partial_{t}P_{t}+\nabla\cdot\Big{(}P_{t}\nabla\frac{\delta L}{\delta P}(P_{t})\Big{)}=0

where δPL\delta_{P}L is the first variation, which satisfies

ddϵL(P+ϵ(QP))|ϵ=0=δLδP(Pt)d(QP)\frac{d}{d\epsilon}L\big{(}P+\epsilon(Q-P)\big{)}\big{|}_{\epsilon=0}=\int\frac{\delta L}{\delta P}(P_{t})d(Q-P)

for any bounded and compactly-supported density function QQ.

The Wasserstein gradient flow abstracts away the parametrization θGθ\theta\mapsto G_{\theta} of the generator, and thus simplifies the analysis of critical points. To relate to our problem, the following result shows that the Wasserstein gradient flow often shares the same critical points as the parametrized loss L(θ)=L(Gθ#)L(\theta)=L(G_{\theta}\#\mathbb{P}), and thus we are allowed to study the former instead.

Proposition 7.4 (Comparison of critical points).

Given any loss LL over 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}), assume that the first variation δPL\delta_{P}L exists, is C1C^{1}, and 𝐱δPL(P)L2(P,d)\nabla_{\mathbf{x}}\delta_{P}L(P)\in L^{2}(P,\mathbb{R}^{d}) for all P𝒫2(d)P\in\mathcal{P}_{2}(\mathbb{R}^{d}). Define the set of critical points

CP={P𝒫2(d)|𝐱δPL(P)(𝐱)=𝟎 for P almost all 𝐱}CP=\big{\{}P\in\mathcal{P}_{2}(\mathbb{R}^{d})~{}\big{|}~{}\nabla_{\mathbf{x}}\delta_{P}L(P)(\mathbf{x})=\mathbf{0}\text{ for }P\text{ almost all }\mathbf{x}\big{\}}

Given any base distribution 𝒫2(k)\mathbb{P}\in\mathcal{P}_{2}(\mathbb{R}^{k}) and any generator Gθ:kdG_{\theta}:\mathbb{R}^{k}\to\mathbb{R}^{d} parametrized by θΘ\theta\in\Theta, where Θ\Theta is a Hilbert space. Assume that Gθ(𝐱)G_{\theta}(\mathbf{x}) is Lipschitz in 𝐱\mathbf{x} for any θ\theta, and C1C^{1} in θ\theta for any 𝐱\mathbf{x}, and that the gradient θGθ\nabla_{\theta}G_{\theta} at any θ\theta is a continuous linear operator ΘL2(,d)\Theta\to L^{2}(\mathbb{P},\mathbb{R}^{d}). Define the set of generated distributions 𝒫Θ={Gθ#,θΘ}\mathcal{P}_{\Theta}=\{G_{\theta}\#\mathbb{P},~{}\theta\in\Theta\} and the set of critical points

CPΘ={Gθ#|θΘ,θL(Gθ#)=0}CP_{\Theta}=\big{\{}G_{\theta}\#\mathbb{P}~{}\big{|}~{}\theta\in\Theta,~{}\nabla_{\theta}L(G_{\theta}\#\mathbb{P})=0\big{\}}

Then, CP𝒫ΘCPΘCP\cap\mathcal{P}_{\Theta}\subseteq CP_{\Theta}. Furthermore, if the parametrization GθG_{\theta} is the random feature functions G𝐚(k,d)G_{\mathbf{a}}\in\mathcal{H}(\mathbb{R}^{k},\mathbb{R}^{d}), and either Assumption 3.1 or 3.2 holds, then

CPΘ=CP𝒫ΘCP_{\Theta}=CP\cap\mathcal{P}_{\Theta}

Below is an example use case of this proposition.

Example 2.

Consider the GAN loss L(P)L(P) in (42) induced by discriminators that are random feature functions (36). Assume that the target distribution P𝒫2,ac(d)P_{*}\in\mathcal{P}_{2,ac}(\mathbb{R}^{d}) is radially symmetric, the parameter distribution ρ\rho in (36) is radially symmetric in 𝐰\mathbf{w} conditioned on any bb, and that the activation σ\sigma is C1C^{1}. Consider the point mass P=δ𝟎P=\delta_{\mathbf{0}}. Then,

δLδP(P)(𝟎)\displaystyle\nabla\frac{\delta L}{\delta P}(P)(\mathbf{0}) =k(𝐱,𝐱)d(PP)(𝐱)|𝐱=𝟎\displaystyle=\nabla\int k(\mathbf{x},\mathbf{x}^{\prime})d(P-P_{*})(\mathbf{x}^{\prime})\big{|}_{\mathbf{x}=\mathbf{0}}
=𝐰σ(𝐰𝟎+b)σ(𝐰𝐱+b)d(PP)(𝐱)𝑑ρ(𝐰,b)\displaystyle=\iint\mathbf{w}\sigma^{\prime}(\mathbf{w}\cdot\mathbf{0}+b)\sigma(\mathbf{w}\cdot\mathbf{x}^{\prime}+b)d(P-P_{*})(\mathbf{x}^{\prime})d\rho(\mathbf{w},b)
=12𝐰σ(b)σ(𝐰𝐱+b)d(PP)(𝐱)𝑑ρ(𝐰|b)𝑑ρ(b)\displaystyle=\frac{1}{2}\iint\mathbf{w}\sigma^{\prime}(b)\sigma(\mathbf{w}\cdot\mathbf{x}^{\prime}+b)d(P-P_{*})(\mathbf{x}^{\prime})d\rho(\mathbf{w}|b)d\rho(b)
+12(𝐰)σ(b)σ((𝐰)(𝐱)+b)d(PP)(𝐱)𝑑ρ(𝐰|b)𝑑ρ(b)\displaystyle\quad+\frac{1}{2}\iint(-\mathbf{w})\sigma^{\prime}(b)\sigma\big{(}(-\mathbf{w})\cdot(-\mathbf{x}^{\prime})+b\big{)}d(P-P_{*})(\mathbf{x}^{\prime})d\rho(\mathbf{w}|b)d\rho(b)
=𝟎\displaystyle=\mathbf{0}

Thus, δPL(P)=𝟎\nabla\delta_{P}L(P)=\mathbf{0} for PP almost all 𝐱\mathbf{x}, and PCPP\in CP. It follows from Proposition 7.4 that PP is a critical point (PCPΘP\in CP_{\Theta}) for any parametrized generator GθG_{\theta} that can express the constant zero function.

A probability measure PP is called singular if PP cannot be expressed as a density function (P𝒫(d)𝒫ac(d)P\in\mathcal{P}(\mathbb{R}^{d})-\mathcal{P}_{ac}(\mathbb{R}^{d})). The critical points of an expectation-based loss are often singular distributions such as δ𝟎\delta_{\mathbf{0}} from the previous example. The following toy model shows that the critical points may consist of only global minima and saddle points that are singular.

Consider a one-dimensional setting. Model the generator by any function GL2(,)G\in L^{2}(\mathbb{P},\mathbb{R}), where the base distribution \mathbb{P} is conveniently set to be uniform over [0,1][0,1]. Given any target distribution P𝒫2,ac()P_{*}\in\mathcal{P}_{2,ac}(\mathbb{R}) and any initialization G0L2(,)G_{0}\in L^{2}(\mathbb{P},\mathbb{R}), consider the dynamics

ddtGt(z)=x𝑑πt(x|Gt(z))Gt(z)\frac{d}{dt}G_{t}(z)=\int x~{}d\pi_{t}\big{(}x\big{|}G_{t}(z)\big{)}-G_{t}(z) (46)

where πt(|)\pi_{t}(\cdot|\cdot) is the conditional distribution of the optimal transport plan πt𝒫(×)\pi_{t}\in\mathcal{P}(\mathbb{R}\times\mathbb{R}) between Pt=Gt#P_{t}=G_{t}\#\mathbb{P} and PP_{*}. By Brennier’s theorem [117], since PP_{*} is absolutely continuous, the optimal transport plan is unique.

Note that if PtP_{t} is also absolutely continuous, then Brennier’s theorem implies that the transport plan is deterministic: π(|x0)=δϕ(x0)\pi(\cdot|x_{0})=\delta_{\nabla\phi(x_{0})} for some convex potential ϕ\phi. Since the first variation of W22(P,P)W_{2}^{2}(P,P_{*}) is exactly the potential ϕ\phi [101], the dynamics (46) when restricted to {G|G#𝒫ac()}\{G~{}|~{}G\#\mathbb{P}\in\mathcal{P}_{ac}(\mathbb{R})\} becomes equivalent to the gradient flow on the loss

L(G)=W22(G#,P)L(G)=W_{2}^{2}(G\#\mathbb{P},P_{*})

Since the dynamics (46) is not everywhere differentiable, we consider stationary points instead of critical points, and extend the definition of saddle points: A stationary point GL2(,d)G\in L^{2}(\mathbb{P},\mathbb{R}^{d}) is a generalized saddle point if for any ϵ>0\epsilon>0, there exists a perturbation hh (hL2()<ϵ\|h\|_{L^{2}(\mathbb{P})}<\epsilon) such that L(G+h)<L(G)L(G+h)<L(G).

Proposition 7.5.

For any target distribution P𝒫2,ac()P_{*}\in\mathcal{P}_{2,ac}(\mathbb{R}), the stationary points of the dynamics (46) consist only of global minima and generalized saddle points. If GG is a generalized saddle point, then G#G\#\mathbb{P} is a singular distribution, and there exists xx such that (G#)({x})>0(G\#\mathbb{P})(\{x\})>0. Moreover, global convergence holds

limtW2(P,Pt)=0\lim_{t\to\infty}W_{2}(P_{*},P_{t})=0

if and only if the initialization P0𝒫2,ac()P_{0}\in\mathcal{P}_{2,ac}(\mathbb{R}).

This toy example confirms the intuition that despite the loss L(G)L(G) is nonconvex, global convergence is still achievable. Moreover, all saddle points have one thing in common, that part of the mass has collapsed onto one point.

7.3 Free generator: Mode collapse

The previous section has presented a general study of the critical points of models with free generator, while this section focuses on a particular training failure that is common to GANs, the mode collapse phenomenon. Mode collapse is characterized as the situation when the generator GtG_{t} during training maps a positive amount of mass of \mathbb{P} onto the same point [100, 66, 77, 90], and is identified as the primary obstacle for GAN convergence. This characterization is analogous to the saddle points from Proposition 7.5 that are also singular distributions. Despite that in the setting of Proposition 7.5, the singular distributions can be avoided and the toy model enjoys global convergence, how mode collapse occurs in practice remains a mystery.

To provide some insight into the mode collapse phenomenon, we demonstrate with toy examples two mechanisms that can lead to mode collapse.

Denote by U[x,y]U[x,y] the uniform distribution over the interval [x,y][x,y]. Let the base and target distributions be =P=U[0,1]\mathbb{P}=P_{*}=U[0,1]. Model the generator by G(x)=axG(x)=ax, and discriminator by D(x)=bϕ(x)D(x)=b\phi(x) for some ϕ\phi to be specified. Consider the following GAN loss based on (23)

minamaxbL(a,b)=Dd(G#P)+c2|b|2\min_{a}\max_{b}L(a,b)=\int D~{}d(G\#\mathbb{P}-P_{*})+\frac{c}{2}|b|^{2}

where c0c\geq 0 is the strength of regularization. Train a,ba,b by continuous time gradient flow

ddtat=aL(at,bt),ddtbt=bL(at,bt)\frac{d}{dt}a_{t}=-\partial_{a}L(a_{t},b_{t}),\quad\frac{d}{dt}b_{t}=\partial_{b}L(a_{t},b_{t})

with initialization a0>0a_{0}>0 and b0=0b_{0}=0. Mode collapse happens when Gt#=[at,at]G_{t}\#\mathbb{P}=[-a_{t},a_{t}] becomes a singular distribution, i.e. when at=0a_{t}=0.

Case one. This example shows that a non-differentiable discriminator can lead to mode collapse. Set ϕ(x)=|x|=ReLU(x)+ReLU(x)2\phi(x)=|x|=\frac{\text{ReLU}(x)+\text{ReLU}(-x)}{2}. Restricting to the half space {(a,b),a>0}\{(a,b),~{}a>0\}, the loss and training dynamics become

L(a,b)\displaystyle L(a,b) =a12bcb22\displaystyle=\frac{a-1}{2}b-\frac{cb^{2}}{2}
ddtat\displaystyle\frac{d}{dt}a_{t} =b2,ddtbt=a12cb\displaystyle=-\frac{b}{2},\quad\frac{d}{dt}b_{t}=\frac{a-1}{2}-cb

Assume that c[0,1)c\in[0,1). Then, the unique solution is given by

at\displaystyle a_{t} =1+(a01)ect/2(cos(1c2t/2)+c1c2sin(1c2t/2))\displaystyle=1+(a_{0}-1)e^{-ct/2}\Big{(}\cos\big{(}\sqrt{1-c^{2}}~{}t/2\big{)}+\frac{c}{\sqrt{1-c^{2}}}\sin\big{(}\sqrt{1-c^{2}}~{}t/2\big{)}\Big{)}
bt\displaystyle b_{t} =a01c2ect/2sin(1c2t/2)\displaystyle=\frac{a_{0}}{\sqrt{1-c^{2}}}e^{-ct/2}\sin\big{(}\sqrt{1-c^{2}}~{}t/2\big{)}

At time T=2π1c2T=\frac{2\pi}{\sqrt{1-c^{2}}}, we have

aT=1(a01)ecπ/1c2a_{T}=1-(a_{0}-1)e^{-c\pi/\sqrt{1-c^{2}}}

Thus, if a0>1+ecπ/1c2a_{0}>1+e^{c\pi/\sqrt{1-c^{2}}}, then the trajectory ata_{t} must hit the line {a=0}\{a=0\} at some t(0,T)t\in(0,T), and thus mode collapse happens.

This process is depicted in Figure 4 (Left). In general, one can conjecture that mode collapse may occur at the locations where Dt\nabla D_{t} is discontinuous.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Mode collapse during GAN training. Left: Case 1. Right: Case 2. The horizontal and vertical axes are the parameters aa and bb. The red curves are trained with small regularization cc and end up in mode collapse a=0a=0, while the blue curves have larger cc and converge to the stationary point (1,0)(1,0). The initialization is (2.5,0)(2.5,0) and the learning rate for Case 2 is γ=0.1\gamma=0.1.

Case two. This example shows that the accumulation of numerical error due to the finite learning rate can lead to mode collapse. This result is analogous to [79] which shows that if PtP_{t} is a point mass, numerical error can cause it to diverge to infinity.

Set ϕ=12x2\phi=\frac{1}{2}x^{2}. The loss and training dynamics become

L(a,b)\displaystyle L(a,b) =a216bcb22\displaystyle=\frac{a^{2}-1}{6}b-\frac{cb^{2}}{2}
ddtat\displaystyle\frac{d}{dt}a_{t} =ab3,ddtbt=a216cb\displaystyle=-\frac{ab}{3},\quad\frac{d}{dt}b_{t}=\frac{a^{2}-1}{6}-cb

Given a learning rate γ>0\gamma>0, consider the discretized training dynamics

a(k+1)γ=akγγakγbkγ3,b(k+1)γ=bkγ+γ(akγ216cbkγ)a_{(k+1)\gamma}=a_{k\gamma}-\gamma\frac{a_{k\gamma}b_{k\gamma}}{3},\quad b_{(k+1)\gamma}=b_{k\gamma}+\gamma\Big{(}\frac{a_{k\gamma}^{2}-1}{6}-cb_{k\gamma}\Big{)}

for every kk\in\mathbb{N}

In general, given a discrete dynamics in the form of θ(k+1)γ=θkγ+γf(θkγ)\theta_{(k+1)\gamma}=\theta_{k\gamma}+\gamma f(\theta_{k\gamma}), one can fit the sequence θkγ\theta_{k\gamma} by the continuous time solution of an ODE ddtθt=f~(θt)\frac{d}{dt}\theta_{t}=\tilde{f}(\theta_{t}), where f~=f+γh\tilde{f}=f+\gamma h for some function hh. Assume that the two trajectories match at each t=kγt=k\gamma,

f(θkγ)\displaystyle f(\theta_{k\gamma}) =γ1(θ(k+1)γθkγ)=γ10γf~(θkγ+s)𝑑s=01(f+γh)(θ(k+s)γ)𝑑s\displaystyle=\gamma^{-1}(\theta_{(k+1)\gamma}-\theta_{k\gamma})=\gamma^{-1}\int_{0}^{\gamma}\tilde{f}(\theta_{k\gamma+s})ds=\int_{0}^{1}(f+\gamma h)(\theta_{(k+s)\gamma})ds
=01f(θkγ)+sγf(θkγ)f(θkγ)+γh(θkγ)+O(γ2)ds\displaystyle=\int_{0}^{1}f(\theta_{k\gamma})+s\gamma\nabla f(\theta_{k\gamma})\cdot f(\theta_{k\gamma})+\gamma h(\theta_{k\gamma})+O(\gamma^{2})ds
=f(θkγ)+γ2f(θkγ)f(θkγ)+γh(θkγ)+O(γ2)\displaystyle=f(\theta_{k\gamma})+\frac{\gamma}{2}\nabla f(\theta_{k\gamma})\cdot f(\theta_{k\gamma})+\gamma h(\theta_{k\gamma})+O(\gamma^{2})

It follows that

f~=fγ2ff+O(γ2)\tilde{f}=f-\frac{\gamma}{2}\nabla f\cdot f+O(\gamma^{2})

Plugging in θt=[at,bt]\theta_{t}=[a_{t},b_{t}], we obtain the following approximate ODE

ddtat\displaystyle\frac{d}{dt}a_{t} =ab3+γ(ab218+a(a21)36cab6)\displaystyle=-\frac{ab}{3}+\gamma\Big{(}-\frac{ab^{2}}{18}+\frac{a(a^{2}-1)}{36}-\frac{cab}{6}\Big{)}
ddtbt\displaystyle\frac{d}{dt}b_{t} =a216cb+γ(a2b18+c(a21)12c2b2)\displaystyle=\frac{a^{2}-1}{6}-cb+\gamma\Big{(}\frac{a^{2}b}{18}+\frac{c(a^{2}-1)}{12}-\frac{c^{2}b}{2}\Big{)}

Define the energy function

H(a,b)=b22+a214loga2H(a,b)=\frac{b^{2}}{2}+\frac{a^{2}-1}{4}-\frac{\log a}{2}

Then

ddtH(at,bt)=cb2+γ((a2+1)b236+(a21)272c2b22)\frac{d}{dt}H(a_{t},b_{t})=-cb^{2}+\gamma\Big{(}\frac{(a^{2}+1)b^{2}}{36}+\frac{(a^{2}-1)^{2}}{72}-\frac{c^{2}b^{2}}{2}\Big{)}

Thus, if cmin(117,γ144)c\leq\min(\frac{1}{17},\frac{\gamma}{144}),

ddtH(at,bt)γ72[(a2+1)b2+(a21)2]>0\frac{d}{dt}H(a_{t},b_{t})\geq\frac{\gamma}{72}\big{[}(a^{2}+1)b^{2}+(a^{2}-1)^{2}\big{]}>0

The energy HH is strictly convex and its only minimizer is the stationary point (a,b)=(1,0)(a,b)=(1,0) where H=0H=0. Assume that the initialization (a0,b0)(1,0)(a_{0},b_{0})\neq(1,0) and a0>0a_{0}>0. Since the energy is non-decreasing, H(at,bt)H(a0,b0)>0H(a_{t},b_{t})\geq H(a_{0},b_{0})>0, and thus the trajectory will never enter the set S0={(a,b),H(a,b)<H(a0,b0)}S_{0}=\{(a,b),~{}H(a,b)<H(a_{0},b_{0})\}. Since S0S_{0} is an open set that contains (1,0)(1,0), there exists r>0r>0 such that the open ball Br((1,0))S0B_{r}((1,0))\subseteq S_{0}. Since at0a_{t}\geq 0, the trajectory (at,bt)(a_{t},b_{t}) satisfies

ddtH(at,bt)γ72[(a2+1)b2+(a+1)2(a1)2]γr272\frac{d}{dt}H(a_{t},b_{t})\geq\frac{\gamma}{72}\big{[}(a^{2}+1)b^{2}+(a+1)^{2}(a-1)^{2}\big{]}\geq\frac{\gamma r^{2}}{72}

and thus

H(at,bt)H(a0,b0)+γr272tH(a_{t},b_{t})\geq H(a_{0},b_{0})+\frac{\gamma r^{2}}{72}t

Consider the four subsets S1={b0,a>1}S_{1}=\{b\leq 0,~{}a>1\}, S2={b>0,0<a1}S_{2}=\{b>0,~{}0<a\leq 1\}, S3={b0,0<a<1}S_{3}=\{b\geq 0,~{}0<a<1\}, S4={b<0,a1}S_{4}=\{b<0,~{}a\geq 1\} that partition the space {(a,b)|(a,b)(1,0),a>0}\{(a,b)~{}|~{}(a,b)\neq(1,0),~{}a>0\}. Then, the trajectory (at,bt)(a_{t},b_{t}) repeatedly moves from S1S_{1} to S2S_{2} to S3S_{3} to S4S_{4} and back to S1S_{1}. In particular, it crosses the line {0<a<1,b=0}\{0<a<1,b=0\} for times t1,t2t_{1},t_{2}\dots that go to infinity. It follow that

inft0at\displaystyle\inf_{t\geq 0}a_{t} infiatiinfiexp[2(logati2)]\displaystyle\leq\inf_{i\in\mathbb{N}}a_{t_{i}}\leq\inf_{i\in\mathbb{N}}\exp\Big{[}-2\Big{(}\frac{-\log a_{t_{i}}}{2}\Big{)}\Big{]}
infiexp[2H(ati,bti)]\displaystyle\leq\inf_{i\in\mathbb{N}}\exp\big{[}-2H(a_{t_{i}},b_{t_{i}})\big{]}
infiexp[2(H(a0,b0)+γr272ti)]\displaystyle\leq\inf_{i\in\mathbb{N}}\exp\Big{[}-2\big{(}H(a_{0},b_{0})+\frac{\gamma r^{2}}{72}t_{i}\big{)}\Big{]}
=0\displaystyle=0

Hence, ata_{t} converges to 0 exponentially fast, and then numerical underflow would lead to mode collapse. This process is depicted in Figure 4 (Right).

In summary, these two examples indicate two possible causes for mode collapse. On one hand, if the discriminator is non-smooth, then the gradient field Dt\nabla D_{t} may squeeze the distribution Gt#G_{t}\#\mathbb{P} at the places where Dt\nabla D_{t} is discontinuous and thus form a singular distribution. On the other hand, the numerical error due to the finite learning rate may amplify the oscillatory training dynamics of GtG_{t} and DtD_{t}, and if this error is stronger than the regularization on DtD_{t}, then the norm of GtG_{t} can diverge and lead to collapse.

Note that, however, if we consider two-time-scale training such that btb_{t} is always the maximizer, then the loss becomes proportional to L(a)=(|a|1)2L(a)=(|a|-1)^{2} or (a21)2(a^{2}-1)^{2}, and training converges to the global minimum as long as a00a_{0}\neq 0.

8 Discussion

This paper studied three aspects of the machine learning of probability distributions.

First, we proposed a mathematical framework for generative models and density estimators that allows for a unified perspective on these models. Abstractly speaking, the diversity of model designs mainly arises from one factor, the vertical or horizontal way of modeling discussed in Section 3.1. When applied to distribution representation, it leads to the options of potential representation and transport representation, and the latter ramifies into the free generator and fixed generator depending on whether a probabilistic coupling can be fixed. Similarly, when applied to the loss, it leads to three loss types, depending on whether the difference is measured vertically or horizontally, with or without a fixed target. Then, the rest of the design process is to try to realize each category in Table 1 by satisfying the various constraints of implementation, for instance, whether to compute the density of G#G\#\mathbb{P} directly or indirectly, and which random path is chosen to achieve the product coupling ×P\mathbb{P}\times P_{*}. Thereby, all the major models are derived. By isolating the factors of distribution representation, loss type and function representation, this framework allows for a more transparent study of training and landscape (who depend more on distribution representation and loss type) and generalization error (which depends more on function representation).

Second, we studied the seeming conflict between the memorization phenomenon and the generalization ability of the models, and reviewed our results that resolve this conflict. On one hand, we confirmed that the models satisfy the universal approximation property (some models even enjoy universal convergence) and thus memorization is inevitable. On the other hand, function representations defined by expectations are insensitive to the sampling error PP(n)P_{*}-P_{*}^{(n)}, so that the training trajectory tends to approximate the hidden target distribution before eventually diverging towards memorization. In particular, our results established generalization error bounds of the form O(nα)O(n^{-\alpha}) with α=14,16,18\alpha=\frac{1}{4},\frac{1}{6},\frac{1}{8} for several models. There should be room for improvement, but for now we are content that the models can escape from the curse of dimensionality. Considering that this generalization ability is mostly an effect of the function representations, it seems reasonable to expect that good generalization is enjoyed by all models regardless of the choice of distribution representation and loss type.

Third, we discussed the training dynamics and loss landscape. For the potential and fixed generator representations, the convexity of their distribution parametrizations and loss functions enable the estimation of the rates of global convergence. For the free generator representation, despite the loss is non-convex, we demonstrated that the critical points have tractable forms, and also identified two general mechanisms common to the min-max training of GANs that can provably lead to mode collapse. It seems worthwhile to devote more effort in the design of models with the fixed generator representation, since they are as expressive as the free generator models while their convexity greatly eases training. It is not clear at this moment whether the product coupling ×P\mathbb{P}\times P_{*} is the best choice for the fixed generators, and whether the diffusion SDE and linear interpolants are the most effective random paths, so there is much to explore.

To conclude, we list a few interesting topics that have not been covered in this paper

  • Unstructured data: Our analysis was conducted only in the Euclidean space, whereas most of the applications of distribution learning models involve unstructured data such as images, texts and molecules. One thing of practical importance is that the performance of generative models for unstructured data is judged by human perception or perceptual loss [59], which can greatly differ from the Euclidean metric and thus the W2W_{2} metric. To train the model to have higher fidelity, one approach is to use the adversarial loss of GANs such that the hidden features of the discriminators can be seen as an embedding space that captures fidelity. A related approach is to rely on a pretrained feature embedding as in [96, 53].

  • Prior knowledge: For supervised tasks involving unstructured data, it is often helpful to instill prior knowledge from humans into the models through self-supervised pretraining. A well-known example is the approximate invariance of image classification with respect to color distortion and cropping, and models that are pretrained to be insensitive to these augmentations can achieve higher test accuracy after training [22, 42, 18]. It could be beneficial to try to boost distribution learning models by prior knowledge. One example is given by [131] such that a generative model for sampling a thermodynamic system is designed to respect the spatial symmetry of the system.

  • Conditional generation: In practice, people are more interested in estimating conditional distributions, e.g. generate images conditioned on text descriptions [94, 96]. Incorporating a context variable can be done by simply allowing an additional input variable in the parameter functions [80], but it can also be accomplished with tricks that minimize additional training [107].

  • Factor discovery: For generative models, instead of using G#G\#\mathbb{P} as a blackbox for sampling, it could be useful to train GG in a way such that \mathbb{P} has semantic meaning, e.g. for image synthesis, given a random image G(Z),ZG(Z),Z\sim\mathbb{P} of a face, one coordinate of ZZ may control hair style and another may control head orientation. This unsupervised task is known as factor discovery [110], and some solutions are provided by [23, 48, 62] with application to semantic photo editing [74].

  • Density distillation: The basic setting considered in this paper is to estimate a target distribution given a sample set; yet, another task common to scientific computing is to train a generative model to sample from a distribution 1ZeV\frac{1}{Z}e^{-V} given a potential function VV. One popular approach [70, 131, 82, 17] is to use a modified normalizing flow with the reverse KL-divergence.

9 Proofs

9.1 Loss function

Proof of Proposition 3.1.

For any P,P𝒫ac(d)P_{*},P\in\mathcal{P}_{ac}(\mathbb{R}^{d}), the assumption on global minimum implies that

limt0+1t(L((1t)P+tP)L(P))=f(P(𝐱))(P(𝐱)P(𝐱))𝑑P(𝐱)\displaystyle\lim_{t\to 0^{+}}\frac{1}{t}\big{(}L((1-t)P_{*}+tP)-L(P_{*})\big{)}=\int f^{\prime}(P(\mathbf{x}))(P(\mathbf{x})-P_{*}(\mathbf{x}))dP_{*}(\mathbf{x}) 0\displaystyle\geq 0

Define the map g(p)=pf(p)g(p)=pf^{\prime}(p). Then,

𝔼P[g(P(𝐱))]𝔼P[g(P(𝐱))]\mathbb{E}_{P}[g(P_{*}(\mathbf{x}))]\geq\mathbb{E}_{P_{*}}[g(P_{*}(\mathbf{x}))]

Assume for contradiction that gg is nonconstant on (0,)(0,\infty), then there exist a,b>0a,b>0 such that g(a)<g(b)g(a)<g(b). Let A,BdA,B\subseteq\mathbb{R}^{d} be two disjoint hyperrectangles with volumes 1/2a1/2a and 1/2b1/2b. Define PP as the uniform distribution over AA, and PP_{*} as

P=a𝟏A+b𝟏BP_{*}=a\mathbf{1}_{A}+b\mathbf{1}_{B}

Then the above inequality is violated. It follows that gg is constant and ff has the form clogp+cc\log p+c^{\prime}. Finally, the assumption on global minimum implies that c0c\leq 0. ∎

9.2 Universal approximation theorems

Proof of Proposition 4.1.

By Brennier’s theorem ([14] and [117, Theorem 2.12]), since both \mathbb{P} and PP_{*} have finite second moments and \mathbb{P} is absolutely continuous, there exists an L2(,d)L^{2}(\mathbb{P},\mathbb{R}^{d}) function GG_{*} such that G#=PG_{*}\#\mathbb{P}=P_{*}. By Lemma 9.1, there exists a sequence {Gn}n=1\{G_{n}\}_{n=1}^{\infty}\subset\mathcal{H} that converges to GG_{*} in L2()L^{2}(\mathbb{P}) norm. Hence,

limnW2(G#,Gn#)limnGGnL2()=0\lim_{n\to\infty}W_{2}(G\#\mathbb{P},G_{n}\#\mathbb{P})\leq\lim_{n\to\infty}\|G-G_{n}\|_{L^{2}(\mathbb{P})}=0

The preceding proof uses the following lemma, which is a slight extension of the classical universal approximation theorem [52, 51] to cases with possibly unbounded base distributions \mathbb{P}. Such extension is needed since in practice \mathbb{P} is usually set to be the unit Gaussian.

Lemma 9.1.

Given either Assumption 3.1 or 3.2, for any 𝒫(d)\mathbb{P}\in\mathcal{P}(\mathbb{R}^{d}) and any kk\in\mathbb{N}, the space (d,k)\mathcal{H}(\mathbb{R}^{d},\mathbb{R}^{k}) is dense in L2(,k)L^{2}(\mathbb{P},\mathbb{R}^{k}).

Proof.

It suffices to consider the case with output dimension k=1k=1. Denote \mathcal{H} by σ\mathcal{H}_{\sigma} to emphasize the choice of the activation σ\sigma.

Given Assumption 3.1, the activation σ\sigma is ReLU. Define

σ(x)=σ(x+1+ϵ)2σ(x+ϵ)+σ(x1+ϵ)dh(ϵ)\sigma_{*}(x)=\int\sigma(x+1+\epsilon)-2\sigma(x+\epsilon)+\sigma(x-1+\epsilon)dh(\epsilon)

where hh is a continuous distribution supported in [0.1,0.1][-0.1,0.1]. Since ReLU is homogeneous, σ\sigma_{*} can be expressed by (36) with some bounded parameter function aa, and thus σσ(,)\sigma_{*}\in\mathcal{H}_{\sigma}(\mathbb{R},\mathbb{R}). Similarly, given Assumption 3.2, the activation σ\sigma is sigmoid. Define

σ(x)=σ(x+1+ϵ)σ(x1+ϵ)dh(ϵ)\sigma_{*}(x)=\int\sigma(x+1+\epsilon)-\sigma(x-1+\epsilon)dh(\epsilon)

Since the parameter distribution ρ\rho is bounded below by some positive constant over the ball B3B_{3}, the function σ\sigma_{*} can be expressed by (36) with bounded parameter function aa, and thus σσ(,)\sigma_{*}\in\mathcal{H}_{\sigma}(\mathbb{R},\mathbb{R}).

Hence, we always have σσ(,)\sigma_{*}\in\mathcal{H}_{\sigma}(\mathbb{R},\mathbb{R}) and it is integrable (σL1()<\|\sigma_{*}\|_{L^{1}(\mathbb{R})}<\infty). Define the subspace σcσ\mathcal{H}^{c}_{\sigma_{*}}\subseteq\mathcal{H}_{\sigma_{*}} of functions faf_{a} whose parameter functions aa are compactly-supported. Since σcσ\mathcal{H}^{c}_{\sigma_{*}}\subseteq\mathcal{H}_{\sigma}, it suffices to show that σc\mathcal{H}^{c}_{\sigma_{*}} is dense in L2()L^{2}(\mathbb{P}). Without loss of generality, we denote σ\sigma_{*} by σ\sigma. Since σ\sigma is L1L^{1}, its Fourier transform σ^\hat{\sigma} is well-defined. Since σ\sigma is not constant zero, there exists a constant c0c\neq 0 such that σ^(c)0\hat{\sigma}(c)\neq 0. Perform scaling if necessary, assume that σ^(1)=1\hat{\sigma}(1)=1.

It suffices to approximate the subspace Cc(d)C_{c}^{\infty}(\mathbb{R}^{d}), which is dense in L2()L^{2}(\mathbb{P}). Fix any fCcf\in C_{c}^{\infty}. Its Fourier transform f^\hat{f} is integrable. Then, Step 1 of the proof of [52, Theorem 3.1] implies that,

f(𝐱)=dσ(𝐰𝐱+b)m(𝐰,b)𝑑𝐰𝑑b,m(𝐰,b)=Re[f^(𝐰)e2πib]f(\mathbf{x})=\int_{\mathbb{R}}\int_{\mathbb{R}^{d}}\sigma(\mathbf{w}\cdot\mathbf{x}+b)m(\mathbf{w},b)~{}d\mathbf{w}db,\quad m(\mathbf{w},b)=\text{Re}\big{[}\hat{f}(\mathbf{w})e^{-2\pi ib}\big{]}

For any R>0R>0, define the signed distribution mRm_{R} by

mR(𝐰,b)=Re[f^(𝐰)e2πib]𝟏[R,R]d+1(𝐰,b)m_{R}(\mathbf{w},b)=\text{Re}\big{[}\hat{f}(\mathbf{w})e^{-2\pi ib}\big{]}\mathbf{1}_{[-R,R]^{d+1}}(\mathbf{w},b)

Define the function fRf_{R} by

fR(𝐱)\displaystyle f_{R}(\mathbf{x}) =σ(𝐰𝐱+b)𝑑mR(𝐰,b)\displaystyle=\int\sigma(\mathbf{w}\cdot\mathbf{x}+b)dm_{R}(\mathbf{w},b)
=aR(𝐰,b)σ(𝐰𝐱+b)𝑑ρ(𝐰,b)\displaystyle=\int a_{R}(\mathbf{w},b)\sigma(\mathbf{w}\cdot\mathbf{x}+b)d\rho(\mathbf{w},b)

If Assumption 3.2 holds, then define the parameter function aRa_{R} by the compactly supported function

aR(𝐰,b)=mR(𝐰,b)ρ(𝐰,b)a_{R}(\mathbf{w},b)=\frac{m_{R}(\mathbf{w},b)}{\rho(\mathbf{w},b)}

Then,

fR\displaystyle\|f_{R}\|_{\mathcal{H}} aRL2(ρ)fL1(d)min𝐰,b[R,R]d+1ρ(𝐰,b)<\displaystyle\leq\|a_{R}\|_{L^{2}(\rho)}\leq\frac{\|f\|_{L^{1}(\mathbb{R}^{d})}}{\sqrt{\min_{\mathbf{w},b\in[-R,R]^{d+1}}\rho(\mathbf{w},b)}}<\infty

If Assumption 3.1 holds, then define aRa_{R} by the following function over the l1l^{1} sphere {𝐰1+|b|=1}\{\|\mathbf{w}\|_{1}+|b|=1\}

aR(𝐰,b)=0mR(λ𝐰,λb)𝑑λρ(𝐰,b)a_{R}(\mathbf{w},b)=\frac{\int_{0}^{\infty}m_{R}(\lambda\mathbf{w},\lambda b)d\lambda}{\rho(\mathbf{w},b)}

Then,

fR\displaystyle\|f_{R}\|_{\mathcal{H}} aRL2(ρ)fL1(d)min𝐰+|b|=1ρ(𝐰,b)<\displaystyle\leq\|a_{R}\|_{L^{2}(\rho)}\leq\frac{\|f\|_{L^{1}(\mathbb{R}^{d})}}{\sqrt{\min_{\|\mathbf{w}\|+|b|=1}\rho(\mathbf{w},b)}}<\infty

Thus, we always have fRσc(d,)f_{R}\in\mathcal{H}^{c}_{\sigma}(\mathbb{R}^{d},\mathbb{R}).

The approximation error is bounded by

ffRL2()2\displaystyle\|f-f_{R}\|_{L^{2}(\mathbb{P})}^{2} hR(𝐱,𝐰,b,𝐰,b)𝑑P(𝐱)𝑑𝐰𝑑𝐰𝑑b𝑑b\displaystyle\leq\iint\iint\int h_{R}(\mathbf{x},\mathbf{w},b,\mathbf{w}^{\prime},b^{\prime})dP(\mathbf{x})d\mathbf{w}d\mathbf{w}^{\prime}dbdb^{\prime}
hR(𝐱,𝐰,b,𝐰,b)\displaystyle h_{R}(\mathbf{x},\mathbf{w},b,\mathbf{w}^{\prime},b^{\prime}) =|σ(𝐰𝐱+b)||f^(𝐰)|𝟏d[R,R]d+1(𝐰,b)\displaystyle=|\sigma(\mathbf{w}\cdot\mathbf{x}+b)||\hat{f}(\mathbf{w})|\mathbf{1}_{\mathbb{R}^{d}-[-R,R]^{d+1}}(\mathbf{w},b)
|σ(𝐰𝐱+b)||f^(𝐰)|𝟏d[R,R]d+1(𝐰,b)\displaystyle\cdot|\sigma(\mathbf{w}^{\prime}\cdot\mathbf{x}+b^{\prime})||\hat{f}(\mathbf{w}^{\prime})|\mathbf{1}_{\mathbb{R}^{d}-[-R,R]^{d+1}}(\mathbf{w}^{\prime},b^{\prime})

Note that 0hRh00\leq h_{R}\leq h_{0}, and hR0h_{R}\to 0 pointwise, and h0h_{0} is integrable:

h0(𝐱,𝐰,b,𝐰,b)𝑑P(𝐱)𝑑𝐰𝑑𝐰𝑑b𝑑bσL1()2f^L1(d)2<\iint\iint\int h_{0}(\mathbf{x},\mathbf{w},b,\mathbf{w}^{\prime},b^{\prime})dP(\mathbf{x})d\mathbf{w}d\mathbf{w}^{\prime}dbdb^{\prime}\leq\|\sigma\|_{L^{1}(\mathbb{R})}^{2}\|\hat{f}\|_{L^{1}(\mathbb{R}^{d})}^{2}<\infty

Hence, the dominated convergence theorem implies that limRffRL2()0\lim_{R\to\infty}\|f-f_{R}\|_{L^{2}(\mathbb{P})}\to 0, which completes the proof. ∎

Proof of Proposition 4.2.

If Assumption 3.1 holds, then [109] implies that \mathcal{H} is dense in C(K)C(K) with respect to the supremum norm. Else, Assumption 3.2 holds, then [52] implies that \mathcal{H} is dense in C(K)C(K). Thus, we can apply Proposition 2.1 of [126] to conclude the proof. ∎

Proof of Proposition 4.3.

It suffices to approximate the compactly-supported distributions, which are dense with respect to the W2W_{2} metric. Fix any compactly-supported distribution PP. Choose R>0R>0 such that the support is contained in BRB_{R}. Assume that the base distribution \mathbb{P} has full support over d\mathbb{R}^{d}. The case without full support will be discussed in the end.

By Brennier’s theorem [117, Theorem 2.12], there exists a convex function ψ\psi over d\mathbb{R}^{d} such that ψ#=P\nabla\psi\#\mathbb{P}=P. For any δ>0\delta>0, we can define the mollified function ψδ\psi_{\delta}

ψδ(𝐱)=hδψ=h(𝐲δ)ψ(𝐱𝐲)𝑑𝐲=h(𝐱𝐲δ)ψ(𝐲)𝑑𝐲\psi_{\delta}(\mathbf{x})=h_{\delta}*\psi=\int h\Big{(}\frac{\mathbf{y}}{\delta}\Big{)}\psi(\mathbf{x}-\mathbf{y})d\mathbf{y}=\int h\Big{(}\frac{\mathbf{x}-\mathbf{y}}{\delta}\Big{)}\psi(\mathbf{y})d\mathbf{y}

where hh is a mollifier (i.e. hh is CC^{\infty}, non-negative, supported in the unit ball, and h=1\int h=1). Then, ψδ\psi_{\delta} is CC^{\infty} and convex, and

limδ0W2(P,ψδ#)limδ0ψhδψL2()=0\lim_{\delta\to 0}W_{2}(P,\nabla\psi_{\delta}\#\mathbb{P})\leq\lim_{\delta\to 0}\|\nabla\psi-h_{\delta}*\nabla\psi\|_{L^{2}(\mathbb{P})}=0

Thus, without loss of generality, we can assume that ψ\psi is CC^{\infty}.

Define the time-dependent transport map for τ[0,1]\tau\in[0,1]

Tτ(𝐱)=(1τ)𝐱+τψ(𝐱)=[(1τ)𝐱22+τψ]T_{\tau}(\mathbf{x})=(1-\tau)\mathbf{x}+\tau\nabla\psi(\mathbf{x})=\nabla\Big{[}(1-\tau)\frac{\|\mathbf{x}\|^{2}}{2}+\tau\psi\Big{]}

which is CC^{\infty} in 𝐱\mathbf{x} and τ\tau, and define the distributions Pτ=Tτ#P_{\tau}=T_{\tau}\#\mathbb{P}, which are known as McCann interpolation. Then, define the vector field

Vτ=ψTτ1V_{\tau}=\nabla\psi\circ T_{\tau}^{-1}

The Jacobian of TτT_{\tau} is positive definite for τ[0,1)\tau\in[0,1)

𝐱Tτ=Hess[(1τ)𝐱22+τψ](1τ)I\nabla_{\mathbf{x}}T_{\tau}=\text{Hess}\Big{[}(1-\tau)\frac{\|\mathbf{x}\|^{2}}{2}+\tau\psi\Big{]}\geq(1-\tau)I

so the inverse function theorem implies that the inverse Tτ1T^{-1}_{\tau} exists and is CC^{\infty} over (𝐱,τ)d×[0,1)(\mathbf{x},\tau)\in\mathbb{R}^{d}\times[0,1), with

𝐱Tτ111τI,τTτ1(𝐱)=𝐱T1(𝐱)τTτ(Tτ1(𝐱))\nabla_{\mathbf{x}}T^{-1}_{\tau}\leq\frac{1}{1-\tau}I,\quad\partial_{\tau}T^{-1}_{\tau}(\mathbf{x})=-\nabla_{\mathbf{x}}T^{-1}(\mathbf{x})\cdot\partial_{\tau}T_{\tau}(T^{-1}_{\tau}(\mathbf{x}))

Since limδ0+W2(P1δ,P)=0\lim_{\delta\to 0^{+}}W_{2}(P_{1-\delta},P)=0, we can replace the approximation target PP by the sequence {P11/n}\{P_{1-1/n}\}, restrict τ\tau to the interval [0,11/n][0,1-1/n], and thus assume without loss of generality that sup𝐱Tτ1<\sup\nabla_{\mathbf{x}}T^{-1}_{\tau}<\infty. It follows that Tτ1T_{\tau}^{-1} and VτV_{\tau} are CC^{\infty} over d×[0,1]\mathbb{R}^{d}\times[0,1]. By Picard-Lindelöf theorem, the ODE 𝐱˙τ=Vτ(𝐱τ)\dot{\mathbf{x}}_{\tau}=V_{\tau}(\mathbf{x}_{\tau}) has unique solution locally, and it is straightforward to check that 𝐱t=Tt(𝐱0)\mathbf{x}_{t}=T_{t}(\mathbf{x}_{0}) is exactly the solution.

For any r>R+1r>R+1 and any ϵ(0,1)\epsilon\in(0,1), the universal approximation theorem [51] implies that there exists a random feature function Vr,ϵ(d+1,d)V_{r,\epsilon}\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}) such that

VVr,ϵC0(Br×[0,1])<ϵ\|V-V_{r,\epsilon}\|_{C^{0}(B_{r}\times[0,1])}<\epsilon (47)

Denote its flow map (7) by GτG_{\tau}. For any 𝐱0Br\mathbf{x}_{0}\in B_{r}, let 𝐱τ,𝐲τ\mathbf{x}_{\tau},\mathbf{y}_{\tau} denote the solutions to the ODEs

𝐱˙τ=V(𝐱τ,τ),𝐲˙τ=Vr,ϵ(𝐲τ,τ),𝐱0=𝐲0\dot{\mathbf{x}}_{\tau}=V(\mathbf{x}_{\tau},\tau),\quad\dot{\mathbf{y}}_{\tau}=V_{r,\epsilon}(\mathbf{y}_{\tau},\tau),\quad\mathbf{x}_{0}=\mathbf{y}_{0}

Recall that Vτ(𝐱τ)V_{\tau}(\mathbf{x}_{\tau}) is simply the vector 𝐱1𝐱0\mathbf{x}_{1}-\mathbf{x}_{0}, where 𝐱1sprtPBR\mathbf{x}_{1}\in\text{sprt}P\subseteq B_{R}. Then, for any 𝐱BrBR+1\mathbf{x}\in B_{r}-B_{R+1} and τ[0,1]\tau\in[0,1],

𝐱𝐱V(𝐱,τ)1,𝐱𝐱Vr,ϵ(𝐱,τ)1+ϵ\frac{\mathbf{x}}{\|\mathbf{x}\|}\cdot V(\mathbf{x},\tau)\leq-1,\quad\frac{\mathbf{x}}{\|\mathbf{x}\|}\cdot V_{r,\epsilon}(\mathbf{x},\tau)\leq-1+\epsilon

and this remains true despite the surgeries we performed on VV (the mollification of ψ\psi and the restriction of tt to [0,11/n][0,1-1/n]). It follows that the solutions 𝐱τ,𝐲τ\mathbf{x}_{\tau},\mathbf{y}_{\tau} are contained in BrB_{r}. By Theorem 2.8 of [113],

𝐱1𝐲1VVr,ϵC0(Br×[0,1])eVC0(Br×[0,1])1VC0(Br×[0,1])<ϵexpVC0(Br×[0,1])\|\mathbf{x}_{1}-\mathbf{y}_{1}\|\leq\|V-V_{r,\epsilon}\|_{C^{0}(B_{r}\times[0,1])}\frac{e^{\|\nabla V\|_{C^{0}(B_{r}\times[0,1])}}-1}{\|\nabla V\|_{C^{0}(B_{r}\times[0,1])}}<\epsilon\exp\|\nabla V\|_{C^{0}(B_{r}\times[0,1])}

Thus

W2(T1#|Br,G1#|Br)<ϵexpVC0(Br×[0,1])W_{2}\big{(}T_{1}\#\mathbb{P}|_{B_{r}},G_{1}\#\mathbb{P}|_{B_{r}}\big{)}<\epsilon\exp\|\nabla V\|_{C^{0}(B_{r}\times[0,1])}

Meanwhile, for any m>0m>0, define the random feature function gmg_{m}\in\mathcal{H}

gm(𝐱,τ)=m𝐧σ(m(𝐧𝐱(r+1)))d(h1/mU𝕊d1)(𝐧)g_{m}(\mathbf{x},\tau)=m\int\mathbf{n}~{}\sigma\big{(}m(\mathbf{n}\cdot\mathbf{x}-(r+1))\big{)}d(h_{1/m}*U\mathbb{S}_{d-1})(\mathbf{n})

where U𝕊d1U\mathbb{S}_{d-1} is the uniform distribution over the unit sphere and h1/mh_{1/m} is a mollifier. Since σ\sigma is sigmoid, as mm\to\infty, we have gm(𝐱)0\|g_{m}(\mathbf{x})\|\to 0 uniformly over BrB_{r} while gm(𝐱)\|g_{m}(\mathbf{x})\|\to\infty uniformly over dBr+2\mathbb{R}^{d}-B_{r+2}. Replace Vr,ϵV_{r,\epsilon} by Vr,ϵgmV_{r,\epsilon}-g_{m} with mm sufficiently large enough such that (47) continues to hold, while for all 𝐱dBr+2\mathbf{x}\in\mathbb{R}^{d}-B_{r+2} and t[0,1]t\in[0,1]

𝐱𝐱Vr,ϵ(𝐱,t)<0\frac{\mathbf{x}}{\|\mathbf{x}\|}V_{r,\epsilon}(\mathbf{x},t)<0

It follows that G1(𝐱0)𝐱0\|G_{1}(\mathbf{x}_{0})\|\leq\|\mathbf{x}_{0}\| for all 𝐱0dBr+2\mathbf{x}_{0}\in\mathbb{R}^{d}-B_{r+2}. Thus

W2(T1#|dBr,G1#|dBr)<dBr+2(𝐱+r+2)2𝑑(𝐱)W_{2}\big{(}T_{1}\#\mathbb{P}|_{\mathbb{R}^{d}-B_{r}},G_{1}\#\mathbb{P}|_{\mathbb{R}^{d}-B_{r}}\big{)}<\int_{\mathbb{R}^{d}-B_{r+2}}(\|\mathbf{x}\|+r+2)^{2}d\mathbb{P}(\mathbf{x})

Hence,

limrlimϵ0W2(P,G1#)\displaystyle\quad\lim_{r\to\infty}\lim_{\epsilon\to 0}W_{2}(P,G_{1}\#\mathbb{P})
limrlimϵ0W2(T1#|Br,G1#|Br)+W2(T1#|dBr,G1#|dBr)\displaystyle\leq\lim_{r\to\infty}\lim_{\epsilon\to 0}W_{2}\big{(}T_{1}\#\mathbb{P}|_{B_{r}},G_{1}\#\mathbb{P}|_{B_{r}}\big{)}+W_{2}\big{(}T_{1}\#\mathbb{P}|_{\mathbb{R}^{d}-B_{r}},G_{1}\#\mathbb{P}|_{\mathbb{R}^{d}-B_{r}}\big{)}
limrlimϵ0ϵexpVC0(Br×[0,1])+dBr+2(𝐱+r+2)2𝑑(𝐱)\displaystyle\leq\lim_{r\to\infty}\lim_{\epsilon\to 0}\epsilon\exp\|\nabla V\|_{C^{0}(B_{r}\times[0,1])}+\int_{\mathbb{R}^{d}-B_{r+2}}(\|\mathbf{x}\|+r+2)^{2}d\mathbb{P}(\mathbf{x})
limrdBr+2(𝐱+r+2)2𝑑(𝐱)\displaystyle\leq\lim_{r\to\infty}\int_{\mathbb{R}^{d}-B_{r+2}}(\|\mathbf{x}\|+r+2)^{2}d\mathbb{P}(\mathbf{x})
0\displaystyle\leq 0

We conclude that PP is a limit point of 𝒢#\mathcal{G}\#\mathbb{P}.

Finally, consider the general case when the support of the base distribution \mathbb{P} is not necessarily d\mathbb{R}^{d}. For any ϵ(0,1)\epsilon\in(0,1), define ϵ=(1ϵ)+ϵ𝒩\mathbb{P}_{\epsilon}=(1-\epsilon)\mathbb{P}+\epsilon\mathcal{N}, where 𝒩\mathcal{N} is the unit Gaussian distribution. Choose a velocity field Vϵ(d+1,d)V_{\epsilon}\in\mathcal{H}(\mathbb{R}^{d+1},\mathbb{R}^{d}) such that its flow GτG_{\tau} satisfies W2(P,G1#ϵ)<ϵW_{2}(P,G_{1}\#\mathbb{P}_{\epsilon})<\epsilon. Then

W2(P,G1#)\displaystyle W_{2}(P,G_{1}\#\mathbb{P}) W2(P,G1#ϵ)+W2(G1#ϵ,G1#)\displaystyle\leq W_{2}(P,G_{1}\#\mathbb{P}_{\epsilon})+W_{2}(G_{1}\#\mathbb{P}_{\epsilon},G_{1}\#\mathbb{P})
<ϵ+W2(G1#ϵ𝒩,G1#ϵ)\displaystyle<\epsilon+W_{2}(G_{1}\#\epsilon\mathcal{N},G_{1}\#\epsilon\mathbb{P})
ϵ+W2(ϵG1#𝒩,ϵδ𝟎)+W2(ϵδ0,ϵG1#)\displaystyle\leq\epsilon+W_{2}(\epsilon G_{1}\#\mathcal{N},\epsilon\delta_{\mathbf{0}})+W_{2}(\epsilon\delta_{\textbf{0}},\epsilon G_{1}\#\mathbb{P})
ϵ+ϵ(𝐱2𝑑𝒩(𝐱)+𝐱2𝑑(𝐱))\displaystyle\leq\epsilon+\epsilon\Big{(}\sqrt{\int\|\mathbf{x}\|^{2}d\mathcal{N}(\mathbf{x})}+\sqrt{\int\|\mathbf{x}\|^{2}d\mathbb{P}(\mathbf{x})}\Big{)}

Taking ϵ0\epsilon\to 0 completes the proof. ∎

9.3 Generalization error

Proof of Proposition 6.4.

This proof is a slight modification of the proof of Proposition 6.3 (Proposition 3.9 of [126]). By equation (17) of [126], with probability 1δ1-\delta over the sampling of P(n)P_{*}^{(n)},

|L(a)L(n)(a)|42log2d+2log(2/δ)naL2(ρ)|L(a)-L^{(n)}(a)|\leq\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}\|a\|_{L^{2}(\rho)} (48)

Since the regularized loss is strongly convex in aa, the minimizer aλ(n)a_{\lambda}^{(n)} exists and is unique. Then

L(aλ(n))\displaystyle L(a^{(n)}_{\lambda}) L(n)(aλ(n))+42log2d+2log(2/δ)naλ(n)L2(ρ)\displaystyle\leq L^{(n)}(a^{(n)}_{\lambda})+\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}\|a^{(n)}_{\lambda}\|_{L^{2}(\rho)}
L(n)(aλ(n))+λnaλ(n)L2(ρ)\displaystyle\leq L^{(n)}(a^{(n)}_{\lambda})+\frac{\lambda}{\sqrt{n}}\|a^{(n)}_{\lambda}\|_{L^{2}(\rho)}
L(n)(a)+λnaL2(ρ)\displaystyle\leq L^{(n)}(a_{*})+\frac{\lambda}{\sqrt{n}}\|a_{*}\|_{L^{2}(\rho)}
L(a)+(λ+42log2d+2log(2/δ)n)aL2(ρ)\displaystyle\leq L(a_{*})+\Big{(}\lambda+\frac{4\sqrt{2\log 2d}+\sqrt{2\log(2/\delta)}}{\sqrt{n}}\Big{)}\|a_{*}\|_{L^{2}(\rho)}

where the first and last inequalities follow from (48) and the third inequality follows from the fact that aλ(n)a_{\lambda}^{(n)} is the global minimizer.

Hence,

KL(PPλ(n))=L(aλ(n))L(a)2λaL2(ρ)n\text{KL}(P_{*}\|P^{(n)}_{\lambda})=L(a^{(n)}_{\lambda})-L(a_{*})\leq\frac{2\lambda\|a_{*}\|_{L^{2}(\rho)}}{\sqrt{n}}

9.4 Training and loss landscape

Proof of Proposition 7.4.

Since GθG_{\theta} is Lipschitz for any θΘ\theta\in\Theta, the set of generated distributions 𝒫Θ𝒫2(d)\mathcal{P}_{\Theta}\subseteq\mathcal{P}_{2}(\mathbb{R}^{d}), so the loss L(θ)=L(Gθ#)L(\theta)=L(G_{\theta}\#\mathbb{P}) is well-defined. Denote the evaluation of any linear operator ll over the Hilbert space Θ\Theta by l,θ\langle l,\theta\rangle. By assumption, for any θ0,θ1Θ\theta_{0},\theta_{1}\in\Theta, the velocity field

θGθ0,θ1L2(,k)\big{\langle}\nabla_{\theta}G_{\theta_{0}},\theta_{1}\big{\rangle}\in L^{2}(\mathbb{P},\mathbb{R}^{k})

Thus, we can define a linear operator Vθ:ΘL2(Gθ#,d)V_{\theta}:\Theta\to L^{2}(G_{\theta}\#\mathbb{P},\mathbb{R}^{d}) by

f(𝐱)Vθ0,θ1(𝐱)d(Gθ#)(𝐱):=f(G(𝐱))θGθ0,θ1(𝐱)𝑑(𝐱)\displaystyle\int f(\mathbf{x})\cdot\big{\langle}V_{\theta_{0}},\theta_{1}\big{\rangle}(\mathbf{x})~{}d(G_{\theta}\#\mathbb{P})(\mathbf{x}):=\int f(G(\mathbf{x}))\cdot\big{\langle}\nabla_{\theta}G_{\theta_{0}},\theta_{1}\big{\rangle}(\mathbf{x})~{}d\mathbb{P}(\mathbf{x})

where fL2(Gθ0#,d)f\in L^{2}(G_{\theta_{0}}\#\mathbb{P},\mathbb{R}^{d}) is any test function. This operator is continuous since VθopθGθop<\|V_{\theta}\|_{op}\leq\|\nabla_{\theta}G_{\theta}\|_{op}<\infty.

For any perturbation hΘh\in\Theta,

ddϵL(θ+ϵh)|ϵ=0\displaystyle\frac{d}{d\epsilon}L(\theta+\epsilon h)\big{|}_{\epsilon=0} =limϵ01ϵ(L(Gθ+ϵh#)L(Gθ#))\displaystyle=\lim_{\epsilon\to 0}\frac{1}{\epsilon}\big{(}L(G_{\theta+\epsilon h}\#\mathbb{P})-L(G_{\theta}\#\mathbb{P})\big{)}
=limϵ01ϵδLδP|Gθ+ϵh#d(Gθ+ϵh#Gθ#)\displaystyle=\lim_{\epsilon\to 0}\frac{1}{\epsilon}\int\frac{\delta L}{\delta P}\big{|}_{G_{\theta+\epsilon h}\#\mathbb{P}}d(G_{\theta+\epsilon h}\#\mathbb{P}-G_{\theta}\#\mathbb{P})
=limϵ01ϵδLδP(Gθ+ϵh(𝐱))δLδP(Gθ(𝐱))d(𝐱)\displaystyle=\lim_{\epsilon\to 0}\frac{1}{\epsilon}\int\frac{\delta L}{\delta P}\big{(}G_{\theta+\epsilon h}(\mathbf{x})\big{)}-\frac{\delta L}{\delta P}\big{(}G_{\theta}(\mathbf{x})\big{)}d\mathbb{P}(\mathbf{x})
=limϵ01ϵδLδP(Gθ(𝐱))(Gθ+ϵh(𝐱)Gθ(𝐱))𝑑(𝐱)\displaystyle=\lim_{\epsilon\to 0}\frac{1}{\epsilon}\int\nabla\frac{\delta L}{\delta P}\big{(}G_{\theta}(\mathbf{x})\big{)}\cdot\big{(}G_{\theta+\epsilon h}(\mathbf{x})-G_{\theta}(\mathbf{x})\big{)}d\mathbb{P}(\mathbf{x})
=δLδP(Gθ(𝐱))θGθ,h(𝐱)𝑑(𝐱)\displaystyle=\int\nabla\frac{\delta L}{\delta P}\big{(}G_{\theta}(\mathbf{x})\big{)}\cdot\big{\langle}\nabla_{\theta}G_{\theta},h\big{\rangle}(\mathbf{x})~{}d\mathbb{P}(\mathbf{x})
=δLδP(𝐱)Vθ,h(𝐱)d(Gθ#)(𝐱)\displaystyle=\int\nabla\frac{\delta L}{\delta P}(\mathbf{x})\cdot\big{\langle}V_{\theta},h\big{\rangle}(\mathbf{x})~{}d(G_{\theta}\#\mathbb{P})(\mathbf{x})

Hence, the loss LL is differentiable in θ\theta and the derivative is the following operator

θL(θ),=δLδP(𝐱)Vθ,(𝐱)d(Gθ#)(𝐱)\langle\nabla_{\theta}L(\theta),\cdot\rangle=\int\nabla\frac{\delta L}{\delta P}(\mathbf{x})\cdot\big{\langle}V_{\theta},\cdot\big{\rangle}(\mathbf{x})~{}d(G_{\theta}\#\mathbb{P})(\mathbf{x})

For any θΘ\theta\in\Theta such that Gθ#CPG_{\theta}\#\mathbb{P}\in CP,

θL(θ),=𝟎Vθ,(𝐱)d(Gθ#)(𝐱)=0\langle\nabla_{\theta}L(\theta),\cdot\rangle=\int\mathbf{0}\cdot\big{\langle}V_{\theta},\cdot\big{\rangle}(\mathbf{x})~{}d(G_{\theta}\#\mathbb{P})(\mathbf{x})=0

Thus, CP𝒫ΘCPΘCP\cap\mathcal{P}_{\Theta}\subseteq CP_{\Theta}.

For the second claim, we first verify that the random feature functions GθG_{\theta} satisfy the smoothness assumptions: For any 𝐚L2(ρ,d)\mathbf{a}\in L^{2}(\rho,\mathbb{R}^{d}) and 𝐱,𝐱k\mathbf{x},\mathbf{x}^{\prime}\in\mathbb{R}^{k},

G𝐚(𝐱)G𝐚(𝐱)\displaystyle\|G_{\mathbf{a}}(\mathbf{x})-G_{\mathbf{a}}(\mathbf{x}^{\prime})\| 𝐚(𝐰,b)|σ(𝐰𝐱+b)σ(𝐰𝐱+b)|𝑑ρ(𝐰,b)\displaystyle\leq\int\|\mathbf{a}(\mathbf{w},b)\|\big{|}\sigma(\mathbf{w}\cdot\mathbf{x}+b)-\sigma(\mathbf{w}\cdot\mathbf{x}^{\prime}+b)\big{|}d\rho(\mathbf{w},b)
𝐚L2(ρ)σLip𝐰2𝑑ρ(𝐰,b)𝐱𝐱\displaystyle\leq\|\mathbf{a}\|_{L^{2}(\rho)}\|\sigma\|_{\text{Lip}}\sqrt{\int\|\mathbf{w}\|^{2}d\rho(\mathbf{w},b)}\|\mathbf{x}-\mathbf{x}^{\prime}\|
C𝐚L2(ρ)𝐱𝐱\displaystyle\leq C\|\mathbf{a}\|_{L^{2}(\rho)}\|\mathbf{x}-\mathbf{x}^{\prime}\|

where C=dC=\sqrt{d} if Assumption 3.1 holds and C=1C=1 if Assumption 3.2 holds. Thus, G𝐚G_{\mathbf{a}} is Lipschitz. Meanwhile, since the parametrization 𝐚G𝐚\mathbf{a}\mapsto G_{\mathbf{a}} is linear

𝐚G𝐚(𝐱),𝐚=G𝐚\displaystyle\langle\nabla_{\mathbf{a}}G_{\mathbf{a}}(\mathbf{x}),\mathbf{a}^{\prime}\rangle=G_{\mathbf{a}^{\prime}}

Then, for any 𝐚,𝐚L2(ρ,d)\mathbf{a},\mathbf{a}^{\prime}\in L^{2}(\rho,\mathbb{R}^{d}) and any 𝐱k\mathbf{x}\in\mathbb{R}^{k}

𝐚G𝐚(𝐱),𝐚\displaystyle\|\langle\nabla_{\mathbf{a}}G_{\mathbf{a}}(\mathbf{x}),\mathbf{a}^{\prime}\rangle\| =G𝐚(𝐱)𝐚L2(ρ)(|σ(0)|+𝐰2+|b|2dρ(𝐰,b)𝐱2+1)\displaystyle=\|G_{\mathbf{a}^{\prime}}(\mathbf{x})\|\leq\|\mathbf{a}^{\prime}\|_{L^{2}(\rho)}\Big{(}|\sigma(0)|+\sqrt{\int\|\mathbf{w}\|^{2}+|b|^{2}d\rho(\mathbf{w},b)}\sqrt{\|\mathbf{x}\|^{2}+1}\Big{)}
C𝐚L2(ρ)𝐱2+1\displaystyle\leq C\|\mathbf{a}^{\prime}\|_{L^{2}(\rho)}\sqrt{\|\mathbf{x}\|^{2}+1}
𝐚G𝐚(),𝐚L2()\displaystyle\|\langle\nabla_{\mathbf{a}}G_{\mathbf{a}}(\cdot),\mathbf{a}^{\prime}\rangle\|_{L^{2}(\mathbb{P})} =G𝐚L2()C(1+𝐱2𝑑(𝐱))𝐚L2(ρ)\displaystyle=\|G_{\mathbf{a}^{\prime}}\|_{L^{2}(\mathbb{P})}\leq C\Big{(}1+\sqrt{\int\|\mathbf{x}\|^{2}d\mathbb{P}(\mathbf{x})}\Big{)}\|\mathbf{a}^{\prime}\|_{L^{2}(\rho)}

where C=d+1C=\sqrt{d+1} if Assumption 3.1 holds and C=1C=1 if Assumption 3.2 holds. Thus, G𝐚(𝐱)G_{\mathbf{a}}(\mathbf{x}) is C1C^{1} in 𝐚\mathbf{a} for any 𝐱\mathbf{x}, and 𝐚G𝐚\nabla_{\mathbf{a}}G_{\mathbf{a}} is a continuous operator L2(ρ,d)L2(,d)L^{2}(\rho,\mathbb{R}^{d})\to L^{2}(\mathbb{P},\mathbb{R}^{d}).

Consider any 𝐚L2(ρ,d)\mathbf{a}\in L^{2}(\rho,\mathbb{R}^{d}) such that G𝐚#CPΘG_{\mathbf{a}}\#\mathbb{P}\in CP_{\Theta}. Then, for any 𝐚L2(ρ,d)\mathbf{a}^{\prime}\in L^{2}(\rho,\mathbb{R}^{d}),

0\displaystyle 0 =δLδP(G𝐚(𝐱))𝐚G𝐚,𝐚(𝐱)𝑑(𝐱)\displaystyle=\int\nabla\frac{\delta L}{\delta P}\big{(}G_{\mathbf{a}}(\mathbf{x})\big{)}\cdot\big{\langle}\nabla_{\mathbf{a}}G_{\mathbf{a}},\mathbf{a}^{\prime}\big{\rangle}(\mathbf{x})~{}d\mathbb{P}(\mathbf{x})
=𝐚(𝐰,b)δLδP(G𝐚(𝐱))σ(𝐰𝐱+b)𝑑(𝐱)𝑑ρ(𝐰,b)\displaystyle=\int\mathbf{a}^{\prime}(\mathbf{w},b)\cdot\int\nabla\frac{\delta L}{\delta P}\big{(}G_{\mathbf{a}}(\mathbf{x})\big{)}\sigma(\mathbf{w}\cdot\mathbf{x}+b)~{}d\mathbb{P}(\mathbf{x})d\rho(\mathbf{w},b)

If σ\sigma is Lipschitz, then the integrand is continuous in (𝐰,b)(\mathbf{w},b), so for any (𝐰,b)sprtρ(\mathbf{w},b)\in\text{sprt}\rho, we have

δLδP(G𝐚(𝐱))σ(𝐰𝐱+b)𝑑(𝐱)=𝟎\int\nabla\frac{\delta L}{\delta P}\big{(}G_{\mathbf{a}}(\mathbf{x})\big{)}\sigma(\mathbf{w}\cdot\mathbf{x}+b)~{}d\mathbb{P}(\mathbf{x})=\mathbf{0}

Thus, if either Assumption 3.1 or 3.2 holds, the equality holds for all (𝐰,b)k+1(\mathbf{w},b)\in\mathbb{R}^{k+1}. It follows that, by universal approximation theorem [52], δPL(G𝐚(𝐱))=𝟎\nabla\delta_{P}L\big{(}G_{\mathbf{a}}(\mathbf{x})\big{)}=\mathbf{0} for \mathbb{P} almost all 𝐱\mathbf{x}. Or equivalently, δPL(𝐱)=𝟎\delta_{P}L(\mathbf{x})=\mathbf{0} for G𝐚#G_{\mathbf{a}}\#\mathbb{P} almost all 𝐱\mathbf{x}. Hence, the reverse inclusion CPΘCP𝒫ΘCP_{\Theta}\subseteq CP\cap\mathcal{P}_{\Theta} holds. ∎

Proof of Proposition 7.5.

For any GL2(,d)G\in L^{2}(\mathbb{P},\mathbb{R}^{d}), let π\pi be the optimal transport plan between G#G\#\mathbb{P} and PP_{*}. Define the function

m(x)=x𝑑π(x|x)m(x)=\int x^{\prime}~{}d\pi\big{(}x^{\prime}\big{|}x\big{)} (49)

Then, GG is a stationary point if and only if G=mGG=m\circ G. If G#G\#\mathbb{P} is absolutely continuous, then π\pi is concentrated on the graph of ϕ\nabla\phi for some convex function ϕ\phi by Brennier’s theorem [117]. Then, m=ϕm=\nabla\phi, and G#=(mG)#=ϕ#(G#)=PG\#\mathbb{P}=(m\circ G)\#\mathbb{P}=\nabla\phi\#(G\#\mathbb{P})=P_{*}, so GG is a global minimum.

It follows that if GG is a stationary point but not a global minimum, then P=G#P=G\#\mathbb{P} must be singular. Since the cumulant function of PP is non-decreasing, there are at most countably many jumps. Thus, PP can be expressed as P=Pac+PsgP=P_{ac}+P_{sg} where PacP_{ac} is a density function (the Lebesgue derivative of the continuous part of the cumulant) and PsgP_{sg} is a countable sum of point masses at the jumps xix_{i}. Since PsgP_{sg} is nonzero, there exists xsprtPx_{*}\in\text{sprt}P such that P({x})>0P(\{x_{*}\})>0. Let S+,SS_{+},S_{-} be a disjoint partition of G1({x})G^{-1}(\{x_{*}\}) such that (S+)=(S)=P(x)/2\mathbb{P}(S_{+})=\mathbb{P}(S_{-})=P(x_{*})/2. Since PP_{*} is absolutely continuous, Brennier’s theorem implies that the transport plan π(x0,x1)\pi(x_{0},x_{1}) has the conditional distribution π(|x1)=δψ(x1)\pi(\cdot|x_{1})=\delta_{\nabla\psi(x_{1})} for some convex potential ψ\psi. Since ψ\nabla\psi is non-decreasing, there exists an interval [x,x+][x_{-},x_{+}] such that ψ\nabla\psi maps [x,x+][x_{-},x_{+}] to {x}\{x_{*}\} and P([x,x+])=P({x})P_{*}([x_{-},x_{+}])=P(\{x_{*}\}). Choose xo(x,x+)x_{o}\in(x_{-},x_{+}) such that P([x,xo])=P((xo,x+])=P({x})/2P_{*}([x_{-},x_{o}])=P_{*}((x_{o},x_{+}])=P(\{x_{*}\})/2. Define the means

m=2P({x})xxox𝑑P(x),m+=2P({x})xox+x𝑑P(x)m_{-}=\frac{2}{P(\{x_{*}\})}\int_{x_{-}}^{x_{o}}x~{}dP_{*}(x),\quad m_{+}=\frac{2}{P(\{x_{*}\})}\int_{x_{o}}^{x_{+}}x~{}dP_{*}(x)

Then, x<m<xo<m+<x+x_{-}<m_{-}<x_{o}<m_{+}<x_{+} and x=(m+m+)/2x_{*}=(m_{-}+m_{+})/2. For any ϵ\epsilon, define the function

h=ϵ((x+m+)𝟏S++(xm)𝟏S)h=\epsilon((x_{+}-m_{+})\mathbf{1}_{S_{+}}+(x_{-}-m_{-})\mathbf{1}_{S_{-}})

and the map

xsprtP,F(x)={(1ϵ)x+ϵm if x[x,xo](1ϵ)x+ϵm+ if x(xo,x+]ψ(x) else\forall x\in\text{sprt}P_{*},\quad F(x)=\begin{cases}(1-\epsilon)x_{*}+\epsilon m_{-}\text{ if }x\in[x_{-},x_{o}]\\ (1-\epsilon)x_{*}+\epsilon m_{+}\text{ if }x\in(x_{o},x_{+}]\\ \nabla\psi(x)\text{ else}\end{cases}

Then, for any ϵ(0,1)\epsilon\in(0,1),

L(G)L(G+h)\displaystyle\quad L(G)-L(G+h)
12|xψ(x)|2|xF(x)|2dP(x)\displaystyle\geq\frac{1}{2}\int|x-\nabla\psi(x)|^{2}-|x-F(x)|^{2}dP_{*}(x)
=12xxo|xx|2|x(1ϵ)xϵm|2dP(x)\displaystyle=\frac{1}{2}\int_{x_{-}}^{x_{o}}|x-x_{*}|^{2}-|x-(1-\epsilon)x_{*}-\epsilon m_{-}|^{2}dP_{*}(x)
+12xox+|xx|2|x(1ϵ)xϵm+|2dP(x)\displaystyle\quad+\frac{1}{2}\int_{x_{o}}^{x_{+}}|x-x_{*}|^{2}-|x-(1-\epsilon)x_{*}-\epsilon m_{+}|^{2}dP_{*}(x)
=ϵ22xxo|xx|2|xm|2dP(x)+ϵ22xox+|xx|2|xm+|2dP(x)\displaystyle=\frac{\epsilon^{2}}{2}\int_{x_{-}}^{x_{o}}|x-x_{*}|^{2}-|x-m_{-}|^{2}dP_{*}(x)+\frac{\epsilon^{2}}{2}\int_{x_{o}}^{x_{+}}|x-x_{*}|^{2}-|x-m_{+}|^{2}dP_{*}(x)
>0\displaystyle>0

Thus, GG is a generalized saddle point.

For any initialization G0L2()G_{0}\in L^{2}(\mathbb{P}), let GtG_{t} be a trajectory defined by the dynamics (46). Let mtm_{t} be the function (49) defined by the optimal transport plan πt\pi_{t} between Pt=Gt#P_{t}=G_{t}\#\mathbb{P} and PP_{*}. Then, the dynamics (46) can be written as

ddtGt=mtGtGt\frac{d}{dt}G_{t}=m_{t}\circ G_{t}-G_{t}

By cyclic monotonicity [117], the coupling πt\pi_{t} must be monotone: for any (x0,x1),(x0,x1)sprtπt(x_{0},x_{1}),(x_{0}^{\prime},x_{1}^{\prime})\in\text{sprt}\pi_{t},

(x0x0)(x1x1)0(x_{0}-x_{0}^{\prime})(x_{1}-x_{1}^{\prime})\geq 0

Thus, the trajectory GtG_{t} is order-preserving

G0(z)<G(z)Gt(z)<Gt(z)G_{0}(z)<G(z^{\prime})\to G_{t}(z)<G_{t}(z^{\prime})

It follows that the cumulant function

F(z):=Pt((,Gt(z)])=𝟏Gt(z)Gt(z)𝑑(z)=𝟏G0(z)G0(z)𝑑(z)F(z):=P_{t}((-\infty,G_{t}(z)])=\int\mathbf{1}_{G_{t}(z^{\prime})\leq G_{t}(z)}d\mathbb{P}(z)=\int\mathbf{1}_{G_{0}(z^{\prime})\leq G_{0}(z)}d\mathbb{P}(z)

is constant in tt.

Since PP_{*} is absolutely continuous, its cumulant function F(x)=P((,x])F_{*}(x)=P_{*}((-\infty,x]) is a continuous, non-decreasing function from \mathbb{R} to [0,1][0,1]. For any p[0,1]p\in[0,1], define its inverse by

F1(p)=argminx{F(x)p}F_{*}^{-1}(p)=\text{argmin}_{x}\{F_{*}(x)\leq p\}

It follows that for any zz, the support of πt(|Gt(z))\pi_{t}(\cdot|G_{t}(z)) must lie in the closed interval

I(z)\displaystyle I(z) =[F1(Pt((,Gt(z)))),F1(Pt((,Gt(z)]))]\displaystyle=\Big{[}F_{*}^{-1}\big{(}P_{t}((-\infty,G_{t}(z)))\big{)},~{}F_{*}^{-1}\big{(}P_{t}((-\infty,G_{t}(z)])\big{)}\Big{]}
=[F1(limϵ0+F(zϵ)),F1(F(z))]\displaystyle=\Big{[}F_{*}^{-1}\big{(}\lim_{\epsilon\to 0^{+}}F(z-\epsilon)\big{)},~{}F_{*}^{-1}(F(z))\Big{]}

It follows that for any zsprtz\in\text{sprt}\mathbb{P}, the conditional distribution π(|G(z))\pi(\cdot|G(z)) is exactly PP_{*} conditioned on the subset I(z)I(z). Hence, the function

mtGt(z)={I(z)x𝑑P(x)P(I(z)) if |I(z)|>0F1(F(z)) elsem_{t}\circ G_{t}(z)=\begin{cases}\frac{\int_{I(z)}xdP_{*}(x)}{P_{*}(I(z))}\text{ if }|I(z)|>0\\ F_{*}^{-1}(F(z))\text{ else}\end{cases}

is constant in tt. It follows that the trajectory GtG_{t} satisfies

ddtGt=m0G0Gt\frac{d}{dt}G_{t}=m_{0}\circ G_{0}-G_{t}

and thus

Gt=etG0+(1et)m0G0G_{t}=e^{-t}G_{0}+(1-e^{-t})m_{0}\circ G_{0}

It follows that PtP_{t} converges to m0#P0m_{0}\#P_{0}. If P0𝒫acP_{0}\in\mathcal{P}_{ac}, then as discussed above, m0m_{0} is exactly the optimal transport map from P0P_{0} to PP_{*}, and thus we have global convergence. Else, there exists some x0x_{0} such that Po({x0})>0P_{o}(\{x_{0}\})>0, so (m0#P0)({m0(x0)})P0({x0})>0(m_{0}\#P_{0})(\{m_{0}(x_{0})\})\geq P_{0}(\{x_{0}\})>0, and thus m0#P0m_{0}\#P_{0} is not absolutely continuous. Since P𝒫acP_{*}\in\mathcal{P}_{ac}, we have m0#P0Pm_{0}\#P_{0}\neq P_{*}. ∎

References

  • [1] Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. A learning algorithm for boltzmann machines. Cognitive science 9, 1 (1985), 147–169.
  • [2] Albergo, M. S., and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022).
  • [3] Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [4] An, D., Guo, Y., Lei, N., Luo, Z., Yau, S.-T., and Gu, X. AE-OT: a new generative model based on extended semi-discrete optimal transport. ICLR 2020 (2019).
  • [5] An, D., Guo, Y., Zhang, M., Qi, X., Lei, N., and Gu, X. AE-OT-GAN: Training gans from data specific latent distribution. In European Conference on Computer Vision (2020), Springer, pp. 548–564.
  • [6] Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications 12, 3 (1982), 313–326.
  • [7] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017).
  • [8] Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (GANs). arXiv preprint arXiv:1703.00573 (2017).
  • [9] Arora, S., Risteski, A., and Zhang, Y. Do GANs learn the distribution? Some theory and empirics. In International Conference on Learning Representations (2018).
  • [10] Balaji, Y., Sajedi, M., Kalibhat, N. M., Ding, M., Stöger, D., Soltanolkotabi, M., and Feizi, S. Understanding overparameterization in generative adversarial networks. arXiv preprint arXiv:2104.05605 (2021).
  • [11] Barron, A. R., and Sheu, C.-H. Approximation of density functions by sequences of exponential families. The Annals of Statistics (1991), 1347–1369.
  • [12] Bengio, Y., Deleu, T., Hu, E. J., Lahlou, S., Tiwari, M., and Bengio, E. GFlowNet foundations. arXiv preprint arXiv:2111.09266 (2021).
  • [13] Bonati, L., Zhang, Y.-Y., and Parrinello, M. Neural networks-based variationally enhanced sampling. In Proceedings of the National Academy of Sciences (2019), vol. 116, pp. 17641–17647.
  • [14] Brenier, Y. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics 44, 4 (1991), 375–417.
  • [15] Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).
  • [16] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020.
  • [17] Cao, Y., and Vanden-Eijnden, E. Learning optimal flows for non-equilibrium importance sampling. arXiv preprint arXiv:2206.09908 (2022).
  • [18] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 9650–9660.
  • [19] Chavdarova, T., and Fleuret, F. SGAN: An alternative training of generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 9407–9415.
  • [20] Che, T., Li, Y., Jacob, A., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136 (2016).
  • [21] Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in neural information processing systems (2018), pp. 6571–6583.
  • [22] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning (2020), PMLR, pp. 1597–1607.
  • [23] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems 29 (2016).
  • [24] Chizat, L., and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems (2018), pp. 3036–3046.
  • [25] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  • [26] Cucker, F., and Smale, S. On the mathematical foundations of learning. Bulletin of the American mathematical society 39, 1 (2002), 1–49.
  • [27] Denton, E., Chintala, S., Arthur, S., and Fergus, R. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in neural information processing systems (2015), pp. 1486–1494.
  • [28] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • [29] Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016).
  • [30] E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5, 1 (2017), 1–11.
  • [31] E, W., Ma, C., and Wang, Q. A priori estimates of the population risk for residual networks. arXiv preprint arXiv:1903.02154 1, 7 (2019).
  • [32] E, W., Ma, C., Wojtowytsch, S., and Wu, L. Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t, 2020.
  • [33] E, W., Ma, C., and Wu, L. A priori estimates for two-layer neural networks. arXiv preprint arXiv:1810.06397 (2018).
  • [34] E, W., Ma, C., and Wu, L. Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039 (2019).
  • [35] E, W., Ma, C., and Wu, L. Machine learning from a continuous viewpoint. arXiv preprint arXiv:1912.12777 (2019).
  • [36] E, W., Ma, C., and Wu, L. On the generalization properties of minimum-norm solutions for over-parameterized neural network models. arXiv preprint arXiv:1912.06987 (2019).
  • [37] E, W., and Wojtowytsch, S. Kolmogorov width decay and poor approximators in machine learning: Shallow neural networks, random feature models and neural tangent kernels. arXiv preprint arXiv:2005.10807 (2020).
  • [38] Feizi, S., Farnia, F., Ginart, T., and Tse, D. Understanding GANs in the LQG setting: Formulation, generalization and stability. IEEE Journal on Selected Areas in Information Theory 1, 1 (2020), 304–311.
  • [39] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems (2014), pp. 2672–2680.
  • [40] Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).
  • [41] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research 13, 1 (2012), 723–773.
  • [42] Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
  • [43] Gui, J., Sun, Z., Wen, Y., Tao, D., and Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data Engineering (2021).
  • [44] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein GANs, 2017.
  • [45] Gulrajani, I., Raffel, C., and Metz, L. Towards GAN benchmarks which require generalization. arXiv preprint arXiv:2001.03653 (2020).
  • [46] Han, J., Hu, R., and Long, J. A class of dimensionality-free metrics for the convergence of empirical measures. arXiv preprint arXiv:2104.12036 (2021).
  • [47] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems (2017), pp. 6626–6637.
  • [48] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning basic visual concepts with a constrained variational framework.
  • [49] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  • [50] Hoang, Q., Nguyen, T. D., Le, T., and Phung, D. Mgan: Training generative adversarial nets with multiple generators. In International conference on learning representations (2018).
  • [51] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (1991), 251–257.
  • [52] Hornik, K., Stinchcombe, M., and White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 3, 5 (1990), 551–560.
  • [53] Hou, X., Shen, L., Sun, K., and Qiu, G. Deep feature consistent variational autoencoder. In 2017 IEEE winter conference on applications of computer vision (WACV) (2017), IEEE, pp. 1133–1141.
  • [54] Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P. On unifying deep generative models. arXiv preprint arXiv:1706.00550 (2017).
  • [55] Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. arXiv preprint arXiv:1804.00779 (2018).
  • [56] Hyvärinen, A., and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6, 4 (2005).
  • [57] Ivanov, A. The theory of approximate methods and their applications to the numerical solution of singular integral equations, vol. 2. Springer Science & Business Media, 1976.
  • [58] Jiang, Y., Chang, S., and Wang, Z. TransGAN: Two pure transformers can make one strong GAN, and that can scale up. Advances in Neural Information Processing Systems 34 (2021), 14745–14758.
  • [59] Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (2016), Springer, pp. 694–711.
  • [60] Kantorovich, L., and Rubinstein, G. S. On a space of totally additive functions. Vestnik Leningrad. Univ 13 (1958), 52–59.
  • [61] Kantorovich, L. V. Mathematical methods of organizing and planning production. Management science 6, 4 (1960), 366–422.
  • [62] Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 4401–4410.
  • [63] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (2016), pp. 4743–4751.
  • [64] Kingma, D. P., and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  • [65] Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 3964–3979.
  • [66] Kodali, N., Abernethy, J., Hays, J., and Kira, Z. On convergence and stability of GANs. arXiv preprint arXiv:1705.07215 (2017).
  • [67] Kontorovich, A. Concentration in unbounded metric spaces and algorithmic stability. In International Conference on Machine Learning (2014), PMLR, pp. 28–36.
  • [68] Larochelle, H., and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011), pp. 29–37.
  • [69] Lei, Q., Lee, J. D., Dimakis, A. G., and Daskalakis, C. Sgd learns one-layer networks in wgans, 2020.
  • [70] Li, S.-H., and Wang, L. Neural network renormalization group. Physical review letters 121, 26 (2018), 260601.
  • [71] Li, Y., and Dou, Z. Making method of moments great again?–how can GANs learn distributions. arXiv preprint arXiv:2003.04033 (2020).
  • [72] Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In International Conference on Machine Learning (2015), pp. 1718–1727.
  • [73] Lin, T., Jin, C., and Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning (2020), PMLR, pp. 6083–6093.
  • [74] Ling, H., Kreis, K., Li, D., Kim, S. W., Torralba, A., and Fidler, S. EditGAN: High-precision semantic image editing. Advances in Neural Information Processing Systems 34 (2021), 16331–16345.
  • [75] Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022).
  • [76] Ma, C., Wu, L., and Weinan, E. The slow deterioration of the generalization error of the random feature model. In Mathematical and Scientific Machine Learning (2020), PMLR, pp. 373–389.
  • [77] Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., and Yang, M.-H. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 1429–1437.
  • [78] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. On the effectiveness of least squares generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 12 (2018), 2947–2960.
  • [79] Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for GANs do actually converge? In International conference on machine learning (2018), PMLR, pp. 3481–3490.
  • [80] Mirza, M., and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
  • [81] Nagarajan, V., and Kolter, J. Z. Gradient descent GAN optimization is locally stable. arXiv preprint arXiv:1706.04156 (2017).
  • [82] Noé, F., Olsson, S., K ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 365, 6457 (2019), eaaw1147.
  • [83] Nowozin, S., Cseke, B., and Tomioka, R. ff-GAN: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems (2016), pp. 271–279.
  • [84] Oneto, L., Ridella, S., and Anguita, D. Tikhonov, ivanov and morozov regularization for support vector machine learning. Machine Learning 103, 1 (2016), 103–136.
  • [85] Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L., Stimberg, F., et al. Parallel WaveNet: Fast high-fidelity speech synthesis. In International conference on machine learning (2018), pp. 3918–3926.
  • [86] Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
  • [87] Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016).
  • [88] Ororbia, A., and Kifer, D. The neural coding framework for learning generative models. Nature communications 13, 1 (2022), 1–14.
  • [89] Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems (2017), pp. 2338–2347.
  • [90] Pei, S., Da Xu, R. Y., Xiang, S., and Meng, G. Alleviating mode collapse in GAN via diversity penalty module. arXiv preprint arXiv:2108.02353 (2021).
  • [91] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
  • [92] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training.
  • [93] Rahimi, A., and Recht, B. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing (2008), IEEE, pp. 555–561.
  • [94] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In International Conference on Machine Learning (2021), PMLR, pp. 8821–8831.
  • [95] Rezende, D. J., and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015).
  • [96] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 10684–10695.
  • [97] Rotskoff, G. M., and Vanden-Eijden, E. Trainability and accuracy of neural networks: An interacting particle system approach. stat 1050 (2019), 30.
  • [98] Rout, L., Korotin, A., and Burnaev, E. Generative modeling with optimal transport maps. arXiv preprint arXiv:2110.02999 (2021).
  • [99] Salehinejad, H., Sankar, S., Barfett, J., Colak, E., and Valaee, S. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078 (2017).
  • [100] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in neural information processing systems (2016), pp. 2234–2242.
  • [101] Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY 55, 58-63 (2015), 94.
  • [102] Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks 61 (2015), 85–117.
  • [103] Singh, S., and Póczos, B. Minimax distribution estimation in Wasserstein distance. arXiv preprint arXiv:1802.08855 (2018).
  • [104] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (2015), PMLR, pp. 2256–2265.
  • [105] Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems 34 (2021), 1415–1428.
  • [106] Song, Y., and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32 (2019).
  • [107] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  • [108] Su, J. Variational inference: A unified framework of generative models and some revelations. arXiv preprint arXiv:1807.05936 (2018).
  • [109] Sun, Y., Gilbert, A., and Tewari, A. On the approximation properties of random ReLU features. arXiv preprint arXiv:1810.04374 (2018).
  • [110] Tabak, E. G., and Trigila, G. Explanation of variability and removal of confounding factors from data through optimal transport. Communications on Pure and Applied Mathematics 71, 1 (2018), 163–199.
  • [111] Tabak, E. G., and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66, 2 (2013), 145–164.
  • [112] Tabak, E. G., and Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences 8, 1 (2010), 217–233.
  • [113] Teschl, G. Ordinary Differential Equations and Dynamical Systems. AMS, 2012, ch. 3.4, pp. 83–84.
  • [114] Tikhonov, A. N., and Arsenin, V. Y. Solutions of ill-posed problems. V. H. Winston & Sons, Washington, D.C.: John Wiley & Sons, New York, 1977.
  • [115] Valsson, O., and Parrinello, M. Variational approach to enhanced sampling and free energy calculations. Physical review letters 113 (2014).
  • [116] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • [117] Villani, C. Topics in optimal transportation. No. 58. American Mathematical Soc., 2003.
  • [118] Wang, D., Wang, Y., Chang, J., Zhang, L., Wang, H., et al. Efficient sampling of high-dimensional free energy landscapes using adaptive reinforced dynamics. Nature Computational Science 2, 1 (2022), 20–29.
  • [119] Wang, Z., and Scott, D. W. Nonparametric density estimation for high-dimensional data—algorithms and applications. Wiley Interdisciplinary Reviews: Computational Statistics 11, 4 (2019), e1461.
  • [120] Weed, J., and Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli 25, 4A (2019), 2620–2648.
  • [121] Wu, S., Dimakis, A. G., and Sanghavi, S. Learning distributions generated by one-layer relu networks. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8107–8117.
  • [122] Xu, K., Li, C., Zhu, J., and Zhang, B. Understanding and stabilizing GANs’ training dynamics using control theory. In International Conference on Machine Learning (2020), PMLR, pp. 10566–10575.
  • [123] Xu, Z.-Q. J., Zhang, Y., and Luo, T. Overview frequency principle/spectral bias in deep learning. arXiv preprint arXiv:2201.07395 (2022).
  • [124] Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 (2019).
  • [125] Yang, H. Generalization error of normalizing flows with stochastic interpolants. arXiv preprint (to appear) (2022).
  • [126] Yang, H., and E, W. Generalization and memorization: The bias potential model. In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference (16–19 Aug 2022), J. Bruna, J. Hesthaven, and L. Zdeborova, Eds., vol. 145 of Proceedings of Machine Learning Research, PMLR, pp. 1013–1043.
  • [127] Yang, H., and E, W. Generalization error of GAN from the discriminator’s perspective. Research in the Mathematical Sciences 9, 1 (2022), 1–31.
  • [128] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Shao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796 (2022).
  • [129] Yazici, Y., Foo, C.-S., Winkler, S., Yap, K.-H., and Chandrasekhar, V. Empirical analysis of overfitting and mode drop in GAN training. In 2020 IEEE International Conference on Image Processing (ICIP) (2020), IEEE, pp. 1651–1655.
  • [130] Zhang, D., Chen, R. T., Malkin, N., and Bengio, Y. Unifying generative models with gflownets. arXiv preprint arXiv:2209.02606 (2022).
  • [131] Zhang, L., E, W., and Wang, L. Monge-Ampère flow for generative modeling. arXiv preprint arXiv:1809.10188 (2018).
  • [132] Zhang, O., Lin, R.-S., and Gou, Y. Optimal transport based generative autoencoders. arXiv preprint arXiv:1910.07636 (2019).
  • [133] Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the discrimination-generalization tradeoff in GANs. arXiv preprint arXiv:1711.02771 (2017).
  • [134] Zhao, J., Mathieu, M., and LeCun, Y. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016).