This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples

Yixing Zhang Department of Electrical and Computer Engineering, Duke University    Xiuyuan Cheng Department of Mathematics, Duke University    Galen Reeves Department of Electrical and Computer Engineering and the Department of Statistical Science, Duke University
Abstract

The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed pp-Wasserstein distance in dd dimensions, our results require only the existence of a moment greater than d+2pd+2p. For the special case of sub-gamma distributions, we quantify the dependence on the dimension dd and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space.

A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.

1 Introduction

There has been significant interest in optimal transport (OT) distances for data analysis, motivated by applications in statistics and machine learning ranging from computer graphics and imaging processing Solomon et al. (2014); Ryu et al. (2018); Li et al. (2018) to deep learning Courty et al. (2016); Shen et al. (2017); Bhushan Damodaran et al. (2018); see Peyré and Cuturi (2019). The OT cost between probability measures PP and QQ with cost function c(x,y)c(x,y) is defined as

𝒯(P,Q)infπΠ(P,Q)c(x,y)𝑑π(x,y),\displaystyle\mathcal{T}(P,Q)\coloneqq\inf_{\pi\in\Pi(P,Q)}\int c(x,y)\,d\pi(x,y), (1)

where Π(P,Q)\Pi(P,Q) is the set of all probability measures whose marginals are PP and QQ. Of central importance to applications in statistics and machine learning is the rate at which the empirical measure PnP_{n} of and iid sample approximates the true underlying distribution PP. In this regard, one of the main challenges for OT distances is that rate convergence suffers from the curse of dimensionality: the number of samples nn needs to grow exponentially with the dimension dd of the data Fournier and Guillin (2015).

On a closely related note, OT also suffers from computational issues, particularly in the high-dimensional settings. To address both statistical and computation limitations, recent work has focused on regularized versions of OT including entropy regularization Cuturi (2013) and Gaussian-smoothed optimal transport (GOT) Goldfeld et al. (2020b). The entropy-regularized OT has attracted intensive theoretical interest Feydy et al. (2019); Klatt et al. (2020); Bigot et al. (2019), as well as an abundance of algorithm developments Gerber and Maggioni (2017); Abid and Gower (2018); Chakrabarty and Khanna (2020). In comparison, GOT is less understood both in theory and in computation. The goal of the current paper is thus to deepen the theoretical analysis of GOT under more general settings, so as to lay a theoretical foundation for computational study and potential applications.

In particular, we consider distributions that satisfy only a bounded moment condition and general settings involving dependent samples. For the special case of sub-gamma distributions, we show a phase transition depending on the dimension dd and with respect to the scale parameter of the sub-gamma distribution. Going beyond the case of iid samples, our convergence rate covers dependent samples as long as a condition on the pair-wise dependence quantified by the covariance of the kernel-space feature map is satisfied. A key step in our analysis is to establish a novel connection between the GOT distance and a family of kernel MMD distances, which can be of independent interest. In the kernel MMD upper bound, the kernel is neither bounded nor translation invariant, and is determined by both the cost function of OT and the amount of Gaussian smoothing. The theoretical findings are supported by numerical experiments.

To summarize our contribution, we provide an overview of the main theoretical results in the next subsection, and then close the introduction with a detailed review of related work. After introducing notations and needed preliminaries in Section 2, we derive upper bounds of GOT using kernel MMD of a new two-moment kernel in Section 3, which leads to the convergence rate results in Section 4 and numerical results in Section 5. All proofs are in the appendix.

1.1 Overview of Main Results

In this paper, we focus on the OT cost 𝒯p(P,Q)\mathcal{T}_{p}(P,Q) associated with the cost function cp(x,y)=xypc_{p}(x,y)=\|x-y\|^{p} for p>0p>0 and c0(x,y)=1{xy}c_{0}(x,y)=1_{\{x\neq y\}}. The total variation distance is given by 𝒯0(P,Q)\mathcal{T}_{0}(P,Q) and the pp-Wasserstein distance is given by 𝒯p(P,Q)\mathcal{T}_{p}(P,Q) if p(0,1]p\in(0,1] and (𝒯p(P,Q))1/p(\mathcal{T}_{p}(P,Q))^{1/p} if p>1p>1 Villani (2003).

The minimax convergence rate of 𝒯p(P,Pn)\mathcal{T}_{p}(P,P_{n}) was established by (Fournier and Guillin, 2015, Theorem 1) who showed that if PP has a moment strictly greater than 2p2p, then

𝔼[𝒯p(P,Pn)]{n12,p>d/2n12logn,p=d/2npd,p(0,d/2).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\asymp\begin{cases}n^{-\frac{1}{2}},&p>d/2\\ n^{-\frac{1}{2}}\log n,&p=d/2\\ n^{-\frac{p}{d}},&p\in(0,d/2)\end{cases}. (2)

Unfortunately, this means that the sample complexity increases exponentially with the dimension for d>2pd>2p.

Recently, Goldfeld et al. (2020b) showed that one way to overcome the curse of dimensionality is to consider the Gaussian-smoothed OT distance, defined as

𝒯p(σ)(P,Q)𝒯p(P𝒩σ,Q𝒩σ),\displaystyle\mathcal{T}_{p}^{(\sigma)}(P,Q)\coloneqq\mathcal{T}_{p}(P\ast\mathcal{N}_{\sigma},Q\ast\mathcal{N}_{\sigma}),

where 𝒩σ\mathcal{N}_{\sigma} denotes the iid Gaussian measure with mean zero and variance σ2\sigma^{2}. Under the assumption that PP is sub-Gaussian with constant vv, they proved an upper bound on the converge rate that is independent of the dimension:

𝔼[𝒯p(P,Pn)]Cd,p,σ,vn,p{0,1,2}.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\leq\frac{C_{d,p,\sigma,v}}{\sqrt{n}},\quad p\in\{0,1,2\}.

The precise the form of the constant Cd,p,σ,vC_{d,p,\sigma,v} is provided for p{0,1}p\in\{0,1\} but not for the case p=2p=2 unless PP is also assumed to have bounded support. Ensuing work by Goldfeld et al. (2020a) established the same convergence rate for p=1p=1 under the relaxed assumption that PP has finite moment grater than 2d+22d+2.

Metric properties of GOT were studied by Goldfeld and Greenewald (2020) who showed that 𝒯1(σ)(P,Q)\mathcal{T}_{1}^{(\sigma)}(P,Q) is a metric on the space of probability measures with finite first moment and that the sequence of optimal couplings converges in the σ0\sigma\to 0 limit to the optimal coupling for the unsmoothed Wasserstein distance. Their arguments depend only on the pointwise convergence of the characteristic functions under Gaussian smoothing, and thus also apply to the case of general pp considered in this paper.

One of the main contributions of this paper is to prove an upper bound on the convergence rate for all orders of pp and under more general assumptions on PP. Specifically, we prove the following result:

Theorem 1.

Let PnP_{n} be the empirical measure of nn iid samples from a probability measure PP on d\mathbb{R}^{d} that satisfies (xs𝑑P(x))1/sm(\int\|x\|^{s}\,dP(x))^{1/s}\leq m for some s>d+2ps>d+2p. There exists a positive constant Cd,p,sC_{d,p,s} such that for all σ>0\sigma>0,

𝔼[𝒯p(σ)(P,Pn)]Cd,p,sσpn(1+mσ)d2+p.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}^{(\sigma)}_{p}(P,P_{n})}\right]\leq C_{d,p,s}\frac{\sigma^{p}}{\sqrt{n}}\mathopen{}\mathclose{{}\left(1+\frac{m}{\sigma}}\right)^{\frac{d}{2}+p}. (3)

This result brings the GOT framework in line with the general setting studied by (Fournier and Guillin, 2015, Theorem 1), and shows that the benefits obtained by smoothing extend beyond the special cases of small pp and well-controlled tails. To help interpret this result it is important to keep in mind that for p>1p>1, the Wasserstein distance is given by the pp-th root of the GOT. As for the tightness of the bound, there are two regimes worth considering, namely when σ0\sigma\to 0 as nn\to\infty and when σ\sigma is fixed. In the former case, the dependence on σ\sigma seems to be nearly tight. In Section 4.1, we show that if σn1d+2p\sigma\asymp n^{-\frac{1}{d+2p}} then Theorem 1 implies an upper bound on the unsmoothed convergence rate

𝔼[𝒯p(P,Pn)]Cd,p,smpnpd+2p.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\leq C_{d,p,s}m^{p}n^{-\frac{p}{d+2p}}. (4)

Notice that for d2pd\ll 2p and d2pd\gg 2p this recovers the minimax convergence rate given in (2).

The main technical step in our approach is to establish a novel connection between GOT and a family of kernel MMD distances (Theorem 2). We then show how a particular member of this family, which we call the ‘two-moment’ kernel, defines a metric on the space of probability measures with finite moments strictly greater than p+d/2p+d/2 (Theorem 3).

In addition to Theorem 1, we also provide further results that elucidate the role of the dimension as well the tail behavior of the underlying distribution (Theorem 6). Furthermore, we address the setting of dependent samples and provide an example illustrating how the connection with MMD can be used to go beyond the usual assumptions involving mixing conditions for stationary processes. Finally, we provide some numerical experiments that support our theory.

1.2 Comparison with Previous Work

The convergence of OT distances continues to be an active area of research Singh and Póczos (2018); Niles-Weed and Berthet (2019); Lei et al. (2020). Building upon the the work of work of Cuturi (2013), a recent line of work has focused on entropy regularized OT defined by

Sϵ(P,Q)infπΠ(P,Q)c(x,y)𝑑π(x,y)+ϵD(πPQ)\displaystyle S_{\epsilon}(P,Q)\coloneqq\inf_{\pi\in\Pi(P,Q)}\int c(x,y)\,d\pi(x,y)+\epsilon D(\pi\|P\otimes Q)

where D(μν)=logdμdνdμD(\mu\|\nu)=\int\log\frac{d\mu}{d\nu}d\mu is the relative entropy between probability measures μ\mu and ν\nu. The addition of the regularization term facilitates the numerical approximation using the Sinkhorn algorithm. The amount of regularization interpolates between OT in the ϵ0\epsilon\to 0 limit and the kernel MMD in ϵ\epsilon\to\infty limit; see Feydy et al. (2019). In contrast to the Gaussian-smoothed Wasserstein distance, entropy regularized OT is not a metric since it does not satisfy the triangle inequality. Convergence rates for entropy regularized OT were obtained by Genevay et al. (2019) under the assumption of bounded support and more recently by Mena and Niles-Weed (2019) under the assumption of sub-Gaussian tails. Further properties have been studied by Luise et al. (2018) and Klatt et al. (2020).

There has also been work focusing on the sliced Wasserstein distance, which is obtained by averaging the one-dimensional Wasserstein distance over the unit sphere Rabin et al. (2011); Bonneel et al. (2015). While the sliced Wasserstein distance is equivalent to the Wasserstein distance in the sense that convergence in one metric implies convergence in another, the rates of convergence need not be the same. See Section 1.2 of Goldfeld et al. (2020a) for further discussion.

Going beyond convergence rates for empirical measures, properties of smoothed empirical measures have been studied in a variety of contexts, including the high noise limit Chen and Niles-Weed (2020) and applications to the estimation of mutual information in deep networks Goldfeld et al. (2018). Finally, we note that there has also been some work on convergence with dependent samples by Fournier and Guillin (2015), who focus on OT distance, and also by Young and Dunson (2019), who consider a closely related entropy estimation problem.

2 Preliminaries

Let 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) be the space of Borel probability measures on d\mathbb{R}^{d} and let 𝒫s(d)\mathcal{P}_{s}(\mathbb{R}^{d}) be the space of probability measures with finite ss-th moment, i.e, xs𝑑P(x)<\int\|x\|^{s}\,dP(x)<\infty. The Gaussian measure on d\mathbb{R}^{d} with mean-zero and covariance σ2Id\sigma^{2}I_{d} is denoted by 𝒩σ\mathcal{N}_{\sigma}. The convolution of probability measures PP and QQ is denoted by PQP\ast Q. The Gamma function is given by Γ(z)=0xz1ex𝑑x\Gamma(z)=\int_{0}^{\infty}x^{z-1}e^{-x}\,dx for z>0z>0. We use CC to denote a generic positive real number, and the value of CC may change from place to place.

Kernel MMD.

A symmetric function k:d×dk\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} is said to be a positive-definite kernel on d\mathbb{R}^{d} if and only if for every x1,,xndx_{1},\dots,x_{n}\in\mathbb{R}^{d}, the symmetrix matrix (k(xi,xj))i,j=1n(k(x_{i},x_{j}))_{i,j=1}^{n} is positive semidefinite. A positive definite kernel kk can be used to define a distance on probability measure known as RKHS MMD Anderson et al. (1994); Gretton et al. (2005); Smola et al. (2007); Gretton et al. (2012). Let 𝒫k(d)\mathcal{P}^{k}(\mathbb{R}^{d}) be the space of probability measures such that k(x,x)𝑑P(x)<\int\sqrt{k(x,x)}dP(x)<\infty. For P,Q𝒫k(d)P,Q\in\mathcal{P}^{k}(\mathbb{R}^{d}) the kernel MMD distance is defined as

γk2(P,Q)=k(x,y)d(P(x)Q(x))d(P(y)Q(y)).\displaystyle\gamma^{2}_{k}(P,Q)=\iint k(x,y)\,d(P(x)-Q(x))\,d(P(y)-Q(y)).

The distance γk(P,Q)\gamma_{k}(P,Q) is a pseudo-metric in general. A kernel kk is said to be characteristic to a set Q𝒫Q\subseteq\mathcal{P} of probability measures if and only if γk\gamma_{k} is a metric on 𝒬\mathcal{Q} Gretton et al. (2012); Sriperumbudur et al. (2010). An alternative representation is given by

γk2(P,Q)=𝔼[k(X,X)]+𝔼[k(Y,Y)]2𝔼[k(X,Y)]\displaystyle\gamma^{2}_{k}(P,Q)=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\mathbb{E}\mathopen{}\mathclose{{}\left[k(Y,Y^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right] (5)

where X,XX,X^{\prime} P\sim P iid, and Y,YY,Y^{\prime} Q\sim Q iid. The kernel MMD distance was shown to be equivalent to energy distance in Sejdinovic et al. (2013), and a variant form used in practice is the kernel mean embedding statistics Muandet et al. (2017); Chwialkowski et al. (2015); Jitkrittum et al. (2016). Another appealing theoretical property of the kernel MMD distance is its representation via the spectral decomposition of the kernel Epps and Singleton (1986); Fernández et al. (2008), which gives rise to estimation consistency as well as practical algorithms of computing Zhao and Meng (2015). Kernel MMD has been widely applied in data analysis and machine learning, including independence testing Fukumizu et al. (2008); Zhang et al. (2012) and generative modeling Li et al. (2015, 2017).

Magnitude of multivariate Gaussian.

Our results also depend on some properties of the noncentral chi distribution. Let Z𝒩(μ,Id)Z\sim\mathcal{N}(\mu,I_{d}) be a Gaussian vector with mean μ\mu and identity covariance. The random variable X=ZX=\|Z\| has chi-distribution with parameter u=μu=\|\mu\|. The density is given by

gd,u(x)=e(x2+u2)/2xd12d21k=0(ux/2)2kk!Γ(d+2k2).\displaystyle g_{d,u}(x)=\frac{e^{-(x^{2}+u^{2})/2}x^{d-1}}{2^{\frac{d}{2}-1}}\sum_{k=0}^{\infty}\frac{(ux/2)^{2k}}{k!\Gamma\mathopen{}\mathclose{{}\left({\frac{d+2k}{2}}}\right)}. (6)

The ss-th moment of this distribution is denoted by Md,u(s)=xsgd,u(x)𝑑xM_{d,u}(s)=\int x^{s}\,g_{d,u}(x)\,dx. This function is an even polynomial of degree ss whenever ss is an even integer (see Appendix C.1). The special case u=0u=0 corresponds to the (central) chi distribution and is given by Md(s)2s2Γ(d+s2)/Γ(d2).M_{d}(s)\coloneqq 2^{\frac{s}{2}}\Gamma(\tfrac{d+s}{2})/\Gamma(\tfrac{d}{2}).

3 Upper Bounds on GOT

In this section, we show that GOT is bounded from above by a family of kernel MMD distances. It is assumed throughout that dd is a positive integer, p(0,)p\in(0,\infty), and σ(0,)\sigma\in(0,\infty).

3.1 General Bound via Kernel MMD

Consider the feature map ψx:d[0,)\psi_{x}\colon\mathbb{R}^{d}\to[0,\infty) defined by

ψx(z)\displaystyle\psi_{x}(z) ωd2d+p2zd1+2p2f(z)ϕ(z2xσ),\displaystyle\coloneqq\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right), (7)

where ωd=2πd/2/Γ(d/2)\omega_{d}=2\pi^{d/2}/\Gamma(d/2) is the volume of the unit sphere in d\mathbb{R}^{d}, ϕ(u)=(2π)d/2exp(12u2)\phi(u)=(2\pi)^{-d/2}\exp(-\frac{1}{2}\|u\|^{2}) is the standard Gaussian density on d\mathbb{R}^{d}, and ff is a probability density function on [0,)[0,\infty) that satisfies

f(x)axd+2p1exp(bx2),\displaystyle f(x)\geq ax^{d+2p-1}\exp(-bx^{2}), (8)

for some a>0a>0 and b(0,1/2)b\in(0,1/2). This feature map defines a positive semidefinite kernel kk according to

k(x,y)=ψx,ψy.\displaystyle k(x,y)=\langle\psi_{x},\psi_{y}\rangle.

After some straightforward manipulations (see Appendix A), one finds that k(x,y)k(x,y) is finite on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} and can be expressed as the product of a Gaussian kernel and a term that depends only on x+y\|x+y\|. Specifically,

k(x,y)\displaystyle k(x,y) =exp(xy24σ2)If(x+y2σ),\displaystyle=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)I_{f}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right), (9)

where

If(u)\displaystyle I_{f}(u) ωd2d+p(2π)d20xd1+2pf(x)gd,u(x)𝑑x,\displaystyle\coloneqq\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\int_{0}^{\infty}\frac{x^{d-1+2p}}{f(x)}g_{d,u}\mathopen{}\mathclose{{}\left(x}\right)\,dx, (10)

and gd,u(x)g_{d,u}(x) is the density of the non-central chi-distribution given in (6). Note that this kernel is not shift invariant because of the term If(u)I_{f}(u).

The next result shows that the MMD defined by this kernel provides an upper bound on the GOT.

Theorem 2.

Let kk be defined as in (9). For any P,Q𝒫(d)P,Q\in\mathcal{P}(\mathbb{R}^{d}) such that k(x,x)𝑑P(x)\int\sqrt{k(x,x)}\,dP(x) and k(x,x)𝑑Q(x)\int\sqrt{k(x,x)}\,dQ(x) are finite, the MMD defined by kk provides and upper bound on the GOT:

𝒯p(σ)(P,Q)2max(p1,0)σpγk(P,Q).\displaystyle\mathcal{T}^{(\sigma)}_{p}(P,Q)\leq 2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q).

The significance of Theorem 2 is twofold. From the perspective of GOT, it provides a natural connection between the role of Gaussian smoothing and normalization of the kernel. From the perspective of MMD, Theorem 2 describes a family of kernels that metrize convergence in distribution as well as convergence in pp-th moments.

Similar to the analysis of convergence rates in previous work Goldfeld and Greenewald (2020); Goldfeld et al. (2020b), the proof of Theorem 2 builds upon the fact that 𝒯p(P,Q)\mathcal{T}_{p}(P,Q) can be upper bounded by a weighted total variation distance (Villani, 2008, Theorem 6.13). The novelty of Theorem 2 is that it establishes an explicit relationship with the kernel MMD and also provides a much broader class of upper bounds parameterized by the density ff.

3.2 A ‘Two-moment’ Kernel

One potential limitation of Theorem 2 is that for a particular choice of density ff, the requirement that k(x,x)\sqrt{k(x,x)} is integrable might not be satisfied for probability measures of interest. For example, the convergence rates in Goldfeld and Greenewald (2020) and Goldfeld et al. (2020b) can be obtained as a corollary of Theorem 2 by choosing ff to be the density of the generalized gamma distribution. However, the inverse of this density grows faster than exponentially, and as a consequence, the resulting bound can be applied only to the case of sub-Gaussian distributions.

The main idea underlying the approach in this section is that choosing a density with heavier tails leads to an upper bound that holds for a larger class of probability measures. Motivated by the functional inequalities appearing in Reeves (2020), we consider the following density function, which belongs to the family of generalized beta-prime distributions:

f(x)=ϵ2πx((xλ)ϵ+(xλ)ϵ)1.\displaystyle f(x)=\frac{\epsilon}{2\pi x}\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\frac{x}{\lambda}}\right)^{-\epsilon}+\mathopen{}\mathclose{{}\left(\frac{x}{\lambda}}\right)^{\epsilon}}\right)^{-1}.

For this special choice, the function If(u)I_{f}(u) can be expressed as the weighted sum of two moments of the non-central chi distribution. Starting with (9) and simplifying terms leads to the following:

Definition 1.

The two-moment kernel k:d×dk\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} is defined as

k(x,y)\displaystyle k(x,y) =αd,pexp(xy24σ2)J(x+y2σ)\displaystyle=\alpha_{d,p}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)J\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right) (11)

for all ϵ(0,d+2p]\epsilon\in(0,d+2p] and λ(0,)\lambda\in(0,\infty), where

αd,p\displaystyle\alpha_{d,p} (2π)2(p+d)2d/2Γ(d2)\displaystyle\coloneqq\frac{(2\pi)2^{-(p+d)}2^{-d/2}}{\Gamma(\frac{d}{2})}
J(u)\displaystyle J(u) λϵMd,u(d+2pϵ)+λϵMd,u(d+2p+ϵ)2ϵ.\displaystyle\coloneqq\frac{\lambda^{\epsilon}M_{d,u}(d\!+\!2p\!-\!\epsilon)+\lambda^{-\epsilon}M_{d,u}(d\!+\!2p\!+\!\epsilon)}{2\epsilon}.

In this expression, Md,u(s)M_{d,u}(s) denotes the ss-th moment of the non-central chi distribution with dd degrees of freedom and parameter uu.

A useful property of the two-moment kernel is that it satisfies the upper bound

k(x,y)Cd,p,ϵ,λ(1+xd+2p+ϵ+yd+2p+ϵ),\displaystyle k(x,y)\leq C_{d,p,\epsilon,\lambda}\mathopen{}\mathclose{{}\left(1+\|x\|^{d+2p+\epsilon}+\|y\|^{d+2p+\epsilon}}\right),

where the constant depends only on (d,p,ϵ,λ)(d,p,\epsilon,\lambda) (see Appendix A for details). As a consequence:

Theorem 3.

Fix any s>p+d/2s>p+d/2. For all 0<ϵ<min(d+2p,2s2pd)0<\epsilon<\min(d+2p,2s-2p-d) and λ(0,)\lambda\in(0,\infty) the MMD defined by the two-moment kernel is a metric on the space 𝒫s(d)\mathcal{P}_{s}(\mathbb{R}^{d}) of probability measures with finite ss-th moment. Furthermore, for all P,Q𝒫s(d)P,Q\in\mathcal{P}_{s}(\mathbb{R}^{d}),

𝒯p(σ)(P,Q)2max(p1,0)σpγk(P,Q).\displaystyle\mathcal{T}_{p}^{(\sigma)}(P,Q)\leq 2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q).
Remark 1.

If d+2p±ϵd+2p\pm\epsilon are even integers then J(u)J(u) is an even polynomial of degree s=d+2p+ϵs=d+2p+\epsilon with non-negative coefficients. For example, if d=3d=3, p=1p=1, and (ϵ,λ)=(1,1)(\epsilon,\lambda)=(1,1), then

J(u)\displaystyle J(u) =60+1152u2+11u4+12u6.\displaystyle=60+\frac{115}{2}u^{2}+11u^{4}+\frac{1}{2}u^{6}. (12)

Methods for efficient numerical approximation of J(u)J(u) are provided in Appendix C.1.

4 Convergence Rate

We now turn our attention to the fundamental question of how well the empirical measure of iid samples approximates the true underlying distribution. Let S1,,SnnS_{1},\dots,S_{n}\in\mathbb{R}^{n} be a sequence of nn independent samples with common distribution PP. The empirical measure PnP_{n} is the (random) probability measure on d\mathbb{R}^{d} that places probability mass 1/n1/n at each sample point:

Pn:=1ni=1nδSi,P_{n}:=\frac{1}{n}\sum_{i=1}^{n}\delta_{S_{i}}, (13)

where δx\delta_{x} denotes the pointmass distribution at xx.

Distributional properties of the kernel MMD between PP and PnP_{n} have been studied extensively Gretton et al. (2012). For the purposes of this paper, we will focus on the expected difference between these distributions. As a straightforward consequence of (5) one obtains an exact expression for the expectation of the squared MMD:

𝔼[γk2(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right] =𝔼[k(X,X)]𝔼[k(X,X)]n,\displaystyle=\frac{\mathbb{E}[k(X,X)]-\mathbb{E}[k(X,X^{\prime})]}{n}, (14)

where X,XX,X^{\prime} are independent draws from PP. Note that the numerator can also be expressed as 𝔼[γ12(P,P1)]\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{1}^{2}(P,P_{1})}\right], i.e., the squared MMD under n=1n=1 samples. By Jensen’s inequality, the first moment satisfies

𝔼[γk(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{k}(P,P_{n})}\right] 𝔼[k(X,X)]n.\displaystyle\leq\frac{\sqrt{\mathbb{E}[k(X,X)]}}{\sqrt{n}}. (15)

In the following, we focus on the two-moment kernel given in (11) and study how 𝔼[k(X,X)]\mathbb{E}[k(X,X)] depends on PP and the parameters (p,σ)(p,\sigma).

We note that because γk\gamma_{k} satisfies the triangle inequality, all of the results provided here extend naturally to the two-sample settings where one the goal is to approximate the distance γk(P,Q)\gamma_{k}(P,Q) based on the empirical measures PnP_{n} and QmQ_{m}.

4.1 Finite Moment Condition

We begin with an upper bound on 𝔼[k(X,X)]\mathbb{E}[k(X,X)] that holds whenever X\|X\| has a moment greater than d+2pd+2p.

Theorem 4.

Let XdX\in\mathbb{R}^{d} be a random vector that satisfies (𝔼[Xs])1/sm(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right])^{1/s}\leq m for some s>d+2ps>d+2p. If kk is the two-moment kernel given in (11) with parameters 0<ϵmin(d+2p,sd2p)0<\epsilon\leq\min(d+2p,s-d-2p) and λ=d+2p+ϵ+2m/σ\lambda=\sqrt{d+2p+\epsilon}+\sqrt{2}m/\sigma, then

𝔼[k(X,X)]\displaystyle\mathbb{E}[k(X,X)] αd,pϵ(2d+2p+ϵ+2mσ)d+2p.\displaystyle\leq\frac{\alpha_{d,p}}{\epsilon}\Big{(}\sqrt{2d+2p+\epsilon}+\frac{\sqrt{2}m}{\sigma}\Big{)}^{d+2p}.

In view of (15), Theorem 1 follows as an immediate corollary.

Small noise limit and unsmoothed OT.

It is instructive to consider the implications of Theorem 1 in the limit as σ\sigma converges to zero. By two applications of the triangle inequality and the fact that 𝒯p(Q,Q𝒩σ)σMd(p)\mathcal{T}_{p}(Q,Q\ast\mathcal{N}_{\sigma})\leq\sigma M_{d}(p) for every Q𝒫Q\in\mathcal{P}, one finds that, for any σ[0,)\sigma\in[0,\infty), the (unsmoothed) OT distance can be upper bounded according to

𝒯p(P,Pn)Cd,p(𝒯p(σ)(P,Pn)+σp).\displaystyle\mathcal{T}_{p}(P,P_{n})\leq C_{d,p}\mathopen{}\mathclose{{}\left(\mathcal{T}^{(\sigma)}_{p}(P,P_{n})+\sigma^{p}}\right). (16)

Combining (16) with Theorem 1 and then evaluating at σ=mn1d+2p\sigma=mn^{-\frac{1}{d+2p}} leads to the (unsmoothed) convergence rate given in (4).

4.2 Sub-gamma Condition

Next, we provide a refined bound for distributions satisfying a sub-gamma tail condition.

Definition 2.

A random vector XdX\in\mathbb{R}^{d} is said to be sub-gamma with variance parameter v>0v>0 and scale parameter b0b\geq 0 if

𝔼[exp(αX)]exp(vα22(1bα))\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]\leq\exp\mathopen{}\mathclose{{}\left(\frac{v\|\alpha\|^{2}}{2(1-b\|\alpha\|)}}\right) (17)

for all αd\alpha\in\mathbb{R}^{d} such that α1/b\|\alpha\|\leq 1/b. If this condition holds with b=0b=0 then XX is said to be sub-Gaussian with variance parameter vv.

Properties of sub-Gaussian and sub-gamma distributions have been studied extensively; see e.g,  Boucheron et al. (2013). In particular, if X1X_{1} and X2X_{2} are independent sub-gamma random vectors with parameters (v1,b1)(v_{1},b_{1}) and (v2,b2)(v_{2},b_{2}), respectively, then X1+X2X_{1}+X_{2} is sub-gamma with parameters (v1+v2,max(b1,b2))(v_{1}+v_{2},\max(b_{1},b_{2})).

The next result provides an upper bound on the moments of the magnitude of a sub-gamma vector. Although there is a rich literature this topic, we were unable to find a previous statement of this result and so it may be of independent interest.

Theorem 5.

Let XdX\in\mathbb{R}^{d} be a sub-gamma random vector with parameters (v,b)(v,b). For all s[0,)s\in[0,\infty)

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] 2e(v+sb)sMd(s),\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}\,b}\right)^{s}M_{d}(s), (18)

where Md(s):=2s2Γ(d+s2)/Γ(d2)M_{d}(s):=2^{\frac{s}{2}}\Gamma(\frac{d+s}{2})/\Gamma(\frac{d}{2}) is the ss-th moment of the chi distribution with dd degrees of freedom. Furthermore, Md(s)M¯d(s)M_{d}(s)\leq\overline{M}_{d}(s) where

M¯d(s)\displaystyle\overline{M}_{d}(s) :=(d+se)s2(d+sd)d12.\displaystyle:=\mathopen{}\mathclose{{}\left(\frac{d+s}{e}}\right)^{\frac{s}{2}}\mathopen{}\mathclose{{}\left(\frac{d+s}{d}}\right)^{\frac{d-1}{2}}. (19)
Theorem 6.

Let XdX\in\mathbb{R}^{d} be a sub-gamma random vector with parameters (v,b)(v,b). Let r=d+2pr=d+2p and for t(0,d+2p]t\in(0,d+2p] define

m(t)\displaystyle m(t) (σ2+2v+r+tb)r+tM¯d(r+t),\displaystyle\coloneqq(\sqrt{\sigma^{2}+2v}+\sqrt{r+t}\,b)^{r+t}\overline{M}_{d}(r+t),

where M¯d(s)\overline{M}_{d}(s) is defined in (19). If kk is the two-moment kernel defined in (11) with parameters ϵ=d\epsilon=\sqrt{d} and λ=(m(ϵ)/m(ϵ))1/(2ϵ)\lambda=(m(\epsilon)/m(-\epsilon))^{1/(2\epsilon)} then

𝔼[k(X,X)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] C(d+p)p(1+2vσ2+d+2pbσ)d+2p,\displaystyle\leq C(d+p)^{p}\!\Big{(}\sqrt{1+\frac{2v}{\sigma^{2}}}+\frac{\sqrt{d+2p}\,b}{\sigma}\Big{)}^{d+2p}, (20)

where C=2πe7/8<6.02C=\sqrt{2\pi}e^{7/8}<6.02.

In view of (15), Theorem 6 gives an upper bound on the convergence rate of γk(P,Pn)\gamma_{k}(P,P_{n}) with an explicit dependence on the sub-gamma parameters (d,p,σ,v,b)(d,p,\sigma,v,b). For the special case of a sub-Gaussian distribution (b=0b=0) and p{0,1}p\in\{0,1\}, this bound recovers the results in Goldfeld and Greenewald (2020). Going beyond the setting of sub-Gaussian distributions (i.e., b>0b>0) this bound quantifies the dependence on the dimension dd and the scale parameter bb.

Phase transition in scale parameter.

In the high-dimensional setting where dd increases with nn, Theorem 6 exhibits two distinct regimes depending on the behavior of the scale parameter. Suppose that (p,σ,v)(p,\sigma,v) are fixed while bb scales with dd. If b=O(dδ)b=O(d^{-\delta}) for some δ>1/2\delta>1/2 then 𝔼[k(X,X)]\mathbb{E}[k(X,X)] grows at most exponentially with the dimension:

log𝔼[k(X,X)]d12log(1+2vσ2)+o(1).\displaystyle\frac{\log\mathbb{E}[k(X,X)]}{d}\leq\frac{1}{2}\log\mathopen{}\mathclose{{}\left(1+\frac{2v}{\sigma^{2}}}\right)+o(1). (21)

Conversely if b=Ω(dδ)b=\Omega(d^{-\delta}) for δ<1/2\delta<1/2 then the upper bound increases faster than exponentially:

log𝔼[k(X,X)]d2(1δ)logd2+O(1).\displaystyle\frac{\log\mathbb{E}[k(X,X)]}{d}\leq\frac{2(1-\delta)\log d}{2}+O(1).

The following example provides a specific example of a sub-gamma distribution which shows that the upper bound in Theorem 6 is tight with respect to the scaling of the dimension dd and the scale parameter bb. Full details of this example are provided in Appendix C.2.

Example 1.

Suppose that X=UZX=\sqrt{U}Z where Z𝒩(0,Id)Z\sim\mathcal{N}(0,I_{d}) is a standard Gaussian vector and UU is an independent Gamma random variable with shape parameter 1/(2b2)1/(2b^{2}) and scale parameter 2b22b^{2}. Then, XX satisfies the sub-gamma condition with parameters (1,b)(1,b) and so the upper bound in Theorem 6 applies. Moreover, for every pair (ϵ,λ)(\epsilon,\lambda) the expectation of the two-moment kernel satisfies the following lower bound

𝔼[k(X,X)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] αd,pϵ(2bσ)r(Mb2(rϵ)Mb2(r+ϵ))1/2\displaystyle\geq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\frac{\sqrt{2}b}{\sigma}}\right)^{r}\mathopen{}\mathclose{{}\left(M_{b^{-2}}(r-\epsilon)M_{b^{-2}}(r+\epsilon)}\right)^{1/2}
×(Md(rϵ)Md(r+ϵ))1/2.\displaystyle\quad\times\mathopen{}\mathclose{{}\left(M_{d}(r-\epsilon)M_{d}(r+\epsilon)}\right)^{1/2}. (22)

The bounds on 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] given in (20) and (22) are shown in Figure 1 as a function of δ=(logd)/(logb)\delta=(\log d)/(\log b) for various values of dd. The plot demonstrates a phase transition at the critical value of δ=0.5\delta=0.5. Further computational results on this example are given in Section 5.1.

Refer to caption
Figure 1: Bounds on 1dlog𝔼[k(X,X)]\frac{1}{d}\log\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] for a distribution satisfying the sub-gamma condition with parameters (1,dδ)(1,d^{-\delta}) as a function of δ\delta for various dd. In all cases, p=1p=1, σ=0.1\sigma=0.1, and kk is the ‘two-moment’ kernel described in Theorem 6. The upper bounds (solid line) are given by the right-hand side of (20). The lower bounds (dashed line) are given by the right-hand side of (22) evaluated at ϵ=d\epsilon=\sqrt{d}.

4.3 Dependent Samples

Motivated by applications involving Markov chain Monte Carlo there is significant interest in understanding the rate of convergence when there is dependence in the samples. Within the literature, this question is often address by focusing on stationary sequences satisfying certain mixing conditions Peligrad (1986). The basic idea is that the dependence between SiS_{i} and SjS_{j} decays rapidly as |ij||i-j| increases, then the effect of the dependence is negligible.

To go beyond the usual mixing conditions, a particularly useful property of the kernel MMD distance is that the second moment of γk(P,Pn)\gamma_{k}(P,P_{n}) depends only on the pairwise correlation in the samples. This perspective is useful for settings where there may not be a natural notion of time.

To make things precise, suppose that S1,,SndS_{1},\dots,S_{n}\in\mathbb{R}^{d} is a collection of (possibly dependent) samples with identical distribution PP. For each iji\neq j let QijQ_{ij} denote the law of (Si,Sj)(S_{i},S_{j}). Starting with (5), the expectation of the squared MMD can now be decomposed as

𝔼[γk2(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right] =1n𝔼[γk2(P,P1)]+1n2ijrij,\displaystyle=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{1})}\right]+\frac{1}{n^{2}}\sum_{i\neq j}r_{ij}, (23)

where rij𝔼Qij[k(Si,Sj)]𝔼PP[k(Si,Sj)]r_{ij}\coloneqq\mathbb{E}_{Q_{ij}}[k(S_{i},S_{j})]-\mathbb{E}_{P\otimes P}[k(S_{i},S_{j}^{\prime})]. Notice that the first term in this decomposition is the second moment under independent samples. The second term is non-negative and depends only on the pairwise dependence, i.e., the difference between QijQ_{ij} and the probability measure obtained by the product of its marginals.

In some cases, it is natural to argue that only a small number of the terms rijr_{ij} are nonzero. More generally it is desirable to provide guarantees in terms of measures of dependence that do not rely on the particular choice of kernel. The next result provides such a bound in terms of the Hellinger distance.

Lemma 7.

If 𝔼P[k2(X,X)]1/2<Ck,P\mathbb{E}_{P}[k^{2}(X,X)]^{1/2}<C_{k,P}, then

rij2Ck,PdH(Qij,PP),\displaystyle r_{ij}\leq\sqrt{2}\,C_{k,P}\,d_{H}(Q_{ij},P\otimes P),

where dHd_{H} denotes the Hellinger distance.

The following example is inspired by random feature kernel interpretation of neural networks  Rahimi and Recht (2008).

Refer to caption Refer to caption
Refer to caption Refer to caption

Figure 2: Upper panel: (Left) The UB (solid line) and LB (dashed line) of 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] given by (20) and (22), plotted as functions of dd up to 500 and for various values of δ\delta (shown in 4 colors), where XX is sub-gamma random variable in d\mathbb{R}^{d} as in Example 1, b=dδb=d^{-\delta}, and the kernel function kk has bandwidth σ=1\sigma=1. (Right) Empirical estimates of 𝔼Δ^γ2\mathbb{E}{\hat{\Delta}_{\gamma}^{2}} (cross markers), shown together with the UB and LB of 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] as in the left plot but over a range of dd up to 30. The values are computed with n=400n=400 samples of XX, and are averaged over 200 Monte-Carlo replicas. Bottom panel: Same plots where σ=4\sigma=4, n=100n=100, and averaged over 100 Monte-Carlo replicas. In all the plots, the quantities are taken log\log and divide by dd for better demonstration.
Example 2.

Let {Z(α):αN}\{Z(\alpha)\,:\,\alpha\in\mathbb{R}^{N}\} be a d\mathbb{R}^{d}-valued Gaussian processes with mean zero and covariance function 𝖢ov(Z(α),Z(β))=α,βId\operatorname{\mathsf{C}ov}(Z(\alpha),Z(\beta))=\langle\alpha,\beta\rangle I_{d}. Suppose that samples from PP are generated according to Si=T(Z(αi))S_{i}=T(Z(\alpha_{i})) where α1,,αn\alpha_{1},\dots,\alpha_{n} are points on the unit sphere and T:ddT\colon\mathbb{R}^{d}\to\mathbb{R}^{d} is a function that maps a standard Gaussian vector into a vector with distribution PP. Because Hellinger distance is non-increasing under the mapping given by TT, it can be verified that

dH(Qij,PP)d|αi,αj|.\displaystyle d_{H}(Q_{ij},P\otimes P)\leq\sqrt{d}|\langle\alpha_{i},\alpha_{j}\rangle|.

Thus, by (23) and Lemma 7, there exists a constant CkC_{k} depending only on kk such that

𝔼[γk2(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right] Ck,P(1n+1n2ij|αi.αj|).\displaystyle\leq C_{k,P}\Big{(}\frac{1}{n}+\frac{1}{n^{2}}\sum_{i\neq j}|\langle\alpha_{i}.\alpha_{j}\rangle|\Big{)}. (24)

This inequality holds for any set of points {αi}\{\alpha_{i}\}. To gain insight into the typical scaling behavior when the samples are nearly orthogonal on average, let us suppose that the {αi}\{\alpha_{i}\} are drawn independently from the uniform distribution on the sphere. Then, 𝔼[|ai,aj|2]=1/N\mathbb{E}\mathopen{}\mathclose{{}\left[|\langle a_{i},a_{j}\rangle|^{2}}\right]=1/N and by, standard concentration arguments, one finds that the following upper bound holds with high probability when NN is large:

𝔼[γk2(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right] Ck,P(1n+logNN).\displaystyle\leq C_{k,P}\Big{(}\frac{1}{n}+\sqrt{\frac{\log N}{N}}\Big{)}. (25)

5 Numerical Results

In this section, we compute the upper bounds of GOT distance provided by the the empirical kernel MMD distance.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Empirical values of γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},P_{n}^{\prime}) for dependent samples in Example 2 in Section 4.3. (Left) Mean and standard deviation computed over 200 realizations of SiS_{i}’s,P plotted against the increasing values of NN and for different nn. (Middle) The log10\log_{10} of the mean value in the left plot. (Right) Log-log plot of the estimated limiting values of the mean in the left plot, by averaging over the range of N[80,100]N\in[80,100], v.s. nn, showing a slope close to 1.

5.1 Bounds under Sub-gamma Condition

The sub-gamma condition allow us to address distributions that do not satisfy the sub-Gaussian condition. Upper bounds on the convergence rate for this class of distributions follow from Theorem 3 combined with Theorem 6.

For sub-gamma XX as in Example 1, 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] has upper bound (UB) and lower bound (LB) as in (20) and (22) respectively. Here, in addition to the theoretical UB and LB as shown in Figure 1, we also compute the estimate of 𝔼[nγk2(P,Pn)]\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma_{k}^{2}(P,P_{n})}\right] approximated by the two-sample estimator. Let PnP_{n} and PnP_{n}^{\prime} be the empirical measure defined as in (13) for independent iid copies of the sub-gamma random vector as in Example 1, {Xi}i=1n\{X_{i}\}_{i=1}^{n} and {Xi}i=1n\{X_{i}^{\prime}\}_{i=1}^{n}, respectively. Define Δ^γ2n2γk2(Pn,Pn)\hat{\Delta}_{\gamma}^{2}\coloneqq\frac{n}{2}\gamma_{k}^{2}({P}_{n},{P}_{n}^{\prime}), and then 𝔼[Δ^γ2]=𝔼[k(X,X)]𝔼[k(X,X)]=𝔼[nγk2(P,Pn)]\mathbb{E}[\hat{\Delta}_{\gamma}^{2}]=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma_{k}^{2}(P,P_{n})}\right]. Because the empirical kernel MMD (squared) distance γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},P_{n}^{\prime}) can be computed numerically from the two samples of sub-gamma random vectors, we can estimate the expectation of Δ^γ2\hat{\Delta}_{\gamma}^{2} by empirical average over Monte-Carlo replicas. Detailed numerical techniques to compute the two-moment kernel and experimental setup are provided in the appendix.

The results are shown in Figure 2, where we set the parameter bb controlling the shape and scale of XX as b=dδb=d^{-\delta}, δ={0,0.25,0.5,0.75}\delta=\{0,0.25,0.5,0.75\}. The data dimension dd takes multiple values, and the kernel bandwidth σ\sigma takes values 1 and 4. In both cases, the left plot shows the same information as Figure 1 in view of increasing dd, so as to be compared to the right plot. The right plot focuses on the case of small dd, where the empirical estimates of 𝔼[Δ^γ2]\mathbb{E}[\hat{\Delta}_{\gamma}^{2}] observe the UB. Notably, these values also lie between the UB and LB (recall that the LB applies to 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] but not 𝔼[Δ^γ2]\mathbb{E}[\hat{\Delta}_{\gamma}^{2}]) for both cases of σ\sigma, and approach the LB when σ=1\sigma=1. This shows that our theoretical UB captures an important component of the kernel MMD distances for this example of XX.

5.2 Dependent Samples via Gaussian Process

We generate dependent samples following a Gaussian process {Z(α)}αSN\{Z(\alpha)\}_{\alpha\in S^{N}} as in Example 2 in Section 4.3, and numerically compute the values of γk2(P,Pn)\gamma_{k}^{2}(P,P_{n}) by its two-sample version γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},{P}_{n}^{\prime}), using the same kernel as in the first experiment. Theoretically, when nn is large, γk2(P,Pn)\gamma_{k}^{2}(P,P_{n}) is expected to concentrate at its mean value as in (24), and then since αi\alpha_{i} are uniformly sampled on the NN-sphere, we also expect concentration at the value of (25).

We set n={30,50,70,100}n=\{30,50,70,100\}, and vary NN from 5 to 100. The data are in dimension d=5d=5, the kernel parameters are σ=0.5\sigma=0.5, c=1c=1, p=1p=1, and for each value of nn and NN, γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},P_{n}^{\prime}) are computed for 100 realizations of the dependent random variables SiS_{i}, conditioned on one realization of the vectors αi\alpha_{i}’s. The results are shown in Figure 3, where for the small values of NN, the computed approximate values of 𝔼[γk2(P,Pn)]\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{k}^{2}(P,P_{n})}\right] are decreasing because they are dominated by the contribution from the dependence in the samples, namely the part that is upper-bounded by the second term depending on α\alpha in (24). As NN increases, these values converge to certain positive value which decrease as nn increases.

Specifically, we take the average over the mean values of γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},P_{n}^{\prime}) for N[80,100]N\in[80,100] as an estimate of the limiting values, and Figure 3 (right) shows that these values decay as n1n^{-1}, as they correspond to the first term in (24). These numerical results show the competing two factors as predicted by the analysis in Section 4.3.

6 Conclusion and Discussion

The paper proves new convergence rates of GOT under general settings. Our results require only a finite moment condition and in the special case of sub-gamma distributions, we quantify the dependence on the dimension dd and show a phase transition with respect to the scale parameter. Furthermore, our results cover the setting of dependent samples where convergence is proved only requiring a condition on pairwise sample dependence expressed by the kernel. Throughout our analysis, the main theoretical technique is to establish an upper bound using a kernel MMD where the kernel is called a “two-moment” kernel due to its special properties. The kernel depends on the cost function of the OT as well as the Gaussian smoothing used in GOT.

For the tightness of the kernel MMD upper bound, as has been pointed out in the comment beneath Theorem 1, our result shows that the convergence rate of n1/2n^{-1/2} is tight in some regimes where σ0\sigma\to 0 with nn\to\infty and the result matches the minimax rate in the unsmoothed OT setting. Alternatively, if σ\sigma is bounded away from zero then it may be possible to obtain a better rate of convergence. For example, Proposition 6 in Goldfeld et al. (2020b) shows that if the pair (P,σ)(P,\sigma) satisfies an additional chi-square divergence condition, then 𝒯2(σ)(P,Pn)\mathcal{T}_{2}^{(\sigma)}(P,P_{n}) converges at rate 1/n1/n, which is faster than the general upper bound of 1/n1/\sqrt{n} appearing in our paper. In this direction, pinning down the exact convergence rate in terms of regularity conditions on PP remains an interesting open question for future work. In addition, it would be interesting to investigate the relationship between the Gaussian smoothing used in this paper and the multiscale representation of 𝒯p\mathcal{T}_{p} in terms of partitions of the support of PP, which was used in the analysis in Fournier and Guillin (2015) as well as the related work Weed et al. (2019).

In practice, the tightness of the kernel MMD upper bound also depends on the choice of kernel, which can be optimized for the data distribution and the level of smoothing in GOT. The question of whether the kennel MMD provides a useful alternative to OT distance in applications can be worthwhile of further investigation. Finally, another important direction of future work is to study computational methods and applications of the GOT approach, particularly in the high dimensional space. Currently, no specialized algorithm for GOP from finite samples has been developed, except for the direct method of applying any OT algorithm, e.g., entropy OT (Sinkhorn), to data with additive Gaussian noise Goldfeld and Greenewald (2020). Progress on the computational side will also enable various applications of GOT, e.g., the evaluation of generative models in machine learning.

References

  • Abid and Gower (2018) Brahim Khalil Abid and Robert M Gower. Greedy stochastic algorithms for entropy-regularized optimal transport problems. arXiv preprint arXiv:1803.01347, 2018.
  • Anderson et al. (1994) Niall H Anderson, Peter Hall, and D Michael Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50(1):41–54, 1994.
  • Bhushan Damodaran et al. (2018) Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447–463, 2018.
  • Bigot et al. (2019) Jérémie Bigot, Elsa Cazelles, Nicolas Papadakis, et al. Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. Electronic Journal of Statistics, 13(2):5120–5150, 2019.
  • Bonneel et al. (2015) Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
  • Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • Chakrabarty and Khanna (2020) Deeparnab Chakrabarty and Sanjeev Khanna. Better and simpler error analysis of the sinkhorn–knopp algorithm for matrix scaling. Mathematical Programming, pages 1–13, 2020.
  • Chen and Niles-Weed (2020) Hong-Bin Chen and Jonathan Niles-Weed. Asymptotics of smoothed wasserstein distances. arXiv preprint arXiv:2005.00738, 2020.
  • Chwialkowski et al. (2015) Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, pages 1981–1989, 2015.
  • Courty et al. (2016) Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016.
  • Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • Epps and Singleton (1986) TW Epps and Kenneth J Singleton. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation, 26(3-4):177–203, 1986.
  • Fernández et al. (2008) V Alba Fernández, MD Jiménez Gamero, and J Muñoz García. A test for the two-sample problem based on empirical characteristic functions. Computational statistics & data analysis, 52(7):3730–3748, 2008.
  • Feydy et al. (2019) Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.
  • Fournier and Guillin (2015) Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3):707–738, 2015. doi: 10.1007/s00440-014-0583-7.
  • Fukumizu et al. (2008) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
  • Genevay et al. (2019) Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1574–1583. PMLR, 2019.
  • Gerber and Maggioni (2017) Samuel Gerber and Mauro Maggioni. Multiscale strategies for computing optimal transport. The Journal of Machine Learning Research, 18(1):2440–2471, 2017.
  • Goldfeld and Greenewald (2020) Ziv Goldfeld and Kristjan Greenewald. Gaussian-smoothed optimal transport: Metric structure and statistical efficiency. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020.
  • Goldfeld et al. (2018) Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks. arXiv preprint arXiv:1810.05728, 2018.
  • Goldfeld et al. (2020a) Ziv Goldfeld, Kristjan Greenewald, and Kengo Kato. Asymptotic guarantees for generative modeling based on the smooth wasserstein distance. Advances in Neural Information Processing Systems, 33, 2020a.
  • Goldfeld et al. (2020b) Ziv Goldfeld, Kristjan Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66(7):4368–4391, July 2020b.
  • Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Jitkrittum et al. (2016) Wittawat Jitkrittum, Zoltán Szabó, Kacper P Chwialkowski, and Arthur Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, pages 181–189, 2016.
  • Johnson et al. (1995) Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous univariate distributions. John Wiley & Sons, Ltd, 1995.
  • Klatt et al. (2020) Marcel Klatt, Carla Tameling, and Axel Munk. Empirical regularized optimal transport: Statistical theory and applications. SIAM Journal on Mathematics of Data Science, 2(2):419–443, 2020.
  • Lei et al. (2020) Jing Lei et al. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767–798, 2020.
  • Li et al. (2017) Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
  • Li et al. (2018) Wuchen Li, Ernest K Ryu, Stanley Osher, Wotao Yin, and Wilfrid Gangbo. A parallel method for earth mover’s distance. Journal of Scientific Computing, 75(1):182–197, 2018.
  • Li et al. (2015) Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
  • Luise et al. (2018) Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Advances in Neural Information Processing Systems, pages 5859–5870, 2018.
  • Mena and Niles-Weed (2019) Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
  • Muandet et al. (2017) K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel Mean Embedding of Distributions: A Review and Beyond. 2017.
  • Niles-Weed and Berthet (2019) Jonathan Niles-Weed and Quentin Berthet. Minimax estimation of smooth densities in wasserstein distance, 2019.
  • Peligrad (1986) Magda Peligrad. Recent advances in the central limit theorem and its weak invariance principle for mixing sequences of random variables (a survey). In Dependence in probability and statistics, pages 193–223. Springer, 1986.
  • Peyré and Cuturi (2019) Gabriel Peyré and Marco Cuturi. Computational Optimal Transport, volume 11 of 355-607. Foundations and Trends in Machine Learning, 2019.
  • Rabin et al. (2011) Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
  • Rahimi and Recht (2008) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
  • Reeves (2020) Galen Reeves. A two-moment inequality with applications to rényi entropy and mutual information. Entropy, 22(11), 2020.
  • Ryu et al. (2018) Ernest K Ryu, Yongxin Chen, Wuchen Li, and Stanley Osher. Vector and matrix optimal mass transport: theory, algorithm, and applications. SIAM Journal on Scientific Computing, 40(5):A3675–A3698, 2018.
  • Sejdinovic et al. (2013) Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pages 2263–2291, 2013.
  • Shen et al. (2017) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. arXiv preprint arXiv:1707.01217, 2017.
  • Singh and Póczos (2018) Shashank Singh and Barnabás Póczos. Minimax distribution estimation in Wasserstein distance, 2018. [Online]. Available https://arxiv.org/abs/1802.08855.
  • Smola et al. (2007) Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pages 13–31. Springer, 2007.
  • Solomon et al. (2014) Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Earth mover’s distances on discrete surfaces. ACM Transactions on Graphics (TOG), 33(4):67, 2014.
  • Sriperumbudur et al. (2010) Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010.
  • Villani (2003) Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Weed et al. (2019) Jonathan Weed, Francis Bach, et al. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 2019.
  • Young and Dunson (2019) Alexander L Young and David B Dunson. Consistent entropy estimation for stationary time series. arXiv preprint arXiv:1904.05850, 2019.
  • Zhang et al. (2012) Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.
  • Zhao and Meng (2015) Ji Zhao and Deyu Meng. Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural computation, 2015.

Appendix A Upper Bounds on GOT

This section provides proofs of the upper bounds on the GOT provided in Section 3. For the convenience of the reader we repeat some of the necessary definitions. Recall that the feature map ψx:d[0,)\psi_{x}\colon\mathbb{R}^{d}\to[0,\infty) is defined by

ψx(z)\displaystyle\psi_{x}(z) :=ωd2d+p2zd1+2p2f(z)ϕ(z2xσ),\displaystyle:=\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right), (A.1)

where ωd=2πd/2/Γ(d/2)\omega_{d}=2\pi^{d/2}/\Gamma(d/2) is the volume of the unit sphere in d\mathbb{R}^{d}, ϕ(u)=(2π)d/2exp(12u2)\phi(u)=(2\pi)^{-d/2}\exp(-\frac{1}{2}\|u\|^{2}) is the standard Gaussian density on d\mathbb{R}^{d}, and ff is a probability density function on [0,)[0,\infty) that satisfies

f(x)axd+2p1exp(bx2),\displaystyle f(x)\geq ax^{d+2p-1}\exp(-bx^{2}), (A.2)

for some a>0a>0 and b(0,1/2)b\in(0,1/2).

Lemma A.1.

The feature map in (A.1) defines a positive semidefinite kernel k:n×n[0,)k\colon\mathbb{R}^{n}\times\mathbb{R}^{n}\to[0,\infty) according to

k(x,y)dψx(z)ψy(z)𝑑z.\displaystyle k(x,y)\coloneqq\int_{\mathbb{R}^{d}}\psi_{x}(z)\psi_{y}(z)\,dz. (A.3)

Furthermore, this kernel can also be expressed as

k(x,y)\displaystyle k(x,y) =exp(xy24σ2)If(x+y2σ),\displaystyle=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)I_{f}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right), (A.4)

where

If(u)\displaystyle I_{f}(u) ωd2d+p(2π)d20xd1+2pf(x)gd,u(x)𝑑x,\displaystyle\coloneqq\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\int_{0}^{\infty}\frac{x^{d-1+2p}}{f(x)}g_{d,u}\mathopen{}\mathclose{{}\left(x}\right)\,dx, (A.5)

and gd,u(x)g_{d,u}(x) is the density of Z\|Z\| when Z𝒩(μ,Id)Z\sim\mathcal{N}(\mu,I_{d}) with μ=u\|\mu\|=u.

Proof.

First we establish that ψx\psi_{x} is square integrable. By the assumed lower bound in (A.2) and the fact that ϕ2(y/2)=(2π)d/2ϕ(y)\phi^{2}(y/\sqrt{2})=(2\pi)^{d/2}\phi(y), we can write

d|ψx(z)|2𝑑z\displaystyle\int_{\mathbb{R}^{d}}|\psi_{x}(z)|^{2}\,dz Cd,padexp(bz2)ϕ(z2xσ)𝑑z.\displaystyle\leq\frac{C_{d,p}}{a}\int_{\mathbb{R}^{d}}\exp\mathopen{}\mathclose{{}\left(-b\|z\|^{2}}\right)\phi\Big{(}z-\frac{\sqrt{2}x}{\sigma}\Big{)}\,dz. (A.6)

This integral is the moment generating function of the non-central chi-square distribution with dd degrees of freedom and non-centrality parameter 2x2/σ22\|x\|^{2}/\sigma^{2} evaluated at bb. Under the assumption b<1/2b<1/2, this integral is finite.

To establish the form given in (A.4) we can expand the squares to obtain:

ϕ(z2xσ)ϕ(z2yσ)\displaystyle\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right)\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{y}{\sigma}}\right) =(2π)d2exp(xy24σ2)ϕ(zx+y2σ).\displaystyle=(2\pi)^{-\frac{d}{2}}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\phi\mathopen{}\mathclose{{}\left(z-\frac{x+y}{\sqrt{2}\sigma}}\right).

Since the first factor does not depend on zz, it follows that

k(x,y)\displaystyle k(x,y) =ωd2d+p(2π)d2exp(xy24σ2)dzd1+2pf(z)ϕ(zx+y2σ)𝑑z.\displaystyle=\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\int_{\mathbb{R}^{d}}\frac{\|z\|^{d-1+2p}}{f(\|z\|)}\phi\mathopen{}\mathclose{{}\left(z-\frac{x+y}{\sqrt{2}\sigma}}\right)\,dz.

In this case, we recognize the integral as the expectation of d1+2p/f()\|\cdot\|^{d-1+2p}/f(\cdot) under the chi-distribution with dd degrees of freedom and parameter u=x+y/(2σ)u=\|x+y\|/(\sqrt{2}\sigma). ∎

A.1 Proof of Theorem 2

The following result is an immediate consequence of (Villani, 2008, Theorem 6.13) adapted to the notation of this paper.

Lemma A.2 ((Villani, 2008, Theorem 6.13)).

For any P,Q𝒫p(d)P,Q\in\mathcal{P}_{p}(\mathbb{R}^{d}),

𝒯p(P,Q)2max(p1,0)xpd|PQ|(x),\displaystyle\mathcal{T}_{p}(P,Q)\leq 2^{\max(p-1,0)}\int\|x\|^{p}\,d|P-Q|(x), (A.7)

where |PQ||P-Q| denotes the absolute variation the signed measure PQP-Q.

To proceed, let pσ(z)=dϕσ(zx)𝑑P(x)p_{\sigma}(z)=\int_{\mathbb{R}^{d}}\phi_{\sigma}(z-x)\,dP(x) and qσ(z)=dϕσ(zx)𝑑Q(x)q_{\sigma}(z)=\int_{\mathbb{R}^{d}}\phi_{\sigma}(z-x)\,dQ(x) denote the probability density functions of P𝒩σP\ast\mathcal{N}_{\sigma} and Q𝒩σQ\ast\mathcal{N}_{\sigma}, respectively. By Lemma A.2, the OT distance between P𝒩σP\ast\mathcal{N}_{\sigma} and Q𝒩σQ\ast\mathcal{N}_{\sigma} is bounded from above by the weighted total variation distance:

𝒯p(σ)(P,Q)2max(p1,0)zp|pσ(z)qσ(z)|𝑑z.\displaystyle\mathcal{T}^{(\sigma)}_{p}(P,Q)\leq 2^{\max(p-1,0)}\int\|z\|^{p}\mathopen{}\mathclose{{}\left|p_{\sigma}(z)-q_{\sigma}(z)}\right|\,dz. (A.8)

In the following we will show that 2max(p1,0)σpγk(P,Q)2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q) provides an upper bound on the right-hand side of (A.8). To proceed, recall that the kernel MMD can be expressed as

γk2(P,Q)=𝔼[k(X,X)]+𝔼[k(Y,Y)]2𝔼[k(X,Y)],\displaystyle\gamma^{2}_{k}(P,Q)=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\mathbb{E}\mathopen{}\mathclose{{}\left[k(Y,Y^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right], (A.9)

where X,XX,X^{\prime} are iid PP and Y,YY,Y^{\prime} are iid QQ. The assumptions k(x,x)P(x)<\int\sqrt{k(x,x)}\,P(x)<\infty and k(x,x)Q(x)<\int\sqrt{k(x,x)}\,Q(x)<\infty ensure that these expectations are finite, and so, by Fubini’s theorem, we can interchange the order of integration:

𝔼[k(X,Y)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right] =k(x,y)𝑑P(x)𝑑Q(y)=(ψx(z)𝑑P(x))(ψx(z)𝑑Q(x))𝑑z.\displaystyle=\int k(x,y)\,dP(x)\,dQ(y)=\int\mathopen{}\mathclose{{}\left(\int\psi_{x}(z)\,dP(x)}\right)\mathopen{}\mathclose{{}\left(\int\psi_{x}(z)\,dQ(x)}\right)\,dz.

For each zdz\in\mathbb{R}^{d}, it follows that

ψx(z)𝑑P(x)\displaystyle\int\psi_{x}(z)\,dP(x) =ωd2d+p2zd1+2p2f(z)dϕ(z2xσ)𝑑P(x)\displaystyle=\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}\int_{\mathbb{R}^{d}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right)\,dP(x)
=σdωd2d+p2zd1+2p2f(z)pσ(σz2),\displaystyle=\frac{\sigma^{d}\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}p_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right),

and this leads to

𝔼[k(X,Y)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right] =σ2dωd2d+pzd1+2pf(z)pσ(σz2)qσ(σz2)𝑑z\displaystyle=\frac{\sigma^{2d}\omega_{d}}{2^{d+p}}\int\frac{\|z\|^{d-1+2p}}{f(\|z\|)}p_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right)q_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right)\,dz
=σ2pz2prσ(z)pσ(z)qσ(z)𝑑z,\displaystyle=\sigma^{-2p}\int\frac{\|z\|^{2p}}{r_{\sigma}(\|z\|)}p_{\sigma}(z)q_{\sigma}(z)\,dz,

where rσ(x)2σf(2σx)/(ωdzd1)r_{\sigma}(x)\coloneqq\frac{\sqrt{2}}{\sigma}f(\frac{\sqrt{2}}{\sigma}x)/(\omega_{d}\|z\|^{d-1}). Combining this expression with (A.9) leads to

γk2(P,Q)=σ2pz2prσ(z)(pσ(z)qσ(z))2𝑑z.\displaystyle\gamma^{2}_{k}(P,Q)=\sigma^{-2p}\int\frac{\|z\|^{2p}}{r_{\sigma}(\|z\|)}\mathopen{}\mathclose{{}\left(p_{\sigma}(z)-q_{\sigma}(z)}\right)^{2}\,dz.

Finally, we note that zrσ(z)z\mapsto r_{\sigma}(\|z\|) is a probability density function on d\mathbb{R}^{d} (it is non-negative and integrates to one) and so by Jensen’s inequality and the convexity of the square,

γk2(P,Q)\displaystyle\gamma^{2}_{k}(P,Q) σ2p(zp|pσ(z)qσ(z)|𝑑z)2.\displaystyle\geq\sigma^{-2p}\mathopen{}\mathclose{{}\left(\int\|z\|^{p}\mathopen{}\mathclose{{}\left|p_{\sigma}(z)-q_{\sigma}(z)}\right|\,dz}\right)^{2}.

In view of (A.8), this establishes the desired result.

A.2 Proof of Theorem 3

The fact that the kernel MMD provides an upper bound on 𝒯p(σ)(P,Q)\mathcal{T}_{p}^{(\sigma)}(P,Q) follows directly from Theorem 2. All the remains to be shown is that k(x,x)\sqrt{k(x,x)} is integrable for any probability measure with finite ss-th moment, where s=(d+2p+ϵ)/2s=(d+2p+\epsilon)/2. To this end, we note that by the triangle inequality,

Md,u(r)2min(1,r)(Md(r)+ur),\displaystyle M_{d,u}(r)\leq 2^{\min(1,r)}\mathopen{}\mathclose{{}\left(M_{d}(r)+\|u\|^{r}}\right),

for all r0r\geq 0. Under the assumptions on ϵ\epsilon, we have 0d+2pϵ<d+2p+ϵ2s0\leq d+2p-\epsilon<d+2p+\epsilon\leq 2s and so there exists a constant Cd,p,ϵ,λC_{d,p,\epsilon,\lambda} such that

k(x,y)Cd,p,ϵ,λ(1+x2s+y2s).\displaystyle k(x,y)\leq C_{d,p,\epsilon,\lambda}\mathopen{}\mathclose{{}\left(1+\|x\|^{2s}+\|y\|^{2s}}\right).

Thus, the existence of finite ss-th moment is sufficient to ensure that k(x,x)\sqrt{k(x,x)} is integrable.

Appendix B Convergence Rate

This section provides proofs for the results in Section 4 of the main text as well as Theorem 1. To simplify the notation, we define r=d+2pr=d+2p and let Y=(2/σ)X+ZY=(\sqrt{2}/\sigma)X+Z where Z𝒩(0,Id)Z\sim\mathcal{N}(0,I_{d}).

Let us first consider some properties of 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]. Since the two moment kernel satisfies k(x,x)=αd,pJ((2/σ)x)k(x,x)=\alpha_{d,p}\,J((\sqrt{2}/\sigma)\|x\|), it follows from the definition of J()J(\cdot) that

𝔼[k(X,X)]=αd,p2ϵ(λϵ𝔼[Yrϵ]+λϵ𝔼[Yr+ϵ]).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]=\frac{\alpha_{d,p}}{2\epsilon}\mathopen{}\mathclose{{}\left(\lambda^{\epsilon}\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]+\lambda^{-\epsilon}\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]}\right). (A.10)

Suppose that there exists numbers MM_{-} and M+M_{+} such that

𝔼[Yrϵ]M,𝔼[Yr+ϵ]M+.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]\leq M_{-},\quad\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]\leq M_{+}. (A.11)

Choosing

λ=(M+/M)1/(2ϵ),\displaystyle\lambda=(M_{+}/M_{-})^{1/(2\epsilon)}, (A.12)

leads to

𝔼[k(X,X)]αd,pϵMM+.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\alpha_{d,p}}{\epsilon}\sqrt{M_{-}M_{+}}. (A.13)

In other words, optimizing the choice of λ\lambda results in an upper bound on 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] that depends on only the geometric mean of the upper bounds on 𝔼[Yr±ϵ]\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r\pm\epsilon}}\right].

Lemma A.3.

Let XdX\in\mathbb{R}^{d} be a random vector satisfying

(𝔼[Xs])1sm(s),\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}\leq m(s), (A.14)

for some function m(s)m(s) for s1s\geq 1. Then, if rϵ1r-\epsilon\geq 1, (A.11) holds with

M±\displaystyle M_{\pm} =((M¯d(r±ϵ))1r±ϵ+2σm(r±ϵ))r±ϵ,\displaystyle=\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\overline{M}_{d}(r\pm\epsilon)}\right)^{\frac{1}{r\pm\epsilon}}+\frac{\sqrt{2}}{\sigma}m(r\pm\epsilon)}\right)^{r\pm\epsilon}, (A.15)

where M¯d(s)=(d+s)(d+s1)/2d(d1)/2es/2\overline{M}_{d}(s)=(d+s)^{(d+s-1)/2}d^{-(d-1)/2}e^{-s/2}.

Proof.

This result follows from Minkowski’s inequality, which gives

(𝔼[Ys])1s\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{s}}\right]}\right)^{\frac{1}{s}} (𝔼[Zs])1s+2σ(𝔼[Xs])1s\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Z\|^{s}}\right]}\right)^{\frac{1}{s}}+\frac{\sqrt{2}}{\sigma}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}

for all s1s\geq 1 and the upper bound on Md(s)=𝔼[Zs]M_{d}(s)=\mathbb{E}\mathopen{}\mathclose{{}\left[\|Z\|^{s}}\right] in Theorem 5. ∎

B.1 Proof of Theorem 1

The result follows immediately by combining Theorem 4 and Equation (11) in the main text.

B.2 Proof of Theorem 4

By Lyapunov’s inequality and Minkowski’s inequality, it follows that for t{r±ϵ}t\in\{r\pm\epsilon\},

(𝔼[Yt])1t\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{t}}\right]}\right)^{\frac{1}{t}} (𝔼[Yr+ϵ])1r+ϵ\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}
(𝔼[Zr+ϵ])1r+ϵ+2σ(𝔼[Xr+ϵ])1r+ϵ\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Z\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}+\frac{\sqrt{2}}{\sigma}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}
d+r+ϵ+2mσ,\displaystyle\leq\sqrt{d+r+\epsilon}+\frac{\sqrt{2}m}{\sigma},

where the last step holds because Md(q)(d+q)q/2M_{d}(q)\leq(d+q)^{q/2} and the assumption (𝔼[Xs])1sm\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}\leq m. Thus, for λ=r+ϵ+m\lambda=\sqrt{r+\epsilon}+m, the bound in (A.13) becomes

𝔼[k(X,X)]αd,pϵ(d+r+ϵ+2mσ)r.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\sqrt{d+r+\epsilon}+\frac{\sqrt{2}m}{\sigma}}\right)^{r}.

Recalling that r=d+2pr=d+2p gives the stated result.

B.3 Proof of Theorem 5

Lemma A.4.

Let XdX\in\mathbb{R}^{d} be a sub-gamma random vector with parameters (v,b)(v,b). For all s[0,)s\in[0,\infty) and λ(0,1/b)\lambda\in(0,1/b),

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] 2π2s2Γ(s+12)(sλe)sexp(λ2v2(1λb))Md(s),\displaystyle\leq\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp\mathopen{}\mathclose{{}\left(\frac{\lambda^{2}v}{2(1-\lambda b)}}\right)M_{d}(s), (A.16)

where Md(s)2s2Γ(d+s2)/Γ(d2)M_{d}(s)\coloneqq 2^{\frac{s}{2}}\Gamma(\frac{d+s}{2})/\Gamma(\frac{d}{2}). In particular, if λ=((sb)2+4vpsb)/(2v)\lambda=(\sqrt{(sb)^{2}+4vp}-sb)/(2v), then

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] (v+(sb2)2+sb2)s2π2s2Γ(s+12)(se)s2Md(s).\displaystyle\leq\mathopen{}\mathclose{{}\left(\sqrt{v+\mathopen{}\mathclose{{}\left(\frac{\sqrt{s}b}{2}}\right)^{2}}+\frac{\sqrt{s}b}{2}}\right)^{s}\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}}M_{d}(s). (A.17)
Proof.

Let Y=ZXY=Z^{\top}X where Z=(Z1,,Zd)Z=(Z_{1},\dots,Z_{d}) is independent of XX and distributed uniformly on the unit sphere in d\mathbb{R}^{d}. Since ZZ is orthogonally invariant, it may be assumed that X=(X,0,,0)X=(\|X\|,0,\dots,0) and thus YY is equal in distribution to Z1XZ_{1}\|X\|. Therefore,

𝔼[|Y|s]=𝔼[|Z1|s]𝔼[Xs].\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[|Y|^{s}}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[|Z_{1}|^{s}}\right]\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right].

The variable Z1Z_{1} has density function

f(z)=Γ(d2)πΓ(d12)(1z2)(d3)/2,z[1,1],\displaystyle f(z)=\frac{\Gamma(\frac{d}{2})}{\sqrt{\pi}\Gamma(\frac{d-1}{2})}(1-z^{2})^{(d-3)/2},\quad z\in[-1,1],

and so the moments are given by

𝔼[|Z1|s]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[|Z_{1}|^{s}}\right] =2Γ(d2)πΓ(d12)01zs(1z2)(d3)/2𝑑z=Γ(d2)Γ(s+12)πΓ(d+s2).\displaystyle=\frac{2\Gamma(\frac{d}{2})}{\sqrt{\pi}\Gamma(\frac{d-1}{2})}\int_{0}^{1}z^{s}(1-z^{2})^{(d-3)/2}\,dz=\frac{\Gamma(\frac{d}{2})\Gamma(\frac{s+1}{2})}{\sqrt{\pi}\Gamma(\frac{d+s}{2})}.

To bound the absolute moments of YY we use the basic inequality uexp(u1)u\leq\exp(u-1) with u=λ|y|/su=\lambda|y|/s, which leads to

|y|s(sλe)sexp(λ|y|)(sλe)s(eλy+eλy),\displaystyle|y|^{s}\leq\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp(\lambda|y|)\leq\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}(e^{\lambda y}+e^{-\lambda y}),

for all s,λ(0,)s,\lambda\in(0,\infty). Noting that YY is equal in distribution to Y-Y and then using the sub-gamma assumption along with the fact that ZZ is a unit vector yields

𝔼[|Y|s]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[|Y|^{s}}\right] 2(sλe)s𝔼[exp(λY)]\displaystyle\leq 2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\lambda Y)}\right]
=2(sλe)s𝔼[exp(λZX)]\displaystyle=2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\lambda Z^{\top}X)}\right]
2(sλe)sexp(λ2v2(1λb)).\displaystyle\leq 2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp\mathopen{}\mathclose{{}\left(\frac{\lambda^{2}v}{2(1-\lambda b)}}\right).

Combining the above displays yields (A.16).

Finally, under the specified value of λ\lambda it follows that

λ2v1λb=p,sλ\displaystyle\frac{\lambda^{2}v}{1-\lambda b}=p,\qquad\frac{\sqrt{s}}{\lambda} =v+(sv2)2+sb2\displaystyle=\sqrt{v+\mathopen{}\mathclose{{}\left(\frac{\sqrt{s}v}{2}}\right)^{2}}+\frac{\sqrt{s}b}{2}

and plugging this expression back into the bound gives (A.17). ∎

Theorem 5 now follows as a corollary of Lemma A.4. Starting with (A.17) and using the basic inequality a2+b2a+b\sqrt{a^{2}+b^{2}}\leq a+b leads to

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] (v+sb)s2π2s2Γ(s+12)(se)s2Md(s).\displaystyle\leq\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}}M_{d}(s).

To simplify the expressions involving the Gamma functions we use the lower bound logΓ(z)(z12)logzz+12log(2π)\log\Gamma(z)\geq(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi) for z>0z>0, which leads to

2π2s2Γ(s+12)(se)s2\displaystyle\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}} 2e(ss+1)s2.\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\frac{s}{s+1}}\right)^{\frac{s}{2}}.

Combining this bound with the expression above yields

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] 2e(v+sb)s(ss+1)s2Md(s)\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}\mathopen{}\mathclose{{}\left(\frac{s}{s+1}}\right)^{\frac{s}{2}}M_{d}(s)
2e(v+sb)sMd(s).\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}M_{d}(s).

This completes the proof of Theorem 5.

B.4 Proof of Theorem 6

Since Z𝒩(0,Id)Z\sim\mathcal{N}(0,I_{d}) is sub-gamma with parameters (1,0)(1,0) it follows that Y=(2/σ)X+ZY=(\sqrt{2}/\sigma)X+Z is sub-gamma with parameters (1+2v/σ2,2b/σ)(1+2v/\sigma^{2},\sqrt{2}b/\sigma). For t>rt>-r we can apply Theorem 5 to obtain

𝔼[Yr+t]2e(1+2v/σ2+r+t2b/σ)r+tM¯d(r+t)=2eσr+tm(r+t).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+t}}\right]\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{1+2v/\sigma^{2}}+\sqrt{r+t}\sqrt{2}b/\sigma}\right)^{r+t}\overline{M}_{d}(r+t)=\frac{\sqrt{2e}}{\sigma^{r+t}}m(r+t).

Under the specified value of λ=(m(ϵ)/m(ϵ))1/(2ϵ)\lambda=(m(\epsilon)/m(-\epsilon))^{1/(2\epsilon)}, it then follows from (A.13) that

𝔼[k(X,X)]2eαk,pσrϵm(ϵ)m(ϵ).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\sqrt{2e}\alpha_{k,p}}{\sigma^{r}\epsilon}\sqrt{m(-\epsilon)m(\epsilon)}. (A.18)

To proceed, let (v,b)=(σ2+2v,2b)(v^{\prime},b^{\prime})=(\sigma^{2}+2v,\sqrt{2}b) and consider the decomposition

log(m(ϵ)m(ϵ))\displaystyle\log(m(-\epsilon)m(\epsilon)) =2logm(0)+A+B,\displaystyle=2\log m(0)+A+B,

where

A\displaystyle A (rϵ)log(v+rϵb)+(r+ϵ)log(v+r+ϵb)2rlog(v+rb)\displaystyle\coloneqq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(\sqrt{v^{\prime}}+\sqrt{r-\epsilon}\,b^{\prime}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(\sqrt{v^{\prime}}+\sqrt{r+\epsilon}\,b^{\prime}}\right)-2r\log(\sqrt{v^{\prime}}+\sqrt{r}b^{\prime})
B\displaystyle B logM¯d(rϵ)+logM¯d(r+ϵ)2logM¯d(r).\displaystyle\coloneqq\log\overline{M}_{d}(r-\epsilon)+\log\overline{M}_{d}(r+\epsilon)-2\log\overline{M}_{d}(r).

Using the basic inequalities 1+x1x/2\sqrt{1+x}-1\leq x/2 and log(1+x)x\log(1+x)\leq x, the term AA can be bounded from above as follows:

A\displaystyle A =(rϵ)log(1+(rϵr)bv+rb)+(r+ϵ)log(1+(r+ϵr)bv+rb)\displaystyle=(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r-\epsilon}-\sqrt{r})b^{\prime}}{\sqrt{v^{\prime}}+\sqrt{r}b^{\prime}}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r+\epsilon}-\sqrt{r})b^{\prime}}{\sqrt{v^{\prime}}+\sqrt{r}b^{\prime}}}\right)
(rϵ)log(1+(rϵr)r)+(r+ϵ)log(1+(r+ϵrr)\displaystyle\leq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r-\epsilon}-\sqrt{r})}{\sqrt{r}}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r+\epsilon}-\sqrt{r}}{\sqrt{r}}}\right)
(rϵ)log(1ϵ2r)+(r+ϵ)log(1+ϵ2r)\displaystyle\leq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1-\frac{\epsilon}{2r}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{\epsilon}{2r}}\right)
(rϵ)ϵ2r+(r+ϵ)ϵ2r\displaystyle\leq-(r-\epsilon)\frac{\epsilon}{2r}+(r+\epsilon)\frac{\epsilon}{2r}
=ϵ2r.\displaystyle=\frac{\epsilon^{2}}{r}.

Similarly, one finds that

B\displaystyle B =d+rϵ12log(1ϵd+r)+d+r+ϵ12log(1+ϵd+r)ϵ2d+r.\displaystyle=\frac{d+r-\epsilon-1}{2}\log\mathopen{}\mathclose{{}\left(1-\frac{\epsilon}{d+r}}\right)+\frac{d+r+\epsilon-1}{2}\log\mathopen{}\mathclose{{}\left(1+\frac{\epsilon}{d+r}}\right)\leq\frac{\epsilon^{2}}{d+r}.

Combining these bounds with the fact that rdr\geq d leads to

m(ϵ)m(ϵ)\displaystyle\sqrt{m(-\epsilon)m(\epsilon)} (v+rb)rM¯d(r)exp(3ϵ24d).\displaystyle\leq(\sqrt{v^{\prime}}+\sqrt{r}b^{\prime})^{r}\overline{M}_{d}(r)\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right).

Plugging this inequality back into (A.18) yields

𝔼[k(X,X)]2eαk,pϵ(1+2v/σ2+2rb)rM¯d(r)exp(3ϵ24d).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\sqrt{2e}\alpha_{k,p}}{\epsilon}(\sqrt{1+2v/\sigma^{2}}+\sqrt{2r}b)^{r}\overline{M}_{d}(r)\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right). (A.19)

Finally, by the lower bound logΓ(z)(z12)logzz+12log(2π)\log\Gamma(z)\geq(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi) and the basic inequality (1+p/d)dep(1+p/d)^{d}\leq e^{p} for p,d,0p,d,\geq 0 we can write

αd,pM¯d(r)\displaystyle\alpha_{d,p}\overline{M}_{d}(r) π(d+p)d+pepdddd+pπ(d+p)pd.\displaystyle\leq\frac{\sqrt{\pi}(d+p)^{d+p}}{e^{p}d^{d}}\frac{d}{\sqrt{d+p}}\leq\sqrt{\pi}(d+p)^{p}\sqrt{d}.

Hence,

𝔼[k(X,X)]2πe(d+p)pd(1+2v/σ+2rb/σ)rexp(3ϵ24d)ϵ.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\sqrt{2\pi e}(d+p)^{p}\sqrt{d}\mathopen{}\mathclose{{}\left(\sqrt{1+2v/\sigma}+\sqrt{2r}b/\sigma}\right)^{r}\frac{\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right)}{\epsilon}.

This bound holds for all ϵ[0,r]\epsilon\in[0,r]. Evaluating with ϵ=d\epsilon=\sqrt{d} and recalling that r=d+2pr=d+2p gives the stated result.

B.5 Proof of Lemma 7

Note that QijQ_{ij} is absolutely continuous with respect to PPP\otimes P and let λij=dQij/d(PP)\lambda_{ij}=dQ_{ij}/d(P\otimes P) denote the Radon-Nikodym derivative. Then

rij\displaystyle r_{ij} =k(λij1)d(PP)\displaystyle=\int k(\lambda_{ij}-1)d(P\otimes P)
=k(λij+1)(λij1)d(PP)\displaystyle=\int k(\sqrt{\lambda_{ij}}+1)(\sqrt{\lambda_{ij}}-1)d(P\otimes P)
2(k2𝑑Qij+k2d(PP))dH(Qij,PP),\displaystyle\leq\sqrt{2}\mathopen{}\mathclose{{}\left(\sqrt{\int k^{2}dQ_{ij}}+\sqrt{\int k^{2}d(P\otimes P)}}\right)d_{H}(Q_{ij},P\otimes P),

where the last step is by the Cauchy-Scharz inequality and we have used that fact that dH2(Qij,PP)=12(λ1)2d(PP)d^{2}_{H}(Q_{ij},P\otimes P)=\frac{1}{2}\int(\sqrt{\lambda}-1)^{2}d(P\otimes P).

Next, since k2(x,y)k(x,x)k(y,y)k^{2}(x,y)\leq k(x,x)k(y,y) for any positive semidefinite kernel, it follows that

k2(x,y)𝑑Qij(x,y)k(x,x)k(y,y)𝑑Qij(x,y)=k2(x,x)𝑑P(x),\displaystyle\int k^{2}(x,y)dQ_{ij}(x,y)\leq\int k(x,x)k(y,y)dQ_{ij}(x,y)=\int k^{2}(x,x)dP(x),

and thus the stated result follows from the assumption 𝔼P[k2(X,X)]Ck,P2\mathbb{E}_{P}\mathopen{}\mathclose{{}\left[k^{2}(X,X)}\right]\leq C_{k,P}^{2}.

Appendix C Experimental Details and Additional Results

In this section, we provide details of the experiments in Section 5 of the main text and additional numerical results. Our experiments are based on the two-moment kernel given in Definition 1.

C.1 Numerical Computation of the Two-Moment Kernel

To evaluate the two-moment kernel given in Definition 1 we need to numerically compute the function Md,u(s)M_{d,u}(s), which is the ss-th moment of the non-central chi-distribution with dd degrees of freedom and parameter uu. For all s0s\geq 0, this function can be written as a Poisson mixture of the (central) moments according to

Md,u(s)\displaystyle M_{d,u}(s) =k=0u2kexp(12u2)2kk!Md+2k,0(s).\displaystyle=\sum_{k=0}^{\infty}\frac{u^{2k}\exp(-\frac{1}{2}u^{2})}{2^{k}k!}M_{d+2k,0}(s). (A.20)

This series can be approximated efficiently by retaining only the terms with ku2/2k\approx u^{2}/2.

Alternatively, if s=2s=2\ell where \ell is an integer, then Md,u(2)M_{d,u}(2\ell) is the \ell-th moment of the chi-square distribution with dd degrees of freedom and non-centrality parameter u2u^{2}. The integer moments of this distribution can be obtained by differentiating the moment generating function. An explicit formula is given by (Johnson et al., 1995, pg. 448)

Md,u(2)=2Γ(+d/2)j=0(kj)(u2/2)jΓ(j+d/2).\displaystyle M_{d,u}(2\ell)=2^{\ell}\Gamma(\ell+d/2)\sum_{j=0}^{\ell}\binom{k}{j}\frac{(u^{2}/2)^{j}}{\Gamma(j+d/2)}. (A.21)

Here we see that Md,u(2)M_{d,u}(2\ell) is a degree \ell polynomial in u2u^{2}.

Accordingly, for any tuple (d,p,σ,λ,ϵ)(d,p,\sigma,\lambda,\epsilon) such that d+2p±ϵd+2p\pm\epsilon are even integers, the two-moment kernel defined in (11) can be expressed as

k(x,y)=exp(xy24σ2)=0Lc(x+y2σ)2,\displaystyle k(x,y)=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\sum_{\ell=0}^{L}c_{\ell}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right)^{2\ell}, (A.22)

where L=(d+2p+ϵ)/2L=(d+2p+\epsilon)/2 and the coefficients c0,,cLc_{0},\dots,c_{L} are given by

cαd,pϵ2Γ(+d/2)[λϵ2LϵΓ(Lϵ+d/2)(Lϵ)𝟏{Lϵ}+λϵ2LΓ(L+d/2)(L)],\displaystyle c_{\ell}\coloneqq\frac{\alpha_{d,p}}{\epsilon 2^{\ell}\Gamma(\ell+d/2)}\mathopen{}\mathclose{{}\left[\lambda^{\epsilon}2^{L-\epsilon}\Gamma(L-\epsilon+d/2)\binom{L-\epsilon}{\ell}\bm{1}_{\{\ell\leq L-\epsilon\}}+\lambda^{\epsilon}2^{L}\Gamma(L+d/2)\binom{L}{\ell}}\right], (A.23)

with αd,p(2π)2(p+d)2d/2/Γ(d/2)\alpha_{d,p}\coloneqq(2\pi)2^{-(p+d)}2^{-d/2}/\Gamma(d/2).

C.2 Details for Example 1

We now consider Example 1, a specific example of a sub-gamma distribution which shows that the upper bound in Theorem 6 is tight with respect to the scaling of the dimension dd and the scale parameter bb. Specifically, let X=UZX=\sqrt{U}Z where Z𝒩(0,Id)Z\sim\mathcal{N}(0,I_{d}) is a standard Gaussian vector and UU is an independent Gamma random variable with shape parameter 1/(2b2)1/(2b^{2}) and scale parameter 2b22b^{2}.

Lemma A.5.

For αd\alpha\in\mathbb{R}^{d} such that α1/b\|\alpha\|\leq 1/b, it holds that

𝔼[exp(αX)]=12b2log(1α2b2).\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]=-\frac{1}{2b^{2}}\log\mathopen{}\mathclose{{}\left(1-\|\alpha\|^{2}b^{2}}\right). (A.24)

In particular, this means that XX is a sub-gamma random vector with parameters (1,b)(1,b). Furthermore, for s>max{b2,d}s>\max\{-b^{-2},-d\},

𝔼[Xs]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right] =bsMb2(s)Md(s).\displaystyle=b^{s}M_{b^{-2}}(s)M_{d}(s). (A.25)
Proof.

Observe that αX=UαZ\alpha^{\top}X=\sqrt{U}\alpha^{\top}Z where αZ𝒩(0,α2)\alpha^{\top}Z\sim\mathcal{N}(0,\|\alpha\|^{2}). Hence

𝔼[exp(αX)]=𝔼[exp(α22U)].\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\exp\mathopen{}\mathclose{{}\left(\frac{\|\alpha\|^{2}}{2}U}\right)}\right].

Recognizing the right-hand side as the moment generating function of the Gamma distribution evaluated at α2/2\|\alpha\|^{2}/2 yields (A.24). To see that this distribution satisfies the sub-gamma condition, we use the basic inequality log(1x)x/(1x)x/(1x)-\log(1-x)\leq x/(1-x)\leq x/(1-\sqrt{x}) for all x(0,1)x\in(0,1).

The expression for the moments follows immediately from the independence of UU and ZZ and the fact that U2/b2U^{2}/b^{2} has Gamma distribution with shape parameter b2/2b^{-2}/2 and scale parameter 22, which implies that U/bU/b has a chi distribution with b2b^{-2} degrees of freedom. ∎

Since XX satisfies the sub-gamma condition with parameters (1,b)(1,b) the upper bound in Theorem 6 applies. Alternatively, for each pair (ϵ,λ)(\epsilon,\lambda) we can consider the exact expression for 𝔼[k(X,X)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right] given in (A.10) where r=d+2pr=d+2p and

Y=(2σ2U+1)1/2Z.Y=\mathopen{}\mathclose{{}\left(\frac{2}{\sigma^{2}}U+1}\right)^{1/2}Z.

Minimizing this expression with respect to λ\lambda yields

𝔼[k(X,X)]αd,pϵ(𝔼[Yrϵ]𝔼[Yr+ϵ])1/2.\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\geq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]}\right)^{1/2}. (A.26)

To get a lower bound on the moments, we use

𝔼[Ys]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{s}}\right] (2σ)s𝔼[Xs].\displaystyle\geq\mathopen{}\mathclose{{}\left(\frac{\sqrt{2}}{\sigma}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]. (A.27)

Combining the above displays leads to (22). Using Stirling’s approximation logΓ(z)=(z12)logzz+12log(2π)+o(1)\log\Gamma(z)=(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi)+o(1) as zz\to\infty it can be verified that the minimum of this lower bound with respect to ϵ\epsilon satisfies the same scaling behavior with respect to dd as the upper bound in Theorem 6. Namely, the bound exponential in dd if δ1/2\delta\geq 1/2 and superexponential in dd if δ<1/2\delta<1/2.

C.3 Experiments in Section 5.1

In this experiment, p=1p=1, the random variable XdX\in\mathbb{R}^{d} is generated according to the distribution in Example 1, and the kernel bandwidth σ\sigma takes values 11 and 44. The parameters (λ,ϵ)(\lambda,\epsilon) of the two-moment kernel are specified as in Theorem 6 with parameters (1,b)(1,b), and k(x,y)k(x,y) can be computed as in Appendix C.1.

In the Monte-Carlo computation of the average of Δ^γ2\hat{\Delta}_{\gamma}^{2} (the right column of Figure 2), 2n2n samples of XX are partitioned into two independent datasets {Xi}i=1n\{X_{i}\}_{i=1}^{n} and {Xi}i=1n\{X_{i}^{\prime}\}_{i=1}^{n}, each having nn samples. The kernel MMD (squared) distance has the empirical estimator Gretton et al. (2012)

γk2(Pn,Pn)=1n2i,j=1nk(Xi,Xj)+1n2i,j=1nk(Xi,Xj)2n2i=1,jnk(Xi,Xj),\gamma^{2}_{k}(P_{n},P_{n}^{\prime})=\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j})+\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i}^{\prime},X_{j}^{\prime})-\frac{2}{n^{2}}\sum_{i=1,j}^{n}k(X_{i},X_{j}^{\prime}),

and then, by definition,

𝔼[γk2(Pn,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P_{n},P_{n}^{\prime})}\right] =2(1n𝔼[k(X,X)]+(11n)𝔼[k(X,X)]2𝔼[k(X,X)]\displaystyle=2(\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]+(1-\frac{1}{n})\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]
=2n(𝔼[k(X,X)]𝔼[k(X,X)]).\displaystyle=\frac{2}{n}(\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]).

Recall that

γk2(P,Pn)=k(x,x)𝑑P(x)𝑑P(x)+1n2i,j=1nk(Xi,Xj)2ni=1nk(x,Xi)𝑑P(x),\gamma^{2}_{k}(P,P_{n})=\int\int k(x,x^{\prime})dP(x)dP(x^{\prime})+\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j})-\frac{2}{n}\sum_{i=1}^{n}\int k(x,X_{i})dP(x),

and then

𝔼[γk2(P,Pn)]\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right] =𝔼[k(X,X)]+1n𝔼[k(X,X)]+(11n)𝔼[k(X,X)]2𝔼[k(X,X)]\displaystyle=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]+(1-\frac{1}{n})\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]
=1n(𝔼[k(X,X)]𝔼[k(X,X)]).\displaystyle=\frac{1}{n}(\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]).

Thus, if we define

Δ^γ2n2γk2(Pn,Pn),\hat{\Delta}_{\gamma}^{2}\coloneqq\frac{n}{2}\gamma_{k}^{2}(P_{n},P_{n}^{\prime}),

the expectation of Δ^γ2\hat{\Delta}_{\gamma}^{2} equals 𝔼[k(X,X)]𝔼[k(X,X)]=𝔼[nγk2(P,Pn)]\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma^{2}_{k}(P,P_{n})}\right].

C.4 Experiments in Section 5.2

In this experiment, d=5d=5, p=1p=1, σ=1/2\sigma=1/2 and the parameters (ϵ,λ)(\epsilon,\lambda) of the two-moment kernel are specified as in Theorem 6 with parameters (1,0)(1,0). Figure 3 in the main text plots the values of γk2(Pn,Pn)\gamma_{k}^{2}\mathopen{}\mathclose{{}\left(P_{n},P^{\prime}_{n}}\right) as a function of increasing NN and for various values of nn. Figure A.1 plots γk2(Pn,Pn)\gamma_{k}^{2}\mathopen{}\mathclose{{}\left(P_{n},P^{\prime}_{n}}\right) as a function of increasing nn and for various values of NN. Note that in this setting, the typical correlation between samples is of magnitude 1/N1/\sqrt{N}, and thus the overall dependence is not negligible when NN is relatively small compared to nn.

Refer to caption
Refer to caption
Refer to caption
Figure A.1: Empirical values of γk2(Pn,Pn)\gamma_{k}^{2}(P_{n},P_{n}^{\prime}) as a function of nn for various values of NN for dependent samples in Example 2, that is, the same experiment as in Figure 3. (Left) Mean and standard deviation averaged over 100 realizations. (Middle) The log10\log_{10} of the mean value in the left plot. (Right) The log-log plot of the mean value in the left plot.