Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples

Yixing Zhang Department of Electrical and Computer Engineering, Duke University Xiuyuan Cheng Department of Mathematics, Duke University Galen Reeves Department of Electrical and Computer Engineering and the Department of Statistical Science, Duke University

Abstract

The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating the GOT distance under more general settings. For the Gaussian-smoothed $p$ -Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d+2p$ . For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space.

A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.

1 Introduction

There has been significant interest in optimal transport (OT) distances for data analysis, motivated by applications in statistics and machine learning ranging from computer graphics and imaging processing Solomon et al. (2014); Ryu et al. (2018); Li et al. (2018) to deep learning Courty et al. (2016); Shen et al. (2017); Bhushan Damodaran et al. (2018); see Peyré and Cuturi (2019). The OT cost between probability measures $P$ and $Q$ with cost function $c(x,y)$ is defined as

\displaystyle\mathcal{T}(P,Q)\coloneqq\inf_{\pi\in\Pi(P,Q)}\int c(x,y)\,d\pi(x,y),

(1)

where $\Pi(P,Q)$ is the set of all probability measures whose marginals are $P$ and $Q$ . Of central importance to applications in statistics and machine learning is the rate at which the empirical measure $P_{n}$ of and iid sample approximates the true underlying distribution $P$ . In this regard, one of the main challenges for OT distances is that rate convergence suffers from the curse of dimensionality: the number of samples $n$ needs to grow exponentially with the dimension $d$ of the data Fournier and Guillin (2015).

On a closely related note, OT also suffers from computational issues, particularly in the high-dimensional settings. To address both statistical and computation limitations, recent work has focused on regularized versions of OT including entropy regularization Cuturi (2013) and Gaussian-smoothed optimal transport (GOT) Goldfeld et al. (2020b). The entropy-regularized OT has attracted intensive theoretical interest Feydy et al. (2019); Klatt et al. (2020); Bigot et al. (2019), as well as an abundance of algorithm developments Gerber and Maggioni (2017); Abid and Gower (2018); Chakrabarty and Khanna (2020). In comparison, GOT is less understood both in theory and in computation. The goal of the current paper is thus to deepen the theoretical analysis of GOT under more general settings, so as to lay a theoretical foundation for computational study and potential applications.

In particular, we consider distributions that satisfy only a bounded moment condition and general settings involving dependent samples. For the special case of sub-gamma distributions, we show a phase transition depending on the dimension $d$ and with respect to the scale parameter of the sub-gamma distribution. Going beyond the case of iid samples, our convergence rate covers dependent samples as long as a condition on the pair-wise dependence quantified by the covariance of the kernel-space feature map is satisfied. A key step in our analysis is to establish a novel connection between the GOT distance and a family of kernel MMD distances, which can be of independent interest. In the kernel MMD upper bound, the kernel is neither bounded nor translation invariant, and is determined by both the cost function of OT and the amount of Gaussian smoothing. The theoretical findings are supported by numerical experiments.

To summarize our contribution, we provide an overview of the main theoretical results in the next subsection, and then close the introduction with a detailed review of related work. After introducing notations and needed preliminaries in Section 2, we derive upper bounds of GOT using kernel MMD of a new two-moment kernel in Section 3, which leads to the convergence rate results in Section 4 and numerical results in Section 5. All proofs are in the appendix.

1.1 Overview of Main Results

In this paper, we focus on the OT cost $\mathcal{T}_{p}(P,Q)$ associated with the cost function $c_{p}(x,y)=\|x-y\|^{p}$ for $p>0$ and $c_{0}(x,y)=1_{\{x\neq y\}}$ . The total variation distance is given by $\mathcal{T}_{0}(P,Q)$ and the $p$ -Wasserstein distance is given by $\mathcal{T}_{p}(P,Q)$ if $p\in(0,1]$ and $(\mathcal{T}_{p}(P,Q))^{1/p}$ if $p>1$ Villani (2003).

The minimax convergence rate of $\mathcal{T}_{p}(P,P_{n})$ was established by (Fournier and Guillin, 2015, Theorem 1) who showed that if $P$ has a moment strictly greater than $2p$ , then

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\asymp\begin{cases}n^{-\frac{1}{2}},&p>d/2\\ n^{-\frac{1}{2}}\log n,&p=d/2\\ n^{-\frac{p}{d}},&p\in(0,d/2)\end{cases}.

(2)

Unfortunately, this means that the sample complexity increases exponentially with the dimension for $d>2p$ .

Recently, Goldfeld et al. (2020b) showed that one way to overcome the curse of dimensionality is to consider the Gaussian-smoothed OT distance, defined as

\displaystyle\mathcal{T}_{p}^{(\sigma)}(P,Q)\coloneqq\mathcal{T}_{p}(P\ast\mathcal{N}_{\sigma},Q\ast\mathcal{N}_{\sigma}),

where $\mathcal{N}_{\sigma}$ denotes the iid Gaussian measure with mean zero and variance $\sigma^{2}$ . Under the assumption that $P$ is sub-Gaussian with constant $v$ , they proved an upper bound on the converge rate that is independent of the dimension:

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\leq\frac{C_{d,p,\sigma,v}}{\sqrt{n}},\quad p\in\{0,1,2\}.

The precise the form of the constant $C_{d,p,\sigma,v}$ is provided for $p\in\{0,1\}$ but not for the case $p=2$ unless $P$ is also assumed to have bounded support. Ensuing work by Goldfeld et al. (2020a) established the same convergence rate for $p=1$ under the relaxed assumption that $P$ has finite moment grater than $2d+2$ .

Metric properties of GOT were studied by Goldfeld and Greenewald (2020) who showed that $\mathcal{T}_{1}^{(\sigma)}(P,Q)$ is a metric on the space of probability measures with finite first moment and that the sequence of optimal couplings converges in the $\sigma\to 0$ limit to the optimal coupling for the unsmoothed Wasserstein distance. Their arguments depend only on the pointwise convergence of the characteristic functions under Gaussian smoothing, and thus also apply to the case of general $p$ considered in this paper.

One of the main contributions of this paper is to prove an upper bound on the convergence rate for all orders of $p$ and under more general assumptions on $P$ . Specifically, we prove the following result:

Theorem 1.

Let $P_{n}$ be the empirical measure of $n$ iid samples from a probability measure $P$ on $\mathbb{R}^{d}$ that satisfies $(\int\|x\|^{s}\,dP(x))^{1/s}\leq m$ for some $s>d+2p$ . There exists a positive constant $C_{d,p,s}$ such that for all $\sigma>0$ ,

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}^{(\sigma)}_{p}(P,P_{n})}\right]\leq C_{d,p,s}\frac{\sigma^{p}}{\sqrt{n}}\mathopen{}\mathclose{{}\left(1+\frac{m}{\sigma}}\right)^{\frac{d}{2}+p}.

(3)

This result brings the GOT framework in line with the general setting studied by (Fournier and Guillin, 2015, Theorem 1), and shows that the benefits obtained by smoothing extend beyond the special cases of small $p$ and well-controlled tails. To help interpret this result it is important to keep in mind that for $p>1$ , the Wasserstein distance is given by the $p$ -th root of the GOT. As for the tightness of the bound, there are two regimes worth considering, namely when $\sigma\to 0$ as $n\to\infty$ and when $\sigma$ is fixed. In the former case, the dependence on $\sigma$ seems to be nearly tight. In Section 4.1, we show that if $\sigma\asymp n^{-\frac{1}{d+2p}}$ then Theorem 1 implies an upper bound on the unsmoothed convergence rate

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\mathcal{T}_{p}(P,P_{n})}\right]\leq C_{d,p,s}m^{p}n^{-\frac{p}{d+2p}}.

(4)

Notice that for $d\ll 2p$ and $d\gg 2p$ this recovers the minimax convergence rate given in (2).

The main technical step in our approach is to establish a novel connection between GOT and a family of kernel MMD distances (Theorem 2). We then show how a particular member of this family, which we call the ‘two-moment’ kernel, defines a metric on the space of probability measures with finite moments strictly greater than $p+d/2$ (Theorem 3).

In addition to Theorem 1, we also provide further results that elucidate the role of the dimension as well the tail behavior of the underlying distribution (Theorem 6). Furthermore, we address the setting of dependent samples and provide an example illustrating how the connection with MMD can be used to go beyond the usual assumptions involving mixing conditions for stationary processes. Finally, we provide some numerical experiments that support our theory.

1.2 Comparison with Previous Work

The convergence of OT distances continues to be an active area of research Singh and Póczos (2018); Niles-Weed and Berthet (2019); Lei et al. (2020). Building upon the the work of work of Cuturi (2013), a recent line of work has focused on entropy regularized OT defined by

\displaystyle S_{\epsilon}(P,Q)\coloneqq\inf_{\pi\in\Pi(P,Q)}\int c(x,y)\,d\pi(x,y)+\epsilon D(\pi\|P\otimes Q)

where $D(\mu\|\nu)=\int\log\frac{d\mu}{d\nu}d\mu$ is the relative entropy between probability measures $\mu$ and $\nu$ . The addition of the regularization term facilitates the numerical approximation using the Sinkhorn algorithm. The amount of regularization interpolates between OT in the $\epsilon\to 0$ limit and the kernel MMD in $\epsilon\to\infty$ limit; see Feydy et al. (2019). In contrast to the Gaussian-smoothed Wasserstein distance, entropy regularized OT is not a metric since it does not satisfy the triangle inequality. Convergence rates for entropy regularized OT were obtained by Genevay et al. (2019) under the assumption of bounded support and more recently by Mena and Niles-Weed (2019) under the assumption of sub-Gaussian tails. Further properties have been studied by Luise et al. (2018) and Klatt et al. (2020).

There has also been work focusing on the sliced Wasserstein distance, which is obtained by averaging the one-dimensional Wasserstein distance over the unit sphere Rabin et al. (2011); Bonneel et al. (2015). While the sliced Wasserstein distance is equivalent to the Wasserstein distance in the sense that convergence in one metric implies convergence in another, the rates of convergence need not be the same. See Section 1.2 of Goldfeld et al. (2020a) for further discussion.

Going beyond convergence rates for empirical measures, properties of smoothed empirical measures have been studied in a variety of contexts, including the high noise limit Chen and Niles-Weed (2020) and applications to the estimation of mutual information in deep networks Goldfeld et al. (2018). Finally, we note that there has also been some work on convergence with dependent samples by Fournier and Guillin (2015), who focus on OT distance, and also by Young and Dunson (2019), who consider a closely related entropy estimation problem.

2 Preliminaries

Let $\mathcal{P}(\mathbb{R}^{d})$ be the space of Borel probability measures on $\mathbb{R}^{d}$ and let $\mathcal{P}_{s}(\mathbb{R}^{d})$ be the space of probability measures with finite $s$ -th moment, i.e, $\int\|x\|^{s}\,dP(x)<\infty$ . The Gaussian measure on $\mathbb{R}^{d}$ with mean-zero and covariance $\sigma^{2}I_{d}$ is denoted by $\mathcal{N}_{\sigma}$ . The convolution of probability measures $P$ and $Q$ is denoted by $P\ast Q$ . The Gamma function is given by $\Gamma(z)=\int_{0}^{\infty}x^{z-1}e^{-x}\,dx$ for $z>0$ . We use $C$ to denote a generic positive real number, and the value of $C$ may change from place to place.

Kernel MMD.

A symmetric function $k\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ is said to be a positive-definite kernel on $\mathbb{R}^{d}$ if and only if for every $x_{1},\dots,x_{n}\in\mathbb{R}^{d}$ , the symmetrix matrix $(k(x_{i},x_{j}))_{i,j=1}^{n}$ is positive semidefinite. A positive definite kernel $k$ can be used to define a distance on probability measure known as RKHS MMD Anderson et al. (1994); Gretton et al. (2005); Smola et al. (2007); Gretton et al. (2012). Let $\mathcal{P}^{k}(\mathbb{R}^{d})$ be the space of probability measures such that $\int\sqrt{k(x,x)}dP(x)<\infty$ . For $P,Q\in\mathcal{P}^{k}(\mathbb{R}^{d})$ the kernel MMD distance is defined as

\displaystyle\gamma^{2}_{k}(P,Q)=\iint k(x,y)\,d(P(x)-Q(x))\,d(P(y)-Q(y)).

The distance $\gamma_{k}(P,Q)$ is a pseudo-metric in general. A kernel $k$ is said to be characteristic to a set $Q\subseteq\mathcal{P}$ of probability measures if and only if $\gamma_{k}$ is a metric on $\mathcal{Q}$ Gretton et al. (2012); Sriperumbudur et al. (2010). An alternative representation is given by

\displaystyle\gamma^{2}_{k}(P,Q)=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\mathbb{E}\mathopen{}\mathclose{{}\left[k(Y,Y^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right]

(5)

where $X,X^{\prime}$ $\sim P$ iid, and $Y,Y^{\prime}$ $\sim Q$ iid. The kernel MMD distance was shown to be equivalent to energy distance in Sejdinovic et al. (2013), and a variant form used in practice is the kernel mean embedding statistics Muandet et al. (2017); Chwialkowski et al. (2015); Jitkrittum et al. (2016). Another appealing theoretical property of the kernel MMD distance is its representation via the spectral decomposition of the kernel Epps and Singleton (1986); Fernández et al. (2008), which gives rise to estimation consistency as well as practical algorithms of computing Zhao and Meng (2015). Kernel MMD has been widely applied in data analysis and machine learning, including independence testing Fukumizu et al. (2008); Zhang et al. (2012) and generative modeling Li et al. (2015, 2017).

Magnitude of multivariate Gaussian.

Our results also depend on some properties of the noncentral chi distribution. Let $Z\sim\mathcal{N}(\mu,I_{d})$ be a Gaussian vector with mean $\mu$ and identity covariance. The random variable $X=\|Z\|$ has chi-distribution with parameter $u=\|\mu\|$ . The density is given by

\displaystyle g_{d,u}(x)=\frac{e^{-(x^{2}+u^{2})/2}x^{d-1}}{2^{\frac{d}{2}-1}}\sum_{k=0}^{\infty}\frac{(ux/2)^{2k}}{k!\Gamma\mathopen{}\mathclose{{}\left({\frac{d+2k}{2}}}\right)}.

(6)

The $s$ -th moment of this distribution is denoted by $M_{d,u}(s)=\int x^{s}\,g_{d,u}(x)\,dx$ . This function is an even polynomial of degree $s$ whenever $s$ is an even integer (see Appendix C.1). The special case $u=0$ corresponds to the (central) chi distribution and is given by $M_{d}(s)\coloneqq 2^{\frac{s}{2}}\Gamma(\tfrac{d+s}{2})/\Gamma(\tfrac{d}{2}).$

3 Upper Bounds on GOT

In this section, we show that GOT is bounded from above by a family of kernel MMD distances. It is assumed throughout that $d$ is a positive integer, $p\in(0,\infty)$ , and $\sigma\in(0,\infty)$ .

3.1 General Bound via Kernel MMD

Consider the feature map $\psi_{x}\colon\mathbb{R}^{d}\to[0,\infty)$ defined by

\displaystyle\psi_{x}(z)

\displaystyle\coloneqq\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right),

(7)

where $\omega_{d}=2\pi^{d/2}/\Gamma(d/2)$ is the volume of the unit sphere in $\mathbb{R}^{d}$ , $\phi(u)=(2\pi)^{-d/2}\exp(-\frac{1}{2}\|u\|^{2})$ is the standard Gaussian density on $\mathbb{R}^{d}$ , and $f$ is a probability density function on $[0,\infty)$ that satisfies

\displaystyle f(x)\geq ax^{d+2p-1}\exp(-bx^{2}),

(8)

for some $a>0$ and $b\in(0,1/2)$ . This feature map defines a positive semidefinite kernel $k$ according to

\displaystyle k(x,y)=\langle\psi_{x},\psi_{y}\rangle.

After some straightforward manipulations (see Appendix A), one finds that $k(x,y)$ is finite on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ and can be expressed as the product of a Gaussian kernel and a term that depends only on $\|x+y\|$ . Specifically,

\displaystyle k(x,y)

\displaystyle=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)I_{f}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right),

(9)

where

\displaystyle I_{f}(u)

\displaystyle\coloneqq\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\int_{0}^{\infty}\frac{x^{d-1+2p}}{f(x)}g_{d,u}\mathopen{}\mathclose{{}\left(x}\right)\,dx,

(10)

and $g_{d,u}(x)$ is the density of the non-central chi-distribution given in (6). Note that this kernel is not shift invariant because of the term $I_{f}(u)$ .

The next result shows that the MMD defined by this kernel provides an upper bound on the GOT.

Theorem 2.

Let $k$ be defined as in (9). For any $P,Q\in\mathcal{P}(\mathbb{R}^{d})$ such that $\int\sqrt{k(x,x)}\,dP(x)$ and $\int\sqrt{k(x,x)}\,dQ(x)$ are finite, the MMD defined by $k$ provides and upper bound on the GOT:

\displaystyle\mathcal{T}^{(\sigma)}_{p}(P,Q)\leq 2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q).

The significance of Theorem 2 is twofold. From the perspective of GOT, it provides a natural connection between the role of Gaussian smoothing and normalization of the kernel. From the perspective of MMD, Theorem 2 describes a family of kernels that metrize convergence in distribution as well as convergence in $p$ -th moments.

Similar to the analysis of convergence rates in previous work Goldfeld and Greenewald (2020); Goldfeld et al. (2020b), the proof of Theorem 2 builds upon the fact that $\mathcal{T}_{p}(P,Q)$ can be upper bounded by a weighted total variation distance (Villani, 2008, Theorem 6.13). The novelty of Theorem 2 is that it establishes an explicit relationship with the kernel MMD and also provides a much broader class of upper bounds parameterized by the density $f$ .

3.2 A ‘Two-moment’ Kernel

One potential limitation of Theorem 2 is that for a particular choice of density $f$ , the requirement that $\sqrt{k(x,x)}$ is integrable might not be satisfied for probability measures of interest. For example, the convergence rates in Goldfeld and Greenewald (2020) and Goldfeld et al. (2020b) can be obtained as a corollary of Theorem 2 by choosing $f$ to be the density of the generalized gamma distribution. However, the inverse of this density grows faster than exponentially, and as a consequence, the resulting bound can be applied only to the case of sub-Gaussian distributions.

The main idea underlying the approach in this section is that choosing a density with heavier tails leads to an upper bound that holds for a larger class of probability measures. Motivated by the functional inequalities appearing in Reeves (2020), we consider the following density function, which belongs to the family of generalized beta-prime distributions:

\displaystyle f(x)=\frac{\epsilon}{2\pi x}\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\frac{x}{\lambda}}\right)^{-\epsilon}+\mathopen{}\mathclose{{}\left(\frac{x}{\lambda}}\right)^{\epsilon}}\right)^{-1}.

For this special choice, the function $I_{f}(u)$ can be expressed as the weighted sum of two moments of the non-central chi distribution. Starting with (9) and simplifying terms leads to the following:

Definition 1.

The two-moment kernel $k\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ is defined as

\displaystyle k(x,y)

\displaystyle=\alpha_{d,p}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)J\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right)

(11)

for all $\epsilon\in(0,d+2p]$ and $\lambda\in(0,\infty)$ , where

	$\displaystyle\alpha_{d,p}$	$\displaystyle\coloneqq\frac{(2\pi)2^{-(p+d)}2^{-d/2}}{\Gamma(\frac{d}{2})}$
	$\displaystyle J(u)$	$\displaystyle\coloneqq\frac{\lambda^{\epsilon}M_{d,u}(d\!+\!2p\!-\!\epsilon)+\lambda^{-\epsilon}M_{d,u}(d\!+\!2p\!+\!\epsilon)}{2\epsilon}.$

In this expression, $M_{d,u}(s)$ denotes the $s$ -th moment of the non-central chi distribution with $d$ degrees of freedom and parameter $u$ .

A useful property of the two-moment kernel is that it satisfies the upper bound

\displaystyle k(x,y)\leq C_{d,p,\epsilon,\lambda}\mathopen{}\mathclose{{}\left(1+\|x\|^{d+2p+\epsilon}+\|y\|^{d+2p+\epsilon}}\right),

where the constant depends only on $(d,p,\epsilon,\lambda)$ (see Appendix A for details). As a consequence:

Theorem 3.

Fix any $s>p+d/2$ . For all $0<\epsilon<\min(d+2p,2s-2p-d)$ and $\lambda\in(0,\infty)$ the MMD defined by the two-moment kernel is a metric on the space $\mathcal{P}_{s}(\mathbb{R}^{d})$ of probability measures with finite $s$ -th moment. Furthermore, for all $P,Q\in\mathcal{P}_{s}(\mathbb{R}^{d})$ ,

\displaystyle\mathcal{T}_{p}^{(\sigma)}(P,Q)\leq 2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q).

Remark 1.

If $d+2p\pm\epsilon$ are even integers then $J(u)$ is an even polynomial of degree $s=d+2p+\epsilon$ with non-negative coefficients. For example, if $d=3$ , $p=1$ , and $(\epsilon,\lambda)=(1,1)$ , then

\displaystyle J(u)

\displaystyle=60+\frac{115}{2}u^{2}+11u^{4}+\frac{1}{2}u^{6}.

(12)

Methods for efficient numerical approximation of $J(u)$ are provided in Appendix C.1.

4 Convergence Rate

We now turn our attention to the fundamental question of how well the empirical measure of iid samples approximates the true underlying distribution. Let $S_{1},\dots,S_{n}\in\mathbb{R}^{n}$ be a sequence of $n$ independent samples with common distribution $P$ . The empirical measure $P_{n}$ is the (random) probability measure on $\mathbb{R}^{d}$ that places probability mass $1/n$ at each sample point:

P_{n}:=\frac{1}{n}\sum_{i=1}^{n}\delta_{S_{i}},

(13)

where $\delta_{x}$ denotes the pointmass distribution at $x$ .

Distributional properties of the kernel MMD between $P$ and $P_{n}$ have been studied extensively Gretton et al. (2012). For the purposes of this paper, we will focus on the expected difference between these distributions. As a straightforward consequence of (5) one obtains an exact expression for the expectation of the squared MMD:

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right]

\displaystyle=\frac{\mathbb{E}[k(X,X)]-\mathbb{E}[k(X,X^{\prime})]}{n},

(14)

where $X,X^{\prime}$ are independent draws from $P$ . Note that the numerator can also be expressed as $\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{1}^{2}(P,P_{1})}\right]$ , i.e., the squared MMD under $n=1$ samples. By Jensen’s inequality, the first moment satisfies

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{k}(P,P_{n})}\right]

\displaystyle\leq\frac{\sqrt{\mathbb{E}[k(X,X)]}}{\sqrt{n}}.

(15)

In the following, we focus on the two-moment kernel given in (11) and study how $\mathbb{E}[k(X,X)]$ depends on $P$ and the parameters $(p,\sigma)$ .

We note that because $\gamma_{k}$ satisfies the triangle inequality, all of the results provided here extend naturally to the two-sample settings where one the goal is to approximate the distance $\gamma_{k}(P,Q)$ based on the empirical measures $P_{n}$ and $Q_{m}$ .

4.1 Finite Moment Condition

We begin with an upper bound on $\mathbb{E}[k(X,X)]$ that holds whenever $\|X\|$ has a moment greater than $d+2p$ .

Theorem 4.

Let $X\in\mathbb{R}^{d}$ be a random vector that satisfies $(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right])^{1/s}\leq m$ for some $s>d+2p$ . If $k$ is the two-moment kernel given in (11) with parameters $0<\epsilon\leq\min(d+2p,s-d-2p)$ and $\lambda=\sqrt{d+2p+\epsilon}+\sqrt{2}m/\sigma$ , then

\displaystyle\mathbb{E}[k(X,X)]

\displaystyle\leq\frac{\alpha_{d,p}}{\epsilon}\Big{(}\sqrt{2d+2p+\epsilon}+\frac{\sqrt{2}m}{\sigma}\Big{)}^{d+2p}.

In view of (15), Theorem 1 follows as an immediate corollary.

Small noise limit and unsmoothed OT.

It is instructive to consider the implications of Theorem 1 in the limit as $\sigma$ converges to zero. By two applications of the triangle inequality and the fact that $\mathcal{T}_{p}(Q,Q\ast\mathcal{N}_{\sigma})\leq\sigma M_{d}(p)$ for every $Q\in\mathcal{P}$ , one finds that, for any $\sigma\in[0,\infty)$ , the (unsmoothed) OT distance can be upper bounded according to

\displaystyle\mathcal{T}_{p}(P,P_{n})\leq C_{d,p}\mathopen{}\mathclose{{}\left(\mathcal{T}^{(\sigma)}_{p}(P,P_{n})+\sigma^{p}}\right).

(16)

Combining (16) with Theorem 1 and then evaluating at $\sigma=mn^{-\frac{1}{d+2p}}$ leads to the (unsmoothed) convergence rate given in (4).

4.2 Sub-gamma Condition

Next, we provide a refined bound for distributions satisfying a sub-gamma tail condition.

Definition 2.

A random vector $X\in\mathbb{R}^{d}$ is said to be sub-gamma with variance parameter $v>0$ and scale parameter $b\geq 0$ if

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]\leq\exp\mathopen{}\mathclose{{}\left(\frac{v\|\alpha\|^{2}}{2(1-b\|\alpha\|)}}\right)

(17)

for all $\alpha\in\mathbb{R}^{d}$ such that $\|\alpha\|\leq 1/b$ . If this condition holds with $b=0$ then $X$ is said to be sub-Gaussian with variance parameter $v$ .

Properties of sub-Gaussian and sub-gamma distributions have been studied extensively; see e.g, Boucheron et al. (2013). In particular, if $X_{1}$ and $X_{2}$ are independent sub-gamma random vectors with parameters $(v_{1},b_{1})$ and $(v_{2},b_{2})$ , respectively, then $X_{1}+X_{2}$ is sub-gamma with parameters $(v_{1}+v_{2},\max(b_{1},b_{2}))$ .

The next result provides an upper bound on the moments of the magnitude of a sub-gamma vector. Although there is a rich literature this topic, we were unable to find a previous statement of this result and so it may be of independent interest.

Theorem 5.

Let $X\in\mathbb{R}^{d}$ be a sub-gamma random vector with parameters $(v,b)$ . For all $s\in[0,\infty)$

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]

\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}\,b}\right)^{s}M_{d}(s),

(18)

where $M_{d}(s):=2^{\frac{s}{2}}\Gamma(\frac{d+s}{2})/\Gamma(\frac{d}{2})$ is the $s$ -th moment of the chi distribution with $d$ degrees of freedom. Furthermore, $M_{d}(s)\leq\overline{M}_{d}(s)$ where

\displaystyle\overline{M}_{d}(s)

\displaystyle:=\mathopen{}\mathclose{{}\left(\frac{d+s}{e}}\right)^{\frac{s}{2}}\mathopen{}\mathclose{{}\left(\frac{d+s}{d}}\right)^{\frac{d-1}{2}}.

(19)

Theorem 6.

Let $X\in\mathbb{R}^{d}$ be a sub-gamma random vector with parameters $(v,b)$ . Let $r=d+2p$ and for $t\in(0,d+2p]$ define

\displaystyle m(t)

\displaystyle\coloneqq(\sqrt{\sigma^{2}+2v}+\sqrt{r+t}\,b)^{r+t}\overline{M}_{d}(r+t),

where $\overline{M}_{d}(s)$ is defined in (19). If $k$ is the two-moment kernel defined in (11) with parameters $\epsilon=\sqrt{d}$ and $\lambda=(m(\epsilon)/m(-\epsilon))^{1/(2\epsilon)}$ then

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]

\displaystyle\leq C(d+p)^{p}\!\Big{(}\sqrt{1+\frac{2v}{\sigma^{2}}}+\frac{\sqrt{d+2p}\,b}{\sigma}\Big{)}^{d+2p},

(20)

where $C=\sqrt{2\pi}e^{7/8}<6.02$ .

In view of (15), Theorem 6 gives an upper bound on the convergence rate of $\gamma_{k}(P,P_{n})$ with an explicit dependence on the sub-gamma parameters $(d,p,\sigma,v,b)$ . For the special case of a sub-Gaussian distribution ( $b=0$ ) and $p\in\{0,1\}$ , this bound recovers the results in Goldfeld and Greenewald (2020). Going beyond the setting of sub-Gaussian distributions (i.e., $b>0$ ) this bound quantifies the dependence on the dimension $d$ and the scale parameter $b$ .

Phase transition in scale parameter.

In the high-dimensional setting where $d$ increases with $n$ , Theorem 6 exhibits two distinct regimes depending on the behavior of the scale parameter. Suppose that $(p,\sigma,v)$ are fixed while $b$ scales with $d$ . If $b=O(d^{-\delta})$ for some $\delta>1/2$ then $\mathbb{E}[k(X,X)]$ grows at most exponentially with the dimension:

\displaystyle\frac{\log\mathbb{E}[k(X,X)]}{d}\leq\frac{1}{2}\log\mathopen{}\mathclose{{}\left(1+\frac{2v}{\sigma^{2}}}\right)+o(1).

(21)

Conversely if $b=\Omega(d^{-\delta})$ for $\delta<1/2$ then the upper bound increases faster than exponentially:

\displaystyle\frac{\log\mathbb{E}[k(X,X)]}{d}\leq\frac{2(1-\delta)\log d}{2}+O(1).

The following example provides a specific example of a sub-gamma distribution which shows that the upper bound in Theorem 6 is tight with respect to the scaling of the dimension $d$ and the scale parameter $b$ . Full details of this example are provided in Appendix C.2.

Example 1.

Suppose that $X=\sqrt{U}Z$ where $Z\sim\mathcal{N}(0,I_{d})$ is a standard Gaussian vector and $U$ is an independent Gamma random variable with shape parameter $1/(2b^{2})$ and scale parameter $2b^{2}$ . Then, $X$ satisfies the sub-gamma condition with parameters $(1,b)$ and so the upper bound in Theorem 6 applies. Moreover, for every pair $(\epsilon,\lambda)$ the expectation of the two-moment kernel satisfies the following lower bound

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$	$\displaystyle\geq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\frac{\sqrt{2}b}{\sigma}}\right)^{r}\mathopen{}\mathclose{{}\left(M_{b^{-2}}(r-\epsilon)M_{b^{-2}}(r+\epsilon)}\right)^{1/2}$
		$\displaystyle\quad\times\mathopen{}\mathclose{{}\left(M_{d}(r-\epsilon)M_{d}(r+\epsilon)}\right)^{1/2}.$		(22)

The bounds on $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ given in (20) and (22) are shown in Figure 1 as a function of $\delta=(\log d)/(\log b)$ for various values of $d$ . The plot demonstrates a phase transition at the critical value of $\delta=0.5$ . Further computational results on this example are given in Section 5.1.

Refer to caption — Figure 1: Bounds on $\frac{1}{d}\log\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ for a distribution satisfying the sub-gamma condition with parameters $(1,d^{-\delta})$ as a function of $\delta$ for various $d$ . In all cases, $p=1$ , $\sigma=0.1$ , and $k$ is the ‘two-moment’ kernel described in Theorem 6. The upper bounds (solid line) are given by the right-hand side of (20). The lower bounds (dashed line) are given by the right-hand side of (22) evaluated at $\epsilon=\sqrt{d}$ .

4.3 Dependent Samples

Motivated by applications involving Markov chain Monte Carlo there is significant interest in understanding the rate of convergence when there is dependence in the samples. Within the literature, this question is often address by focusing on stationary sequences satisfying certain mixing conditions Peligrad (1986). The basic idea is that the dependence between $S_{i}$ and $S_{j}$ decays rapidly as $|i-j|$ increases, then the effect of the dependence is negligible.

To go beyond the usual mixing conditions, a particularly useful property of the kernel MMD distance is that the second moment of $\gamma_{k}(P,P_{n})$ depends only on the pairwise correlation in the samples. This perspective is useful for settings where there may not be a natural notion of time.

To make things precise, suppose that $S_{1},\dots,S_{n}\in\mathbb{R}^{d}$ is a collection of (possibly dependent) samples with identical distribution $P$ . For each $i\neq j$ let $Q_{ij}$ denote the law of $(S_{i},S_{j})$ . Starting with (5), the expectation of the squared MMD can now be decomposed as

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right]

\displaystyle=\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{1})}\right]+\frac{1}{n^{2}}\sum_{i\neq j}r_{ij},

(23)

where $r_{ij}\coloneqq\mathbb{E}_{Q_{ij}}[k(S_{i},S_{j})]-\mathbb{E}_{P\otimes P}[k(S_{i},S_{j}^{\prime})]$ . Notice that the first term in this decomposition is the second moment under independent samples. The second term is non-negative and depends only on the pairwise dependence, i.e., the difference between $Q_{ij}$ and the probability measure obtained by the product of its marginals.

In some cases, it is natural to argue that only a small number of the terms $r_{ij}$ are nonzero. More generally it is desirable to provide guarantees in terms of measures of dependence that do not rely on the particular choice of kernel. The next result provides such a bound in terms of the Hellinger distance.

Lemma 7.

If $\mathbb{E}_{P}[k^{2}(X,X)]^{1/2}<C_{k,P}$ , then

\displaystyle r_{ij}\leq\sqrt{2}\,C_{k,P}\,d_{H}(Q_{ij},P\otimes P),

where $d_{H}$ denotes the Hellinger distance.

The following example is inspired by random feature kernel interpretation of neural networks Rahimi and Recht (2008).

Example 2.

Let $\{Z(\alpha)\,:\,\alpha\in\mathbb{R}^{N}\}$ be a $\mathbb{R}^{d}$ -valued Gaussian processes with mean zero and covariance function $\operatorname{\mathsf{C}ov}(Z(\alpha),Z(\beta))=\langle\alpha,\beta\rangle I_{d}$ . Suppose that samples from $P$ are generated according to $S_{i}=T(Z(\alpha_{i}))$ where $\alpha_{1},\dots,\alpha_{n}$ are points on the unit sphere and $T\colon\mathbb{R}^{d}\to\mathbb{R}^{d}$ is a function that maps a standard Gaussian vector into a vector with distribution $P$ . Because Hellinger distance is non-increasing under the mapping given by $T$ , it can be verified that

\displaystyle d_{H}(Q_{ij},P\otimes P)\leq\sqrt{d}|\langle\alpha_{i},\alpha_{j}\rangle|.

Thus, by (23) and Lemma 7, there exists a constant $C_{k}$ depending only on $k$ such that

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right]

\displaystyle\leq C_{k,P}\Big{(}\frac{1}{n}+\frac{1}{n^{2}}\sum_{i\neq j}|\langle\alpha_{i}.\alpha_{j}\rangle|\Big{)}.

(24)

This inequality holds for any set of points $\{\alpha_{i}\}$ . To gain insight into the typical scaling behavior when the samples are nearly orthogonal on average, let us suppose that the $\{\alpha_{i}\}$ are drawn independently from the uniform distribution on the sphere. Then, $\mathbb{E}\mathopen{}\mathclose{{}\left[|\langle a_{i},a_{j}\rangle|^{2}}\right]=1/N$ and by, standard concentration arguments, one finds that the following upper bound holds with high probability when $N$ is large:

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right]

\displaystyle\leq C_{k,P}\Big{(}\frac{1}{n}+\sqrt{\frac{\log N}{N}}\Big{)}.

(25)

5 Numerical Results

In this section, we compute the upper bounds of GOT distance provided by the the empirical kernel MMD distance.

5.1 Bounds under Sub-gamma Condition

The sub-gamma condition allow us to address distributions that do not satisfy the sub-Gaussian condition. Upper bounds on the convergence rate for this class of distributions follow from Theorem 3 combined with Theorem 6.

For sub-gamma $X$ as in Example 1, $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ has upper bound (UB) and lower bound (LB) as in (20) and (22) respectively. Here, in addition to the theoretical UB and LB as shown in Figure 1, we also compute the estimate of $\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma_{k}^{2}(P,P_{n})}\right]$ approximated by the two-sample estimator. Let $P_{n}$ and $P_{n}^{\prime}$ be the empirical measure defined as in (13) for independent iid copies of the sub-gamma random vector as in Example 1, $\{X_{i}\}_{i=1}^{n}$ and $\{X_{i}^{\prime}\}_{i=1}^{n}$ , respectively. Define $\hat{\Delta}_{\gamma}^{2}\coloneqq\frac{n}{2}\gamma_{k}^{2}({P}_{n},{P}_{n}^{\prime})$ , and then $\mathbb{E}[\hat{\Delta}_{\gamma}^{2}]=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma_{k}^{2}(P,P_{n})}\right]$ . Because the empirical kernel MMD (squared) distance $\gamma_{k}^{2}(P_{n},P_{n}^{\prime})$ can be computed numerically from the two samples of sub-gamma random vectors, we can estimate the expectation of $\hat{\Delta}_{\gamma}^{2}$ by empirical average over Monte-Carlo replicas. Detailed numerical techniques to compute the two-moment kernel and experimental setup are provided in the appendix.

The results are shown in Figure 2, where we set the parameter $b$ controlling the shape and scale of $X$ as $b=d^{-\delta}$ , $\delta=\{0,0.25,0.5,0.75\}$ . The data dimension $d$ takes multiple values, and the kernel bandwidth $\sigma$ takes values 1 and 4. In both cases, the left plot shows the same information as Figure 1 in view of increasing $d$ , so as to be compared to the right plot. The right plot focuses on the case of small $d$ , where the empirical estimates of $\mathbb{E}[\hat{\Delta}_{\gamma}^{2}]$ observe the UB. Notably, these values also lie between the UB and LB (recall that the LB applies to $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ but not $\mathbb{E}[\hat{\Delta}_{\gamma}^{2}]$ ) for both cases of $\sigma$ , and approach the LB when $\sigma=1$ . This shows that our theoretical UB captures an important component of the kernel MMD distances for this example of $X$ .

5.2 Dependent Samples via Gaussian Process

We generate dependent samples following a Gaussian process $\{Z(\alpha)\}_{\alpha\in S^{N}}$ as in Example 2 in Section 4.3, and numerically compute the values of $\gamma_{k}^{2}(P,P_{n})$ by its two-sample version $\gamma_{k}^{2}(P_{n},{P}_{n}^{\prime})$ , using the same kernel as in the first experiment. Theoretically, when $n$ is large, $\gamma_{k}^{2}(P,P_{n})$ is expected to concentrate at its mean value as in (24), and then since $\alpha_{i}$ are uniformly sampled on the $N$ -sphere, we also expect concentration at the value of (25).

We set $n=\{30,50,70,100\}$ , and vary $N$ from 5 to 100. The data are in dimension $d=5$ , the kernel parameters are $\sigma=0.5$ , $c=1$ , $p=1$ , and for each value of $n$ and $N$ , $\gamma_{k}^{2}(P_{n},P_{n}^{\prime})$ are computed for 100 realizations of the dependent random variables $S_{i}$ , conditioned on one realization of the vectors $\alpha_{i}$ ’s. The results are shown in Figure 3, where for the small values of $N$ , the computed approximate values of $\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma_{k}^{2}(P,P_{n})}\right]$ are decreasing because they are dominated by the contribution from the dependence in the samples, namely the part that is upper-bounded by the second term depending on $\alpha$ in (24). As $N$ increases, these values converge to certain positive value which decrease as $n$ increases.

Specifically, we take the average over the mean values of $\gamma_{k}^{2}(P_{n},P_{n}^{\prime})$ for $N\in[80,100]$ as an estimate of the limiting values, and Figure 3 (right) shows that these values decay as $n^{-1}$ , as they correspond to the first term in (24). These numerical results show the competing two factors as predicted by the analysis in Section 4.3.

6 Conclusion and Discussion

The paper proves new convergence rates of GOT under general settings. Our results require only a finite moment condition and in the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and show a phase transition with respect to the scale parameter. Furthermore, our results cover the setting of dependent samples where convergence is proved only requiring a condition on pairwise sample dependence expressed by the kernel. Throughout our analysis, the main theoretical technique is to establish an upper bound using a kernel MMD where the kernel is called a “two-moment” kernel due to its special properties. The kernel depends on the cost function of the OT as well as the Gaussian smoothing used in GOT.

For the tightness of the kernel MMD upper bound, as has been pointed out in the comment beneath Theorem 1, our result shows that the convergence rate of $n^{-1/2}$ is tight in some regimes where $\sigma\to 0$ with $n\to\infty$ and the result matches the minimax rate in the unsmoothed OT setting. Alternatively, if $\sigma$ is bounded away from zero then it may be possible to obtain a better rate of convergence. For example, Proposition 6 in Goldfeld et al. (2020b) shows that if the pair $(P,\sigma)$ satisfies an additional chi-square divergence condition, then $\mathcal{T}_{2}^{(\sigma)}(P,P_{n})$ converges at rate $1/n$ , which is faster than the general upper bound of $1/\sqrt{n}$ appearing in our paper. In this direction, pinning down the exact convergence rate in terms of regularity conditions on $P$ remains an interesting open question for future work. In addition, it would be interesting to investigate the relationship between the Gaussian smoothing used in this paper and the multiscale representation of $\mathcal{T}_{p}$ in terms of partitions of the support of $P$ , which was used in the analysis in Fournier and Guillin (2015) as well as the related work Weed et al. (2019).

In practice, the tightness of the kernel MMD upper bound also depends on the choice of kernel, which can be optimized for the data distribution and the level of smoothing in GOT. The question of whether the kennel MMD provides a useful alternative to OT distance in applications can be worthwhile of further investigation. Finally, another important direction of future work is to study computational methods and applications of the GOT approach, particularly in the high dimensional space. Currently, no specialized algorithm for GOP from finite samples has been developed, except for the direct method of applying any OT algorithm, e.g., entropy OT (Sinkhorn), to data with additive Gaussian noise Goldfeld and Greenewald (2020). Progress on the computational side will also enable various applications of GOT, e.g., the evaluation of generative models in machine learning.

References

Abid and Gower (2018) Brahim Khalil Abid and Robert M Gower. Greedy stochastic algorithms for entropy-regularized optimal transport problems. arXiv preprint arXiv:1803.01347, 2018.
Anderson et al. (1994) Niall H Anderson, Peter Hall, and D Michael Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50(1):41–54, 1994.
Bhushan Damodaran et al. (2018) Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447–463, 2018.
Bigot et al. (2019) Jérémie Bigot, Elsa Cazelles, Nicolas Papadakis, et al. Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. Electronic Journal of Statistics, 13(2):5120–5150, 2019.
Bonneel et al. (2015) Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
Chakrabarty and Khanna (2020) Deeparnab Chakrabarty and Sanjeev Khanna. Better and simpler error analysis of the sinkhorn–knopp algorithm for matrix scaling. Mathematical Programming, pages 1–13, 2020.
Chen and Niles-Weed (2020) Hong-Bin Chen and Jonathan Niles-Weed. Asymptotics of smoothed wasserstein distances. arXiv preprint arXiv:2005.00738, 2020.
Chwialkowski et al. (2015) Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, pages 1981–1989, 2015.
Courty et al. (2016) Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016.
Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
Epps and Singleton (1986) TW Epps and Kenneth J Singleton. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation, 26(3-4):177–203, 1986.
Fernández et al. (2008) V Alba Fernández, MD Jiménez Gamero, and J Muñoz García. A test for the two-sample problem based on empirical characteristic functions. Computational statistics & data analysis, 52(7):3730–3748, 2008.
Feydy et al. (2019) Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.
Fournier and Guillin (2015) Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3):707–738, 2015. doi: 10.1007/s00440-014-0583-7.
Fukumizu et al. (2008) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
Genevay et al. (2019) Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1574–1583. PMLR, 2019.
Gerber and Maggioni (2017) Samuel Gerber and Mauro Maggioni. Multiscale strategies for computing optimal transport. The Journal of Machine Learning Research, 18(1):2440–2471, 2017.
Goldfeld and Greenewald (2020) Ziv Goldfeld and Kristjan Greenewald. Gaussian-smoothed optimal transport: Metric structure and statistical efficiency. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020.
Goldfeld et al. (2018) Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy. Estimating information flow in deep neural networks. arXiv preprint arXiv:1810.05728, 2018.
Goldfeld et al. (2020a) Ziv Goldfeld, Kristjan Greenewald, and Kengo Kato. Asymptotic guarantees for generative modeling based on the smooth wasserstein distance. Advances in Neural Information Processing Systems, 33, 2020a.
Goldfeld et al. (2020b) Ziv Goldfeld, Kristjan Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66(7):4368–4391, July 2020b.
Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
Jitkrittum et al. (2016) Wittawat Jitkrittum, Zoltán Szabó, Kacper P Chwialkowski, and Arthur Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, pages 181–189, 2016.
Johnson et al. (1995) Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous univariate distributions. John Wiley & Sons, Ltd, 1995.
Klatt et al. (2020) Marcel Klatt, Carla Tameling, and Axel Munk. Empirical regularized optimal transport: Statistical theory and applications. SIAM Journal on Mathematics of Data Science, 2(2):419–443, 2020.
Lei et al. (2020) Jing Lei et al. Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. Bernoulli, 26(1):767–798, 2020.
Li et al. (2017) Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.
Li et al. (2018) Wuchen Li, Ernest K Ryu, Stanley Osher, Wotao Yin, and Wilfrid Gangbo. A parallel method for earth mover’s distance. Journal of Scientific Computing, 75(1):182–197, 2018.
Li et al. (2015) Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
Luise et al. (2018) Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Advances in Neural Information Processing Systems, pages 5859–5870, 2018.
Mena and Niles-Weed (2019) Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
Muandet et al. (2017) K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel Mean Embedding of Distributions: A Review and Beyond. 2017.
Niles-Weed and Berthet (2019) Jonathan Niles-Weed and Quentin Berthet. Minimax estimation of smooth densities in wasserstein distance, 2019.
Peligrad (1986) Magda Peligrad. Recent advances in the central limit theorem and its weak invariance principle for mixing sequences of random variables (a survey). In Dependence in probability and statistics, pages 193–223. Springer, 1986.
Peyré and Cuturi (2019) Gabriel Peyré and Marco Cuturi. Computational Optimal Transport, volume 11 of 355-607. Foundations and Trends in Machine Learning, 2019.
Rabin et al. (2011) Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
Rahimi and Recht (2008) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
Reeves (2020) Galen Reeves. A two-moment inequality with applications to rényi entropy and mutual information. Entropy, 22(11), 2020.
Ryu et al. (2018) Ernest K Ryu, Yongxin Chen, Wuchen Li, and Stanley Osher. Vector and matrix optimal mass transport: theory, algorithm, and applications. SIAM Journal on Scientific Computing, 40(5):A3675–A3698, 2018.
Sejdinovic et al. (2013) Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pages 2263–2291, 2013.
Shen et al. (2017) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. arXiv preprint arXiv:1707.01217, 2017.
Singh and Póczos (2018) Shashank Singh and Barnabás Póczos. Minimax distribution estimation in Wasserstein distance, 2018. [Online]. Available https://arxiv.org/abs/1802.08855.
Smola et al. (2007) Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pages 13–31. Springer, 2007.
Solomon et al. (2014) Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Earth mover’s distances on discrete surfaces. ACM Transactions on Graphics (TOG), 33(4):67, 2014.
Sriperumbudur et al. (2010) Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010.
Villani (2003) Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
Weed et al. (2019) Jonathan Weed, Francis Bach, et al. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 2019.
Young and Dunson (2019) Alexander L Young and David B Dunson. Consistent entropy estimation for stationary time series. arXiv preprint arXiv:1904.05850, 2019.
Zhang et al. (2012) Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.
Zhao and Meng (2015) Ji Zhao and Deyu Meng. Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural computation, 2015.

Appendix A Upper Bounds on GOT

This section provides proofs of the upper bounds on the GOT provided in Section 3. For the convenience of the reader we repeat some of the necessary definitions. Recall that the feature map $\psi_{x}\colon\mathbb{R}^{d}\to[0,\infty)$ is defined by

\displaystyle\psi_{x}(z)

\displaystyle:=\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\|z\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\|z\|)}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right),

(A.1)

\displaystyle f(x)\geq ax^{d+2p-1}\exp(-bx^{2}),

(A.2)

for some $a>0$ and $b\in(0,1/2)$ .

Lemma A.1.

The feature map in (A.1) defines a positive semidefinite kernel $k\colon\mathbb{R}^{n}\times\mathbb{R}^{n}\to[0,\infty)$ according to

\displaystyle k(x,y)\coloneqq\int_{\mathbb{R}^{d}}\psi_{x}(z)\psi_{y}(z)\,dz.

(A.3)

Furthermore, this kernel can also be expressed as

\displaystyle k(x,y)

\displaystyle=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)I_{f}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right),

(A.4)

where

\displaystyle I_{f}(u)

\displaystyle\coloneqq\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\int_{0}^{\infty}\frac{x^{d-1+2p}}{f(x)}g_{d,u}\mathopen{}\mathclose{{}\left(x}\right)\,dx,

(A.5)

and $g_{d,u}(x)$ is the density of $\|Z\|$ when $Z\sim\mathcal{N}(\mu,I_{d})$ with $\|\mu\|=u$ .

Proof.

First we establish that $\psi_{x}$ is square integrable. By the assumed lower bound in (A.2) and the fact that $\phi^{2}(y/\sqrt{2})=(2\pi)^{d/2}\phi(y)$ , we can write

\displaystyle\int_{\mathbb{R}^{d}}|\psi_{x}(z)|^{2}\,dz

\displaystyle\leq\frac{C_{d,p}}{a}\int_{\mathbb{R}^{d}}\exp\mathopen{}\mathclose{{}\left(-b\|z\|^{2}}\right)\phi\Big{(}z-\frac{\sqrt{2}x}{\sigma}\Big{)}\,dz.

(A.6)

This integral is the moment generating function of the non-central chi-square distribution with $d$ degrees of freedom and non-centrality parameter $2\|x\|^{2}/\sigma^{2}$ evaluated at $b$ . Under the assumption $b<1/2$ , this integral is finite.

To establish the form given in (A.4) we can expand the squares to obtain:

\displaystyle\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right)\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{y}{\sigma}}\right)

\displaystyle=(2\pi)^{-\frac{d}{2}}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\phi\mathopen{}\mathclose{{}\left(z-\frac{x+y}{\sqrt{2}\sigma}}\right).

Since the first factor does not depend on $z$ , it follows that

\displaystyle k(x,y)

\displaystyle=\frac{\omega_{d}}{2^{d+p}(2\pi)^{\frac{d}{2}}}\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\int_{\mathbb{R}^{d}}\frac{\|z\|^{d-1+2p}}{f(\|z\|)}\phi\mathopen{}\mathclose{{}\left(z-\frac{x+y}{\sqrt{2}\sigma}}\right)\,dz.

In this case, we recognize the integral as the expectation of $\|\cdot\|^{d-1+2p}/f(\cdot)$ under the chi-distribution with $d$ degrees of freedom and parameter $u=\|x+y\|/(\sqrt{2}\sigma)$ . ∎

A.1 Proof of Theorem 2

The following result is an immediate consequence of (Villani, 2008, Theorem 6.13) adapted to the notation of this paper.

Lemma A.2 ((Villani, 2008, Theorem 6.13)).

For any $P,Q\in\mathcal{P}_{p}(\mathbb{R}^{d})$ ,

\displaystyle\mathcal{T}_{p}(P,Q)\leq 2^{\max(p-1,0)}\int\|x\|^{p}\,d|P-Q|(x),

(A.7)

where $|P-Q|$ denotes the absolute variation the signed measure $P-Q$ .

To proceed, let $p_{\sigma}(z)=\int_{\mathbb{R}^{d}}\phi_{\sigma}(z-x)\,dP(x)$ and $q_{\sigma}(z)=\int_{\mathbb{R}^{d}}\phi_{\sigma}(z-x)\,dQ(x)$ denote the probability density functions of $P\ast\mathcal{N}_{\sigma}$ and $Q\ast\mathcal{N}_{\sigma}$ , respectively. By Lemma A.2, the OT distance between $P\ast\mathcal{N}_{\sigma}$ and $Q\ast\mathcal{N}_{\sigma}$ is bounded from above by the weighted total variation distance:

\displaystyle\mathcal{T}^{(\sigma)}_{p}(P,Q)\leq 2^{\max(p-1,0)}\int\|z\|^{p}\mathopen{}\mathclose{{}\left|p_{\sigma}(z)-q_{\sigma}(z)}\right|\,dz.

(A.8)

In the following we will show that $2^{\max(p-1,0)}\sigma^{p}\gamma_{k}(P,Q)$ provides an upper bound on the right-hand side of (A.8). To proceed, recall that the kernel MMD can be expressed as

\displaystyle\gamma^{2}_{k}(P,Q)=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\mathbb{E}\mathopen{}\mathclose{{}\left[k(Y,Y^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right],

(A.9)

where $X,X^{\prime}$ are iid $P$ and $Y,Y^{\prime}$ are iid $Q$ . The assumptions $\int\sqrt{k(x,x)}\,P(x)<\infty$ and $\int\sqrt{k(x,x)}\,Q(x)<\infty$ ensure that these expectations are finite, and so, by Fubini’s theorem, we can interchange the order of integration:

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right]

\displaystyle=\int k(x,y)\,dP(x)\,dQ(y)=\int\mathopen{}\mathclose{{}\left(\int\psi_{x}(z)\,dP(x)}\right)\mathopen{}\mathclose{{}\left(\int\psi_{x}(z)\,dQ(x)}\right)\,dz.

For each $z\in\mathbb{R}^{d}$ , it follows that

	$\displaystyle\int\psi_{x}(z)\,dP(x)$	$\displaystyle=\frac{\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\\|z\\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\\|z\\|)}}\int_{\mathbb{R}^{d}}\phi\mathopen{}\mathclose{{}\left(\frac{z}{\sqrt{2}}-\frac{x}{\sigma}}\right)\,dP(x)$
		$\displaystyle=\frac{\sigma^{d}\sqrt{\omega_{d}}}{2^{\frac{d+p}{2}}}\frac{\\|z\\|^{\frac{d-1+2p}{2}}}{\sqrt{f(\\|z\\|)}}p_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right),$

and this leads to

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,Y)}\right]$	$\displaystyle=\frac{\sigma^{2d}\omega_{d}}{2^{d+p}}\int\frac{\\|z\\|^{d-1+2p}}{f(\\|z\\|)}p_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right)q_{\sigma}\mathopen{}\mathclose{{}\left(\frac{\sigma z}{\sqrt{2}}}\right)\,dz$
		$\displaystyle=\sigma^{-2p}\int\frac{\\|z\\|^{2p}}{r_{\sigma}(\\|z\\|)}p_{\sigma}(z)q_{\sigma}(z)\,dz,$

where $r_{\sigma}(x)\coloneqq\frac{\sqrt{2}}{\sigma}f(\frac{\sqrt{2}}{\sigma}x)/(\omega_{d}\|z\|^{d-1})$ . Combining this expression with (A.9) leads to

\displaystyle\gamma^{2}_{k}(P,Q)=\sigma^{-2p}\int\frac{\|z\|^{2p}}{r_{\sigma}(\|z\|)}\mathopen{}\mathclose{{}\left(p_{\sigma}(z)-q_{\sigma}(z)}\right)^{2}\,dz.

Finally, we note that $z\mapsto r_{\sigma}(\|z\|)$ is a probability density function on $\mathbb{R}^{d}$ (it is non-negative and integrates to one) and so by Jensen’s inequality and the convexity of the square,

\displaystyle\gamma^{2}_{k}(P,Q)

\displaystyle\geq\sigma^{-2p}\mathopen{}\mathclose{{}\left(\int\|z\|^{p}\mathopen{}\mathclose{{}\left|p_{\sigma}(z)-q_{\sigma}(z)}\right|\,dz}\right)^{2}.

In view of (A.8), this establishes the desired result.

A.2 Proof of Theorem 3

The fact that the kernel MMD provides an upper bound on $\mathcal{T}_{p}^{(\sigma)}(P,Q)$ follows directly from Theorem 2. All the remains to be shown is that $\sqrt{k(x,x)}$ is integrable for any probability measure with finite $s$ -th moment, where $s=(d+2p+\epsilon)/2$ . To this end, we note that by the triangle inequality,

\displaystyle M_{d,u}(r)\leq 2^{\min(1,r)}\mathopen{}\mathclose{{}\left(M_{d}(r)+\|u\|^{r}}\right),

for all $r\geq 0$ . Under the assumptions on $\epsilon$ , we have $0\leq d+2p-\epsilon<d+2p+\epsilon\leq 2s$ and so there exists a constant $C_{d,p,\epsilon,\lambda}$ such that

\displaystyle k(x,y)\leq C_{d,p,\epsilon,\lambda}\mathopen{}\mathclose{{}\left(1+\|x\|^{2s}+\|y\|^{2s}}\right).

Thus, the existence of finite $s$ -th moment is sufficient to ensure that $\sqrt{k(x,x)}$ is integrable.

Appendix B Convergence Rate

This section provides proofs for the results in Section 4 of the main text as well as Theorem 1. To simplify the notation, we define $r=d+2p$ and let $Y=(\sqrt{2}/\sigma)X+Z$ where $Z\sim\mathcal{N}(0,I_{d})$ .

Let us first consider some properties of $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ . Since the two moment kernel satisfies $k(x,x)=\alpha_{d,p}\,J((\sqrt{2}/\sigma)\|x\|)$ , it follows from the definition of $J(\cdot)$ that

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]=\frac{\alpha_{d,p}}{2\epsilon}\mathopen{}\mathclose{{}\left(\lambda^{\epsilon}\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]+\lambda^{-\epsilon}\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]}\right).

(A.10)

Suppose that there exists numbers $M_{-}$ and $M_{+}$ such that

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]\leq M_{-},\quad\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]\leq M_{+}.

(A.11)

Choosing

\displaystyle\lambda=(M_{+}/M_{-})^{1/(2\epsilon)},

(A.12)

leads to

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\alpha_{d,p}}{\epsilon}\sqrt{M_{-}M_{+}}.

(A.13)

In other words, optimizing the choice of $\lambda$ results in an upper bound on $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ that depends on only the geometric mean of the upper bounds on $\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r\pm\epsilon}}\right]$ .

Lemma A.3.

Let $X\in\mathbb{R}^{d}$ be a random vector satisfying

\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}\leq m(s),

(A.14)

for some function $m(s)$ for $s\geq 1$ . Then, if $r-\epsilon\geq 1$ , (A.11) holds with

\displaystyle M_{\pm}

\displaystyle=\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\overline{M}_{d}(r\pm\epsilon)}\right)^{\frac{1}{r\pm\epsilon}}+\frac{\sqrt{2}}{\sigma}m(r\pm\epsilon)}\right)^{r\pm\epsilon},

(A.15)

where $\overline{M}_{d}(s)=(d+s)^{(d+s-1)/2}d^{-(d-1)/2}e^{-s/2}$ .

Proof.

This result follows from Minkowski’s inequality, which gives

\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{s}}\right]}\right)^{\frac{1}{s}}

\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Z\|^{s}}\right]}\right)^{\frac{1}{s}}+\frac{\sqrt{2}}{\sigma}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}

for all $s\geq 1$ and the upper bound on $M_{d}(s)=\mathbb{E}\mathopen{}\mathclose{{}\left[\|Z\|^{s}}\right]$ in Theorem 5. ∎

B.1 Proof of Theorem 1

The result follows immediately by combining Theorem 4 and Equation (11) in the main text.

B.2 Proof of Theorem 4

By Lyapunov’s inequality and Minkowski’s inequality, it follows that for $t\in\{r\pm\epsilon\}$ ,

	$\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Y\\|^{t}}\right]}\right)^{\frac{1}{t}}$	$\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Y\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}$
		$\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Z\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}+\frac{\sqrt{2}}{\sigma}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|X\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}$
		$\displaystyle\leq\sqrt{d+r+\epsilon}+\frac{\sqrt{2}m}{\sigma},$

where the last step holds because $M_{d}(q)\leq(d+q)^{q/2}$ and the assumption $\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]}\right)^{\frac{1}{s}}\leq m$ . Thus, for $\lambda=\sqrt{r+\epsilon}+m$ , the bound in (A.13) becomes

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\sqrt{d+r+\epsilon}+\frac{\sqrt{2}m}{\sigma}}\right)^{r}.

Recalling that $r=d+2p$ gives the stated result.

B.3 Proof of Theorem 5

Lemma A.4.

Let $X\in\mathbb{R}^{d}$ be a sub-gamma random vector with parameters $(v,b)$ . For all $s\in[0,\infty)$ and $\lambda\in(0,1/b)$ ,

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]

\displaystyle\leq\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp\mathopen{}\mathclose{{}\left(\frac{\lambda^{2}v}{2(1-\lambda b)}}\right)M_{d}(s),

(A.16)

where $M_{d}(s)\coloneqq 2^{\frac{s}{2}}\Gamma(\frac{d+s}{2})/\Gamma(\frac{d}{2})$ . In particular, if $\lambda=(\sqrt{(sb)^{2}+4vp}-sb)/(2v)$ , then

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]

\displaystyle\leq\mathopen{}\mathclose{{}\left(\sqrt{v+\mathopen{}\mathclose{{}\left(\frac{\sqrt{s}b}{2}}\right)^{2}}+\frac{\sqrt{s}b}{2}}\right)^{s}\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}}M_{d}(s).

(A.17)

Proof.

Let $Y=Z^{\top}X$ where $Z=(Z_{1},\dots,Z_{d})$ is independent of $X$ and distributed uniformly on the unit sphere in $\mathbb{R}^{d}$ . Since $Z$ is orthogonally invariant, it may be assumed that $X=(\|X\|,0,\dots,0)$ and thus $Y$ is equal in distribution to $Z_{1}\|X\|$ . Therefore,

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[|Y|^{s}}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[|Z_{1}|^{s}}\right]\,\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right].

The variable $Z_{1}$ has density function

\displaystyle f(z)=\frac{\Gamma(\frac{d}{2})}{\sqrt{\pi}\Gamma(\frac{d-1}{2})}(1-z^{2})^{(d-3)/2},\quad z\in[-1,1],

and so the moments are given by

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[|Z_{1}|^{s}}\right]

\displaystyle=\frac{2\Gamma(\frac{d}{2})}{\sqrt{\pi}\Gamma(\frac{d-1}{2})}\int_{0}^{1}z^{s}(1-z^{2})^{(d-3)/2}\,dz=\frac{\Gamma(\frac{d}{2})\Gamma(\frac{s+1}{2})}{\sqrt{\pi}\Gamma(\frac{d+s}{2})}.

To bound the absolute moments of $Y$ we use the basic inequality $u\leq\exp(u-1)$ with $u=\lambda|y|/s$ , which leads to

\displaystyle|y|^{s}\leq\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp(\lambda|y|)\leq\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}(e^{\lambda y}+e^{-\lambda y}),

for all $s,\lambda\in(0,\infty)$ . Noting that $Y$ is equal in distribution to $-Y$ and then using the sub-gamma assumption along with the fact that $Z$ is a unit vector yields

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{s}}\right]$	$\displaystyle\leq 2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\lambda Y)}\right]$
		$\displaystyle=2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\lambda Z^{\top}X)}\right]$
		$\displaystyle\leq 2\mathopen{}\mathclose{{}\left(\frac{s}{\lambda e}}\right)^{s}\exp\mathopen{}\mathclose{{}\left(\frac{\lambda^{2}v}{2(1-\lambda b)}}\right).$

Combining the above displays yields (A.16).

Finally, under the specified value of $\lambda$ it follows that

\displaystyle\frac{\lambda^{2}v}{1-\lambda b}=p,\qquad\frac{\sqrt{s}}{\lambda}

\displaystyle=\sqrt{v+\mathopen{}\mathclose{{}\left(\frac{\sqrt{s}v}{2}}\right)^{2}}+\frac{\sqrt{s}b}{2}

and plugging this expression back into the bound gives (A.17). ∎

Theorem 5 now follows as a corollary of Lemma A.4. Starting with (A.17) and using the basic inequality $\sqrt{a^{2}+b^{2}}\leq a+b$ leads to

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]

\displaystyle\leq\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}}M_{d}(s).

To simplify the expressions involving the Gamma functions we use the lower bound $\log\Gamma(z)\geq(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi)$ for $z>0$ , which leads to

\displaystyle\frac{2\sqrt{\pi}}{2^{\frac{s}{2}}\Gamma(\frac{s+1}{2})}\mathopen{}\mathclose{{}\left(\frac{s}{e}}\right)^{\frac{s}{2}}

\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\frac{s}{s+1}}\right)^{\frac{s}{2}}.

Combining this bound with the expression above yields

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\\|X\\|^{s}}\right]$	$\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}\mathopen{}\mathclose{{}\left(\frac{s}{s+1}}\right)^{\frac{s}{2}}M_{d}(s)$
		$\displaystyle\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{v}+\sqrt{s}b}\right)^{s}M_{d}(s).$

This completes the proof of Theorem 5.

B.4 Proof of Theorem 6

Since $Z\sim\mathcal{N}(0,I_{d})$ is sub-gamma with parameters $(1,0)$ it follows that $Y=(\sqrt{2}/\sigma)X+Z$ is sub-gamma with parameters $(1+2v/\sigma^{2},\sqrt{2}b/\sigma)$ . For $t>-r$ we can apply Theorem 5 to obtain

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+t}}\right]\leq\sqrt{2e}\mathopen{}\mathclose{{}\left(\sqrt{1+2v/\sigma^{2}}+\sqrt{r+t}\sqrt{2}b/\sigma}\right)^{r+t}\overline{M}_{d}(r+t)=\frac{\sqrt{2e}}{\sigma^{r+t}}m(r+t).

Under the specified value of $\lambda=(m(\epsilon)/m(-\epsilon))^{1/(2\epsilon)}$ , it then follows from (A.13) that

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\sqrt{2e}\alpha_{k,p}}{\sigma^{r}\epsilon}\sqrt{m(-\epsilon)m(\epsilon)}.

(A.18)

To proceed, let $(v^{\prime},b^{\prime})=(\sigma^{2}+2v,\sqrt{2}b)$ and consider the decomposition

\displaystyle\log(m(-\epsilon)m(\epsilon))

\displaystyle=2\log m(0)+A+B,

where

	$\displaystyle A$	$\displaystyle\coloneqq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(\sqrt{v^{\prime}}+\sqrt{r-\epsilon}\,b^{\prime}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(\sqrt{v^{\prime}}+\sqrt{r+\epsilon}\,b^{\prime}}\right)-2r\log(\sqrt{v^{\prime}}+\sqrt{r}b^{\prime})$
	$\displaystyle B$	$\displaystyle\coloneqq\log\overline{M}_{d}(r-\epsilon)+\log\overline{M}_{d}(r+\epsilon)-2\log\overline{M}_{d}(r).$

Using the basic inequalities $\sqrt{1+x}-1\leq x/2$ and $\log(1+x)\leq x$ , the term $A$ can be bounded from above as follows:

	$\displaystyle A$	$\displaystyle=(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r-\epsilon}-\sqrt{r})b^{\prime}}{\sqrt{v^{\prime}}+\sqrt{r}b^{\prime}}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r+\epsilon}-\sqrt{r})b^{\prime}}{\sqrt{v^{\prime}}+\sqrt{r}b^{\prime}}}\right)$
		$\displaystyle\leq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r-\epsilon}-\sqrt{r})}{\sqrt{r}}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{(\sqrt{r+\epsilon}-\sqrt{r}}{\sqrt{r}}}\right)$
		$\displaystyle\leq(r-\epsilon)\log\mathopen{}\mathclose{{}\left(1-\frac{\epsilon}{2r}}\right)+(r+\epsilon)\log\mathopen{}\mathclose{{}\left(1+\frac{\epsilon}{2r}}\right)$
		$\displaystyle\leq-(r-\epsilon)\frac{\epsilon}{2r}+(r+\epsilon)\frac{\epsilon}{2r}$
		$\displaystyle=\frac{\epsilon^{2}}{r}.$

Similarly, one finds that

\displaystyle B

\displaystyle=\frac{d+r-\epsilon-1}{2}\log\mathopen{}\mathclose{{}\left(1-\frac{\epsilon}{d+r}}\right)+\frac{d+r+\epsilon-1}{2}\log\mathopen{}\mathclose{{}\left(1+\frac{\epsilon}{d+r}}\right)\leq\frac{\epsilon^{2}}{d+r}.

Combining these bounds with the fact that $r\geq d$ leads to

\displaystyle\sqrt{m(-\epsilon)m(\epsilon)}

\displaystyle\leq(\sqrt{v^{\prime}}+\sqrt{r}b^{\prime})^{r}\overline{M}_{d}(r)\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right).

Plugging this inequality back into (A.18) yields

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\frac{\sqrt{2e}\alpha_{k,p}}{\epsilon}(\sqrt{1+2v/\sigma^{2}}+\sqrt{2r}b)^{r}\overline{M}_{d}(r)\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right).

(A.19)

Finally, by the lower bound $\log\Gamma(z)\geq(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi)$ and the basic inequality $(1+p/d)^{d}\leq e^{p}$ for $p,d,\geq 0$ we can write

\displaystyle\alpha_{d,p}\overline{M}_{d}(r)

\displaystyle\leq\frac{\sqrt{\pi}(d+p)^{d+p}}{e^{p}d^{d}}\frac{d}{\sqrt{d+p}}\leq\sqrt{\pi}(d+p)^{p}\sqrt{d}.

Hence,

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\leq\sqrt{2\pi e}(d+p)^{p}\sqrt{d}\mathopen{}\mathclose{{}\left(\sqrt{1+2v/\sigma}+\sqrt{2r}b/\sigma}\right)^{r}\frac{\exp\mathopen{}\mathclose{{}\left(\frac{3\epsilon^{2}}{4d}}\right)}{\epsilon}.

This bound holds for all $\epsilon\in[0,r]$ . Evaluating with $\epsilon=\sqrt{d}$ and recalling that $r=d+2p$ gives the stated result.

B.5 Proof of Lemma 7

Note that $Q_{ij}$ is absolutely continuous with respect to $P\otimes P$ and let $\lambda_{ij}=dQ_{ij}/d(P\otimes P)$ denote the Radon-Nikodym derivative. Then

	$\displaystyle r_{ij}$	$\displaystyle=\int k(\lambda_{ij}-1)d(P\otimes P)$
		$\displaystyle=\int k(\sqrt{\lambda_{ij}}+1)(\sqrt{\lambda_{ij}}-1)d(P\otimes P)$
		$\displaystyle\leq\sqrt{2}\mathopen{}\mathclose{{}\left(\sqrt{\int k^{2}dQ_{ij}}+\sqrt{\int k^{2}d(P\otimes P)}}\right)d_{H}(Q_{ij},P\otimes P),$

where the last step is by the Cauchy-Scharz inequality and we have used that fact that $d^{2}_{H}(Q_{ij},P\otimes P)=\frac{1}{2}\int(\sqrt{\lambda}-1)^{2}d(P\otimes P)$ .

Next, since $k^{2}(x,y)\leq k(x,x)k(y,y)$ for any positive semidefinite kernel, it follows that

\displaystyle\int k^{2}(x,y)dQ_{ij}(x,y)\leq\int k(x,x)k(y,y)dQ_{ij}(x,y)=\int k^{2}(x,x)dP(x),

and thus the stated result follows from the assumption $\mathbb{E}_{P}\mathopen{}\mathclose{{}\left[k^{2}(X,X)}\right]\leq C_{k,P}^{2}$ .

Appendix C Experimental Details and Additional Results

In this section, we provide details of the experiments in Section 5 of the main text and additional numerical results. Our experiments are based on the two-moment kernel given in Definition 1.

C.1 Numerical Computation of the Two-Moment Kernel

To evaluate the two-moment kernel given in Definition 1 we need to numerically compute the function $M_{d,u}(s)$ , which is the $s$ -th moment of the non-central chi-distribution with $d$ degrees of freedom and parameter $u$ . For all $s\geq 0$ , this function can be written as a Poisson mixture of the (central) moments according to

\displaystyle M_{d,u}(s)

\displaystyle=\sum_{k=0}^{\infty}\frac{u^{2k}\exp(-\frac{1}{2}u^{2})}{2^{k}k!}M_{d+2k,0}(s).

(A.20)

This series can be approximated efficiently by retaining only the terms with $k\approx u^{2}/2$ .

Alternatively, if $s=2\ell$ where $\ell$ is an integer, then $M_{d,u}(2\ell)$ is the $\ell$ -th moment of the chi-square distribution with $d$ degrees of freedom and non-centrality parameter $u^{2}$ . The integer moments of this distribution can be obtained by differentiating the moment generating function. An explicit formula is given by (Johnson et al., 1995, pg. 448)

\displaystyle M_{d,u}(2\ell)=2^{\ell}\Gamma(\ell+d/2)\sum_{j=0}^{\ell}\binom{k}{j}\frac{(u^{2}/2)^{j}}{\Gamma(j+d/2)}.

(A.21)

Here we see that $M_{d,u}(2\ell)$ is a degree $\ell$ polynomial in $u^{2}$ .

Accordingly, for any tuple $(d,p,\sigma,\lambda,\epsilon)$ such that $d+2p\pm\epsilon$ are even integers, the two-moment kernel defined in (11) can be expressed as

\displaystyle k(x,y)=\exp\mathopen{}\mathclose{{}\left(-\frac{\|x-y\|^{2}}{4\sigma^{2}}}\right)\sum_{\ell=0}^{L}c_{\ell}\mathopen{}\mathclose{{}\left(\frac{\|x+y\|}{\sqrt{2}\sigma}}\right)^{2\ell},

(A.22)

where $L=(d+2p+\epsilon)/2$ and the coefficients $c_{0},\dots,c_{L}$ are given by

\displaystyle c_{\ell}\coloneqq\frac{\alpha_{d,p}}{\epsilon 2^{\ell}\Gamma(\ell+d/2)}\mathopen{}\mathclose{{}\left[\lambda^{\epsilon}2^{L-\epsilon}\Gamma(L-\epsilon+d/2)\binom{L-\epsilon}{\ell}\bm{1}_{\{\ell\leq L-\epsilon\}}+\lambda^{\epsilon}2^{L}\Gamma(L+d/2)\binom{L}{\ell}}\right],

(A.23)

with $\alpha_{d,p}\coloneqq(2\pi)2^{-(p+d)}2^{-d/2}/\Gamma(d/2)$ .

C.2 Details for Example 1

We now consider Example 1, a specific example of a sub-gamma distribution which shows that the upper bound in Theorem 6 is tight with respect to the scaling of the dimension $d$ and the scale parameter $b$ . Specifically, let $X=\sqrt{U}Z$ where $Z\sim\mathcal{N}(0,I_{d})$ is a standard Gaussian vector and $U$ is an independent Gamma random variable with shape parameter $1/(2b^{2})$ and scale parameter $2b^{2}$ .

Lemma A.5.

For $\alpha\in\mathbb{R}^{d}$ such that $\|\alpha\|\leq 1/b$ , it holds that

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]=-\frac{1}{2b^{2}}\log\mathopen{}\mathclose{{}\left(1-\|\alpha\|^{2}b^{2}}\right).

(A.24)

In particular, this means that $X$ is a sub-gamma random vector with parameters $(1,b)$ . Furthermore, for $s>\max\{-b^{-2},-d\}$ ,

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right]

\displaystyle=b^{s}M_{b^{-2}}(s)M_{d}(s).

(A.25)

Proof.

Observe that $\alpha^{\top}X=\sqrt{U}\alpha^{\top}Z$ where $\alpha^{\top}Z\sim\mathcal{N}(0,\|\alpha\|^{2})$ . Hence

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\exp(\alpha^{\top}X)}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\exp\mathopen{}\mathclose{{}\left(\frac{\|\alpha\|^{2}}{2}U}\right)}\right].

Recognizing the right-hand side as the moment generating function of the Gamma distribution evaluated at $\|\alpha\|^{2}/2$ yields (A.24). To see that this distribution satisfies the sub-gamma condition, we use the basic inequality $-\log(1-x)\leq x/(1-x)\leq x/(1-\sqrt{x})$ for all $x\in(0,1)$ .

The expression for the moments follows immediately from the independence of $U$ and $Z$ and the fact that $U^{2}/b^{2}$ has Gamma distribution with shape parameter $b^{-2}/2$ and scale parameter $2$ , which implies that $U/b$ has a chi distribution with $b^{-2}$ degrees of freedom. ∎

Since $X$ satisfies the sub-gamma condition with parameters $(1,b)$ the upper bound in Theorem 6 applies. Alternatively, for each pair $(\epsilon,\lambda)$ we can consider the exact expression for $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]$ given in (A.10) where $r=d+2p$ and

Y=\mathopen{}\mathclose{{}\left(\frac{2}{\sigma^{2}}U+1}\right)^{1/2}Z.

Minimizing this expression with respect to $\lambda$ yields

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]\geq\frac{\alpha_{d,p}}{\epsilon}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r-\epsilon}}\right]\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{r+\epsilon}}\right]}\right)^{1/2}.

(A.26)

To get a lower bound on the moments, we use

\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\|Y\|^{s}}\right]

\displaystyle\geq\mathopen{}\mathclose{{}\left(\frac{\sqrt{2}}{\sigma}}\right)^{s}\mathbb{E}\mathopen{}\mathclose{{}\left[\|X\|^{s}}\right].

(A.27)

Combining the above displays leads to (22). Using Stirling’s approximation $\log\Gamma(z)=(z-\frac{1}{2})\log z-z+\frac{1}{2}\log(2\pi)+o(1)$ as $z\to\infty$ it can be verified that the minimum of this lower bound with respect to $\epsilon$ satisfies the same scaling behavior with respect to $d$ as the upper bound in Theorem 6. Namely, the bound exponential in $d$ if $\delta\geq 1/2$ and superexponential in $d$ if $\delta<1/2$ .

C.3 Experiments in Section 5.1

In this experiment, $p=1$ , the random variable $X\in\mathbb{R}^{d}$ is generated according to the distribution in Example 1, and the kernel bandwidth $\sigma$ takes values $1$ and $4$ . The parameters $(\lambda,\epsilon)$ of the two-moment kernel are specified as in Theorem 6 with parameters $(1,b)$ , and $k(x,y)$ can be computed as in Appendix C.1.

In the Monte-Carlo computation of the average of $\hat{\Delta}_{\gamma}^{2}$ (the right column of Figure 2), $2n$ samples of $X$ are partitioned into two independent datasets $\{X_{i}\}_{i=1}^{n}$ and $\{X_{i}^{\prime}\}_{i=1}^{n}$ , each having $n$ samples. The kernel MMD (squared) distance has the empirical estimator Gretton et al. (2012)

\gamma^{2}_{k}(P_{n},P_{n}^{\prime})=\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j})+\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i}^{\prime},X_{j}^{\prime})-\frac{2}{n^{2}}\sum_{i=1,j}^{n}k(X_{i},X_{j}^{\prime}),

and then, by definition,

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P_{n},P_{n}^{\prime})}\right]$	$\displaystyle=2(\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]+(1-\frac{1}{n})\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]$
		$\displaystyle=\frac{2}{n}(\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]).$

Recall that

\gamma^{2}_{k}(P,P_{n})=\int\int k(x,x^{\prime})dP(x)dP(x^{\prime})+\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(X_{i},X_{j})-\frac{2}{n}\sum_{i=1}^{n}\int k(x,X_{i})dP(x),

and then

	$\displaystyle\mathbb{E}\mathopen{}\mathclose{{}\left[\gamma^{2}_{k}(P,P_{n})}\right]$	$\displaystyle=\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]+\frac{1}{n}\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]+(1-\frac{1}{n})\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]-2\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]$
		$\displaystyle=\frac{1}{n}(\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]).$

Thus, if we define

\hat{\Delta}_{\gamma}^{2}\coloneqq\frac{n}{2}\gamma_{k}^{2}(P_{n},P_{n}^{\prime}),

the expectation of $\hat{\Delta}_{\gamma}^{2}$ equals $\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X)}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[k(X,X^{\prime})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[n\gamma^{2}_{k}(P,P_{n})}\right]$ .

C.4 Experiments in Section 5.2

In this experiment, $d=5$ , $p=1$ , $\sigma=1/2$ and the parameters $(\epsilon,\lambda)$ of the two-moment kernel are specified as in Theorem 6 with parameters $(1,0)$ . Figure 3 in the main text plots the values of $\gamma_{k}^{2}\mathopen{}\mathclose{{}\left(P_{n},P^{\prime}_{n}}\right)$ as a function of increasing $N$ and for various values of $n$ . Figure A.1 plots $\gamma_{k}^{2}\mathopen{}\mathclose{{}\left(P_{n},P^{\prime}_{n}}\right)$ as a function of increasing $n$ and for various values of $N$ . Note that in this setting, the typical correlation between samples is of magnitude $1/\sqrt{N}$ , and thus the overall dependence is not negligible when $N$ is relatively small compared to $n$ .

	$\displaystyle\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Y\\|^{t}}\right]}\right)^{\frac{1}{t}}$	$\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Y\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}$
		$\displaystyle\leq\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|Z\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}+\frac{\sqrt{2}}{\sigma}\mathopen{}\mathclose{{}\left(\mathbb{E}\mathopen{}\mathclose{{}\left[\\|X\\|^{r+\epsilon}}\right]}\right)^{\frac{1}{r+\epsilon}}$
		$\displaystyle\leq\sqrt{d+r+\epsilon}+\frac{\sqrt{2}m}{\sigma},$