This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Limit distribution theory for smooth pp-Wasserstein distances

Ziv Goldfeld School of Electrical and Computer Engineering, Cornell University. [email protected] Kengo Kato Department of Statistics and Data Science, Cornell University. [email protected] Sloan Nietert Department of Computer Science, Cornell Universty. [email protected]  and  Gabriel Rioux Center for Applied Mathematics, Cornell University. [email protected]
Abstract.

The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality, giving rise to a parametric convergence rate in any dimension, while preserving the Wasserstein metric and topological structure. To facilitate valid statistical inference, in this work, we develop a comprehensive limit distribution theory for the empirical smooth Wasserstein distance. The limit distribution results leverage the functional delta method after embedding the domain of the Wasserstein distance into a certain dual Sobolev space, characterizing its Hadamard directional derivative for the dual Sobolev norm, and establishing weak convergence of the smooth empirical process in the dual space. To estimate the distributional limits, we also establish consistency of the nonparametric bootstrap. Finally, we use the limit distribution theory to study applications to generative modeling via minimum distance estimation with the smooth Wasserstein distance, showing asymptotic normality of optimal solutions for the quadratic cost.

Key words and phrases:
Dual Sobolev space, Gaussian smoothing, functional delta method, limit distribution theory, Wasserstein distance
1991 Mathematics Subject Classification:
60F05, 62E20, and 62G09
Z. Goldfeld is supported by the NSF CRII grant CCF-1947801, in part by the 2020 IBM Academic Award, and in part by the NSF CAREER Award CCF-2046018. K. Kato is partially supported by the NSF grants DMS-1952306 and DMS-2014636. S. Nietert is supported by the NSF Graduate Research Fellowship under Grant DGE-1650441.

1. Introduction

1.1. Overview

The Wasserstein distance is an instance of the Kantorovich optimal transport problem [Kan42], which defines a metric on a space of probability measures. Specifically, for 1p<1\leq p<\infty, the pp-Wasserstein distance between two Borel probability measures μ\mu and ν\nu on d\mathbb{R}^{d} with finite ppth moments is defined by

𝖶p(μ,ν)=infπΠ(μ,ν)[d×d|xy|p𝑑π(x,y)]1/p,\mathsf{W}_{p}(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\left[\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}|x-y|^{p}\,d\pi(x,y)\right]^{1/p}, (1)

where Π(μ,ν)\Pi(\mu,\nu) is the set of couplings (or transportation plans) of μ\mu and ν\nu. The Wasserstein distance has seen a surge of applications in statistics, machine learning, and applied mathematics, ranging from generative modeling [ACB17, GAA+17, TBGS18], image recognition [RTG00, SL11], and domain adaptation [CFT14, CFTR16] to robust optimization [GK16, MEK18, BMZ21] and partial differential equations [JKO98, San17]. The widespread applicability of the Wasserstein distance is driven by an array of desirable properties, including its metric structure (𝖶p\mathsf{W}_{p} metrizes weak convergence plus convergence of ppth moments), a convenient dual form, robustness to support mismatch, and a rich geometry it induces on a space of probability measures. We refer to [Vil03, Vil08, AGS08, San15] as standard references on optimal transport theory.

However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Specifically, for the empirical distribution μ^n\hat{\mu}_{n} of nn independent observations from a distribution μ\mu on d\mathbb{R}^{d}, it is known that 𝔼[𝖶p(μ^n,μ)]\operatorname{\mathbb{E}}\big{[}\mathsf{W}_{p}(\hat{\mu}_{n},\mu)\big{]} scales as n1/dn^{-1/d} for d>2pd>2p under moment conditions [DSS13, BLG14, FG15, WB19a, Lei20]. This slow rate renders performance guarantees in terms of 𝖶p\mathsf{W}_{p} all but vacuous when dd is large. It is also a roadblock towards a more delicate statistical analysis concerning limit distributions, bootstrap, and valid inference.

Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality of empirical 𝖶p\mathsf{W}_{p} [GGNP20, GG20, GGK20, SGK21, NGK21]. For σ>0\sigma>0, the smooth pp-Wasserstein distance is defined as 𝖶p(σ)(μ,ν):=𝖶p(μγσ,νγσ)\mathsf{W}_{p}^{(\sigma)}(\mu,\nu):=\mathsf{W}_{p}(\mu*\gamma_{\sigma},\nu*\gamma_{\sigma}), where * denotes convolution and γσ=N(0,σ2Id)\gamma_{\sigma}=N(0,\sigma^{2}I_{d}) is the isotropic Gaussian distribution with variance parameter σ2\sigma^{2}. For sufficiently sub-Gaussian μ\mu, [GGNP20] showed that the expected smooth distance between μ^n\hat{\mu}_{n} and μ\mu exhibits the parametric convergence rate, i.e., 𝔼[𝖶1(σ)(μ^n,μ)]=O(n1/2)\operatorname{\mathbb{E}}\big{[}\mathsf{W}^{(\sigma)}_{1}(\hat{\mu}_{n},\mu)\big{]}=O(n^{-1/2}) in any dimension. This is a significant departure from the n1/dn^{-1/d} rate in the unsmoothed case. [GG20] further showed that 𝖶1(σ)\mathsf{W}^{(\sigma)}_{1} maintains the metric and topological structure of 𝖶1\mathsf{W}_{1} and is able to approximate it within a σd\sigma\sqrt{d} gap. The structural properties and fast empirical convergence rates were later extended to p>1p>1 in [NGK21]. Other follow-up works explored relations between 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} and maximum mean discrepancies [ZCR21], analyzed its rate of decay as σ\sigma\to\infty [CNW21], and adopted it as a performance metric for nonparametric mixture model estimation [HMS21].

A limit distribution theory for 𝖶1(σ)\mathsf{W}^{(\sigma)}_{1} was developed in [GGK20, SGK21], where the scaled empirical distance n𝖶1(σ)(μ^n,μ)\sqrt{n}\,\mathsf{W}^{(\sigma)}_{1}(\hat{\mu}_{n},\mu) was shown to converge in distribution to the supremum of a tight Gaussian process in every dimension dd under mild moment conditions. This result relies on the dual formulation of 𝖶1\mathsf{W}_{1} as an integral probability metric (IPM) over the class of 1-Lipschitz functions. Gaussian smoothing shrinks the function class to that of 1-Lipschitz functions convolved with a Gaussian density, which is shown to be μ\mu-Donsker in every dimension, thereby yielding the limit distribution. Extending these results to empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} with p>1p>1, however, requires substantially new ideas due to the lack of an IPM structure. Consequently, works exploring 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} with p>1p>1, such as [ZCR21, NGK21], did not contain limit distribution results for it and this question remained largely open.

The present paper closes this gap and provides a comprehensive limit distribution theory for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} with p>1p>1. Our main limit distribution results are summarized in the following theorem, where the ‘null’ refers to when μ=ν\mu=\nu, while ‘alternative’ corresponds to μν\mu\neq\nu. In all what follows, the dimension d1d\geq 1 is arbitrary.

Theorem 1.1 (Main results).

Let 1<p<1<p<\infty, and μ,ν\mu,\nu be Borel probability measures on d\mathbb{R}^{d} with finite ppth moments. Let μ^n=n1i=1nδXi\hat{\mu}_{n}=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}} and ν^n=n1i=1nδYi\hat{\nu}_{n}=n^{-1}\sum_{i=1}^{n}\delta_{Y_{i}} be the empirical distributions of independent observations X1,,XnμX_{1},\ldots,X_{n}\sim\mu and Y1,,YnνY_{1},\ldots,Y_{n}\sim\nu. Suppose that μ\mu satisfies Condition (4) ahead (which requires μ\mu to be sub-Gaussian).

  1. (i)

    (One-sample null case) We have

    n𝖶p(σ)(μ^n,μ)dsupφC0:φH˙1,q(μγσ)1Gμ(φ),\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)\stackrel{{\scriptstyle d}}{{\to}}\sup_{\begin{subarray}{c}\varphi\in C_{0}^{\infty}:\\ \|\varphi\|_{\dot{H}^{1,q}(\mu*\gamma_{\sigma})}\leq 1\end{subarray}}G_{\mu}(\varphi),

    where Gμ=(Gμ(φ))φC0G_{\mu}=(G_{\mu}(\varphi))_{\varphi\in C_{0}^{\infty}} is a centered Gaussian process whose paths are linear and continuous with respect to (w.r.t.) the Sobolev seminorm φH˙1,q(μγσ):=φLq(μγσ;d)\|\varphi\|_{\mspace{-1.0mu}\dot{H}^{1,q}(\mu*\gamma_{\sigma})}\mspace{-5.0mu}:=\mspace{-3.0mu}\|\nabla\varphi\|_{L^{q}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}. Here qq is the conjugate index of pp, i.e., 1/p+1/q=11/p+1/q=1.

  2. (ii)

    (Two-sample null case) If ν=μ\nu=\mu, then we have

    n𝖶p(σ)(μ^n,ν^n)dsupφC0:φH˙1,q(μγσ)1[Gμ(φ)Gμ(φ)],\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})\stackrel{{\scriptstyle d}}{{\to}}\sup_{\begin{subarray}{c}\varphi\in C_{0}^{\infty}:\\ \|\varphi\|_{\dot{H}^{1,q}(\mu*\gamma_{\sigma})}\leq 1\end{subarray}}\big{[}G_{\mu}(\varphi)-G_{\mu}^{\prime}(\varphi)\big{]},

    where GμG_{\mu}^{\prime} is an independent copy of GμG_{\mu}.

  3. (iii)

    (One-sample alternative case) If νμ\nu\neq\mu and ν\nu is sub-Weibull, then we have

    n(𝖶p(σ)(μ^n,ν)𝖶p(σ)(μ,ν))dN(0,Varμ(gϕσ)p2[𝖶p(σ)(μ,ν)]2(p1)),\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu)-\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{)}\stackrel{{\scriptstyle d}}{{\to}}N\left(0,\frac{\operatorname{Var}_{\mu}(g*\phi_{\sigma})}{p^{2}\big{[}\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{]}^{2(p-1)}}\right),

    where gg is an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma} for 𝖶pp\mathsf{W}_{p}^{p}, and ϕσ(x)=(2πσ2)d/2e|x|2/(2σ2)\phi_{\sigma}(x)=(2\pi\sigma^{2})^{-d/2}e^{-|x|^{2}/(2\sigma^{2})} is the Gaussian density.

  4. (iv)

    (Two-sample alternative case) If νμ\nu\neq\mu and ν\nu satisfies Condition (4), then we have

    n(𝖶p(σ)(μ^n,ν^n)𝖶p(σ)(μ,ν))dN(0,Varμ(gϕσ)+Varν(gcϕσ)p2[𝖶p(σ)(μ,ν)]2(p1)),\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})-\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{)}\stackrel{{\scriptstyle d}}{{\to}}N\left(0,\frac{\operatorname{Var}_{\mu}(g*\phi_{\sigma})+\operatorname{Var}_{\nu}(g^{c}*\phi_{\sigma})}{p^{2}\big{[}\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{]}^{2(p-1)}}\right),

    where gcg^{c} is the cc-transform of gg for the cost function c(x,y)=|xy|pc(x,y)=|x-y|^{p}.

Parts (i) and (ii) show that the null limit distributions are non-Gaussian. On the other hand, Parts (iii) and (iv) establish asymptotic normality of empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} under the alternative. Notably, these result have the correct centering, 𝖶p(σ)(μ,ν)\mathsf{W}_{p}^{(\sigma)}(\mu,\nu), which enables us to construct confidence intervals for 𝖶p(σ)(μ,ν)\mathsf{W}_{p}^{(\sigma)}(\mu,\nu).

The proof strategy for Theorem 1.1 differs from existing approaches to limit distribution theory for empirical 𝖶p\mathsf{W}_{p} for general distributions. In fact, an analog of Theorem 1.1 is not known to hold for classic 𝖶p\mathsf{W}_{p} in this generality, except for the special case where μ,ν\mu,\nu are discrete (see a literature review below for details). The key insight is to regard 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} as a functional defined on a subset of a certain dual Sobolev space. We show that the smooth empirical process converges weakly in the dual Sobolev space and that 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} is Hadamard (directionally) differentiable w.r.t. the dual Sobolev norm. We then employ the extended functional delta method [Sha90, R0̈4] to obtain the limit distribution of one- and two-sample empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} under both the null and the alternative.

The limit distributions in Theorem 1.1 are non-pivotal in the sense that they depend on the population distributions μ\mu and ν\nu, which are unknown in practice. To facilitate statistical inference using 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}, we employ the bootstrap to estimate the limit distributions and prove its consistency for each case of Theorem 1.1. Under the alternative, the consistency follows from the linearity of the Hadamard derivative. Under the null, where the Hadamard (directional) derivative is nonlinear, the bootstrap consistency is not obvious but still holds. This is somewhat surprising in light of [Düm93, FS19], where it is demonstrated that the bootstrap, in general, fails to be consistent for functionals whose Hadamard directional derivatives are nonlinear (cf. Proposition 1 in [Düm93] or Corollary 3.1 in [FS19]). Nevertheless, our application of the bootstrap differs from [Düm93, FS19] so there is no contradiction, and the specific structure of the Hadamard derivative of 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} allows to establish consistency under the null (see the discussion after Proposition 3.8 for more details). These bootstrap consistency results enable constructing confidence intervals for 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} and using it to test the equality of distributions.

As an application of the limit theory, we study implicit generative modeling under the minimum distance estimation (MDE) framework [Wol57, Pol80, PS80]. MDE extends the maximum-likelihood principle beyond the KL divergence and applies to models supported on low-dimensional manifolds [ACB17] (whence the KL divergence is not well-defined), as well as to cases when the likelihood function is intractable [GMR93]. For MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}, we establish limit distribution results for the optimal solution and the smooth pp-Wasserstein error. Our results hold for arbitrary dimension, again contrasting the classic case where analogous distributional limits for MDE with 𝖶p\mathsf{W}_{p} are known only for p=d=1p=d=1 [BJGR19]. Remarkably, when p=2p=2, the Hilbertian structure of the underlying dual Sobolev space allows showing asymptotic normality of the MDE solution.

1.2. Literature review

Analysis of empirical Wasserstein distances, or more generally empirical optimal transport distances, has been an active research area in the statistics and probability theory literature. In particular, significant attention was devoted to rates of convergence and exact asymptotics [Dud69, AKT84, Tal92, Tal94, DY95, BdMM02, BGV07, DSS13, BB13, BLG14, FG15, WB19a, AST19, Led19, BL21, Lei20, CRL+20, MNW21, DGS21, MBNWW21]. As noted before, the empirical Wasserstein distance suffers from the curse of dimensionality, namely, 𝔼[𝖶p(μ^n,μ)]=O(n1/d)\operatorname{\mathbb{E}}[\mathsf{W}_{p}(\hat{\mu}_{n},\mu)]=O(n^{-1/d}) whenever d>2pd>2p. This rate is known to be sharp in general [Dud69]. The recent work by [CRL+20, MNW21] discovered that the rate can be improved under the alternative, namely, 𝔼[|𝖶p(μ^n,ν)𝖶p(μ,ν)|]=O(nα/d)\operatorname{\mathbb{E}}\big{[}\big{|}\mathsf{W}_{p}(\hat{\mu}_{n},\nu)-\mathsf{W}_{p}(\mu,\nu)\big{|}\big{]}=O(n^{-\alpha/d}) for d5d\geq 5 if νμ\nu\neq\mu, where α=p\alpha=p for 1p<21\leq p<2 and α=2\alpha=2 for 2p<2\leq p<\infty. Their insight is to use the duality formula for 𝖶pp\mathsf{W}_{p}^{p} and exploit regularity of optimal transport potentials. [MNW21] also derive matching minimax lower bounds up to log factors under some technical conditions.

Another central problem that has seen a rapid development is limit distribution theory for empirical Wasserstein distances. However, except for the two special cases discussed next, to the best of our knowledge, there is no proven analog of our Theorem 1.1 for classic Wasserstein distances, i.e., a comprehensive limit distribution theory for empirical 𝖶p\mathsf{W}_{p} that holds for general dd and pp. The first case for which the limit distribution is well understood is when d=1d=1. Then, 𝖶p\mathsf{W}_{p} reduces to the LpL^{p} distance between quantile functions for 1p<1\leq p<\infty, and further simplifies to the L1L^{1} distance between distribution functions when p=1p=1. Building on such explicit expressions, [dBGM99] and [dBGU05] derived null limit distributions in d=1d=1 for p=1p=1 and p=2p=2, respectively. More recently, under the alternative (μν\mu\neq\nu), [DBGL19] derived a central limit theorem (CLT) when d=1d=1 and p2p\geq 2. The second case where a limit distribution theory for empirical 𝖶p\mathsf{W}_{p} is available is when μ,ν\mu,\nu are discrete. If the distributions are finitely discrete, i.e., μ=j=1mrjδxj\mu=\sum_{j=1}^{m}r_{j}\delta_{x_{j}} and ν=j=1ksjδyj\nu=\sum_{j=1}^{k}s_{j}\delta_{y_{j}} for two simplex vectors r=(r1,,rm)r=(r_{1},\dots,r_{m}) and s=(s1,,sk)s=(s_{1},\dots,s_{k}), then 𝖶p(μ,ν)\mathsf{W}_{p}(\mu,\nu) can be seen as a function of those simplex vectors rr and ss. Leveraging this, [SM18] applied the delta method to obtain limit distributions for empirical 𝖶p\mathsf{W}_{p} in the finitely discrete case. An extension to countably infinite supports was provided in [TSM19], while [dBGSL21a] treated the semidiscrete case where μ\mu is finitely discrete but ν\nu is general.

Except for these two special cases, limit distributions for Wasserstein distances are less understood. To avoid repetitions, we focus here our discussion on the one sample case. In [dBL19], a CLT for n(𝖶22(μ^n,ν)𝔼[𝖶22(μ^n,ν)])\sqrt{n}\big{(}\mathsf{W}_{2}^{2}(\hat{\mu}_{n},\nu)-\operatorname{\mathbb{E}}\big{[}\mathsf{W}_{2}^{2}(\hat{\mu}_{n},\nu)\big{]}\big{)} is derived in any dimension, but the limit Gaussian distribution degenerates to 0 when μ=ν\mu=\nu; see also [dBGSL21b] for an extension to general 1<p<1<p<\infty. Notably, the centering constant there is the expected empirical Wasserstein distance 𝔼[𝖶22(μ^n,ν)]\operatorname{\mathbb{E}}\big{[}\mathsf{W}_{2}^{2}(\hat{\mu}_{n},\nu)\big{]}, which in general can not be replaced with the (more natural) population distance 𝖶22(μ,ν)\mathsf{W}_{2}^{2}(\mu,\nu). The recent preprint [MBNWW21] addressed this gap and established a CLT for n(𝖶22(μ~n,ν)𝖶22(μ,ν))\sqrt{n}\big{(}\mathsf{W}_{2}^{2}(\tilde{\mu}_{n},\nu)-\mathsf{W}_{2}^{2}(\mu,\nu)\big{)} for a wavelet-based estimator μ~n\tilde{\mu}_{n} of μ\mu, assuming that the ambient space is [0,1]d[0,1]^{d} and that μ,ν\mu,\nu are absolutely continuous w.r.t. the Lebesgue measure with smooth and strictly positive densities. Following arguments similar to [dBL19], they first derive a CLT for n(𝖶22(μ~n,ν)𝔼[𝖶22(μ~n,ν)])\sqrt{n}\big{(}\mathsf{W}_{2}^{2}(\tilde{\mu}_{n},\nu)-\operatorname{\mathbb{E}}\big{[}\mathsf{W}_{2}^{2}(\tilde{\mu}_{n},\nu)\big{]}\big{)} and then use the strict positivity of the densities and higher order regularity of optimal transport potentials to control the bias term as 𝔼[𝖶22(μ~n,ν)]𝖶22(μ,ν)=o(n1/2)\operatorname{\mathbb{E}}\big{[}\mathsf{W}_{2}^{2}(\tilde{\mu}_{n},\nu)\big{]}-\mathsf{W}_{2}^{2}(\mu,\nu)=o(n^{-1/2}).

Finally, we note that our proof techniques differ from the aforementioned arguments for classic 𝖶p\mathsf{W}_{p}. Specifically, as opposed to the two-step approach of [MBNWW21] described above, we directly prove asymptotic normality for n(𝖶p(σ)(μ^n,ν)𝖶p(σ)(μ,ν))\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu)-\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{)} under the alternative. Their derivation does not apply to our case even when p=2p=2 since their bias bound requires that the densities of μ\mu and ν\nu be bounded away from zero on their (compact) supports, which fails to hold after the Gaussian convolution. Our argument also differs from that of [SM18, TSM19], even though they also rely on the functional delta method. Specifically, since we do not assume that μ,ν\mu,\nu are discrete, 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} can not be parameterized by simplex vectors, and hence the application of the functional delta method is nontrivial. Very recently, an independent work [HKSM22] used the extended functional delta method for the supremum functional [CCR20] to derive limit distributions for classic 𝖶p\mathsf{W}_{p}, with p2p\geq 2, for compactly supported distributions under the alternative in dimensions d3d\leq 3.

1.3. Organization

The rest of the paper is organized as follows. In Section 2, we collect background material on Wasserstein distances, smooth Wasserstein distances, and dual Sobolev spaces. In Section 3, we prove Theorem 1.1 and explore the validity of the bootstrap for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. Section 4 presents applications of our limit distribution theory to MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. Proofs for Section 3 and 4 can be found in Section 5. Section 6 provides concluding remarks and discusses future research directions. Finally, the Appendix contains additional proofs.

1.4. Notation

Let |||\cdot| and ,\langle\cdot,\cdot\rangle denote the Euclidean norm and inner product, respectively. Let B(x,r)={yd:|yx|r}B(x,r)=\{y\in\mathbb{R}^{d}:|y-x|\leq r\} denote the closed ball with center xx and radius rr. We use dxdx for the Lebesgue measure on d\mathbb{R}^{d}, while δx\delta_{x} denotes the Dirac measure at xdx\in\mathbb{R}^{d}. Given a finite signed Borel measure \ell on d\mathbb{R}^{d}, we identify \ell with the linear functional f(f):=f𝑑f\mapsto\ell(f):=\int fd\ell. Let \lesssim denote inequalities up to some numerical constants. For any a,ba,b\in\mathbb{R}, we use the shorthands ab=max{a,b}a\vee b=\max\{a,b\} and ab=min{a,b}a\land b=\min\{a,b\}.

For a topological space SS, (S)\mathcal{B}(S) and 𝒫(S)\mathcal{P}(S) denote, respectively, the Borel σ\sigma-field on SS and the class of Borel probability measures on SS. We write 𝒫:=𝒫(d)\mathcal{P}:=\mathcal{P}(\mathbb{R}^{d}) and for 1p<1\leq p<\infty, use 𝒫p\mathcal{P}_{p} to denote the subset of μ𝒫\mu\in\mathcal{P} with finite ppth moment d|x|p𝑑μ(x)<\int_{\mathbb{R}^{d}}|x|^{p}d\mu(x)<\infty. For μ,ν𝒫\mu,\nu\in\mathcal{P}, we write μν\mu\ll\nu for the absolute continuity of μ\mu w.r.t. ν\nu, and use dμ/dνd\mu/d\nu for the corresponding Radon-Nikodym derivative. Let μν\mu*\nu denote the convolution of μ,ν𝒫\mu,\nu\in\mathcal{P}. Likewise, the convolution of two measurable functions f,g:df,g:\mathbb{R}^{d}\to\mathbb{R} is denoted by fgf*g. We write γσ=N(0,σ2Id)\gamma_{\sigma}=N(0,\sigma^{2}I_{d}) for the centered Gaussian distribution on d\mathbb{R}^{d} with covariance matrix σ2Id\sigma^{2}I_{d}, and use ϕσ(x)=(2πσ2)d/2e|x|2/(2σ2)\phi_{\sigma}(x)=(2\pi\sigma^{2})^{-d/2}e^{-|x|^{2}/(2\sigma^{2})} (xdx\in\mathbb{R}^{d}) for the corresponding density. We write μν\mu\otimes\nu for the product measure of μ,ν𝒫\mu,\nu\in\mathcal{P}. Let w,d\smash{\stackrel{{\scriptstyle w}}{{\to}},\stackrel{{\scriptstyle d}}{{\to}}}, and \smash{\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}} denote weak convergence of probability measures, convergence in distribution of random variables, and convergence in probability, respectively. When necessary, convergence in distribution is understood in the sense of Hoffmann-Jørgensen (cf. Chapter 2 in [vdVW96]).

Throughout the paper, we assume that (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots are the coordinate projections of the product probability space i=1(2d,(2d),μν)\prod_{i=1}^{\infty}\big{(}\mathbb{R}^{2d},\mathcal{B}(\mathbb{R}^{2d}),\mu\otimes\nu\big{)}. To generate auxiliary random variables, we extend the probability space as (Ω,𝒜,)=[i=1(2d,(2d),μν)]×([0,1],([0,1]),Leb)(\Omega,\mathcal{A},\operatorname{\mathbb{P}})=\left[\prod_{i=1}^{\infty}\big{(}\mathbb{R}^{2d},\mathcal{B}(\mathbb{R}^{2d}),\mu\otimes\nu\big{)}\right]\times\big{(}[0,1],\mathcal{B}([0,1]),\mathrm{Leb}\big{)}, where Leb\mathrm{Leb} denotes the Lebesgue measure on [0,1][0,1]. For β(0,2]\beta\in(0,2], let ψβ(t)=etβ1\psi_{\beta}(t)=e^{t^{\beta}}-1 for t0t\geq 0, and recall that the corresponding Orlicz (quasi-)norm of a real-valued random variable ξ\xi is defined as ξψβ:=inf{C>0:𝔼[ψβ(|ξ|/C)]1}\|\xi\|_{\psi_{\beta}}:=\inf\{C>0:\operatorname{\mathbb{E}}[\psi_{\beta}(|\xi|/C)]\leq 1\}. A Borel probability measure μ𝒫\mu\in\mathcal{P} is called β\beta-sub-Weibull if |X|ψβ<\||X|\|_{\psi_{\beta}}<\infty for XμX\sim\mu. We say that μ\mu is sub-Weibull if it is β\beta-sub-Weibull for some β(0,2]\beta\in(0,2]. Finally, μ\mu is sub-Gaussian if it is 22-sub-Weibull.

For an open set 𝒪\mathcal{O} in a Euclidean space, C0(𝒪)C_{0}^{\infty}(\mathcal{O}) denotes the space of compactly supported, infinitely differentiable, real functions on 𝒪\mathcal{O}. We write C0=C0(d)C_{0}^{\infty}=C_{0}^{\infty}(\mathbb{R}^{d}) and define C˙0={f+a:fC0,a}\dot{C}_{0}^{\infty}=\{f+a:f\in C_{0}^{\infty},a\in\mathbb{R}\}. For any p[1,)p\in[1,\infty) and μ𝒫(d)\mu\in\mathcal{P}(\mathbb{R}^{d}), let Lp(μ;k)L^{p}(\mu;\mathbb{R}^{k}) denote the space of measurable maps f:dkf:\mathbb{R}^{d}\to\mathbb{R}^{k} such that fLp(μ;k)=(d|f|p𝑑μ)1/p<\|f\|_{L^{p}(\mu;\mathbb{R}^{k})}=(\int_{\mathbb{R}^{d}}|f|^{p}d\mu)^{1/p}<\infty; when d=1d=1 we use the shorthand Lp(μ)=Lp(μ;1)L^{p}(\mu)=L^{p}(\mu;\mathbb{R}^{1}). Recall that (Lp(μ;k),Lp(μ;k))\big{(}L^{p}(\mu;\mathbb{R}^{k}),\|\cdot\|_{L^{p}(\mu;\mathbb{R}^{k})}\big{)} is a Banach space. Finally, for a subset AA of a topological space SS, let A¯S\overline{A}^{S} denote the closure of AA; if the space SS is clear from the context, then we simply write A¯\overline{A} for the closure.

2. Background

2.1. Wasserstein distances and their smooth variants

Recall that, for 1p<1\leq p<\infty, the pp-Wasserstein distance 𝖶p(μ,ν)\mathsf{W}_{p}(\mu,\nu) between μ,ν𝒫p\mu,\nu\in\mathcal{P}_{p} is defined in (1). Some basic properties of 𝖶p\mathsf{W}_{p} are (cf. e.g., [Vil03, AGS08, Vil08, San15]): (i) the inf\inf is attained in the definition of 𝖶p\mathsf{W}_{p}, i.e., there exists a coupling πΠ(μ,ν)\pi^{\star}\in\Pi(\mu,\nu) such that 𝖶pp(μ,ν)=d×d|xy|p𝑑π(x,y)\mathsf{W}_{p}^{p}(\mu,\nu)=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}|x-y|^{p}d\pi^{\star}(x,y), and the optimal coupling π\pi^{\star} is unique when p>1p>1 and μdx\mu\ll dx; (ii) 𝖶p\mathsf{W}_{p} is a metric on 𝒫p\mathcal{P}_{p}; and (iii) convergence in 𝖶p\mathsf{W}_{p} is equivalent to weak convergence plus convergence of ppth moments: 𝖶p(μn,μ)0\mathsf{W}_{p}(\mu_{n},\mu)\to 0 if and only if μnwμ\mu_{n}\stackrel{{\scriptstyle w}}{{\to}}\mu and |x|p𝑑μn(x)|x|p𝑑μ(x)\int|x|^{p}d\mu_{n}(x)\to\int|x|^{p}d\mu(x).

The proof of the limit distribution for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} under the alternative hinges on duality theory for 𝖶p\mathsf{W}_{p}, which we summarize below. For a function g:d[,)g:\mathbb{R}^{d}\to[-\infty,\infty) and a cost function c:d×dc:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}, the cc-transform of gg is defined by

gc(y)=infxd[c(x,y)g(x)],yd.g^{c}(y)=\inf_{x\in\mathbb{R}^{d}}\big{[}c(x,y)-g(x)\big{]},\quad y\in\mathbb{R}^{d}.

A function g:d[,)g:\mathbb{R}^{d}\to[-\infty,\infty), not identically -\infty, is called cc-concave if g=fcg=f^{c} for some function f:d[,)f:\mathbb{R}^{d}\to[-\infty,\infty).

Lemma 2.1 (Duality for 𝖶p\mathsf{W}_{p}).

Let 1p<1\leq p<\infty, μ,ν𝒫p\mu,\nu\in\mathcal{P}_{p}, and set the cost function to c(x,y)=|xy|pc(x,y)=|x-y|^{p}.

  1. (i)

    (Theorem 5.9 in [Vil08]; Theorem 6.1.5 in [AGS08]) The following duality holds,

    𝖶pp(μ,ν)=supgL1(μ)[dg𝑑μ+dgc𝑑ν],\mathsf{W}_{p}^{p}(\mu,\nu)=\sup_{g\in L^{1}(\mu)}\left[\int_{\mathbb{R}^{d}}gd\mu+\int_{\mathbb{R}^{d}}g^{c}d\nu\right], (2)

    and there is at least one cc-concave function gL1(μ)g\in L^{1}(\mu) that attains the supremum in (2); we call this gg an optimal transport potential from μ\mu to ν\nu for 𝖶pp\mathsf{W}_{p}^{p}.

  2. (ii)

    (Theorem 3.3 in [GM96]) Let 1<p<1<p<\infty, suppose that g:d[,)g:\mathbb{R}^{d}\to[-\infty,\infty) is cc-concave, and take KK as the convex hull of {x:g(x)>}\{x:g(x)>-\infty\}. Then gg is locally Lipschitz on the interior of KK.

  3. (iii)

    (Corollary 2.7 in [dBGSL21b]) If 1<p<1<p<\infty and μdx\mu\ll dx is supported on an open connected set AA, then the optimal transport potential from μ\mu to ν\nu for 𝖶pp\mathsf{W}_{p}^{p} is unique on AA up to additive constants, i.e., if g1g_{1} and g2g_{2} are optimal transport potentials, then there exists CC\in\mathbb{R} such that g1(x)=g2(x)+Cg_{1}(x)=g_{2}(x)+C for all xAx\in A.

The smooth Wasserstein distance convolves the distributions with an isotropic Gaussian kernel. Gaussian convolution levels out local irregularities in the distributions, while largely preserving the structure of classic 𝖶p\mathsf{W}_{p}. Recalling that γσ=N(0,σ2Id)\gamma_{\sigma}=N(0,\sigma^{2}I_{d}), the smooth pp-Wasserstein distance is defined as follows.

Definition 2.1 (Smooth Wasserstein distance).

Let 1p<1\leq p<\infty and σ0\sigma\geq 0. For μ,ν𝒫p\mu,\nu\in\mathcal{P}_{p}, the smooth pp-Wasserstein distance between μ\mu and ν\nu with smoothing parameter σ\sigma is

𝖶p(σ)(μ,ν):=𝖶p(μγσ,νγσ).\mathsf{W}_{p}^{(\sigma)}(\mu,\nu):=\mathsf{W}_{p}(\mu*\gamma_{\sigma},\nu*\gamma_{\sigma}).

The smooth Wasserstein distance was studied in [GGNP20, GG20, GGK20, SGK21, NGK21] for structural properties and empirical convergence rates. We recall two basic properties: (i) 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} is a metric on 𝒫p\mathcal{P}_{p} that generates the same topology as classic 𝖶p\mathsf{W}_{p}; (ii) for μ,ν𝒫p\mu,\nu\in\mathcal{P}_{p} and 0σ1σ2<0\leq\sigma_{1}\leq\sigma_{2}<\infty, we have 𝖶p(σ2)(μ,ν)𝖶p(σ1)(μ,ν)𝖶p(σ2)(μ,ν)+Cp,dσ22σ12\mathsf{W}_{p}^{(\sigma_{2})}(\mu,\nu)\leq\mathsf{W}_{p}^{(\sigma_{1})}(\mu,\nu)\leq\mathsf{W}_{p}^{(\sigma_{2})}(\mu,\nu)+C_{p,d}\sqrt{\sigma_{2}^{2}-\sigma_{1}^{2}} for a constant Cp,dC_{p,d} that depends only on p,dp,d. In particular, 𝖶p(σ)(μ,ν)\mathsf{W}_{p}^{(\sigma)}(\mu,\nu) is continuous and monotonically non-increasing in σ[0,+)\sigma\in[0,+\infty) with limσ0𝖶p(σ)(μ,ν)=𝖶p(μ,ν)\lim_{\sigma\downarrow 0}\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)=\mathsf{W}_{p}(\mu,\nu). See [NGK21] for additional structural results, including an explicit expression for Cp,dC_{p,d} and weak convergence of smooth optimal couplings. For empirical convergence, it was shown in [NGK21] that under appropriate moment assumptions 𝔼[𝖶p(σ)(μ^n,μ)]=O(n1/2)\mathbb{E}\big{[}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)\big{]}=O(n^{-1/2}) for p>1p>1 in any dimension dd. Versions of this result for p=1p=1 and p=2p=2 were derived earlier in [GGNP20, GGK20, SGK21].

2.2. Sobolev spaces and their duals

Our proof strategy for the limit distribution results is to regard 𝖶p\mathsf{W}_{p} as a functional defined on a subset of a certain dual Sobolev space. We will show that the smooth empirical process converges weakly in the dual Sobolev space and that 𝖶p\mathsf{W}_{p} is Hadamard (directionally) differentiable w.r.t. the dual Sobolev norm. Given these, the limit distributions in Theorem 1.1 follow via the functional delta method. Here we briefly discuss (homogeneous) Sobolev spaces and their duals.

Definition 2.2 (Homogeneous Sobolev spaces and their duals).

Let ρ𝒫\rho\mspace{-1.5mu}\in\mspace{-1.5mu}\mathcal{P} and 1p<1\mspace{-1.5mu}\leq\mspace{-1.5mu}p\mspace{-1.5mu}<\mspace{-1.5mu}\infty.

  1. (i)

    For a differentiable function f:df:\mathbb{R}^{d}\to\mathbb{R}, let

    fH˙1,p(ρ):=fLp(ρ;d)=(d|f|p𝑑ρ)1/p\|f\|_{\dot{H}^{1,p}(\rho)}:=\|\nabla f\|_{L^{p}(\rho;\mathbb{R}^{d})}=\left(\int_{\mathbb{R}^{d}}|\nabla f|^{p}d\rho\right)^{1/p}

    be the Sobolev seminorm. We define the homogeneous Sobolev space H˙1,p(ρ)\dot{H}^{1,p}(\rho) by the completion of C˙0\dot{C}_{0}^{\infty} w.r.t. H˙1,p(ρ)\|\cdot\|_{\dot{H}^{1,p}(\rho)}.

  2. (ii)

    Let qq be the conjugate index of pp, i.e., 1/p+1/q=11/p+1/q=1. Let H˙1,p(ρ)\dot{H}^{-1,p}(\rho) denote the topological dual of H˙1,q(ρ)\dot{H}^{1,q}(\rho). The dual Sobolev norm H˙1,p(ρ)\|\cdot\|_{\dot{H}^{-1,p}(\rho)} (dual to H˙1,q(ρ)\|\cdot\|_{\dot{H}^{1,q}(\rho)}) of a continuous linear functional :H˙1,q(ρ)\ell:\dot{H}^{1,q}(\rho)\to\mathbb{R} is defined by

    H˙1,p(ρ)=sup{(f):fC˙0,fH˙1,q(ρ)1}.\|\ell\|_{\dot{H}^{-1,p}(\rho)}=\sup\left\{\ell(f):f\in\dot{C}_{0}^{\infty},\|f\|_{\dot{H}^{1,q}(\rho)}\leq 1\right\}.

The restriction fC˙0f\in\dot{C}_{0}^{\infty} can be replaced with fC0f\in C_{0}^{\infty} in the definition of the dual norm H˙1,p(ρ)\|\cdot\|_{\dot{H}^{-1,p}(\rho)} since (f+a)=(f)\ell(f+a)=\ell(f) for any H˙1,p(ρ)\ell\in\dot{H}^{-1,p}(\rho).

We have defined the homogeneous Sobolev space H˙1,p(ρ)\dot{H}^{1,p}(\rho) as the completion of C˙0\dot{C}_{0}^{\infty} w.r.t. H˙1,p(ρ)\|\cdot\|_{\dot{H}^{1,p}(\rho)}. It is not immediately clear that the so-constructed space is a function space over d\mathbb{R}^{d}. Below we present an explicit construction of H˙1,p(ρ)\dot{H}^{1,p}(\rho) when dρ/dκd\rho/d\kappa is bounded away from zero for some reference measure κdx\kappa\gg dx satisfying the pp-Poincaré inequality. To that end, we first define the Poincaré inequality.

Definition 2.3 (Poincaré inequality).

For 1p<1\leq p<\infty, a probability measure κ𝒫\kappa\in\mathcal{P} is said to satisfy the pp-Poincaré inequality if there exists a finite constant 𝖢\mathsf{C} such that

φκ(φ)Lp(κ)𝖢φLp(κ;d),φC0.\|\varphi-\kappa(\varphi)\|_{L^{p}(\kappa)}\leq\mathsf{C}\|\nabla\varphi\|_{L^{p}(\kappa;\mathbb{R}^{d})},\quad\forall\varphi\in C_{0}^{\infty}.

The smallest constant satisfying the above is denoted by 𝖢p(κ)\mathsf{C}_{p}(\kappa).

The standard Poincaré inequality refers to the 22-Poincaré inequality. It is known that any log-concave distribution (i.e., a distribution κ\kappa of the form dκ=eVdxd\kappa=e^{-V}dx for some convex function V:d{+}V:\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}; cf. [LV07, SW14]) satisfies the pp-Poincaré inequality for any 1p<1\leq p<\infty [Bob99, Mil09]. In particular, the Gaussian distribution γσ\gamma_{\sigma} satisfies every pp-Poincaré inequality (see also [Bog98, Corollary 1.7.3]).

Remark 2.1 (Explicit construction of H˙1,p(ρ)\dot{H}^{1,p}(\rho)).

Suppose that there exists a reference measure κ𝒫\kappa\in\mathcal{P}, with κdx\kappa\gg dx, that satisfies the pp-Poincaré inequality. Assume that dρ/dκcd\rho/d\kappa\geq c for some constant c>0c>0 (in our applications, ρ=γσ\rho=\gamma_{\sigma} or μγσ\mu*\gamma_{\sigma} for some μ𝒫p\mu\in\mathcal{P}_{p}; in either case, the stated assumption is satisfied with κ=γσ\kappa=\gamma_{\sigma} or γσ/2\gamma_{\sigma/\sqrt{2}}). Let 𝒞={fC˙0:κ(f)=0}\mathcal{C}=\{f\in\dot{C}_{0}^{\infty}:\kappa(f)=0\}. Then, H˙1,p(ρ)\|\cdot\|_{\dot{H}^{1,p}(\rho)} is a proper norm on 𝒞\mathcal{C}, and the map ι:ff\iota:f\mapsto\nabla f is an isometry from (𝒞,H˙1,p(ρ))(\mathcal{C},\|\cdot\|_{\dot{H}^{1,p}(\rho)}) into (Lp(ρ;d),Lp(ρ;d))(L^{p}(\rho;\mathbb{R}^{d}),\|\cdot\|_{L^{p}(\rho;\mathbb{R}^{d})}). Let VV be the closure of ι𝒞\iota\mathcal{C} in Lp(ρ;d)L^{p}(\rho;\mathbb{R}^{d}) under Lp(ρ;d)\|\cdot\|_{L^{p}(\rho;\mathbb{R}^{d})}. The inverse map ι1:ι𝒞𝒞\iota^{-1}:\iota\mathcal{C}\to\mathcal{C} can be extended to VV as follows. For any gVg\in V, choose fn𝒞f_{n}\in\mathcal{C} such that fngLp(ρ;d)0\|\nabla f_{n}-g\|_{L^{p}(\rho;\mathbb{R}^{d})}\to 0. Since fn\nabla f_{n} is Cauchy in Lp(ρ;d)L^{p}(\rho;\mathbb{R}^{d}) and thus in Lp(κ;d)L^{p}(\kappa;\mathbb{R}^{d}) (as Lp(κ;d)Lp(ρ;d)\|\cdot\|_{L^{p}(\kappa;\mathbb{R}^{d})}\lesssim\|\cdot\|_{L^{p}(\rho;\mathbb{R}^{d})}), fnf_{n} is Cauchy in Lp(κ)L^{p}(\kappa) by the pp-Poincaré inequality, so fnfLp(κ)0\|f_{n}-f\|_{L^{p}(\kappa)}\to 0 for some fLp(κ)f\in L^{p}(\kappa). Set ι1g=f\iota^{-1}g=f and extend H˙1,p(ρ)\|\cdot\|_{\dot{H}^{1,p}(\rho)} by fH˙1,p(ρ)=limnfnH˙1,p(ρ)=gLp(ρ;d)\|f\|_{\dot{H}^{1,p}(\rho)}=\lim_{n\to\infty}\|f_{n}\|_{\dot{H}^{1,p}(\rho)}=\|g\|_{L^{p}(\rho;\mathbb{R}^{d})}. The space (ι1V,H˙1,p(ρ))(\iota^{-1}V,\|\cdot\|_{\dot{H}^{1,p}(\rho)}) is a Banach space of functions over d\mathbb{R}^{d}. Finally, the homogeneous Sobolev space H˙1,p(ρ)\dot{H}^{1,p}(\rho) can be constructed as H˙1,p(ρ)={f+a:a,fι1V}\dot{H}^{1,p}(\rho)=\big{\{}f+a:a\in\mathbb{R},f\in\iota^{-1}V\big{\}} with f+aH˙1,p(ρ)=fH˙1,p(ρ)\|f+a\|_{\dot{H}^{1,p}(\rho)}=\|f\|_{\dot{H}^{1,p}(\rho)}.

The next lemma summarizes some basic results about the space H˙1,p(ρ)\dot{H}^{-1,p}(\rho) and H˙1,p(ρ)\dot{H}^{-1,p}(\rho)-valued random variables that we use in the sequel. The proof can be found in Section A.1.

Lemma 2.2.

Let 1<p<1<p<\infty and ρ𝒫\rho\in\mathcal{P}. The dual space H˙1,p(ρ)\dot{H}^{-1,p}(\rho) is a separable Banach space. The Borel σ\sigma-field on H˙1,p(ρ)\dot{H}^{-1,p}(\rho) coincides with the cylinder σ\sigma-field (the smallest σ\sigma-field that makes the coordinate projections, H˙1,p(ρ)(f)\dot{H}^{-1,p}(\rho)\ni\ell\mapsto\ell(f)\in\mathbb{R}, measurable).

Consider a stochastic process Y=(Y(f))fH˙1,q(ρ)Y=(Y(f))_{f\in\dot{H}^{1,q}(\rho)} indexed by H˙1,q(ρ)\dot{H}^{1,q}(\rho), i.e., ωY(f,ω)\omega\mapsto Y(f,\omega) is measurable for each fH˙1,q(ρ)f\in\dot{H}^{1,q}(\rho). The process can be thought of as a map from Ω\Omega into H˙1,p(ρ)\dot{H}^{-1,p}(\rho) as long as YY has paths in H˙1,p(ρ)\dot{H}^{-1,p}(\rho), i.e., for each fixed ωΩ\omega\in\Omega, the map fY(f,ω)f\mapsto Y(f,\omega) is continuous and linear. The fact that the Borel σ\sigma-field on H˙1,p(ρ)\dot{H}^{-1,p}(\rho) coincides with the cylinder σ\sigma-field guarantees that a stochastic process indexed by H˙1,q(ρ)\dot{H}^{1,q}(\rho) with paths in H˙1,p(ρ)\dot{H}^{-1,p}(\rho) is Borel measurable as a map from Ω\Omega into H˙1,p(ρ)\dot{H}^{-1,p}(\rho).

2.3. 𝖶p\mathsf{W}_{p} and dual Sobolev norm

In Section 3, we will explore limit distributions for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. One of the key technical ingredients there is a comparison of the Wasserstein distance with a certain dual Sobolev norm, which we present next.

Proposition 2.1 (Comparison between 𝖶p\mathsf{W}_{p} and dual Sobolev norm; Theorem 5.26 in [DNS09]).

Let 1<p<1<p<\infty, and suppose that μ0,μ1𝒫p\mu_{0},\mu_{1}\in\mathcal{P}_{p} with μ0,μ1ρ\mu_{0},\mu_{1}\ll\rho for some reference measure ρ𝒫\rho\in\mathcal{P}. Denote their respective densities by fi=dμi/dρf_{i}=d\mu_{i}/d\rho, i=0,1i=0,1. If f0f_{0} or f1f_{1} is bounded from below by some c>0c>0, then

𝖶p(μ0,μ1)pc1/qμ1μ0H˙1,p(ρ).\mathsf{W}_{p}(\mu_{0},\mu_{1})\leq p\,c^{-1/q}\,\|\mu_{1}-\mu_{0}\|_{\dot{H}^{-1,p}(\rho)}. (3)

Proposition 2.1 follows directly from Theorem 5.26 of [DNS09]. Similar comparison inequalities appear in [Pey18, Led19, WB19b]. We include a self-contained proof of Proposition 2.1 in Section A.2 as some elements of the proof are key to our derivation of the null limit distribution for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. The proof builds on the Benamou-Brenier dynamic formulation of optimal transport [BB00], which shows that 𝖶p(μ0,μ1)\mathsf{W}_{p}(\mu_{0},\mu_{1}) is bounded from above by the length of any absolutely continuous path from μ0\mu_{0} to μ1\mu_{1} in (𝒫p,𝖶p)(\mathcal{P}_{p},\mathsf{W}_{p}). The dual Sobolev norm emerges as a bound on the length of the linear interpolation ttμ1+(1t)μ0t\mapsto t\mu_{1}+(1-t)\mu_{0}.

3. Limit distribution theory

The goal of this section is to establish Theorem 1.1. The proof relies on two key steps: (i) establish weak convergence of the smooth empirical process n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} in the dual Sobolev space H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}); and (ii) regard 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} as a functional defined on a subset of H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) and characterize its Hadamard directional derivative w.r.t. the corresponding dual Sobolev norm. Given (i) and (ii), the limit distribution results follow from the functional delta method, and the asymptotic normality under the alternative further follows from linearity of the Hadamard directional derivative.

3.1. Preliminaries

Throughout this section, we fix 1<p<1<p<\infty, take qq as the conjugate index of pp, and let σ>0\sigma>0. For μ,ν𝒫p\mu,\nu\in\mathcal{P}_{p}, let X1,,XnμX_{1},\ldots,X_{n}\sim\mu and Y1,,YnνY_{1},\ldots,Y_{n}\sim\nu be independent observations and denote the associated empirical distributions by μ^n:=n1i=1nδXi\hat{\mu}_{n}:=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}} and ν^n:=n1i=1nδYi\hat{\nu}_{n}:=n^{-1}\sum_{i=1}^{n}\delta_{Y_{i}}, respectively.

3.1.1. Weak convergence of smooth empirical process in dual Sobolev spaces

The first building block of our limit distribution results is the following weak convergence of the smoothed empirical process n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) and H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}).

Proposition 3.1 (Weak convergence of smooth empirical process).

Suppose that XμX\sim\mu satisfies

0eθr22σ2(|X|>r)𝑑r<for some θ>p1.\int_{0}^{\infty}e^{\frac{\theta r^{2}}{2\sigma^{2}}}\sqrt{\operatorname{\mathbb{P}}(|X|>r)}dr<\infty\quad\text{for some $\theta>p-1$}. (4)

Then, the smoothed empirical process n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} converges in distribution as nn\to\infty both in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) and H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}). The limit process in each case is a centered Gaussian process, indexed by H˙1,q(γσ)\dot{H}^{1,q}(\gamma_{\sigma}) or H˙1,q(μγσ)\dot{H}^{1,q}(\mu*\gamma_{\sigma}), respectively, with covariance function (f1,f2)Covμ(f1ϕσ,f2ϕσ)(f_{1},f_{2})\mapsto\operatorname{Cov}_{\mu}(f_{1}*\phi_{\sigma},f_{2}*\phi_{\sigma}). Here Covμ\operatorname{Cov}_{\mu} denotes the covariance under μ\mu.

The proof of Proposition 3.1 relies on the prior work [NGK21] by a subset of the authors, where it was shown that the smoothed function class ϕσ={fϕσ:f}\mathcal{F}*\phi_{\sigma}=\{f*\phi_{\sigma}:f\in\mathcal{F}\} with ={fC˙0:fH˙1,q(γσ)1}\mathcal{F}=\{f\in\dot{C}_{0}^{\infty}:\|f\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq 1\} is μ\mu-Donsker. The weak convergence in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) then follows from a similar argument to Lemma 1 in [Nic09]. This, in turn, implies weak convergence in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) when μ\mu has mean zero, since in that case H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) is continuously embedded into H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}). To account for non-centered distributions, we use a reduction to the mean zero case via translation. See also Remark 5.1 for an alternative proof for p=2p=2 that relies on the CLT in the Hilbert space.

Inspection of the proof of Proposition 3.1 shows that Condition (4) implies

de(p1)|x|2/σ2𝑑μ(x)<,\int_{\mathbb{R}^{d}}e^{(p-1)|x|^{2}/\sigma^{2}}d\mu(x)<\infty,

which requires μ\mu to be sub-Gaussian. It is not difficult to see that Condition (4) is satisfied if μ\mu is compactly supported or sub-Gaussian with |X|ψ2<σ/p1\||X|\|_{\psi_{2}}<\sigma/\sqrt{p-1} for XμX\sim\mu.

A natural question is whether a condition in the spirit of (4) is necessary for the conclusion of Proposition 3.1 to hold. Indeed, we show that some form of sub-Gaussianity is necessary for the smooth empirical process to converge to zero in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}).

Proposition 3.2 (Necessity of sub-Gaussian condition).

The following hold.

  1. (i)

    If (μ^nμ)γσ0(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\to 0 in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) as nn\to\infty a.s., then deθ|x|2/(2σ2)𝑑μ(x)<\int_{\mathbb{R}^{d}}e^{\theta|x|^{2}/(2\sigma^{2})}d\mu(x)<\infty for any θ<p1\theta<p-1.

  2. (ii)

    Conversely, if de(p1)|x|2/(2σ2)𝑑μ(x)<\int_{\mathbb{R}^{d}}e^{(p-1)|x|^{2}/(2\sigma^{2})}d\mu(x)<\infty, then (μ^nμ)γσ0(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\to 0 in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) as nn\to\infty a.s.

3.1.2. Functional delta method

Another ingredient of our limit distribution results is the (extended) functional delta method [Sha91, Düm93, R0̈4, FS19]. Let DD be a normed space and Φ:ΞD\Phi:\Xi\subset D\to\mathbb{R} be a function. Following [Sha90, R0̈4], we say that Φ\Phi is Hadamard directionally differentiable at θΞ\theta\in\Xi if there exists a map Φθ:TΞ(θ)\Phi^{\prime}_{\theta}:T_{\Xi}(\theta)\to\mathbb{R} such that

limnΦ(θ+tnhn)Φ(θ)tn=Φθ(h)\lim_{n\to\infty}\frac{\Phi(\theta+t_{n}h_{n})-\Phi(\theta)}{t_{n}}=\Phi^{\prime}_{\theta}(h)

for any hTΞ(θ)h\in T_{\Xi}(\theta), tn0t_{n}\downarrow 0, and hnhh_{n}\to h in DD such that θ+tnhnΞ\theta+t_{n}h_{n}\in\Xi. Here TΞ(θ)T_{\Xi}(\theta) is the tangent cone to Ξ\Xi at θ\theta defined as

TΞ(θ):={hD:h=limnθnθtnfor some θnθ in Ξ and tn0}.T_{\Xi}(\theta):=\left\{h\in D:h=\lim_{n\to\infty}\frac{\theta_{n}-\theta}{t_{n}}\ \text{for some $\theta_{n}\to\theta$ in $\Xi$ and $t_{n}\downarrow 0$}\right\}.

The tangent cone TΞ(θ)T_{\Xi}(\theta) is closed, and if Ξ\Xi is convex, then TΞ(θ)T_{\Xi}(\theta) coincides with the closure in DD of {(θ~θ)/t:θ~Ξ,t>0}\{(\tilde{\theta}-\theta)/t:\tilde{\theta}\in\Xi,t>0\} (cf. Proposition 4.2.1 in [AF09]). The derivative Φθ\Phi_{\theta}^{\prime} is positively homogeneous (i.e., Φθ(ch)=cΦθ(h)\Phi_{\theta}^{\prime}(ch)=c\Phi_{\theta}^{\prime}(h) for any c0c\geq 0) and continuous, but need not be linear.

Lemma 3.1 (Extended functional delta method [Sha91, Düm93, R0̈4, FS19]).

Let DD be a normed space and Φ:ΞD\Phi:\Xi\subset D\to\mathbb{R} be a function that is Hadamard directionally differentiable at θΞ\theta\in\Xi with derivative Φθ:TΞ(θ)\Phi_{\theta}^{\prime}:T_{\Xi}(\theta)\to\mathbb{R}. Let Tn:ΩΞT_{n}:\Omega\to\Xi be maps such that rn(Tnθ)dTr_{n}(T_{n}-\theta)\stackrel{{\scriptstyle d}}{{\to}}T for some rnr_{n}\to\infty and Borel measurable map T:ΩDT:\Omega\to D with values in TΞ(θ)T_{\Xi}(\theta). Then, rn(Φ(Tn)Φ(θ))dΦθ(T)r_{n}\big{(}\Phi(T_{n})-\Phi(\theta)\big{)}\stackrel{{\scriptstyle d}}{{\to}}\Phi_{\theta}^{\prime}(T). Further, if Ξ\Xi is convex, then we have the expansion rn(Φ(Tn)Φ(θ))=Φθ(rn(Tnθ))+o(1)r_{n}\big{(}\Phi(T_{n})-\Phi(\theta)\big{)}=\Phi_{\theta}^{\prime}\big{(}r_{n}(T_{n}-\theta)\big{)}+o_{\operatorname{\mathbb{P}}}(1).

Remark 3.1 (Choice of domain Ξ\Xi).

The domain Ξ\Xi is arbitrary as long as it contains the ranges of TnT_{n} for all nn, and the tangent cone TΞ(θ)T_{\Xi}(\theta) contains the range of the limit variable TT.

3.2. Limit distributions under the null (μ=ν\mu=\nu)

We shall apply the extended functional delta method to derive the limit distributions of n𝖶p(σ)(μ^n,μ)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu) and n𝖶p(σ)(μ^n,ν^n)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n}) as nn\to\infty, namely, proving Parts (i) and (ii) of Theorem 1.1. To set up the problem over a (real) vector space, we regard ρ𝖶p(σ)(ρ,μ)=𝖶p(ργσ,μγσ)\rho\mapsto\mathsf{W}_{p}^{(\sigma)}(\rho,\mu)=\mathsf{W}_{p}(\rho*\gamma_{\sigma},\mu*\gamma_{\sigma}) as a map h𝖶p(μγσ+h,μγσ)h\mapsto\mathsf{W}_{p}(\mu*\gamma_{\sigma}+h,\mu*\gamma_{\sigma}) defined on a set of finite signed Borel measures. The comparison result from Proposition 2.1 implies that the latter map is Lipschitz in H˙1,p(μγσ)\|\cdot\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}, and Proposition 3.1 shows that n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} is weakly convergent in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}). These suggest choosing the ambient space to be H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}).

To cover the one- and two-sample cases in a unified manner, consider the same map but in two variables. Take Dμ=H˙1,p(μγσ)D_{\mu}=\dot{H}^{-1,p}(\mu*\gamma_{\sigma}), set Ξμ:=Dμ{h=(ρμ)γσ:ρ𝒫p}\Xi_{\mu}:=D_{\mu}\cap\big{\{}h=(\rho-\mu)*\gamma_{\sigma}:\rho\in\mathcal{P}_{p}\big{\}}, and define the function Φ:Ξμ×ΞμDμ×Dμ\Phi:\Xi_{\mu}\times\Xi_{\mu}\subset D_{\mu}\times D_{\mu}\to\mathbb{R} as

Φ(h1,h2):=𝖶p(μγσ+h1,μγσ+h2),(h1,h2)Ξμ×Ξμ.\Phi(h_{1},h_{2}):=\mathsf{W}_{p}(\mu*\gamma_{\sigma}+h_{1},\mu*\gamma_{\sigma}+h_{2}),\quad(h_{1},h_{2})\in\Xi_{\mu}\times\Xi_{\mu}.

We endow Dμ×DμD_{\mu}\times D_{\mu} with a product norm (e.g., h1H˙1,p(μγσ)+h2H˙1,p(μγσ)\|h_{1}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\|h_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}). Since the set Ξμ\Xi_{\mu} (and thus Ξμ×Ξμ\Xi_{\mu}\times\Xi_{\mu}) is convex, the tangent cone TΞμ×Ξμ(0,0)T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0) coincides with the closure in Dμ×DμD_{\mu}\times D_{\mu} of {(h1,h2)/t:(h1,h2)Ξμ×Ξμ,t>0}\{(h_{1},h_{2})/t:(h_{1},h_{2})\in\Xi_{\mu}\times\Xi_{\mu},t>0\}. We next verify that Φ\Phi is Hadamard directionally differentiable at (0,0)(0,0).

Proposition 3.3 (Hadamard directional derivative of 𝖶p\mathsf{W}_{p} under the null).

Let 1<p<1<p<\infty and μ𝒫p\mu\in\mathcal{P}_{p}. Then, the map Φ:(h1,h2)𝖶p(μγσ+h1,μγσ+h2),Ξμ×ΞμDμ×Dμ\Phi:(h_{1},h_{2})\mapsto\mathsf{W}_{p}(\mu*\gamma_{\sigma}+h_{1},\mu*\gamma_{\sigma}+h_{2}),\Xi_{\mu}\times\Xi_{\mu}\subset D_{\mu}\times D_{\mu}\to\mathbb{R}, is Hadamard directionally differentiable at (h1,h2)=(0,0)(h_{1},h_{2})=(0,0) with derivative Φ(0,0)(h1,h2)=h1h2H˙1,p(μγσ)\Phi^{\prime}_{(0,0)}(h_{1},h_{2})=\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}, i.e., for any (h1,h2)TΞμ×Ξμ(0)(h_{1},h_{2})\in T_{\Xi_{\mu}\times\Xi_{\mu}}(0), tn0t_{n}\downarrow 0 and (hn,1,hn,2)(h1,h2)(h_{n,1},h_{n,2})\to(h_{1},h_{2}) in Dμ×DμD_{\mu}\times D_{\mu} such that (tnhn,1,tnhn,2)Ξμ×Ξμ(t_{n}h_{n,1},t_{n}h_{n,2})\in\Xi_{\mu}\times\Xi_{\mu}, we have

limnΦ(tnhn,1,tnhn,2)tn=h1h2H˙1,p(μγσ).\lim_{n\to\infty}\frac{\Phi(t_{n}h_{n,1},t_{n}h_{n,2})}{t_{n}}=\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

Proposition 3.3 follows from the next Gâteaux differentiability result for 𝖶p\mathsf{W}_{p}, which may be of independent interest, combined with Lipschitz continuity of Φ\Phi w.r.t. H˙1,p(μγσ)\|\cdot\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} (cf. Proposition 2.1).

Lemma 3.2 (Gâteaux directional derivative of 𝖶p\mathsf{W}_{p}).

Let μ𝒫p\mu\in\mathcal{P}_{p} and hiH˙1,p(μ),i=1,2h_{i}\in\dot{H}^{-1,p}(\mu),i=1,2 be finite signed Borel measures with total mass 0 such that hiμh_{i}\ll\mu and μ+hi𝒫p\mu+h_{i}\in\mathcal{P}_{p}. Then,

ddt+𝖶p(μ+th1,μ+th2)|t=0=h1h2H˙1,p(μ),\frac{d}{dt^{+}}\mathsf{W}_{p}(\mu+th_{1},\mu+th_{2})\Big{|}_{t=0}=\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu)},

where d/dt+d/dt^{+} denotes the right derivative.

Remark 3.2 (Comparison with Exercise 22.20 in [Vil08]).

Exercise 22.20 in [Vil08] states that (in our notation)

limϵ0𝖶2((1+ϵh)μ,μ)ϵ=hμH˙1,2(μ),\lim_{\epsilon\downarrow 0}\frac{\mathsf{W}_{2}\big{(}(1+\epsilon h)\mu,\mu\big{)}}{\epsilon}=\|h\mu\|_{\dot{H}^{-1,2}(\mu)}, (5)

for any sufficiently regular function hh with h𝑑μ=0\int hd\mu=0 (hμh\mu is understood as a signed measure hdμhd\mu). Theorem 7.26 in [Vil03] provides a proof of the one-sided inequality that the liminf of the left-hand side above is at least hμH˙1,2(μ)\|h\mu\|_{\dot{H}^{-1,2}(\mu)}, when μ𝒫2\mu\in\mathcal{P}_{2} satisfies μdx\mu\ll dx and hh is bounded. The subsequent Remark 7.27 states that “We shall not consider the converse of this inequality, which requires more assumptions and more effort.” However, we could not find references that establish rigorous conditions applicable to our problem under which the derivative formula (5) holds. Lemma 3.2 provides a rigorous justification for this formula and extends it to general p>1p>1.

Given these preparations, the proof of Theorem 1.1 Parts (i) and (ii) is immediate.

Proof of Theorem 1.1, Parts (i) and (ii).

Let GμG_{\mu} denote the weak limit of n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} in H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right); cf. Proposition 3.1. Recall that Dμ=H˙1,p(μγσ)D_{\mu}=\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) is separable (cf. Lemma 2.2), so (Tn,1,Tn,2):=((μ^nμ)γσ,(ν^nμ)γσ)(T_{n,1},T_{n,2}):=\big{(}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma},(\hat{\nu}_{n}-\mu)*\gamma_{\sigma}\big{)} is a Borel measurable map from Ω\Omega into the product space Dμ×DμD_{\mu}\times D_{\mu} [vdVW96, Lemma 1.4.1]. Since Tn,1T_{n,1} and Tn,2T_{n,2} are independent, by Example 1.4.6 in [vdVW96] and Proposition 3.1, (Tn,1,Tn,2)d(Gμ,Gμ)(T_{n,1},T_{n,2})\stackrel{{\scriptstyle d}}{{\to}}(G_{\mu},G_{\mu}^{\prime}) in Dμ×DμD_{\mu}\times D_{\mu}, where GμG_{\mu}^{\prime} is an independent copy of GμG_{\mu}. Since (Tn,1,Tn,2)TΞμ×Ξμ(0,0)(T_{n,1},T_{n,2})\in T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0) and TΞμ×Ξμ(0,0)T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0) is closed in Dμ×DμD_{\mu}\times D_{\mu}, we see that (Gμ,Gμ)TΞμ×Ξμ(0,0)(G_{\mu},G_{\mu}^{\prime})\in T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0) by the portmanteau theorem.

Applying the functional delta method (Lemma 3.1) and Proposition 3.3, we conclude that

n𝖶p(σ)(μ^n,ν^n)=n(Φ(Tn,1,Tn,2)Φ(0,0))dΦ(0,0)(Gμ,Gμ)=GμGμH˙1,p(μγσ).\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})=\sqrt{n}\big{(}\Phi(T_{n,1},T_{n,2})-\Phi(0,0)\big{)}\stackrel{{\scriptstyle d}}{{\to}}\Phi_{(0,0)}^{\prime}(G_{\mu},G_{\mu}^{\prime})=\|G_{\mu}-G_{\mu}^{\prime}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

Likewise, we also have

n𝖶p(σ)(μ^n,μ)=n(Φ(Tn,1,0)Φ(0,0))dΦ(0,0)(Gμ,0)=GμH˙1,p(μγσ).\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)=\sqrt{n}\big{(}\Phi(T_{n,1},0)-\Phi(0,0)\big{)}\stackrel{{\scriptstyle d}}{{\to}}\Phi_{(0,0)}^{\prime}(G_{\mu},0)=\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

This completes the proof. ∎

3.3. Limit distributions under the alternative (μν)(\mu\neq\nu)

3.3.1. One-sample case

We start from the simpler situation where ν\nu is known and prove Part (iii) of Theorem 1.1. Our proof strategy is to first establish asymptotic normality of the ppth power of 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}, from which Part (iii) follows by applying the delta method for ss1/ps\mapsto s^{1/p}. For notational convenience, define

𝖲p(σ)(μ,ν):=[𝖶p(σ)(μ,ν)]p,\mathsf{S}^{(\sigma)}_{p}(\mu,\nu):=\big{[}\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{]}^{p},

for which one-sample asymptotic normality under the alternative is stated next.

Proposition 3.4.

Suppose that μ𝒫\mu\in\mathcal{P} satisfies Condition (4), ν𝒫\nu\in\mathcal{P} is sub-Weibull, and μν\mu\neq\nu. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma} for 𝖶pp\mathsf{W}_{p}^{p}. Then, we have

n(𝖲p(σ)(μ^n,ν)𝖲p(σ)(μ,ν))dN(0,Varμ(gϕσ)).\sqrt{n}\big{(}\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}_{n},\nu)-\mathsf{S}^{(\sigma)}_{p}(\mu,\nu)\big{)}\stackrel{{\scriptstyle d}}{{\to}}N\big{(}0,\operatorname{Var}_{\mu}(g*\phi_{\sigma})\big{)}.

We again use the functional delta method to prove this proposition, but with a slightly different setting. Set Dμ=H˙1,p(μγσ)D_{\mu}=\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) as before, and consider the function Ψ:ΛμDμ\Psi:\Lambda_{\mu}\subset D_{\mu}\to\mathbb{R} defined by

Ψ(h):=𝖶pp(μγσ+h,νγσ),hΛμ,\Psi(h):=\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+h,\nu*\gamma_{\sigma}),\quad h\in\Lambda_{\mu},

where

Λμ:=Dμ{h=(ρμ)γσ:ρ𝒫 is sub-Weibull}.\Lambda_{\mu}:=D_{\mu}\cap\big{\{}h=(\rho-\mu)*\gamma_{\sigma}:\text{$\rho\in\mathcal{P}$ is sub-Weibull}\big{\}}. (6)

As long as μ\mu is sub-Weibull (recall that Condition (4) requires μ\mu to be sub-Gaussian), the set Λμ\Lambda_{\mu} contains 0. This set is also convex, and so the tangent cone TΛμ(0)T_{\Lambda_{\mu}}(0) coincides with the closure in DμD_{\mu} of {h/t:hΛμ,t>0}\{h/t:h\in\Lambda_{\mu},t>0\}. The corresponding Hadamard directional derivative of 𝖶pp\mathsf{W}_{p}^{p} is given next.

Proposition 3.5 (Hadamard directional derivative of 𝖶pp\mathsf{W}_{p}^{p} w.r.t. one argument).

Let 1<p<1<p<\infty, and suppose that μ,ν𝒫\mu,\nu\in\mathcal{P} are sub-Weibull. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma} for 𝖶pp\mathsf{W}_{p}^{p}, which is uniquely determined up to additive constants (see Lemma 2.1 (iii)). Then

  1. (i)

    gH˙1,q(μγσ)g\in\dot{H}^{1,q}(\mu*\gamma_{\sigma}), where qq is the conjugate index of pp; and

  2. (ii)

    the map Ψ:ΛμDμ,h𝖶pp(μγσ+h,νγσ)\Psi:\Lambda_{\mu}\subset D_{\mu}\to\mathbb{R},h\mapsto\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+h,\nu*\gamma_{\sigma}), is Hadamard directionally differentiable at h=0h=0 with derivative Ψ0(h)=h(g)\Psi_{0}^{\prime}(h)=h(g), i.e., for any hTΛμ(0),tn0h\in T_{\Lambda_{\mu}}(0),t_{n}\downarrow 0, and hnhh_{n}\to h in DμD_{\mu} such that tnhnΛμt_{n}h_{n}\in\Lambda_{\mu}, we have

    limnΨ(tnhn)Ψ(0)tn=h(g).\lim_{n\to\infty}\frac{\Psi(t_{n}h_{n})-\Psi(0)}{t_{n}}=h(g). (7)

As in the null case, Part (ii) of Proposition 3.5 follows from the following Gâteaux differentiability result for 𝖶pp\mathsf{W}_{p}^{p}, combined with local Lipschitz continuity of Ψ\Psi w.r.t. H˙1,p(μγσ)\|\cdot\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

Lemma 3.3 (Gâteaux directional derivative of 𝖶pp\mathsf{W}_{p}^{p} w.r.t. one argument).

Let 1<p<1<p<\infty and μ,ν,ρ𝒫\mu,\nu,\rho\in\mathcal{P} be sub-Weibull. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to ν\nu. Then

ddt+𝖶pp((μ+t(ρμ))γσ,ν)|t=0=dgd((ρμ)γσ),\frac{d}{dt^{+}}\mathsf{W}_{p}^{p}\big{(}(\mu+t(\rho-\mu))*\gamma_{\sigma},\nu\big{)}\Big{|}_{t=0}=\int_{\mathbb{R}^{d}}gd\big{(}(\rho-\mu)*\gamma_{\sigma}\big{)},

where the integral on the right-hand side is well-defined and finite.

Remark 3.3 (Comparisons with Theorem 8.4.7 in [AGS08] and Theorem 5.24 in [San15]).

Theorem 8.4.7 in [AGS08] derives the following differentiabiliy result for 𝖶pp\mathsf{W}_{p}^{p}. Let μt:I(𝒫p,𝖶p)\mu_{t}:I\to(\mathcal{P}_{p},\mathsf{W}_{p}) be an absolutely continuous curve for some open interval II, and let vtv_{t} be an “optimal” velocity field satisfying the continuity equation for μt\mu_{t} (see Theorem 8.4.7 in [AGS08] for the precise meaning). Then, for any ν𝒫p\nu\in\mathcal{P}_{p}, we have that

ddt𝖶pp(μt,ν)=d×dp|xy|p2xy,vt(x)𝑑πt(x,y)\frac{d}{dt}\mathsf{W}_{p}^{p}(\mu_{t},\nu)=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}p|x-y|^{p-2}\langle x-y,v_{t}(x)\rangle d\pi_{t}(x,y) (8)

for almost every (a.e.) tIt\in I, where πtΠ(μt,ν)\pi_{t}\in\Pi(\mu_{t},\nu) is an optimal coupling for 𝖶p(μt,ν)\mathsf{W}_{p}(\mu_{t},\nu). See also Theorem 5.24 in [San15]. Since (8) only holds for a.e. tIt\in I, while we need the (right) differentiability at a specific point, the result of [AGS08, Theorem 8.4.7] (or [San15, Theorem 5.24]) does not directly apply to our problem. We overcome this difficulty by establishing regularity of optimal transport potentials (see Lemma 5.3 ahead), for which Gaussian smoothing plays an essential role.

We are now ready to prove Proposition 3.4 and obtain Part (iii) of Theorem 1.1 combined with the delta method for the map ss1/ps\mapsto s^{1/p}.

Proof of Proposition 3.4.

By Proposition 3.1, Tn:=(μ^nμ)γσΛμT_{n}:=(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\in\Lambda_{\mu} and nTndGμ\sqrt{n}T_{n}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu} in DμD_{\mu}. Also GμTΛμ(0)G_{\mu}\in T_{\Lambda_{\mu}}(0) with probability one by the portmanteau theorem. Applying the functional delta method (Lemma 3.1) and Proposition 3.5, we have

n(𝖲p(σ)(μ^n,ν)𝖲p(σ)(μ,ν))=n(Ψ(Tn)Ψ(0))dGμ(g)N(0,Varμ(gϕσ)),\sqrt{n}\big{(}\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}_{n},\nu)-\mathsf{S}^{(\sigma)}_{p}(\mu,\nu)\big{)}=\sqrt{n}\big{(}\Psi(T_{n})-\Psi(0)\big{)}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu}(g)\sim N\big{(}0,\operatorname{Var}_{\mu}(g*\phi_{\sigma})\big{)},

as desired. ∎

3.3.2. Two-sample case

Finally, we consider the two-sample case and prove the following, from which Part (iv) of Theorem 1.1 follows.

Proposition 3.6.

Let 1<p<1<p<\infty. Suppose that μ,ν𝒫\mu,\nu\in\mathcal{P} satisfy Condition (4) and νμ\nu\neq\mu. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma} for 𝖶pp\mathsf{W}_{p}^{p}. Then, we have

n(𝖲p(σ)(μ^n,ν^n)𝖲p(σ)(μ,ν))dN(0,Varμ(gϕσ)+Varν(gcϕσ)).\sqrt{n}\big{(}\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}_{n},\hat{\nu}_{n})-\mathsf{S}^{(\sigma)}_{p}(\mu,\nu)\big{)}\stackrel{{\scriptstyle d}}{{\to}}N\big{(}0,\operatorname{Var}_{\mu}(g*\phi_{\sigma})+\operatorname{Var}_{\nu}(g^{c}*\phi_{\sigma})\big{)}.

Set Dμ=H˙1,p(μγσ)D_{\mu}=\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) and Dν=H˙1,p(νγσ)D_{\nu}=\dot{H}^{-1,p}(\nu*\gamma_{\sigma}). Consider the function Υ:Λμ×ΛνDμ×Dν\Upsilon:\Lambda_{\mu}\times\Lambda_{\nu}\subset D_{\mu}\times D_{\nu}\to\mathbb{R} defined by

Υ(h1,h2):=𝖶pp(μγσ+h1,νγσ+h2),(h1,h2)Λμ×Λν,\Upsilon(h_{1},h_{2}):=\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+h_{1},\nu*\gamma_{\sigma}+h_{2}),\quad(h_{1},h_{2})\in\Lambda_{\mu}\times\Lambda_{\nu},

where Λμ\Lambda_{\mu} is given in (6) and Λν\Lambda_{\nu} is defined analogously. Here we endow Dμ×DνD_{\mu}\times D_{\nu} with a product norm (e.g. h1H˙1,p(μγσ)+h2H˙1,p(νγσ)\|h_{1}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\|h_{2}\|_{\dot{H}^{-1,p}(\nu*\gamma_{\sigma})}).

We note that if gg is an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma}, then gcg^{c} is an optimal transport potential from νγσ\nu*\gamma_{\sigma} to μγσ\mu*\gamma_{\sigma}, as gcc=gg^{cc}=g. With this in mind, Proposition 3.5 immediately yields the following proposition.

Proposition 3.7 (Hadamard directional derivative of 𝖶pp\mathsf{W}_{p}^{p} w.r.t. two arguments).

Let 1<p<1<p<\infty, and suppose that μ,ν𝒫\mu,\nu\in\mathcal{P} are sub-Weibull. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to νγσ\nu*\gamma_{\sigma} for 𝖶pp\mathsf{W}_{p}^{p}. Then, (g,gc)H˙1,q(μγσ)×H˙1,q(νγσ)(g,g^{c})\in\dot{H}^{1,q}(\mu*\gamma_{\sigma})\times\dot{H}^{1,q}(\nu*\gamma_{\sigma}), and the map Υ:Λμ×ΛνDμ×Dν,(h1,h2)𝖶pp(μγσ+h1,νγσ+h2)\Upsilon:\Lambda_{\mu}\times\Lambda_{\nu}\subset D_{\mu}\times D_{\nu}\to\mathbb{R},(h_{1},h_{2})\mapsto\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+h_{1},\nu*\gamma_{\sigma}+h_{2}), is Hadamard directionally differentiable at (h1,h2)=(0,0)(h_{1},h_{2})=(0,0) with derivative Υ(0,0)(h1,h2)=h1(g)+h2(gc)\Upsilon_{(0,0)}^{\prime}(h_{1},h_{2})=h_{1}(g)+h_{2}(g^{c}) for (h1,h2)TΛμ×Λν(0,0)(h_{1},h_{2})\in T_{\Lambda_{\mu}\times\Lambda_{\nu}}(0,0).

Given Proposition 3.7, the proof of Proposition 3.6 is analogous to that of Proposition 3.4, and is thus omitted for brevity. As before, Part (iv) of Theorem 1.1 follows via the delta method for ss1/ps\mapsto s^{1/p}.

3.4. Bootstrap

The limit distributions in Theorem 1.1 are non-pivotal, as they depend on the population distributions μ\mu and/or ν\nu, which are unknown in practice. To overcome this and facilitate statistical inference using 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}, we apply the bootstrap to estimate the limit distributions of empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}.

We start from the one-sample case. Given the data X1,,XnX_{1},\dots,X_{n}, let X1B,,XnBX_{1}^{B},\dots,X_{n}^{B} be an independent sample from μ^n\hat{\mu}_{n}, and set μ^nB:=n1i=1nδXiB\hat{\mu}_{n}^{B}:=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}^{B}} as the bootstrap empirical distribution. Let B\operatorname{\mathbb{P}}^{B} denote the conditional probability given X1,X2,X_{1},X_{2},\dots. The next proposition shows that the bootstrap consistently estimates the limit distribution of empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} under both the null and the alternative.

Proposition 3.8 (Bootstrap consistency: one-sample case).

Suppose that μ\mu satisfies Condition (4).

  1. (i)

    (Null case) We have

    supt0|B(n𝖶p(σ)(μ^nB,μ^n)t)(GμH˙1,p(μγσ)t)|0.\sup_{t\geq 0}\left|\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\mu}_{n})\leq t\right)-\operatorname{\mathbb{P}}\left(\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq t\right)\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0. (9)
  2. (ii)

    (Alternative case) Assume in addition that ν\nu is sub-Weibull with νμ\nu\neq\mu. Let 𝔳12\mathfrak{v}_{1}^{2} denote the asymptotic variance of n(𝖶p(σ)(μ^n,ν)𝖶p(σ)(μ,ν))\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu)-\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{)} given in Part (iii) of Theorem 1.1. Then, we have

    supt|B(n(𝖶p(σ)(μ^nB,ν)𝖶p(σ)(μ^n,ν))t)(N(0,𝔳12)t)|0.\sup_{t\in\mathbb{R}}\left|\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\nu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu)\big{)}\leq t\right)-\operatorname{\mathbb{P}}\big{(}N(0,\mathfrak{v}_{1}^{2})\leq t\big{)}\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0.

Part (ii) of the proposition is not surprising given that the Haramard directional derivative of the function Ψ\Psi in Proposition 3.5 is Ψ0(h)=h(g)\Psi_{0}^{\prime}(h)=h(g), which is linear in hh. Part (i) is less obvious since the function h1Φ(h1,0)h_{1}\mapsto\Phi(h_{1},0) from Proposition 3.3 has a nonlinear Hadamard directional derivative, Ψ(0,0)(h1,0)=h1H˙1,p(μγσ)\Psi_{(0,0)}^{\prime}(h_{1},0)=\|h_{1}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}. Recall that [Düm93, Proposition 1] or [FS19, Corollary 3.1] show that the bootstrap is inconsistent for functionals with nonlinear derivatives, but these results do not collide with Part (i) of Proposition 3.8 since our application of the bootstrap differs from theirs. For instance, [Düm93, Proposition 1] specialized to our setting states that the conditional law of n(Φ(μ^nBμ,0)Φ(μ^nμ,0))=n(𝖶p(σ)(μ^nB,μ)𝖶p(σ)(μ^n,μ))\sqrt{n}\big{(}\Phi(\hat{\mu}_{n}^{B}-\mu,0)-\Phi(\hat{\mu}_{n}-\mu,0)\big{)}=\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n}^{B},\mu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)\big{)} does not converge weakly to GμH˙1,p(μγσ)\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} in probability. Heuristically, n𝖶p(σ)(μ^n,μ)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu) is nonnegative while n(𝖶p(σ)(μ^nB,μ)𝖶p(σ)(μ^n,μ))\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n}^{B},\mu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)\big{)} can be negative, so the conditional law of the latter cannot mimic the distribution of the former. Further, when μ\mu is unknown, the conditional law of n(𝖶p(σ)(μ^nB,μ)𝖶p(σ)(μ^n,μ))\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n}^{B},\mu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)\big{)} is infeasible. The correct bootstrap analog for 𝖶p(σ)(μ^n,μ)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu) is 𝖶p(σ)(μ^nB,μ^n)=Φ(μ^nBμ,μ^nμ)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n}^{B},\hat{\mu}_{n})=\Phi(\hat{\mu}_{n}^{B}-\mu,\hat{\mu}_{n}-\mu), and the proof of Proposition 3.8 shows that it can be approximated by μ^nBμ(μ^nμ)H˙1,p(μγσ)=μ^nBμ^nH˙1,p(μγσ)\|\hat{\mu}_{n}^{B}-\mu-(\hat{\mu}_{n}-\mu)\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}=\|\hat{\mu}_{n}^{B}-\hat{\mu}_{n}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}, whose conditional law (after scaling) converges weakly to GμH˙1,p(μγσ)\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} in probability.

Next, consider the two-sample case. In addition to X1B,,XnBX_{1}^{B},\dots,X_{n}^{B} and μ^nB\hat{\mu}_{n}^{B}, given Y1,,YnY_{1},\dots,Y_{n}, let Y1B,,YnBY_{1}^{B},\dots,Y_{n}^{B} be an independent sample from ν^n\hat{\nu}_{n}, and set ν^nB:=n1i=1nδYiB\hat{\nu}_{n}^{B}:=n^{-1}\sum_{i=1}^{n}\delta_{Y_{i}^{B}}. Slightly abusing notation, we reuse B\operatorname{\mathbb{P}}^{B} for the conditional probability given (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\dots.

Proposition 3.9 (Bootstrap consistency: two-sample under the alternative).

Suppose that μ\mu and ν\nu satisfy Condition (4) and μν\mu\neq\nu. Let 𝔳22\mathfrak{v}_{2}^{2} denote the asymptotic variance of n(𝖶p(σ)(μ^n,ν^n)𝖶p(σ)(μ,ν))\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})-\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)\big{)} given in Part (iv) of Theorem 1.1. Then, we have

supt|B(n(𝖶p(σ)(μ^nB,ν^nB)𝖶p(σ)(μ^n,ν^n))t)(N(0,𝔳22)t)|0.\sup_{t\in\mathbb{R}}\left|\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B})-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})\big{)}\leq t\right)-\operatorname{\mathbb{P}}\left(N(0,\mathfrak{v}_{2}^{2})\leq t\right)\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0.
Example 3.1 (Confidence interval for 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}).

Consider constructing confidence intervals for 𝖶p(μ,ν)\mathsf{W}_{p}(\mu,\nu). For α(0,1)\alpha\in(0,1), let ζ^α\hat{\zeta}_{\alpha} denote the conditional α\alpha-quantile of 𝖶p(σ)(μ^nB,ν^nB)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B}) given the data. Then, by Proposition 3.9 above and Lemma 23.3 in [vdV98], the interval

[2𝖶p(σ)(μ^n,ν^n)ζ^1α/2,2𝖶p(σ)(μ^n,ν^n)ζ^α/2],\left[2\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})-\hat{\zeta}_{1-\alpha/2},2\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})-\hat{\zeta}_{\alpha/2}\right],

contains 𝖶p(σ)(μ,ν)\mathsf{W}_{p}^{(\sigma)}(\mu,\nu) with probability approaching 1α1-\alpha.

For the two-sample case under the null, instead of separately sampling bootstrap draws from μ^n\hat{\mu}_{n} and ν^n\hat{\nu}_{n} (see Remark 3.4 below), we use the pooled empirical distribution ρ^n=(2n)1i=1n(δXi+δYi)\hat{\rho}_{n}=(2n)^{-1}\sum_{i=1}^{n}(\delta_{X_{i}}+\delta_{Y_{i}}) (cf. Chapter 3.7 in [vdVW96]). Given (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}), let Z1B,,Z_{1}^{B},\dots, Z2nBZ_{2n}^{B} be an independent sample from ρ^n\hat{\rho}_{n}, and set

ρ^n,1B=1ni=1nδZiBandρ^n,2B=1ni=n+12nδZiB.\hat{\rho}_{n,1}^{B}=\frac{1}{n}\sum_{i=1}^{n}\delta_{Z_{i}^{B}}\quad\text{and}\quad\hat{\rho}_{n,2}^{B}=\frac{1}{n}\sum_{i=n+1}^{2n}\delta_{Z_{i}^{B}}.

The following proposition shows that this two-sample bootstrap is consistent for the null limit distribution of empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}.

Proposition 3.10 (Bootstrap consistency: two-sample under the null).

Suppose that μ\mu and ν\nu satisfy Condition (4). Then, for ρ=(μ+ν)/2\rho=(\mu+\nu)/2, we have

supt0|B(n𝖶p(σ)(ρ^n,1B,ρ^n,2B)t)(GρGρH˙1,p(ργσ)t)|0,\sup_{t\geq 0}\left|\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\rho}^{B}_{n,1},\hat{\rho}_{n,2}^{B})\leq t\right)-\operatorname{\mathbb{P}}\left(\|G_{\rho}-G_{\rho}^{\prime}\|_{\dot{H}^{-1,p}(\rho*\gamma_{\sigma})}\leq t\right)\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0,

where GρG_{\rho}^{\prime} is an independent copy of GρG_{\rho}. In particular, if μ=ν\mu=\nu, then

supt0|B(n𝖶p(σ)(ρ^n,1B,ρ^n,2B)t)(GμGμH˙1,p(μγσ)t)|0.\sup_{t\geq 0}\left|\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\rho}^{B}_{n,1},\hat{\rho}_{n,2}^{B})\leq t\right)-\operatorname{\mathbb{P}}\left(\|G_{\mu}-G_{\mu}^{\prime}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq t\right)\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0.
Remark 3.4 (Inconsistency of naive bootstrap).

One may consider using 𝖶p(σ)(μ^nB,ν^nB)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B}) (rather than 𝖶p(σ)(ρ^n,1B,ρ^n,2B)\mathsf{W}_{p}^{(\sigma)}(\hat{\rho}^{B}_{n,1},\hat{\rho}_{n,2}^{B})) to approximate the distribution of 𝖶p(σ)(μ^n,ν^n)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n}), but this bootstrap is not consistent. Indeed, from the proof of Proposition 3.10, we may deduce that, if μ=ν\mu=\nu, then n𝖶p(σ)(μ^nB,ν^nB)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B}) is expanded as

n(μ^nBμ^n)γσn(ν^nBν^n)γσ+n(μ^nν^n)γσH˙1,p(μγσ)+o(1),\big{\|}\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}-\sqrt{n}(\hat{\nu}_{n}^{B}-\hat{\nu}_{n})*\gamma_{\sigma}+\sqrt{n}(\hat{\mu}_{n}-\hat{\nu}_{n})*\gamma_{\sigma}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+o_{\operatorname{\mathbb{P}}}(1),

which converges in distribution to Gμ1Gμ2+Gμ3Gμ4H˙1,p(μγσ)\|G_{\mu}^{1}-G_{\mu}^{2}+G_{\mu}^{3}-G_{\mu}^{4}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} unconditionally, where Gμ1,,Gμ4G_{\mu}^{1},\dots,G_{\mu}^{4} are independent copies of GμG_{\mu}. This implies that the conditional law of n𝖶p(σ)(μ^nB,ν^nB)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B}) does not converge weakly to the law of GμGμH˙1,p(μγσ)\|G_{\mu}-G_{\mu}^{\prime}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} in probability.

Example 3.2 (Testing the equality of distributions).

Consider testing the equality of distributions, i.e., H0:μ=νH_{0}:\mu=\nu against H1:μνH_{1}:\mu\neq\nu. We shall use n𝖶p(σ)(μ^n,ν^n)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n}) as a test statistic and reject H0H_{0} if n𝖶p(σ)(μ^n,ν^n)>c\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})>c for some critical value cc. Proposition 3.10 implies that, if we choose c=c^1αc=\hat{c}_{1-\alpha} to be the conditional (1α)(1-\alpha)-quantile of n𝖶p(σ)(ρ^n,1B,ρ^n,2B)\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\rho}^{B}_{n,1},\hat{\rho}_{n,2}^{B}) given the data, then the resulting test is asymptotically of level α\alpha,

limn(n𝖶p(σ)(μ^n,ν^n)>c^1α)=αifμ=ν.\lim_{n\to\infty}\operatorname{\mathbb{P}}\left(\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})>\hat{c}_{1-\alpha}\right)=\alpha\quad\text{if}\ \mu=\nu.

Here α(0,1)\alpha\in(0,1) is the nominal level. To see that the test is consistent, note that if μν\mu\neq\nu, then 𝖶p(σ)(μ^n,ν^n)𝖶p(σ)(μ,ν)𝖶p(σ)(μ^n,μ)𝖶p(σ)(ν^n,ν)𝖶p(σ)(μ,ν)/2\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\hat{\nu}_{n})\geq\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\nu}_{n},\nu)\geq\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)/2 with probability approaching one, while c^1α=O(1)\hat{c}_{1-\alpha}=O_{\operatorname{\mathbb{P}}}(1) by Proposition 3.10.

Testing the equality of distributions using Wasserstein distances was considered in [RTC17], but their theoretical analysis is focused on the d=1d=1 case, partly because of the lack of null limit distribution results for empirical 𝖶p\mathsf{W}_{p} in higher dimensions. We overcome this obstacle by using the smooth Wasserstein distance.

4. Minimum distance estimation with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}

We consider the application of our limit distribution theory to MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. Given an independent sample X1,,XnX_{1},\dots,X_{n} from a distribution μ𝒫\mu\in\mathcal{P}, MDE aims to learn a generative model from a parametric family {νθ}θΘ𝒫\{\nu_{\theta}\}_{\theta\in\Theta}\subset\mathcal{P} that approximates μ\mu under some statistical divergence. We use 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} as the proximity measure and the empirical distribution μ^n\hat{\mu}_{n} as an estimate for μ\mu, which leads to the following MDE problem

infθΘ𝖶p(σ)(μ^n,νθ).\inf_{\theta\in\Theta}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta}).

MDE with classic 𝖶1\mathsf{W}_{1} is called the Wasserstein GAN, which continues to underlie state-of-the-art methods in generative modeling [ACB17, GAA+17]. MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} was previously examined for p=1p=1 in [GGK20] and for p>1p>1 in [NGK21]. Specifically, [NGK21] established measurability, consistency, and parametric convergence rates for MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} for p>1p>1, but did not derive limit distribution results. We will expand on this prior work by providing limit distributions for the 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} MDE problem.

Analogously to the conditions of Theorem 4 in [GGK20], we assume the following.

Assumption 1.

Let 1<p<1<p<\infty, and assume that the following conditions hold. (i) The distribution μ\mu satisfies Condition (4). (ii) The parameter space Θd0\Theta\subset\mathbb{R}^{d_{0}} is compact with nonempty interior. (iii) The map θνθ\theta\mapsto\nu_{\theta} is continuous w.r.t. the weak topology. (iv) There exists a unique θ\theta^{\star} in the interior of Θ\Theta such that νθ=μ\nu_{\theta^{\star}}=\mu. (v) There exists a neighborhood N0N_{0} of θ\theta^{\star} such that (νθνθ)γσH˙1,p(μγσ)\left(\nu_{\theta}-\nu_{\theta^{\star}}\right)*\gamma_{\sigma}\in\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) for every θN0\theta\in N_{0}. (vi) The map N0θ(νθνθ)γσH˙1,p(μγσ)N_{0}\ni\theta\mapsto\left(\nu_{\theta}-\nu_{\theta^{\star}}\right)*\gamma_{\sigma}\in\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) is norm differentiable with non–singular derivative 𝔇\mathfrak{D} at θ\theta^{\star}. That is, there exists 𝔇=(𝔇1,,𝔇d0)(H˙1,p(μγσ))d0\mathfrak{D}=\left(\mathfrak{D}_{1},\dots,\mathfrak{D}_{d_{0}}\right)\in\big{(}\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)\big{)}^{d_{0}}, where 𝔇1,,𝔇d0\mathfrak{D}_{1},\dots,\mathfrak{D}_{d_{0}} are linearly independent elements of H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right), such that

(νθνθ)γσθθ,𝔇H˙1,p(μγσ)=o(|θθ|),\norm{(\nu_{\theta}-\nu_{\theta^{\star}})*\gamma_{\sigma}-\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}=o(\absolutevalue{\theta-\theta^{\star}}),

as θθ\theta\to\theta^{\star} in N0N_{0}, where t,𝔇=i=1d0ti𝔇i\left\langle t,\mathfrak{D}\right\rangle=\sum_{i=1}^{d_{0}}t_{i}\mathfrak{D}_{i} for t=(t1,,td0)d0t=(t_{1},\dots,t_{d_{0}})\in\mathbb{R}^{d_{0}}.

We derive limit distributions for the optimal value function and MDE solution, following the methodology of [Pol80, BJGR19, GGK20].

Theorem 4.1 (Limit distributions for MDE with 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}).

Suppose that 1 holds. Let 𝔾n(σ):=n(μ^nμ)γσ\mathbb{G}_{n}^{(\sigma)}:=\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} be the smooth empirical process, and GμG_{\mu} its weak limit in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}); cf. Proposition 3.1. Then, the following hold.

  1. (i)

    We have infθΘn𝖶p(σ)(μ^n,νθ)dinftd0Gμt,𝔇H˙1,p(μγσ)\inf_{\theta\in\Theta}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})\stackrel{{\scriptstyle d}}{{\to}}\inf_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

  2. (ii)

    Let (θ^n)n(\hat{\theta}_{n})_{n\in\mathbb{N}} be a sequence of measurable estimators satisfying

    𝖶p(σ)(μ^n,νθ^n)infθΘ𝖶p(σ)(μ^n,νθ)+o(n1/2).{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\hat{\theta}_{n}})\leq\inf_{\theta\in\Theta}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})+o_{\mathbb{P}}(n^{-1/2}).

    Then, provided that argmintd0Gμt,𝔇H˙1,p(μγσ)\operatorname{argmin}_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} is almost surely unique, we have n(θ^nθ)dargmintd0Gμt,𝔇H˙1,p(μγσ)\sqrt{n}(\hat{\theta}_{n}-\theta^{\star})\stackrel{{\scriptstyle d}}{{\to}}\operatorname{argmin}_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

In general, it is nontrivial to verify that argmintd0Gμt,𝔇H˙1,p(μγσ)\operatorname{argmin}_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} is almost surely unique. However, for p=2p=2, the Hilbertian structure of H˙1,2(μγσ)\dot{H}^{-1,2}(\mu*\gamma_{\sigma}) guarantees this uniqueness. Let EE be an isometric isomorphism between H˙1,2(μγσ)\dot{H}^{-1,2}(\mu*\gamma_{\sigma}) and a closed subspace of L2(μγσ;d)L^{2}(\mu*\gamma_{\sigma};\mathbb{R}^{d}); cf. Lemma 5.1. Setting G¯μ:=E(Gμ)\underline{G\mkern-3.0mu}\mkern 3.0mu_{\mu}:=E(G_{\mu}) and 𝔇¯=(𝔇¯1,,𝔇¯d0):=(E(𝔇1),,E(𝔇d0))\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu=\left(\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu_{1},\dots,\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu_{d_{0}}\right):=\big{(}E(\mathfrak{D}_{1}),\dots,E(\mathfrak{D}_{d_{0}})\big{)}, we have

Gμt,𝔇H˙1,2(μγσ)=G¯μt,𝔇¯L2(μγσ;d).\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,2}(\mu*\gamma_{\sigma})}=\norm{\underline{G\mkern-3.0mu}\mkern 3.0mu_{\mu}-\left\langle t,\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu\right\rangle}_{L^{2}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}.

The unique minimizer in tt of the above display is given by

t^μ=[(𝔇¯j,𝔇¯kL2(μγσ;d))1j,kd0]1(Gμ,𝔇¯jL2(μγσ;d))j=1d0.\hat{t}_{\mu}=\left[\big{(}\left\langle\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu_{j},\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu_{k}\right\rangle_{L^{2}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}\big{)}_{1\leq j,k\leq d_{0}}\right]^{-1}\big{(}\left\langle G_{\mu},\underline{\mathfrak{D}\mkern-3.0mu}\mkern 3.0mu_{j}\right\rangle_{L^{2}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}\big{)}_{j=1}^{d_{0}}. (10)

Since G¯μ\underline{G\mkern-3.0mu}\mkern 3.0mu_{\mu} is a centered Gaussian random variable in L2(μγσ;d)L^{2}(\mu*\gamma_{\sigma};\mathbb{R}^{d}), t^μ\hat{t}_{\mu} is a mean–zero Gaussian vector in d0\mathbb{R}^{d_{0}}.

Corollary 4.1 (Asymptotic normality for MDE solutions when p=2p=2).

Consider the setting of Theorem 4.1 Part (ii) and let p=2p=2. Then n(θ^nθ)dt^μ\sqrt{n}(\hat{\theta}_{n}-\theta^{\star})\stackrel{{\scriptstyle d}}{{\to}}\hat{t}_{\mu}, the mean–zero Gaussian vector in (10).

Without assuming the uniqueness of argmintd0Gμt,𝔇H˙1,p(μγσ)\operatorname{argmin}_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}, limit distributions for MDE solutions can be stated in terms of set-valued random variables. Consider the set of approximate minimizers

Θ^n:={θΘ:𝖶p(σ)(μ^n,νθ)infθΘ𝖶p(σ)(μ^n,νθ)+n1/2λn},\hat{\Theta}_{n}:=\left\{\theta\in\Theta:{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})\leq\inf_{\theta^{\prime}\in\Theta}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta^{\prime}})+n^{-1/2}\lambda_{n}\right\}, (11)

where λn\lambda_{n} is any nonnegative sequence with λn=o(1)\lambda_{n}=o_{\operatorname{\mathbb{P}}}(1). We will show that Θ^nθ+n1/2Kn\hat{\Theta}_{n}\subset\theta^{\star}+n^{-1/2}K_{n} with inner probability approaching one for some sequence KnK_{n} of random, convex, and compact sets; cf. [Pol80, Section 2]. To describe the sets KnK_{n}, for any β0\beta\geq 0 and hH˙1,p(μγσ)h\in\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right), define

K(h,β):={td0:ht,𝔇H˙1,p(μγσ)inftd0ht,𝔇H˙1,p(μγσ)+β}𝔎,K(h,\beta):=\left\{t\in\mathbb{R}^{d_{0}}:\norm{h-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq\inf_{t^{\prime}\in\mathbb{R}^{d_{0}}}\norm{h-\left\langle t^{\prime},\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\beta\right\}\in\mathfrak{K},

where 𝔎\mathfrak{K} is the class of compact, convex, and nonempty subsets of d0\mathbb{R}^{d_{0}} endowed with the Hausdorff topology. That is, the topology induced by the Hausdorff metric dH(K1,K2):=inf{δ>0:K2K1δ,K1K2δ}d_{H}(K_{1},K_{2}):=\inf\left\{\delta>0:K_{2}\subset K_{1}^{\delta},K_{1}\subset K_{2}^{\delta}\right\}, where Kδ:=xK{yd0:xyδ}K^{\delta}:=\bigcup_{x\in K}\left\{y\in\mathbb{R}^{d_{0}}:\norm{x-y}\leq\delta\right\}. Lemma 7.1 in [Pol80] shows that the map hK(h,β)h\mapsto K(h,\beta) is measurable from H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) into 𝔎\mathfrak{K} for any β0\beta\geq 0.

Proposition 4.1 (Limit distribution for set of approximate minimizers).

Under 1, there exists a sequence of nonnegative real numbers βn0\beta_{n}\downarrow 0 such that (i) (Θ^nθ+n1/2K(𝔾n(σ),βn))1\mathbb{P}_{*}\big{(}\hat{\Theta}_{n}\subset\theta^{\star}+n^{-1/2}K(\mathbb{G}_{n}^{(\sigma)},\beta_{n})\big{)}\to 1, where \mathbb{P}_{*} denotes inner probability; and (ii) K(𝔾n(σ),βn)dK(Gμ,0)K\big{(}\mathbb{G}_{n}^{(\sigma)},\beta_{n}\big{)}\stackrel{{\scriptstyle d}}{{\to}}K(G_{\mu},0) as 𝔎\mathfrak{K}–valued random variables.

The proof of this proposition is an adaptation of that of Theorem 7.2 in [Pol80]. For completeness, a self-contained argument is provided in Appendix A.3.

5. Remaining proofs

5.1. Proofs for Section 3.1.1

We fix some notation. For a nonempty set SS, let (S)\ell^{\infty}(S) denote the space of bounded real functions on SS endowed with the sup-norm ,S=supsS||\|\cdot\|_{\infty,S}=\sup_{s\in S}|\cdot|. The space ((S),,S)(\ell^{\infty}(S),\|\cdot\|_{\infty,S}) is a Banach space.

5.1.1. Proof of Proposition 3.1

We divide the proof into three steps. In Steps 1 and 2, we will establish weak convergence of n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}). Step 3 is devoted to weak convergence of n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}).

Step 1. Observe that

((μ^nμ)γσ)(f)=(μ^nμ)(fϕσ).\big{(}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\big{)}(f)=(\hat{\mu}_{n}-\mu)(f*\phi_{\sigma}). (12)

Consider the function classes

={fC˙0:fH˙1,q(γσ)1}andϕσ={fϕσ:f}.\mathcal{F}=\big{\{}f\in\dot{C}_{0}^{\infty}:\|f\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq 1\big{\}}\quad\text{and}\quad\mathcal{F}*\phi_{\sigma}=\big{\{}f*\phi_{\sigma}:f\in\mathcal{F}\big{\}}.

The proof of Theorem 3 in [NGK21] shows that the function class ϕσ\mathcal{F}*\phi_{\sigma} is μ\mu-Donsker. For completeness, we provide an outline of the argument. Since for any constant aa\in\mathbb{R} and any function ff\in\mathcal{F}, (μ^nμ)(fϕσ)=(μ^nμ)((fa)ϕσ)(\hat{\mu}_{n}-\mu)(f*\phi_{\sigma})=(\hat{\mu}_{n}-\mu)\big{(}(f-a)*\phi_{\sigma}\big{)}, it suffices to show that 0ϕσ\mathcal{F}_{0}*\phi_{\sigma} with 0:={f:γσ(f)=0}\mathcal{F}_{0}:=\{f\in\mathcal{F}:\gamma_{\sigma}(f)=0\} is μ\mu-Donsker. To this end, we will apply Theorem 1 in [vdV96] or its simple adaptation, Lemma 8 in [NGK21].

Fix any η(0,1)\eta\in(0,1). We first observe that, for any f0f\in\mathcal{F}_{0} and any multi-index k=(k1,,kd)0dk=(k_{1},\dots,k_{d})\in\mathbb{N}_{0}^{d}, we have

|k(fϕσ)(x)|(𝖢q(γσ)σk¯+1)exp((p1)|x|22σ2(1η))\big{|}\partial^{k}(f*\phi_{\sigma})(x)\big{|}\lesssim(\mathsf{C}_{q}(\gamma_{\sigma})\vee\sigma^{-\bar{k}+1})\exp\left(\frac{(p-1)|x|^{2}}{2\sigma^{2}(1-\eta)}\right) (13)

up to constants independent of f,xf,x, and σ\sigma, where k¯=j=1dkj\bar{k}=\sum_{j=1}^{d}k_{j}. Here k=1k1dkd\partial^{k}=\partial_{1}^{k_{1}}\cdots\partial_{d}^{k_{d}} is the differential operator and 𝖢q(γσ)\mathsf{C}_{q}(\gamma_{\sigma}) is the qq-Poincaré constant for the Gaussian measure γσ\gamma_{\sigma}. To see this, observe that

(fϕσ)(x)=dϕσ(xy)ϕσ(y)f(y)ϕσ(y)𝑑y.(f*\phi_{\sigma})(x)=\int_{\mathbb{R}^{d}}\frac{\phi_{\sigma}(x-y)}{\phi_{\sigma}(y)}f(y)\phi_{\sigma}(y)dy.

Applying Hölder’s inequality and using the fact that fLq(γσ)𝖢q(γσ)fH˙1,q(γσ)𝖢q(γσ)\|f\|_{L^{q}(\gamma_{\sigma})}\leq\mathsf{C}_{q}(\gamma_{\sigma})\|f\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq\mathsf{C}_{q}(\gamma_{\sigma}) (recall that γσ(f)=0\gamma_{\sigma}(f)=0), we obtain

|(fϕσ)(x)|𝖢q(γσ)[dϕσp(xy)ϕσp1(y)𝑑y]1/p.|(f*\phi_{\sigma})(x)|\leq\mathsf{C}_{q}(\gamma_{\sigma})\left[\int_{\mathbb{R}^{d}}\frac{\phi_{\sigma}^{p}(x-y)}{\phi_{\sigma}^{p-1}(y)}dy\right]^{1/p}.

A direct calculation further shows that

dϕσp(xy)ϕσp1(y)𝑑y=exp(p(p1)|x|22σ2),\int_{\mathbb{R}^{d}}\frac{\phi_{\sigma}^{p}(x-y)}{\phi^{p-1}_{\sigma}(y)}dy=\exp\left(\frac{p(p-1)|x|^{2}}{2\sigma^{2}}\right),

which implies

|(fϕσ)(x)|𝖢q(γσ)exp((p1)|x|22σ2),|(f*\phi_{\sigma})(x)|\leq\mathsf{C}_{q}(\gamma_{\sigma})\exp\left(\frac{(p-1)|x|^{2}}{2\sigma^{2}}\right),

establishing (13) when k¯=0\bar{k}=0. Derivative bounds follow similarly; see [NGK21] for details.

Next, we construct a cover {𝒳j}j=1\{\mathcal{X}_{j}\}_{j=1}^{\infty} of d\mathbb{R}^{d}. Let Br=B(0,r)B_{r}=B(0,r). For δ>0\delta>0 fixed and r=2,3,r=2,3,\dots, let {x1(r),,xNr(r)}\{x_{1}^{(r)},\dots,x_{N_{r}}^{(r)}\} be a minimal δ\delta-net of BrδB(r1)δB_{r\delta}\setminus B_{(r-1)\delta}. Set x1(1)=0x_{1}^{(1)}=0 with N1=1N_{1}=1. It is not difficult to see from a volumetric argument that Nr=O(rd1)N_{r}=O(r^{d-1}). Set 𝒳j=B(xj(r),δ)\mathcal{X}_{j}=B(x_{j}^{(r)},\delta) for j=k=1r1Nk+1,,k=1rNkj=\sum_{k=1}^{r-1}N_{k}+1,\dots,\sum_{k=1}^{r}N_{k}. By construction, {𝒳j}j=1\{\mathcal{X}_{j}\}_{j=1}^{\infty} forms a cover of d\mathbb{R}^{d} with diameter 2δ2\delta. Set α=d/2+1\alpha=\lfloor d/2\rfloor+1 and Mj=supf0maxk¯αsupxint(𝒳j)|k(fϕσ)(x)|M_{j}=\sup_{f\in\mathcal{F}_{0}}\max_{\bar{k}\leq\alpha}\sup_{x\in\mathrm{int}(\mathcal{X}_{j})}|\partial^{k}(f*\phi_{\sigma})(x)|. By Theorem 1 in [vdV96] combined with Theorem 2.7.1 in [vdVW96] (or their simple adaptation; cf. Lemma 8 in [NGK21]), 0ϕσ\mathcal{F}_{0}*\phi_{\sigma} is μ\mu-Donsker if j=1Mjμ(𝒳j)1/2<\sum_{j=1}^{\infty}M_{j}\mu(\mathcal{X}_{j})^{1/2}<\infty. By inequality (13),

maxk=1r1Nk+1jk=1rNjMjσd/2exp((p1)r2δ22σ2(1η))\max_{\sum_{k=1}^{r-1}N_{k}+1\leq j\leq\sum_{k=1}^{r}N_{j}}M_{j}\lesssim\sigma^{-\lfloor d/2\rfloor}\exp\left(\frac{(p-1)r^{2}\delta^{2}}{2\sigma^{2}(1-\eta)}\right)

up to constants independent of rr and σ\sigma. Hence, j=1Mjμ(𝒳j)1/2\sum_{j=1}^{\infty}M_{j}\mu(\mathcal{X}_{j})^{1/2} is finite if

r=1rd1exp((p1)r2δ22σ2(1η))(|X|>(r1)δ)<.\sum_{r=1}^{\infty}r^{d-1}\exp\left(\frac{(p-1)r^{2}\delta^{2}}{2\sigma^{2}(1-\eta)}\right)\sqrt{\operatorname{\mathbb{P}}(|X|>(r-1)\delta)}<\infty.

By Riemann approximation, the sum on the the left-hand side above can be bounded by

δd11td1exp((p1)t22σ2(1η))(|X|>t2δ)𝑑t,\delta^{-d-1}\int_{1}^{\infty}t^{d-1}\exp\left(\frac{(p-1)t^{2}}{2\sigma^{2}(1-\eta)}\right)\sqrt{\operatorname{\mathbb{P}}(|X|>t-2\delta)}dt,

which is finite under our assumption by choosing η\eta and δ\delta sufficiently small, and absorbing td1t^{d-1} into the exponential function.

Step 2. Let 𝒰={fH˙1,q(γσ):fH˙1,q(γσ)1}\mathcal{U}=\{f\in\dot{H}^{1,q}(\gamma_{\sigma}):\|f\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq 1\}. Recall from Remark 2.1 that H˙1,q(γσ)Lq(γσ)\dot{H}^{1,q}(\gamma_{\sigma})\subset L^{q}(\gamma_{\sigma}). From Step 1, we know that ϕσ\mathcal{F}*\phi_{\sigma} is μ\mu-Donsker. The same conclusion holds with \mathcal{F} replaced by 𝒰\mathcal{U}. This can be verified as follows. From the proof of (13) when k¯=0\bar{k}=0, we see that for f1,f2H˙1,q(γσ)f_{1},f_{2}\in\dot{H}^{1,q}(\gamma_{\sigma}) with γσ\gamma_{\sigma}-mean zero,

|(f1ϕσ)(x)(f2ϕσ)(x)|𝖢q(γσ)f1f2H˙1,q(γσ)exp((p1)|x|22σ2),xd.\big{|}(f_{1}*\phi_{\sigma})(x)-(f_{2}*\phi_{\sigma})(x)\big{|}\leq\mathsf{C}_{q}(\gamma_{\sigma})\|f_{1}-f_{2}\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\exp\left(\frac{(p-1)|x|^{2}}{2\sigma^{2}}\right),\quad\forall\,x\in\mathbb{R}^{d}.

Since the exponential function on the right-hand side is square-integrable w.r.t. μ\mu under Condition (4) and 0\mathcal{F}_{0} is dense in 𝒰0:={f𝒰:γσ(f)=0}\mathcal{U}_{0}:=\{f\in\mathcal{U}:\gamma_{\sigma}(f)=0\} for H˙1,q(γσ)\|\cdot\|_{\dot{H}^{1,q}(\gamma_{\sigma})} by construction (cf. Remark 2.1), we see that

𝒰0ϕσ{g:gm0ϕσsuch that gmg poinwise and in L2(μ)}.\mathcal{U}_{0}*\phi_{\sigma}\subset\big{\{}g:\exists\,g_{m}\in\mathcal{F}_{0}*\phi_{\sigma}\ \text{such that $g_{m}\to g$ poinwise and in $L^{2}(\mu)$}\big{\}}.

Thus, by Theorem 2.10.2 in [vdV96], 𝒰0ϕσ\mathcal{U}_{0}*\phi_{\sigma} (or equivalently, 𝒰ϕσ\mathcal{U}*\phi_{\sigma}) is μ\mu-Donsker. Since the map (𝒰ϕσ)L(L(fϕσ))f𝒰(𝒰)\ell^{\infty}(\mathcal{U}*\phi_{\sigma})\ni L\mapsto(L(f*\phi_{\sigma}))_{f\in\mathcal{U}}\in\ell^{\infty}(\mathcal{U}) is isometrically isomorphic, in view of (12), we have n(μ^nμ)γσdGμ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu}^{\circ} in (𝒰)\ell^{\infty}(\mathcal{U}) for some tight Gaussian process GμG_{\mu}^{\circ}.

Let lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}) denote all bounded real functionals LL on 𝒰\mathcal{U}, such that L(0)=0L(0)=0 and

L(αf+(1α)g)=αL(f)+(1α)L(g),0α1,f,g𝒰.L(\alpha f+(1-\alpha)g)=\alpha L(f)+(1-\alpha)L(g),\quad 0\leq\alpha\leq 1,\ f,g\in\mathcal{U}.

Equip lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}) with the norm ,𝒰=supf𝒰||\|\cdot\|_{\infty,\mathcal{U}}=\sup_{f\in\mathcal{U}}|\cdot|. Each element in lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}) extends uniquely to the corresponding element in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}), and the extension, denoted by ι:lin(𝒰)H˙1,p(γσ)\iota:\operatorname{lin}^{\infty}(\mathcal{U})\to\dot{H}^{-1,p}(\gamma_{\sigma}), is isometrically isomorphic. This follows from the same argument as the proof of Lemma 1 in [Nic09]. We omit the details.

Since lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}) is a closed subspace of (𝒰)\ell^{\infty}(\mathcal{U}) and n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} has paths in lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}), we see that Gμlin(𝒰)G_{\mu}^{\circ}\in\operatorname{lin}^{\infty}(\mathcal{U}) with probability one by the portmanteau theorem and n(μ^nμ)γσdGμ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu}^{\circ} in lin(𝒰)\operatorname{lin}^{\infty}(\mathcal{U}). Now, since n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} is a (random) signed measure that is bounded on 𝒰\mathcal{U} with probability one, we can regard n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} as a random variable with values in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}). Conclude that n(μ^nμ)γσdιGμ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}\iota\circ G_{\mu}^{\circ} in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) by the continuous mapping theorem. For notational convenience, redefine GμG_{\mu}^{\circ} by ιGμ\iota\circ G_{\mu}^{\circ}. The limit variable Gμ=(Gμ(f))fH˙1,q(γσ)G_{\mu}^{\circ}=(G_{\mu}^{\circ}(f))_{f\in\dot{H}^{1,q}(\gamma_{\sigma})} is a centered Gaussian process with covariance function Cov(Gμ(f),Gμ(g))=Covμ(fϕσ,gϕσ)\operatorname{Cov}(G^{\circ}_{\mu}(f),G^{\circ}_{\mu}(g)\big{)}=\operatorname{Cov}_{\mu}(f*\phi_{\sigma},g*\phi_{\sigma}).

Step 3. We will show that n(μ^nμ)γσ\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} converges in distribution to a centered Gaussian process in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}). For XμX\sim\mu with a=𝔼[X]a=\operatorname{\mathbb{E}}[X], let μa\mu^{-a} denote the distribution of XaX-a, and let μ^na=n1i=1nδ(Xia)\hat{\mu}_{n}^{-a}=n^{-1}\sum_{i=1}^{n}\delta_{(X_{i}-a)}. It is not difficult to see that μa\mu^{-a} satisfies Condition (4). Applying the result of Step 2 with μ\mu replaced by μa\mu^{-a}, we have n(μ^naμa)γσdGμa\sqrt{n}(\hat{\mu}_{n}^{-a}-\mu^{-a})*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu^{-a}}^{\circ} in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}). Since H˙1,q(γσ)H˙1,q(μaγσ)\|\cdot\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\lesssim\|\cdot\|_{\dot{H}^{1,q}(\mu^{-a}*\gamma_{\sigma})} (as d(μaγσ)/dγσe𝔼μ[|Xa|2]/(2σ2)d(\mu^{-a}*\gamma_{\sigma})/d\gamma_{\sigma}\geq e^{-\operatorname{\mathbb{E}}_{\mu}[|X-a|^{2}]/(2\sigma^{2})} by Jensen’s inequality), we have H˙1,p(μaγσ)H˙1,p(γσ)\|\cdot\|_{\dot{H}^{-1,p}(\mu^{-a}*\gamma_{\sigma})}\lesssim\|\cdot\|_{\dot{H}^{-1,p}(\gamma_{\sigma})}, i.e., the continuous embedding H˙1,p(γσ)H˙1,p(μaγσ)\dot{H}^{-1,p}(\gamma_{\sigma})\hookrightarrow\dot{H}^{-1,p}(\mu^{-a}*\gamma_{\sigma}) holds. Thus n(μ^naμa)γσd(Gμa(f))fH˙1,q(μaγσ)\sqrt{n}(\hat{\mu}_{n}^{-a}-\mu^{-a})*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}(G_{\mu^{-a}}^{\circ}(f))_{f\in\dot{H}^{1,q}(\mu^{-a}*\gamma_{\sigma})} in H˙1,p(μaγσ)\dot{H}^{-1,p}(\mu^{-a}*\gamma_{\sigma}).

Observe that for φC0\varphi\in C_{0}^{\infty},

φ(+a)H˙1,q(μaγσ)q=d|φ(+a)|qd(μaγσ)=d|φ|qd(μγσ)=φH˙1,q(μγσ)q.\begin{split}\|\varphi(\cdot+a)\|_{\dot{H}^{1,q}(\mu^{-a}*\gamma_{\sigma})}^{q}&=\int_{\mathbb{R}^{d}}|\nabla\varphi(\cdot+a)|^{q}d(\mu^{-a}*\gamma_{\sigma})\\ &=\int_{\mathbb{R}^{d}}|\nabla\varphi|^{q}d(\mu*\gamma_{\sigma})=\|\varphi\|_{\dot{H}^{1,q}(\mu*\gamma_{\sigma})}^{q}.\end{split}

Thus, the map τa:H˙1,p(μaγσ)H˙1,p(μγ)\tau_{a}:\dot{H}^{-1,p}(\mu^{-a}*\gamma_{\sigma})\to\dot{H}^{-1,p}(\mu*\gamma), defined by τa(h)(f)=h(f(+a))\tau_{a}(h)(f)=h(f(\cdot+a)), is continuous (indeed, isometrically isomorphic). Conclude that

n(μ^nμ)γσ=τa(n(μ^naμa)γσ)dτaGμa=:GμinH˙1,p(μγσ).\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}=\tau_{a}\big{(}\sqrt{n}(\hat{\mu}_{n}^{-a}-\mu^{-a})*\gamma_{\sigma}\big{)}\stackrel{{\scriptstyle d}}{{\to}}\tau_{a}G^{\circ}_{\mu^{-a}}=:G_{\mu}\quad\text{in}\ \dot{H}^{-1,p}(\mu*\gamma_{\sigma}).

The limit variable Gμ=(Gμ(f))fH˙1,q(μγσ)=(Gμa(f(+a)))fH˙1,q(μγσ)G_{\mu}=(G_{\mu}(f))_{f\in\dot{H}^{1,q}(\mu*\gamma_{\sigma})}=(G_{\mu^{-a}}^{\circ}(f(\cdot+a)))_{f\in\dot{H}^{1,q}(\mu*\gamma_{\sigma})} is a centered Gaussian process with covariance function

Cov(Gμ(f),Gμ(g))=Cov(Gμa(f(+a),Gμa(g(+a)))=Covμa(f(+a)ϕσ,g(+a)ϕσ)=Covμa(fϕσ(+a),gϕσ(+a))=Covμ(fϕσ,gϕσ).\begin{split}\operatorname{Cov}(G_{\mu}(f),G_{\mu}(g))&=\operatorname{Cov}\big{(}G^{\circ}_{\mu^{-a}}(f(\cdot+a),G^{\circ}_{\mu^{-a}}(g(\cdot+a))\big{)}\\ &=\operatorname{Cov}_{\mu^{-a}}\big{(}f(\cdot+a)*\phi_{\sigma},g(\cdot+a)*\phi_{\sigma}\big{)}\\ &=\operatorname{Cov}_{\mu^{-a}}\big{(}f*\phi_{\sigma}(\cdot+a),g*\phi_{\sigma}(\cdot+a)\big{)}\\ &=\operatorname{Cov}_{\mu}(f*\phi_{\sigma},g*\phi_{\sigma}).\end{split}

This completes the proof. ∎

Remark 5.1 (Alternative proof for p=2p=2).

Observe that (μ^nμ)γσ=n1i=1n(δXiμ)γσ=n1i=1nZi(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}=n^{-1}\sum_{i=1}^{n}(\delta_{X_{i}}-\mu)*\gamma_{\sigma}=n^{-1}\sum_{i=1}^{n}Z_{i} with Zi=(δXiμ)γσZ_{i}=(\delta_{X_{i}}-\mu)*\gamma_{\sigma}, and that Z1,Z2,Z_{1},Z_{2},\dots are i.i.d. random variables with values in H˙1,p(γσ)\dot{H}^{-1,p}(\gamma_{\sigma}) (cf. (13)). Since H˙1,2(γσ)\dot{H}^{-1,2}(\gamma_{\sigma}) is isometrically isomorphic to a closed subspace of L2(γσ;d)L^{2}(\gamma_{\sigma};\mathbb{R}^{d}) (see Lemma 5.1 ahead), we may apply the CLT in the Hilbert space to derive a limit distribution for n(μ^nμ)γσ=n1/2i=1nZi\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}=n^{-1/2}\sum_{i=1}^{n}Z_{i} in H˙1,2(γσ)\dot{H}^{-1,2}(\gamma_{\sigma}). Let E:H˙1,2(γσ)Lp(γσ;d)E:\dot{H}^{-1,2}(\gamma_{\sigma})\to L^{p}(\gamma_{\sigma};\mathbb{R}^{d}) be the linear isometry given in Lemma 5.1 ahead and Z¯i=E(Zi)\underline{Z}_{i}=E(Z_{i}) be the corresponding L2(γσ;d)L^{2}(\gamma_{\sigma};\mathbb{R}^{d})-valued random variables. Since L2(γσ;d)L^{2}(\gamma_{\sigma};\mathbb{R}^{d}) is a Hilbert space, n1/2i=1nZin^{-1/2}\sum_{i=1}^{n}Z_{i} obeys the CLT if 𝔼[Z¯1L2(γσ;d)2]=𝔼[Z1H˙1,2(γσ)2]<\operatorname{\mathbb{E}}\big{[}\|\underline{Z}_{1}\|^{2}_{L^{2}(\gamma_{\sigma};\mathbb{R}^{d})}\big{]}=\operatorname{\mathbb{E}}\big{[}\|Z_{1}\|_{\dot{H}^{-1,2}(\gamma_{\sigma})}^{2}\big{]}<\infty, which is satisfied under Condition (4). Indeed, for p=2p=2, it is not difficult to see that the CLT in H˙1,2(γσ)\dot{H}^{-1,2}(\gamma_{\sigma}) holds for n1/2i=1nZin^{-1/2}\sum_{i=1}^{n}Z_{i} under a slightly weaker moment condition, namely, de|x|2/σ2𝑑μ(x)<\int_{\mathbb{R}^{d}}e^{|x|^{2}/\sigma^{2}}d\mu(x)<\infty.

5.1.2. Proof of Proposition 3.2

Part (i). Let

𝔉={fϕσ:fH˙1,q(γσ),fLq(γσ)1,fH˙1,q(γσ)C}\mathfrak{F}=\{f*\phi_{\sigma}:f\in\dot{H}^{1,q}(\gamma_{\sigma}),\|f\|_{L^{q}(\gamma_{\sigma})}\leq 1,\|f\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq C\}

for some sufficiently large but fixed constant CC. It is not difficult to see that limn(μ^nμ)γσH˙1,p(γσ)=0\lim_{n\to\infty}\|(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\|_{\dot{H}^{-1,p}(\gamma_{\sigma})}=0 a.s. if and only if 𝔉\mathfrak{F} is μ\mu-Glivenko-Cantelli.

Suppose first that 𝔉\mathfrak{F} is μ\mu-Glivenko-Cantelli. Let F𝔉F_{\mathfrak{F}} denote the minimal envelope for 𝔉\mathfrak{F}, i.e., F𝔉(x)=supf𝔉|f(x)|F_{\mathfrak{F}}(x)=\sup_{f\in\mathfrak{F}}|f(x)|. By Theorem 3.7.14 in [GN16], F𝔉F_{\mathfrak{F}} must be μ\mu-integrable. We shall bound F𝔉F_{\mathfrak{F}} from below. Fix any xdx\in\mathbb{R}^{d}. Consider

φx(y)=gxp1(y)gxLp(γσ)p1withgx(y)=ϕσ(xy)ϕσ(y)=e|x|2/(2σ2)+x,y/σ2,yd.\varphi_{x}(y)=\frac{g_{x}^{p-1}(y)}{\|g_{x}\|_{L^{p}(\gamma_{\sigma})}^{p-1}}\quad\text{with}\quad g_{x}(y)=\frac{\phi_{\sigma}(x-y)}{\phi_{\sigma}(y)}=e^{-|x|^{2}/(2\sigma^{2})+\langle x,y\rangle/\sigma^{2}},\quad y\in\mathbb{R}^{d}.

Observe that yφx(y)=((p1)x/σ2)φx(y)\nabla_{y}\varphi_{x}(y)=\big{(}(p-1)x/\sigma^{2}\big{)}\varphi_{x}(y) and thus φxH˙1,q(γσ)=(p1)|x|/σ2\|\varphi_{x}\|_{\dot{H}^{1,q}(\gamma_{\sigma})}=(p-1)|x|/\sigma^{2}. Thus, for φ~x=φx/(1+|x|)\tilde{\varphi}_{x}=\varphi_{x}/(1+|x|), we have φ~xLq(γσ)1,φ~xH˙1,q(γσ)(p1)/σ2\|\tilde{\varphi}_{x}\|_{L^{q}(\gamma_{\sigma})}\leq 1,\|\tilde{\varphi}_{x}\|_{\dot{H}^{1,q}(\gamma_{\sigma})}\leq(p-1)/\sigma^{2}, and

(φ~xϕσ)(x)=11+|x|gxLp(γσ)=11+|x|e(p1)|x|2/(2σ2).(\tilde{\varphi}_{x}*\phi_{\sigma})(x)=\frac{1}{1+|x|}\|g_{x}\|_{L^{p}(\gamma_{\sigma})}=\frac{1}{1+|x|}e^{(p-1)|x|^{2}/(2\sigma^{2})}.

Also, from Proposition 1.5.2 in [Bog98], we see that φ~xH˙1,q(γσ)\tilde{\varphi}_{x}\in\dot{H}^{1,q}(\gamma_{\sigma}). Conclude that, as long as C(p1)/σ2C\geq(p-1)/\sigma^{2},

F𝔉(x)11+|x|e(p1)|x|2/(2σ2).F_{\mathfrak{F}}(x)\geq\frac{1}{1+|x|}e^{(p-1)|x|^{2}/(2\sigma^{2})}.

Now, the left-hand side is μ\mu-integrable, so that deθ|x|2/(2σ2)𝑑μ(x)<\int_{\mathbb{R}^{d}}e^{\theta|x|^{2}/(2\sigma^{2})}d\mu(x)<\infty for any θ<p1\theta<p-1.

Part (ii). Conversely, suppose that de(p1)|x|2/(2σ2)𝑑μ(x)<\int_{\mathbb{R}^{d}}e^{(p-1)|x|^{2}/(2\sigma^{2})}d\mu(x)<\infty, which ensures that F𝔉F_{\mathfrak{F}} is μ\mu-integrable from (13). From the proof of Proposition 3.1, for any M>0M>0, we see that the restricted function class {f𝟙F𝔉M:f𝔉}\{f\mathbbm{1}_{F_{\mathfrak{F}}\leq M}:f\in\mathfrak{F}\} is μ\mu-Donsker and thus μ\mu-Glivenko-Cantelli (cf. Theorem 3.7.14 in [GN16]). Since the envelope function F𝔉F_{\mathfrak{F}} is μ\mu-integrable, we conclude that 𝔉\mathfrak{F} is μ\mu-Glivenko-Cantelli; cf. the proof of Theorem 3.7.14 in [GN16]. ∎

5.2. Proofs for Section 3.2

Recall that 1<p<1<p<\infty and qq is its conjugate index, i.e., 1/p+1/q=11/p+1/q=1.

5.2.1. Proof of Lemma 3.2

One of the main ingredients of the proof of Lemma 3.2 is Theorem 8.3.1 in [AGS08], which is stated next (see also the Benamou-Brenier formula [BB00]).

Theorem 5.1 (Theorem 8.3.1 in [AGS08]).

Let II be an open interval, and let ItμtI\ni t\mapsto\mu_{t} be a continuous curve in 𝒫p(d)\mathcal{P}_{p}(\mathbb{R}^{d}) (equipped with 𝖶p\mathsf{W}_{p}) such that for some Borel vector field d×I(x,t)vt(x)d\mathbb{R}^{d}\times I\ni(x,t)\mapsto v_{t}(x)\in\mathbb{R}^{d}, the continuity equation

tμt+(vtμt)=0ind×I\partial_{t}\mu_{t}+\nabla\cdot(v_{t}\mu_{t})=0\quad\text{in}\ \ \mathbb{R}^{d}\times I (14)

holds in the distributional sense, i.e.,

Id(tφ(x,t)+vt(x),xφ(x,t))𝑑μt(x)𝑑t=0,φC0(d×I).\int_{I}\int_{\mathbb{R}^{d}}(\partial_{t}\varphi(x,t)+\langle v_{t}(x),\nabla_{x}\varphi(x,t)\rangle)d\mu_{t}(x)dt=0,\quad\forall\varphi\in C_{0}^{\infty}(\mathbb{R}^{d}\times I).

If vtLp(μt;d)L1(I)\|v_{t}\|_{L^{p}(\mu_{t};\mathbb{R}^{d})}\in L^{1}(I), then 𝖶p(μa,μb)abvtLp(μt;d)𝑑t\mathsf{W}_{p}(\mu_{a},\mu_{b})\leq\int_{a}^{b}\|v_{t}\|_{L^{p}(\mu_{t};\mathbb{R}^{d})}dt for all a<ba<b with a,bIa,b\in I.

For a vector field v:ddv:\mathbb{R}^{d}\to\mathbb{R}^{d}, define

jp(v):={|v|p2vifv00otherwise.j_{p}(v):=\begin{cases}|v|^{p-2}v&\text{if}\ v\neq 0\\ 0&\text{otherwise}\end{cases}.

Observe that w=jp(v)w=j_{p}(v) if and only if v=jq(w)v=j_{q}(w), and for any ρ𝒫\rho\in\mathcal{P},

jp(v)Lq(ρ;d)q=vLp(ρ;d)p=djp(v),v𝑑ρ.\|j_{p}(v)\|_{L^{q}(\rho;\mathbb{R}^{d})}^{q}=\|v\|_{L^{p}(\rho;\mathbb{R}^{d})}^{p}=\int_{\mathbb{R}^{d}}\langle j_{p}(v),v\rangle d\rho.

We will also use the following lemma.

Lemma 5.1.

Let ρ𝒫\rho\in\mathcal{P} be a reference measure. For any hH˙1,p(ρ)h\in\dot{H}^{-1,p}(\rho), there exists a unique vector field E=E(h)Lp(ρ;d)E=E(h)\in L^{p}(\rho;\mathbb{R}^{d}) such that

{dφ,E𝑑ρ=h(φ),φC0,jp(E){φ:φC0}¯Lq(ρ;d).\begin{cases}&\int_{\mathbb{R}^{d}}\langle\nabla\varphi,E\rangle d\rho=h(\varphi),\quad\forall\varphi\in C_{0}^{\infty},\\ &j_{p}(E)\in\overline{\{\nabla\varphi:\varphi\in C_{0}^{\infty}\}}^{L^{q}(\rho;\mathbb{R}^{d})}.\end{cases} (15)

The map hE(h)h\mapsto E(h) is homogeneous (i.e., E(ah)=aE(h)E(ah)=aE(h) for all aa\in\mathbb{R} and hH˙1,p(ρ)h\in\dot{H}^{-1,p}(\rho)) and such that E(h)Lp(ρ;d)=hH˙1,p(ρ)\|E(h)\|_{L^{p}(\rho;\mathbb{R}^{d})}=\|h\|_{\dot{H}^{-1,p}(\rho)} for all hH˙1,p(ρ)h\in\dot{H}^{-1,p}(\rho). If p=2p=2, then the map hE(h)h\mapsto E(h) is a linear isometry from H˙1,2(ρ)\dot{H}^{-1,2}(\rho) into L2(ρ;d)L^{2}(\rho;\mathbb{R}^{d}).

The proof of Lemma 5.1 in turn relies on the following existence result of optimal solutions in Banach spaces. We provide its proof for the sake of completeness.

Lemma 5.2.

Let (V,)(V,\|\cdot\|) be a reflexive real Banach space, and let J:V{+}J:V\to\mathbb{R}\cup\{+\infty\} (J+J\not\equiv+\infty) be weakly lower semicontinuous (i.e., J(v)lim infnJ(vn)J(v)\leq\liminf_{n}J(v_{n}) for any vnvv_{n}\to v weakly) and coercive (i.e., J(v)J(v)\to\infty as v\|v\|\to\infty). Then there exists v0Vv_{0}\in V such that J(v0)=infvVJ(v)J(v_{0})=\inf_{v\in V}J(v).

Proof of Lemma 5.2.

Let vnVv_{n}\in V be such that J(vn)infvVJ(v)=:J¯J(v_{n})\to\inf_{v\in V}J(v)=:\underline{J}. By coercivity, vnv_{n} is bounded, so by reflexivity and the Banach-Alaoglu theorem, there exists a weakly convergent subsequence vnkv_{n_{k}} such that vnkv0v_{n_{k}}\to v_{0} weakly. Since JJ is weakly lower semicontinuous, we conclude J(v0)lim infkJ(vnk)=J¯J(v_{0})\leq\liminf_{k}J(v_{n_{k}})=\underline{J}. ∎

We turn to the proof of Lemma 5.1, which is inspired by the first part of the proof of Theorem 8.3.1 in [AGS08].

Proof of Lemma 5.1.

Let VV denote the closure in Lq(ρ;d)L^{q}(\rho;\mathbb{R}^{d}) of the subspace V0={φ:φC0}V_{0}=\{\nabla\varphi:\varphi\in C_{0}^{\infty}\}. Endowing VV with Lq(ρ;d)\|\cdot\|_{L^{q}(\rho;\mathbb{R}^{d})} gives a reflexive Banach space because any closed subspace of a reflexive Banach space is reflexive. Define the linear functional L:V0L:V_{0}\to\mathbb{R} by L(φ):=h(φ)L(\nabla\varphi):=h(\varphi). To see that LL is well-defined, observe that

|h(φ)|φH˙1,q(ρ)hH˙1,p(ρ)=φLq(ρ;d)hH˙1,p(ρ).\begin{split}|h(\varphi)|&\leq\|\varphi\|_{\dot{H}^{1,q}(\rho)}\|h\|_{\dot{H}^{-1,p}(\rho)}\\ &=\|\nabla\varphi\|_{L^{q}(\rho;\mathbb{R}^{d})}\|h\|_{\dot{H}^{-1,p}(\rho)}.\end{split}

This also shows that LL can be extended to a bounded linear functional on VV.

Consider the optimization problem

minvVJ(v)withJ(v):=1qd|v|q𝑑ρL(v).\min_{v\in V}J(v)\quad\text{with}\ \ J(v):=\frac{1}{q}\int_{\mathbb{R}^{d}}|v|^{q}d\rho-L(v). (16)

The functional JJ is finite, weakly lower semicontinuous, and coercive. By Lemma 5.2 there exists a solution v0v_{0} to the optimization problem (16). Further, the functional JJ is Gâteaux differentiable with derivative

J(v;w):=limt0J(v+tw)J(v)t=dw,jq(v)𝑑ρL(w).J^{\prime}(v;w):=\lim_{t\to 0}\frac{J(v+tw)-J(v)}{t}=\int_{\mathbb{R}^{d}}\langle w,j_{q}(v)\rangle d\rho-L(w).

Thus, for E=jq(v0)E=j_{q}(v_{0}), we have dφ,E𝑑ρ=L(φ)\int_{\mathbb{R}^{d}}\langle\nabla\varphi,E\rangle d\rho=L(\nabla\varphi) for all φC0\varphi\in C_{0}^{\infty} and jp(E)=v0Vj_{p}(E)=v_{0}\in V.

To show uniqueness of EE, pick another vector field ELp(ρ;d)E^{\prime}\in L^{p}(\rho;\mathbb{R}^{d}) satisfying (15). Then, jp(E)Vj_{p}(E^{\prime})\in V satisfies J(jp(E);w)=0J^{\prime}(j_{p}(E^{\prime});w)=0 for all wVw\in V, so by convexity of JJ, jp(E)j_{p}(E^{\prime}) is another optimal solution to (16). However, since JJ is strictly convex, the optimal solution to (16) is unique, so that jp(E)=jp(E)j_{p}(E^{\prime})=j_{p}(E), i.e., E=EE^{\prime}=E.

Now, the map hE(h)h\mapsto E(h) is homogeneous, as aE(h)aE(h) clearly satisfies the first equation in (15) for hh replaced with ahah and jp(aE(h))=|a|p2ajp(E(h))Vj_{p}(aE(h))=|a|^{p-2}aj_{p}(E(h))\in V. Further, as jp(E(h)){φ:φC0}¯Lq(ρ;d)j_{p}(E(h))\in\overline{\{\nabla\varphi:\varphi\in C_{0}^{\infty}\}}^{L^{q}(\rho;\mathbb{R}^{d})} by construction, it also satisfies

E(h)Lp(ρ;d)=sup{dφ,E(h)𝑑ρ:φC0,φLq(ρ;d)1}=hH˙1,p(ρ).\|E(h)\|_{L^{p}(\rho;\mathbb{R}^{d})}=\sup\left\{\int_{\mathbb{R}^{d}}\langle\nabla\varphi,E(h)\rangle d\rho:\varphi\in C_{0}^{\infty},\|\nabla\varphi\|_{L^{q}(\rho;\mathbb{R}^{d})}\leq 1\right\}=\|h\|_{\dot{H}^{-1,p}(\rho)}.

Finally, if p=2p=2, then j2(v)=vj_{2}(v)=v, so it is clear that the map hE(h)h\mapsto E(h) is linear. ∎

We are now ready to prove Lemma 3.2.

Proof of Lemma 3.2.

Let μt=μ+th1\mu_{t}=\mu+th_{1} and νt=μ+th2\nu_{t}=\mu+th_{2} for t[0,1]t\in[0,1]. For notational convenience, let h=h1h2Dμ{finite signed Borel measures}h=h_{1}-h_{2}\in D_{\mu}\cap\{\text{finite signed Borel measures}\}. We will first show that

lim inft0𝖶p(μt,νt)th1h2H˙1,p(μ0).\liminf_{t\downarrow 0}\frac{\mathsf{W}_{p}(\mu_{t},\nu_{t})}{t}\geq\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu_{0})}.

The proof is inspired by Theorem 7.26 in [Vil03]. Observe that for any φC0\varphi\in C_{0}^{\infty} and t>0t>0,

h(φ)=dφ𝑑h=dφd(μtνtt)=1tdφd(μtνt).h(\varphi)=\int_{\mathbb{R}^{d}}\varphi dh=\int_{\mathbb{R}^{d}}\varphi d\left(\frac{\mu_{t}-\nu_{t}}{t}\right)=\frac{1}{t}\int_{\mathbb{R}^{d}}\varphi d(\mu_{t}-\nu_{t}).

Let πtΠ(μt,νt)\pi_{t}\in\Pi(\mu_{t},\nu_{t}) be an optimal coupling for 𝖶pp(μt,νt)\mathsf{W}_{p}^{p}(\mu_{t},\nu_{t}), i.e., 𝖶pp(μt,νt)=|xy|p𝑑πt(x,y)\mathsf{W}_{p}^{p}(\mu_{t},\nu_{t})=\iint|x-y|^{p}d\pi_{t}(x,y). Then

1tdφd(μtνt)=1td×d{φ(x)φ(y)}𝑑πt(x,y).\frac{1}{t}\int_{\mathbb{R}^{d}}\varphi d(\mu_{t}-\nu_{t})=\frac{1}{t}\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\{\varphi(x)-\varphi(y)\}d\pi_{t}(x,y).

Since φ\varphi is smooth and compactly supported, there exists a constant C=Cφ,p<C=C_{\varphi,p}<\infty such that

φ(x)φ(y)φ(y),xy+C|xy|2p,x,yd.\varphi(x)-\varphi(y)\leq\langle\nabla\varphi(y),x-y\rangle+C|x-y|^{2\wedge p},\quad\forall x,y\in\mathbb{R}^{d}.

Indeed, for p2p\geq 2, we can take C=C1:=supxd2φ(x)op/2C=C_{1}:=\sup_{x\in\mathbb{R}^{d}}\|\nabla^{2}\varphi(x)\|_{\mathrm{op}}/2 (here op\|\cdot\|_{\mathrm{op}} denotes the operator norm for matrices). For 1<p<21<p<2, we have

φ(x)φ(y)φ(y),xy+C1C22p|xy|p,x,yS:=supp(φ)\varphi(x)-\varphi(y)\leq\langle\nabla\varphi(y),x-y\rangle+C_{1}C_{2}^{2-p}|x-y|^{p},\ \forall x,y\in S:=\operatorname{supp}(\varphi)

with C2:=sup{|xy|:x,yS}C_{2}:=\sup\{|x-y|:x,y\in S\}. Here supp(φ)\operatorname{supp}(\varphi) denotes the support of φ\varphi, supp(φ):={φ0}¯\operatorname{supp}(\varphi):=\overline{\{\varphi\neq 0\}}. If xSx\in S and d(y,S):=inf{|yz|:zS}>1d(y,S):=\inf\{|y-z|:z\in S\}>1, then φ(x)/|xy|pφ\varphi(x)/|x-y|^{p}\leq\|\varphi\|_{\infty}, so that we have

φ(x)φ(y)=φ(x)(φC1C32p)|xy|p,xS,ySc\varphi(x)-\varphi(y)=\varphi(x)\leq(\|\varphi\|_{\infty}\vee C_{1}C_{3}^{2-p})|x-y|^{p},\ \forall x\in S,y\in S^{c}

with C3:=sup{|xy|:xS,d(y,S)1}<C_{3}:=\sup\{|x-y|:x\in S,d(y,S)\leq 1\}<\infty. Finally, if d(x,S)>1d(x,S)>1 and ySy\in S, then

φ(y)φ(y),xy|xy|pφ|xy|p+φ|xy|1pφ+φ,\frac{-\varphi(y)-\langle\nabla\varphi(y),x-y\rangle}{|x-y|^{p}}\leq\|\varphi\|_{\infty}|x-y|^{-p}+\|\nabla\varphi\|_{\infty}|x-y|^{1-p}\leq\|\varphi\|_{\infty}+\|\nabla\varphi\|_{\infty},

so that we have

φ(x)φ(y)=φ(y)φ(y),xy+((φ+φ)C1C42p)|xy|p,xSc,yS\begin{split}&\varphi(x)-\varphi(y)=-\varphi(y)\\ &\qquad\leq\langle\nabla\varphi(y),x-y\rangle+\left((\|\varphi\|_{\infty}+\|\nabla\varphi\|_{\infty})\bigvee C_{1}C_{4}^{2-p}\right)|x-y|^{p},\ \forall x\in S^{c},y\in S\end{split}

with C4:=sup{|xy|:d(x,S)1,yS}<C_{4}:=\sup\{|x-y|:d(x,S)\leq 1,y\in S\}<\infty.

Now, we have

1td×d{φ(x)φ(y)}𝑑πt(x,y)1t{d×dφ(y),xy𝑑πt(x,y)+Cd×d|xy|2p𝑑πt(x,y)}1t[d×dφ(y),xy𝑑πt(x,y)+C{d×d|xy|p𝑑πt(x,y)}2/(2p)]=1t{d×dφ(y),xy𝑑πt(x,y)+C𝖶p2p(μt,νt)}.\begin{split}&\frac{1}{t}\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\{\varphi(x)-\varphi(y)\}d\pi_{t}(x,y)\\ &\leq\frac{1}{t}\left\{\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\langle\nabla\varphi(y),x-y\rangle d\pi_{t}(x,y)+C\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}|x-y|^{2\wedge p}d\pi_{t}(x,y)\right\}\\ &\leq\frac{1}{t}\left[\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\langle\nabla\varphi(y),x-y\rangle d\pi_{t}(x,y)+C\left\{\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}|x-y|^{p}d\pi_{t}(x,y)\right\}^{2/(2\vee p)}\right]\\ &=\frac{1}{t}\left\{\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\langle\nabla\varphi(y),x-y\rangle d\pi_{t}(x,y)+C\mathsf{W}_{p}^{2\wedge p}(\mu_{t},\nu_{t})\right\}.\end{split}

Applying Proposition 2.1 with ρ=μ\rho=\mu, we know that 𝖶p(μt,νt)𝖶p(μt,μ)+𝖶p(μ,νt)pt(h1H˙1,p(μ)+h2H˙1,p(μ))=O(t)\mathsf{W}_{p}(\mu_{t},\nu_{t})\leq\mathsf{W}_{p}(\mu_{t},\mu)+\mathsf{W}_{p}(\mu,\nu_{t})\leq pt(\|h_{1}\|_{\dot{H}^{-1,p}(\mu)}+\|h_{2}\|_{\dot{H}^{-1,p}(\mu)})=O(t) as t0t\downarrow 0, so that 𝖶p2p(μt,νt)=O(t2p)=o(t)\mathsf{W}_{p}^{2\wedge p}(\mu_{t},\nu_{t})=O(t^{2\wedge p})=o(t) as t0t\downarrow 0. Further, by Hölder’s inequality, with qq being the conjugate index of pp, we have

d×dφ(y),xy𝑑πt(x,y)φLq(νt;d){d×d|xy|p𝑑πt(x,y)}1/p=𝖶p(μt,νt).\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\langle\nabla\varphi(y),x-y\rangle d\pi_{t}(x,y)\leq\|\nabla\varphi\|_{L^{q}(\nu_{t};\mathbb{R}^{d})}\underbrace{\left\{\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}|x-y|^{p}d\pi_{t}(x,y)\right\}^{1/p}}_{=\mathsf{W}_{p}(\mu_{t},\nu_{t})}.

Here

φLq(νt;d)q=d|φ|q𝑑μ+td|φ|q𝑑h2=φLq(μ;d)q+O(t),t0.\|\nabla\varphi\|_{L^{q}(\nu_{t};\mathbb{R}^{d})}^{q}=\int_{\mathbb{R}^{d}}|\nabla\varphi|^{q}d\mu+t\int_{\mathbb{R}^{d}}|\nabla\varphi|^{q}dh_{2}=\|\nabla\varphi\|_{L^{q}(\mu;\mathbb{R}^{d})}^{q}+O(t),\ t\downarrow 0.

Conclude that

h(φ)φLq(μ;d)lim inft0𝖶p(μt,νt)t,h(\varphi)\leq\|\nabla\varphi\|_{L^{q}(\mu;\mathbb{R}^{d})}\liminf_{t\downarrow 0}\frac{\mathsf{W}_{p}(\mu_{t},\nu_{t})}{t},

that is,

lim inft0𝖶p(μt,νt)tsup{h(φ):φC0,φLq(μ;d)1}=hH˙1,p(μ).\liminf_{t\downarrow 0}\frac{\mathsf{W}_{p}(\mu_{t},\nu_{t})}{t}\geq\sup\left\{h(\varphi):\varphi\in C_{0}^{\infty},\|\nabla\varphi\|_{L^{q}(\mu;\mathbb{R}^{d})}\leq 1\right\}=\|h\|_{\dot{H}^{-1,p}(\mu)}.

To prove the reverse inequality, let 𝔥E(𝔥)\mathfrak{h}\mapsto E(\mathfrak{h}) be the map from H˙1,p(μ)\dot{H}^{-1,p}(\mu) into Lp(μ;d)L^{p}(\mu;\mathbb{R}^{d}) given in Lemma 5.1. Let ft1=dμt/dμ=1+tdh1/dμf_{t}^{1}=d\mu_{t}/d\mu=1+tdh_{1}/d\mu. Since μ1=μ+h1\mu_{1}=\mu+h_{1} is a probability measure, we have 1+dh1/dμ01+dh_{1}/d\mu\geq 0, i.e., dh1/dμ1dh_{1}/d\mu\geq-1, so that ft11/2f_{t}^{1}\geq 1/2 for t[0,1/2]t\in[0,1/2]. Likewise, ft2:=dνt/dμ1/2f_{t}^{2}:=d\nu_{t}/d\mu\geq 1/2 for t[0,1/2]t\in[0,1/2].

Fix t[0,1/2]t\in[0,1/2] and consider the curve ρs=(1s)μt+sνt=μtsth\rho_{s}=(1-s)\mu_{t}+s\nu_{t}=\mu_{t}-sth for s[0,1]s\in[0,1]. Then ρs\rho_{s} satisfies the continuity equation (14) with vs=E(th)/((1s)ft1+sft2))v_{s}=E(-th)/((1-s)f_{t}^{1}+sf_{t}^{2})). By Theorem 5.1 (Theorem 8.3.1 in [AGS08]), we have

𝖶p(μt,νt)01vsLp(ρs;d)𝑑s=01(d|E(th)|p[(1s)ft1+sft2]p1𝑑μ)1/p𝑑s.\mathsf{W}_{p}(\mu_{t},\nu_{t})\leq\int_{0}^{1}\|v_{s}\|_{L^{p}(\rho_{s};\mathbb{R}^{d})}ds=\int_{0}^{1}\left(\int_{\mathbb{R}^{d}}\frac{|E(-th)|^{p}}{\big{[}(1-s)f_{t}^{1}+sf_{t}^{2}\big{]}^{p-1}}d\mu\right)^{1/p}ds.

Since E(th)=tE(h)E(-th)=-tE(h) by homogeneity and 1/2fti11/2\leq f_{t}^{i}\to 1 as t0t\downarrow 0, the dominated convergence theorem yields that, as t0t\downarrow 0,

𝖶p(μt,νt)t01(d|E(h)|p[(1s)ft1+sft2]p1𝑑μ)1/p𝑑s=E(h)Lp(μ;d)+o(1)=hH˙1,p(μ)+o(1).\begin{split}\frac{\mathsf{W}_{p}(\mu_{t},\nu_{t})}{t}&\leq\int_{0}^{1}\left(\int_{\mathbb{R}^{d}}\frac{|E(h)|^{p}}{\big{[}(1-s)f_{t}^{1}+sf_{t}^{2}\big{]}^{p-1}}d\mu\right)^{1/p}ds\\ &=\|E(h)\|_{L^{p}(\mu;\mathbb{R}^{d})}+o(1)\\ &=\|h\|_{\dot{H}^{-1,p}(\mu)}+o(1).\end{split}

This completes the proof. ∎

5.2.2. Proof of Proposition 3.3

Pick arbitrary (h1,h2)TΞμ×Ξμ(0,0)(h_{1},h_{2})\in T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0), tn0t_{n}\downarrow 0 and (hn,1,hn,2)(h1,h2)(h_{n,1},h_{n,2})\to(h_{1},h_{2}) in Dμ×DμD_{\mu}\times D_{\mu} such that (tnhn,1,tnhn,2)Ξμ×Ξμ(t_{n}h_{n,1},t_{n}h_{n,2})\in\Xi_{\mu}\times\Xi_{\mu}. By density, for any ϵ>0\epsilon>0, there exist c>0c>0 and ρi𝒫p\rho_{i}\in\mathcal{P}_{p} for i=1,2i=1,2 such that hih~iH˙1,p(μγσ)<ϵ\|h_{i}-\tilde{h}_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<\epsilon for h~i=c(ρiμ)γσ\tilde{h}_{i}=c(\rho_{i}-\mu)*\gamma_{\sigma}. By scaling, Lemma 3.2 holds with (h1,h2)(h_{1},h_{2}) replaced by (h~1,h~2)(\tilde{h}_{1},\tilde{h}_{2}). Assume without loss of generality that nn is large enough such that hn,ihiH˙1,p(μγσ)<ϵ\|h_{n,i}-h_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<\epsilon for i=1,2i=1,2 and ctn1/2ct_{n}\leq 1/2. The density of μγσ+tnh~i=((1ctn)μ+ctnρi)γσ\mu*\gamma_{\sigma}+t_{n}\tilde{h}_{i}=\big{(}(1-ct_{n})\mu+ct_{n}\rho_{i}\big{)}*\gamma_{\sigma} w.r.t. μγσ\mu*\gamma_{\sigma} is

d(μγσ+tnh~i)d(μγσ)(1ctn)12,i=1,2.\frac{d(\mu*\gamma_{\sigma}+t_{n}\tilde{h}_{i})}{d(\mu*\gamma_{\sigma})}\geq(1-ct_{n})\geq\frac{1}{2},\ i=1,2.

Thus, by Proposition 2.1, we have

|Φ(tnhn,1,tnhn,2)tnΦ(tnh~1,tnh~2)tn|i=12𝖶p(μγσ+tnhn,i,μγσ+tnh~i)tni=12hn,ih~iH˙1,p(μγσ)i=12(hn,ihiH˙1,p(μγσ)+hih~iH˙1,p(μγσ))<4ϵ.\begin{split}&\left|\frac{\Phi(t_{n}h_{n,1},t_{n}h_{n,2})}{t_{n}}-\frac{\Phi(t_{n}\tilde{h}_{1},t_{n}\tilde{h}_{2})}{t_{n}}\right|\\ &\leq\sum_{i=1}^{2}\frac{\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}h_{n,i},\mu*\gamma_{\sigma}+t_{n}\tilde{h}_{i})}{t_{n}}\\ &\lesssim\sum_{i=1}^{2}\|h_{n,i}-\tilde{h}_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\\ &\leq\sum_{i=1}^{2}\big{(}\|h_{n,i}-h_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\|h_{i}-\tilde{h}_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\big{)}\\ &<4\epsilon.\end{split}

Further,

|h1h2H˙1,p(μγσ)h~1h~2H˙1,p(μγσ)|i=12hih~iH˙1,p(μγσ)<2ϵ.\big{|}\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}-\|\tilde{h}_{1}-\tilde{h}_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\big{|}\leq\sum_{i=1}^{2}\|h_{i}-\tilde{h}_{i}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<2\epsilon.

Thus, using the result of Lemma 3.2, we conclude that

lim supn|Φ(tnhn,1,tnhn,2)tnh1h2H˙1,p(μγσ)|ϵ.\limsup_{n\to\infty}\left|\frac{\Phi(t_{n}h_{n,1},t_{n}h_{n,2})}{t_{n}}-\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\right|\lesssim\epsilon.

Since ϵ>0\epsilon>0 is arbitrary, we obtain the desired conclusion. ∎

5.3. Proofs for Section 3.3

5.3.1. Proof of Lemma 3.3

The proof of Lemma 3.3 relies on the following technical lemma concerning regularity of optimal transport potentials. Recall that any locally Lipschitz function on d\mathbb{R}^{d} is differentiable a.e. by the Rademacher theorem (cf. [EG18]). Here and in what follows a.e. is taken w.r.t. the Lebesgue measure.

Lemma 5.3 (Regularity of optimal transport potential).

Let 1<p<1<p<\infty. Suppose that μ𝒫p\mu\in\mathcal{P}_{p} and ν𝒫\nu\in\mathcal{P} is β\beta-sub-Weibull for some β(0,2]\beta\in(0,2]. Let gg be an optimal transport potential from μγσ\mu*\gamma_{\sigma} to ν\nu for 𝖶pp\mathsf{W}_{p}^{p}. Then there exists a constant CC that depends only on p,d,σ,βp,d,\sigma,\beta, upper bounds on 𝔼μ[|X|]\operatorname{\mathbb{E}}_{\mu}[|X|] and |Y|ψβ\||Y|\|_{\psi_{\beta}} for YνY\sim\nu, and a lower bound on ϕσ𝑑μ\int\phi_{\sigma}d\mu, such that

{g is locally Lipschitz,|g(x)g(0)|C(1+|x|2pβ)|x|,xd,|g(x)|C(1+|x|2pβ)for a.e.xd.\begin{cases}&\text{$g$ is locally Lipschitz},\\ &|g(x)-g(0)|\leq C(1+|x|^{\frac{2p}{\beta}})|x|,\ \forall x\in\mathbb{R}^{d},\\ &|\nabla g(x)|\leq C(1+|x|^{\frac{2p}{\beta}})\ \text{for a.e.}\ x\in\mathbb{R}^{d}.\end{cases}

The proof of Lemma 5.3 borrows ideas from Lemmas 9 and 10 and Theorem 11 in the recent work by [MNW21], which in turn build on [GM96, CF21].

Proof of Lemma 5.3.

By Theorem 11 in [MNW21], there exists a constant C1C_{1} depending only on p,d,βp,d,\beta and an upper bound on |Y|ψβ\||Y|\|_{\psi_{\beta}} for YνY\sim\nu, such that

supycg(x)|y|C1{(|x|+1)pp1supy:|xy|2[log(1(μγσ)(By))]pβ(p1)},xd,\sup_{y\in\partial^{c}g(x)}|y|\leq C_{1}\left\{(|x|+1)^{\frac{p}{p-1}}\bigvee\sup_{y:|x-y|\leq 2}\left[\log\left(\frac{1}{(\mu*\gamma_{\sigma})(B_{y})}\right)\right]^{\frac{p}{\beta(p-1)}}\right\},\quad x\in\mathbb{R}^{d},

where cg(x)={yd:c(z,y)g(z)c(x,y)g(x),zd}\partial^{c}g(x)=\{y\in\mathbb{R}^{d}:c(z,y)-g(z)\geq c(x,y)-g(x),\ \forall z\in\mathbb{R}^{d}\} is the cc–superdifferential of gg at xx for the cost function c(x,y)=|xy|pc(x,y)=|x-y|^{p}, and By=B(y,1)={xd:|xy|1}B_{y}=B(y,1)=\{x\in\mathbb{R}^{d}:|x-y|\leq 1\}.

Next, by Proposition 2 in [PW16], μγσ\mu*\gamma_{\sigma} has Lebesgue density fμf_{\mu} that is (c1,c2)(c_{1},c_{2})-regular with c1=3/σ2c_{1}=3/\sigma^{2} and c2=4𝔼μ[|X|]/σ2c_{2}=4\operatorname{\mathbb{E}}_{\mu}[|X|]/\sigma^{2}, i.e.,

|logfμ(x)|c1|x|+c2,xd.\big{|}\nabla\log f_{\mu}(x)\big{|}\leq c_{1}|x|+c_{2},\quad\forall x\in\mathbb{R}^{d}.

From the proof of Lemma 10 in [MNW21], we have

fμ(x)ec22fμ(0)e(1+c1)|x|2,xd.f_{\mu}(x)\geq e^{-c_{2}^{2}}f_{\mu}(0)e^{-(1+c_{1})|x|^{2}},\quad\forall x\in\mathbb{R}^{d}. (17)

Thus, whenever |xy|2|x-y|\leq 2,

(μγσ)(By)=Byfμ(z)𝑑zinfzByfμ(z)×By𝑑zc3ec22fμ(0)e2(1+c1)(|x|2+9),(\mu*\gamma_{\sigma})(B_{y})=\int_{B_{y}}f_{\mu}(z)dz\geq\inf_{z\in B_{y}}f_{\mu}(z)\times\int_{B_{y}}dz\geq c_{3}e^{-c_{2}^{2}}f_{\mu}(0)e^{-2(1+c_{1})(|x|^{2}+9)},

where c3c_{3} is a constant that depends only on dd. Conclude that there exists a constant C2C_{2} depending only on p,d,σ,βp,d,\sigma,\beta, upper bounds on 𝔼μ[|X|]\operatorname{\mathbb{E}}_{\mu}[|X|] and |Y|ψβ\||Y|\|_{\psi_{\beta}} for YνY\sim\nu, and a lower bound on fμ(0)f_{\mu}(0), such that

supycg(x)|y|C2(1+|x|2pβ(p1)),xd.\sup_{y\in\partial^{c}g(x)}|y|\leq C_{2}(1+|x|^{\frac{2p}{\beta(p-1)}}),\quad\forall x\in\mathbb{R}^{d}.

The rest of the proof mirrors the latter half of the proof of Lemma 9 in [MNW21]. Since gL1(μγσ)g\in L^{1}(\mu*\gamma_{\sigma}) and μγσ\mu*\gamma_{\sigma} is equivalent to the Lebesgue measure (i.e., μγσdx\mu*\gamma_{\sigma}\ll dx and dxμγσdx\ll\mu*\gamma_{\sigma}), g(x)>g(x)>-\infty for a.e. xdx\in\mathbb{R}^{d}. Since any open convex set in d\mathbb{R}^{d} agrees with the interior of its closure (cf. Proposition 6.2.10 in [Dud02]), the convex hull of {x:g(x)>}\{x:g(x)>-\infty\} agrees with d\mathbb{R}^{d}. Thus, by Lemma 2.1 (ii) (Theorem 3.3 in [GM96]), gg is locally Lipschitz on d\mathbb{R}^{d}. Further, by Propositon C.4 in [GM96], cg(x)\partial^{c}g(x) is nonempty for all xdx\in\mathbb{R}^{d}. For any xdx\in\mathbb{R}^{d} and ycg(x)y\in\partial^{c}g(x),

g(x)=c(x,y)gc(y).g(x)=c(x,y)-g^{c}(y).

Thus, for any xdx^{\prime}\in\mathbb{R}^{d},

g(x)g(x)c(x,y)gc(y)[c(x,y)gc(y)]=c(x,y)c(x,y)=|xy|p|xy|pp(|xy|p1|xy|p1)|xx|C3[1+(|x||x|)2pβ]|xx|,\begin{split}g(x^{\prime})-g(x)&\leq c(x^{\prime},y)-g^{c}(y)-[c(x,y)-g^{c}(y)]\\ &=c(x^{\prime},y)-c(x,y)\\ &=|x^{\prime}-y|^{p}-|x-y|^{p}\\ &\leq p(|x-y|^{p-1}\vee|x^{\prime}-y|^{p-1})|x-x^{\prime}|\\ &\leq C_{3}\big{[}1+(|x|\vee|x^{\prime}|)^{\frac{2p}{\beta}}\big{]}|x-x^{\prime}|,\end{split}

where C3C_{3} depends only on p,β,C2p,\beta,C_{2}. Interchanging xx and xx^{\prime}, we conclude that

|g(x)g(x)|C3[1+(|x||x|)2pβ]|xx|,x,xd,|g(x)-g(x^{\prime})|\leq C_{3}\big{[}1+(|x|\vee|x^{\prime}|)^{\frac{2p}{\beta}}\big{]}|x-x^{\prime}|,\ x,x^{\prime}\in\mathbb{R}^{d}, (18)

which implies the desired conclusion. ∎

Proof of Lemma 3.3.

Let μt=(μ+t(ρμ))γσ=(1t)μγσ+tργσ\mu_{t}=(\mu+t(\rho-\mu))*\gamma_{\sigma}=(1-t)\mu*\gamma_{\sigma}+t\rho*\gamma_{\sigma} for t[0,1]t\in[0,1], and let gtg_{t} be an optimal transport potential from μt\mu_{t} to ν\nu. Without loss of generality, we may normalize gtg_{t} in such a way that gt(0)=0g_{t}(0)=0 for t[0,1]t\in[0,1].

We will apply Lemma 5.3 with (μ,ν)(\mu,\nu) replaced with ((1t)μ+tρ,ν)\big{(}(1-t)\mu+t\rho,\nu\big{)} for t[0,1/2]t\in[0,1/2]. It is not difficult to see that, as long as t[0,1/2]t\in[0,1/2],

𝔼(1t)μ+tρ[|X|]𝔼μ[|X|]+𝔼ρ[|X|]anddϕσd((1t)μ+tρ)12dϕσ𝑑μ.\operatorname{\mathbb{E}}_{(1-t)\mu+t\rho}[|X|]\leq\operatorname{\mathbb{E}}_{\mu}[|X|]+\operatorname{\mathbb{E}}_{\rho}[|X|]\quad\text{and}\quad\int_{\mathbb{R}^{d}}\phi_{\sigma}d\big{(}(1-t)\mu+t\rho\big{)}\geq\frac{1}{2}\int_{\mathbb{R}^{d}}\phi_{\sigma}d\mu.

Thus, by Lemma 5.3, there exist constants CC and KK independent of tt such that for every t[0,1/2]t\in[0,1/2],

{gt is locally Lipschitz,|gt(x)|C(1+|x|K)|x|,xd,|gt(x)|C(1+|x|K)for a.e.xd.\begin{cases}&\text{$g_{t}$ is locally Lipschitz},\\ &|g_{t}(x)|\leq C(1+|x|^{K})|x|,\ \forall x\in\mathbb{R}^{d},\\ &|\nabla g_{t}(x)|\leq C(1+|x|^{K})\ \text{for a.e.}\ x\in\mathbb{R}^{d}.\end{cases}

By duality (Lemma 2.1 (i)), we have with h=(ρμ)γσh=(\rho-\mu)*\gamma_{\sigma},

𝖶pp(μt,ν)dg0𝑑μt+dg0c𝑑ν=dg0𝑑μ0+dg0c𝑑ν+tdg0𝑑h=𝖶pp(μ0,ν)+tdg0𝑑h,\begin{split}\mathsf{W}_{p}^{p}(\mu_{t},\nu)&\geq\int_{\mathbb{R}^{d}}g_{0}d\mu_{t}+\int_{\mathbb{R}^{d}}g_{0}^{c}d\nu\\ &=\int_{\mathbb{R}^{d}}g_{0}d\mu_{0}+\int_{\mathbb{R}^{d}}g_{0}^{c}d\nu+t\int_{\mathbb{R}^{d}}g_{0}dh\\ &=\mathsf{W}_{p}^{p}(\mu_{0},\nu)+t\int_{\mathbb{R}^{d}}g_{0}dh,\end{split}

so that

lim inft0𝖶pp(μt,ν)𝖶pp(μ0,ν)tdg0𝑑h.\liminf_{t\downarrow 0}\frac{\mathsf{W}_{p}^{p}(\mu_{t},\nu)-\mathsf{W}_{p}^{p}(\mu_{0},\nu)}{t}\geq\int_{\mathbb{R}^{d}}g_{0}dh.

Second, by construction,

𝖶pp(μt,ν)=dgt𝑑μt+dgtc𝑑ν=dgt𝑑μ0+dgtc𝑑ν+tdgt𝑑hdg0𝑑μ0+dg0c𝑑ν+tdgt𝑑h=𝖶pp(μ0,ν)+tdgt𝑑h.\begin{split}\mathsf{W}_{p}^{p}(\mu_{t},\nu)&=\int_{\mathbb{R}^{d}}g_{t}d\mu_{t}+\int_{\mathbb{R}^{d}}g_{t}^{c}d\nu\\ &=\int_{\mathbb{R}^{d}}g_{t}d\mu_{0}+\int_{\mathbb{R}^{d}}g_{t}^{c}d\nu+t\int_{\mathbb{R}^{d}}g_{t}dh\\ &\leq\int_{\mathbb{R}^{d}}g_{0}d\mu_{0}+\int_{\mathbb{R}^{d}}g_{0}^{c}d\nu+t\int_{\mathbb{R}^{d}}g_{t}dh\\ &=\mathsf{W}_{p}^{p}(\mu_{0},\nu)+t\int_{\mathbb{R}^{d}}g_{t}dh.\end{split}

Pick any tn0t_{n}\downarrow 0. Since μ0=μγσdx\mu_{0}=\mu*\gamma_{\sigma}\ll dx, μ0\mu_{0} has full support d\mathbb{R}^{d}, and μtnwμ0\mu_{t_{n}}\stackrel{{\scriptstyle w}}{{\to}}\mu_{0}, we have by Theorem 3.4 in [dBGSL21b] that there exists some sequence of constants ana_{n} such that gtnang0g_{t_{n}}-a_{n}\to g_{0} pointwise. Since we have normalized gtg_{t} in such a way that gt(0)=0g_{t}(0)=0, we have an0a_{n}\to 0, i.e., gtng0g_{t_{n}}\to g_{0} pointwise. Further, since |gt(x)|C(1+|x|K)|x||g_{t}(x)|\leq C(1+|x|^{K})|x| for all t[0,1/2]t\in[0,1/2], the dominated convergence theorem yields that

dgtn𝑑hdg0𝑑h.\int_{\mathbb{R}^{d}}g_{t_{n}}dh\to\int_{\mathbb{R}^{d}}g_{0}dh.

Conclude that

lim supn𝖶pp(μtn,ν)𝖶pp(μ0,ν)tndg0𝑑h.\limsup_{n\to\infty}\frac{\mathsf{W}_{p}^{p}(\mu_{t_{n}},\nu)-\mathsf{W}_{p}^{p}(\mu_{0},\nu)}{t_{n}}\leq\int_{\mathbb{R}^{d}}g_{0}dh.

This completes the proof ∎

5.3.2. Proof of Proposition 3.5

Part (i). We first note that H˙1,q(μγσ)\dot{H}^{1,q}(\mu*\gamma_{\sigma}) is a function space over d\mathbb{R}^{d}. To see this, observe that if we choose a reference measure κ\kappa to be an isotropic Gaussian distribution with sufficiently small variance parameter, then the relative density d(μγσ)/dκd(\mu*\gamma_{\sigma})/d\kappa is bounded away from zero. Indeed, for κ=γσ/2\kappa=\gamma_{\sigma/\sqrt{2}}, we have

d(μγσ)dγσ/2(x)=2d/2de|xy|2/(2σ2)+|x|2/σ2𝑑μ(y)=2d/2de|x+y|2/(2σ2)|y|2/σ2𝑑μ(y)2d/2e𝔼μ[|X|2]/σ2\begin{split}\frac{d(\mu*\gamma_{\sigma})}{d\gamma_{\sigma/\sqrt{2}}}(x)&=2^{-d/2}\int_{\mathbb{R}^{d}}e^{-|x-y|^{2}/(2\sigma^{2})+|x|^{2}/\sigma^{2}}d\mu(y)\\ &=2^{-d/2}\int_{\mathbb{R}^{d}}e^{|x+y|^{2}/(2\sigma^{2})-|y|^{2}/\sigma^{2}}d\mu(y)\\ &\geq 2^{-d/2}e^{-\operatorname{\mathbb{E}}_{\mu}[|X|^{2}]/\sigma^{2}}\end{split}

by Jensen’s inequality, which guarantees that H˙1,q(μγσ)\dot{H}^{1,q}(\mu*\gamma_{\sigma}) is a function space over d\mathbb{R}^{d} in view of Remark 2.1.

By regularity of gg from Lemma 5.3, we know that gg is locally Lipschitz and gLq(μγσ)gLq(μγσ;d)<\|g\|_{L^{q}(\mu*\gamma_{\sigma})}\vee\|\nabla g\|_{L^{q}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}<\infty (the latter alone does not automatically guarantee gH˙1,q(μγσ)g\in\dot{H}^{1,q}(\mu*\gamma_{\sigma})). As in Proposition 1.5.2 in [Bog98], choose a sequence ζjC0\zeta_{j}\in C_{0}^{\infty} with the following property:

0ζj1,ζj(x)=1if |x|j,supj,x|ζj(x)|<.0\leq\zeta_{j}\leq 1,\ \zeta_{j}(x)=1\ \text{if $|x|\leq j$},\ \sup_{j,x}|\nabla\zeta_{j}(x)|<\infty.

Let φj=ζjg\varphi_{j}=\zeta_{j}g. Each φj\varphi_{j} belongs to the ordinary Sobolev (1,q)(1,q)-space w.r.t. the Lebesgue measure, so φj\nabla\varphi_{j} can be approximated by gradients of C0C_{0}^{\infty} functions under Lq(dx;d)\|\cdot\|_{L^{q}(dx;\mathbb{R}^{d})} (cf. [Ada75], Corollary 3.23). Since μγσ\mu*\gamma_{\sigma} has a bounded Lebesgue density, this shows that φjH˙1,q(μγσ)\varphi_{j}\in\dot{H}^{1,q}(\mu*\gamma_{\sigma}). Now,

φjgLp(μγσ;d)(ζj)gLq(μγσ;d)+(ζj1)gLq(μγσ;d)0\|\nabla\varphi_{j}-\nabla g\|_{L^{p}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}\leq\|(\nabla\zeta_{j})g\|_{L^{q}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}+\|(\zeta_{j}-1)\nabla g\|_{L^{q}(\mu*\gamma_{\sigma};\mathbb{R}^{d})}\to 0

as jj\to\infty, implying that gH˙1,q(μγσ)g\in\dot{H}^{1,q}(\mu*\gamma_{\sigma}).

Part (ii). Pick any hTΛμ(0),tn0h\in T_{\Lambda_{\mu}}(0),t_{n}\downarrow 0, and hnhh_{n}\to h in DμD_{\mu} such that tnhnΛμt_{n}h_{n}\in\Lambda_{\mu}. For any ϵ>0\epsilon>0, there exist some constant c>0c>0 and sub-Weibull ρ𝒫\rho\in\mathcal{P} such that hh~H˙1,p(μγσ)<ϵ\|h-\tilde{h}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<\epsilon for h~=c(ρμ)γσ\tilde{h}=c(\rho-\mu)*\gamma_{\sigma}.

Observe that

|Ψ(tnhn)Ψ(tnh~)|=|𝖶pp(μγσ+tnhn,νγσ)𝖶pp(μγσ+tnh~,νγσ)|p(𝖶pp1(μγσ+tnhn,νγσ)𝖶pp1(μγσ+tnh~,νγσ))×𝖶p(μγσ+tnhn,μγσ+tnh~).\begin{split}&|\Psi(t_{n}h_{n})-\Psi(t_{n}\tilde{h})|=|\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+t_{n}h_{n},\nu*\gamma_{\sigma})-\mathsf{W}_{p}^{p}(\mu*\gamma_{\sigma}+t_{n}\tilde{h},\nu*\gamma_{\sigma})|\\ &\leq p\big{(}\mathsf{W}_{p}^{p-1}(\mu*\gamma_{\sigma}+t_{n}h_{n},\nu*\gamma_{\sigma})\vee\mathsf{W}_{p}^{p-1}(\mu*\gamma_{\sigma}+t_{n}\tilde{h},\nu*\gamma_{\sigma})\big{)}\\ &\quad\times\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}h_{n},\mu*\gamma_{\sigma}+t_{n}\tilde{h}).\end{split}

Assume that nn is large enough so that ctn1/2ct_{n}\leq 1/2 and hnhH˙1,p(μγσ)<ϵ\|h_{n}-h\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<\epsilon. The density of μγσ+tnh~=((1ctn)μ+ctnρ)γσ\mu*\gamma_{\sigma}+t_{n}\tilde{h}=\big{(}(1-ct_{n})\mu+ct_{n}\rho\big{)}*\gamma_{\sigma} w.r.t. μγσ\mu*\gamma_{\sigma} is

d(μγσ+tnh~)d(μγσ)1ctn12.\frac{d(\mu*\gamma_{\sigma}+t_{n}\tilde{h})}{d(\mu*\gamma_{\sigma})}\geq 1-ct_{n}\geq\frac{1}{2}.

Thus, by Proposition 2.1,

𝖶p(μγσ+tnhn,μγσ+tnh~)tnhnh~H˙1,p(μγσ)<2tnϵ.\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}h_{n},\mu*\gamma_{\sigma}+t_{n}\tilde{h})\lesssim t_{n}\|h_{n}-\tilde{h}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}<2t_{n}\epsilon.

Also, by Proposition 2.1,

𝖶p(μγσ+tnh~,νγσ)𝖶p(μγσ+tnh~,μγσ)+𝖶p(μγσ,νγσ)tnh~H˙1,p(μγσ)+𝖶p(μγσ,νγσ)=O(1).\begin{split}\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}\tilde{h},\nu*\gamma_{\sigma})&\leq\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}\tilde{h},\mu*\gamma_{\sigma})+\mathsf{W}_{p}(\mu*\gamma_{\sigma},\nu*\gamma_{\sigma})\\ &\lesssim t_{n}\|\tilde{h}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\mathsf{W}_{p}(\mu*\gamma_{\sigma},\nu*\gamma_{\sigma})=O(1).\end{split}

Likewise, 𝖶p(μγσ+tnhn,νγσ)=O(1)\mathsf{W}_{p}(\mu*\gamma_{\sigma}+t_{n}h_{n},\nu*\gamma_{\sigma})=O(1). Conclude that

lim supn|Ψ(tnhn)Ψ(tnh~)|/tnϵ.\limsup_{n\to\infty}|\Psi(t_{n}h_{n})-\Psi(t_{n}\tilde{h})|/t_{n}\lesssim\epsilon.

Further, |h(g)h~(g)|gH˙1,q(μγσ)hh~H˙1,p(μγσ)ϵ|h(g)-\tilde{h}(g)|\leq\|g\|_{\dot{H}^{1,q}(\mu*\gamma_{\sigma})}\|h-\tilde{h}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\lesssim\epsilon. Combining Lemma 3.3, we conclude that

lim supn|Ψ(tnhn)Ψ(0)tnh(g)|ϵ.\limsup_{n\to\infty}\left|\frac{\Psi(t_{n}h_{n})-\Psi(0)}{t_{n}}-h(g)\right|\lesssim\epsilon.

This completes the proof. ∎

5.4. Proofs for Section 3.4

5.4.1. Proof of Proposition 3.8

We first prove the following lemma. We note that the empirical distributions μ^nB\hat{\mu}_{n}^{B} and μ^n\hat{\mu}_{n} are finitely discrete, so n(μ^nBμ^n)γσ\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma} defines a random variable with values in H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) (cf. (13) and Step 3 of the proof of Proposition 3.1). Let 𝔏nB=𝔏nB(X1,,Xn)\mathfrak{L}_{n}^{B}=\mathfrak{L}_{n}^{B}(X_{1},\dots,X_{n}) denote its (regular) conditional law given the data (which exists as H˙1,p(μγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma}) is a separable Banach space; cf. Chapter 11 in [Dud02]).

Lemma 5.4.

Suppose that μ\mu satisfies Condition (4). Then, we have 𝔏nBwGμ1\mathfrak{L}_{n}^{B}\stackrel{{\scriptstyle w}}{{\to}}\operatorname{\mathbb{P}}\circ G_{\mu}^{-1} almost surely.

Proof of Lemma 5.4.

From the proof of Proposition 3.1, the function class 𝒰ϕσ\mathcal{U}*\phi_{\sigma} is μ\mu-Donsker with a μ\mu-square integrable envelope. The rest of the proof follows from the Giné-Zinn theorem for the bootstrap (cf. Theorem 3.6.2 in [vdVW96]) and repeating the arguments in Steps 2 and 3 in the proof of Proposition 3.1. ∎

Proof of Proposition 3.8.

Part (i). Assume without loss of generality that μ\mu is not a point mass. We first note that the limit variable GμH˙1,p(μγ)\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma)} has a continuous distribution function. This can be verified, e.g., analogously to the proof of Lemma 1 in [SGK21]. Thus, it suffices to prove the convergence in probability (9) for each fixed t0t\geq 0 (cf. Problem 23.1 in [vdV98]).

Let Tn=(μ^nμ)γσT_{n}=(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} and TnB=(μ^nBμ)γσT_{n}^{B}=(\hat{\mu}^{B}_{n}-\mu)*\gamma_{\sigma}. By Proposition 3.1 and Lemma 5.4, we know that

(nTnB,nTn)=(n(μ^nBμ^n)γσ+n(μ^nμ)γσ,n(μ^nμ)γσ)d(Gμ+Gμ,Gμ)inH˙1,p(μγσ)×H˙1,p(μγσ)\begin{split}\big{(}\sqrt{n}T_{n}^{B},\sqrt{n}T_{n}\big{)}&=\big{(}\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}+\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma},\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma}\big{)}\\ &\stackrel{{\scriptstyle d}}{{\to}}\big{(}G_{\mu}^{\prime}+G_{\mu},G_{\mu}\big{)}\quad\text{in}\quad\dot{H}^{-1,p}(\mu*\gamma_{\sigma})\times\dot{H}^{-1,p}(\mu*\gamma_{\sigma})\end{split}

unconditionally, where GμG_{\mu}^{\prime} is an independent copy of GμG_{\mu} (cf. Theorem 2.2 in [Kos08]). Thus, by Proposition 3.5 and the second claim of the functional delta method (Lemma 3.1), we see that

n𝖶p(σ)(μ^nB,μ^n)=nΦ(TnB,Tn)=Φ(0,0)(nTnB,nTn)+Rn=n(TnBTn)H˙1,p(μγσ)+Rn=n(μ^nBμ^n)γσH˙1,p(μγσ)+Rn.\begin{split}\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\mu}_{n})&=\sqrt{n}\Phi(T_{n}^{B},T_{n})=\Phi_{(0,0)}^{\prime}(\sqrt{n}T_{n}^{B},\sqrt{n}T_{n})+R_{n}\\ &=\|\sqrt{n}(T_{n}^{B}-T_{n})\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+R_{n}\\ &=\|\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+R_{n}.\end{split}

Here Rn=o(1)R_{n}=o_{\operatorname{\mathbb{P}}}(1) unconditonally. Choose ϵn0\epsilon_{n}\to 0 such that (|Rn|>ϵn)0\operatorname{\mathbb{P}}(|R_{n}|>\epsilon_{n})\to 0. By Markov’s inequality, we have B(|Rn|>ϵn)0\operatorname{\mathbb{P}}^{B}(|R_{n}|>\epsilon_{n})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0. By Lemma 5.4 and the continuous mapping theorem, we also have

supt0|B(n(μ^nBμ^n)γσH˙1,p(μγσ)t)(GμH˙1,p(μγσ)t)|0.\sup_{t\geq 0}\left|\operatorname{\mathbb{P}}^{B}\left(\|\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq t\right)-\operatorname{\mathbb{P}}\left(\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq t\right)\right|\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}0.

Thus, for each t0t\geq 0,

B(n𝖶p(σ)(μ^nB,μ^n)t)B(n(μ^nBμ^n)γσH˙1,p(μγσ)t+ϵn)+B(|Rn|>ϵn)=(GμH˙1,p(μγ)t+ϵn)+o(1)=(GμH˙1,p(μγ)t)+o(1).\begin{split}&\operatorname{\mathbb{P}}^{B}\left(\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\hat{\mu}_{n})\leq t\right)\\ &\leq\operatorname{\mathbb{P}}^{B}\left(\|\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\|_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq t+\epsilon_{n}\right)+\operatorname{\mathbb{P}}^{B}(|R_{n}|>\epsilon_{n})\\ &=\operatorname{\mathbb{P}}\left(\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma)}\leq t+\epsilon_{n}\right)+o_{\operatorname{\mathbb{P}}}(1)\\ &=\operatorname{\mathbb{P}}\left(\|G_{\mu}\|_{\dot{H}^{-1,p}(\mu*\gamma)}\leq t\right)+o_{\operatorname{\mathbb{P}}}(1).\end{split}

The reverse inequality follows similarly.

Part (ii). The argument is analogous to Part (i). Observe that, by Proposition 3.5,

n(𝖲p(σ)(μ^nB,ν)𝖲p(σ)(μ^n,ν))=n(Ψ(TnB)Ψ(Tn))=Ψ0(nTnB)Ψ0(nTn)+o(1)=n(TnBTn)(g)+o(1)=n((μ^nBμ^n)γσ)(g)+o(1).\begin{split}\sqrt{n}\big{(}\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}^{B}_{n},\nu)-\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}_{n},\nu)\big{)}&=\sqrt{n}\big{(}\Psi(T_{n}^{B})-\Psi(T_{n})\big{)}\\ &=\Psi_{0}^{\prime}(\sqrt{n}T_{n}^{B})-\Psi_{0}^{\prime}(\sqrt{n}T_{n})+o_{\operatorname{\mathbb{P}}}(1)\\ &=\sqrt{n}(T_{n}^{B}-T_{n})(g)+o_{\operatorname{\mathbb{P}}}(1)\\ &=\sqrt{n}\big{(}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\big{)}(g)+o_{\operatorname{\mathbb{P}}}(1).\end{split}

Taking ppth root and applying the delta method, we have

n(𝖶p(σ)(μ^nB,ν)𝖶p(σ)(μ^n,ν))=1p[𝖶p(σ)(μ,ν)]p1n((μ^nBμ^n)γσ)(g)+o(1).\sqrt{n}\big{(}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}^{B}_{n},\nu)-\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu)\big{)}=\frac{1}{p[\mathsf{W}_{p}^{(\sigma)}(\mu,\nu)]^{p-1}}\cdot\sqrt{n}\big{(}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\big{)}(g)+o_{\operatorname{\mathbb{P}}}(1).

The rest of the proof is completely analogous to Part (i). ∎

Proof of Proposition 3.9.

By Lemma 5.4 and Example 1.4.6 in [vdVW96], the conditional law of (n(μ^nBμ^n)γσ,n(ν^nBν^n)γσ)\big{(}\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma},\sqrt{n}(\hat{\nu}_{n}^{B}-\hat{\nu}_{n})*\gamma_{\sigma}\big{)} given the data converges weakly to the law of (Gμ,Gν)(G_{\mu},G_{\nu}) in H˙1,p(μγσ)×H˙1,p(νγσ)\dot{H}^{-1,p}(\mu*\gamma_{\sigma})\times\dot{H}^{-1,p}(\nu*\gamma_{\sigma}) almost surely, where GμG_{\mu} and GνG_{\nu} are independent. By Theorem 2.2 in [Kos08], for Tn,1B=(μ^nBμ)γσT_{n,1}^{B}=(\hat{\mu}_{n}^{B}-\mu)*\gamma_{\sigma} and Tn,2B=(ν^nBν)γσT_{n,2}^{B}=(\hat{\nu}_{n}^{B}-\nu)*\gamma_{\sigma}, we have

(nTn,1B,nTn,2B)=(n(μ^nBμ^n)γσ+n(μ^nμ)γσ,n(ν^nBν^n)γσ+n(ν^nν)γσ)d(Gμ+Gμ,Gν+Gν)inH˙1,p(μγσ)×H˙1,p(νγσ)\begin{split}&\big{(}\sqrt{n}T_{n,1}^{B},\sqrt{n}T_{n,2}^{B}\big{)}\\ &=\big{(}\sqrt{n}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}+\sqrt{n}(\hat{\mu}_{n}-\mu)*\gamma_{\sigma},\sqrt{n}(\hat{\nu}_{n}^{B}-\hat{\nu}_{n})*\gamma_{\sigma}+\sqrt{n}(\hat{\nu}_{n}-\nu)*\gamma_{\sigma}\big{)}\\ &\stackrel{{\scriptstyle d}}{{\to}}\big{(}G_{\mu}+G_{\mu}^{\prime},G_{\nu}+G_{\nu}^{\prime}\big{)}\quad\text{in}\quad\dot{H}^{-1,p}(\mu*\gamma_{\sigma})\times\dot{H}^{-1,p}(\nu*\gamma_{\sigma})\end{split}

unconditionally, where Gμ,GνG_{\mu}^{\prime},G_{\nu}^{\prime} are copies of Gμ,GνG_{\mu},G_{\nu}, respectively, and Gμ,Gμ,Gν,GνG_{\mu},G_{\mu}^{\prime},G_{\nu},G_{\nu}^{\prime} are independent. Thus, by Proposition 3.7 and Lemma 3.1, for Tn,1=(μ^nμ)γσT_{n,1}=(\hat{\mu}_{n}-\mu)*\gamma_{\sigma} and Tn,2=(ν^nν)γσT_{n,2}=(\hat{\nu}_{n}-\nu)*\gamma_{\sigma}, we have

n(𝖲p(σ)(μ^nB,ν^nB)𝖲p(σ)(μ^n,ν^n))=n(Υ(Tn,1B,Tn,2B)Υ(Tn,1,Tn,2))=Υ(0,0)(nTn,1B,nTn,2B)Υ(0,0)(nTn,1,nTn,2)+o(1)=n(Tn,1BTn,1)(g)+n(Tn,2BTn,2)(gc)+o(1)=n((μ^nBμ^n)γσ)(g)+n((ν^nBν^n)γσ)(gc)+o(1).\begin{split}&\sqrt{n}\big{(}\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}^{B}_{n},\hat{\nu}_{n}^{B})-\mathsf{S}^{(\sigma)}_{p}(\hat{\mu}_{n},\hat{\nu}_{n})\big{)}\\ &=\sqrt{n}\big{(}\Upsilon(T_{n,1}^{B},T_{n,2}^{B})-\Upsilon(T_{n,1},T_{n,2})\big{)}\\ &=\Upsilon_{(0,0)}^{\prime}(\sqrt{n}T_{n,1}^{B},\sqrt{n}T_{n,2}^{B})-\Upsilon_{(0,0)}^{\prime}(\sqrt{n}T_{n,1},\sqrt{n}T_{n,2})+o_{\operatorname{\mathbb{P}}}(1)\\ &=\sqrt{n}(T_{n,1}^{B}-T_{n,1})(g)+\sqrt{n}(T_{n,2}^{B}-T_{n,2})(g^{c})+o_{\operatorname{\mathbb{P}}}(1)\\ &=\sqrt{n}\big{(}(\hat{\mu}_{n}^{B}-\hat{\mu}_{n})*\gamma_{\sigma}\big{)}(g)+\sqrt{n}\big{(}(\hat{\nu}_{n}^{B}-\hat{\nu}_{n})*\gamma_{\sigma}\big{)}(g^{c})+o_{\operatorname{\mathbb{P}}}(1).\end{split}

The rest of the proof is analogous to Proposition 3.8 Part (ii). ∎

Proof of Proposition 3.10.

It is not difficult to see that ρ\rho satisfies Condition (4) and 2n(ρ^nρ)γσdGρ\sqrt{2n}(\hat{\rho}_{n}-\rho)*\gamma_{\sigma}\stackrel{{\scriptstyle d}}{{\to}}G_{\rho} in H˙1,p(ργσ)\dot{H}^{-1,p}(\rho*\gamma_{\sigma}). By Theorem 3.7.7 and Example 1.4.6 in [vdVW96], the conditional law of (n(ρ^n,1Bρ^n)γσ,n(ρ^n,2Bρ^n)γσ)\big{(}\sqrt{n}(\hat{\rho}_{n,1}^{B}-\hat{\rho}_{n})*\gamma_{\sigma},\sqrt{n}(\hat{\rho}_{n,2}^{B}-\hat{\rho}_{n})*\gamma_{\sigma}\big{)} given the data converges weakly to the law of (Gρ,Gρ)(G_{\rho},G_{\rho}^{\prime}) in H˙1,p(ργσ)×H˙1,p(ργσ)\dot{H}^{-1,p}(\rho*\gamma_{\sigma})\times\dot{H}^{-1,p}(\rho*\gamma_{\sigma}) almost surely, where GρG_{\rho}^{\prime} is an independent copy of GρG_{\rho}. Thus, arguing as in the proof of Proposition 3.9, for Tn,jB=(ρ^n,jBρ)γσT_{n,j}^{B}=(\hat{\rho}_{n,j}^{B}-\rho)*\gamma_{\sigma} (j=1,2j=1,2), we have

(nTn,1B,nTn,2B)d(Gρ1+Gρ2/2,Gρ3+Gρ4/2)inH˙1,p(ργσ)×H˙1,p(ργσ)(\sqrt{n}T_{n,1}^{B},\sqrt{n}T_{n,2}^{B})\stackrel{{\scriptstyle d}}{{\to}}(G_{\rho}^{1}+G_{\rho}^{2}/\sqrt{2},G_{\rho}^{3}+G_{\rho}^{4}/\sqrt{2})\quad\text{in}\quad\dot{H}^{-1,p}(\rho*\gamma_{\sigma})\times\dot{H}^{-1,p}(\rho*\gamma_{\sigma})

unconditionally, where Gρ1,,Gρ4G_{\rho}^{1},\dots,G_{\rho}^{4} are independent copies of GρG_{\rho}. Define Φ\Phi by replacing μ\mu with ρ\rho in Section 3.2. Then, by Proposition 3.3 and the second claim of the functional delta method (Lemma 3.1), we see that

n𝖶p(σ)(ρ^n,1B,ρ^n,2B)=nΦ(Tn,1B,Tn,2B)=Φ(0,0)(nTn,1B,nTn,2B)+o(1)=n(Tn,1BTn,2B)H˙1,p(ργσ)+o(1)=n(ρ^n,1Bρ^n)γσn(ρ^n,2Bρ^n)γσH˙1,p(ργσ)+o(1).\begin{split}\sqrt{n}\mathsf{W}_{p}^{(\sigma)}(\hat{\rho}^{B}_{n,1},\hat{\rho}_{n,2}^{B})&=\sqrt{n}\Phi(T_{n,1}^{B},T_{n,2}^{B})=\Phi_{(0,0)}^{\prime}(\sqrt{n}T_{n,1}^{B},\sqrt{n}T_{n,2}^{B})+o_{\operatorname{\mathbb{P}}}(1)\\ &=\|\sqrt{n}(T_{n,1}^{B}-T_{n,2}^{B})\|_{\dot{H}^{-1,p}(\rho*\gamma_{\sigma})}+o_{\operatorname{\mathbb{P}}}(1)\\ &=\|\sqrt{n}(\hat{\rho}_{n,1}^{B}-\hat{\rho}_{n})*\gamma_{\sigma}-\sqrt{n}(\hat{\rho}_{n,2}^{B}-\hat{\rho}_{n})*\gamma_{\sigma}\|_{\dot{H}^{-1,p}(\rho*\gamma_{\sigma})}+o_{\operatorname{\mathbb{P}}}(1).\end{split}

The rest of the proof is analogous to Proposition 3.8 Part (i). ∎

5.5. Proof of Theorem 4.1

5.5.1. Preliminary lemmas

Recall the notation Ξμ\Xi_{\mu} and DμD_{\mu} that appeared in Section 3.2.

Lemma 5.5.

Let μ𝒫p\mu\in\mathcal{P}_{p} for 1<p<1<p<\infty. Under Assumption 1, the map

(h,θ)Ξμ×N0𝖶p(μγσ+h,νθγσ)(h,\theta)\in\Xi_{\mu}\times N_{0}\mapsto{\sf W}_{p}\left(\mu*\gamma_{\sigma}+h,\nu_{\theta}*\gamma_{\sigma}\right)

is Hadamard directionally differentiable at (0,θ)(0,\theta^{\star}) with derivative

(h,θ)Ξμ×N0hθ,𝔇H˙1,p(μγσ).(h,\theta)\in\Xi_{\mu}\times N_{0}\mapsto\norm{h-\left\langle\theta,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}.

Furthermore, the expansion

𝖶p(μγσ+h,νθγσ)=hθθ,𝔇H˙1,p(μγσ)+r(h,θθ),{\sf W}_{p}(\mu*\gamma_{\sigma}+h,\nu_{\theta}*\gamma_{\sigma})=\norm{h-\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+r(h,\theta-\theta^{\star}),

holds, with remainder rr satisfying r(th,t(θθ))=o(t)r(th,t(\theta-\theta^{\star}))=o(t) as t0t\downarrow 0 uniformly w.r.t. (h,θ)(h,\theta) varying in KΞμ×N0K\subset\Xi_{\mu}\times N_{0}, a compact subset of Dμ×d0D_{\mu}\times\mathbb{R}^{d_{0}}.

Proof.

Consider the map ψ:(h,θ)Ξμ×N0(h,(νθνθ)γσ)Ξμ×Ξμ\psi:(h,\theta)\in\Xi_{\mu}\times N_{0}\mapsto\big{(}h,(\nu_{\theta}-\nu_{\theta^{\star}})*\gamma_{\sigma}\big{)}\in\Xi_{\mu}\times\Xi_{\mu}. The norm differentiability condition, Assumption 1 (vi), establishes Fréchet (hence Hadamard) directional differentiability of ψ\psi at (0,θ)(0,\theta^{\star}) with

ψ(0,θ)(h,θ)=(h,θ,𝔇)TΞμ×Ξμ(0,0).\psi^{\prime}_{(0,\theta^{\star})}(h,\theta)=(h,\left\langle\theta,\mathfrak{D}\right\rangle)\in T_{\Xi_{\mu}\times\Xi_{\mu}}(0,0).

The chain rule for Hadamard directional derivatives paired with Proposition 3.3 yields

(Φψ)(0,θ)(h,θ)=Φψ(0,θ)ψ(0,θ)(h,θ)=Φ(0,0)(h,θ,𝔇)=hθ,𝔇H˙1,p(μγσ).(\Phi\circ\psi)^{\prime}_{(0,\theta^{\star})}(h,\theta)=\Phi^{\prime}_{\psi(0,\theta^{\star})}\circ\psi^{\prime}_{(0,\theta^{\star})}(h,\theta)=\Phi^{\prime}_{(0,0)}(h,\left\langle\theta,\mathfrak{D}\right\rangle)=\norm{h-\left\langle\theta,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}.

The final assertion follows from compact directional differentiability of the composition [Sha90]. ∎

Lemma 5.6.

Assume the setting of Lemma 5.5.

  1. (i)

    There exists a neighborhood N1N_{1} of θ\theta^{\star} with N1¯N0\overline{N_{1}}\subset N_{0} such that

    𝖶p(σ)(μ^n,νθ)C2|θθ|𝖶p(σ)(μ^n,μ),θN1¯,{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})\geq\frac{C}{2}\absolutevalue{\theta-\theta^{\star}}-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu),\quad\forall\theta\in\overline{N_{1}},

    where C>0C>0 is such that t,𝔇H˙1,p(μγσ)C|t|\norm{\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}\geq C\absolutevalue{t} for every td0t\in\mathbb{R}^{d_{0}}.

  2. (ii)

    Let ξn=O(1)\xi_{n}=O_{\mathbb{P}}(1) and Θn:={θN1¯:n|θθ|ξn}\Theta_{n}:=\left\{\theta\in\overline{N_{1}}:\sqrt{n}\absolutevalue{\theta-\theta^{\star}}\leq\xi_{n}\right\}; then, uniformly in θΘn\theta\in\Theta_{n},

    n𝖶p(σ)(μ^n,νθ)=𝔾n(σ)nθθ,𝔇H˙1,p(μγσ)+o(1).\sqrt{n}{\sf W}_{p}^{(\sigma)}\left(\hat{\mu}_{n},\nu_{\theta}\right)=\big{\|}\mathbb{G}_{n}^{(\sigma)}-\sqrt{n}\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+o_{\mathbb{P}}(1).
Proof.

Part (i). Assumption 1 (vi) guarantees that there exists a constant C>0C>0 such that θθ,𝔇H˙1,p(μγσ)C|θθ|\norm{\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}\geq C\absolutevalue{\theta-\theta^{\star}} for every θN0\theta\in N_{0}. Let N1N_{1} be an open ball of radius r¯\bar{r} centered at θ\theta^{\star} whose closure is contained in N0N_{0}; then there exists t0>0t_{0}>0 such that, for every tt0t\leq t_{0}, the remainder term rr of Lemma 5.5 satisfies t1|r(0,t(θθ))|Cr¯/2t^{-1}\absolutevalue{r(0,t(\theta-{\theta}^{\star}))}\leq C\bar{r}/2 for every θN1\theta\in\partial N_{1}. Hence, |r(0,θθ)|(C/2)|θθ|\absolutevalue{r(0,\theta-\theta^{\star})}\leq(C/2)\absolutevalue{\theta-\theta^{\star}} for every θN1¯\theta\in\overline{N_{1}}. The triangle inequality yields, for any θN¯1\theta\in\overline{N}_{1},

𝖶p(σ)(μ^n,νθ)\displaystyle{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta}) 𝖶p(σ)(μ,νθ)𝖶p(σ)(μ^n,μ),\displaystyle\geq{\sf W}_{p}^{(\sigma)}(\mu,\nu_{\theta})-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu),
=θθ,𝔇H˙1,p(μγσ)+r(0,θθ)𝖶p(σ)(μ^n,μ),\displaystyle=\norm{\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+r(0,\theta-\theta^{\star})-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu),
C2|θθ|𝖶p(σ)(μ^n,μ).\displaystyle\geq\frac{C}{2}\absolutevalue{\theta-\theta^{\star}}-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu).

Part (ii). Since 𝔾n(σ)dGμ\mathbb{G}_{n}^{(\sigma)}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu} in H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) and GμG_{\mu} is tight, the sequence 𝔾n(σ)\mathbb{G}_{n}^{(\sigma)} is uniformly tight. Pick any ϵ,δ>0\epsilon,\delta>0. By uniform tightness, there exists a compact set KϵH˙1,p(μγσ)K_{\epsilon}\subset\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) such that (𝔾n(σ)Kϵ)1ϵ/2\operatorname{\mathbb{P}}(\mathbb{G}_{n}^{(\sigma)}\in K_{\epsilon})\geq 1-\epsilon/2 for every nn\in\mathbb{N}. Further, since ξn=O(1)\xi_{n}=O_{\mathbb{P}}(1), there exists Mϵ>0M_{\epsilon}>0 such that (|ξn|Mϵ)1ϵ/2\operatorname{\mathbb{P}}(\absolutevalue{\xi_{n}}\leq M_{\epsilon})\geq 1-\epsilon/2 for every nn\in\mathbb{N}. Define the event An,ϵ={𝔾n(σ)Kϵ}{|ξn|Mϵ}A_{n,\epsilon}=\big{\{}\mathbb{G}_{n}^{(\sigma)}\in K_{\epsilon}\big{\}}\cap\big{\{}\absolutevalue{\xi_{n}}\leq M_{\epsilon}\big{\}}. Observe that (An,ϵ)1ϵ\operatorname{\mathbb{P}}(A_{n,\epsilon})\geq 1-\epsilon for every nn\in\mathbb{N}. Then, on this event An,ϵA_{n,\epsilon}, it holds that ΞnΘn,ϵ:={θN1¯:n|θθ|Mϵ}\Xi_{n}\subset\Theta_{n,\epsilon}:=\left\{\theta\in\overline{N_{1}}:\sqrt{n}\absolutevalue{\theta-\theta^{\star}}\leq M_{\epsilon}\right\}. Since Θn,ϵ\Theta_{n,\epsilon} is compact, we have, for every θΘn,ϵ\theta\in\Theta_{n,\epsilon},

n𝖶p(σ)(μ^n,νθ)=𝔾n(σ)nθθ,𝔇H˙1,p(μγσ)+nr(n1/2𝔾n(σ),θθ).\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})=\big{\|}\mathbb{G}_{n}^{(\sigma)}-\sqrt{n}\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+\sqrt{n}r(n^{-1/2}\mathbb{G}_{n}^{(\sigma)},\theta-\theta^{\star}).

Set ζn:=supθΞn,ϵ|nr(n1/2𝔾n(σ),θnθ)|\zeta_{n}:=\sup_{\theta\in\Xi_{n,\epsilon}}|\sqrt{n}r(n^{-1/2}\mathbb{G}_{n}^{(\sigma)},\theta_{n}-\theta^{\star})|. Then, on the event An,ϵA_{n,\epsilon},

|ζn|suphKϵ,|u|Mϵn|r(n1/2h,n1/2u)|,\absolutevalue{\zeta_{n}}\leq\sup_{h\in K_{\epsilon},|u|\leq M_{\epsilon}}\sqrt{n}\big{|}r(n^{-1/2}h,n^{-1/2}u)\big{|},

and the right hand side can be made less than δ\delta for nn sufficiently large. Hence, for every sufficiently large nn,

(|ζn|>δ)\displaystyle\operatorname{\mathbb{P}}\left(\absolutevalue{\zeta_{n}}>\delta\right) =({|ζn|>δ}An,ϵ)+({|ζn|>δ}An,ϵc)\displaystyle=\operatorname{\mathbb{P}}\left(\left\{\absolutevalue{\zeta_{n}}>\delta\right\}\cap A_{n,\epsilon}\right)+\operatorname{\mathbb{P}}\left(\left\{\absolutevalue{\zeta_{n}}>\delta\right\}\cap A_{n,\epsilon}^{c}\right)
({|ζn|>δ}An,ϵ)+(An,ϵc)\displaystyle\leq\operatorname{\mathbb{P}}\left(\left\{\absolutevalue{\zeta_{n}}>\delta\right\}\cap A_{n,\epsilon}\right)+\operatorname{\mathbb{P}}\left(A_{n,\epsilon}^{c}\right)
(|ξn|>Mϵ)\displaystyle\leq\operatorname{\mathbb{P}}\left(\absolutevalue{\xi_{n}}>M_{\epsilon}\right)
ϵ,\displaystyle\leq\epsilon,

that is, ζn=o(1)\zeta_{n}=o_{\mathbb{P}}(1). This implies the desired result. ∎

5.5.2. Proof of Theorem 4.1

Part (i). Given the above lemmas, the proof follows closely [Pol80, Theorem 4.2] or [GGK20, Appendix B.4]. We first note that, under 1, there exists a sequence of measurable estimators θ^n\hat{\theta}_{n} such that 𝖶p(σ)(μ^n,νθ^n)=infθΘ𝖶p(σ)(μ^n,νθ)\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\hat{\theta}_{n}})=\inf_{\theta\in\Theta}\mathsf{W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta}) and θ^nθ\hat{\theta}_{n}\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}\theta^{\star}. This follows from a small modification to the proof of Theorems 2 and 3 in [GGK20]. Thus, for any neighborhood NN of θ\theta^{\star},

infθΘ𝖶p(σ)(μ^n,νθ)=infθN𝖶p(σ)(μ^n,νθ)\inf_{\theta\in\Theta}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})=\inf_{\theta\in N}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})

with probability approaching one.

By Assumption 1 (vi), there exists C>0C>0 such that t,𝔇H˙1,p(μγσ)C|t|\norm{\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}\geq C\absolutevalue{t} for every td0t\in\mathbb{R}^{d_{0}}. Thus, by Lemma 5.6 (i), there exists a neighborhood N1N_{1} of θ\theta^{\star} with N1¯N0\overline{N_{1}}\subset N_{0} such that

𝖶p(σ)(μ^n,νθ)C2|θθ|𝖶p(σ)(μ^n,μ),θN1¯.{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})\geq\frac{C}{2}\absolutevalue{\theta-\theta^{\star}}-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu),\quad\forall\theta\in\overline{N_{1}}.

Set Θn:={θΘ:n|θθ|ξn}\Theta_{n}:=\left\{\theta\in\Theta:\sqrt{n}\absolutevalue{\theta-\theta^{\star}}\leq\xi_{n}\right\} with ξn:=(4/C)𝔾n(σ)H˙1,p(μγσ)=O(1)\xi_{n}:=(4/C)\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}=O_{\operatorname{\mathbb{P}}}(1). By Lemma 5.6 (ii), the expansion

n𝖶p(σ)(μ^n,νθ)=𝔾n(σ)nθθ,𝔇H˙1,p(μγσ)+o(1),\sqrt{n}{\sf W}_{p}^{(\sigma)}\left(\hat{\mu}_{n},\nu_{\theta}\right)=\big{\|}\mathbb{G}_{n}^{(\sigma)}-\sqrt{n}\left\langle\theta-\theta^{\star},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+o_{\mathbb{P}}(1), (19)

holds uniformly in θN1Θn\theta\in N_{1}\cap\Theta_{n}. Then, for arbitrary θN1Θnc\theta\in N_{1}\cap\Theta_{n}^{c},

𝖶p(σ)(μ^n,νθ)\displaystyle{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta}) >C2ξnn𝖶p(σ)(μ^n,μ)\displaystyle>\frac{C}{2}\frac{\xi_{n}}{\sqrt{n}}-{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)
=𝖶p(σ)(μ^n,μ)+o(n1/2),\displaystyle={\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)+o_{\mathbb{P}}(n^{-1/2}),

so that

infθN1Θnc𝖶p(σ)(μ^n,νθ)>𝖶p(σ)(μ^n,μ)+o(n1/2)infθN1Θn𝖶p(σ)(μ^n,νθ)+o(n1/2).\begin{split}\inf_{\theta\in N_{1}\cap\Theta_{n}^{c}}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})&>{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)+o_{\mathbb{P}}(n^{-1/2})\\ &\geq\inf_{\theta\in N_{1}\cap\Theta_{n}}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})+o_{\mathbb{P}}(n^{-1/2}).\end{split}

This shows that infθΘ𝖶p(σ)(μ^n,νθ)=infθN1Θn𝖶p(σ)(μ^n,νθ)+o(n1/2)\inf_{\theta\in\Theta}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})=\inf_{\theta\in N_{1}\cap\Theta_{n}}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})+o_{\mathbb{P}}(n^{-1/2}).

Now, reparametrizing by t=n(θθ)t=\sqrt{n}(\theta-\theta^{\star}) and setting Tn:={td0:|t|ξn,θ+t/nN1}T_{n}:=\big{\{}t\in\mathbb{R}^{d_{0}}:\absolutevalue{t}\leq\xi_{n},\theta^{\star}+t/\sqrt{n}\in N_{1}\big{\}} in (19), we have

infθN1Θnn𝖶p(σ)(μ^n,νθ)=inftTn𝔾n(σ)t,𝔇H˙1,p(μγσ)+o(1).\inf_{\theta\in N_{1}\cap\Theta_{n}}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})=\inf_{t\in T_{n}}\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+o_{\mathbb{P}}(1).

Set 𝔤n:=𝔾n(σ),𝔇H˙1,p(μγσ)\mathfrak{g}_{n}:=\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle\cdot,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}. For any td0t\in\mathbb{R}^{d_{0}} such that |t|>ξn\absolutevalue{t}>\xi_{n}, we have

𝔤n(t)C|t|𝔾n(σ)H˙1,p(μγσ)>3𝔤n(0)3inf|t|ξn𝔤n(t).\mathfrak{g}_{n}(t)\geq C\absolutevalue{t}-\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}>3\mathfrak{g}_{n}(0)\geq 3\inf_{\absolutevalue{t^{\prime}}\leq\xi_{n}}\mathfrak{g}_{n}(t^{\prime}).

Since {td0:|t|ξn}Tn\left\{t\in\mathbb{R}^{d_{0}}:\absolutevalue{t}\leq\xi_{n}\right\}\subset T_{n} with probability approaching one (as ξn=O(1)\xi_{n}=O_{\mathbb{P}}(1)), we have inftTn𝔤n(t)=inftd0𝔤n(t)\inf_{t\in T_{n}}\mathfrak{g}_{n}(t)=\inf_{t\in\mathbb{R}^{d_{0}}}\mathfrak{g}_{n}(t) with probability approaching one. Conclude that

infθΘn𝖶p(σ)(μ^n,νθ)=inftd0𝔾n(σ)t,𝔇H˙1,p(μγσ)+o(1).\inf_{\theta\in\Theta}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})=\inf_{t\in\mathbb{R}^{d_{0}}}\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+o_{\mathbb{P}}(1).

Finally, since the map hH˙1,p(μγσ)inftd0ht,𝔇H˙1,p(μγσ)h\in\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)\mapsto\inf_{t\in\mathbb{R}^{d_{0}}}\norm{h-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)} is continuous, the continuous mapping theorem yields

infθΘn𝖶p(σ)(μ^n,νθ)dinftd0Gμt,𝔇H˙1,p(μγσ).\inf_{\theta\in\Theta}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})\stackrel{{\scriptstyle d}}{{\to}}\inf_{t\in\mathbb{R}^{d_{0}}}\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}.

This completes the proof of Part (i).

Part (ii). From the proof of Theorem 3 in [GGK20], it is not difficult to show that θ^nθ\hat{\theta}_{n}\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}\theta^{\star}. Thus, θ^nN1\hat{\theta}_{n}\in N_{1} with probability approaching one, where N1N_{1} is the neighborhood of θ\theta^{\star} given in the proof of Part (i), so that by the definition of θ^n\hat{\theta}_{n} and Lemma 5.6 (ii),

infθΘn𝖶p(σ)(μ^n,νθ)=O(1)+o(1)n𝖶p(σ)(μ^n,νθ^n)C2n|θ^nθ|n𝖶p(σ)(μ^n,μ)=O(1),\underbrace{\inf_{\theta\in\Theta}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})}_{=O_{\mathbb{P}}(1)}+o_{\mathbb{P}}(1)\geq\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\hat{\theta}_{n}})\geq\frac{C}{2}\sqrt{n}|\hat{\theta}_{n}-\theta^{\star}|-\underbrace{\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)}_{=O_{\mathbb{P}}(1)}, (20)

with probability tending to one. This implies that n|θ^nθ|=O(1)\sqrt{n}|\hat{\theta}_{n}-\theta^{\star}|=O_{\mathbb{P}}(1). Let 𝕄n(t):=𝔾n(σ)t,𝔇H˙1,p(μγσ)\mathbb{M}_{n}(t):=\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)} and 𝕄(t):=Gμt,𝔇H˙1,p(μγσ)\mathbb{M}(t):=\norm{G_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}. Observe that 𝕄n\mathbb{M}_{n} and 𝕄\mathbb{M} are convex in tt. Again, from the proof of Part (i), since t^n:=n(θ^nθ)=O(1)\hat{t}_{n}:=\sqrt{n}(\hat{\theta}_{n}-\theta^{\star})=O_{\mathbb{P}}(1), we have

n𝖶p(σ)(μ^n,νθ^n)=𝕄n(t^n)+o(1).\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\hat{\theta}_{n}})=\mathbb{M}_{n}(\hat{t}_{n})+o_{\mathbb{P}}(1).

Hence,

𝕄n(t^n)=n𝖶p(σ)(μ^n,νθ^n)+o(1)infθΘn𝖶p(σ)(μ^n,νθ)+o(1)=inftd0𝕄n(t)+o(1).\begin{split}\mathbb{M}_{n}(\hat{t}_{n})&=\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\hat{\theta}_{n}})+o_{\mathbb{P}}(1)\\ &\leq\inf_{\theta\in\Theta}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta})+o_{\mathbb{P}}(1)\\ &=\inf_{t\in\mathbb{R}^{d_{0}}}\mathbb{M}_{n}(t)+o_{\mathbb{P}}(1).\end{split}

Since 𝔾n(σ)dGμ\mathbb{G}_{n}^{(\sigma)}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu} in H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right), (𝕄n(t1),,𝕄n(tk))d(𝕄(t1),,𝕄(tk))(\mathbb{M}_{n}(t_{1}),\dots,\mathbb{M}_{n}(t_{k}))\stackrel{{\scriptstyle d}}{{\to}}(\mathbb{M}(t_{1}),\dots,\mathbb{M}(t_{k})) for any finite set of points (ti)i=1kd0(t_{i})_{i=1}^{k}\subset\mathbb{R}^{d_{0}} by the continuous mapping theorem. Applying Theorem 1 in [Kat09] (or Lemma 6 in [GGK20]) yields t^ndargmintd0𝕄(t)\hat{t}_{n}\stackrel{{\scriptstyle d}}{{\to}}\operatorname{argmin}_{t\in\mathbb{R}^{d_{0}}}\mathbb{M}(t). ∎

6. Concluding remarks

In this paper, we have developed a comprehensive limit distribution theory for empirical 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} that covers general 1<p<1<p<\infty and d1d\geq 1, under both the null and the alternative. Our proof technique leveraged the extended functional delta method, which required two main ingredients: (i) convergence of the smooth empirical process in an appropriate normed vector space; and (ii) characterization of the Hadamard directional derivative of 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} w.r.t. the norm. We have identified the dual Sobolev space H˙1,p(μγ)\dot{H}^{-1,p}(\mu*\gamma) as the normed space of interest and established the items above to obtain the limit distribution results. Linearity of the Hadamard directional derivative under the alternative enabled establishing the asymptotic normality of the empirical (scaled) 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}.

To facilitate statistical inference using 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}, we have established the consistency of the nonparametric bootstrap. The limit distribution theory was used to study generative modeling via 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} MDE. We have derived limit distributions for the optimal solutions and the corresponding smooth Wasserstein error, and obtained Gaussian limits when p=2p=2 by leveraging the Hilbertian structure of the corresponding dual Sobolev space. Our statistical study, together with the appealing metric and topological structure of 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)} [GG20, NGK21], suggest that the smooth Wasserstein framework is compatible with high-dimensional learning and inference.

An important direction for future research is the efficient computation of 𝖶p(σ)\mathsf{W}_{p}^{(\sigma)}. While standard methods for computing 𝖶p\mathsf{W}_{p} are applicable in the smooth case (by sampling the Gaussian noise), it is desirable to find computational techniques that make use of the structure induced by the convolution with a known smooth kernel. Another appealing direction is to establish Berry-Esseen type bounds for the limit distributions in Theorem 1.1. Of particular interest is to explore how parameters such as dd and σ\sigma affect the accuracy of the limit distributions in Theorem 1.1. [SGK21] addressed a similar problem for empirical 𝖶1(σ)\mathsf{W}_{1}^{(\sigma)} under the one-sample null case, but their proof relies substantially on the IPM structure of 𝖶1\mathsf{W}_{1} and finite sample Gaussian approximation techniques developed by [CCK14, CCK16]. These techniques do not apply to p>1p>1, and thus new ideas, such as the linearization arguments herein, are required to develop Berry-Esseen type bounds for p>1p>1.

Appendix A Additional proofs

A.1. Proof of Lemma 2.2

Completeness of H˙1,p(ρ)\dot{H}^{-1,p}(\rho) is immediate. To prove separability, let hE(h)h\mapsto E(h) be the map from H˙1,p(ρ)\dot{H}^{-1,p}(\rho) into Lp(ρ;d)L^{p}(\rho;\mathbb{R}^{d}) given in Lemma 5.1. Then, from the first equation in (15) and the Hölder’s inequality, it holds that h1h2H˙1,p(ρ)E(h1)E(h2)Lp(ρ;d)\|h_{1}-h_{2}\|_{\dot{H}^{-1,p}(\rho)}\leq\|E(h_{1})-E(h_{2})\|_{L^{p}(\rho;\mathbb{R}^{d})} for all h1,h2H˙1,p(ρ)h_{1},h_{2}\in\dot{H}^{-1,p}(\rho). Since Lp(ρ;d)L^{p}(\rho;\mathbb{R}^{d}) is separable, we can choose a countable dense subset W0W_{0} of the range of EE. As the map hE(h)h\mapsto E(h) is injective, the set {E1(w):wW0}\{E^{-1}(w):w\in W_{0}\} is (countable and) dense in H˙1,p(ρ)\dot{H}^{-1,p}(\rho).

Finally, the Borel σ\sigma-field contains the cylinder σ\sigma-field as the coordinate projections are Borel measurable. By separability of H˙1,p(ρ)\dot{H}^{-1,p}(\rho), to show that the cylinder σ\sigma-field contains the Borel σ\sigma-field, it suffices to show that any closed ball in H˙1,p(ρ)\dot{H}^{-1,p}(\rho) is measurable relative to the cylinder σ\sigma-field. For any H˙1,p(ρ)\ell\in\dot{H}^{-1,p}(\rho) and r>0r>0, we have

{hH˙1,p(ρ):hH˙1,p(ρ)r}=φC0:φH˙1,q(ρ)1{hH˙1,p(ρ):|(h)(φ)|r}.\big{\{}h\in\dot{H}^{-1,p}(\rho):\|h-\ell\|_{\dot{H}^{-1,p}(\rho)}\leq r\big{\}}=\bigcap_{\begin{subarray}{c}\varphi\in C_{0}^{\infty}:\\ \|\varphi\|_{\dot{H}^{1,q}(\rho)}\leq 1\end{subarray}}\big{\{}h\in\dot{H}^{-1,p}(\rho):|(h-\ell)(\varphi)|\leq r\big{\}}.

Since H˙1,q(ρ)\dot{H}^{1,q}(\rho) (which is isometrically isomorphic to a closed subspace of Lq(ρ;d)L^{q}(\rho;\mathbb{R}^{d})) is separable, the intersection on the right-hand side can be replaced with the intersection over countably many functions. This shows that the H˙1,p(ρ)\dot{H}^{-1,p}(\rho)-ball on the left-hand side is measurable relative to the cylinder σ\sigma-field. ∎

A.2. Proof of Proposition 2.1

It suffices to prove the proposition when μ1μ0H˙1,p(ρ)<\|\mu_{1}-\mu_{0}\|_{\dot{H}^{-1,p}(\rho)}<\infty. Consider the curve μt=(1t)μ0+tμ1\mu_{t}=(1-t)\mu_{0}+t\mu_{1} for t[0,1]t\in[0,1], and let E:ddE:\mathbb{R}^{d}\to\mathbb{R}^{d} be the Borel vector field given in Lemma 5.1 for h=μ1μ0h=\mu_{1}-\mu_{0}. Let ftf_{t} denote the density of μt\mu_{t} w.r.t. ρ\rho, i.e., ft=(1t)f0+tf1f_{t}=(1-t)f_{0}+tf_{1}. By our lower bound of cc on one of f0f_{0} or f1f_{1}, we have ft>0f_{t}>0 for all tt in the open interval I=(0,1)I=(0,1), so the vector field E/ftE/f_{t} is well-defined for tIt\in I. Then the continuity equation (14) holds with vt=E/ftv_{t}=E/f_{t} and this choice of II.

Now suppose that f1cf_{1}\geq c (f0cf_{0}\geq c case is similar). Then,

vtLp(μt;d)p=d|E|pftp1𝑑ρcp+1tp+1ELp(ρ;d)p,\|v_{t}\|_{L^{p}(\mu_{t};\mathbb{R}^{d})}^{p}=\int_{\mathbb{R}^{d}}\frac{|E|^{p}}{f_{t}^{p-1}}d\rho\leq c^{-p+1}t^{-p+1}\|E\|_{L^{p}(\rho;\mathbb{R}^{d})}^{p},

so that 𝖶p(μa,μb)c1/qELp(ρ;d)abt1+1/p𝑑t\mathsf{W}_{p}(\mu_{a},\mu_{b})\leq c^{-1/q}\|E\|_{L^{p}(\rho;\mathbb{R}^{d})}\int_{a}^{b}t^{-1+1/p}dt for 0<a<b<10<a<b<1. Taking a0a\to 0 and b1b\to 1, we conclude that 𝖶p(μ0,μ1)c1/qpELp(ρ;d)\mathsf{W}_{p}(\mu_{0},\mu_{1})\leq c^{-1/q}p\,\|E\|_{L^{p}(\rho;\mathbb{R}^{d})}.

In view of (15), the desired conclusion follows once we verify

ELp(ρ;d)=sup{dφ,E𝑑ρ:φC0,φLq(ρ)1},\|E\|_{L^{p}(\rho;\mathbb{R}^{d})}=\sup\left\{\int_{\mathbb{R}^{d}}\langle\nabla\varphi,E\rangle d\rho:\varphi\in C_{0}^{\infty},\|\nabla\varphi\|_{L^{q}(\rho)}\leq 1\right\},

but this follows from the fact that jp(E){φ:φC0}¯Lq(ρ;d)j_{p}(E)\in\overline{\{\nabla\varphi:\varphi\in C_{0}^{\infty}\}}^{L^{q}(\rho;\mathbb{R}^{d})}. ∎

A.3. Proof of Proposition 4.1

The proof adopts the approach of [Pol80, Theorem 7.2]. Let N1N_{1} be the neighborhood of θ\theta^{\star} appearing in the proof of Theorem 4.1 Part (i). Set

Tn:={td0:|t|C1(4𝔾n(σ)H˙1,p(μγσ)+2λn)}.T_{n}:=\left\{t\in\mathbb{R}^{d_{0}}:\absolutevalue{t}\leq C^{-1}(4\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+2\lambda_{n})\right\}.

Since 𝔾n(σ)H˙1,p(μγσ)=n𝖶p(σ)(μ^n,μ)+o(1)\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}=\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\mu)+o_{\mathbb{P}}(1) (cf. Lemma 5.6), Θ^nθ+n1/2Tn\hat{\Theta}_{n}\subset\theta^{\star}+n^{-1/2}T_{n} with inner probability tending to one (cf. equation (20)).

Let Γn:=suptTn|n𝖶p(σ)(μ^n,νθ+n1/2t)𝔾n(σ)t,𝔇H˙1,p(μγσ)|\Gamma_{n}:=\sup_{t\in T_{n}}\big{|}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta^{\star}+n^{-1/2}t})-\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\big{|}. From the proof of Theorem 4.1 Part (i), we see that Γn=o(1)\Gamma_{n}=o_{\operatorname{\mathbb{P}}}(1). Also, since 𝔾n(σ)dGμ\mathbb{G}_{n}^{(\sigma)}\stackrel{{\scriptstyle d}}{{\to}}G_{\mu} in H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right), by the Skorohod-Dudley-Wichura construction, there exist versions 𝔾~n(σ)\tilde{\mathbb{G}}_{n}^{(\sigma)} and G~μ\tilde{G}_{\mu} of 𝔾n(σ)\mathbb{G}_{n}^{(\sigma)} and GμG_{\mu} (i.e., 𝔾~n(σ)\tilde{\mathbb{G}}_{n}^{(\sigma)} and G~μ\tilde{G}_{\mu} have the same distributions as 𝔾n(σ)\mathbb{G}_{n}^{(\sigma)} and GμG_{\mu}, respectively, as H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)-valued random variables) such that 𝔾~n(σ)G~μ\tilde{\mathbb{G}}_{n}^{(\sigma)}\to\tilde{G}_{\mu} in H˙1,p(μγσ)\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right) almost surely. Choose δn,ϵn,γn0\delta_{n},\epsilon_{n},\gamma_{n}\downarrow 0 in such a way that

(λn>δn)0,(𝔾~n(σ)G~μH˙1,p(μγσ)>ϵn)0,(Γn>γn)0.\mathbb{P}(\lambda_{n}>\delta_{n})\to 0,\quad\mathbb{P}\left(\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}>\epsilon_{n}\right)\to 0,\quad\mathbb{P}(\Gamma_{n}>\gamma_{n})\to 0.

Set βn:=max{2γn+δn,2ϵn}\beta_{n}:=\max\left\{2\gamma_{n}+\delta_{n},2\epsilon_{n}\right\}.

Part (ii). We first show that K(𝔾n(σ),βn)dK(Gμ,0)K(\mathbb{G}_{n}^{(\sigma)},\beta_{n})\stackrel{{\scriptstyle d}}{{\to}}K(G_{\mu},0). To this end, it suffices to show that K(𝔾~n(σ),βn)K(G~μ,0)K(\tilde{\mathbb{G}}_{n}^{(\sigma)},\beta_{n})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}K(\tilde{G}_{\mu},0). Observe that for any h,hH˙1,p(μγσ)h,h^{\prime}\in\dot{H}^{-1,p}(\mu*\gamma_{\sigma}),

|inftd0ht,𝔇H˙1,p(μγσ)inftd0ht,𝔇H˙1,p(μγσ)|\displaystyle\absolutevalue{\inf_{t\in\mathbb{R}^{d_{0}}}\norm{h-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}-\inf_{t\in\mathbb{R}^{d_{0}}}\norm{h^{\prime}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}}
suptd0|ht,𝔇H˙1,p(μγσ)ht,𝔇H˙1,p(μγσ)|\displaystyle\leq\sup_{t\in\mathbb{R}^{d_{0}}}\absolutevalue{\norm{h-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}-\norm{h^{\prime}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}}
hhH˙1,p(μγσ).\displaystyle\leq{\norm{h-h^{\prime}}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}}.

Thus, when 𝔾~n(σ)G~μH˙1,p(μγσ)ϵn\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq\epsilon_{n} and tK(Gμ,0)t\in K(G_{\mu},0),

𝔾~n(σ)t,𝔇H˙1,p(μγσ)G~μt,𝔇H˙1,p(μγσ)+𝔾~n(σ)G~μH˙1,p(μγσ)\displaystyle\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq\norm{\tilde{G}_{\mu}-\left\langle t,\mathfrak{D}\right\rangle}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}
=inftd0G~μt,𝔇H˙1,p(μγσ)+𝔾~n(σ)G~μH˙1,p(μγσ)\displaystyle\qquad=\inf_{t^{\prime}\in\mathbb{R}^{d_{0}}}\big{\|}\tilde{G}_{\mu}-\left\langle t^{\prime},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}
inftd0𝔾~n(σ)t,𝔇H˙1,p(μγσ)+2ϵn.\displaystyle\qquad\leq\inf_{t^{\prime}\in\mathbb{R}^{d_{0}}}\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\left\langle t^{\prime},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+2\epsilon_{n}.

This implies that, when 𝔾~n(σ)G~μH˙1,p(μγσ)ϵn\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}\leq\epsilon_{n}, which occurs with probability approaching one, it holds that K(G~μ,0)K(𝔾~n(σ),βn)K(\tilde{G}_{\mu},0)\subset K(\tilde{\mathbb{G}}_{n}^{(\sigma)},\beta_{n}).

Now, for any α>0\alpha>0 and ωΩ\omega\in\Omega, there exists λ0(ω)\lambda_{0}(\omega) such that K(G~μ(ω),λ)K(G~μ(ω),0)αK(\tilde{G}_{\mu}(\omega),\lambda)\subset K(\tilde{G}_{\mu}(\omega),0)^{\alpha} for every λλ0(ω)\lambda\leq\lambda_{0}(\omega), since for any sequence κn0\kappa_{n}\downarrow 0, K(G~μ(ω),κn)K(\tilde{G}_{\mu}(\omega),\kappa_{n}) is a decreasing sequence of compact sets whose intersection is contained in the open set K(G~μ(ω),0)αK(\tilde{G}_{\mu}(\omega),0)^{\alpha}, and hence K(G~μ(ω),κn)K(G~μ(ω),0)αK(\tilde{G}_{\mu}(\omega),\kappa_{n})\subset K(\tilde{G}_{\mu}(\omega),0)^{\alpha} for nn sufficiently large. It follows that if κn0\kappa_{n}\to 0 almost surely, then (K(G~μ,κn)K(G~μ,0)α)1\operatorname{\mathbb{P}}(K(\tilde{G}_{\mu},\kappa_{n})\subset K(\tilde{G}_{\mu},0)^{\alpha})\to 1. Applying this result with κn=βn+2𝔾~n(σ)G~μH˙1,p(μγσ)\kappa_{n}=\beta_{n}+2\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\tilde{G}_{\mu}\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}, we have that K(𝔾~n(σ),βn)K(\tilde{\mathbb{G}}_{n}^{(\sigma)},\beta_{n}) is contained in

{t:𝔾~n(σ)t,𝔇H˙1,p(μγσ)inftd0𝔾~n(σ)t,𝔇H˙1,p(μγσ)+λn}\displaystyle\Big{\{}t:\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}\leq\inf_{t\in\mathbb{R}^{d_{0}}}\big{\|}\tilde{\mathbb{G}}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}+\lambda_{n}\Big{\}}
K(G~μ,0)α,\displaystyle\subset K(\tilde{G}_{\mu},0)^{\alpha},

where the final inclusion holds with probability tending to one. Conclude that K(𝔾~n(σ),βn)K(G~μ,0)K(\tilde{\mathbb{G}}_{n}^{(\sigma)},\beta_{n})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\to}}K(\tilde{G}_{\mu},0) as desired.

Part (i). Finally, suppose that Γnγn\Gamma_{n}\leq\gamma_{n} and λnδn\lambda_{n}\leq\delta_{n}. Observe that if tTnct\in T_{n}^{c}, then

𝔾n(σ)t,𝔇H˙1,p(μγσ)\displaystyle\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} C|t|𝔾n(σ)H˙1,p(μγσ)\displaystyle\geq C\absolutevalue{t}-\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}
>3𝔾n(σ)H˙1,p(μγσ)+λn\displaystyle>3\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)}+\lambda_{n}
𝔾n(σ)H˙1,p(μγσ),\displaystyle\geq\big{\|}\mathbb{G}_{n}^{(\sigma)}\big{\|}_{\dot{H}^{-1,p}\left(\mu*\gamma_{\sigma}\right)},

so that inftd0𝔾n(σ)t,𝔇H˙1,p(μγσ)=inftTn𝔾n(σ)t,𝔇H˙1,p(μγσ)\inf_{t\in\mathbb{R}^{d_{0}}}\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}=\inf_{t\in T_{n}}\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}. By repeated applications of the triangle inequality, for tTnn(Θ^nθ)t\in T_{n}\cap\sqrt{n}(\hat{\Theta}_{n}-\theta^{\star}), we have

inftTn𝔾n(σ)t,𝔇H˙1,p(μγσ)\displaystyle\inf_{t^{\prime}\in T_{n}}\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t^{\prime},\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})} inftTnn𝖶p(σ)(μ^n,νθ+n1/2t)γn\displaystyle\geq\inf_{t^{\prime}\in T_{n}}\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta^{\star}+n^{-1/2}t^{\prime}})-\gamma_{n}
n𝖶p(σ)(μ^n,νθ+n1/2t)δnγn\displaystyle\geq\sqrt{n}{\sf W}_{p}^{(\sigma)}(\hat{\mu}_{n},\nu_{\theta^{\star}+n^{-1/2}t})-\delta_{n}-\gamma_{n}
𝔾n(σ)t,𝔇H˙1,p(μγσ)βn.\displaystyle\geq\big{\|}\mathbb{G}_{n}^{(\sigma)}-\left\langle t,\mathfrak{D}\right\rangle\big{\|}_{\dot{H}^{-1,p}(\mu*\gamma_{\sigma})}-\beta_{n}.

Conclude that

([Tnn(Θ^nθ)]K(𝔾n(σ),βn))1.\operatorname{\mathbb{P}}_{*}\Big{(}\big{[}T_{n}\cap\sqrt{n}(\hat{\Theta}_{n}-\theta^{\star})\big{]}\subset K(\mathbb{G}_{n}^{(\sigma)},\beta_{n})\Big{)}\to 1.

Since Θ^nθ+n1/2Tn\hat{\Theta}_{n}\subset\theta^{\star}+n^{-1/2}T_{n} with inner probability tending to one, we obtain the desired conclusion. This completes the proof. ∎

References

  • [ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou, Wasserstein generative adversarial networks, International Conference on Machine Learning, 2017.
  • [Ada75] Robert A. Adams, Sobolev spaces, Academic Press, New York-London, 1975.
  • [AF09] Jean-Pierre Aubin and Hélène Frankowska, Set-valued analysis, Springer Science & Business Media, 2009.
  • [AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows: in metric spaces and in the space of probability measures, Springer Science & Business Media, 2008.
  • [AKT84] Miklós Ajtai, János Komlós, and Gábor Tusnády, On optimal matchings, Combinatorica 4 (1984), no. 4, 259–264.
  • [AST19] Luigi Ambrosio, Federico Stra, and Dario Trevisan, A PDE approach to a 2-dimensional matching problem, Probability Theory and Related Fields 173 (2019), no. 1, 433–477.
  • [BB00] Jean-David Benamou and Yann Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numerische Mathematik 84 (2000), no. 3, 375–393.
  • [BB13] Franck Barthe and Charles Bordenave, Combinatorial optimization over two random point sets, Séminaire de Probabilités XLV, Springer, 2013, pp. 483–535.
  • [BdMM02] J. H. Boutet de Monvel and O. C. Martin, Almost sure convergence of the minimum bipartite matching functional in euclidean space, Combinatorica 22 (2002), no. 4, 523–530.
  • [BGV07] François Bolley, Arnaud Guillin, and Cédric Villani, Quantitative concentration inequalities for empirical measures on non-compact spaces, Probability Theory and Related Fields 137 (2007), no. 3-4, 541–593.
  • [BJGR19] Espen Bernton, Pierre E. Jacob, Mathieu Gerber, and Christian P. Robert, On parameter estimation with the Wasserstein distance, Information and Inference. A Journal of the IMA 8 (2019), no. 4, 657–676.
  • [BL21] Sergey G. Bobkov and Michel Ledoux, A simple Fourier analytic proof of the AKT optimal matching theorem, The Annals of Applied Probability 31 (2021), no. 6, 2567–2584.
  • [BLG14] Emmanuel Boissard and Thibaut Le Gouic, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et Statistiques 50 (2014), no. 2, 539–563.
  • [BMZ21] Jose Blanchet, Karthyek Murthy, and Fan Zhang, Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes, Mathematics of Operations Research (2021).
  • [Bob99] Sergey G. Bobkov, Isoperimetric and Analytic Inequalities for Log-Concave Probability Measures, The Annals of Probability 27 (1999), no. 4, 1903–1921.
  • [Bog98] Vladimir I. Bogachev, Gaussian measures, no. 62, American Mathematical Society, 1998.
  • [CCK14] Victor Chernozhukov, Denis Chetverikov, and Kato Kato, Gaussian approximation of suprema of empirical processes, The Annals of Statistics 42 (2014), 1564–1597.
  • [CCK16] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato, Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings, Stochastic Processes and their Applications 126 (2016), no. 12, 3632 – 3651.
  • [CCR20] J. Cárcamo, A. Cuevas, and L.-A. Rodríguez, Directional differentiability for supremum-type functionals: statistical applications, Bernoulli 26 (2020), no. 3, 2143–2175.
  • [CF21] Maria Colombo and Max Fathi, Bounds on optimal transport maps onto log-concave measures, Journal of Differential Equations 271 (2021), 1007–1022.
  • [CFT14] Nicolas Courty, Rémi Flamary, and Devis Tuia, Domain adaptation with regularized optimal transport, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 274–289.
  • [CFTR16] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy, Optimal transport for domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2016), no. 9, 1853–1865.
  • [CNW21] Hong-Bin Chen and Jonathan Niles-Weed, Asymptotics of smoothed Wasserstein distances, Potential Analysis (2021), 1–25.
  • [CRL+20] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré, Faster Wasserstein distance estimation with the sinkhorn divergence, International Conference on Neural Information Processing Systems, 2020.
  • [DBGL19] Eustasio Del Barrio, Paula Gordaliza, and Jean-Michel Loubes, A central limit theorem for LpL_{p} transportation cost on the real line with application to fairness assessment in machine learning, Information and Inference: A Journal of the IMA 8 (2019), no. 4, 817–849.
  • [dBGM99] Eustasio del Barrio, Evarist Giné, and Carlos Matrán, Central limit theorems for the Wasserstein distance between the empirical and the true distributions, The Annals of Probability (1999), 1009–1071.
  • [dBGSL21a] Eustasio del Barrio, Alberto González-Sanz, and Jean-Michel Loubes, A central limit theorem for semidiscrete Wasserstein distances, arXiv:2105.11721 (2021).
  • [dBGSL21b] by same author, Central limit theorems for general transportation costs, arXiv:2102.06379 (2021).
  • [dBGU05] Eustasio del Barrio, Evarist Giné, and Frederic Utzet, Asymptotics for L2L_{2} functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances, Bernoulli 11 (2005), no. 1, 131–189.
  • [dBL19] Eustasio del Barrio and Jean-Michel Loubes, Central limit theorems for empirical transportation cost in general dimension, The Annals of Probability 47 (2019), no. 2, 926–951.
  • [DGS21] Nabrun Deb, Promit Ghosal, and Bodhisattva Sen, Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections, arXiv:2107.01718 (2021).
  • [DNS09] Jean Dolbeault, Bruno Nazaret, and Giuseppe Savaré, A new class of transport distances between measures, Calculus of Variations and Partial Differential Equations 34 (2009), no. 2, 193–231.
  • [DSS13] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt, Constructive quantization: Approximation by empirical measures, Annales de l’Institut Henri Poincaré Probabilités et Statistiques 49 (2013), no. 4, 1183–1203.
  • [Dud69] Richard M. Dudley, The speed of mean glivenko-cantelli convergence, The Annals of Mathematical Statistics 40 (1969), no. 1, 40–50.
  • [Dud02] by same author, Real analysis and probability, Cambridge University Press, 2002.
  • [Düm93] Lutz Dümbgen, On nondifferentiable functions and the bootstrap, Probability Theory and Related Fields 95 (1993), no. 1, 125–140.
  • [DY95] V. Dobrić and J. E. Yukich, Asymptotics for transportation cost in high dimensions, Journal of Theoretical Probability 8 (1995), no. 1, 97–118.
  • [EG18] Lawrence C. Evans and Ronald F. Gariepy, Measure theory and fine properties of functions, Routledge, 2018.
  • [FG15] Nicolas Fournier and Arnaud Guillin, On the rate of convergence in Wasserstein distance of the empirical measure, Probability Theory and Related Fields 162 (2015), 707–738.
  • [FS19] Zheng Fang and Andres Santos, Inference on directionally differentiable functions, The Review of Economic Studies 86 (2019), 377–412.
  • [GAA+17] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville, Improved training of Wasserstein GANs, International Conference on Neural Information Processing Systems, 2017.
  • [GG20] Ziv Goldfeld and Kristjan Greenewald, Gaussian-smoothed optimal transport: Metric structure and statistical efficiency, International Conference on Artificial Intelligence and Statistics, 2020.
  • [GGK20] Ziv Goldfeld, Kristjan Greenewald, and Kengo Kato, Asymptotic guarantees for generative modeling based on the smooth Wasserstein distance, International Conference on Neural Information Processing Systems, 2020.
  • [GGNP20] Ziv Goldfeld, Kristjan H. Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy, Convergence of smoothed empirical measures with applications to entropy estimation, IEEE Transactions on Information Theory 66 (2020), no. 7, 1489–1501.
  • [GK16] Rui Gao and Anton J. Kleywegt, Distributionally robust stochastic optimization with Wasserstein distance, arXiv preprint arXiv:1604.02199 (2016).
  • [GM96] Wilfrid Gangbo and Robert J. McCann, The geometry of optimal transportation, Acta Mathematica 177 (1996), no. 2, 113–161.
  • [GMR93] Christian Gourieroux, Alain Monfort, and Eric Renault, Indirect inference, Journal of Applied Econometrics 8 (1993), no. S1, S85–S118.
  • [GN16] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge University Press, 2016.
  • [HKSM22] Shayan Hundrieser, Marcel Klatt, Thomas Staudt, and Axel Munk, A unifying approach to distributional limits for empirical optimal transport, arXiv preprint: arXiv 2202.12790 (2022).
  • [HMS21] Fang Han, Zhen Miao, and Yandi Shen, Nonparametric mixture MLEs under Gaussian-smoothed optimal transport distance, arXiv preprint arXiv:2112.02421 (2021).
  • [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the Fokker–Planck equation, SIAM journal on mathematical analysis 29 (1998), no. 1, 1–17.
  • [Kan42] L. V. Kantorovich, On the translocation of masses, USSR Academy of Science (Doklady Akademii Nauk USSR, vol. 37, 1942, pp. 199–201.
  • [Kat09] Kengo Kato, Asymptotics for argmin processes: Convexity arguments, Journal of Multivariate Analysis 100 (2009), no. 8, 1816–1829.
  • [Kos08] Michael R. Kosorok, Bootstrapping the Grenander estimator, Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen, Institute of Mathematical Statistics, 2008, pp. 282–292.
  • [Led19] Michel Ledoux, On optimal matching of Gaussian samples, Journal of Mathematical Science 238 (2019), 495–522.
  • [Lei20] Jing Lei, Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces, Bernoulli 26 (2020), no. 1, 767–798.
  • [LV07] Lázlo Lovász and Santosh Vempala, The geometry of logconcave functions and sampling algorithms, Random Structures and Algorithms 30 (2007), no. 3, 307–358.
  • [MBNWW21] Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman, Plugin estimation of smooth optimal transport maps, arXiv:2107.12364 (2021).
  • [MEK18] Peyman Mohajerin Esfahani and Daniel Kuhn, Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations, Mathematical Programming 171 (2018), 115–166.
  • [Mil09] Emanuel Milman, On the role of convexity in isoperimetry, spectral gap and concentration, Inventiones Mathematicae 177 (2009), no. 1, 1–43.
  • [MNW21] Tudor Manole and Jonathan Niles-Weed, Sharp convergence rates for empirical optimal transport with smooth costs, arXiv:2106.13181 (2021).
  • [NGK21] Sloan Nietert, Ziv Goldfeld, and Kengo Kato, Smooth pp-Wasserstein distance: structure, empirical approximation, and statistical applications, International Conference on Machine Learning, 2021.
  • [Nic09] Richard Nickl, On convergence and convolutions of random signed measures, Journal of Theoretical Probability 22 (2009), no. 1, 38–56.
  • [Pey18] Rémi Peyre, Comparison between W2W_{2} distance and H˙1\dot{H}^{-1} norm, and localization of Wasserstein distance, ESAIM: Control, Optimisation and Calculus of Variations 24 (2018), no. 4, 1489–1501.
  • [Pol80] David Pollard, The minimum distance method of testing, Metrika 27 (1980), no. 1, 43–70.
  • [PS80] William C. Parr and William R. Schucany, Minimum distance and robust estimation, Journal of the American Statistical Association 75 (1980), no. 371, 616–624.
  • [PW16] Yury Polyanskiy and Yihong Wu, Wasserstein continuity of entropy and outer bounds for interference channels, IEEE Transactions on Information Theory 62 (2016), 3992–4002.
  • [R0̈4] Werner Römisch, Delta method, infinite dimensional, Encyclopedia of Statistical Sciences, Wiley, 2004.
  • [RTC17] Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi, On wasserstein two-sample testing and related families of nonparametric tests, Entropy 19 (2017).
  • [RTG00] Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover’s distance as a metric for image retrieval, International Journal of Computer Vision 40 (2000), no. 2, 99–121.
  • [San15] F. Santambrogio, Optimal transport for applied mathematicians, Birkhäuser, 2015.
  • [San17] Filippo Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: An overview, Bulletin of Mathematical Sciences 7 (2017), no. 1, 87–154.
  • [SGK21] Ritwik Sadhu, Ziv Goldfeld, and Kengo Kato, Limit distribution theory for the smooth 1-Wasserstein distance with applications, arXiv preprint arXiv:2107.13494 (2021).
  • [Sha90] Alexander Shapiro, On concepts of directional differentiability, Journal of Optimization Theory and Applications 66 (1990), 477–487.
  • [Sha91] by same author, Asymptotic analysis of stochastic programs, Annals of Operations Research 30 (1991), 169–186.
  • [SL11] Roman Sandler and Michael Lindenbaum, Nonnegative matrix factorization with earth mover’s distance metric for image analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011), no. 8, 1590–1602.
  • [SM18] Max Sommerfeld and Axel Munk, Inference for empirical Wasserstein distances on finite spaces, Journal of the Royal Statistical Society Series B 80 (2018), 219–238.
  • [SW14] Adrien Saumard and Jon A. Wellner, Log-concavity and strong log-concavity: A review, Statistics Surveys 8 (2014), 45–114.
  • [Tal92] Michel Talagrand, Matching random samples in many dimensions, The Annals of Applied Probability (1992), 846–856.
  • [Tal94] by same author, The transportation cost from the uniform measure to the empirical measure in dimension 3\geq 3, The Annals of Probability (1994), 919–959.
  • [TBGS18] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf, Wasserstein auto-encoders, International Conference on Learning Representations, 2018.
  • [TSM19] Carla Tameling, Max Sommerfeld, and Axel Munk, Empirical optimal transport on countable metric spaces: distributional limits and statistical applications, The Annals of Applied Probability 29 (2019), 2744–2781.
  • [vdV96] Aad var der Vaart, New Donsker classes, The Annals of Probability 24 (1996), no. 4, 2128–2124.
  • [vdV98] Aad. W. van der Vaart, Asymptotic statistics, Cambridge University Press, 1998.
  • [vdVW96] Aad W. van der Vaart and Jon A. Wellner, Weak convergence and empirical processes: With applications to statistics, Springer, 1996.
  • [Vil03] Cédric Villani, Topics in optimal transportation, American Mathematical Society, 2003.
  • [Vil08] Cédric Villani, Optimal transport: Old and new, Springer, 2008.
  • [WB19a] Jonathan Weed and Francis Bach, Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance, Bernoulli 25 (2019), no. 4A, 2620–2648.
  • [WB19b] Jonathan Weed and Quentin Berthet, Estimation of smooth densities in Wasserstein distance, Conference on Learning Theory, 2019.
  • [Wol57] Jacob Wolfowitz, The minimum distance method, The Annals of Mathematical Statistics (1957), 75–88.
  • [ZCR21] Yixing Zhang, Xiuyuan Cheng, and Galen Reeves, Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples, Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021, pp. 2422–2430.