Limit distribution theory for smooth -Wasserstein distances
Abstract.
The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality, giving rise to a parametric convergence rate in any dimension, while preserving the Wasserstein metric and topological structure. To facilitate valid statistical inference, in this work, we develop a comprehensive limit distribution theory for the empirical smooth Wasserstein distance. The limit distribution results leverage the functional delta method after embedding the domain of the Wasserstein distance into a certain dual Sobolev space, characterizing its Hadamard directional derivative for the dual Sobolev norm, and establishing weak convergence of the smooth empirical process in the dual space. To estimate the distributional limits, we also establish consistency of the nonparametric bootstrap. Finally, we use the limit distribution theory to study applications to generative modeling via minimum distance estimation with the smooth Wasserstein distance, showing asymptotic normality of optimal solutions for the quadratic cost.
Key words and phrases:
Dual Sobolev space, Gaussian smoothing, functional delta method, limit distribution theory, Wasserstein distance1991 Mathematics Subject Classification:
60F05, 62E20, and 62G091. Introduction
1.1. Overview
The Wasserstein distance is an instance of the Kantorovich optimal transport problem [Kan42], which defines a metric on a space of probability measures. Specifically, for , the -Wasserstein distance between two Borel probability measures and on with finite th moments is defined by
(1) |
where is the set of couplings (or transportation plans) of and . The Wasserstein distance has seen a surge of applications in statistics, machine learning, and applied mathematics, ranging from generative modeling [ACB17, GAA+17, TBGS18], image recognition [RTG00, SL11], and domain adaptation [CFT14, CFTR16] to robust optimization [GK16, MEK18, BMZ21] and partial differential equations [JKO98, San17]. The widespread applicability of the Wasserstein distance is driven by an array of desirable properties, including its metric structure ( metrizes weak convergence plus convergence of th moments), a convenient dual form, robustness to support mismatch, and a rich geometry it induces on a space of probability measures. We refer to [Vil03, Vil08, AGS08, San15] as standard references on optimal transport theory.
However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Specifically, for the empirical distribution of independent observations from a distribution on , it is known that scales as for under moment conditions [DSS13, BLG14, FG15, WB19a, Lei20]. This slow rate renders performance guarantees in terms of all but vacuous when is large. It is also a roadblock towards a more delicate statistical analysis concerning limit distributions, bootstrap, and valid inference.
Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality of empirical [GGNP20, GG20, GGK20, SGK21, NGK21]. For , the smooth -Wasserstein distance is defined as , where denotes convolution and is the isotropic Gaussian distribution with variance parameter . For sufficiently sub-Gaussian , [GGNP20] showed that the expected smooth distance between and exhibits the parametric convergence rate, i.e., in any dimension. This is a significant departure from the rate in the unsmoothed case. [GG20] further showed that maintains the metric and topological structure of and is able to approximate it within a gap. The structural properties and fast empirical convergence rates were later extended to in [NGK21]. Other follow-up works explored relations between and maximum mean discrepancies [ZCR21], analyzed its rate of decay as [CNW21], and adopted it as a performance metric for nonparametric mixture model estimation [HMS21].
A limit distribution theory for was developed in [GGK20, SGK21], where the scaled empirical distance was shown to converge in distribution to the supremum of a tight Gaussian process in every dimension under mild moment conditions. This result relies on the dual formulation of as an integral probability metric (IPM) over the class of 1-Lipschitz functions. Gaussian smoothing shrinks the function class to that of 1-Lipschitz functions convolved with a Gaussian density, which is shown to be -Donsker in every dimension, thereby yielding the limit distribution. Extending these results to empirical with , however, requires substantially new ideas due to the lack of an IPM structure. Consequently, works exploring with , such as [ZCR21, NGK21], did not contain limit distribution results for it and this question remained largely open.
The present paper closes this gap and provides a comprehensive limit distribution theory for empirical with . Our main limit distribution results are summarized in the following theorem, where the ‘null’ refers to when , while ‘alternative’ corresponds to . In all what follows, the dimension is arbitrary.
Theorem 1.1 (Main results).
Let , and be Borel probability measures on with finite th moments. Let and be the empirical distributions of independent observations and . Suppose that satisfies Condition (4) ahead (which requires to be sub-Gaussian).
-
(i)
(One-sample null case) We have
where is a centered Gaussian process whose paths are linear and continuous with respect to (w.r.t.) the Sobolev seminorm . Here is the conjugate index of , i.e., .
-
(ii)
(Two-sample null case) If , then we have
where is an independent copy of .
-
(iii)
(One-sample alternative case) If and is sub-Weibull, then we have
where is an optimal transport potential from to for , and is the Gaussian density.
-
(iv)
(Two-sample alternative case) If and satisfies Condition (4), then we have
where is the -transform of for the cost function .
Parts (i) and (ii) show that the null limit distributions are non-Gaussian. On the other hand, Parts (iii) and (iv) establish asymptotic normality of empirical under the alternative. Notably, these result have the correct centering, , which enables us to construct confidence intervals for .
The proof strategy for Theorem 1.1 differs from existing approaches to limit distribution theory for empirical for general distributions. In fact, an analog of Theorem 1.1 is not known to hold for classic in this generality, except for the special case where are discrete (see a literature review below for details). The key insight is to regard as a functional defined on a subset of a certain dual Sobolev space. We show that the smooth empirical process converges weakly in the dual Sobolev space and that is Hadamard (directionally) differentiable w.r.t. the dual Sobolev norm. We then employ the extended functional delta method [Sha90, R0̈4] to obtain the limit distribution of one- and two-sample empirical under both the null and the alternative.
The limit distributions in Theorem 1.1 are non-pivotal in the sense that they depend on the population distributions and , which are unknown in practice. To facilitate statistical inference using , we employ the bootstrap to estimate the limit distributions and prove its consistency for each case of Theorem 1.1. Under the alternative, the consistency follows from the linearity of the Hadamard derivative. Under the null, where the Hadamard (directional) derivative is nonlinear, the bootstrap consistency is not obvious but still holds. This is somewhat surprising in light of [Düm93, FS19], where it is demonstrated that the bootstrap, in general, fails to be consistent for functionals whose Hadamard directional derivatives are nonlinear (cf. Proposition 1 in [Düm93] or Corollary 3.1 in [FS19]). Nevertheless, our application of the bootstrap differs from [Düm93, FS19] so there is no contradiction, and the specific structure of the Hadamard derivative of allows to establish consistency under the null (see the discussion after Proposition 3.8 for more details). These bootstrap consistency results enable constructing confidence intervals for and using it to test the equality of distributions.
As an application of the limit theory, we study implicit generative modeling under the minimum distance estimation (MDE) framework [Wol57, Pol80, PS80]. MDE extends the maximum-likelihood principle beyond the KL divergence and applies to models supported on low-dimensional manifolds [ACB17] (whence the KL divergence is not well-defined), as well as to cases when the likelihood function is intractable [GMR93]. For MDE with , we establish limit distribution results for the optimal solution and the smooth -Wasserstein error. Our results hold for arbitrary dimension, again contrasting the classic case where analogous distributional limits for MDE with are known only for [BJGR19]. Remarkably, when , the Hilbertian structure of the underlying dual Sobolev space allows showing asymptotic normality of the MDE solution.
1.2. Literature review
Analysis of empirical Wasserstein distances, or more generally empirical optimal transport distances, has been an active research area in the statistics and probability theory literature. In particular, significant attention was devoted to rates of convergence and exact asymptotics [Dud69, AKT84, Tal92, Tal94, DY95, BdMM02, BGV07, DSS13, BB13, BLG14, FG15, WB19a, AST19, Led19, BL21, Lei20, CRL+20, MNW21, DGS21, MBNWW21]. As noted before, the empirical Wasserstein distance suffers from the curse of dimensionality, namely, whenever . This rate is known to be sharp in general [Dud69]. The recent work by [CRL+20, MNW21] discovered that the rate can be improved under the alternative, namely, for if , where for and for . Their insight is to use the duality formula for and exploit regularity of optimal transport potentials. [MNW21] also derive matching minimax lower bounds up to log factors under some technical conditions.
Another central problem that has seen a rapid development is limit distribution theory for empirical Wasserstein distances. However, except for the two special cases discussed next, to the best of our knowledge, there is no proven analog of our Theorem 1.1 for classic Wasserstein distances, i.e., a comprehensive limit distribution theory for empirical that holds for general and . The first case for which the limit distribution is well understood is when . Then, reduces to the distance between quantile functions for , and further simplifies to the distance between distribution functions when . Building on such explicit expressions, [dBGM99] and [dBGU05] derived null limit distributions in for and , respectively. More recently, under the alternative (), [DBGL19] derived a central limit theorem (CLT) when and . The second case where a limit distribution theory for empirical is available is when are discrete. If the distributions are finitely discrete, i.e., and for two simplex vectors and , then can be seen as a function of those simplex vectors and . Leveraging this, [SM18] applied the delta method to obtain limit distributions for empirical in the finitely discrete case. An extension to countably infinite supports was provided in [TSM19], while [dBGSL21a] treated the semidiscrete case where is finitely discrete but is general.
Except for these two special cases, limit distributions for Wasserstein distances are less understood. To avoid repetitions, we focus here our discussion on the one sample case. In [dBL19], a CLT for is derived in any dimension, but the limit Gaussian distribution degenerates to when ; see also [dBGSL21b] for an extension to general . Notably, the centering constant there is the expected empirical Wasserstein distance , which in general can not be replaced with the (more natural) population distance . The recent preprint [MBNWW21] addressed this gap and established a CLT for for a wavelet-based estimator of , assuming that the ambient space is and that are absolutely continuous w.r.t. the Lebesgue measure with smooth and strictly positive densities. Following arguments similar to [dBL19], they first derive a CLT for and then use the strict positivity of the densities and higher order regularity of optimal transport potentials to control the bias term as .
Finally, we note that our proof techniques differ from the aforementioned arguments for classic . Specifically, as opposed to the two-step approach of [MBNWW21] described above, we directly prove asymptotic normality for under the alternative. Their derivation does not apply to our case even when since their bias bound requires that the densities of and be bounded away from zero on their (compact) supports, which fails to hold after the Gaussian convolution. Our argument also differs from that of [SM18, TSM19], even though they also rely on the functional delta method. Specifically, since we do not assume that are discrete, can not be parameterized by simplex vectors, and hence the application of the functional delta method is nontrivial. Very recently, an independent work [HKSM22] used the extended functional delta method for the supremum functional [CCR20] to derive limit distributions for classic , with , for compactly supported distributions under the alternative in dimensions .
1.3. Organization
The rest of the paper is organized as follows. In Section 2, we collect background material on Wasserstein distances, smooth Wasserstein distances, and dual Sobolev spaces. In Section 3, we prove Theorem 1.1 and explore the validity of the bootstrap for empirical . Section 4 presents applications of our limit distribution theory to MDE with . Proofs for Section 3 and 4 can be found in Section 5. Section 6 provides concluding remarks and discusses future research directions. Finally, the Appendix contains additional proofs.
1.4. Notation
Let and denote the Euclidean norm and inner product, respectively. Let denote the closed ball with center and radius . We use for the Lebesgue measure on , while denotes the Dirac measure at . Given a finite signed Borel measure on , we identify with the linear functional . Let denote inequalities up to some numerical constants. For any , we use the shorthands and .
For a topological space , and denote, respectively, the Borel -field on and the class of Borel probability measures on . We write and for , use to denote the subset of with finite th moment . For , we write for the absolute continuity of w.r.t. , and use for the corresponding Radon-Nikodym derivative. Let denote the convolution of . Likewise, the convolution of two measurable functions is denoted by . We write for the centered Gaussian distribution on with covariance matrix , and use () for the corresponding density. We write for the product measure of . Let , and denote weak convergence of probability measures, convergence in distribution of random variables, and convergence in probability, respectively. When necessary, convergence in distribution is understood in the sense of Hoffmann-Jørgensen (cf. Chapter 2 in [vdVW96]).
Throughout the paper, we assume that are the coordinate projections of the product probability space . To generate auxiliary random variables, we extend the probability space as , where denotes the Lebesgue measure on . For , let for , and recall that the corresponding Orlicz (quasi-)norm of a real-valued random variable is defined as . A Borel probability measure is called -sub-Weibull if for . We say that is sub-Weibull if it is -sub-Weibull for some . Finally, is sub-Gaussian if it is -sub-Weibull.
For an open set in a Euclidean space, denotes the space of compactly supported, infinitely differentiable, real functions on . We write and define . For any and , let denote the space of measurable maps such that ; when we use the shorthand . Recall that is a Banach space. Finally, for a subset of a topological space , let denote the closure of ; if the space is clear from the context, then we simply write for the closure.
2. Background
2.1. Wasserstein distances and their smooth variants
Recall that, for , the -Wasserstein distance between is defined in (1). Some basic properties of are (cf. e.g., [Vil03, AGS08, Vil08, San15]): (i) the is attained in the definition of , i.e., there exists a coupling such that , and the optimal coupling is unique when and ; (ii) is a metric on ; and (iii) convergence in is equivalent to weak convergence plus convergence of th moments: if and only if and .
The proof of the limit distribution for empirical under the alternative hinges on duality theory for , which we summarize below. For a function and a cost function , the -transform of is defined by
A function , not identically , is called -concave if for some function .
Lemma 2.1 (Duality for ).
Let , , and set the cost function to .
- (i)
-
(ii)
(Theorem 3.3 in [GM96]) Let , suppose that is -concave, and take as the convex hull of . Then is locally Lipschitz on the interior of .
-
(iii)
(Corollary 2.7 in [dBGSL21b]) If and is supported on an open connected set , then the optimal transport potential from to for is unique on up to additive constants, i.e., if and are optimal transport potentials, then there exists such that for all .
The smooth Wasserstein distance convolves the distributions with an isotropic Gaussian kernel. Gaussian convolution levels out local irregularities in the distributions, while largely preserving the structure of classic . Recalling that , the smooth -Wasserstein distance is defined as follows.
Definition 2.1 (Smooth Wasserstein distance).
Let and . For , the smooth -Wasserstein distance between and with smoothing parameter is
The smooth Wasserstein distance was studied in [GGNP20, GG20, GGK20, SGK21, NGK21] for structural properties and empirical convergence rates. We recall two basic properties: (i) is a metric on that generates the same topology as classic ; (ii) for and , we have for a constant that depends only on . In particular, is continuous and monotonically non-increasing in with . See [NGK21] for additional structural results, including an explicit expression for and weak convergence of smooth optimal couplings. For empirical convergence, it was shown in [NGK21] that under appropriate moment assumptions for in any dimension . Versions of this result for and were derived earlier in [GGNP20, GGK20, SGK21].
2.2. Sobolev spaces and their duals
Our proof strategy for the limit distribution results is to regard as a functional defined on a subset of a certain dual Sobolev space. We will show that the smooth empirical process converges weakly in the dual Sobolev space and that is Hadamard (directionally) differentiable w.r.t. the dual Sobolev norm. Given these, the limit distributions in Theorem 1.1 follow via the functional delta method. Here we briefly discuss (homogeneous) Sobolev spaces and their duals.
Definition 2.2 (Homogeneous Sobolev spaces and their duals).
Let and .
-
(i)
For a differentiable function , let
be the Sobolev seminorm. We define the homogeneous Sobolev space by the completion of w.r.t. .
-
(ii)
Let be the conjugate index of , i.e., . Let denote the topological dual of . The dual Sobolev norm (dual to ) of a continuous linear functional is defined by
The restriction can be replaced with in the definition of the dual norm since for any .
We have defined the homogeneous Sobolev space as the completion of w.r.t. . It is not immediately clear that the so-constructed space is a function space over . Below we present an explicit construction of when is bounded away from zero for some reference measure satisfying the -Poincaré inequality. To that end, we first define the Poincaré inequality.
Definition 2.3 (Poincaré inequality).
For , a probability measure is said to satisfy the -Poincaré inequality if there exists a finite constant such that
The smallest constant satisfying the above is denoted by .
The standard Poincaré inequality refers to the -Poincaré inequality. It is known that any log-concave distribution (i.e., a distribution of the form for some convex function ; cf. [LV07, SW14]) satisfies the -Poincaré inequality for any [Bob99, Mil09]. In particular, the Gaussian distribution satisfies every -Poincaré inequality (see also [Bog98, Corollary 1.7.3]).
Remark 2.1 (Explicit construction of ).
Suppose that there exists a reference measure , with , that satisfies the -Poincaré inequality. Assume that for some constant (in our applications, or for some ; in either case, the stated assumption is satisfied with or ). Let . Then, is a proper norm on , and the map is an isometry from into . Let be the closure of in under . The inverse map can be extended to as follows. For any , choose such that . Since is Cauchy in and thus in (as ), is Cauchy in by the -Poincaré inequality, so for some . Set and extend by . The space is a Banach space of functions over . Finally, the homogeneous Sobolev space can be constructed as with .
The next lemma summarizes some basic results about the space and -valued random variables that we use in the sequel. The proof can be found in Section A.1.
Lemma 2.2.
Let and . The dual space is a separable Banach space. The Borel -field on coincides with the cylinder -field (the smallest -field that makes the coordinate projections, , measurable).
Consider a stochastic process indexed by , i.e., is measurable for each . The process can be thought of as a map from into as long as has paths in , i.e., for each fixed , the map is continuous and linear. The fact that the Borel -field on coincides with the cylinder -field guarantees that a stochastic process indexed by with paths in is Borel measurable as a map from into .
2.3. and dual Sobolev norm
In Section 3, we will explore limit distributions for empirical . One of the key technical ingredients there is a comparison of the Wasserstein distance with a certain dual Sobolev norm, which we present next.
Proposition 2.1 (Comparison between and dual Sobolev norm; Theorem 5.26 in [DNS09]).
Let , and suppose that with for some reference measure . Denote their respective densities by , . If or is bounded from below by some , then
(3) |
Proposition 2.1 follows directly from Theorem 5.26 of [DNS09]. Similar comparison inequalities appear in [Pey18, Led19, WB19b]. We include a self-contained proof of Proposition 2.1 in Section A.2 as some elements of the proof are key to our derivation of the null limit distribution for empirical . The proof builds on the Benamou-Brenier dynamic formulation of optimal transport [BB00], which shows that is bounded from above by the length of any absolutely continuous path from to in . The dual Sobolev norm emerges as a bound on the length of the linear interpolation .
3. Limit distribution theory
The goal of this section is to establish Theorem 1.1. The proof relies on two key steps: (i) establish weak convergence of the smooth empirical process in the dual Sobolev space ; and (ii) regard as a functional defined on a subset of and characterize its Hadamard directional derivative w.r.t. the corresponding dual Sobolev norm. Given (i) and (ii), the limit distribution results follow from the functional delta method, and the asymptotic normality under the alternative further follows from linearity of the Hadamard directional derivative.
3.1. Preliminaries
Throughout this section, we fix , take as the conjugate index of , and let . For , let and be independent observations and denote the associated empirical distributions by and , respectively.
3.1.1. Weak convergence of smooth empirical process in dual Sobolev spaces
The first building block of our limit distribution results is the following weak convergence of the smoothed empirical process in and .
Proposition 3.1 (Weak convergence of smooth empirical process).
Suppose that satisfies
(4) |
Then, the smoothed empirical process converges in distribution as both in and . The limit process in each case is a centered Gaussian process, indexed by or , respectively, with covariance function . Here denotes the covariance under .
The proof of Proposition 3.1 relies on the prior work [NGK21] by a subset of the authors, where it was shown that the smoothed function class with is -Donsker. The weak convergence in then follows from a similar argument to Lemma 1 in [Nic09]. This, in turn, implies weak convergence in when has mean zero, since in that case is continuously embedded into . To account for non-centered distributions, we use a reduction to the mean zero case via translation. See also Remark 5.1 for an alternative proof for that relies on the CLT in the Hilbert space.
Inspection of the proof of Proposition 3.1 shows that Condition (4) implies
which requires to be sub-Gaussian. It is not difficult to see that Condition (4) is satisfied if is compactly supported or sub-Gaussian with for .
A natural question is whether a condition in the spirit of (4) is necessary for the conclusion of Proposition 3.1 to hold. Indeed, we show that some form of sub-Gaussianity is necessary for the smooth empirical process to converge to zero in .
Proposition 3.2 (Necessity of sub-Gaussian condition).
The following hold.
-
(i)
If in as a.s., then for any .
-
(ii)
Conversely, if , then in as a.s.
3.1.2. Functional delta method
Another ingredient of our limit distribution results is the (extended) functional delta method [Sha91, Düm93, R0̈4, FS19]. Let be a normed space and be a function. Following [Sha90, R0̈4], we say that is Hadamard directionally differentiable at if there exists a map such that
for any , , and in such that . Here is the tangent cone to at defined as
The tangent cone is closed, and if is convex, then coincides with the closure in of (cf. Proposition 4.2.1 in [AF09]). The derivative is positively homogeneous (i.e., for any ) and continuous, but need not be linear.
Lemma 3.1 (Extended functional delta method [Sha91, Düm93, R0̈4, FS19]).
Let be a normed space and be a function that is Hadamard directionally differentiable at with derivative . Let be maps such that for some and Borel measurable map with values in . Then, . Further, if is convex, then we have the expansion .
Remark 3.1 (Choice of domain ).
The domain is arbitrary as long as it contains the ranges of for all , and the tangent cone contains the range of the limit variable .
3.2. Limit distributions under the null ()
We shall apply the extended functional delta method to derive the limit distributions of and as , namely, proving Parts (i) and (ii) of Theorem 1.1. To set up the problem over a (real) vector space, we regard as a map defined on a set of finite signed Borel measures. The comparison result from Proposition 2.1 implies that the latter map is Lipschitz in , and Proposition 3.1 shows that is weakly convergent in . These suggest choosing the ambient space to be .
To cover the one- and two-sample cases in a unified manner, consider the same map but in two variables. Take , set , and define the function as
We endow with a product norm (e.g., ). Since the set (and thus ) is convex, the tangent cone coincides with the closure in of . We next verify that is Hadamard directionally differentiable at .
Proposition 3.3 (Hadamard directional derivative of under the null).
Let and . Then, the map , is Hadamard directionally differentiable at with derivative , i.e., for any , and in such that , we have
Proposition 3.3 follows from the next Gâteaux differentiability result for , which may be of independent interest, combined with Lipschitz continuity of w.r.t. (cf. Proposition 2.1).
Lemma 3.2 (Gâteaux directional derivative of ).
Let and be finite signed Borel measures with total mass such that and . Then,
where denotes the right derivative.
Remark 3.2 (Comparison with Exercise 22.20 in [Vil08]).
Exercise 22.20 in [Vil08] states that (in our notation)
(5) |
for any sufficiently regular function with ( is understood as a signed measure ). Theorem 7.26 in [Vil03] provides a proof of the one-sided inequality that the liminf of the left-hand side above is at least , when satisfies and is bounded. The subsequent Remark 7.27 states that “We shall not consider the converse of this inequality, which requires more assumptions and more effort.” However, we could not find references that establish rigorous conditions applicable to our problem under which the derivative formula (5) holds. Lemma 3.2 provides a rigorous justification for this formula and extends it to general .
Given these preparations, the proof of Theorem 1.1 Parts (i) and (ii) is immediate.
Proof of Theorem 1.1, Parts (i) and (ii).
Let denote the weak limit of in ; cf. Proposition 3.1. Recall that is separable (cf. Lemma 2.2), so is a Borel measurable map from into the product space [vdVW96, Lemma 1.4.1]. Since and are independent, by Example 1.4.6 in [vdVW96] and Proposition 3.1, in , where is an independent copy of . Since and is closed in , we see that by the portmanteau theorem.
Applying the functional delta method (Lemma 3.1) and Proposition 3.3, we conclude that
Likewise, we also have
This completes the proof. ∎
3.3. Limit distributions under the alternative
3.3.1. One-sample case
We start from the simpler situation where is known and prove Part (iii) of Theorem 1.1. Our proof strategy is to first establish asymptotic normality of the th power of , from which Part (iii) follows by applying the delta method for . For notational convenience, define
for which one-sample asymptotic normality under the alternative is stated next.
Proposition 3.4.
Suppose that satisfies Condition (4), is sub-Weibull, and . Let be an optimal transport potential from to for . Then, we have
We again use the functional delta method to prove this proposition, but with a slightly different setting. Set as before, and consider the function defined by
where
(6) |
As long as is sub-Weibull (recall that Condition (4) requires to be sub-Gaussian), the set contains . This set is also convex, and so the tangent cone coincides with the closure in of . The corresponding Hadamard directional derivative of is given next.
Proposition 3.5 (Hadamard directional derivative of w.r.t. one argument).
Let , and suppose that are sub-Weibull. Let be an optimal transport potential from to for , which is uniquely determined up to additive constants (see Lemma 2.1 (iii)). Then
-
(i)
, where is the conjugate index of ; and
-
(ii)
the map , is Hadamard directionally differentiable at with derivative , i.e., for any , and in such that , we have
(7)
As in the null case, Part (ii) of Proposition 3.5 follows from the following Gâteaux differentiability result for , combined with local Lipschitz continuity of w.r.t. .
Lemma 3.3 (Gâteaux directional derivative of w.r.t. one argument).
Let and be sub-Weibull. Let be an optimal transport potential from to . Then
where the integral on the right-hand side is well-defined and finite.
Remark 3.3 (Comparisons with Theorem 8.4.7 in [AGS08] and Theorem 5.24 in [San15]).
Theorem 8.4.7 in [AGS08] derives the following differentiabiliy result for . Let be an absolutely continuous curve for some open interval , and let be an “optimal” velocity field satisfying the continuity equation for (see Theorem 8.4.7 in [AGS08] for the precise meaning). Then, for any , we have that
(8) |
for almost every (a.e.) , where is an optimal coupling for . See also Theorem 5.24 in [San15]. Since (8) only holds for a.e. , while we need the (right) differentiability at a specific point, the result of [AGS08, Theorem 8.4.7] (or [San15, Theorem 5.24]) does not directly apply to our problem. We overcome this difficulty by establishing regularity of optimal transport potentials (see Lemma 5.3 ahead), for which Gaussian smoothing plays an essential role.
We are now ready to prove Proposition 3.4 and obtain Part (iii) of Theorem 1.1 combined with the delta method for the map .
Proof of Proposition 3.4.
By Proposition 3.1, and in . Also with probability one by the portmanteau theorem. Applying the functional delta method (Lemma 3.1) and Proposition 3.5, we have
as desired. ∎
3.3.2. Two-sample case
Finally, we consider the two-sample case and prove the following, from which Part (iv) of Theorem 1.1 follows.
Proposition 3.6.
Let . Suppose that satisfy Condition (4) and . Let be an optimal transport potential from to for . Then, we have
Set and . Consider the function defined by
where is given in (6) and is defined analogously. Here we endow with a product norm (e.g. ).
We note that if is an optimal transport potential from to , then is an optimal transport potential from to , as . With this in mind, Proposition 3.5 immediately yields the following proposition.
Proposition 3.7 (Hadamard directional derivative of w.r.t. two arguments).
Let , and suppose that are sub-Weibull. Let be an optimal transport potential from to for . Then, , and the map , is Hadamard directionally differentiable at with derivative for .
Given Proposition 3.7, the proof of Proposition 3.6 is analogous to that of Proposition 3.4, and is thus omitted for brevity. As before, Part (iv) of Theorem 1.1 follows via the delta method for .
3.4. Bootstrap
The limit distributions in Theorem 1.1 are non-pivotal, as they depend on the population distributions and/or , which are unknown in practice. To overcome this and facilitate statistical inference using , we apply the bootstrap to estimate the limit distributions of empirical .
We start from the one-sample case. Given the data , let be an independent sample from , and set as the bootstrap empirical distribution. Let denote the conditional probability given . The next proposition shows that the bootstrap consistently estimates the limit distribution of empirical under both the null and the alternative.
Proposition 3.8 (Bootstrap consistency: one-sample case).
Part (ii) of the proposition is not surprising given that the Haramard directional derivative of the function in Proposition 3.5 is , which is linear in . Part (i) is less obvious since the function from Proposition 3.3 has a nonlinear Hadamard directional derivative, . Recall that [Düm93, Proposition 1] or [FS19, Corollary 3.1] show that the bootstrap is inconsistent for functionals with nonlinear derivatives, but these results do not collide with Part (i) of Proposition 3.8 since our application of the bootstrap differs from theirs. For instance, [Düm93, Proposition 1] specialized to our setting states that the conditional law of does not converge weakly to in probability. Heuristically, is nonnegative while can be negative, so the conditional law of the latter cannot mimic the distribution of the former. Further, when is unknown, the conditional law of is infeasible. The correct bootstrap analog for is , and the proof of Proposition 3.8 shows that it can be approximated by , whose conditional law (after scaling) converges weakly to in probability.
Next, consider the two-sample case. In addition to and , given , let be an independent sample from , and set . Slightly abusing notation, we reuse for the conditional probability given .
Proposition 3.9 (Bootstrap consistency: two-sample under the alternative).
Example 3.1 (Confidence interval for ).
Consider constructing confidence intervals for . For , let denote the conditional -quantile of given the data. Then, by Proposition 3.9 above and Lemma 23.3 in [vdV98], the interval
contains with probability approaching .
For the two-sample case under the null, instead of separately sampling bootstrap draws from and (see Remark 3.4 below), we use the pooled empirical distribution (cf. Chapter 3.7 in [vdVW96]). Given , let be an independent sample from , and set
The following proposition shows that this two-sample bootstrap is consistent for the null limit distribution of empirical .
Proposition 3.10 (Bootstrap consistency: two-sample under the null).
Suppose that and satisfy Condition (4). Then, for , we have
where is an independent copy of . In particular, if , then
Remark 3.4 (Inconsistency of naive bootstrap).
One may consider using (rather than ) to approximate the distribution of , but this bootstrap is not consistent. Indeed, from the proof of Proposition 3.10, we may deduce that, if , then is expanded as
which converges in distribution to unconditionally, where are independent copies of . This implies that the conditional law of does not converge weakly to the law of in probability.
Example 3.2 (Testing the equality of distributions).
Consider testing the equality of distributions, i.e., against . We shall use as a test statistic and reject if for some critical value . Proposition 3.10 implies that, if we choose to be the conditional -quantile of given the data, then the resulting test is asymptotically of level ,
Here is the nominal level. To see that the test is consistent, note that if , then with probability approaching one, while by Proposition 3.10.
Testing the equality of distributions using Wasserstein distances was considered in [RTC17], but their theoretical analysis is focused on the case, partly because of the lack of null limit distribution results for empirical in higher dimensions. We overcome this obstacle by using the smooth Wasserstein distance.
4. Minimum distance estimation with
We consider the application of our limit distribution theory to MDE with . Given an independent sample from a distribution , MDE aims to learn a generative model from a parametric family that approximates under some statistical divergence. We use as the proximity measure and the empirical distribution as an estimate for , which leads to the following MDE problem
MDE with classic is called the Wasserstein GAN, which continues to underlie state-of-the-art methods in generative modeling [ACB17, GAA+17]. MDE with was previously examined for in [GGK20] and for in [NGK21]. Specifically, [NGK21] established measurability, consistency, and parametric convergence rates for MDE with for , but did not derive limit distribution results. We will expand on this prior work by providing limit distributions for the MDE problem.
Analogously to the conditions of Theorem 4 in [GGK20], we assume the following.
Assumption 1.
Let , and assume that the following conditions hold. (i) The distribution satisfies Condition (4). (ii) The parameter space is compact with nonempty interior. (iii) The map is continuous w.r.t. the weak topology. (iv) There exists a unique in the interior of such that . (v) There exists a neighborhood of such that for every . (vi) The map is norm differentiable with non–singular derivative at . That is, there exists , where are linearly independent elements of , such that
as in , where for .
We derive limit distributions for the optimal value function and MDE solution, following the methodology of [Pol80, BJGR19, GGK20].
Theorem 4.1 (Limit distributions for MDE with ).
Suppose that 1 holds. Let be the smooth empirical process, and its weak limit in ; cf. Proposition 3.1. Then, the following hold.
-
(i)
We have .
-
(ii)
Let be a sequence of measurable estimators satisfying
Then, provided that is almost surely unique, we have .
In general, it is nontrivial to verify that is almost surely unique. However, for , the Hilbertian structure of guarantees this uniqueness. Let be an isometric isomorphism between and a closed subspace of ; cf. Lemma 5.1. Setting and , we have
The unique minimizer in of the above display is given by
(10) |
Since is a centered Gaussian random variable in , is a mean–zero Gaussian vector in .
Corollary 4.1 (Asymptotic normality for MDE solutions when ).
Consider the setting of Theorem 4.1 Part (ii) and let . Then , the mean–zero Gaussian vector in (10).
Without assuming the uniqueness of , limit distributions for MDE solutions can be stated in terms of set-valued random variables. Consider the set of approximate minimizers
(11) |
where is any nonnegative sequence with . We will show that with inner probability approaching one for some sequence of random, convex, and compact sets; cf. [Pol80, Section 2]. To describe the sets , for any and , define
where is the class of compact, convex, and nonempty subsets of endowed with the Hausdorff topology. That is, the topology induced by the Hausdorff metric , where . Lemma 7.1 in [Pol80] shows that the map is measurable from into for any .
Proposition 4.1 (Limit distribution for set of approximate minimizers).
Under 1, there exists a sequence of nonnegative real numbers such that (i) , where denotes inner probability; and (ii) as –valued random variables.
5. Remaining proofs
5.1. Proofs for Section 3.1.1
We fix some notation. For a nonempty set , let denote the space of bounded real functions on endowed with the sup-norm . The space is a Banach space.
5.1.1. Proof of Proposition 3.1
We divide the proof into three steps. In Steps 1 and 2, we will establish weak convergence of in . Step 3 is devoted to weak convergence of in .
Step 1. Observe that
(12) |
Consider the function classes
The proof of Theorem 3 in [NGK21] shows that the function class is -Donsker. For completeness, we provide an outline of the argument. Since for any constant and any function , , it suffices to show that with is -Donsker. To this end, we will apply Theorem 1 in [vdV96] or its simple adaptation, Lemma 8 in [NGK21].
Fix any . We first observe that, for any and any multi-index , we have
(13) |
up to constants independent of , and , where . Here is the differential operator and is the -Poincaré constant for the Gaussian measure . To see this, observe that
Applying Hölder’s inequality and using the fact that (recall that ), we obtain
A direct calculation further shows that
which implies
establishing (13) when . Derivative bounds follow similarly; see [NGK21] for details.
Next, we construct a cover of . Let . For fixed and , let be a minimal -net of . Set with . It is not difficult to see from a volumetric argument that . Set for . By construction, forms a cover of with diameter . Set and . By Theorem 1 in [vdV96] combined with Theorem 2.7.1 in [vdVW96] (or their simple adaptation; cf. Lemma 8 in [NGK21]), is -Donsker if . By inequality (13),
up to constants independent of and . Hence, is finite if
By Riemann approximation, the sum on the the left-hand side above can be bounded by
which is finite under our assumption by choosing and sufficiently small, and absorbing into the exponential function.
Step 2. Let . Recall from Remark 2.1 that . From Step 1, we know that is -Donsker. The same conclusion holds with replaced by . This can be verified as follows. From the proof of (13) when , we see that for with -mean zero,
Since the exponential function on the right-hand side is square-integrable w.r.t. under Condition (4) and is dense in for by construction (cf. Remark 2.1), we see that
Thus, by Theorem 2.10.2 in [vdV96], (or equivalently, ) is -Donsker. Since the map is isometrically isomorphic, in view of (12), we have in for some tight Gaussian process .
Let denote all bounded real functionals on , such that and
Equip with the norm . Each element in extends uniquely to the corresponding element in , and the extension, denoted by , is isometrically isomorphic. This follows from the same argument as the proof of Lemma 1 in [Nic09]. We omit the details.
Since is a closed subspace of and has paths in , we see that with probability one by the portmanteau theorem and in . Now, since is a (random) signed measure that is bounded on with probability one, we can regard as a random variable with values in . Conclude that in by the continuous mapping theorem. For notational convenience, redefine by . The limit variable is a centered Gaussian process with covariance function .
Step 3. We will show that converges in distribution to a centered Gaussian process in . For with , let denote the distribution of , and let . It is not difficult to see that satisfies Condition (4). Applying the result of Step 2 with replaced by , we have in . Since (as by Jensen’s inequality), we have , i.e., the continuous embedding holds. Thus in .
Observe that for ,
Thus, the map , defined by , is continuous (indeed, isometrically isomorphic). Conclude that
The limit variable is a centered Gaussian process with covariance function
This completes the proof. ∎
Remark 5.1 (Alternative proof for ).
Observe that with , and that are i.i.d. random variables with values in (cf. (13)). Since is isometrically isomorphic to a closed subspace of (see Lemma 5.1 ahead), we may apply the CLT in the Hilbert space to derive a limit distribution for in . Let be the linear isometry given in Lemma 5.1 ahead and be the corresponding -valued random variables. Since is a Hilbert space, obeys the CLT if , which is satisfied under Condition (4). Indeed, for , it is not difficult to see that the CLT in holds for under a slightly weaker moment condition, namely, .
5.1.2. Proof of Proposition 3.2
Part (i). Let
for some sufficiently large but fixed constant . It is not difficult to see that a.s. if and only if is -Glivenko-Cantelli.
Suppose first that is -Glivenko-Cantelli. Let denote the minimal envelope for , i.e., . By Theorem 3.7.14 in [GN16], must be -integrable. We shall bound from below. Fix any . Consider
Observe that and thus . Thus, for , we have , and
Also, from Proposition 1.5.2 in [Bog98], we see that . Conclude that, as long as ,
Now, the left-hand side is -integrable, so that for any .
Part (ii). Conversely, suppose that , which ensures that is -integrable from (13). From the proof of Proposition 3.1, for any , we see that the restricted function class is -Donsker and thus -Glivenko-Cantelli (cf. Theorem 3.7.14 in [GN16]). Since the envelope function is -integrable, we conclude that is -Glivenko-Cantelli; cf. the proof of Theorem 3.7.14 in [GN16]. ∎
5.2. Proofs for Section 3.2
Recall that and is its conjugate index, i.e., .
5.2.1. Proof of Lemma 3.2
One of the main ingredients of the proof of Lemma 3.2 is Theorem 8.3.1 in [AGS08], which is stated next (see also the Benamou-Brenier formula [BB00]).
Theorem 5.1 (Theorem 8.3.1 in [AGS08]).
Let be an open interval, and let be a continuous curve in (equipped with ) such that for some Borel vector field , the continuity equation
(14) |
holds in the distributional sense, i.e.,
If , then for all with .
For a vector field , define
Observe that if and only if , and for any ,
We will also use the following lemma.
Lemma 5.1.
Let be a reference measure. For any , there exists a unique vector field such that
(15) |
The map is homogeneous (i.e., for all and ) and such that for all . If , then the map is a linear isometry from into .
The proof of Lemma 5.1 in turn relies on the following existence result of optimal solutions in Banach spaces. We provide its proof for the sake of completeness.
Lemma 5.2.
Let be a reflexive real Banach space, and let () be weakly lower semicontinuous (i.e., for any weakly) and coercive (i.e., as ). Then there exists such that .
Proof of Lemma 5.2.
Let be such that . By coercivity, is bounded, so by reflexivity and the Banach-Alaoglu theorem, there exists a weakly convergent subsequence such that weakly. Since is weakly lower semicontinuous, we conclude . ∎
We turn to the proof of Lemma 5.1, which is inspired by the first part of the proof of Theorem 8.3.1 in [AGS08].
Proof of Lemma 5.1.
Let denote the closure in of the subspace . Endowing with gives a reflexive Banach space because any closed subspace of a reflexive Banach space is reflexive. Define the linear functional by . To see that is well-defined, observe that
This also shows that can be extended to a bounded linear functional on .
Consider the optimization problem
(16) |
The functional is finite, weakly lower semicontinuous, and coercive. By Lemma 5.2 there exists a solution to the optimization problem (16). Further, the functional is Gâteaux differentiable with derivative
Thus, for , we have for all and .
To show uniqueness of , pick another vector field satisfying (15). Then, satisfies for all , so by convexity of , is another optimal solution to (16). However, since is strictly convex, the optimal solution to (16) is unique, so that , i.e., .
Now, the map is homogeneous, as clearly satisfies the first equation in (15) for replaced with and . Further, as by construction, it also satisfies
Finally, if , then , so it is clear that the map is linear. ∎
We are now ready to prove Lemma 3.2.
Proof of Lemma 3.2.
Let and for . For notational convenience, let . We will first show that
The proof is inspired by Theorem 7.26 in [Vil03]. Observe that for any and ,
Let be an optimal coupling for , i.e., . Then
Since is smooth and compactly supported, there exists a constant such that
Indeed, for , we can take (here denotes the operator norm for matrices). For , we have
with . Here denotes the support of , . If and , then , so that we have
with . Finally, if and , then
so that we have
with .
Now, we have
Applying Proposition 2.1 with , we know that as , so that as . Further, by Hölder’s inequality, with being the conjugate index of , we have
Here
Conclude that
that is,
To prove the reverse inequality, let be the map from into given in Lemma 5.1. Let . Since is a probability measure, we have , i.e., , so that for . Likewise, for .
5.2.2. Proof of Proposition 3.3
Pick arbitrary , and in such that . By density, for any , there exist and for such that for . By scaling, Lemma 3.2 holds with replaced by . Assume without loss of generality that is large enough such that for and . The density of w.r.t. is
Thus, by Proposition 2.1, we have
Further,
Thus, using the result of Lemma 3.2, we conclude that
Since is arbitrary, we obtain the desired conclusion. ∎
5.3. Proofs for Section 3.3
5.3.1. Proof of Lemma 3.3
The proof of Lemma 3.3 relies on the following technical lemma concerning regularity of optimal transport potentials. Recall that any locally Lipschitz function on is differentiable a.e. by the Rademacher theorem (cf. [EG18]). Here and in what follows a.e. is taken w.r.t. the Lebesgue measure.
Lemma 5.3 (Regularity of optimal transport potential).
Let . Suppose that and is -sub-Weibull for some . Let be an optimal transport potential from to for . Then there exists a constant that depends only on , upper bounds on and for , and a lower bound on , such that
The proof of Lemma 5.3 borrows ideas from Lemmas 9 and 10 and Theorem 11 in the recent work by [MNW21], which in turn build on [GM96, CF21].
Proof of Lemma 5.3.
By Theorem 11 in [MNW21], there exists a constant depending only on and an upper bound on for , such that
where is the –superdifferential of at for the cost function , and .
Next, by Proposition 2 in [PW16], has Lebesgue density that is -regular with and , i.e.,
From the proof of Lemma 10 in [MNW21], we have
(17) |
Thus, whenever ,
where is a constant that depends only on . Conclude that there exists a constant depending only on , upper bounds on and for , and a lower bound on , such that
The rest of the proof mirrors the latter half of the proof of Lemma 9 in [MNW21]. Since and is equivalent to the Lebesgue measure (i.e., and ), for a.e. . Since any open convex set in agrees with the interior of its closure (cf. Proposition 6.2.10 in [Dud02]), the convex hull of agrees with . Thus, by Lemma 2.1 (ii) (Theorem 3.3 in [GM96]), is locally Lipschitz on . Further, by Propositon C.4 in [GM96], is nonempty for all . For any and ,
Thus, for any ,
where depends only on . Interchanging and , we conclude that
(18) |
which implies the desired conclusion. ∎
Proof of Lemma 3.3.
Let for , and let be an optimal transport potential from to . Without loss of generality, we may normalize in such a way that for .
We will apply Lemma 5.3 with replaced with for . It is not difficult to see that, as long as ,
Thus, by Lemma 5.3, there exist constants and independent of such that for every ,
Second, by construction,
Pick any . Since , has full support , and , we have by Theorem 3.4 in [dBGSL21b] that there exists some sequence of constants such that pointwise. Since we have normalized in such a way that , we have , i.e., pointwise. Further, since for all , the dominated convergence theorem yields that
Conclude that
This completes the proof ∎
5.3.2. Proof of Proposition 3.5
Part (i). We first note that is a function space over . To see this, observe that if we choose a reference measure to be an isotropic Gaussian distribution with sufficiently small variance parameter, then the relative density is bounded away from zero. Indeed, for , we have
by Jensen’s inequality, which guarantees that is a function space over in view of Remark 2.1.
By regularity of from Lemma 5.3, we know that is locally Lipschitz and (the latter alone does not automatically guarantee ). As in Proposition 1.5.2 in [Bog98], choose a sequence with the following property:
Let . Each belongs to the ordinary Sobolev -space w.r.t. the Lebesgue measure, so can be approximated by gradients of functions under (cf. [Ada75], Corollary 3.23). Since has a bounded Lebesgue density, this shows that . Now,
as , implying that .
Part (ii). Pick any , and in such that . For any , there exist some constant and sub-Weibull such that for .
5.4. Proofs for Section 3.4
5.4.1. Proof of Proposition 3.8
We first prove the following lemma. We note that the empirical distributions and are finitely discrete, so defines a random variable with values in (cf. (13) and Step 3 of the proof of Proposition 3.1). Let denote its (regular) conditional law given the data (which exists as is a separable Banach space; cf. Chapter 11 in [Dud02]).
Lemma 5.4.
Suppose that satisfies Condition (4). Then, we have almost surely.
Proof of Lemma 5.4.
From the proof of Proposition 3.1, the function class is -Donsker with a -square integrable envelope. The rest of the proof follows from the Giné-Zinn theorem for the bootstrap (cf. Theorem 3.6.2 in [vdVW96]) and repeating the arguments in Steps 2 and 3 in the proof of Proposition 3.1. ∎
Proof of Proposition 3.8.
Part (i). Assume without loss of generality that is not a point mass. We first note that the limit variable has a continuous distribution function. This can be verified, e.g., analogously to the proof of Lemma 1 in [SGK21]. Thus, it suffices to prove the convergence in probability (9) for each fixed (cf. Problem 23.1 in [vdV98]).
Let and . By Proposition 3.1 and Lemma 5.4, we know that
unconditionally, where is an independent copy of (cf. Theorem 2.2 in [Kos08]). Thus, by Proposition 3.5 and the second claim of the functional delta method (Lemma 3.1), we see that
Here unconditonally. Choose such that . By Markov’s inequality, we have . By Lemma 5.4 and the continuous mapping theorem, we also have
Thus, for each ,
The reverse inequality follows similarly.
Part (ii). The argument is analogous to Part (i). Observe that, by Proposition 3.5,
Taking th root and applying the delta method, we have
The rest of the proof is completely analogous to Part (i). ∎
Proof of Proposition 3.9.
By Lemma 5.4 and Example 1.4.6 in [vdVW96], the conditional law of given the data converges weakly to the law of in almost surely, where and are independent. By Theorem 2.2 in [Kos08], for and , we have
unconditionally, where are copies of , respectively, and are independent. Thus, by Proposition 3.7 and Lemma 3.1, for and , we have
The rest of the proof is analogous to Proposition 3.8 Part (ii). ∎
Proof of Proposition 3.10.
It is not difficult to see that satisfies Condition (4) and in . By Theorem 3.7.7 and Example 1.4.6 in [vdVW96], the conditional law of given the data converges weakly to the law of in almost surely, where is an independent copy of . Thus, arguing as in the proof of Proposition 3.9, for (), we have
unconditionally, where are independent copies of . Define by replacing with in Section 3.2. Then, by Proposition 3.3 and the second claim of the functional delta method (Lemma 3.1), we see that
The rest of the proof is analogous to Proposition 3.8 Part (i). ∎
5.5. Proof of Theorem 4.1
5.5.1. Preliminary lemmas
Recall the notation and that appeared in Section 3.2.
Lemma 5.5.
Let for . Under Assumption 1, the map
is Hadamard directionally differentiable at with derivative
Furthermore, the expansion
holds, with remainder satisfying as uniformly w.r.t. varying in , a compact subset of .
Proof.
Consider the map . The norm differentiability condition, Assumption 1 (vi), establishes Fréchet (hence Hadamard) directional differentiability of at with
The chain rule for Hadamard directional derivatives paired with Proposition 3.3 yields
The final assertion follows from compact directional differentiability of the composition [Sha90]. ∎
Lemma 5.6.
Assume the setting of Lemma 5.5.
-
(i)
There exists a neighborhood of with such that
where is such that for every .
-
(ii)
Let and ; then, uniformly in ,
Proof.
Part (i). Assumption 1 (vi) guarantees that there exists a constant such that for every . Let be an open ball of radius centered at whose closure is contained in ; then there exists such that, for every , the remainder term of Lemma 5.5 satisfies for every . Hence, for every . The triangle inequality yields, for any ,
Part (ii). Since in and is tight, the sequence is uniformly tight. Pick any . By uniform tightness, there exists a compact set such that for every . Further, since , there exists such that for every . Define the event . Observe that for every . Then, on this event , it holds that . Since is compact, we have, for every ,
Set . Then, on the event ,
and the right hand side can be made less than for sufficiently large. Hence, for every sufficiently large ,
that is, . This implies the desired result. ∎
5.5.2. Proof of Theorem 4.1
Part (i). Given the above lemmas, the proof follows closely [Pol80, Theorem 4.2] or [GGK20, Appendix B.4]. We first note that, under 1, there exists a sequence of measurable estimators such that and . This follows from a small modification to the proof of Theorems 2 and 3 in [GGK20]. Thus, for any neighborhood of ,
with probability approaching one.
By Assumption 1 (vi), there exists such that for every . Thus, by Lemma 5.6 (i), there exists a neighborhood of with such that
Set with . By Lemma 5.6 (ii), the expansion
(19) |
holds uniformly in . Then, for arbitrary ,
so that
This shows that .
Now, reparametrizing by and setting in (19), we have
Set . For any such that , we have
Since with probability approaching one (as ), we have with probability approaching one. Conclude that
Finally, since the map is continuous, the continuous mapping theorem yields
This completes the proof of Part (i).
Part (ii). From the proof of Theorem 3 in [GGK20], it is not difficult to show that . Thus, with probability approaching one, where is the neighborhood of given in the proof of Part (i), so that by the definition of and Lemma 5.6 (ii),
(20) |
with probability tending to one. This implies that . Let and . Observe that and are convex in . Again, from the proof of Part (i), since , we have
Hence,
Since in , for any finite set of points by the continuous mapping theorem. Applying Theorem 1 in [Kat09] (or Lemma 6 in [GGK20]) yields . ∎
6. Concluding remarks
In this paper, we have developed a comprehensive limit distribution theory for empirical that covers general and , under both the null and the alternative. Our proof technique leveraged the extended functional delta method, which required two main ingredients: (i) convergence of the smooth empirical process in an appropriate normed vector space; and (ii) characterization of the Hadamard directional derivative of w.r.t. the norm. We have identified the dual Sobolev space as the normed space of interest and established the items above to obtain the limit distribution results. Linearity of the Hadamard directional derivative under the alternative enabled establishing the asymptotic normality of the empirical (scaled) .
To facilitate statistical inference using , we have established the consistency of the nonparametric bootstrap. The limit distribution theory was used to study generative modeling via MDE. We have derived limit distributions for the optimal solutions and the corresponding smooth Wasserstein error, and obtained Gaussian limits when by leveraging the Hilbertian structure of the corresponding dual Sobolev space. Our statistical study, together with the appealing metric and topological structure of [GG20, NGK21], suggest that the smooth Wasserstein framework is compatible with high-dimensional learning and inference.
An important direction for future research is the efficient computation of . While standard methods for computing are applicable in the smooth case (by sampling the Gaussian noise), it is desirable to find computational techniques that make use of the structure induced by the convolution with a known smooth kernel. Another appealing direction is to establish Berry-Esseen type bounds for the limit distributions in Theorem 1.1. Of particular interest is to explore how parameters such as and affect the accuracy of the limit distributions in Theorem 1.1. [SGK21] addressed a similar problem for empirical under the one-sample null case, but their proof relies substantially on the IPM structure of and finite sample Gaussian approximation techniques developed by [CCK14, CCK16]. These techniques do not apply to , and thus new ideas, such as the linearization arguments herein, are required to develop Berry-Esseen type bounds for .
Appendix A Additional proofs
A.1. Proof of Lemma 2.2
Completeness of is immediate. To prove separability, let be the map from into given in Lemma 5.1. Then, from the first equation in (15) and the Hölder’s inequality, it holds that for all . Since is separable, we can choose a countable dense subset of the range of . As the map is injective, the set is (countable and) dense in .
Finally, the Borel -field contains the cylinder -field as the coordinate projections are Borel measurable. By separability of , to show that the cylinder -field contains the Borel -field, it suffices to show that any closed ball in is measurable relative to the cylinder -field. For any and , we have
Since (which is isometrically isomorphic to a closed subspace of ) is separable, the intersection on the right-hand side can be replaced with the intersection over countably many functions. This shows that the -ball on the left-hand side is measurable relative to the cylinder -field. ∎
A.2. Proof of Proposition 2.1
It suffices to prove the proposition when . Consider the curve for , and let be the Borel vector field given in Lemma 5.1 for . Let denote the density of w.r.t. , i.e., . By our lower bound of on one of or , we have for all in the open interval , so the vector field is well-defined for . Then the continuity equation (14) holds with and this choice of .
Now suppose that ( case is similar). Then,
so that for . Taking and , we conclude that .
In view of (15), the desired conclusion follows once we verify
but this follows from the fact that . ∎
A.3. Proof of Proposition 4.1
The proof adopts the approach of [Pol80, Theorem 7.2]. Let be the neighborhood of appearing in the proof of Theorem 4.1 Part (i). Set
Since (cf. Lemma 5.6), with inner probability tending to one (cf. equation (20)).
Let . From the proof of Theorem 4.1 Part (i), we see that . Also, since in , by the Skorohod-Dudley-Wichura construction, there exist versions and of and (i.e., and have the same distributions as and , respectively, as -valued random variables) such that in almost surely. Choose in such a way that
Set .
Part (ii). We first show that . To this end, it suffices to show that . Observe that for any ,
Thus, when and ,
This implies that, when , which occurs with probability approaching one, it holds that .
Now, for any and , there exists such that for every , since for any sequence , is a decreasing sequence of compact sets whose intersection is contained in the open set , and hence for sufficiently large. It follows that if almost surely, then . Applying this result with , we have that is contained in
where the final inclusion holds with probability tending to one. Conclude that as desired.
Part (i). Finally, suppose that and . Observe that if , then
so that . By repeated applications of the triangle inequality, for , we have
Conclude that
Since with inner probability tending to one, we obtain the desired conclusion. This completes the proof. ∎
References
- [ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou, Wasserstein generative adversarial networks, International Conference on Machine Learning, 2017.
- [Ada75] Robert A. Adams, Sobolev spaces, Academic Press, New York-London, 1975.
- [AF09] Jean-Pierre Aubin and Hélène Frankowska, Set-valued analysis, Springer Science & Business Media, 2009.
- [AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows: in metric spaces and in the space of probability measures, Springer Science & Business Media, 2008.
- [AKT84] Miklós Ajtai, János Komlós, and Gábor Tusnády, On optimal matchings, Combinatorica 4 (1984), no. 4, 259–264.
- [AST19] Luigi Ambrosio, Federico Stra, and Dario Trevisan, A PDE approach to a 2-dimensional matching problem, Probability Theory and Related Fields 173 (2019), no. 1, 433–477.
- [BB00] Jean-David Benamou and Yann Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numerische Mathematik 84 (2000), no. 3, 375–393.
- [BB13] Franck Barthe and Charles Bordenave, Combinatorial optimization over two random point sets, Séminaire de Probabilités XLV, Springer, 2013, pp. 483–535.
- [BdMM02] J. H. Boutet de Monvel and O. C. Martin, Almost sure convergence of the minimum bipartite matching functional in euclidean space, Combinatorica 22 (2002), no. 4, 523–530.
- [BGV07] François Bolley, Arnaud Guillin, and Cédric Villani, Quantitative concentration inequalities for empirical measures on non-compact spaces, Probability Theory and Related Fields 137 (2007), no. 3-4, 541–593.
- [BJGR19] Espen Bernton, Pierre E. Jacob, Mathieu Gerber, and Christian P. Robert, On parameter estimation with the Wasserstein distance, Information and Inference. A Journal of the IMA 8 (2019), no. 4, 657–676.
- [BL21] Sergey G. Bobkov and Michel Ledoux, A simple Fourier analytic proof of the AKT optimal matching theorem, The Annals of Applied Probability 31 (2021), no. 6, 2567–2584.
- [BLG14] Emmanuel Boissard and Thibaut Le Gouic, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et Statistiques 50 (2014), no. 2, 539–563.
- [BMZ21] Jose Blanchet, Karthyek Murthy, and Fan Zhang, Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes, Mathematics of Operations Research (2021).
- [Bob99] Sergey G. Bobkov, Isoperimetric and Analytic Inequalities for Log-Concave Probability Measures, The Annals of Probability 27 (1999), no. 4, 1903–1921.
- [Bog98] Vladimir I. Bogachev, Gaussian measures, no. 62, American Mathematical Society, 1998.
- [CCK14] Victor Chernozhukov, Denis Chetverikov, and Kato Kato, Gaussian approximation of suprema of empirical processes, The Annals of Statistics 42 (2014), 1564–1597.
- [CCK16] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato, Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings, Stochastic Processes and their Applications 126 (2016), no. 12, 3632 – 3651.
- [CCR20] J. Cárcamo, A. Cuevas, and L.-A. Rodríguez, Directional differentiability for supremum-type functionals: statistical applications, Bernoulli 26 (2020), no. 3, 2143–2175.
- [CF21] Maria Colombo and Max Fathi, Bounds on optimal transport maps onto log-concave measures, Journal of Differential Equations 271 (2021), 1007–1022.
- [CFT14] Nicolas Courty, Rémi Flamary, and Devis Tuia, Domain adaptation with regularized optimal transport, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 274–289.
- [CFTR16] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy, Optimal transport for domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2016), no. 9, 1853–1865.
- [CNW21] Hong-Bin Chen and Jonathan Niles-Weed, Asymptotics of smoothed Wasserstein distances, Potential Analysis (2021), 1–25.
- [CRL+20] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré, Faster Wasserstein distance estimation with the sinkhorn divergence, International Conference on Neural Information Processing Systems, 2020.
- [DBGL19] Eustasio Del Barrio, Paula Gordaliza, and Jean-Michel Loubes, A central limit theorem for transportation cost on the real line with application to fairness assessment in machine learning, Information and Inference: A Journal of the IMA 8 (2019), no. 4, 817–849.
- [dBGM99] Eustasio del Barrio, Evarist Giné, and Carlos Matrán, Central limit theorems for the Wasserstein distance between the empirical and the true distributions, The Annals of Probability (1999), 1009–1071.
- [dBGSL21a] Eustasio del Barrio, Alberto González-Sanz, and Jean-Michel Loubes, A central limit theorem for semidiscrete Wasserstein distances, arXiv:2105.11721 (2021).
- [dBGSL21b] by same author, Central limit theorems for general transportation costs, arXiv:2102.06379 (2021).
- [dBGU05] Eustasio del Barrio, Evarist Giné, and Frederic Utzet, Asymptotics for functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances, Bernoulli 11 (2005), no. 1, 131–189.
- [dBL19] Eustasio del Barrio and Jean-Michel Loubes, Central limit theorems for empirical transportation cost in general dimension, The Annals of Probability 47 (2019), no. 2, 926–951.
- [DGS21] Nabrun Deb, Promit Ghosal, and Bodhisattva Sen, Rates of estimation of optimal transport maps using plug-in estimators via barycentric projections, arXiv:2107.01718 (2021).
- [DNS09] Jean Dolbeault, Bruno Nazaret, and Giuseppe Savaré, A new class of transport distances between measures, Calculus of Variations and Partial Differential Equations 34 (2009), no. 2, 193–231.
- [DSS13] Steffen Dereich, Michael Scheutzow, and Reik Schottstedt, Constructive quantization: Approximation by empirical measures, Annales de l’Institut Henri Poincaré Probabilités et Statistiques 49 (2013), no. 4, 1183–1203.
- [Dud69] Richard M. Dudley, The speed of mean glivenko-cantelli convergence, The Annals of Mathematical Statistics 40 (1969), no. 1, 40–50.
- [Dud02] by same author, Real analysis and probability, Cambridge University Press, 2002.
- [Düm93] Lutz Dümbgen, On nondifferentiable functions and the bootstrap, Probability Theory and Related Fields 95 (1993), no. 1, 125–140.
- [DY95] V. Dobrić and J. E. Yukich, Asymptotics for transportation cost in high dimensions, Journal of Theoretical Probability 8 (1995), no. 1, 97–118.
- [EG18] Lawrence C. Evans and Ronald F. Gariepy, Measure theory and fine properties of functions, Routledge, 2018.
- [FG15] Nicolas Fournier and Arnaud Guillin, On the rate of convergence in Wasserstein distance of the empirical measure, Probability Theory and Related Fields 162 (2015), 707–738.
- [FS19] Zheng Fang and Andres Santos, Inference on directionally differentiable functions, The Review of Economic Studies 86 (2019), 377–412.
- [GAA+17] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville, Improved training of Wasserstein GANs, International Conference on Neural Information Processing Systems, 2017.
- [GG20] Ziv Goldfeld and Kristjan Greenewald, Gaussian-smoothed optimal transport: Metric structure and statistical efficiency, International Conference on Artificial Intelligence and Statistics, 2020.
- [GGK20] Ziv Goldfeld, Kristjan Greenewald, and Kengo Kato, Asymptotic guarantees for generative modeling based on the smooth Wasserstein distance, International Conference on Neural Information Processing Systems, 2020.
- [GGNP20] Ziv Goldfeld, Kristjan H. Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy, Convergence of smoothed empirical measures with applications to entropy estimation, IEEE Transactions on Information Theory 66 (2020), no. 7, 1489–1501.
- [GK16] Rui Gao and Anton J. Kleywegt, Distributionally robust stochastic optimization with Wasserstein distance, arXiv preprint arXiv:1604.02199 (2016).
- [GM96] Wilfrid Gangbo and Robert J. McCann, The geometry of optimal transportation, Acta Mathematica 177 (1996), no. 2, 113–161.
- [GMR93] Christian Gourieroux, Alain Monfort, and Eric Renault, Indirect inference, Journal of Applied Econometrics 8 (1993), no. S1, S85–S118.
- [GN16] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge University Press, 2016.
- [HKSM22] Shayan Hundrieser, Marcel Klatt, Thomas Staudt, and Axel Munk, A unifying approach to distributional limits for empirical optimal transport, arXiv preprint: arXiv 2202.12790 (2022).
- [HMS21] Fang Han, Zhen Miao, and Yandi Shen, Nonparametric mixture MLEs under Gaussian-smoothed optimal transport distance, arXiv preprint arXiv:2112.02421 (2021).
- [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the Fokker–Planck equation, SIAM journal on mathematical analysis 29 (1998), no. 1, 1–17.
- [Kan42] L. V. Kantorovich, On the translocation of masses, USSR Academy of Science (Doklady Akademii Nauk USSR, vol. 37, 1942, pp. 199–201.
- [Kat09] Kengo Kato, Asymptotics for argmin processes: Convexity arguments, Journal of Multivariate Analysis 100 (2009), no. 8, 1816–1829.
- [Kos08] Michael R. Kosorok, Bootstrapping the Grenander estimator, Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen, Institute of Mathematical Statistics, 2008, pp. 282–292.
- [Led19] Michel Ledoux, On optimal matching of Gaussian samples, Journal of Mathematical Science 238 (2019), 495–522.
- [Lei20] Jing Lei, Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces, Bernoulli 26 (2020), no. 1, 767–798.
- [LV07] Lázlo Lovász and Santosh Vempala, The geometry of logconcave functions and sampling algorithms, Random Structures and Algorithms 30 (2007), no. 3, 307–358.
- [MBNWW21] Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman, Plugin estimation of smooth optimal transport maps, arXiv:2107.12364 (2021).
- [MEK18] Peyman Mohajerin Esfahani and Daniel Kuhn, Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations, Mathematical Programming 171 (2018), 115–166.
- [Mil09] Emanuel Milman, On the role of convexity in isoperimetry, spectral gap and concentration, Inventiones Mathematicae 177 (2009), no. 1, 1–43.
- [MNW21] Tudor Manole and Jonathan Niles-Weed, Sharp convergence rates for empirical optimal transport with smooth costs, arXiv:2106.13181 (2021).
- [NGK21] Sloan Nietert, Ziv Goldfeld, and Kengo Kato, Smooth -Wasserstein distance: structure, empirical approximation, and statistical applications, International Conference on Machine Learning, 2021.
- [Nic09] Richard Nickl, On convergence and convolutions of random signed measures, Journal of Theoretical Probability 22 (2009), no. 1, 38–56.
- [Pey18] Rémi Peyre, Comparison between distance and norm, and localization of Wasserstein distance, ESAIM: Control, Optimisation and Calculus of Variations 24 (2018), no. 4, 1489–1501.
- [Pol80] David Pollard, The minimum distance method of testing, Metrika 27 (1980), no. 1, 43–70.
- [PS80] William C. Parr and William R. Schucany, Minimum distance and robust estimation, Journal of the American Statistical Association 75 (1980), no. 371, 616–624.
- [PW16] Yury Polyanskiy and Yihong Wu, Wasserstein continuity of entropy and outer bounds for interference channels, IEEE Transactions on Information Theory 62 (2016), 3992–4002.
- [R0̈4] Werner Römisch, Delta method, infinite dimensional, Encyclopedia of Statistical Sciences, Wiley, 2004.
- [RTC17] Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi, On wasserstein two-sample testing and related families of nonparametric tests, Entropy 19 (2017).
- [RTG00] Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover’s distance as a metric for image retrieval, International Journal of Computer Vision 40 (2000), no. 2, 99–121.
- [San15] F. Santambrogio, Optimal transport for applied mathematicians, Birkhäuser, 2015.
- [San17] Filippo Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: An overview, Bulletin of Mathematical Sciences 7 (2017), no. 1, 87–154.
- [SGK21] Ritwik Sadhu, Ziv Goldfeld, and Kengo Kato, Limit distribution theory for the smooth 1-Wasserstein distance with applications, arXiv preprint arXiv:2107.13494 (2021).
- [Sha90] Alexander Shapiro, On concepts of directional differentiability, Journal of Optimization Theory and Applications 66 (1990), 477–487.
- [Sha91] by same author, Asymptotic analysis of stochastic programs, Annals of Operations Research 30 (1991), 169–186.
- [SL11] Roman Sandler and Michael Lindenbaum, Nonnegative matrix factorization with earth mover’s distance metric for image analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011), no. 8, 1590–1602.
- [SM18] Max Sommerfeld and Axel Munk, Inference for empirical Wasserstein distances on finite spaces, Journal of the Royal Statistical Society Series B 80 (2018), 219–238.
- [SW14] Adrien Saumard and Jon A. Wellner, Log-concavity and strong log-concavity: A review, Statistics Surveys 8 (2014), 45–114.
- [Tal92] Michel Talagrand, Matching random samples in many dimensions, The Annals of Applied Probability (1992), 846–856.
- [Tal94] by same author, The transportation cost from the uniform measure to the empirical measure in dimension , The Annals of Probability (1994), 919–959.
- [TBGS18] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf, Wasserstein auto-encoders, International Conference on Learning Representations, 2018.
- [TSM19] Carla Tameling, Max Sommerfeld, and Axel Munk, Empirical optimal transport on countable metric spaces: distributional limits and statistical applications, The Annals of Applied Probability 29 (2019), 2744–2781.
- [vdV96] Aad var der Vaart, New Donsker classes, The Annals of Probability 24 (1996), no. 4, 2128–2124.
- [vdV98] Aad. W. van der Vaart, Asymptotic statistics, Cambridge University Press, 1998.
- [vdVW96] Aad W. van der Vaart and Jon A. Wellner, Weak convergence and empirical processes: With applications to statistics, Springer, 1996.
- [Vil03] Cédric Villani, Topics in optimal transportation, American Mathematical Society, 2003.
- [Vil08] Cédric Villani, Optimal transport: Old and new, Springer, 2008.
- [WB19a] Jonathan Weed and Francis Bach, Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance, Bernoulli 25 (2019), no. 4A, 2620–2648.
- [WB19b] Jonathan Weed and Quentin Berthet, Estimation of smooth densities in Wasserstein distance, Conference on Learning Theory, 2019.
- [Wol57] Jacob Wolfowitz, The minimum distance method, The Annals of Mathematical Statistics (1957), 75–88.
- [ZCR21] Yixing Zhang, Xiuyuan Cheng, and Galen Reeves, Convergence of Gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples, Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021, pp. 2422–2430.