This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inference via robust optimal transportation: theory and methods

Yiming Malabel=e1][email protected] [    Hang Liulabel=e2]hliu01.ustc.edu.cn [    Davide La Vecchialabel=e3][email protected] [    Matthieu Leraslelabel=e4][email protected] [ Department of Statistics and Finance, School of Management, University of Science and Technology of Chinapresep=, ]e1 International Institute of Finance, School of Management, University of Science and Technology of China presep=, ]e2 University of Genevapresep=, ]e3 CREST, CNRS, École polytechnique, GENES, ENSAE Paris, Institut Polytechnique de Paris, 91120 Palaiseau, Francepresep=, ]e4
Abstract

Optimal transportation theory and the related pp-Wasserstein distance (WpW_{p}, p1p\geq 1) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools has some issues. For instance, it is sensitive to outliers and it may not be even defined when the underlying model has infinite moments. To cope with these problems, first we consider a robust version of the primal transportation problem and show that it defines the robust Wasserstein distance, W(λ)W^{(\lambda)}, depending on a tuning parameter λ>0\lambda>0. Second, we illustrate the link between W1W_{1} and W(λ)W^{(\lambda)} and study its key measure theoretic aspects. Third, we derive some concentration inequalities for W(λ)W^{(\lambda)}. Fourth, we use W(λ)W^{(\lambda)} to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply the derived concentration inequalities for a data driven selection of λ\lambda. Fifth, we provide the dual form of the robust optimal transportation problem and we apply it to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our novel methods.

62F10,
62F12,
62F35,
6208,
68T09,
Concentration inequalities ,
Minimum distance estimation,
Robust Wasserstein distance,
Robust generative adversarial networks,
keywords:
[class=MSC]
keywords:
\arxiv

2303.110540 \startlocaldefs \endlocaldefs

, and

1 Introduction

1.1 Related literature

Optimal transportation (OT) dates back to the work of [43], who considered the problem of finding the optimal way to move given piles of sand to fill up given holes of the same total volume. Monge’s problem remained open until it was revisited and solved by Kantorovich, who characterized the optimal transportation plan in relation to the economic problem of optimal allocation of resources. We refer to [57], [53] for a book-length presentation in mathematics, to [47] for a data science view, and to [46] for a statistical perspective.

Within the OT setting, the Wasserstein distance (Wp,p1W_{p},p\geq 1) is defined as the minimum cost of transferring the probability mass from a source distribution to a target distribution, for a given transfer cost function. Nowadays, OT theory and WpW_{p} are of fundamental importance for many scientific areas, including statistics, econometrics, information theory and machine learning. For instance, in machine learning, OT based inference is a popular tool in generative models; see e.g. [3], [23], [45] and references therein. Similarly, OT techniques are widely-applied for problems related to domain adaptation; see [15] and [4]. In statistics and econometrics, estimation methods based on WpW_{p} are related to minimum distance estimation (MDE); see [5] and [8]. MDE is a research area in continuous development and the estimators based on this technique are called minimum distance estimators. They are linked to M-estimators and may be robust. We refer to [29], [55] and [7] for a book-length statistical introduction. As far as econometrics is concerned, we refer to [34] and [30]. The extant approaches for MDE are usually defined via Kullback-Leibler (KL) divergence, Cressie-Read power divergence, total variation (TV) distance, Hellinger distance, to mention a few. Some of those approaches (e.g. Hellinger distance) require kernel density estimation of the underlying probability density function. Some others (e.g. KL divergence) have poor performance when the considered distributions do not have a common support. Compared to other notions of distance or divergence, WpW_{p} avoids the mentioned drawbacks, e.g. it can be applied when the considered distributions do not share the same support and inference is conducted in a generative fashion—namely, a sampling mechanism is considered, in which non-linear functions map a low dimensional latent random vector to a larger dimensional space; see [23]. This explains why WpW_{p} is a popular inference tool; see e.g. [15, 22, 35, 12, 47]; see [25] and [37] for a recent overview of technology transfers among different areas.

In spite of their appealing theoretical properties, OT and WpW_{p} have two important inference aspects to deal with: (i) implementation cost and (ii) lack of robustness. (i) The computation of the solution to OT problem and of the computation of WpW_{p} can be demanding. To circumvent this problem, [16] proposes the use of an entropy regularized version of the OT problem. The resulting divergence is called Sinkhorn divergence; see [2] for further details. [23] illustrate the use of this divergence for inference in generative models, via MDE. The proposed estimation method is attractive in terms of performance and speed of computation. However (to the best of our knowledge) the statistical guarantees of the minimum Sinkhorn divergence estimators have not been provided. (ii) Robustness is becoming a crucial aspect for many complex statistical and machine learning applications. The reason is essentially two-fold. First, large datasets (e.g. records of images or large-scale data collected over the internet) can be cluttered with outlying values (e.g. due to recording device failures). Second, every statistical model represents only an approximation to reality. Thus, the need for inference procedures that remain informative even in the presence of small deviations from the assumed statistical (generative) model is in high demand, in statistics and in machine learning. Robust statistics deals with this problem. We refer to [29] and [51] for a book-length discussion on robustness principles.

Other papers have already mentioned the robustness issues of WpW_{p} and OT; see e.g. [1], [26],[27],[28] [18], [44], [60] and [50]. In the OT literature, [13] proposes to deal with outliers via unbalanced OT, namely allowing the source and target distribution to be non-standard probability distributions. This sensitivity is an undesired consequence of exactly satisfying the marginal constraints in OT transportation problem: intuitively, using Monge’s formulation, the piles of sand and the holes need to have the same total volume, thus the optimal transportation plan is affected by the outlying values inducing a high transportation cost. Relaxing the exact marginal constraints in a way that the transportation plan does not assign large weights to outliers yields the unbalanced OT theory: a mass variation is allowed in the OT problem (intuitively, the piles of sand and the holes do not need to have the same volume) and outlying values can be ignored by the transportation plan. Working along the lines of this approach, [4] define a robust optimal transportation (ROBOT) problem and study its applicability to machine learning. In the same spirit, starting from OT formulation with a regularization term which relaxes the marginal constraints, [44] derive a novel robust OT problem, that they label as ROBOT and they apply it to machine learning. Moreover, building on the Minimum Kantorovich Estimator (MKE) of [5], Mukherjee et al. propose a class of estimators obtained through minimizing the ROBOT and illustrate numerically their performance. In the same spirit, recently, [45] defined a notion of outlier-robust Wasserstein distance which allows for an ϵ\epsilon-percentage of outlier mass to be removed from contaminated distributions.

1.2 Our contributions

Working at the intersection between mathematics, probability, statistics, machine learning and econometrics, we study the theoretical, methodological and computational aspects of ROBOT. Our contributions to these different research areas can be summarized as follows.

We prove that the primal ROBOT problem of [44] defines a robust distance that we call robust Wasserstein distance W(λ)W^{(\lambda)}, which depends on a tuning parameter λ\lambda controlling the level of robustness. We study the measure theoretic properties of W(λ)W^{(\lambda)}, proving that it induces a metric space whose separability and completeness are proved. Moreover, we derive concentration inequalities for the robust Wasserstein distance and illustrate their practical applicability for the selection of λ\lambda in the presence of outliers.

These theoretical (in mathematics and probability) findings are instrumental to another contribution (in statistics and econometrics) of this paper: the development of a novel inferential theory based on robust Wasserstein distance. Making use of convex analysis [49] and of the techniques in [8], we prove that the minimum robust Wasserstein distance estimators exist (almost surely) and are consistent (at the reference model, possibly misspecified). Numerical experiments show that the robust Wasserstein distance estimators remain reliable even in the presence of local departures from the postulated statistical model. These results complement the methodological investigations already available in the literature on minimum distance estimation (in particular the MKE of [5] and [8]) and on ROBOT ([4] and [44]). Indeed, beside the numerical evidence available in the mentioned papers, there are (to the best of our knowledge) no statistical guarantees on the estimators obtained by the method of [44] and [4]. Moreover, our results extend the use of OT to conduct reliable inference on random variables with infinite moments.

Another contribution is the derivation of the dual form of ROBOT for application to machine learning problems. This result not only completes the theory in [44], but also provides the stepping stone for the implementation of ROBOT: we explain how our duality can be applied to define robust Wasserstein Generative Adversarial Networks (RWGAN). Numerical exercises provide evidence that RWGAN have better performance than the extant Wasserstein GAN (WGAN) models. Moreover, we illustrate how our ROBOT can be applied to domain adaptation, another popular machine learning problem; see Section 5.4 and Appendix .4.1.

Finally, as far as the approach in [45] is concerned, we remark that their notion of robust Wasserstein distance has a similar spirit to our W(λ)W^{(\lambda)} (both are based on a total variation penalization of the original OT problem, but yield different duality formulations) and, similarly to our results in §5.3, it is proved to be useful in generative adversarial network (GAN). In spite of these similarities, our investigation goes deeper in the probabilistic and statistical aspects related to inference methods based on W(λ)W^{(\lambda)}. With this regard, some key difference between [45] and our paper are that Nietert et al.: (a) do not study concentration inequalities for their distance; (b) do not provide indications on how to select and control the ϵ\epsilon-percentage of outlier mass ignored by the OT solution—for a numerical illustration of this aspect in the setting of GAN, see Section 5.3; (c) do not define the estimators related to the minimization of their outlier-robust Wasserstein distance; (d) use regularization in the dual and a constraint in the primal, whereas ROBOT does the reverse: this simplifies our derivation of the dual problem. In the next sections we take care of all these and other aspects for the ROBOT, providing the theoretical and the applied statistician with a novel toolkit for robust inference via OT.

2 Optimal transport

2.1 Classical OT

We recall briefly the fundamental notions of OT as formulated by Kontorovich; more details are available in Appendix .1. Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} denote two Polish spaces, and let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) represent the set of all probability measures on 𝒳\mathcal{X}. Let Π(μ,ν)\Pi(\mu,\nu) denote the set of all joint probability measures of μ𝒫(𝒳)\mu\in\mathcal{P(X)} and ν𝒫(𝒴)\nu\in\mathcal{P(Y)}. Kontorovich’s problem aims at finding a joint distribution πΠ(μ,ν)\pi\in\Pi(\mu,\nu) which minimizes the expectation of the coupling between XX and YY in terms of the cost function cc. This problem can be formulated as

inf{𝒳×𝒴c(x,y)𝑑π(x,y):πΠ(μ,ν)}.\inf\left\{\int_{\mathcal{X\times Y}}c(x,y)d\pi(x,y):\pi\in\Pi(\mu,\nu)\right\}. (1)

A solution to Kantorovich’s problem (KP) (1) is called an optimal transport plan. Note that the problem is convex, and its solution exists under some mild assumptions on cc, e.g., lower semicontinuous; see e.g. [57, Chapter 4].

KP as in (1) has a dual form, which is related to Kantorovich dual (KD) problem

sup{𝒴ϕ𝑑ν𝒳ψ𝑑μ:ϕCb(𝒴),ψCb(𝒳)}s.t.ϕ(y)ψ(x)c(x,y),\sup\left\{\int_{\mathcal{Y}}\phi d\nu-\int_{\mathcal{X}}\psi d\mu:\phi\in C_{b}(\mathcal{Y}),\psi\in C_{b}(\mathcal{X})\right\}\quad\text{s.t.}\quad\phi(y)-\psi(x)\leq c(x,y),\enspace

for (x,y)\forall(x,y), where Cb(𝒳)C_{b}(\mathcal{X}) is the set of bounded continuous functions on 𝒳\mathcal{X}. According to Th. 5.10 in [57], if function cc is lower semicontinuous, there exists a solution to the dual problem such that the solutions to KD and KP coincide (no duality gap). In this case, the solution is

ϕ(y)=infx𝒳[ψ(x)+c(x,y)]andψ(x)=supy𝒴[ϕ(y)c(x,y)],\phi(y)=\inf\limits_{x\in\mathcal{X}}[\psi(x)+c(x,y)]\quad\text{and}\quad\psi(x)=\sup\limits_{y\in\mathcal{Y}}[\phi(y)-c(x,y)],

where the functions ϕ\phi and ψ\psi are called cc-concave and cc-convex, respectively, and ϕ\phi (resp. ψ\psi) is called the cc-transform of ψ\psi (resp. ϕ\phi). When cc is a metric on 𝒳\mathcal{X}, OT problem becomes sup{𝒳ψ𝑑ν𝒳ψ𝑑μ:ψis 1-Lipschitz continuous}\sup\left\{\int_{\mathcal{X}}\psi d\nu-\int_{\mathcal{X}}\psi d\mu:\psi\,\text{is 1-Lipschitz continuous}\right\}, which is the Kantorovich-Rubenstein duality. Now, let (𝒳,d)(\mathcal{X},d) denote a complete metric space equipped with a metric d:𝒳×𝒳d:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}, and let μ\mu and ν\nu be two probability measures on 𝒳\mathcal{X}. Solving the optimal transport problem in (1), with the cost function c(x,y)=dp(x,y)c(x,y)=d^{p}(x,y), introduces a distance, called Wasserstein distance of order pp (p1)(p\geq 1):

Wp(μ,ν)\displaystyle W_{p}(\mu,\nu) =(inf𝒳×𝒳(d(x,y))p𝑑π(x,y))1/p.\displaystyle=\left(\inf\int_{\mathcal{X\times X}}(d(x,y))^{p}d\pi(x,y)\right)^{1/p}. (2)

With regard to (2 ), we emphasize that the ability to lift the ground distance d(x,y)d(x,y) is one of the perks of WpW_{p} and it makes it a suitable tool in statistics and machine learning; see Remark 2.18 in [47]. Interestingly, this desirable feature becomes a negative aspect as far as robustness is concerned. Intuitively, OT embeds the distributions geometry: when the underlying distribution is contaminated by outliers, the marginal constraints force OT to transport outlying values, inducing an undesirable extra cost, which in turns entails large changes in the WpW_{p}. More in detail, let δx0\delta_{x_{0}} be point mass measure centered at x0x_{0} (a point in the sample space), and d(x,y)d(x,y) be the ground distance. Then, we introduce the OIO\cup I framework, in which the majority of the data (Xi)iI(X_{i})_{i\in I} are inliers, that is i.i.d. informative data with distribution μ\mu, whilst a few data (Xi)iO(X_{i})_{i\in O} are outliers, meaning that they are not i.i.d. with distribution μ\mu. We refer to [39] (and references therein) for more detail on OIO\cup I and for discussion on its connection to Huber gross-error neighbourhood. Thus, for ε>0\varepsilon>0, we have Wp(μ,(1ε)μ+εδx0)W_{p}({\rm\mu},(1-\varepsilon){\rm\mu}+\varepsilon\delta_{x_{0}})\rightarrow\infty, as x0x_{0}\rightarrow\infty. Thus, WpW_{p} can be arbitrarily inflated by a small number of large observations and it is sensitive to outliers. For instance, let us consider WpW_{p}, with p=1,2p=1,2. In Figure 1(a) and 1(b), we display the scatter plot for two bivariate distributions: the distribution in panel (b) has some abnormal points (contaminated sample) compared to the distribution in panel (a) (clean sample). Although only a small fraction of the sample data is contaminated, both W1W_{1} and W2W_{2} display a large increase. Anticipating some of the results that we will derive in the next pages, we compute our W(λ)W^{(\lambda)} (see §3.1) and we notice that its value remains roughly the same in both panels. We refer to §5.1 for additional insights on the resistance to outliers of W(λ)W^{(\lambda)}.

Refer to caption
(a) W1=3.0W_{1}=3.0, W2=3.0W_{2}=3.0 and W(λ)=2.9(λ=3)W^{(\lambda)}=2.9(\lambda=3)
Refer to caption
(b) W1=5.3W_{1}=5.3, W2=6.6W_{2}=6.6 and W(λ)=3.3(λ=3)W^{(\lambda)}=3.3(\lambda=3)
Figure 1: Wasserstein distance (W1W_{1} and W2W_{2}) and robust Wasserstein distance (W(λ)W^{(\lambda)}, with λ=3\lambda=3) between two bivariate distributions. The scatter plot of data in panel (a) represents a sample from the reference model: cross points (blue) are generated from 𝒩((11),I2)\mathcal{N}\left(\binom{-1}{-1},I_{2}\right), points (orange) are generated from 𝒩((11),I2)\mathcal{N}\left(\binom{1}{1},I_{2}\right). The plot in panel (b) contains some outliers: cross points (blue) are generated from 0.8𝒩((11),I2)+0.2𝒩((99),I2)0.8\mathcal{N}\left(\binom{-1}{-1},I_{2}\right)+0.2\mathcal{N}\left(\binom{-9}{-9},I_{2}\right), points (orange) are generated from 𝒩((11),I2)\mathcal{N}\left(\binom{1}{1},I_{2}\right). The marginal distributions are plotted on the xx- and yy-axis.

2.2 Robust OT (ROBOT)

We recall the primal formulation of ROBOT problem, as defined in [44], who introduce a modified version of Kantorovich OT problem and work on the notion of unbalanced OT. Specifically, they consider a TV-regularized OT re-formulation

minπ,sc(x,y)π(x,y)𝑑x𝑑y+λsTVs.t.π(x,y)𝑑y=μ(x)+s(x)0π(x,y)𝑑x=ν(y)s(x)𝑑x=0,\begin{array}[]{ll}&\min_{\pi,s}\iint c(x,y)\pi(x,y)dxdy+\lambda\|s\|_{\mathrm{TV}}\\ \text{s.t.}&\int\pi(x,y)dy=\mu(x)+s(x)\geq 0\\ &\int\pi(x,y)dx=\nu(y)\\ &\int s(x)dx=0,\end{array} (3)

where λ>0\lambda>0 is a regularization parameter. In (3), the original source measure μ\mu is modified by adding ss (nevertheless, the first and last constraints ensure that μ+s\mu+s is still a valid probability measure). Intuitively, having μ(x)+s(x)=0\mu(x)+s(x)=0 means that x𝒳x\in{\cal X} has strong impact on the OT problem and hence can be labelled as an outlier. The outlier is eliminated from the sample, since the probability measure μ+s\mu+s at this point is zero.

[44] prove that a simplified, computationally efficient formulation equivalent to (3) is

inf{𝒳×𝒳cλ(x,y)𝑑π(x,y):πΠ(μ,ν)},\inf\left\{\int_{\mathcal{X\times X}}c_{\lambda}(x,y)d\pi(x,y):\pi\in\Pi(\mu,\nu)\right\}, (4)

which has a functional form similar to (1), but the cost function cc is replaced by the trimmed cost function cλ=min{c,2λ}c_{\lambda}=\min\left\{c,2\lambda\right\} that is bounded from above by 2λ2\lambda. Following [44], we call formulations (3) and (4) ROBOT. Beside the primal formulation, in the following theorem we drive the dual form of the ROBOT, which is not available in [44]. We need this duality for the development our inferential approach and, for ease of reference, we state it in form of theorem—the proof can be obtained from Th. 5.10 in [57] and it is available in Appendix .2.

Theorem 2.1.

Let μ,ν\mu,\nu be probability measures in 𝒫(𝒳)\mathcal{P(X)}, with cλ(x,y)c_{\lambda}(x,y) as in (6). Then the c-transform of a c-convex function ψ(x)\psi(x) is itself, i.e. ψc(x)=ψ(x)\psi^{c}(x)=\psi(x). Moreover, the dual form of ROBOT is related to the Kantorovich potential ψ\psi, which is a solution to

sup{𝒳ψ𝑑μ𝒳ψ𝑑ν:ψCb(𝒳)},\sup\left\{\int_{\mathcal{X}}\psi d\mu-\int_{\mathcal{X}}\psi d\nu:\psi\in C_{b}(\mathcal{X})\right\}, (5)

where ψ\psi satisfies |ψ(x)ψ(y)|d(x,y)|\psi(x)-\psi(y)|\leq d(x,y) and range(ψ)2λ\mathrm{range}(\psi)\leq 2\lambda.

Note that the difference between the dual form of OT and that of ROBOT is that the latter imposes a restriction on the range of ψ\psi. Thanks to this, we may think of using the ROBOT in theoretical proofs and inference problems where the dual form of OT is already applied, with the advantage that ROBOT remains stable even in the presence of outliers. In the next sections, we illustrate these ideas and we start the construction of our inference methodology by proving that to ROBOT is associated a robust distance, which will be the pivotal tool of our approach.

3 Robust Wasserstein distance W(λ)W^{(\lambda)}

3.1 Key notions

Setting c(x,y)=d(x,y)c(x,y)=d(x,y), (2) implies that the corresponding minimal cost W1(μ,ν)W_{1}(\mu,\nu) is a metric between two probability measures μ\mu and ν\nu. A similar result can be obtained also in the ROBOT setting of (4), if the modified cost function takes the form

cλ(x,y):=min{d(x,y),2λ}.c_{\lambda}(x,y):=\min\left\{d(x,y),2\lambda\right\}. (6)
Lemma 3.1.

Let d(,)d(\cdot,\cdot) denote a metric on 𝒳{\cal X}. Then cλ(x,y)c_{\lambda}(x,y) in (6) is also a metric on 𝒳{\cal X}.

Based on Lemma 3.1, we define the robust Wasserstein distance and prove that it is a metric on 𝒫(𝒳)\mathcal{P(X)}.

Theorem 3.2.

For cλ(x,y),x,y𝒳,c_{\lambda}(x,y),x,y\in{\cal X}, defined in (6),

W(λ)(μ,ν):=infπΠ(μ,ν){𝒳×𝒳cλ(x,y)𝑑π(x,y)}W^{(\lambda)}(\mu,\nu):=\inf\limits_{\pi\in\Pi(\mu,\nu)}\left\{\int_{\mathcal{X\times X}}c_{\lambda}(x,y)d\pi(x,y)\right\} (7)

is a metric on 𝒫(𝒳)\mathcal{P(X)}, and we call it robust Wasserstein distance.

It is worth noting that WpW_{p} is well defined only if probability measures have finite pp-th order moments. In contrast, thanks to the boundedness of the cλc_{\lambda}, our W(λ)W^{(\lambda)} in (7) can be applied to all probability measures, even those with infinite moments of any order. Making use of W(λ)W^{(\lambda)} we introduce the following

Definition 3.3 (Robust Wasserstein space).

A robust Wasserstein space is an ordered pair (𝒫(𝒳),W(λ))(\mathcal{P(X)},W^{(\lambda)}), where 𝒫(𝒳)\mathcal{P(X)} is the set of all probability measures on a complete separable metric space 𝒳\mathcal{X}, and W(λ)W^{(\lambda)} defined in Th. 3.2 is a (finite) distance on 𝒫(𝒳)\mathcal{P(X)}.

3.2 Some measure theoretic properties

In this section we state, briefly, some preliminary results. The proofs (which are based on [57]) of these results are available in Appendix .2, to which we refer the reader interested in the mathematical details. Here, we provide essentially the key aspects, which we prefer to state in the form of theorems and corollary for the ease of reference in the theoretical developments available in the next sections.

Equipped with Th. 3.2 and Definition 3.3, it is interesting to connect W1W_{1} to W(λ)W^{(\lambda)}, via the characterization of its limiting behavior as λ\lambda\rightarrow\infty. Since W(λ)W^{(\lambda)} is continuous and monotonically increasing with respect to λ\lambda, one can prove (via dominated convergence theorem) that its limit coincides with the Wasserstein distance of order p=1p=1. Thus:

Theorem 3.4.

For any probability measures μ\mu and ν\nu in 𝒫(𝒳){\cal P}({\cal X}), W(λ)W^{(\lambda)} is continuous and monotonically non-decreasing with respect to λ[0,)\lambda\in[0,\infty). Moreover, if W1(μ,ν)W_{1}(\mu,\nu) exists, we have

limλW(λ)(μ,ν)=W1(μ,ν).\lim_{\lambda\rightarrow\infty}W^{(\lambda)}(\mu,\nu)=W_{1}(\mu,\nu).

Th. 3.4 has a two-fold implication: first, W(λ)W1W^{(\lambda)}\leq W_{1}; second, for large values of the regularization parameter, W(λ)W^{(\lambda)} and W1W_{1} behave similarly. Another important property of W(λ)W^{(\lambda)} is that weak convergence is entirely characterized by convergence in the robust Wasserstein distance. More precisely, we have

Theorem 3.5.

Let (𝒳,d)\left(\mathcal{X},d\right) be a Polish space. Then W(λ)W^{(\lambda)} metrizes the weak convergence in 𝒫(𝒳)\mathcal{P({X})}. In other words, if (μk)k\left(\mu_{k}\right)_{k\in\mathbb{N}} is a sequence of measures in 𝒫(𝒳)\mathcal{P(X)} and μ\mu is another measure in 𝒫(𝒳)\mathcal{P(X)}, then (μk)kconverges weakly in𝒫(𝒳)toμ\left(\mu_{k}\right)_{k\in\mathbb{N}}\,\,\text{converges weakly in}\,\,\mathcal{P(X)}\,\,\text{to}\,\,\mu and W(λ)(μk,μ)0W^{(\lambda)}\left(\mu_{k},\mu\right)\rightarrow 0 are equivalent.

The next result follows immediately from Theorem 3.5.

Corollary 3.6 (Continuity of W(λ)W^{(\lambda)}).

For a Polish space (𝒳,d)(\mathcal{X},d), suppose that μk(resp.νk)\mu_{k}\,(resp.\,\nu_{k}) converges weakly to μ(resp.ν)\mu\,(resp.\,\nu) in 𝒫(𝒳)\mathcal{P(X)} as kk\rightarrow\infty. Then

W(λ)(μk,νk)W(λ)(μ,ν).W^{(\lambda)}(\mu_{k},\nu_{k})\rightarrow W^{(\lambda)}(\mu,\nu).

Finally, we prove the separability and completeness of robust Wasserstein space, when 𝒳\mathcal{X} is separable and complete. To this end, we state

Theorem 3.7.

Let 𝒳\mathcal{X} be a complete separable metric space. Then the space 𝒫(𝒳)\mathcal{P(X)}, metrized by the robust Wasserstein distance W(λ)W^{(\lambda)}, is complete and separable. Moreover, the set of finitely supported measures with rational coefficients is dense in 𝒫(𝒳)\mathcal{P(X)}.

We remark that our dual form in Th. 2.1 unveils a connection between W(λ)W^{(\lambda)} and the Bounded Lipschitz (BL) metric discussed in [19, p.41 and p.42]: the two metrics coincide for a suitable selection of λ\lambda. With this regard, we emphaisze that, since the BL meterizes the weak-star topology, Th. 3.2 and Th. 3.5 are in line with the results in Section 3 of [19].

3.3 Some novel probabilistic aspects

The previous results offer the mathematical toolkit needed to derive some concentration inequalities for W(λ)W^{(\lambda)} building on the extant results for W1W_{1}. To illustrate this aspect, let us consider the case of Talagrand transportation inequality TpT_{p}. We recall that using WpW_{p} as a distance between probability measures, transportation inequalities state that, given α>0\alpha>0, a probability measure μ\mu on XX satisfies Tp(α)T_{p}(\alpha) if the inequality

Wp(μ,ν)2H(μ|ν)/αW_{p}(\mu,\nu)\leq\sqrt{{2}H(\mu|\nu)/{\alpha}}

holds for any probability measure ν\nu, with H(μ|ν)H(\mu|\nu) being the Kullback-Leibler divergence. For p=1p=1, it has been proven that T1(α)T_{1}(\alpha) is equivalent to the existence of a square-exponential moment; see e.g. [10]. Now, note that, from Th. 3.4, we have W(λ)W1W^{(\lambda)}\leq W_{1}: we can immediately claim that if T1(α)T_{1}(\alpha) holds, then we have

W(λ)(μ,ν)2H(μν)/α.W^{(\lambda)}(\mu,\nu)\leq\sqrt{{2}H(\mu\mid\nu)/{\alpha}}.

Thus, we state

Theorem 3.8.

Let μ\mu be a probability measure on d\mathbb{R}^{d}, which satisfies a T1(α)T_{1}(\alpha) inequality, and {X1,,Xn}\{X_{1},...,X_{n}\} be a random sample of independent variables, all distributed according to μ\mu; let also μ^n:=n1i=1nδXi\hat{\mu}_{n}:=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}} be the associated empirical measure, where δx\delta_{x} is the Dirac distribution with mass on xdx\in\mathbb{R}^{d}. Then, for any d>dd^{\prime}>d and α<α\alpha^{\prime}<\alpha, there exists some constant n0n_{0}, depending only on α,d\alpha^{\prime},d^{\prime} and some square-exponential moment of μ\mu, such that for any ε>0\varepsilon>0 and

nn0max(ε(d+2),1),n\geq n_{0}\max\left(\varepsilon^{-\left(d^{\prime}+2\right)},1\right),

we have

[W(λ)(μ,μ^n)>ε]exp{0.5αnε2}.{\mathbb{P}}\left[W^{(\lambda)}\left(\mu,\hat{\mu}_{n}\right)>\varepsilon\right]\leq\exp\{-0.5\alpha^{\prime}n\varepsilon^{2}\}.

The proof follows along the lines as the proof of Th. 2.1 in Bolley et al. combined with Theorem 2.1 which yields

exp{0.5α2nε2}[W1(μ,μ^n)>ε][W(λ)(μ,μ^n)>ε].\mathrm{exp}\{-0.5{\alpha^{\prime}}{2}n\varepsilon^{2}\}\geq{\mathbb{P}}\left[W_{1}\left(\mu,\hat{\mu}_{n}\right)>\varepsilon\right]\geq{\mathbb{P}}\left[W^{(\lambda)}(\mu,\hat{\mu}_{n})>\varepsilon\right]. (8)

We remark that condition nn0max(ε(d+2),1)n\geq n_{0}\max(\varepsilon^{-\left(d^{\prime}+2\right)},1) implies that Th. 3.8 has an asymptotic nature. Nevertheless, Bolley et al. prove that

[W1(μ,μ^n)>ε]C(ϵ)exp{0.5αnε2}{\mathbb{P}}\left[W_{1}\left(\mu,\hat{\mu}_{n}\right)>\varepsilon\right]\leq C(\epsilon)\exp\{-0.5{\alpha^{\prime}}{}n\varepsilon^{2}\}

hods for any nn with C(ϵ)C(\epsilon) being a (large) constant depending on ϵ\epsilon. The argument in (8) implies

[W(λ)(μ,μ^n)>ε]C(ϵ)exp{0.5αnε2}{\mathbb{P}}\left[W^{(\lambda)}\left(\mu,\hat{\mu}_{n}\right)>\varepsilon\right]\leq C(\epsilon)\exp\{-0.5{\alpha^{\prime}}{}n\varepsilon^{2}\}

for any nn\in\mathbb{N}. This yields a concentration inequality which is valid for any sample size.

We obtained the results in Th. 3.8 using an upper bound on W1W_{1}. The simplicity of our argument comes with a cost: for Th. 3.8 to hold, the underlying measure μ\mu needs to satisfy some conditions required for W1W_{1}, like e.g. some moment restrictions. To obtain bounds that do not need these stringent conditions, in the next theorem we derive novel non asymptotic upper bounds working directly on W(λ)W^{(\lambda)}. The techniques that we apply to obtain our new results put under the spotlight the role of λ\lambda to control the concentration in the OIO\cup I framework.

Theorem 3.9.

Assume that X1,,XnX_{1},\ldots,X_{n} are generated according to the OIO\cup I setting and let μ\mu denote the common distribution of the inliers. Let μ^n(I)=|I|1iIδXi\hat{\mu}_{n}^{(I)}=|I|^{-1}\sum_{i\in I}\delta_{X_{i}} denote the empirical distribution based on inliers. Then, for any t>0t>0 an, when O=O=\varnothing, we have

(|W(λ)(μ,μ^n(I))E[W(λ)(μ,μ^n(I))]|>σ2t|I|+4λt|I|)2exp(t),\displaystyle\mathbb{P}\bigg{(}|W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})-{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})]|>\sigma\sqrt{\frac{2t}{|I|}}+\frac{4\lambda t}{|I|}\bigg{)}\leqslant 2\exp\big{(}-t\big{)}\enspace, (9)

where the variance term is

σ2=|I|1iIE[min(d2(Xi,Xi),(2λ)2)]\sigma^{2}=|I|^{-1}\sum_{i\in I}{\rm E}[\min(d^{2}(X_{i},X_{i}^{\prime}),(2\lambda)^{2})]

and XiX_{i}^{\prime} is an independent copy of XiX_{i}. When OO\neq\varnothing, we have that, for any t>0t>0,

(|W(λ)(μ,μ^n)|I|nE[W(λ)(μ,μ^n(I))]|>σ|I|n2tn+4λtn+2λ|O|n)2exp(t)\displaystyle\mathbb{P}\bigg{(}\bigg{|}W^{(\lambda)}(\mu,\hat{\mu}_{n})-\frac{|I|}{n}{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})]\bigg{|}>\sigma\sqrt{\frac{|I|}{n}}\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n}+\frac{2\lambda|O|}{n}\bigg{)}\leqslant 2\exp\big{(}-t\big{)}\enspace (10)

Some remarks are in order. (i) In the case where O=O=\varnothing, (9) holds without any assumption on μ\mu as the variance term σ2\sigma^{2} is always bounded, even without assuming a finite second moment of XiX_{i}. This strongly relaxes the exponential square moment assumption required for W1(μ,μ^n)W_{1}(\mu,\hat{\mu}_{n}) in Th. 3.8.

(ii) (10) illustrates that the presence of contamination entails the extra term 2λ|O|/n2\lambda|O|/n: one can make use of the tuning parameter λ\lambda to mitigate the influence of outlying values.

(iii) In (10), it is always possible to replace the term

|I|E[W(λ)(μ,μ^n(I))]/n{|I|}{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})]/n

by either E[W(λ)(μ,μ^n)]{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] or E[W(λ)(μ,μ^n)]{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{\prime}_{n})], where μn=n1i=1nδXi\mu_{n}^{\prime}=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}^{\prime}} and X1,,XnX_{1}^{\prime},\ldots,X_{n}^{\prime} are i.i.d. with common distribution μ\mu. This can be proved noticing that it is possible to bound W(λ)(μ,μ^n)W^{(\lambda)}(\mu,\hat{\mu}_{n}) exploiting the fact that cλc_{\lambda} is bounded by 2λ2\lambda, so Th. 2.1 implies

|I|nW(λ)(μ,μ^n(I))2λ|O|nW(λ)(μ,μ^n)|I|nW(λ)(μ,μ^n(I))+2λ|O|n.\frac{|I|}{n}W^{(\lambda)}(\mu,\hat{\mu}_{n}^{(I)})-\frac{2\lambda|O|}{n}\leqslant W^{(\lambda)}(\mu,\hat{\mu}_{n})\leqslant\frac{|I|}{n}W^{(\lambda)}(\mu,\hat{\mu}_{n}^{(I)})+\frac{2\lambda|O|}{n}\enspace. (11)

(iv) Delving more into the link between (10) and (11), consider that, for any ψCb(𝒳)\psi\in C_{b}(\mathcal{X}),

𝒳ψ𝑑μ^n𝒳ψ𝑑μ\displaystyle\int_{\mathcal{X}}\psi d\hat{\mu}_{n}-\int_{\mathcal{X}}\psi d\mu =1ni=1n{ψ(Xi)𝒳ψ𝑑μ}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\big{\{}\psi(X_{i})-\int_{\mathcal{X}}\psi d\mu\big{\}}
=1n(iI{ψ(Xi)𝒳ψ𝑑μ}+iO{ψ(Xi)𝒳ψ𝑑μ})\displaystyle=\frac{1}{n}\bigg{(}\sum_{i\in I}\big{\{}\psi(X_{i})-\int_{\mathcal{X}}\psi d\mu\big{\}}+\sum_{i\in O}\big{\{}\psi(X_{i})-\int_{\mathcal{X}}\psi d\mu\big{\}}\bigg{)}
=1n(|I|{𝒳ψ𝑑μ^n(I)𝒳ψ𝑑μ}+iO{ψ(Xi)𝒳ψ𝑑μ}).\displaystyle=\frac{1}{n}\bigg{(}|I|\big{\{}\int_{\mathcal{X}}\psi d\hat{\mu}^{(I)}_{n}-\int_{\mathcal{X}}\psi d\mu\big{\}}+\sum_{i\in O}\big{\{}\psi(X_{i})-\int_{\mathcal{X}}\psi d\mu\big{\}}\bigg{)}\enspace.

As range(ψ)2λ\text{range}(\psi)\leqslant 2\lambda, for any iOi\in O, it holds that

|ψ(Xi)𝒳ψ𝑑μ|2λ.\big{|}\psi(X_{i})-\int_{\mathcal{X}}\psi d\mu\big{|}\leqslant 2\lambda\enspace.

Therefore, we have, for any ψCb(𝒳)\psi\in C_{b}(\mathcal{X}),

|{𝒳ψ𝑑μ^n𝒳ψ𝑑μ}|I|n{𝒳ψ𝑑μ^n(I)𝒳ψ𝑑μ}|2λ|O|n.\displaystyle\big{|}\big{\{}\int_{\mathcal{X}}\psi d\hat{\mu}_{n}-\int_{\mathcal{X}}\psi d\mu\big{\}}-\frac{|I|}{n}\big{\{}\int_{\mathcal{X}}\psi d\hat{\mu}^{(I)}_{n}-\int_{\mathcal{X}}\psi d\mu\big{\}}\big{|}\leqslant\frac{2\lambda|O|}{n}\enspace.

Taking the suprema over ψ\psi in Cb(𝒳)C_{b}(\mathcal{X}) yields (11). From (11) and the triangular inequality, we have that (adding and subtracting |I|nW(λ)(μ,μ^n(I))\frac{|I|}{n}W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n}))

|W(λ)(μ,μ^n)|I|nE[W(λ)(μ,μ^n(I))]||I|n|W(λ)(μ,μ^n(I))E[W(λ)(μ,μ^n(I))]|\displaystyle\bigg{|}W^{(\lambda)}(\mu,\hat{\mu}_{n})-\frac{|I|}{n}{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})]\bigg{|}\leqslant\frac{|I|}{n}\bigg{|}W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})-{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}^{(I)}_{n})]\bigg{|}
+2λ|O|n.\displaystyle+\frac{2\lambda|O|}{n}\enspace.

Inequality (10) follows by plugging (9) into this last bound. Finally, combining (10) and (11), we obtain

(|W(λ)(μ,μ^n)E[W(λ)(μ,μ^n)]|>σ|I|n2tn+4λtn+4λ|O|n)2exp(t).\mathbb{P}\bigg{(}\big{|}W^{(\lambda)}(\mu,\hat{\mu}_{n})-{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]\big{|}>\sigma\sqrt{\frac{|I|}{n}}\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n}+\frac{4\lambda|O|}{n}\bigg{)}\leqslant 2\exp\big{(}-t\big{)}\enspace. (12)

Beside the concentrations in Th. 3.9, we consider the problem of providing an upper bound to E[W(λ)(μ,μ^n(I))]{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n}^{(I)})]. Following [9], we call this quantity the mean rate of convergence. For the sake of exposition, we assume that all data are inliers, that is X1,,XnX_{1},\ldots,X_{n} are i.i.d. with common distribution μ\mu—the case with outliers can be obtained using (11). As in Th. 3.8, we notice that it is always possible to use W(λ)W1W^{(\lambda)}\leqslant W_{1} from Th. 3.4 and derive bounds for E[W(λ)(μ,μ^n)]{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] applying the results for E[W1(μ,μ^n)]{\rm E}[W_{1}(\mu,\hat{\mu}_{n})]; see e.g. [9], [21, 40, 20]. Beside this option, here we derive novel results for E[W(λ)(μ,μ^n)]{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]. In what follows, we assume that μ\mu is supported in B(0,K)B(0,K), the ball in Euclidean distance of d\mathbb{R}^{d} with radius KK and that the reference distance is the Euclidean distance in d\mathbb{R}^{d}. Then we state:

Theorem 3.10.

Let μ\mu denote a measure on B(0,K)B(0,K), the ball in Euclidean distance of d\mathbb{R}^{d} with radius KK. Let X1,,XnX_{1},\ldots,X_{n} denote an i.i.d. sample of μ\mu and μ^n\hat{\mu}_{n} denote the empirical measure. Then, we have

E[W(λ)(μ,μ^n)]{CK(Kλ)n when d=1,CKnlog(nK) when d=2,CKn1/d when d3.{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]\leqslant\begin{cases}C\sqrt{\frac{K(K\wedge\lambda)}{n}}&\text{ when }d=1\enspace,\\ \frac{CK}{\sqrt{n}}\log\bigg{(}\frac{\sqrt{n}}{K}\bigg{)}&\text{ when }d=2\enspace,\\ \frac{CK}{n^{1/d}}&\text{ when }d\geqslant 3\enspace.\end{cases}

The bound is obtained using (i) the duality formula derived in Th. 2.1, (ii) an extension of Dudley’s inequality in the spirit of [9], and (iii) a computation of the entropy number of a set of 11-Lipschitz functions defined on the ball B(0,K)B(0,K) and taking value in [λ,λ][-\lambda,\lambda]. The technical details are available in Appendix .2. Here, we remark the key role of λ\lambda when d=1d=1 (univariate case), which allows to bound the mean rate of convergence. For d2d\geq 2, our bounds could have been obtained also from [9]. However, the obtained results can be substantially larger than our upper bounds when λK\lambda\ll K: Th. 3.10 provides a refinement (related to the boundedness of cλc_{\lambda}) of the extant bounds; see also Proposition 9 in [45] for a similar result.

4 Inference via minimum W(λ)W^{(\lambda)} estimation

4.1 Estimation method

Let us consider a probability space (Ω,,P)(\Omega,\mathcal{F},{\rm P}). On this probability space, we define random variables taking values on 𝒳d,d1\mathcal{X}\subset\mathbb{R}^{d},d\geq 1 and endowed with the Borel σ\sigma-algebra. We observe nn\in\mathbb{N} i.i.d. data points {X1,,Xn}\{X_{1},\ldots,X_{n}\}, which are distributed according to μ(n)𝒫(𝒳n)\mu_{\star}^{(n)}\in\mathcal{P}\left(\mathcal{X}^{n}\right). A parametric statistical model on 𝒳n\mathcal{X}^{n} is denoted by {μθ(n)}θΘ\{\mu_{\theta}^{(n)}\}_{\theta\in\Theta} and it is a collection of probability distributions indexed by a parameter θ\theta of dimension dθd_{\theta}. The parameter space is Θdθ,dθ1\Theta\subset\mathbb{R}^{d_{\theta}},d_{\theta}\geq 1, which is equipped with a distance ρΘ\rho_{\Theta}. We let Z1:nZ_{1:n} represent the observations from μθ(n)\mu_{\theta}^{(n)}. For every θ\theta\in Θ\Theta, the sequence (μ^θ,n)n1\left(\hat{\mu}_{\theta,n}\right)_{n\geq 1} of random probability measures on 𝒳\mathcal{X} converges (in some sense) to a distribution μθ𝒫(𝒳)\mu_{\theta}\in\mathcal{P}(\mathcal{X}), where

μ^θ,n=n1i=1nδZi\hat{\mu}_{\theta,n}=n^{-1}\sum_{i=1}^{n}\delta_{Z_{i}}

with Z1:nμθ(n)Z_{1:n}\sim\mu_{\theta}^{(n)}. Similarly, we will assume that the empirical measure μ^n\hat{\mu}_{n} converges (in some sense) to some distribution μ𝒫(𝒳)\mu_{\star}\in\mathcal{P}(\mathcal{X}) as nn\rightarrow\infty. We say that the model is well-specified if there exists θΘ\theta_{\star}\in\Theta such that μμθ\mu_{\star}\equiv\mu_{\theta_{\star}}; otherwise, it is misspecified. Parameters are identifiable: θ1=θ2\theta_{1}=\theta_{2} is implied by μθ1=μθ2\mu_{\theta_{1}}=\mu_{\theta_{2}}.

Our estimation method relies on selecting a parametric model in {μθ(n)}θΘ\{\mu_{\theta}^{(n)}\}_{\theta\in\Theta}, which is the closest, in robust Wasserstein distance, to the true model μ\mu_{\star}. Thus, minimum W(λ)W^{(\lambda)} estimation refers to the minimization, over the parameter θΘ\theta\in\Theta, of the robust Wasserstein distance between the empirical distribution μ^n\hat{\mu}_{n} and the reference model distribution μθ\mu_{\theta}. This is similar to the approach described in [5] and [8], who derive minimum Kantorovich estimators by making use of WpW_{p}.

More formally, the minimum robust Wasserstein estimator (MRWE) is defined as

θ^nλ=argminθΘW(λ)(μ^n,μθ).\hat{\theta}_{n}^{\lambda}=\underset{\theta\in\Theta}{\operatorname{argmin}}W^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\theta}\right). (13)

When there is no explicit expression for the probability measure characterizing the parametric model (e.g. in complex generative models, see [23] and § 5.2 for some examples), the computation of MRWE can be difficult. To cope with this issue, in lieu of (13), we propose using the minimum expected robust Wasserstein estimator (MERWE) defined as

θ^n,mλ=argminθΘEm[W(λ)(μ^n,μ^θ,m)].\hat{\theta}_{n,m}^{\lambda}=\underset{\theta\in\Theta}{\operatorname{argmin}}{\rm E}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)\right]. (14)

where the expectation Em\mathrm{E}_{m} is taken over the distribution μθ(m)\mu^{(m)}_{\theta}. To implement the MERWE one can rely on Monte Carlo methods and approximate numerically Em[W(λ)(μ^n,μ^θ,m)]{\rm E}_{m}[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)]. Replacing the robust Wasserstein distance with the WpW_{p} in (14), one obtains the minimum Wasserstein estimator (MWE) and the minimum expected Wasserstein estimator (MEWE), studied in [8].

4.2 Statistical guarantees

Intuitively, the consistency of the MRWE and MERWE can be conceptualized as follows. We expect that the empirical measure converges to μ\mu_{\star}, in the sense that W(λ)(μ^n,μ)0W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})\rightarrow 0 as nn\rightarrow\infty; see Th. 3.5. Therefore, the argmin\arg\min of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}) should converge to

θ=argminW(λ)(μ,μθ),\theta^{\ast}=\arg\min W^{(\lambda)}(\mu_{\star},\mu_{\theta}),

assuming its existence and unicity. The same can be said for the minimum of the MERWE, provided that mm\rightarrow\infty. If the reference parametric model is correctly specified (e.g. no data contamination), θ\theta^{\ast} is the limiting object of interest and it is the minimizer of W(λ)(μ,μθ)W^{(\lambda)}(\mu_{\star},\mu_{\theta}). In the case of model misspecification (e.g. wrong parametric form of μθ\mu_{\theta} and/or presence of data contamination), θ\theta^{\ast} is not necessarily the parameter that minimizes the KL divergence between the empirical measure and the measure characterizing the reference model. We emphasize that this is at odd with the standard misspecification theory (see e.g. [59]) and it is due to the fact that we replace the KL divergence (which yields non robust estimators) with our novel robust Wasserstein distance.

To formalize these arguments, we introduce the following set of assumptions, which are standard in the literature on MKE; see [8].

Assumption 4.1.

The data-generating process is such that W(λ)(μ^n,μ)0,PW^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\star}\right)\rightarrow 0,{\rm P}-almost surely as nn\rightarrow\infty.

Assumption 4.2.

The map θμθ\theta\mapsto\mu_{\theta} is continuous in the sense that ρΘ(θn,θ)0\rho_{\Theta}\left(\theta_{n},\theta\right)\rightarrow 0 implies that μθn\mu_{\theta_{n}} convergences to μθ\mu_{\theta} weakly as nn\rightarrow\infty.

Assumption 4.3.

For some ε>0\varepsilon>0,

B(ε)={θΘ:W(λ)(μ,μθ)ε+ε},B_{\star}(\varepsilon)=\left\{\theta\in\Theta:W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right)\leq\varepsilon_{\star}+\varepsilon\right\},

with ε=infθΘW(λ)(μ,μθ)\varepsilon_{\star}=\inf_{\theta\in\Theta}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right), is a bounded set.

Then we state the following

Theorem 4.4 (Existence of MRWE).

Under Assumptions 4.1,4.2 and 4.3, there exists a set 𝒜Ω{\cal A}\subset\Omega with P(𝒜)=1{\rm P}({\cal A})=1 such that, for all ω𝒜\omega\in{\cal A},

infθΘW(λ)(μ^n(ω),μθ)infθΘW(λ)(μ,μθ)\inf_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right)\rightarrow\inf_{\theta\in\Theta}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right)

and there exists n(ω)n(\omega) such that, for all nn(ω)n\geq n(\omega), the sets

argminθΘW(λ)(μ^n(ω),μθ)\operatorname{argmin}_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right)

are non-empty and form a bounded sequence with

lim supnargminθΘW(λ)(μ^n(ω),μθ)argminθΘW(λ)(μ,μθ).\limsup_{n\rightarrow\infty}\underset{\theta\in\Theta}{\operatorname{argmin}}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right)\subset\underset{\theta\in\Theta}{\operatorname{argmin}}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right).

To prove Th. 4.4 we need to show that the sequence of functions θW(λ)(μ^n(ω),μθ)\theta\mapsto W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right) epi-converges (see Definition .5 in Appendix .2) to θW(λ)(μ,μθ)\theta\mapsto W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right). The result follows from [49]. We remark that Th. 4.4 generalizes the results of [5] and [8], where the model is assumed to be well specified ([5]) and moments of order p1p\geq 1 are needed ([8]).

Moving along the same lines as in Th. 2.1 in [8], we may prove the measurability of MRWE; see Th. .6 in Appendix A. These results for the MRWE provide the stepping stone to derive similar theorems for the MERWE θ^n,mλ\hat{\theta}_{n,m}^{\lambda}. To this end, the following assumptions are needed.

Assumption 4.5.

For any m1m\geq 1, if ρΘ(θn,θ)0\rho_{\Theta}\left(\theta_{n},\theta\right)\rightarrow 0, then μθn(m)\mu_{\theta_{n}}^{(m)} converges to μθ(m)\mu_{\theta}^{(m)} weakly as nn\rightarrow\infty.

Assumption 4.6.

If ρΘ(θn,θ)0\rho_{\Theta}\left(\theta_{n},\theta\right)\rightarrow 0, then EnW(λ)(μθn,μ^θn,n)0{\rm E}_{n}W^{(\lambda)}\left(\mu_{\theta_{n}},\hat{\mu}_{\theta_{n},n}\right)\rightarrow 0 as nn\rightarrow\infty.

Theorem 4.7 (Existence of MERWE).

Under Assumptions 4.1,4.2,4.3,4.5 and 4.6, there exists a set 𝒜Ω{\cal A}\subset\Omega with P(𝒜)=1{\rm P}({\cal A})=1 such that, for all ω𝒜\omega\in{\cal A},

infθΘEm(n)[W(λ)(μ^n(ω),μ^θ,m(n))]infθΘW(λ)(μ,μθ)\inf_{\theta\in\Theta}{\rm E}_{m(n)}\left[W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\hat{\mu}_{\theta,m(n)}\right)\right]\rightarrow\inf_{\theta\in\Theta}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right)

and there exists n(ω)n(\omega) such that, for all nn(ω)n\geq n(\omega), the sets

argminθΘW(λ)(μ^n(ω),μ^θ,m(n))\operatorname{argmin}_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\hat{\mu}_{\theta,m(n)}\right)

are non-empty and form a bounded sequence with

lim supnargminθΘEm(n)[W(λ)(μ^n(ω),μ^θ,m(n))]argminθΘW(λ)(μ,μθ).\limsup_{n\rightarrow\infty}\underset{\theta\in\Theta}{\operatorname{argmin}}{\rm E}_{m(n)}\left[W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\hat{\mu}_{\theta,m(n)}\right)\right]\subset\underset{\theta\in\Theta}{\operatorname{argmin}}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right).

For a generic function ff, let us define

ε-argminxf:={x:f(x)ε+infxf}.\varepsilon\text{-}\operatorname{argmin}_{x}f:=\left\{x:f(x)\leq\varepsilon+\inf_{x}f\right\}.

Then, we state the following

Theorem 4.8 (Measurability of MERWE).

Suppose that Θ\Theta is a σ\sigma-compact Borel measurable subset of dθ\mathbb{R}^{d_{\theta}}. Under Assumption 4.5, for any n1n\geq 1 and m1m\geq 1 and ε>0\varepsilon>0, there exists a Borel measurable function θ^n,m:ΩΘ\hat{\theta}_{n,m}:\Omega\rightarrow\Theta that satisfies

θ^n,mλ(ω)argminθΘEm[W(λ)(μ^n(ω),μ^θ,m)],\hat{\theta}_{n,m}^{\lambda}(\omega)\in\operatorname{argmin}_{\theta\in\Theta}{\rm E}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\hat{\mu}_{\theta,m}\right)\right],

if this set is non-empty, otherwise,

θ^n,mλ(ω)ε-argminθΘEm[W(λ)(μ^n(ω),μ^θ,m)].\hat{\theta}_{n,m}^{\lambda}(\omega)\in\varepsilon\text{-}\operatorname{argmin}_{\theta\in\Theta}{\rm E}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\hat{\mu}_{\theta,m}\right)\right].

Considering the case where the data and nn are fixed, the next result shows that the MERWE converges to the MRWE, as mm\rightarrow\infty. In the next assumption, the observed empirical distribution is kept fix and εn=infθΘW(λ)(μ^n,μθ)\varepsilon_{n}=\inf_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\theta}\right).

Assumption 4.9.

For some ε>0\varepsilon>0, the set

Bn(ε)={θΘ:W(λ)(μ^n,μθ)εn+ε}B_{n}(\varepsilon)=\left\{\theta\in\Theta:W^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\theta}\right)\leq\varepsilon_{n}+\varepsilon\right\}

is bounded.

Theorem 4.10 (MERWE converges to MRWE as mm\rightarrow\infty).

Under Assumptions 4.2,4.5,4.6 and 4.9,

infθΘEm[W(λ)(μ^n,μ^θ,m)]infθΘW(λ)(μ^n,μθ),\inf_{\theta\in\Theta}{\rm E}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)\right]\rightarrow\inf_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\theta}\right),

and there exists a m~\tilde{m} such that, for all mm~m\geq\tilde{m}, the sets

argminθΘEmW(λ)(μ^n,μ^θ,m)\operatorname{argmin}_{\theta\in\Theta}{\rm E}_{m}W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)

are non-empty and form a bounded sequence with

lim supmargminθΘEm[W(λ)(μ^n,μ^θ,m)]argminθΘW(λ)(μ^n,μθ).\limsup_{m\rightarrow\infty}\underset{\theta\in\Theta}{\operatorname{argmin}}{\rm E}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)\right]\subset\underset{\theta\in\Theta}{\operatorname{argmin}}W^{(\lambda)}\left(\hat{\mu}_{n},\mu_{\theta}\right).

5 Numerical experiments

5.1 Sensitivity to outliers

To complement the insights obtained looking at Figure 1, we build on the notion of sensitive curve, which is an empirical tool that illustrates the stability of statistical functionals; see [54] and [29] for a book-length presentation. We consider a similar tool to study the sensitivity of W1W_{1} and W(λ)W^{(\lambda)} in finite samples, both interpreted as functionals of the empirical measure. More in detail, we fix the standard normal distribution as a reference model (for this distribution both W(λ)W^{(\lambda)} and W1W_{1} are well-defined) and we use it to generate a sample of size n=1000n=1000. We denote the resulting sample 𝐗n:=(X1,,Xn)\mathbf{X}_{n}:=(X_{1},\ldots,X_{n}), whose empirical measure is μ^n\hat{\mu}_{n}. Then, we replace XnX_{n} with a value xx\in\mathbb{R} and form a new set of sample points, denoted by 𝐗n(x)\mathbf{X}_{n}(x), whose empirical measure is μ^n(x)\hat{\mu}^{(x)}_{n}. We let TnT_{n} represent the robust Wasserstein distance between an nn-dimensional sample and 𝐗n\mathbf{X}_{n}. So, Tn(𝐗n)T_{n}(\mathbf{X}_{n}) is the empirical robust Wasserstein distance between 𝐗n\mathbf{X}_{n} and its self, thus it is equal to 0, while Tn(𝐗n(x))T_{n}(\mathbf{X}_{n}(x)) is the empirical robust Wasserstein distance between 𝐗n(x)\mathbf{X}_{n}(x) and 𝐗n\mathbf{X}_{n}. Finally, for different values of xx and λ\lambda, we compute (numerically)

Δ(x,W(λ))=n[Tn(X1,,Xn1,x)Tn(X1,,Xn)]=nW(λ)(μ^n,μ^n(x)).\Delta(x,W^{(\lambda)})=n\left[T_{n}\left(X_{1},\ldots,X_{n-1},x\right)-T_{n}\left(X_{1},\ldots,X_{n}\right)\right]=nW^{(\lambda)}(\hat{\mu}_{n},\hat{\mu}^{(x)}_{n}).

A similar procedure is applied to obtain Δ(x,W1)\Delta(x,W_{1}). We display the results in Figure 2. For each value of λ\lambda, the plots illustrate that Δ(x,W(λ))\Delta(x,W^{(\lambda)}) first grows and then remains flat (at the value of 2λ2\lambda) even for very large values of |x||x|. In contrast, for |x||x|\to\infty, Δ(x,W1)\Delta(x,W_{1}) diverges, in line with the evidence from (and the comments on) Figure 1. In addition, we notice that as λ\lambda\rightarrow\infty, the plot of Δ(x,W(λ))\Delta(x,W^{(\lambda)}) becomes more and more similar to the one of Δ(x,W1)\Delta(x,W_{1}): this aspect is in line with Th. 3.4 and it is reminiscent of the behaviour of the Huber loss function ([32]) for location estimation, which converges to the quadratic loss (maximum likelihood estimation), as the constant tuning the robustness goes to infinity. However, an important remark is order: the cost cλc_{\lambda} yields, in the language of robust statistics, the so-called “hard rejection”: it bounds the influence of outlying values (to be contrasted with the behavior of Huber loss, which downweights outliers to preserve efficiency at the reference model); see [50] for a similar comment.

Refer to caption
Figure 2: The continuous line represents Δ(x,W1)\Delta(x,W_{1}). The dashed (red), dot-dashed (blue) and dotted (green) line represents Δ(x,W(λ))\Delta(x,W^{(\lambda)}) with λ=3,4,5\lambda=3,4,5 respectively.

5.2 Estimation of location

Let us consider the statistical problem of inference on univariate location models. We study the robustness of MERWE, comparing its performance to the one of MEWE based on minimizing W1W_{1}, under different degrees of data contamination and for different underlying data generating models (with finite and infinite moments). Before delving into the numerical exercise, let us give some numerical details. We recall that the MERWE aims at minimizing the function

θEm[W(λ)(μ^n,μ^θ,m)].\theta\mapsto{\rm{E}}_{m}\left[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)\right].

With the same notation as in §4.1, suppose we generate kk replications Z1:m(i),i=1,,k,Z_{1:m}^{(i)},i=1,\ldots,k, independently from a reference model μθ(m)\mu_{\theta}^{(m)}, and let μ^θ,m(i)\hat{\mu}_{\theta,m}^{(i)} denote the empirical probability measure of Z1:m(i)Z_{1:m}^{(i)}. Then the loss function

LMERWE=k1i=1kW(λ)(μ^n,μ^θ,m(i))L_{\text{MERWE}}=k^{-1}\sum_{i=1}^{k}W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}^{(i)}\right)

is a natural estimator of Em[W(λ)(μ^n,μ^θ,m)]{\rm E}_{m}[W^{(\lambda)}\left(\hat{\mu}_{n},\hat{\mu}_{\theta,m}\right)], since the former converges almost surely to the latter as kk\rightarrow\infty. We note that the algorithmic complexity in mm is super-linear while the complexity in kk is linear: the incremental cost of increasing kk is lower than that one of increasing mm. Moreover, a key aspect for the implementation of our estimator is related to the need for specifying λ\lambda. In the next numerical exercises we set λ=5\lambda=5: this value yields a good performance (in terms of Mean-Squared Error, MSE) of our estimators across the Monte Carlo settings, which consider different models and levels of contamination. In §6, we describe a data driven procedure for selecting λ\lambda.

In our numerical investigation, first, we consider a reference model which is the sum of log-normal random variables having finite moments. For a given L1L\geq 1, γ\gamma\in\mathbb{R} and σ>0\sigma>0, we have X=l=1Lexp(Zl),X=\sum_{l=1}^{L}\exp\left(Z_{l}\right), where Z1,,ZLZ_{1},\ldots,Z_{L} are sampled from 𝒩(γ,σ2)\mathcal{N}\left(\gamma,\sigma^{2}\right) independently. Suppose we are interested in estimating the location parameter γ\gamma. We investigate the performance of MERWE and MEWE, under different scenarios: namely, in presence of outliers, for different sample sizes and contamination values. Specifically, we consider the model

Xi(n)=l=1Lexp(Zil(n)),l=1,,L,i=1,,n,X_{i}^{(n)}=\sum_{l=1}^{L}\exp(Z_{i_{l}}^{(n)}),l=1,\ldots,L,i=1,\ldots,n,

where Zil(n)𝒩(γ,σ)Z_{i_{l}}^{(n)}\sim\mathcal{N}(\gamma,\sigma) for i=1,,n1i=1,\ldots,n_{1} (clean part of the sample) and Zil(n)𝒩(γ+η,σ)Z_{i_{l}}^{(n)}\sim\mathcal{N}(\gamma+\eta,\sigma) for i=n1+1,,ni=n_{1}+1,\ldots,n (contaminated part of the sample). Therefore, in each sample, there are nn1n-n_{1} outliers of size η\eta.

In our simulation experiments, we set L=10L=10, γ=0\gamma=0 and σ=1\sigma=1. To implement the MERWE and MEWE of γ\gamma, we choose m=1000m=1000, k=20k=20 and λ=5\lambda=5. The bias and MSE, based on 10001000 replications, of the estimators are displayed in Table 1, for various sample sizes nn, different contamination size η\eta and proportion of contamination ε\varepsilon. The table illustrates the superior performance (both in terms of bias and MSE) of the MERWE with respect to the MEWE. In small samples (n=100n=100), the MERWE has smaller bias and MSE than the MEWE, in all settings. Similar results are available in moderate and large sample size (n=200(n=200 and n=1000n=1000). Interestingly, for n=1000n=1000, MERWE and MEWE have similar performance when ε=0\varepsilon=0 (no contamination), whilst the MERWE still has smaller MSE for ε>0\varepsilon>0. This implies that the MERWE maintains good efficiency with respect to MEWE at the reference model.

SETTINGS n=100n=100 n=200n=200 n=1000n=1000      \bigstrut
BIAS MSE BIAS MSE BIAS MSE      \bigstrut
MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE      \bigstrut
ε=0.1,η=1\varepsilon=0.1,\eta=1 0.049 0.092 0.003 0.010 0.042 0.093 0.002 0.012 0.037 0.086 0.002 0.008 \bigstrut
ε=0.1,η=4\varepsilon=0.1,\eta=4 0.035 0.090 0.002 0.012 0.029 0.097 0.001 0.016 0.013 0.100 0\approx 0 0.018 \bigstrut
ε=0.2,η=1\varepsilon=0.2,\eta=1 0.071 0.157 0.008 0.028 0.086 0.178 0.009 0.033 0.081 0.172 0.007 0.031 \bigstrut
ε=0.2,η=4\varepsilon=0.2,\eta=4 0.046 0.204 0.003 0.045 0.035 0.203 0.002 0.043 0.017 0.195 0\approx 0 0.038\bigstrut
ε=0\varepsilon=0 0.036 0.034 0.002 0.002 0.022 0.022 0.001 0.001 0.012 0.010 0\approx 0 0\approx 0 \bigstrut
Table 1: The bias and MSE of the MERWE and MEWE of γ\gamma for different nn, ε\varepsilon and η\eta.

Now, let us turn to the case of random variables having infinite moments. [60] recently illustrates that minimum Wasserstein distance estimators do not perform well for heavy-tailed distributions and suggests to avoid the use of minimum Wasserstein distance inference when the underlying distribution has infinite pp-th moment, with p1p\geq 1. The theoretical developments of Section 3.1 show that our MERWE does not suffer from the same criticism. In the next MC experiment, we illustrate the good performance of MERWE when the data generating process does not admit finite first moment. We consider the problem of estimating the location parameters for a symmetric α\alpha-stable distribution; see e.g. [52] for a book-length presentation. A stable distribution is characterized by four parameter (α,β,γ,δ)(\alpha,\beta,\gamma,\delta): α\alpha is the index parameter, β\beta is the skewness parameter, γ\gamma is the scale parameter and δ\delta is the location parameter. It is worth noting that stable distributions have undefined variance for α<2\alpha<2, and undefined mean for α1\alpha\leq 1. We consider three parameters setting:

  • (1)

    (α,β,γ,δ)=(0.5,0,1,0)(\alpha,\beta,\gamma,\delta)=(0.5,0,1,0), which represents a heavy-tailed distribution without defined mean;

  • (2)

    (α,β,γ,δ)=(1,0,1,0)(\alpha,\beta,\gamma,\delta)=(1,0,1,0) which is the standard Cauchy distribution, having undefined moment of order p1p\geq 1;

  • (3)

    (α,β,γ,δ)=(1.1,0,1,0)(\alpha,\beta,\gamma,\delta)=(1.1,0,1,0) representing a distribution having a finite mean.

In each MC experiment, we estimate the location parameter, while the other parameters are supposed to be known. In addition, we consider contaminated heavy-tailed data, where (1ε)(1-\varepsilon) proportion of nn observations is generated from α\alpha-stable distribution with parameter (α,β,γ,δ)(\alpha,\beta,\gamma,\delta) and the other ε\varepsilon proportion (outliers) comes from the distribution with parameter (α,β,γ,δ+η)(\alpha,\beta,\gamma,\delta+\eta) (η\eta is the size of outliers). We set m=1000m=1000, n=100n=100 and k=20k=20 and repeat the experiment 1000 times for each distribution and estimator, and λ=5\lambda=5. We display the results in Table 2. For the stable distributions, the MEWE has larger bias and MSE than the ones yielded by the MERWE. This aspect is particularly evident for the distributions with undefined first moment, namely the Cauchy distribution and the stable distribution with α=0.5\alpha=0.5. These experiments, complementing the ones available in [60], illustrate that while MEWE (which is not well-defined in the considered setting) entails large bias and MSE values, MERWE is well-defined and performs well even for stable distributions with infinite moments.

SETTINGS Cauchy Stable (α=0.5\alpha=0.5) Stable (α=1.1\alpha=1.1)      \bigstrut
BIAS MSE BIAS MSE BIAS MSE      \bigstrut
MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE MERWE MEWE      \bigstrut
ε=0.1,η=1\varepsilon=0.1,\eta=1 0.084 1.531 0.011 3.628 0.088 3.179 0.011 13.731 0.090 0.658 0.011 1.030 \bigstrut
ε=0.1,η=4\varepsilon=0.1,\eta=4 0.205 1.529 0.047 3.656 0.164 3.174 0.034 13.706 0.207 0.746 0.048 1.051 \bigstrut
ε=0.2,η=1\varepsilon=0.2,\eta=1 0.181 1.502 0.038 3.602 0.171 3.155 0.037 12.840 0.181 0.676 0.038 0.942 \bigstrut
ε=0.2,η=4\varepsilon=0.2,\eta=4 0.459 1.820 0.224 4.691 0.384 3.140 0.166 12.713 0.485 1.072 0.245 1.802 \bigstrut
ε=0\varepsilon=0 0.046 1.551 0.003 3.740 0.045 3.118 0.003 12.600 0.042 0.613 0.003 0.894 \bigstrut
Table 2: The bias and MSE of the MERWE and MEWE of location parameter for various contaminated α\alpha-stable distributions. Simulation setting: m=1000m=1000, k=20k=20.

5.3 Generative Adversarial Networks (GAN)

Synthetic data. We propose two RWGAN deep learning models: both approaches are based on ROBOT. The first one is derived directly from the dual form in Th. 2.1, while the second one is derived from (3). We compare these two methods with routinely-applied Wasserstein GAN (WGAN) and with the robust WGAN introduced by [4]. To provide some details on our new procedures, we recall that GAN is a deep learning method, first proposed by [24]. It is one of the most popular machine learning approaches for unsupervised learning on complex distributions. GAN includes a generator and a discriminator that are both neural networks and the procedure is based on mutual game learning between them. The generator creates fake samples as well as possible to deceive the discriminator, while the discriminator needs to be more and more powerful to detect the fake samples. The ultimate goal of this procedure is to produce a generator with a great ability to produce high-quality samples just like the original sample. To illustrate the connections with the ROBOT, let us consider the following GAN architecture; more details are available in the Supplementary Material (Appendix .4.2). Let XPr{X}\sim{\rm P_{r}} and ZPz{Z}\sim{\rm P}_{z}, where Pr\rm P_{r} and Pz{\rm P}_{z} denote the distribution of the reference sample and of a random sample which is the input for the generator. Then, denote by GθG_{\theta} the function applied by generator (it transforms ZZ to create fake samples, which are the output of a statistical model Pθ{\rm P}_{\theta}, indexed by the finite dimensional parameter θ\theta) and by DϑD_{\vartheta} the function applied by the discriminator (it is indexed by a finite dimensional parameter ϑ\vartheta, which outputs the estimate of the probability that the sample is true). The objective function is

minθmaxϑ{E[logDϑ(X)]+E[log(1Dϑ(Gθ(Z)))]},\min_{\theta}\max_{\vartheta}\left\{{\rm E}[\log D_{\vartheta}({X})]+{\rm E}[\log(1-D_{\vartheta}(G_{\theta}{(Z)}))]\right\}, (15)

where Dϑ(X)D_{\vartheta}(X) represents the probability that XX came from the data rather than PθP_{\theta}. The GAN mechanism trains DϑD_{\vartheta} to maximize the probability of assigning the correct label to both training examples and fake samples. Simultaneously, it trains GθG_{\theta} to minimize log(1D(G(z)))\log(1-D(G(z))). In other words, we would like to train GθG_{\theta} such that Pθ\mathrm{P_{\theta}} is very close (in some distance/divergence) to Pr\mathrm{P_{r}}.

Despite its popularity, GAN has some inference issues. For instance, during the training, the generator may collapse to a setting where it always produces the same samples and face the fast vanishing gradient problem. Thus, training GANs is a delicate and unstable numerical and inferential task; see [3]. To overcome these problems, Arjovsky et al. propose the so-called Wasserstein GAN (WGAN) based on Wasserstein distance of order 11. The main idea is still based on mutual learning, but rather than using a discriminator to predict the probability of generated images as being real or fake, the WGAN replaces DϑD_{\vartheta} with a function fξf_{\xi} (it corresponds to ψ\psi in Kantorovich-Rubenstein duality), indexed by parameter ξ\xi, which is called “a critic”, namely a function that evaluates the realness or fakeness of a given sample. In mathematical form, the WGAN objective function is

minθsupfξL1{E[fξ(X)]E[fξ(Gθ(Z))]},\min_{\theta}\sup_{\|f_{\xi}\|_{L}\leq 1}\left\{{\rm E}[f_{\xi}({X})]-{\rm E}[f_{\xi}(G_{\theta}({Z}))]\right\}, (16)

Also in this new formulation, the task is still to train GθG_{\theta} in such a way that Pθ\mathrm{P_{\theta}} is very close (now, in Wasserstein distance) to Pr\mathrm{P_{r}}. [3] explain how the WGAN is connected to minimum distance estimation. We remark that (16) has the same form as the Kantorovich-Rubenstein duality: we apply our dual form of ROBOT in Th. 2.1 to the WGAN to obtain a new objective function

minθsupfξL1,range(fξ)2λ{E[fξ(X)]E[fξ(Gθ(Z))]}.\min_{\theta}\sup_{\|f_{\xi}\|_{L}\leq 1,{\rm range}(f_{\xi})\leq 2\lambda}\left\{{\rm E}[f_{\xi}({X})]-{\rm E}[f_{\xi}(G_{\theta}({Z}))]\right\}. (17)

The central idea is to train a RWGAN by minimizing the robust Wasserstein distance (actually, using the dual form) between real and generative data. To this end, we define a novel RWGAN model based on (17). The algorithm for training this RWGAN model is very similar to the one for WGAN available in [3], to which we refer for the implementation. We label this robust GAN model as RWGAN-1. Besides this model, we propose another approach derived from (3), where we use a new neural network to represent the modified distribution and add a penalty term to (16) in order to control modification. Because of space constraint, the details of this procedure are provided in Algorithm 3 in Appendix .4.2. We label this new RWGAN model as RWGAN-2. Different from [4]’s robust GAN, which uses χ2\chi_{2}-divergence to constrain the modification of distribution, our RWGAN-2 does this by making use of the TV. In the sequel, we will write RWGAN-B for the robust GAN of [4], which we implement using the same set up as Balaji et al. We remark that the RWGAN-1 is less computationally complex than RWGAN-2 and RWGAN-B. Indeed, RWGAN-2 and RWGAN-B make use of some regularized terms, thus an additional neural network is needed to represent the modification of the distribution. In contrast, RWGAN-1 has a simpler structure: for its implementation, it requires only a small modification of the activation function in the generator network; see Appendix .4.2.

To investigate the robustness of RWGAN-1 and RWGAN-2 we rely on synthetic data. We consider reference samples generated from a simple model, containing nn points in total, with nn1n-n_{1} outliers, whose data generation process is

Xi1(n)U(0,1),Xi2(n)=Xi1(n)+1,Xi(n)=(Xi1(n),Xi2(n)),i=1,2,,n1,Xi(n)=(Xi1(n),Xi2(n)+η),i=n1+1,n1+2,,n,\begin{array}[]{cc}&X_{i_{1}}^{(n)}\sim\mathrm{U}(0,1),X_{i_{2}}^{(n)}=X_{i_{1}}^{(n)}+1,\\ &{X}_{i}^{(n)}=(X_{i_{1}}^{(n)},X_{i_{2}}^{(n)}),i=1,2,\ldots,n_{1},\\ &{X}_{i}^{(n)}=(X_{i_{1}}^{(n)},X_{i_{2}}^{(n)}+\eta),i=n_{1}+1,n_{1}+2,\ldots,n,\end{array} (18)

with η\eta representing the size of outliers. We set n=1000n=1000 and try four different settings by changing values of ε=(nn1)/n\varepsilon=(n-n_{1})/n and η\eta. As it is common in the GAN literature, our generative adversarial models are obtained using the Python library Pytorch; see Appendix .4.2. We display the results in Figure 3, where we compare WGAN, RWGAN-1, RWGAN-2, RWGAN-B, and RWGAN-N (based on the outlier-robust Wasserstein distance W1(ϵ)W_{1}^{(\epsilon)} of [45], with ϵ=0.07\epsilon=0.07 and ϵ=0.25\epsilon=0.25, as in Nietert et al. paper) and RWGAN-D (based on the BL distance, as in [19]). To measure the distance between the data simulated by the generator and the input data, we report the Wasserstein distance of order 1. For visual comparison, we display the cloud of clean data points (blue triangles) and the cloud of GAN generated points (red squares). The plots reveal that WGAN is greatly affected by outliers. Differently, RWGAN-2, RWGAN-B, RWGAN-N and RWGAN-D are able to generate data roughly consistent with the uncontaminated distribution in most of the settings. Nevertheless, they still produce some obvious abnormal points, especially when the proportion and size of outliers increase. In particular, we notice that RWGAN-D performs well in the first two contamination settings (ε=0.1\varepsilon=0.1, η=2\eta=2 and ε=0.1\varepsilon=0.1, η=3\eta=3) but it breaks down (namely, it generates points which are very different from the uncontaminated data) in the third and fourth contamination setting (ε=0.2\varepsilon=0.2, η=2\eta=2 and ε=0.2\varepsilon=0.2, η=3\eta=3), where the amount and the size of contamination increase and affect negatively the GAN performance. In a very different way, RWGAN-1 performs better than its competitors and generates data that agree with the uncontaminated distribution, even when the proportion and size of outliers are large. Several repetitions of the experiment lead to the same conclusions.

To conclude this experiment, we remark that RWGAN-N is very sensitive to the specified value of ϵ\epsilon. For instance, in the first contaminations setting, comparing the W1W_{1} between the clean and the generated data by RWGAN-N with ϵ=0.07\epsilon=0.07 to the W1W_{1} of the RWGAN-1 illustrates that our method (whose λ\lambda has been selected using the procedure described in Section 6) performs better. Increasing ϵ\epsilon to 0.25 improves on the RWGAN-N performance, making its W1W_{1} closer to the one of the RWGAN-1. This is an important methodological consideration, which has a practical implication for the application of the RWGAN-N: a selection criterion for the ϵ\epsilon hyper-parameter needs to be proposed, discussed and studied similarly to what we do in Section 6. To the best of our knowledge, such a criterion is not available in the literature.

Refer to caption
W1=W_{1}= 0.5864
Refer to caption
W1=W_{1}= 0.5229
Refer to caption
W1=W_{1}= 0.5646
Refer to caption
W1=W_{1}= 0.6518
Refer to caption
W1=W_{1}= 0.0514
Refer to caption
W1=W_{1}= 0.0495
Refer to caption
W1=W_{1}= 0.0470
Refer to caption
W1=W_{1}= 0.0471
Refer to caption
W1=W_{1}= 0.1560
Refer to caption
W1=W_{1}= 0.2312
Refer to caption
W1=W_{1}= 0.2938
Refer to caption
W1=W_{1}= 0.3954
Refer to caption
W1=W_{1}= 0.1102
Refer to caption
W1=W_{1}= 0.0983
Refer to caption
W1=W_{1}= 0.3229
Refer to caption
W1=W_{1}= 0.1206
Refer to caption
W1=W_{1}= 0.1932
Refer to caption
W1=W_{1}= 0.2077
Refer to caption
W1=W_{1}= 0.3911
Refer to caption
W1=W_{1}= 0.3821
Refer to caption
W1=W_{1}= 0.1164
Refer to caption
W1=W_{1}= 0.1985
Refer to caption
W1=W_{1}= 0.1381
Refer to caption
W1=W_{1}= 0.2056
Refer to caption
W1=W_{1}= 0.0607
Refer to caption
W1=W_{1}= 0.1446
Refer to caption
W1=W_{1}= 0.5962
Refer to caption
W1=W_{1}= 0.2949
Figure 3: Blue triangles: 200 data points sampled from the reference distribution (uncontaminated observations). Red squares: 200 data points generated from WGAN (1st row), RWGAN-1 (2nd row), RWGAN-2 (3rd row), RWGAN-B(4th row), RWGAN-N with ϵ=0.07\epsilon=0.07 (5th row), RWGAN-N with ϵ=0.25\epsilon=0.25 (6th row) and RWGAN-D (7th row). Empirical Wasserstein distance p=1p=1 between triangles and squares is provided. (1st column: ε=0.1\varepsilon=0.1,η=2\eta=2; 2nd column: ε=0.1\varepsilon=0.1,η=3\eta=3; 3rd column: ε=0.2\varepsilon=0.2,η=2\eta=2; 4th column: ε=0.2\varepsilon=0.2,η=3\eta=3).

Fashion-MNIST real data example. We illustrate the performance of RWGAN-1 and RWGAN-2 through comparing them with the WGAN and RWGAN-B in analysing the Fashion-MNIST real dataset, which contains 70000 grayscale images of apparels, including T-shirt, jeans, pullover, skirt, coat, etc. Each image is of 28×2828\times 28 pixels, and each pixel gets value from 0 to 255, indicating the darkness or lightness. We generate outlying images by taking the negative effect of images already available in the dataset—a negative image is a total inversion, in which light areas appear dark and vice versa; in the Fashion-MNIST dataset, for each pixel of a negative image, it takes 255 minus the value of the pixel corresponding to the normal picture. By varying the number of negative images we define different levels of contamination: we consider 0, 3000, and 6000 outlying images. For the sake of visualization, in Figure 4(a) and 4(b) we show normal and negative images, respectively. Our goal is to obtain a generator which can produce images of the same style as Figure 4(a), even if it is trained on sample containing normal images and outliers. A more detailed explanation on how images generation works via Pytorch is available in Appendix .4.2.

In Figure 5, we display images generated by the WGAN, RWGAN-1, RWGAN-2 and RWGAN-B under the different levels of contamination—we select λ\lambda making use of the data driven procedure described in Section 6. We see that all of the GANs generate similar images when there are no outliers in the reference sample (1st row). When there are 3000 outliers in the reference sample (2nd row), the GANs generate some negative images. However, by a simple eyeball search on the plots, we notice that WGAN produces many obvious negative images, while RWGAN-B, RWGAN-1 and RWGAN-2 generate less outlying images than WGAN. Remarkably, RWGAN-1 produces the smaller number of negative images among the considered methods. Similar conclusions holds for the higher contamination level with 6000 outlying images.

Refer to caption
(a) normal image
Refer to caption
(b) outlying image
Figure 4: Style of clean reference sample images and outlier images. Panel (a) contains pictures that are regarded as the right style. Panel (b) contains negative pictures (outliers).

To quantify the accuracy of these GANs, we train a convolutional neural network (CNN) to classify normal images and negative images, using a training set of 60000 normal images and 60000 negative images. The resulting CNN is 100% accurate on a test set of size 20000. Then, we use this CNN to classify 1000 images generated by the four GANs. In Table 3, we report the frequencies of normal images detected by the CNN. The results are consistent with the eyeball analysis of Figure 5: the RWGAN-1 produces the smaller numbers of negative images at different contamination levels.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 5: 64 images generated from WGAN (1st row), RWGAN-1 (2nd row), RWGAN-2 (3rd row), RWGAN-B (4th row). (1st column: no outliers; 2nd column: 3000 outliers; 3rd column: 6000 outliers).
WGAN RWGAN-1 RWGAN-2 RWGAN-B
0 outliers 1.000 1.000 1.000 1.000
3000 outliers 0.941 0.982 0.953 0.945
6000 outliers 0.890 0.975 0.913 0.897
Table 3: Frequency of normal images output for each GAN model, under different contamination levels (the results are based on the classification by a trained CNN model).

5.4 Domain adaptation

Methodology. We consider an application of ROBOT to domain adaptation (DA), which is a popular machine learning problem and it deals with the inconsistent probability distributions of training samples and test samples. It has a wide range of applications in natural language processing ([48]), text analysis ([17]) and medical data analysis ([31]).

The DA problem can be stated as follows. Let

{(Xis,Yis);1iNs}\{({X}_{i}^{\rm s},Y_{i}^{\rm s});1\leq i\leq N_{\rm s}\}

denote an i.i.d. sequence of source domain (training sample) and

{(Xjt,Yjt);1jNt}\{({X}_{j}^{\rm t},Y_{j}^{\rm t});1\leq j\leq N_{\rm t}\}

denote an i.i.d. sequence of target domain (test sample), which have joint distributions Ps{\rm P_{s}} and Pt{\rm P_{t}}, respectively. Interestingly, Ps{\rm P_{s}} and Pt{\rm P_{t}} may be two different distributions: discrepancies (also known as drift) in these data distributions depend on the type of application considered—e.g. in image analysis, domain drifts can be due to changing lighting conditions, recording device failures, or to changes in the backgrounds.

In the context of DA, we are interested in estimating the labels Yjt,j=1,,Nt,Y_{j}^{\rm t},j=1,\ldots,N_{\rm t}, from observations

{(Xis,Yis);1iNs}\{({X}_{i}^{\rm s},Y_{i}^{\rm s});1\leq i\leq N_{\rm s}\}

and {Xjt;1jNt}\{{X}_{j}^{\rm t};1\leq j\leq N_{\rm t}\}. Therefore, the goal is to develop a learning method which is able to transfer knowledge from the source domain to the target domain, taking into account that the training and test samples may have different distributions.

To achieve this inference goal, [14] assume that there is a transformation 𝒯\mathcal{T} between Ps{\rm P}_{\rm s} and Pt{\rm P}_{\rm t}, and estimate the map 𝒯\mathcal{T} via OT. To elaborate further, let us consider that, in practice, the population distribution of the source sample is unknown and it is usually estimated by its empirical counterpart

P^s(Ns)=Ns1i=1NsδXis,Yis.\widehat{\rm P}_{\rm s}^{(N_{\rm s})}={N_{\rm s}}^{-1}\sum_{i=1}^{N_{\rm s}}\delta_{{X}_{i}^{\rm s},Y_{i}^{\rm s}}.

As far as the target domain is concerned, suppose we can obtain a function ff to predict the unobserved response variable YtY^{\rm t}. Thus, the empirical distribution of the target sample is

P^t;f(Nt)=Nt1j=1NtδXjt,f(Xjt).\widehat{{\rm P}}_{{\rm t};f}^{(N_{\rm t})}=N_{\rm t}^{-1}\sum_{j=1}^{N_{\rm t}}\delta_{{{X}}_{j}^{\rm t},f({{X}}_{j}^{\rm t})}.

The central idea of Curtey et al. (2014) relies on transforming the data to make the source and target distributions similar (namely, close in some distance), and use the label information available in the source domain to learn a classifier in the transformed domain, which in turn can be exploited to get the labels in the target domain.

More formally, the DA problem is written as an optimization problem which aims at finding the function ff that minimizes the transport cost between the estimated source distribution P^s(Ns)\widehat{\rm P}_{\rm s}^{(N_{\rm s})} and the estimated joint target distribution P^t;f(Nt)\widehat{{\rm P}}_{{\rm t};f}^{(N_{\rm t})}. To set up this problem, [15] propose to use the cost function

D(x1,y1;x2,y2)=αd(x1,x2)+L(y1,y2),{D}\left({x}_{1},y_{1};{x}_{2},y_{2}\right)=\alpha{d}\left({x}_{1},{x}_{2}\right)+{L}\left(y_{1},y_{2}\right),

which defines a joint cost combining both the distances between the features d(x1,x2)d\left({x}_{1},{x}_{2}\right) and a measure L{L} of the discrepancy between y1y_{1} and y2y_{2}, where (as recommended in the Courty et al. code) α=0.3\alpha=0.3. Then, the optimization problem formulated in terms of 1-Wasserstein distance is

minfW1(P^s(Ns),P^t;f(Nt))minf,𝚷ijD(Xis,Yis;Xjt,f(Xjt))𝚷ij(Ns,Nt),\min_{f}W_{1}\left(\hat{\mathrm{P}}_{\rm s}^{(N_{\rm s})},\hat{\mathrm{P}}_{{\rm t};f}^{(N_{\rm t})}\right)\equiv\min_{f,\mathbf{\Pi}}\sum_{ij}{D}\left({X}_{i}^{\rm s},Y_{i}^{\rm s};{X}_{j}^{\rm t},f({X}_{j}^{\rm t})\right)\mathbf{\Pi}_{ij}^{(N_{\rm s},N_{\rm t})}, (19)

which, in the case of squared loss function LL (see [15]), becomes

minfNt1j=1NtY^jf(Xjt)2,whereY^j=Nti=1Ns𝚷ij(Ns,Nt)Yis;\min_{f}N_{\rm t}^{-1}\sum_{j=1}^{N_{\rm t}}\left\|\hat{Y}_{j}-f(X_{j}^{\rm t})\right\|^{2},\ \text{where}\ \hat{Y}_{j}=N_{\rm t}\sum_{i=1}^{N_{\rm s}}\mathbf{\Pi}_{ij}^{(N_{\rm s},N_{\rm t})}Y_{i}^{\rm s};

Courty et al. 2014 propose to solve this problem using (a regularized version of) OT. Since OT can be sensitive to outliers, in the same spirit of [15] but with the additional aim of reducing the impact of outlying values, we propose to use ROBOT instead of OT in the DA problem. The basic intuition is that ROBOT yields a matrix 𝚷(Ns,Nt)\mathbf{\Pi}^{(N_{\rm s},N_{\rm t})} which is able (by construction) to identify and downweight the impact of anomalous values, yielding a robust Y^j\hat{Y}_{j}, for every jj. We provide the key mathematical and implementation details in Algorithm 1.

Data: source sample {(Xis,Yis),1iNs}\{({X}_{i}^{\rm s},Y_{i}^{\rm s}),1\leq i\leq N_{\rm s}\}, target features {Xjt,1jNt}\{{X}_{j}^{\rm t},1\leq j\leq N_{\rm t}\}, robust regularization parameter λ\lambda, parametric model fθf_{\theta} (index by a parameter θ\theta)
Result: Model fθf_{\theta}
1 estimate θ\theta using the source sample;
2 while θ{\theta} does not convergence do
3       let Cij(λ,Ns,Nt)=min{D(Xis,Yis;Xjt,fθ(Xjt)),2λ}C^{(\lambda,N_{\rm s},N_{\rm t})}_{ij}=\min\left\{{D}\left({X}_{i}^{\rm s},Y_{i}^{\rm s};{X}_{j}^{\rm t},f_{\theta}\left({X}_{j}^{\rm t}\right)\right),2\lambda\right\};
4      
5      based on the cost matrix (Cij(λ,Ns,Nt))1iNs,1jNt(C^{(\lambda,N_{\rm s},N_{\rm t})}_{ij})_{1\leq i\leq N_{\rm s},1\leq j\leq N_{\rm t}}, compute the optimal transport plan matrix 𝚷(Ns,Nt)\mathbf{\Pi}^{(N_{\rm s},N_{\rm t})}, and collect the set of all indices
={(i,j):D(Xis,Yis;Xjt,fθ(Xjt))2λ,1iNs,1jNt}\mathcal{I}=\{(i,j):{D}\left({X}_{i}^{\rm s},Y_{i}^{\rm s};{X}_{j}^{\rm t},f_{\theta}\left({X}_{j}^{\rm t}\right)\right)\geq 2\lambda,1\leq i\leq N_{\rm s},1\leq j\leq N_{\rm t}\}
;
6      
7      set s(i)=j=1Nt𝚷ij(Ns,Nt)𝟙(i,j)s(i)=-\sum_{j=1}^{N_{\rm t}}\mathbf{\Pi}^{(N_{\rm s},N_{\rm t})}_{ij}\mathbbm{1}_{(i,j)\in\mathcal{I}}, and find \mathcal{H}, the set of all the indices where s(i)+1/Ns=0s(i)+1/N_{\rm s}=0;
8      
9      remove \ell-th row of 𝚷(Ns,Nt)\mathbf{\Pi}^{(N_{\rm s},N_{\rm t})} for all \ell\in\mathcal{H} and normalize the matrix over columns to form a new matrix 𝚷(Ns,Nt)\mathbf{\Pi}^{(N^{\prime}_{\rm s},N_{\rm t})};
10      
11      let Lθ=Nt1j=1NtY^jfθ(Xjt)2{L}_{\theta}=N_{\rm t}^{-1}\sum_{j=1}^{N_{\rm t}}\left\|\hat{Y}_{j}-f_{\theta}(X_{j}^{\rm t})\right\|^{2}, where Y^j=Nti=1Ns𝚷ij(Ns,Nt)Yis\hat{Y}_{j}=N_{\rm t}\sum_{i=1}^{N^{\prime}_{\rm s}}\mathbf{\Pi}_{ij}^{(N^{\prime}_{\rm s},N_{\rm t})}Y_{i}^{\rm s};
12      
13      update θ\theta, which minimizes Lθ{L}_{\theta};
14      
15 end while
Algorithm 1 ROBUST DOMAIN ADAPTATION

Numerical exercise. The next Monte Carlo experiment illustrates the use and the performance of Algorithm 1 on synthetic data. We consider NsN_{\rm s} source data points (training sample) generated from

Xi(Ns)𝒩(2,1),i=1,2,,[Ns/2],Xi(Ns)𝒩(2,1),i=[Ns/2]+1,[Ns/2]+2,,Ns,Yi(Ns)=sin(Xi(Ns)/2)+Zi(Ns),i=1,2,,[9Ns/10],Yi(Ns)=sin(Xi(Ns)/2)+2+Zi(Ns),i=[9n/10]+1,[9Ns/10]+2,,Ns,Zi(Ns)𝒩(0,0.12)\begin{array}[]{cc}&X_{i}^{(N_{\rm s})}\sim\mathcal{N}(-2,1),i=1,2,\ldots,[N_{\rm s}/2],\\ &X_{i}^{(N_{\rm s})}\sim\mathcal{N}(2,1),i=[N_{\rm s}/2]+1,[N_{\rm s}/2]+2,\ldots,N_{\rm s},\\ &Y_{i}^{(N_{\rm s})}=\sin(X_{i}^{(N_{\rm s})}/2)+Z_{i}^{(N_{\rm s})},i=1,2,\ldots,[9N_{\rm s}/10],\\ &Y_{i}^{(N_{\rm s})}=\sin(X_{i}^{(N_{\rm s})}/2)+2+Z_{i}^{(N_{\rm s})},i=[9n/10]+1,[9N_{\rm s}/10]+2,\ldots,N_{\rm s},\\ &Z_{i}^{(N_{\rm s})}\sim\mathcal{N}(0,0.1^{2})\\ \end{array} (20)

and the NtN_{\rm t} target points (test sample) are generated from

Xj(Nt)𝒩(1,1),j=1,2,,[Nt/2],Xj(Nt)𝒩(2,1),j=[Nt/2]+1,[Nt/2]+2,,Nt,Yj(Nt)=sin((Xi(Nt)2)/2)+Zi(Nt),j=1,2,,Nt,Zj(Nt)𝒩(0,0.12).\begin{array}[]{cc}&X_{j}^{(N_{\rm t})}\sim\mathcal{N}(-1,1),j=1,2,\ldots,[N_{\rm t}/2],\\ &X_{j}^{(N_{\rm t})}\sim\mathcal{N}(2,1),j=[N_{\rm t}/2]+1,[N_{\rm t}/2]+2,\ldots,N_{\rm t},\\ &Y_{j}^{(N_{\rm t})}=\sin((X_{i}^{(N_{\rm t})}-2)/2)+Z_{i}^{(N_{\rm t})},j=1,2,\ldots,N_{\rm t},\\ &Z_{j}^{(N_{\rm t})}\sim\mathcal{N}(0,0.1^{2}).\\ \end{array} (21)

We notice that the source and target domain have a similar distribution but there is a shift, called drift in the DA literature, between them. In the training sample, outliers in the response variable are due to the drift added to the Yi(Ns)Y_{i}^{(N_{\rm s})} for i=[9n/10]+1,[9Ns/10]+2,,Nsi=[9n/10]+1,[9N_{\rm s}/10]+2,\ldots,N_{\rm s}. Moreover, mimicking the problem of leverage points in linear regression (see [29, Ch. 6]), our simulation design is such that half of the Xi(Ns)X_{i}^{(N_{\rm s})}s are generated from a 𝒩(2,1)\mathcal{N}(-2,1), whilst the other half is obtained from a shifted/drifted Gaussian 𝒩(2,1)\mathcal{N}(2,1); the feature space in the test sample contains a similar pattern but with a different drift.

In our simulation exercise, we set Ns=200N_{\rm s}=200 and Nt=200N_{\rm t}=200 and we compare three methods for estimating ff: the first one is the Kernel Ridge Regression (KRR); the second one is the method proposed by [14], which is a combination of OT and KRR; the third one is our method, which is based on a modification of the second method, as obtained using ROBOT instead of OT. To the best of our knowledge, this application of ROBOT to DA is new to the literature.

In Figure 6 we show the results. The outliers in the training sample are clearly visible as a cluster of points which are far a part from the bulk of the data. The plot illustrates that the curve obtained via KRR fits reasonably well only the majority of the data in the source domain, but it looks inadequate for the target domain. This is due to the drift between the target and source distributions: by design, the KRR approach cannot capture this drift since it takes into account only the source domain information. The method in [14] yields a curve that is closer to the target data than the KRR curve, but because of the outliers in the Yi(Ns)Y_{i}^{(N_{\rm s})}, it does not perform well when the feature takes on negative values or values larger than four. Differently from the other methods, our method yields an estimated curve that is not too influenced by outlying values and that fits the target distribution better than the other methods.

Refer to caption
Figure 6: Scatter plots of the source and target samples, and functions estimated by different approaches. The grey-blue points represent the source samples, while the orange points represent the target samples. The gray, blue and red solid curves are obtained by KRR, the method of [14] and our method, respectively.

6 Choice of λ\lambda via concentration inequalities

The application of W(λ)W^{(\lambda)} requires the specification of λ\lambda. In the literature on robust statistics, many selection criteria are based on first-order asymptotic theory. For instance, [41] and [38] propose to select the tuning constant for Huber-type M-estimators achieving a given asymptotic efficiency at the reference model (namely, without considering the contamination). Unfortunately, these type of criteria hinge on the asymptotic notion of influence function, which is not available for our estimators defined in Section 4; see [6] for a related discussion on MKE. [44] propose a criterion which needs access to a clean data set and do not develop its theoretical underpinning. To fill this gap in the literature, we introduce a selection criterion of λ\lambda that does not have an asymptotic nature and is data-driven. Moreover, it does not need the computation of θ^nλ\hat{\theta}^{\lambda}_{n} for each λ\lambda, that makes its implementation fast.

To start with, let us notice that the term W(λ)(μθ^nλ,μ)W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\mu_{\star}) expresses the robust Wasserstein distance between the fitted model (μθ^nλ\mu_{\hat{\theta}^{\lambda}_{n}}) and the actual data generating measure (μ\mu_{\star}). Intuitively, outlying values induce bias in the estimates, entailing abrupt changes in the distribution of θ^nλ\hat{\theta}^{\lambda}_{n}. This, in turn, entails changes in W(λ)(μθ^nλ,μ)W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\mu_{\star}). Therefore, the stability of the estimates and the stability of W(λ)(μθ^nλ,μ)W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\mu_{\star}) are tied together.

This consideration is the stepping stone of our selection criterion for λ\lambda. To elaborate further, repeated applications of the triangle inequality yield

W(λ)(μθ^nλ,μ)\displaystyle W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\mu_{\star}) W(λ)(μθ^nλ,μ^n)+W(λ)(μ^n,μ)W(λ)(μθ,μ)+2W(λ)(μ^n,μ)\displaystyle\leq W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\hat{\mu}_{n})+W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})\leq W^{(\lambda)}(\mu_{\theta^{*}},\mu_{\star})+2W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})
W(λ)(μθ,μ)+2E[W(λ)(μ^n,μ)]+2|W(λ)(μ^n,μ)E[W(λ)(μ^n,μ)]|,\displaystyle\leq W^{(\lambda)}(\mu_{\theta^{*}},\mu_{\star})+2{\rm E}[W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})]+2|W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})-{\rm E}[W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})]|, (22)

which implies that the term W(λ)(μ^n,μ)E[W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star})-{\rm E}[W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}) has an influence on the behavior of W(λ)(μθ^nλ,μ)W^{(\lambda)}(\mu_{\hat{\theta}^{\lambda}_{n}},\mu_{\star}). Therefore, we argue that to reduce the impact of outliers in the distribution of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}) (hence, on the distribution of θ^nλ\hat{\theta}^{\lambda}_{n}), we need to control the concentration of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu) about its mean. To study this quantity, we make use of (9) and (10) and we define a data-driven and non-asymptotic (i.e. depending on the finite nn, on the estimated σ\sigma and on the guessed |O||O|) selection criterion for λ\lambda.

We refer to Appendix .3.2 for the mathematical detail, here, we simply state the basic intuition. Our idea hinges on the fact that λ\lambda controls the impact that the outliers have on the concentration bounds of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}). Thus, one may think of selecting the value(s) of λ\lambda ensuring a desired level of stability in the distribution of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}). Operationally, we encode this stability in the ratio between the quantiles of W(λ)(μ^n,μ)W^{(\lambda)}(\hat{\mu}_{n},\mu_{\star}) in the presence and in the absence of contamination (see the quantity 𝒬\mathcal{Q}, as available in Appendix .3.2). Then, we measure the impact that different values of λ\lambda have on this ratio computing its derivative w.r.t. λ\lambda (see the quantity 𝒬/λ{\partial\mathcal{Q}}/{\partial\lambda}, as available in Appendix .3.2). Finally, we specify a tolerance for the changes in this derivative and we identify the value(s) of λ\lambda yielding the desired level of stability.

The described procedure allows to identify a (range of values of) λ\lambda, not too close to zero and not too large either. This is a suitable and sensible achievement. Indeed, on the one hand, values of λ\lambda too close to zero imply that W(λ)W^{(\lambda)} trims too many observations which likely include inliers and most outliers. On the other hand, too large values of λ\lambda imply that W(λ)W^{(\lambda)} trims a very limited number of outliers; see Appendix .3.1 for a numerical illustration of this aspect in terms of MSE as a function of λ\lambda. Additional methodological aspects of the selection procedure and its numerical illustration are deferred to Appendix .3.1 and Appendix .3.2.

Appendix

This Appendix complements the results available in the main text. Specifically, Appendix .1 provides some key theoretical results of optimal transport—the reader familiar with optimal transport theory may skip Appendix .1. All proofs are collected in Appendix .2. Appendix .4, as a complement to §5, provides more details and additional numerical results of a few applications of ROBOT—see Appendix .4.1 for an application of ROBOT to detect outliers in the context of parametric estimation in a linear regression model. In Appendix .4.2, we provide more details about implementation of RWGAN. Appendix .3 contains the methodological aspects of λ\lambda selection. All our numerical results can be reproduced using the (Python and R) code available upon request to the corresponding author.

.1 Some key theoretical details about optimal transport

This section is a supplementary to Section 2.1. For deep details about optimal transport, we refer the reader to Villani (2003, 2009).

Measure transportation theory dates back to the celebrated work [43], in which Monge formulated a mathematical problem that in modern language can be expressed as follows. Given two probability measures μ𝒫(𝒳)\mu\in\mathcal{P(X)}, ν𝒫(𝒴)\nu\in\mathcal{P(Y)} and a cost function c:𝒳×𝒴[0,+]c:\mathcal{X\times Y}\rightarrow[0,+\infty], Monge’s problem is to solve the minimization equation

inf{𝒳c(x,𝒯(x))𝑑μ(x):𝒯#μ=ν}\inf\left\{{\int_{\mathcal{X}}c(x,\mathcal{T}(x))d\mu(x):\mathcal{T}\#\mu=\nu}\right\} (23)

(in the measure transportation terminology, 𝒯#μ=ν\mathcal{T}\#\mu=\nu means that 𝒯\mathcal{T} is pushing μ\mu foward to ν\nu). The solution to the Monge’s problem (MP) (23) is called optimal transport map. Informally, one says that 𝒯\mathcal{T} transports the mass represented by the measure μ\mu to the mass represented by the measure ν\nu. The optimal transport map which solves problem (23) (for a given cc) hence naturally yields minimal cost of transportation.

It should be stressed that Monge’s formulation has two undesirable prospectives. First, for some measures, no solution to (23) exists. Considering, for instance, the case that μ\mu is a Dirac measure while ν\nu is not, there is no transport map which transport mass between μ\mu and ν\nu. Second, since the set {μ,𝒯(ν)}\{\mu,{\cal T}(\nu)\} of all measurable transport map is non-convex, solving problem (23) is algorithmically challenging.

Monge’s problem was revisited by [33], in which a more flexible and computationally feasible formulation was proposed. The intuition behind the the Kantorovich’s problem is that mass can be disassembled and combined freely. Let Π(μ,ν)\Pi(\mu,\nu) denote the set of all joint probability measures of μ𝒫(𝒳)\mu\in\mathcal{P(X)} and ν𝒫(𝒴)\nu\in\mathcal{P(Y)}. Kontorovich’s problem aims at finding a joint distribution πΠ(μ,ν)\pi\in\Pi(\mu,\nu) which minimizes the expectation of the coupling between XX and YY in terms of the cost function cc, and it can be formulated as

inf{𝒳×𝒴c(x,y)𝑑π(x,y):πΠ(μ,ν)}.\inf\left\{\int_{\mathcal{X\times Y}}c(x,y)d\pi(x,y):\pi\in\Pi(\mu,\nu)\right\}. (24)

A solution to Kantorovich’s problem (KP) (24) is called an optimal transport plan. Note that Kantorovich’s problem is more general than Monge’s one, since it allows for mass splitting. Moreover, unlike {μ,𝒯(ν)}\{\mu,{\cal T}(\nu)\}, the set Π(μ,ν)\Pi(\mu,\nu) is convex, and the solution to (24) exists under some mild assumptions on cc, e.g., lower semicontinuous (see [57, Chapter 4]). [11] established the relationship between optimal transport plan and optimal transport map when the cost function is the squared Euclidean distance. More specifically, the optimal transport plan π\pi can be expressed as (Id,𝒯)#μ(\mathrm{Id},\mathcal{T})_{\#}\mu.

The dual form (KD) of Kantorovich’s primal minimization problem (24) is to solve the maximization problem

sup{𝒴ϕ𝑑ν𝒳ψ𝑑μ:ϕCb(𝒴),ψCb(𝒳)}s.t.ϕ(y)ψ(x)c(x,y),(x,y),\begin{array}[]{ll}&\sup\left\{\int_{\mathcal{Y}}\phi d\nu-\int_{\mathcal{X}}\psi d\mu:\phi\in C_{b}(\mathcal{Y}),\psi\in C_{b}(\mathcal{X})\right\}\\ &{\rm s.t.}\quad\phi(y)-\psi(x)\leq c(x,y),\enspace\forall(x,y),\end{array} (25)

where Cb(𝒳)C_{b}(\mathcal{X}) is the set of bounded continuous functions on 𝒳\mathcal{X}. According to Theorem 5.10 in [57], if the cost function cc is a lower semicontinuous, there exists a solution to the dual problem such that the solutions to KD and KP coincide, namely there is no duality gap. In this case, the solution takes the form

ϕ(y)=infx𝒳[ψ(x)+c(x,y)]andψ(x)=supy𝒴[ϕ(y)c(x,y)],\phi(y)=\inf\limits_{x\in\mathcal{X}}[\psi(x)+c(x,y)]\quad\text{and}\quad\psi(x)=\sup\limits_{y\in\mathcal{Y}}[\phi(y)-c(x,y)], (26)

where the functions ϕ\phi and ψ\psi are called cc-concave and cc-convex, respectively, and ϕ\phi (resp. ψ\psi) is called the cc-transform of ψ\psi (resp. ϕ\phi). A special case is that when cc is a metric on 𝒳\mathcal{X}, equation (25) simplifies to

sup{𝒴ψ𝑑ν𝒳ψ𝑑μ:ψis 1-Lipschitz continuous},\sup\left\{\int_{\mathcal{Y}}\psi d\nu-\int_{\mathcal{X}}\psi d\mu:\psi\,\text{is 1-Lipschitz continuous}\right\}, (27)

This case is commonly known as Kantorovich-Rubenstein duality.

Another important concept in measure transportation theory is Wasserstein distance. Let (𝒳,d)(\mathcal{X},d) denote a complete metric space equipped with a metric d:𝒳×𝒳d:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}, and let μ\mu and ν\nu be two probability measures on 𝒳\mathcal{X}. Solving the optimal transport problem in (24), with the cost function c(x,y)=dp(x,y)c(x,y)=d^{p}(x,y), would introduce a distance, called Wasserstein distance, between μ\mu and ν\nu. More specifically, the Wasserstein distance of order pp (p1)(p\geq 1) is defined by (2).

.2 Proofs

Proof of Theorem 2.1. First, if ψ(y)ψ(x)c(x,y)\psi(y)-\psi(x)\leq c(x,y), then ψ(y)c(x,y)ψ(x)\psi(y)-c(x,y)\leq\psi(x) and hence

supy𝒳(ψ(y)c(x,y))ψ(x),\sup\limits_{y\in\mathcal{X}}(\psi(y)-c(x,y))\leq\psi(x),

where the supremum of the left-hand side is reached when y=xy=x. Therefore, ψ(x)\psi(x) is c-convex and ψ(x)=ψc(x)\psi(x)=\psi^{c}(x). Combined with Theorem 5.10 in [57], the proof can be completed. ∎

Proof of Lemma 3.1. We will prove that cλc_{\lambda} satisfies the axioms of a distance. Let x,y,zx,y,z be three points in the space 𝒳\mathcal{X}. First, the symmetry of cλ(x,y)c_{\lambda}(x,y), that is, cλ(x,y)=cλ(y,x)c_{\lambda}(x,y)=c_{\lambda}(y,x), follows immediately from

min{c(x,y),2λ}=min{c(y,x),2λ}.\min\{c(x,y),2\lambda\}=\min\{c(y,x),2\lambda\}.

Next, it is obvious that cλ(x,x)=0c_{\lambda}(x,x)=0. Assuming that cλ(x,y)=0c_{\lambda}(x,y)=0, then

min{c(x,y),2λ}=c(x,y)=0,\min\{c(x,y),2\lambda\}=c(x,y)=0,

which entails that x=yx=y since c(x,y)c(x,y) is a metric. By definition, we also have cλ(x,y)0c_{\lambda}(x,y)\geq 0 for all x,y𝒳x,y\in\mathcal{X}.

Finally, note that

cλ(x,z)+cλ(y,z)>2λmin{c(x,y),2λ}=cλ(x,y)c_{\lambda}(x,z)+c_{\lambda}(y,z)>2\lambda\geq\min\{c(x,y),2\lambda\}=c_{\lambda}(x,y)

when c(x,z)c(x,z) or c(y,z)>2λc(y,z)>2\lambda, and

cλ(x,z)+cλ(y,z)=c(x,z)+c(y,z)c(x,y)cλ(x,y)c_{\lambda}(x,z)+c_{\lambda}(y,z)=c(x,z)+c(y,z)\geq c(x,y)\geq c_{\lambda}(x,y)

when c(x,z)2λc(x,z)\leq 2\lambda and c(y,z)2λc(y,z)\leq 2\lambda. Therefore, we conclude that cλc_{\lambda} is a metric on 𝒳\mathcal{X}. ∎

Proof of Theorem 3.2. We prove that W(λ)W^{(\lambda)} satisfies the axioms of a distance. First, W(λ)(μ,ν)0W^{(\lambda)}(\mu,\nu)\geq 0 is obvious. Conversely, letting μ,ν\mu,\nu be two probability measures such that W(λ)(μ,ν)=0W^{(\lambda)}(\mu,\nu)=0, then there exists at least an optimal transport plan π\pi [57]. It is clear that the support set of π\pi is the diagonal (y=x)(y=x). Thus, for all continuous and bounded functions ψ\psi,

ψ𝑑μ=ψ(x)𝑑π(x,y)=ψ(y)𝑑π(x,y)=ψ𝑑ν,\int\psi d\mu=\int\psi(x)d\pi(x,y)=\int\psi(y)d\pi(x,y)=\int\psi d\nu,

which implies μ=ν\mu=\nu.

Next, let μ1,μ2\mu_{1},\mu_{2} and μ3\mu_{3} be three probability measures on 𝒳\mathcal{X}, and let π12Π(μ1,μ2)\pi_{12}\in\Pi(\mu_{1},\mu_{2}), π23Π(μ2,μ3)\pi_{23}\in\Pi(\mu_{2},\mu_{3}) be optimal transport plans. By the Gluing Lemma, there exist a probability measure π\pi in 𝒫(𝒳×𝒳×𝒳)\mathcal{P}(\mathcal{X}\times\mathcal{X}\times\mathcal{X}) with marginals π12\pi_{12}, π23\pi_{23} and π13\pi_{13} on 𝒳×𝒳\mathcal{X}\times\mathcal{X}. Use the triangle inequality in Lemma 3.1, we obtain

W(λ)(μ1,μ3)\displaystyle W^{(\lambda)}(\mu_{1},\mu_{3}) 𝒳×𝒳cλ(x1,x3)𝑑π13(x1,x3)\displaystyle\leq\int_{\mathcal{X}\times\mathcal{X}}c_{\lambda}(x_{1},x_{3})d\pi_{13}(x_{1},x_{3})
=𝒳×𝒳×𝒳cλ(x1,x3)𝑑π(x1,x2,x3)\displaystyle=\int_{\mathcal{X}\times\mathcal{X}\times\mathcal{X}}c_{\lambda}(x_{1},x_{3})d\pi(x_{1},x_{2},x_{3})
𝒳×𝒳×𝒳[cλ(x1,x2)+cλ(x2,x3)]𝑑π(x1,x2,x3)\displaystyle\leq\int_{\mathcal{X}\times\mathcal{X}\times\mathcal{X}}[c_{\lambda}(x_{1},x_{2})+c_{\lambda}(x_{2},x_{3})]d\pi(x_{1},x_{2},x_{3})
=𝒳×𝒳cλ(x1,x2)𝑑π12(x1,x2)+𝒳×𝒳cλ(x2,x3)𝑑π13(x2,x3)\displaystyle=\int_{\mathcal{X}\times\mathcal{X}}c_{\lambda}(x_{1},x_{2})d\pi_{12}(x_{1},x_{2})+\int_{\mathcal{X}\times\mathcal{X}}c_{\lambda}(x_{2},x_{3})d\pi_{13}(x_{2},x_{3})
=W(λ)(μ1,μ2)+W(λ)(μ2,μ3)\displaystyle=W^{(\lambda)}(\mu_{1},\mu_{2})+W^{(\lambda)}(\mu_{2},\mu_{3})

Moreover, the symmetry is obvious since

W(λ)(μ,ν)=infπΠ(μ,ν)(cλ(x,y)𝑑π(x,y))=W(λ)(ν,μ).W^{(\lambda)}(\mu,\nu)=\inf\limits_{\pi\in\Pi(\mu,\nu)}\left(\int c_{\lambda}(x,y)d\pi(x,y)\right)=W^{(\lambda)}(\nu,\mu).

Finally, note that W(λ)W^{(\lambda)} is finite since

W(λ)(μ,ν)=infπΠ(μ,ν)cλ(x,y)𝑑π(x,y)infπΠ(μ,ν)2λ𝑑π(x,y)+.W^{(\lambda)}(\mu,\nu)=\inf\limits_{\pi\in\Pi(\mu,\nu)}\int c_{\lambda}(x,y)d\pi(x,y)\leq\inf\limits_{\pi\in\Pi(\mu,\nu)}\int 2\lambda d\pi(x,y)\leq+\infty.

In conclusion, W(λ)W^{(\lambda)} is a finite distance on 𝒫(𝒳)\mathcal{P(X)}. ∎

Proof of Theorem 3.4. First, we prove that W(λ)W^{(\lambda)} is monotonically increasing with respect to λ\lambda. Assuming that λ2>λ1>0\lambda_{2}>\lambda_{1}>0, then cλ2cλ1c_{\lambda_{2}}\geq c_{\lambda_{1}}. Suppose that there is an optimal transport plan πλ,=1,2\pi_{\lambda_{\ell}},\ell=1,2 such that cλ𝑑πλ\int c_{\lambda_{\ell}}d\pi_{\lambda_{\ell}} reaches its minimum. Then πλ2\pi_{\lambda_{2}} is not necessarily the optimal transport plan when the cost function is cλ1c_{\lambda_{1}}. We therefore have cλ2𝑑πλ2>cλ1𝑑πλ2cλ1𝑑πλ1\int c_{\lambda_{2}}d\pi_{\lambda_{2}}>\int c_{\lambda_{1}}d\pi_{\lambda_{2}}\geq\int c_{\lambda_{1}}d\pi_{\lambda_{1}}.

Then we prove the continuity of W(λ)W^{(\lambda)}. Fixing ε>0\varepsilon>0 and letting 2λ22λ1<ε2\lambda_{2}-2\lambda_{1}<\varepsilon, we have

|W(λ2)W(λ1)|\displaystyle\left|W^{(\lambda_{2})}-W^{(\lambda_{1})}\right| =|cλ2𝑑πλ2cλ1𝑑πλ1|\displaystyle=\left|\int c_{\lambda_{2}}d\pi_{\lambda_{2}}-\int c_{\lambda_{1}}d\pi_{\lambda_{1}}\right|
|cλ2𝑑πλ1cλ1𝑑πλ1|\displaystyle\leq\left|\int c_{\lambda_{2}}d\pi_{\lambda_{1}}-\int c_{\lambda_{1}}d\pi_{\lambda_{1}}\right|
=|(cλ2cλ1)𝑑πλ1|\displaystyle=\left|\int(c_{\lambda_{2}}-c_{\lambda_{1}})d\pi_{\lambda_{1}}\right|
2λ22λ1ε.\displaystyle\leq 2\lambda_{2}-2\lambda_{1}\leq\varepsilon.

Therefore, W(λ)W^{(\lambda)} is continuous.

Now we turn to the last part of Theorem 3.4. Note that if W1W_{1} exists, we have W(λ)W1W^{(\lambda)}\leq W_{1} for any λ[0,)\lambda\in[0,\infty). Since W(λ)W^{(\lambda)} is increasing with respect to λ\lambda, its limit W:=limλW(λ)(μ,ν)W^{*}:=\lim_{\lambda\rightarrow\infty}W^{(\lambda)}(\mu,\nu) exists. It remains to prove W=W1W^{*}=W_{1}. Since W1<+W_{1}<+\infty, for any fixing ε>0\varepsilon>0, there exists a number RR such that c>Rc𝑑π<ε,\int_{c>R}cd\pi<\varepsilon, where π\pi is the optimal transport plan corresponding to the cost function cc. Let 2λ>R2\lambda>R, then W(λ)>cRc𝑑πW^{(\lambda)}>\int_{c\leq R}cd\pi and |W(λ)W1|<ε|W^{(\lambda)}-W_{1}|<\varepsilon. Finally, since ε\varepsilon is arbitrarily small, we have W=W1W^{*}=W_{1}. ∎

In order to prove Theorem 3.5, we need a lemma which shows that a Cauchy sequence in robust Wasserstein distance is tight.

Lemma .1.

Let 𝒳\mathcal{X} be a Polish space and (μk)k(\mu_{k})_{k\in\mathbb{N}} be a Cauchy sequence in (𝒫(𝒳),W(λ))(\mathcal{P(X)},W^{(\lambda)}). Then (μk)k(\mu_{k})_{k\in\mathbb{N}} is tight.

Proof of Lemma .1. Because (μk)k(\mu_{k})_{k\in\mathbb{N}} is a Cauchy sequence, we have

W(λ)(μk,μl)0W^{(\lambda)}(\mu_{k},\mu_{l})\rightarrow 0

as k,l.k,l\rightarrow\infty. Hence, for ε>0\varepsilon>0, there is a NN such that

W(λ)(μk,μl)εfork,lNW^{(\lambda)}(\mu_{k},\mu_{l})\leq\varepsilon\quad\text{for}\ \ k,l\geq N (28)

Then for any kk\in\mathbb{N}, there is a jj such that W(λ)(μk,μj)<ε2W^{(\lambda)}(\mu_{k},\mu_{j})<\varepsilon^{2} (if k>Nk>N, this is (28); if kNk\leq N, we just let j=kj=k).

Since the finite set {μ1,μ2,,μN}\left\{\mu_{1},\mu_{2},\dots,\mu_{N}\right\} is tight, there is a compact set KK such that μj[𝒳\Kε]<ε\mu_{j}[\mathcal{X}\backslash K_{\varepsilon}]<\varepsilon for all j{1,,N}j\in\left\{1,\dots,N\right\}. By compactness, KεK_{\varepsilon} can be covered by a finite number of small balls, that is, KεB(x1,ε)B(xm,ε)K_{\varepsilon}\subset B(x_{1},\varepsilon)\cup\dots\cup B(x_{m},\varepsilon) for a fixed integer mm.

Now write

Uε:=B(x1,ε)B(xm,ε),U~ε:={x𝒳:d(x,U)<ε}B(x1,2ε)B(xm,2ε),ϕ(x):=(1d(x,U)ε)+.\begin{array}[]{cc}U_{\varepsilon}:=B(x_{1},\varepsilon)\cup\dots\cup B(x_{m},\varepsilon),\\ \tilde{U}_{\varepsilon}:=\left\{x\in\mathcal{X}:d(x,U)<\varepsilon\right\}\subset B(x_{1},2\varepsilon)\cup\dots\cup B(x_{m},2\varepsilon),\\ \phi(x):=\left(1-\dfrac{d(x,U)}{\varepsilon}\right)_{+}.\end{array}

Note that 𝟙Uεϕ𝟙U~ε\mathbbm{1}_{U_{\varepsilon}}\leq\phi\leq\mathbbm{1}_{\tilde{U}_{\varepsilon}} and ϕ\phi is (1/ε)(1/\varepsilon)-Lipschitz. By Theorem 2.1, we find that for jNj\leq N, 2λε1\dfrac{2\lambda}{\varepsilon}\geq 1 (this is reasonable because we need ε\varepsilon as small as possible) and kk arbitrary,

μk[U~ε]\displaystyle\mu_{k}\left[\tilde{U}_{\varepsilon}\right] ϕ𝑑μk\displaystyle\geq\int\phi d\mu_{k}
=ϕ𝑑μj+(ϕ𝑑μkϕ𝑑μj)\displaystyle=\int\phi d\mu_{j}+\left(\int\phi d\mu_{k}-\int\phi d\mu_{j}\right)
(1)ϕ𝑑μjW(λ)(μk,μj)ε\displaystyle\mathop{\geq}\limits_{(1)}\int\phi d\mu_{j}-\frac{W^{(\lambda)}\left(\mu_{k},\mu_{j}\right)}{\varepsilon}
μj[Uε]W(λ)(μk,μj)ε.\displaystyle\geq\mu_{j}[U_{\varepsilon}]-\frac{W^{(\lambda)}\left(\mu_{k},\mu_{j}\right)}{\varepsilon}.

Inequality (1)(1) holds for

ϕ𝑑μkϕ𝑑μj\displaystyle\int\phi d\mu_{k}-\int\phi d\mu_{j}\leq suprange(f)1|f(x1)f(x2)|1ϵd(x1,x2){f𝑑μkf𝑑μj}\displaystyle\sup_{\mathrm{range}(f)\leq 1\atop|f(x_{1})-f(x_{2})|\leq\frac{1}{\epsilon}d(x_{1},x_{2})}\left\{\int fd\mu_{k}-\int fd\mu_{j}\right\}
\displaystyle\leq suprange(f)2λϵ|f(x1)f(x2)|1ϵd(x1,x2){f𝑑μkf𝑑μj}\displaystyle\sup_{\mathrm{range}(f)\leq\frac{2\lambda}{\epsilon}\atop|f(x_{1})-f(x_{2})|\leq\frac{1}{\epsilon}d(x_{1},x_{2})}\left\{\int fd\mu_{k}-\int fd\mu_{j}\right\}
=\displaystyle= suprange(f)2λ|f(x1)f(x2)|d(x1,x2)1ϵ{f𝑑μkf𝑑μj}\displaystyle\sup_{\mathrm{range}(f)\leq 2\lambda\atop|f(x_{1})-f(x_{2})|\leq d(x_{1},x_{2})}\frac{1}{\epsilon}\left\{\int fd\mu_{k}-\int fd\mu_{j}\right\}
=W(λ)(μk,μj)ε.\displaystyle=\frac{W^{(\lambda)}\left(\mu_{k},\mu_{j}\right)}{\varepsilon}.

Note that μj[Uε]μj[Kε]1ε\mu_{j}[U_{\varepsilon}]\geq\mu_{j}[K_{\varepsilon}]\geq 1-\varepsilon if jNj\leq N, and, for each kk, we can find j=j(k)j=j(k) such that W1(μk,μj)ε2W_{1}\left(\mu_{k},\mu_{j}\right)\leq\varepsilon^{2}. We therefore have

μk[U~ε]1εε2ε=12ε.\mu_{k}\left[\tilde{U}_{\varepsilon}\right]\geq 1-\varepsilon-\frac{\varepsilon^{2}}{\varepsilon}=1-2\varepsilon.

Now we have shown the following: For each ε>0\varepsilon>0 there is a finite family (xi)1im\left(x_{i}\right)_{1\leq i\leq m} such that all measures μk\mu_{k} give mass at least 12ε1-2\varepsilon to the set Zε:=B(xi,2ε)Z_{\varepsilon}:=\cup B\left(x_{i},2\varepsilon\right).

For \ell\in\mathbb{N}, we can find (xi)1im()\left(x_{i}\right)_{1\leq i\leq m(\ell)} such that

μk[𝒳\1im()B(xi,2ε)]2ε.\mu_{k}\left[\mathcal{X}\backslash\bigcup_{1\leq i\leq m(\ell)}B\left(x_{i},2^{-\ell}\varepsilon\right)\right]\leq 2^{-\ell}\varepsilon.

Now we let

Sε:=1p1im(p)B(xi,2pε)¯.S_{\varepsilon}:=\bigcap_{1\leq p\leq\infty}\bigcup_{1\leq i\leq m(p)}\overline{B\left(x_{i},2^{-p}\varepsilon\right)}.

It is clear that μk[𝒳\Sε]ε\mu_{k}[\mathcal{X}\backslash S_{\varepsilon}]\leq\varepsilon.

By construction, SεS_{\varepsilon} is closed and it can be covered by finitely many balls of arbitrarily small radius, so it is also totally bounded. We note that in complete metric space, a closed totally bounded set is equivalent to a compact set. In conclusion, SεS_{\varepsilon} is compact. The result then follows. ∎

Proof of Theorem 3.5. First, we prove that μk\mu_{k} converges to μ\mu in 𝒫(𝒳){\cal P}(\mathcal{X}) if W(λ)(μk,μ)0W^{(\lambda)}\left(\mu_{k},\mu\right)\rightarrow 0. By Lemma .1, the sequence (μk)k\left(\mu_{k}\right)_{k\in\mathbb{N}} is tight, so there is a subsequence (μk)\left(\mu_{k^{\prime}}\right) such that μk\mu_{k^{\prime}} converges weakly to some probability measure μ~\widetilde{\mu}. Let hh be the solution of the dual form of ROBOT for μ~\widetilde{\mu} and μ\mu. Then, by Theorem 2.1,

W(λ)(μ~,μ)\displaystyle W^{(\lambda)}(\widetilde{\mu},\mu) =h𝑑μ~h𝑑μ\displaystyle=\int hd\widetilde{\mu}-\int hd\mu
=(1)limk{h𝑑μkh𝑑μ}\displaystyle\mathop{=}\limits_{(1)}\lim\limits_{k^{\prime}\rightarrow\infty}\left\{\int hd\mu_{k^{\prime}}-\int hd\mu\right\}
limksuprange(f)2λ|f(x1)f(x2)|d(x1,x2){f𝑑μkf𝑑μ}\displaystyle\leq\lim\limits_{k^{\prime}\rightarrow\infty}\sup_{\mathrm{range}(f)\leq 2\lambda\atop|f(x_{1})-f(x_{2})|\leq d(x_{1},x_{2})}\left\{\int fd\mu_{k^{\prime}}-\int fd\mu\right\}
=limkW(λ)(μ,μk)\displaystyle=\lim\limits_{k^{\prime}\rightarrow\infty}W^{(\lambda)}(\mu,\mu_{k^{\prime}})
=(2)limkW(λ)(μ,μk)=0,\displaystyle\mathop{=}\limits_{(2)}\lim\limits_{k\rightarrow\infty}W^{(\lambda)}(\mu,\mu_{k})=0,

where (1)(1) holds since the subsequence (μk)(\mu_{k^{\prime}}) converges weakly to μ~\tilde{\mu} and (2)(2) holds since (μk)(\mu_{k^{\prime}}) is a subsequence of (μk)(\mu_{k}). Therefore, μ~=μ\tilde{\mu}=\mu, and the whole sequence (μk)k\left(\mu_{k}\right)_{k\in\mathbb{N}} has to converge to μ\mu weakly.

Conversely, suppose (μk)k\left(\mu_{k}\right)_{k\in\mathbb{N}} converges to μ\mu weakly. By Prokhorov’s theorem, (μk)\left(\mu_{k}\right) forms a tight sequence; also, μ\mu is tight. Let (πk)(\pi_{k}) be a sequence representing the optimal plan for μ\mu and μk\mu_{k}. By Lemma 4.4 in [57], the sequence (πk)\left(\pi_{k}\right) is tight in P(𝒳×𝒳)P(\mathcal{X}\times\mathcal{X}). So, up to the extraction of a subsequence, denoted by (πk1)\left(\pi_{k_{1}}\right), we may assume that πk1\pi_{k_{1}} converges to π\pi weakly in P(𝒳×𝒳)P(\mathcal{X}\times\mathcal{X}). Since each πk1\pi_{k_{1}} is optimal, Theorem 5.205.20 in [57] guarantees that π\pi is an optimal coupling of μ\mu and μ\mu, so this is the coupling π=(Id,Id)#μ\pi=(\mathrm{Id},\mathrm{Id})_{\#}\mu. In fact, this is independent of the extracted subsequence. So πk\pi_{k} converges to π\pi weakly. Finally,

limkW(λ)(μ,μk)=limkcλ𝑑πk=cλ𝑑π=0.\lim\limits_{k\rightarrow\infty}W^{(\lambda)}(\mu,\mu_{k})=\lim\limits_{k\rightarrow\infty}\int c_{\lambda}d\pi_{k}=\int c_{\lambda}d\pi=0.

Proof of Theorem 3.7. First, we prove completeness. Let (μk)k(\mu_{k})_{k\in\mathbb{N}} be a Cauchy sequence. By Lemma .1, (μk)k(\mu_{k})_{k\in\mathbb{N}} is tight and there is a subsequence (μk)(\mu_{k^{\prime}}) which convergence to a measure μ\mu in 𝒫(𝒳)\mathcal{P(X)}. Therefore, the continuity of W(λ)W^{(\lambda)} (Corollary 3.6) entails that

limlW(λ)(μ,μl)=limk,lW(λ)(μk,μl)=0.\lim\limits_{l\rightarrow\infty}W^{(\lambda)}(\mu,\mu_{l})=\lim\limits_{k^{\prime},l\rightarrow\infty}W^{(\lambda)}(\mu_{k^{\prime}},\mu_{l})=0.

The result of the completeness follows.

Then, we complete the proof of separability by using a rational step function approximation. We notice that μ\mu lies in 𝒫(𝒳)\mathcal{P(X)} and is integrable. So for ε>0\varepsilon>0, we can find a compact set 𝒦ε𝒳\mathcal{K}_{\varepsilon}\subset\mathcal{X} such that 𝒳\𝒦ε2λ𝑑με.\int_{\mathcal{X\textbackslash K_{\varepsilon}}}2\lambda d\mu\leq\varepsilon. Letting 𝒟\mathcal{D} be a countable dense set in 𝒳\mathcal{X}, we can find a finite family of balls (B(xk,ε))1kN,xk𝒟(B(x_{k},\varepsilon))_{1\leq k\leq N},x_{k}\in\mathcal{D}, which cover 𝒦ε\mathcal{K}_{\varepsilon}. Moreover, by letting

Bk=B(xk,ε)\j<kB(xj,ε),B_{k}=B(x_{k},\varepsilon)\text{\textbackslash}\cup_{j<k}B(x_{j},\varepsilon),

then 1kNBk\cup_{1\leq k\leq N}{B_{k}} still covers the 𝒦ε\mathcal{K}_{\varepsilon}, and BkB_{k} and BB_{\ell} are disjoint for kk\neq\ell. Then by defining a step function f(x):=k=1Nxk𝟙xBkf(x):=\sum_{k=1}^{N}x_{k}\mathbbm{1}_{x\in B_{k}}, we can easily find

cλ(x,f(x))𝑑μ(x)ε𝒦ε𝑑μ(x)+𝒳\𝒦ε2λ𝑑μ(x)2ε.\int c_{\lambda}(x,f(x))d\mu(x)\leq\varepsilon\int_{\mathcal{K}_{\varepsilon}}d\mu(x)+\int_{\mathcal{X\textbackslash K_{\varepsilon}}}2\lambda d\mu(x)\leq 2\varepsilon.

Moreover, f#μf_{\#}\mu can be written as j=1Najδxj\sum_{j=1}^{N}a_{j}\delta_{x_{j}}. Next, we prove that j=1Najδxj\sum_{j=1}^{N}a_{j}\delta_{x_{j}} can be approximated to arbitrary accuracy by another step function j=1Nbjδxj\sum_{j=1}^{N}b_{j}\delta_{x_{j}} with rational coefficients (bj)1jN(b_{j})_{1\leq j\leq N}. Specifically,

W(λ)(j=1Najδxj,j=1Nbjδxj)\displaystyle W^{(\lambda)}\left(\sum_{j=1}^{N}a_{j}\delta_{x_{j}},\sum_{j=1}^{N}b_{j}\delta_{x_{j}}\right)
\displaystyle\leq [maxk,ld(xk,xl)]d(j=1Najδxj)+[maxk,ld(xk,xl)]d(j=1Nbjδxj)\displaystyle\int\left[\mathop{\max}\limits_{k,l}d(x_{k},x_{l})\right]d(\sum_{j=1}^{N}a_{j}\delta_{x_{j}})+\int\left[\mathop{\max}\limits_{k,l}d(x_{k},x_{l})\right]d(\sum_{j=1}^{N}b_{j}\delta_{x_{j}})
\displaystyle\leq 4λj=1N|ajbj|.\displaystyle 4\lambda\mathop{\sum_{j=1}^{N}}|a_{j}-b_{j}|.

Therefore, we can replace aja_{j} with some well-chosen rational coefficients bjb_{j}, and μ\mu can be approximated by bjδxj, 0jN\sum b_{j}\delta_{x_{j}},\,0\leq j\leq N with arbitrary precision. In conclusion, the set of finitely supported measures with rational coefficients is dense in 𝒫(𝒳)\mathcal{P(X)}. ∎

Proof of Theorem 3.9. The proof derives from [40, Theorem 5.1]. To apply this result, we have to check that Condition (11) p 779 in [40] holds. For this, let X1,,XnX_{1}^{\prime},\ldots,X_{n}^{\prime} denote i.i.d. data with common distribution μ\mu, independent from X1,,XnX_{1},\ldots,X_{n}. Denote by I={i1,,i|I|}I=\{i_{1},\ldots,i_{|I|}\}, fix j{1,,|I|}j\in\{1,\ldots,|I|\} and let X=(Xj)jIX=(X_{j})_{j\in I}, X=(Xi1,,Xij,,Xi|I|)X^{\prime}=(X_{i_{1}},\ldots,X_{i_{j}}^{\prime},\ldots,X_{i_{|I|}}) (the data XijX_{i_{j}} in the original dataset of inliers has been replaced by an independent copy XijX_{i_{j}}^{\prime}). Let μ~n(I)=|I|1(iI{ij}δXi+δXij)\tilde{\mu}_{n}^{(I)}=|I|^{-1}(\sum_{i\in I\setminus\{i_{j}\}}\delta_{X_{i}}+\delta_{X_{i_{j}}^{\prime}}) denote the associated empirical measure. Let us define the function f(X)=W(λ)(μ,μ^n(I))f(X)=W^{(\lambda)}(\mu,\hat{\mu}_{n}^{(I)}). We have, by the triangular inequality established in Theorem 3.2:

|f(X)f(X)|W(λ)(μ^n(I),μ~n(I)).|f(X)-f(X^{\prime})|\leqslant W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\tilde{\mu}^{(I)}_{n})\enspace.

By the duality formula given in Theorem 2.1, we have moreover

W(λ)(μ^n(I),μ~n(I))supψΨψ(Xij)ψ(Xij)|I|min(d(Xij,Xij),2λ)|I|.W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\tilde{\mu}^{(I)}_{n})\leqslant\sup_{\psi\in\Psi}\frac{\psi(X_{i_{j}})-\psi(X_{i_{j}}^{\prime})}{|I|}\leqslant\frac{\min(d(X_{i_{j}},X_{i_{j}}^{\prime}),2\lambda)}{|I|}\enspace.

Therefore, we deduce that, for any k2k\geqslant 2,

E[W(λ)(μ^n(I),μ~n(I))k](2λ)k2|I|kE[min(d2(Xij,Xij),(2λ)2)].{\rm E}[W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\tilde{\mu}^{(I)}_{n})^{k}]\leqslant\frac{(2\lambda)^{k-2}}{|I|^{k}}{\rm E}[\min(d^{2}(X_{i_{j}},X_{i_{j}}^{\prime}),(2\lambda)^{2})]\enspace.

Thus, the function ff satisfies Condition (11) p 779 in [40] with M=2λ/|I|M=2\lambda/|I| and σj2=E[min(d2(Xij,Xij),(2λ)2)]/|I|\sigma_{j}^{2}=\sqrt{{\rm E}[\min(d^{2}(X_{i_{j}},X_{i_{j}}^{\prime}),(2\lambda)^{2})]}/|I|, thus by Theorem 5.1 in [40], we deduce that, for any t>0t>0,

(W(λ)(μ^n(I),μ)E[W(λ)(μ^n(I),μ)]>t)exp(|I|t22σ2+4λt).\mathbb{P}(W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\mu^{*})-{\rm E}[W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\mu^{*})]>t)\leqslant\exp\bigg{(}-\frac{|I|t^{2}}{2\sigma^{2}+4\lambda t}\bigg{)}\enspace.

Next, letting

f(X)=W(λ)(μ,μ^n(I))f(X)=-W^{(\lambda)}(\mu,\hat{\mu}_{n}^{(I)})

and repeating the above arguments yield

(W(λ)(μ^n(I),μ)E[W(λ)(μ^n(I),μ)]<t)exp(|I|t22σ2+4λt).\mathbb{P}(W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\mu^{*})-{\rm E}[W^{(\lambda)}(\hat{\mu}^{(I)}_{n},\mu^{*})]<-t)\leqslant\exp\bigg{(}-\frac{|I|t^{2}}{2\sigma^{2}+4\lambda t}\bigg{)}\enspace.

Inequality (9) then follows from piecing together the above two inequalities.

Inequality (10) is obtained by plugging the first inequality into inequality (11). ∎

Proof of Theorem 3.10. The proof is divided into three steps. We start by using the duality formula derived in Theorem 2.1 and prove a slight extension of Dudley’s inequality in the spirit of [9] to reduce the bound to the computation of the entropy numbers of the Ψ\Psi of 11-Lipschitz functions defined on B(0,K)B(0,K) and taking values in [λ,λ][-\lambda,\lambda]. Then we compute these entropy numbers using a slight extension of the computation of these numbers done, for example in [56, Exercise 8.2.8]. Finally, we put together all these ingredients to derive our final results.

Step 1: Reduction to bounding entropy. Using the dual formulation of W(λ)W^{(\lambda)}, we get that

W(λ)(μ,μ^n)=supψΨ1ni=1nψ(Xi)E[ψ(Xi)].W^{(\lambda)}(\mu,\hat{\mu}_{n})=\sup_{\psi\in\Psi}\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})-{\rm E}[\psi(X_{i})]\enspace.

Fix first ψ\psi and ψ\psi^{\prime} in Ψ\Psi. The random variables

Yi(ψ,ψ)=ψ(Xi)E[ψ(Xi)]{ψ(Xi)E[ψ(Xi)]}Y_{i}(\psi,\psi^{\prime})=\psi(X_{i})-{\rm E}[\psi(X_{i})]-\{\psi^{\prime}(X_{i})-{\rm E}[\psi^{\prime}(X_{i})]\}

are independent, centered and take values in [4λ4K2ψψ,4λ4K2ψψ][-4\lambda\wedge 4K\wedge 2\|\psi-\psi^{\prime}\|_{\infty},4\lambda\wedge 4K\wedge 2\|\psi-\psi^{\prime}\|_{\infty}] and therefore by Hoeffding’s inequality, the process (Yψ)ψΨ(Y_{\psi})_{\psi\in\Psi}, where the variables

Yψ=1ni=1nψ(Xi)E[ψ(Xi)],Y_{\psi}=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})-{\rm E}[\psi(X_{i})],

has increments YψYψY_{\psi}-Y_{\psi^{\prime}} satisfying

s,E[exp(s(YψYψ))]exp(s2(γ2ψψ2)n),\forall s\in\mathbb{R},\qquad{\rm E}\big{[}\exp\big{(}s(Y_{\psi}-Y_{\psi^{\prime}})\big{)}\big{]}\leqslant\exp\bigg{(}\frac{s^{2}(\gamma^{2}\wedge\|\psi-\psi^{\prime}\|_{\infty}^{2})}{n}\bigg{)}\enspace, (29)

where, here and in the rest of the proof δ=2(λK)\delta=\sqrt{2}(\lambda\wedge K). Introduce the truncated sup-distance

d,δ(f,g)=fgδ,d_{\infty,\delta}(f,g)=\|f-g\|_{\infty}\wedge\delta,

we just proved that the process YψY_{\psi} has sub-Gaussian increments with respect to the distance d,δd_{\infty,\delta}.

At this point, we will use a chaining argument to bound the expectation E[supψΨYψ]{\rm E}[\sup_{\psi\in\Psi}Y_{\psi}]. Let k0k_{0} denote the largest kk such that 2k2δ2^{-k}\geqslant\sqrt{2}\delta. As 2k02^{-k_{0}} is larger than the diameter of Ψ\Psi w.r.t. d,δd_{\infty,\delta}, we define Ψk0={0}\Psi_{k_{0}}=\{0\} and all functions ψΨ\psi\in\Psi satisfy d,δ(ψ,0)=d,δ(ψ,Ψ0)2k0d_{\infty,\delta}(\psi,0)=d_{\infty,\delta}(\psi,\Psi_{0})\leqslant 2^{-k_{0}}. For any kk0k\geqslant k_{0} denote by Ψk\Psi_{k} a sequence of 2k2^{-k} nets of Ψ\Psi with minimal cardinality, so |Ψk||Ψk+1||\Psi_{k}|\leqslant|\Psi_{k+1}|. For any kk0k\geqslant k_{0} and ψΨ\psi\in\Psi, let πk(ψ)Ψk\pi_{k}(\psi)\in\Psi_{k} be such that ψπk(ψ)2k\|\psi-\pi_{k}(\psi)\|_{\infty}\leqslant 2^{-k}. We have πk0(ψ)=0\pi_{k_{0}}(\psi)=0 for any ψΨ\psi\in\Psi, so we can write, for any k>k0k>k_{0},

ψ=ψπk0(ψ)=ψπk(ψ)+=k0+1kπ(ψ)π1(ψ).\psi=\psi-\pi_{k_{0}}(\psi)=\psi-\pi_{k}(\psi)+\sum_{\ell=k_{0}+1}^{k}\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi)\enspace. (30)

Write αn(ψ)=ψdμ^nψdμ\alpha_{n}(\psi)=\int\psi{\rm d}\hat{\mu}_{n}-\int\psi{\rm d}\mu, we have thus

αn(ψ)=αn(ψπk(ψ))+=k0+1kαn(π(ψ)π1(ψ)).\alpha_{n}(\psi)=\alpha_{n}(\psi-\pi_{k}(\psi))+\sum_{\ell=k_{0}+1}^{k}\alpha_{n}(\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi))\enspace. (31)

We bound both terms separately. For any function gg, |αn(g)|2g|\alpha_{n}(g)|\leqslant 2\|g\|_{\infty}, so supψΨ|αn(ψπk(ψ))|21k\sup_{\psi\in\Psi}|\alpha_{n}(\psi-\pi_{k}(\psi))|\leqslant 2^{1-k}. On the other hand, for any ψ\psi and \ell, by (29),

s,E[exp(sαn(π(ψ)π1(ψ)))]exp(s2(δ2π(ψ)π1(ψ)2n).\forall s\in\mathbb{R},\qquad{\rm E}\bigg{[}\exp\big{(}s\alpha_{n}(\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi))\big{)}\bigg{]}\leqslant\exp\bigg{(}\frac{s^{2}(\delta^{2}\wedge\|\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi)\|_{\infty}^{2}}{n}\bigg{)}\enspace.

For any >k0\ell>k_{0},

π(ψ)π1(ψ)π(ψ)ψ+ψπ1(ψ)22,\|\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi)\|_{\infty}\leqslant\|\pi_{\ell}(\psi)-\psi\|_{\infty}+\|\psi-\pi_{\ell-1}(\psi)\|_{\infty}\leqslant 2^{2-\ell}\enspace,

we deduce that

s,E[exp(sαn(π(ψ)π1(ψ)))]exp(16s2n22).\forall s\in\mathbb{R},\qquad{\rm E}\bigg{[}\exp\big{(}s\alpha_{n}(\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi))\big{)}\bigg{]}\leqslant\exp\bigg{(}\frac{16s^{2}}{n2^{2\ell}}\bigg{)}\enspace. (32)

Now we use the following classical result.

Lemma .2.

If Z1,,ZmZ_{1},\ldots,Z_{m} are random variables satisfying

s,E[exp(sZi)]exp(s2κ2),\forall s\in\mathbb{R},\qquad{\rm E}[\exp(sZ_{i})]\leqslant\exp(s^{2}\kappa^{2})\enspace, (33)

then

E[maxiZi]2κlogm.{\rm E}[\max_{i}Z_{i}]\leqslant 2\kappa\sqrt{\log m}\enspace.

We apply Lemma .2 to the random variable Zψ,ψ=αn(π(ψ)π1(ψ)))Z_{\psi,\psi^{\prime}}=\alpha_{n}(\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi))\big{)}. There are less than |Ψ||Ψ1||\Psi_{\ell}||\Psi_{\ell-1}| such variables, and all of them satisfy Assumption 33 with κ=4/(n2)\kappa=4/(\sqrt{n}2^{\ell}) thanks to (32). Therefore, Lemma .2 and |Ψ1||Ψ||\Psi_{\ell-1}|\leqslant|\Psi_{\ell}| imply that

E[supψΨαn(π(ψ)π1(ψ)))]8n2log(|Ψ||Ψ1|)8n22log(|Ψ|).{\rm E}\bigg{[}\sup_{\psi\in\Psi}\alpha_{n}(\pi_{\ell}(\psi)-\pi_{\ell-1}(\psi))\big{)}\bigg{]}\leqslant\frac{8}{\sqrt{n}2^{\ell}}\sqrt{\log\big{(}|\Psi_{\ell}||\Psi_{\ell-1}|\big{)}}\leqslant\frac{8}{\sqrt{n}2^{\ell}}\sqrt{2\log\big{(}|\Psi_{\ell}|\big{)}}\enspace.

We conclude that, for any k>k0k>k_{0}, it holds that

E[W(λ)(μ,μ^n)]21k+82n=k0+1klog(|Ψ|)2.{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]\leqslant 2^{1-k}+\frac{8\sqrt{2}}{\sqrt{n}}\sum_{\ell=k_{0}+1}^{k}\frac{\sqrt{\log\big{(}|\Psi_{\ell}|\big{)}}}{2^{\ell}}\enspace. (34)

To exploit this bound, we have now to bound from above the entropy numbers |Ψ||\Psi_{\ell}| of Ψ\Psi.

Step 2: Bounding the entropy of Ψ\Psi. Let ϵ(0,Kλ)\epsilon\in(0,K\wedge\lambda) and let NN denote an ϵ\epsilon-net of B(0,K)B(0,K) with size |N|(3K/ϵ)d|N|\leqslant(3K/\epsilon)^{d} (for a proof that such nets exist, we refer for example to [56, Corollary 4.2.13]: For any xB(0,K)x\in B(0,K), there exists yNy\in N such that xyϵ\|x-y\|\leqslant\epsilon. Let also GG denote an ϵ\epsilon-grid of [λ,λ][-\lambda,\lambda] with step size ϵ\epsilon and |G|3λ/ϵ|G|\leqslant 3\lambda/\epsilon. We define the set \mathcal{F} of all functions f:B(0,K)[λ,λ]f:B(0,K)\to[-\lambda,\lambda] such that

x,yN,f(x)G,|f(x)f(y)|xy,\forall x,y\in N,\qquad f(x)\in G,\qquad|f(x)-f(y)|\leqslant\|x-y\|\enspace,

and ff linearly interpolates between points in NN. Starting from f(0)f(0) and moving recursively to its neighbors, we see that

||3λϵ3|N|,|\mathcal{F}|\leqslant\frac{3\lambda}{\epsilon}3^{|N|}\enspace,

where the first term counts the number of choices for f(0)f(0) and 33 is an upper bound on the number of choices for f(x)f(x) given the value of its neighbor f(x)f(x^{\prime}) due to the constraint |f(x)f(y)|xy|f(x)-f(y)|\leqslant\|x-y\|. This shows that there exists a numerical constant CC such that

||3λϵexp((CKϵ)d).|\mathcal{F}|\leqslant\frac{3\lambda}{\epsilon}\exp\bigg{(}\bigg{(}\frac{CK}{\epsilon}\bigg{)}^{d}\bigg{)}\enspace.

By construction, Ψ\mathcal{F}\subset\Psi. Fix now ψΨ\psi\in\Psi. By construction of \mathcal{F}, there exists ff\in\mathcal{F} such that |f(x)ψ(x)|ϵ|f(x)-\psi(x)|\leqslant\epsilon for any xNx\in N. Moreover, for any yB(0,K)y\in B(0,K), there exists xNx\in N such that xyϵ\|x-y\|\leqslant\epsilon. As both ψ\psi and ff are 11-Lipschitz, we have therefore that

|f(y)ψ(y)||f(y)f(x)|+|f(x)ψ(x)|+|ψ(x)ψ(y)|3ϵ.|f(y)-\psi(y)|\leqslant|f(y)-f(x)|+|f(x)-\psi(x)|+|\psi(x)-\psi(y)|\leqslant 3\epsilon\enspace.

We have thus established that there exists a numerical constant CC such that

ϵKλ,N(Ψ,d,ϵ)9λϵexp((CKϵ)d).\forall\epsilon\leqslant K\wedge\lambda,\qquad N(\Psi,d_{\infty},\epsilon)\leqslant\frac{9\lambda}{\epsilon}\exp\bigg{(}\bigg{(}\frac{CK}{\epsilon}\bigg{)}^{d}\bigg{)}\enspace. (35)

Step 3: Conclusion of the proof. Plugging the estimate (35) into the chaining bound (34) yields, for any k>k0k>k_{0},

E[W(λ)(μ,μ^n)]\displaystyle{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] 21k+82n=k0+1klog(92λ)+(C2K)d/22\displaystyle\leqslant 2^{1-k}+\frac{8\sqrt{2}}{\sqrt{n}}\sum_{\ell=k_{0}+1}^{k}\frac{\sqrt{\log\big{(}9*2^{\ell}\lambda\big{)}}+\big{(}C2^{\ell}K\big{)}^{d/2}}{2^{\ell}}
21k+Cn(δ+(CK)d/2=k0+1k2(d2)/2).\displaystyle\leqslant 2^{1-k}+\frac{C}{\sqrt{n}}\bigg{(}\delta+(CK)^{d/2}\sum_{\ell=k_{0}+1}^{k}2^{\ell(d-2)/2}\bigg{)}\enspace.

Here, the behavior of the last sum is different when d=1d=1 and d1d\geqslant 1.

When d=1d=1, the sum is convergent, so we can choose k=+k=+\infty and the bound becomes

E[W(λ)(μ,μ^n)]\displaystyle{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] Cn(δ+(CK)1/22k0/2)\displaystyle\leqslant\frac{C}{\sqrt{n}}\big{(}\delta+(CK)^{1/2}2^{-k_{0}/2}\big{)}
CKδn.\displaystyle\leqslant C\sqrt{\frac{K\delta}{n}}\enspace.

When d2d\geqslant 2, the sum is no longer convergent. For d=2d=2, we have

E[W(λ)(μ,μ^n)]\displaystyle{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] 21k+Cn(δ+Kk).\displaystyle\leqslant 2^{1-k}+\frac{C}{\sqrt{n}}\big{(}\delta+Kk\big{)}\enspace.

Taking kk such that 2kK/n2^{-k}\asymp K/\sqrt{n} yields, as δ/nK/n\delta/\sqrt{n}\leqslant K/\sqrt{n},

E[W(λ)(μ,μ^n)]\displaystyle{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})] CKnlog(nK).\displaystyle\leqslant\frac{CK}{\sqrt{n}}\log\bigg{(}\frac{\sqrt{n}}{K}\bigg{)}\enspace. (36)

Finally, for any d3d\geqslant 3, we have

E[W(λ)(μ,μ^n)]21k+Cn(δ+(CK)d/22k(d/21)).{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]\leqslant 2^{1-k}+\frac{C}{\sqrt{n}}\bigg{(}\delta+(CK)^{d/2}2^{k(d/2-1)}\bigg{)}\enspace.

Optimizing over kk yields the choice 2k=n1/d/(CK)2^{k}=n^{1/d}/(CK) and thus, as δ/nK/n1/d\delta/\sqrt{n}\leqslant K/n^{1/d},

E[W(λ)(μ,μ^n)]CKn1/d.{\rm E}[W^{(\lambda)}(\mu,\hat{\mu}_{n})]\leqslant\frac{CK}{n^{1/d}}\enspace. (37)

Proof of Theorem 4.4. Before we prove Theorem 4.4, we prove the following two lemmas.

Lemma .3.

Let (θn)n1(\theta_{n})_{n\geq 1} be a sequence in Θ\Theta and θΘ\theta\in\Theta. Suppose ρΘ(θn,θ)\rho_{\Theta}(\theta_{n},\theta) \rightarrow 0 implies that μθn\mu_{\theta_{n}} convergence weakly to μθ\mu_{\theta}. Then the map (θ,μ)W(λ)(μθ,μ)(\theta,\mu)\mapsto W^{(\lambda)}(\mu_{\theta},\mu) is continuous, where (θ,μ)Θ×𝒫(𝒳)(\theta,\mu)\in\Theta\times\mathcal{P(X)}.

Proof.

The result follows directly from Corollary 3.6. ∎

Lemma .4.

The function (ν,μ(m))EW(λ)(ν,μ^m)\left(\nu,\mu^{(m)}\right)\mapsto{\rm E}W^{(\lambda)}\left(\nu,\hat{\mu}_{m}\right) is continuous with respect to weak convergence. Furthermore, if ρΘ(θn,θ)0\rho_{\Theta}\left(\theta_{n},\theta\right)\rightarrow 0 implies that μθn(m)\mu_{\theta_{n}}^{(m)} converges weakly to μθ(m)\mu_{\theta}^{(m)}, then the map (ν,θ)(\nu,\theta)\mapsto EW(λ)(ν,μ^θ,m){\rm E}W^{(\lambda)}\left(\nu,\hat{\mu}_{\theta,m}\right) is continuous.

Proof.

The proof moves along the same lines as the proof of Lemma A2 in [8]. It is worth noting that 0W(λ)2λ0\leq W^{(\lambda)}\leq 2\lambda, so we can use dominated convergence theorem instead of Fatou’s lemma. Because of the continuity of robust Wasserstein distance, we have

EW(λ)(ν,μ^m)=ElimkW(λ)(νk,μ^k,m)=limkEW(λ)(νk,μ^k,m).{\rm E}W^{(\lambda)}\left(\nu,\hat{\mu}_{m}\right)={\rm E}\lim_{k\rightarrow\infty}W^{(\lambda)}\left(\nu_{k},\hat{\mu}_{k,m}\right)=\lim_{k\rightarrow\infty}{\rm E}W^{(\lambda)}\left(\nu_{k},\hat{\mu}_{k,m}\right).

This implies (ν,θ)EW(λ)(ν,μ^θ,m)\left(\nu,\theta\right)\mapsto{\rm E}W^{(\lambda)}\left(\nu,\hat{\mu}_{\theta,m}\right) is continuous. ∎

In order to prove Theorem 4.4, we introduce the concept of the epi-converge as follows.

Definition .5.

We call a sequence of functions fn:Θf_{n}:\Theta\rightarrow\mathbb{R} epi-converge to f:Θf:\Theta\rightarrow\mathbb{R} if for all θΘ\theta\in\Theta,

{lim infnfn(θn)f(θ) for every sequence θnθlim supnfn(θn)f(θ) for some sequence θnθ\begin{cases}\liminf_{n\rightarrow\infty}f_{n}\left(\theta_{n}\right)\geq f(\theta)&\text{ for every sequence }\theta_{n}\rightarrow\theta\\ \limsup_{n\rightarrow\infty}f_{n}\left(\theta_{n}\right)\leq f(\theta)&\text{ for some sequence }\theta_{n}\rightarrow\theta\end{cases}

For any ν𝒫(𝒳)\nu\in\mathcal{P}(\mathcal{X}), the continuity of the map θ\theta\mapsto W(λ)(ν,μθ)W^{(\lambda)}\left(\nu,\mu_{\theta}\right) follows from Lemma .3, via Assumption 4.2. Next, by definition of the infimum, the set B(ε)B_{\star}(\varepsilon) with the ε\varepsilon of Assumption 4.3 is non-empty. Moreover, since θW(λ)(μ,μθ)\theta\mapsto W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right) is continuous, the set argminθΘW(λ)(μ,μθ)\operatorname{argmin}_{\theta\in\Theta}W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right) is non-empty and the set B(ε)B_{\star}(\varepsilon) is closed. Then, by Assumption 4.3, B(ε)B_{\star}(\varepsilon) is a compact set.

The next step is to prove the sequence of functions θW(λ)(μ^n(ω),μθ)\theta\mapsto W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right) epi-converges to θW(λ)(μ,μθ)\theta\mapsto W^{(\lambda)}\left(\mu_{\star},\mu_{\theta}\right) by applying Proposition 7.29 of [48]. Then we can obtain the results by Proposition 7.29 and Theorem 7.31 of  [48]. These steps are similar to [8] and are hence omitted.∎

Moreover, we state

Theorem .6 (Measurability of MRWE).

Suppose that Θ\Theta is a σ\sigma-compact Borel measurable subset of dθ\mathbb{R}^{d_{\theta}}. Then under Assumption 4.2, for any n1n\geq 1 and ε>0\varepsilon>0, there exists a Borel measurable function θ^nλ:ΩΘ\hat{\theta}_{n}^{\lambda}:\Omega\rightarrow\Theta that satisfies θ^nλ(ω)argminθΘW(λ)(μ^n(ω),μθ),\hat{\theta}_{n}^{\lambda}(\omega)\in\operatorname{argmin}_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right), if this set is non-empty, otherwise, θ^nλ(ω)ε-argminθΘW(λ)(μ^n(ω),μθ)\hat{\theta}_{n}^{\lambda}(\omega)\in\varepsilon\text{-}\operatorname{argmin}_{\theta\in\Theta}W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right).

Th. .6 implies that, for any n1n\geq 1, there is a measurable function θ^nλ\hat{\theta}_{n}^{\lambda} that coincides (or it is very close) to the minimizer of W(λ)(μ^n(ω),μθ)W^{(\lambda)}\left(\hat{\mu}_{n}(\omega),\mu_{\theta}\right).

Proofs of Theorems 4.7, 4.8, 4.10 and .6. The proofs of these theorems follow by moving along the same lines as the proofs of Theorems 2.4, 2.5, 2.6 and 2.2 in [8]. ∎

.3 The choice of λ\lambda

.3.1 Some considerations based on RWGAN

Here we focus on the applications of RWGAN and MERWE and illustrate, via Monte Carlo experiments, how λ\lambda would affect the results. First, we recall the experiment in § 5.2: MERWE of location for sum of log-normal. We keep other conditions unchanged, and just change the parameter λ\lambda to see how the estimated value changes. We set n=200n=200, ε=0.1\varepsilon=0.1 and η=4\eta=4 and other parameters are still set as in § 5.2. The result is shown in Figure 7. We can observe that the MSE is large when λ\lambda is small. The reason for this is that we truncate the cost matrix prematurely at this time, making robust Wasserstein distance unable to distinguish the differences between distributions. When λ\lambda is large, there is negligible difference between Wasserstein distance and robust Wasserstein distance, resulting in almost the same MSE estimated by MERWE and MEWE. When λ\lambda is moderate, within a large range (e.g., from exp(2)\exp(2) to exp(5.5)\exp(5.5)), the MERWE outperforms the MEWE.

Refer to caption
Figure 7: log(MSE)\log(\rm MSE) of MERWE based on 1000 times experiment for different values of log(λ)\log(\lambda). The horizontal (red) line represents log(MSE)\log(\rm MSE) of MEWE.

We consider a novel way to study parameter selection based on our Theorem 2.1, which implies that modifying the constraint on range(ψ)\mathrm{range}(\psi) is equivalent to changing the parameter λ\lambda in equation (4). This reminds us that we can use RWGAN-1 to study how to select λ\lambda.

We consider the same synthetic data as in § .4.2 and set n=1000,ε=0.1n=1000,\varepsilon=0.1 and η=2\eta=2. To quantify training quality of RWGAN-1, we take the Wasserstein distance of order 1 between the clean data and data generated from RWGAN-1. Specifically, we draw 1000 samples from model (18) and generate 1000 samples from the model trained by RWGAN-1 with different parameter λ\lambda. Then, we calculate the empirical Wasserstein distance of order 1 between them. The plot of the empirical Wasserstein distance against log(λ)\log(\lambda) is shown in Figure 8. The Wasserstein distance has a trend of decreasing first and increasing later on: it reaches the minimum at log(λ)=2\log(\lambda)=-2. Also, we notice that over a large range of λ\lambda values (i.e. from exp(4)\exp(-4) to exp(0)\exp(0)), the Wasserstein distance is small: this illustrates that RWGAN-1 has a good outlier detection ability for a choice of the penalization parameter within this range.

Refer to caption
Figure 8: Plot of the empirical Wasserstein distance of order 1 between the clean data and data generated from RWGAN-1 for different values of log(λ)\log(\lambda).

Heuristically, a large λ\lambda means the truncated cost function is just slightly different from the non-truncated one. Therefore, RWGAN-1 will be greatly affected by outliers. When λ\lambda is small, the cost function is severely truncated, which means that we are modifying too much of the distribution. This deviates from our main purpose of GAN (i.e., PrPθ\mathrm{P_{r}}\approx\mathrm{P_{\theta}}). Hence, the distribution of final outputs generated from RWGAN-1 becomes very different from the reference one.

Both experiments suggest that to achieve good performance of ROBOT, one could choose a moderate λ\lambda within a large range. Actually, the range of a “good” λ\lambda depends on the size and proportion of outliers: When the size and proportion are close to zero, both moderate and large λ\lambda will work well, since only slight or no truncation of the cost function is needed; when the size and proportion are large, a large truncation is needed, so this range becomes smaller. Generally, we prefer to choose a slightly larger λ\lambda, as we found in the experiment that when we use a larger λ\lambda, the performance of ROBOT is at least no worse than that using OT.

.3.2 Selection based on concentration inequality

Methodology. In the Monte Carlo setting of Section 5.2, Figure 7 in Appendix .3.1 illustrates that MERWE has good performance (in terms of MSE) when λ\lambda is within the range [exp(2),exp(5)][\exp(2),\exp(5)]. The estimation of the MSE represents, in principle, a valuable criterion that one can apply to select λ\lambda. However, while this idea is perfectly fine in the case of synthetic data, its applicability seems to be hard in real data examples, where the true underling parameter is not known. To cope with this issue, [58] propose to use a pilot estimators which can yield an estimated MSE. The same argument can be considered also for the our estimator, but this entails the need for selecting a pilot estimator: a task that cannot be easy for some complex models and for the models considered in this paper. Therefore, we propose and investigate the use of an alternative approach, which yields a range of values of λ\lambda similar to the one obtained computing the MSE but it does not need to select such a pilot estimator.

To start with, let us consider Th. 3.9. At the reference model, namely in the absence of contamination (i.e. |O|=0|O|=0 and |I|=n|I|=n), the mean concentration in (9) implies that |Wλ(μ^n,μ)E[Wλ(μ^n,μ)]|\left|W^{\lambda}(\hat{\mu}_{n},\mu^{\ast})-{\rm E}[W^{\lambda}(\hat{\mu}_{n},\mu^{\ast})]\right| takes on values larger than the threshold

σ2tn+4λtn\sigma\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n} (38)

with probability bounded by 2exp(t)2\exp(-t). In the presence of contamination, from (12) we have that the probability that the deviations of Wλ(μ^n,μ)W^{\lambda}(\hat{\mu}_{n},\mu^{\ast}) from its mean are larger than the threshold

σ|I|n2tn+4λtn+4λ|O|n\sigma\sqrt{\frac{|I|}{n}}\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n}+\frac{4\lambda|O|}{n} (39)

is still bounded by 2exp(t)2\exp(-t).

Then, to select λ\lambda we propose to set up the following criterion, which compares the changes in (38) and (39) due to different values of λ\lambda. More in detail, for a fixed t>0t>0, we assume a contamination level τ\tau which represents a guess that the statistician does about the actual level of data contamination. Then we compare the thresholds at different values of λ\lambda. To this end, we compute the ratio between (38) and (39) and define the quotient

𝒬(n,τ,λ,t):=σ2tn+4λtnσ1τ2tn+4λtn+4τλ.\mathcal{Q}(n,\tau,\lambda,t):=\frac{\sigma\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n}}{\sigma\sqrt{1-\tau}\sqrt{\frac{2t}{n}}+\frac{4\lambda t}{n}+4\tau\lambda}. (40)

Assuming that σ\sigma is a twice differentiable function of λ\lambda, one can check that 𝒬\mathcal{Q} decreases monotonically in λ\lambda. Indeed, the partial derivative with respect to λ\lambda of 𝒬(n,τ,λ,t)\mathcal{Q}(n,\tau,\lambda,t) is

𝒬λ=Cn,τ,t(λ)(λσ(λ)σ(λ)),\frac{\partial\mathcal{Q}}{\partial\lambda}=C_{n,\tau,t}(\lambda)\left(\lambda\sigma^{\prime}(\lambda)-\sigma(\lambda)\right), (41)

where

Cn,τ,t(λ)=42tσ(λ)((n+t)τtσ(λ)2nτ(ttσ(λ)2n+nτtσ(λ)2n))tσ(λ)2nτtσ(λ)2n(4n(1τ)λ+4tλ+2nτtσ(λ)2n)2C_{n,\tau,t}(\lambda)=\frac{4\sqrt{2}t\sigma(\lambda)\left((n+t)\sqrt{\frac{\tau t\sigma(\lambda)^{2}}{n}}-\tau\left(t\sqrt{\frac{t\sigma(\lambda)^{2}}{n}}+n\sqrt{\frac{\tau t\sigma(\lambda)^{2}}{n}}\right)\right)}{\sqrt{\frac{t\sigma(\lambda)^{2}}{n}}\sqrt{\frac{\tau t\sigma(\lambda)^{2}}{n}}\left(4n(1-\tau)\lambda+4t\lambda+\sqrt{2}n\sqrt{\frac{\tau t\sigma(\lambda)^{2}}{n}}\right)^{2}}

is positive when λ>0\lambda>0. Moreover, let us denote by σ\sigma^{\prime} and σ′′\sigma^{\prime\prime} the first and second derivative of σ\sigma with respect to λ\lambda. So the monotonicity of 𝒬\mathcal{Q} depends on γ(λ):=λσ(λ)σ(λ)\gamma(\lambda):=\lambda\sigma^{\prime}(\lambda)-\sigma(\lambda). Now, notice that γ=0\gamma=0 when λ=0\lambda=0 and that γ=λσ′′(λ)\gamma^{\prime}=\lambda\sigma^{\prime\prime}(\lambda). Moreover, from the definition of σ\sigma, σ\sigma is increasing with respect to λ\lambda and σ′′<0\sigma^{\prime\prime}<0 when λ>0\lambda>0. So γ<0\gamma^{\prime}<0 when λ>0\lambda>0 and γ<0\gamma<0 when λ>0\lambda>0. This means 𝒬\mathcal{Q} decreases monotonically.

Next we explain how 𝒬/λ{\partial\mathcal{Q}}/{\partial\lambda} can be applied to define a data driven criterion for the selection of λ\lambda. We remark that for most models, the computation of 𝒬\mathcal{Q} and of 𝒬/λ{\partial\mathcal{Q}}/{\partial\lambda} need to be performed numerically. To illustrate this aspect and to elaborate further on the selection of λ\lambda, we consider the same Monte Carlo setting of Section 5.2, where ε=|O|/n\varepsilon={|O|}/{n} is the actual contamination level. In Appendix .3.1, we found through numerical experiments that MERWE has good performance for λ\lambda within the [exp(2),exp(5)][\exp(2),\exp(5)]. Hence, in our numerical analysis, we consider this range as a benchmark for evaluating the λ\lambda selection procedure based on 𝒬(n,τ,λ,t)\mathcal{Q}(n,\tau,\lambda,t). Moreover, we need to specify a value for the contamination level and fix tt. In our numerical analysis we set τ=0.1\tau=0.1 and t=1t=1. We remark that the specified τ\tau coincides with the percentage of outliers ε\varepsilon: later on we discuss the case when τε\tau\neq\varepsilon. Moreover, we point out that different values of tt simply imply that the proposed selection criterion concerns the stability of other quantiles of the distribution of Wλ(μ^n,μ)W^{\lambda}(\hat{\mu}_{n},\mu^{\ast}).

In Figure 9, we display the plot of 𝒬\mathcal{Q} with respect to λ\lambda: 𝒬\mathcal{Q} decreases monotonically, with slope which decreases as λ\lambda\to\infty. The plot suggests that, when λ\lambda is small, 𝒬\mathcal{Q} decreases fast for small increments in the value of λ\lambda. This is due to the fact that, for λ\lambda small, a small change of λ\lambda has a large influence on the transport cost c(λ)c^{(\lambda)} and hence on the values of W(λ)W^{(\lambda)}. On the contrary, when λ\lambda is sufficiently large, larger values of the trimming parameter do not have large influence on the transport cost c(λ)c^{(\lambda)}. As a result, looking at the (estimated) slope of (estimated) 𝒬\mathcal{Q} allows to identify a range of λ\lambda values which avoids the small values (for which W(λ)W^{(\lambda)} trims too many observations and it is maximally concentrated about its mean) and large values (for which W(λ)W^{(\lambda)} trims too few observations and it is minimally concentrated about its mean). This is the same information available in Figure 7, but differently from the computation which yields the MSE in that figure, our method does not entail either the need for knowing the true value of the model parameter or the need for a pilot estimator.

Implementation. Operationally, to obtain the desired range of λ\lambda values, one needs to create a sequence of {λk}\{\lambda_{k}\} with

0<λ1<λ2<<λN,N>1.0<\lambda_{1}<\lambda_{2}<\ldots<\lambda_{N},\,N>1.

Then, one computes the sequence of absolute slopes

AS(n,τ,λk,t)=|𝒬(n,τ,λk+1,t)𝒬(n,τ,λk,t)|λk+1λk,k1,\mathrm{AS}{(n,\tau,\lambda_{k},t)}=\frac{|{\mathcal{Q}}{(n,\tau,\lambda_{k+1},t)}-{\mathcal{Q}}{(n,\tau,\lambda_{k},t)}|}{\lambda_{k+1}-\lambda_{k}},k\geq 1,

at different values of λk\lambda_{k}. We remark that to compute AS\mathrm{AS}, we need to estimate σ2\sigma^{2}. To this end, a robust estimator of the variance should be applied. In our numerical exercises, we find numerically convenient to fix XiX_{i}^{\prime} to X¯rob\bar{X}_{\mathrm{rob}} which represents the geometric median for real-valued variable XX. Then, we replace

E[min(d2(Xi,Xi),(2λ)2)]\rm{E}\left[\min\left(d^{2}\left(X_{i},X_{i}^{\prime}\right),(2\lambda)^{2}\right)\right]

by

(1/n)i=1nmin{d2(Xi,X¯rob),(2λ)2}.({1}/{n})\sum_{i=1}^{n}\min\left\{d^{2}\left(X_{i},\bar{X}_{\mathrm{rob}}\right),(2\lambda)^{2}\right\}.

Once the different values of AV are available, one needs to make use of them to select λ\lambda. To this end, we propose to follow the idea of Algorithm 1 in [36] and choose

λ:=min{λk:AS(n,τ,λi,t)ι,for all λiλk},\lambda^{\star}:=\min\left\{\lambda_{k}:{\mathrm{AS}}{(n,\tau,\lambda_{i},t)}\leq\iota,\text{for all }\lambda_{i}\geq\lambda_{k}\right\},

where ι\iota is a user-specified value. According to our numerical experience it is sensible to specify values of ι\iota which are close to zero: they typically define estimators having a performance that remains stable across different Monte Carlo settings, while avoiding to trim too many observations. For the sake of illustration, in Figure 10 we display the estimated AS related to 𝒬\mathcal{Q} as in Figure 9: we see that its maximum value is around 0.05. We propose to set ι=0.001\iota=0.001, a value which is slightly larger than zero (namely, the value that AS\mathrm{AS} has when λ\lambda is large). This yields λ26\lambda^{\star}\approx 26 (lnλ3.25\ln\lambda\approx 3.25), which is within the range [exp(2),exp(5)][\exp(2),\exp(5)] where MERWE produces small MSE as displayed in Figure 7. A reassuring aspect emerges from Figure 10: in case one decides to set slightly larger values of ι\iota (e.g. values between 0.0080.008 and 0.0010.001), the selection procedure is still able to provide a value of λ\lambda^{\star} within the benchmark [exp(2),exp(5)][\exp(2),\exp(5)].

Essentially, we conclude that any value of ι\iota which is greater than zero and smaller than the max of AS selects a λ\lambda^{\star} which in turns yields a good performance (in terms of MSE) of our estimator.

Refer to caption
Figure 9: The curve (black) of 𝒬\mathcal{Q} with respect to λ\lambda. The coordinates of the vertical dashed lines (red) are exp(2)\exp(2) and exp(5)\exp(5), respectively.
Refer to caption
Figure 10: Plot of the AS\mathrm{AS} curve (black) as a function of λ\lambda. The coordinates of the vertical dashed lines (red) are exp(2)\exp(2) and exp(5)\exp(5), respectively, and the coordinate of horizontal dashed line (blue) is 0.0010.001.

Remark. We mention three interesting methodological aspects related to this selection procedure. First, the proposed criterion is based on the computation of AS\mathrm{AS} for different λk\lambda_{k} and it does not entail the need for computing the corresponding MDE θ^nλ\hat{\theta}^{\lambda}_{n}. This aspect makes our criterion easy-to-implement and fast. Second, one may wonder what is the distribution of λ\lambda^{\ast} when many samples from the same OIO\cup I setting are available. Put it in another way, given many samples and selecting in each sample λ\lambda by the aforementioned procedure how does λ\lambda^{\ast} change? Third, an interesting question related to the proposed criterion is: what happens to λ\lambda^{\ast} if the statistician specifies a contamination level τ\tau that is different from the true contamination level ε\varepsilon? As far as the last two questions are concerned, we remark that Figure 10 is based on one sample and it cannot answer these questions. Thus, to gain deeper understanding, we keep the same setting as in the Monte Carlo experiment (which yielded Figure 10) and we simulate 1000 samples, specifying τ=0.05,0.1,0.2\tau=0.05,0.1,0.2 and selecting λ\lambda^{\star} by the aforementioned procedure. In Figure 11 we display the kernel density estimates of the selected values.

The plots illustrate that, for each τ\tau value, the kernel density estimate is rather concentrated: for instance, if τ=0.1\tau=0.1 the support of the estimated density goes from about 3 to about 3.4. Moreover, looking at the estimated densities when τ=0.005\tau=0.005 (underestimation of the number of outliers) and τ=0.2\tau=0.2 (overestimation of the number of outliers), we see that, even if τε\tau\neq\varepsilon, the selection criterion still yields values of λ\lambda which are within the benchmark range [exp(2),exp(5)][\exp(2),\exp(5)]. Similar considerations hold for other user-specified values of tt in the thresholds (38) and (39).

Refer to caption
Figure 11: Dashed line (red), solid line (black), dotted line (green) represent the Gaussian kernel density functions (bandwidth = 0.01670.0167) of log(λ)\log(\lambda^{\star}) for τ=0.05,0.1,0.2\tau=0.05,0.1,0.2, respectively.

.4 Additional numerical exercises

This section, as a compliment to Section 5, provides more details and additional numerical results of a few applications of ROBOT in statistical estimation and machine learning tasks.

.4.1 ROBOT-based estimation via outlier detection (pre-processing)

Methodology. Many routinely-applied methods of model estimation can be significantly affected by outliers. Here we show how ROBOT can be applied for robust model estimation.

Recall that formulation (3) is devised in order to detect and remove outliers. Then, one may exploit this feature and estimate in a robust way the model parameter. Hereunder, we explain in detail the outlier detection procedure.

Before delving into the details of the procedure, we give a cautionary remark. Although theoretically possible and numerically effective, the use of ROBOT for pre-processing does not give the same statistical guarantees for the parameter estimations as the ones yielded by MRWE and MERWE (see §4).

With this regard we mention two key inference issues. (i) There is the risk of having a masking effect: one outlier may hide another outlier, thus if one is removed another one may appear. As a result, it becomes unclear and really subjective when to stop the pre-processing procedure. (ii) Inference obtained via M-estimation on the cleaned sample is conditional to the outlier detection: the distribution of the resulting estimates is unknown. For further discussion see [42, Section 4.3].

Because of these aspects, we prefer the use of MERWE, which does not need any (subjective) pre-processing and it is an estimator defined by an automatic procedure, which is, by design, resistant to outliers and whose statistical properties can be studied.

Assume we want to estimate data distribution through some parametric model gθg_{\theta} indexed by a parameter θ\theta. For the sake of exposition, let us think of a regression model, where θ\theta characterizes the conditional mean to which we add an innovation term.

Now, suppose the innovations are i.i.d. copies of a random variable having normal distribution. The following ROBOT-based procedure can be applied to eliminate the effect of outliers: (i) estimate θ\theta based on the observations and compute the variance σ~2\tilde{\sigma}^{2} of the residual term; (ii) solve the ROBOT problem between the residual term and a normal distribution whose variance is σ~2\tilde{\sigma}^{2} to detect the outliers; (iii) remove the detected outliers and update the estimate of θ\theta; (iv) calculate the variance σ~2\tilde{\sigma}^{2} of the updated residuals; (v) repeat (ii) and (iii) until θ\theta converges or the number of iterations is reached. More details are given in Algorithm 2. For convenience, we still denote the output of Algorithm 2 as gθg_{\theta}.

Data: observations {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}, robust regularization parameter λ\lambda, number of iterations MM, number of generated sample nn, parametric model gθg_{\theta} (indexed by a parameter θ\theta)
Result: Model gθg_{\theta}
1 estimate θ\theta based on the observations and compute the residuals and their sample standard deviation σ~\tilde{\sigma};
2 while M\ell\leq M do
3       generate samples 𝐙~(n)={Z~j}j=1n\tilde{\mathbf{Z}}^{(n)}=\{\tilde{Z}_{j}\}_{j=1}^{n} from 𝒩(0,σ~2)\mathcal{N}(0,\tilde{\sigma}^{2}), and let 𝐙^(n)={Zi^;1in}\hat{\mathbf{Z}}^{(n)}=\left\{\hat{Z_{i}};1\leq i\leq n\right\}, where Zi^=Yigθ(Xi)\hat{Z_{i}}=Y_{i}-g_{\theta}(X_{i});
4       calculate the cost matrix 𝐂:=(Cij)1i,jn\mathbf{C}:=(C_{ij})_{1\leq i,j\leq n} between 𝐙^(n)\hat{\mathbf{Z}}^{(n)} and 𝐙~(n)\tilde{\mathbf{Z}}^{(n)};
5       collect all the indices ={(i,j):Cij2λ}\mathcal{I}=\{(i,j):C_{ij}\geq 2\lambda\}, and let Cij(λ)=2λC^{(\lambda)}_{ij}=2\lambda for (i,j)(i,j)\in\mathcal{I} and Cij(λ)=CijC^{(\lambda)}_{ij}=C_{ij} otherwise;
6       calculate the transport matrix 𝚷(n,n)\mathbf{\Pi}^{(n,n)} between 𝐙^(n)\hat{\mathbf{Z}}^{(n)} and 𝐙~(n)\tilde{\mathbf{Z}}^{(n)} based on the modified cost matrix 𝐂(λ):=(Cij(λ))1i,jn\mathbf{C}^{(\lambda)}:=(C^{(\lambda)}_{ij})_{1\leq i,j\leq n};
7       set s(i)=j=1n𝚷(n,n)(i,j)𝟙(i,j)s(i)=-\sum_{j=1}^{n}\mathbf{\Pi}^{(n,n)}(i,j)\mathbbm{1}_{(i,j)\in\mathcal{I}};
8       find \mathcal{H}, the set of all the indices where s(i)+1/n=0s(i)+1/n=0;
9       use {(Xi,Yi)}i\{(X_{i},Y_{i})\}_{i\notin\mathcal{H}} to update θ\theta and the residuals;
10       update the sample standard deviation σ~\tilde{\sigma} of the residuals;
11       +1\ell\leftarrow\ell+1;
12      
13 end while
Algorithm 2 ROBOT ESTIMATION VIA PRE-PROCESSING

To illustrate the ROBOT-based estimation procedure, consider the following linear regression model

XU(0,10),Y=αX+β+Z,Z𝒩(0,σ2).\begin{array}[]{cc}&X\sim\mathrm{U}(0,10),\\ &Y=\alpha X+\beta+Z,\ Z\sim\mathcal{N}(0,\sigma^{2}).\end{array} (42)

The observations are contaminated by additive outliers and are generated from the model

Xi(n)U(0,10),Yi(n)=αXi(n)+β+Zi(n)+𝟙(i=h)ϵi(n),Zi(n)𝒩(0,σ2) and ϵi(n)𝒩(η+Xi(n),1),\begin{array}[]{cc}&X_{i}^{(n)}\sim\mathrm{U}(0,10),\\ &Y_{i}^{(n)}=\alpha X_{i}^{(n)}+\beta+Z_{i}^{(n)}+\mathbbm{1}_{(i=h)}\epsilon_{i}^{(n)},\\ &Z_{i}^{(n)}\sim\mathcal{N}(0,\sigma^{2})\text{\> and \>}\epsilon_{i}^{(n)}\sim\mathcal{N}(\eta+X_{i}^{(n)},1),\end{array}

with i=1,2,,ni=1,2,\ldots,n, where hh is the location of the additive outliers, η\eta is a parameter to control the size of contamination, and 𝟙(.)\mathbbm{1}_{(.)} denotes the indicator function. We consider the case that outliers are added at probability ε\varepsilon.

In this toy experiment, we set n=1000n=1000, σ=1\sigma=1, α=1\alpha=1 and β=1\beta=1, and we change the value of η\eta and the proportion ε\varepsilon of abnormal points to all points to investigate the impact of the size and proportion of the outliers on different methodologies. Each experiment is repeated 100 times.

We combine the ordinary least square (OLS) with ROBOT (we set λ=1\lambda=1) through Algorithm 2 to get a robust estimation of α\alpha and β\beta, and we will compare this robust approach with the OLS. Boxplots of slope estimates and intercept estimates based on the OLS and ROBOT, for different values η\eta and ε\varepsilon, are demonstrated in Figures 12(a) and 12(b), respectively.

Refer to caption
(a) boxplots of slope (α\alpha) estimates in different situation
Refer to caption
(b) boxplots of intercept (β\beta) estimates in different situation
Figure 12: We show the estimated effect of parameters through boxplots. The red lines represent the true values of the model parameters.

For the estimation of α\alpha, Figure 12(a) shows that the OLS is greatly affected by the outliers: both the bias and variance increase rapidly with the size and proportion of the outliers. The ROBOT, on the contrary, shows great robustness against the outliers, with much smaller bias and variance. For the estimation of β\beta (Figure 12(b)), ROBOT and OLS have similar performance for η=2\eta=2, but ROBOT outperforms OLS when the contamination size is large.

The purpose of our construction of outlier robust estimation is to let the trained model close to the uncontaminated model (42), hence it can fit better the data generated from (42). We, therefore, consider calculating mean square error (MSE) with respect to the data generated from (42), in order to compare the performance of the OLS and ROBOT.

In simulation, we set n=1000n=1000, σ=1\sigma=1, α=1\alpha=1, β=1\beta=1 and change the contamination size η\eta. Each experiment is repeated 100100 times. In Figures 13(a) and 13(b), we plot the averaged MSE of the OLS and ROBOT against η\eta for ε=0.1\varepsilon=0.1 and ε=0.2\varepsilon=0.2, respectively. Even a rapid inspection of Figures 13(a) and 13(b) demonstrates the better performance of ROBOT than the OLS. Specifically, in both figures, OLS is always above ROBOT, and the latter yields much smaller MSE than the former for a large η\eta. After careful observation, it is found that when the proportion of outliers is small, the MSE curve of ROBOT is close to 1, which is the variance of the data itself. The results show that ROBOT has good robustness against outliers of all sizes.

ROBOT, with the ability to detect outliers, hence fits the uncontaminated distribution better than OLS. Next, we use Figure 14(a) and Figure 14(b) to show Algorithm 2 enables us to detect outliers accurately and hence makes the estimation procedure more robust. More specifically, we can see there are many points in Figure 14(a) (after 1 iteration) classified incorrectly; after 10 iterations (Figure 14(b)), only a few outliers close to the uncontaminated distribution are not detected.

Refer to caption
(a) MSE (p=0.1p=0.1)
Refer to caption
(b) MSE (p=0.2p=0.2)
Figure 13: We consider using MSE based on the uncontaminated data to show how well the estimate fits the real distribution. The left picture is the MSE corresponding to different scale parameter η\eta when the proportion of outliers is 0.1. The red line represents ROBOT-based estimation while the blue line represents OLS. On the right is the result when ε=0.2\varepsilon=0.2.
Refer to caption
(a) iteration=1
Refer to caption
(b) iteration=10
Figure 14: Outlier detection for ROBOT-based estimation: the left panel is the result after 1 iteration in Algorithm 2, and the right panel is the result after 10 iterations. Circle points (red) and square points (green) represent the outliers and uncontaminated points correctly detected respectively; pentagon points (blue) and inverted triangle (black) represent the outliers and uncontaminated points incorrectly detected. (Simulation settings: n=500n=500, ε=0.2\varepsilon=0.2 and η=1\eta=1).

.4.2 RWGAN

RWGAN-2 Here we provide the key steps needed to implement the RWGAN-2, which is derived directly from (3). Noticing that (3) allows the distribution of reference samples to be modified, and there is a penalty term to limit the degree of the modification, the RWGAN-2 can thus be implemented as follows.

Letting XPr{X}^{\prime}\sim\mathrm{P}_{{\rm r}^{\prime}}, where Pr\mathrm{P}_{{\rm r}^{\prime}} is the modified reference distribution, and other notations be the same as in (16), the objective function of the RWGAN-2 is defined as

minθminPrsupfξL1,range(fξ)2λ{E[fξ(X)]E[fξ(Gθ(Z))]+λmPrPrTV},\min_{\theta}\min_{\mathrm{P}_{{\rm r}^{\prime}}}\sup_{\|f_{\xi}\|_{L}\leq 1,{\rm range}(f_{\xi})\leq 2\lambda}\left\{{\rm E}[f_{\xi}({X}^{\prime})]-{\rm E}[f_{\xi}(G_{\theta}({Z}))]+\lambda_{\rm m}\|\mathrm{P}_{{\rm r}}-\mathrm{P}_{{\rm r}^{\prime}}\|_{\rm TV}\right\}, (43)

where λm\lambda_{\rm m} is the penalty parameter to control the modification. The algorithm for implementing the RWGAN-2 is summarized in Algorithm 3. Note that in Algorithm 3, we consider introducing a new neural network UωU_{\omega} to represent the modified reference distribution, which corresponds to Pr\mathrm{P_{r^{\prime}}} in the objective function (43). Then we can follow (43) and use standard steps to update generator, discriminator and modified distribution in turn. Here are a few things to note: (i) from the objective function, we know that the loss function of discriminator is E[fξ(Gθ(Z))]E[fξ(X)]{\rm E}[f_{\xi}(G_{\theta}({Z}))]-{\rm E}[f_{\xi}({X}^{\prime})], the loss function of modified distribution is E[fξ(X)]+λmPrPrTV{\rm E}[f_{\xi}({X}^{\prime})]+\lambda_{\rm m}\|\mathrm{P}_{{\rm r}}-\mathrm{P}_{{\rm r}^{\prime}}\|_{\rm TV} and the loss function of generator is E[fξ(Gθ(Z))]-{\rm E}[f_{\xi}(G_{\theta}({Z}))]; (ii) we truncate the absolute values of the discriminator parameter ξ\xi to no more than a fixed constant cc on each update: this is a simple and efficient way to satisfy the Lipschitz condition in WGAN; (iii) we adopt Root Mean Squared Propagation (RMSProp) optimization method which is recommended by [3] for WGAN.

Data: observations {X1,X2,,Xn}\{{X}_{1},{X}_{2},\ldots,{X}_{n}\}, critic fξf_{\xi} (indexed by a parameter ξ\xi), generator GθG_{\theta} (indexed by a parameter θ\theta), modified distribution UωU_{\omega} (indexed by a parameter ω\omega), learning rate α\alpha, modification penalty parameter λm\lambda_{\rm m}, batch size NbatchN_{\rm batch}, number of iterations NiterN_{\rm iter}, noise distribution Pz\mathrm{P}_{{z}}, clipping parameter cc
Result: generation model GθG_{\theta}
1 while iNiteri\leq N_{\rm iter} do
2       sample a batch of reference samples {Xi}i=1Nbatch\{{X}_{i}\}_{i=1}^{N_{\rm batch}};
3       sample a batch of i.i.d. noises {Zi}i=1NbatchPz\{{Z}_{i}\}_{i=1}^{N_{\rm batch}}\sim\mathrm{P}_{{z}};
4       let Lξ=1Nbatchifξ(Gθ(Zi))iUω(Xi)fξ(Xi){L}_{\xi}=\frac{1}{N_{\rm batch}}\sum_{i}f_{\xi}(G_{\theta}({Z}_{i}))-\sum_{i}U_{\omega}({X}_{i})f_{\xi}({X}_{i});
5       update ξξαRMSProp(ξ,ξLξ)\xi\xleftarrow{}\xi-\alpha\cdot RMSProp(\xi,\nabla_{\xi}{L}_{\xi});
6       ξclip(ξ,c,c)\xi\xleftarrow{}\mathrm{clip}(\xi,-c,c);
7       let Lθ=1Nbatchifξ(Gθ(Zi)){L}_{\theta}=-\frac{1}{N_{\rm batch}}\sum_{i}f_{\xi}(G_{\theta}({Z}_{i}));
8       update θθαRMSProp(θ,θLθ)\theta\xleftarrow{}\theta-\alpha\cdot RMSProp(\theta,\nabla_{\theta}{L}_{\theta});
9       let Lω=iUω(Xi)fξ(Xi)+λmi|Uω(Xi)1/Nbatch|{L}_{\omega}=\sum_{i}U_{\omega}({X}_{i})f_{\xi}({X}_{i})+\lambda_{\rm m}\sum_{i}|U_{\omega}({X}_{i})-1/N_{\rm batch}|;
10       update ωωαRMSProp(ω,ωLω)\omega\xleftarrow{}\omega-\alpha\cdot RMSProp(\omega,\nabla_{\omega}{L}_{\omega});
11      
12 end while
Algorithm 3 RWGAN-2

Some details. If we assume that the reference sample is mm-dimensional, the generator is a neural network with mm input nodes and mm output nodes; the critic is a neural network with mm input nodes and 11 output node. Also, in RWGAN-2 and RWGAN-B, we need an additional neural network (it has mm input nodes and 11 output node) to represent the modified distribution. In both the synthetic data example and Fashion-MNIST example, all generative adversarial models are implemented based on 𝖯𝗒𝗍𝗈𝗋𝖼𝗁\mathsf{Pytorch} (an open source Python machine learning library https://pytorch.org/) to which we refer for additional computational details.

In RWGAN-1, as in the case of WGAN, we also only need two neural networks representing generator and critic, but we need to make a little modification to the activation function of the output node of critic: (i) choose a bounded activation function (ii) add a positional parameter to the activation function. This is for fξf_{\xi} to satisfy the condition range(fξ)2λ\mathrm{range}(f_{\xi})\leq 2\lambda in (17). So RWGAN-1 has a simple structure similar to WGAN, but still shows good robustness to outliers. We can also see this from the objective functions of RWGAN-1 ((17)) and RWGAN-2 ((43)): RWGAN-2 needs to update one more parameter ω\omega than RWGAN-1. This means more computational complexity.

In addition, we make use of Fashion-MNIST to provide an example of how generator GθG_{\theta} works on real-data. We choose a standard multivariate normal random variable (with the same dimension as the reference sample) to obtain the random input sample. An image in Fashion-MNIST is of 28×2828\times 28 pixels, and we think of it as a 784-dimensional vector. Hence, we choose a 784784-dimensional standard multidimensional normal random variable ZZ as the random input of the generator GθG_{\theta}, from which we output a vector of the same 784 dimensions (which corresponds to the values of all pixels in a grayscale image) to generate one fake image. To understand it in another way, we can regard each image as a sample point, so all the images in Fashion-MNIST constitute an empirical reference distribution P^r\hat{\mathrm{P}}_{\rm r}, thus our goal is to train GθG_{\theta} such that Pθ\mathrm{P}_{\theta} is very close to P^r\hat{\mathrm{P}}_{\rm r}.

References

  • [1] {barticle}[author] \bauthor\bsnmAlvarez-Esteban, \bfnmPedro César\binitsP. C., \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmCuesta-Albertos, \bfnmJuan Antonio\binitsJ. A. and \bauthor\bsnmMatran, \bfnmCarlos\binitsC. (\byear2008). \btitleTrimmed comparison of distributions. \bjournalJournal of the American Statistical Association \bvolume103 \bpages697–704. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAmari, \bfnmShun-ichi\binitsS.-i., \bauthor\bsnmKarakida, \bfnmRyo\binitsR. and \bauthor\bsnmOizumi, \bfnmMasafumi\binitsM. (\byear2018). \btitleInformation geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem. \bjournalInformation Geometry \bvolume1 \bpages13–37. \endbibitem
  • [3] {binproceedings}[author] \bauthor\bsnmArjovsky, \bfnmMartin\binitsM., \bauthor\bsnmChintala, \bfnmSoumith\binitsS. and \bauthor\bsnmBottou, \bfnmLéon\binitsL. (\byear2017). \btitleWasserstein generative adversarial networks. In \bbooktitleInternational conference on machine learning \bpages214–223. \bpublisherPMLR. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmBalaji, \bfnmYogesh\binitsY., \bauthor\bsnmChellappa, \bfnmRama\binitsR. and \bauthor\bsnmFeizi, \bfnmSoheil\binitsS. (\byear2020). \btitleRobust optimal transport with applications in generative modeling and domain adaptation. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages12934–12944. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBassetti, \bfnmFederico\binitsF., \bauthor\bsnmBodini, \bfnmAntonella\binitsA. and \bauthor\bsnmRegazzini, \bfnmEugenio\binitsE. (\byear2006). \btitleOn minimum Kantorovich distance estimators. \bjournalStatistics & probability letters \bvolume76 \bpages1298–1302. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBassetti, \bfnmFederico\binitsF. and \bauthor\bsnmRegazzini, \bfnmEugenio\binitsE. (\byear2006). \btitleAsymptotic properties and robustness of minimum dissimilarity estimators of location-scale parameters. \bjournalTheory of Probability & Its Applications \bvolume50 \bpages171–186. \endbibitem
  • [7] {bbook}[author] \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA., \bauthor\bsnmShioya, \bfnmHiroyuki\binitsH. and \bauthor\bsnmPark, \bfnmChanseok\binitsC. (\byear2011). \btitleStatistical inference: the minimum distance approach. \bpublisherCRC press. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBernton, \bfnmEspen\binitsE., \bauthor\bsnmJacob, \bfnmPierre E\binitsP. E., \bauthor\bsnmGerber, \bfnmMathieu\binitsM. and \bauthor\bsnmRobert, \bfnmChristian P\binitsC. P. (\byear2019). \btitleOn parameter estimation with the Wasserstein distance. \bjournalInformation and Inference: A Journal of the IMA \bvolume8 \bpages657–676. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBoissard, \bfnmEmmanuel\binitsE. and \bauthor\bsnmLe Gouic, \bfnmThibaut\binitsT. (\byear2014). \btitleOn the mean speed of convergence of empirical and occupation measures in Wasserstein distance. \bvolume50 \bpages539–563. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmBolley, \bfnmFrançois\binitsF., \bauthor\bsnmGuillin, \bfnmArnaud\binitsA. and \bauthor\bsnmVillani, \bfnmCédric\binitsC. (\byear2007). \btitleQuantitative concentration inequalities for empirical measures on non-compact spaces. \bjournalProbability Theory and Related Fields \bvolume137 \bpages541–593. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmBrenier, \bfnmYann\binitsY. (\byear1987). \btitleDécomposition polaire et réarrangement monotone des champs de vecteurs. \bjournalCR Acad. Sci. Paris Sér. I Math. \bvolume305 \bpages805–808. \endbibitem
  • [12] {binproceedings}[author] \bauthor\bsnmCarriere, \bfnmMathieu\binitsM., \bauthor\bsnmCuturi, \bfnmMarco\binitsM. and \bauthor\bsnmOudot, \bfnmSteve\binitsS. (\byear2017). \btitleSliced Wasserstein kernel for persistence diagrams. In \bbooktitleInternational conference on machine learning \bpages664–673. \bpublisherPMLR. \endbibitem
  • [13] {bphdthesis}[author] \bauthor\bsnmChizat, \bfnmLénaïc\binitsL. (\byear2017). \btitleUnbalanced optimal transport: Models, numerical methods, applications, \btypePhD thesis, \bpublisherUniversité Paris sciences et lettres. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmCourty, \bfnmNicolas\binitsN., \bauthor\bsnmFlamary, \bfnmRémi\binitsR., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA. and \bauthor\bsnmRakotomamonjy, \bfnmAlain\binitsA. (\byear2017). \btitleJoint distribution optimal transportation for domain adaptation. \bjournalAdvances in Neural Information Processing Systems \bvolume30. \endbibitem
  • [15] {binproceedings}[author] \bauthor\bsnmCourty, \bfnmNicolas\binitsN., \bauthor\bsnmFlamary, \bfnmRémi\binitsR. and \bauthor\bsnmTuia, \bfnmDevis\binitsD. (\byear2014). \btitleDomain adaptation with regularized optimal transport. In \bbooktitleJoint European Conference on Machine Learning and Knowledge Discovery in Databases \bpages274–289. \bpublisherSpringer. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmCuturi, \bfnmMarco\binitsM. (\byear2013). \btitleSinkhorn distances: lightspeed computation of optimal transport. \bjournalAdvances in neural information processing systems \bvolume26. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmDaumé III, \bfnmHal\binitsH. (\byear2009). \btitleFrustratingly easy domain adaptation. \bjournalarXiv preprint arXiv:0907.1815. \endbibitem
  • [18] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE., \bauthor\bsnmSanz, \bfnmAlberto Gonzalez\binitsA. G. and \bauthor\bsnmHallin, \bfnmMarc\binitsM. (\byear2022). \btitleNonparametric Multiple-Output Center-Outward Quantile Regression. \bjournalarXiv preprint arXiv:2204.11756. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmDudley, \bfnmRichard Mansfield\binitsR. M. (\byear1969). \btitleThe speed of mean Glivenko-Cantelli convergence. \bjournalThe Annals of Mathematical Statistics \bvolume40 \bpages40–50. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmFournier, \bfnmNicolas\binitsN. (\byear2022). \btitleConvergence of the empirical measure in expected Wasserstein distance: non asymptotic explicit bounds in Rd. \bjournalarXiv preprint arXiv:2209.00923. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmFournier, \bfnmNicolas\binitsN. and \bauthor\bsnmGuillin, \bfnmArnaud\binitsA. (\byear2015). \btitleOn the rate of convergence in Wasserstein distance of the empirical measure. \bjournalProbability theory and related fields \bvolume162 \bpages707–738. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmFrogner, \bfnmCharlie\binitsC., \bauthor\bsnmZhang, \bfnmChiyuan\binitsC., \bauthor\bsnmMobahi, \bfnmHossein\binitsH., \bauthor\bsnmAraya, \bfnmMauricio\binitsM. and \bauthor\bsnmPoggio, \bfnmTomaso A\binitsT. A. (\byear2015). \btitleLearning with a Wasserstein loss. \bjournalAdvances in neural information processing systems \bvolume28. \endbibitem
  • [23] {binproceedings}[author] \bauthor\bsnmGenevay, \bfnmAude\binitsA., \bauthor\bsnmPeyré, \bfnmGabriel\binitsG. and \bauthor\bsnmCuturi, \bfnmMarco\binitsM. (\byear2018). \btitleLearning generative models with sinkhorn divergences. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages1608–1617. \bpublisherPMLR. \endbibitem
  • [24] {binproceedings}[author] \bauthor\bsnmGoodfellow, \bfnmIan\binitsI., \bauthor\bsnmJean, \bfnmPouget-Abadie\binitsP.-A., \bauthor\bsnmMehdi, \bfnmMirza\binitsM., \bauthor\bsnmBing, \bfnmXu\binitsX., \bauthor\bsnmDavid, \bfnmWarde-Farley\binitsW.-F., \bauthor\bsnmSherjil, \bfnmOzair\binitsO. and \bauthor\bsnmCourville Aaron, \bfnmC\binitsC. (\byear2014). \btitleGenerative adversarial networks. In \bbooktitleProceedings of the 27th international conference on neural information processing systems \bvolume2 \bpages2672–2680. \endbibitem
  • [25] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM. (\byear2022). \btitleMeasure transportation and statistical decision theory. \bjournalAnnual Review of Statistics and Its Application \bvolume9 \bpages401–424. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM., \bauthor\bsnmLa Vecchia, \bfnmDavide\binitsD. and \bauthor\bsnmLiu, \bfnmHang\binitsH. (\byear2022). \btitleCenter-outward R-estimation for semiparametric VARMA models. \bjournalJournal of the American Statistical Association \bvolume117 \bpages925–938. \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM., \bauthor\bsnmLa Vecchia, \bfnmDavide\binitsD. and \bauthor\bsnmLiu, \bfnmHang\binitsH. (\byear2023). \btitleRank-based testing for semiparametric VAR models: a measure transportation approach. \bjournalBernoulli \bvolume29 \bpages229–273. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM. and \bauthor\bsnmLiu, \bfnmHang\binitsH. (\byear2023). \btitleCenter-outward Rank-and Sign-based VARMA Portmanteau Tests: Chitturi, Hosking, and Li–McLeod revisited. \bjournalEconometrics and Statistics. \endbibitem
  • [29] {bbook}[author] \bauthor\bsnmHampel, \bfnmFrank R\binitsF. R., \bauthor\bsnmRonchetti, \bfnmElvezio M\binitsE. M., \bauthor\bsnmRousseeuw, \bfnmPeter\binitsP. and \bauthor\bsnmStahel, \bfnmWerner A\binitsW. A. (\byear1986). \btitleRobust statistics: the approach based on influence functions. \bpublisherWiley-Interscience; New York. \endbibitem
  • [30] {bbook}[author] \bauthor\bsnmHayashi, \bfnmFumio\binitsF. (\byear2011). \btitleEconometrics. \bpublisherPrinceton University Press. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmHu, \bfnmYipeng\binitsY., \bauthor\bsnmJacob, \bfnmJoseph\binitsJ., \bauthor\bsnmParker, \bfnmGeoffrey JM\binitsG. J., \bauthor\bsnmHawkes, \bfnmDavid J\binitsD. J., \bauthor\bsnmHurst, \bfnmJohn R\binitsJ. R. and \bauthor\bsnmStoyanov, \bfnmDanail\binitsD. (\byear2020). \btitleThe challenges of deploying artificial intelligence models in a rapidly evolving pandemic. \bjournalNature Machine Intelligence \bvolume2 \bpages298–300. \endbibitem
  • [32] {bincollection}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1992). \btitleRobust estimation of a location parameter. In \bbooktitleBreakthroughs in statistics \bpages492–518. \bpublisherSpringer. \endbibitem
  • [33] {barticle}[author] \bauthor\bsnmKantorovich, \bfnmLV\binitsL. (\byear1942). \btitleOn the translocation of masses, CR Dokl. \bjournalAcad. Sci. URSS \bvolume37 \bpages191–201. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmKitamura, \bfnmYuichi\binitsY. and \bauthor\bsnmStutzer, \bfnmMichael\binitsM. (\byear1997). \btitleAn information-theoretic alternative to generalized method of moments estimation. \bjournalEconometrica: Journal of the Econometric Society \bpages861–874. \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmKolouri, \bfnmSoheil\binitsS., \bauthor\bsnmPark, \bfnmSe Rim\binitsS. R., \bauthor\bsnmThorpe, \bfnmMatthew\binitsM., \bauthor\bsnmSlepcev, \bfnmDejan\binitsD. and \bauthor\bsnmRohde, \bfnmGustavo K\binitsG. K. (\byear2017). \btitleOptimal mass transport: Signal processing and machine learning applications. \bjournalIEEE signal processing magazine \bvolume34 \bpages43–59. \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmLa Vecchia, \bfnmDavide\binitsD., \bauthor\bsnmCamponovo, \bfnmLorenzo\binitsL. and \bauthor\bsnmFerrari, \bfnmDavide\binitsD. (\byear2015). \btitleRobust heart rate variability analysis by generalized entropy minimization. \bjournalComputational statistics & data analysis \bvolume82 \bpages137–151. \endbibitem
  • [37] {barticle}[author] \bauthor\bsnmLa Vecchia, \bfnmDavide\binitsD., \bauthor\bsnmRonchetti, \bfnmElvezio\binitsE. and \bauthor\bsnmIlievski, \bfnmAndrej\binitsA. (\byear2022). \btitleOn Some Connections Between Esscher’s Tilting, Saddlepoint Approximations, and Optimal Transportation: A Statistical Perspective. \bjournalStatistical Science \bvolume1 \bpages1–22. \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmLa Vecchia, \bfnmDavide\binitsD., \bauthor\bsnmRonchetti, \bfnmElvezio\binitsE. and \bauthor\bsnmTrojani, \bfnmFabio\binitsF. (\byear2012). \btitleHigher-order infinitesimal robustness. \bjournalJournal of the American Statistical Association \bvolume107 \bpages1546–1557. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmLecué, \bfnmGuillaume\binitsG. and \bauthor\bsnmLerasle, \bfnmMatthieu\binitsM. (\byear2020). \btitleRobust machine learning by median-of-means: Theory and practice. \bjournalThe Annals of Statistics \bvolume48 \bpages906 – 931. \bdoi10.1214/19-AOS1828 \endbibitem
  • [40] {barticle}[author] \bauthor\bsnmLei, \bfnmJing\binitsJ. (\byear2020). \btitleConvergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. \bjournalBernoulli \bvolume26 \bpages767 – 798. \bdoi10.3150/19-BEJ1151 \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmMancini, \bfnmLoriano\binitsL., \bauthor\bsnmRonchetti, \bfnmElvezio\binitsE. and \bauthor\bsnmTrojani, \bfnmFabio\binitsF. (\byear2005). \btitleOptimal conditionally unbiased bounded-influence inference in dynamic location and scale models. \bjournalJournal of the American Statistical Association \bvolume100 \bpages628–641. \endbibitem
  • [42] {bbook}[author] \bauthor\bsnmMaronna, \bfnmR. A.\binitsR. A., \bauthor\bsnmMartin, \bfnmD. R.\binitsD. R. and \bauthor\bsnmYohai, \bfnmV. J.\binitsV. J. (\byear2006). \btitleRobust statistics: theory and methods. \bpublisherWiley, New York. \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmMonge, \bfnmGaspard\binitsG. (\byear1781). \btitleMémoire sur la théorie des déblais et des remblais. \bjournalMem. Math. Phys. Acad. Royale Sci. \bpages666–704. \endbibitem
  • [44] {binproceedings}[author] \bauthor\bsnmMukherjee, \bfnmDebarghya\binitsD., \bauthor\bsnmGuha, \bfnmAritra\binitsA., \bauthor\bsnmSolomon, \bfnmJustin M\binitsJ. M., \bauthor\bsnmSun, \bfnmYuekai\binitsY. and \bauthor\bsnmYurochkin, \bfnmMikhail\binitsM. (\byear2021). \btitleOutlier-robust optimal transport. In \bbooktitleInternational Conference on Machine Learning \bpages7850–7860. \bpublisherPMLR. \endbibitem
  • [45] {binproceedings}[author] \bauthor\bsnmNietert, \bfnmSloan\binitsS., \bauthor\bsnmGoldfeld, \bfnmZiv\binitsZ. and \bauthor\bsnmCummings, \bfnmRachel\binitsR. (\byear2022). \btitleOutlier-robust optimal transport: Duality, structure, and statistical analysis. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages11691–11719. \bpublisherProceedings of The 25th International Conference on Artificial Intelligence and Statistics. \endbibitem
  • [46] {bbook}[author] \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2020). \btitleAn invitation to statistics in Wasserstein space. \bpublisherSpringer Nature. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmPeyré, \bfnmGabriel\binitsG. and \bauthor\bsnmCuturi, \bfnmMarco\binitsM. (\byear2019). \btitleComputational optimal transport: With applications to data science. \bjournalFoundations and Trends® in Machine Learning \bvolume11 \bpages355–607. \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmRamponi, \bfnmAlan\binitsA. and \bauthor\bsnmPlank, \bfnmBarbara\binitsB. (\byear2020). \btitleNeural Unsupervised Domain Adaptation in NLP—A Survey. \bjournalarXiv preprint arXiv:2006.00632. \endbibitem
  • [49] {bbook}[author] \bauthor\bsnmRockafellar, \bfnmR Tyrrell\binitsR. T. and \bauthor\bsnmWets, \bfnmRoger J-B\binitsR. J.-B. (\byear2009). \btitleVariational analysis \bvolume317. \bpublisherSpringer Science & Business Media. \endbibitem
  • [50] {binproceedings}[author] \bauthor\bsnmRonchetti, \bfnmE.\binitsE. (\byear2022). \btitleRobustness Aspects of Optimal Transport. \bpublisherDraft Paper. \endbibitem
  • [51] {bbook}[author] \bauthor\bsnmRonchetti, \bfnmElvezio M\binitsE. M. and \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear2009). \btitleRobust statistics. \bpublisherJohn Wiley & Sons. \endbibitem
  • [52] {bbook}[author] \bauthor\bsnmSamorodnitsky, \bfnmGennady\binitsG. and \bauthor\bsnmTaqqu, \bfnmMurad S\binitsM. S. (\byear2017). \btitleStable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. \bpublisherRoutledge. \endbibitem
  • [53] {barticle}[author] \bauthor\bsnmSantambrogio, \bfnmFilippo\binitsF. (\byear2015). \btitleOptimal transport for applied mathematicians. \bjournalBirkäuser, NY \bvolume55 \bpages94. \endbibitem
  • [54] {barticle}[author] \bauthor\bsnmTukey, \bfnmJW\binitsJ. (\byear1977). \btitleExploratory Data Analysis (1970–71: preliminary edition). \bjournalMassachasetts: Addison-Wesley. \endbibitem
  • [55] {bbook}[author] \bauthor\bparticleVan der \bsnmVaart, \bfnmAad W\binitsA. W. (\byear2000). \btitleAsymptotic statistics \bvolume3. \bpublisherCambridge university press. \endbibitem
  • [56] {bbook}[author] \bauthor\bsnmVershynin, \bfnmRoman\binitsR. (\byear2018). \btitleHigh-dimensional probability: An introduction with applications in data science \bvolume47. \bpublisherCambridge university press. \endbibitem
  • [57] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. \betalet al. (\byear2009). \btitleOptimal transport: old and new \bvolume338. \bpublisherSpringer. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmWarwick, \bfnmJane\binitsJ. and \bauthor\bsnmJones, \bfnmMC\binitsM. (\byear2005). \btitleChoosing a robustness tuning parameter. \bjournalJournal of Statistical Computation and Simulation \bvolume75 \bpages581–588. \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmWhite, \bfnmHalbert\binitsH. (\byear1982). \btitleMaximum likelihood estimation of misspecified models. \bjournalEconometrica: Journal of the econometric society \bpages1–25. \endbibitem
  • [60] {barticle}[author] \bauthor\bsnmYatracos, \bfnmYannis G\binitsY. G. (\byear2022). \btitleLimitations of the Wasserstein MDE for univariate data. \bjournalStatistics and Computing \bvolume32 \bpages1–11. \endbibitem