On Generalization and Regularization via Wasserstein Distributionally Robust Optimization

Qinyu Wu Department of Statistics and Finance, University of Science and Technology of China, China. E-mail: [email protected] Jonathan Yu-Meng Li Telfer School of Management, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada. E-mail: [email protected] Tiantian Mao Department of Statistics and Finance, University of Science and Technology of China, China. E-mail: [email protected]

Abstract

Wasserstein distributionally robust optimization (DRO) has gained prominence in operations research and machine learning as a powerful method for achieving solutions with favorable out-of-sample performance. Two compelling explanations for its success are the generalization bounds derived from Wasserstein DRO and its equivalence to regularization schemes commonly used in machine learning. However, existing results on generalization bounds and regularization equivalence are largely limited to settings where the Wasserstein ball is of a specific type, and the decision criterion takes certain forms of expected functions. In this paper, we show that generalization bounds and regularization equivalence can be obtained in a significantly broader setting, where the Wasserstein ball is of a general type and the decision criterion accommodates any form, including general risk measures. This not only addresses important machine learning and operations management applications but also expands to general decision-theoretical frameworks previously unaddressed by Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a by-product, we show that Wasserstein DRO coincides with the recent max-sliced Wasserstein DRO for any decision criterion under affine decision rules – resulting in both being efficiently solvable as convex programs via our general regularization results. These general assurances provide a strong foundation for expanding the application of Wasserstein DRO across diverse domains of data-driven decision problems.

Key-words: distributionally robust optimization, Wasserstein metrics, finite-sample guarantees, regularization

1 Introduction

Stochastic optimization problems of the form

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\rho(Y\cdot\bm{\beta}^{\top}\mathbf{X})

(1)

naturally arise in many machine learning (ML) and operations research (OR) applications, where $Y$ denotes a binary random variable, taking values from $\{-1,1\}$ , and $\mathbf{X}$ is a random vector in ${\mathbb{R}}^{n}$ . The function $f_{\bm{\beta}}(\mathbf{X})=\bm{\beta}^{\top}\mathbf{X}$ , parameterized by the decision variable $\bm{\beta}$ $\in\mathcal{D}\subseteq\mathbb{R}^{n}$ , can generally be interpreted as an affine decision rule. In ML, the binary random variable $Y$ often represents outcomes in classification problems. When $Y$ is constant, the formulation (1) applies to a broad range of regression problems in ML and risk minimization problems in OR. The function $\rho$ represents a measure of risk, mapping a random variable to a real number that quantifies its risk. The notion of risk in this paper is broadly defined as the undesirability of a random variable $Z$ ; that is, a larger value of $\rho(Z)$ indicates that $Z$ is less preferable.

In this paper, we study the Wasserstein distributionally robust counterpart of (1), specifically:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),

(2)

where ${\cal B}_{p}(F_{0},\varepsilon)$ denotes a type- $p$ Wasserstein ball of radius $\varepsilon$ centred at a reference distribution $F_{0}$ (c.f. Kuhn et al. (2019)). The superscript $F$ in $\rho^{F}$ indicates that the measure depends on a joint distribution $F$ of $(Y,\mathbf{X})$ . This problem has been extensively studied and applied in various fields when $\rho^{F}$ takes the form of an expected value function, i.e., $\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)]$ for some $\ell:{\mathbb{R}}\rightarrow{\mathbb{R}}$ (see e.g., Kuhn et al. (2019); Shafieezadeh-Abadeh et al. (2019); Gao et al. (2017); Gao (2022)). This setup reveals two key findings that underscore the potential strength of the Wasserstein distributionally robust optimization (DRO) model in achieving strong out-of-sample performance. First, the model enjoys generalization bounds under mild assumptions (Shafieezadeh-Abadeh et al. (2019)). More notably, Gao (2022) shows that when the data-generating distribution satisfies transportation inequalities, $O(N^{-1/2})$ -generalization bounds for the type- $p$ Wasserstein ball ${\cal B}_{p}(F_{0},\varepsilon)$ with $p\in[1,2]$ can be established, free from the curse of dimensionality. Second, the model has an exact regularization equivalent to the nominal problem (1) with an added regularizer on the decision variable $\bm{\beta}$ , when the Wasserstein ball is of type-1 and the loss function $\ell$ is Lipschitz continuous (Shafieezadeh-Abadeh et al. (2019)), as well as when the ball is of type-2 and the loss function $\ell$ is quadratic (Blanchet et al. (2019)). This connection to the classical regularization scheme, prevalent in machine learning, offers a compelling interpretation of the DRO model and has spurred considerable interest in its application within ML and OR (see e.g., Blanchet and Kang (2021), Blanchet et al. (2019), Chen and Paschalidis (2018), Gao et al. (2017), Gao et al. (2022)).

It remains largely unclear, however, whether the generalization bounds established for the specific case of the expected value function can be achieved in the broader setting of (2) – namely, for a general measure $\rho$ and any type- $p$ Wasserstein ball ${\cal B}_{p}(F_{0},\varepsilon)$ with $p\in[1,\infty]$ – without suffering from the curse of dimensionality. In particular, the result of Gao (2022), aside from the limitations imposed by the transportation inequality assumption on the underlying distribution and the restriction to $p\in[1,2]$ , fundamentally relies on the unbiased property of the sample average – a property that does not extend to risk measures such as Conditional Value-at-Risk (CVaR) (Rockafellar and Uryasev (2002)). It is also worth noting that, even in the literature on ML, the question of how to obtain generalization bounds, possibly through the classical regularization scheme, for the nominal problem (1) has largely been left open (c.f. Shalev-Shwartz and Ben-David (2014)). In this paper, we shed new light on the efficacy of Wasserstein DRO in broad out-of-sample tests by showing that the Wasserstein DRO model (2) can circumvent the curse of dimensionality for virtually any decision criterion under affine decision rules. As a crucial insight, we uncover that the Wasserstein DRO remarkably coincides with the recent max-sliced Wasserstein DRO (Olea et al. (2022)) for any decision criterion, even though the ball defined in the former is a far less conservative choice – specifically, it is smaller and tighter than the one defined in the latter. Additionally, we extend our results to general decision rules under the light-tail assumption by leveraging the measure concentration property of a Wasserstein ball.

Another focus of this work is to investigate whether the key equivalence of Wasserstein DRO to classical regularization in ML can be extended to a more general setting in (2). To answer this, we first pay particular attention to the setting where the Wasserstein ball ${\cal B}_{p}(F_{0},\varepsilon)$ is of any type, i.e., $p\in[1,\infty]$ , and the measure $\rho$ is an expected function $\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)]$ with a loss function $\ell$ having a growth order $p$ . We show that for many loss functions $\ell$ arising from practical applications, it is possible to establish an exact relation between the Wasserstein DRO model (2) and the classical regularization scheme in more general terms. Notably, we prove that no loss functions beyond those we identify satisfy this equivalence, offering a definitive answer to the extent to which Wasserstein DRO can be interpreted and solved from a regularization perspective. Our results also reveal the tractability of solving Wasserstein DRO for many non-Lipschitz continuous loss functions $\ell$ and higher-order Wasserstein balls ( $p>1$ ). While Sim et al. (2021) approach such problems via approximation due to reformulation challenges, we show that exact solutions are possible through regularization reformulations. Moreover, we extend this equivalence to settings where $\rho$ goes beyond expected value functions. Previously, this exact relation was known only for variance (Blanchet et al. (2022)) or distortion risk measures (Wozabal (2014)). We expand it to a significantly broader class of measures, encompassing higher moment coherent risk measures (Krokhmal (2007)), criteria emerging more recently from OR and ML (Rockafellar and Uryasev (2013), Rockafellar et al. (2008), Gotoh and Uryasev (2017)), and notably, general decision-theoretical models such as the celebrated rank-dependent expected utility (RDEU) (Quiggin (1982)).

Related Work

From the perspective of generalization bounds. A series of works done by Blanchet et al. (2019), Blanchet and Kang (2021), Blanchet et al. (2022) take a different approach to tackle the curse of dimensionality. They study the classical setting of Wasserstein DRO, where the measure $\rho$ is an expected function, and propose a radius selection rule for Wasserstein balls. They show the rule can be applied to build a confidence region of the optimal solution, and the radius can be chosen in the square-root order as the sample size goes to infinity. Although this allows for bypassing the curse of dimensionality, the bounds derived from the rule are only valid in the asymptotic sense. Blanchet et al. (2022) also takes this approach to obtain generalization bounds for mean-variance portfolio selection problems. On the other hand, the generalization bounds established in this paper, like those in Shafieezadeh-Abadeh et al. (2019), Chen and Paschalidis (2018), and Gao (2022), break the curse of dimensionality in a non-asymptotic sense for finite samples.

From the perspective of the equivalency between Wasserstein DRO and regularization. While the focus of this work is on studying the exact equivalency between Wasserstein DRO and regularization, there is an active stream of works studying the asymptotic equivalence in the setting where the measure $\rho$ is an expected function (see Gao et al. (2017); Blanchet et al. (2019); Blanchet et al. (2022); Volpi et al. (2018); Bartl et al. (2020)). In particular, Gao et al. (2022) introduce the notion of variation regularization and show that for a broad class of loss functions, Wasserstein DRO is asymptotically equivalent to a variation regularization problem.

Recap of Major Contributions and Practical Implications

1.

We underscore the fundamental gap in achieving $O(N^{-1/2})$ -generalization bounds for diverse decision criteria in Wasserstein DRO and novelly establish such guarantees by leveraging the projection set of the Wasserstein ball, revealing a broad equivalence with the recent max-sliced Wasserstein DRO for any decision criterion.
2.

By proving the precise conditions under which Wasserstein DRO has an exact regularization equivalent, we provide key insights into a wide range of OR/MS applications, highlighting when a simpler, intuitive, and computationally efficient regularization approach suffices to achieve the generalization guarantees of Wasserstein DRO – and when solving the full Wasserstein DRO formulation is necessary.
3.

By extending generalization bounds beyond affine decision rules, we uncover important OR/MS applications, such as foundational two-stage stochastic programs and the increasingly popular feature-based newsvendor problem, with generalization guarantees that scale effectively across diverse decision criteria.

2 Wasserstein Distributionally Robust Optimization Model

Let $(\Omega,\mathcal{A},\mathbb{P})$ be an atomless probability space. A random vector $\bm{\xi}$ is a measurable mapping from $\Omega$ to $\mathbb{R}^{n+1}$ , $n\in\mathbb{N}$ . Denote by $F_{\bm{\xi}}$ the distribution of $\bm{\xi}$ under $\mathbb{P}$ . For $p\geqslant 1$ , let $L^{p}:=L^{p}(\Omega,\mathcal{A},\mathbb{P})$ be the set of all random variables with a finite $p$ th moment, and $\mathcal{M}_{p}(\Xi)$ be the set of all distributions on $\Xi\subseteq\mathbb{R}^{n+1}$ with finite $p$ th moments in each component. Let $L^{\infty}:=L^{\infty}(\Omega,\mathcal{A},\mathbb{P})$ be the set of all bounded random variables, and $\mathcal{M}_{\infty}(\Xi)$ be the set of all distributions on $\Xi\subseteq\mathbb{R}^{n+1}$ with bounded support. Let $q$ denote the Hölder conjugate of $p$ , i.e., $1/p+1/q=1$ . We use $\mathrm{ess\mbox{-}sup}Z$ to represent the essential supremum of the random variable $Z$ , i.e., $\mathrm{ess\mbox{-}sup}Z=\inf\{t:F_{Z}(t)\geqslant 1\}$ . Recall that for any two distributions $F_{1},\,F_{2}\in\mathcal{M}_{p}(\Xi)$ , the type- $p$ Wasserstein metric is defined as

\displaystyle W_{d,p}\left(F_{1},F_{2}\right)=\inf_{\pi\in\Pi(F_{1},F_{2})}\big{(}\mathbb{E}^{\pi}[d(\bm{\xi}_{1},\bm{\xi}_{2})^{p}]\big{)}^{{1}/{p}},

(3)

where $d(\cdot,\cdot):\Xi\times\Xi\to\mathbb{R}_{+}\cup\{\infty\}$ is a metric on $\Xi$ . The set $\Pi(F_{1},F_{2})$ denotes the set of all joint distributions of $\bm{\xi}_{1}$ and $\bm{\xi}_{2}$ with marginal distributions $F_{1}$ and $F_{2}$ , respectively. In this paper, we consider $\bm{\xi}=(Y,{\bf X})\in\Xi$ with $\Xi:=\{-1,1\}\times\mathbb{R}^{n}\subseteq\mathbb{R}^{n+1}$ and apply the type- $p$ Wasserstein metric (3) with $d(\bm{\xi}_{1},\bm{\xi}_{2})=d((Y_{1},\mathbf{X}_{1}),(Y_{2},\mathbf{X}_{2}))$ , defined by the following additively separable form

\displaystyle d((Y_{1},\mathbf{X}_{1}),(Y_{2},\mathbf{X}_{2}))=\|\mathbf{X}_{1}-\mathbf{X}_{2}\|+\Theta(Y_{1}-Y_{2}),

(4)

where $\|\cdot\|$ is any given norm on $\mathbb{R}^{n}$ with its dual norm $\|\cdot\|_{*}$ defined by $\|\mathbf{y}\|_{*}=\sup_{\|\mathbf{x}\|\leqslant 1}\mathbf{x}^{\top}\mathbf{y}$ . The function $\Theta:\mathbb{R}\rightarrow\{0,\infty\}$ satisfies $\Theta(s)=0$ if $s=0$ and $\Theta(s)=\infty$ otherwise. Thus, the function (4) assigns an infinitely large cost to any discrepancy in $Y$ , i.e., $Y_{1}-Y_{2}\neq 0$ , and reduces to a general norm on $\mathbf{X}$ when there is no discrepancy in $Y$ , i.e., $Y_{1}-Y_{2}=0$ . Using this norm, for $F_{0}\in\mathcal{M}_{p}(\Xi)$ and $\varepsilon\geqslant 0$ , we define the ball of distributions $\overline{{\cal B}}_{p}(F_{0},\varepsilon)$ as

\displaystyle\overline{{\cal B}}_{p}(F_{0},\varepsilon)=\left\{F\in\mathcal{M}_{p}(\Xi):W_{d,p}(F,F_{0})\leqslant\varepsilon\right\},

(5)

which we refer to as the type- $p$ Wasserstein ball throughout this paper.

In the remainder of this paper, we first demonstrate that the DRO problem (2), using the above definition of the Wasserstein ball:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),

(6)

can achieve generalization guarantees for any measure $\rho$ and type- $p$ Wasserstein ball. We then elucidate this guarantee for the widely used affine decision rules by exploring the connection between the optimization model (6) and the classical regularization scheme in Section 4. This framework covers various problem types, including:

•

Classification:

$\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),$

•

Regression:

\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}((1,-\bm{\beta}_{r})^{\top}\mathbf{X}),

•

Risk minimization:

$\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{X}),$

where $F_{0}(\{Y=1\})=1$ in the cases of regression and risk minimization. These formulations accommodate numerous machine learning and operations research applications not yet addressed in the existing literature on Wasserstein DRO. Table 1 highlights examples considered in this paper.

Table 1: Examples of measures

\rho

Applications	Measure	Formulation	References
Classification	HO hinge loss HO SVM $\nu$ -SVM	$\mathbb{E}[(1-Z)_{+}^{s}]$ $\mathbb{E}[\|1-Z\|^{s}]$ ${\rm CVaR}_{\alpha}(-Z)$	Lee and Mangasarian (2001), $s=2$ Suykens and Vandewalle (1999), $s=2$ Schölkopf et al. (2000); Gotoh and Uryasev (2017)
Regression	HO regression HO $c$ -insensitive $\nu$ -SVR	$\mathbb{E}[\|Z\|^{s}]$ $\mathbb{E}[(\|Z\|-r)^{s}_{+}]$ ${\rm CVaR}_{\alpha}(\|Z\|)$	the well-known least-square regression for $s=2$ Drucker et al. (1997), $s=1$ ; Lee et al. (2005), $s=2$ Schölkopf et al. (1998)
Risk minimization	LPM CVaR-Deviation HM risk measure	$\mathbb{E}[(Z-c)_{+}^{s}]$ ${\rm CVaR}_{\alpha}(Z-\mathbb{E}[Z])$ $\inf_{t\in\mathbb{R}}\{t+k(\mathbb{E}[(Z-t)_{+}^{s}])^{1/s}\}$	Bawa (1975); Fishburn (1977); Chen et al. (2011) Rockafellar et al. (2008) Krokhmal (2007)

Notes: HO: higher-order; SVM: support vector machine; SVR: support vector regression; LPM: lower partial moments; HM: higher moment; CVaR: conditional Value-at-Risk, ${\rm CVaR}_{\alpha}(Z)=\int_{\alpha}^{1}F_{Z}^{-1}(t)\mathrm{d}t/(1-\alpha)$ , $Z\in L^{1}$ , where $F_{Z}^{-1}$ is the left-quantile function of $Z$ . The range of parameters: $s\geqslant 1$ ; $\alpha\in(0,1)$ ; $r\geqslant 0$ ; $c\in\mathbb{R}$ ; $k\geqslant 1$ .

3 Generalization Bounds

Wasserstein DRO is typically applied in a data-driven setting, where the data-generating distribution $F^{*}$ of random variables $(Y,\mathbf{X})\in\Xi=\{-1,1\}\times{\mathbb{R}}^{n}$ is only partially observed through sample data points $(\widehat{y}_{i},\widehat{\mathbf{x}}_{i})$ for $i=1,...,N$ , independently drawn from $F^{*}$ . In this context, the empirical distribution $\widehat{F}_{N}$ is used as the reference distribution $F_{0}\in\mathcal{M}_{p}(\Xi)$ in Wasserstein DRO, where $\widehat{F}_{N}:=\frac{1}{N}\sum_{i=1}^{N}\delta_{(\widehat{y}_{i},\widehat{\mathbf{x}}_{i})}$ and $\delta_{\mathbf{x}}$ denotes the point-mass at $\mathbf{x}$ . The central question is whether the (Wasserstein) in-sample risk

\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),

can provide an upper confidence bound on the out-of-sample risk

\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})

across all possible decisions $\bm{\beta}\in{\cal D}$ , with a radius $\varepsilon_{N}$ that decays at a rate scaling gracefully with the dimensionality of the random vector $(Y,\mathbf{X})$ – in other words, generalization bounds that overcome the curse of dimensionality.

Current methods face significant limitations. While generalization bounds based on the measure concentration properties of type- $p$ Wasserstein balls offer flexibility in accommodating general risk measures $\rho$ , they are constrained by the curse of dimensionality inherent in high-dimensional Wasserstein balls (see e.g., Esfahani and Kuhn (2018)). Approaches such as Gao (2022) address this issue by focusing on expected value functions, i.e., $\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)]$ , leveraging the concentration properties of sample averages. However, extending these methods beyond expected value functions proves challenging, as most risk measures lack analogous concentration properties.

We take a novel perspective to tackle this challenge by focusing on the structural properties of the projection set:

\displaystyle\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon):=\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\},

(7)

where $F_{0}\in\mathcal{M}_{p}(\Xi)$ , $\bm{\beta}\in\mathbb{R}^{n}$ and $\varepsilon\geqslant 0$ . We uncover that this set coincides with two particularly revealing one-dimensional sets: a one-dimensional Wasserstein ball and the projection set of a high-dimensional max-sliced Wasserstein ball. To formalize this, for $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , we first recall the definition of a type- $p$ Wasserstein ball on $\mathbb{R}^{n}$ with the metric $d(\cdot,\cdot)$ being a norm:

\displaystyle{\cal B}_{p}(G_{0},\varepsilon)=\left\{G\in\mathcal{M}_{p}(\mathbb{R}^{n}):W_{p}(G,G_{0})\leqslant\varepsilon\right\}~{}~{}{\rm with}~{}~{}W_{p}(G,G_{0}):=\inf_{\pi\in\Pi(G,G_{0})}\left(\mathbb{E}^{\pi}[\|\bm{\xi}_{1}-\bm{\xi}_{2}\|^{p}]\right)^{{1}/{p}},

(8)

and a type- $p$ max-sliced Wasserstein ball (Olea et al. (2022)):

\displaystyle{\cal B}_{p}^{\rm ms}(G_{0},\varepsilon)=\left\{G\in\mathcal{M}_{p}(\mathbb{R}^{n}):\sup_{\bm{\beta}:\|\bm{\beta}\|_{*}=1}\inf_{\pi\in\Pi(G,G_{0})}\left(\mathbb{E}^{\pi}[|\bm{\beta}^{\top}\bm{\xi}_{1}-\bm{\beta}^{\top}\bm{\xi}_{2}|^{p}]\right)^{1/p}\leqslant\varepsilon\right\}.

(9)

Here, $\|\cdot\|$ is the norm defined in (4) on $\mathbb{R}^{n}$ . Without loss of generality and to be consistent with the definition of classic one-dimensional Wasserstein metric, for $n=1$ , i.e., on $\mathbb{R}$ , we assume that $\|\cdot\|=|\cdot|$ is the absolute-value norm.

With these definitions in place, we can now present the following equivalence results.

Proposition 1.

Suppose that $p\in[1,\infty]$ , $\varepsilon\geqslant 0$ , $\bm{\beta}\in\mathbb{R}^{n}$ , $F_{0}\in\mathcal{M}_{p}(\Xi)$ and $(Y_{0},\mathbf{X}_{0})\sim F_{0}$ . Then, we have

\displaystyle\{F_{Y\bf X}:F_{(Y,{\bf X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\}={\cal B}_{p}(F_{Y_{0}\bf X_{0}},\varepsilon)

(10)

and

\displaystyle\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon)=\mathcal{B}_{p}\left(F_{Y_{0}\cdot\bm{\beta}^{\top}\bf X_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:F_{\bf Z}\in\mathcal{B}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}_{0}},\varepsilon)\},

(11)

where $\overline{\mathcal{B}}_{p|\bm{\beta}}$ , ${\cal B}_{p}$ , and $\mathcal{B}_{p}^{\rm ms}$ are defined by (7), (8), and (9), respectively.

The equivalence established above has profound implications for both Wasserstein DRO and max-sliced Wasserstein DRO. It shows that the two frameworks coincide for any decision criterion $\rho$ under affine decision rules, both reducing to the problem of minimizing worst-case risk over a one-dimensional type- $p$ Wasserstein ball – an inherently more tractable problem, as detailed in Section 4. Notably, this equivalence reveals that, although the Wasserstein distance is a significantly stronger metric than the max-sliced Wasserstein distance, i.e., the type- $p$ Wasserstein ball is a much less conservative choice of an uncertainty set than its max-sliced counterpart (Olea et al. (2022)), Wasserstein DRO can still achieve the same generalization bounds. This insight directly leads to the following generalization bounds for general measures $\rho$ , free from the curse of dimensionality.

Theorem 1.

Given $p\in[1,\infty)$ , $\eta\in(0,1)$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , if $\Gamma:=\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{s}]<\infty$ for some $s>2p$ , then we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon_{p,N}(\eta))}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta,

where $\varepsilon_{p,N}(\eta)^{p}=C\log(2N+1)^{p/s}/\sqrt{N}$ and

C=2^{p}p\left(180\sqrt{n+2}+\sqrt{2\log\left(\frac{3}{\eta}\right)}+\sqrt{\frac{3\Gamma}{\eta}}\frac{8}{s/2-p}\sqrt{\log{\left(\frac{24}{\eta}\right)}+2(n+2)}\right).

The decay rate of the radius is approximately of the order $N^{-1/(2p)}$ . While this rate is unaffected by dimensionality, its dependency on the type of Wasserstein ball, specifically the parameter $p$ , is noteworthy. This dependency, often overlooked in the literature compared to the curse of dimensionality, is potentially concerning, as it suggests that the radius may shrink very slowly for higher-order Wasserstein balls. This issue arises from the use of stronger metrics – specifically, norms raised to higher powers – when defining any ball constructed similarly to the Wasserstein ball, including the max-sliced Wasserstein ball (Olea et al. (2022)). Intuitively, a higher-order Wasserstein ball is smaller than a lower-order one, requiring a larger radius to ensure the data-generating distribution is covered with high probability. In the extreme case of a type- $\infty$ Wasserstein ball, it would never contain a data-generating distribution with unbounded support, regardless of the radius.

We circumvent this issue, which we refer to as the “curse of the order $p$ ”, by recognizing that ensuring high-probability coverage of the data-generating distribution, while sufficient, is not necessary for developing generalization bounds. In fact, we show that even for the type- $\infty$ Wasserstein ball – where the probability of covering a data-generating distribution with unbounded support is effectively zero – the worst-case risk from the type- $\infty$ ball can still bound the out-of-sample risk of the data-generating distribution with high probability. This guarantee requires only the following mild assumption about the measure $\rho$ .

Assumption 1.

There exists $\lambda\geqslant 1$ such that

\displaystyle\sup_{|V|\leqslant\lambda\varepsilon}\rho(Z+V)\geqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\rho(Z+V),~{}~{}\forall Z\in L^{1},~{}\varepsilon\geqslant 0.

For any measures satisfying Assumption 1, one can leverage the generalization bounds for the type-1 Wasserstein ball to derive similar bounds for the type- $\infty$ Wasserstein ball, with a radius scaled by a constant $\lambda$ . These bounds apply to any type- $p$ Wasserstein ball, for $p\in[1,\infty]$ , resulting in generalization bounds where the decay rate of the radius is no longer dependent on the order $p$ .

Theorem 2.

Let $p\in[1,\infty]$ , $\eta\in(0,1)$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ . Suppose that $\Gamma:=\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{s}]<\infty$ for some $s>2$ and Assumption 1 holds. Then, we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon_{1,N}(\eta))}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta,

where $\varepsilon_{1,N}(\eta)$ is defined in Theorem 1.

The core idea presented in this section can be extended beyond affine decision rules, as discussed in greater detail in Section 6. Here, we further illustrate the generality of Assumption 1 by providing additional examples.

Example 1 (Expected function).

Let $\rho(Z)=\mathbb{E}[\ell(Z)]$ , where $\ell:\mathbb{R}\to\mathbb{R}$ is quasi-convex and Lipschitz continuous with Lipschitz constant ${\rm Lip}(\ell)$ , and satisfies, for any differentiable point $z\in\mathbb{R}$ , $|\ell^{\prime}(z)|\geqslant M>0$ . By $\ell$ is quasi-convex, there exists $\alpha\in[-\infty,\infty]$ such that $\ell$ is decreasing on $(-\infty,\alpha)$ and increasing otherwise, and thus $\ell$ is differentiable almost everywhere. It follows that

	$\displaystyle\sup_{\|V\|\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)]$	$\displaystyle=\mathbb{E}[\ell(Z-\varepsilon)\mathds{1}_{\{Z<\alpha\}}]+\mathbb{E}[\ell(Z+\varepsilon)\mathds{1}_{\{Z\geqslant\alpha\}}]$
		$\displaystyle\geqslant\mathbb{E}[(\ell(Z)+M\varepsilon)\mathds{1}_{\{Z<\alpha\}}]+\mathbb{E}[(\ell(Z)+M\varepsilon)\mathds{1}_{\{Z\geqslant\alpha\}}]=\mathbb{E}[\ell(Z)]+M\varepsilon.$

This, together with $\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)]\leqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\mathbb{E}[\ell(Z)+{\rm Lip}(\ell)|V|]\leqslant\mathbb{E}[\ell(Z)]+{\rm Lip}(\ell)\varepsilon$ by the Lipschitz continuity of $\ell$ , implies $\rho$ satisfies Assumption 1 with $\lambda={\rm Lip}(\ell)/M$ . In particular, the function $\ell$ , defined as $\ell(x)=a(x-c)_{+}+b(x-c)_{-}$ with $a>0$ , $b,c\in\mathbb{R}$ , satisfies Assumption 1 with $\lambda=\max\{a,|b|\}/\min\{a,|b|\}$ .

Example 2 (Risk measure).

Suppose that $\rho:L^{1}\to\mathbb{R}$ satisfies monotonicity and translation invariance, i.e., $\rho(Z_{1})\leqslant\rho(Z_{2})$ whenever $Z_{1}\leqslant Z_{2}$ , and $\rho(Z+m)=\rho(Z)+m$ for all $Z\in L^{1}$ and $m\in\mathbb{R}$ . Such functionals are referred to as monetary risk measures (see, e.g., Föllmer and Schied (2016)). Further, assume that $\rho$ is Lipschitz continuous with respect to the $L^{1}$ -norm, i.e., there exists $M>0$ such that $|\rho(Z_{1})-\rho(Z_{2})|\leqslant M\mathbb{E}[|Z_{1}-Z_{2}|]$ for all $Z_{1},Z_{2}\in L^{1}$ . Note that for any $\varepsilon>0$ , $\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\rho(Z+V)\leqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\{\rho(Z)+M\mathbb{E}[|V|]\}=\rho(Z)+M\varepsilon.$ This, together with $\sup_{|V|\leqslant\varepsilon}\rho(Z+V)\geqslant\rho(Z+\varepsilon)=\rho(Z)+\varepsilon$ , yields that $\rho$ satisfies Assumption 1 with $\lambda=M$ . In particular, CVaR, as defined in Table 1 by ${\rm CVaR}_{\alpha}(Z)=\int_{\alpha}^{1}F_{Z}^{-1}(t)\mathrm{d}t/(1-\alpha)$ , $Z\in L^{1}$ , satisfies Assumption 1 with $\lambda=1/(1-\alpha)$ , where $F_{Z}^{-1}$ is the quantile function of $Z$ .

4 A Regularization Perspective

Regularization generally refers to any means applied to avoid overfitting and enhance generalization. In this regard, with the generalization guarantees established in the previous section, the Wasserstein DRO model (6) is well justified as a general regularization model. In this section, we provide further insights into the regularization effect of model (6) on the decision variable $\bm{\beta}$ , offering a practically useful alternative interpretation. Specifically, we show that the previously observed equivalence between the Wasserstein DRO model and regularized empirical optimization in ML holds across much broader settings, while precisely identifying the boundary of its validity. This equivalence significantly broadens the range of both Wasserstein and max-sliced Wasserstein DRO problems that can now be solved efficiently through their regularization counterparts. Proposition 1 plays a key role in facilitating this analysis. By (10) in Proposition 1, the Wasserstein DRO model (6) can be recast as

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\ \rho^{F}(\bm{\beta}^{\top}\mathbf{Z}),

(12)

where $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , and $\mathcal{B}_{p}(G_{0},\varepsilon)$ is a Wasserstein ball on $\mathbb{R}^{n}$ defined by (8).

4.1 The case of expected function

When $\rho$ is an expected function, the Wasserstein DRO model (12):

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})],

(13)

is equivalent to the regularized model:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\left\{\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+{\rm Lip(\ell)}\varepsilon\|\bm{\beta}\|_{*}\right\},

(14)

for $p=1$ (the type-1 Wasserstein ball), where ${\rm Lip(\ell)}$ is the Lipschitz constant of $\ell$ . For higher-order Wasserstein balls ( $p>1$ ) or non-Lipschitz loss functions, this relationship is less understood, except for $p=2$ and $\ell(x)=x^{2}$ (Blanchet et al. (2019)). It remains an open question whether (13) can be tractably solved in these cases. Higher-order Wasserstein balls are less conservative than type-1 Wasserstein balls and offer practical appeal.

We show that an equivalence with a regularized model (14) exists for $p>1$ if and only if $\ell$ is linear or takes the form of an absolute function.

Theorem 3.

Let $\ell:\mathbb{R}\to\mathbb{R}$ be a convex function. For $p\in(1,\infty]$ , suppose that $\mathbb{E}[|\ell(Z)|]<\infty$ for all $Z\in L^{p}$ . Then the following statements are equivalent.

(i)

There exists $C>0$ such that for any $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , it holds that

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+C\varepsilon\|\bm{\beta}\|_{*}\right\}.

(15)

(ii)

The function $\ell$ takes one of the following two forms, multiplied by $C$ :
- (a)
  
  $\ell_{1}(x)=x+b$ or $\ell_{1}(x)=-x+b$ with some $b\in\mathbb{R}$ ;
- (b)
  
  $\ell_{2}(x)=|x-b_{1}|+b_{2}$ with some $b_{1},b_{2}\in\mathbb{R}$ .

This result, which holds for any type- $p$ Wasserstein ball, is somewhat surprising as the Wasserstein DRO model is equivalent to the same regularized model, regardless of the order $p$ . This equivalence occurs when the slope of the loss function $\ell$ takes values only from a constant and its negative. It further indicates that no regularized model in the form of (14) can be derived for any other loss function $\ell$ . In other words, if there is an equivalence between the Wasserstein DRO model (13) and a regularized model for another loss function, the regularized model must take a different form from (14). Before discussing other forms of regularization, it is worth noting the following application of this result in regression.

Example 3.

(Regression)

- (Least absolute deviation (LAD) regression) Applying $\ell_{2}(x)=C|x-b_{1}|+b_{2}$ in Theorem 3 and setting $b_{1}=0$ , $b_{2}=0$ , and $C=1$ , we arrive at the distributionally robust counterpart of the least absolute deviation regression, i.e., $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|].$ It is equivalent to $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\{\mathbb{E}^{G_{0}}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|]+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*}\}$ for any $p\geqslant 1$ .

We now explore whether an alternative form of regularization on $\bm{\beta}$ exists that is equivalent to the Wasserstein DRO model (13). Specifically, we focus on cases where the loss function $\ell$ is not Lipschitz continuous, such as higher-order loss functions arising from the examples presented in Table 1. It is known that the worst case expectation problem based on the type-1 Wasserstein ball can become unbounded when the loss function is not Lipschitz continuous. To address cases beyond Lipschitz continuity, we consider the following formulation of Wasserstein DRO model (13):

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})],

(16)

where $\ell^{p}$ denotes the function $\ell$ raised to the power of $p>1$ , matching the order of the type- $p$ Wasserstein ball. Below, we present an alternative equivalency relationship and specify the exact cases where it holds.

Theorem 4.

Let $\ell:\mathbb{R}\to\mathbb{R}_{+}$ be a Lipchitz continuous and convex function. For any $p\in(1,\infty)$ , the following statements are equivalent.

(i)

There exists $C>0$ such that for any $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , it holds that

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D}}\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+C\varepsilon||\bm{\beta}||_{*}\right)^{p}.

(17)

(ii)
The function $\ell$ takes one of the following four forms, multiplied by $C$ :
- (a)
  
  $\ell_{1}(x)=(x-b)_{+}$ with some $b\in\mathbb{R}$ ;
- (b)
  
  $\ell_{2}(x)=(x-b)_{-}$ with some $b\in\mathbb{R}$ ;
- (c)
  
  $\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+}$ with some $b_{1}\in\mathbb{R}$ and $b_{2}\geqslant 0$ ;
- (d)
  
  $\ell_{4}(x)=|x-b_{1}|+b_{2}$ with some $b_{1}\in\mathbb{R}$ and $b_{2}>0$ .

This result is also an “impossibility” theorem, providing a definitive answer to whether the equivalence relationship can hold more broadly for other loss functions. It settles any attempts to establish this equivalence for other convex Lipschitz continuous functions $\ell:\mathbb{R}\rightarrow\mathbb{R}_{+}$ . This theorem is fundamentally important to the study of Wasserstein DRO, especially given ongoing efforts to understand its relationship with a classical regularization perspective. It shows precisely the limits of how far this perspective can be extended.

Remark 1.

As previously discussed, Wasserstein DRO and empirical risk regularization are two widely used and competing approaches for data-driven decision-making in ML and OR/MS. Wasserstein DRO is valued for its universality, offering simple generalization guarantees under minimal assumptions, such as not relying on notions of hypothesis complexity (Shafieezadeh-Abadeh et al. (2019)), while empirical risk regularization is predominantly embraced by practitioners as a simple yet powerful heuristic, despite its desirable theoretical properties (Shafieezadeh-Abadeh et al. (2019); Abu-Mostafa et al. (2012)). While we have expanded the universality of Wasserstein DRO by establishing generalization guarantees across diverse decision criteria, free from the curse of dimensionality, a natural managerial question arises: can simpler, more intuitive heuristics, such as empirical risk regularization, achieve the same generalization guarantees? Our equivalency results provide a positive answer for many applications by specifying the exact form of regularization required. Conversely, our impossibility theorem identifies when Wasserstein DRO becomes essential for decision problems beyond these cases. To underscore the importance of our regularization results for OR/MS applications, we present below an extensive series of examples, including those highlighted in Table 1 and the celebrated rank-dependent expected utility (RDEU) model.

Remark 2.

It is worth noting that Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) provided a regularized version of (13) and (16) based on a duality scheme. They showed that when $\ell$ is proper, upper-semicontinuous and satisfies $\ell(x)\leqslant C(1+|x|^{p}),$ $x\in\mathbb{R}$ for some $C>0$ , then

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D},\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\},

(18)

where $\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}\{\ell(y)-\lambda|y-x|^{p}\}.$ While this formula can be applied to general loss functions and used to verify the implication (ii) $\Rightarrow$ (i) in both Theorems 3 and 4, it offers limited insight into its connection with the empirical nominal problem and typically requires additional analysis to establish a regularized empirical risk minimization formulation. In contrast, our Theorems 3 and 4 directly leverage the structure of regularized empirical risk minimization to clearly identify the necessary and sufficient conditions for its existence, where the necessary “impossibility” would be difficult to derive through general duality arguments alone.

We detail in Appendix B.1 that the regularized models (15) and (17) can be established based on (18) when $\ell$ is given in (ii) of Theorems 3 and 4, respectively. In particular, for the $\ell$ in (ii) of Theorem 3, although the regularization in (18) appears to depend on $p$ , we demonstrate that this regularization can, in fact, be reduced to a formulation independent of $p$ (Theorem 3). As will be shown, the calculation is lengthy, making the derivation from (18) somewhat cumbersome.

Remark 3.

Our regularization results offer deeper and more practically significant insights than Proposition 3.9 in Shafieezadeh-Abadeh et al. (2023), particularly regarding the required regularization term. While Proposition 3.9 suggests the need for a term of the form $\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}$ in (18), with the regularization raised to the power $p$ , this can create computational challenges, especially as $p\rightarrow\infty$ . In contrast, our regularization result reveals that the regularization term is actually independent of $p$ , namely $C\varepsilon\|\bm{\beta}\|_{*}$ . This insight significantly enhances the computational tractability of solving the regularization problem. Moreover, the regularized formulations (17) in Theorem 4 are all convex programs in $\bm{\beta}$ , with complexity comparable to nominal problems. In contrast, the right-hand side of (17) is nonconvex in $\bm{\beta}$ and $\lambda$ , and its objective function $\ell_{p}(x,\lambda)$ is more costly to evaluate. Thus, Theorem 4 can also be viewed as enabling a convex program solution to (18). Finally, the equivalency between Wasserstein DRO and max-sliced Wasserstein DRO established in the previous section implies that max-sliced Wasserstein DRO can likewise be solved efficiently via convex programs for the cases covered in Theorem 4.

Example 4.

(Classification)

- (Higher-order hinge loss) Applying $\ell_{2}(x)=(x-b)_{-}$ and setting $b=1$ , we have that the following classification problem with a higher-order hinge loss $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\mathbb{E}^{F}[(1-Y\cdot\bm{\beta}^{\top}\mathbf{X})_{+}^{p}]$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{F_{0}}[(1-Y\cdot\bm{\beta}^{\top}\mathbf{X})_{+}^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.$

- (Higher-order SVM) Applying $\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+}$ and setting $b_{1}=1$ and $b_{2}=0$ , we have that the higher-order SVM classification problem $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\mathbb{E}^{F}[|1-Y\cdot\bm{\beta}^{\top}\mathbf{X}|^{p}]$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{F_{0}}[|1-Y\cdot\bm{\beta}^{\top}\mathbf{X}|^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.$

Example 5.

(Regression)

- (Higher-order measure) Applying $\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+}$ and setting $b_{1}=0$ and $b_{2}=0$ , we have that the regression with a higher-order measure $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|^{p}],$ is equivalent to the regularization problem $\inf_{\bm{\beta}_{r}\in\mathcal{D}}(\left(\mathbb{E}^{G_{0}}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|^{p}]\right)^{1/p}+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*})^{p}.$

- (higher-order $c$ -insensitive measure) Applying $\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+}$ and setting $b_{1}=0$ and $b_{2}=c$ , we have that the following regression problem with a higher-order $c$ -insensitive measure $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|-c)_{+}^{p}]$ is equivalent to the regularization problem $\inf_{\bm{\beta}_{r}\in\mathcal{D}}((\mathbb{E}^{G_{0}}[(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|-c)_{+}^{p}])^{1/p}+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*})^{p}.$

Example 6.

(Risk minimization)

- (Lower partial moments) Applying $\ell_{1}(x)=(x-b)_{+}$ and setting $b=c$ , we have that the risk minimization with lower partial moments $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-c)_{+}^{p}]$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{G_{0}}[(\bm{\beta}^{\top}\mathbf{X}-c)_{+}^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.$

Even though Theorem 4 highlights the impossibility of establishing an equivalence relation for a more general loss function $\ell$ , Theorem 4 can still be applied more broadly as a powerful foundation to derive alternative equivalence relations for a richer family of measures. Specifically, there is a large family of measures that can be generally expressed in the following two forms:

\displaystyle\mathcal{V}^{F}(Z)=\inf_{t\in\mathbb{R}}\mathbb{E}^{F}[\ell^{p}(Z,t)]~{}~{}~{}{\rm and}~{}~{}~{}\rho^{F}(Z)=\inf_{t\in\mathbb{R}}\left\{t+\left(\mathbb{E}^{F}[\ell^{p}(Z,t)]\right)^{1/p}\right\}

(19)

for some loss function $\ell$ .

We show in the appendix (see Lemma 8) that for a wide range of loss functions $\ell$ , the following switching of $\sup$ and $\inf$ is valid

\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{i,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{i,\ell}(F,t),~{}~{}i=1,2,

where

\pi_{1,\ell}(F,t)=\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]~{}~{}{\rm and}~{}~{}\pi_{2,\ell}(F,t)=t+\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}.

This, combined with Theorem 4, leads to the following.

Corollary 1.

For any $p\in[1,\infty)$ and $c>0$ , let $\mathcal{V}$ and $\rho$ be defined by (19) with $\ell(z,t)=c\ell(z-t)$ . Take $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ and $\varepsilon\geqslant 0$ .

(i) If $\ell$ is $\ell_{3}$ or $\ell_{4}$ in Theorem 4, then

\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left(\left(\mathcal{V}^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})\right)^{1/p}+c\varepsilon\|\bm{\beta}\|_{*}\right)^{p}.

If $\ell$ is $\ell_{1}$ or $\ell_{2}$ in Theorem 4, then $\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=0$ for any $\bm{\beta}\in\mathbb{R}^{n}$ and $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ .

(ii) If $c>1$ and $\ell$ is one of $\ell_{1},\ell_{3},\ell_{4}$ in Theorem 4 or $\ell(z,t)=c(|z|-t)_{+}$ , then it holds that

\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+c\varepsilon\|\bm{\beta}\|_{*}\right\}.

If $\ell=\ell_{2}$ in Theorem 4, then $\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=-\infty$ for any $\bm{\beta}\in\mathbb{R}^{n}$ and $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ .

Corollary 1 can accommodate cases where variance is used as the measure. Variance is not listed as an example in Section 2 because it has already been studied in Blanchet et al. (2022). However, our method for deriving the equivalency relation is different from, and significantly more general than, the approach in Blanchet et al. (2022). The following example illustrates the broader applicability of our approach.

Example 7.

(Blanchet et al. (2022)) When $p=2$ and $\ell(z,t)=|z-t|$ (i.e., $\ell_{3}(z-t)$ with $b_{1}=0$ and $b_{2}=0$ ), we have $\mathcal{V}^{F}={\rm\mathbb{V}ar}^{F}$ , where ${\rm\mathbb{V}ar}$ represents the variance. That is, ${\rm\mathbb{V}ar}^{F}(\bm{\beta}^{\top}\mathbf{X})=\inf_{t\in\mathbb{R}}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-t)^{2}].$ Applying Corollary 1 (i) yields $\sup_{F\in\mathcal{B}_{2}(G_{0},\varepsilon)}{\rm\mathbb{V}ar}^{F}(\bm{\beta}^{\top}\mathbf{X})=(({\rm\mathbb{V}ar}^{G_{0}}(\bm{\beta}^{\top}\mathbf{X}))^{1/2}+\varepsilon\|\bm{\beta}\|_{*})^{2}.$

The following example demonstrates how Corollary 1 can be applied to derive the regularization counterpart for higher moment risk measures in risk minimization, a previously unknown result.

Example 8.

(Risk minimization) For $c>1$ , setting $\ell(z,t)=c(z-t)_{+}$ (i.e., $c\ell_{1}(z-t)$ with $b=0$ ), we have that the following problem of minimizing higher moment risk measures $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\{t+c(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-t)_{+}^{p}])^{1/p}\}$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D},t\in\mathbb{R}}\{t+c(\mathbb{E}^{G_{0}}[(\bm{\beta}^{\top}\mathbf{X}-t)_{+}^{p}])^{1/p}+c\varepsilon\|\bm{\beta}\|_{*}\}.$

4.2 The case of non-expected function with distortion

A powerful perspective in the risk measure literature, especially when studying non-expected functions such as Conditional Value-at-Risk (CVaR), is to view these functions as equivalent to taking an expectation with respect to a distorted probability distribution. The concept of distorted expectation is central to foundational theories, including Yaari’s dual utility theory (Yaari (1987)), Choquet Expected Utility (Schmeidler (1989)), and various business applications. In this section, we adopt this perspective to study the problem (13) with a distorted expectation:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z})),

(20)

where

\displaystyle\rho_{h}^{F}(Z)=\int_{0}^{1}F^{-1}(s)\mathrm{d}h(s)

is an “expectation” taken with respect to a distortion function $h:[0,1]\to\mathbb{R}$ ¹¹1Strictly speaking, a distortion function $h$ should additionally satisfy $h(0)=1-h(1)=0$ and be increasing, so that $\rho_{h}^{F}$ can be considered as distorted expectation. However, our results do not hinge on these requirements.. Here, $F^{-1}$ represents the left-quantile function of $F$ , i.e., $F^{-1}(s)=\inf\{x:F(x)\geqslant s\}$ for $s\in(0,1]$ , and $F^{-1}(0)=\inf\{x:F(x)>0\}$ . The problem (20) encompasses (13) as a special case, since $\rho_{h}^{F}(Z)=\mathbb{E}^{F}[Z]$ when $h(s)=s$ , $s\in[0,1]$ , and accommodates CVaR, since $\rho_{h}^{F}(Z)={\rm CVaR}_{\alpha}^{F}(Z)$ when $h(s)=(s-\alpha)_{+}/(1-\alpha)$ , $s\in[0,1]$ . A comprehensive list of risk measures with distorted expectation representations can be found in Wang et al. (2020) and Cai et al. (2023).

It is intriguing to consider how the problem (20) might be solved, particularly whether a regularization counterpart, similar to that of the expected function (13), remains available. Our findings are surprisingly general: for any problem (20) with an increasing and convex distortion function $h$ , a regularization counterpart always exists for the type-1 Wasserstein ball. This similarity to the classical regularization result for (13) is particularly notable, given the broader applicability of (20). This insight even extends to the general type- $p$ Wasserstein ball. As we show below, the necessary and sufficient conditions for a regularization counterpart to exist are remarkably similar for both (20) and (13) in the general type- $p$ case.

Before presenting the result, we introduce the notation that $\|g\|_{a}=(\int_{A}|g(x)|^{a}\mathrm{d}x)^{1/a}$ for a function $g:A\to\mathbb{R}$ with $A\subseteq\mathbb{R}$ and $a\in[1,\infty)$ , and $\|g\|_{\infty}=\sup_{x\in A}|g(x)|$ . For a convex distortion function $h:[0,1]\to\mathbb{R}$ with $\lim_{x\to 1-}h(x)=h(1)$ , we denote by $h^{\prime}_{-}$ the left-derivative of $h$ on $(0,1]$ , noting that $h^{\prime}_{-}(1)$ may be infinity.

Theorem 5.

For $p\in[1,\infty]$ , let $h:[0,1]\to\mathbb{R}$ be an increasing and convex distortion function satisfying $\lim_{x\to 1-}h(x)=h(1)$ and $\|h^{\prime}_{-}\|_{q}\in(0,\infty)$ , and let $\ell:\mathbb{R}\to\mathbb{R}$ be a convex function.

For $p=1$ , if ${\rm Lip}(\ell)<\infty$ , then for any $G_{0}\in\mathcal{M}_{1}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , we have

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\}.

For $p\in(1,\infty]$ , if $\rho_{h}(|\ell(Z)|)<\infty$ for all $Z\in L^{p}$ , then the following statements are equivalent.

(i)

For any $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , there exists $C>0$ such that

\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+C\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

(ii)

The function $\ell$ takes one of the following two forms, multiplied by $C$ :
- (a)
  
  $\ell_{1}(x)=x+b$ or $\ell_{1}(x)=-x+b$ with some $b\in\mathbb{R}$ ;
- (b)
  
  $\ell_{2}(x)=|x-b_{1}|+b_{2}$ with some $b_{1},b_{2}\in\mathbb{R}$ .

Interestingly, the “exact” class of loss functions $\ell$ that admit a regularization counterpart is the same as in the previous section, despite distorted expectations being significantly more general than expected functions. This result is particularly significant from a decision-theoretical perspective, as the formulation (20) closely resembles the celebrated rank-dependent expected utility (RDEU) model, known for resolving the paradox in expected utility (Quiggin (1982)). Given its importance, we highlight below the application of Theorem 5 in RDEU.

Example 9.

(Rank-Dependent Expected Utility (RDEU)) The decision criteria of RDEU admits the form $V_{u,h}(Z)=\rho_{h}(u(Z))$ , where $\rho_{h}$ is a distortion function and $u:\mathbb{R}\to\mathbb{R}$ is a dis-utility function. For decision makers who are risk-averse (Chew et al. (1987)), i.e., $h$ and $u$ are increasing convex with ${\rm Lip}(u)\in(0,\infty)$ , we have

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}V_{u,h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{X}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{V_{u,h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{X}))+{\rm Lip}(u)\cdot{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\},

for any convex loss function $\ell$ with ${\rm Lip}(\ell)\in(0,\infty)$ , $G_{0}\in\mathcal{M}_{1}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ .

We further demonstrate the practical applicability of the above theorem by showing how it can be directly applied to derive the regularization counterpart for the $\nu$ -support vector regression example.

Example 10.

(Regression)

- ( $\nu$ -support vector regression) For $\alpha\in[0,1)$ , let $h(s)=(s-\alpha)_{+}/(1-\alpha)$ , $s\in[0,1]$ . Applying $\ell_{2}(x)=C|x-b_{1}|+b_{2}$ and setting $C=1$ and $b_{1}=b_{2}=0$ , we have from Theorem 5 that the $\nu$ -support vector regression $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|)$ is equivalent to the regularization problem $\inf_{\bm{\beta}_{r}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{G_{0}}|((1,-\bm{\beta}_{r})^{\top}\mathbf{X}|)+\varepsilon(1-\alpha)^{-1/p}\|(1,-\bm{\beta}_{r})\|_{*}\}.$

Measures like the CVaR-Deviation example in risk minimization, as shown below, require a convex distortion function $h$ , which is not covered by the theorem above. We now show that when the loss function $\ell$ in (20) is linear, a regularization counterpart can also be found for any convex distortion function $h$ .

Proposition 2.

For $p\in[1,\infty]$ , let $h:[0,1]\to\mathbb{R}$ be a convex function with $\lim_{x\to 0+}h(x)=h(0)$ and $\lim_{x\to 1-}h(x)=h(1)$ . Then for any $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , we have

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

Note first that applying Proposition 2 to classification problems is fairly straightforward, namely yielding the following equivalency:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)}\rho_{h}^{F}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{F_{0}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

This equivalency is readily applicable to specific cases, such as the $\nu$ -support vector machine.

Example 11.

(Classification)

- ( $\nu$ -support vector machine) Setting $h(s)=(s-\alpha)_{+}/(1-\alpha)$ , $s\in[0,1]$ with $\alpha\in[0,1),$ we have that the classification problem with $\nu$ -support vector machine $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{F_{0}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\varepsilon(1-\alpha)^{-1/p}\|\bm{\beta}\|_{*}\}.$

Proposition 2 can be further applied to general deviation measures defined by a distorted expectation, such as the CVaR-Deviation example in risk minimization. Observe that $\rho_{h}(Z-\mathbb{E}[Z])=\rho_{\widetilde{h}}(Z)$ holds for any distortion function $h$ , where ${\widetilde{h}}(s):=h(s)+(h(0)-h(1))s$ for $s\in[0,1]$ . Moreover, ${\widetilde{h}}$ is convex whenever $h$ is convex. Thus, we apply Proposition 2 and arrive at the following equivalency:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\bm{\beta}^{\top}\mathbf{Z}-\mathbb{E}^{F}[\bm{\beta}^{\top}\mathbf{Z}])=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}_{h}(\bm{\beta}^{\top}\mathbf{Z}-\mathbb{E}^{G_{0}}[\bm{\beta}^{\top}\mathbf{Z}])+\|h^{\prime}_{-}+h(0)-h(1)\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

This leads us to our final example.

Example 12.

(Risk minimization)

- (CVaR-Deviation) Setting $h(s)=(s-\alpha)_{+}/(1-\alpha)$ , $s\in[0,1]$ with $\alpha\in[0,1),$ we have that the risk minimization with CVaR-Deviation $\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(\bm{\beta}^{\top}\mathbf{X}-\mathbb{E}^{F}[\bm{\beta}^{\top}\mathbf{X}])$ is equivalent to the regularization problem $\inf_{\bm{\beta}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{G_{0}}(\bm{\beta}^{\top}\mathbf{X}-\mathbb{E}^{G_{0}}[\bm{\beta}^{\top}\mathbf{X}])+\varepsilon(\alpha+\alpha^{q}(1-\alpha)^{1-q})^{1/q}\|\bm{\beta}\|_{*}\}.$

Remark 4.

We offer some comments here on the recent work by Chu et al. (2024), which was released well after our work had been made available online. While Chu et al. (2024) initially set out to establish an equivalency relationship for a general setting of Wasserstein DRO – considering a general loss function $\ell$ and cost function $d$ as defined in (3) – their actual results do not extend much beyond our settings and findings in Section 4.1 (expected functions and extensions) and are limited to discussing sufficient conditions for the equivalency. Although some examples presented in Table 2 of Chu et al. (2024) are not covered in Section 4.1, we attribute this largely to our focus on cases that admit a final convex optimization formulation, rather than a limitation of our analysis. Indeed, we believe our analysis is foundational, reaching the very core of the equivalency problem. This is evident in our ability to prove not only sufficient conditions but also necessary ones for both the case of expected functions and general distorted expectations – an achievement that appears beyond the reach of Chu et al. (2024), as it inevitably hinges on Lemma 4 provided in our analysis. To substantiate our claim that our analysis is not limited to the examples presented in this paper, we demonstrate in Proposition 3 in Appendix B.2 how to generalize Theorem 5 (the case $p=1$ ) to non-convex loss functions $\ell$ , including the truncated pinball loss from Chu et al. (2024) as a special case.²²2In Chu et al. (2024), the center of the Wasserstein ball is assumed to be the empirical distribution, which has a finite set of vectors as its support. Consequently, Proposition 3 is directly applicable. As another example, it is not difficult to confirm that our findings in Proposition 1, Lemma 4, and all the results in Section 4 can be readily extended to a general Hilbert space $\mathcal{H}$ , with the norm induced by its inner product. This extension covers the case of nonparametric scalar-on-function linear regression $\mathbb{E}[\ell^{p}(Y-\int_{0}^{1}\mathbf{X}(t)\bm{\beta}(t)\mathrm{d}t)]$ from Chu et al. (2024), where $\ell$ takes the forms outlined in our Theorem 4.³³3 The expression $y-\int_{0}^{1}\mathbf{x}(t)\bm{\beta}(t)\mathrm{d}t$ can be interpreted as the inner product between $(y,\mathbf{x})$ and $(1,-\bm{\beta})$ on $\mathcal{H}:=\mathbb{R}\times\mathcal{L}^{2}([0,1])$ , where $\mathcal{L}^{2}([0,1])$ is the space of all square integrable functions on $[0,1]$ . In this setting, the decision rule is the inner product $\langle(y,\mathbf{x}),(1,-\bm{\beta})\rangle$ for each decision vector $\bm{\beta}$ which is affine and the Wasserstein ball $\overline{\cal B}_{p}(F_{0},\varepsilon)$ , similar in form to (5), is now encompassed within the set of probability measures on $\mathcal{H}$ .

5 Numerical Demonstration of Generalization Bounds

In this section, we seek to provide a further demonstration of the generalization result obtained in this paper. In particular, we numerically illustrate the decay rate of the Wasserstein radius for measures $\rho$ that are non-expected functions. We closely follow the experimental setup presented in Shafieezadeh-Abadeh et al. (2019), where the authors investigate the scaling behaviour of the smallest Wasserstein radius for the synthetic threenorm classification problem (Breiman (1996)). While Shafieezadeh-Abadeh et al. (2019) uses classical support vector machine for classification, in which the measure $\rho$ is an expected function defined by a hinge loss function $\ell$ , we employ $\nu$ -support vector machine, where $\rho$ is represented by CVaR. Specifically, we consider the distributionally robust counterpart of the standard $\nu$ -support vector machine formulation,

\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{n}}~{}~{}{\rm CVaR}_{\alpha}^{\widehat{F}_{N}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\frac{1}{2}\|\bm{\beta}\|_{2}^{2}.

(21)

By choosing $\infty$ -norm as the norm $\|\cdot\|$ on the input space in the transport cost and setting $p=1$ , like Shafieezadeh-Abadeh et al. (2019), and following Proposition 2 or Example 11, we have the following regularized form for the Wasserstein robust $\nu$ -support vector machine problem:

\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{n}}~{}~{}{\rm CVaR}_{\alpha}^{\widehat{F}_{N}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\frac{1}{2}\|\bm{\beta}\|_{2}^{2}+\frac{\varepsilon\|\bm{\beta}\|_{1}}{1-\alpha}.

(22)

The experiment is based on an output $y$ drawn uniformly from $\{-1,1\}$ and a $20$ -dimensional input $\mathbf{x}$ from a multivariate normal distribution. Specifically, if $y=-1$ , then $\mathbf{x}$ is drawn from standard multivariate normal distribution shifted by $(c,\dots,c)$ or $(-c,\dots,-c)$ with equal probabilities, where $c=2/\sqrt{20}$ . If $y=1$ , then $\mathbf{x}$ is drawn from standard multivariate normal distribution shifted by $(c,-c,\dots,c,-c)$ . We consider different sizes of training samples $N$ in the set $\{10,\dots,90\}\cup\{100,\dots,900\}\cup\{1000,\dots,10000\}$ as well as $10^{5}$ test samples. Each size of training sample involves $100$ simulation trials.

Throughout the experiment, the search space of radius is chosen as $\mathcal{E}=\{a\cdot 10^{-b}:~{}a\in\{1,\dots,10\},~{}b\in\{1,2,3\}\}\cup\{1.1,1.2,\dots,2\}$ . Similar to Shafieezadeh-Abadeh et al. (2019), we use the following three approaches to choose the smallest Wasserstein radius:

•

Cross validation: For a set of $N$ training samples, denoted as $\{(y_{i},\mathbf{x}_{i})\}_{i=1}^{N}$ , we partition them into $k=5$ subsets. We use one subset as the validation dataset and combine the remaining $k-1$ subsets as the training dataset. This results in $k$ pairs of validation and training datasets. We choose the radius in $\mathcal{E}$ such that the average validation error over these $k$ pairs is the smallest. This operation is repeated across all $100$ trials, and we then report the average of the radii.
•

Optimal radius: In each trial, we choose the radius in $\mathcal{E}$ that has the smallest testing error and then report the average of the radii across all $100$ trials.
•

Generalization bound: We choose the smallest radius such that the optimal value of (22) exceeds the value of the nominal problem on the test samples in at least $95\%$ of all 100 trials.

In our initial experiments, we set $\alpha=0.2$ . Figure 1 displays the selected Wasserstein radii in relation to various sample sizes $N$ for the three approaches above. Note that while the first two approaches determine the radius by averaging the radii induced by all simulation trials (potentially leading to a radius not necessarily in the set $\mathcal{E}$ ), the third approach specifically selects a radius from $\mathcal{E}$ based on the percentage criteria, utilizing the testing samples across all 100 trials. One can observe that the Wasserstein radii of all three approaches decay approximately as $1/\sqrt{N}$ , which is consistent with the theoretical generalization bound of Theorem 1.

Refer to caption — Figure 1: Wasserstein radius versus the number of training samples, with $\alpha=0.2$

It is natural to wonder if varying choices of $\alpha$ could affect the scaling behavior. This becomes particularly intriguing when considering higher values of $\alpha$ . However, literature on the $\nu$ -support vector machine suggests that the optimal solution of (21) can become degenerate (i.e., equal to zero) when $\alpha$ surpasses a specific threshold. We indeed observe that the experiment setup of Shafieezadeh-Abadeh et al. (2019) only allows us to generate meaningful solutions for lower values of $\alpha$ . To address the potential degeneracy issue at higher $\alpha$ values, we modified the method for generating simulated data. Specifically, for each sample $(y,\mathbf{x})\in\{-1,1\}\times\mathbb{R}^{20}$ , if $y=-1$ , then $\mathbf{x}$ is drawn from standard multivariate normal distribution shifted by $(-c,\dots,-c)$ or $(-2c,\dots,-2c)$ with equal probabilities, where $c=1/3$ , whereas if $y=1$ , then $\mathbf{x}$ is drawn from standard multivariate normal distribution shifted by $(c,2c,\dots,c,2c)$ . Figure 2 contains three panels, each displaying the selected radii for a distinct level of $\alpha$ : $\alpha=0.2,~{}0.5~{}{\rm or}~{}0.8$ . Notably, the decay rate of the radius is remarkably similar across different levels of $\alpha$ , approximately as $1/\sqrt{N}$ . This is in line with our theoretical finding that the rate in general does not hinge on the specification of the measure $\rho$ .

6 Beyond Affine Decision Rules

While our results thus far have focused on affine decision rules, the underlying principles of our approach to establishing generalization bounds for a broad class of measures $\rho$ can be extended to a wider range of decision rules. In this section, we illustrate this extension by deriving generalization bounds for the following Wasserstein DRO problem with a decision rule denoted by $f_{\bm{\beta}}$ :

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X})),

where $F_{0}\in\mathcal{M}_{p}(\Xi)$ with $\Xi=\{-1,1\}\times\mathbb{R}^{n}$ and $\overline{\cal B}_{p}(F_{0},\varepsilon)$ is the type- $p$ Wasserstein ball defined by (5). Throughout this section, we assume that $\mathcal{D}\subseteq\mathbb{R}^{m}$ for some $m\in\mathbb{N}$ , with $\|\cdot\|_{\mathcal{D}}$ representing any given norm on $\mathcal{D}$ .

The key to establishing generalization bounds applicable to a broad class of measures continues to lie in focusing on the projection set, now expressed in a more general form:

\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)=\{F_{Y\cdot f_{\bm{\beta}}(\mathbf{X})}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\}.

(23)

Although the exact equivalence of this set to a one-dimensional Wasserstein ball or the projection set of a max-sliced Wasserstein ball, as shown in Proposition 1, no longer holds in the general case, we can still establish generalization bounds by instead identifying a one-dimensional Wasserstein ball dominated by $\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)$ , i.e.

\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)\supseteq\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)),

(24)

for some increasing function $g:\mathbb{R}_{+}\to\mathbb{R}_{+}$ , where $(Y_{0},\mathbf{X}_{0})\sim F_{0}$ . Indeed, if this dominance relationship (24) can be established, one can then leverage the measure concentration property of a one-dimensional Wasserstein ball to determine the radius of this one-dimensional ball $\varepsilon_{N}$ such that

\mathbb{P}\left(F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(\widehat{F}_{N},g^{-1}(\varepsilon_{N}))\right)\geqslant\mathbb{P}\left(F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}\in\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},\varepsilon_{N})\right)\geqslant 1-\eta

with $(Y^{*},\mathbf{X}^{*})\sim F^{*}$ , where $F^{*}$ and $\widehat{F}_{N}$ , defined in Section 3, are the data-generating distribution and empirical distribution, respectively. This leads to the confidence bound:⁴⁴4 One may wish to further gauge how tight (i.e., small) the Wasserstein in-sample risk, $\sup_{F\in\overline{{\cal B}}p(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))$ , is – e.g., relative to the empirical in-sample risk – when the radius $\varepsilon$ is sufficiently small. To this end, we note that, under Assumption 5 and the condition that $f_{\bm{\beta}}$ is Lipschitz continuous with a Lipschitz constant $L_{\bm{\beta}}$ (a condition satisfied in the affine case), the relationship $\sup_{F\in\overline{\mathcal{B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\rho^{\widehat{F}_{N}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+ML_{\bm{\beta}}\varepsilon$ holds.

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{N}))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta,

(25)

for a fixed $\bm{\beta}$ . ⁵⁵5 Notably, Aolarite et al. (2022) also studied the set inclusion property between the projection of a Wasserstein ball and a lower-dimensional Wasserstein ball. However, their focus was on establishing the existence of a lower-dimensional Wasserstein ball as a superset of the projection (i.e., the converse of (24)), which cannot be used to derive generalization bounds based on (Wasserstein) in-sample risk, i.e. $\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))$ . Extending this to generalization bounds – i.e., the union bound of the above – is achievable using covering number techniques. We leave the exact technical details and derivations to Appendix C, and present here the final generalization bounds along with the necessary assumptions.

Assumption 2.

There exists $a>p$ such that

\displaystyle A:=\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(|f_{\bm{\beta}}(\mathbf{X})|^{a})]<\infty.

Assumption 3.

The decision rule $f_{\bm{\beta}}:\mathbb{R}^{n}\to\mathbb{R}$ is continuous for each $\bm{\beta}\in\mathcal{D}$ , and there exists a strictly increasing convex function $g:\mathbb{R}_{+}\to\mathbb{R}_{+}$ such that for every $\bm{\beta}\in\mathcal{D}$ , $\mathbf{x}\in\mathbb{R}^{n},$ and $\varepsilon\geqslant 0$ ,

\displaystyle\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant g(\varepsilon)~{}~{}{and}~{}~{}f_{\bm{\beta}}(\mathbf{x})-\inf_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})\geqslant g(\varepsilon).

Assumption 4.

There exist $a_{1},a_{2}\geqslant 0$ and $r\in[1,p]$ such that

\displaystyle|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})|\leqslant(a_{1}\|\mathbf{x}\|^{r}+a_{2})\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}},~{}~{}\forall\,\mathbf{x}\in\mathbb{R}^{n}~{}{\rm and}~{}\bm{\beta},\widetilde{\bm{\beta}}\in\mathcal{D}.

Assumption 5.

There exist $M>0$ and $k\in[1,p]$ such that

\displaystyle|\rho(Z_{1})-\rho(Z_{2})|\leqslant M\left(\mathbb{E}[|Z_{1}-Z_{2}|^{k}]\right)^{1/k},~{}~{}\forall\,Z_{1},Z_{2}\in L^{p}.

Assumption 2 is a fairly standard light-tail condition, necessary for invoking the measure concentration property of a one-dimensional Wasserstein ball; see Proposition 4 in Appendix C.4 for a sufficient condition for Assumption 2 for a class of decision rules $\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}}$ . Assumption 3, however, is more notable, as it successfully characterizes decision rules that ensure the set inclusion (24) holds. Simply put, decision rules must exhibit sufficiently increasing and decreasing behaviour; otherwise, a Wasserstein ball dominated by the projection set would not exist. Assumptions 4 and 5 are regularity conditions required for constructing union bounds through covering number techniques.

Theorem 6.

For $p\in[1,\infty)$ , $\eta\in(0,1)$ and $\mathcal{D}\subseteq\mathbb{R}^{m}$ with some $m\in\mathbb{N}$ , assume that Assumptions 2 and 3 hold, Assumptions 4 and 5 hold with $rk\leqslant p$ , and $U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ . Then we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

\displaystyle\varepsilon_{N}=g^{-1}(\varepsilon_{p,N})~{}~{}{\rm with}~{}~{}\varepsilon_{p,N}=\begin{cases}\left(\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}N}\right)^{1/(2p)}&\text{if}~{}N\geqslant\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}},\\ \left(\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}N}\right)^{1/a}&\text{if}~{}N<\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}},\end{cases}

and $c_{1},c_{2}$ are positive constants that only depend on $p$ , $a$ and $A$ .

The decay rate here, $g^{-1}\left((m/N)^{1/(2p)}\right)$ , maintains an order independent of dimensionality, mirroring the $(m/N)^{1/2}$ rate reported in Gao (2022). Although our rate may not be as rapid, due to its dependence on the function $g$ and the parameter $p$ , it is derived under less restrictive distributional assumptions and is valid for any type- $p$ Wasserstein ball. Most importantly, it accommodates a significantly broader class of measures $\rho$ . A notable class of applications for these bounds includes any regression problem and decision rule, with a rate that, as demonstrated below, is independent of the function $g$ .

Example 13 (Regression).

In the regression problem, the data-generating distribution $F^{*}$ satisfies $F^{*}(\{Y=1\})=1$ and the decision rule has the form: $f_{\bm{\beta}}(\mathbf{x})=x_{1}-h_{\bm{\beta}}(\mathbf{x}_{(2,:)})$ , ${\bf x}\in\mathbb{R}^{n}$ , where $x_{1}$ is the first component of $\mathbf{x}$ and $\mathbf{x}_{(2,:)}$ represents the rest components, i.e., we use $h_{\bm{\beta}}(\mathbf{x}_{(2,:)})$ to predict the output $x_{1}$ . Suppose that the norm on $\mathbb{R}^{n}$ satisfies $\|(1,0,\dots,0)\|=1$ . Then, Assumption 3 holds for any $h_{\bm{\beta}}$ with $g(\varepsilon)=\varepsilon$ . Indeed, we have for any $\mathbf{x}\in\mathbb{R}^{n}$ and $\varepsilon\geqslant 0$ , $\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant f_{\bm{\beta}}(\mathbf{x}+(\varepsilon,0,\dots,0))-f_{\bm{\beta}}(\mathbf{x})=\varepsilon$ and $f_{\bm{\beta}}(\mathbf{x})-\inf_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})\geqslant f_{\bm{\beta}}(\mathbf{x})-f_{\bm{\beta}}(\mathbf{x}-(\varepsilon,0,\dots,0))=\varepsilon.$ Hence, any decision rule for a regression problem satisfies Assumption 3 with $g(\varepsilon)=\varepsilon$ .

We further illustrate the application of this example in the context of operations management. Specifically, the feature-based newsvendor problem (Zhang et al. (2024)), also known as the big-data newsvendor (Ban and Rudin (2018)), represents as a special case of regression problems.

Example 14 (Feature-based newsvendor).

In the newsvendor problem, the decision maker seeks the optimal order quantity for a product facing uncertain demand $X_{1}$ , with holding costs of $a>0$ and back-order costs of $b>0$ . In the world of big data, the decision maker has access to a feature vector $\mathbf{x}_{(2,:)}\in\mathbb{R}^{n-1}$ , which can be leveraged to enhance ordering decisions based on this additional information. In this setting, define the decision rule as $f_{\bm{\beta}}(\mathbf{x})=x_{1}-h_{\bm{\beta}}(\mathbf{x}_{(2,:)})$ with $\bm{\beta}\in\mathcal{D}$ and the data-driven distributionally robust feature-based newsvendor problem has the form

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\mathcal{B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(b(f_{\bm{\beta}}(\mathbf{X}))_{+}+a(f_{\bm{\beta}}(\mathbf{X}))_{-}).

Applying Example 13, we know that any decision rule of feature-based newsvendor problem satisfies Assumption 3 with $g(\varepsilon)=\varepsilon$ .

The above generalization bound can be extended to an even broader class of decision rules when the measure $\rho$ satisfies monotonicity, i.e., $\rho(Z_{1})\leqslant\rho(Z_{2})$ whenever $Z_{1}\leqslant Z_{2}$ , a natural property in problems such as risk minimization. Specifically, we consider the following data-driven Wasserstein DRO problem:

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X})),

where $\rho$ satisfies monotonicity, $\widehat{F}_{\mathbf{X},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{\mathbf{x}}_{i}}$ represents the empirical distribution of the component $\mathbf{X}$ , and ${\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)$ is defined by (8). In these cases, Assumption 3 can be relaxed, requiring only that each decision rule $f_{\bm{\beta}}$ exhibits sufficiently increasing behavior.

Assumption 6.

The decision rule $f_{\bm{\beta}}:\mathbb{R}^{n}\to\mathbb{R}$ is continuous, and there exists a strictly increasing convex function $g:\mathbb{R}_{+}\to\mathbb{R}_{+}$ such that for every $\bm{\beta}\in\mathcal{D}$ ,

\displaystyle\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant g(\varepsilon),~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\varepsilon\geqslant 0.

Theorem 7.

For $p\in[1,\infty)$ , $\eta\in(0,1)$ and $\mathcal{D}\subseteq\mathbb{R}^{m}$ with some $m\in\mathbb{N}$ , assume that Assumptions 2 and 6 hold, Assumptions 4 and 5 hold with $rk\leqslant p$ , $U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ , and $\rho:L^{p}\to\mathbb{R}$ satisfies monotonicity. Then,

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{{\bf X},N},\,\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

where $\tau_{N}={M[(2^{r-1}+1)a_{1}(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}])^{1/k}+2^{r-1}a_{1}({\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk}))^{1/(2k)}+2^{r-1}a_{1}\varepsilon_{N}^{r}+2a_{2}]}/{N}$ and $\varepsilon_{N}=g^{-1}(\varepsilon_{p,N})$ with $\varepsilon_{p,N}$ defined in Theorem 6.

In particular, Assumption 6 accommodates non-linear decision rules, including quadratic and max-affine functions (cf. Esfahani and Kuhn (2018) and Example 16), which can be bounded below.

Example 15 (Quadratic function).

Define the decision variable as $\bm{\beta}=(\Sigma,\mathbf{a},b)$ , where $\mathbf{a}\in\mathbb{R}^{n}$ , $b\in\mathbb{R}$ , and $\Sigma\in\mathbb{R}^{n\times n}$ is a positive semi-definite matrix. Let $\mathcal{D}$ be the feasible set of $\bm{\beta}$ such that $M:=\inf_{\bm{\beta}\in\mathcal{D}}\|\Sigma\|_{S}>0$ , where $\|\Sigma\|_{S}$ is the spectral norm of $\Sigma$ and is defined as the largest absolute value of all eigenvalues. Define the norm on $\mathcal{D}$ as $\|\bm{\beta}\|_{\mathcal{D}}=\|\Sigma\|_{S}+\|\mathbf{a}\|_{2}+|b|~{}{\rm with}~{}\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D}.$ Let $f_{\bm{\beta}}(\mathbf{x})=\mathbf{x}^{\top}\Sigma\mathbf{x}+\mathbf{a}^{\top}\mathbf{x}+b$ for $x\in\mathbb{R}^{n}$ and $\bm{\beta}\in\mathcal{D}$ . For this specific case, we first show that $\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}}$ satisfies Assumption 4 with $a_{1}=a_{2}=r=2$ and Assumption 6 with $g(\varepsilon)=M\varepsilon^{2}$ . For $\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D}$ and $\widetilde{\bm{\beta}}=(\widetilde{\Sigma},\widetilde{\mathbf{a}},\widetilde{b})\in\mathcal{D}$ , we have

	$\displaystyle\|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})\|$	$\displaystyle\leqslant\|\mathbf{x}^{\top}(\Sigma-\widetilde{\Sigma})\mathbf{x}\|+\|(\mathbf{a}-\widetilde{\mathbf{a}})^{\top}\mathbf{x}\|+\|b-\widetilde{b}\|$
		$\displaystyle\leqslant\\|\mathbf{x}\\|_{2}^{2}\\|\Sigma-\widetilde{\Sigma}\\|_{S}+\\|\mathbf{x}\\|_{2}\\|\mathbf{a}-\widetilde{\mathbf{a}}\\|_{2}+\|b-\widetilde{b}\|\leqslant 2(\\|\mathbf{x}\\|_{2}^{2}+1)\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|_{\mathcal{D}}.$

Hence, Assumption 4 holds with $a_{1}=a_{2}=r=2$ . For any $\mathbf{x}\in\mathbb{R}^{n}$ , $\varepsilon\geqslant 0$ and $\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D}$ , it follows from the orthogonal decomposition of $\Sigma$ , i.e., $\Sigma=Q\Lambda Q^{\top}$ where $Q$ is an orthogonal matrix and $\Lambda$ is a diagonal matrix, that

	$\displaystyle\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})$	$\displaystyle=\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}\{\mathbf{y}^{\top}\Sigma\mathbf{y}+(2\mathbf{x}^{\top}\Sigma+\mathbf{a}^{\top})\mathbf{y}\}$
		$\displaystyle=\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}\left\{(Q^{\top}\mathbf{y})^{\top}\Lambda(Q^{\top}\mathbf{y})+(2\mathbf{x}^{\top}Q\Lambda Q^{\top}+\mathbf{a}^{\top})\mathbf{y}\right\}$
		$\displaystyle=\sup_{\\|\mathbf{z}\\|_{2}\leqslant\varepsilon}\left\{\mathbf{z}^{\top}\Lambda\mathbf{z}+(2\mathbf{x}^{\top}Q\Lambda+\mathbf{a}^{\top}Q)\mathbf{z}\right\}\geqslant\\|\Sigma\\|_{S}\varepsilon^{2}\geqslant M\varepsilon^{2}.$

Hence, Assumption 6 holds with $g(\varepsilon)=M\varepsilon^{2}$ . Further, if we assume that $U_{\mathcal{D}}:=\sup_{\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ and there exists $a>p$ such that $\mathbb{E}^{F^{*}}[\exp(2^{2a-1}U_{\mathcal{D}}^{a}\|\mathbf{X}\|^{2a})]<\infty$ , then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4.

To further emphasize its implications for operations management, we next illustrate how Theorem 7 applies to two-stage stochastic programs.

Example 16 (Two-stage stochastic program).

In the context of two-stage stochastic programs with right-hand-side uncertainty, the recourse function is defined as (Esfahani and Kuhn (2018)): $f_{\bm{\beta}}(\mathbf{x})=\inf_{y}\left\{q^{\top}y\;|\;Wy\geqslant H\mathbf{x}+\bm{\beta}\right\},$ where $\bm{\beta}\in\mathcal{D}\subseteq\mathbb{R}^{m}$ with $\|\bm{\beta}\|_{\mathcal{D}}=\|\bm{\beta}\|$ , denotes the first-stage decision variable, and $y$ represents the recourse decision. As shown in Esfahani and Kuhn (2018), this function can be reformulated as a max-affine function $f_{\bm{\beta}}(\mathbf{x})=\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta}),$ where $v_{k}\in\mathbb{R}^{m}$ , $k=1,\ldots,K$ are the vertices of the dual feasible set $\left\{\theta\geqslant 0\;|\;W^{\top}\theta=q\right\}$ . Suppose $M:=\min_{1\leqslant k\leqslant K}\|H^{\top}v_{k}\|_{*}>0$ . We show that $f_{\bm{\beta}}$ for $\bm{\beta}\in\mathcal{D}$ satisfies Assumption 4 with $a_{1}=0$ and $a_{2}=\max_{1\leqslant k\leqslant K}\|v_{k}\|_{*}$ and Assumption 6 with $g(\varepsilon)=M\varepsilon$ . For $\bm{\beta},\,\widetilde{\bm{\beta}}\in\mathcal{D}$ , we have

	$\displaystyle\|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})\|$	$\displaystyle=\left\|\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right\|$
		$\displaystyle\leqslant\max_{1\leqslant k\leqslant K}\left\|(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right\|$
		$\displaystyle=\max_{1\leqslant k\leqslant K}\left\|v_{k}^{\top}(\bm{\beta}-\widetilde{\bm{\beta}})\right\|\leqslant\max_{1\leqslant k\leqslant K}\\|v_{k}\\|_{*}\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|.$

Hence, Assumption 4 holds with $a_{1}=0$ and $a_{2}=\max_{1\leqslant k\leqslant K}\|v_{k}\|_{*}$ . For any $\mathbf{x}\in\mathbb{R}^{n}$ , $\varepsilon\geqslant 0$ and $\bm{\beta}\in\mathcal{D}$ , we have

	$\displaystyle\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})$	$\displaystyle=\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle=\max_{1\leqslant k\leqslant K}\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle=\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+\varepsilon\\|H^{\top}v_{k}\\|_{*}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle\geqslant\min_{1\leqslant k\leqslant K}\varepsilon\\|H^{\top}v_{k}\\|_{*}=M\varepsilon.$

Hence, Assumption 6 holds with $g(\varepsilon)=M\varepsilon$ . Further, if we assume that $U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ and there exists $a>p$ such that $\mathbb{E}^{F^{*}}[\exp(2^{a-1}R^{a}\|\mathbf{X}\|^{a})]<\infty$ with $R:=\max_{1\leqslant k\leqslant K}\|H^{\top}v_{k}\|_{*}$ , then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4. We note that extending this discussion to cases where the matrix $H$ also represents a first-stage decision is straightforward, and verifying this extension closely mirrors our existing analysis, thus it is omitted here for brevity.

Finally, we show that for problems such as regression or risk minimization, the “curse of the order $p$ ” – i.e., the dependence of the decay rate on $p$ – can be overcome for any non-linear decision rules covered by Assumption 6.

Theorem 8.

For $p\in[1,\infty]$ , $\eta\in(0,1)$ and $\mathcal{D}\subseteq\mathbb{R}^{m}$ with some $m\in\mathbb{N}$ , assume that Assumptions 1 and 6 hold, Assumption 2 holds with $a>1$ , Assumption 4 holds with $r=1$ and $U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ , and $\rho:L^{1}\to\mathbb{R}$ satisfies monotonicity and $|\rho(Z_{1})-\rho(Z_{2})|\leqslant M\mathrm{ess\mbox{-}sup}|Z_{1}-Z_{2}|$ for $Z_{1},Z_{2}\in L^{1}$ . Then,

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{{\bf X},N},\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

where $\tau_{N}={\lambda M[a_{1}\mathbb{E}^{F^{*}}[\|\mathbf{X}\|]+a_{1}({\rm\mathbb{V}ar}^{F^{*}}\!\!(\|\mathbf{X}\|))^{1/2}+a_{1}\varepsilon_{N}+2a_{2}]}/{N}$ and $\varepsilon_{N}=g^{-1}(\lambda\varepsilon_{1,N})$ with $\varepsilon_{1,N}$ defined in Theorem 6.

We note that Assumption 1 and the Lipschitz continuity with respect to the $L^{\infty}$ -norm together imply Assumption 5 (with $k=1$ ). Hence, Assumption 5 is omitted in Theorem 8 (see Appendix C.3 for details).

7 Conclusion

As ML and OR/MS applications increasingly adopt diverse decision criteria, identifying methods that ensure robust, data-driven solutions with strong generalization guarantees across a wide range of criteria is essential. This paper establishes the universality of Wasserstein DRO in achieving such guarantees. We show that Wasserstein DRO attains generalization bounds comparable to max-sliced Wasserstein DRO for any decision criterion under affine decision rules, even though the Wasserstein ball is a significantly less conservative ambiguity set. Furthermore, we extend these guarantees to general criteria and decision rules under light-tail assumptions, avoiding the curse of dimensionality. Our projection-based analysis of the Wasserstein ball offers key insights into why these guarantees require minimal reliance on specific properties of decision criteria.

Beyond developing generalization guarantees, our work emphasizes their practical significance for OR/MS problems. We demonstrate that these guarantees can often be achieved through regularization equivalents, as detailed in Theorems 3, 4, and 5, all of which are efficiently solvable as convex programs with complexity comparable to nominal problems. Importantly, our impossibility theorems identify precisely when solving the full Wasserstein DRO problem becomes necessary, defining the boundaries where simpler regularization approaches no longer suffice. These results deepen the theoretical foundation of Wasserstein DRO while charting a clear course for developing more efficient methods to address full Wasserstein DRO formulations.

References

Abu-Mostafa et al. (2012) Abu-Mostafa, Y. S., Magdon-Ismail, M. and Lin, H. T. (2012). Learning from data. New York: AMLBook.
Aolarite et al. (2022) Aolaritei, L., Lanzetti, N., Chen, H., and Dörfler, F. (2022). Distributional uncertainty propagation via optimal transport. arXiv:2205.00343.
Ban and Rudin (2018) Ban, G. -Y. and Rudin, C. (2018) The big data newsvendor:practical insights from machine learning. Operations Research, 67(1), 90–108.
Bartl et al. (2020) Bartl, D., Drapeau, S., Obloj, J. and Wiesel, J. (2020). Robust uncertainty sensitivity analysis. arXiv:2006.12022.
Bawa (1975) Bawa, V. S. (1975). Optimal rules for ordering uncertain prospects. Journal of Financial Economics, 2(1), 95–1.
Blanchet et al. (2022) Blanchet, J., Chen, L. and Zhou, X. (2022). Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 68(9), 6382–6410.
Blanchet and Kang (2021) Blanchet, J. and Kang, Y. (2021). Sample out-of-sample inference based on Wasserstein distance. Operations Research, 69(3), 985–1013.
Blanchet et al. (2019) Blanchet, J., Kang, Y. and Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3), 830–857.
Blanchet et al. (2022) Blanchet, J. Murthy, K. and Si, N. (2022). Confidence regions in Wasserstein distributionally robust estimation. Biometrika, 109(2), 295–315.
Breiman (1996) Breiman, L. (1996). Bias, variance, and arcing classifiers (Technical Report 460). Statistics Department, University of California.
Cai et al. (2023) Cai, J., Li, J. Y. M. and Mao, T. (2023). Distributionally robust optimization under distorted expectations. Operations Research, forthcoming.
Chen and Paschalidis (2018) Chen, R. and Paschalidis, I. C. (2018). A Robust Learning Approach for Regression Models Based on Distributionally Robust Optimization. Journal of Machine Learning Research, 19(1), 1–48.
Chen et al. (2011) Chen, L., He, S. and Zhang, S. (2011). Tight Bounds for Some Risk Measures, with Applications to Robust Portfolio Selection. Operations Research, 59(4), 847–865.
Chew et al. (1987) Chew, H. C., Karni, E., and Safra, Z. (1987). Risk aversion in the theory of expected utility with rank dependent probabilities. Journal of Economic theory, 42(2), 370–381.
Chu et al. (2024) Chu, H., Lin, M. and Toh, K. C. (2024). Wasserstein distributionally robust optimization and its tractable regularization formulations. arXiv:2402.03942.
Drucker et al. (1997) Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. and Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 28(7), 779–784.
Esfahani and Kuhn (2018) Esfahani, P. M. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 115–166.
Fishburn (1977) Fishburn, P. C. (1977). Mean-risk analysis with risk associated with below target returns. American Economic Review, 67(2), 116–126.
Gao (2022) Gao, R. (2022). Finite-Sample Guarantees for Wasserstein Distributionally Robust Optimization: Breaking the Curse of Dimensionality. Operations Research, 71(6), 2291–2306.
Gao et al. (2017) Gao, R., Chen, X. and Kleywegt, A. J. (2017). Distributional robustness and regularization in statistical learning. arXiv:1712.06050.
Gao et al. (2022) Gao, R., Chen, X. and Kleywegt, A. J. (2022). Wasserstein Distributionally Robust Optimization and Variation Regularization. Operations Research, 72(3), 1177–1191.
Gotoh and Uryasev (2017) Gotoh, J. and Uryasev, S. (2017). Support vector machines based on convex risk functions and general norms. Annals of Operations Research, 249, 301–328.
Krokhmal (2007) Krokhmal, P. (2007). Higher moment coherent risk measures. Quantitative Finance, 7(4), 373–387.
Kuhn et al. (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A. and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: theory and applications in machine learning. INFORMS TutORials in Operations Research, 130–166.
Lee and Mangasarian (2001) Lee, Y. J., and Mangasarian (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20, 5–22.
Lee et al. (2005) Lee, Y. J., Hsieh, W. F. and Huang, C. M. (2005). $\varepsilon$ -SSVR: A smooth support vector machine for $\varepsilon$ -insensitive regression. IEEE Transactions on Knowledge and Data Engineering, 17(5), 678–685.
Mohri et al. (2018) Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
Olea et al. (2022) Olea, J. L. M., Rush, C., Velez, A. and Wiesel, J. (2022). The out-of-sample prediction error of the square-root-LASSO and related estimators. arXiv:2211.07608.
Quiggin (1982) Quiggin, J. (1982). A theory of anticipated utility. Journal of Economic Behavior & Organization, 3(4), 323–343.
Rockafellar and Uryasev (2002) Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 26(7), 1443–1471.
Rockafellar et al. (2008) Rockafellar, R. T., Uryasev, S. and Zabarankin, M. (2008). Risk tuning with generalized linear regression. Mathematics of Operations Research, 33(3), 712–729.
Rockafellar and Uryasev (2013) Rockafellar, R. T. and Uryasev, S. (2013). The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2), 33–53.
Schölkopf et al. (1998) Schölkopf, B., Bartlett, P., Smola, A. and Williamson, R. (1998). Support vector regression with automatic accuracy control. In ICANN 98; Springer: Heidelberg, Germany, 111–116.
Schölkopf et al. (2000) Schölkopf, B., Smola, A. J., Williamson, R. C. and Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245.
Schmeidler (1989) Schmeidler, D. (1989). Subjective probability and expected utility without additivity. Econometrica, 57(3), 571–587.
Shafieezadeh-Abadeh et al. (2023) Shafieezadeh-Abadeh, S., Aolaritei, L., Dörfler, F. and Kuhn, D. (2023). New perspectives on regularization and computation in optimal transport-based distributionally robust optimization. arXiv:2303.03900.
Shafieezadeh-Abadeh et al. (2019) Shafieezadeh-Abadeh, S., Kuhn, D. and Esfahani, P. M. (2019). Regularization via mass transportation. Journal of Machine Learning Research, 20, 1–68.
Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge.
Sim et al. (2021) Sim, M., Zhao, L. and Zhou, M. (2021). Tractable robust supervised learning models. SSRN:3981205.
Suykens and Vandewalle (1999) Suykens, J. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.
Volpi et al. (2018) Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V. and Savarese, S. (2018). Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 5334–5344.
Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-asymptotic Viewpoint. Volume 48 (Cambridge University Press).
Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.
Wozabal (2014) Wozabal, D. (2014). Robustifying convex risk measures for linear portfolios: A nonparametric approach. Operations Research, 62(6), 1302–1315.
Yaari (1987) Yaari, M.E. (1987). The dual theory of choice under risk. Econometrica, 55(1), 95–115.
Zhang et al. (2024) Zhang, L., Yang, J. and Gao, R. (2024). Optimal robust policy for feature-based newsvendor. Management Science, 70(4), 2315–2329.

Appendix A A Proofs of Section 3

Before presenting the proofs in Section A, we first introduce a lemma inspired by the optimal coupling (see, e.g., Theorem 4.1 of Villani (2009)),⁶⁶6We thank a referee for bringing this to our attention. which will be used throughout the proofs in the appendix.

For $m\in\mathbb{N}$ , a function $c:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\}$ is a cost function if $c(\mathbf{x},\mathbf{y})=c(\mathbf{y},\mathbf{x})\geqslant 0$ for all $\mathbf{x},\mathbf{y}\in\mathbb{R}^{m}$ . For a given cost function $c$ , define $W_{c}(F,F_{0})=\inf_{\pi\in\Pi(F,F_{0})}\mathbb{E}^{\pi}[c(\bm{\xi},\bm{\xi}_{0})]$ for two distributions $F,F_{0}$ on $\mathbb{R}^{m}$ .

Lemma 1.

Let $c:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\}$ with $m\in\mathbb{N}$ be a lower semicontinuous cost function. Given two distributions $F,F_{0}$ on $\mathbb{R}^{m}$ and $\mathbf{X}_{0}\sim F_{0}$ , there exists $\mathbf{X}\sim F$ such that $W_{c}(F,F_{0})=\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]$ . As a result, we have for $\varepsilon\geqslant 0$ ,

\displaystyle\left\{G:W_{c}(G,F_{0})\leqslant\varepsilon\right\}=\left\{F_{\mathbf{X}}:\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]\leqslant\varepsilon\right\}.

(26)

Proof.

Proof. Note that by the optimal coupling (see e.g., Theorem 4.1 of Villani (2009)), there exists $\mathbf{Y}_{0}\sim F_{0}$ and $\mathbf{Y}\sim F$ such that $W_{c}(F,F_{0})=\mathbb{E}[c(\mathbf{Y},\mathbf{Y}_{0})].$ Since $\mathbf{X}_{0}\buildrel\mathrm{d}\over{=}\mathbf{Y}_{0}$ , by Theorem 8.17 of Kallenberg (2021), there exists $\mathbf{X}\buildrel\mathrm{d}\over{=}\mathbf{Y}\sim F$ such that $(\mathbf{X},\mathbf{X}_{0})\buildrel\mathrm{d}\over{=}(\mathbf{Y},\mathbf{Y}_{0})$ , and thus, $\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]=\mathbb{E}[c(\mathbf{Y},\mathbf{Y}_{0})]=W_{c}(F,F_{0}).$ This completes the proof. $\Box$

Consider the type- $p$ Wasserstein balls $\overline{\mathcal{B}}_{p}$ and $\mathcal{B}_{p}$ defined by (5) and (8), respectively. By Lemma 1, we have for $(Y_{0},\mathbf{X}_{0})\sim F_{0}$ , it holds that

\displaystyle\overline{\cal B}_{p}(F_{0},\varepsilon)=\{F_{(Y,\mathbf{X})}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\}~{}~{}{\rm and}~{}~{}{\cal B}_{p}(F_{{\bf X}_{0}},\varepsilon)=\{F_{\mathbf{X}}:\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\}.

The two formulas above will be used frequently throughout the appendix, and we may apply them directly without referencing Lemma 1.

Proof of Proposition 1. Note that

	$\displaystyle\{F_{Y\bf X}:F_{(Y,{\bf X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\}$	$\displaystyle=\left\{F_{Y\bf X}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\right\}$
		$\displaystyle=\left\{F_{Y_{0}\bf X}:\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]\leqslant\varepsilon^{p}\right\}$
		$\displaystyle=\left\{F_{Y_{0}\bf X}:\mathbb{E}[\\|Y_{0}\mathbf{X}-Y_{0}\mathbf{X}_{0}\\|^{p}]\leqslant\varepsilon^{p}\right\}=:\mathcal{B}_{(1)},$

where the first equality is due to Lemma 1, the second equality follows from the definition in (4) which implies $Y=Y_{0}$ almost surely for any $(Y,{\bf X})$ satisfying $\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}$ , and the third equality follows from $\|\mathbf{X}-\mathbf{X}_{0}\|=\|Y_{0}\mathbf{X}-Y_{0}\mathbf{X}_{0}\|$ as $|Y_{0}|=1$ . Additionally, it follows from Lemma 1 that

\displaystyle{\cal B}_{p}(F_{Y_{0}\bf X_{0}},\varepsilon)=\left\{F_{\bf Z}:\mathbb{E}[\|{\bf Z}-Y_{0}{\bf X_{0}}\|^{p}]\leqslant\varepsilon^{p}\right\}=:\mathcal{B}_{(2)},

it holds that $\mathcal{B}_{(1)}\subseteq\mathcal{B}_{(2)}$ obviously. To see the converse inclusion, note that for any $F\in\mathcal{B}_{(2)}$ , there exists $\mathbf{Z}$ such that $\mathbf{Z}\sim F$ and $\mathbb{E}[\|{\bf Z}-Y_{0}{\bf X}_{0}\|^{p}]\leqslant\varepsilon^{p}.$ Taking $(Y,{\bf X})=(Y_{0},{\bf Z}/Y_{0})$ , we have $Y=Y_{0}$ and

\mathbb{E}[\|Y_{0}{\bf X}-Y_{0}{\bf X}_{0}\|^{p}]=\mathbb{E}\left[\left\|{\bf Z}-Y_{0}{\bf X}_{0}\right\|^{p}\right]\leqslant\varepsilon^{p}.

Hence, we have $F\in\mathcal{B}_{(1)}$ , and thus $\mathcal{B}_{(2)}\subseteq\mathcal{B}_{(1)}$ . Therefore, we have $\mathcal{B}_{(1)}=\mathcal{B}_{(2)}$ . This completes the proof of (10). To show (11), denote by

\displaystyle\mathcal{B}_{1}=\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon),~{}~{}\mathcal{B}_{2}=\mathcal{B}_{p}\left(F_{Y_{0}\cdot\bm{\beta}^{\top}\bf X_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)~{}~{}{\rm and}~{}~{}\mathcal{B}_{3}=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:F_{\bf Z}\in\mathcal{B}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}_{0}},\varepsilon)\}.

We first verify that $\mathcal{B}_{1}=\mathcal{B}_{2}$ . With the aid of (10), we can verify this similarly as the proof of Theorem 5 in Mao et al. (2022). For ease of completeness, we give a proof of $\mathcal{B}_{1}=\mathcal{B}_{2}$ here. The case of $\|\bm{\beta}\|=0$ is trivial, and we assume now $\|\bm{\beta}\|>0$ . Note that

	$\displaystyle\mathcal{B}_{1}$	$\displaystyle=\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:F_{(Y,\mathbf{X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\}$
		$\displaystyle=\left\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\right\}=\left\{F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}:\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]\leqslant\varepsilon^{p}\right\},$		(27)

where the last equality follows from the definition in (4) and Lemma 1. On one hand, for $F\in\mathcal{B}_{1}$ , there exists $\mathbf{X}$ such that $Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}\sim F$ and $\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}$ . Thus, we have

\displaystyle\mathbb{E}[|Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}-Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}_{0}|^{p}]

\displaystyle=\mathbb{E}[|\bm{\beta}^{\top}\mathbf{X}-\bm{\beta}^{\top}\mathbf{X}_{0}|^{p}]\leqslant\|\bm{\beta}\|_{*}^{p}\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p},

(28)

where we use Hölder’s inequality in the first inequality. This means $F=F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}\in\mathcal{B}_{2}$ , and hence, $\mathcal{B}_{1}\subseteq\mathcal{B}_{2}$ . On the other hand, for $F\in\mathcal{B}_{2}$ , using Lemma 1 yields that there exists $Z\sim F$ such that $\mathbb{E}[|Z-Y_{0}\cdot\bm{\beta}^{\top}{\bf X_{0}}|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}$ . It follows from the definition of dual norm that there exists $\bm{\beta}_{0}\in\mathbb{R}^{n}$ such that $\|\bm{\beta}_{0}\|=1$ and $\|\bm{\beta}\|_{*}=\bm{\beta}^{\top}\bm{\beta}_{0}$ . Define

T=Z-Y_{0}\cdot\bm{\beta}^{\top}{\bf X_{0}}~{}~{}\mbox{and}~{}~{}{\mathbf{X}}={\mathbf{X}_{0}}+\frac{\bm{\beta}_{0}T}{Y_{0}\|\bm{\beta}\|_{*}}.

It holds that $\mathbb{E}[|T|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}$ , and thus,

\displaystyle\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]=\mathbb{E}\left[\left\|\frac{\bm{\beta}_{0}T}{Y_{0}\|\bm{\beta}\|_{*}}\right\|^{p}\right]=\frac{\mathbb{E}[|T|^{p}]}{\|\bm{\beta}\|_{*}^{p}}\leqslant\varepsilon^{p}.

It follows from (A) that $F_{Y_{0}\cdot\bm{\beta}^{\top}{\bf X}}\in\mathcal{B}_{1}$ . Noting that $Z=Y_{0}\cdot\bm{\beta}^{\top}{\bf X}$ and $Z\sim F$ , we have $F\in\mathcal{B}_{1}$ . Hence, we conclude $\mathcal{B}_{2}\subseteq\mathcal{B}_{1}$ . This completes the proof of $\mathcal{B}_{1}=\mathcal{B}_{2}$ .

It remains to prove that $\mathcal{B}_{1}\subseteq\mathcal{B}_{3}\subseteq\mathcal{B}_{2}$ . For $G_{1},G_{2}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , denote

\displaystyle{W}_{p}^{\rm ms}(G_{1},G_{2}):=\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\inf_{\pi\in\Pi(G_{1},G_{2})}\left(\mathbb{E}^{\pi}[|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}|^{p}]\right)^{1/p},

and thus, $\mathcal{B}_{3}=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:{W}_{p}^{\rm ms}(F_{\bf Z},F_{Y_{0}\mathbf{X}_{0}})\leqslant\varepsilon\}.$ For $F\in\mathcal{B}_{1}$ and using (A), there exists $\mathbf{X}$ such that $Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}\sim F$ and $\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}$ . It holds that

	$\displaystyle{W}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})^{p}$	$\displaystyle=\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\inf_{\pi\in\Pi(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})}\mathbb{E}^{\pi}[\|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\leqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\mathbb{E}[\|Y_{0}\cdot\bm{\gamma}^{\top}(\mathbf{X}-\mathbf{X}_{0})\|^{p}]$
		$\displaystyle\leqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{}=1}\\|\bm{\gamma}\\|_{}^{p}\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]$
		$\displaystyle=\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]\leqslant\varepsilon^{p}.$

This implies $F=F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}\in\mathcal{B}_{3}$ , and hence, $\mathcal{B}_{1}\subseteq\mathcal{B}_{3}$ . Suppose now $F\in\mathcal{B}_{3}$ . There exists $\mathbf{Z}$ such that $\bm{\beta}^{\top}\mathbf{Z}\sim F$ and ${W}_{p}^{\rm ms}(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}})\leqslant\varepsilon$ . It follows that

	$\displaystyle\varepsilon^{p}$	$\displaystyle\geqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[\|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\geqslant\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\frac{1}{\\|\bm{\beta}\\|_{*}^{p}}\mathbb{E}^{\pi}[\|{\bm{\beta}}^{\top}\bm{\xi}_{1}-{\bm{\beta}}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\geqslant\frac{1}{\\|\bm{\beta}\\|_{*}^{p}}\inf_{\pi\in\Pi\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[\|\xi_{1}-\xi_{2}\|^{p}],$

where the second inequality follows from $\bm{\beta}/\|\bm{\beta}\|_{*}\in\{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1\}$ . This implies

\displaystyle W_{p}\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)=\inf_{\pi\in\Pi\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)}\left(\mathbb{E}^{\pi}[|\xi_{1}-\xi_{2}|^{p}]\right)^{1/p}\leqslant\|\bm{\beta}\|_{*}\varepsilon.

Hence, we have $F=F_{\bm{\beta}^{\top}\mathbf{Z}}\in\mathcal{B}_{2}$ , which yields $\mathcal{B}_{3}\subseteq\mathcal{B}_{2}$ . This completes the proof. ∎

The proof of Theorem 1 relies on our projection result in Proposition 1 and a concentration result for the max-sliced Wasserstein distance in Theorem 3 of Olea et al. (2022), which is given as follows.

Lemma 2 (Theorem 3 of Olea et al. (2022)).

Let $p\geqslant 1$ and $\eta\in(0,1)$ . Suppose that $\widetilde{F}^{*}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ satisfies $\Gamma:=\mathbb{E}^{\widetilde{F}^{*}}[\|\mathbf{X}\|^{s}]<\infty$ for some $s>2p$ and $\varepsilon_{p,N}(\eta)$ is defined in Theorem 1. Then, we have

\displaystyle\mathbb{P}\left(\widetilde{F}^{*}\in{\cal B}_{p}^{\rm ms}\big{(}\widetilde{F}_{N},\varepsilon_{p,N}(\eta)\big{)}\right)\geqslant 1-\eta,

where $\widetilde{F}_{N}$ is the empirical distribution of $\widetilde{F}^{*}.$

Proof of Theorem 1. Denote by $\widetilde{F}_{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{y}_{i}\widehat{\mathbf{x}}_{i}}$ , $F^{*}_{\bm{\beta}}=F_{Y^{*}\cdot\bm{\beta}^{\top}\mathbf{X}^{*}}$ , where $(Y^{*},\mathbf{X}^{*})\sim F^{*}$ , and $\varepsilon_{N}=\varepsilon_{p,N}(\eta)$ . It is clear that $\widetilde{F}_{N}$ is an empirical distribution of $F_{Y^{*}\mathbf{X}^{*}}$ . It holds that

	$\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}\forall\bm{\beta}\in\mathcal{D}\right)$
	$\displaystyle=\mathbb{P}\left(\rho^{F^{*}_{\bm{\beta}}}(Z)\leqslant\sup_{F\in\overline{\mathcal{B}}_{p\|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Z),~{}\forall\bm{\beta}\in\mathcal{D}\right)$
	$\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\overline{\mathcal{B}}_{p\|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N}),~{}\forall\bm{\beta}\in\mathcal{D}\right)$
	$\displaystyle=\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\{F_{\bm{\beta}^{\top}\bf Z}:F_{\bf Z}\in{\cal B}_{p}^{\rm ms}(\widetilde{F}_{N},\varepsilon_{N}),~{}\forall\bm{\beta}\in\mathcal{D}\right)$
	$\displaystyle\geqslant\mathbb{P}\left(F_{Y^{}\mathbf{X}^{}}\in{\cal B}_{p}^{\rm ms}\big{(}\widetilde{F}_{N},\varepsilon_{N}\big{)}\right)\geqslant 1-\eta,$

where the second equality follows from $\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N})=\{F_{\bm{\beta}^{\top}\bf Z}:F_{\bf Z}\in{\cal B}_{p}^{\rm ms}(\widetilde{F}_{N},\varepsilon_{N})\}$ by (ii) of Proposition 1 and we have used Lemma 2 in the last inequality. This completes the proof.

Proof of Theorem 2.

Let $(\widehat{Y}_{N},\widehat{\mathbf{X}}_{N})\sim\widehat{F}_{N}$ . By the definition of $\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon)$ , it follows that

	$\displaystyle\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})$	$\displaystyle=\sup_{F\in\overline{\mathcal{B}}_{p\|\bm{\beta}}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Z)$
		$\displaystyle=\sup\left\{\rho(Z):\left(\mathbb{E}[\|Z-\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}\|^{p}]\right)^{1/p}\leqslant\varepsilon\\|\bm{\beta}\\|_{*}\right\}$
		$\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):(\mathbb{E}[\|V\|^{p}])^{1/p}\leqslant\varepsilon\\|\bm{\beta}\\|_{*}\right\},$

where the second inequality follows from (11) in Proposition 1, which states that

\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon)=\mathcal{B}_{p}(F_{\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}},\varepsilon\|\bm{\beta}\|_{*})=\left\{F_{Z}:\left(\mathbb{E}[|Z-\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}|^{p}]\right)^{1/p}\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\}.

In particular, we have for any $\varepsilon>0$

	$\displaystyle\sup_{F\in\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})$	$\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):\|V\|\leqslant\varepsilon\\|\bm{\beta}\\|_{*}\right\}$		(29)
	$\displaystyle\sup_{F\in\overline{{\cal B}}_{1}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})$	$\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):\mathbb{E}[\|V\|]\leqslant\varepsilon\\|\bm{\beta}\\|_{*}\right\}.$		(30)

Therefore, we have

$\displaystyle\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})$	$\displaystyle\geqslant\sup_{F\in\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\lambda\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})$
	$\displaystyle=\sup\left\{\rho(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V):\|V\|\leqslant\lambda\varepsilon\\|\bm{\beta}\\|_{*}\right\}$
	$\displaystyle\geqslant\sup\left\{\rho(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V):\mathbb{E}[\|V\|]\leqslant\varepsilon\\|\bm{\beta}\\|_{*}\right\}$
	$\displaystyle=\sup_{F\in\overline{{\cal B}}_{1}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D},~{}\varepsilon\geqslant 0,$	(31)

where the first inequality is due to $\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\lambda\varepsilon)\subseteq\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon)$ , and the second inequality is due to Assumption 1, and the two equalities come from (29) and (30), respectively. Combining (A) and Theorem 1 with type- $1$ Wasserstein ball yields the desired result.

Appendix B B Proofs of Section 4

For ease of exposition, we need the following notations. For a random variable $Z$ , we define $\|Z\|_{p}=(\mathbb{E}[|Z|^{p}])^{1/p}$ . We use $\mathds{1}_{A}$ to represent the indicator function, i.e., $\mathds{1}_{A}(\omega)=1$ if $\omega\in A$ , and $\mathds{1}_{A}(\omega)=0$ otherwise. The sign function on $\mathbb{R}$ is defined as

{\rm sign}(x)=-\mathds{1}_{(-\infty,0)}(x)+\mathds{1}_{[0,\infty)}(x).

To prove the results in Section 4, we need the following auxiliary lemmas. The first lemma follows immediately from Proposition 1.

Lemma 3.

Given $\bm{\beta}\in\mathbb{R}^{n}$ , we have

\displaystyle\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\sup_{F\in\mathcal{B}_{p}\left(F_{\bm{\beta}^{\top}\mathbf{Z}_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)}\rho^{F}(Z),

where $\mathbf{Z}_{0}\sim G_{0}$ and $F_{\bm{\beta}^{\top}\mathbf{Z}_{0}}$ is the distribution of $\bm{\beta}^{\top}\mathbf{Z}_{0}$ .

Based on Lemma 3, we present the following lemma, which will be used in the proofs of the main results.

Lemma 4.

Let $p\in[1,\infty]$ and $C>0$ . For $\rho:L^{p}\to\mathbb{R}$ , the following statements are equivalent.

(i)

For any $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ ,

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+C\varepsilon\|\bm{\beta}\|_{*}\right\}.

(32)

(ii)

For any $Z\in L^{p}$ and $\varepsilon\geqslant 0$ ,

$\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho(Z+V)=\rho(Z)+C\varepsilon.$ (33)

Proof.

Proof. (i) $\Rightarrow$ (ii): For $Z\in L^{p}$ , take $\mathcal{D}=\{\bm{\beta}_{0}\}$ with $\bm{\beta}_{0}=(1,0,\dots,0)/\|(1,0,\dots,0)\|_{*}$ , and let $G_{0}$ be the distribution of $(Z,0,\dots,0)$ . It holds that $\|\bm{\beta}_{0}\|_{*}=1$ , and thus, the left-hand side of (32) reduces to

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}_{0}^{\top}\mathbf{Z})=\sup_{F\in\mathcal{B}_{p}(F_{Z},\varepsilon)}\rho^{F}(Y)=\sup_{\|Y-Z\|_{p}\leqslant\varepsilon}\rho(Y)=\sup_{\|V\|_{p}\leqslant\varepsilon}\rho(Z+V),

where the second equality follows from Lemma 3. Note that the right-hand side of (32) reduces to

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+C\varepsilon\|\bm{\beta}\|_{*}\right\}=\rho(Z)+C\varepsilon.

Combining the above two equations with (32) yields (33). This completes the proof of the implication (i) $\Rightarrow$ (ii).

(ii) $\Rightarrow$ (i): For $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , let $\mathbf{Z}_{0}\sim G_{0}$ . We have for any $\bm{\beta}\in\mathcal{D}$ ,

	$\displaystyle\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})$	$\displaystyle=\sup_{F\in\mathcal{B}_{p}\left(F_{\bm{\beta}^{\top}\mathbf{Z}_{0}},\varepsilon\\|\bm{\beta}\\|_{*}\right)}\rho^{F}(Z)$
		$\displaystyle=\sup_{\\|Z-\bm{\beta}^{\top}\mathbf{Z}_{0}\\|_{p}\leqslant\varepsilon\\|\bm{\beta}\\|_{*}}\rho(Z)$
		$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon\\|\bm{\beta}\\|_{}}\rho(\bm{\beta}^{\top}\mathbf{Z}_{0}+V)=\rho(\bm{\beta}^{\top}\mathbf{Z}_{0})+C\varepsilon\\|\bm{\beta}\\|_{},$

where the first equality follows from Lemma 3 and the last equality is due to (33). Hence, (32) holds immediately, and thus, we complete the proof.∎

Remark 5.

In Lemma 4, neither the distribution $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ or the set $\mathcal{D}\subseteq\mathbb{R}^{n}$ is fixed. If we fix $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$ , then by similar arguments, we can show that (32) holds for all $\varepsilon\geqslant 0$ if (33) holds with $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ for all $\bm{\beta}\in\mathcal{D}$ and $\varepsilon\geqslant 0$ , where $\mathbf{Z}_{0}\sim G_{0}$ .

B.1 Proofs of Section 4.1

We present the following lemma, which will be used in the proofs of Theorems 3 and 4.

Lemma 5.

Let $p\in(1,\infty)$ , $\varepsilon>0$ and $\eta\in(0,\varepsilon]$ . Define $\mathcal{V}=\{V\in L^{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\leqslant\eta\}$ . We have $\mathbb{E}[|V|]\leqslant\varepsilon 2^{-p/q}+\eta$ for all $V\in\mathcal{V}$ .

Proof.

Proof. Using the Chebyshev’s inequality, we have $(2\varepsilon)^{p}\mathbb{P}(|V|>2\varepsilon)\leqslant\|V\|_{p}^{p},$ which implies $\mathbb{P}(|V|>2\varepsilon)\leqslant 2^{-p}$ for all $\|V\|_{p}\leqslant\varepsilon$ . Hence, for any $V\in\mathcal{V}$ , it holds that

\displaystyle\mathbb{E}[|V|]

\displaystyle=\mathbb{E}\left[|V|\mathds{1}_{\{|V|>2\varepsilon\}}\right]+\mathbb{E}\left[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}\right]\leqslant\|V\|_{p}(\mathbb{P}(|V|>2\varepsilon))^{1/q}+\eta\leqslant\varepsilon 2^{-p/q}+\eta,

where the first inequality follows from Hölder’s inequality. This completes the proof. ∎

Proof of Theorem 3. Throughout the proof, we assume $C=1$ without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.

(i)’

For any $Z\in L^{p}$ and $\varepsilon\geqslant 0$ , we have (33) holds, i.e.,

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)]=\mathbb{E}[\ell(Z)]+\varepsilon,~{}~{}\forall Z\in L^{p},~{}\varepsilon\geqslant 0.

(34)

To see (ii) $\Rightarrow$ (i)’, note that $\ell_{1}(Z+V)=\ell_{1}(Z)\pm V$ and thus,

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell_{1}(Z+V)]=\mathbb{E}[\ell_{1}(Z)]+\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\pm V]=\mathbb{E}[\ell_{1}(Z)]+\varepsilon

and

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell_{2}(Z+V)]$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\mathbb{E}[\|Z+V-b_{1}\|+b_{2}]$
		$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[\|Z-b_{1}\|+b_{2}]+\mathbb{E}[\|V\|]\right\}=\mathbb{E}[\ell_{2}(Z)]+\varepsilon.$

The inequality reduces to equality if $V=\varepsilon\ {\rm sign}(Z-b_{1})$ . Hence, we complete the proof of (ii) $\Rightarrow$ (i)’.

To see (i)’ $\Rightarrow$ (ii), we first show that ${\rm Lip}(\ell)\leqslant 1$ . Otherwise, by that a convex function has derivative almost everywhere, there exists $x$ such that $|\ell^{\prime}(x)|>1$ . If $\ell^{\prime}(x)>1$ , then we have

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x+V)]

\displaystyle\geqslant\ell(x+\varepsilon)\geqslant\ell(x)+|\ell^{\prime}(x)|\varepsilon>\ell(x)+\varepsilon.

If $\ell^{\prime}(x)<-1$ , then we have

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x+V)]

\displaystyle\geqslant\ell(x-\varepsilon)\geqslant\ell(x)-\ell^{\prime}(x)\varepsilon>\ell(x)+\varepsilon.

Those two cases both yield a contradiction to (34).

Next, we aim to verify the following fact

\displaystyle|\ell^{\prime}(x)|=1~{}{\rm{for~{}all}}~{}x\in\mathbb{R}~{}{\rm as~{}long~{}as}~{}\ell~{}{\rm is~{}differentiable~{}at}~{}x.

(35)

This will complete the proof since a convex function that satisfies (35) must be one of the forms of $\ell_{1}$ and $\ell_{2}$ with $C=1$ . To see (35), assume by contradiction that there exists $x_{0}\in\mathbb{R}$ such that $|\ell^{\prime}(x_{0})|<1$ . We first consider the case $p=\infty$ . In this case, we have

\displaystyle\sup_{\|V\|_{\infty}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)]=\max\{\ell(x_{0}-\varepsilon),\ell(x_{0}+\varepsilon)\}<\ell(x_{0})+\varepsilon,

where the strict inequality follows from $|\ell^{\prime}(x_{0})|<1$ and ${\rm Lip}(\ell)\leqslant 1$ , which yields a contradiction. Suppose now $p\in(1,\infty)$ . Define

k=\max\left\{\left|\frac{\ell(x_{0}+2\varepsilon)-\ell(x_{0})}{2\varepsilon}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-2\varepsilon)}{2\varepsilon}\right|\right\}.

We have that $|\ell(x_{0}+v)-\ell(x_{0})|\leqslant k|v|$ for all $v\in[-2\varepsilon,2\varepsilon]$ by the convexity of $\ell$ , and thus,

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)]$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\{\mathbb{E}[\ell(x_{0}+V)\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell(x_{0}+V)\mathds{1}_{\{\|V\|>2\varepsilon\}}]\}$
		$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\{\mathbb{E}[(\ell(x_{0})+k\|V\|)\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0}+V)\mathds{1}_{\{\|V\|>2\varepsilon\}}]\}$
		$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\{\mathbb{E}[(\ell(x_{0})+k\|V\|)\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0})+\|V\|)\mathds{1}_{\{\|V\|>2\varepsilon\}}]\}$
		$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon}\{\mathbb{E}[\|V\|]-(1-k)\mathbb{E}[\|V\|\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]\}=:I,$

where the second inequality follows from ${\rm Lip}(\ell)\leqslant 1$ . By defining

\displaystyle\mathcal{V}_{1}=\left\{V\in L^{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}\left[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}\right]\leqslant\frac{(1-2^{-p/q})\varepsilon}{2}\right\},~{}~{}~{}\mathcal{V}_{2}=\{V:\|V\|_{p}\leqslant\varepsilon\}\setminus\mathcal{V}_{1},

(36)

we can rewrite $I$ as $I=\max\{I_{1},I_{2}\}$ with

I_{i}=\ell(x_{0})+\sup_{V\in\mathcal{V}_{i}}\{\mathbb{E}[|V|]-(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\},~{}~{}i=1,2.

Since $|\ell^{\prime}(x_{0})|<1$ and ${\rm Lip}(\ell)\leqslant 1$ , we know that $k<1$ . Hence, it holds that

\displaystyle I_{1}

\displaystyle\leqslant\ell(x_{0})+\sup_{V\in\mathcal{V}_{1}}\mathbb{E}[|V|]\leqslant\ell(x_{0})+\frac{(1+2^{-p/q})\varepsilon}{2}<\ell(x_{0})+\varepsilon,

(37)

where the second inequality follows from Lemma 5. Further note that

$\displaystyle I_{2}$	$\displaystyle\leqslant\ell(x_{0})+\sup_{V\in\mathcal{V}_{2}}\left\{\mathbb{E}[\|V\|]-(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}\right\}$
	$\displaystyle\leqslant\ell(x_{0})+\varepsilon-(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}$
	$\displaystyle<\ell(x_{0})+\varepsilon,$	(38)

where the first inequality follows from the definition of $\mathcal{V}_{2}$ , and the second one holds because $\|V\|_{p}\leqslant\varepsilon$ implies $\mathbb{E}[|V|]\leqslant\varepsilon$ . Combining (37) and (B.1), we conclude that

\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)]\leqslant\max\{I_{1},I_{2}\}<\ell(x_{0})+\varepsilon.

This yields a contradiction. Hence, (35) is verified, and thus we complete the proof.

To prove Theorem 4, we begin by introducing two auxiliary lemmas. Specifically, Lemma 7 will be employed in the proof of Theorem 4, and Lemma 6 serves to establish Lemma 7.

Lemma 6.

For $p\in(1,\infty)$ , $\varepsilon>0$ and $\eta\in(0,\varepsilon)$ , the following problem has a positive optimal value.

	$\displaystyle\min$	$\displaystyle~{}~{}\mathbb{E}[(Z-\varepsilon)_{+}]$		(39)
	$\displaystyle{\rm s.t.}$	$\displaystyle~{}~{}\mathbb{E}[Z]=\varepsilon,~{}~{}(\mathbb{E}[Z^{1/p}])^{p}\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0.$

Proof.

Proof. Note that Problem (39) has two moment constraints. By (1995), the optimal value of Problem (39) is equivalent to

	$\displaystyle\min$	$\displaystyle~{}~{}\mathbb{E}[(Z-\varepsilon)_{+}]$		(40)
	$\displaystyle{\rm s.t.}$	$\displaystyle~{}~{}\mathbb{E}[Z]=\varepsilon,~{}~{}(\mathbb{E}[Z^{1/p}])^{p}\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0,~{}~{}Z\in\mathcal{X}_{3},$

where $\mathcal{X}_{3}$ is the set of all random variables that take at most three values. Let $Z$ be a feasible solution of Problem (40). Denote by $\overline{z}=\mathrm{ess\mbox{-}sup}Z$ , and let $Z^{*}$ be a random variable such that

\displaystyle Z^{*}\sim(p_{1}+p_{2})\delta_{\varepsilon}+q\delta_{\overline{z}}+(1-(p_{1}+p_{2}+q))\delta_{0}

(41)

with the conditions that $p_{1}\varepsilon=\mathbb{E}[Z\mathds{1}_{\{Z<\varepsilon\}}]$ , $p_{2}\varepsilon+\overline{z}q=\mathbb{E}[Z\mathds{1}_{\{Z\geqslant\varepsilon\}}]$ and $p_{2}=\mathbb{P}(Z\geqslant\varepsilon)-q$ . It is straightforward to check that $\mathbb{E}[Z^{*}]=\mathbb{E}[Z]=\varepsilon$ and $\mathbb{E}[(t-Z)_{+}]\leqslant\mathbb{E}[(t-Z^{*})_{+}]$ for all $t\in\mathbb{R}$ . It follows from Theorem 3.A.1 of Shaked and Shanthikumar (2007) that $Z^{*}\leqslant_{\rm cv}Z$ .⁷⁷7For two random variables $Z_{1}$ and $Z_{2}$ , $Z_{1}$ is said to be smaller than $Z_{2}$ in the concave order, denoted by $Z_{1}\leqslant_{\rm cv}Z_{2}$ , if $\mathbb{E}[\phi(Z_{1})]\leqslant\mathbb{E}[\phi(Z_{2})]$ for any concave function $\phi$ . This implies that $\mathbb{E}[(Z^{*})^{1/p}]\leqslant\mathbb{E}[Z^{1/p}]$ since $x\mapsto x^{1/p}$ is a concave function on $\mathbb{R}_{+}$ . Therefore, $Z^{*}$ is also a feasible solution of Problem (40). In addition, we have

\displaystyle\mathbb{E}[(Z-\varepsilon)_{+}]=\mathbb{E}[Z\mathds{1}_{\{Z\geqslant\varepsilon\}}]-\varepsilon\mathbb{P}(Z\geqslant\varepsilon)=p_{2}\varepsilon+\overline{z}q-\varepsilon(p_{2}+q)=q(\overline{z}-\varepsilon)=\mathbb{E}[(Z^{*}-\varepsilon)_{+}].

Hence, to solve Problem (40), it suffices to consider all random variables whose distributions take the form of (41), i.e., $p_{1}\delta_{0}+p_{2}\delta_{\varepsilon}+p_{3}\delta_{x}$ with $p_{1},p_{2},p_{3}\geqslant 0$ , $p_{1}+p_{2}+p_{3}=1$ and $x\geqslant\varepsilon$ . The corresponding problem can be represented as follows:

\displaystyle\min_{p_{i}\geqslant 0,{x}}

\displaystyle~{}~{}({x}-\varepsilon)p_{3}~{}~{}~{}{\rm s.t.}~{}~{}\varepsilon p_{2}+{x}p_{3}=\varepsilon,~{}~{}\varepsilon^{1/p}p_{2}+{x}^{1/p}p_{3}\leqslant(\varepsilon-\eta)^{1/p},~{}~{}{x}\geqslant\varepsilon,~{}~{}p_{1}+p_{2}+p_{3}=1.

(42)

Note that $({x}-\varepsilon)p_{3}=\varepsilon(1-p_{2})-\varepsilon p_{3}=\varepsilon p_{1}$ whenever $(p_{1},p_{2},p_{3},x)$ is a feasible solution of Problem (42), and thus, Problem (42) is equivalent to

\displaystyle\min_{p_{i}\geqslant 0,{x}}

\displaystyle~{}~{}\varepsilon p_{1}~{}~{}~{}{\rm s.t.}~{}~{}\varepsilon p_{2}+{x}p_{3}=\varepsilon,~{}~{}\varepsilon^{1/p}p_{2}+{x}^{1/p}p_{3}\leqslant(\varepsilon-\eta)^{1/p},~{}~{}{x}\geqslant\varepsilon,~{}~{}p_{1}+p_{2}+p_{3}=1.

Further, letting $x=\lambda\varepsilon$ yields the following equivalent problem

\displaystyle\min_{p_{i}\geqslant 0,\lambda}

\displaystyle~{}~{}\varepsilon p_{1}~{}~{}~{}{\rm s.t.}~{}~{}p_{2}+\lambda p_{3}=1,~{}~{}p_{2}+{\lambda}^{1/p}p_{3}\leqslant\left(1-\frac{\eta}{\varepsilon}\right)^{1/p},~{}~{}{\lambda}\geqslant 1,~{}~{}p_{1}+p_{2}+p_{3}=1.

(43)

Let $(p_{1},p_{2},p_{3},\lambda)$ be a feasible solution of Problem (43). It holds that $p_{1}=(\lambda-1)p_{3}$ and $p_{2}=1-\lambda p_{3}$ and $\lambda\geqslant 1$ . Substituting $p_{2}=1-\lambda p_{3}$ into the second constraint of Problem (43) yields

\displaystyle\left(1-\frac{\eta}{\varepsilon}\right)^{1/p}\geqslant 1-\lambda p_{3}+\lambda^{1/p}p_{3}=-(\lambda-1)p_{3}-p_{3}+1+\lambda^{1/p}p_{3}=-p_{1}+(\lambda^{1/p}-1)p_{3}+1\geqslant-p_{1}+1,

which implies $p_{1}\geqslant 1-(1-\eta/\varepsilon)^{1/p}$ . Hence, the optimal value of Problem (43) is no less than $\varepsilon-\varepsilon(1-\eta/\varepsilon)^{1/p}$ which is a positive value. This completes the proof. ∎

Lemma 7.

Let $p\in(1,\infty)$ , $t,\,\varepsilon>0$ and $\eta\in(0,\varepsilon)$ . For $V\in L^{p}$ , the following statements hold.

(i)

If $\|V\|_{p}\leqslant\varepsilon$ , then $\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}$ .
(ii)

If $\|V\|_{p}\leqslant\varepsilon$ and $\mathbb{E}[|V|]\leqslant\varepsilon-\eta$ , then there exists $\Delta>0$ that only depends on $p,t,\varepsilon,\eta$ such that $\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}-\Delta$ . In particular, if $p$ is an integer, then $\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}-pt^{p-1}\eta$ .

Proof.

Proof. (i) Suppose that $\|V\|_{p}\leqslant\varepsilon$ and denote by $Z=|V|^{p}$ . It holds that $\mathbb{E}[Z]\leqslant\varepsilon^{p}$ and

\displaystyle\mathbb{E}[(|V|+t)^{p}]=\mathbb{E}[(Z^{1/p}+t)^{p}]\leqslant((\mathbb{E}[Z])^{1/p}+t)^{p}\leqslant(\varepsilon+t)^{p},

where we have used Jensen’s inequality in the first inequality by noting that $x\mapsto(x^{1/p}+t)^{p}$ is concave on $\mathbb{R}_{+}$ .

(ii) The case that $p$ is an integer can be verified by the following direct calculations:

	$\displaystyle\mathbb{E}[(\|V\|+t)^{p}]$	$\displaystyle=\mathbb{E}\left[\sum_{i=0}^{p}{p\choose i}t^{p-i}\|V\|^{i}\right]$
		$\displaystyle=\sum_{i\neq 1}{p\choose i}t^{p-i}\mathbb{E}[\|V\|^{i}]+pt^{p-1}\mathbb{E}[\|V\|]$
		$\displaystyle\leqslant\sum_{i\neq 1}{p\choose i}t^{p-i}\varepsilon^{i}+pt^{p-1}(\varepsilon-\eta)$
		$\displaystyle=\sum_{i=0}^{p}{p\choose i}t^{p-i}\varepsilon^{i}-pt^{p-1}\eta=(\varepsilon+t)^{p}-pt^{p-1}\eta,$

where the inequality holds because $\|V\|_{p}\leqslant\varepsilon$ implies $\|V\|_{i}\leqslant\varepsilon$ for $i=1,2,\dots,p$ .

For general case $p$ , it suffices to verify that the optimal value of the following problem is smaller than $(\varepsilon+t)^{p}$ :

\displaystyle\sup\,\mathbb{E}[(Z+t)^{p}]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z^{p}]\leqslant\varepsilon^{p},~{}~{}\mathbb{E}[Z]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0.

(44)

We first assert that Problem (44) is equivalent to

\displaystyle\sup\,\mathbb{E}[(Z+t)^{p}]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z^{p}]=\varepsilon^{p},~{}~{}\mathbb{E}[Z]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0.

(45)

To see this, let $Z$ be a feasible solution of Problem (44) with $\mathbb{E}[Z]>0$ . Let $U$ be a uniform random variable on $(0,1)$ and for $\alpha\in(0,1)$ , define

\displaystyle Z_{\alpha}=(F_{Z}^{-1}(U)+f(\alpha))\mathds{1}_{\{U\geqslant\alpha\}},~{}~{}\alpha\in[0,1),~{}~{}{\rm where}~{}f(\alpha)=\frac{\int_{0}^{\alpha}F_{Z}^{-1}(s)\mathrm{d}s}{1-\alpha}.

It is straightforward to check that $\mathbb{E}[Z_{\alpha}]=\mathbb{E}[Z]$ for all $\alpha\in[0,1)$ . Moreover, it holds that

\displaystyle\mathbb{E}[(Z_{\alpha})^{p}]\geqslant\mathbb{E}[f(\alpha)^{p}\mathds{1}_{\{U\geqslant\alpha\}}]=\frac{\left(\int_{0}^{\alpha}F_{Z}^{-1}(s)\mathrm{d}s\right)^{p}}{(1-\alpha)^{p-1}}\to\infty~{}~{}{\rm as}~{}\alpha\uparrow 1.

(46)

Note that the mapping $\alpha\mapsto\mathbb{E}[(Z_{\alpha})^{p}]$ is continuous as $f(\alpha)$ is continuous, and $\mathbb{E}[(Z_{\alpha})^{p}]=\mathbb{E}[Z^{p}]\leqslant\varepsilon$ if $\alpha=0$ . Combining with (46), there exists $\alpha^{*}\in[0,1)$ such that $\mathbb{E}[Z_{\alpha^{*}}^{p}]=\varepsilon^{p}$ . Moreover, one can check that for any $\alpha\in[0,1)$ , $\int_{\gamma}^{1}F_{Z}^{-1}(s)\mathrm{d}s\leqslant\int_{\gamma}^{1}F_{Z_{\alpha}}^{-1}(s)\mathrm{d}s$ for all $\gamma\in(0,1)$ . It follows from Theorem 2.5 of Bäuerle and Müller (2006) that $Z_{\alpha}\leqslant_{\rm cv}Z$ . Hence, we have $\mathbb{E}[(Z+t)^{p}]\leqslant\mathbb{E}[(Z_{\alpha^{*}}+t)^{p}]$ as $x\mapsto-(x+t)^{p}$ is a concave function on $\mathbb{R}_{+}$ . Therefore, Problem (44) is equivalent to Problem (45), which is further equivalent to

\displaystyle\sup\,\mathbb{E}[g(Z)]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z]=\varepsilon^{p},~{}~{}\mathbb{E}[Z^{1/p}]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0,

(47)

where $g(x)=(x^{1/p}+t)^{p}$ is a strictly concave function on $\mathbb{R}_{+}$ . It remains to verify that the optimal value of Problem (47) is strictly less than $g(\varepsilon^{p})$ .

To see this, let $Z$ be a feasible solution of Problem (47). Define

Z^{*}=z_{1}\mathds{1}_{\{Z<\varepsilon^{p}\}}+z_{2}\mathds{1}_{\{Z\geqslant\varepsilon^{p}\}},

where $z_{1}=\mathbb{E}[Z|Z<\varepsilon^{p}]$ and $z_{2}=\mathbb{E}[Z|Z\geqslant\varepsilon^{p}]$ satisfy $z_{1}\lambda+z_{2}(1-\lambda)=\varepsilon^{p}$ with $\lambda=\mathbb{P}(Z<\varepsilon^{p})$ . Note that by Lemma 6, there exists $\Delta>0$ that only depends on $p,t,\varepsilon,\eta$ such that $\mathbb{E}[(\varepsilon^{p}-Z)_{+}]=\mathbb{E}[(Z-\varepsilon^{p})_{+}]\geqslant\Delta$ . By definition of $z_{1}$ and $z_{2}$ , this implies $\Delta\leqslant(\varepsilon^{p}-z_{1})\mathbb{P}(Z<\varepsilon^{p})\leqslant\varepsilon^{p}-z_{1}$ and $\Delta\leqslant(z_{2}-\varepsilon^{p})\mathbb{P}(Z\geqslant\varepsilon^{p})\leqslant z_{2}-\varepsilon^{p}$ . Therefore, we have

\displaystyle z_{1}\leqslant\varepsilon^{p}-\Delta~{}~{}{\rm and}~{}~{}z_{2}\geqslant\varepsilon^{p}+\Delta.

(48)

Further note that $Z\leqslant_{\rm cv}Z^{*}$ and also note that $g$ is concave on $\mathbb{R}_{+}$ . This implies

\displaystyle\mathbb{E}[g(Z)]\leqslant\mathbb{E}[g(Z^{*})]=\lambda g(z_{1})+(1-\lambda)g(z_{2})\leqslant\frac{g(\varepsilon^{p}-\Delta)+g(\varepsilon^{p}+\Delta)}{2}<g(\varepsilon^{p}),

where the second inequality is due to the concavity of $g$ , $\lambda z_{1}+(1-\lambda)z_{2}=(\varepsilon^{p}-\Delta+\varepsilon^{p}+\Delta)/2$ and (48); and the strict inequality is due to $\Delta>0$ and the strict concavity of $g$ . Noting that $\Delta$ is independent of the random variable $Z$ , this completes the proof.

∎

Proof of Theorem 4.

(ii) $\Rightarrow$ (i): By Lemma 4, it suffices to show that (33) holds for any $Z\in L^{p}$ and $\varepsilon\geqslant 0$ . For $\ell=\ell_{1}$ , we have

$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\ell_{1}(Z+V)\\|_{p}$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|(Z+V-b_{1})_{+}\\|_{p}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|(Z-b_{1})_{+}+\|V\|\\|_{p}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\\|(Z-b_{1})_{+}\\|_{p}+\\|V\\|_{p}\right\}$
	$\displaystyle\leqslant\\|(Z-b_{1})_{+}\\|_{p}+\varepsilon=\\|\ell_{1}(Z)\\|_{p}+\varepsilon.$	(49)

If $\mathbb{P}(Z>b_{1})>0$ , then all the inequalities are equalities, and the maximizer can be chosen as $V=\lambda(Z-b_{1})_{+}$ with some $\lambda>0$ such that $\|V\|_{p}=\varepsilon$ . If $Z\leqslant b_{1}$ a.s., then we take $\{V_{n}\}_{n\in\mathbb{N}}$ such that $V_{n}$ has distribution $(1-1/n^{p})\delta_{0}+(1/n^{p})\delta_{n\varepsilon}$ , and $\{V_{n}=n\varepsilon\}\subseteq\{Z\geqslant F^{-1}_{Z}(1-1/n^{p})\}$ where $F_{Z}^{-1}$ is the left-quantile function of $Z$ for all $n$ . We have $\|V_{n}\|_{p}=\varepsilon$ and

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\ell_{1}(Z+V)\\|_{p}$	$\displaystyle\geqslant\\|(Z+V_{n}-b_{1})_{+}\\|_{p}$
		$\displaystyle\geqslant\frac{1}{n}\left(n\varepsilon+F_{Z}^{-1}\left(1-\frac{1}{n^{p}}\right)-b_{1}\right)_{+}\to\varepsilon=\\|\ell_{1}(Z)\\|_{p}+\varepsilon,~{}~{}{\rm as~{}}n\to\infty.$

Combining with (B.1), we verify that (33) holds with $\ell_{1}$ for all $Z\leqslant b_{1}$ a.s. This completes the proof of the case $\ell=\ell_{1}$ . For $\ell=\ell_{2}$ , the proof is similar to that of $\ell_{1}$ . For $\ell=\ell_{3}$ , we have

$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\ell_{3}(Z+V)\\|_{p}$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|(\|Z+V-b_{1}\|-b_{2})_{+}\\|_{p}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|(\|Z-b_{1}\|-b_{2}+\|V\|)_{+}\\|_{p}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|(\|Z-b_{1}\|-b_{2})_{+}+\|V\|\\|_{p}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\\|(\|Z-b_{1}\|-b_{2})_{+}\\|_{p}+\\|V\\|_{p}\right\}$
	$\displaystyle\leqslant\\|(\|Z-b_{1}\|-b_{2})_{+}\\|_{p}+\varepsilon=\\|\ell_{3}(Z)\\|_{p}+\varepsilon.$	(50)

If $\mathbb{P}(|Z-b_{1}|>b_{2})>0$ , then one can check that all inequalities are equalities, and the maximizer can be chosen as $V=\lambda(|Z-b_{1}|-b_{2})_{+}{\rm sign}(Z-b_{1})$ with some $\lambda\geqslant 0$ such that $\|V\|_{p}=\varepsilon$ . If $|Z-b_{1}|\leqslant b_{2}$ a.s., then we have $Z\in[b_{1}-b_{2},b_{1}+b_{2}]$ . Taking a sequence $\{V_{n}\}_{n\in\mathbb{N}}$ as shown in the case of $\ell_{1}$ , i.e., $V_{n}$ with distribution $(1-1/n^{p})\delta_{0}+(1/n^{p})\delta_{n\varepsilon}$ , it holds that for large enough $n$ such that $n\varepsilon\geqslant\max\{b_{1},b_{2}\}$ ,

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\ell_{3}(Z+V)\\|_{p}$	$\displaystyle\geqslant\\|(\|Z+V_{n}-b_{1}\|-b_{2})_{+}\\|_{p}$
		$\displaystyle\geqslant\left(\mathbb{E}[\left(\|Z-b_{1}\|-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=0\}}]+\mathbb{E}[\left(\|Z-b_{1}+n\varepsilon\|-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=n\varepsilon\}}]\right)^{1/p}$
		$\displaystyle\geqslant\left(\mathbb{E}[\left(n\varepsilon-b_{2}-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=n\varepsilon\}}]\right)^{1/p}$
		$\displaystyle=\left(\varepsilon-\frac{2b_{2}}{n}\right)_{+}\to\varepsilon=\\|\ell_{3}(Z)\\|_{p}+\varepsilon,~{}~{}{\rm as}~{}n\to\infty.$

Combining with (B.1), we have that (33) holds with $\ell_{3}$ for all $|Z-b_{1}|\leqslant b_{2}$ a.s. This completes the proof of the case $\ell_{3}$ . For $\ell=\ell_{4}$ , we have

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\ell_{4}(Z+V)\\|_{p}$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\|Z+V-b_{1}\|+b_{2}\\|_{p}$
		$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\\|\|Z-b_{1}\|+b_{2}+\|V\|\\|_{p}$
		$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\\|\|Z-b_{1}\|+b_{2}\\|_{p}+\\|V\\|_{p}\right\}$
		$\displaystyle\leqslant\\|\|Z-b_{1}\|+b\\|_{p}+\varepsilon=\\|\ell_{4}(Z)\\|_{p}+\varepsilon,$

where all inequalities can be equality, and the maximizer can be chosen as $V=\lambda(|Z-b_{1}|+b_{2}){\rm sign}(Z-b_{1})$ for some $\lambda\geqslant 0$ such that $\|V\|_{p}=\varepsilon$ . Hence, we conclude that (33) holds with $\ell=\ell_{4}$ and complete the proof of (ii) $\Rightarrow$ (i).

(i) $\Rightarrow$ (ii): The proof of this direction is in the same spirit to that of Theorem 3. Assume without loss of generality that $C=1$ . By Lemma 4, we have for any $Z\in L^{p}$ and $\varepsilon\geqslant 0$ ,

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell(Z+V)\|_{p}=\|\ell(Z)\|_{p}+\varepsilon.

(51)

By the same arguments in the proof of Theorem 3, we can show that ${\rm Lip}(\ell)\leqslant 1$ . Next, assume that $\ell$ is differential at $x$ when we use the notation $\ell^{\prime}(x)$ , and we show the following facts.

\displaystyle{\rm If}~{}|\ell^{\prime}(x)|>0,~{}{\rm then}~{}|\ell^{\prime}(x)|=1.~{}~{}~{}{\rm If}~{}\ell^{\prime}(x)=0,~{}{\rm then}~{}\ell(x)=0.

(52)

This will complete the proof since one can check that (52) implies that $\ell$ has one of the forms of $\ell_{1}$ , $\ell_{2}$ , $\ell_{3}$ and $\ell_{4}$ with $C=1$ . To see (52), we assume by contradiction that there exists $x_{0}\in\mathbb{R}$ such that $|\ell^{\prime}(x_{0})|<1$ and $\ell(x_{0})>0$ (note that $|\ell^{\prime}(x_{0})|\in(0,1)$ implies $\ell(x_{0})>0$ ). Define

k=\max\left\{\left|\frac{\ell(x_{0}+2\varepsilon)-\ell(x_{0})}{2\varepsilon}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-2\varepsilon)}{2\varepsilon}\right|\right\},

and it holds that $0\leqslant\ell(x_{0}+v)\leqslant\ell(x_{0})+k|v|$ for all $v\in[-2\varepsilon,2\varepsilon]$ as $\ell$ is nonnegative and convex. Further, noting that $|\ell^{\prime}(x_{0})|<1$ and ${\rm Lip}(\ell)\leqslant 1$ , we have $k<1$ and

\displaystyle(\ell(x_{0})+|v|)^{p}-(\ell(x_{0})+k|v|)^{p}\geqslant p\ell^{p-1}(x_{0})(1-k)|v|,~{}\forall v\in\mathbb{R}.

(53)

Therefore,

$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell^{p}(x_{0}+V)]$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{\|V\|>2\varepsilon\}}]\right\}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+k\|V\|)^{p}\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{\|V\|>2\varepsilon\}}]\right\}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+k\|V\|)^{p}\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0})+\|V\|)^{p}\mathds{1}_{\{\|V\|>2\varepsilon\}}]\right\}$	(54)
	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+\|V\|)^{p}]-\mathbb{E}[((\ell(x_{0})+\|V\|)^{p}-(\ell(x_{0})+k\|V\|)^{p})\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]\right\}$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+\|V\|)^{p}]-p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[\|V\|\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]\right\}$	(55)
	$\displaystyle=:I,$

where (54) holds because $\ell$ is nonnegative with ${\rm Lip}(\ell)\leqslant 1$ , and (55) follows from (53). Recalling $\mathcal{V}_{1}$ and $\mathcal{V}_{2}$ defined by (36) in the proof of Theorem 3, we can rewrite $I=\max\{I_{1},I_{2}\}$ with

\displaystyle I_{i}=\sup_{V\in\mathcal{V}_{i}}\left\{\mathbb{E}[(\ell(x_{0})+|V|)^{p}]-p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\right\},~{}~{}i=1,2.

By Lemma 5 and the definition of $\mathcal{V}_{1}$ , we have

\displaystyle\mathbb{E}[|V|]\leqslant\varepsilon 2^{-p/q}+\frac{(1-2^{-p/q})\varepsilon}{2}=\frac{(1+2^{-p/q})\varepsilon}{2}<\varepsilon,~{}~{}\forall~{}V\in\mathcal{V}_{1}.

(56)

It holds that

\displaystyle I_{1}

\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\mathbb{E}[(\ell(x_{0})+|V|)^{p}]<(\ell(x_{0})+\varepsilon)^{p},

(57)

where the strict inequality follows from (56) and Statement (ii) of Lemma 7 by noting that $\ell(x_{0})>0$ . For $I_{2}$ , we have

$\displaystyle I_{2}$	$\displaystyle\leqslant\sup_{V\in\mathcal{V}_{2}}\mathbb{E}[(\ell(x_{0})+\|V\|)^{p}]-\inf_{V\in\mathcal{V}_{2}}p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[\|V\|\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]$
	$\displaystyle\leqslant(\ell(x_{0})+\varepsilon)^{p}-p\ell^{p-1}(x_{0})(1-k)\inf_{V\in\mathcal{V}_{2}}\mathbb{E}[\|V\|\mathds{1}_{\{\|V\|\leqslant 2\varepsilon\}}]$
	$\displaystyle\leqslant(\ell(x_{0})+\varepsilon)^{p}-p\ell^{p-1}(x_{0})(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}$
	$\displaystyle<(\ell(x_{0})+\varepsilon)^{p},$	(58)

where the second inequality follows from Statement (i) of Lemma 7, and the third inequality is due to the definition of $\mathcal{V}_{2}$ . Combining (57) and (B.1), we have

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell(x_{0}+V)\|_{p}\leqslant I^{1/p}=\max\{I_{1}^{1/p},I_{2}^{1/p}\}<\ell(x_{0})+\varepsilon,

which yields a contradiction to (51). Hence, (52) holds, which completes the proof.

As outlined in Remark 2, we detail below the steps to verify that for the function $\ell$ listed in (ii) of Theorems 3 and 4, the regularization version given by Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) can be reformed as the corresponding regularization version in our framework. To be specific, Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) states that for $\ell$ that is proper, upper-semicontinuous and satisfies $\ell(x)\leqslant C(1+|x|^{p})$ for some $C>0$ and all $x\in\mathbb{R}$ ,

\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\},

where $\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}\{\ell(y)-\lambda|y-x|^{p}\}$ and $\|\cdot\|_{*}$ is the dual norm of $\|\cdot\|$ . When $\ell$ is given by (ii) of Theorem 3, we have

\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+C\varepsilon\|\bm{\beta}\|_{*};

(59)

when $\ell$ is given by (ii) of Theorem 4, we have

\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[(\ell^{p})_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}=\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+C\varepsilon||\bm{\beta}||_{*}\right)^{p}.

(60)

Proof.

Proof. We only give the proof of (59) for the case $\ell(x)=|x-b|$ for some $b\in\mathbb{R}$ and the proof of (60) for $\ell(x)=(|x-b_{1}|-b_{2})_{+}^{p}$ for some $b_{1}\in\mathbb{R}$ and $b_{2}\geqslant 0$ as the other cases can be proved similarly.

In the case that $\ell(x)=|x-b|$ , the function $\ell_{p}$ reduces to $\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y)$ , where $f_{\lambda,x}(y):=|y-b|-\lambda|y-x|^{p}$ for $\lambda\geqslant 0$ and $x,y\in\mathbb{R}$ . We first show

\displaystyle\ell_{p}(x,\lambda)=|x-b|+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}

(61)

by considering the following two cases:

(i)

If $x\geqslant b$ , then we have $\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y)=\max\{I_{1},I_{2},I_{3}\}$ where

\displaystyle I_{1}=\sup_{y\leqslant b}f_{\lambda,x}(y),~{}~{}I_{2}=\sup_{b<y<x}f_{\lambda,x}(y)~{}~{}{\rm and}~{}~{}I_{3}=\sup_{y\geqslant x}f_{\lambda,x}(y).

For $I_{1}$ , by calculating the derivative of $f_{\lambda,x}(y)$ for $y<b$ , one can verify that

\displaystyle I_{1}=f_{\lambda,x}(b)=-\lambda|b-x|^{p}

if $x\geqslant b+(\lambda p)^{-1/(p-1)}$ , and

\displaystyle I_{1}=f_{\lambda,x}\left(x-(\lambda p)^{-1/(p-1)}\right)=b-x+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)},

otherwise. For $I_{2}$ , since $f_{\lambda,x}$ is increasing on $[b,x]$ , we have

I_{2}=f_{\lambda,x}(b)\leqslant I_{3}.

Similarly, by calculating the derivative of $f_{\lambda,x}(y)$ for $y>x$ , we have

\displaystyle I_{3}=f_{\lambda,x}\left(x+(\lambda p)^{-1/(p-1)}\right)=x-b+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.

Note that $x\geqslant b$ . Comparing $I_{1}$ , $I_{2}$ and $I_{3}$ , it is straightforward to check that $I_{1}\leqslant I_{3}$ and $I_{2}\leqslant I_{3}$ , and hence,

\displaystyle\ell_{p}(x,\lambda)=I_{3}=x-b+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.

(ii)

If $x<b$ , then by similar analysis, we obtain

\displaystyle\ell_{p}(x,\lambda)=f_{\lambda,x}\left(x-(\lambda p)^{-1/(p-1)}\right)=b-x+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.

Combining the above two cases, we have (61) holds. Substituting (61) into the left-hand side of (59) yields

	$\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\right\}$
	$\displaystyle=\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\|\bm{\beta}^{\top}\mathbf{Z}-b\|]+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}+\lambda\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\right\}$
	$\displaystyle=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+\inf_{\lambda\geqslant 0}\{p^{-p/(p-1)}(p-1)\lambda^{-1/(p-1)}+\lambda\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\}$
	$\displaystyle=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+\varepsilon\\|\bm{\beta}\\|_{*}.$

This completes the proof of (59).

In the case that $\ell(x)=(|x-b_{1}|-b_{2})_{+}$ , that is, $\ell^{p}(x)=(|x-b_{1}|-b_{2})_{+}^{p}$ , we have

\displaystyle(\ell^{p})_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y)=\max\{I_{1},I_{2},I_{3}\},

where $f_{\lambda,x}(y)=(|y-b_{1}|-b_{2})_{+}^{p}-\lambda|y-x|^{p}$ and

\displaystyle I_{1}=\sup_{y\leqslant b_{1}-b_{2}}f_{\lambda,x}(y),~{}~{}I_{2}=\sup_{b_{1}-b_{2}<y<b_{1}+b_{2}}f_{\lambda,x}(y)~{}~{}{\rm and}~{}~{}I_{3}=\sup_{y\geqslant b_{1}+b_{2}}f_{\lambda,x}(y).

It is straightforward to check that $(\ell^{p})_{p}(x,\lambda)=\infty$ if $\lambda<1$ . Hence, the constraint of $\lambda$ can be replaced by $\lambda\geqslant 1$ . Let now $\lambda\geqslant 1$ and denote by $k=\lambda^{1/(p-1)}$ . By considering the first-order condition, we have

\displaystyle I_{1}=\begin{cases}-k^{p-1}(x-b_{1}+b_{2})^{p},~{}~{}&x>b_{1}-b_{2},\\[5.0pt] \left(\frac{k}{k-1}\right)^{p-1}(b_{1}-b_{2}-x)^{p},~{}~{}&x\leqslant b_{1}-b_{2},\end{cases}

\displaystyle I_{2}=\begin{cases}0,~{}~{}&b_{1}-b_{2}<x<b_{1}+b_{2},\\ -k^{p-1}(b_{1}-b_{2}-x)^{p},~{}~{}&x\leqslant b_{1}-b_{2},\\ -k^{p-1}(x-b_{1}-b_{2})^{p},~{}~{}&x\geqslant b_{1}+b_{2},\end{cases}

and

\displaystyle I_{3}=\begin{cases}-k^{p-1}(b_{1}+b_{2}-x)^{p},~{}~{}&x<b_{1}+b_{2},\\[5.0pt] \left(\frac{k}{k-1}\right)^{p-1}(x-b_{1}-b_{2})^{p},~{}~{}&x\geqslant b_{1}+b_{2}\end{cases}

where we use the convention that $s/0=\infty$ if $s>0$ . Comparing $I_{1}$ , $I_{2}$ and $I_{3}$ , we have

\displaystyle(\ell^{p})_{p}(x,\lambda)=\left(\frac{k}{k-1}\right)^{p-1}(|x-b_{1}|-b_{2})_{+}^{p}~{}~{}{\rm with}~{}k=\lambda^{1/(p-1)}.

Denote by $A=\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]=\mathbb{E}^{G_{0}}[(|\bm{\beta}^{\top}\mathbf{Z}-b_{1}|-b_{2})_{+}^{p}]$ . Substituting the above equation into the left-hand side of (60) and noting that the constraint of $k$ can be replaced by $k\geqslant 1$ yield

	$\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[(\ell^{p})_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\right\}$	$\displaystyle=\inf_{k\geqslant 1}\left\{\left(\frac{k}{k-1}\right)^{p-1}\mathbb{E}^{G_{0}}[(\|\bm{\beta}^{\top}\mathbf{Z}-b_{1}\|-b_{2})_{+}^{p}]+k^{p-1}\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\right\}$
		$\displaystyle=\inf_{k\geqslant 1}\left\{\left(\frac{k}{k-1}\right)^{p-1}A+k^{p-1}\varepsilon^{p}\\|\bm{\beta}\\|_{*}^{p}\right\}$
		$\displaystyle=\left(A^{1/p}+\varepsilon\\|\bm{\beta}\\|_{}\right)^{p}=\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+\varepsilon\|\|\bm{\beta}\|\|_{}\right)^{p}.$

This completes the proof of (60). ∎

The proof of Corollary 1 relies on the following lemma and the regularization results in Theorem 4. Recall that

\displaystyle\pi_{1,\ell}(F,t)=t+\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}~{}~{}{\rm and}~{}~{}\pi_{2,\ell}(F,t)=\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)].

Lemma 8.

For any $p\in[1,\infty)$ , $G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ and $\varepsilon\geqslant 0$ , the following two statements hold.

(i)

Suppose that $\ell(z,t)$ is nonnegative on $\mathbb{R}^{2}$ , and convex in $t$ with $\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<-1$ for all $z\in\mathbb{R}$ , and Lipschitz continuous in $z$ for all $t\in\mathbb{R}$ with a uniform Lipschitz constant, i.e., there exists $M>0$ such that

|\ell(z_{1},t)-\ell(z_{2},t)|\leqslant M|z_{1}-z_{2}|,~{}~{}\forall t,z_{1},z_{2}\in\mathbb{R}.

Then we have

\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t).

(ii)

Suppose that $\ell(z,t)$ is convex in $t$ with $\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<0$ and $\lim_{t\to\infty}{\partial\ell(z,t)}/{\partial t}>0$ for all $z\in\mathbb{R}$ , and Lipschitz continuous in $z$ for all $t\in\mathbb{R}$ with a uniform Lipschitz constant. Then we have

\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{2,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{2,\ell}(F,t).

Proof.

Proof. (i) First, we show three facts below. (a) $\pi_{1,\ell}(F,t)$ is concave in $F$ for all $t\in\mathbb{R}$ ; (b) $\pi_{1,\ell}(F,t)$ is convex in $t$ for all $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ ; (c) $\lim_{t\to\pm\infty}\pi_{1,\ell}(F,t)=\infty$ for all $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ . The fact (a) is trivial. For $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\lambda\in[0,1]$ and $t_{1},t_{2}\in\mathbb{R}$ , it holds that

	$\displaystyle\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda t_{1}+(1-\lambda)t_{2})]\right)^{1/p}$	$\displaystyle\leqslant\left(\mathbb{E}^{F}\left[\left(\lambda\ell(\bm{\beta}^{\top}\mathbf{Z},t_{1})+(1-\lambda)\ell(\bm{\beta}^{\top}\mathbf{Z},t_{2})\right)^{p}\right]\right)^{1/p}$
		$\displaystyle\leqslant\lambda\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t_{1})]\right)^{1/p}+(1-\lambda)\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t_{2})]\right)^{1/p},$

where the first step holds because $\ell$ is nonnegative, and the second follows from the triangle inequality. This implies (b). To see (c), it is obvious that $\lim_{t\to\infty}\pi_{1,\ell}(F,t)=\infty$ . Note that $(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)])^{1/p}\geqslant\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z},t)].$ Combining with $\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<-1$ and the convexity of $\ell(z,t)$ in $t$ , we have $\lim_{t\to-\infty}\pi_{1,\ell}(F,t)=\infty$ . Hence, we conclude the proof of (c). Using (b) and (c), the set of all minimizers of the problem $\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t)$ is a closed interval. Denote by $t(F):=\inf\arg\min_{t}\pi_{1,\ell}(F,t)$ . We will show that $\{t(F):F\in\mathcal{B}_{p}(G_{0},\varepsilon)\}$ is a subset of a compact set. For any $F\in B_{p}(G_{0},\varepsilon)$ and $t\in\mathbb{R}$ , let $\mathbf{Z}\sim F$ and $\mathbf{Z_{0}}\sim G_{0}$ , and we have

$\displaystyle\|\pi_{1,\ell}(F,t)-\pi_{1,\ell}(G_{0},t)\|$	$\displaystyle=\left\|\left(\mathbb{E}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}-\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z_{0}},t)]\right)^{1/p}\right\|$
	$\displaystyle\leqslant\left(\mathbb{E}[\|\ell(\bm{\beta}^{\top}\mathbf{Z},t)-\ell(\bm{\beta}^{\top}\mathbf{Z}_{0},t)\|^{p}]\right)^{1/p}$
	$\displaystyle\leqslant\left(\mathbb{E}[M^{p}\|\bm{\beta}^{\top}(\mathbf{Z}-\mathbf{Z}_{0})\|^{p}]\right)^{1/p}$
	$\displaystyle\leqslant M\\|\bm{\beta}\\|_{}\left(\mathbb{E}[\\|\mathbf{Z}-\mathbf{Z}_{0}\\|^{p}]\right)^{1/p}\leqslant M\\|\bm{\beta}\\|_{}\varepsilon,$	(62)

where the first and the third inequalities follow from the triangle inequality and Hölder’s inequality, respectively, and we have used the definition of the Wasserstein ball $\mathcal{B}_{p}(G_{0},\varepsilon)$ in the last step. Hence, it holds that

\displaystyle\pi_{1,\ell}(F,t(G_{0}))\leqslant\pi_{1,\ell}(G_{0},t(G_{0}))+M\|\bm{\beta}\|_{*}\varepsilon.

(63)

Note that $\pi_{1,\ell}(G_{0},t)\to\infty$ as $t\to\pm\infty$ . There exists $\Delta>0$ such that $\pi_{1,\ell}(G_{0},t)>\pi_{1,\ell}(G_{0},t(G_{0}))+2M\|\bm{\beta}\|_{*}\varepsilon$ for all $t\notin[t(G_{0})-\Delta,t(G_{0})+\Delta]$ . This, combined with (B.1), imply that

\displaystyle\pi_{1,\ell}(F,t)\geqslant\pi_{1,\ell}(G_{0},t)-M\|\bm{\beta}\|_{*}\varepsilon>\pi_{1,\ell}(G_{0},t(G_{0}))+M\|\bm{\beta}\|_{*}\varepsilon,~{}~{}\forall t\notin[t(G_{0})-\Delta,t(G_{0})+\Delta].

(64)

Applying (63) and (64), we have $\{t(F):F\in\mathcal{B}_{p}(G_{0},\varepsilon)\}\subseteq[t(G_{0})-\Delta,t(G_{0})+\Delta]$ . Using a minimax theorem (see e.g., Sion (1958)), it holds that

	$\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t)$	$\displaystyle=\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in[t(G_{0})-\Delta,t(G_{0})+\Delta]}\pi_{1,\ell}(F,t)$
		$\displaystyle=\inf_{t\in[t(G_{0})-\Delta,t(G_{0})+\Delta]}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t)\geqslant\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t).$

The converse direction is trivial. Hence, we complete the proof.

(ii) The proof is similar to (i). ∎

Proof of Corollary 1. (i) For the case $\ell(z,t):=c\ell(z-t)$ with $\ell=\ell_{3}$ or $\ell_{4}$ in Theorem 4, one can verify that $\ell(z,t)$ satisfies the conditions in Lemma 8 (ii). Hence, the equivalency result follows immediately from Lemma 8 and Theorem 4. For the case $\ell(z,t):=c\ell(z-t)$ with $\ell=\ell_{1}$ or $\ell_{2}$ in Theorem 4, we assume without loss of generality that $\ell_{1}(x)=x_{+}$ and $\ell_{2}(x)=x_{-}$ . In this case, we have

\displaystyle\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{t\in\mathbb{R}}c^{p}\mathbb{E}^{F}[\ell_{i}^{p}(\bm{\beta}^{\top}\mathbf{Z}-t)]=\begin{cases}\lim_{t\to\infty}c^{p}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{+}^{p}]=0,~{}~{}&i=1,\\[8.0pt] \lim_{t\to-\infty}c^{p}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]=0,~{}~{}&i=2.\end{cases}

(ii) For $\ell(z,t):=c\ell(z-t)$ with $\ell=\ell_{1}$ , $\ell_{3}$ or $\ell_{4}$ in Theorem 4, it holds that $\ell(z,t)$ satisfies the conditions in Lemma 8 (i). Applying Lemma 8 and Theorem 4, the equivalency result holds. For $\ell(z,t):=c\ell(z-t)$ with $\ell=\ell_{2}$ in Theorem 4, we assume without loss of generality that $\ell_{2}(x)=x_{-}$ . In this case, for any $c>1$ , $\bm{\beta}\in\mathcal{D}$ and $F\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , it holds that

\displaystyle\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{t\in\mathbb{R}}\left\{t+c\left(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]\right)^{1/p}\right\}=\lim_{t\to-\infty}\left\{t+c\left(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]\right)^{1/p}\right\}=-\infty.

This completes the proof.

B.2 Proofs of Section 4.2

To prove the results in Section 4.2, we list the definition of comonotonicity and some basic properties of distortion functionals as follows; see e.g.,Wang et al. (2020).

Definition 1 (Comonotonicity).

Two random variables $Z_{1}$ and $Z_{2}$ are said to be comonotonic if $(Z_{1},Z_{2})$ is distributionally equivalent to $(F^{-1}_{Z_{1}}(U),F^{-1}_{Z_{2}}(U))$ , where $F_{Z}^{-1}$ denotes the left-quantile function of $Z$ , and $U$ is a random variable uniformly distributed on the interval $[0,1]$ (see e.g., Dhaene et al. (2002) for a discussion of comonotonic random variables).

Lemma 9.

Let $h:[0,1]\to\mathbb{R}$ be a distortion function. The following statements hold.

(i)

$\rho_{h}$ is monotone, i.e., $\rho_{h}(Z_{1})\leqslant\rho_{h}(Z_{2})$ for $Z_{1}\leqslant Z_{2}$ a.s. if $h$ is increasing.
(ii)

$\rho_{h}$ is translation invariant, i.e., $\rho_{h}(Z+c)=\rho_{h}(Z)+(h(1)-h(0))c$ for any $Z$ and $c\in\mathbb{R}$ .
(iii)

$\rho_{h}$ is positively homogeneous, i.e., $\rho_{h}(\lambda Z)=\lambda\rho_{h}(Z)$ for any $Z$ and $\lambda\geqslant 0$ .
(iv)

$\rho_{h}$ is subadditive, i.e., $\rho_{h}(Z_{1}+Z_{2})\leqslant\rho_{h}(Z_{1})+\rho_{h}(Z_{2})$ for any $Z_{1}$ and $Z_{2}$ if $h$ is convex. The equality holds when $Z_{1}$ and $Z_{2}$ are comonotonic.

Proof of Theorem 5. We first consider the case $p=1$ . Assume without loss of generality that ${\rm Lip}(\ell)=1$ . By Lemma 4, it suffices to prove that for any $Z\in L^{1}$ and $\varepsilon\geqslant 0$ ,

\displaystyle\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

To see it, first note that

	$\displaystyle\sup_{\\|V\\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))$	$\displaystyle\leqslant\sup_{\\|V\\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z)+\|V\|)\leqslant\rho_{h}(\ell(Z))+\sup_{\\|V\\|_{1}\leqslant\varepsilon}\rho_{h}(\|V\|)$
		$\displaystyle\leqslant\rho_{h}(\ell(Z))+\sup_{\\|V\\|_{1}\leqslant\varepsilon}\\|h^{\prime}_{-}\\|_{\infty}\\|V\\|_{1}=\rho_{h}(\ell(Z))+\\|h^{\prime}_{-}\\|_{\infty}\varepsilon,$		(65)

where the first inequality follows from $\ell(Z+V)-\ell(Z)\leqslant|V|$ and the monotonicity of $\rho_{h}$ , the second inequality follows from the subadditivity of $\rho_{h}$ , and the last inequality is due to Hölder’s inequality. Let us now verify the other direction. Suppose that $\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1$ where $\ell^{\prime}_{-}$ is the left-derivative of $\ell$ . Denote by $U$ a uniform random variable on $[0,1]$ such that $U$ and $\ell(Z)$ are comonotonic. For $n\in\mathbb{N}$ and $\delta>0$ , define

\displaystyle A_{n,\delta}=\left\{\omega:1-\frac{\varepsilon}{n}-\delta<U(\omega)\leqslant 1-\delta\right\}~{}~{}{\rm and}~{}~{}V_{n,\delta}=n\mathds{1}_{A_{n,\delta}}.

(66)

One can check that $\|V_{n,\delta}\|_{1}=\varepsilon$ . Moreover, we have

$\displaystyle\rho_{h}(\ell(Z+V_{n,\delta}))$	$\displaystyle=\rho_{h}\left(\ell(Z)\mathds{1}_{A_{n,\delta}^{c}}+\ell(Z+n)\mathds{1}_{A_{n,\delta}}\right)$
	$\displaystyle=\rho_{h}\left(\ell(Z)+(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}\right)$
	$\displaystyle\geqslant\mathbb{E}\left[\left(\ell(Z)+(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}\right)h_{-}^{\prime}(U)\right]$
	$\displaystyle=\rho_{h}(\ell(Z))+\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right],$	(67)

where the inequality follows from the dual representation of $\rho_{h}$ (see e.g., Theorem 4.79 of Föllmer and Schied (2016)). Noting that $\ell(Z)$ must be uniformly bounded on $A_{n,\delta}$ for all $n>\varepsilon/(1-\delta)$ as $\ell(Z)$ and $U$ are comonotonic, there exists $x_{0}\in\mathbb{R}$ such that $Z\geqslant x_{0}$ on $A_{n,\delta}$ for all $n>\varepsilon/(1-\delta)$ as $\ell$ is convex. Therefore, we have

	$\displaystyle\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right]$	$\displaystyle=\mathbb{E}\left[\frac{\ell(Z+n)-\ell(Z)}{n}V_{n,\delta}h^{\prime}_{-}(U)\right]$
		$\displaystyle\geqslant\frac{\ell(x_{0}+n)-\ell(x_{0})}{n}\mathbb{E}\left[V_{n,\delta}h^{\prime}_{-}(U)\right]$
		$\displaystyle=\frac{\ell(x_{0}+n)-\ell(x_{0})}{n}\frac{\int_{1-\varepsilon/n-\delta}^{1-\delta}h^{\prime}_{-}(s)\mathrm{d}s}{\varepsilon/n}\varepsilon\rightarrow h^{\prime}_{-}(1-\delta)\varepsilon~{}~{}{\rm as}~{}n\to\infty,$

where the inequality follows from the convexity of $\ell$ and $Z\geqslant x_{0}$ on $A_{n,\delta}$ , and the convergence is due to $\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1$ . Note that $h^{\prime}_{-}$ is left continuous on $(0,1]$ (see e.g., Proposition A.4 of Föllmer and Schied (2016)). It holds that $\lim_{x\to 1-}h^{\prime}_{-}(x)=h^{\prime}_{-}(1)$ . Therefore, letting $\delta\to 0$ and combining with (B.2), we have concluded that

\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))\geqslant\rho_{h}(\ell(Z))+h^{\prime}_{-}(1)\varepsilon=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

Hence, we have verified the other direction for the case of $\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1$ . If $\lim_{x\to-\infty}\ell^{\prime}_{-}(x)=-1$ , the proof is similar by letting $V_{n,\delta}=-n\mathds{1}_{A_{n,\delta}}$ . Hence, we complete the proof of the case $p=1$ .

Let us now focus on the case $p>1$ . In the rest of the proof, we assume $C=1$ and $h(0)=1-h(1)=0$ without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.

(i)’

For any $Z\in L^{p}$ and $\varepsilon\geqslant 0$ , we have (33) holds, i.e.,

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))=\rho_{h}(\ell(Z))+\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall Z\in L^{p},~{}\varepsilon\geqslant 0.

(68)

To see (ii) $\Rightarrow$ (i)’, we note that $\rho_{h}(-Z)=\rho_{\widetilde{h}}(Z)$ , where $\widetilde{h}(s)=h(1-s)$ for $s\in[0,1]$ . It is easy to see that $\widetilde{h}$ is convex whenever $h$ is convex. Hence, (a) is a special case of Proposition 2, and we choose to omit its proof here for brevity. In the case of (b), we assume without loss of generality that $\ell(x)=|x|$ for $x\in\mathbb{R}$ . Note that $h$ is an increasing and convex distortion function. By Lemma 9, we have for any $V$ with $\|V\|_{p}\leqslant\varepsilon$ ,

\displaystyle\rho_{h}(|Z+V|)

\displaystyle\leqslant\rho_{h}(|Z|)+\rho_{h}(|V|)\leqslant\rho_{h}(|Z|)+\|V\|_{p}\|h^{\prime}_{-}\|_{q}\leqslant\rho_{h}(|Z|)+\varepsilon\|h^{\prime}_{-}\|_{q},

(69)

where the second inequality follows from Hölder’s inequality. Suppose that $|Z|=F_{|Z|}^{-1}(U)$ a.s. where $U$ is a uniform random variable on $[0,1]$ . Define $\widetilde{V}={\rm sign}(Z){\varepsilon(h^{\prime}_{-}(U))^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}}.$ One can verify that $\|\widetilde{V}\|_{p}=\varepsilon$ and $\rho_{h}(|Z+\widetilde{V}|)=\rho_{h}(|Z|)+\varepsilon\|h^{\prime}_{-}\|_{q}$ . Therefore, we conclude that (68) holds. Hence, we complete the proof of (i)’.

To see (i)’ $\Rightarrow$ (ii), note that $h$ is increasing with $h(0)=1-h(1)=0$ . It holds that $\rho_{h}$ satisfies monotonicity and translation invariance, i.e., $\rho_{h}(Z+c)=\rho_{h}(Z)+c$ for all $Z\in\ L^{p}$ and $c\in\mathbb{R}$ . We first show that ${\rm Lip}(\ell)\leqslant 1$ by contradiction. Otherwise, by that a convex function has derivative almost everywhere, there exists $x$ such that $|\ell^{\prime}(x)|>1$ . In this case, define $V_{1}\in L^{p}$ with quantile function as $F_{V_{1}}^{-1}(s)={\varepsilon(h^{\prime}_{-}(s))^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}},~{}~{}s\in(0,1),$ and one can easily check that $V_{1}\geqslant 0$ and $\|V_{1}\|_{p}=\varepsilon$ . Hence,

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x+V))$	$\displaystyle\geqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}(\ell(x)+\ell^{\prime}(x)V)$
		$\displaystyle=\ell(x)+\ell^{\prime}(x)\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}(V)$
		$\displaystyle\geqslant\ell(x)+\ell^{\prime}(x)\rho_{h}(V_{1})$
		$\displaystyle=\ell(x)+\ell^{\prime}(x)\\|h^{\prime}_{-}\\|_{q}\varepsilon>\ell(x)+\\|h^{\prime}_{-}\\|_{q}\varepsilon.$

If $\ell^{\prime}(x)<-1$ , then define $V_{2}=-V_{1}$ , and we have $V_{2}\leqslant 0$ and $\|V_{2}\|_{p}=\varepsilon$ . Hence,

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x+V))$	$\displaystyle\geqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\leqslant 0}\rho_{h}(\ell(x)+\ell^{\prime}(x)V)$
		$\displaystyle=\ell(x)-\ell^{\prime}(x)\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\leqslant 0}\rho_{h}(-V)$
		$\displaystyle\geqslant\ell(x)-\ell^{\prime}(x)\rho_{h}(-V_{2})$
		$\displaystyle=\ell(x)-\ell^{\prime}(x)\rho_{h}(V_{1})$
		$\displaystyle=\ell(x)-\ell^{\prime}(x)\\|h^{\prime}_{-}\\|_{q}\varepsilon>\ell(x)+\\|h^{\prime}_{-}\\|_{q}\varepsilon.$

Those two cases both yield a contradiction to (68).

Next, we aim to show

\displaystyle|\ell^{\prime}(x)|=1~{}{\rm{for~{}all}}~{}x\in\mathbb{R}~{}{\rm as~{}long~{}as}~{}\ell~{}{\rm is~{}differentiable~{}at}~{}x,

(70)

as this will complete the proof since a convex function that satisfies (70) must be one of the forms of $\ell_{1}$ and $\ell_{2}$ with $C=1$ . To show (70), we assume by contradiction that there exists $x_{0}\in\mathbb{R}$ such that $|\ell^{\prime}(x_{0})|<1$ . If $p=\infty$ , then we have

\displaystyle\sup_{\|V\|_{\infty}\leqslant\varepsilon}\rho(\ell(x_{0}+V))=\max\{\ell(x_{0}-\varepsilon),\ell(x_{0}+\varepsilon)\}<\ell(x_{0})+\varepsilon,

where the strict inequality follows from $|\ell^{\prime}(x_{0})|<1$ and ${\rm Lip}(\ell)\leqslant 1$ , which yields a contradiction. Suppose now $p\in(1,\infty)$ . To yield a contradiction, it suffices to prove that

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x_{0}+V))<\ell(x_{0})+\varepsilon\|h^{\prime}_{-}\|_{q}.

(71)

Let $U\sim{\rm U}[0,1]$ be a uniformly distributed random variable, and define

\mathcal{X}_{p}=\left\{F^{-1}(U):\int_{0}^{1}|F^{-1}(u)|^{p}du<\infty,~{}F^{-1}(U)\geqslant 0\right\}.

We have that $\{F_{X}:X\in L^{p}\}=\{F_{X}:X\in\mathcal{X}_{p}\}$ and the random variables in $\mathcal{X}_{p}$ are all comonotonic. Define

\alpha^{*}=\sup\{\alpha\in[0,1]:h(\alpha)=0\}<1~{}~{}{\rm and}~{}~{}M=\varepsilon\left(\frac{1-\alpha^{*}}{2}\right)^{-1/p}.

(72)

It follows from Chebyshev’s inequality that

\mathbb{P}(V>M)\leqslant\frac{\mathbb{E}[V^{p}]}{M^{p}}\leqslant\frac{\varepsilon^{p}}{M^{p}}=\frac{1-\alpha^{*}}{2}~{}~{}{\rm if}~{}\|V\|_{p}\leqslant\varepsilon~{}{\rm and}~{}V\geqslant 0.

(73)

Define

k=\max\left\{\left|\frac{\ell(x_{0}+M)-\ell(x_{0})}{M}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-M)}{M}\right|\right\}.

We have that $|\ell(x_{0}+v)-\ell(x_{0})|\leqslant k|v|$ for all $v\in[-M,M]$ as $\ell$ is convex. Further, it holds that $k<1$ as $|\ell^{\prime}(x_{0})|<1$ and ${\rm Lip}(\ell)=1$ . Therefore,

$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x_{0}+V))$	$\displaystyle=\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}\left(\ell(x_{0}+V)\mathds{1}_{\{\|V\|\leqslant M\}}+\ell(x_{0}+V)\mathds{1}_{\{\|V\|>M\}}\right)$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}\left((\ell(x_{0})+k\|V\|)\mathds{1}_{\{\|V\|\leqslant M\}}+\ell(x_{0}+V)\mathds{1}_{\{\|V\|>M\}}\right)$
	$\displaystyle\leqslant\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}\left((\ell(x_{0})+k\|V\|)\mathds{1}_{\{\|V\|\leqslant M\}}+(\ell(x_{0})+\|V\|)\mathds{1}_{\{\|V\|>M\}}\right)$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}\left(k\|V\|\mathds{1}_{\{\|V\|\leqslant M\}}+\|V\|\mathds{1}_{\{\|V\|>M\}}\right)$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\rho_{h}\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\mathbb{E}\left[\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)h^{\prime}_{-}(U)\right]$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\mathbb{E}\left[\left(V-(1-k)V\mathds{1}_{\{V\leqslant M\}}\right)h^{\prime}_{-}(U)\right]$
	$\displaystyle=\ell(x_{0})+\sup_{\\|V\\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}$
	$\displaystyle=:\ell(x_{0})+I,$	(74)

where the second inequality follows from ${\rm Lip}(\ell)\leqslant 1$ , the fourth equality is due to the distribution-invariance of $\rho_{h}$ , and the fifth equality holds because $V\in\mathcal{X}_{p}$ implies that $kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}$ and $h^{\prime}_{-}(U)$ are comonotonic. For $\delta>0$ , define

\displaystyle\mathcal{V}_{1}=\left\{V\in\mathcal{X}_{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}\left[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)\right]\leqslant\delta\right\},~{}~{}\mathcal{V}_{2}=\{V\in\mathcal{X}_{p}:\|V\|_{p}\leqslant\varepsilon\}\setminus\mathcal{V}_{1}.

(75)

We can rewrite $I=\max\{I_{1},I_{2}\}$ with

I_{i}=\sup_{V\in\mathcal{V}_{i}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\},~{}~{}i=1,2.

Below we aim to demonstrate that $I_{i}<\varepsilon\|h^{\prime}_{-}\|_{q}$ for $i=1,2$ by selecting an appropriate $\delta$ . Note that

$\displaystyle I_{1}$	$\displaystyle=\sup_{V\in\mathcal{V}_{1}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}$
	$\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\rho_{h}(V)$
	$\displaystyle=\sup_{V\in\mathcal{V}_{1}}\left\{\mathbb{E}[V\mathds{1}_{\{V>M\}}h^{\prime}_{-}(U)]+\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\right\}$
	$\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\\|h^{\prime}_{-}(U)\mathds{1}_{\{V>M\}}\\|_{q}\\|V\\|_{p}+\delta$
	$\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\\|h^{\prime}_{-}(U)\mathds{1}_{\{V>M\}}\\|_{q}\varepsilon+\delta$
	$\displaystyle\leqslant\left(\int_{\frac{1+\alpha^{}}{2}}^{1}(h^{\prime}_{-}(s))^{q}\mathrm{d}s\right)^{1/q}\varepsilon+\delta=\left(\\|h^{\prime}_{-}\\|_{q}^{q}-\int_{\alpha^{}}^{\frac{1+\alpha^{*}}{2}}(h^{\prime}_{-}(s))^{q}\mathrm{d}s\right)^{1/q}\varepsilon+\delta,$	(76)

where the second inequality follows from Hölder’s inequality and the definition of $\mathcal{V}_{1}$ in (75), the last inequality is due to (73), and the last equality follows from the definition of $\alpha^{*}$ in (72). Denote by

\displaystyle A=\int_{\alpha^{*}}^{\frac{1+\alpha^{*}}{2}}(h^{\prime}_{-}(s))^{q}\mathrm{d}s~{}~{}{\rm and}~{}~{}\Delta=\left[\|h^{\prime}_{-}\|_{q}-(\|h^{\prime}_{-}\|_{q}^{q}-A)^{1/q}\right]\varepsilon.

Recalling the definition of $\alpha^{*}$ in (72) again, we have $A>0$ and $\Delta>0$ , and hence, $I_{1}<\varepsilon\|h^{\prime}_{-}\|_{q}$ whenever $\delta<\Delta$ . For $\delta<\Delta$ , we have

\displaystyle I_{2}=\sup_{V\in\mathcal{V}_{2}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}\leqslant\sup_{V\in\mathcal{V}_{2}}\|h^{\prime}_{-}\|_{q}\|V\|_{p}-(1-k)\delta<\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall\delta>0,

where the first inequality follows from Hölder’s inequality. Hence, we have that $\max\{I_{1},I_{2}\}<\varepsilon\|h^{\prime}_{-}\|_{q}$ . Combining with (B.2), we have verified (71). This completes the proof.

To substantiate our claim that our analysis is not limited to the examples presented in this paper, we present the following proposition to demonstrate that the arguments in the proof of Theorem 5 can be applied to establish the equivalence between Wasserstein DRO and regularization for a class of non-convex loss functions, when the support of the reference distribution is a finite set.

Proposition 3.

For $p=1$ , let $h$ be given in Theorem 5. If the support of $G_{0}\in\mathcal{M}_{1}(\mathbb{R})$ is finite, and $\ell:\mathbb{R}\to\mathbb{R}$ with ${\rm Lip}(\ell)\in\mathbb{R}$ satisfies $\lim_{t\to\infty}(\ell(x+t)-\ell(x))/t={\rm Lip}(\ell)$ for all $x\in\mathbb{R}$ or $\lim_{t\to-\infty}(\ell(x)-\ell(x+t))/t={\rm Lip}(\ell)$ for all $x\in\mathbb{R}$ , then, for $\varepsilon\geqslant 0$ and $\mathcal{D}\subseteq\mathbb{R}^{n}$

\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\}.

Proof.

Proof. Without loss of generality, assume ${\rm Lip}(\ell)=1$ . We only give the proof for the case $\lim_{t\to\infty}(\ell(x+t)-\ell(x))/t=1$ for $x\in\mathbb{R}$ as the other case is similar. By Remark 5, it suffices to show that (33) holds with $p=1$ and $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ where $\mathbf{Z}_{0}\sim G_{0}$ . For $n\in\mathbb{N}$ and $\delta>0$ , define $A_{n,\delta}$ and $V_{n,\delta}$ as (66). First note that both (B.2) and (B.2) hold for $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ . Noting that the support of $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ is finite, we denote it by $\{x_{1},\dots,x_{k}\}$ . For $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ , it holds that

	$\displaystyle\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right]$	$\displaystyle=\mathbb{E}\left[\frac{\ell(Z+n)-\ell(Z)}{n}V_{n,\delta}h^{\prime}_{-}(U)\right]$
		$\displaystyle\geqslant\min_{i\in[k]}\frac{\ell(x_{i}+n)-\ell(x_{i})}{n}\mathbb{E}\left[V_{n,\delta}h^{\prime}_{-}(U)\right]$
		$\displaystyle=\min_{i\in[k]}\frac{\ell(x_{i}+n)-\ell(x_{i})}{n}\frac{\int_{1-t/n-\delta}^{1-\delta}h^{\prime}_{-}(s)\mathrm{d}s}{t/n}t$
		$\displaystyle\rightarrow h^{\prime}_{-}(1-\delta)t~{}~{}{\rm as}~{}n\to\infty.$

Letting $\delta\to 0$ and combining this with (B.2), we have concluded that

\displaystyle\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))\geqslant\rho_{h}(\ell(Z))+h^{\prime}_{-}(1)\varepsilon=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

Hence, (33) holds with $p=1$ and $Z=\bm{\beta}^{\top}\mathbf{Z}_{0}$ , and thus we complete the proof. ∎

Proof of Proposition 2. By Lemma 4, it suffices to prove that

\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(Z+V)=\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall Z\in L^{p}.

(77)

By Lemma 9, we have for any $\|V\|_{p}\leqslant\varepsilon$ ,

\displaystyle\rho_{h}(Z+V)

\displaystyle\leqslant\rho_{h}(Z)+\rho_{h}(V)\leqslant\rho_{h}(Z)+\|V\|_{p}\|h^{\prime}_{-}\|_{q}\leqslant\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q},

(78)

where the second inequality follows from Hölder’s inequality. Suppose that $Z=F_{Z}^{-1}(U)$ a.s., where $U$ is a uniform random variable on $[0,1]$ (see Lemma A.28 of Föllmer and Schied (2016) for the existence of $U$ ). In the following, we show (77) by considering the following two cases.

(a)

If $p=1$ , then $q=\infty$ and $\|h^{\prime}_{-}\|_{\infty}=\max\{|h^{\prime}_{-}(0+)|,|h^{\prime}_{-}(1)|\}=\max\{-h^{\prime}_{-}(0+),h^{\prime}_{-}(1)\}$ , where the second equality follows from $h^{\prime}_{-}$ is increasing. We first assume that $h^{\prime}_{-}(1)\geqslant-h^{\prime}_{-}(0+)$ . Define a sequence $\{V_{n}\}_{n\in\mathbb{N}}$ such that $V_{n}=n\varepsilon\mathds{1}_{\{U>1-1/n\}}$ . For all $n\in\mathbb{N}$ , it holds that $\|V_{n}\|_{1}=\varepsilon$ , and $V_{n}$ and $Z$ are comonotonic. Hence, we have

	$\displaystyle\sup_{\\|V\\|_{p}\leqslant\varepsilon}\rho_{h}(Z+V)$	$\displaystyle\geqslant\rho_{h}(Z+V_{n})=\rho_{h}(Z)+n\varepsilon\int_{1-1/n}^{1}h^{\prime}_{-}(s)\mathrm{d}s$
		$\displaystyle\to\rho_{h}(Z)+\varepsilon h^{\prime}_{-}(1)=\rho_{h}(Z)+\varepsilon\\|h^{\prime}_{-}\\|_{\infty},~{}~{}{\rm as}~{}n\to\infty.$		(79)

Combining (78) and ((a)), we have verified (77). Assume now $h^{\prime}_{-}(1)<-h^{\prime}_{-}(0+)$ . We can construct a sequence $\{V_{n}\}_{n\in\mathbb{N}}$ as $V_{n}=-n\varepsilon\mathds{1}_{\{U<1/n\}}$ , and then follow the same analysis as the previous argument to verify the result.

(b)

If $p\in(1,\infty]$ , then we define $\widetilde{V}={\rm sgn}(h^{\prime}_{-}(U)){\varepsilon|h^{\prime}_{-}(U)|^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}}.$ One can verify that $\|\widetilde{V}\|_{p}=\varepsilon$ and $\rho_{h}(Z+\widetilde{V})=\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q}$ . This, together with (78), implies that (77) holds.

Combining (a) and (b), we complete the proof of (77), and thus complete the proof.

Appendix C C Proofs of Section 6

Before presenting the proofs in Section 6, we first highlight some key elements required in our analysis for the non-affine case. As stated in Section 6, while our analysis for the non-affine case continues to rely on a projection set perspective, the equivalence between the projection sets of a Wasserstein ball and a max-sliced Wasserstein ball no longer holds. Instead, we explore the possibility of first establishing a confidence bound for a fixed decision $\bm{\beta}$ by identifying a set inclusion relationship with a one-dimensional Wasserstein ball and leveraging its measure concentration property (Lemma 10). We then extend these results to derive generalization bounds that hold uniformly across all $\bm{\beta}\in{\cal D}$ using covering number techniques (see e.g., Gao (2022) and Section 3.5 of Mohri et al. (2018)). For $\tau>0$ , a $\tau$ -cover of $\mathcal{D}$ , denoted by $\mathcal{D}_{\tau}$ , is a subset of $\mathcal{D}$ such that for each $\bm{\beta}\in\mathcal{D}$ , there exists $\widetilde{\bm{\beta}}\in\mathcal{D}_{\tau}$ satisfying $\|\widetilde{\bm{\beta}}-\bm{\beta}\|_{\mathcal{D}}\leqslant\tau$ , where $\|\cdot\|_{\mathcal{D}}$ is a norm on $\mathcal{D}$ . The covering number $\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}})$ of $\mathcal{D}$ with respect to $\|\cdot\|_{\mathcal{D}}$ is the smallest cardinality of such a $\tau$ -cover of $\mathcal{D}$ .

Lemma 10 (Theorem 2 of Fournier and Guillin (2015)).

Let $p\geqslant 1$ and $\eta\in(0,1)$ . For $\widetilde{F}^{*}\in\mathcal{M}_{p}(\mathbb{R})$ , suppose that $A_{0}:=\mathbb{E}^{\widetilde{F}^{*}}[\exp(|Z|^{a})]<\infty$ for some $a>p$ and $\widetilde{F}_{N}$ is the empirical distribution based on the independent sample drawn from $\widetilde{F}^{*}$ . Then, we have

\displaystyle\mathbb{P}\left(W_{p}(\widetilde{F}_{N},\widetilde{F}^{*})\leqslant\varepsilon_{p,N}(\eta)\right)\geqslant 1-\eta,

where

\displaystyle\varepsilon_{p,N}(\eta)=\begin{cases}\left(\frac{\log(c_{1}/\eta)}{c_{2}N}\right)^{1/(2p)}~{}~{}&\text{if}~{}N\geqslant\frac{\log(c_{1}/\eta)}{c_{2}},\\ \left(\frac{\log(c_{1}/\eta)}{c_{2}N}\right)^{1/a}~{}~{}&\text{if}~{}N<\frac{\log(c_{1}/\eta)}{c_{2}},\end{cases}

(80)

and $c_{1},c_{2}$ are positive constants that only depend on $a$ , $A_{0}$ , and $p$ .

With these tools in place, we now proceed to the proofs for the generalization results in Section 6.

C.1 Proof of Theorem 6

We proceed in three steps. First, we verify the following set inclusion result:

\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)\supseteq\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)),

(81)

under Assumption 3, where $(Y_{0},\mathbf{X}_{0})\sim F_{0}$ and $\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}$ is the projection set defined by (23) in Section 6:

\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)=\{F_{Y\cdot f_{\bm{\beta}}(\mathbf{X})}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\}.

(82)

Second, using the set inclusion result (81) and the concentration result for the one-dimensional Wasserstein ball (Lemma 10), we establish a confidence bound for a fixed $\bm{\beta}$ . Finally, we apply covering number techniques to derive union bounds from the confidence bounds established in the second step, overcoming the challenge posed by the non-finiteness of the set ${\cal D}$ .

Step 1: Proving the set inclusion result (81). We first give a lemma which will be used in the proof of the set inclusion result (81).

Lemma 11.

Let $p\geqslant 1$ and $g:\mathbb{R}_{+}\to\mathbb{R}_{+}$ be an increasing function with $g(0)=0$ . If $g$ is convex (resp. concave), then the function $x\mapsto(g(x^{1/p}))^{p}$ is convex (resp. concave) on $\mathbb{R}_{+}$ .

Proof.

Proof. We provide the proof of convexity only for the case where $g$ is twice differentiable, as the concavity case is analogous and the general case can be established by approximating $g$ with a twice differentiable convex function. Denote by $h(x)=(g(x^{1/p}))^{p}$ , $x\in\mathbb{R}_{+}$ . By standard calculations, we have

	$\displaystyle h^{\prime\prime}(x)$	$\displaystyle=\frac{p-1}{p}g^{\prime}(y)(g(y))^{p-2}y^{1-2p}[yg^{\prime}(y)-g(y)]+\frac{1}{p}g^{\prime\prime}(y)(g(y))^{p-1}y^{2-2p}$
		$\displaystyle\geqslant\frac{p-1}{p}g^{\prime}(y)(g(y))^{p-2}y^{1-2p}[yg^{\prime}(y)-g(y)]$
		$\displaystyle\stackrel{{\scriptstyle\rm sgn}}{{=}}yg^{\prime}(y)-g(y),$

where we have denoted by $y=x^{1/p}\in\mathbb{R}_{+}$ , the inequality follows from $g^{\prime\prime}\geqslant 0$ and $g\geqslant 0$ , $A\stackrel{{\scriptstyle\rm sgn}}{{=}}B$ means that $A$ and $B$ have the same sign, and $\stackrel{{\scriptstyle\rm sgn}}{{=}}$ comes from $(p-1)g^{\prime}(y)(g(y))^{p-2}y^{1-2p}\geqslant 0$ . By $g(0)=0$ and the convexity of $g$ , we have that $g(y)/y=(g(y)-g(0))/y\leqslant g^{\prime}(y)$ , that is, $yg^{\prime}(y)-g(y)\geqslant 0$ . It then follows that $h^{\prime\prime}\geqslant 0$ , that is, $h$ is convex. This completes the proof. ∎

Now we are ready to show (81) under Assumption 3.

Lemma 12.

For $p\in[1,\infty]$ , $\varepsilon\geqslant 0$ , $F_{0}\in\mathcal{M}_{p}(\Xi)$ , and $(Y_{0},\mathbf{X}_{0})\sim F_{0}$ , if Assumption 3 holds, then for $\bm{\beta}\in\mathcal{D}$ , we have (81) holds.

Proof.

Proof. For $\bm{\beta}\in\mathcal{D}$ , let $F\in\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon))$ . There exists $Z$ such that $Z\sim F$ and $\mathbb{E}[|Z-Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]\leqslant(g(\varepsilon))^{p}$ . Our aim is to construct a random vector $(Y,\mathbf{X})$ such that $F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)$ and $Y\cdot f_{\bm{\beta}}(\mathbf{X})\sim F$ as this implies that $F\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)$ . Denote by $T=Z/Y_{0}-f_{\bm{\beta}}(\mathbf{X}_{0})$ and we have

\displaystyle\mathbb{E}[|T|^{p}]=\mathbb{E}\left[\left|\frac{Z}{Y_{0}}-f_{\bm{\beta}}(\mathbf{X}_{0})\right|^{p}\right]=\mathbb{E}[|Z-Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]\leqslant(g(\varepsilon))^{p}.

(83)

We first assert that there exist measurable mappings $\mathbf{S}_{1}$ and $\mathbf{S}_{2}$ such that

\displaystyle\mathbf{S}_{1}(\omega)\in\operatorname*{arg\,max}_{\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)}f_{\bm{\beta}}({\bf X}_{0}(\omega)+\mathbf{y})~{}~{}{\rm and}~{}~{}\mathbf{S}_{2}(\omega)\in\operatorname*{arg\,min}_{\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)}f_{\bm{\beta}}({\bf X}_{0}(\omega)+\mathbf{y}),~{}~{}\omega\in\Omega.

To see this, we apply a measurable selection theorem (Theorem 3.5 in Rieder (1978)) to show the existence of ${\bf S}_{1}$ and that of ${\bf S}_{2}$ is similar. Define

\displaystyle v(\omega)=\sup_{\mathbf{y}\in D(\omega)}u(\omega,\mathbf{y}),~{}~{}\omega\in\Omega,

where $D(\omega)=\{\mathbf{y}\in\mathbb{R}^{n}:\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)\}$ and $u(\omega,\mathbf{y})=f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y})$ . Further define

\displaystyle E=\{(\omega,\mathbf{y})\in\Omega\times\mathbb{R}^{n}:\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)\}.

Let ${\rm P_{1}}(E)$ be the projection set of $E$ on the first argument. The first three conditions in Theorem 3.5 of Rieder (1978) are trivial in our setting. To see the last condition, for $c\in\mathbb{R}$ and $\omega\in{\rm P_{1}}(E)$ , we have

\displaystyle\{\mathbf{y}\in D(\omega):u(\omega,\mathbf{y})\geqslant c\}=D(\omega)\cap\{\mathbf{y}\in\mathbb{R}^{n}:f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y})\geqslant c\}.

Note that $D(\omega)$ is a compact set and the continuity of $f_{\bm{\beta}}$ implies that $\{\mathbf{y}\in\mathbb{R}^{n}:f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y})\geqslant c\}$ is a closed set. Hence, we conclude that $\{\mathbf{y}\in D(\omega):u(\omega,\mathbf{y})\geqslant c\}$ is a compact set, which verifies the last condition in Theorem 3.5 of Rieder (1978). This implies the existence of a measurable maximizer ${\bf S}_{1}$ . Denote by $A_{+}=\{\omega:T(\omega)\geqslant 0\}$ and $A_{-}=\{\omega:T(\omega)<0\}$ . We define $\mathbf{X}^{\prime}$ as

\displaystyle\mathbf{X}^{\prime}(\omega)=\mathbf{X}_{0}(\omega)+\mathbf{S}_{1}(\omega),~{}~{}\omega\in A_{+}~{}~{}{\rm and}~{}~{}\mathbf{X}^{\prime}(\omega)=\mathbf{X}_{0}(\omega)+\mathbf{S}_{2}(\omega),~{}~{}\omega\in A_{-}.

By that $\mathbf{S}_{1}$ and $\mathbf{S}_{2}$ are measurable, we have $\mathbf{X}^{\prime}$ is measurable. Note that $\Omega=A_{+}\cup A_{-}$ , and hence,

\displaystyle\|\mathbf{X}^{\prime}(\omega)-\mathbf{X}_{0}(\omega)\|\leqslant\max\{\|\mathbf{S}_{1}(\omega)\|,\|\mathbf{S}_{2}(\omega)\|\}\leqslant g^{-1}(|T(\omega)|),~{}~{}\omega\in\Omega.

(84)

By Assumption 3, we have

\displaystyle f_{\bm{\beta}}(\mathbf{X}^{\prime}(\omega))-f_{\bm{\beta}}(\mathbf{X}_{0}(\omega))\geqslant g(g^{-1}(|T(\omega)|))=T(\omega),~{}~{}\omega\in A_{+}

and

\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}(\omega))-f_{\bm{\beta}}(\mathbf{X}^{\prime}(\omega))\geqslant g(g^{-1}(|T(\omega)|))=-T(\omega),~{}~{}\omega\in A_{-}.

Note that $f_{\bm{\beta}}$ is continuous. There exists a random variable $R$ taking values on $[0,1]$ such that

\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}+R(\mathbf{X}^{\prime}-\mathbf{X}_{0}))-f_{\bm{\beta}}(\mathbf{X}_{0})=T.

(85)

Define $\widetilde{\mathbf{X}}=\mathbf{X}_{0}+R(\mathbf{X}^{\prime}-\mathbf{X}_{0})$ . It follows from (85) that

\displaystyle Y_{0}\cdot f_{\bm{\beta}}(\widetilde{\mathbf{X}})=Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})+Y_{0}T=Z\sim F.

(86)

Moreover, we have

\displaystyle\mathbb{E}[\|\widetilde{\mathbf{X}}-\mathbf{X}_{0}\|^{p}]=\mathbb{E}[\|R(\mathbf{X}^{\prime}-\mathbf{X}_{0})\|^{p}]\leqslant\mathbb{E}\left[(g^{-1}(|T|))^{p}\right]\leqslant\left(g^{-1}\big{(}(\mathbb{E}[|T|^{p}])^{1/p}\big{)}\right)^{p}\leqslant\varepsilon^{p},

where the first inequality is due to (84); the second inequality follows from the Jensen’s inequality and the concavity of the function $x\mapsto(g^{-1}(x^{1/p}))^{p}$ by Lemma 11; and the last inequality follows from (83). Hence, $F_{(Y_{0},\widetilde{\mathbf{X}})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)$ . Combining this with (86) yields $F\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)$ . This completes the proof. ∎

Step 2: Establishing confidence bounds for fixed $\bm{\beta}$ . Based on Lemma 12 and applying Lemma 10, we are able to derive the following confidence bounds under Assumptions 2 and 3.

Lemma 13.

Let $p\geqslant 1$ and $\eta\in(0,1)$ . Suppose that Assumptions 2 and 3 hold. Then for each $\bm{\beta}\in\mathcal{D}$ , we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta,

where $\varepsilon_{p,N}(\eta)$ is defined by (80) with constants $c_{1},c_{2}$ depending only on $a$ , $A$ and $p$ .

Proof.

Proof. Denote by $\widehat{F}_{\bm{\beta},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{y}_{i}\cdot f_{\bm{\beta}}(\widehat{\mathbf{x}}_{i})}$ and $F_{\bm{\beta}}^{*}=F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}$ , where $(Y^{*},\mathbf{X}^{*})\sim F^{*}$ . It holds that $\widehat{F}_{\bm{\beta},N}$ is an empirical distribution of $F_{\bm{\beta}}^{*}$ . Note that

	$\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)$
	$\displaystyle=\mathbb{P}\left(\rho^{F^{*}_{\bm{\beta}}}(Z)\leqslant\sup_{F\in\overline{\mathcal{B}}_{p\|f_{\bm{\beta}}}\left(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta))\right)}\rho^{F}(Z)\right)$
	$\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\overline{\mathcal{B}}_{p\|f_{\bm{\beta}}}\left(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta))\right)\right)$
	$\displaystyle\geqslant\mathbb{P}\left(F_{\bm{\beta}}^{*}\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon_{p,N}(\eta))\right)\geqslant 1-\eta,$

where $\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)$ is defined by (82), the second inequality follows from Lemma 12, which states that $\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon)\subseteq\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(\widehat{F}_{N},g^{-1}(\varepsilon))$ , and the last inequality is due to Lemma 10 by noting that Assumption 2 implies $\mathbb{E}^{F_{\bm{\beta}}^{*}}[\exp(|Z|^{a})]=\mathbb{E}^{F^{*}}[\exp(|f_{\bm{\beta}}(\mathbf{X})|^{a})]\leqslant A<\infty$ . Hence, we complete the proof. ∎

Step 3: Union bounds for $\bm{\beta}\in\mathcal{D}$ . Now we are ready to prove union bounds in Theorem 6.

Proof of Theorem 6. Note that for any $(Y,\mathbf{X})\in\{-1,1\}\times\mathbb{R}^{n}$ and $\bm{\beta},\widetilde{\bm{\beta}}\in\mathcal{D}$ , we have

$\displaystyle\left\|\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right\|$	$\displaystyle\leqslant M\left(\mathbb{E}^{F}[\|f_{\bm{\beta}}(\mathbf{X})-f_{\widetilde{\bm{\beta}}}(\mathbf{X})\|^{k}]\right)^{1/k}$
	$\displaystyle\leqslant M\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|_{\mathcal{D}}(\mathbb{E}^{F}[(a_{1}\\|\mathbf{X}\\|^{r}+a_{2})^{k}])^{1/k}$
	$\displaystyle\leqslant M\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|_{\mathcal{D}}\left(a_{1}\left(\mathbb{E}^{F}[\|\\|\mathbf{X}\\|\|^{rk}]\right)^{1/k}+a_{2}\right),$	(87)

where the three inequalities follow from Assumption 5, Assumption 4 and triangle’s inequality for the $L^{k}$ -norm, respectively. Then, it holds that

$\displaystyle\left\|\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right\|$
	$\displaystyle\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\left\|\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right\|$
	$\displaystyle\leqslant M\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|_{\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\left(a_{1}\left(\mathbb{E}^{F}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}+a_{2}\right)$
	$\displaystyle\leqslant M\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|_{\mathcal{D}}\left(2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}+2^{r-1}a_{1}\varepsilon^{r}+a_{2}\right),$	(88)

where the second inequality follows from (C.1), and the last inequality holds because $(\mathbb{E}^{F}[\|\mathbf{X}\|^{rk}])^{1/(rk)}\leqslant(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/(rk)}+\varepsilon$ for any $F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)$ with $rk\in[1,p]$ , which further implies

(\mathbb{E}^{F}[\|\mathbf{X}\|^{rk}])^{1/k}\leqslant((\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/(rk)}+\varepsilon)^{r}\leqslant 2^{r-1}((\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/k}+\varepsilon^{r}).

For $0<x\leqslant y$ and $t\geqslant 0$ , it holds that $y-x>t^{1/k}$ implies $y^{k}-x^{k}>t$ . Hence, we have

		$\displaystyle\mathbb{P}\left(\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}-\left(\mathbb{E}^{F^{}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}>\left({\rm\mathbb{V}ar}^{F^{}}(\\|\mathbf{X}\\|^{rk})\right)^{1/(2k)}\right)$
		$\displaystyle\leqslant\mathbb{P}\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]-\mathbb{E}^{F^{}}[\\|\mathbf{X}\\|^{rk}]>\sqrt{{\rm\mathbb{V}ar}^{F^{}}(\\|\mathbf{X}\\|^{rk})}\right)\leqslant\frac{1}{N},$		(89)

where the second inequality follows from Chebyshev’s inequality. Let $\tau>0$ and $\mathcal{D}_{\tau}$ be an $\tau$ -cover of $\mathcal{D}$ with respect to the norm $\|\cdot\|_{\mathcal{D}}$ . Denote by $\tau=1/N$ , $t=\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}})$ and $\varepsilon_{N}(\eta/t):=g^{-1}(\varepsilon_{p,N}(\eta/t))$ , where $s\mapsto\varepsilon_{p,N}(s)$ is defined by (80) in Lemma 10. It holds that

	$\displaystyle 1-\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+N\tau\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)$
	$\displaystyle=\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[(2^{r-1}+1)a_{1}\left(\mathbb{E}^{F^{}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}\!\!\!\!\!+2^{r-1}a_{1}\left({\rm\mathbb{V}ar}^{F^{}}(\\|\mathbf{X}\\|^{rk})\right)^{1/(2k)}\!\!\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+2a_{2}\right]\right)$
	$\displaystyle\leqslant\frac{1}{N}+\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[a_{1}\left(\mathbb{E}^{F^{*}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+2a_{2}\right]\right)$
	$\displaystyle=\frac{1}{N}+\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\tau M\left[a_{1}\left(\mathbb{E}^{F^{}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}\!\!\!\!+a_{2}\right]>\!\!\!\!\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\!\!\!\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+a_{2}\right]\right)$		(90)
	$\displaystyle\leqslant\frac{1}{N}+\mathbb{P}\left(\exists{\widetilde{\bm{\beta}}}\in\mathcal{D}_{\tau}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{{\widetilde{\bm{\beta}}}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\mathbf{\widetilde{\bm{\beta}}}}(\mathbf{X}))\right)$		(91)
	$\displaystyle\leqslant\frac{1}{N}+\sum_{{\widetilde{\bm{\beta}}}\in\mathcal{D}_{\tau}}\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{{\widetilde{\bm{\beta}}}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot{f_{\widetilde{\bm{\beta}}}}(\mathbf{X}))\right)$
	$\displaystyle\leqslant\frac{1}{N}+\frac{\mathcal{N}(\tau;\mathcal{D},\\|\cdot\\|_{\mathcal{D}})\eta}{t}=\frac{1}{N}+\eta,$

where the first inequality follows from (C.1), the last inequality is due to Lemma 13, and the second inequality holds because if the event in (90) happens with some $\bm{\beta}\in\mathcal{D}$ , then there exists $\widetilde{\bm{\beta}}\in\mathcal{D}_{\tau}$ such that $\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}\leqslant\tau$ and

	$\displaystyle\rho^{F^{*}}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))$	$\displaystyle\geqslant\rho^{F^{}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\tau M\left[a_{1}\left(\mathbb{E}^{F^{}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}+a_{2}\right]$
		$\displaystyle>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau M\left[2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\\|\mathbf{X}\\|^{rk}]\right)^{1/k}+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+a_{2}\right]$
		$\displaystyle\geqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\mathbf{\widetilde{\bm{\beta}}}}(\mathbf{X})),$

where the first and the last inequalities follow from (C.1) and (C.1), respectively, which implies that the event in (91) happens. Hence, we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta/t)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N}.

(92)

We recall that $t=\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}})$ with $\tau=1/N$ . Note that $\mathcal{D}$ is contained in $U_{\mathcal{D}}B$ where $B$ is a unit ball of $\mathbb{R}^{m}$ . For any $\varepsilon>0$ , we have (see Example 5.8 of Wainwright (2019))

\displaystyle t=\mathcal{N}\left(\frac{1}{N};\mathcal{D},\|\cdot\|_{\mathcal{D}}\right)\leqslant\left(1+{2U_{\mathcal{D}}N}\right)^{m}=:\widetilde{t}.

Let $\varepsilon_{p,N}$ be defined in Theorem 6. Noting that $g^{-1}$ is increasing and $s\mapsto\varepsilon_{p,N}(s)$ is decreasing, we obtain

g^{-1}(\varepsilon_{p,N})=g^{-1}\left(\varepsilon_{p,N}\left(\eta/\widetilde{t}\right)\right)\geqslant g^{-1}(\varepsilon_{p,N}(\eta/t)).

Combining with (92), we complete the proof.

C.2 Proof of Theorem 7

The proof of Theorem 7 closely follows the three-step structure used for Theorem 6, with a key difference in the first step. Specifically, establishing the inclusion relation (81) becomes challenging because Assumption 6 is weaker than Assumption 3. However, as shown in the first step of the proof below, the “full” inclusion relation (81) is unnecessary; instead, only a partial inclusion relation is required. Specifically, the worst-case risk evaluated over the full Wasserstein ball coincides with that evaluated over a subset of the Wasserstein ball, and this subset can be shown to belong to the projection set. This enables the derivation of confidence bounds for fixed $\bm{\beta}$ in the second step, and subsequently, the generalization bounds using the same covering number arguments as in Theorem 6.

Step 1. Proving partial set inclusion and coincidence of worst-case risk. For $H_{0}\in\mathcal{M}_{p}(\mathbb{R})$ and $\varepsilon\geqslant 0$ , define a subset of a one-dimensional Wasserstein ball as

\displaystyle\mathcal{B}_{p}^{+}(H_{0},\varepsilon)=\{F\in\mathcal{B}_{p}(H_{0},\varepsilon):F\succeq_{\rm st}H_{0}\},

(93)

where $\succeq_{\rm st}$ denotes the usual stochastic order (see Shaked and Shanthikumar (2007)), also known as the first-order stochastic dominance, and is defined as: for two distributions $H_{1},H_{2}$ on $\mathbb{R}$ , $H_{1}\succeq_{\rm st}H_{2}$ if $H_{1}(t)\leqslant H_{2}(t)$ for all $t\in\mathbb{R}$ . We aim to show the following set inclusion result:

\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon)\supseteq\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)),

(94)

where $\mathbf{X}_{0}\sim G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n})$ , $\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon))$ is defined by (93), and

\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon):=\{F_{f_{\bm{\beta}}(\mathbf{X})}:F_{\mathbf{X}}\in{\mathcal{B}}_{p}(G_{0},\varepsilon)\}.

(95)

Lemma 14.

For $p\in[1,\infty]$ , $\varepsilon\geqslant 0$ , $G_{0}\in\mathcal{M}({\mathbb{R}}^{n})$ , and $\mathbf{X}_{0}\sim G_{0}$ , the set inclusion result (94) holds for each $\bm{\beta}\in\mathcal{D}$ under Assumption 6.

Proof.

Proof. Let $\bm{\beta}\in\mathcal{D}$ . For $F\in\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon))$ , we have $F\in\mathcal{B}_{p}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon))$ and $F\succeq_{\rm st}F_{f_{\bm{\beta}}(\mathbf{X}_{0})}$ . Let $U$ be a uniform random variable on $[0,1]$ such that $f_{\bm{\beta}}(\mathbf{X}_{0})=F_{f_{\bm{\beta}}(\mathbf{X}_{0})}^{-1}(U)$ almost surely (see Lemma A.28 of Föllmer and Schied (2016) for the exsistence of $U$ ). Let $Z=F^{-1}(U)\sim F$ and $T=Z-f_{\bm{\beta}}(\mathbf{X}_{0})$ . Since $F\succeq_{\rm st}F_{f_{\bm{\beta}}(\mathbf{X}_{0})}$ , we have $T\geqslant 0$ . Moreover, it holds that

\displaystyle\mathbb{E}[|T|^{p}]=\mathbb{E}[|Z-f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]=\int_{0}^{1}|F^{-1}(s)-F_{f_{\bm{\beta}}(\mathbf{X}_{0})}^{-1}(s)|^{p}\mathrm{d}s=W_{p}(F,F_{f_{\bm{\beta}}(\mathbf{X}_{0})})^{p}\leqslant g(\varepsilon)^{p},

(96)

where the inequality is due to $F\in\mathcal{B}_{p}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon))$ . Using the similar arguments in the proof of Lemma 12 where a measurable selection theorem in Rieder (1978) has been used, we know that there exists a measurable mapping $\mathbf{S}$ such that

\displaystyle\mathbf{S}(\omega)\in\operatorname*{arg\,max}_{\|\mathbf{y}\|\leqslant g^{-1}(T(\omega))}f_{\bm{\beta}}({\mathbf{X}}_{0}(\omega)+\mathbf{y}),~{}~{}\omega\in\Omega.

Using Assumption 6, we have

\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{S}(\omega))-f_{\bm{\beta}}({\mathbf{X}}_{0}(\omega))\geqslant g(g^{-1}(T(\omega)))=T(\omega),~{}~{}\omega\in\Omega.

Note that $f_{\bm{\beta}}$ is continuous. There exists a random variable $R$ taking values on $[0,1]$ such that

\displaystyle f_{\bm{\beta}}({\mathbf{X}}_{0}+R\mathbf{S})-f_{\bm{\beta}}({\mathbf{X}}_{0})=T.

(97)

Define $\widetilde{\mathbf{X}}={\mathbf{X}}_{0}+R\mathbf{S}$ . By (97), we have $f_{\bm{\beta}}(\widetilde{\mathbf{X}})=f_{\bm{\beta}}({\mathbf{X}}_{0})+T=Z\sim F$ . By $\|\mathbf{S}\|\leqslant g^{-1}(T)$ , we have

\displaystyle\mathbb{E}[\|\widetilde{\mathbf{X}}-{\mathbf{X}}_{0}\|^{p}]\leqslant\mathbb{E}\left[(g^{-1}(T))^{p}\right]\leqslant\left(g^{-1}((\mathbb{E}[T^{p}])^{1/p})\right)^{p}\leqslant\varepsilon^{p},

where the second inequality follows from Jensen’s inequality and the concavity of the mapping $x\mapsto(g^{-1}(x^{1/p}))^{p}$ by Lemma 11, and the last inequality is due to (96). Hence, we have $F_{\widetilde{\mathbf{X}}}\in\mathcal{B}_{p}(G_{0},\varepsilon)$ , which implies $F=F_{Z}=F_{f_{\bm{\beta}}(\widetilde{\mathbf{X}})}\in\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon)$ . This completes the proof.∎

Below we demonstrate that the worst-case values of a monotone risk mapping are identical under $\mathcal{B}_{p}(H_{0},\varepsilon)$ and $\mathcal{B}_{p}^{+}(H_{0},\varepsilon)$ for any $H_{0}\in\mathcal{M}_{p}(\mathbb{R})$ and $\varepsilon\geqslant 0$ .

Lemma 15.

Let $p\in[1,\infty]$ and suppose that $\rho:L^{p}\to\mathbb{R}$ satisfies monotonicity. For $H_{0}\in\mathcal{M}_{p}(\mathbb{R})$ and $\varepsilon\geqslant 0$ , we have

\displaystyle\sup_{F\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon)}\rho^{F}(Z)=\sup_{F\in\mathcal{B}_{p}(H_{0},\varepsilon)}\rho^{F}(Z).

(98)

Proof.

Proof. Note that $\mathcal{B}_{p}^{+}(H_{0},\varepsilon)\subseteq\mathcal{B}_{p}(H_{0},\varepsilon)$ . It suffices to show

\displaystyle\sup_{F\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon)}\rho^{F}(Z)\geqslant\sup_{F\in\mathcal{B}_{p}(H_{0},\varepsilon)}\rho^{F}(Z).

(99)

For any $F\in\mathcal{B}_{p}(H_{0},\varepsilon)$ , define $H(x)=\min\{F(x),H_{0}(x)\}$ for $x\in\mathbb{R}$ . It holds that $H^{-1}(s)=\max\{F^{-1}(s),H_{0}^{-1}(s)\}$ for $s\in[0,1]$ , and thus,

	$\displaystyle W_{p}(H,H_{0})^{p}$	$\displaystyle=\int_{0}^{1}\left\|\max\{F^{-1}(s),H_{0}^{-1}(s)\}-H_{0}^{-1}(s)\right\|^{p}\mathrm{d}s$
		$\displaystyle\leqslant\int_{0}^{1}\left\|F^{-1}(s)-H_{0}^{-1}(s)\right\|^{p}\mathrm{d}s=W_{p}(F,H_{0})^{p}\leqslant\varepsilon^{p},$

where the last inequality is due to $F\in\mathcal{B}_{p}(H_{0},\varepsilon)$ . This means $H\in\mathcal{B}_{p}(H_{0},\varepsilon)$ . This, together with $H\leqslant H_{0}$ , that is, $H\succeq_{\rm st}H_{0}$ , implies $H\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon)$ . By the monotonicity of $\rho$ , we have $\rho^{H}(Z)\geqslant\rho^{F}(Z)$ as $H\succeq_{\rm st}F$ . Therefore, we have (99) holds, and thus, we complete the proof. ∎

Step 2: Establishing confidence bounds for fixed $\bm{\beta}$ . Similar to Lemma 13 in Step 2 for Theorem 6, we can obtain the confidence bounds for fixed $\bm{\beta}$ as stated in the following lemma.

Lemma 16.

For $p\geqslant 1$ and $\eta\in(0,1)$ , suppose that Assumptions 2 and 6 hold. Then for each $\bm{\beta}\in\mathcal{D}$ and a monotone measure $\rho$ , we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta,

where $\varepsilon_{p,N}(\eta)$ is defined by (80) with constants $c_{1},c_{2}$ depending only on $a$ , $A$ and $p$ .

Proof.

Proof. Denote by $\widehat{F}_{\bm{\beta},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{f_{\bm{\beta}}(\widehat{\mathbf{x}}_{i})}$ and $F^{*}_{\bm{\beta}}=F_{f_{\bm{\beta}}(\mathbf{X}^{*})}$ , where $(Y^{*},\mathbf{X}^{*})\sim F^{*}$ . Note that by Lemma 14, we have

\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)\supseteq\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon)),

where $\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)=\{F_{f_{\bm{\beta}}(\mathbf{X})}:F_{\mathbf{X}}\in\mathcal{B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)\}$ and $\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))=\{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon)):F\succeq_{\rm st}\widehat{F}_{\bm{\beta},N}\}.$ Therefore, we have

\displaystyle\sup_{F\in\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\geqslant\sup_{F\in\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)=\sup_{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z),

(100)

where the equality follows from Lemma 15 as $\rho$ is monotone. It then follows that

	$\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)$	$\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in{\mathcal{B}}_{p\|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\right)$
		$\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)$
		$\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))\right),$

where the first inequality follows from (100). Substituting $\varepsilon:=\varepsilon_{N}=g^{-1}(\varepsilon_{p,N}(\eta))$ into the above inequalities, we have

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)

\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon_{p,N}(\eta))\right)\geqslant 1-\eta.

where the last inequality follows from Lemma 10, similar to the proof of Lemma 13. This completes the proof. ∎

Step 3: Union bounds for $\bm{\beta}\in{\cal D}$ . Now we are ready to prove the last step, union bounds, for Theorem 7.

Proof of Theorem 7. The proof of union bounds based on confidence bounds is similar to that of Theorem 6 by applying the covering number arguments.

C.3 Proof of Theorem 8

Proof of Theorem 8. Note that for $Z_{1},Z_{2}\in L^{1},$

\displaystyle\rho(Z_{1})\leqslant\sup_{\mathbb{E}[|V|]\leqslant\mathbb{E}[|Z_{1}-Z_{2}|]}\rho(Z_{2}+V)\leqslant\sup_{|V|\leqslant\lambda\mathbb{E}[|Z_{1}-Z_{2}|]}\rho(Z_{2}+V)\leqslant\rho(Z_{2})+M\lambda\mathbb{E}[|Z_{1}-Z_{2}|],

where the second inequality follows from Assumption 1 and the last inequality is due to the Lipschitz continuity of $\rho$ with respect to $L^{\infty}$ -norm, which implies $|\rho(Z_{2}+V)-\rho(Z_{2})|\leqslant M\mathrm{ess\mbox{-}sup}|V|$ . By symmetry, we conclude that

\displaystyle|\rho(Z_{1})-\rho(Z_{2})|\leqslant\lambda M\mathbb{E}[|Z_{1}-Z_{2}|],~{}~{}\forall Z_{1},Z_{2}\in L^{1}.

Hence, Assumption 5 holds with $k=1$ and $M$ being replaced by $\lambda M$ . Further, let $Z_{N}\sim\widehat{F}_{\bm{\beta},N}$ , and it holds for any $\varepsilon\geqslant 0$ that

$\displaystyle\sup_{\mathcal{B}_{\infty}(\widehat{F}_{\bm{\beta},N},\varepsilon)}\rho^{F}(Z)=\sup_{\|Y-Z_{N}\|\leqslant\varepsilon}\rho(Y)$	$\displaystyle=\sup_{\|V\|\leqslant\varepsilon}\rho(Z_{N}+V)$
	$\displaystyle\geqslant\sup_{\mathbb{E}[\|V\|]\leqslant\varepsilon/\lambda}\rho(Z_{N}+V)$
	$\displaystyle=\sup_{\mathbb{E}[\|Y-Z_{N}\|]\leqslant\varepsilon/\lambda}\rho(Y)=\sup_{\mathcal{B}_{1}(\widehat{F}_{\bm{\beta},N},\varepsilon/\lambda)}\rho^{F}(Z),$	(101)

where the inequality is due to Assumption 1. Then, we have the following chain of inequalities:

$\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)$	$\displaystyle\geqslant\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{\infty}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)$
	$\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty\|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\right)$
	$\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)$
	$\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)$
	$\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{1}(\widehat{F}_{\bm{\beta},N},\,g(\varepsilon)/\lambda)}\rho^{F}(Z)\right)$
	$\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{1}(\widehat{F}_{\bm{\beta},N},\,g(\varepsilon)/\lambda)\right),$	(102)

where the first inequality is due to ${\mathcal{B}}_{\infty}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N})\subseteq{\mathcal{B}}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N})$ , the second inequality follows from Lemma 14 which states that $\mathcal{B}_{\infty}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))\subseteq\mathcal{B}_{\infty|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)$ , the second equality is due to Lemma 15, and the third inequality follows from (C.3). For $\eta\in(0,1)$ , combining (C.3) with Lemma 10 yields the following confidence bound:

\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},g^{-1}(\lambda\varepsilon_{1,N}(\eta)))}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{1}(\widehat{F}_{\bm{\beta},N},\varepsilon_{1,N}(\eta))\right)\geqslant 1-\eta,

where $\varepsilon_{1,N}(\eta)$ is defined by (80). Note that Assumptions 2 and 3 hold, Assumption 4 holds with $r=1$ , and Assumption 5 holds with $k=1$ and $M$ being replaced by $\lambda M$ . The rest proof is similar to that of Theorem 6 by applying the covering number arguments.

C.4 A sufficient condition for Assumption 2

Below we present a sufficient condition for Assumption 2 that is more straightforward to verify.

Proposition 4.

Assumption 2 holds if there exist $b_{1},b_{2}\geqslant 0$ , $s\in[1,p]$ and $a>p$ such that

\displaystyle|f_{\bm{\beta}}(\mathbf{x})|\leqslant b_{1}\|\mathbf{x}\|^{s}+b_{2},~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\bm{\beta}\in\mathcal{D},

(103)

and

\displaystyle\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\|\mathbf{X}\|^{as})]<\infty.

(104)

Proof. Note that

	$\displaystyle\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(\|f_{\bm{\beta}}(\mathbf{X})\|^{a})]$	$\displaystyle\leqslant\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(\|b_{1}\\|\mathbf{x}\\|^{s}+b_{2})^{a}]$
		$\displaystyle\leqslant\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\\|\mathbf{x}\\|^{as}+2^{a-1}b_{2}^{a})]$
		$\displaystyle=\exp(2^{a-1}b_{2}^{a})\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\\|\mathbf{x}\\|^{as})]<\infty,$

which confirms Assumption 2. This completes the proof.

Proposition 4 shows that Assumption 2 can be verified through a combination of a union-bound condition (103) on $\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}}$ and a light-tail condition (104) on the data-generating distribution $F^{*}$ . It is worth noting that (103) holds if $f_{\bm{\beta}_{0}}(\mathbf{x})\equiv 0$ when $\bm{\beta}_{0}=\mathbf{0}\in\mathbb{R}^{m}$ , Assumption 4 holds on $\mathcal{D}\cup\{\bm{\beta}_{0}\}$ and $U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty$ . To see this, note that by Assumption 4, we have

\displaystyle|f_{\bm{\beta}}(\mathbf{x})|\leqslant(a_{1}\|\mathbf{x}\|^{r}+a_{2})\|\bm{\beta}\|_{\mathcal{D}}\leqslant a_{1}U_{\mathcal{D}}\|\mathbf{x}\|^{r}+a_{2}U_{\mathcal{D}},~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\bm{\beta}\in\mathcal{D},

and thus, (103) holds with $b_{1}=a_{1}U_{\mathcal{D}}$ , $b_{2}=a_{2}U_{\mathcal{D}}$ and $s=r$ .

References

Bäuerle and Müller (2006) Bäuerle, N. and Müller, A. (2006). Stochastic orders and risk measures: Consistency and bounds. Insurance: Mathematics and Economics, 38(1), 132–148.
Bobkov and Ledoux (2019) Bobkov, S. and Ledoux, M. (2019). One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances. American Mathematical Society.
Dhaene et al. (2002) Dhaene, J., Denuit, M., Goovaerts, M.J., Kaas, R. and Vyncke, D. (2002). The concept of comonotonicity in actuarial science and finance: Theory. Insurance: Mathematics and Economics. 31(2), 3–33.
Föllmer and Schied (2016) Föllmer, H. and Schied, A. (2016). Stochastic Finance. An Introduction in Discrete Time. Fourth Edition. Walter de Gruyter, Berlin.
Fournier and Guillin (2015) Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3), 707–738.
Kallenberg (2021) Kallenberg, O. (2021). Foundations of Modern Probability. Third Edition. New York: springer.
Mao et al. (2022) Mao, T., Wang, R. and Wu, Q. (2022). Model aggregation for risk evaluation and robust optimization. arXiv:2201.06370.
Rieder (1978) Rieder, U. (1978). Measurable selection theorems for optimization problems. manuscripta mathematica, 24(1), 115–131.
Shaked and Shanthikumar (2007) Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer New York.
Sion (1958) Sion, M. (1958). On general minimax theorems. Pacific Journal of Mathematics, 8(1), 171–176.
(11) Smith, J. E. (1995). Generalized Chebychev inequalities: theory and applications in decision analysis. Operations Research, 43(5), 807–825.
Villani (2009) Villani, C. (2009). Optimal Transport: Old and New. Berlin: springer.
Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.

	$\displaystyle\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})$	$\displaystyle=\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}\{\mathbf{y}^{\top}\Sigma\mathbf{y}+(2\mathbf{x}^{\top}\Sigma+\mathbf{a}^{\top})\mathbf{y}\}$
		$\displaystyle=\sup_{\\|\mathbf{y}\\|_{2}\leqslant\varepsilon}\left\{(Q^{\top}\mathbf{y})^{\top}\Lambda(Q^{\top}\mathbf{y})+(2\mathbf{x}^{\top}Q\Lambda Q^{\top}+\mathbf{a}^{\top})\mathbf{y}\right\}$
		$\displaystyle=\sup_{\\|\mathbf{z}\\|_{2}\leqslant\varepsilon}\left\{\mathbf{z}^{\top}\Lambda\mathbf{z}+(2\mathbf{x}^{\top}Q\Lambda+\mathbf{a}^{\top}Q)\mathbf{z}\right\}\geqslant\\|\Sigma\\|_{S}\varepsilon^{2}\geqslant M\varepsilon^{2}.$

	$\displaystyle\|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})\|$	$\displaystyle=\left\|\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right\|$
		$\displaystyle\leqslant\max_{1\leqslant k\leqslant K}\left\|(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right\|$
		$\displaystyle=\max_{1\leqslant k\leqslant K}\left\|v_{k}^{\top}(\bm{\beta}-\widetilde{\bm{\beta}})\right\|\leqslant\max_{1\leqslant k\leqslant K}\\|v_{k}\\|_{*}\\|\bm{\beta}-\widetilde{\bm{\beta}}\\|.$

	$\displaystyle\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})$	$\displaystyle=\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle=\max_{1\leqslant k\leqslant K}\sup_{\\|\mathbf{y}\\|\leqslant\varepsilon}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle=\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+\varepsilon\\|H^{\top}v_{k}\\|_{*}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})$
		$\displaystyle\geqslant\min_{1\leqslant k\leqslant K}\varepsilon\\|H^{\top}v_{k}\\|_{*}=M\varepsilon.$

	$\displaystyle{W}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})^{p}$	$\displaystyle=\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\inf_{\pi\in\Pi(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})}\mathbb{E}^{\pi}[\|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\leqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\mathbb{E}[\|Y_{0}\cdot\bm{\gamma}^{\top}(\mathbf{X}-\mathbf{X}_{0})\|^{p}]$
		$\displaystyle\leqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{}=1}\\|\bm{\gamma}\\|_{}^{p}\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]$
		$\displaystyle=\mathbb{E}[\\|\mathbf{X}-\mathbf{X}_{0}\\|^{p}]\leqslant\varepsilon^{p}.$

	$\displaystyle\varepsilon^{p}$	$\displaystyle\geqslant\sup_{\bm{\gamma}:\\|\bm{\gamma}\\|_{*}=1}\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[\|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\geqslant\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\frac{1}{\\|\bm{\beta}\\|_{*}^{p}}\mathbb{E}^{\pi}[\|{\bm{\beta}}^{\top}\bm{\xi}_{1}-{\bm{\beta}}^{\top}\bm{\xi}_{2}\|^{p}]$
		$\displaystyle\geqslant\frac{1}{\\|\bm{\beta}\\|_{*}^{p}}\inf_{\pi\in\Pi\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[\|\xi_{1}-\xi_{2}\|^{p}],$