This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Generalization and Regularization via Wasserstein Distributionally Robust Optimization

Qinyu Wu Department of Statistics and Finance, University of Science and Technology of China, China. E-mail: [email protected]    Jonathan Yu-Meng Li Telfer School of Management, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada. E-mail: [email protected]    Tiantian Mao Department of Statistics and Finance, University of Science and Technology of China, China. E-mail: [email protected]
Abstract

Wasserstein distributionally robust optimization (DRO) has gained prominence in operations research and machine learning as a powerful method for achieving solutions with favorable out-of-sample performance. Two compelling explanations for its success are the generalization bounds derived from Wasserstein DRO and its equivalence to regularization schemes commonly used in machine learning. However, existing results on generalization bounds and regularization equivalence are largely limited to settings where the Wasserstein ball is of a specific type, and the decision criterion takes certain forms of expected functions. In this paper, we show that generalization bounds and regularization equivalence can be obtained in a significantly broader setting, where the Wasserstein ball is of a general type and the decision criterion accommodates any form, including general risk measures. This not only addresses important machine learning and operations management applications but also expands to general decision-theoretical frameworks previously unaddressed by Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a by-product, we show that Wasserstein DRO coincides with the recent max-sliced Wasserstein DRO for any decision criterion under affine decision rules – resulting in both being efficiently solvable as convex programs via our general regularization results. These general assurances provide a strong foundation for expanding the application of Wasserstein DRO across diverse domains of data-driven decision problems.

Key-words: distributionally robust optimization, Wasserstein metrics, finite-sample guarantees, regularization

1 Introduction

Stochastic optimization problems of the form

inf𝜷𝒟ρ(Y𝜷𝐗)\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\rho(Y\cdot\bm{\beta}^{\top}\mathbf{X}) (1)

naturally arise in many machine learning (ML) and operations research (OR) applications, where YY denotes a binary random variable, taking values from {1,1}\{-1,1\}, and 𝐗\mathbf{X} is a random vector in n{\mathbb{R}}^{n}. The function f𝜷(𝐗)=𝜷𝐗f_{\bm{\beta}}(\mathbf{X})=\bm{\beta}^{\top}\mathbf{X}, parameterized by the decision variable 𝜷\bm{\beta} 𝒟n\in\mathcal{D}\subseteq\mathbb{R}^{n}, can generally be interpreted as an affine decision rule. In ML, the binary random variable YY often represents outcomes in classification problems. When YY is constant, the formulation (1) applies to a broad range of regression problems in ML and risk minimization problems in OR. The function ρ\rho represents a measure of risk, mapping a random variable to a real number that quantifies its risk. The notion of risk in this paper is broadly defined as the undesirability of a random variable ZZ; that is, a larger value of ρ(Z)\rho(Z) indicates that ZZ is less preferable.

In this paper, we study the Wasserstein distributionally robust counterpart of (1), specifically:

inf𝜷𝒟supFp(F0,ε)ρF(Y𝜷𝐗),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}), (2)

where p(F0,ε){\cal B}_{p}(F_{0},\varepsilon) denotes a type-pp Wasserstein ball of radius ε\varepsilon centred at a reference distribution F0F_{0} (c.f. Kuhn et al. (2019)). The superscript FF in ρF\rho^{F} indicates that the measure depends on a joint distribution FF of (Y,𝐗)(Y,\mathbf{X}). This problem has been extensively studied and applied in various fields when ρF\rho^{F} takes the form of an expected value function, i.e., ρF(Z)=𝔼F[(Z)]\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)] for some :\ell:{\mathbb{R}}\rightarrow{\mathbb{R}} (see e.g., Kuhn et al. (2019); Shafieezadeh-Abadeh et al. (2019); Gao et al. (2017); Gao (2022)). This setup reveals two key findings that underscore the potential strength of the Wasserstein distributionally robust optimization (DRO) model in achieving strong out-of-sample performance. First, the model enjoys generalization bounds under mild assumptions (Shafieezadeh-Abadeh et al. (2019)). More notably, Gao (2022) shows that when the data-generating distribution satisfies transportation inequalities, O(N1/2)O(N^{-1/2})-generalization bounds for the type-pp Wasserstein ball p(F0,ε){\cal B}_{p}(F_{0},\varepsilon) with p[1,2]p\in[1,2] can be established, free from the curse of dimensionality. Second, the model has an exact regularization equivalent to the nominal problem (1) with an added regularizer on the decision variable 𝜷\bm{\beta}, when the Wasserstein ball is of type-1 and the loss function \ell is Lipschitz continuous (Shafieezadeh-Abadeh et al. (2019)), as well as when the ball is of type-2 and the loss function \ell is quadratic (Blanchet et al. (2019)). This connection to the classical regularization scheme, prevalent in machine learning, offers a compelling interpretation of the DRO model and has spurred considerable interest in its application within ML and OR (see e.g., Blanchet and Kang (2021), Blanchet et al. (2019), Chen and Paschalidis (2018), Gao et al. (2017), Gao et al. (2022)).

It remains largely unclear, however, whether the generalization bounds established for the specific case of the expected value function can be achieved in the broader setting of (2) – namely, for a general measure ρ\rho and any type-pp Wasserstein ball p(F0,ε){\cal B}_{p}(F_{0},\varepsilon) with p[1,]p\in[1,\infty] – without suffering from the curse of dimensionality. In particular, the result of Gao (2022), aside from the limitations imposed by the transportation inequality assumption on the underlying distribution and the restriction to p[1,2]p\in[1,2], fundamentally relies on the unbiased property of the sample average – a property that does not extend to risk measures such as Conditional Value-at-Risk (CVaR) (Rockafellar and Uryasev (2002)). It is also worth noting that, even in the literature on ML, the question of how to obtain generalization bounds, possibly through the classical regularization scheme, for the nominal problem (1) has largely been left open (c.f. Shalev-Shwartz and Ben-David (2014)). In this paper, we shed new light on the efficacy of Wasserstein DRO in broad out-of-sample tests by showing that the Wasserstein DRO model (2) can circumvent the curse of dimensionality for virtually any decision criterion under affine decision rules. As a crucial insight, we uncover that the Wasserstein DRO remarkably coincides with the recent max-sliced Wasserstein DRO (Olea et al. (2022)) for any decision criterion, even though the ball defined in the former is a far less conservative choice – specifically, it is smaller and tighter than the one defined in the latter. Additionally, we extend our results to general decision rules under the light-tail assumption by leveraging the measure concentration property of a Wasserstein ball.

Another focus of this work is to investigate whether the key equivalence of Wasserstein DRO to classical regularization in ML can be extended to a more general setting in (2). To answer this, we first pay particular attention to the setting where the Wasserstein ball p(F0,ε){\cal B}_{p}(F_{0},\varepsilon) is of any type, i.e., p[1,]p\in[1,\infty], and the measure ρ\rho is an expected function ρF(Z)=𝔼F[(Z)]\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)] with a loss function \ell having a growth order pp. We show that for many loss functions \ell arising from practical applications, it is possible to establish an exact relation between the Wasserstein DRO model (2) and the classical regularization scheme in more general terms. Notably, we prove that no loss functions beyond those we identify satisfy this equivalence, offering a definitive answer to the extent to which Wasserstein DRO can be interpreted and solved from a regularization perspective. Our results also reveal the tractability of solving Wasserstein DRO for many non-Lipschitz continuous loss functions \ell and higher-order Wasserstein balls (p>1p>1). While Sim et al. (2021) approach such problems via approximation due to reformulation challenges, we show that exact solutions are possible through regularization reformulations. Moreover, we extend this equivalence to settings where ρ\rho goes beyond expected value functions. Previously, this exact relation was known only for variance (Blanchet et al. (2022)) or distortion risk measures (Wozabal (2014)). We expand it to a significantly broader class of measures, encompassing higher moment coherent risk measures (Krokhmal (2007)), criteria emerging more recently from OR and ML (Rockafellar and Uryasev (2013), Rockafellar et al. (2008), Gotoh and Uryasev (2017)), and notably, general decision-theoretical models such as the celebrated rank-dependent expected utility (RDEU) (Quiggin (1982)).

Related Work

From the perspective of generalization bounds. A series of works done by Blanchet et al. (2019), Blanchet and Kang (2021), Blanchet et al. (2022) take a different approach to tackle the curse of dimensionality. They study the classical setting of Wasserstein DRO, where the measure ρ\rho is an expected function, and propose a radius selection rule for Wasserstein balls. They show the rule can be applied to build a confidence region of the optimal solution, and the radius can be chosen in the square-root order as the sample size goes to infinity. Although this allows for bypassing the curse of dimensionality, the bounds derived from the rule are only valid in the asymptotic sense. Blanchet et al. (2022) also takes this approach to obtain generalization bounds for mean-variance portfolio selection problems. On the other hand, the generalization bounds established in this paper, like those in Shafieezadeh-Abadeh et al. (2019), Chen and Paschalidis (2018), and Gao (2022), break the curse of dimensionality in a non-asymptotic sense for finite samples.

From the perspective of the equivalency between Wasserstein DRO and regularization. While the focus of this work is on studying the exact equivalency between Wasserstein DRO and regularization, there is an active stream of works studying the asymptotic equivalence in the setting where the measure ρ\rho is an expected function (see Gao et al. (2017); Blanchet et al. (2019); Blanchet et al. (2022); Volpi et al. (2018); Bartl et al. (2020)). In particular, Gao et al. (2022) introduce the notion of variation regularization and show that for a broad class of loss functions, Wasserstein DRO is asymptotically equivalent to a variation regularization problem.

Recap of Major Contributions and Practical Implications

  1. 1.

    We underscore the fundamental gap in achieving O(N1/2)O(N^{-1/2})-generalization bounds for diverse decision criteria in Wasserstein DRO and novelly establish such guarantees by leveraging the projection set of the Wasserstein ball, revealing a broad equivalence with the recent max-sliced Wasserstein DRO for any decision criterion.

  2. 2.

    By proving the precise conditions under which Wasserstein DRO has an exact regularization equivalent, we provide key insights into a wide range of OR/MS applications, highlighting when a simpler, intuitive, and computationally efficient regularization approach suffices to achieve the generalization guarantees of Wasserstein DRO – and when solving the full Wasserstein DRO formulation is necessary.

  3. 3.

    By extending generalization bounds beyond affine decision rules, we uncover important OR/MS applications, such as foundational two-stage stochastic programs and the increasingly popular feature-based newsvendor problem, with generalization guarantees that scale effectively across diverse decision criteria.

2 Wasserstein Distributionally Robust Optimization Model

Let (Ω,𝒜,)(\Omega,\mathcal{A},\mathbb{P}) be an atomless probability space. A random vector 𝝃\bm{\xi} is a measurable mapping from Ω\Omega to n+1\mathbb{R}^{n+1}, nn\in\mathbb{N}. Denote by F𝝃F_{\bm{\xi}} the distribution of 𝝃\bm{\xi} under \mathbb{P}. For p1p\geqslant 1, let Lp:=Lp(Ω,𝒜,)L^{p}:=L^{p}(\Omega,\mathcal{A},\mathbb{P}) be the set of all random variables with a finite ppth moment, and p(Ξ)\mathcal{M}_{p}(\Xi) be the set of all distributions on Ξn+1\Xi\subseteq\mathbb{R}^{n+1} with finite ppth moments in each component. Let L:=L(Ω,𝒜,)L^{\infty}:=L^{\infty}(\Omega,\mathcal{A},\mathbb{P}) be the set of all bounded random variables, and (Ξ)\mathcal{M}_{\infty}(\Xi) be the set of all distributions on Ξn+1\Xi\subseteq\mathbb{R}^{n+1} with bounded support. Let qq denote the Hölder conjugate of pp, i.e., 1/p+1/q=11/p+1/q=1. We use ess-supZ\mathrm{ess\mbox{-}sup}Z to represent the essential supremum of the random variable ZZ, i.e., ess-supZ=inf{t:FZ(t)1}\mathrm{ess\mbox{-}sup}Z=\inf\{t:F_{Z}(t)\geqslant 1\}. Recall that for any two distributions F1,F2p(Ξ)F_{1},\,F_{2}\in\mathcal{M}_{p}(\Xi), the type-pp Wasserstein metric is defined as

Wd,p(F1,F2)=infπΠ(F1,F2)(𝔼π[d(𝝃1,𝝃2)p])1/p,\displaystyle W_{d,p}\left(F_{1},F_{2}\right)=\inf_{\pi\in\Pi(F_{1},F_{2})}\big{(}\mathbb{E}^{\pi}[d(\bm{\xi}_{1},\bm{\xi}_{2})^{p}]\big{)}^{{1}/{p}}, (3)

where d(,):Ξ×Ξ+{}d(\cdot,\cdot):\Xi\times\Xi\to\mathbb{R}_{+}\cup\{\infty\} is a metric on Ξ\Xi. The set Π(F1,F2)\Pi(F_{1},F_{2}) denotes the set of all joint distributions of 𝝃1\bm{\xi}_{1} and 𝝃2\bm{\xi}_{2} with marginal distributions F1F_{1} and F2F_{2}, respectively. In this paper, we consider 𝝃=(Y,𝐗)Ξ\bm{\xi}=(Y,{\bf X})\in\Xi with Ξ:={1,1}×nn+1\Xi:=\{-1,1\}\times\mathbb{R}^{n}\subseteq\mathbb{R}^{n+1} and apply the type-pp Wasserstein metric (3) with d(𝝃1,𝝃2)=d((Y1,𝐗1),(Y2,𝐗2))d(\bm{\xi}_{1},\bm{\xi}_{2})=d((Y_{1},\mathbf{X}_{1}),(Y_{2},\mathbf{X}_{2})), defined by the following additively separable form

d((Y1,𝐗1),(Y2,𝐗2))=𝐗1𝐗2+Θ(Y1Y2),\displaystyle d((Y_{1},\mathbf{X}_{1}),(Y_{2},\mathbf{X}_{2}))=\|\mathbf{X}_{1}-\mathbf{X}_{2}\|+\Theta(Y_{1}-Y_{2}), (4)

where \|\cdot\| is any given norm on n\mathbb{R}^{n} with its dual norm \|\cdot\|_{*} defined by 𝐲=sup𝐱1𝐱𝐲\|\mathbf{y}\|_{*}=\sup_{\|\mathbf{x}\|\leqslant 1}\mathbf{x}^{\top}\mathbf{y}. The function Θ:{0,}\Theta:\mathbb{R}\rightarrow\{0,\infty\} satisfies Θ(s)=0\Theta(s)=0 if s=0s=0 and Θ(s)=\Theta(s)=\infty otherwise. Thus, the function (4) assigns an infinitely large cost to any discrepancy in YY, i.e., Y1Y20Y_{1}-Y_{2}\neq 0, and reduces to a general norm on 𝐗\mathbf{X} when there is no discrepancy in YY, i.e., Y1Y2=0Y_{1}-Y_{2}=0. Using this norm, for F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi) and ε0\varepsilon\geqslant 0, we define the ball of distributions ¯p(F0,ε)\overline{{\cal B}}_{p}(F_{0},\varepsilon) as

¯p(F0,ε)={Fp(Ξ):Wd,p(F,F0)ε},\displaystyle\overline{{\cal B}}_{p}(F_{0},\varepsilon)=\left\{F\in\mathcal{M}_{p}(\Xi):W_{d,p}(F,F_{0})\leqslant\varepsilon\right\}, (5)

which we refer to as the type-pp Wasserstein ball throughout this paper.

In the remainder of this paper, we first demonstrate that the DRO problem (2), using the above definition of the Wasserstein ball:

inf𝜷𝒟supF¯p(F0,ε)ρF(Y𝜷𝐗),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}), (6)

can achieve generalization guarantees for any measure ρ\rho and type-pp Wasserstein ball. We then elucidate this guarantee for the widely used affine decision rules by exploring the connection between the optimization model (6) and the classical regularization scheme in Section 4. This framework covers various problem types, including:

  • Classification:

    inf𝜷𝒟supF¯p(F0,ε)ρF(Y𝜷𝐗),\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),
  • Regression:

    inf𝜷r𝒟supF¯p(F0,ε)ρF((1,𝜷r)𝐗),\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}((1,-\bm{\beta}_{r})^{\top}\mathbf{X}),
  • Risk minimization:

    inf𝜷𝒟supF¯p(F0,ε)ρF(𝜷𝐗),\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{{\cal B}}_{p}(F_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{X}),

where F0({Y=1})=1F_{0}(\{Y=1\})=1 in the cases of regression and risk minimization. These formulations accommodate numerous machine learning and operations research applications not yet addressed in the existing literature on Wasserstein DRO. Table 1 highlights examples considered in this paper.

Table 1: Examples of measures ρ\rho
Applications Measure Formulation References
Classification HO hinge loss HO SVM ν\nu-SVM 𝔼[(1Z)+s]\mathbb{E}[(1-Z)_{+}^{s}] 𝔼[|1Z|s]\mathbb{E}[|1-Z|^{s}] CVaRα(Z){\rm CVaR}_{\alpha}(-Z) Lee and Mangasarian (2001), s=2s=2 Suykens and Vandewalle (1999), s=2s=2 Schölkopf et al. (2000); Gotoh and Uryasev (2017)
Regression HO regression HO cc-insensitive ν\nu-SVR 𝔼[|Z|s]\mathbb{E}[|Z|^{s}] 𝔼[(|Z|r)+s]\mathbb{E}[(|Z|-r)^{s}_{+}] CVaRα(|Z|){\rm CVaR}_{\alpha}(|Z|) the well-known least-square regression for s=2s=2 Drucker et al. (1997), s=1s=1; Lee et al. (2005), s=2s=2 Schölkopf et al. (1998)
Risk minimization LPM CVaR-Deviation HM risk measure 𝔼[(Zc)+s]\mathbb{E}[(Z-c)_{+}^{s}] CVaRα(Z𝔼[Z]){\rm CVaR}_{\alpha}(Z-\mathbb{E}[Z]) inft{t+k(𝔼[(Zt)+s])1/s}\inf_{t\in\mathbb{R}}\{t+k(\mathbb{E}[(Z-t)_{+}^{s}])^{1/s}\} Bawa (1975); Fishburn (1977); Chen et al. (2011) Rockafellar et al. (2008) Krokhmal (2007)

Notes: HO: higher-order; SVM: support vector machine; SVR: support vector regression; LPM: lower partial moments; HM: higher moment; CVaR: conditional Value-at-Risk, CVaRα(Z)=α1FZ1(t)dt/(1α){\rm CVaR}_{\alpha}(Z)=\int_{\alpha}^{1}F_{Z}^{-1}(t)\mathrm{d}t/(1-\alpha), ZL1Z\in L^{1}, where FZ1F_{Z}^{-1} is the left-quantile function of ZZ. The range of parameters: s1s\geqslant 1; α(0,1)\alpha\in(0,1); r0r\geqslant 0; cc\in\mathbb{R}; k1k\geqslant 1.

3 Generalization Bounds

Wasserstein DRO is typically applied in a data-driven setting, where the data-generating distribution FF^{*} of random variables (Y,𝐗)Ξ={1,1}×n(Y,\mathbf{X})\in\Xi=\{-1,1\}\times{\mathbb{R}}^{n} is only partially observed through sample data points (y^i,𝐱^i)(\widehat{y}_{i},\widehat{\mathbf{x}}_{i}) for i=1,,Ni=1,...,N, independently drawn from FF^{*}. In this context, the empirical distribution F^N\widehat{F}_{N} is used as the reference distribution F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi) in Wasserstein DRO, where F^N:=1Ni=1Nδ(y^i,𝐱^i)\widehat{F}_{N}:=\frac{1}{N}\sum_{i=1}^{N}\delta_{(\widehat{y}_{i},\widehat{\mathbf{x}}_{i})} and δ𝐱\delta_{\mathbf{x}} denotes the point-mass at 𝐱\mathbf{x}. The central question is whether the (Wasserstein) in-sample risk

supF¯p(F^N,εN)ρF(Y𝜷𝐗),\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),

can provide an upper confidence bound on the out-of-sample risk

ρF(Y𝜷𝐗)\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})

across all possible decisions 𝜷𝒟\bm{\beta}\in{\cal D}, with a radius εN\varepsilon_{N} that decays at a rate scaling gracefully with the dimensionality of the random vector (Y,𝐗)(Y,\mathbf{X}) – in other words, generalization bounds that overcome the curse of dimensionality.

Current methods face significant limitations. While generalization bounds based on the measure concentration properties of type-pp Wasserstein balls offer flexibility in accommodating general risk measures ρ\rho, they are constrained by the curse of dimensionality inherent in high-dimensional Wasserstein balls (see e.g., Esfahani and Kuhn (2018)). Approaches such as Gao (2022) address this issue by focusing on expected value functions, i.e., ρF(Z)=𝔼F[(Z)]\rho^{F}(Z)=\mathbb{E}^{F}[\ell(Z)], leveraging the concentration properties of sample averages. However, extending these methods beyond expected value functions proves challenging, as most risk measures lack analogous concentration properties.

We take a novel perspective to tackle this challenge by focusing on the structural properties of the projection set:

¯p|𝜷(F0,ε):={FY𝜷𝐗:F(Y,𝐗)¯p(F0,ε)},\displaystyle\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon):=\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\}, (7)

where F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi), 𝜷n\bm{\beta}\in\mathbb{R}^{n} and ε0\varepsilon\geqslant 0. We uncover that this set coincides with two particularly revealing one-dimensional sets: a one-dimensional Wasserstein ball and the projection set of a high-dimensional max-sliced Wasserstein ball. To formalize this, for G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), we first recall the definition of a type-pp Wasserstein ball on n\mathbb{R}^{n} with the metric d(,)d(\cdot,\cdot) being a norm:

p(G0,ε)={Gp(n):Wp(G,G0)ε}withWp(G,G0):=infπΠ(G,G0)(𝔼π[𝝃1𝝃2p])1/p,\displaystyle{\cal B}_{p}(G_{0},\varepsilon)=\left\{G\in\mathcal{M}_{p}(\mathbb{R}^{n}):W_{p}(G,G_{0})\leqslant\varepsilon\right\}~{}~{}{\rm with}~{}~{}W_{p}(G,G_{0}):=\inf_{\pi\in\Pi(G,G_{0})}\left(\mathbb{E}^{\pi}[\|\bm{\xi}_{1}-\bm{\xi}_{2}\|^{p}]\right)^{{1}/{p}}, (8)

and a type-pp max-sliced Wasserstein ball (Olea et al. (2022)):

pms(G0,ε)={Gp(n):sup𝜷:𝜷=1infπΠ(G,G0)(𝔼π[|𝜷𝝃1𝜷𝝃2|p])1/pε}.\displaystyle{\cal B}_{p}^{\rm ms}(G_{0},\varepsilon)=\left\{G\in\mathcal{M}_{p}(\mathbb{R}^{n}):\sup_{\bm{\beta}:\|\bm{\beta}\|_{*}=1}\inf_{\pi\in\Pi(G,G_{0})}\left(\mathbb{E}^{\pi}[|\bm{\beta}^{\top}\bm{\xi}_{1}-\bm{\beta}^{\top}\bm{\xi}_{2}|^{p}]\right)^{1/p}\leqslant\varepsilon\right\}. (9)

Here, \|\cdot\| is the norm defined in (4) on n\mathbb{R}^{n}. Without loss of generality and to be consistent with the definition of classic one-dimensional Wasserstein metric, for n=1n=1, i.e., on \mathbb{R}, we assume that =||\|\cdot\|=|\cdot| is the absolute-value norm.

With these definitions in place, we can now present the following equivalence results.

Proposition 1.

Suppose that p[1,]p\in[1,\infty], ε0\varepsilon\geqslant 0, 𝛃n\bm{\beta}\in\mathbb{R}^{n}, F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi) and (Y0,𝐗0)F0(Y_{0},\mathbf{X}_{0})\sim F_{0}. Then, we have

{FY𝐗:F(Y,𝐗)¯p(F0,ε)}=p(FY0𝐗𝟎,ε)\displaystyle\{F_{Y\bf X}:F_{(Y,{\bf X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\}={\cal B}_{p}(F_{Y_{0}\bf X_{0}},\varepsilon) (10)

and

¯p|𝜷(F0,ε)=p(FY0𝜷𝐗𝟎,ε𝜷)={F𝜷𝐙:F𝐙pms(FY0𝐗0,ε)},\displaystyle\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon)=\mathcal{B}_{p}\left(F_{Y_{0}\cdot\bm{\beta}^{\top}\bf X_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:F_{\bf Z}\in\mathcal{B}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}_{0}},\varepsilon)\}, (11)

where ¯p|𝛃\overline{\mathcal{B}}_{p|\bm{\beta}}, p{\cal B}_{p}, and pms\mathcal{B}_{p}^{\rm ms} are defined by (7), (8), and (9), respectively.

The equivalence established above has profound implications for both Wasserstein DRO and max-sliced Wasserstein DRO. It shows that the two frameworks coincide for any decision criterion ρ\rho under affine decision rules, both reducing to the problem of minimizing worst-case risk over a one-dimensional type-pp Wasserstein ball – an inherently more tractable problem, as detailed in Section 4. Notably, this equivalence reveals that, although the Wasserstein distance is a significantly stronger metric than the max-sliced Wasserstein distance, i.e., the type-pp Wasserstein ball is a much less conservative choice of an uncertainty set than its max-sliced counterpart (Olea et al. (2022)), Wasserstein DRO can still achieve the same generalization bounds. This insight directly leads to the following generalization bounds for general measures ρ\rho, free from the curse of dimensionality.

Theorem 1.

Given p[1,)p\in[1,\infty), η(0,1)\eta\in(0,1) and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, if Γ:=𝔼F[𝐗s]<\Gamma:=\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{s}]<\infty for some s>2ps>2p, then we have

(ρF(Y𝜷𝐗)supF¯p(F^N,εp,N(η))ρF(Y𝜷𝐗),𝜷𝒟)1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon_{p,N}(\eta))}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta,

where εp,N(η)p=Clog(2N+1)p/s/N\varepsilon_{p,N}(\eta)^{p}=C\log(2N+1)^{p/s}/\sqrt{N} and

C=2pp(180n+2+2log(3η)+3Γη8s/2plog(24η)+2(n+2)).C=2^{p}p\left(180\sqrt{n+2}+\sqrt{2\log\left(\frac{3}{\eta}\right)}+\sqrt{\frac{3\Gamma}{\eta}}\frac{8}{s/2-p}\sqrt{\log{\left(\frac{24}{\eta}\right)}+2(n+2)}\right).

The decay rate of the radius is approximately of the order N1/(2p)N^{-1/(2p)}. While this rate is unaffected by dimensionality, its dependency on the type of Wasserstein ball, specifically the parameter pp, is noteworthy. This dependency, often overlooked in the literature compared to the curse of dimensionality, is potentially concerning, as it suggests that the radius may shrink very slowly for higher-order Wasserstein balls. This issue arises from the use of stronger metrics – specifically, norms raised to higher powers – when defining any ball constructed similarly to the Wasserstein ball, including the max-sliced Wasserstein ball (Olea et al. (2022)). Intuitively, a higher-order Wasserstein ball is smaller than a lower-order one, requiring a larger radius to ensure the data-generating distribution is covered with high probability. In the extreme case of a type-\infty Wasserstein ball, it would never contain a data-generating distribution with unbounded support, regardless of the radius.

We circumvent this issue, which we refer to as the “curse of the order pp”, by recognizing that ensuring high-probability coverage of the data-generating distribution, while sufficient, is not necessary for developing generalization bounds. In fact, we show that even for the type-\infty Wasserstein ball – where the probability of covering a data-generating distribution with unbounded support is effectively zero – the worst-case risk from the type-\infty ball can still bound the out-of-sample risk of the data-generating distribution with high probability. This guarantee requires only the following mild assumption about the measure ρ\rho.

Assumption 1.

There exists λ1\lambda\geqslant 1 such that

sup|V|λερ(Z+V)sup𝔼[|V|]ερ(Z+V),ZL1,ε0.\displaystyle\sup_{|V|\leqslant\lambda\varepsilon}\rho(Z+V)\geqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\rho(Z+V),~{}~{}\forall Z\in L^{1},~{}\varepsilon\geqslant 0.

For any measures satisfying Assumption 1, one can leverage the generalization bounds for the type-1 Wasserstein ball to derive similar bounds for the type-\infty Wasserstein ball, with a radius scaled by a constant λ\lambda. These bounds apply to any type-pp Wasserstein ball, for p[1,]p\in[1,\infty], resulting in generalization bounds where the decay rate of the radius is no longer dependent on the order pp.

Theorem 2.

Let p[1,]p\in[1,\infty], η(0,1)\eta\in(0,1) and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}. Suppose that Γ:=𝔼F[𝐗s]<\Gamma:=\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{s}]<\infty for some s>2s>2 and Assumption 1 holds. Then, we have

(ρF(Y𝜷𝐗)supF¯p(F^N,λε1,N(η))ρF(Y𝜷𝐗),𝜷𝒟)1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon_{1,N}(\eta))}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta,

where ε1,N(η)\varepsilon_{1,N}(\eta) is defined in Theorem 1.

The core idea presented in this section can be extended beyond affine decision rules, as discussed in greater detail in Section 6. Here, we further illustrate the generality of Assumption 1 by providing additional examples.

Example 1 (Expected function).

Let ρ(Z)=𝔼[(Z)]\rho(Z)=\mathbb{E}[\ell(Z)], where :\ell:\mathbb{R}\to\mathbb{R} is quasi-convex and Lipschitz continuous with Lipschitz constant Lip(){\rm Lip}(\ell), and satisfies, for any differentiable point zz\in\mathbb{R}, |(z)|M>0|\ell^{\prime}(z)|\geqslant M>0. By \ell is quasi-convex, there exists α[,]\alpha\in[-\infty,\infty] such that \ell is decreasing on (,α)(-\infty,\alpha) and increasing otherwise, and thus \ell is differentiable almost everywhere. It follows that

sup|V|ε𝔼[(Z+V)]\displaystyle\sup_{|V|\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)] =𝔼[(Zε)𝟙{Z<α}]+𝔼[(Z+ε)𝟙{Zα}]\displaystyle=\mathbb{E}[\ell(Z-\varepsilon)\mathds{1}_{\{Z<\alpha\}}]+\mathbb{E}[\ell(Z+\varepsilon)\mathds{1}_{\{Z\geqslant\alpha\}}]
𝔼[((Z)+Mε)𝟙{Z<α}]+𝔼[((Z)+Mε)𝟙{Zα}]=𝔼[(Z)]+Mε.\displaystyle\geqslant\mathbb{E}[(\ell(Z)+M\varepsilon)\mathds{1}_{\{Z<\alpha\}}]+\mathbb{E}[(\ell(Z)+M\varepsilon)\mathds{1}_{\{Z\geqslant\alpha\}}]=\mathbb{E}[\ell(Z)]+M\varepsilon.

This, together with sup𝔼[|V|]ε𝔼[(Z+V)]sup𝔼[|V|]ε𝔼[(Z)+Lip()|V|]𝔼[(Z)]+Lip()ε\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)]\leqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\mathbb{E}[\ell(Z)+{\rm Lip}(\ell)|V|]\leqslant\mathbb{E}[\ell(Z)]+{\rm Lip}(\ell)\varepsilon by the Lipschitz continuity of \ell, implies ρ\rho satisfies Assumption 1 with λ=Lip()/M\lambda={\rm Lip}(\ell)/M. In particular, the function \ell, defined as (x)=a(xc)++b(xc)\ell(x)=a(x-c)_{+}+b(x-c)_{-} with a>0a>0, b,cb,c\in\mathbb{R}, satisfies Assumption 1 with λ=max{a,|b|}/min{a,|b|}\lambda=\max\{a,|b|\}/\min\{a,|b|\}.

Example 2 (Risk measure).

Suppose that ρ:L1\rho:L^{1}\to\mathbb{R} satisfies monotonicity and translation invariance, i.e., ρ(Z1)ρ(Z2)\rho(Z_{1})\leqslant\rho(Z_{2}) whenever Z1Z2Z_{1}\leqslant Z_{2}, and ρ(Z+m)=ρ(Z)+m\rho(Z+m)=\rho(Z)+m for all ZL1Z\in L^{1} and mm\in\mathbb{R}. Such functionals are referred to as monetary risk measures (see, e.g., Föllmer and Schied (2016)). Further, assume that ρ\rho is Lipschitz continuous with respect to the L1L^{1}-norm, i.e., there exists M>0M>0 such that |ρ(Z1)ρ(Z2)|M𝔼[|Z1Z2|]|\rho(Z_{1})-\rho(Z_{2})|\leqslant M\mathbb{E}[|Z_{1}-Z_{2}|] for all Z1,Z2L1Z_{1},Z_{2}\in L^{1}. Note that for any ε>0\varepsilon>0, sup𝔼[|V|]ερ(Z+V)sup𝔼[|V|]ε{ρ(Z)+M𝔼[|V|]}=ρ(Z)+Mε.\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\rho(Z+V)\leqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon}\{\rho(Z)+M\mathbb{E}[|V|]\}=\rho(Z)+M\varepsilon. This, together with sup|V|ερ(Z+V)ρ(Z+ε)=ρ(Z)+ε\sup_{|V|\leqslant\varepsilon}\rho(Z+V)\geqslant\rho(Z+\varepsilon)=\rho(Z)+\varepsilon, yields that ρ\rho satisfies Assumption 1 with λ=M\lambda=M. In particular, CVaR, as defined in Table 1 by CVaRα(Z)=α1FZ1(t)dt/(1α){\rm CVaR}_{\alpha}(Z)=\int_{\alpha}^{1}F_{Z}^{-1}(t)\mathrm{d}t/(1-\alpha), ZL1Z\in L^{1}, satisfies Assumption 1 with λ=1/(1α)\lambda=1/(1-\alpha), where FZ1F_{Z}^{-1} is the quantile function of ZZ.

4 A Regularization Perspective

Regularization generally refers to any means applied to avoid overfitting and enhance generalization. In this regard, with the generalization guarantees established in the previous section, the Wasserstein DRO model (6) is well justified as a general regularization model. In this section, we provide further insights into the regularization effect of model (6) on the decision variable 𝜷\bm{\beta}, offering a practically useful alternative interpretation. Specifically, we show that the previously observed equivalence between the Wasserstein DRO model and regularized empirical optimization in ML holds across much broader settings, while precisely identifying the boundary of its validity. This equivalence significantly broadens the range of both Wasserstein and max-sliced Wasserstein DRO problems that can now be solved efficiently through their regularization counterparts. Proposition 1 plays a key role in facilitating this analysis. By (10) in Proposition 1, the Wasserstein DRO model (6) can be recast as

inf𝜷𝒟supFp(G0,ε)ρF(𝜷𝐙),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\ \rho^{F}(\bm{\beta}^{\top}\mathbf{Z}), (12)

where G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), and p(G0,ε)\mathcal{B}_{p}(G_{0},\varepsilon) is a Wasserstein ball on n\mathbb{R}^{n} defined by (8).

4.1 The case of expected function

When ρ\rho is an expected function, the Wasserstein DRO model (12):

inf𝜷𝒟supFp(G0,ε)𝔼F[(𝜷𝐙)],\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})], (13)

is equivalent to the regularized model:

inf𝜷𝒟{𝔼G0[(𝜷𝐙)]+Lip()ε𝜷},\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\left\{\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+{\rm Lip(\ell)}\varepsilon\|\bm{\beta}\|_{*}\right\}, (14)

for p=1p=1 (the type-1 Wasserstein ball), where Lip(){\rm Lip(\ell)} is the Lipschitz constant of \ell. For higher-order Wasserstein balls (p>1p>1) or non-Lipschitz loss functions, this relationship is less understood, except for p=2p=2 and (x)=x2\ell(x)=x^{2} (Blanchet et al. (2019)). It remains an open question whether (13) can be tractably solved in these cases. Higher-order Wasserstein balls are less conservative than type-1 Wasserstein balls and offer practical appeal.

We show that an equivalence with a regularized model (14) exists for p>1p>1 if and only if \ell is linear or takes the form of an absolute function.

Theorem 3.

Let :\ell:\mathbb{R}\to\mathbb{R} be a convex function. For p(1,]p\in(1,\infty], suppose that 𝔼[|(Z)|]<\mathbb{E}[|\ell(Z)|]<\infty for all ZLpZ\in L^{p}. Then the following statements are equivalent.

  • (i)

    There exists C>0C>0 such that for any G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, it holds that

    inf𝜷𝒟supFp(G0,ε)𝔼F[(𝜷𝐙)]=inf𝜷𝒟{𝔼G0[(𝜷𝐙)]+Cε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+C\varepsilon\|\bm{\beta}\|_{*}\right\}. (15)
  • (ii)

    The function \ell takes one of the following two forms, multiplied by CC:

    • (a)

      1(x)=x+b\ell_{1}(x)=x+b or 1(x)=x+b\ell_{1}(x)=-x+b with some bb\in\mathbb{R};

    • (b)

      2(x)=|xb1|+b2\ell_{2}(x)=|x-b_{1}|+b_{2} with some b1,b2b_{1},b_{2}\in\mathbb{R}.

This result, which holds for any type-pp Wasserstein ball, is somewhat surprising as the Wasserstein DRO model is equivalent to the same regularized model, regardless of the order pp. This equivalence occurs when the slope of the loss function \ell takes values only from a constant and its negative. It further indicates that no regularized model in the form of (14) can be derived for any other loss function \ell. In other words, if there is an equivalence between the Wasserstein DRO model (13) and a regularized model for another loss function, the regularized model must take a different form from (14). Before discussing other forms of regularization, it is worth noting the following application of this result in regression.

Example 3.

(Regression)

- (Least absolute deviation (LAD) regression) Applying 2(x)=C|xb1|+b2\ell_{2}(x)=C|x-b_{1}|+b_{2} in Theorem 3 and setting b1=0b_{1}=0, b2=0b_{2}=0, and C=1C=1, we arrive at the distributionally robust counterpart of the least absolute deviation regression, i.e., inf𝜷r𝒟supFp(G0,ε)𝔼F[|(1,𝜷r)𝐗|].\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|]. It is equivalent to inf𝜷r𝒟{𝔼G0[|(1,𝜷r)𝐗|]+ε(1,𝜷r)}\inf_{\bm{\beta}_{r}\in\mathcal{D}}\{\mathbb{E}^{G_{0}}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|]+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*}\} for any p1p\geqslant 1.

We now explore whether an alternative form of regularization on 𝜷\bm{\beta} exists that is equivalent to the Wasserstein DRO model (13). Specifically, we focus on cases where the loss function \ell is not Lipschitz continuous, such as higher-order loss functions arising from the examples presented in Table 1. It is known that the worst case expectation problem based on the type-1 Wasserstein ball can become unbounded when the loss function is not Lipschitz continuous. To address cases beyond Lipschitz continuity, we consider the following formulation of Wasserstein DRO model (13):

inf𝜷𝒟supFp(G0,ε)𝔼F[p(𝜷𝐙)],\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})], (16)

where p\ell^{p} denotes the function \ell raised to the power of p>1p>1, matching the order of the type-pp Wasserstein ball. Below, we present an alternative equivalency relationship and specify the exact cases where it holds.

Theorem 4.

Let :+\ell:\mathbb{R}\to\mathbb{R}_{+} be a Lipchitz continuous and convex function. For any p(1,)p\in(1,\infty), the following statements are equivalent.

  • (i)

    There exists C>0C>0 such that for any G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, it holds that

    inf𝜷𝒟supFp(G0,ε)𝔼F[p(𝜷𝐙)]=inf𝜷𝒟((𝔼G0[p(𝜷𝐙)])1/p+Cε𝜷)p.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D}}\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+C\varepsilon||\bm{\beta}||_{*}\right)^{p}. (17)
  • (ii)

    The function \ell takes one of the following four forms, multiplied by CC:

    • (a)

      1(x)=(xb)+\ell_{1}(x)=(x-b)_{+} with some bb\in\mathbb{R};

    • (b)

      2(x)=(xb)\ell_{2}(x)=(x-b)_{-} with some bb\in\mathbb{R};

    • (c)

      3(x)=(|xb1|b2)+\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+} with some b1b_{1}\in\mathbb{R} and b20b_{2}\geqslant 0;

    • (d)

      4(x)=|xb1|+b2\ell_{4}(x)=|x-b_{1}|+b_{2} with some b1b_{1}\in\mathbb{R} and b2>0b_{2}>0.

This result is also an “impossibility” theorem, providing a definitive answer to whether the equivalence relationship can hold more broadly for other loss functions. It settles any attempts to establish this equivalence for other convex Lipschitz continuous functions :+\ell:\mathbb{R}\rightarrow\mathbb{R}_{+}. This theorem is fundamentally important to the study of Wasserstein DRO, especially given ongoing efforts to understand its relationship with a classical regularization perspective. It shows precisely the limits of how far this perspective can be extended.

Remark 1.

As previously discussed, Wasserstein DRO and empirical risk regularization are two widely used and competing approaches for data-driven decision-making in ML and OR/MS. Wasserstein DRO is valued for its universality, offering simple generalization guarantees under minimal assumptions, such as not relying on notions of hypothesis complexity (Shafieezadeh-Abadeh et al. (2019)), while empirical risk regularization is predominantly embraced by practitioners as a simple yet powerful heuristic, despite its desirable theoretical properties (Shafieezadeh-Abadeh et al. (2019); Abu-Mostafa et al. (2012)). While we have expanded the universality of Wasserstein DRO by establishing generalization guarantees across diverse decision criteria, free from the curse of dimensionality, a natural managerial question arises: can simpler, more intuitive heuristics, such as empirical risk regularization, achieve the same generalization guarantees? Our equivalency results provide a positive answer for many applications by specifying the exact form of regularization required. Conversely, our impossibility theorem identifies when Wasserstein DRO becomes essential for decision problems beyond these cases. To underscore the importance of our regularization results for OR/MS applications, we present below an extensive series of examples, including those highlighted in Table 1 and the celebrated rank-dependent expected utility (RDEU) model.

Remark 2.

It is worth noting that Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) provided a regularized version of (13) and (16) based on a duality scheme. They showed that when \ell is proper, upper-semicontinuous and satisfies (x)C(1+|x|p),\ell(x)\leqslant C(1+|x|^{p}), xx\in\mathbb{R} for some C>0C>0, then

inf𝜷𝒟supFp(G0,ε)𝔼F[(𝜷𝐙)]=inf𝜷𝒟,λ0{𝔼G0[p(𝜷𝐙,λ)]+λεp𝜷p},\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\bm{\beta}\in\mathcal{D},\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}, (18)

where p(x,λ)=supy{(y)λ|yx|p}.\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}\{\ell(y)-\lambda|y-x|^{p}\}. While this formula can be applied to general loss functions and used to verify the implication (ii) \Rightarrow (i) in both Theorems 3 and 4, it offers limited insight into its connection with the empirical nominal problem and typically requires additional analysis to establish a regularized empirical risk minimization formulation. In contrast, our Theorems 3 and 4 directly leverage the structure of regularized empirical risk minimization to clearly identify the necessary and sufficient conditions for its existence, where the necessary “impossibility” would be difficult to derive through general duality arguments alone.

We detail in Appendix B.1 that the regularized models (15) and (17) can be established based on (18) when \ell is given in (ii) of Theorems 3 and 4, respectively. In particular, for the \ell in (ii) of Theorem 3, although the regularization in (18) appears to depend on pp, we demonstrate that this regularization can, in fact, be reduced to a formulation independent of pp (Theorem 3). As will be shown, the calculation is lengthy, making the derivation from (18) somewhat cumbersome.

Remark 3.

Our regularization results offer deeper and more practically significant insights than Proposition 3.9 in Shafieezadeh-Abadeh et al. (2023), particularly regarding the required regularization term. While Proposition 3.9 suggests the need for a term of the form λεp𝜷p\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p} in (18), with the regularization raised to the power pp, this can create computational challenges, especially as pp\rightarrow\infty. In contrast, our regularization result reveals that the regularization term is actually independent of pp, namely Cε𝜷C\varepsilon\|\bm{\beta}\|_{*}. This insight significantly enhances the computational tractability of solving the regularization problem. Moreover, the regularized formulations (17) in Theorem 4 are all convex programs in 𝜷\bm{\beta}, with complexity comparable to nominal problems. In contrast, the right-hand side of (17) is nonconvex in 𝜷\bm{\beta} and λ\lambda, and its objective function p(x,λ)\ell_{p}(x,\lambda) is more costly to evaluate. Thus, Theorem 4 can also be viewed as enabling a convex program solution to (18). Finally, the equivalency between Wasserstein DRO and max-sliced Wasserstein DRO established in the previous section implies that max-sliced Wasserstein DRO can likewise be solved efficiently via convex programs for the cases covered in Theorem 4.

Example 4.

(Classification)

- (Higher-order hinge loss) Applying 2(x)=(xb)\ell_{2}(x)=(x-b)_{-} and setting b=1b=1, we have that the following classification problem with a higher-order hinge loss inf𝜷𝒟supF¯p(F0,ε)𝔼F[(1Y𝜷𝐗)+p]\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\mathbb{E}^{F}[(1-Y\cdot\bm{\beta}^{\top}\mathbf{X})_{+}^{p}] is equivalent to the regularization problem inf𝜷𝒟((𝔼F0[(1Y𝜷𝐗)+p])1/p+ε𝜷)p.\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{F_{0}}[(1-Y\cdot\bm{\beta}^{\top}\mathbf{X})_{+}^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.

- (Higher-order SVM) Applying 3(x)=(|xb1|b2)+\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+} and setting b1=1b_{1}=1 and b2=0b_{2}=0, we have that the higher-order SVM classification problem inf𝜷𝒟supF¯p(F0,ε)𝔼F[|1Y𝜷𝐗|p]\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\mathbb{E}^{F}[|1-Y\cdot\bm{\beta}^{\top}\mathbf{X}|^{p}] is equivalent to the regularization problem inf𝜷𝒟((𝔼F0[|1Y𝜷𝐗|p])1/p+ε𝜷)p.\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{F_{0}}[|1-Y\cdot\bm{\beta}^{\top}\mathbf{X}|^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.

Example 5.

(Regression)

- (Higher-order measure) Applying 3(x)=(|xb1|b2)+\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+} and setting b1=0b_{1}=0 and b2=0b_{2}=0, we have that the regression with a higher-order measure inf𝜷r𝒟supFp(G0,ε)𝔼F[|(1,𝜷r)𝐗|p],\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|^{p}], is equivalent to the regularization problem inf𝜷r𝒟((𝔼G0[|(1,𝜷r)𝐗|p])1/p+ε(1,𝜷r))p.\inf_{\bm{\beta}_{r}\in\mathcal{D}}(\left(\mathbb{E}^{G_{0}}[|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|^{p}]\right)^{1/p}+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*})^{p}.

- (higher-order cc-insensitive measure) Applying 3(x)=(|xb1|b2)+\ell_{3}(x)=(|x-b_{1}|-b_{2})_{+} and setting b1=0b_{1}=0 and b2=cb_{2}=c, we have that the following regression problem with a higher-order cc-insensitive measure inf𝜷r𝒟supFp(G0,ε)𝔼F[(|(1,𝜷r)𝐗|c)+p]\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|-c)_{+}^{p}] is equivalent to the regularization problem inf𝜷r𝒟((𝔼G0[(|(1,𝜷r)𝐗|c)+p])1/p+ε(1,𝜷r))p.\inf_{\bm{\beta}_{r}\in\mathcal{D}}((\mathbb{E}^{G_{0}}[(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|-c)_{+}^{p}])^{1/p}+\varepsilon\|(1,-\bm{\beta}_{r})\|_{*})^{p}.

Example 6.

(Risk minimization)

- (Lower partial moments) Applying 1(x)=(xb)+\ell_{1}(x)=(x-b)_{+} and setting b=cb=c, we have that the risk minimization with lower partial moments inf𝜷𝒟supFp(G0,ε)𝔼F[(𝜷𝐗c)+p]\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-c)_{+}^{p}] is equivalent to the regularization problem inf𝜷𝒟((𝔼G0[(𝜷𝐗c)+p])1/p+ε𝜷)p.\inf_{\bm{\beta}\in\mathcal{D}}((\mathbb{E}^{G_{0}}[(\bm{\beta}^{\top}\mathbf{X}-c)_{+}^{p}])^{1/p}+\varepsilon\|\bm{\beta}\|_{*})^{p}.

Even though Theorem 4 highlights the impossibility of establishing an equivalence relation for a more general loss function \ell, Theorem 4 can still be applied more broadly as a powerful foundation to derive alternative equivalence relations for a richer family of measures. Specifically, there is a large family of measures that can be generally expressed in the following two forms:

𝒱F(Z)=inft𝔼F[p(Z,t)]andρF(Z)=inft{t+(𝔼F[p(Z,t)])1/p}\displaystyle\mathcal{V}^{F}(Z)=\inf_{t\in\mathbb{R}}\mathbb{E}^{F}[\ell^{p}(Z,t)]~{}~{}~{}{\rm and}~{}~{}~{}\rho^{F}(Z)=\inf_{t\in\mathbb{R}}\left\{t+\left(\mathbb{E}^{F}[\ell^{p}(Z,t)]\right)^{1/p}\right\} (19)

for some loss function \ell.

We show in the appendix (see Lemma 8) that for a wide range of loss functions \ell, the following switching of sup\sup and inf\inf is valid

supFp(G0,ε)inftπi,(F,t)=inftsupFp(G0,ε)πi,(F,t),i=1,2,\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{i,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{i,\ell}(F,t),~{}~{}i=1,2,

where

π1,(F,t)=𝔼F[p(𝜷𝐙,t)]andπ2,(F,t)=t+(𝔼F[p(𝜷𝐙,t)])1/p.\pi_{1,\ell}(F,t)=\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]~{}~{}{\rm and}~{}~{}\pi_{2,\ell}(F,t)=t+\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}.

This, combined with Theorem 4, leads to the following.

Corollary 1.

For any p[1,)p\in[1,\infty) and c>0c>0, let 𝒱\mathcal{V} and ρ\rho be defined by (19) with (z,t)=c(zt)\ell(z,t)=c\ell(z-t). Take G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}) and ε0\varepsilon\geqslant 0.

(i) If \ell is 3\ell_{3} or 4\ell_{4} in Theorem 4, then

inf𝜷𝒟supFp(G0,ε)𝒱F(𝜷𝐙)=inf𝜷𝒟((𝒱G0(𝜷𝐙))1/p+cε𝜷)p.\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left(\left(\mathcal{V}^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})\right)^{1/p}+c\varepsilon\|\bm{\beta}\|_{*}\right)^{p}.

If \ell is 1\ell_{1} or 2\ell_{2} in Theorem 4, then 𝒱F(𝛃𝐙)=0\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=0 for any 𝛃n\bm{\beta}\in\mathbb{R}^{n} and Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}).

(ii) If c>1c>1 and \ell is one of 1,3,4\ell_{1},\ell_{3},\ell_{4} in Theorem 4 or (z,t)=c(|z|t)+\ell(z,t)=c(|z|-t)_{+}, then it holds that

inf𝜷𝒟supFp(G0,ε)ρF(𝜷𝐙)=inf𝜷𝒟{ρG0(𝜷𝐙)+cε𝜷}.\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+c\varepsilon\|\bm{\beta}\|_{*}\right\}.

If =2\ell=\ell_{2} in Theorem 4, then ρF(𝛃𝐙)=\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=-\infty for any 𝛃n\bm{\beta}\in\mathbb{R}^{n} and Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}).

Corollary 1 can accommodate cases where variance is used as the measure. Variance is not listed as an example in Section 2 because it has already been studied in Blanchet et al. (2022). However, our method for deriving the equivalency relation is different from, and significantly more general than, the approach in Blanchet et al. (2022). The following example illustrates the broader applicability of our approach.

Example 7.

(Blanchet et al. (2022)) When p=2p=2 and (z,t)=|zt|\ell(z,t)=|z-t| (i.e., 3(zt)\ell_{3}(z-t) with b1=0b_{1}=0 and b2=0b_{2}=0), we have 𝒱F=𝕍arF\mathcal{V}^{F}={\rm\mathbb{V}ar}^{F}, where 𝕍ar{\rm\mathbb{V}ar} represents the variance. That is, 𝕍arF(𝜷𝐗)=inft𝔼F[(𝜷𝐗t)2].{\rm\mathbb{V}ar}^{F}(\bm{\beta}^{\top}\mathbf{X})=\inf_{t\in\mathbb{R}}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-t)^{2}]. Applying Corollary 1 (i) yields supF2(G0,ε)𝕍arF(𝜷𝐗)=((𝕍arG0(𝜷𝐗))1/2+ε𝜷)2.\sup_{F\in\mathcal{B}_{2}(G_{0},\varepsilon)}{\rm\mathbb{V}ar}^{F}(\bm{\beta}^{\top}\mathbf{X})=(({\rm\mathbb{V}ar}^{G_{0}}(\bm{\beta}^{\top}\mathbf{X}))^{1/2}+\varepsilon\|\bm{\beta}\|_{*})^{2}.

The following example demonstrates how Corollary 1 can be applied to derive the regularization counterpart for higher moment risk measures in risk minimization, a previously unknown result.

Example 8.

(Risk minimization) For c>1c>1, setting (z,t)=c(zt)+\ell(z,t)=c(z-t)_{+} (i.e., c1(zt)c\ell_{1}(z-t) with b=0b=0), we have that the following problem of minimizing higher moment risk measures inf𝜷𝒟supFp(G0,ε)inft{t+c(𝔼F[(𝜷𝐗t)+p])1/p}\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\{t+c(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{X}-t)_{+}^{p}])^{1/p}\} is equivalent to the regularization problem inf𝜷𝒟,t{t+c(𝔼G0[(𝜷𝐗t)+p])1/p+cε𝜷}.\inf_{\bm{\beta}\in\mathcal{D},t\in\mathbb{R}}\{t+c(\mathbb{E}^{G_{0}}[(\bm{\beta}^{\top}\mathbf{X}-t)_{+}^{p}])^{1/p}+c\varepsilon\|\bm{\beta}\|_{*}\}.

4.2 The case of non-expected function with distortion

A powerful perspective in the risk measure literature, especially when studying non-expected functions such as Conditional Value-at-Risk (CVaR), is to view these functions as equivalent to taking an expectation with respect to a distorted probability distribution. The concept of distorted expectation is central to foundational theories, including Yaari’s dual utility theory (Yaari (1987)), Choquet Expected Utility (Schmeidler (1989)), and various business applications. In this section, we adopt this perspective to study the problem (13) with a distorted expectation:

inf𝜷𝒟supFp(G0,ε)ρhF((𝜷𝐙)),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z})), (20)

where

ρhF(Z)=01F1(s)dh(s)\displaystyle\rho_{h}^{F}(Z)=\int_{0}^{1}F^{-1}(s)\mathrm{d}h(s)

is an “expectation” taken with respect to a distortion function h:[0,1]h:[0,1]\to\mathbb{R} 111Strictly speaking, a distortion function hh should additionally satisfy h(0)=1h(1)=0h(0)=1-h(1)=0 and be increasing, so that ρhF\rho_{h}^{F} can be considered as distorted expectation. However, our results do not hinge on these requirements.. Here, F1F^{-1} represents the left-quantile function of FF, i.e., F1(s)=inf{x:F(x)s}F^{-1}(s)=\inf\{x:F(x)\geqslant s\} for s(0,1]s\in(0,1], and F1(0)=inf{x:F(x)>0}F^{-1}(0)=\inf\{x:F(x)>0\}. The problem (20) encompasses (13) as a special case, since ρhF(Z)=𝔼F[Z]\rho_{h}^{F}(Z)=\mathbb{E}^{F}[Z] when h(s)=sh(s)=s, s[0,1]s\in[0,1], and accommodates CVaR, since ρhF(Z)=CVaRαF(Z)\rho_{h}^{F}(Z)={\rm CVaR}_{\alpha}^{F}(Z) when h(s)=(sα)+/(1α)h(s)=(s-\alpha)_{+}/(1-\alpha), s[0,1]s\in[0,1]. A comprehensive list of risk measures with distorted expectation representations can be found in Wang et al. (2020) and Cai et al. (2023).

It is intriguing to consider how the problem (20) might be solved, particularly whether a regularization counterpart, similar to that of the expected function (13), remains available. Our findings are surprisingly general: for any problem (20) with an increasing and convex distortion function hh, a regularization counterpart always exists for the type-1 Wasserstein ball. This similarity to the classical regularization result for (13) is particularly notable, given the broader applicability of (20). This insight even extends to the general type-pp Wasserstein ball. As we show below, the necessary and sufficient conditions for a regularization counterpart to exist are remarkably similar for both (20) and (13) in the general type-pp case.

Before presenting the result, we introduce the notation that ga=(A|g(x)|adx)1/a\|g\|_{a}=(\int_{A}|g(x)|^{a}\mathrm{d}x)^{1/a} for a function g:Ag:A\to\mathbb{R} with AA\subseteq\mathbb{R} and a[1,)a\in[1,\infty), and g=supxA|g(x)|\|g\|_{\infty}=\sup_{x\in A}|g(x)|. For a convex distortion function h:[0,1]h:[0,1]\to\mathbb{R} with limx1h(x)=h(1)\lim_{x\to 1-}h(x)=h(1), we denote by hh^{\prime}_{-} the left-derivative of hh on (0,1](0,1], noting that h(1)h^{\prime}_{-}(1) may be infinity.

Theorem 5.

For p[1,]p\in[1,\infty], let h:[0,1]h:[0,1]\to\mathbb{R} be an increasing and convex distortion function satisfying limx1h(x)=h(1)\lim_{x\to 1-}h(x)=h(1) and hq(0,)\|h^{\prime}_{-}\|_{q}\in(0,\infty), and let :\ell:\mathbb{R}\to\mathbb{R} be a convex function.

For p=1p=1, if Lip()<{\rm Lip}(\ell)<\infty, then for any G01(n)G_{0}\in\mathcal{M}_{1}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, we have

inf𝜷𝒟supF1(G0,ε)ρhF((𝜷𝐙))=inf𝜷𝒟{ρhG0((𝜷𝐙))+Lip()hε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\}.

For p(1,]p\in(1,\infty], if ρh(|(Z)|)<\rho_{h}(|\ell(Z)|)<\infty for all ZLpZ\in L^{p}, then the following statements are equivalent.

  • (i)

    For any G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, there exists C>0C>0 such that

    inf𝜷𝒟supFp(G0,ε)ρhF((𝜷𝐙))=inf𝜷𝒟{ρhG0((𝜷𝐙))+Chqε𝜷}.\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+C\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.
  • (ii)

    The function \ell takes one of the following two forms, multiplied by CC:

    • (a)

      1(x)=x+b\ell_{1}(x)=x+b or 1(x)=x+b\ell_{1}(x)=-x+b with some bb\in\mathbb{R};

    • (b)

      2(x)=|xb1|+b2\ell_{2}(x)=|x-b_{1}|+b_{2} with some b1,b2b_{1},b_{2}\in\mathbb{R}.

Interestingly, the “exact” class of loss functions \ell that admit a regularization counterpart is the same as in the previous section, despite distorted expectations being significantly more general than expected functions. This result is particularly significant from a decision-theoretical perspective, as the formulation (20) closely resembles the celebrated rank-dependent expected utility (RDEU) model, known for resolving the paradox in expected utility (Quiggin (1982)). Given its importance, we highlight below the application of Theorem 5 in RDEU.

Example 9.

(Rank-Dependent Expected Utility (RDEU)) The decision criteria of RDEU admits the form Vu,h(Z)=ρh(u(Z))V_{u,h}(Z)=\rho_{h}(u(Z)), where ρh\rho_{h} is a distortion function and u:u:\mathbb{R}\to\mathbb{R} is a dis-utility function. For decision makers who are risk-averse (Chew et al. (1987)), i.e., hh and uu are increasing convex with Lip(u)(0,){\rm Lip}(u)\in(0,\infty), we have

inf𝜷𝒟supF1(G0,ε)Vu,hF((𝜷𝐗))=inf𝜷𝒟{Vu,hG0((𝜷𝐗))+Lip(u)Lip()hε𝜷},\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}V_{u,h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{X}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{V_{u,h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{X}))+{\rm Lip}(u)\cdot{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\},

for any convex loss function \ell with Lip()(0,){\rm Lip}(\ell)\in(0,\infty), G01(n)G_{0}\in\mathcal{M}_{1}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}.

We further demonstrate the practical applicability of the above theorem by showing how it can be directly applied to derive the regularization counterpart for the ν\nu-support vector regression example.

Example 10.

(Regression)

- (ν\nu-support vector regression) For α[0,1)\alpha\in[0,1), let h(s)=(sα)+/(1α)h(s)=(s-\alpha)_{+}/(1-\alpha), s[0,1]s\in[0,1]. Applying 2(x)=C|xb1|+b2\ell_{2}(x)=C|x-b_{1}|+b_{2} and setting C=1C=1 and b1=b2=0b_{1}=b_{2}=0, we have from Theorem 5 that the ν\nu-support vector regression inf𝜷r𝒟supFp(G0,ε)CVaRαF(|(1,𝜷r)𝐗|)\inf_{\bm{\beta}_{r}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(|(1,-\bm{\beta}_{r})^{\top}\mathbf{X}|) is equivalent to the regularization problem inf𝜷r𝒟{CVaRαG0|((1,𝜷r)𝐗|)+ε(1α)1/p(1,𝜷r)}.\inf_{\bm{\beta}_{r}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{G_{0}}|((1,-\bm{\beta}_{r})^{\top}\mathbf{X}|)+\varepsilon(1-\alpha)^{-1/p}\|(1,-\bm{\beta}_{r})\|_{*}\}.

Measures like the CVaR-Deviation example in risk minimization, as shown below, require a convex distortion function hh, which is not covered by the theorem above. We now show that when the loss function \ell in (20) is linear, a regularization counterpart can also be found for any convex distortion function hh.

Proposition 2.

For p[1,]p\in[1,\infty], let h:[0,1]h:[0,1]\to\mathbb{R} be a convex function with limx0+h(x)=h(0)\lim_{x\to 0+}h(x)=h(0) and limx1h(x)=h(1)\lim_{x\to 1-}h(x)=h(1). Then for any G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, we have

inf𝜷𝒟supFp(G0,ε)ρhF(𝜷𝐙)=inf𝜷𝒟{ρhG0(𝜷𝐙)+hqε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

Note first that applying Proposition 2 to classification problems is fairly straightforward, namely yielding the following equivalency:

inf𝜷𝒟supF¯p(F0,ε)ρhF(Y𝜷𝐗)=inf𝜷𝒟{ρhF0(Y𝜷𝐗)+hqε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)}\rho_{h}^{F}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{F_{0}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\|h^{\prime}_{-}\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

This equivalency is readily applicable to specific cases, such as the ν\nu-support vector machine.

Example 11.

(Classification)

- (ν\nu-support vector machine) Setting h(s)=(sα)+/(1α)h(s)=(s-\alpha)_{+}/(1-\alpha), s[0,1]s\in[0,1] with α[0,1),\alpha\in[0,1), we have that the classification problem with ν\nu-support vector machine inf𝜷𝒟supF¯p(F0,ε)CVaRαF(Y𝜷𝐗)\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(-Y\cdot\bm{\beta}^{\top}\mathbf{X}) is equivalent to the regularization problem inf𝜷𝒟{CVaRαF0(Y𝜷𝐗)+ε(1α)1/p𝜷}.\inf_{\bm{\beta}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{F_{0}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\varepsilon(1-\alpha)^{-1/p}\|\bm{\beta}\|_{*}\}.

Proposition 2 can be further applied to general deviation measures defined by a distorted expectation, such as the CVaR-Deviation example in risk minimization. Observe that ρh(Z𝔼[Z])=ρh~(Z)\rho_{h}(Z-\mathbb{E}[Z])=\rho_{\widetilde{h}}(Z) holds for any distortion function hh, where h~(s):=h(s)+(h(0)h(1))s{\widetilde{h}}(s):=h(s)+(h(0)-h(1))s for s[0,1]s\in[0,1]. Moreover, h~{\widetilde{h}} is convex whenever hh is convex. Thus, we apply Proposition 2 and arrive at the following equivalency:

inf𝜷𝒟supFp(G0,ε)ρhF(𝜷𝐙𝔼F[𝜷𝐙])=inf𝜷𝒟{ρhG0(𝜷𝐙𝔼G0[𝜷𝐙])+h+h(0)h(1)qε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\rho_{h}^{F}(\bm{\beta}^{\top}\mathbf{Z}-\mathbb{E}^{F}[\bm{\beta}^{\top}\mathbf{Z}])=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}_{h}(\bm{\beta}^{\top}\mathbf{Z}-\mathbb{E}^{G_{0}}[\bm{\beta}^{\top}\mathbf{Z}])+\|h^{\prime}_{-}+h(0)-h(1)\|_{q}\varepsilon\|\bm{\beta}\|_{*}\right\}.

This leads us to our final example.

Example 12.

(Risk minimization)

- (CVaR-Deviation) Setting h(s)=(sα)+/(1α)h(s)=(s-\alpha)_{+}/(1-\alpha), s[0,1]s\in[0,1] with α[0,1),\alpha\in[0,1), we have that the risk minimization with CVaR-Deviation inf𝜷𝒟supFp(G0,ε)CVaRαF(𝜷𝐗𝔼F[𝜷𝐗])\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}{\rm CVaR}_{\alpha}^{F}(\bm{\beta}^{\top}\mathbf{X}-\mathbb{E}^{F}[\bm{\beta}^{\top}\mathbf{X}]) is equivalent to the regularization problem inf𝜷𝒟{CVaRαG0(𝜷𝐗𝔼G0[𝜷𝐗])+ε(α+αq(1α)1q)1/q𝜷}.\inf_{\bm{\beta}\in\mathcal{D}}\{{\rm CVaR}_{\alpha}^{G_{0}}(\bm{\beta}^{\top}\mathbf{X}-\mathbb{E}^{G_{0}}[\bm{\beta}^{\top}\mathbf{X}])+\varepsilon(\alpha+\alpha^{q}(1-\alpha)^{1-q})^{1/q}\|\bm{\beta}\|_{*}\}.

Remark 4.

We offer some comments here on the recent work by Chu et al. (2024), which was released well after our work had been made available online. While Chu et al. (2024) initially set out to establish an equivalency relationship for a general setting of Wasserstein DRO – considering a general loss function \ell and cost function dd as defined in (3) – their actual results do not extend much beyond our settings and findings in Section 4.1 (expected functions and extensions) and are limited to discussing sufficient conditions for the equivalency. Although some examples presented in Table 2 of Chu et al. (2024) are not covered in Section 4.1, we attribute this largely to our focus on cases that admit a final convex optimization formulation, rather than a limitation of our analysis. Indeed, we believe our analysis is foundational, reaching the very core of the equivalency problem. This is evident in our ability to prove not only sufficient conditions but also necessary ones for both the case of expected functions and general distorted expectations – an achievement that appears beyond the reach of Chu et al. (2024), as it inevitably hinges on Lemma 4 provided in our analysis. To substantiate our claim that our analysis is not limited to the examples presented in this paper, we demonstrate in Proposition 3 in Appendix B.2 how to generalize Theorem 5 (the case p=1p=1) to non-convex loss functions \ell, including the truncated pinball loss from Chu et al. (2024) as a special case.222In Chu et al. (2024), the center of the Wasserstein ball is assumed to be the empirical distribution, which has a finite set of vectors as its support. Consequently, Proposition 3 is directly applicable. As another example, it is not difficult to confirm that our findings in Proposition 1, Lemma 4, and all the results in Section 4 can be readily extended to a general Hilbert space \mathcal{H}, with the norm induced by its inner product. This extension covers the case of nonparametric scalar-on-function linear regression 𝔼[p(Y01𝐗(t)𝜷(t)dt)]\mathbb{E}[\ell^{p}(Y-\int_{0}^{1}\mathbf{X}(t)\bm{\beta}(t)\mathrm{d}t)] from Chu et al. (2024), where \ell takes the forms outlined in our Theorem 4.333 The expression y01𝐱(t)𝜷(t)dty-\int_{0}^{1}\mathbf{x}(t)\bm{\beta}(t)\mathrm{d}t can be interpreted as the inner product between (y,𝐱)(y,\mathbf{x}) and (1,𝜷)(1,-\bm{\beta}) on :=×2([0,1])\mathcal{H}:=\mathbb{R}\times\mathcal{L}^{2}([0,1]), where 2([0,1])\mathcal{L}^{2}([0,1]) is the space of all square integrable functions on [0,1][0,1]. In this setting, the decision rule is the inner product (y,𝐱),(1,𝜷)\langle(y,\mathbf{x}),(1,-\bm{\beta})\rangle for each decision vector 𝜷\bm{\beta} which is affine and the Wasserstein ball ¯p(F0,ε)\overline{\cal B}_{p}(F_{0},\varepsilon), similar in form to (5), is now encompassed within the set of probability measures on \mathcal{H}.

5 Numerical Demonstration of Generalization Bounds

In this section, we seek to provide a further demonstration of the generalization result obtained in this paper. In particular, we numerically illustrate the decay rate of the Wasserstein radius for measures ρ\rho that are non-expected functions. We closely follow the experimental setup presented in Shafieezadeh-Abadeh et al. (2019), where the authors investigate the scaling behaviour of the smallest Wasserstein radius for the synthetic threenorm classification problem (Breiman (1996)). While Shafieezadeh-Abadeh et al. (2019) uses classical support vector machine for classification, in which the measure ρ\rho is an expected function defined by a hinge loss function \ell, we employ ν\nu-support vector machine, where ρ\rho is represented by CVaR. Specifically, we consider the distributionally robust counterpart of the standard ν\nu-support vector machine formulation,

min𝜷nCVaRαF^N(Y𝜷𝐗)+12𝜷22.\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{n}}~{}~{}{\rm CVaR}_{\alpha}^{\widehat{F}_{N}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\frac{1}{2}\|\bm{\beta}\|_{2}^{2}. (21)

By choosing \infty-norm as the norm \|\cdot\| on the input space in the transport cost and setting p=1p=1, like Shafieezadeh-Abadeh et al. (2019), and following Proposition 2 or Example 11, we have the following regularized form for the Wasserstein robust ν\nu-support vector machine problem:

min𝜷nCVaRαF^N(Y𝜷𝐗)+12𝜷22+ε𝜷11α.\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{n}}~{}~{}{\rm CVaR}_{\alpha}^{\widehat{F}_{N}}(-Y\cdot\bm{\beta}^{\top}\mathbf{X})+\frac{1}{2}\|\bm{\beta}\|_{2}^{2}+\frac{\varepsilon\|\bm{\beta}\|_{1}}{1-\alpha}. (22)

The experiment is based on an output yy drawn uniformly from {1,1}\{-1,1\} and a 2020-dimensional input 𝐱\mathbf{x} from a multivariate normal distribution. Specifically, if y=1y=-1, then 𝐱\mathbf{x} is drawn from standard multivariate normal distribution shifted by (c,,c)(c,\dots,c) or (c,,c)(-c,\dots,-c) with equal probabilities, where c=2/20c=2/\sqrt{20}. If y=1y=1, then 𝐱\mathbf{x} is drawn from standard multivariate normal distribution shifted by (c,c,,c,c)(c,-c,\dots,c,-c). We consider different sizes of training samples NN in the set {10,,90}{100,,900}{1000,,10000}\{10,\dots,90\}\cup\{100,\dots,900\}\cup\{1000,\dots,10000\} as well as 10510^{5} test samples. Each size of training sample involves 100100 simulation trials.

Throughout the experiment, the search space of radius is chosen as ={a10b:a{1,,10},b{1,2,3}}{1.1,1.2,,2}\mathcal{E}=\{a\cdot 10^{-b}:~{}a\in\{1,\dots,10\},~{}b\in\{1,2,3\}\}\cup\{1.1,1.2,\dots,2\}. Similar to Shafieezadeh-Abadeh et al. (2019), we use the following three approaches to choose the smallest Wasserstein radius:

  • Cross validation: For a set of NN training samples, denoted as {(yi,𝐱i)}i=1N\{(y_{i},\mathbf{x}_{i})\}_{i=1}^{N}, we partition them into k=5k=5 subsets. We use one subset as the validation dataset and combine the remaining k1k-1 subsets as the training dataset. This results in kk pairs of validation and training datasets. We choose the radius in \mathcal{E} such that the average validation error over these kk pairs is the smallest. This operation is repeated across all 100100 trials, and we then report the average of the radii.

  • Optimal radius: In each trial, we choose the radius in \mathcal{E} that has the smallest testing error and then report the average of the radii across all 100100 trials.

  • Generalization bound: We choose the smallest radius such that the optimal value of (22) exceeds the value of the nominal problem on the test samples in at least 95%95\% of all 100 trials.

In our initial experiments, we set α=0.2\alpha=0.2. Figure 1 displays the selected Wasserstein radii in relation to various sample sizes NN for the three approaches above. Note that while the first two approaches determine the radius by averaging the radii induced by all simulation trials (potentially leading to a radius not necessarily in the set \mathcal{E}), the third approach specifically selects a radius from \mathcal{E} based on the percentage criteria, utilizing the testing samples across all 100 trials. One can observe that the Wasserstein radii of all three approaches decay approximately as 1/N1/\sqrt{N}, which is consistent with the theoretical generalization bound of Theorem 1.

Figure 1: Wasserstein radius versus the number of training samples, with α=0.2\alpha=0.2
Refer to caption

It is natural to wonder if varying choices of α\alpha could affect the scaling behavior. This becomes particularly intriguing when considering higher values of α\alpha. However, literature on the ν\nu-support vector machine suggests that the optimal solution of (21) can become degenerate (i.e., equal to zero) when α\alpha surpasses a specific threshold. We indeed observe that the experiment setup of Shafieezadeh-Abadeh et al. (2019) only allows us to generate meaningful solutions for lower values of α\alpha. To address the potential degeneracy issue at higher α\alpha values, we modified the method for generating simulated data. Specifically, for each sample (y,𝐱){1,1}×20(y,\mathbf{x})\in\{-1,1\}\times\mathbb{R}^{20}, if y=1y=-1, then 𝐱\mathbf{x} is drawn from standard multivariate normal distribution shifted by (c,,c)(-c,\dots,-c) or (2c,,2c)(-2c,\dots,-2c) with equal probabilities, where c=1/3c=1/3, whereas if y=1y=1, then 𝐱\mathbf{x} is drawn from standard multivariate normal distribution shifted by (c,2c,,c,2c)(c,2c,\dots,c,2c). Figure 2 contains three panels, each displaying the selected radii for a distinct level of α\alpha: α=0.2,0.5or0.8\alpha=0.2,~{}0.5~{}{\rm or}~{}0.8. Notably, the decay rate of the radius is remarkably similar across different levels of α\alpha, approximately as 1/N1/\sqrt{N}. This is in line with our theoretical finding that the rate in general does not hinge on the specification of the measure ρ\rho.

Figure 2: Wasserstein radius versus the number of training samples, with different α\alpha level (Left: α=0.2\alpha=0.2; Middle: α=0.5\alpha=0.5; Right: α=0.8\alpha=0.8).
Refer to caption
Refer to caption
Refer to caption

6 Beyond Affine Decision Rules

While our results thus far have focused on affine decision rules, the underlying principles of our approach to establishing generalization bounds for a broad class of measures ρ\rho can be extended to a wider range of decision rules. In this section, we illustrate this extension by deriving generalization bounds for the following Wasserstein DRO problem with a decision rule denoted by f𝜷f_{\bm{\beta}}:

inf𝜷𝒟supF¯p(F0,ε)ρF(Yf𝜷(𝐗)),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(F_{0},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X})),

where F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi) with Ξ={1,1}×n\Xi=\{-1,1\}\times\mathbb{R}^{n} and ¯p(F0,ε)\overline{\cal B}_{p}(F_{0},\varepsilon) is the type-pp Wasserstein ball defined by (5). Throughout this section, we assume that 𝒟m\mathcal{D}\subseteq\mathbb{R}^{m} for some mm\in\mathbb{N}, with 𝒟\|\cdot\|_{\mathcal{D}} representing any given norm on 𝒟\mathcal{D}.

The key to establishing generalization bounds applicable to a broad class of measures continues to lie in focusing on the projection set, now expressed in a more general form:

¯p|f𝜷(F0,ε)={FYf𝜷(𝐗):F(Y,𝐗)¯p(F0,ε)}.\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)=\{F_{Y\cdot f_{\bm{\beta}}(\mathbf{X})}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\}. (23)

Although the exact equivalence of this set to a one-dimensional Wasserstein ball or the projection set of a max-sliced Wasserstein ball, as shown in Proposition 1, no longer holds in the general case, we can still establish generalization bounds by instead identifying a one-dimensional Wasserstein ball dominated by ¯p|f𝜷(F0,ε)\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon), i.e.

¯p|f𝜷(F0,ε)p(FY0f𝜷(𝐗0),g(ε)),\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)\supseteq\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)), (24)

for some increasing function g:++g:\mathbb{R}_{+}\to\mathbb{R}_{+}, where (Y0,𝐗0)F0(Y_{0},\mathbf{X}_{0})\sim F_{0}. Indeed, if this dominance relationship (24) can be established, one can then leverage the measure concentration property of a one-dimensional Wasserstein ball to determine the radius of this one-dimensional ball εN\varepsilon_{N} such that

(FYf𝜷(𝐗)¯p|f𝜷(F^N,g1(εN)))(FYf𝜷(𝐗)p(FY0f𝜷(𝐗0),εN))1η\mathbb{P}\left(F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(\widehat{F}_{N},g^{-1}(\varepsilon_{N}))\right)\geqslant\mathbb{P}\left(F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}\in\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},\varepsilon_{N})\right)\geqslant 1-\eta

with (Y,𝐗)F(Y^{*},\mathbf{X}^{*})\sim F^{*}, where FF^{*} and F^N\widehat{F}_{N}, defined in Section 3, are the data-generating distribution and empirical distribution, respectively. This leads to the confidence bound:444 One may wish to further gauge how tight (i.e., small) the Wasserstein in-sample risk, supF¯p(F^N,ε)ρF(Yf𝜷(𝐗))\sup_{F\in\overline{{\cal B}}p(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X})), is – e.g., relative to the empirical in-sample risk – when the radius ε\varepsilon is sufficiently small. To this end, we note that, under Assumption 5 and the condition that f𝜷f_{\bm{\beta}} is Lipschitz continuous with a Lipschitz constant L𝜷L_{\bm{\beta}} (a condition satisfied in the affine case), the relationship supF¯p(F^N,ε)ρF(Yf𝜷(𝐗))ρF^N(Yf𝜷(𝐗))+ML𝜷ε\sup_{F\in\overline{\mathcal{B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\rho^{\widehat{F}_{N}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+ML_{\bm{\beta}}\varepsilon holds.

(ρF(Yf𝜷(𝐗))supF¯p(F^N,g1(εN))ρF(Yf𝜷(𝐗)))1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{N}))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta, (25)

for a fixed 𝜷\bm{\beta}. 555 Notably, Aolarite et al. (2022) also studied the set inclusion property between the projection of a Wasserstein ball and a lower-dimensional Wasserstein ball. However, their focus was on establishing the existence of a lower-dimensional Wasserstein ball as a superset of the projection (i.e., the converse of (24)), which cannot be used to derive generalization bounds based on (Wasserstein) in-sample risk, i.e. supF¯p(F^N,ε)ρF(Yf𝜷(𝐗))\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X})). Extending this to generalization bounds – i.e., the union bound of the above – is achievable using covering number techniques. We leave the exact technical details and derivations to Appendix C, and present here the final generalization bounds along with the necessary assumptions.

Assumption 2.

There exists a>pa>p such that

A:=sup𝜷𝒟𝔼F[exp(|f𝜷(𝐗)|a)]<.\displaystyle A:=\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(|f_{\bm{\beta}}(\mathbf{X})|^{a})]<\infty.
Assumption 3.

The decision rule f𝛃:nf_{\bm{\beta}}:\mathbb{R}^{n}\to\mathbb{R} is continuous for each 𝛃𝒟\bm{\beta}\in\mathcal{D}, and there exists a strictly increasing convex function g:++g:\mathbb{R}_{+}\to\mathbb{R}_{+} such that for every 𝛃𝒟\bm{\beta}\in\mathcal{D}, 𝐱n,\mathbf{x}\in\mathbb{R}^{n}, and ε0\varepsilon\geqslant 0,

sup𝐲εf𝜷(𝐱+𝐲)f𝜷(𝐱)g(ε)andf𝜷(𝐱)inf𝐲εf𝜷(𝐱+𝐲)g(ε).\displaystyle\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant g(\varepsilon)~{}~{}{and}~{}~{}f_{\bm{\beta}}(\mathbf{x})-\inf_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})\geqslant g(\varepsilon).
Assumption 4.

There exist a1,a20a_{1},a_{2}\geqslant 0 and r[1,p]r\in[1,p] such that

|f𝜷(𝐱)f𝜷~(𝐱)|(a1𝐱r+a2)𝜷𝜷~𝒟,𝐱nand𝜷,𝜷~𝒟.\displaystyle|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})|\leqslant(a_{1}\|\mathbf{x}\|^{r}+a_{2})\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}},~{}~{}\forall\,\mathbf{x}\in\mathbb{R}^{n}~{}{\rm and}~{}\bm{\beta},\widetilde{\bm{\beta}}\in\mathcal{D}.
Assumption 5.

There exist M>0M>0 and k[1,p]k\in[1,p] such that

|ρ(Z1)ρ(Z2)|M(𝔼[|Z1Z2|k])1/k,Z1,Z2Lp.\displaystyle|\rho(Z_{1})-\rho(Z_{2})|\leqslant M\left(\mathbb{E}[|Z_{1}-Z_{2}|^{k}]\right)^{1/k},~{}~{}\forall\,Z_{1},Z_{2}\in L^{p}.

Assumption 2 is a fairly standard light-tail condition, necessary for invoking the measure concentration property of a one-dimensional Wasserstein ball; see Proposition 4 in Appendix C.4 for a sufficient condition for Assumption 2 for a class of decision rules {f𝜷}𝜷𝒟\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}}. Assumption 3, however, is more notable, as it successfully characterizes decision rules that ensure the set inclusion (24) holds. Simply put, decision rules must exhibit sufficiently increasing and decreasing behaviour; otherwise, a Wasserstein ball dominated by the projection set would not exist. Assumptions 4 and 5 are regularity conditions required for constructing union bounds through covering number techniques.

Theorem 6.

For p[1,)p\in[1,\infty), η(0,1)\eta\in(0,1) and 𝒟m\mathcal{D}\subseteq\mathbb{R}^{m} with some mm\in\mathbb{N}, assume that Assumptions 2 and 3 hold, Assumptions 4 and 5 hold with rkprk\leqslant p, and U𝒟:=sup𝛃𝒟𝛃𝒟<U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty. Then we have

(ρF(Yf𝜷(𝐗))supF¯p(F^N,εN)ρF(Yf𝜷(𝐗))+τN,𝜷𝒟)1η1N,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

where τN=M[(2r1+1)a1(𝔼F[𝐗rk])1/k+2r1a1(𝕍arF(𝐗rk))1/(2k)+2r1a1εNr+2a2]/N\tau_{N}={M[(2^{r-1}+1)a_{1}(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}])^{1/k}+2^{r-1}a_{1}({\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk}))^{1/(2k)}+2^{r-1}a_{1}\varepsilon_{N}^{r}+2a_{2}]}/{N},

εN=g1(εp,N)withεp,N={(log(c1/η)+mlog(1+2U𝒟N)c2N)1/(2p)ifNlog(c1/η)+mlog(1+2U𝒟N)c2,(log(c1/η)+mlog(1+2U𝒟N)c2N)1/aifN<log(c1/η)+mlog(1+2U𝒟N)c2,\displaystyle\varepsilon_{N}=g^{-1}(\varepsilon_{p,N})~{}~{}{\rm with}~{}~{}\varepsilon_{p,N}=\begin{cases}\left(\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}N}\right)^{1/(2p)}&\text{if}~{}N\geqslant\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}},\\ \left(\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}N}\right)^{1/a}&\text{if}~{}N<\frac{\log(c_{1}/\eta)+m\log(1+2U_{\mathcal{D}}N)}{c_{2}},\end{cases}

and c1,c2c_{1},c_{2} are positive constants that only depend on pp, aa and AA.

The decay rate here, g1((m/N)1/(2p))g^{-1}\left((m/N)^{1/(2p)}\right), maintains an order independent of dimensionality, mirroring the (m/N)1/2(m/N)^{1/2} rate reported in Gao (2022). Although our rate may not be as rapid, due to its dependence on the function gg and the parameter pp, it is derived under less restrictive distributional assumptions and is valid for any type-pp Wasserstein ball. Most importantly, it accommodates a significantly broader class of measures ρ\rho. A notable class of applications for these bounds includes any regression problem and decision rule, with a rate that, as demonstrated below, is independent of the function gg.

Example 13 (Regression).

In the regression problem, the data-generating distribution FF^{*} satisfies F({Y=1})=1F^{*}(\{Y=1\})=1 and the decision rule has the form: f𝜷(𝐱)=x1h𝜷(𝐱(2,:))f_{\bm{\beta}}(\mathbf{x})=x_{1}-h_{\bm{\beta}}(\mathbf{x}_{(2,:)}), 𝐱n{\bf x}\in\mathbb{R}^{n}, where x1x_{1} is the first component of 𝐱\mathbf{x} and 𝐱(2,:)\mathbf{x}_{(2,:)} represents the rest components, i.e., we use h𝜷(𝐱(2,:))h_{\bm{\beta}}(\mathbf{x}_{(2,:)}) to predict the output x1x_{1}. Suppose that the norm on n\mathbb{R}^{n} satisfies (1,0,,0)=1\|(1,0,\dots,0)\|=1. Then, Assumption 3 holds for any h𝜷h_{\bm{\beta}} with g(ε)=εg(\varepsilon)=\varepsilon. Indeed, we have for any 𝐱n\mathbf{x}\in\mathbb{R}^{n} and ε0\varepsilon\geqslant 0, sup𝐲εf𝜷(𝐱+𝐲)f𝜷(𝐱)f𝜷(𝐱+(ε,0,,0))f𝜷(𝐱)=ε\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant f_{\bm{\beta}}(\mathbf{x}+(\varepsilon,0,\dots,0))-f_{\bm{\beta}}(\mathbf{x})=\varepsilon and f𝜷(𝐱)inf𝐲εf𝜷(𝐱+𝐲)f𝜷(𝐱)f𝜷(𝐱(ε,0,,0))=ε.f_{\bm{\beta}}(\mathbf{x})-\inf_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})\geqslant f_{\bm{\beta}}(\mathbf{x})-f_{\bm{\beta}}(\mathbf{x}-(\varepsilon,0,\dots,0))=\varepsilon. Hence, any decision rule for a regression problem satisfies Assumption 3 with g(ε)=εg(\varepsilon)=\varepsilon.

We further illustrate the application of this example in the context of operations management. Specifically, the feature-based newsvendor problem (Zhang et al. (2024)), also known as the big-data newsvendor (Ban and Rudin (2018)), represents as a special case of regression problems.

Example 14 (Feature-based newsvendor).

In the newsvendor problem, the decision maker seeks the optimal order quantity for a product facing uncertain demand X1X_{1}, with holding costs of a>0a>0 and back-order costs of b>0b>0. In the world of big data, the decision maker has access to a feature vector 𝐱(2,:)n1\mathbf{x}_{(2,:)}\in\mathbb{R}^{n-1}, which can be leveraged to enhance ordering decisions based on this additional information. In this setting, define the decision rule as f𝜷(𝐱)=x1h𝜷(𝐱(2,:))f_{\bm{\beta}}(\mathbf{x})=x_{1}-h_{\bm{\beta}}(\mathbf{x}_{(2,:)}) with 𝜷𝒟\bm{\beta}\in\mathcal{D} and the data-driven distributionally robust feature-based newsvendor problem has the form

inf𝜷𝒟supF¯p(F^N,ε)ρF(b(f𝜷(𝐗))++a(f𝜷(𝐗))).\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\overline{\mathcal{B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(b(f_{\bm{\beta}}(\mathbf{X}))_{+}+a(f_{\bm{\beta}}(\mathbf{X}))_{-}).

Applying Example 13, we know that any decision rule of feature-based newsvendor problem satisfies Assumption 3 with g(ε)=εg(\varepsilon)=\varepsilon.

The above generalization bound can be extended to an even broader class of decision rules when the measure ρ\rho satisfies monotonicity, i.e., ρ(Z1)ρ(Z2)\rho(Z_{1})\leqslant\rho(Z_{2}) whenever Z1Z2Z_{1}\leqslant Z_{2}, a natural property in problems such as risk minimization. Specifically, we consider the following data-driven Wasserstein DRO problem:

inf𝜷𝒟supFp(F^𝐗,N,ε)ρF(f𝜷(𝐗)),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X})),

where ρ\rho satisfies monotonicity, F^𝐗,N=1Ni=1Nδ𝐱^i\widehat{F}_{\mathbf{X},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{\mathbf{x}}_{i}} represents the empirical distribution of the component 𝐗\mathbf{X}, and p(F^𝐗,N,ε){\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon) is defined by (8). In these cases, Assumption 3 can be relaxed, requiring only that each decision rule f𝜷f_{\bm{\beta}} exhibits sufficiently increasing behavior.

Assumption 6.

The decision rule f𝛃:nf_{\bm{\beta}}:\mathbb{R}^{n}\to\mathbb{R} is continuous, and there exists a strictly increasing convex function g:++g:\mathbb{R}_{+}\to\mathbb{R}_{+} such that for every 𝛃𝒟\bm{\beta}\in\mathcal{D},

sup𝐲εf𝜷(𝐱+𝐲)f𝜷(𝐱)g(ε),𝐱n,ε0.\displaystyle\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x})\geqslant g(\varepsilon),~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\varepsilon\geqslant 0.
Theorem 7.

For p[1,)p\in[1,\infty), η(0,1)\eta\in(0,1) and 𝒟m\mathcal{D}\subseteq\mathbb{R}^{m} with some mm\in\mathbb{N}, assume that Assumptions 2 and 6 hold, Assumptions 4 and 5 hold with rkprk\leqslant p, U𝒟:=sup𝛃𝒟𝛃𝒟<U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty, and ρ:Lp\rho:L^{p}\to\mathbb{R} satisfies monotonicity. Then,

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,εN)ρF(f𝜷(𝐗))+τN,𝜷𝒟)1η1N,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{{\bf X},N},\,\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

where τN=M[(2r1+1)a1(𝔼F[𝐗rk])1/k+2r1a1(𝕍arF(𝐗rk))1/(2k)+2r1a1εNr+2a2]/N\tau_{N}={M[(2^{r-1}+1)a_{1}(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}])^{1/k}+2^{r-1}a_{1}({\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk}))^{1/(2k)}+2^{r-1}a_{1}\varepsilon_{N}^{r}+2a_{2}]}/{N} and εN=g1(εp,N)\varepsilon_{N}=g^{-1}(\varepsilon_{p,N}) with εp,N\varepsilon_{p,N} defined in Theorem 6.

In particular, Assumption 6 accommodates non-linear decision rules, including quadratic and max-affine functions (cf. Esfahani and Kuhn (2018) and Example 16), which can be bounded below.

Example 15 (Quadratic function).

Define the decision variable as 𝜷=(Σ,𝐚,b)\bm{\beta}=(\Sigma,\mathbf{a},b), where 𝐚n\mathbf{a}\in\mathbb{R}^{n}, bb\in\mathbb{R}, and Σn×n\Sigma\in\mathbb{R}^{n\times n} is a positive semi-definite matrix. Let 𝒟\mathcal{D} be the feasible set of 𝜷\bm{\beta} such that M:=inf𝜷𝒟ΣS>0M:=\inf_{\bm{\beta}\in\mathcal{D}}\|\Sigma\|_{S}>0, where ΣS\|\Sigma\|_{S} is the spectral norm of Σ\Sigma and is defined as the largest absolute value of all eigenvalues. Define the norm on 𝒟\mathcal{D} as 𝜷𝒟=ΣS+𝐚2+|b|with𝜷=(Σ,𝐚,b)𝒟.\|\bm{\beta}\|_{\mathcal{D}}=\|\Sigma\|_{S}+\|\mathbf{a}\|_{2}+|b|~{}{\rm with}~{}\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D}. Let f𝜷(𝐱)=𝐱Σ𝐱+𝐚𝐱+bf_{\bm{\beta}}(\mathbf{x})=\mathbf{x}^{\top}\Sigma\mathbf{x}+\mathbf{a}^{\top}\mathbf{x}+b for xnx\in\mathbb{R}^{n} and 𝜷𝒟\bm{\beta}\in\mathcal{D}. For this specific case, we first show that {f𝜷}𝜷𝒟\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}} satisfies Assumption 4 with a1=a2=r=2a_{1}=a_{2}=r=2 and Assumption 6 with g(ε)=Mε2g(\varepsilon)=M\varepsilon^{2}. For 𝜷=(Σ,𝐚,b)𝒟\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D} and 𝜷~=(Σ~,𝐚~,b~)𝒟\widetilde{\bm{\beta}}=(\widetilde{\Sigma},\widetilde{\mathbf{a}},\widetilde{b})\in\mathcal{D}, we have

|f𝜷(𝐱)f𝜷~(𝐱)|\displaystyle|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})| |𝐱(ΣΣ~)𝐱|+|(𝐚𝐚~)𝐱|+|bb~|\displaystyle\leqslant|\mathbf{x}^{\top}(\Sigma-\widetilde{\Sigma})\mathbf{x}|+|(\mathbf{a}-\widetilde{\mathbf{a}})^{\top}\mathbf{x}|+|b-\widetilde{b}|
𝐱22ΣΣ~S+𝐱2𝐚𝐚~2+|bb~|2(𝐱22+1)𝜷𝜷~𝒟.\displaystyle\leqslant\|\mathbf{x}\|_{2}^{2}\|\Sigma-\widetilde{\Sigma}\|_{S}+\|\mathbf{x}\|_{2}\|\mathbf{a}-\widetilde{\mathbf{a}}\|_{2}+|b-\widetilde{b}|\leqslant 2(\|\mathbf{x}\|_{2}^{2}+1)\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}.

Hence, Assumption 4 holds with a1=a2=r=2a_{1}=a_{2}=r=2. For any 𝐱n\mathbf{x}\in\mathbb{R}^{n}, ε0\varepsilon\geqslant 0 and 𝜷=(Σ,𝐚,b)𝒟\bm{\beta}=(\Sigma,\mathbf{a},b)\in\mathcal{D}, it follows from the orthogonal decomposition of Σ\Sigma, i.e., Σ=QΛQ\Sigma=Q\Lambda Q^{\top} where QQ is an orthogonal matrix and Λ\Lambda is a diagonal matrix, that

sup𝐲2εf𝜷(𝐱+𝐲)f𝜷(𝐱)\displaystyle\sup_{\|\mathbf{y}\|_{2}\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x}) =sup𝐲2ε{𝐲Σ𝐲+(2𝐱Σ+𝐚)𝐲}\displaystyle=\sup_{\|\mathbf{y}\|_{2}\leqslant\varepsilon}\{\mathbf{y}^{\top}\Sigma\mathbf{y}+(2\mathbf{x}^{\top}\Sigma+\mathbf{a}^{\top})\mathbf{y}\}
=sup𝐲2ε{(Q𝐲)Λ(Q𝐲)+(2𝐱QΛQ+𝐚)𝐲}\displaystyle=\sup_{\|\mathbf{y}\|_{2}\leqslant\varepsilon}\left\{(Q^{\top}\mathbf{y})^{\top}\Lambda(Q^{\top}\mathbf{y})+(2\mathbf{x}^{\top}Q\Lambda Q^{\top}+\mathbf{a}^{\top})\mathbf{y}\right\}
=sup𝐳2ε{𝐳Λ𝐳+(2𝐱QΛ+𝐚Q)𝐳}ΣSε2Mε2.\displaystyle=\sup_{\|\mathbf{z}\|_{2}\leqslant\varepsilon}\left\{\mathbf{z}^{\top}\Lambda\mathbf{z}+(2\mathbf{x}^{\top}Q\Lambda+\mathbf{a}^{\top}Q)\mathbf{z}\right\}\geqslant\|\Sigma\|_{S}\varepsilon^{2}\geqslant M\varepsilon^{2}.

Hence, Assumption 6 holds with g(ε)=Mε2g(\varepsilon)=M\varepsilon^{2}. Further, if we assume that U𝒟:=sup𝒟𝜷𝒟<U_{\mathcal{D}}:=\sup_{\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty and there exists a>pa>p such that 𝔼F[exp(22a1U𝒟a𝐗2a)]<\mathbb{E}^{F^{*}}[\exp(2^{2a-1}U_{\mathcal{D}}^{a}\|\mathbf{X}\|^{2a})]<\infty, then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4.

To further emphasize its implications for operations management, we next illustrate how Theorem 7 applies to two-stage stochastic programs.

Example 16 (Two-stage stochastic program).

In the context of two-stage stochastic programs with right-hand-side uncertainty, the recourse function is defined as (Esfahani and Kuhn (2018)): f𝜷(𝐱)=infy{qy|WyH𝐱+𝜷},f_{\bm{\beta}}(\mathbf{x})=\inf_{y}\left\{q^{\top}y\;|\;Wy\geqslant H\mathbf{x}+\bm{\beta}\right\}, where 𝜷𝒟m\bm{\beta}\in\mathcal{D}\subseteq\mathbb{R}^{m} with 𝜷𝒟=𝜷\|\bm{\beta}\|_{\mathcal{D}}=\|\bm{\beta}\|, denotes the first-stage decision variable, and yy represents the recourse decision. As shown in Esfahani and Kuhn (2018), this function can be reformulated as a max-affine function f𝜷(𝐱)=max1kK(vkH𝐱+vk𝜷),f_{\bm{\beta}}(\mathbf{x})=\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta}), where vkmv_{k}\in\mathbb{R}^{m}, k=1,,Kk=1,\ldots,K are the vertices of the dual feasible set {θ0|Wθ=q}\left\{\theta\geqslant 0\;|\;W^{\top}\theta=q\right\}. Suppose M:=min1kKHvk>0M:=\min_{1\leqslant k\leqslant K}\|H^{\top}v_{k}\|_{*}>0. We show that f𝜷f_{\bm{\beta}} for 𝜷𝒟\bm{\beta}\in\mathcal{D} satisfies Assumption 4 with a1=0a_{1}=0 and a2=max1kKvka_{2}=\max_{1\leqslant k\leqslant K}\|v_{k}\|_{*} and Assumption 6 with g(ε)=Mεg(\varepsilon)=M\varepsilon. For 𝜷,𝜷~𝒟\bm{\beta},\,\widetilde{\bm{\beta}}\in\mathcal{D}, we have

|f𝜷(𝐱)f𝜷~(𝐱)|\displaystyle|f_{\bm{\beta}}(\mathbf{x})-f_{\widetilde{\bm{\beta}}}(\mathbf{x})| =|max1kK(vkH𝐱+vk𝜷)max1kK(vkH𝐱+vk𝜷~)|\displaystyle=\left|\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right|
max1kK|(vkH𝐱+vk𝜷)(vkH𝐱+vk𝜷~)|\displaystyle\leqslant\max_{1\leqslant k\leqslant K}\left|(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})-(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\widetilde{\bm{\beta}})\right|
=max1kK|vk(𝜷𝜷~)|max1kKvk𝜷𝜷~.\displaystyle=\max_{1\leqslant k\leqslant K}\left|v_{k}^{\top}(\bm{\beta}-\widetilde{\bm{\beta}})\right|\leqslant\max_{1\leqslant k\leqslant K}\|v_{k}\|_{*}\|\bm{\beta}-\widetilde{\bm{\beta}}\|.

Hence, Assumption 4 holds with a1=0a_{1}=0 and a2=max1kKvka_{2}=\max_{1\leqslant k\leqslant K}\|v_{k}\|_{*}. For any 𝐱n\mathbf{x}\in\mathbb{R}^{n}, ε0\varepsilon\geqslant 0 and 𝜷𝒟\bm{\beta}\in\mathcal{D}, we have

sup𝐲εf𝜷(𝐱+𝐲)f𝜷(𝐱)\displaystyle\sup_{\|\mathbf{y}\|\leqslant\varepsilon}f_{\bm{\beta}}(\mathbf{x}+\mathbf{y})-f_{\bm{\beta}}(\mathbf{x}) =sup𝐲εmax1kK(vkH(𝐱+𝐲)+vk𝜷)max1kK(vkH𝐱+vk𝜷)\displaystyle=\sup_{\|\mathbf{y}\|\leqslant\varepsilon}\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})
=max1kKsup𝐲ε(vkH(𝐱+𝐲)+vk𝜷)max1kK(vkH𝐱+vk𝜷)\displaystyle=\max_{1\leqslant k\leqslant K}\sup_{\|\mathbf{y}\|\leqslant\varepsilon}(v_{k}^{\top}H(\mathbf{x}+\mathbf{y})+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})
=max1kK(vkH𝐱+εHvk+vk𝜷)max1kK(vkH𝐱+vk𝜷)\displaystyle=\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+\varepsilon\|H^{\top}v_{k}\|_{*}+v_{k}^{\top}\bm{\beta})-\max_{1\leqslant k\leqslant K}(v_{k}^{\top}H\mathbf{x}+v_{k}^{\top}\bm{\beta})
min1kKεHvk=Mε.\displaystyle\geqslant\min_{1\leqslant k\leqslant K}\varepsilon\|H^{\top}v_{k}\|_{*}=M\varepsilon.

Hence, Assumption 6 holds with g(ε)=Mεg(\varepsilon)=M\varepsilon. Further, if we assume that U𝒟:=sup𝜷𝒟𝜷𝒟<U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty and there exists a>pa>p such that 𝔼F[exp(2a1Ra𝐗a)]<\mathbb{E}^{F^{*}}[\exp(2^{a-1}R^{a}\|\mathbf{X}\|^{a})]<\infty with R:=max1kKHvkR:=\max_{1\leqslant k\leqslant K}\|H^{\top}v_{k}\|_{*}, then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4. We note that extending this discussion to cases where the matrix HH also represents a first-stage decision is straightforward, and verifying this extension closely mirrors our existing analysis, thus it is omitted here for brevity.

Finally, we show that for problems such as regression or risk minimization, the “curse of the order pp” – i.e., the dependence of the decay rate on pp – can be overcome for any non-linear decision rules covered by Assumption 6.

Theorem 8.

For p[1,]p\in[1,\infty], η(0,1)\eta\in(0,1) and 𝒟m\mathcal{D}\subseteq\mathbb{R}^{m} with some mm\in\mathbb{N}, assume that Assumptions 1 and 6 hold, Assumption 2 holds with a>1a>1, Assumption 4 holds with r=1r=1 and U𝒟:=sup𝛃𝒟𝛃𝒟<U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty, and ρ:L1\rho:L^{1}\to\mathbb{R} satisfies monotonicity and |ρ(Z1)ρ(Z2)|Mess-sup|Z1Z2||\rho(Z_{1})-\rho(Z_{2})|\leqslant M\mathrm{ess\mbox{-}sup}|Z_{1}-Z_{2}| for Z1,Z2L1Z_{1},Z_{2}\in L^{1}. Then,

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,εN)ρF(f𝜷(𝐗))+τN,𝜷𝒟)1η1N,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{{\bf X},N},\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N},

where τN=λM[a1𝔼F[𝐗]+a1(𝕍arF(𝐗))1/2+a1εN+2a2]/N\tau_{N}={\lambda M[a_{1}\mathbb{E}^{F^{*}}[\|\mathbf{X}\|]+a_{1}({\rm\mathbb{V}ar}^{F^{*}}\!\!(\|\mathbf{X}\|))^{1/2}+a_{1}\varepsilon_{N}+2a_{2}]}/{N} and εN=g1(λε1,N)\varepsilon_{N}=g^{-1}(\lambda\varepsilon_{1,N}) with ε1,N\varepsilon_{1,N} defined in Theorem 6.

We note that Assumption 1 and the Lipschitz continuity with respect to the LL^{\infty}-norm together imply Assumption 5 (with k=1k=1). Hence, Assumption 5 is omitted in Theorem 8 (see Appendix C.3 for details).

7 Conclusion

As ML and OR/MS applications increasingly adopt diverse decision criteria, identifying methods that ensure robust, data-driven solutions with strong generalization guarantees across a wide range of criteria is essential. This paper establishes the universality of Wasserstein DRO in achieving such guarantees. We show that Wasserstein DRO attains generalization bounds comparable to max-sliced Wasserstein DRO for any decision criterion under affine decision rules, even though the Wasserstein ball is a significantly less conservative ambiguity set. Furthermore, we extend these guarantees to general criteria and decision rules under light-tail assumptions, avoiding the curse of dimensionality. Our projection-based analysis of the Wasserstein ball offers key insights into why these guarantees require minimal reliance on specific properties of decision criteria.

Beyond developing generalization guarantees, our work emphasizes their practical significance for OR/MS problems. We demonstrate that these guarantees can often be achieved through regularization equivalents, as detailed in Theorems 3, 4, and 5, all of which are efficiently solvable as convex programs with complexity comparable to nominal problems. Importantly, our impossibility theorems identify precisely when solving the full Wasserstein DRO problem becomes necessary, defining the boundaries where simpler regularization approaches no longer suffice. These results deepen the theoretical foundation of Wasserstein DRO while charting a clear course for developing more efficient methods to address full Wasserstein DRO formulations.

References

  • Abu-Mostafa et al. (2012) Abu-Mostafa, Y. S., Magdon-Ismail, M. and Lin, H. T. (2012). Learning from data. New York: AMLBook.
  • Aolarite et al. (2022) Aolaritei, L., Lanzetti, N., Chen, H., and Dörfler, F. (2022). Distributional uncertainty propagation via optimal transport. arXiv:2205.00343.
  • Ban and Rudin (2018) Ban, G. -Y. and Rudin, C. (2018) The big data newsvendor:practical insights from machine learning. Operations Research, 67(1), 90–108.
  • Bartl et al. (2020) Bartl, D., Drapeau, S., Obloj, J. and Wiesel, J. (2020). Robust uncertainty sensitivity analysis. arXiv:2006.12022.
  • Bawa (1975) Bawa, V. S. (1975). Optimal rules for ordering uncertain prospects. Journal of Financial Economics, 2(1), 95–1.
  • Blanchet et al. (2022) Blanchet, J., Chen, L. and Zhou, X. (2022). Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 68(9), 6382–6410.
  • Blanchet and Kang (2021) Blanchet, J. and Kang, Y. (2021). Sample out-of-sample inference based on Wasserstein distance. Operations Research, 69(3), 985–1013.
  • Blanchet et al. (2019) Blanchet, J., Kang, Y. and Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3), 830–857.
  • Blanchet et al. (2022) Blanchet, J. Murthy, K. and Si, N. (2022). Confidence regions in Wasserstein distributionally robust estimation. Biometrika, 109(2), 295–315.
  • Breiman (1996) Breiman, L. (1996). Bias, variance, and arcing classifiers (Technical Report 460). Statistics Department, University of California.
  • Cai et al. (2023) Cai, J., Li, J. Y. M. and Mao, T. (2023). Distributionally robust optimization under distorted expectations. Operations Research, forthcoming.
  • Chen and Paschalidis (2018) Chen, R. and Paschalidis, I. C. (2018). A Robust Learning Approach for Regression Models Based on Distributionally Robust Optimization. Journal of Machine Learning Research, 19(1), 1–48.
  • Chen et al. (2011) Chen, L., He, S. and Zhang, S. (2011). Tight Bounds for Some Risk Measures, with Applications to Robust Portfolio Selection. Operations Research, 59(4), 847–865.
  • Chew et al. (1987) Chew, H. C., Karni, E., and Safra, Z. (1987). Risk aversion in the theory of expected utility with rank dependent probabilities. Journal of Economic theory, 42(2), 370–381.
  • Chu et al. (2024) Chu, H., Lin, M. and Toh, K. C. (2024). Wasserstein distributionally robust optimization and its tractable regularization formulations. arXiv:2402.03942.
  • Drucker et al. (1997) Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. and Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 28(7), 779–784.
  • Esfahani and Kuhn (2018) Esfahani, P. M. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 115–166.
  • Fishburn (1977) Fishburn, P. C. (1977). Mean-risk analysis with risk associated with below target returns. American Economic Review, 67(2), 116–126.
  • Gao (2022) Gao, R. (2022). Finite-Sample Guarantees for Wasserstein Distributionally Robust Optimization: Breaking the Curse of Dimensionality. Operations Research, 71(6), 2291–2306.
  • Gao et al. (2017) Gao, R., Chen, X. and Kleywegt, A. J. (2017). Distributional robustness and regularization in statistical learning. arXiv:1712.06050.
  • Gao et al. (2022) Gao, R., Chen, X. and Kleywegt, A. J. (2022). Wasserstein Distributionally Robust Optimization and Variation Regularization. Operations Research, 72(3), 1177–1191.
  • Gotoh and Uryasev (2017) Gotoh, J. and Uryasev, S. (2017). Support vector machines based on convex risk functions and general norms. Annals of Operations Research, 249, 301–328.
  • Krokhmal (2007) Krokhmal, P. (2007). Higher moment coherent risk measures. Quantitative Finance, 7(4), 373–387.
  • Kuhn et al. (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A. and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: theory and applications in machine learning. INFORMS TutORials in Operations Research, 130–166.
  • Lee and Mangasarian (2001) Lee, Y. J., and Mangasarian (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20, 5–22.
  • Lee et al. (2005) Lee, Y. J., Hsieh, W. F. and Huang, C. M. (2005). ε\varepsilon-SSVR: A smooth support vector machine for ε\varepsilon-insensitive regression. IEEE Transactions on Knowledge and Data Engineering, 17(5), 678–685.
  • Mohri et al. (2018) Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
  • Olea et al. (2022) Olea, J. L. M., Rush, C., Velez, A. and Wiesel, J. (2022). The out-of-sample prediction error of the square-root-LASSO and related estimators. arXiv:2211.07608.
  • Quiggin (1982) Quiggin, J. (1982). A theory of anticipated utility. Journal of Economic Behavior & Organization, 3(4), 323–343.
  • Rockafellar and Uryasev (2002) Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 26(7), 1443–1471.
  • Rockafellar et al. (2008) Rockafellar, R. T., Uryasev, S. and Zabarankin, M. (2008). Risk tuning with generalized linear regression. Mathematics of Operations Research, 33(3), 712–729.
  • Rockafellar and Uryasev (2013) Rockafellar, R. T. and Uryasev, S. (2013). The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2), 33–53.
  • Schölkopf et al. (1998) Schölkopf, B., Bartlett, P., Smola, A. and Williamson, R. (1998). Support vector regression with automatic accuracy control. In ICANN 98; Springer: Heidelberg, Germany, 111–116.
  • Schölkopf et al. (2000) Schölkopf, B., Smola, A. J., Williamson, R. C. and Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245.
  • Schmeidler (1989) Schmeidler, D. (1989). Subjective probability and expected utility without additivity. Econometrica, 57(3), 571–587.
  • Shafieezadeh-Abadeh et al. (2023) Shafieezadeh-Abadeh, S., Aolaritei, L., Dörfler, F. and Kuhn, D. (2023). New perspectives on regularization and computation in optimal transport-based distributionally robust optimization. arXiv:2303.03900.
  • Shafieezadeh-Abadeh et al. (2019) Shafieezadeh-Abadeh, S., Kuhn, D. and Esfahani, P. M. (2019). Regularization via mass transportation. Journal of Machine Learning Research, 20, 1–68.
  • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge.
  • Sim et al. (2021) Sim, M., Zhao, L. and Zhou, M. (2021). Tractable robust supervised learning models. SSRN:3981205.
  • Suykens and Vandewalle (1999) Suykens, J. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.
  • Volpi et al. (2018) Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V. and Savarese, S. (2018). Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 5334–5344.
  • Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-asymptotic Viewpoint. Volume 48 (Cambridge University Press).
  • Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.
  • Wozabal (2014) Wozabal, D. (2014). Robustifying convex risk measures for linear portfolios: A nonparametric approach. Operations Research, 62(6), 1302–1315.
  • Yaari (1987) Yaari, M.E. (1987). The dual theory of choice under risk. Econometrica, 55(1), 95–115.
  • Zhang et al. (2024) Zhang, L., Yang, J. and Gao, R. (2024). Optimal robust policy for feature-based newsvendor. Management Science, 70(4), 2315–2329.

Appendix A A Proofs of Section 3

Before presenting the proofs in Section A, we first introduce a lemma inspired by the optimal coupling (see, e.g., Theorem 4.1 of Villani (2009)),666We thank a referee for bringing this to our attention. which will be used throughout the proofs in the appendix.

For mm\in\mathbb{N}, a function c:m×m{+}c:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\} is a cost function if c(𝐱,𝐲)=c(𝐲,𝐱)0c(\mathbf{x},\mathbf{y})=c(\mathbf{y},\mathbf{x})\geqslant 0 for all 𝐱,𝐲m\mathbf{x},\mathbf{y}\in\mathbb{R}^{m}. For a given cost function cc, define Wc(F,F0)=infπΠ(F,F0)𝔼π[c(𝝃,𝝃0)]W_{c}(F,F_{0})=\inf_{\pi\in\Pi(F,F_{0})}\mathbb{E}^{\pi}[c(\bm{\xi},\bm{\xi}_{0})] for two distributions F,F0F,F_{0} on m\mathbb{R}^{m}.

Lemma 1.

Let c:m×m{+}c:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}\cup\{+\infty\} with mm\in\mathbb{N} be a lower semicontinuous cost function. Given two distributions F,F0F,F_{0} on m\mathbb{R}^{m} and 𝐗0F0\mathbf{X}_{0}\sim F_{0}, there exists 𝐗F\mathbf{X}\sim F such that Wc(F,F0)=𝔼[c(𝐗,𝐗0)]W_{c}(F,F_{0})=\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]. As a result, we have for ε0\varepsilon\geqslant 0,

{G:Wc(G,F0)ε}={F𝐗:𝔼[c(𝐗,𝐗0)]ε}.\displaystyle\left\{G:W_{c}(G,F_{0})\leqslant\varepsilon\right\}=\left\{F_{\mathbf{X}}:\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]\leqslant\varepsilon\right\}. (26)
Proof.

Proof. Note that by the optimal coupling (see e.g., Theorem 4.1 of Villani (2009)), there exists 𝐘0F0\mathbf{Y}_{0}\sim F_{0} and 𝐘F\mathbf{Y}\sim F such that Wc(F,F0)=𝔼[c(𝐘,𝐘0)].W_{c}(F,F_{0})=\mathbb{E}[c(\mathbf{Y},\mathbf{Y}_{0})]. Since 𝐗0=d𝐘0\mathbf{X}_{0}\buildrel\mathrm{d}\over{=}\mathbf{Y}_{0}, by Theorem 8.17 of Kallenberg (2021), there exists 𝐗=d𝐘F\mathbf{X}\buildrel\mathrm{d}\over{=}\mathbf{Y}\sim F such that (𝐗,𝐗0)=d(𝐘,𝐘0)(\mathbf{X},\mathbf{X}_{0})\buildrel\mathrm{d}\over{=}(\mathbf{Y},\mathbf{Y}_{0}), and thus, 𝔼[c(𝐗,𝐗0)]=𝔼[c(𝐘,𝐘0)]=Wc(F,F0).\mathbb{E}[c(\mathbf{X},\mathbf{X}_{0})]=\mathbb{E}[c(\mathbf{Y},\mathbf{Y}_{0})]=W_{c}(F,F_{0}). This completes the proof. \Box

Consider the type-pp Wasserstein balls ¯p\overline{\mathcal{B}}_{p} and p\mathcal{B}_{p} defined by (5) and (8), respectively. By Lemma 1, we have for (Y0,𝐗0)F0(Y_{0},\mathbf{X}_{0})\sim F_{0}, it holds that

¯p(F0,ε)={F(Y,𝐗):𝔼[d((Y,𝐗),(Y0,𝐗0))p]εp}andp(F𝐗0,ε)={F𝐗:𝔼[𝐗𝐗0p]εp}.\displaystyle\overline{\cal B}_{p}(F_{0},\varepsilon)=\{F_{(Y,\mathbf{X})}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\}~{}~{}{\rm and}~{}~{}{\cal B}_{p}(F_{{\bf X}_{0}},\varepsilon)=\{F_{\mathbf{X}}:\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\}.

The two formulas above will be used frequently throughout the appendix, and we may apply them directly without referencing Lemma 1.

Proof of Proposition 1. Note that

{FY𝐗:F(Y,𝐗)¯p(F0,ε)}\displaystyle\{F_{Y\bf X}:F_{(Y,{\bf X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\} ={FY𝐗:𝔼[d((Y,𝐗),(Y0,𝐗0))p]εp}\displaystyle=\left\{F_{Y\bf X}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\right\}
={FY0𝐗:𝔼[𝐗𝐗0p]εp}\displaystyle=\left\{F_{Y_{0}\bf X}:\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\right\}
={FY0𝐗:𝔼[Y0𝐗Y0𝐗0p]εp}=:(1),\displaystyle=\left\{F_{Y_{0}\bf X}:\mathbb{E}[\|Y_{0}\mathbf{X}-Y_{0}\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\right\}=:\mathcal{B}_{(1)},

where the first equality is due to Lemma 1, the second equality follows from the definition in (4) which implies Y=Y0Y=Y_{0} almost surely for any (Y,𝐗)(Y,{\bf X}) satisfying 𝔼[d((Y,𝐗),(Y0,𝐗0))p]εp\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}, and the third equality follows from 𝐗𝐗0=Y0𝐗Y0𝐗0\|\mathbf{X}-\mathbf{X}_{0}\|=\|Y_{0}\mathbf{X}-Y_{0}\mathbf{X}_{0}\| as |Y0|=1|Y_{0}|=1. Additionally, it follows from Lemma 1 that

p(FY0𝐗𝟎,ε)={F𝐙:𝔼[𝐙Y0𝐗𝟎p]εp}=:(2),\displaystyle{\cal B}_{p}(F_{Y_{0}\bf X_{0}},\varepsilon)=\left\{F_{\bf Z}:\mathbb{E}[\|{\bf Z}-Y_{0}{\bf X_{0}}\|^{p}]\leqslant\varepsilon^{p}\right\}=:\mathcal{B}_{(2)},

it holds that (1)(2)\mathcal{B}_{(1)}\subseteq\mathcal{B}_{(2)} obviously. To see the converse inclusion, note that for any F(2)F\in\mathcal{B}_{(2)}, there exists 𝐙\mathbf{Z} such that 𝐙F\mathbf{Z}\sim F and 𝔼[𝐙Y0𝐗0p]εp.\mathbb{E}[\|{\bf Z}-Y_{0}{\bf X}_{0}\|^{p}]\leqslant\varepsilon^{p}. Taking (Y,𝐗)=(Y0,𝐙/Y0)(Y,{\bf X})=(Y_{0},{\bf Z}/Y_{0}), we have Y=Y0Y=Y_{0} and

𝔼[Y0𝐗Y0𝐗0p]=𝔼[𝐙Y0𝐗0p]εp.\mathbb{E}[\|Y_{0}{\bf X}-Y_{0}{\bf X}_{0}\|^{p}]=\mathbb{E}\left[\left\|{\bf Z}-Y_{0}{\bf X}_{0}\right\|^{p}\right]\leqslant\varepsilon^{p}.

Hence, we have F(1)F\in\mathcal{B}_{(1)}, and thus (2)(1)\mathcal{B}_{(2)}\subseteq\mathcal{B}_{(1)}. Therefore, we have (1)=(2)\mathcal{B}_{(1)}=\mathcal{B}_{(2)}. This completes the proof of (10). To show (11), denote by

1=¯p|𝜷(F0,ε),2=p(FY0𝜷𝐗𝟎,ε𝜷)and3={F𝜷𝐙:F𝐙pms(FY0𝐗0,ε)}.\displaystyle\mathcal{B}_{1}=\overline{\mathcal{B}}_{p|\bm{\beta}}(F_{0},\varepsilon),~{}~{}\mathcal{B}_{2}=\mathcal{B}_{p}\left(F_{Y_{0}\cdot\bm{\beta}^{\top}\bf X_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)~{}~{}{\rm and}~{}~{}\mathcal{B}_{3}=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:F_{\bf Z}\in\mathcal{B}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}_{0}},\varepsilon)\}.

We first verify that 1=2\mathcal{B}_{1}=\mathcal{B}_{2}. With the aid of (10), we can verify this similarly as the proof of Theorem 5 in Mao et al. (2022). For ease of completeness, we give a proof of 1=2\mathcal{B}_{1}=\mathcal{B}_{2} here. The case of 𝜷=0\|\bm{\beta}\|=0 is trivial, and we assume now 𝜷>0\|\bm{\beta}\|>0. Note that

1\displaystyle\mathcal{B}_{1} ={FY𝜷𝐗:F(Y,𝐗)¯p(F0,ε)}\displaystyle=\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:F_{(Y,\mathbf{X})}\in\overline{\cal B}_{p}(F_{0},\varepsilon)\}
={FY𝜷𝐗:𝔼[d((Y,𝐗),(Y0,𝐗0))p]εp}={FY0𝜷𝐗:𝔼[𝐗𝐗0p]εp},\displaystyle=\left\{F_{Y\cdot\bm{\beta}^{\top}\mathbf{X}}:\mathbb{E}[d((Y,\mathbf{X}),(Y_{0},\mathbf{X}_{0}))^{p}]\leqslant\varepsilon^{p}\right\}=\left\{F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}:\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\right\}, (27)

where the last equality follows from the definition in (4) and Lemma 1. On one hand, for F1F\in\mathcal{B}_{1}, there exists 𝐗\mathbf{X} such that Y0𝜷𝐗FY_{0}\cdot\bm{\beta}^{\top}\mathbf{X}\sim F and 𝔼[𝐗𝐗0p]εp\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}. Thus, we have

𝔼[|Y0𝜷𝐗Y0𝜷𝐗0|p]\displaystyle\mathbb{E}[|Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}-Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}_{0}|^{p}] =𝔼[|𝜷𝐗𝜷𝐗0|p]𝜷p𝔼[𝐗𝐗0p]εp𝜷p,\displaystyle=\mathbb{E}[|\bm{\beta}^{\top}\mathbf{X}-\bm{\beta}^{\top}\mathbf{X}_{0}|^{p}]\leqslant\|\bm{\beta}\|_{*}^{p}\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}, (28)

where we use Hölder’s inequality in the first inequality. This means F=FY0𝜷𝐗2F=F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}\in\mathcal{B}_{2}, and hence, 12\mathcal{B}_{1}\subseteq\mathcal{B}_{2}. On the other hand, for F2F\in\mathcal{B}_{2}, using Lemma 1 yields that there exists ZFZ\sim F such that 𝔼[|ZY0𝜷𝐗𝟎|p]εp𝜷p\mathbb{E}[|Z-Y_{0}\cdot\bm{\beta}^{\top}{\bf X_{0}}|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}. It follows from the definition of dual norm that there exists 𝜷0n\bm{\beta}_{0}\in\mathbb{R}^{n} such that 𝜷0=1\|\bm{\beta}_{0}\|=1 and 𝜷=𝜷𝜷0\|\bm{\beta}\|_{*}=\bm{\beta}^{\top}\bm{\beta}_{0}. Define

T=ZY0𝜷𝐗𝟎and𝐗=𝐗0+𝜷0TY0𝜷.T=Z-Y_{0}\cdot\bm{\beta}^{\top}{\bf X_{0}}~{}~{}\mbox{and}~{}~{}{\mathbf{X}}={\mathbf{X}_{0}}+\frac{\bm{\beta}_{0}T}{Y_{0}\|\bm{\beta}\|_{*}}.

It holds that 𝔼[|T|p]εp𝜷p\mathbb{E}[|T|^{p}]\leqslant\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}, and thus,

𝔼[𝐗𝐗0p]=𝔼[𝜷0TY0𝜷p]=𝔼[|T|p]𝜷pεp.\displaystyle\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]=\mathbb{E}\left[\left\|\frac{\bm{\beta}_{0}T}{Y_{0}\|\bm{\beta}\|_{*}}\right\|^{p}\right]=\frac{\mathbb{E}[|T|^{p}]}{\|\bm{\beta}\|_{*}^{p}}\leqslant\varepsilon^{p}.

It follows from (A) that FY0𝜷𝐗1F_{Y_{0}\cdot\bm{\beta}^{\top}{\bf X}}\in\mathcal{B}_{1}. Noting that Z=Y0𝜷𝐗Z=Y_{0}\cdot\bm{\beta}^{\top}{\bf X} and ZFZ\sim F, we have F1F\in\mathcal{B}_{1}. Hence, we conclude 21\mathcal{B}_{2}\subseteq\mathcal{B}_{1}. This completes the proof of 1=2\mathcal{B}_{1}=\mathcal{B}_{2}.

It remains to prove that 132\mathcal{B}_{1}\subseteq\mathcal{B}_{3}\subseteq\mathcal{B}_{2}. For G1,G2p(n)G_{1},G_{2}\in\mathcal{M}_{p}(\mathbb{R}^{n}), denote

Wpms(G1,G2):=sup𝜸:𝜸=1infπΠ(G1,G2)(𝔼π[|𝜸𝝃1𝜸𝝃2|p])1/p,\displaystyle{W}_{p}^{\rm ms}(G_{1},G_{2}):=\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\inf_{\pi\in\Pi(G_{1},G_{2})}\left(\mathbb{E}^{\pi}[|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}|^{p}]\right)^{1/p},

and thus, 3={F𝜷𝐙:Wpms(F𝐙,FY0𝐗0)ε}.\mathcal{B}_{3}=\{F_{\bm{\beta}^{\top}\mathbf{Z}}:{W}_{p}^{\rm ms}(F_{\bf Z},F_{Y_{0}\mathbf{X}_{0}})\leqslant\varepsilon\}. For F1F\in\mathcal{B}_{1} and using (A), there exists 𝐗\mathbf{X} such that Y0𝜷𝐗FY_{0}\cdot\bm{\beta}^{\top}\mathbf{X}\sim F and 𝔼[𝐗𝐗0p]εp\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}. It holds that

Wpms(FY0𝐗,FY0𝐗0)p\displaystyle{W}_{p}^{\rm ms}(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})^{p} =sup𝜸:𝜸=1infπΠ(FY0𝐗,FY0𝐗0)𝔼π[|𝜸𝝃1𝜸𝝃2|p]\displaystyle=\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\inf_{\pi\in\Pi(F_{Y_{0}\mathbf{X}},F_{Y_{0}\mathbf{X}_{0}})}\mathbb{E}^{\pi}[|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}|^{p}]
sup𝜸:𝜸=1𝔼[|Y0𝜸(𝐗𝐗0)|p]\displaystyle\leqslant\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\mathbb{E}[|Y_{0}\cdot\bm{\gamma}^{\top}(\mathbf{X}-\mathbf{X}_{0})|^{p}]
sup𝜸:𝜸=1𝜸p𝔼[𝐗𝐗0p]\displaystyle\leqslant\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\|\bm{\gamma}\|_{*}^{p}\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]
=𝔼[𝐗𝐗0p]εp.\displaystyle=\mathbb{E}[\|\mathbf{X}-\mathbf{X}_{0}\|^{p}]\leqslant\varepsilon^{p}.

This implies F=FY0𝜷𝐗3F=F_{Y_{0}\cdot\bm{\beta}^{\top}\mathbf{X}}\in\mathcal{B}_{3}, and hence, 13\mathcal{B}_{1}\subseteq\mathcal{B}_{3}. Suppose now F3F\in\mathcal{B}_{3}. There exists 𝐙\mathbf{Z} such that 𝜷𝐙F\bm{\beta}^{\top}\mathbf{Z}\sim F and Wpms(F𝐙,FY0𝐗0)ε{W}_{p}^{\rm ms}(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}})\leqslant\varepsilon. It follows that

εp\displaystyle\varepsilon^{p} sup𝜸:𝜸=1infπΠ(F𝐙,FY0𝐗0)𝔼π[|𝜸𝝃1𝜸𝝃2|p]\displaystyle\geqslant\sup_{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1}\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[|\bm{\gamma}^{\top}\bm{\xi}_{1}-\bm{\gamma}^{\top}\bm{\xi}_{2}|^{p}]
infπΠ(F𝐙,FY0𝐗0)1𝜷p𝔼π[|𝜷𝝃1𝜷𝝃2|p]\displaystyle\geqslant\inf_{\pi\in\Pi\left(F_{\mathbf{Z}},F_{Y_{0}\mathbf{X}_{0}}\right)}\frac{1}{\|\bm{\beta}\|_{*}^{p}}\mathbb{E}^{\pi}[|{\bm{\beta}}^{\top}\bm{\xi}_{1}-{\bm{\beta}}^{\top}\bm{\xi}_{2}|^{p}]
1𝜷pinfπΠ(F𝜷𝐙,FY0𝜷𝐗0)𝔼π[|ξ1ξ2|p],\displaystyle\geqslant\frac{1}{\|\bm{\beta}\|_{*}^{p}}\inf_{\pi\in\Pi\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)}\mathbb{E}^{\pi}[|\xi_{1}-\xi_{2}|^{p}],

where the second inequality follows from 𝜷/𝜷{𝜸:𝜸=1}\bm{\beta}/\|\bm{\beta}\|_{*}\in\{\bm{\gamma}:\|\bm{\gamma}\|_{*}=1\}. This implies

Wp(F𝜷𝐙,FY0𝜷𝐗0)=infπΠ(F𝜷𝐙,FY0𝜷𝐗0)(𝔼π[|ξ1ξ2|p])1/p𝜷ε.\displaystyle W_{p}\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)=\inf_{\pi\in\Pi\left(F_{{\bm{\beta}}^{\top}\mathbf{Z}},F_{Y_{0}\cdot{\bm{\beta}}^{\top}\mathbf{X}_{0}}\right)}\left(\mathbb{E}^{\pi}[|\xi_{1}-\xi_{2}|^{p}]\right)^{1/p}\leqslant\|\bm{\beta}\|_{*}\varepsilon.

Hence, we have F=F𝜷𝐙2F=F_{\bm{\beta}^{\top}\mathbf{Z}}\in\mathcal{B}_{2}, which yields 32\mathcal{B}_{3}\subseteq\mathcal{B}_{2}. This completes the proof. ∎

The proof of Theorem 1 relies on our projection result in Proposition 1 and a concentration result for the max-sliced Wasserstein distance in Theorem 3 of Olea et al. (2022), which is given as follows.

Lemma 2 (Theorem 3 of Olea et al. (2022)).

Let p1p\geqslant 1 and η(0,1)\eta\in(0,1). Suppose that F~p(n)\widetilde{F}^{*}\in\mathcal{M}_{p}(\mathbb{R}^{n}) satisfies Γ:=𝔼F~[𝐗s]<\Gamma:=\mathbb{E}^{\widetilde{F}^{*}}[\|\mathbf{X}\|^{s}]<\infty for some s>2ps>2p and εp,N(η)\varepsilon_{p,N}(\eta) is defined in Theorem 1. Then, we have

(F~pms(F~N,εp,N(η)))1η,\displaystyle\mathbb{P}\left(\widetilde{F}^{*}\in{\cal B}_{p}^{\rm ms}\big{(}\widetilde{F}_{N},\varepsilon_{p,N}(\eta)\big{)}\right)\geqslant 1-\eta,

where F~N\widetilde{F}_{N} is the empirical distribution of F~.\widetilde{F}^{*}.

Proof of Theorem 1. Denote by F~N=1Ni=1Nδy^i𝐱^i\widetilde{F}_{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{y}_{i}\widehat{\mathbf{x}}_{i}}, F𝜷=FY𝜷𝐗F^{*}_{\bm{\beta}}=F_{Y^{*}\cdot\bm{\beta}^{\top}\mathbf{X}^{*}}, where (Y,𝐗)F(Y^{*},\mathbf{X}^{*})\sim F^{*}, and εN=εp,N(η)\varepsilon_{N}=\varepsilon_{p,N}(\eta). It is clear that F~N\widetilde{F}_{N} is an empirical distribution of FY𝐗F_{Y^{*}\mathbf{X}^{*}}. It holds that

(ρF(Y𝜷𝐗)supF¯p(F^N,εN)ρF(Y𝜷𝐗),𝜷𝒟)\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot\bm{\beta}^{\top}\mathbf{X})\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}\forall\bm{\beta}\in\mathcal{D}\right)
=(ρF𝜷(Z)supF¯p|𝜷(F^N,εN)ρF(Z),𝜷𝒟)\displaystyle=\mathbb{P}\left(\rho^{F^{*}_{\bm{\beta}}}(Z)\leqslant\sup_{F\in\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N})}\rho^{F}(Z),~{}\forall\bm{\beta}\in\mathcal{D}\right)
(F𝜷¯p|𝜷(F^N,εN),𝜷𝒟)\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N}),~{}\forall\bm{\beta}\in\mathcal{D}\right)
=(F𝜷{F𝜷𝐙:F𝐙pms(F~N,εN),𝜷𝒟)\displaystyle=\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\{F_{\bm{\beta}^{\top}\bf Z}:F_{\bf Z}\in{\cal B}_{p}^{\rm ms}(\widetilde{F}_{N},\varepsilon_{N}),~{}\forall\bm{\beta}\in\mathcal{D}\right)
(FY𝐗pms(F~N,εN))1η,\displaystyle\geqslant\mathbb{P}\left(F_{Y^{*}\mathbf{X}^{*}}\in{\cal B}_{p}^{\rm ms}\big{(}\widetilde{F}_{N},\varepsilon_{N}\big{)}\right)\geqslant 1-\eta,

where the second equality follows from ¯p|𝜷(F^N,εN)={F𝜷𝐙:F𝐙pms(F~N,εN)}\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon_{N})=\{F_{\bm{\beta}^{\top}\bf Z}:F_{\bf Z}\in{\cal B}_{p}^{\rm ms}(\widetilde{F}_{N},\varepsilon_{N})\} by (ii) of Proposition 1 and we have used Lemma 2 in the last inequality. This completes the proof.

Proof of Theorem 2.

Let (Y^N,𝐗^N)F^N(\widehat{Y}_{N},\widehat{\mathbf{X}}_{N})\sim\widehat{F}_{N}. By the definition of ¯p|𝜷(F^N,ε)\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon), it follows that

supF¯p(F^N,ε)ρF(Y𝜷𝐗)\displaystyle\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}) =supF¯p|𝜷(F^N,ε)ρF(Z)\displaystyle=\sup_{F\in\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Z)
=sup{ρ(Z):(𝔼[|ZY^N𝜷𝐗^N|p])1/pε𝜷}\displaystyle=\sup\left\{\rho(Z):\left(\mathbb{E}[|Z-\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}|^{p}]\right)^{1/p}\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\}
=sup{ρ(Y^N𝜷𝐗^N+V):(𝔼[|V|p])1/pε𝜷},\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):(\mathbb{E}[|V|^{p}])^{1/p}\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\},

where the second inequality follows from (11) in Proposition 1, which states that

¯p|𝜷(F^N,ε)=p(FY^N𝜷𝐗^N,ε𝜷)={FZ:(𝔼[|ZY^N𝜷𝐗^N|p])1/pε𝜷}.\overline{\mathcal{B}}_{p|\bm{\beta}}(\widehat{F}_{N},\varepsilon)=\mathcal{B}_{p}(F_{\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}},\varepsilon\|\bm{\beta}\|_{*})=\left\{F_{Z}:\left(\mathbb{E}[|Z-\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}|^{p}]\right)^{1/p}\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\}.

In particular, we have for any ε>0\varepsilon>0

supF¯(F^N,ε)ρF(Y𝜷𝐗)\displaystyle\sup_{F\in\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}) =sup{ρ(Y^N𝜷𝐗^N+V):|V|ε𝜷}\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):|V|\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\} (29)
supF¯1(F^N,ε)ρF(Y𝜷𝐗)\displaystyle\sup_{F\in\overline{{\cal B}}_{1}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}) =sup{ρ(Y^N𝜷𝐗^N+V):𝔼[|V|]ε𝜷}.\displaystyle=\sup\left\{\rho\left(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V\right):\mathbb{E}[|V|]\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\}. (30)

Therefore, we have

supF¯p(F^N,λε)ρF(Y𝜷𝐗)\displaystyle\sup_{F\in\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}) supF¯(F^N,λε)ρF(Y𝜷𝐗)\displaystyle\geqslant\sup_{F\in\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\lambda\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X})
=sup{ρ(Y^N𝜷𝐗^N+V):|V|λε𝜷}\displaystyle=\sup\left\{\rho(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V):|V|\leqslant\lambda\varepsilon\|\bm{\beta}\|_{*}\right\}
sup{ρ(Y^N𝜷𝐗^N+V):𝔼[|V|]ε𝜷}\displaystyle\geqslant\sup\left\{\rho(\widehat{Y}_{N}\cdot\bm{\beta}^{\top}\widehat{\mathbf{X}}_{N}+V):\mathbb{E}[|V|]\leqslant\varepsilon\|\bm{\beta}\|_{*}\right\}
=supF¯1(F^N,ε)ρF(Y𝜷𝐗),𝜷𝒟,ε0,\displaystyle=\sup_{F\in\overline{{\cal B}}_{1}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot\bm{\beta}^{\top}\mathbf{X}),~{}~{}\forall\bm{\beta}\in\mathcal{D},~{}\varepsilon\geqslant 0, (31)

where the first inequality is due to ¯(F^N,λε)¯p(F^N,λε)\overline{{\cal B}}_{\infty}(\widehat{F}_{N},\lambda\varepsilon)\subseteq\overline{{\cal B}}_{p}(\widehat{F}_{N},\lambda\varepsilon), and the second inequality is due to Assumption 1, and the two equalities come from (29) and (30), respectively. Combining (A) and Theorem 1 with type-11 Wasserstein ball yields the desired result.

Appendix B B Proofs of Section 4

For ease of exposition, we need the following notations. For a random variable ZZ, we define Zp=(𝔼[|Z|p])1/p\|Z\|_{p}=(\mathbb{E}[|Z|^{p}])^{1/p}. We use 𝟙A\mathds{1}_{A} to represent the indicator function, i.e., 𝟙A(ω)=1\mathds{1}_{A}(\omega)=1 if ωA\omega\in A, and 𝟙A(ω)=0\mathds{1}_{A}(\omega)=0 otherwise. The sign function on \mathbb{R} is defined as

sign(x)=𝟙(,0)(x)+𝟙[0,)(x).{\rm sign}(x)=-\mathds{1}_{(-\infty,0)}(x)+\mathds{1}_{[0,\infty)}(x).

To prove the results in Section 4, we need the following auxiliary lemmas. The first lemma follows immediately from Proposition 1.

Lemma 3.

Given 𝛃n\bm{\beta}\in\mathbb{R}^{n}, we have

supFp(G0,ε)ρF(𝜷𝐙)=supFp(F𝜷𝐙0,ε𝜷)ρF(Z),\displaystyle\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\sup_{F\in\mathcal{B}_{p}\left(F_{\bm{\beta}^{\top}\mathbf{Z}_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)}\rho^{F}(Z),

where 𝐙0G0\mathbf{Z}_{0}\sim G_{0} and F𝛃𝐙0F_{\bm{\beta}^{\top}\mathbf{Z}_{0}} is the distribution of 𝛃𝐙0\bm{\beta}^{\top}\mathbf{Z}_{0}.

Based on Lemma 3, we present the following lemma, which will be used in the proofs of the main results.

Lemma 4.

Let p[1,]p\in[1,\infty] and C>0C>0. For ρ:Lp\rho:L^{p}\to\mathbb{R}, the following statements are equivalent.

  • (i)

    For any G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n},

    inf𝜷𝒟supFp(G0,ε)ρF(𝜷𝐙)=inf𝜷𝒟{ρG0(𝜷𝐙)+Cε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+C\varepsilon\|\bm{\beta}\|_{*}\right\}. (32)
  • (ii)

    For any ZLpZ\in L^{p} and ε0\varepsilon\geqslant 0,

    supVpερ(Z+V)=ρ(Z)+Cε.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho(Z+V)=\rho(Z)+C\varepsilon. (33)
Proof.

Proof. (i) \Rightarrow (ii): For ZLpZ\in L^{p}, take 𝒟={𝜷0}\mathcal{D}=\{\bm{\beta}_{0}\} with 𝜷0=(1,0,,0)/(1,0,,0)\bm{\beta}_{0}=(1,0,\dots,0)/\|(1,0,\dots,0)\|_{*}, and let G0G_{0} be the distribution of (Z,0,,0)(Z,0,\dots,0). It holds that 𝜷0=1\|\bm{\beta}_{0}\|_{*}=1, and thus, the left-hand side of (32) reduces to

inf𝜷𝒟supFp(G0,ε)ρF(𝜷𝐙)=supFp(G0,ε)ρF(𝜷0𝐙)=supFp(FZ,ε)ρF(Y)=supYZpερ(Y)=supVpερ(Z+V),\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}_{0}^{\top}\mathbf{Z})=\sup_{F\in\mathcal{B}_{p}(F_{Z},\varepsilon)}\rho^{F}(Y)=\sup_{\|Y-Z\|_{p}\leqslant\varepsilon}\rho(Y)=\sup_{\|V\|_{p}\leqslant\varepsilon}\rho(Z+V),

where the second equality follows from Lemma 3. Note that the right-hand side of (32) reduces to

inf𝜷𝒟{ρG0(𝜷𝐙)+Cε𝜷}=ρ(Z)+Cε.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho^{G_{0}}(\bm{\beta}^{\top}\mathbf{Z})+C\varepsilon\|\bm{\beta}\|_{*}\right\}=\rho(Z)+C\varepsilon.

Combining the above two equations with (32) yields (33). This completes the proof of the implication (i) \Rightarrow (ii).

(ii) \Rightarrow (i): For G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, let 𝐙0G0\mathbf{Z}_{0}\sim G_{0}. We have for any 𝜷𝒟\bm{\beta}\in\mathcal{D},

supFp(G0,ε)ρF(𝜷𝐙)\displaystyle\sup_{F\in{\cal B}_{p}(G_{0},\varepsilon)}\rho^{F}(\bm{\beta}^{\top}\mathbf{Z}) =supFp(F𝜷𝐙0,ε𝜷)ρF(Z)\displaystyle=\sup_{F\in\mathcal{B}_{p}\left(F_{\bm{\beta}^{\top}\mathbf{Z}_{0}},\varepsilon\|\bm{\beta}\|_{*}\right)}\rho^{F}(Z)
=supZ𝜷𝐙0pε𝜷ρ(Z)\displaystyle=\sup_{\|Z-\bm{\beta}^{\top}\mathbf{Z}_{0}\|_{p}\leqslant\varepsilon\|\bm{\beta}\|_{*}}\rho(Z)
=supVpε𝜷ρ(𝜷𝐙0+V)=ρ(𝜷𝐙0)+Cε𝜷,\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon\|\bm{\beta}\|_{*}}\rho(\bm{\beta}^{\top}\mathbf{Z}_{0}+V)=\rho(\bm{\beta}^{\top}\mathbf{Z}_{0})+C\varepsilon\|\bm{\beta}\|_{*},

where the first equality follows from Lemma 3 and the last equality is due to (33). Hence, (32) holds immediately, and thus, we complete the proof.∎

Remark 5.

In Lemma 4, neither the distribution G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}) or the set 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n} is fixed. If we fix G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}) and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}, then by similar arguments, we can show that (32) holds for all ε0\varepsilon\geqslant 0 if (33) holds with Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0} for all 𝜷𝒟\bm{\beta}\in\mathcal{D} and ε0\varepsilon\geqslant 0, where 𝐙0G0\mathbf{Z}_{0}\sim G_{0}.

B.1 Proofs of Section 4.1

We present the following lemma, which will be used in the proofs of Theorems 3 and 4.

Lemma 5.

Let p(1,)p\in(1,\infty), ε>0\varepsilon>0 and η(0,ε]\eta\in(0,\varepsilon]. Define 𝒱={VLp:Vpε,𝔼[|V|𝟙{|V|2ε}]η}\mathcal{V}=\{V\in L^{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\leqslant\eta\}. We have 𝔼[|V|]ε2p/q+η\mathbb{E}[|V|]\leqslant\varepsilon 2^{-p/q}+\eta for all V𝒱V\in\mathcal{V}.

Proof.

Proof. Using the Chebyshev’s inequality, we have (2ε)p(|V|>2ε)Vpp,(2\varepsilon)^{p}\mathbb{P}(|V|>2\varepsilon)\leqslant\|V\|_{p}^{p}, which implies (|V|>2ε)2p\mathbb{P}(|V|>2\varepsilon)\leqslant 2^{-p} for all Vpε\|V\|_{p}\leqslant\varepsilon. Hence, for any V𝒱V\in\mathcal{V}, it holds that

𝔼[|V|]\displaystyle\mathbb{E}[|V|] =𝔼[|V|𝟙{|V|>2ε}]+𝔼[|V|𝟙{|V|2ε}]Vp((|V|>2ε))1/q+ηε2p/q+η,\displaystyle=\mathbb{E}\left[|V|\mathds{1}_{\{|V|>2\varepsilon\}}\right]+\mathbb{E}\left[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}\right]\leqslant\|V\|_{p}(\mathbb{P}(|V|>2\varepsilon))^{1/q}+\eta\leqslant\varepsilon 2^{-p/q}+\eta,

where the first inequality follows from Hölder’s inequality. This completes the proof. ∎

Proof of Theorem 3. Throughout the proof, we assume C=1C=1 without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.

  • (i)’

    For any ZLpZ\in L^{p} and ε0\varepsilon\geqslant 0, we have (33) holds, i.e.,

    supVpε𝔼[(Z+V)]=𝔼[(Z)]+ε,ZLp,ε0.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(Z+V)]=\mathbb{E}[\ell(Z)]+\varepsilon,~{}~{}\forall Z\in L^{p},~{}\varepsilon\geqslant 0. (34)

To see (ii)\Rightarrow(i)’, note that 1(Z+V)=1(Z)±V\ell_{1}(Z+V)=\ell_{1}(Z)\pm V and thus,

supVpε𝔼[1(Z+V)]=𝔼[1(Z)]+supVpε𝔼[±V]=𝔼[1(Z)]+ε\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell_{1}(Z+V)]=\mathbb{E}[\ell_{1}(Z)]+\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\pm V]=\mathbb{E}[\ell_{1}(Z)]+\varepsilon

and

supVpε𝔼[2(Z+V)]\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell_{2}(Z+V)] =supVpε𝔼[|Z+Vb1|+b2]\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[|Z+V-b_{1}|+b_{2}]
supVpε{𝔼[|Zb1|+b2]+𝔼[|V|]}=𝔼[2(Z)]+ε.\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[|Z-b_{1}|+b_{2}]+\mathbb{E}[|V|]\right\}=\mathbb{E}[\ell_{2}(Z)]+\varepsilon.

The inequality reduces to equality if V=εsign(Zb1)V=\varepsilon\ {\rm sign}(Z-b_{1}). Hence, we complete the proof of (ii)\Rightarrow(i)’.

To see (i)’\Rightarrow(ii), we first show that Lip()1{\rm Lip}(\ell)\leqslant 1. Otherwise, by that a convex function has derivative almost everywhere, there exists xx such that |(x)|>1|\ell^{\prime}(x)|>1. If (x)>1\ell^{\prime}(x)>1, then we have

supVpε𝔼[(x+V)]\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x+V)] (x+ε)(x)+|(x)|ε>(x)+ε.\displaystyle\geqslant\ell(x+\varepsilon)\geqslant\ell(x)+|\ell^{\prime}(x)|\varepsilon>\ell(x)+\varepsilon.

If (x)<1\ell^{\prime}(x)<-1, then we have

supVpε𝔼[(x+V)]\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x+V)] (xε)(x)(x)ε>(x)+ε.\displaystyle\geqslant\ell(x-\varepsilon)\geqslant\ell(x)-\ell^{\prime}(x)\varepsilon>\ell(x)+\varepsilon.

Those two cases both yield a contradiction to (34).

Next, we aim to verify the following fact

|(x)|=1forallxaslongasisdifferentiableatx.\displaystyle|\ell^{\prime}(x)|=1~{}{\rm{for~{}all}}~{}x\in\mathbb{R}~{}{\rm as~{}long~{}as}~{}\ell~{}{\rm is~{}differentiable~{}at}~{}x. (35)

This will complete the proof since a convex function that satisfies (35) must be one of the forms of 1\ell_{1} and 2\ell_{2} with C=1C=1. To see (35), assume by contradiction that there exists x0x_{0}\in\mathbb{R} such that |(x0)|<1|\ell^{\prime}(x_{0})|<1. We first consider the case p=p=\infty. In this case, we have

supVε𝔼[(x0+V)]=max{(x0ε),(x0+ε)}<(x0)+ε,\displaystyle\sup_{\|V\|_{\infty}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)]=\max\{\ell(x_{0}-\varepsilon),\ell(x_{0}+\varepsilon)\}<\ell(x_{0})+\varepsilon,

where the strict inequality follows from |(x0)|<1|\ell^{\prime}(x_{0})|<1 and Lip()1{\rm Lip}(\ell)\leqslant 1, which yields a contradiction. Suppose now p(1,)p\in(1,\infty). Define

k=max{|(x0+2ε)(x0)2ε|,|(x0)(x02ε)2ε|}.k=\max\left\{\left|\frac{\ell(x_{0}+2\varepsilon)-\ell(x_{0})}{2\varepsilon}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-2\varepsilon)}{2\varepsilon}\right|\right\}.

We have that |(x0+v)(x0)|k|v||\ell(x_{0}+v)-\ell(x_{0})|\leqslant k|v| for all v[2ε,2ε]v\in[-2\varepsilon,2\varepsilon] by the convexity of \ell, and thus,

supVpε𝔼[(x0+V)]\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)] =supVpε{𝔼[(x0+V)𝟙{|V|2ε}]+𝔼[(x0+V)𝟙{|V|>2ε}]}\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\{\mathbb{E}[\ell(x_{0}+V)\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell(x_{0}+V)\mathds{1}_{\{|V|>2\varepsilon\}}]\}
supVpε{𝔼[((x0)+k|V|)𝟙{|V|2ε}]+𝔼[((x0+V)𝟙{|V|>2ε}]}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\{\mathbb{E}[(\ell(x_{0})+k|V|)\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0}+V)\mathds{1}_{\{|V|>2\varepsilon\}}]\}
supVpε{𝔼[((x0)+k|V|)𝟙{|V|2ε}]+𝔼[((x0)+|V|)𝟙{|V|>2ε}]}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\{\mathbb{E}[(\ell(x_{0})+k|V|)\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0})+|V|)\mathds{1}_{\{|V|>2\varepsilon\}}]\}
=(x0)+supVpε{𝔼[|V|](1k)𝔼[|V|𝟙{|V|2ε}]}=:I,\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon}\{\mathbb{E}[|V|]-(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\}=:I,

where the second inequality follows from Lip()1{\rm Lip}(\ell)\leqslant 1. By defining

𝒱1={VLp:Vpε,𝔼[|V|𝟙{|V|2ε}](12p/q)ε2},𝒱2={V:Vpε}𝒱1,\displaystyle\mathcal{V}_{1}=\left\{V\in L^{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}\left[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}\right]\leqslant\frac{(1-2^{-p/q})\varepsilon}{2}\right\},~{}~{}~{}\mathcal{V}_{2}=\{V:\|V\|_{p}\leqslant\varepsilon\}\setminus\mathcal{V}_{1}, (36)

we can rewrite II as I=max{I1,I2}I=\max\{I_{1},I_{2}\} with

Ii=(x0)+supV𝒱i{𝔼[|V|](1k)𝔼[|V|𝟙{|V|2ε}]},i=1,2.I_{i}=\ell(x_{0})+\sup_{V\in\mathcal{V}_{i}}\{\mathbb{E}[|V|]-(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\},~{}~{}i=1,2.

Since |(x0)|<1|\ell^{\prime}(x_{0})|<1 and Lip()1{\rm Lip}(\ell)\leqslant 1, we know that k<1k<1. Hence, it holds that

I1\displaystyle I_{1} (x0)+supV𝒱1𝔼[|V|](x0)+(1+2p/q)ε2<(x0)+ε,\displaystyle\leqslant\ell(x_{0})+\sup_{V\in\mathcal{V}_{1}}\mathbb{E}[|V|]\leqslant\ell(x_{0})+\frac{(1+2^{-p/q})\varepsilon}{2}<\ell(x_{0})+\varepsilon, (37)

where the second inequality follows from Lemma 5. Further note that

I2\displaystyle I_{2} (x0)+supV𝒱2{𝔼[|V|](1k)(12p/q)ε2}\displaystyle\leqslant\ell(x_{0})+\sup_{V\in\mathcal{V}_{2}}\left\{\mathbb{E}[|V|]-(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}\right\}
(x0)+ε(1k)(12p/q)ε2\displaystyle\leqslant\ell(x_{0})+\varepsilon-(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}
<(x0)+ε,\displaystyle<\ell(x_{0})+\varepsilon, (38)

where the first inequality follows from the definition of 𝒱2\mathcal{V}_{2}, and the second one holds because Vpε\|V\|_{p}\leqslant\varepsilon implies 𝔼[|V|]ε\mathbb{E}[|V|]\leqslant\varepsilon. Combining (37) and (B.1), we conclude that

supVpε𝔼[(x0+V)]max{I1,I2}<(x0)+ε.\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell(x_{0}+V)]\leqslant\max\{I_{1},I_{2}\}<\ell(x_{0})+\varepsilon.

This yields a contradiction. Hence, (35) is verified, and thus we complete the proof.

To prove Theorem 4, we begin by introducing two auxiliary lemmas. Specifically, Lemma 7 will be employed in the proof of Theorem 4, and Lemma 6 serves to establish Lemma 7.

Lemma 6.

For p(1,)p\in(1,\infty), ε>0\varepsilon>0 and η(0,ε)\eta\in(0,\varepsilon), the following problem has a positive optimal value.

min\displaystyle\min 𝔼[(Zε)+]\displaystyle~{}~{}\mathbb{E}[(Z-\varepsilon)_{+}] (39)
s.t.\displaystyle{\rm s.t.} 𝔼[Z]=ε,(𝔼[Z1/p])pεη,Z0.\displaystyle~{}~{}\mathbb{E}[Z]=\varepsilon,~{}~{}(\mathbb{E}[Z^{1/p}])^{p}\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0.
Proof.

Proof. Note that Problem (39) has two moment constraints. By (1995), the optimal value of Problem (39) is equivalent to

min\displaystyle\min 𝔼[(Zε)+]\displaystyle~{}~{}\mathbb{E}[(Z-\varepsilon)_{+}] (40)
s.t.\displaystyle{\rm s.t.} 𝔼[Z]=ε,(𝔼[Z1/p])pεη,Z0,Z𝒳3,\displaystyle~{}~{}\mathbb{E}[Z]=\varepsilon,~{}~{}(\mathbb{E}[Z^{1/p}])^{p}\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0,~{}~{}Z\in\mathcal{X}_{3},

where 𝒳3\mathcal{X}_{3} is the set of all random variables that take at most three values. Let ZZ be a feasible solution of Problem (40). Denote by z¯=ess-supZ\overline{z}=\mathrm{ess\mbox{-}sup}Z, and let ZZ^{*} be a random variable such that

Z(p1+p2)δε+qδz¯+(1(p1+p2+q))δ0\displaystyle Z^{*}\sim(p_{1}+p_{2})\delta_{\varepsilon}+q\delta_{\overline{z}}+(1-(p_{1}+p_{2}+q))\delta_{0} (41)

with the conditions that p1ε=𝔼[Z𝟙{Z<ε}]p_{1}\varepsilon=\mathbb{E}[Z\mathds{1}_{\{Z<\varepsilon\}}], p2ε+z¯q=𝔼[Z𝟙{Zε}]p_{2}\varepsilon+\overline{z}q=\mathbb{E}[Z\mathds{1}_{\{Z\geqslant\varepsilon\}}] and p2=(Zε)qp_{2}=\mathbb{P}(Z\geqslant\varepsilon)-q. It is straightforward to check that 𝔼[Z]=𝔼[Z]=ε\mathbb{E}[Z^{*}]=\mathbb{E}[Z]=\varepsilon and 𝔼[(tZ)+]𝔼[(tZ)+]\mathbb{E}[(t-Z)_{+}]\leqslant\mathbb{E}[(t-Z^{*})_{+}] for all tt\in\mathbb{R}. It follows from Theorem 3.A.1 of Shaked and Shanthikumar (2007) that ZcvZZ^{*}\leqslant_{\rm cv}Z.777For two random variables Z1Z_{1} and Z2Z_{2}, Z1Z_{1} is said to be smaller than Z2Z_{2} in the concave order, denoted by Z1cvZ2Z_{1}\leqslant_{\rm cv}Z_{2}, if 𝔼[ϕ(Z1)]𝔼[ϕ(Z2)]\mathbb{E}[\phi(Z_{1})]\leqslant\mathbb{E}[\phi(Z_{2})] for any concave function ϕ\phi. This implies that 𝔼[(Z)1/p]𝔼[Z1/p]\mathbb{E}[(Z^{*})^{1/p}]\leqslant\mathbb{E}[Z^{1/p}] since xx1/px\mapsto x^{1/p} is a concave function on +\mathbb{R}_{+}. Therefore, ZZ^{*} is also a feasible solution of Problem (40). In addition, we have

𝔼[(Zε)+]=𝔼[Z𝟙{Zε}]ε(Zε)=p2ε+z¯qε(p2+q)=q(z¯ε)=𝔼[(Zε)+].\displaystyle\mathbb{E}[(Z-\varepsilon)_{+}]=\mathbb{E}[Z\mathds{1}_{\{Z\geqslant\varepsilon\}}]-\varepsilon\mathbb{P}(Z\geqslant\varepsilon)=p_{2}\varepsilon+\overline{z}q-\varepsilon(p_{2}+q)=q(\overline{z}-\varepsilon)=\mathbb{E}[(Z^{*}-\varepsilon)_{+}].

Hence, to solve Problem (40), it suffices to consider all random variables whose distributions take the form of (41), i.e., p1δ0+p2δε+p3δxp_{1}\delta_{0}+p_{2}\delta_{\varepsilon}+p_{3}\delta_{x} with p1,p2,p30p_{1},p_{2},p_{3}\geqslant 0, p1+p2+p3=1p_{1}+p_{2}+p_{3}=1 and xεx\geqslant\varepsilon. The corresponding problem can be represented as follows:

minpi0,x\displaystyle\min_{p_{i}\geqslant 0,{x}} (xε)p3s.t.εp2+xp3=ε,ε1/pp2+x1/pp3(εη)1/p,xε,p1+p2+p3=1.\displaystyle~{}~{}({x}-\varepsilon)p_{3}~{}~{}~{}{\rm s.t.}~{}~{}\varepsilon p_{2}+{x}p_{3}=\varepsilon,~{}~{}\varepsilon^{1/p}p_{2}+{x}^{1/p}p_{3}\leqslant(\varepsilon-\eta)^{1/p},~{}~{}{x}\geqslant\varepsilon,~{}~{}p_{1}+p_{2}+p_{3}=1. (42)

Note that (xε)p3=ε(1p2)εp3=εp1({x}-\varepsilon)p_{3}=\varepsilon(1-p_{2})-\varepsilon p_{3}=\varepsilon p_{1} whenever (p1,p2,p3,x)(p_{1},p_{2},p_{3},x) is a feasible solution of Problem (42), and thus, Problem (42) is equivalent to

minpi0,x\displaystyle\min_{p_{i}\geqslant 0,{x}} εp1s.t.εp2+xp3=ε,ε1/pp2+x1/pp3(εη)1/p,xε,p1+p2+p3=1.\displaystyle~{}~{}\varepsilon p_{1}~{}~{}~{}{\rm s.t.}~{}~{}\varepsilon p_{2}+{x}p_{3}=\varepsilon,~{}~{}\varepsilon^{1/p}p_{2}+{x}^{1/p}p_{3}\leqslant(\varepsilon-\eta)^{1/p},~{}~{}{x}\geqslant\varepsilon,~{}~{}p_{1}+p_{2}+p_{3}=1.

Further, letting x=λεx=\lambda\varepsilon yields the following equivalent problem

minpi0,λ\displaystyle\min_{p_{i}\geqslant 0,\lambda} εp1s.t.p2+λp3=1,p2+λ1/pp3(1ηε)1/p,λ1,p1+p2+p3=1.\displaystyle~{}~{}\varepsilon p_{1}~{}~{}~{}{\rm s.t.}~{}~{}p_{2}+\lambda p_{3}=1,~{}~{}p_{2}+{\lambda}^{1/p}p_{3}\leqslant\left(1-\frac{\eta}{\varepsilon}\right)^{1/p},~{}~{}{\lambda}\geqslant 1,~{}~{}p_{1}+p_{2}+p_{3}=1. (43)

Let (p1,p2,p3,λ)(p_{1},p_{2},p_{3},\lambda) be a feasible solution of Problem (43). It holds that p1=(λ1)p3p_{1}=(\lambda-1)p_{3} and p2=1λp3p_{2}=1-\lambda p_{3} and λ1\lambda\geqslant 1. Substituting p2=1λp3p_{2}=1-\lambda p_{3} into the second constraint of Problem (43) yields

(1ηε)1/p1λp3+λ1/pp3=(λ1)p3p3+1+λ1/pp3=p1+(λ1/p1)p3+1p1+1,\displaystyle\left(1-\frac{\eta}{\varepsilon}\right)^{1/p}\geqslant 1-\lambda p_{3}+\lambda^{1/p}p_{3}=-(\lambda-1)p_{3}-p_{3}+1+\lambda^{1/p}p_{3}=-p_{1}+(\lambda^{1/p}-1)p_{3}+1\geqslant-p_{1}+1,

which implies p11(1η/ε)1/pp_{1}\geqslant 1-(1-\eta/\varepsilon)^{1/p}. Hence, the optimal value of Problem (43) is no less than εε(1η/ε)1/p\varepsilon-\varepsilon(1-\eta/\varepsilon)^{1/p} which is a positive value. This completes the proof. ∎

Lemma 7.

Let p(1,)p\in(1,\infty), t,ε>0t,\,\varepsilon>0 and η(0,ε)\eta\in(0,\varepsilon). For VLpV\in L^{p}, the following statements hold.

  • (i)

    If Vpε\|V\|_{p}\leqslant\varepsilon, then 𝔼[(|V|+t)p](ε+t)p\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}.

  • (ii)

    If Vpε\|V\|_{p}\leqslant\varepsilon and 𝔼[|V|]εη\mathbb{E}[|V|]\leqslant\varepsilon-\eta, then there exists Δ>0\Delta>0 that only depends on p,t,ε,ηp,t,\varepsilon,\eta such that 𝔼[(|V|+t)p](ε+t)pΔ\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}-\Delta. In particular, if pp is an integer, then 𝔼[(|V|+t)p](ε+t)pptp1η\mathbb{E}[(|V|+t)^{p}]\leqslant(\varepsilon+t)^{p}-pt^{p-1}\eta.

Proof.

Proof. (i) Suppose that Vpε\|V\|_{p}\leqslant\varepsilon and denote by Z=|V|pZ=|V|^{p}. It holds that 𝔼[Z]εp\mathbb{E}[Z]\leqslant\varepsilon^{p} and

𝔼[(|V|+t)p]=𝔼[(Z1/p+t)p]((𝔼[Z])1/p+t)p(ε+t)p,\displaystyle\mathbb{E}[(|V|+t)^{p}]=\mathbb{E}[(Z^{1/p}+t)^{p}]\leqslant((\mathbb{E}[Z])^{1/p}+t)^{p}\leqslant(\varepsilon+t)^{p},

where we have used Jensen’s inequality in the first inequality by noting that x(x1/p+t)px\mapsto(x^{1/p}+t)^{p} is concave on +\mathbb{R}_{+}.

(ii) The case that pp is an integer can be verified by the following direct calculations:

𝔼[(|V|+t)p]\displaystyle\mathbb{E}[(|V|+t)^{p}] =𝔼[i=0p(pi)tpi|V|i]\displaystyle=\mathbb{E}\left[\sum_{i=0}^{p}{p\choose i}t^{p-i}|V|^{i}\right]
=i1(pi)tpi𝔼[|V|i]+ptp1𝔼[|V|]\displaystyle=\sum_{i\neq 1}{p\choose i}t^{p-i}\mathbb{E}[|V|^{i}]+pt^{p-1}\mathbb{E}[|V|]
i1(pi)tpiεi+ptp1(εη)\displaystyle\leqslant\sum_{i\neq 1}{p\choose i}t^{p-i}\varepsilon^{i}+pt^{p-1}(\varepsilon-\eta)
=i=0p(pi)tpiεiptp1η=(ε+t)pptp1η,\displaystyle=\sum_{i=0}^{p}{p\choose i}t^{p-i}\varepsilon^{i}-pt^{p-1}\eta=(\varepsilon+t)^{p}-pt^{p-1}\eta,

where the inequality holds because Vpε\|V\|_{p}\leqslant\varepsilon implies Viε\|V\|_{i}\leqslant\varepsilon for i=1,2,,pi=1,2,\dots,p.

For general case pp, it suffices to verify that the optimal value of the following problem is smaller than (ε+t)p(\varepsilon+t)^{p}:

sup𝔼[(Z+t)p]s.t.𝔼[Zp]εp,𝔼[Z]εη,Z0.\displaystyle\sup\,\mathbb{E}[(Z+t)^{p}]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z^{p}]\leqslant\varepsilon^{p},~{}~{}\mathbb{E}[Z]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0. (44)

We first assert that Problem (44) is equivalent to

sup𝔼[(Z+t)p]s.t.𝔼[Zp]=εp,𝔼[Z]εη,Z0.\displaystyle\sup\,\mathbb{E}[(Z+t)^{p}]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z^{p}]=\varepsilon^{p},~{}~{}\mathbb{E}[Z]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0. (45)

To see this, let ZZ be a feasible solution of Problem (44) with 𝔼[Z]>0\mathbb{E}[Z]>0. Let UU be a uniform random variable on (0,1)(0,1) and for α(0,1)\alpha\in(0,1), define

Zα=(FZ1(U)+f(α))𝟙{Uα},α[0,1),wheref(α)=0αFZ1(s)ds1α.\displaystyle Z_{\alpha}=(F_{Z}^{-1}(U)+f(\alpha))\mathds{1}_{\{U\geqslant\alpha\}},~{}~{}\alpha\in[0,1),~{}~{}{\rm where}~{}f(\alpha)=\frac{\int_{0}^{\alpha}F_{Z}^{-1}(s)\mathrm{d}s}{1-\alpha}.

It is straightforward to check that 𝔼[Zα]=𝔼[Z]\mathbb{E}[Z_{\alpha}]=\mathbb{E}[Z] for all α[0,1)\alpha\in[0,1). Moreover, it holds that

𝔼[(Zα)p]𝔼[f(α)p𝟙{Uα}]=(0αFZ1(s)ds)p(1α)p1asα1.\displaystyle\mathbb{E}[(Z_{\alpha})^{p}]\geqslant\mathbb{E}[f(\alpha)^{p}\mathds{1}_{\{U\geqslant\alpha\}}]=\frac{\left(\int_{0}^{\alpha}F_{Z}^{-1}(s)\mathrm{d}s\right)^{p}}{(1-\alpha)^{p-1}}\to\infty~{}~{}{\rm as}~{}\alpha\uparrow 1. (46)

Note that the mapping α𝔼[(Zα)p]\alpha\mapsto\mathbb{E}[(Z_{\alpha})^{p}] is continuous as f(α)f(\alpha) is continuous, and 𝔼[(Zα)p]=𝔼[Zp]ε\mathbb{E}[(Z_{\alpha})^{p}]=\mathbb{E}[Z^{p}]\leqslant\varepsilon if α=0\alpha=0. Combining with (46), there exists α[0,1)\alpha^{*}\in[0,1) such that 𝔼[Zαp]=εp\mathbb{E}[Z_{\alpha^{*}}^{p}]=\varepsilon^{p}. Moreover, one can check that for any α[0,1)\alpha\in[0,1), γ1FZ1(s)dsγ1FZα1(s)ds\int_{\gamma}^{1}F_{Z}^{-1}(s)\mathrm{d}s\leqslant\int_{\gamma}^{1}F_{Z_{\alpha}}^{-1}(s)\mathrm{d}s for all γ(0,1)\gamma\in(0,1). It follows from Theorem 2.5 of Bäuerle and Müller (2006) that ZαcvZZ_{\alpha}\leqslant_{\rm cv}Z. Hence, we have 𝔼[(Z+t)p]𝔼[(Zα+t)p]\mathbb{E}[(Z+t)^{p}]\leqslant\mathbb{E}[(Z_{\alpha^{*}}+t)^{p}] as x(x+t)px\mapsto-(x+t)^{p} is a concave function on +\mathbb{R}_{+}. Therefore, Problem (44) is equivalent to Problem (45), which is further equivalent to

sup𝔼[g(Z)]s.t.𝔼[Z]=εp,𝔼[Z1/p]εη,Z0,\displaystyle\sup\,\mathbb{E}[g(Z)]~{}~{}{\rm s.t.}~{}~{}\mathbb{E}[Z]=\varepsilon^{p},~{}~{}\mathbb{E}[Z^{1/p}]\leqslant\varepsilon-\eta,~{}~{}Z\geqslant 0, (47)

where g(x)=(x1/p+t)pg(x)=(x^{1/p}+t)^{p} is a strictly concave function on +\mathbb{R}_{+}. It remains to verify that the optimal value of Problem (47) is strictly less than g(εp)g(\varepsilon^{p}).

To see this, let ZZ be a feasible solution of Problem (47). Define

Z=z1𝟙{Z<εp}+z2𝟙{Zεp},Z^{*}=z_{1}\mathds{1}_{\{Z<\varepsilon^{p}\}}+z_{2}\mathds{1}_{\{Z\geqslant\varepsilon^{p}\}},

where z1=𝔼[Z|Z<εp]z_{1}=\mathbb{E}[Z|Z<\varepsilon^{p}] and z2=𝔼[Z|Zεp]z_{2}=\mathbb{E}[Z|Z\geqslant\varepsilon^{p}] satisfy z1λ+z2(1λ)=εpz_{1}\lambda+z_{2}(1-\lambda)=\varepsilon^{p} with λ=(Z<εp)\lambda=\mathbb{P}(Z<\varepsilon^{p}). Note that by Lemma 6, there exists Δ>0\Delta>0 that only depends on p,t,ε,ηp,t,\varepsilon,\eta such that 𝔼[(εpZ)+]=𝔼[(Zεp)+]Δ\mathbb{E}[(\varepsilon^{p}-Z)_{+}]=\mathbb{E}[(Z-\varepsilon^{p})_{+}]\geqslant\Delta. By definition of z1z_{1} and z2z_{2}, this implies Δ(εpz1)(Z<εp)εpz1\Delta\leqslant(\varepsilon^{p}-z_{1})\mathbb{P}(Z<\varepsilon^{p})\leqslant\varepsilon^{p}-z_{1} and Δ(z2εp)(Zεp)z2εp\Delta\leqslant(z_{2}-\varepsilon^{p})\mathbb{P}(Z\geqslant\varepsilon^{p})\leqslant z_{2}-\varepsilon^{p}. Therefore, we have

z1εpΔandz2εp+Δ.\displaystyle z_{1}\leqslant\varepsilon^{p}-\Delta~{}~{}{\rm and}~{}~{}z_{2}\geqslant\varepsilon^{p}+\Delta. (48)

Further note that ZcvZZ\leqslant_{\rm cv}Z^{*} and also note that gg is concave on +\mathbb{R}_{+}. This implies

𝔼[g(Z)]𝔼[g(Z)]=λg(z1)+(1λ)g(z2)g(εpΔ)+g(εp+Δ)2<g(εp),\displaystyle\mathbb{E}[g(Z)]\leqslant\mathbb{E}[g(Z^{*})]=\lambda g(z_{1})+(1-\lambda)g(z_{2})\leqslant\frac{g(\varepsilon^{p}-\Delta)+g(\varepsilon^{p}+\Delta)}{2}<g(\varepsilon^{p}),

where the second inequality is due to the concavity of gg, λz1+(1λ)z2=(εpΔ+εp+Δ)/2\lambda z_{1}+(1-\lambda)z_{2}=(\varepsilon^{p}-\Delta+\varepsilon^{p}+\Delta)/2 and (48); and the strict inequality is due to Δ>0\Delta>0 and the strict concavity of gg. Noting that Δ\Delta is independent of the random variable ZZ, this completes the proof.

Proof of Theorem 4.

(ii) \Rightarrow (i): By Lemma 4, it suffices to show that (33) holds for any ZLpZ\in L^{p} and ε0\varepsilon\geqslant 0. For =1\ell=\ell_{1}, we have

supVpε1(Z+V)p\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell_{1}(Z+V)\|_{p} =supVpε(Z+Vb1)+p\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\|(Z+V-b_{1})_{+}\|_{p}
supVpε(Zb1)++|V|p\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\|(Z-b_{1})_{+}+|V|\|_{p}
supVpε{(Zb1)+p+Vp}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\|(Z-b_{1})_{+}\|_{p}+\|V\|_{p}\right\}
(Zb1)+p+ε=1(Z)p+ε.\displaystyle\leqslant\|(Z-b_{1})_{+}\|_{p}+\varepsilon=\|\ell_{1}(Z)\|_{p}+\varepsilon. (49)

If (Z>b1)>0\mathbb{P}(Z>b_{1})>0, then all the inequalities are equalities, and the maximizer can be chosen as V=λ(Zb1)+V=\lambda(Z-b_{1})_{+} with some λ>0\lambda>0 such that Vp=ε\|V\|_{p}=\varepsilon. If Zb1Z\leqslant b_{1} a.s., then we take {Vn}n\{V_{n}\}_{n\in\mathbb{N}} such that VnV_{n} has distribution (11/np)δ0+(1/np)δnε(1-1/n^{p})\delta_{0}+(1/n^{p})\delta_{n\varepsilon}, and {Vn=nε}{ZFZ1(11/np)}\{V_{n}=n\varepsilon\}\subseteq\{Z\geqslant F^{-1}_{Z}(1-1/n^{p})\} where FZ1F_{Z}^{-1} is the left-quantile function of ZZ for all nn. We have Vnp=ε\|V_{n}\|_{p}=\varepsilon and

supVpε1(Z+V)p\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell_{1}(Z+V)\|_{p} (Z+Vnb1)+p\displaystyle\geqslant\|(Z+V_{n}-b_{1})_{+}\|_{p}
1n(nε+FZ1(11np)b1)+ε=1(Z)p+ε,asn.\displaystyle\geqslant\frac{1}{n}\left(n\varepsilon+F_{Z}^{-1}\left(1-\frac{1}{n^{p}}\right)-b_{1}\right)_{+}\to\varepsilon=\|\ell_{1}(Z)\|_{p}+\varepsilon,~{}~{}{\rm as~{}}n\to\infty.

Combining with (B.1), we verify that (33) holds with 1\ell_{1} for all Zb1Z\leqslant b_{1} a.s. This completes the proof of the case =1\ell=\ell_{1}. For =2\ell=\ell_{2}, the proof is similar to that of 1\ell_{1}. For =3\ell=\ell_{3}, we have

supVpε3(Z+V)p\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell_{3}(Z+V)\|_{p} =supVpε(|Z+Vb1|b2)+p\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\|(|Z+V-b_{1}|-b_{2})_{+}\|_{p}
supVpε(|Zb1|b2+|V|)+p\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\|(|Z-b_{1}|-b_{2}+|V|)_{+}\|_{p}
supVpε(|Zb1|b2)++|V|p\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\|(|Z-b_{1}|-b_{2})_{+}+|V|\|_{p}
supVpε{(|Zb1|b2)+p+Vp}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\|(|Z-b_{1}|-b_{2})_{+}\|_{p}+\|V\|_{p}\right\}
(|Zb1|b2)+p+ε=3(Z)p+ε.\displaystyle\leqslant\|(|Z-b_{1}|-b_{2})_{+}\|_{p}+\varepsilon=\|\ell_{3}(Z)\|_{p}+\varepsilon. (50)

If (|Zb1|>b2)>0\mathbb{P}(|Z-b_{1}|>b_{2})>0, then one can check that all inequalities are equalities, and the maximizer can be chosen as V=λ(|Zb1|b2)+sign(Zb1)V=\lambda(|Z-b_{1}|-b_{2})_{+}{\rm sign}(Z-b_{1}) with some λ0\lambda\geqslant 0 such that Vp=ε\|V\|_{p}=\varepsilon. If |Zb1|b2|Z-b_{1}|\leqslant b_{2} a.s., then we have Z[b1b2,b1+b2]Z\in[b_{1}-b_{2},b_{1}+b_{2}]. Taking a sequence {Vn}n\{V_{n}\}_{n\in\mathbb{N}} as shown in the case of 1\ell_{1}, i.e., VnV_{n} with distribution (11/np)δ0+(1/np)δnε(1-1/n^{p})\delta_{0}+(1/n^{p})\delta_{n\varepsilon}, it holds that for large enough nn such that nεmax{b1,b2}n\varepsilon\geqslant\max\{b_{1},b_{2}\},

supVpε3(Z+V)p\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell_{3}(Z+V)\|_{p} (|Z+Vnb1|b2)+p\displaystyle\geqslant\|(|Z+V_{n}-b_{1}|-b_{2})_{+}\|_{p}
(𝔼[(|Zb1|b2)+p𝟙{Vn=0}]+𝔼[(|Zb1+nε|b2)+p𝟙{Vn=nε}])1/p\displaystyle\geqslant\left(\mathbb{E}[\left(|Z-b_{1}|-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=0\}}]+\mathbb{E}[\left(|Z-b_{1}+n\varepsilon|-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=n\varepsilon\}}]\right)^{1/p}
(𝔼[(nεb2b2)+p𝟙{Vn=nε}])1/p\displaystyle\geqslant\left(\mathbb{E}[\left(n\varepsilon-b_{2}-b_{2}\right)_{+}^{p}\mathds{1}_{\{V_{n}=n\varepsilon\}}]\right)^{1/p}
=(ε2b2n)+ε=3(Z)p+ε,asn.\displaystyle=\left(\varepsilon-\frac{2b_{2}}{n}\right)_{+}\to\varepsilon=\|\ell_{3}(Z)\|_{p}+\varepsilon,~{}~{}{\rm as}~{}n\to\infty.

Combining with (B.1), we have that (33) holds with 3\ell_{3} for all |Zb1|b2|Z-b_{1}|\leqslant b_{2} a.s. This completes the proof of the case 3\ell_{3}. For =4\ell=\ell_{4}, we have

supVpε4(Z+V)p\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell_{4}(Z+V)\|_{p} =supVpε|Z+Vb1|+b2p\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\||Z+V-b_{1}|+b_{2}\|_{p}
supVpε|Zb1|+b2+|V|p\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\||Z-b_{1}|+b_{2}+|V|\|_{p}
supVpε{|Zb1|+b2p+Vp}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\||Z-b_{1}|+b_{2}\|_{p}+\|V\|_{p}\right\}
|Zb1|+bp+ε=4(Z)p+ε,\displaystyle\leqslant\||Z-b_{1}|+b\|_{p}+\varepsilon=\|\ell_{4}(Z)\|_{p}+\varepsilon,

where all inequalities can be equality, and the maximizer can be chosen as V=λ(|Zb1|+b2)sign(Zb1)V=\lambda(|Z-b_{1}|+b_{2}){\rm sign}(Z-b_{1}) for some λ0\lambda\geqslant 0 such that Vp=ε\|V\|_{p}=\varepsilon. Hence, we conclude that (33) holds with =4\ell=\ell_{4} and complete the proof of (ii) \Rightarrow (i).

(i) \Rightarrow (ii): The proof of this direction is in the same spirit to that of Theorem 3. Assume without loss of generality that C=1C=1. By Lemma 4, we have for any ZLpZ\in L^{p} and ε0\varepsilon\geqslant 0,

supVpε(Z+V)p=(Z)p+ε.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell(Z+V)\|_{p}=\|\ell(Z)\|_{p}+\varepsilon. (51)

By the same arguments in the proof of Theorem 3, we can show that Lip()1{\rm Lip}(\ell)\leqslant 1. Next, assume that \ell is differential at xx when we use the notation (x)\ell^{\prime}(x), and we show the following facts.

If|(x)|>0,then|(x)|=1.If(x)=0,then(x)=0.\displaystyle{\rm If}~{}|\ell^{\prime}(x)|>0,~{}{\rm then}~{}|\ell^{\prime}(x)|=1.~{}~{}~{}{\rm If}~{}\ell^{\prime}(x)=0,~{}{\rm then}~{}\ell(x)=0. (52)

This will complete the proof since one can check that (52) implies that \ell has one of the forms of 1\ell_{1}, 2\ell_{2}, 3\ell_{3} and 4\ell_{4} with C=1C=1. To see (52), we assume by contradiction that there exists x0x_{0}\in\mathbb{R} such that |(x0)|<1|\ell^{\prime}(x_{0})|<1 and (x0)>0\ell(x_{0})>0 (note that |(x0)|(0,1)|\ell^{\prime}(x_{0})|\in(0,1) implies (x0)>0\ell(x_{0})>0). Define

k=max{|(x0+2ε)(x0)2ε|,|(x0)(x02ε)2ε|},k=\max\left\{\left|\frac{\ell(x_{0}+2\varepsilon)-\ell(x_{0})}{2\varepsilon}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-2\varepsilon)}{2\varepsilon}\right|\right\},

and it holds that 0(x0+v)(x0)+k|v|0\leqslant\ell(x_{0}+v)\leqslant\ell(x_{0})+k|v| for all v[2ε,2ε]v\in[-2\varepsilon,2\varepsilon] as \ell is nonnegative and convex. Further, noting that |(x0)|<1|\ell^{\prime}(x_{0})|<1 and Lip()1{\rm Lip}(\ell)\leqslant 1, we have k<1k<1 and

((x0)+|v|)p((x0)+k|v|)ppp1(x0)(1k)|v|,v.\displaystyle(\ell(x_{0})+|v|)^{p}-(\ell(x_{0})+k|v|)^{p}\geqslant p\ell^{p-1}(x_{0})(1-k)|v|,~{}\forall v\in\mathbb{R}. (53)

Therefore,

supVpε𝔼[p(x0+V)]\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\mathbb{E}[\ell^{p}(x_{0}+V)] =supVpε{𝔼[p(x0+V)𝟙{|V|2ε}]+𝔼[p(x0+V)𝟙{|V|>2ε}]}\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{|V|>2\varepsilon\}}]\right\}
supVpε{𝔼[((x0)+k|V|)p𝟙{|V|2ε}]+𝔼[p(x0+V)𝟙{|V|>2ε}]}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+k|V|)^{p}\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[\ell^{p}(x_{0}+V)\mathds{1}_{\{|V|>2\varepsilon\}}]\right\}
supVpε{𝔼[((x0)+k|V|)p𝟙{|V|2ε}]+𝔼[((x0)+|V|)p𝟙{|V|>2ε}]}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+k|V|)^{p}\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]+\mathbb{E}[(\ell(x_{0})+|V|)^{p}\mathds{1}_{\{|V|>2\varepsilon\}}]\right\} (54)
=supVpε{𝔼[((x0)+|V|)p]𝔼[(((x0)+|V|)p((x0)+k|V|)p)𝟙{|V|2ε}]}\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+|V|)^{p}]-\mathbb{E}[((\ell(x_{0})+|V|)^{p}-(\ell(x_{0})+k|V|)^{p})\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\right\}
supVpε{𝔼[((x0)+|V|)p]pp1(x0)(1k)𝔼[|V|𝟙{|V|2ε}]}\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\left\{\mathbb{E}[(\ell(x_{0})+|V|)^{p}]-p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\right\} (55)
=:I,\displaystyle=:I,

where (54) holds because \ell is nonnegative with Lip()1{\rm Lip}(\ell)\leqslant 1, and (55) follows from (53). Recalling 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} defined by (36) in the proof of Theorem 3, we can rewrite I=max{I1,I2}I=\max\{I_{1},I_{2}\} with

Ii=supV𝒱i{𝔼[((x0)+|V|)p]pp1(x0)(1k)𝔼[|V|𝟙{|V|2ε}]},i=1,2.\displaystyle I_{i}=\sup_{V\in\mathcal{V}_{i}}\left\{\mathbb{E}[(\ell(x_{0})+|V|)^{p}]-p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]\right\},~{}~{}i=1,2.

By Lemma 5 and the definition of 𝒱1\mathcal{V}_{1}, we have

𝔼[|V|]ε2p/q+(12p/q)ε2=(1+2p/q)ε2<ε,V𝒱1.\displaystyle\mathbb{E}[|V|]\leqslant\varepsilon 2^{-p/q}+\frac{(1-2^{-p/q})\varepsilon}{2}=\frac{(1+2^{-p/q})\varepsilon}{2}<\varepsilon,~{}~{}\forall~{}V\in\mathcal{V}_{1}. (56)

It holds that

I1\displaystyle I_{1} supV𝒱1𝔼[((x0)+|V|)p]<((x0)+ε)p,\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\mathbb{E}[(\ell(x_{0})+|V|)^{p}]<(\ell(x_{0})+\varepsilon)^{p}, (57)

where the strict inequality follows from (56) and Statement (ii) of Lemma 7 by noting that (x0)>0\ell(x_{0})>0. For I2I_{2}, we have

I2\displaystyle I_{2} supV𝒱2𝔼[((x0)+|V|)p]infV𝒱2pp1(x0)(1k)𝔼[|V|𝟙{|V|2ε}]\displaystyle\leqslant\sup_{V\in\mathcal{V}_{2}}\mathbb{E}[(\ell(x_{0})+|V|)^{p}]-\inf_{V\in\mathcal{V}_{2}}p\ell^{p-1}(x_{0})(1-k)\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]
((x0)+ε)ppp1(x0)(1k)infV𝒱2𝔼[|V|𝟙{|V|2ε}]\displaystyle\leqslant(\ell(x_{0})+\varepsilon)^{p}-p\ell^{p-1}(x_{0})(1-k)\inf_{V\in\mathcal{V}_{2}}\mathbb{E}[|V|\mathds{1}_{\{|V|\leqslant 2\varepsilon\}}]
((x0)+ε)ppp1(x0)(1k)(12p/q)ε2\displaystyle\leqslant(\ell(x_{0})+\varepsilon)^{p}-p\ell^{p-1}(x_{0})(1-k)\frac{(1-2^{-p/q})\varepsilon}{2}
<((x0)+ε)p,\displaystyle<(\ell(x_{0})+\varepsilon)^{p}, (58)

where the second inequality follows from Statement (i) of Lemma 7, and the third inequality is due to the definition of 𝒱2\mathcal{V}_{2}. Combining (57) and (B.1), we have

supVpε(x0+V)pI1/p=max{I11/p,I21/p}<(x0)+ε,\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\|\ell(x_{0}+V)\|_{p}\leqslant I^{1/p}=\max\{I_{1}^{1/p},I_{2}^{1/p}\}<\ell(x_{0})+\varepsilon,

which yields a contradiction to (51). Hence, (52) holds, which completes the proof.

As outlined in Remark 2, we detail below the steps to verify that for the function \ell listed in (ii) of Theorems 3 and 4, the regularization version given by Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) can be reformed as the corresponding regularization version in our framework. To be specific, Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) states that for \ell that is proper, upper-semicontinuous and satisfies (x)C(1+|x|p)\ell(x)\leqslant C(1+|x|^{p}) for some C>0C>0 and all xx\in\mathbb{R},

supFp(G0,ε)𝔼F[(𝜷𝐙)]=infλ0{𝔼G0[p(𝜷𝐙,λ)]+λεp𝜷p},\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z})]=\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\},

where p(x,λ)=supy{(y)λ|yx|p}\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}\{\ell(y)-\lambda|y-x|^{p}\} and \|\cdot\|_{*} is the dual norm of \|\cdot\|. When \ell is given by (ii) of Theorem 3, we have

infλ0{𝔼G0[p(𝜷𝐙,λ)]+λεp𝜷p}=𝔼G0[(𝜷𝐙)]+Cε𝜷;\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+C\varepsilon\|\bm{\beta}\|_{*}; (59)

when \ell is given by (ii) of Theorem 4, we have

infλ0{𝔼G0[(p)p(𝜷𝐙,λ)]+λεp𝜷p}=((𝔼G0[p(𝜷𝐙)])1/p+Cε𝜷)p.\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[(\ell^{p})_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}=\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+C\varepsilon||\bm{\beta}||_{*}\right)^{p}. (60)
Proof.

Proof. We only give the proof of (59) for the case (x)=|xb|\ell(x)=|x-b| for some bb\in\mathbb{R} and the proof of (60) for (x)=(|xb1|b2)+p\ell(x)=(|x-b_{1}|-b_{2})_{+}^{p} for some b1b_{1}\in\mathbb{R} and b20b_{2}\geqslant 0 as the other cases can be proved similarly.

In the case that (x)=|xb|\ell(x)=|x-b|, the function p\ell_{p} reduces to p(x,λ)=supyfλ,x(y)\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y), where fλ,x(y):=|yb|λ|yx|pf_{\lambda,x}(y):=|y-b|-\lambda|y-x|^{p} for λ0\lambda\geqslant 0 and x,yx,y\in\mathbb{R}. We first show

p(x,λ)=|xb|+(λp)1/(p1)λ(λp)p/(p1)\displaystyle\ell_{p}(x,\lambda)=|x-b|+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)} (61)

by considering the following two cases:

  • (i)

    If xbx\geqslant b, then we have p(x,λ)=supyfλ,x(y)=max{I1,I2,I3}\ell_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y)=\max\{I_{1},I_{2},I_{3}\} where

    I1=supybfλ,x(y),I2=supb<y<xfλ,x(y)andI3=supyxfλ,x(y).\displaystyle I_{1}=\sup_{y\leqslant b}f_{\lambda,x}(y),~{}~{}I_{2}=\sup_{b<y<x}f_{\lambda,x}(y)~{}~{}{\rm and}~{}~{}I_{3}=\sup_{y\geqslant x}f_{\lambda,x}(y).

    For I1I_{1}, by calculating the derivative of fλ,x(y)f_{\lambda,x}(y) for y<by<b, one can verify that

    I1=fλ,x(b)=λ|bx|p\displaystyle I_{1}=f_{\lambda,x}(b)=-\lambda|b-x|^{p}

    if xb+(λp)1/(p1)x\geqslant b+(\lambda p)^{-1/(p-1)}, and

    I1=fλ,x(x(λp)1/(p1))=bx+(λp)1/(p1)λ(λp)p/(p1),\displaystyle I_{1}=f_{\lambda,x}\left(x-(\lambda p)^{-1/(p-1)}\right)=b-x+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)},

    otherwise. For I2I_{2}, since fλ,xf_{\lambda,x} is increasing on [b,x][b,x], we have

    I2=fλ,x(b)I3.I_{2}=f_{\lambda,x}(b)\leqslant I_{3}.

    Similarly, by calculating the derivative of fλ,x(y)f_{\lambda,x}(y) for y>xy>x, we have

    I3=fλ,x(x+(λp)1/(p1))=xb+(λp)1/(p1)λ(λp)p/(p1).\displaystyle I_{3}=f_{\lambda,x}\left(x+(\lambda p)^{-1/(p-1)}\right)=x-b+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.

    Note that xbx\geqslant b. Comparing I1I_{1}, I2I_{2} and I3I_{3}, it is straightforward to check that I1I3I_{1}\leqslant I_{3} and I2I3I_{2}\leqslant I_{3}, and hence,

    p(x,λ)=I3=xb+(λp)1/(p1)λ(λp)p/(p1).\displaystyle\ell_{p}(x,\lambda)=I_{3}=x-b+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.
  • (ii)

    If x<bx<b, then by similar analysis, we obtain

    p(x,λ)=fλ,x(x(λp)1/(p1))=bx+(λp)1/(p1)λ(λp)p/(p1).\displaystyle\ell_{p}(x,\lambda)=f_{\lambda,x}\left(x-(\lambda p)^{-1/(p-1)}\right)=b-x+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}.

Combining the above two cases, we have (61) holds. Substituting (61) into the left-hand side of (59) yields

infλ0{𝔼G0[p(𝜷𝐙,λ)]+λεp𝜷p}\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[\ell_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}
=infλ0{𝔼G0[|𝜷𝐙b|]+(λp)1/(p1)λ(λp)p/(p1)+λεp𝜷p}\displaystyle=\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[|\bm{\beta}^{\top}\mathbf{Z}-b|]+(\lambda p)^{-1/(p-1)}-\lambda(\lambda p)^{-p/(p-1)}+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}
=𝔼G0[(𝜷𝐙)]+infλ0{pp/(p1)(p1)λ1/(p1)+λεp𝜷p}\displaystyle=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+\inf_{\lambda\geqslant 0}\{p^{-p/(p-1)}(p-1)\lambda^{-1/(p-1)}+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\}
=𝔼G0[(𝜷𝐙)]+ε𝜷.\displaystyle=\mathbb{E}^{G_{0}}[\ell(\bm{\beta}^{\top}\mathbf{Z})]+\varepsilon\|\bm{\beta}\|_{*}.

This completes the proof of (59).

In the case that (x)=(|xb1|b2)+\ell(x)=(|x-b_{1}|-b_{2})_{+}, that is, p(x)=(|xb1|b2)+p\ell^{p}(x)=(|x-b_{1}|-b_{2})_{+}^{p}, we have

(p)p(x,λ)=supyfλ,x(y)=max{I1,I2,I3},\displaystyle(\ell^{p})_{p}(x,\lambda)=\sup_{y\in\mathbb{R}}f_{\lambda,x}(y)=\max\{I_{1},I_{2},I_{3}\},

where fλ,x(y)=(|yb1|b2)+pλ|yx|pf_{\lambda,x}(y)=(|y-b_{1}|-b_{2})_{+}^{p}-\lambda|y-x|^{p} and

I1=supyb1b2fλ,x(y),I2=supb1b2<y<b1+b2fλ,x(y)andI3=supyb1+b2fλ,x(y).\displaystyle I_{1}=\sup_{y\leqslant b_{1}-b_{2}}f_{\lambda,x}(y),~{}~{}I_{2}=\sup_{b_{1}-b_{2}<y<b_{1}+b_{2}}f_{\lambda,x}(y)~{}~{}{\rm and}~{}~{}I_{3}=\sup_{y\geqslant b_{1}+b_{2}}f_{\lambda,x}(y).

It is straightforward to check that (p)p(x,λ)=(\ell^{p})_{p}(x,\lambda)=\infty if λ<1\lambda<1. Hence, the constraint of λ\lambda can be replaced by λ1\lambda\geqslant 1. Let now λ1\lambda\geqslant 1 and denote by k=λ1/(p1)k=\lambda^{1/(p-1)}. By considering the first-order condition, we have

I1={kp1(xb1+b2)p,x>b1b2,(kk1)p1(b1b2x)p,xb1b2,\displaystyle I_{1}=\begin{cases}-k^{p-1}(x-b_{1}+b_{2})^{p},~{}~{}&x>b_{1}-b_{2},\\[5.0pt] \left(\frac{k}{k-1}\right)^{p-1}(b_{1}-b_{2}-x)^{p},~{}~{}&x\leqslant b_{1}-b_{2},\end{cases}
I2={0,b1b2<x<b1+b2,kp1(b1b2x)p,xb1b2,kp1(xb1b2)p,xb1+b2,\displaystyle I_{2}=\begin{cases}0,~{}~{}&b_{1}-b_{2}<x<b_{1}+b_{2},\\ -k^{p-1}(b_{1}-b_{2}-x)^{p},~{}~{}&x\leqslant b_{1}-b_{2},\\ -k^{p-1}(x-b_{1}-b_{2})^{p},~{}~{}&x\geqslant b_{1}+b_{2},\end{cases}

and

I3={kp1(b1+b2x)p,x<b1+b2,(kk1)p1(xb1b2)p,xb1+b2\displaystyle I_{3}=\begin{cases}-k^{p-1}(b_{1}+b_{2}-x)^{p},~{}~{}&x<b_{1}+b_{2},\\[5.0pt] \left(\frac{k}{k-1}\right)^{p-1}(x-b_{1}-b_{2})^{p},~{}~{}&x\geqslant b_{1}+b_{2}\end{cases}

where we use the convention that s/0=s/0=\infty if s>0s>0. Comparing I1I_{1}, I2I_{2} and I3I_{3}, we have

(p)p(x,λ)=(kk1)p1(|xb1|b2)+pwithk=λ1/(p1).\displaystyle(\ell^{p})_{p}(x,\lambda)=\left(\frac{k}{k-1}\right)^{p-1}(|x-b_{1}|-b_{2})_{+}^{p}~{}~{}{\rm with}~{}k=\lambda^{1/(p-1)}.

Denote by A=𝔼G0[p(𝜷𝐙)]=𝔼G0[(|𝜷𝐙b1|b2)+p]A=\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]=\mathbb{E}^{G_{0}}[(|\bm{\beta}^{\top}\mathbf{Z}-b_{1}|-b_{2})_{+}^{p}]. Substituting the above equation into the left-hand side of (60) and noting that the constraint of kk can be replaced by k1k\geqslant 1 yield

infλ0{𝔼G0[(p)p(𝜷𝐙,λ)]+λεp𝜷p}\displaystyle\inf_{\lambda\geqslant 0}\left\{\mathbb{E}^{G_{0}}[(\ell^{p})_{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda)]+\lambda\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\} =infk1{(kk1)p1𝔼G0[(|𝜷𝐙b1|b2)+p]+kp1εp𝜷p}\displaystyle=\inf_{k\geqslant 1}\left\{\left(\frac{k}{k-1}\right)^{p-1}\mathbb{E}^{G_{0}}[(|\bm{\beta}^{\top}\mathbf{Z}-b_{1}|-b_{2})_{+}^{p}]+k^{p-1}\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}
=infk1{(kk1)p1A+kp1εp𝜷p}\displaystyle=\inf_{k\geqslant 1}\left\{\left(\frac{k}{k-1}\right)^{p-1}A+k^{p-1}\varepsilon^{p}\|\bm{\beta}\|_{*}^{p}\right\}
=(A1/p+ε𝜷)p=((𝔼G0[p(𝜷𝐙)])1/p+ε𝜷)p.\displaystyle=\left(A^{1/p}+\varepsilon\|\bm{\beta}\|_{*}\right)^{p}=\left(\left(\mathbb{E}^{G_{0}}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z})]\right)^{1/p}+\varepsilon||\bm{\beta}||_{*}\right)^{p}.

This completes the proof of (60). ∎

The proof of Corollary 1 relies on the following lemma and the regularization results in Theorem 4. Recall that

π1,(F,t)=t+(𝔼F[p(𝜷𝐙,t)])1/pandπ2,(F,t)=𝔼F[p(𝜷𝐙,t)].\displaystyle\pi_{1,\ell}(F,t)=t+\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}~{}~{}{\rm and}~{}~{}\pi_{2,\ell}(F,t)=\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)].
Lemma 8.

For any p[1,)p\in[1,\infty), G0p(n)G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}) and ε0\varepsilon\geqslant 0, the following two statements hold.

  • (i)

    Suppose that (z,t)\ell(z,t) is nonnegative on 2\mathbb{R}^{2}, and convex in tt with limt(z,t)/t<1\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<-1 for all zz\in\mathbb{R}, and Lipschitz continuous in zz for all tt\in\mathbb{R} with a uniform Lipschitz constant, i.e., there exists M>0M>0 such that

    |(z1,t)(z2,t)|M|z1z2|,t,z1,z2.|\ell(z_{1},t)-\ell(z_{2},t)|\leqslant M|z_{1}-z_{2}|,~{}~{}\forall t,z_{1},z_{2}\in\mathbb{R}.

    Then we have

    supFp(G0,ε)inftπ1,(F,t)=inftsupFp(G0,ε)π1,(F,t).\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t).
  • (ii)

    Suppose that (z,t)\ell(z,t) is convex in tt with limt(z,t)/t<0\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<0 and limt(z,t)/t>0\lim_{t\to\infty}{\partial\ell(z,t)}/{\partial t}>0 for all zz\in\mathbb{R}, and Lipschitz continuous in zz for all tt\in\mathbb{R} with a uniform Lipschitz constant. Then we have

    supFp(G0,ε)inftπ2,(F,t)=inftsupFp(G0,ε)π2,(F,t).\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{2,\ell}(F,t)=\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{2,\ell}(F,t).
Proof.

Proof. (i) First, we show three facts below. (a) π1,(F,t)\pi_{1,\ell}(F,t) is concave in FF for all tt\in\mathbb{R}; (b) π1,(F,t)\pi_{1,\ell}(F,t) is convex in tt for all Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}); (c) limt±π1,(F,t)=\lim_{t\to\pm\infty}\pi_{1,\ell}(F,t)=\infty for all Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}). The fact (a) is trivial. For Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}), λ[0,1]\lambda\in[0,1] and t1,t2t_{1},t_{2}\in\mathbb{R}, it holds that

(𝔼F[p(𝜷𝐙,λt1+(1λ)t2)])1/p\displaystyle\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},\lambda t_{1}+(1-\lambda)t_{2})]\right)^{1/p} (𝔼F[(λ(𝜷𝐙,t1)+(1λ)(𝜷𝐙,t2))p])1/p\displaystyle\leqslant\left(\mathbb{E}^{F}\left[\left(\lambda\ell(\bm{\beta}^{\top}\mathbf{Z},t_{1})+(1-\lambda)\ell(\bm{\beta}^{\top}\mathbf{Z},t_{2})\right)^{p}\right]\right)^{1/p}
λ(𝔼F[p(𝜷𝐙,t1)])1/p+(1λ)(𝔼F[p(𝜷𝐙,t2)])1/p,\displaystyle\leqslant\lambda\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t_{1})]\right)^{1/p}+(1-\lambda)\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t_{2})]\right)^{1/p},

where the first step holds because \ell is nonnegative, and the second follows from the triangle inequality. This implies (b). To see (c), it is obvious that limtπ1,(F,t)=\lim_{t\to\infty}\pi_{1,\ell}(F,t)=\infty. Note that (𝔼F[p(𝜷𝐙,t)])1/p𝔼F[(𝜷𝐙,t)].(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)])^{1/p}\geqslant\mathbb{E}^{F}[\ell(\bm{\beta}^{\top}\mathbf{Z},t)]. Combining with limt(z,t)/t<1\lim_{t\to-\infty}{\partial\ell(z,t)}/{\partial t}<-1 and the convexity of (z,t)\ell(z,t) in tt, we have limtπ1,(F,t)=\lim_{t\to-\infty}\pi_{1,\ell}(F,t)=\infty. Hence, we conclude the proof of (c). Using (b) and (c), the set of all minimizers of the problem inftπ1,(F,t)\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t) is a closed interval. Denote by t(F):=infargmintπ1,(F,t)t(F):=\inf\arg\min_{t}\pi_{1,\ell}(F,t). We will show that {t(F):Fp(G0,ε)}\{t(F):F\in\mathcal{B}_{p}(G_{0},\varepsilon)\} is a subset of a compact set. For any FBp(G0,ε)F\in B_{p}(G_{0},\varepsilon) and tt\in\mathbb{R}, let 𝐙F\mathbf{Z}\sim F and 𝐙𝟎G0\mathbf{Z_{0}}\sim G_{0}, and we have

|π1,(F,t)π1,(G0,t)|\displaystyle|\pi_{1,\ell}(F,t)-\pi_{1,\ell}(G_{0},t)| =|(𝔼[p(𝜷𝐙,t)])1/p(𝔼F[p(𝜷𝐙𝟎,t)])1/p|\displaystyle=\left|\left(\mathbb{E}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z},t)]\right)^{1/p}-\left(\mathbb{E}^{F}[\ell^{p}(\bm{\beta}^{\top}\mathbf{Z_{0}},t)]\right)^{1/p}\right|
(𝔼[|(𝜷𝐙,t)(𝜷𝐙0,t)|p])1/p\displaystyle\leqslant\left(\mathbb{E}[|\ell(\bm{\beta}^{\top}\mathbf{Z},t)-\ell(\bm{\beta}^{\top}\mathbf{Z}_{0},t)|^{p}]\right)^{1/p}
(𝔼[Mp|𝜷(𝐙𝐙0)|p])1/p\displaystyle\leqslant\left(\mathbb{E}[M^{p}|\bm{\beta}^{\top}(\mathbf{Z}-\mathbf{Z}_{0})|^{p}]\right)^{1/p}
M𝜷(𝔼[𝐙𝐙0p])1/pM𝜷ε,\displaystyle\leqslant M\|\bm{\beta}\|_{*}\left(\mathbb{E}[\|\mathbf{Z}-\mathbf{Z}_{0}\|^{p}]\right)^{1/p}\leqslant M\|\bm{\beta}\|_{*}\varepsilon, (62)

where the first and the third inequalities follow from the triangle inequality and Hölder’s inequality, respectively, and we have used the definition of the Wasserstein ball p(G0,ε)\mathcal{B}_{p}(G_{0},\varepsilon) in the last step. Hence, it holds that

π1,(F,t(G0))π1,(G0,t(G0))+M𝜷ε.\displaystyle\pi_{1,\ell}(F,t(G_{0}))\leqslant\pi_{1,\ell}(G_{0},t(G_{0}))+M\|\bm{\beta}\|_{*}\varepsilon. (63)

Note that π1,(G0,t)\pi_{1,\ell}(G_{0},t)\to\infty as t±t\to\pm\infty. There exists Δ>0\Delta>0 such that π1,(G0,t)>π1,(G0,t(G0))+2M𝜷ε\pi_{1,\ell}(G_{0},t)>\pi_{1,\ell}(G_{0},t(G_{0}))+2M\|\bm{\beta}\|_{*}\varepsilon for all t[t(G0)Δ,t(G0)+Δ]t\notin[t(G_{0})-\Delta,t(G_{0})+\Delta]. This, combined with (B.1), imply that

π1,(F,t)π1,(G0,t)M𝜷ε>π1,(G0,t(G0))+M𝜷ε,t[t(G0)Δ,t(G0)+Δ].\displaystyle\pi_{1,\ell}(F,t)\geqslant\pi_{1,\ell}(G_{0},t)-M\|\bm{\beta}\|_{*}\varepsilon>\pi_{1,\ell}(G_{0},t(G_{0}))+M\|\bm{\beta}\|_{*}\varepsilon,~{}~{}\forall t\notin[t(G_{0})-\Delta,t(G_{0})+\Delta]. (64)

Applying (63) and (64), we have {t(F):Fp(G0,ε)}[t(G0)Δ,t(G0)+Δ]\{t(F):F\in\mathcal{B}_{p}(G_{0},\varepsilon)\}\subseteq[t(G_{0})-\Delta,t(G_{0})+\Delta]. Using a minimax theorem (see e.g., Sion (1958)), it holds that

supFp(G0,ε)inftπ1,(F,t)\displaystyle\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in\mathbb{R}}\pi_{1,\ell}(F,t) =supFp(G0,ε)inft[t(G0)Δ,t(G0)+Δ]π1,(F,t)\displaystyle=\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\inf_{t\in[t(G_{0})-\Delta,t(G_{0})+\Delta]}\pi_{1,\ell}(F,t)
=inft[t(G0)Δ,t(G0)+Δ]supFp(G0,ε)π1,(F,t)inftsupFp(G0,ε)π1,(F,t).\displaystyle=\inf_{t\in[t(G_{0})-\Delta,t(G_{0})+\Delta]}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t)\geqslant\inf_{t\in\mathbb{R}}\sup_{F\in\mathcal{B}_{p}(G_{0},\varepsilon)}\pi_{1,\ell}(F,t).

The converse direction is trivial. Hence, we complete the proof.

(ii) The proof is similar to (i). ∎

Proof of Corollary 1. (i) For the case (z,t):=c(zt)\ell(z,t):=c\ell(z-t) with =3\ell=\ell_{3} or 4\ell_{4} in Theorem 4, one can verify that (z,t)\ell(z,t) satisfies the conditions in Lemma 8 (ii). Hence, the equivalency result follows immediately from Lemma 8 and Theorem 4. For the case (z,t):=c(zt)\ell(z,t):=c\ell(z-t) with =1\ell=\ell_{1} or 2\ell_{2} in Theorem 4, we assume without loss of generality that 1(x)=x+\ell_{1}(x)=x_{+} and 2(x)=x\ell_{2}(x)=x_{-}. In this case, we have

𝒱F(𝜷𝐙)=inftcp𝔼F[ip(𝜷𝐙t)]={limtcp𝔼F[(𝜷𝐙t)+p]=0,i=1,limtcp𝔼F[(𝜷𝐙t)p]=0,i=2.\displaystyle\mathcal{V}^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{t\in\mathbb{R}}c^{p}\mathbb{E}^{F}[\ell_{i}^{p}(\bm{\beta}^{\top}\mathbf{Z}-t)]=\begin{cases}\lim_{t\to\infty}c^{p}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{+}^{p}]=0,~{}~{}&i=1,\\[8.0pt] \lim_{t\to-\infty}c^{p}\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]=0,~{}~{}&i=2.\end{cases}

(ii) For (z,t):=c(zt)\ell(z,t):=c\ell(z-t) with =1\ell=\ell_{1}, 3\ell_{3} or 4\ell_{4} in Theorem 4, it holds that (z,t)\ell(z,t) satisfies the conditions in Lemma 8 (i). Applying Lemma 8 and Theorem 4, the equivalency result holds. For (z,t):=c(zt)\ell(z,t):=c\ell(z-t) with =2\ell=\ell_{2} in Theorem 4, we assume without loss of generality that 2(x)=x\ell_{2}(x)=x_{-}. In this case, for any c>1c>1, 𝜷𝒟\bm{\beta}\in\mathcal{D} and Fp(n)F\in\mathcal{M}_{p}(\mathbb{R}^{n}), it holds that

ρF(𝜷𝐙)=inft{t+c(𝔼F[(𝜷𝐙t)p])1/p}=limt{t+c(𝔼F[(𝜷𝐙t)p])1/p}=.\displaystyle\rho^{F}(\bm{\beta}^{\top}\mathbf{Z})=\inf_{t\in\mathbb{R}}\left\{t+c\left(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]\right)^{1/p}\right\}=\lim_{t\to-\infty}\left\{t+c\left(\mathbb{E}^{F}[(\bm{\beta}^{\top}\mathbf{Z}-t)_{-}^{p}]\right)^{1/p}\right\}=-\infty.

This completes the proof.   

B.2 Proofs of Section 4.2

To prove the results in Section 4.2, we list the definition of comonotonicity and some basic properties of distortion functionals as follows; see e.g.,Wang et al. (2020).

Definition 1 (Comonotonicity).

Two random variables Z1Z_{1} and Z2Z_{2} are said to be comonotonic if (Z1,Z2)(Z_{1},Z_{2}) is distributionally equivalent to (FZ11(U),FZ21(U))(F^{-1}_{Z_{1}}(U),F^{-1}_{Z_{2}}(U)), where FZ1F_{Z}^{-1} denotes the left-quantile function of ZZ, and UU is a random variable uniformly distributed on the interval [0,1][0,1] (see e.g., Dhaene et al. (2002) for a discussion of comonotonic random variables).

Lemma 9.

Let h:[0,1]h:[0,1]\to\mathbb{R} be a distortion function. The following statements hold.

  • (i)

    ρh\rho_{h} is monotone, i.e., ρh(Z1)ρh(Z2)\rho_{h}(Z_{1})\leqslant\rho_{h}(Z_{2}) for Z1Z2Z_{1}\leqslant Z_{2} a.s. if hh is increasing.

  • (ii)

    ρh\rho_{h} is translation invariant, i.e., ρh(Z+c)=ρh(Z)+(h(1)h(0))c\rho_{h}(Z+c)=\rho_{h}(Z)+(h(1)-h(0))c for any ZZ and cc\in\mathbb{R}.

  • (iii)

    ρh\rho_{h} is positively homogeneous, i.e., ρh(λZ)=λρh(Z)\rho_{h}(\lambda Z)=\lambda\rho_{h}(Z) for any ZZ and λ0\lambda\geqslant 0.

  • (iv)

    ρh\rho_{h} is subadditive, i.e., ρh(Z1+Z2)ρh(Z1)+ρh(Z2)\rho_{h}(Z_{1}+Z_{2})\leqslant\rho_{h}(Z_{1})+\rho_{h}(Z_{2}) for any Z1Z_{1} and Z2Z_{2} if hh is convex. The equality holds when Z1Z_{1} and Z2Z_{2} are comonotonic.

Proof of Theorem 5. We first consider the case p=1p=1. Assume without loss of generality that Lip()=1{\rm Lip}(\ell)=1. By Lemma 4, it suffices to prove that for any ZL1Z\in L^{1} and ε0\varepsilon\geqslant 0,

supV1ερh((Z+V))=ρh((Z))+hε.\displaystyle\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

To see it, first note that

supV1ερh((Z+V))\displaystyle\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V)) supV1ερh((Z)+|V|)ρh((Z))+supV1ερh(|V|)\displaystyle\leqslant\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z)+|V|)\leqslant\rho_{h}(\ell(Z))+\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(|V|)
ρh((Z))+supV1εhV1=ρh((Z))+hε,\displaystyle\leqslant\rho_{h}(\ell(Z))+\sup_{\|V\|_{1}\leqslant\varepsilon}\|h^{\prime}_{-}\|_{\infty}\|V\|_{1}=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon, (65)

where the first inequality follows from (Z+V)(Z)|V|\ell(Z+V)-\ell(Z)\leqslant|V| and the monotonicity of ρh\rho_{h}, the second inequality follows from the subadditivity of ρh\rho_{h}, and the last inequality is due to Hölder’s inequality. Let us now verify the other direction. Suppose that limx(x)=1\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1 where \ell^{\prime}_{-} is the left-derivative of \ell. Denote by UU a uniform random variable on [0,1][0,1] such that UU and (Z)\ell(Z) are comonotonic. For nn\in\mathbb{N} and δ>0\delta>0, define

An,δ={ω:1εnδ<U(ω)1δ}andVn,δ=n𝟙An,δ.\displaystyle A_{n,\delta}=\left\{\omega:1-\frac{\varepsilon}{n}-\delta<U(\omega)\leqslant 1-\delta\right\}~{}~{}{\rm and}~{}~{}V_{n,\delta}=n\mathds{1}_{A_{n,\delta}}. (66)

One can check that Vn,δ1=ε\|V_{n,\delta}\|_{1}=\varepsilon. Moreover, we have

ρh((Z+Vn,δ))\displaystyle\rho_{h}(\ell(Z+V_{n,\delta})) =ρh((Z)𝟙An,δc+(Z+n)𝟙An,δ)\displaystyle=\rho_{h}\left(\ell(Z)\mathds{1}_{A_{n,\delta}^{c}}+\ell(Z+n)\mathds{1}_{A_{n,\delta}}\right)
=ρh((Z)+((Z+n)(Z))𝟙An,δ)\displaystyle=\rho_{h}\left(\ell(Z)+(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}\right)
𝔼[((Z)+((Z+n)(Z))𝟙An,δ)h(U)]\displaystyle\geqslant\mathbb{E}\left[\left(\ell(Z)+(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}\right)h_{-}^{\prime}(U)\right]
=ρh((Z))+𝔼[((Z+n)(Z))𝟙An,δh(U)],\displaystyle=\rho_{h}(\ell(Z))+\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right], (67)

where the inequality follows from the dual representation of ρh\rho_{h} (see e.g., Theorem 4.79 of Föllmer and Schied (2016)). Noting that (Z)\ell(Z) must be uniformly bounded on An,δA_{n,\delta} for all n>ε/(1δ)n>\varepsilon/(1-\delta) as (Z)\ell(Z) and UU are comonotonic, there exists x0x_{0}\in\mathbb{R} such that Zx0Z\geqslant x_{0} on An,δA_{n,\delta} for all n>ε/(1δ)n>\varepsilon/(1-\delta) as \ell is convex. Therefore, we have

𝔼[((Z+n)(Z))𝟙An,δh(U)]\displaystyle\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right] =𝔼[(Z+n)(Z)nVn,δh(U)]\displaystyle=\mathbb{E}\left[\frac{\ell(Z+n)-\ell(Z)}{n}V_{n,\delta}h^{\prime}_{-}(U)\right]
(x0+n)(x0)n𝔼[Vn,δh(U)]\displaystyle\geqslant\frac{\ell(x_{0}+n)-\ell(x_{0})}{n}\mathbb{E}\left[V_{n,\delta}h^{\prime}_{-}(U)\right]
=(x0+n)(x0)n1ε/nδ1δh(s)dsε/nεh(1δ)εasn,\displaystyle=\frac{\ell(x_{0}+n)-\ell(x_{0})}{n}\frac{\int_{1-\varepsilon/n-\delta}^{1-\delta}h^{\prime}_{-}(s)\mathrm{d}s}{\varepsilon/n}\varepsilon\rightarrow h^{\prime}_{-}(1-\delta)\varepsilon~{}~{}{\rm as}~{}n\to\infty,

where the inequality follows from the convexity of \ell and Zx0Z\geqslant x_{0} on An,δA_{n,\delta}, and the convergence is due to limx(x)=1\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1. Note that hh^{\prime}_{-} is left continuous on (0,1](0,1] (see e.g., Proposition A.4 of Föllmer and Schied (2016)). It holds that limx1h(x)=h(1)\lim_{x\to 1-}h^{\prime}_{-}(x)=h^{\prime}_{-}(1). Therefore, letting δ0\delta\to 0 and combining with (B.2), we have concluded that

supV1ερh((Z+V))ρh((Z))+h(1)ε=ρh((Z))+hε.\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))\geqslant\rho_{h}(\ell(Z))+h^{\prime}_{-}(1)\varepsilon=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

Hence, we have verified the other direction for the case of limx(x)=1\lim_{x\to\infty}\ell^{\prime}_{-}(x)=1. If limx(x)=1\lim_{x\to-\infty}\ell^{\prime}_{-}(x)=-1, the proof is similar by letting Vn,δ=n𝟙An,δV_{n,\delta}=-n\mathds{1}_{A_{n,\delta}}. Hence, we complete the proof of the case p=1p=1.

Let us now focus on the case p>1p>1. In the rest of the proof, we assume C=1C=1 and h(0)=1h(1)=0h(0)=1-h(1)=0 without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.

  • (i)’

    For any ZLpZ\in L^{p} and ε0\varepsilon\geqslant 0, we have (33) holds, i.e.,

    supVpερh((Z+V))=ρh((Z))+εhq,ZLp,ε0.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))=\rho_{h}(\ell(Z))+\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall Z\in L^{p},~{}\varepsilon\geqslant 0. (68)

To see (ii) \Rightarrow (i)’, we note that ρh(Z)=ρh~(Z)\rho_{h}(-Z)=\rho_{\widetilde{h}}(Z), where h~(s)=h(1s)\widetilde{h}(s)=h(1-s) for s[0,1]s\in[0,1]. It is easy to see that h~\widetilde{h} is convex whenever hh is convex. Hence, (a) is a special case of Proposition 2, and we choose to omit its proof here for brevity. In the case of (b), we assume without loss of generality that (x)=|x|\ell(x)=|x| for xx\in\mathbb{R}. Note that hh is an increasing and convex distortion function. By Lemma 9, we have for any VV with Vpε\|V\|_{p}\leqslant\varepsilon,

ρh(|Z+V|)\displaystyle\rho_{h}(|Z+V|) ρh(|Z|)+ρh(|V|)ρh(|Z|)+Vphqρh(|Z|)+εhq,\displaystyle\leqslant\rho_{h}(|Z|)+\rho_{h}(|V|)\leqslant\rho_{h}(|Z|)+\|V\|_{p}\|h^{\prime}_{-}\|_{q}\leqslant\rho_{h}(|Z|)+\varepsilon\|h^{\prime}_{-}\|_{q}, (69)

where the second inequality follows from Hölder’s inequality. Suppose that |Z|=F|Z|1(U)|Z|=F_{|Z|}^{-1}(U) a.s. where UU is a uniform random variable on [0,1][0,1]. Define V~=sign(Z)ε(h(U))q/p/hqq/p.\widetilde{V}={\rm sign}(Z){\varepsilon(h^{\prime}_{-}(U))^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}}. One can verify that V~p=ε\|\widetilde{V}\|_{p}=\varepsilon and ρh(|Z+V~|)=ρh(|Z|)+εhq\rho_{h}(|Z+\widetilde{V}|)=\rho_{h}(|Z|)+\varepsilon\|h^{\prime}_{-}\|_{q}. Therefore, we conclude that (68) holds. Hence, we complete the proof of (i)’.

To see (i)’ \Rightarrow (ii), note that hh is increasing with h(0)=1h(1)=0h(0)=1-h(1)=0. It holds that ρh\rho_{h} satisfies monotonicity and translation invariance, i.e., ρh(Z+c)=ρh(Z)+c\rho_{h}(Z+c)=\rho_{h}(Z)+c for all ZLpZ\in\ L^{p} and cc\in\mathbb{R}. We first show that Lip()1{\rm Lip}(\ell)\leqslant 1 by contradiction. Otherwise, by that a convex function has derivative almost everywhere, there exists xx such that |(x)|>1|\ell^{\prime}(x)|>1. In this case, define V1LpV_{1}\in L^{p} with quantile function as FV11(s)=ε(h(s))q/p/hqq/p,s(0,1),F_{V_{1}}^{-1}(s)={\varepsilon(h^{\prime}_{-}(s))^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}},~{}~{}s\in(0,1), and one can easily check that V10V_{1}\geqslant 0 and V1p=ε\|V_{1}\|_{p}=\varepsilon. Hence,

supVpερh((x+V))\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x+V)) supVpε,V0ρh((x)+(x)V)\displaystyle\geqslant\sup_{\|V\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}(\ell(x)+\ell^{\prime}(x)V)
=(x)+(x)supVpε,V0ρh(V)\displaystyle=\ell(x)+\ell^{\prime}(x)\sup_{\|V\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}(V)
(x)+(x)ρh(V1)\displaystyle\geqslant\ell(x)+\ell^{\prime}(x)\rho_{h}(V_{1})
=(x)+(x)hqε>(x)+hqε.\displaystyle=\ell(x)+\ell^{\prime}(x)\|h^{\prime}_{-}\|_{q}\varepsilon>\ell(x)+\|h^{\prime}_{-}\|_{q}\varepsilon.

If (x)<1\ell^{\prime}(x)<-1, then define V2=V1V_{2}=-V_{1}, and we have V20V_{2}\leqslant 0 and V2p=ε\|V_{2}\|_{p}=\varepsilon. Hence,

supVpερh((x+V))\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x+V)) supVpε,V0ρh((x)+(x)V)\displaystyle\geqslant\sup_{\|V\|_{p}\leqslant\varepsilon,V\leqslant 0}\rho_{h}(\ell(x)+\ell^{\prime}(x)V)
=(x)(x)supVpε,V0ρh(V)\displaystyle=\ell(x)-\ell^{\prime}(x)\sup_{\|V\|_{p}\leqslant\varepsilon,V\leqslant 0}\rho_{h}(-V)
(x)(x)ρh(V2)\displaystyle\geqslant\ell(x)-\ell^{\prime}(x)\rho_{h}(-V_{2})
=(x)(x)ρh(V1)\displaystyle=\ell(x)-\ell^{\prime}(x)\rho_{h}(V_{1})
=(x)(x)hqε>(x)+hqε.\displaystyle=\ell(x)-\ell^{\prime}(x)\|h^{\prime}_{-}\|_{q}\varepsilon>\ell(x)+\|h^{\prime}_{-}\|_{q}\varepsilon.

Those two cases both yield a contradiction to (68).

Next, we aim to show

|(x)|=1forallxaslongasisdifferentiableatx,\displaystyle|\ell^{\prime}(x)|=1~{}{\rm{for~{}all}}~{}x\in\mathbb{R}~{}{\rm as~{}long~{}as}~{}\ell~{}{\rm is~{}differentiable~{}at}~{}x, (70)

as this will complete the proof since a convex function that satisfies (70) must be one of the forms of 1\ell_{1} and 2\ell_{2} with C=1C=1. To show (70), we assume by contradiction that there exists x0x_{0}\in\mathbb{R} such that |(x0)|<1|\ell^{\prime}(x_{0})|<1. If p=p=\infty, then we have

supVερ((x0+V))=max{(x0ε),(x0+ε)}<(x0)+ε,\displaystyle\sup_{\|V\|_{\infty}\leqslant\varepsilon}\rho(\ell(x_{0}+V))=\max\{\ell(x_{0}-\varepsilon),\ell(x_{0}+\varepsilon)\}<\ell(x_{0})+\varepsilon,

where the strict inequality follows from |(x0)|<1|\ell^{\prime}(x_{0})|<1 and Lip()1{\rm Lip}(\ell)\leqslant 1, which yields a contradiction. Suppose now p(1,)p\in(1,\infty). To yield a contradiction, it suffices to prove that

supVpερh((x0+V))<(x0)+εhq.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x_{0}+V))<\ell(x_{0})+\varepsilon\|h^{\prime}_{-}\|_{q}. (71)

Let UU[0,1]U\sim{\rm U}[0,1] be a uniformly distributed random variable, and define

𝒳p={F1(U):01|F1(u)|p𝑑u<,F1(U)0}.\mathcal{X}_{p}=\left\{F^{-1}(U):\int_{0}^{1}|F^{-1}(u)|^{p}du<\infty,~{}F^{-1}(U)\geqslant 0\right\}.

We have that {FX:XLp}={FX:X𝒳p}\{F_{X}:X\in L^{p}\}=\{F_{X}:X\in\mathcal{X}_{p}\} and the random variables in 𝒳p\mathcal{X}_{p} are all comonotonic. Define

α=sup{α[0,1]:h(α)=0}<1andM=ε(1α2)1/p.\alpha^{*}=\sup\{\alpha\in[0,1]:h(\alpha)=0\}<1~{}~{}{\rm and}~{}~{}M=\varepsilon\left(\frac{1-\alpha^{*}}{2}\right)^{-1/p}. (72)

It follows from Chebyshev’s inequality that

(V>M)𝔼[Vp]MpεpMp=1α2ifVpεandV0.\mathbb{P}(V>M)\leqslant\frac{\mathbb{E}[V^{p}]}{M^{p}}\leqslant\frac{\varepsilon^{p}}{M^{p}}=\frac{1-\alpha^{*}}{2}~{}~{}{\rm if}~{}\|V\|_{p}\leqslant\varepsilon~{}{\rm and}~{}V\geqslant 0. (73)

Define

k=max{|(x0+M)(x0)M|,|(x0)(x0M)M|}.k=\max\left\{\left|\frac{\ell(x_{0}+M)-\ell(x_{0})}{M}\right|,\left|\frac{\ell(x_{0})-\ell(x_{0}-M)}{M}\right|\right\}.

We have that |(x0+v)(x0)|k|v||\ell(x_{0}+v)-\ell(x_{0})|\leqslant k|v| for all v[M,M]v\in[-M,M] as \ell is convex. Further, it holds that k<1k<1 as |(x0)|<1|\ell^{\prime}(x_{0})|<1 and Lip()=1{\rm Lip}(\ell)=1. Therefore,

supVpερh((x0+V))\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(\ell(x_{0}+V)) =supVpερh((x0+V)𝟙{|V|M}+(x0+V)𝟙{|V|>M})\displaystyle=\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}\left(\ell(x_{0}+V)\mathds{1}_{\{|V|\leqslant M\}}+\ell(x_{0}+V)\mathds{1}_{\{|V|>M\}}\right)
supVpερh(((x0)+k|V|)𝟙{|V|M}+(x0+V)𝟙{|V|>M})\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}\left((\ell(x_{0})+k|V|)\mathds{1}_{\{|V|\leqslant M\}}+\ell(x_{0}+V)\mathds{1}_{\{|V|>M\}}\right)
supVpερh(((x0)+k|V|)𝟙{|V|M}+((x0)+|V|)𝟙{|V|>M})\displaystyle\leqslant\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}\left((\ell(x_{0})+k|V|)\mathds{1}_{\{|V|\leqslant M\}}+(\ell(x_{0})+|V|)\mathds{1}_{\{|V|>M\}}\right)
=(x0)+supVpερh(k|V|𝟙{|V|M}+|V|𝟙{|V|>M})\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}\left(k|V|\mathds{1}_{\{|V|\leqslant M\}}+|V|\mathds{1}_{\{|V|>M\}}\right)
=(x0)+supVpε,V0ρh(kV𝟙{VM}+V𝟙{V>M})\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon,V\geqslant 0}\rho_{h}\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)
=(x0)+supVpε,V𝒳pρh(kV𝟙{VM}+V𝟙{V>M})\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\rho_{h}\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)
=(x0)+supVpε,V𝒳p𝔼[(kV𝟙{VM}+V𝟙{V>M})h(U)]\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\mathbb{E}\left[\left(kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}}\right)h^{\prime}_{-}(U)\right]
=(x0)+supVpε,V𝒳p𝔼[(V(1k)V𝟙{VM})h(U)]\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\mathbb{E}\left[\left(V-(1-k)V\mathds{1}_{\{V\leqslant M\}}\right)h^{\prime}_{-}(U)\right]
=(x0)+supVpε,V𝒳p{ρh(V)(1k)𝔼[V𝟙{VM}h(U)]}\displaystyle=\ell(x_{0})+\sup_{\|V\|_{p}\leqslant\varepsilon,V\in\mathcal{X}_{p}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}
=:(x0)+I,\displaystyle=:\ell(x_{0})+I, (74)

where the second inequality follows from Lip()1{\rm Lip}(\ell)\leqslant 1, the fourth equality is due to the distribution-invariance of ρh\rho_{h}, and the fifth equality holds because V𝒳pV\in\mathcal{X}_{p} implies that kV𝟙{VM}+V𝟙{V>M}kV\mathds{1}_{\{V\leqslant M\}}+V\mathds{1}_{\{V>M\}} and h(U)h^{\prime}_{-}(U) are comonotonic. For δ>0\delta>0, define

𝒱1={V𝒳p:Vpε,𝔼[V𝟙{VM}h(U)]δ},𝒱2={V𝒳p:Vpε}𝒱1.\displaystyle\mathcal{V}_{1}=\left\{V\in\mathcal{X}_{p}:\|V\|_{p}\leqslant\varepsilon,~{}\mathbb{E}\left[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)\right]\leqslant\delta\right\},~{}~{}\mathcal{V}_{2}=\{V\in\mathcal{X}_{p}:\|V\|_{p}\leqslant\varepsilon\}\setminus\mathcal{V}_{1}. (75)

We can rewrite I=max{I1,I2}I=\max\{I_{1},I_{2}\} with

Ii=supV𝒱i{ρh(V)(1k)𝔼[V𝟙{VM}h(U)]},i=1,2.I_{i}=\sup_{V\in\mathcal{V}_{i}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\},~{}~{}i=1,2.

Below we aim to demonstrate that Ii<εhqI_{i}<\varepsilon\|h^{\prime}_{-}\|_{q} for i=1,2i=1,2 by selecting an appropriate δ\delta. Note that

I1\displaystyle I_{1} =supV𝒱1{ρh(V)(1k)𝔼[V𝟙{VM}h(U)]}\displaystyle=\sup_{V\in\mathcal{V}_{1}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}
supV𝒱1ρh(V)\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\rho_{h}(V)
=supV𝒱1{𝔼[V𝟙{V>M}h(U)]+𝔼[V𝟙{VM}h(U)]}\displaystyle=\sup_{V\in\mathcal{V}_{1}}\left\{\mathbb{E}[V\mathds{1}_{\{V>M\}}h^{\prime}_{-}(U)]+\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\right\}
supV𝒱1h(U)𝟙{V>M}qVp+δ\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\|h^{\prime}_{-}(U)\mathds{1}_{\{V>M\}}\|_{q}\|V\|_{p}+\delta
supV𝒱1h(U)𝟙{V>M}qε+δ\displaystyle\leqslant\sup_{V\in\mathcal{V}_{1}}\|h^{\prime}_{-}(U)\mathds{1}_{\{V>M\}}\|_{q}\varepsilon+\delta
(1+α21(h(s))qds)1/qε+δ=(hqqα1+α2(h(s))qds)1/qε+δ,\displaystyle\leqslant\left(\int_{\frac{1+\alpha^{*}}{2}}^{1}(h^{\prime}_{-}(s))^{q}\mathrm{d}s\right)^{1/q}\varepsilon+\delta=\left(\|h^{\prime}_{-}\|_{q}^{q}-\int_{\alpha^{*}}^{\frac{1+\alpha^{*}}{2}}(h^{\prime}_{-}(s))^{q}\mathrm{d}s\right)^{1/q}\varepsilon+\delta, (76)

where the second inequality follows from Hölder’s inequality and the definition of 𝒱1\mathcal{V}_{1} in (75), the last inequality is due to (73), and the last equality follows from the definition of α\alpha^{*} in (72). Denote by

A=α1+α2(h(s))qdsandΔ=[hq(hqqA)1/q]ε.\displaystyle A=\int_{\alpha^{*}}^{\frac{1+\alpha^{*}}{2}}(h^{\prime}_{-}(s))^{q}\mathrm{d}s~{}~{}{\rm and}~{}~{}\Delta=\left[\|h^{\prime}_{-}\|_{q}-(\|h^{\prime}_{-}\|_{q}^{q}-A)^{1/q}\right]\varepsilon.

Recalling the definition of α\alpha^{*} in (72) again, we have A>0A>0 and Δ>0\Delta>0, and hence, I1<εhqI_{1}<\varepsilon\|h^{\prime}_{-}\|_{q} whenever δ<Δ\delta<\Delta. For δ<Δ\delta<\Delta, we have

I2=supV𝒱2{ρh(V)(1k)𝔼[V𝟙{VM}h(U)]}supV𝒱2hqVp(1k)δ<εhq,δ>0,\displaystyle I_{2}=\sup_{V\in\mathcal{V}_{2}}\{\rho_{h}(V)-(1-k)\mathbb{E}[V\mathds{1}_{\{V\leqslant M\}}h^{\prime}_{-}(U)]\}\leqslant\sup_{V\in\mathcal{V}_{2}}\|h^{\prime}_{-}\|_{q}\|V\|_{p}-(1-k)\delta<\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall\delta>0,

where the first inequality follows from Hölder’s inequality. Hence, we have that max{I1,I2}<εhq\max\{I_{1},I_{2}\}<\varepsilon\|h^{\prime}_{-}\|_{q}. Combining with (B.2), we have verified (71). This completes the proof.

To substantiate our claim that our analysis is not limited to the examples presented in this paper, we present the following proposition to demonstrate that the arguments in the proof of Theorem 5 can be applied to establish the equivalence between Wasserstein DRO and regularization for a class of non-convex loss functions, when the support of the reference distribution is a finite set.

Proposition 3.

For p=1p=1, let hh be given in Theorem 5. If the support of G01()G_{0}\in\mathcal{M}_{1}(\mathbb{R}) is finite, and :\ell:\mathbb{R}\to\mathbb{R} with Lip(){\rm Lip}(\ell)\in\mathbb{R} satisfies limt((x+t)(x))/t=Lip()\lim_{t\to\infty}(\ell(x+t)-\ell(x))/t={\rm Lip}(\ell) for all xx\in\mathbb{R} or limt((x)(x+t))/t=Lip()\lim_{t\to-\infty}(\ell(x)-\ell(x+t))/t={\rm Lip}(\ell) for all xx\in\mathbb{R}, then, for ε0\varepsilon\geqslant 0 and 𝒟n\mathcal{D}\subseteq\mathbb{R}^{n}

inf𝜷𝒟supF1(G0,ε)ρhF((𝜷𝐙))=inf𝜷𝒟{ρhG0((𝜷𝐙))+Lip()hε𝜷}.\displaystyle\inf_{\bm{\beta}\in\mathcal{D}}\sup_{F\in\mathcal{B}_{1}(G_{0},\varepsilon)}\rho_{h}^{F}(\ell(\bm{\beta}^{\top}\mathbf{Z}))=\inf_{\bm{\beta}\in\mathcal{D}}\left\{\rho_{h}^{G_{0}}(\ell(\bm{\beta}^{\top}\mathbf{Z}))+{\rm Lip}(\ell)\|h^{\prime}_{-}\|_{\infty}\varepsilon\|\bm{\beta}\|_{*}\right\}.
Proof.

Proof. Without loss of generality, assume Lip()=1{\rm Lip}(\ell)=1. We only give the proof for the case limt((x+t)(x))/t=1\lim_{t\to\infty}(\ell(x+t)-\ell(x))/t=1 for xx\in\mathbb{R} as the other case is similar. By Remark 5, it suffices to show that (33) holds with p=1p=1 and Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0} where 𝐙0G0\mathbf{Z}_{0}\sim G_{0}. For nn\in\mathbb{N} and δ>0\delta>0, define An,δA_{n,\delta} and Vn,δV_{n,\delta} as (66). First note that both (B.2) and (B.2) hold for Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0}. Noting that the support of Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0} is finite, we denote it by {x1,,xk}\{x_{1},\dots,x_{k}\}. For Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0}, it holds that

𝔼[((Z+n)(Z))𝟙An,δh(U)]\displaystyle\mathbb{E}\left[(\ell(Z+n)-\ell(Z))\mathds{1}_{A_{n,\delta}}h_{-}^{\prime}(U)\right] =𝔼[(Z+n)(Z)nVn,δh(U)]\displaystyle=\mathbb{E}\left[\frac{\ell(Z+n)-\ell(Z)}{n}V_{n,\delta}h^{\prime}_{-}(U)\right]
mini[k](xi+n)(xi)n𝔼[Vn,δh(U)]\displaystyle\geqslant\min_{i\in[k]}\frac{\ell(x_{i}+n)-\ell(x_{i})}{n}\mathbb{E}\left[V_{n,\delta}h^{\prime}_{-}(U)\right]
=mini[k](xi+n)(xi)n1t/nδ1δh(s)dst/nt\displaystyle=\min_{i\in[k]}\frac{\ell(x_{i}+n)-\ell(x_{i})}{n}\frac{\int_{1-t/n-\delta}^{1-\delta}h^{\prime}_{-}(s)\mathrm{d}s}{t/n}t
h(1δ)tasn.\displaystyle\rightarrow h^{\prime}_{-}(1-\delta)t~{}~{}{\rm as}~{}n\to\infty.

Letting δ0\delta\to 0 and combining this with (B.2), we have concluded that

supV1ερh((Z+V))ρh((Z))+h(1)ε=ρh((Z))+hε.\displaystyle\sup_{\|V\|_{1}\leqslant\varepsilon}\rho_{h}(\ell(Z+V))\geqslant\rho_{h}(\ell(Z))+h^{\prime}_{-}(1)\varepsilon=\rho_{h}(\ell(Z))+\|h^{\prime}_{-}\|_{\infty}\varepsilon.

Hence, (33) holds with p=1p=1 and Z=𝜷𝐙0Z=\bm{\beta}^{\top}\mathbf{Z}_{0}, and thus we complete the proof. ∎

Proof of Proposition 2. By Lemma 4, it suffices to prove that

supVpερh(Z+V)=ρh(Z)+εhq,ZLp.\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(Z+V)=\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q},~{}~{}\forall Z\in L^{p}. (77)

By Lemma 9, we have for any Vpε\|V\|_{p}\leqslant\varepsilon,

ρh(Z+V)\displaystyle\rho_{h}(Z+V) ρh(Z)+ρh(V)ρh(Z)+Vphqρh(Z)+εhq,\displaystyle\leqslant\rho_{h}(Z)+\rho_{h}(V)\leqslant\rho_{h}(Z)+\|V\|_{p}\|h^{\prime}_{-}\|_{q}\leqslant\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q}, (78)

where the second inequality follows from Hölder’s inequality. Suppose that Z=FZ1(U)Z=F_{Z}^{-1}(U) a.s., where UU is a uniform random variable on [0,1][0,1] (see Lemma A.28 of Föllmer and Schied (2016) for the existence of UU). In the following, we show (77) by considering the following two cases.

  • (a)

    If p=1p=1, then q=q=\infty and h=max{|h(0+)|,|h(1)|}=max{h(0+),h(1)}\|h^{\prime}_{-}\|_{\infty}=\max\{|h^{\prime}_{-}(0+)|,|h^{\prime}_{-}(1)|\}=\max\{-h^{\prime}_{-}(0+),h^{\prime}_{-}(1)\}, where the second equality follows from hh^{\prime}_{-} is increasing. We first assume that h(1)h(0+)h^{\prime}_{-}(1)\geqslant-h^{\prime}_{-}(0+). Define a sequence {Vn}n\{V_{n}\}_{n\in\mathbb{N}} such that Vn=nε𝟙{U>11/n}V_{n}=n\varepsilon\mathds{1}_{\{U>1-1/n\}}. For all nn\in\mathbb{N}, it holds that Vn1=ε\|V_{n}\|_{1}=\varepsilon, and VnV_{n} and ZZ are comonotonic. Hence, we have

    supVpερh(Z+V)\displaystyle\sup_{\|V\|_{p}\leqslant\varepsilon}\rho_{h}(Z+V) ρh(Z+Vn)=ρh(Z)+nε11/n1h(s)ds\displaystyle\geqslant\rho_{h}(Z+V_{n})=\rho_{h}(Z)+n\varepsilon\int_{1-1/n}^{1}h^{\prime}_{-}(s)\mathrm{d}s
    ρh(Z)+εh(1)=ρh(Z)+εh,asn.\displaystyle\to\rho_{h}(Z)+\varepsilon h^{\prime}_{-}(1)=\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{\infty},~{}~{}{\rm as}~{}n\to\infty. (79)

    Combining (78) and ((a)), we have verified (77). Assume now h(1)<h(0+)h^{\prime}_{-}(1)<-h^{\prime}_{-}(0+). We can construct a sequence {Vn}n\{V_{n}\}_{n\in\mathbb{N}} as Vn=nε𝟙{U<1/n}V_{n}=-n\varepsilon\mathds{1}_{\{U<1/n\}}, and then follow the same analysis as the previous argument to verify the result.

  • (b)

    If p(1,]p\in(1,\infty], then we define V~=sgn(h(U))ε|h(U)|q/p/hqq/p.\widetilde{V}={\rm sgn}(h^{\prime}_{-}(U)){\varepsilon|h^{\prime}_{-}(U)|^{q/p}}/{\|h^{\prime}_{-}\|_{q}^{q/p}}. One can verify that V~p=ε\|\widetilde{V}\|_{p}=\varepsilon and ρh(Z+V~)=ρh(Z)+εhq\rho_{h}(Z+\widetilde{V})=\rho_{h}(Z)+\varepsilon\|h^{\prime}_{-}\|_{q}. This, together with (78), implies that (77) holds.

Combining (a) and (b), we complete the proof of (77), and thus complete the proof.

Appendix C C Proofs of Section 6

Before presenting the proofs in Section 6, we first highlight some key elements required in our analysis for the non-affine case. As stated in Section 6, while our analysis for the non-affine case continues to rely on a projection set perspective, the equivalence between the projection sets of a Wasserstein ball and a max-sliced Wasserstein ball no longer holds. Instead, we explore the possibility of first establishing a confidence bound for a fixed decision 𝜷\bm{\beta} by identifying a set inclusion relationship with a one-dimensional Wasserstein ball and leveraging its measure concentration property (Lemma 10). We then extend these results to derive generalization bounds that hold uniformly across all 𝜷𝒟\bm{\beta}\in{\cal D} using covering number techniques (see e.g., Gao (2022) and Section 3.5 of Mohri et al. (2018)). For τ>0\tau>0, a τ\tau-cover of 𝒟\mathcal{D}, denoted by 𝒟τ\mathcal{D}_{\tau}, is a subset of 𝒟\mathcal{D} such that for each 𝜷𝒟\bm{\beta}\in\mathcal{D}, there exists 𝜷~𝒟τ\widetilde{\bm{\beta}}\in\mathcal{D}_{\tau} satisfying 𝜷~𝜷𝒟τ\|\widetilde{\bm{\beta}}-\bm{\beta}\|_{\mathcal{D}}\leqslant\tau, where 𝒟\|\cdot\|_{\mathcal{D}} is a norm on 𝒟\mathcal{D}. The covering number 𝒩(τ;𝒟,𝒟)\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}}) of 𝒟\mathcal{D} with respect to 𝒟\|\cdot\|_{\mathcal{D}} is the smallest cardinality of such a τ\tau-cover of 𝒟\mathcal{D}.

Lemma 10 (Theorem 2 of Fournier and Guillin (2015)).

Let p1p\geqslant 1 and η(0,1)\eta\in(0,1). For F~p()\widetilde{F}^{*}\in\mathcal{M}_{p}(\mathbb{R}), suppose that A0:=𝔼F~[exp(|Z|a)]<A_{0}:=\mathbb{E}^{\widetilde{F}^{*}}[\exp(|Z|^{a})]<\infty for some a>pa>p and F~N\widetilde{F}_{N} is the empirical distribution based on the independent sample drawn from F~\widetilde{F}^{*}. Then, we have

(Wp(F~N,F~)εp,N(η))1η,\displaystyle\mathbb{P}\left(W_{p}(\widetilde{F}_{N},\widetilde{F}^{*})\leqslant\varepsilon_{p,N}(\eta)\right)\geqslant 1-\eta,

where

εp,N(η)={(log(c1/η)c2N)1/(2p)ifNlog(c1/η)c2,(log(c1/η)c2N)1/aifN<log(c1/η)c2,\displaystyle\varepsilon_{p,N}(\eta)=\begin{cases}\left(\frac{\log(c_{1}/\eta)}{c_{2}N}\right)^{1/(2p)}~{}~{}&\text{if}~{}N\geqslant\frac{\log(c_{1}/\eta)}{c_{2}},\\ \left(\frac{\log(c_{1}/\eta)}{c_{2}N}\right)^{1/a}~{}~{}&\text{if}~{}N<\frac{\log(c_{1}/\eta)}{c_{2}},\end{cases} (80)

and c1,c2c_{1},c_{2} are positive constants that only depend on aa, A0A_{0}, and pp.

With these tools in place, we now proceed to the proofs for the generalization results in Section 6.

C.1 Proof of Theorem 6

We proceed in three steps. First, we verify the following set inclusion result:

¯p|f𝜷(F0,ε)p(FY0f𝜷(𝐗0),g(ε)),\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)\supseteq\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)), (81)

under Assumption 3, where (Y0,𝐗0)F0(Y_{0},\mathbf{X}_{0})\sim F_{0} and ¯p|f𝜷\overline{\mathcal{B}}_{p|f_{\bm{\beta}}} is the projection set defined by (23) in Section 6:

¯p|f𝜷(F0,ε)={FYf𝜷(𝐗):F(Y,𝐗)¯p(F0,ε)}.\displaystyle\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon)=\{F_{Y\cdot f_{\bm{\beta}}(\mathbf{X})}:F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon)\}. (82)

Second, using the set inclusion result (81) and the concentration result for the one-dimensional Wasserstein ball (Lemma 10), we establish a confidence bound for a fixed 𝜷\bm{\beta}. Finally, we apply covering number techniques to derive union bounds from the confidence bounds established in the second step, overcoming the challenge posed by the non-finiteness of the set 𝒟{\cal D}.

Step 1: Proving the set inclusion result (81). We first give a lemma which will be used in the proof of the set inclusion result (81).

Lemma 11.

Let p1p\geqslant 1 and g:++g:\mathbb{R}_{+}\to\mathbb{R}_{+} be an increasing function with g(0)=0g(0)=0. If gg is convex (resp. concave), then the function x(g(x1/p))px\mapsto(g(x^{1/p}))^{p} is convex (resp. concave) on +\mathbb{R}_{+}.

Proof.

Proof. We provide the proof of convexity only for the case where gg is twice differentiable, as the concavity case is analogous and the general case can be established by approximating gg with a twice differentiable convex function. Denote by h(x)=(g(x1/p))ph(x)=(g(x^{1/p}))^{p}, x+x\in\mathbb{R}_{+}. By standard calculations, we have

h′′(x)\displaystyle h^{\prime\prime}(x) =p1pg(y)(g(y))p2y12p[yg(y)g(y)]+1pg′′(y)(g(y))p1y22p\displaystyle=\frac{p-1}{p}g^{\prime}(y)(g(y))^{p-2}y^{1-2p}[yg^{\prime}(y)-g(y)]+\frac{1}{p}g^{\prime\prime}(y)(g(y))^{p-1}y^{2-2p}
p1pg(y)(g(y))p2y12p[yg(y)g(y)]\displaystyle\geqslant\frac{p-1}{p}g^{\prime}(y)(g(y))^{p-2}y^{1-2p}[yg^{\prime}(y)-g(y)]
=sgnyg(y)g(y),\displaystyle\stackrel{{\scriptstyle\rm sgn}}{{=}}yg^{\prime}(y)-g(y),

where we have denoted by y=x1/p+y=x^{1/p}\in\mathbb{R}_{+}, the inequality follows from g′′0g^{\prime\prime}\geqslant 0 and g0g\geqslant 0, A=sgnBA\stackrel{{\scriptstyle\rm sgn}}{{=}}B means that AA and BB have the same sign, and =sgn\stackrel{{\scriptstyle\rm sgn}}{{=}} comes from (p1)g(y)(g(y))p2y12p0(p-1)g^{\prime}(y)(g(y))^{p-2}y^{1-2p}\geqslant 0. By g(0)=0g(0)=0 and the convexity of gg , we have that g(y)/y=(g(y)g(0))/yg(y)g(y)/y=(g(y)-g(0))/y\leqslant g^{\prime}(y), that is, yg(y)g(y)0yg^{\prime}(y)-g(y)\geqslant 0. It then follows that h′′0h^{\prime\prime}\geqslant 0, that is, hh is convex. This completes the proof. ∎

Now we are ready to show (81) under Assumption 3.

Lemma 12.

For p[1,]p\in[1,\infty], ε0\varepsilon\geqslant 0, F0p(Ξ)F_{0}\in\mathcal{M}_{p}(\Xi), and (Y0,𝐗0)F0(Y_{0},\mathbf{X}_{0})\sim F_{0}, if Assumption 3 holds, then for 𝛃𝒟\bm{\beta}\in\mathcal{D}, we have (81) holds.

Proof.

Proof. For 𝜷𝒟\bm{\beta}\in\mathcal{D}, let Fp(FY0f𝜷(𝐗0),g(ε))F\in\mathcal{B}_{p}(F_{Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)). There exists ZZ such that ZFZ\sim F and 𝔼[|ZY0f𝜷(𝐗0)|p](g(ε))p\mathbb{E}[|Z-Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]\leqslant(g(\varepsilon))^{p}. Our aim is to construct a random vector (Y,𝐗)(Y,\mathbf{X}) such that F(Y,𝐗)¯p(F0,ε)F_{(Y,\mathbf{X})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon) and Yf𝜷(𝐗)FY\cdot f_{\bm{\beta}}(\mathbf{X})\sim F as this implies that F¯p|f𝜷(F0,ε)F\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon). Denote by T=Z/Y0f𝜷(𝐗0)T=Z/Y_{0}-f_{\bm{\beta}}(\mathbf{X}_{0}) and we have

𝔼[|T|p]=𝔼[|ZY0f𝜷(𝐗0)|p]=𝔼[|ZY0f𝜷(𝐗0)|p](g(ε))p.\displaystyle\mathbb{E}[|T|^{p}]=\mathbb{E}\left[\left|\frac{Z}{Y_{0}}-f_{\bm{\beta}}(\mathbf{X}_{0})\right|^{p}\right]=\mathbb{E}[|Z-Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]\leqslant(g(\varepsilon))^{p}. (83)

We first assert that there exist measurable mappings 𝐒1\mathbf{S}_{1} and 𝐒2\mathbf{S}_{2} such that

𝐒1(ω)argmax𝐲g1(|T(ω)|)f𝜷(𝐗0(ω)+𝐲)and𝐒2(ω)argmin𝐲g1(|T(ω)|)f𝜷(𝐗0(ω)+𝐲),ωΩ.\displaystyle\mathbf{S}_{1}(\omega)\in\operatorname*{arg\,max}_{\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)}f_{\bm{\beta}}({\bf X}_{0}(\omega)+\mathbf{y})~{}~{}{\rm and}~{}~{}\mathbf{S}_{2}(\omega)\in\operatorname*{arg\,min}_{\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)}f_{\bm{\beta}}({\bf X}_{0}(\omega)+\mathbf{y}),~{}~{}\omega\in\Omega.

To see this, we apply a measurable selection theorem (Theorem 3.5 in Rieder (1978)) to show the existence of 𝐒1{\bf S}_{1} and that of 𝐒2{\bf S}_{2} is similar. Define

v(ω)=sup𝐲D(ω)u(ω,𝐲),ωΩ,\displaystyle v(\omega)=\sup_{\mathbf{y}\in D(\omega)}u(\omega,\mathbf{y}),~{}~{}\omega\in\Omega,

where D(ω)={𝐲n:𝐲g1(|T(ω)|)}D(\omega)=\{\mathbf{y}\in\mathbb{R}^{n}:\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)\} and u(ω,𝐲)=f𝜷(𝐗0(ω)+𝐲)u(\omega,\mathbf{y})=f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y}). Further define

E={(ω,𝐲)Ω×n:𝐲g1(|T(ω)|)}.\displaystyle E=\{(\omega,\mathbf{y})\in\Omega\times\mathbb{R}^{n}:\|\mathbf{y}\|\leqslant g^{-1}(|T(\omega)|)\}.

Let P1(E){\rm P_{1}}(E) be the projection set of EE on the first argument. The first three conditions in Theorem 3.5 of Rieder (1978) are trivial in our setting. To see the last condition, for cc\in\mathbb{R} and ωP1(E)\omega\in{\rm P_{1}}(E), we have

{𝐲D(ω):u(ω,𝐲)c}=D(ω){𝐲n:f𝜷(𝐗0(ω)+𝐲)c}.\displaystyle\{\mathbf{y}\in D(\omega):u(\omega,\mathbf{y})\geqslant c\}=D(\omega)\cap\{\mathbf{y}\in\mathbb{R}^{n}:f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y})\geqslant c\}.

Note that D(ω)D(\omega) is a compact set and the continuity of f𝜷f_{\bm{\beta}} implies that {𝐲n:f𝜷(𝐗0(ω)+𝐲)c}\{\mathbf{y}\in\mathbb{R}^{n}:f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{y})\geqslant c\} is a closed set. Hence, we conclude that {𝐲D(ω):u(ω,𝐲)c}\{\mathbf{y}\in D(\omega):u(\omega,\mathbf{y})\geqslant c\} is a compact set, which verifies the last condition in Theorem 3.5 of Rieder (1978). This implies the existence of a measurable maximizer 𝐒1{\bf S}_{1}. Denote by A+={ω:T(ω)0}A_{+}=\{\omega:T(\omega)\geqslant 0\} and A={ω:T(ω)<0}A_{-}=\{\omega:T(\omega)<0\}. We define 𝐗\mathbf{X}^{\prime} as

𝐗(ω)=𝐗0(ω)+𝐒1(ω),ωA+and𝐗(ω)=𝐗0(ω)+𝐒2(ω),ωA.\displaystyle\mathbf{X}^{\prime}(\omega)=\mathbf{X}_{0}(\omega)+\mathbf{S}_{1}(\omega),~{}~{}\omega\in A_{+}~{}~{}{\rm and}~{}~{}\mathbf{X}^{\prime}(\omega)=\mathbf{X}_{0}(\omega)+\mathbf{S}_{2}(\omega),~{}~{}\omega\in A_{-}.

By that 𝐒1\mathbf{S}_{1} and 𝐒2\mathbf{S}_{2} are measurable, we have 𝐗\mathbf{X}^{\prime} is measurable. Note that Ω=A+A\Omega=A_{+}\cup A_{-}, and hence,

𝐗(ω)𝐗0(ω)max{𝐒1(ω),𝐒2(ω)}g1(|T(ω)|),ωΩ.\displaystyle\|\mathbf{X}^{\prime}(\omega)-\mathbf{X}_{0}(\omega)\|\leqslant\max\{\|\mathbf{S}_{1}(\omega)\|,\|\mathbf{S}_{2}(\omega)\|\}\leqslant g^{-1}(|T(\omega)|),~{}~{}\omega\in\Omega. (84)

By Assumption 3, we have

f𝜷(𝐗(ω))f𝜷(𝐗0(ω))g(g1(|T(ω)|))=T(ω),ωA+\displaystyle f_{\bm{\beta}}(\mathbf{X}^{\prime}(\omega))-f_{\bm{\beta}}(\mathbf{X}_{0}(\omega))\geqslant g(g^{-1}(|T(\omega)|))=T(\omega),~{}~{}\omega\in A_{+}

and

f𝜷(𝐗0(ω))f𝜷(𝐗(ω))g(g1(|T(ω)|))=T(ω),ωA.\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}(\omega))-f_{\bm{\beta}}(\mathbf{X}^{\prime}(\omega))\geqslant g(g^{-1}(|T(\omega)|))=-T(\omega),~{}~{}\omega\in A_{-}.

Note that f𝜷f_{\bm{\beta}} is continuous. There exists a random variable RR taking values on [0,1][0,1] such that

f𝜷(𝐗0+R(𝐗𝐗0))f𝜷(𝐗0)=T.\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}+R(\mathbf{X}^{\prime}-\mathbf{X}_{0}))-f_{\bm{\beta}}(\mathbf{X}_{0})=T. (85)

Define 𝐗~=𝐗0+R(𝐗𝐗0)\widetilde{\mathbf{X}}=\mathbf{X}_{0}+R(\mathbf{X}^{\prime}-\mathbf{X}_{0}). It follows from (85) that

Y0f𝜷(𝐗~)=Y0f𝜷(𝐗0)+Y0T=ZF.\displaystyle Y_{0}\cdot f_{\bm{\beta}}(\widetilde{\mathbf{X}})=Y_{0}\cdot f_{\bm{\beta}}(\mathbf{X}_{0})+Y_{0}T=Z\sim F. (86)

Moreover, we have

𝔼[𝐗~𝐗0p]=𝔼[R(𝐗𝐗0)p]𝔼[(g1(|T|))p](g1((𝔼[|T|p])1/p))pεp,\displaystyle\mathbb{E}[\|\widetilde{\mathbf{X}}-\mathbf{X}_{0}\|^{p}]=\mathbb{E}[\|R(\mathbf{X}^{\prime}-\mathbf{X}_{0})\|^{p}]\leqslant\mathbb{E}\left[(g^{-1}(|T|))^{p}\right]\leqslant\left(g^{-1}\big{(}(\mathbb{E}[|T|^{p}])^{1/p}\big{)}\right)^{p}\leqslant\varepsilon^{p},

where the first inequality is due to (84); the second inequality follows from the Jensen’s inequality and the concavity of the function x(g1(x1/p))px\mapsto(g^{-1}(x^{1/p}))^{p} by Lemma 11; and the last inequality follows from (83). Hence, F(Y0,𝐗~)¯p(F0,ε)F_{(Y_{0},\widetilde{\mathbf{X}})}\in\overline{\mathcal{B}}_{p}(F_{0},\varepsilon). Combining this with (86) yields F¯p|f𝜷(F0,ε)F\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon). This completes the proof. ∎

Step 2: Establishing confidence bounds for fixed β\bm{\beta}. Based on Lemma 12 and applying Lemma 10, we are able to derive the following confidence bounds under Assumptions 2 and 3.

Lemma 13.

Let p1p\geqslant 1 and η(0,1)\eta\in(0,1). Suppose that Assumptions 2 and 3 hold. Then for each 𝛃𝒟\bm{\beta}\in\mathcal{D}, we have

(ρF(Yf𝜷(𝐗))supF¯p(F^N,g1(εp,N(η)))ρF(Yf𝜷(𝐗)))1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta,

where εp,N(η)\varepsilon_{p,N}(\eta) is defined by (80) with constants c1,c2c_{1},c_{2} depending only on aa, AA and pp.

Proof.

Proof. Denote by F^𝜷,N=1Ni=1Nδy^if𝜷(𝐱^i)\widehat{F}_{\bm{\beta},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{\widehat{y}_{i}\cdot f_{\bm{\beta}}(\widehat{\mathbf{x}}_{i})} and F𝜷=FYf𝜷(𝐗)F_{\bm{\beta}}^{*}=F_{Y^{*}\cdot f_{\bm{\beta}}(\mathbf{X}^{*})}, where (Y,𝐗)F(Y^{*},\mathbf{X}^{*})\sim F^{*}. It holds that F^𝜷,N\widehat{F}_{\bm{\beta},N} is an empirical distribution of F𝜷F_{\bm{\beta}}^{*}. Note that

(ρF(Yf𝜷(𝐗))supF¯p(F^N,g1(εp,N(η)))ρF(Yf𝜷(𝐗)))\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right)
=(ρF𝜷(Z)supF¯p|f𝜷(F^N,g1(εp,N(η)))ρF(Z))\displaystyle=\mathbb{P}\left(\rho^{F^{*}_{\bm{\beta}}}(Z)\leqslant\sup_{F\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}\left(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta))\right)}\rho^{F}(Z)\right)
(F𝜷¯p|f𝜷(F^N,g1(εp,N(η))))\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}\left(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta))\right)\right)
(F𝜷p(F^𝜷,N,εp,N(η)))1η,\displaystyle\geqslant\mathbb{P}\left(F_{\bm{\beta}}^{*}\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon_{p,N}(\eta))\right)\geqslant 1-\eta,

where ¯p|f𝜷(F0,ε)\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(F_{0},\varepsilon) is defined by (82), the second inequality follows from Lemma 12, which states that p(F^𝜷,N,ε)¯p|f𝜷(F^N,g1(ε))\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon)\subseteq\overline{\mathcal{B}}_{p|f_{\bm{\beta}}}(\widehat{F}_{N},g^{-1}(\varepsilon)), and the last inequality is due to Lemma 10 by noting that Assumption 2 implies 𝔼F𝜷[exp(|Z|a)]=𝔼F[exp(|f𝜷(𝐗)|a)]A<\mathbb{E}^{F_{\bm{\beta}}^{*}}[\exp(|Z|^{a})]=\mathbb{E}^{F^{*}}[\exp(|f_{\bm{\beta}}(\mathbf{X})|^{a})]\leqslant A<\infty. Hence, we complete the proof. ∎

Step 3: Union bounds for β𝒟\bm{\beta}\in\mathcal{D}. Now we are ready to prove union bounds in Theorem 6.

Proof of Theorem 6. Note that for any (Y,𝐗){1,1}×n(Y,\mathbf{X})\in\{-1,1\}\times\mathbb{R}^{n} and 𝜷,𝜷~𝒟\bm{\beta},\widetilde{\bm{\beta}}\in\mathcal{D}, we have

|ρF(Yf𝜷(𝐗))ρF(Yf𝜷~(𝐗))|\displaystyle\left|\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right| M(𝔼F[|f𝜷(𝐗)f𝜷~(𝐗)|k])1/k\displaystyle\leqslant M\left(\mathbb{E}^{F}[|f_{\bm{\beta}}(\mathbf{X})-f_{\widetilde{\bm{\beta}}}(\mathbf{X})|^{k}]\right)^{1/k}
M𝜷𝜷~𝒟(𝔼F[(a1𝐗r+a2)k])1/k\displaystyle\leqslant M\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}(\mathbb{E}^{F}[(a_{1}\|\mathbf{X}\|^{r}+a_{2})^{k}])^{1/k}
M𝜷𝜷~𝒟(a1(𝔼F[|𝐗|rk])1/k+a2),\displaystyle\leqslant M\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}\left(a_{1}\left(\mathbb{E}^{F}[|\|\mathbf{X}\||^{rk}]\right)^{1/k}+a_{2}\right), (87)

where the three inequalities follow from Assumption 5, Assumption 4 and triangle’s inequality for the LkL^{k}-norm, respectively. Then, it holds that

|supF¯p(F^N,ε)ρF(Yf𝜷(𝐗))supF¯p(F^N,ε)ρF(Yf𝜷~(𝐗))|\displaystyle\left|\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right|
supF¯p(F^N,ε)|ρF(Yf𝜷(𝐗))ρF(Yf𝜷~(𝐗))|\displaystyle\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\left|\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\rho^{F}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X}))\right|
M𝜷𝜷~𝒟supF¯p(F^N,ε)(a1(𝔼F[𝐗rk])1/k+a2)\displaystyle\leqslant M\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon)}\left(a_{1}\left(\mathbb{E}^{F}[\|\mathbf{X}\|^{rk}]\right)^{1/k}+a_{2}\right)
M𝜷𝜷~𝒟(2r1a1(𝔼F^N[𝐗rk])1/k+2r1a1εr+a2),\displaystyle\leqslant M\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}\left(2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}+2^{r-1}a_{1}\varepsilon^{r}+a_{2}\right), (88)

where the second inequality follows from (C.1), and the last inequality holds because (𝔼F[𝐗rk])1/(rk)(𝔼F^N[𝐗rk])1/(rk)+ε(\mathbb{E}^{F}[\|\mathbf{X}\|^{rk}])^{1/(rk)}\leqslant(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/(rk)}+\varepsilon for any F¯p(F^N,ε)F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon) with rk[1,p]rk\in[1,p], which further implies

(𝔼F[𝐗rk])1/k((𝔼F^N[𝐗rk])1/(rk)+ε)r2r1((𝔼F^N[𝐗rk])1/k+εr).(\mathbb{E}^{F}[\|\mathbf{X}\|^{rk}])^{1/k}\leqslant((\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/(rk)}+\varepsilon)^{r}\leqslant 2^{r-1}((\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}])^{1/k}+\varepsilon^{r}).

For 0<xy0<x\leqslant y and t0t\geqslant 0, it holds that yx>t1/ky-x>t^{1/k} implies ykxk>ty^{k}-x^{k}>t. Hence, we have

((𝔼F^N[𝐗rk])1/k(𝔼F[𝐗rk])1/k>(𝕍arF(𝐗rk))1/(2k))\displaystyle\mathbb{P}\left(\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}-\left(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}>\left({\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk})\right)^{1/(2k)}\right)
(𝔼F^N[𝐗rk]𝔼F[𝐗rk]>𝕍arF(𝐗rk))1N,\displaystyle\leqslant\mathbb{P}\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]-\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]>\sqrt{{\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk})}\right)\leqslant\frac{1}{N}, (89)

where the second inequality follows from Chebyshev’s inequality. Let τ>0\tau>0 and 𝒟τ\mathcal{D}_{\tau} be an τ\tau-cover of 𝒟\mathcal{D} with respect to the norm 𝒟\|\cdot\|_{\mathcal{D}}. Denote by τ=1/N\tau=1/N, t=𝒩(τ;𝒟,𝒟)t=\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}}) and εN(η/t):=g1(εp,N(η/t))\varepsilon_{N}(\eta/t):=g^{-1}(\varepsilon_{p,N}(\eta/t)), where sεp,N(s)s\mapsto\varepsilon_{p,N}(s) is defined by (80) in Lemma 10. It holds that

1(ρF(Yf𝜷(𝐗))supF¯p(F^N,εN(η/t))ρF(Yf𝜷(𝐗))+NττN,𝜷𝒟)\displaystyle 1-\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+N\tau\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)
=(𝜷𝒟s.t.ρF(Yf𝜷(𝐗))>supF¯p(F^N,εN(η/t))ρF(Yf𝜷(𝐗))\displaystyle=\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.
+τM[(2r1+1)a1(𝔼F[𝐗rk])1/k+2r1a1(𝕍arF(𝐗rk))1/(2k)+2r1a1εNr(η/t)+2a2])\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[(2^{r-1}+1)a_{1}\left(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}\!\!\!\!\!+2^{r-1}a_{1}\left({\rm\mathbb{V}ar}^{F^{*}}(\|\mathbf{X}\|^{rk})\right)^{1/(2k)}\!\!\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+2a_{2}\right]\right)
1N+(𝜷𝒟s.t.ρF(Yf𝜷(𝐗))>supF¯p(F^N,εN(η/t))ρF(Yf𝜷(𝐗))\displaystyle\leqslant\frac{1}{N}+\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.
+τM[a1(𝔼F[𝐗rk])1/k+2r1a1(𝔼F^N[𝐗rk])1/k+2r1a1εNr(η/t)+2a2])\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[a_{1}\left(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+2a_{2}\right]\right)
=1N+(𝜷𝒟s.t.ρF(Yf𝜷(𝐗))τM[a1(𝔼F[𝐗rk])1/k+a2]>supF¯p(F^N,εN(η/t))ρF(Yf𝜷(𝐗))\displaystyle=\frac{1}{N}+\mathbb{P}\left(\exists\bm{\beta}\in\mathcal{D}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\tau M\left[a_{1}\left(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}\!\!\!\!+a_{2}\right]>\!\!\!\!\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\!\!\!\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\right.
+τM[2r1a1(𝔼F^N[𝐗rk])1/k+2r1a1εNr(η/t)+a2])\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\left.+\tau M\left[2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}\!\!\!+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+a_{2}\right]\right) (90)
1N+(𝜷~𝒟τs.t.ρF(Yf𝜷~(𝐗))>supF¯p(F^N,εN(η/t))ρF(Yf𝜷~(𝐗)))\displaystyle\leqslant\frac{1}{N}+\mathbb{P}\left(\exists{\widetilde{\bm{\beta}}}\in\mathcal{D}_{\tau}~{}{\rm s.t.}~{}\rho^{F^{*}}(Y\cdot f_{{\widetilde{\bm{\beta}}}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\mathbf{\widetilde{\bm{\beta}}}}(\mathbf{X}))\right) (91)
1N+𝜷~𝒟τ(ρF(Yf𝜷~(𝐗))>supF¯p(F^N,εN(η/t))ρF(Yf𝜷~(𝐗)))\displaystyle\leqslant\frac{1}{N}+\sum_{{\widetilde{\bm{\beta}}}\in\mathcal{D}_{\tau}}\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{{\widetilde{\bm{\beta}}}}(\mathbf{X}))>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot{f_{\widetilde{\bm{\beta}}}}(\mathbf{X}))\right)
1N+𝒩(τ;𝒟,𝒟)ηt=1N+η,\displaystyle\leqslant\frac{1}{N}+\frac{\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}})\eta}{t}=\frac{1}{N}+\eta,

where the first inequality follows from (C.1), the last inequality is due to Lemma 13, and the second inequality holds because if the event in (90) happens with some 𝜷𝒟\bm{\beta}\in\mathcal{D}, then there exists 𝜷~𝒟τ\widetilde{\bm{\beta}}\in\mathcal{D}_{\tau} such that 𝜷𝜷~𝒟τ\|\bm{\beta}-\widetilde{\bm{\beta}}\|_{\mathcal{D}}\leqslant\tau and

ρF(Yf𝜷~(𝐗))\displaystyle\rho^{F^{*}}(Y\cdot f_{\widetilde{\bm{\beta}}}(\mathbf{X})) ρF(Yf𝜷(𝐗))τM[a1(𝔼F[𝐗rk])1/k+a2]\displaystyle\geqslant\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))-\tau M\left[a_{1}\left(\mathbb{E}^{F^{*}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}+a_{2}\right]
>supF¯p(F^N,εN(η/t))ρF(Yf𝜷(𝐗))+τM[2r1a1(𝔼F^N[𝐗rk])1/k+2r1a1εNr(η/t)+a2]\displaystyle>\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau M\left[2^{r-1}a_{1}\left(\mathbb{E}^{\widehat{F}_{N}}[\|\mathbf{X}\|^{rk}]\right)^{1/k}+2^{r-1}a_{1}\varepsilon_{N}^{r}(\eta/t)+a_{2}\right]
supF¯p(F^N,εN(η/t))ρF(Yf𝜷~(𝐗)),\displaystyle\geqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},\varepsilon_{N}(\eta/t))}\rho^{F}(Y\cdot f_{\mathbf{\widetilde{\bm{\beta}}}}(\mathbf{X})),

where the first and the last inequalities follow from (C.1) and (C.1), respectively, which implies that the event in (91) happens. Hence, we have

(ρF(Yf𝜷(𝐗))supF¯p(F^N,g1(εp,N(η/t)))ρF(Yf𝜷(𝐗))+τN,𝜷𝒟)1η1N.\displaystyle\mathbb{P}\left(\rho^{F^{*}}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in\overline{\cal B}_{p}(\widehat{F}_{N},g^{-1}(\varepsilon_{p,N}(\eta/t)))}\rho^{F}(Y\cdot f_{\bm{\beta}}(\mathbf{X}))+\tau_{N},~{}~{}\forall~{}\bm{\beta}\in\mathcal{D}\right)\geqslant 1-\eta-\frac{1}{N}. (92)

We recall that t=𝒩(τ;𝒟,𝒟)t=\mathcal{N}(\tau;\mathcal{D},\|\cdot\|_{\mathcal{D}}) with τ=1/N\tau=1/N. Note that 𝒟\mathcal{D} is contained in U𝒟BU_{\mathcal{D}}B where BB is a unit ball of m\mathbb{R}^{m}. For any ε>0\varepsilon>0, we have (see Example 5.8 of Wainwright (2019))

t=𝒩(1N;𝒟,𝒟)(1+2U𝒟N)m=:t~.\displaystyle t=\mathcal{N}\left(\frac{1}{N};\mathcal{D},\|\cdot\|_{\mathcal{D}}\right)\leqslant\left(1+{2U_{\mathcal{D}}N}\right)^{m}=:\widetilde{t}.

Let εp,N\varepsilon_{p,N} be defined in Theorem 6. Noting that g1g^{-1} is increasing and sεp,N(s)s\mapsto\varepsilon_{p,N}(s) is decreasing, we obtain

g1(εp,N)=g1(εp,N(η/t~))g1(εp,N(η/t)).g^{-1}(\varepsilon_{p,N})=g^{-1}\left(\varepsilon_{p,N}\left(\eta/\widetilde{t}\right)\right)\geqslant g^{-1}(\varepsilon_{p,N}(\eta/t)).

Combining with (92), we complete the proof.

C.2 Proof of Theorem 7

The proof of Theorem 7 closely follows the three-step structure used for Theorem 6, with a key difference in the first step. Specifically, establishing the inclusion relation (81) becomes challenging because Assumption 6 is weaker than Assumption 3. However, as shown in the first step of the proof below, the “full” inclusion relation (81) is unnecessary; instead, only a partial inclusion relation is required. Specifically, the worst-case risk evaluated over the full Wasserstein ball coincides with that evaluated over a subset of the Wasserstein ball, and this subset can be shown to belong to the projection set. This enables the derivation of confidence bounds for fixed 𝜷\bm{\beta} in the second step, and subsequently, the generalization bounds using the same covering number arguments as in Theorem 6.

Step 1. Proving partial set inclusion and coincidence of worst-case risk. For H0p()H_{0}\in\mathcal{M}_{p}(\mathbb{R}) and ε0\varepsilon\geqslant 0, define a subset of a one-dimensional Wasserstein ball as

p+(H0,ε)={Fp(H0,ε):FstH0},\displaystyle\mathcal{B}_{p}^{+}(H_{0},\varepsilon)=\{F\in\mathcal{B}_{p}(H_{0},\varepsilon):F\succeq_{\rm st}H_{0}\}, (93)

where st\succeq_{\rm st} denotes the usual stochastic order (see Shaked and Shanthikumar (2007)), also known as the first-order stochastic dominance, and is defined as: for two distributions H1,H2H_{1},H_{2} on \mathbb{R}, H1stH2H_{1}\succeq_{\rm st}H_{2} if H1(t)H2(t)H_{1}(t)\leqslant H_{2}(t) for all tt\in\mathbb{R}. We aim to show the following set inclusion result:

p|f𝜷(G0,ε)p+(Ff𝜷(𝐗0),g(ε)),\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon)\supseteq\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)), (94)

where 𝐗0G0p(n)\mathbf{X}_{0}\sim G_{0}\in\mathcal{M}_{p}(\mathbb{R}^{n}), p+(Ff𝜷(𝐗0),g(ε))\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)) is defined by (93), and

p|f𝜷(G0,ε):={Ff𝜷(𝐗):F𝐗p(G0,ε)}.\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon):=\{F_{f_{\bm{\beta}}(\mathbf{X})}:F_{\mathbf{X}}\in{\mathcal{B}}_{p}(G_{0},\varepsilon)\}. (95)
Lemma 14.

For p[1,]p\in[1,\infty], ε0\varepsilon\geqslant 0, G0(n)G_{0}\in\mathcal{M}({\mathbb{R}}^{n}), and 𝐗0G0\mathbf{X}_{0}\sim G_{0}, the set inclusion result (94) holds for each 𝛃𝒟\bm{\beta}\in\mathcal{D} under Assumption 6.

Proof.

Proof. Let 𝜷𝒟\bm{\beta}\in\mathcal{D}. For Fp+(Ff𝜷(𝐗0),g(ε))F\in\mathcal{B}_{p}^{+}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)), we have Fp(Ff𝜷(𝐗0),g(ε))F\in\mathcal{B}_{p}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)) and FstFf𝜷(𝐗0)F\succeq_{\rm st}F_{f_{\bm{\beta}}(\mathbf{X}_{0})}. Let UU be a uniform random variable on [0,1][0,1] such that f𝜷(𝐗0)=Ff𝜷(𝐗0)1(U)f_{\bm{\beta}}(\mathbf{X}_{0})=F_{f_{\bm{\beta}}(\mathbf{X}_{0})}^{-1}(U) almost surely (see Lemma A.28 of Föllmer and Schied (2016) for the exsistence of UU). Let Z=F1(U)FZ=F^{-1}(U)\sim F and T=Zf𝜷(𝐗0)T=Z-f_{\bm{\beta}}(\mathbf{X}_{0}). Since FstFf𝜷(𝐗0)F\succeq_{\rm st}F_{f_{\bm{\beta}}(\mathbf{X}_{0})}, we have T0T\geqslant 0. Moreover, it holds that

𝔼[|T|p]=𝔼[|Zf𝜷(𝐗0)|p]=01|F1(s)Ff𝜷(𝐗0)1(s)|pds=Wp(F,Ff𝜷(𝐗0))pg(ε)p,\displaystyle\mathbb{E}[|T|^{p}]=\mathbb{E}[|Z-f_{\bm{\beta}}(\mathbf{X}_{0})|^{p}]=\int_{0}^{1}|F^{-1}(s)-F_{f_{\bm{\beta}}(\mathbf{X}_{0})}^{-1}(s)|^{p}\mathrm{d}s=W_{p}(F,F_{f_{\bm{\beta}}(\mathbf{X}_{0})})^{p}\leqslant g(\varepsilon)^{p}, (96)

where the inequality is due to Fp(Ff𝜷(𝐗0),g(ε))F\in\mathcal{B}_{p}(F_{f_{\bm{\beta}}(\mathbf{X}_{0})},g(\varepsilon)). Using the similar arguments in the proof of Lemma 12 where a measurable selection theorem in Rieder (1978) has been used, we know that there exists a measurable mapping 𝐒\mathbf{S} such that

𝐒(ω)argmax𝐲g1(T(ω))f𝜷(𝐗0(ω)+𝐲),ωΩ.\displaystyle\mathbf{S}(\omega)\in\operatorname*{arg\,max}_{\|\mathbf{y}\|\leqslant g^{-1}(T(\omega))}f_{\bm{\beta}}({\mathbf{X}}_{0}(\omega)+\mathbf{y}),~{}~{}\omega\in\Omega.

Using Assumption 6, we have

f𝜷(𝐗0(ω)+𝐒(ω))f𝜷(𝐗0(ω))g(g1(T(ω)))=T(ω),ωΩ.\displaystyle f_{\bm{\beta}}(\mathbf{X}_{0}(\omega)+\mathbf{S}(\omega))-f_{\bm{\beta}}({\mathbf{X}}_{0}(\omega))\geqslant g(g^{-1}(T(\omega)))=T(\omega),~{}~{}\omega\in\Omega.

Note that f𝜷f_{\bm{\beta}} is continuous. There exists a random variable RR taking values on [0,1][0,1] such that

f𝜷(𝐗0+R𝐒)f𝜷(𝐗0)=T.\displaystyle f_{\bm{\beta}}({\mathbf{X}}_{0}+R\mathbf{S})-f_{\bm{\beta}}({\mathbf{X}}_{0})=T. (97)

Define 𝐗~=𝐗0+R𝐒\widetilde{\mathbf{X}}={\mathbf{X}}_{0}+R\mathbf{S}. By (97), we have f𝜷(𝐗~)=f𝜷(𝐗0)+T=ZFf_{\bm{\beta}}(\widetilde{\mathbf{X}})=f_{\bm{\beta}}({\mathbf{X}}_{0})+T=Z\sim F. By 𝐒g1(T)\|\mathbf{S}\|\leqslant g^{-1}(T), we have

𝔼[𝐗~𝐗0p]𝔼[(g1(T))p](g1((𝔼[Tp])1/p))pεp,\displaystyle\mathbb{E}[\|\widetilde{\mathbf{X}}-{\mathbf{X}}_{0}\|^{p}]\leqslant\mathbb{E}\left[(g^{-1}(T))^{p}\right]\leqslant\left(g^{-1}((\mathbb{E}[T^{p}])^{1/p})\right)^{p}\leqslant\varepsilon^{p},

where the second inequality follows from Jensen’s inequality and the concavity of the mapping x(g1(x1/p))px\mapsto(g^{-1}(x^{1/p}))^{p} by Lemma 11, and the last inequality is due to (96). Hence, we have F𝐗~p(G0,ε)F_{\widetilde{\mathbf{X}}}\in\mathcal{B}_{p}(G_{0},\varepsilon), which implies F=FZ=Ff𝜷(𝐗~)p|f𝜷(G0,ε)F=F_{Z}=F_{f_{\bm{\beta}}(\widetilde{\mathbf{X}})}\in\mathcal{B}_{p|f_{\bm{\beta}}}(G_{0},\varepsilon). This completes the proof.∎

Below we demonstrate that the worst-case values of a monotone risk mapping are identical under p(H0,ε)\mathcal{B}_{p}(H_{0},\varepsilon) and p+(H0,ε)\mathcal{B}_{p}^{+}(H_{0},\varepsilon) for any H0p()H_{0}\in\mathcal{M}_{p}(\mathbb{R}) and ε0\varepsilon\geqslant 0.

Lemma 15.

Let p[1,]p\in[1,\infty] and suppose that ρ:Lp\rho:L^{p}\to\mathbb{R} satisfies monotonicity. For H0p()H_{0}\in\mathcal{M}_{p}(\mathbb{R}) and ε0\varepsilon\geqslant 0, we have

supFp+(H0,ε)ρF(Z)=supFp(H0,ε)ρF(Z).\displaystyle\sup_{F\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon)}\rho^{F}(Z)=\sup_{F\in\mathcal{B}_{p}(H_{0},\varepsilon)}\rho^{F}(Z). (98)
Proof.

Proof. Note that p+(H0,ε)p(H0,ε)\mathcal{B}_{p}^{+}(H_{0},\varepsilon)\subseteq\mathcal{B}_{p}(H_{0},\varepsilon). It suffices to show

supFp+(H0,ε)ρF(Z)supFp(H0,ε)ρF(Z).\displaystyle\sup_{F\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon)}\rho^{F}(Z)\geqslant\sup_{F\in\mathcal{B}_{p}(H_{0},\varepsilon)}\rho^{F}(Z). (99)

For any Fp(H0,ε)F\in\mathcal{B}_{p}(H_{0},\varepsilon), define H(x)=min{F(x),H0(x)}H(x)=\min\{F(x),H_{0}(x)\} for xx\in\mathbb{R}. It holds that H1(s)=max{F1(s),H01(s)}H^{-1}(s)=\max\{F^{-1}(s),H_{0}^{-1}(s)\} for s[0,1]s\in[0,1], and thus,

Wp(H,H0)p\displaystyle W_{p}(H,H_{0})^{p} =01|max{F1(s),H01(s)}H01(s)|pds\displaystyle=\int_{0}^{1}\left|\max\{F^{-1}(s),H_{0}^{-1}(s)\}-H_{0}^{-1}(s)\right|^{p}\mathrm{d}s
01|F1(s)H01(s)|pds=Wp(F,H0)pεp,\displaystyle\leqslant\int_{0}^{1}\left|F^{-1}(s)-H_{0}^{-1}(s)\right|^{p}\mathrm{d}s=W_{p}(F,H_{0})^{p}\leqslant\varepsilon^{p},

where the last inequality is due to Fp(H0,ε)F\in\mathcal{B}_{p}(H_{0},\varepsilon). This means Hp(H0,ε)H\in\mathcal{B}_{p}(H_{0},\varepsilon). This, together with HH0H\leqslant H_{0}, that is, HstH0H\succeq_{\rm st}H_{0}, implies Hp+(H0,ε)H\in\mathcal{B}_{p}^{+}(H_{0},\varepsilon). By the monotonicity of ρ\rho, we have ρH(Z)ρF(Z)\rho^{H}(Z)\geqslant\rho^{F}(Z) as HstFH\succeq_{\rm st}F. Therefore, we have (99) holds, and thus, we complete the proof. ∎

Step 2: Establishing confidence bounds for fixed β\bm{\beta}. Similar to Lemma 13 in Step 2 for Theorem 6, we can obtain the confidence bounds for fixed 𝜷\bm{\beta} as stated in the following lemma.

Lemma 16.

For p1p\geqslant 1 and η(0,1)\eta\in(0,1), suppose that Assumptions 2 and 6 hold. Then for each 𝛃𝒟\bm{\beta}\in\mathcal{D} and a monotone measure ρ\rho, we have

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,g1(εp,N(η)))ρF(f𝜷(𝐗)))1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},g^{-1}(\varepsilon_{p,N}(\eta)))}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant 1-\eta,

where εp,N(η)\varepsilon_{p,N}(\eta) is defined by (80) with constants c1,c2c_{1},c_{2} depending only on aa, AA and pp.

Proof.

Proof. Denote by F^𝜷,N=1Ni=1Nδf𝜷(𝐱^i)\widehat{F}_{\bm{\beta},N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{f_{\bm{\beta}}(\widehat{\mathbf{x}}_{i})} and F𝜷=Ff𝜷(𝐗)F^{*}_{\bm{\beta}}=F_{f_{\bm{\beta}}(\mathbf{X}^{*})}, where (Y,𝐗)F(Y^{*},\mathbf{X}^{*})\sim F^{*}. Note that by Lemma 14, we have

p|f𝜷(F^𝐗,N,ε)p+(F^𝜷,N,g(ε)),\displaystyle\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)\supseteq\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon)),

where p|f𝜷(F^𝐗,N,ε)={Ff𝜷(𝐗):F𝐗p(F^𝐗,N,ε)}\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)=\{F_{f_{\bm{\beta}}(\mathbf{X})}:F_{\mathbf{X}}\in\mathcal{B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)\} and p+(F^𝜷,N,g(ε))={Fp(F^𝜷,N,g(ε)):FstF^𝜷,N}.\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))=\{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon)):F\succeq_{\rm st}\widehat{F}_{\bm{\beta},N}\}. Therefore, we have

supFp|f𝜷(F^𝐗,N,ε)ρF(Z)supFp+(F^𝜷,N,g(ε))ρF(Z)=supFp(F^𝜷,N,g(ε))ρF(Z),\displaystyle\sup_{F\in\mathcal{B}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\geqslant\sup_{F\in\mathcal{B}_{p}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)=\sup_{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z), (100)

where the equality follows from Lemma 15 as ρ\rho is monotone. It then follows that

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,ε)ρF(f𝜷(𝐗)))\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right) =(ρF𝜷(Z)supFp|f𝜷(F^𝐗,N,ε)ρF(Z))\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in{\mathcal{B}}_{p|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\right)
(ρF𝜷(Z)supFp(F^𝜷,N,g(ε))ρF(Z))\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)
(F𝜷p(F^𝜷,N,g(ε))),\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{p}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))\right),

where the first inequality follows from (100). Substituting ε:=εN=g1(εp,N(η))\varepsilon:=\varepsilon_{N}=g^{-1}(\varepsilon_{p,N}(\eta)) into the above inequalities, we have

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,εN)ρF(f𝜷(𝐗)))\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N})}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right) (F𝜷p(F^𝜷,N,εp,N(η)))1η.\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{p}(\widehat{F}_{\bm{\beta},N},\varepsilon_{p,N}(\eta))\right)\geqslant 1-\eta.

where the last inequality follows from Lemma 10, similar to the proof of Lemma 13. This completes the proof. ∎

Step 3: Union bounds for β𝒟\bm{\beta}\in{\cal D}. Now we are ready to prove the last step, union bounds, for Theorem 7.

Proof of Theorem 7. The proof of union bounds based on confidence bounds is similar to that of Theorem 6 by applying the covering number arguments.

C.3 Proof of Theorem 8

Proof of Theorem 8. Note that for Z1,Z2L1,Z_{1},Z_{2}\in L^{1},

ρ(Z1)sup𝔼[|V|]𝔼[|Z1Z2|]ρ(Z2+V)sup|V|λ𝔼[|Z1Z2|]ρ(Z2+V)ρ(Z2)+Mλ𝔼[|Z1Z2|],\displaystyle\rho(Z_{1})\leqslant\sup_{\mathbb{E}[|V|]\leqslant\mathbb{E}[|Z_{1}-Z_{2}|]}\rho(Z_{2}+V)\leqslant\sup_{|V|\leqslant\lambda\mathbb{E}[|Z_{1}-Z_{2}|]}\rho(Z_{2}+V)\leqslant\rho(Z_{2})+M\lambda\mathbb{E}[|Z_{1}-Z_{2}|],

where the second inequality follows from Assumption 1 and the last inequality is due to the Lipschitz continuity of ρ\rho with respect to LL^{\infty}-norm, which implies |ρ(Z2+V)ρ(Z2)|Mess-sup|V||\rho(Z_{2}+V)-\rho(Z_{2})|\leqslant M\mathrm{ess\mbox{-}sup}|V|. By symmetry, we conclude that

|ρ(Z1)ρ(Z2)|λM𝔼[|Z1Z2|],Z1,Z2L1.\displaystyle|\rho(Z_{1})-\rho(Z_{2})|\leqslant\lambda M\mathbb{E}[|Z_{1}-Z_{2}|],~{}~{}\forall Z_{1},Z_{2}\in L^{1}.

Hence, Assumption 5 holds with k=1k=1 and MM being replaced by λM\lambda M. Further, let ZNF^𝜷,NZ_{N}\sim\widehat{F}_{\bm{\beta},N}, and it holds for any ε0\varepsilon\geqslant 0 that

sup(F^𝜷,N,ε)ρF(Z)=sup|YZN|ερ(Y)\displaystyle\sup_{\mathcal{B}_{\infty}(\widehat{F}_{\bm{\beta},N},\varepsilon)}\rho^{F}(Z)=\sup_{|Y-Z_{N}|\leqslant\varepsilon}\rho(Y) =sup|V|ερ(ZN+V)\displaystyle=\sup_{|V|\leqslant\varepsilon}\rho(Z_{N}+V)
sup𝔼[|V|]ε/λρ(ZN+V)\displaystyle\geqslant\sup_{\mathbb{E}[|V|]\leqslant\varepsilon/\lambda}\rho(Z_{N}+V)
=sup𝔼[|YZN|]ε/λρ(Y)=sup1(F^𝜷,N,ε/λ)ρF(Z),\displaystyle=\sup_{\mathbb{E}[|Y-Z_{N}|]\leqslant\varepsilon/\lambda}\rho(Y)=\sup_{\mathcal{B}_{1}(\widehat{F}_{\bm{\beta},N},\varepsilon/\lambda)}\rho^{F}(Z), (101)

where the inequality is due to Assumption 1. Then, we have the following chain of inequalities:

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,ε)ρF(f𝜷(𝐗)))\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right) (ρF(f𝜷(𝐗))supF(F^𝐗,N,ε)ρF(f𝜷(𝐗)))\displaystyle\geqslant\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{\infty}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)
=(ρF𝜷(Z)supF|f𝜷(F^𝐗,N,ε)ρF(Z))\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon)}\rho^{F}(Z)\right)
(ρF𝜷(Z)supF+(F^𝜷,N,g(ε))ρF(Z))\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)
=(ρF𝜷(Z)supF(F^𝜷,N,g(ε))ρF(Z))\displaystyle=\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{\infty}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))}\rho^{F}(Z)\right)
(ρF𝜷(Z)supF1(F^𝜷,N,g(ε)/λ)ρF(Z))\displaystyle\geqslant\mathbb{P}\left(\rho^{F_{\bm{\beta}}^{*}}(Z)\leqslant\sup_{F\in\mathcal{B}_{1}(\widehat{F}_{\bm{\beta},N},\,g(\varepsilon)/\lambda)}\rho^{F}(Z)\right)
(F𝜷1(F^𝜷,N,g(ε)/λ)),\displaystyle\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{1}(\widehat{F}_{\bm{\beta},N},\,g(\varepsilon)/\lambda)\right), (102)

where the first inequality is due to (F^𝐗,N,εN)p(F^𝐗,N,εN){\mathcal{B}}_{\infty}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N})\subseteq{\mathcal{B}}_{p}(\widehat{F}_{\mathbf{X},N},\varepsilon_{N}), the second inequality follows from Lemma 14 which states that +(F^𝜷,N,g(ε))|f𝜷(F^𝐗,N,ε)\mathcal{B}_{\infty}^{+}(\widehat{F}_{\bm{\beta},N},g(\varepsilon))\subseteq\mathcal{B}_{\infty|f_{\bm{\beta}}}(\widehat{F}_{\mathbf{X},N},\varepsilon), the second equality is due to Lemma 15, and the third inequality follows from (C.3). For η(0,1)\eta\in(0,1), combining (C.3) with Lemma 10 yields the following confidence bound:

(ρF(f𝜷(𝐗))supFp(F^𝐗,N,g1(λε1,N(η)))ρF(f𝜷(𝐗)))(F𝜷1(F^𝜷,N,ε1,N(η)))1η,\displaystyle\mathbb{P}\left(\rho^{F^{*}}(f_{\bm{\beta}}(\mathbf{X}))\leqslant\sup_{F\in{\cal B}_{p}(\widehat{F}_{\mathbf{X},N},g^{-1}(\lambda\varepsilon_{1,N}(\eta)))}\rho^{F}(f_{\bm{\beta}}(\mathbf{X}))\right)\geqslant\mathbb{P}\left(F^{*}_{\bm{\beta}}\in{{\cal B}}_{1}(\widehat{F}_{\bm{\beta},N},\varepsilon_{1,N}(\eta))\right)\geqslant 1-\eta,

where ε1,N(η)\varepsilon_{1,N}(\eta) is defined by (80). Note that Assumptions 2 and 3 hold, Assumption 4 holds with r=1r=1, and Assumption 5 holds with k=1k=1 and MM being replaced by λM\lambda M. The rest proof is similar to that of Theorem 6 by applying the covering number arguments.

C.4 A sufficient condition for Assumption 2

Below we present a sufficient condition for Assumption 2 that is more straightforward to verify.

Proposition 4.

Assumption 2 holds if there exist b1,b20b_{1},b_{2}\geqslant 0, s[1,p]s\in[1,p] and a>pa>p such that

|f𝜷(𝐱)|b1𝐱s+b2,𝐱n,𝜷𝒟,\displaystyle|f_{\bm{\beta}}(\mathbf{x})|\leqslant b_{1}\|\mathbf{x}\|^{s}+b_{2},~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\bm{\beta}\in\mathcal{D}, (103)

and

𝔼F[exp(2a1b1a𝐗as)]<.\displaystyle\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\|\mathbf{X}\|^{as})]<\infty. (104)

Proof. Note that

sup𝜷𝒟𝔼F[exp(|f𝜷(𝐗)|a)]\displaystyle\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(|f_{\bm{\beta}}(\mathbf{X})|^{a})] sup𝜷𝒟𝔼F[exp(|b1𝐱s+b2)a]\displaystyle\leqslant\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(|b_{1}\|\mathbf{x}\|^{s}+b_{2})^{a}]
sup𝜷𝒟𝔼F[exp(2a1b1a𝐱as+2a1b2a)]\displaystyle\leqslant\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\|\mathbf{x}\|^{as}+2^{a-1}b_{2}^{a})]
=exp(2a1b2a)sup𝜷𝒟𝔼F[exp(2a1b1a𝐱as)]<,\displaystyle=\exp(2^{a-1}b_{2}^{a})\sup_{\bm{\beta}\in\mathcal{D}}\mathbb{E}^{F^{*}}[\exp(2^{a-1}b_{1}^{a}\|\mathbf{x}\|^{as})]<\infty,

which confirms Assumption 2. This completes the proof.

Proposition 4 shows that Assumption 2 can be verified through a combination of a union-bound condition (103) on {f𝜷}𝜷𝒟\{f_{\bm{\beta}}\}_{\bm{\beta}\in\mathcal{D}} and a light-tail condition (104) on the data-generating distribution FF^{*}. It is worth noting that (103) holds if f𝜷0(𝐱)0f_{\bm{\beta}_{0}}(\mathbf{x})\equiv 0 when 𝜷0=𝟎m\bm{\beta}_{0}=\mathbf{0}\in\mathbb{R}^{m}, Assumption 4 holds on 𝒟{𝜷0}\mathcal{D}\cup\{\bm{\beta}_{0}\} and U𝒟:=sup𝜷𝒟𝜷𝒟<U_{\mathcal{D}}:=\sup_{\bm{\beta}\in\mathcal{D}}\|\bm{\beta}\|_{\mathcal{D}}<\infty. To see this, note that by Assumption 4, we have

|f𝜷(𝐱)|(a1𝐱r+a2)𝜷𝒟a1U𝒟𝐱r+a2U𝒟,𝐱n,𝜷𝒟,\displaystyle|f_{\bm{\beta}}(\mathbf{x})|\leqslant(a_{1}\|\mathbf{x}\|^{r}+a_{2})\|\bm{\beta}\|_{\mathcal{D}}\leqslant a_{1}U_{\mathcal{D}}\|\mathbf{x}\|^{r}+a_{2}U_{\mathcal{D}},~{}~{}\forall\mathbf{x}\in\mathbb{R}^{n},~{}\bm{\beta}\in\mathcal{D},

and thus, (103) holds with b1=a1U𝒟b_{1}=a_{1}U_{\mathcal{D}}, b2=a2U𝒟b_{2}=a_{2}U_{\mathcal{D}} and s=rs=r.

References

  • Bäuerle and Müller (2006) Bäuerle, N. and Müller, A. (2006). Stochastic orders and risk measures: Consistency and bounds. Insurance: Mathematics and Economics, 38(1), 132–148.
  • Bobkov and Ledoux (2019) Bobkov, S. and Ledoux, M. (2019). One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances. American Mathematical Society.
  • Dhaene et al. (2002) Dhaene, J., Denuit, M., Goovaerts, M.J., Kaas, R. and Vyncke, D. (2002). The concept of comonotonicity in actuarial science and finance: Theory. Insurance: Mathematics and Economics. 31(2), 3–33.
  • Föllmer and Schied (2016) Föllmer, H. and Schied, A. (2016). Stochastic Finance. An Introduction in Discrete Time. Fourth Edition. Walter de Gruyter, Berlin.
  • Fournier and Guillin (2015) Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3), 707–738.
  • Kallenberg (2021) Kallenberg, O. (2021). Foundations of Modern Probability. Third Edition. New York: springer.
  • Mao et al. (2022) Mao, T., Wang, R. and Wu, Q. (2022). Model aggregation for risk evaluation and robust optimization. arXiv:2201.06370.
  • Rieder (1978) Rieder, U. (1978). Measurable selection theorems for optimization problems. manuscripta mathematica, 24(1), 115–131.
  • Shaked and Shanthikumar (2007) Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer New York.
  • Sion (1958) Sion, M. (1958). On general minimax theorems. Pacific Journal of Mathematics, 8(1), 171–176.
  • (11) Smith, J. E. (1995). Generalized Chebychev inequalities: theory and applications in decision analysis. Operations Research, 43(5), 807–825.
  • Villani (2009) Villani, C. (2009). Optimal Transport: Old and New. Berlin: springer.
  • Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.