On Generalization and Regularization via Wasserstein Distributionally Robust Optimization
Abstract
Wasserstein distributionally robust optimization (DRO) has gained prominence in operations research and machine learning as a powerful method for achieving solutions with favorable out-of-sample performance. Two compelling explanations for its success are the generalization bounds derived from Wasserstein DRO and its equivalence to regularization schemes commonly used in machine learning. However, existing results on generalization bounds and regularization equivalence are largely limited to settings where the Wasserstein ball is of a specific type, and the decision criterion takes certain forms of expected functions. In this paper, we show that generalization bounds and regularization equivalence can be obtained in a significantly broader setting, where the Wasserstein ball is of a general type and the decision criterion accommodates any form, including general risk measures. This not only addresses important machine learning and operations management applications but also expands to general decision-theoretical frameworks previously unaddressed by Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a by-product, we show that Wasserstein DRO coincides with the recent max-sliced Wasserstein DRO for any decision criterion under affine decision rules – resulting in both being efficiently solvable as convex programs via our general regularization results. These general assurances provide a strong foundation for expanding the application of Wasserstein DRO across diverse domains of data-driven decision problems.
Key-words: distributionally robust optimization, Wasserstein metrics, finite-sample guarantees, regularization
1 Introduction
Stochastic optimization problems of the form
(1) |
naturally arise in many machine learning (ML) and operations research (OR) applications, where denotes a binary random variable, taking values from , and is a random vector in . The function , parameterized by the decision variable , can generally be interpreted as an affine decision rule. In ML, the binary random variable often represents outcomes in classification problems. When is constant, the formulation (1) applies to a broad range of regression problems in ML and risk minimization problems in OR. The function represents a measure of risk, mapping a random variable to a real number that quantifies its risk. The notion of risk in this paper is broadly defined as the undesirability of a random variable ; that is, a larger value of indicates that is less preferable.
In this paper, we study the Wasserstein distributionally robust counterpart of (1), specifically:
(2) |
where denotes a type- Wasserstein ball of radius centred at a reference distribution (c.f. Kuhn et al. (2019)). The superscript in indicates that the measure depends on a joint distribution of . This problem has been extensively studied and applied in various fields when takes the form of an expected value function, i.e., for some (see e.g., Kuhn et al. (2019); Shafieezadeh-Abadeh et al. (2019); Gao et al. (2017); Gao (2022)). This setup reveals two key findings that underscore the potential strength of the Wasserstein distributionally robust optimization (DRO) model in achieving strong out-of-sample performance. First, the model enjoys generalization bounds under mild assumptions (Shafieezadeh-Abadeh et al. (2019)). More notably, Gao (2022) shows that when the data-generating distribution satisfies transportation inequalities, -generalization bounds for the type- Wasserstein ball with can be established, free from the curse of dimensionality. Second, the model has an exact regularization equivalent to the nominal problem (1) with an added regularizer on the decision variable , when the Wasserstein ball is of type-1 and the loss function is Lipschitz continuous (Shafieezadeh-Abadeh et al. (2019)), as well as when the ball is of type-2 and the loss function is quadratic (Blanchet et al. (2019)). This connection to the classical regularization scheme, prevalent in machine learning, offers a compelling interpretation of the DRO model and has spurred considerable interest in its application within ML and OR (see e.g., Blanchet and Kang (2021), Blanchet et al. (2019), Chen and Paschalidis (2018), Gao et al. (2017), Gao et al. (2022)).
It remains largely unclear, however, whether the generalization bounds established for the specific case of the expected value function can be achieved in the broader setting of (2) – namely, for a general measure and any type- Wasserstein ball with – without suffering from the curse of dimensionality. In particular, the result of Gao (2022), aside from the limitations imposed by the transportation inequality assumption on the underlying distribution and the restriction to , fundamentally relies on the unbiased property of the sample average – a property that does not extend to risk measures such as Conditional Value-at-Risk (CVaR) (Rockafellar and Uryasev (2002)). It is also worth noting that, even in the literature on ML, the question of how to obtain generalization bounds, possibly through the classical regularization scheme, for the nominal problem (1) has largely been left open (c.f. Shalev-Shwartz and Ben-David (2014)). In this paper, we shed new light on the efficacy of Wasserstein DRO in broad out-of-sample tests by showing that the Wasserstein DRO model (2) can circumvent the curse of dimensionality for virtually any decision criterion under affine decision rules. As a crucial insight, we uncover that the Wasserstein DRO remarkably coincides with the recent max-sliced Wasserstein DRO (Olea et al. (2022)) for any decision criterion, even though the ball defined in the former is a far less conservative choice – specifically, it is smaller and tighter than the one defined in the latter. Additionally, we extend our results to general decision rules under the light-tail assumption by leveraging the measure concentration property of a Wasserstein ball.
Another focus of this work is to investigate whether the key equivalence of Wasserstein DRO to classical regularization in ML can be extended to a more general setting in (2). To answer this, we first pay particular attention to the setting where the Wasserstein ball is of any type, i.e., , and the measure is an expected function with a loss function having a growth order . We show that for many loss functions arising from practical applications, it is possible to establish an exact relation between the Wasserstein DRO model (2) and the classical regularization scheme in more general terms. Notably, we prove that no loss functions beyond those we identify satisfy this equivalence, offering a definitive answer to the extent to which Wasserstein DRO can be interpreted and solved from a regularization perspective. Our results also reveal the tractability of solving Wasserstein DRO for many non-Lipschitz continuous loss functions and higher-order Wasserstein balls (). While Sim et al. (2021) approach such problems via approximation due to reformulation challenges, we show that exact solutions are possible through regularization reformulations. Moreover, we extend this equivalence to settings where goes beyond expected value functions. Previously, this exact relation was known only for variance (Blanchet et al. (2022)) or distortion risk measures (Wozabal (2014)). We expand it to a significantly broader class of measures, encompassing higher moment coherent risk measures (Krokhmal (2007)), criteria emerging more recently from OR and ML (Rockafellar and Uryasev (2013), Rockafellar et al. (2008), Gotoh and Uryasev (2017)), and notably, general decision-theoretical models such as the celebrated rank-dependent expected utility (RDEU) (Quiggin (1982)).
Related Work
From the perspective of generalization bounds. A series of works done by Blanchet et al. (2019), Blanchet and Kang (2021), Blanchet et al. (2022) take a different approach to tackle the curse of dimensionality. They study the classical setting of Wasserstein DRO, where the measure is an expected function, and propose a radius selection rule for Wasserstein balls. They show the rule can be applied to build a confidence region of the optimal solution, and the radius can be chosen in the square-root order as the sample size goes to infinity. Although this allows for bypassing the curse of dimensionality, the bounds derived from the rule are only valid in the asymptotic sense. Blanchet et al. (2022) also takes this approach to obtain generalization bounds for mean-variance portfolio selection problems. On the other hand, the generalization bounds established in this paper, like those in Shafieezadeh-Abadeh et al. (2019), Chen and Paschalidis (2018), and Gao (2022), break the curse of dimensionality in a non-asymptotic sense for finite samples.
From the perspective of the equivalency between Wasserstein DRO and regularization. While the focus of this work is on studying the exact equivalency between Wasserstein DRO and regularization, there is an active stream of works studying the asymptotic equivalence in the setting where the measure is an expected function (see Gao et al. (2017); Blanchet et al. (2019); Blanchet et al. (2022); Volpi et al. (2018); Bartl et al. (2020)). In particular, Gao et al. (2022) introduce the notion of variation regularization and show that for a broad class of loss functions, Wasserstein DRO is asymptotically equivalent to a variation regularization problem.
Recap of Major Contributions and Practical Implications
-
1.
We underscore the fundamental gap in achieving -generalization bounds for diverse decision criteria in Wasserstein DRO and novelly establish such guarantees by leveraging the projection set of the Wasserstein ball, revealing a broad equivalence with the recent max-sliced Wasserstein DRO for any decision criterion.
-
2.
By proving the precise conditions under which Wasserstein DRO has an exact regularization equivalent, we provide key insights into a wide range of OR/MS applications, highlighting when a simpler, intuitive, and computationally efficient regularization approach suffices to achieve the generalization guarantees of Wasserstein DRO – and when solving the full Wasserstein DRO formulation is necessary.
-
3.
By extending generalization bounds beyond affine decision rules, we uncover important OR/MS applications, such as foundational two-stage stochastic programs and the increasingly popular feature-based newsvendor problem, with generalization guarantees that scale effectively across diverse decision criteria.
2 Wasserstein Distributionally Robust Optimization Model
Let be an atomless probability space. A random vector is a measurable mapping from to , . Denote by the distribution of under . For , let be the set of all random variables with a finite th moment, and be the set of all distributions on with finite th moments in each component. Let be the set of all bounded random variables, and be the set of all distributions on with bounded support. Let denote the Hölder conjugate of , i.e., . We use to represent the essential supremum of the random variable , i.e., . Recall that for any two distributions , the type- Wasserstein metric is defined as
(3) |
where is a metric on . The set denotes the set of all joint distributions of and with marginal distributions and , respectively. In this paper, we consider with and apply the type- Wasserstein metric (3) with , defined by the following additively separable form
(4) |
where is any given norm on with its dual norm defined by . The function satisfies if and otherwise. Thus, the function (4) assigns an infinitely large cost to any discrepancy in , i.e., , and reduces to a general norm on when there is no discrepancy in , i.e., . Using this norm, for and , we define the ball of distributions as
(5) |
which we refer to as the type- Wasserstein ball throughout this paper.
In the remainder of this paper, we first demonstrate that the DRO problem (2), using the above definition of the Wasserstein ball:
(6) |
can achieve generalization guarantees for any measure and type- Wasserstein ball. We then elucidate this guarantee for the widely used affine decision rules by exploring the connection between the optimization model (6) and the classical regularization scheme in Section 4. This framework covers various problem types, including:
-
•
Classification:
-
•
Regression:
-
•
Risk minimization:
where in the cases of regression and risk minimization. These formulations accommodate numerous machine learning and operations research applications not yet addressed in the existing literature on Wasserstein DRO. Table 1 highlights examples considered in this paper.
Applications | Measure | Formulation | References |
Classification | HO hinge loss HO SVM -SVM | Lee and Mangasarian (2001), Suykens and Vandewalle (1999), Schölkopf et al. (2000); Gotoh and Uryasev (2017) | |
Regression | HO regression HO -insensitive -SVR | the well-known least-square regression for Drucker et al. (1997), ; Lee et al. (2005), Schölkopf et al. (1998) | |
Risk minimization | LPM CVaR-Deviation HM risk measure | Bawa (1975); Fishburn (1977); Chen et al. (2011) Rockafellar et al. (2008) Krokhmal (2007) |
Notes: HO: higher-order; SVM: support vector machine; SVR: support vector regression; LPM: lower partial moments; HM: higher moment; CVaR: conditional Value-at-Risk, , , where is the left-quantile function of . The range of parameters: ; ; ; ; .
3 Generalization Bounds
Wasserstein DRO is typically applied in a data-driven setting, where the data-generating distribution of random variables is only partially observed through sample data points for , independently drawn from . In this context, the empirical distribution is used as the reference distribution in Wasserstein DRO, where and denotes the point-mass at . The central question is whether the (Wasserstein) in-sample risk
can provide an upper confidence bound on the out-of-sample risk
across all possible decisions , with a radius that decays at a rate scaling gracefully with the dimensionality of the random vector – in other words, generalization bounds that overcome the curse of dimensionality.
Current methods face significant limitations. While generalization bounds based on the measure concentration properties of type- Wasserstein balls offer flexibility in accommodating general risk measures , they are constrained by the curse of dimensionality inherent in high-dimensional Wasserstein balls (see e.g., Esfahani and Kuhn (2018)). Approaches such as Gao (2022) address this issue by focusing on expected value functions, i.e., , leveraging the concentration properties of sample averages. However, extending these methods beyond expected value functions proves challenging, as most risk measures lack analogous concentration properties.
We take a novel perspective to tackle this challenge by focusing on the structural properties of the projection set:
(7) |
where , and . We uncover that this set coincides with two particularly revealing one-dimensional sets: a one-dimensional Wasserstein ball and the projection set of a high-dimensional max-sliced Wasserstein ball. To formalize this, for , we first recall the definition of a type- Wasserstein ball on with the metric being a norm:
(8) |
and a type- max-sliced Wasserstein ball (Olea et al. (2022)):
(9) |
Here, is the norm defined in (4) on . Without loss of generality and to be consistent with the definition of classic one-dimensional Wasserstein metric, for , i.e., on , we assume that is the absolute-value norm.
With these definitions in place, we can now present the following equivalence results.
Proposition 1.
The equivalence established above has profound implications for both Wasserstein DRO and max-sliced Wasserstein DRO. It shows that the two frameworks coincide for any decision criterion under affine decision rules, both reducing to the problem of minimizing worst-case risk over a one-dimensional type- Wasserstein ball – an inherently more tractable problem, as detailed in Section 4. Notably, this equivalence reveals that, although the Wasserstein distance is a significantly stronger metric than the max-sliced Wasserstein distance, i.e., the type- Wasserstein ball is a much less conservative choice of an uncertainty set than its max-sliced counterpart (Olea et al. (2022)), Wasserstein DRO can still achieve the same generalization bounds. This insight directly leads to the following generalization bounds for general measures , free from the curse of dimensionality.
Theorem 1.
Given , and , if for some , then we have
where and
The decay rate of the radius is approximately of the order . While this rate is unaffected by dimensionality, its dependency on the type of Wasserstein ball, specifically the parameter , is noteworthy. This dependency, often overlooked in the literature compared to the curse of dimensionality, is potentially concerning, as it suggests that the radius may shrink very slowly for higher-order Wasserstein balls. This issue arises from the use of stronger metrics – specifically, norms raised to higher powers – when defining any ball constructed similarly to the Wasserstein ball, including the max-sliced Wasserstein ball (Olea et al. (2022)). Intuitively, a higher-order Wasserstein ball is smaller than a lower-order one, requiring a larger radius to ensure the data-generating distribution is covered with high probability. In the extreme case of a type- Wasserstein ball, it would never contain a data-generating distribution with unbounded support, regardless of the radius.
We circumvent this issue, which we refer to as the “curse of the order ”, by recognizing that ensuring high-probability coverage of the data-generating distribution, while sufficient, is not necessary for developing generalization bounds. In fact, we show that even for the type- Wasserstein ball – where the probability of covering a data-generating distribution with unbounded support is effectively zero – the worst-case risk from the type- ball can still bound the out-of-sample risk of the data-generating distribution with high probability. This guarantee requires only the following mild assumption about the measure .
Assumption 1.
There exists such that
For any measures satisfying Assumption 1, one can leverage the generalization bounds for the type-1 Wasserstein ball to derive similar bounds for the type- Wasserstein ball, with a radius scaled by a constant . These bounds apply to any type- Wasserstein ball, for , resulting in generalization bounds where the decay rate of the radius is no longer dependent on the order .
Theorem 2.
The core idea presented in this section can be extended beyond affine decision rules, as discussed in greater detail in Section 6. Here, we further illustrate the generality of Assumption 1 by providing additional examples.
Example 1 (Expected function).
Let , where is quasi-convex and Lipschitz continuous with Lipschitz constant , and satisfies, for any differentiable point , . By is quasi-convex, there exists such that is decreasing on and increasing otherwise, and thus is differentiable almost everywhere. It follows that
This, together with by the Lipschitz continuity of , implies satisfies Assumption 1 with . In particular, the function , defined as with , , satisfies Assumption 1 with .
Example 2 (Risk measure).
Suppose that satisfies monotonicity and translation invariance, i.e., whenever , and for all and . Such functionals are referred to as monetary risk measures (see, e.g., Föllmer and Schied (2016)). Further, assume that is Lipschitz continuous with respect to the -norm, i.e., there exists such that for all . Note that for any , This, together with , yields that satisfies Assumption 1 with . In particular, CVaR, as defined in Table 1 by , , satisfies Assumption 1 with , where is the quantile function of .
4 A Regularization Perspective
Regularization generally refers to any means applied to avoid overfitting and enhance generalization. In this regard, with the generalization guarantees established in the previous section, the Wasserstein DRO model (6) is well justified as a general regularization model. In this section, we provide further insights into the regularization effect of model (6) on the decision variable , offering a practically useful alternative interpretation. Specifically, we show that the previously observed equivalence between the Wasserstein DRO model and regularized empirical optimization in ML holds across much broader settings, while precisely identifying the boundary of its validity. This equivalence significantly broadens the range of both Wasserstein and max-sliced Wasserstein DRO problems that can now be solved efficiently through their regularization counterparts. Proposition 1 plays a key role in facilitating this analysis. By (10) in Proposition 1, the Wasserstein DRO model (6) can be recast as
(12) |
where , and is a Wasserstein ball on defined by (8).
4.1 The case of expected function
When is an expected function, the Wasserstein DRO model (12):
(13) |
is equivalent to the regularized model:
(14) |
for (the type-1 Wasserstein ball), where is the Lipschitz constant of . For higher-order Wasserstein balls () or non-Lipschitz loss functions, this relationship is less understood, except for and (Blanchet et al. (2019)). It remains an open question whether (13) can be tractably solved in these cases. Higher-order Wasserstein balls are less conservative than type-1 Wasserstein balls and offer practical appeal.
We show that an equivalence with a regularized model (14) exists for if and only if is linear or takes the form of an absolute function.
Theorem 3.
Let be a convex function. For , suppose that for all . Then the following statements are equivalent.
-
(i)
There exists such that for any , and , it holds that
(15) -
(ii)
The function takes one of the following two forms, multiplied by :
-
(a)
or with some ;
-
(b)
with some .
-
(a)
This result, which holds for any type- Wasserstein ball, is somewhat surprising as the Wasserstein DRO model is equivalent to the same regularized model, regardless of the order . This equivalence occurs when the slope of the loss function takes values only from a constant and its negative. It further indicates that no regularized model in the form of (14) can be derived for any other loss function . In other words, if there is an equivalence between the Wasserstein DRO model (13) and a regularized model for another loss function, the regularized model must take a different form from (14). Before discussing other forms of regularization, it is worth noting the following application of this result in regression.
Example 3.
(Regression)
- (Least absolute deviation (LAD) regression) Applying in Theorem 3 and setting , , and , we arrive at the distributionally robust counterpart of the least absolute deviation regression, i.e., It is equivalent to for any .
We now explore whether an alternative form of regularization on exists that is equivalent to the Wasserstein DRO model (13). Specifically, we focus on cases where the loss function is not Lipschitz continuous, such as higher-order loss functions arising from the examples presented in Table 1. It is known that the worst case expectation problem based on the type-1 Wasserstein ball can become unbounded when the loss function is not Lipschitz continuous. To address cases beyond Lipschitz continuity, we consider the following formulation of Wasserstein DRO model (13):
(16) |
where denotes the function raised to the power of , matching the order of the type- Wasserstein ball. Below, we present an alternative equivalency relationship and specify the exact cases where it holds.
Theorem 4.
Let be a Lipchitz continuous and convex function. For any , the following statements are equivalent.
-
(i)
There exists such that for any , and , it holds that
(17) -
(ii)
The function takes one of the following four forms, multiplied by :
-
(a)
with some ;
-
(b)
with some ;
-
(c)
with some and ;
-
(d)
with some and .
-
(a)
This result is also an “impossibility” theorem, providing a definitive answer to whether the equivalence relationship can hold more broadly for other loss functions. It settles any attempts to establish this equivalence for other convex Lipschitz continuous functions . This theorem is fundamentally important to the study of Wasserstein DRO, especially given ongoing efforts to understand its relationship with a classical regularization perspective. It shows precisely the limits of how far this perspective can be extended.
Remark 1.
As previously discussed, Wasserstein DRO and empirical risk regularization are two widely used and competing approaches for data-driven decision-making in ML and OR/MS. Wasserstein DRO is valued for its universality, offering simple generalization guarantees under minimal assumptions, such as not relying on notions of hypothesis complexity (Shafieezadeh-Abadeh et al. (2019)), while empirical risk regularization is predominantly embraced by practitioners as a simple yet powerful heuristic, despite its desirable theoretical properties (Shafieezadeh-Abadeh et al. (2019); Abu-Mostafa et al. (2012)). While we have expanded the universality of Wasserstein DRO by establishing generalization guarantees across diverse decision criteria, free from the curse of dimensionality, a natural managerial question arises: can simpler, more intuitive heuristics, such as empirical risk regularization, achieve the same generalization guarantees? Our equivalency results provide a positive answer for many applications by specifying the exact form of regularization required. Conversely, our impossibility theorem identifies when Wasserstein DRO becomes essential for decision problems beyond these cases. To underscore the importance of our regularization results for OR/MS applications, we present below an extensive series of examples, including those highlighted in Table 1 and the celebrated rank-dependent expected utility (RDEU) model.
Remark 2.
It is worth noting that Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) provided a regularized version of (13) and (16) based on a duality scheme. They showed that when is proper, upper-semicontinuous and satisfies for some , then
(18) |
where While this formula can be applied to general loss functions and used to verify the implication (ii) (i) in both Theorems 3 and 4, it offers limited insight into its connection with the empirical nominal problem and typically requires additional analysis to establish a regularized empirical risk minimization formulation. In contrast, our Theorems 3 and 4 directly leverage the structure of regularized empirical risk minimization to clearly identify the necessary and sufficient conditions for its existence, where the necessary “impossibility” would be difficult to derive through general duality arguments alone.
We detail in Appendix B.1 that the regularized models (15) and (17) can be established based on (18) when is given in (ii) of Theorems 3 and 4, respectively. In particular, for the in (ii) of Theorem 3, although the regularization in (18) appears to depend on , we demonstrate that this regularization can, in fact, be reduced to a formulation independent of (Theorem 3). As will be shown, the calculation is lengthy, making the derivation from (18) somewhat cumbersome.
Remark 3.
Our regularization results offer deeper and more practically significant insights than Proposition 3.9 in Shafieezadeh-Abadeh et al. (2023), particularly regarding the required regularization term. While Proposition 3.9 suggests the need for a term of the form in (18), with the regularization raised to the power , this can create computational challenges, especially as . In contrast, our regularization result reveals that the regularization term is actually independent of , namely . This insight significantly enhances the computational tractability of solving the regularization problem. Moreover, the regularized formulations (17) in Theorem 4 are all convex programs in , with complexity comparable to nominal problems. In contrast, the right-hand side of (17) is nonconvex in and , and its objective function is more costly to evaluate. Thus, Theorem 4 can also be viewed as enabling a convex program solution to (18). Finally, the equivalency between Wasserstein DRO and max-sliced Wasserstein DRO established in the previous section implies that max-sliced Wasserstein DRO can likewise be solved efficiently via convex programs for the cases covered in Theorem 4.
Example 4.
(Classification)
- (Higher-order hinge loss) Applying and setting , we have that the following classification problem with a higher-order hinge loss is equivalent to the regularization problem
- (Higher-order SVM) Applying and setting and , we have that the higher-order SVM classification problem is equivalent to the regularization problem
Example 5.
(Regression)
- (Higher-order measure) Applying and setting and , we have that the regression with a higher-order measure is equivalent to the regularization problem
- (higher-order -insensitive measure) Applying and setting and , we have that the following regression problem with a higher-order -insensitive measure is equivalent to the regularization problem
Example 6.
(Risk minimization)
- (Lower partial moments) Applying and setting , we have that the risk minimization with lower partial moments is equivalent to the regularization problem
Even though Theorem 4 highlights the impossibility of establishing an equivalence relation for a more general loss function , Theorem 4 can still be applied more broadly as a powerful foundation to derive alternative equivalence relations for a richer family of measures. Specifically, there is a large family of measures that can be generally expressed in the following two forms:
(19) |
for some loss function .
We show in the appendix (see Lemma 8) that for a wide range of loss functions , the following switching of and is valid
where
This, combined with Theorem 4, leads to the following.
Corollary 1.
For any and , let and be defined by (19) with . Take and .
Corollary 1 can accommodate cases where variance is used as the measure. Variance is not listed as an example in Section 2 because it has already been studied in Blanchet et al. (2022). However, our method for deriving the equivalency relation is different from, and significantly more general than, the approach in Blanchet et al. (2022). The following example illustrates the broader applicability of our approach.
Example 7.
The following example demonstrates how Corollary 1 can be applied to derive the regularization counterpart for higher moment risk measures in risk minimization, a previously unknown result.
Example 8.
(Risk minimization) For , setting (i.e., with ), we have that the following problem of minimizing higher moment risk measures is equivalent to the regularization problem
4.2 The case of non-expected function with distortion
A powerful perspective in the risk measure literature, especially when studying non-expected functions such as Conditional Value-at-Risk (CVaR), is to view these functions as equivalent to taking an expectation with respect to a distorted probability distribution. The concept of distorted expectation is central to foundational theories, including Yaari’s dual utility theory (Yaari (1987)), Choquet Expected Utility (Schmeidler (1989)), and various business applications. In this section, we adopt this perspective to study the problem (13) with a distorted expectation:
(20) |
where
is an “expectation” taken with respect to a distortion function 111Strictly speaking, a distortion function should additionally satisfy and be increasing, so that can be considered as distorted expectation. However, our results do not hinge on these requirements.. Here, represents the left-quantile function of , i.e., for , and . The problem (20) encompasses (13) as a special case, since when , , and accommodates CVaR, since when , . A comprehensive list of risk measures with distorted expectation representations can be found in Wang et al. (2020) and Cai et al. (2023).
It is intriguing to consider how the problem (20) might be solved, particularly whether a regularization counterpart, similar to that of the expected function (13), remains available. Our findings are surprisingly general: for any problem (20) with an increasing and convex distortion function , a regularization counterpart always exists for the type-1 Wasserstein ball. This similarity to the classical regularization result for (13) is particularly notable, given the broader applicability of (20). This insight even extends to the general type- Wasserstein ball. As we show below, the necessary and sufficient conditions for a regularization counterpart to exist are remarkably similar for both (20) and (13) in the general type- case.
Before presenting the result, we introduce the notation that for a function with and , and . For a convex distortion function with , we denote by the left-derivative of on , noting that may be infinity.
Theorem 5.
For , let be an increasing and convex distortion function satisfying and , and let be a convex function.
For , if , then for any , and , we have
For , if for all , then the following statements are equivalent.
-
(i)
For any , and , there exists such that
-
(ii)
The function takes one of the following two forms, multiplied by :
-
(a)
or with some ;
-
(b)
with some .
-
(a)
Interestingly, the “exact” class of loss functions that admit a regularization counterpart is the same as in the previous section, despite distorted expectations being significantly more general than expected functions. This result is particularly significant from a decision-theoretical perspective, as the formulation (20) closely resembles the celebrated rank-dependent expected utility (RDEU) model, known for resolving the paradox in expected utility (Quiggin (1982)). Given its importance, we highlight below the application of Theorem 5 in RDEU.
Example 9.
(Rank-Dependent Expected Utility (RDEU)) The decision criteria of RDEU admits the form , where is a distortion function and is a dis-utility function. For decision makers who are risk-averse (Chew et al. (1987)), i.e., and are increasing convex with , we have
for any convex loss function with , , and .
We further demonstrate the practical applicability of the above theorem by showing how it can be directly applied to derive the regularization counterpart for the -support vector regression example.
Example 10.
(Regression)
- (-support vector regression) For , let , . Applying and setting and , we have from Theorem 5 that the -support vector regression is equivalent to the regularization problem
Measures like the CVaR-Deviation example in risk minimization, as shown below, require a convex distortion function , which is not covered by the theorem above. We now show that when the loss function in (20) is linear, a regularization counterpart can also be found for any convex distortion function .
Proposition 2.
For , let be a convex function with and . Then for any , and , we have
Note first that applying Proposition 2 to classification problems is fairly straightforward, namely yielding the following equivalency:
This equivalency is readily applicable to specific cases, such as the -support vector machine.
Example 11.
(Classification)
- (-support vector machine) Setting , with we have that the classification problem with -support vector machine is equivalent to the regularization problem
Proposition 2 can be further applied to general deviation measures defined by a distorted expectation, such as the CVaR-Deviation example in risk minimization. Observe that holds for any distortion function , where for . Moreover, is convex whenever is convex. Thus, we apply Proposition 2 and arrive at the following equivalency:
This leads us to our final example.
Example 12.
(Risk minimization)
- (CVaR-Deviation) Setting , with we have that the risk minimization with CVaR-Deviation is equivalent to the regularization problem
Remark 4.
We offer some comments here on the recent work by Chu et al. (2024), which was released well after our work had been made available online. While Chu et al. (2024) initially set out to establish an equivalency relationship for a general setting of Wasserstein DRO – considering a general loss function and cost function as defined in (3) – their actual results do not extend much beyond our settings and findings in Section 4.1 (expected functions and extensions) and are limited to discussing sufficient conditions for the equivalency. Although some examples presented in Table 2 of Chu et al. (2024) are not covered in Section 4.1, we attribute this largely to our focus on cases that admit a final convex optimization formulation, rather than a limitation of our analysis. Indeed, we believe our analysis is foundational, reaching the very core of the equivalency problem. This is evident in our ability to prove not only sufficient conditions but also necessary ones for both the case of expected functions and general distorted expectations – an achievement that appears beyond the reach of Chu et al. (2024), as it inevitably hinges on Lemma 4 provided in our analysis. To substantiate our claim that our analysis is not limited to the examples presented in this paper, we demonstrate in Proposition 3 in Appendix B.2 how to generalize Theorem 5 (the case ) to non-convex loss functions , including the truncated pinball loss from Chu et al. (2024) as a special case.222In Chu et al. (2024), the center of the Wasserstein ball is assumed to be the empirical distribution, which has a finite set of vectors as its support. Consequently, Proposition 3 is directly applicable. As another example, it is not difficult to confirm that our findings in Proposition 1, Lemma 4, and all the results in Section 4 can be readily extended to a general Hilbert space , with the norm induced by its inner product. This extension covers the case of nonparametric scalar-on-function linear regression from Chu et al. (2024), where takes the forms outlined in our Theorem 4.333 The expression can be interpreted as the inner product between and on , where is the space of all square integrable functions on . In this setting, the decision rule is the inner product for each decision vector which is affine and the Wasserstein ball , similar in form to (5), is now encompassed within the set of probability measures on .
5 Numerical Demonstration of Generalization Bounds
In this section, we seek to provide a further demonstration of the generalization result obtained in this paper. In particular, we numerically illustrate the decay rate of the Wasserstein radius for measures that are non-expected functions. We closely follow the experimental setup presented in Shafieezadeh-Abadeh et al. (2019), where the authors investigate the scaling behaviour of the smallest Wasserstein radius for the synthetic threenorm classification problem (Breiman (1996)). While Shafieezadeh-Abadeh et al. (2019) uses classical support vector machine for classification, in which the measure is an expected function defined by a hinge loss function , we employ -support vector machine, where is represented by CVaR. Specifically, we consider the distributionally robust counterpart of the standard -support vector machine formulation,
(21) |
By choosing -norm as the norm on the input space in the transport cost and setting , like Shafieezadeh-Abadeh et al. (2019), and following Proposition 2 or Example 11, we have the following regularized form for the Wasserstein robust -support vector machine problem:
(22) |
The experiment is based on an output drawn uniformly from and a -dimensional input from a multivariate normal distribution. Specifically, if , then is drawn from standard multivariate normal distribution shifted by or with equal probabilities, where . If , then is drawn from standard multivariate normal distribution shifted by . We consider different sizes of training samples in the set as well as test samples. Each size of training sample involves simulation trials.
Throughout the experiment, the search space of radius is chosen as . Similar to Shafieezadeh-Abadeh et al. (2019), we use the following three approaches to choose the smallest Wasserstein radius:
-
•
Cross validation: For a set of training samples, denoted as , we partition them into subsets. We use one subset as the validation dataset and combine the remaining subsets as the training dataset. This results in pairs of validation and training datasets. We choose the radius in such that the average validation error over these pairs is the smallest. This operation is repeated across all trials, and we then report the average of the radii.
-
•
Optimal radius: In each trial, we choose the radius in that has the smallest testing error and then report the average of the radii across all trials.
-
•
Generalization bound: We choose the smallest radius such that the optimal value of (22) exceeds the value of the nominal problem on the test samples in at least of all 100 trials.
In our initial experiments, we set . Figure 1 displays the selected Wasserstein radii in relation to various sample sizes for the three approaches above. Note that while the first two approaches determine the radius by averaging the radii induced by all simulation trials (potentially leading to a radius not necessarily in the set ), the third approach specifically selects a radius from based on the percentage criteria, utilizing the testing samples across all 100 trials. One can observe that the Wasserstein radii of all three approaches decay approximately as , which is consistent with the theoretical generalization bound of Theorem 1.

It is natural to wonder if varying choices of could affect the scaling behavior. This becomes particularly intriguing when considering higher values of . However, literature on the -support vector machine suggests that the optimal solution of (21) can become degenerate (i.e., equal to zero) when surpasses a specific threshold. We indeed observe that the experiment setup of Shafieezadeh-Abadeh et al. (2019) only allows us to generate meaningful solutions for lower values of . To address the potential degeneracy issue at higher values, we modified the method for generating simulated data. Specifically, for each sample , if , then is drawn from standard multivariate normal distribution shifted by or with equal probabilities, where , whereas if , then is drawn from standard multivariate normal distribution shifted by . Figure 2 contains three panels, each displaying the selected radii for a distinct level of : . Notably, the decay rate of the radius is remarkably similar across different levels of , approximately as . This is in line with our theoretical finding that the rate in general does not hinge on the specification of the measure .



6 Beyond Affine Decision Rules
While our results thus far have focused on affine decision rules, the underlying principles of our approach to establishing generalization bounds for a broad class of measures can be extended to a wider range of decision rules. In this section, we illustrate this extension by deriving generalization bounds for the following Wasserstein DRO problem with a decision rule denoted by :
where with and is the type- Wasserstein ball defined by (5). Throughout this section, we assume that for some , with representing any given norm on .
The key to establishing generalization bounds applicable to a broad class of measures continues to lie in focusing on the projection set, now expressed in a more general form:
(23) |
Although the exact equivalence of this set to a one-dimensional Wasserstein ball or the projection set of a max-sliced Wasserstein ball, as shown in Proposition 1, no longer holds in the general case, we can still establish generalization bounds by instead identifying a one-dimensional Wasserstein ball dominated by , i.e.
(24) |
for some increasing function , where . Indeed, if this dominance relationship (24) can be established, one can then leverage the measure concentration property of a one-dimensional Wasserstein ball to determine the radius of this one-dimensional ball such that
with , where and , defined in Section 3, are the data-generating distribution and empirical distribution, respectively. This leads to the confidence bound:444 One may wish to further gauge how tight (i.e., small) the Wasserstein in-sample risk, , is – e.g., relative to the empirical in-sample risk – when the radius is sufficiently small. To this end, we note that, under Assumption 5 and the condition that is Lipschitz continuous with a Lipschitz constant (a condition satisfied in the affine case), the relationship holds.
(25) |
for a fixed . 555 Notably, Aolarite et al. (2022) also studied the set inclusion property between the projection of a Wasserstein ball and a lower-dimensional Wasserstein ball. However, their focus was on establishing the existence of a lower-dimensional Wasserstein ball as a superset of the projection (i.e., the converse of (24)), which cannot be used to derive generalization bounds based on (Wasserstein) in-sample risk, i.e. . Extending this to generalization bounds – i.e., the union bound of the above – is achievable using covering number techniques. We leave the exact technical details and derivations to Appendix C, and present here the final generalization bounds along with the necessary assumptions.
Assumption 2.
There exists such that
Assumption 3.
The decision rule is continuous for each , and there exists a strictly increasing convex function such that for every , and ,
Assumption 4.
There exist and such that
Assumption 5.
There exist and such that
Assumption 2 is a fairly standard light-tail condition, necessary for invoking the measure concentration property of a one-dimensional Wasserstein ball; see Proposition 4 in Appendix C.4 for a sufficient condition for Assumption 2 for a class of decision rules . Assumption 3, however, is more notable, as it successfully characterizes decision rules that ensure the set inclusion (24) holds. Simply put, decision rules must exhibit sufficiently increasing and decreasing behaviour; otherwise, a Wasserstein ball dominated by the projection set would not exist. Assumptions 4 and 5 are regularity conditions required for constructing union bounds through covering number techniques.
Theorem 6.
The decay rate here, , maintains an order independent of dimensionality, mirroring the rate reported in Gao (2022). Although our rate may not be as rapid, due to its dependence on the function and the parameter , it is derived under less restrictive distributional assumptions and is valid for any type- Wasserstein ball. Most importantly, it accommodates a significantly broader class of measures . A notable class of applications for these bounds includes any regression problem and decision rule, with a rate that, as demonstrated below, is independent of the function .
Example 13 (Regression).
In the regression problem, the data-generating distribution satisfies and the decision rule has the form: , , where is the first component of and represents the rest components, i.e., we use to predict the output . Suppose that the norm on satisfies . Then, Assumption 3 holds for any with . Indeed, we have for any and , and Hence, any decision rule for a regression problem satisfies Assumption 3 with .
We further illustrate the application of this example in the context of operations management. Specifically, the feature-based newsvendor problem (Zhang et al. (2024)), also known as the big-data newsvendor (Ban and Rudin (2018)), represents as a special case of regression problems.
Example 14 (Feature-based newsvendor).
In the newsvendor problem, the decision maker seeks the optimal order quantity for a product facing uncertain demand , with holding costs of and back-order costs of . In the world of big data, the decision maker has access to a feature vector , which can be leveraged to enhance ordering decisions based on this additional information. In this setting, define the decision rule as with and the data-driven distributionally robust feature-based newsvendor problem has the form
Applying Example 13, we know that any decision rule of feature-based newsvendor problem satisfies Assumption 3 with .
The above generalization bound can be extended to an even broader class of decision rules when the measure satisfies monotonicity, i.e., whenever , a natural property in problems such as risk minimization. Specifically, we consider the following data-driven Wasserstein DRO problem:
where satisfies monotonicity, represents the empirical distribution of the component , and is defined by (8). In these cases, Assumption 3 can be relaxed, requiring only that each decision rule exhibits sufficiently increasing behavior.
Assumption 6.
The decision rule is continuous, and there exists a strictly increasing convex function such that for every ,
Theorem 7.
In particular, Assumption 6 accommodates non-linear decision rules, including quadratic and max-affine functions (cf. Esfahani and Kuhn (2018) and Example 16), which can be bounded below.
Example 15 (Quadratic function).
Define the decision variable as , where , , and is a positive semi-definite matrix. Let be the feasible set of such that , where is the spectral norm of and is defined as the largest absolute value of all eigenvalues. Define the norm on as Let for and . For this specific case, we first show that satisfies Assumption 4 with and Assumption 6 with . For and , we have
Hence, Assumption 4 holds with . For any , and , it follows from the orthogonal decomposition of , i.e., where is an orthogonal matrix and is a diagonal matrix, that
Hence, Assumption 6 holds with . Further, if we assume that and there exists such that , then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4.
To further emphasize its implications for operations management, we next illustrate how Theorem 7 applies to two-stage stochastic programs.
Example 16 (Two-stage stochastic program).
In the context of two-stage stochastic programs with right-hand-side uncertainty, the recourse function is defined as (Esfahani and Kuhn (2018)): where with , denotes the first-stage decision variable, and represents the recourse decision. As shown in Esfahani and Kuhn (2018), this function can be reformulated as a max-affine function where , are the vertices of the dual feasible set . Suppose . We show that for satisfies Assumption 4 with and and Assumption 6 with . For , we have
Hence, Assumption 4 holds with and . For any , and , we have
Hence, Assumption 6 holds with . Further, if we assume that and there exists such that with , then one can verify that Assumption 2 holds by Proposition 4 in Appendix C.4. We note that extending this discussion to cases where the matrix also represents a first-stage decision is straightforward, and verifying this extension closely mirrors our existing analysis, thus it is omitted here for brevity.
Finally, we show that for problems such as regression or risk minimization, the “curse of the order ” – i.e., the dependence of the decay rate on – can be overcome for any non-linear decision rules covered by Assumption 6.
Theorem 8.
7 Conclusion
As ML and OR/MS applications increasingly adopt diverse decision criteria, identifying methods that ensure robust, data-driven solutions with strong generalization guarantees across a wide range of criteria is essential. This paper establishes the universality of Wasserstein DRO in achieving such guarantees. We show that Wasserstein DRO attains generalization bounds comparable to max-sliced Wasserstein DRO for any decision criterion under affine decision rules, even though the Wasserstein ball is a significantly less conservative ambiguity set. Furthermore, we extend these guarantees to general criteria and decision rules under light-tail assumptions, avoiding the curse of dimensionality. Our projection-based analysis of the Wasserstein ball offers key insights into why these guarantees require minimal reliance on specific properties of decision criteria.
Beyond developing generalization guarantees, our work emphasizes their practical significance for OR/MS problems. We demonstrate that these guarantees can often be achieved through regularization equivalents, as detailed in Theorems 3, 4, and 5, all of which are efficiently solvable as convex programs with complexity comparable to nominal problems. Importantly, our impossibility theorems identify precisely when solving the full Wasserstein DRO problem becomes necessary, defining the boundaries where simpler regularization approaches no longer suffice. These results deepen the theoretical foundation of Wasserstein DRO while charting a clear course for developing more efficient methods to address full Wasserstein DRO formulations.
References
- Abu-Mostafa et al. (2012) Abu-Mostafa, Y. S., Magdon-Ismail, M. and Lin, H. T. (2012). Learning from data. New York: AMLBook.
- Aolarite et al. (2022) Aolaritei, L., Lanzetti, N., Chen, H., and Dörfler, F. (2022). Distributional uncertainty propagation via optimal transport. arXiv:2205.00343.
- Ban and Rudin (2018) Ban, G. -Y. and Rudin, C. (2018) The big data newsvendor:practical insights from machine learning. Operations Research, 67(1), 90–108.
- Bartl et al. (2020) Bartl, D., Drapeau, S., Obloj, J. and Wiesel, J. (2020). Robust uncertainty sensitivity analysis. arXiv:2006.12022.
- Bawa (1975) Bawa, V. S. (1975). Optimal rules for ordering uncertain prospects. Journal of Financial Economics, 2(1), 95–1.
- Blanchet et al. (2022) Blanchet, J., Chen, L. and Zhou, X. (2022). Distributionally robust mean-variance portfolio selection with Wasserstein distances. Management Science, 68(9), 6382–6410.
- Blanchet and Kang (2021) Blanchet, J. and Kang, Y. (2021). Sample out-of-sample inference based on Wasserstein distance. Operations Research, 69(3), 985–1013.
- Blanchet et al. (2019) Blanchet, J., Kang, Y. and Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3), 830–857.
- Blanchet et al. (2022) Blanchet, J. Murthy, K. and Si, N. (2022). Confidence regions in Wasserstein distributionally robust estimation. Biometrika, 109(2), 295–315.
- Breiman (1996) Breiman, L. (1996). Bias, variance, and arcing classifiers (Technical Report 460). Statistics Department, University of California.
- Cai et al. (2023) Cai, J., Li, J. Y. M. and Mao, T. (2023). Distributionally robust optimization under distorted expectations. Operations Research, forthcoming.
- Chen and Paschalidis (2018) Chen, R. and Paschalidis, I. C. (2018). A Robust Learning Approach for Regression Models Based on Distributionally Robust Optimization. Journal of Machine Learning Research, 19(1), 1–48.
- Chen et al. (2011) Chen, L., He, S. and Zhang, S. (2011). Tight Bounds for Some Risk Measures, with Applications to Robust Portfolio Selection. Operations Research, 59(4), 847–865.
- Chew et al. (1987) Chew, H. C., Karni, E., and Safra, Z. (1987). Risk aversion in the theory of expected utility with rank dependent probabilities. Journal of Economic theory, 42(2), 370–381.
- Chu et al. (2024) Chu, H., Lin, M. and Toh, K. C. (2024). Wasserstein distributionally robust optimization and its tractable regularization formulations. arXiv:2402.03942.
- Drucker et al. (1997) Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. and Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 28(7), 779–784.
- Esfahani and Kuhn (2018) Esfahani, P. M. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 115–166.
- Fishburn (1977) Fishburn, P. C. (1977). Mean-risk analysis with risk associated with below target returns. American Economic Review, 67(2), 116–126.
- Gao (2022) Gao, R. (2022). Finite-Sample Guarantees for Wasserstein Distributionally Robust Optimization: Breaking the Curse of Dimensionality. Operations Research, 71(6), 2291–2306.
- Gao et al. (2017) Gao, R., Chen, X. and Kleywegt, A. J. (2017). Distributional robustness and regularization in statistical learning. arXiv:1712.06050.
- Gao et al. (2022) Gao, R., Chen, X. and Kleywegt, A. J. (2022). Wasserstein Distributionally Robust Optimization and Variation Regularization. Operations Research, 72(3), 1177–1191.
- Gotoh and Uryasev (2017) Gotoh, J. and Uryasev, S. (2017). Support vector machines based on convex risk functions and general norms. Annals of Operations Research, 249, 301–328.
- Krokhmal (2007) Krokhmal, P. (2007). Higher moment coherent risk measures. Quantitative Finance, 7(4), 373–387.
- Kuhn et al. (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A. and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: theory and applications in machine learning. INFORMS TutORials in Operations Research, 130–166.
- Lee and Mangasarian (2001) Lee, Y. J., and Mangasarian (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20, 5–22.
- Lee et al. (2005) Lee, Y. J., Hsieh, W. F. and Huang, C. M. (2005). -SSVR: A smooth support vector machine for -insensitive regression. IEEE Transactions on Knowledge and Data Engineering, 17(5), 678–685.
- Mohri et al. (2018) Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
- Olea et al. (2022) Olea, J. L. M., Rush, C., Velez, A. and Wiesel, J. (2022). The out-of-sample prediction error of the square-root-LASSO and related estimators. arXiv:2211.07608.
- Quiggin (1982) Quiggin, J. (1982). A theory of anticipated utility. Journal of Economic Behavior & Organization, 3(4), 323–343.
- Rockafellar and Uryasev (2002) Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 26(7), 1443–1471.
- Rockafellar et al. (2008) Rockafellar, R. T., Uryasev, S. and Zabarankin, M. (2008). Risk tuning with generalized linear regression. Mathematics of Operations Research, 33(3), 712–729.
- Rockafellar and Uryasev (2013) Rockafellar, R. T. and Uryasev, S. (2013). The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science, 18(1-2), 33–53.
- Schölkopf et al. (1998) Schölkopf, B., Bartlett, P., Smola, A. and Williamson, R. (1998). Support vector regression with automatic accuracy control. In ICANN 98; Springer: Heidelberg, Germany, 111–116.
- Schölkopf et al. (2000) Schölkopf, B., Smola, A. J., Williamson, R. C. and Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245.
- Schmeidler (1989) Schmeidler, D. (1989). Subjective probability and expected utility without additivity. Econometrica, 57(3), 571–587.
- Shafieezadeh-Abadeh et al. (2023) Shafieezadeh-Abadeh, S., Aolaritei, L., Dörfler, F. and Kuhn, D. (2023). New perspectives on regularization and computation in optimal transport-based distributionally robust optimization. arXiv:2303.03900.
- Shafieezadeh-Abadeh et al. (2019) Shafieezadeh-Abadeh, S., Kuhn, D. and Esfahani, P. M. (2019). Regularization via mass transportation. Journal of Machine Learning Research, 20, 1–68.
- Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge.
- Sim et al. (2021) Sim, M., Zhao, L. and Zhou, M. (2021). Tractable robust supervised learning models. SSRN:3981205.
- Suykens and Vandewalle (1999) Suykens, J. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.
- Volpi et al. (2018) Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V. and Savarese, S. (2018). Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 5334–5344.
- Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-asymptotic Viewpoint. Volume 48 (Cambridge University Press).
- Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.
- Wozabal (2014) Wozabal, D. (2014). Robustifying convex risk measures for linear portfolios: A nonparametric approach. Operations Research, 62(6), 1302–1315.
- Yaari (1987) Yaari, M.E. (1987). The dual theory of choice under risk. Econometrica, 55(1), 95–115.
- Zhang et al. (2024) Zhang, L., Yang, J. and Gao, R. (2024). Optimal robust policy for feature-based newsvendor. Management Science, 70(4), 2315–2329.
Appendix A A Proofs of Section 3
Before presenting the proofs in Section A, we first introduce a lemma inspired by the optimal coupling (see, e.g., Theorem 4.1 of Villani (2009)),666We thank a referee for bringing this to our attention. which will be used throughout the proofs in the appendix.
For , a function is a cost function if for all . For a given cost function , define for two distributions on .
Lemma 1.
Let with be a lower semicontinuous cost function. Given two distributions on and , there exists such that . As a result, we have for ,
(26) |
Proof.
Proof. Note that by the optimal coupling (see e.g., Theorem 4.1 of Villani (2009)), there exists and such that Since , by Theorem 8.17 of Kallenberg (2021), there exists such that , and thus, This completes the proof.
Consider the type- Wasserstein balls and defined by (5) and (8), respectively. By Lemma 1, we have for , it holds that
The two formulas above will be used frequently throughout the appendix, and we may apply them directly without referencing Lemma 1.
Proof of Proposition 1. Note that
where the first equality is due to Lemma 1, the second equality follows from the definition in (4) which implies almost surely for any satisfying , and the third equality follows from as . Additionally, it follows from Lemma 1 that
it holds that obviously. To see the converse inclusion, note that for any , there exists such that and Taking , we have and
Hence, we have , and thus . Therefore, we have . This completes the proof of (10). To show (11), denote by
We first verify that . With the aid of (10), we can verify this similarly as the proof of Theorem 5 in Mao et al. (2022). For ease of completeness, we give a proof of here. The case of is trivial, and we assume now . Note that
(27) |
where the last equality follows from the definition in (4) and Lemma 1. On one hand, for , there exists such that and . Thus, we have
(28) |
where we use Hölder’s inequality in the first inequality. This means , and hence, . On the other hand, for , using Lemma 1 yields that there exists such that . It follows from the definition of dual norm that there exists such that and . Define
It holds that , and thus,
It follows from (A) that . Noting that and , we have . Hence, we conclude . This completes the proof of .
It remains to prove that . For , denote
and thus, For and using (A), there exists such that and . It holds that
This implies , and hence, . Suppose now . There exists such that and . It follows that
where the second inequality follows from . This implies
Hence, we have , which yields . This completes the proof. ∎
The proof of Theorem 1 relies on our projection result in Proposition 1 and a concentration result for the max-sliced Wasserstein distance in Theorem 3 of Olea et al. (2022), which is given as follows.
Lemma 2 (Theorem 3 of Olea et al. (2022)).
Let and . Suppose that satisfies for some and is defined in Theorem 1. Then, we have
where is the empirical distribution of
Proof of Theorem 1. Denote by , , where , and . It is clear that is an empirical distribution of . It holds that
where the second equality follows from by (ii) of Proposition 1 and we have used Lemma 2 in the last inequality. This completes the proof.
Proof of Theorem 2.
Let . By the definition of , it follows that
where the second inequality follows from (11) in Proposition 1, which states that
In particular, we have for any
(29) | ||||
(30) |
Therefore, we have
(31) |
where the first inequality is due to , and the second inequality is due to Assumption 1, and the two equalities come from (29) and (30), respectively. Combining (A) and Theorem 1 with type- Wasserstein ball yields the desired result.
Appendix B B Proofs of Section 4
For ease of exposition, we need the following notations. For a random variable , we define . We use to represent the indicator function, i.e., if , and otherwise. The sign function on is defined as
To prove the results in Section 4, we need the following auxiliary lemmas. The first lemma follows immediately from Proposition 1.
Lemma 3.
Given , we have
where and is the distribution of .
Based on Lemma 3, we present the following lemma, which will be used in the proofs of the main results.
Lemma 4.
Let and . For , the following statements are equivalent.
-
(i)
For any , and ,
(32) -
(ii)
For any and ,
(33)
Proof.
Proof. (i) (ii): For , take with , and let be the distribution of . It holds that , and thus, the left-hand side of (32) reduces to
where the second equality follows from Lemma 3. Note that the right-hand side of (32) reduces to
Combining the above two equations with (32) yields (33). This completes the proof of the implication (i) (ii).
Remark 5.
B.1 Proofs of Section 4.1
Lemma 5.
Let , and . Define . We have for all .
Proof.
Proof. Using the Chebyshev’s inequality, we have which implies for all . Hence, for any , it holds that
where the first inequality follows from Hölder’s inequality. This completes the proof. ∎
Proof of Theorem 3. Throughout the proof, we assume without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.
-
(i)’
For any and , we have (33) holds, i.e.,
(34)
To see (ii)(i)’, note that and thus,
and
The inequality reduces to equality if . Hence, we complete the proof of (ii)(i)’.
To see (i)’(ii), we first show that . Otherwise, by that a convex function has derivative almost everywhere, there exists such that . If , then we have
If , then we have
Those two cases both yield a contradiction to (34).
Next, we aim to verify the following fact
(35) |
This will complete the proof since a convex function that satisfies (35) must be one of the forms of and with . To see (35), assume by contradiction that there exists such that . We first consider the case . In this case, we have
where the strict inequality follows from and , which yields a contradiction. Suppose now . Define
We have that for all by the convexity of , and thus,
where the second inequality follows from . By defining
(36) |
we can rewrite as with
Since and , we know that . Hence, it holds that
(37) |
where the second inequality follows from Lemma 5. Further note that
(38) |
where the first inequality follows from the definition of , and the second one holds because implies . Combining (37) and (B.1), we conclude that
This yields a contradiction. Hence, (35) is verified, and thus we complete the proof.
To prove Theorem 4, we begin by introducing two auxiliary lemmas. Specifically, Lemma 7 will be employed in the proof of Theorem 4, and Lemma 6 serves to establish Lemma 7.
Lemma 6.
For , and , the following problem has a positive optimal value.
(39) | ||||
Proof.
Proof. Note that Problem (39) has two moment constraints. By (1995), the optimal value of Problem (39) is equivalent to
(40) | ||||
where is the set of all random variables that take at most three values. Let be a feasible solution of Problem (40). Denote by , and let be a random variable such that
(41) |
with the conditions that , and . It is straightforward to check that and for all . It follows from Theorem 3.A.1 of Shaked and Shanthikumar (2007) that .777For two random variables and , is said to be smaller than in the concave order, denoted by , if for any concave function . This implies that since is a concave function on . Therefore, is also a feasible solution of Problem (40). In addition, we have
Hence, to solve Problem (40), it suffices to consider all random variables whose distributions take the form of (41), i.e., with , and . The corresponding problem can be represented as follows:
(42) |
Note that whenever is a feasible solution of Problem (42), and thus, Problem (42) is equivalent to
Further, letting yields the following equivalent problem
(43) |
Let be a feasible solution of Problem (43). It holds that and and . Substituting into the second constraint of Problem (43) yields
which implies . Hence, the optimal value of Problem (43) is no less than which is a positive value. This completes the proof. ∎
Lemma 7.
Let , and . For , the following statements hold.
-
(i)
If , then .
-
(ii)
If and , then there exists that only depends on such that . In particular, if is an integer, then .
Proof.
Proof. (i) Suppose that and denote by . It holds that and
where we have used Jensen’s inequality in the first inequality by noting that is concave on .
(ii) The case that is an integer can be verified by the following direct calculations:
where the inequality holds because implies for .
For general case , it suffices to verify that the optimal value of the following problem is smaller than :
(44) |
We first assert that Problem (44) is equivalent to
(45) |
To see this, let be a feasible solution of Problem (44) with . Let be a uniform random variable on and for , define
It is straightforward to check that for all . Moreover, it holds that
(46) |
Note that the mapping is continuous as is continuous, and if . Combining with (46), there exists such that . Moreover, one can check that for any , for all . It follows from Theorem 2.5 of Bäuerle and Müller (2006) that . Hence, we have as is a concave function on . Therefore, Problem (44) is equivalent to Problem (45), which is further equivalent to
(47) |
where is a strictly concave function on . It remains to verify that the optimal value of Problem (47) is strictly less than .
To see this, let be a feasible solution of Problem (47). Define
where and satisfy with . Note that by Lemma 6, there exists that only depends on such that . By definition of and , this implies and . Therefore, we have
(48) |
Further note that and also note that is concave on . This implies
where the second inequality is due to the concavity of , and (48); and the strict inequality is due to and the strict concavity of . Noting that is independent of the random variable , this completes the proof.
∎
Proof of Theorem 4.
(ii) (i): By Lemma 4, it suffices to show that (33) holds for any and . For , we have
(49) |
If , then all the inequalities are equalities, and the maximizer can be chosen as with some such that . If a.s., then we take such that has distribution , and where is the left-quantile function of for all . We have and
Combining with (B.1), we verify that (33) holds with for all a.s. This completes the proof of the case . For , the proof is similar to that of . For , we have
(50) |
If , then one can check that all inequalities are equalities, and the maximizer can be chosen as with some such that . If a.s., then we have . Taking a sequence as shown in the case of , i.e., with distribution , it holds that for large enough such that ,
Combining with (B.1), we have that (33) holds with for all a.s. This completes the proof of the case . For , we have
where all inequalities can be equality, and the maximizer can be chosen as for some such that . Hence, we conclude that (33) holds with and complete the proof of (ii) (i).
(i) (ii): The proof of this direction is in the same spirit to that of Theorem 3. Assume without loss of generality that . By Lemma 4, we have for any and ,
(51) |
By the same arguments in the proof of Theorem 3, we can show that . Next, assume that is differential at when we use the notation , and we show the following facts.
(52) |
This will complete the proof since one can check that (52) implies that has one of the forms of , , and with . To see (52), we assume by contradiction that there exists such that and (note that implies ). Define
and it holds that for all as is nonnegative and convex. Further, noting that and , we have and
(53) |
Therefore,
(54) | ||||
(55) | ||||
where (54) holds because is nonnegative with , and (55) follows from (53). Recalling and defined by (36) in the proof of Theorem 3, we can rewrite with
By Lemma 5 and the definition of , we have
(56) |
It holds that
(57) |
where the strict inequality follows from (56) and Statement (ii) of Lemma 7 by noting that . For , we have
(58) |
where the second inequality follows from Statement (i) of Lemma 7, and the third inequality is due to the definition of . Combining (57) and (B.1), we have
which yields a contradiction to (51). Hence, (52) holds, which completes the proof.
As outlined in Remark 2, we detail below the steps to verify that for the function listed in (ii) of Theorems 3 and 4, the regularization version given by Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) can be reformed as the corresponding regularization version in our framework. To be specific, Proposition 3.9 of Shafieezadeh-Abadeh et al. (2023) states that for that is proper, upper-semicontinuous and satisfies for some and all ,
where and is the dual norm of . When is given by (ii) of Theorem 3, we have
(59) |
when is given by (ii) of Theorem 4, we have
(60) |
Proof.
Proof. We only give the proof of (59) for the case for some and the proof of (60) for for some and as the other cases can be proved similarly.
In the case that , the function reduces to , where for and . We first show
(61) |
by considering the following two cases:
-
(i)
If , then we have where
For , by calculating the derivative of for , one can verify that
if , and
otherwise. For , since is increasing on , we have
Similarly, by calculating the derivative of for , we have
Note that . Comparing , and , it is straightforward to check that and , and hence,
-
(ii)
If , then by similar analysis, we obtain
Combining the above two cases, we have (61) holds. Substituting (61) into the left-hand side of (59) yields
This completes the proof of (59).
In the case that , that is, , we have
where and
It is straightforward to check that if . Hence, the constraint of can be replaced by . Let now and denote by . By considering the first-order condition, we have
and
where we use the convention that if . Comparing , and , we have
Denote by . Substituting the above equation into the left-hand side of (60) and noting that the constraint of can be replaced by yield
This completes the proof of (60). ∎
The proof of Corollary 1 relies on the following lemma and the regularization results in Theorem 4. Recall that
Lemma 8.
For any , and , the following two statements hold.
-
(i)
Suppose that is nonnegative on , and convex in with for all , and Lipschitz continuous in for all with a uniform Lipschitz constant, i.e., there exists such that
Then we have
-
(ii)
Suppose that is convex in with and for all , and Lipschitz continuous in for all with a uniform Lipschitz constant. Then we have
Proof.
Proof. (i) First, we show three facts below. (a) is concave in for all ; (b) is convex in for all ; (c) for all . The fact (a) is trivial. For , and , it holds that
where the first step holds because is nonnegative, and the second follows from the triangle inequality. This implies (b). To see (c), it is obvious that . Note that Combining with and the convexity of in , we have . Hence, we conclude the proof of (c). Using (b) and (c), the set of all minimizers of the problem is a closed interval. Denote by . We will show that is a subset of a compact set. For any and , let and , and we have
(62) |
where the first and the third inequalities follow from the triangle inequality and Hölder’s inequality, respectively, and we have used the definition of the Wasserstein ball in the last step. Hence, it holds that
(63) |
Note that as . There exists such that for all . This, combined with (B.1), imply that
(64) |
Applying (63) and (64), we have . Using a minimax theorem (see e.g., Sion (1958)), it holds that
The converse direction is trivial. Hence, we complete the proof.
(ii) The proof is similar to (i). ∎
Proof of Corollary 1. (i) For the case with or in Theorem 4, one can verify that satisfies the conditions in Lemma 8 (ii). Hence, the equivalency result follows immediately from Lemma 8 and Theorem 4. For the case with or in Theorem 4, we assume without loss of generality that and . In this case, we have
B.2 Proofs of Section 4.2
To prove the results in Section 4.2, we list the definition of comonotonicity and some basic properties of distortion functionals as follows; see e.g.,Wang et al. (2020).
Definition 1 (Comonotonicity).
Two random variables and are said to be comonotonic if is distributionally equivalent to , where denotes the left-quantile function of , and is a random variable uniformly distributed on the interval (see e.g., Dhaene et al. (2002) for a discussion of comonotonic random variables).
Lemma 9.
Let be a distortion function. The following statements hold.
-
(i)
is monotone, i.e., for a.s. if is increasing.
-
(ii)
is translation invariant, i.e., for any and .
-
(iii)
is positively homogeneous, i.e., for any and .
-
(iv)
is subadditive, i.e., for any and if is convex. The equality holds when and are comonotonic.
Proof of Theorem 5. We first consider the case . Assume without loss of generality that . By Lemma 4, it suffices to prove that for any and ,
To see it, first note that
(65) |
where the first inequality follows from and the monotonicity of , the second inequality follows from the subadditivity of , and the last inequality is due to Hölder’s inequality. Let us now verify the other direction. Suppose that where is the left-derivative of . Denote by a uniform random variable on such that and are comonotonic. For and , define
(66) |
One can check that . Moreover, we have
(67) |
where the inequality follows from the dual representation of (see e.g., Theorem 4.79 of Föllmer and Schied (2016)). Noting that must be uniformly bounded on for all as and are comonotonic, there exists such that on for all as is convex. Therefore, we have
where the inequality follows from the convexity of and on , and the convergence is due to . Note that is left continuous on (see e.g., Proposition A.4 of Föllmer and Schied (2016)). It holds that . Therefore, letting and combining with (B.2), we have concluded that
Hence, we have verified the other direction for the case of . If , the proof is similar by letting . Hence, we complete the proof of the case .
Let us now focus on the case . In the rest of the proof, we assume and without loss of generality. By Lemma 4, it suffices to show that (ii) holds if and only if the following (i)’ holds.
-
(i)’
For any and , we have (33) holds, i.e.,
(68)
To see (ii) (i)’, we note that , where for . It is easy to see that is convex whenever is convex. Hence, (a) is a special case of Proposition 2, and we choose to omit its proof here for brevity. In the case of (b), we assume without loss of generality that for . Note that is an increasing and convex distortion function. By Lemma 9, we have for any with ,
(69) |
where the second inequality follows from Hölder’s inequality. Suppose that a.s. where is a uniform random variable on . Define One can verify that and . Therefore, we conclude that (68) holds. Hence, we complete the proof of (i)’.
To see (i)’ (ii), note that is increasing with . It holds that satisfies monotonicity and translation invariance, i.e., for all and . We first show that by contradiction. Otherwise, by that a convex function has derivative almost everywhere, there exists such that . In this case, define with quantile function as and one can easily check that and . Hence,
If , then define , and we have and . Hence,
Those two cases both yield a contradiction to (68).
Next, we aim to show
(70) |
as this will complete the proof since a convex function that satisfies (70) must be one of the forms of and with . To show (70), we assume by contradiction that there exists such that . If , then we have
where the strict inequality follows from and , which yields a contradiction. Suppose now . To yield a contradiction, it suffices to prove that
(71) |
Let be a uniformly distributed random variable, and define
We have that and the random variables in are all comonotonic. Define
(72) |
It follows from Chebyshev’s inequality that
(73) |
Define
We have that for all as is convex. Further, it holds that as and . Therefore,
(74) |
where the second inequality follows from , the fourth equality is due to the distribution-invariance of , and the fifth equality holds because implies that and are comonotonic. For , define
(75) |
We can rewrite with
Below we aim to demonstrate that for by selecting an appropriate . Note that
(76) |
where the second inequality follows from Hölder’s inequality and the definition of in (75), the last inequality is due to (73), and the last equality follows from the definition of in (72). Denote by
Recalling the definition of in (72) again, we have and , and hence, whenever . For , we have
where the first inequality follows from Hölder’s inequality. Hence, we have that . Combining with (B.2), we have verified (71). This completes the proof.
To substantiate our claim that our analysis is not limited to the examples presented in this paper, we present the following proposition to demonstrate that the arguments in the proof of Theorem 5 can be applied to establish the equivalence between Wasserstein DRO and regularization for a class of non-convex loss functions, when the support of the reference distribution is a finite set.
Proposition 3.
For , let be given in Theorem 5. If the support of is finite, and with satisfies for all or for all , then, for and
Proof.
Proof. Without loss of generality, assume . We only give the proof for the case for as the other case is similar. By Remark 5, it suffices to show that (33) holds with and where . For and , define and as (66). First note that both (B.2) and (B.2) hold for . Noting that the support of is finite, we denote it by . For , it holds that
Letting and combining this with (B.2), we have concluded that
Hence, (33) holds with and , and thus we complete the proof. ∎
Proof of Proposition 2. By Lemma 4, it suffices to prove that
(77) |
By Lemma 9, we have for any ,
(78) |
where the second inequality follows from Hölder’s inequality. Suppose that a.s., where is a uniform random variable on (see Lemma A.28 of Föllmer and Schied (2016) for the existence of ). In the following, we show (77) by considering the following two cases.
-
(a)
If , then and , where the second equality follows from is increasing. We first assume that . Define a sequence such that . For all , it holds that , and and are comonotonic. Hence, we have
(79) Combining (78) and ((a)), we have verified (77). Assume now . We can construct a sequence as , and then follow the same analysis as the previous argument to verify the result.
- (b)
Combining (a) and (b), we complete the proof of (77), and thus complete the proof.
Appendix C C Proofs of Section 6
Before presenting the proofs in Section 6, we first highlight some key elements required in our analysis for the non-affine case. As stated in Section 6, while our analysis for the non-affine case continues to rely on a projection set perspective, the equivalence between the projection sets of a Wasserstein ball and a max-sliced Wasserstein ball no longer holds. Instead, we explore the possibility of first establishing a confidence bound for a fixed decision by identifying a set inclusion relationship with a one-dimensional Wasserstein ball and leveraging its measure concentration property (Lemma 10). We then extend these results to derive generalization bounds that hold uniformly across all using covering number techniques (see e.g., Gao (2022) and Section 3.5 of Mohri et al. (2018)). For , a -cover of , denoted by , is a subset of such that for each , there exists satisfying , where is a norm on . The covering number of with respect to is the smallest cardinality of such a -cover of .
Lemma 10 (Theorem 2 of Fournier and Guillin (2015)).
Let and . For , suppose that for some and is the empirical distribution based on the independent sample drawn from . Then, we have
where
(80) |
and are positive constants that only depend on , , and .
With these tools in place, we now proceed to the proofs for the generalization results in Section 6.
C.1 Proof of Theorem 6
We proceed in three steps. First, we verify the following set inclusion result:
(81) |
under Assumption 3, where and is the projection set defined by (23) in Section 6:
(82) |
Second, using the set inclusion result (81) and the concentration result for the one-dimensional Wasserstein ball (Lemma 10), we establish a confidence bound for a fixed . Finally, we apply covering number techniques to derive union bounds from the confidence bounds established in the second step, overcoming the challenge posed by the non-finiteness of the set .
Step 1: Proving the set inclusion result (81). We first give a lemma which will be used in the proof of the set inclusion result (81).
Lemma 11.
Let and be an increasing function with . If is convex (resp. concave), then the function is convex (resp. concave) on .
Proof.
Proof. We provide the proof of convexity only for the case where is twice differentiable, as the concavity case is analogous and the general case can be established by approximating with a twice differentiable convex function. Denote by , . By standard calculations, we have
where we have denoted by , the inequality follows from and , means that and have the same sign, and comes from . By and the convexity of , we have that , that is, . It then follows that , that is, is convex. This completes the proof. ∎
Proof.
Proof. For , let . There exists such that and . Our aim is to construct a random vector such that and as this implies that . Denote by and we have
(83) |
We first assert that there exist measurable mappings and such that
To see this, we apply a measurable selection theorem (Theorem 3.5 in Rieder (1978)) to show the existence of and that of is similar. Define
where and . Further define
Let be the projection set of on the first argument. The first three conditions in Theorem 3.5 of Rieder (1978) are trivial in our setting. To see the last condition, for and , we have
Note that is a compact set and the continuity of implies that is a closed set. Hence, we conclude that is a compact set, which verifies the last condition in Theorem 3.5 of Rieder (1978). This implies the existence of a measurable maximizer . Denote by and . We define as
By that and are measurable, we have is measurable. Note that , and hence,
(84) |
By Assumption 3, we have
and
Note that is continuous. There exists a random variable taking values on such that
(85) |
Define . It follows from (85) that
(86) |
Moreover, we have
where the first inequality is due to (84); the second inequality follows from the Jensen’s inequality and the concavity of the function by Lemma 11; and the last inequality follows from (83). Hence, . Combining this with (86) yields . This completes the proof. ∎
Step 2: Establishing confidence bounds for fixed . Based on Lemma 12 and applying Lemma 10, we are able to derive the following confidence bounds under Assumptions 2 and 3.
Lemma 13.
Proof.
Step 3: Union bounds for . Now we are ready to prove union bounds in Theorem 6.
Proof of Theorem 6. Note that for any and , we have
(87) |
where the three inequalities follow from Assumption 5, Assumption 4 and triangle’s inequality for the -norm, respectively. Then, it holds that
(88) |
where the second inequality follows from (C.1), and the last inequality holds because for any with , which further implies
For and , it holds that implies . Hence, we have
(89) |
where the second inequality follows from Chebyshev’s inequality. Let and be an -cover of with respect to the norm . Denote by , and , where is defined by (80) in Lemma 10. It holds that
(90) | |||
(91) | |||
where the first inequality follows from (C.1), the last inequality is due to Lemma 13, and the second inequality holds because if the event in (90) happens with some , then there exists such that and
where the first and the last inequalities follow from (C.1) and (C.1), respectively, which implies that the event in (91) happens. Hence, we have
(92) |
We recall that with . Note that is contained in where is a unit ball of . For any , we have (see Example 5.8 of Wainwright (2019))
Let be defined in Theorem 6. Noting that is increasing and is decreasing, we obtain
Combining with (92), we complete the proof.
C.2 Proof of Theorem 7
The proof of Theorem 7 closely follows the three-step structure used for Theorem 6, with a key difference in the first step. Specifically, establishing the inclusion relation (81) becomes challenging because Assumption 6 is weaker than Assumption 3. However, as shown in the first step of the proof below, the “full” inclusion relation (81) is unnecessary; instead, only a partial inclusion relation is required. Specifically, the worst-case risk evaluated over the full Wasserstein ball coincides with that evaluated over a subset of the Wasserstein ball, and this subset can be shown to belong to the projection set. This enables the derivation of confidence bounds for fixed in the second step, and subsequently, the generalization bounds using the same covering number arguments as in Theorem 6.
Step 1. Proving partial set inclusion and coincidence of worst-case risk. For and , define a subset of a one-dimensional Wasserstein ball as
(93) |
where denotes the usual stochastic order (see Shaked and Shanthikumar (2007)), also known as the first-order stochastic dominance, and is defined as: for two distributions on , if for all . We aim to show the following set inclusion result:
(94) |
where , is defined by (93), and
(95) |
Proof.
Proof. Let . For , we have and . Let be a uniform random variable on such that almost surely (see Lemma A.28 of Föllmer and Schied (2016) for the exsistence of ). Let and . Since , we have . Moreover, it holds that
(96) |
where the inequality is due to . Using the similar arguments in the proof of Lemma 12 where a measurable selection theorem in Rieder (1978) has been used, we know that there exists a measurable mapping such that
Using Assumption 6, we have
Note that is continuous. There exists a random variable taking values on such that
(97) |
Define . By (97), we have . By , we have
where the second inequality follows from Jensen’s inequality and the concavity of the mapping by Lemma 11, and the last inequality is due to (96). Hence, we have , which implies . This completes the proof.∎
Below we demonstrate that the worst-case values of a monotone risk mapping are identical under and for any and .
Lemma 15.
Let and suppose that satisfies monotonicity. For and , we have
(98) |
Proof.
Proof. Note that . It suffices to show
(99) |
For any , define for . It holds that for , and thus,
where the last inequality is due to . This means . This, together with , that is, , implies . By the monotonicity of , we have as . Therefore, we have (99) holds, and thus, we complete the proof. ∎
Step 2: Establishing confidence bounds for fixed . Similar to Lemma 13 in Step 2 for Theorem 6, we can obtain the confidence bounds for fixed as stated in the following lemma.
Lemma 16.
Proof.
Proof. Denote by and , where . Note that by Lemma 14, we have
where and Therefore, we have
(100) |
where the equality follows from Lemma 15 as is monotone. It then follows that
where the first inequality follows from (100). Substituting into the above inequalities, we have
where the last inequality follows from Lemma 10, similar to the proof of Lemma 13. This completes the proof. ∎
Step 3: Union bounds for . Now we are ready to prove the last step, union bounds, for Theorem 7.
C.3 Proof of Theorem 8
Proof of Theorem 8. Note that for
where the second inequality follows from Assumption 1 and the last inequality is due to the Lipschitz continuity of with respect to -norm, which implies . By symmetry, we conclude that
Hence, Assumption 5 holds with and being replaced by . Further, let , and it holds for any that
(101) |
where the inequality is due to Assumption 1. Then, we have the following chain of inequalities:
(102) |
where the first inequality is due to , the second inequality follows from Lemma 14 which states that , the second equality is due to Lemma 15, and the third inequality follows from (C.3). For , combining (C.3) with Lemma 10 yields the following confidence bound:
where is defined by (80). Note that Assumptions 2 and 3 hold, Assumption 4 holds with , and Assumption 5 holds with and being replaced by . The rest proof is similar to that of Theorem 6 by applying the covering number arguments.
C.4 A sufficient condition for Assumption 2
Below we present a sufficient condition for Assumption 2 that is more straightforward to verify.
Proposition 4.
Proposition 4 shows that Assumption 2 can be verified through a combination of a union-bound condition (103) on and a light-tail condition (104) on the data-generating distribution . It is worth noting that (103) holds if when , Assumption 4 holds on and . To see this, note that by Assumption 4, we have
and thus, (103) holds with , and .
References
- Bäuerle and Müller (2006) Bäuerle, N. and Müller, A. (2006). Stochastic orders and risk measures: Consistency and bounds. Insurance: Mathematics and Economics, 38(1), 132–148.
- Bobkov and Ledoux (2019) Bobkov, S. and Ledoux, M. (2019). One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances. American Mathematical Society.
- Dhaene et al. (2002) Dhaene, J., Denuit, M., Goovaerts, M.J., Kaas, R. and Vyncke, D. (2002). The concept of comonotonicity in actuarial science and finance: Theory. Insurance: Mathematics and Economics. 31(2), 3–33.
- Föllmer and Schied (2016) Föllmer, H. and Schied, A. (2016). Stochastic Finance. An Introduction in Discrete Time. Fourth Edition. Walter de Gruyter, Berlin.
- Fournier and Guillin (2015) Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3), 707–738.
- Kallenberg (2021) Kallenberg, O. (2021). Foundations of Modern Probability. Third Edition. New York: springer.
- Mao et al. (2022) Mao, T., Wang, R. and Wu, Q. (2022). Model aggregation for risk evaluation and robust optimization. arXiv:2201.06370.
- Rieder (1978) Rieder, U. (1978). Measurable selection theorems for optimization problems. manuscripta mathematica, 24(1), 115–131.
- Shaked and Shanthikumar (2007) Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer New York.
- Sion (1958) Sion, M. (1958). On general minimax theorems. Pacific Journal of Mathematics, 8(1), 171–176.
- (11) Smith, J. E. (1995). Generalized Chebychev inequalities: theory and applications in decision analysis. Operations Research, 43(5), 807–825.
- Villani (2009) Villani, C. (2009). Optimal Transport: Old and New. Berlin: springer.
- Wang et al. (2020) Wang, R., Wei, Y. and Willmot, G. E. (2020). Characterization, robustness and aggregation of signed Choquet integrals. Mathematics of Operations Research, 45(3), 993–1015.