A data-driven approach to beating SAA out-of-sample
Abstract.
While solutions of Distributionally Robust Optimization (DRO) problems can sometimes have a higher out-of-sample expected reward than the Sample Average Approximation (SAA), there is no guarantee. In this paper, we introduce a class of Distributionally Optimistic Optimization (DOO) models, and show that it is always possible to “beat” SAA out-of-sample if we consider not just worst-case (DRO) models but also best-case (DOO) ones. We also show, however, that this comes at a cost: Optimistic solutions are more sensitive to model error than either worst-case or SAA optimizers, and hence are less robust and calibrating the worst- or best-case model to outperform SAA may be difficult when data is limited.
Keywords: Distributionally Optimistic Optimization (DOO), Distributionally Robust Optimization (DRO), Sample Average Approximation (SAA), data-driven optimization, model uncertainty, worst-case sensitivity, out-of-sample performance.
1. Introduction
It is well known that solutions of optimization problems calibrated from data can perform poorly out-of-sample. This occurs due to errors in both the modeling assumptions and the estimation of model parameters and has motivated various optimization models which account for misspecification in the in-sample model. Distributionally Robust Optimization (DRO), where the decision maker optimizes against worst-case perturbations from the nominal (in-sample) distribution, is one such approach. Similar models have been introduced in a number of communities [2, 7, 8, 14, 24].
The performance of DRO is often evaluated by comparing its out-of-sample expected reward to that of the SAA optimizer. This is often done experimentally, and recent papers characterize finite-sample and asymptotic properties of DRO solutions and the associated expected reward [3, 9, 10, 11, 16].
It has been observed that solutions of DRO and other worst-case models can sometimes have a higher out-of-sample expected reward than the Sample Average Approximation (SAA) optimizer [1, 4, 11, 15]. However, there is no guarantee, and it is of interest whether a decision that beats SAA out-of-sample can be found when DRO is unable to. In this paper, we introduce a class of Distributionally Optimistic Optimization (DOO) problems, where nature works together with the decision maker to optimize the in-sample reward, and show under relatively mild assumptions that it is always possible to “beat” the SAA optimizer out-of-sample if we consider best-case (DOO) and worst-case (DRO) decisions.
As tempting as this may sound, there is a catch. While the out-of-sample expected reward might be larger, the expected reward under a best-case optimizer is always more sensitive to worst-case perturbations from the nominal model than the SAA optimizer, and hence is less robust. More generally, worst and best-case optimizers make a tradeoff between (in-sample) expected reward and sensitivity to model error. We potentially miss the benefits of sensitivity reduction if we only focus on the mean reward, and could even make things worse if this leads to choosing a best-case solution.
In contrast to worst-case problems, the literature on DOO is small, though recent activity suggests an uptick in interest that is generally centered around the concern that worst-case optimization is “too pessimistic.” One disadvantage of optimistic optimization, however, is that convexity can be lost, making it potentially more difficult to solve (computationally) than its worst-case cousin. Non-convexity is considered in [23] where optimistic optimization problems are shown to be related to non-convex regularizers used in machine learning. A majorization minimization algorithm is employed in [23] to efficiently obtain a critical point of their nonconvex optimization problem by making use of its “difference of convex” structure. Also related is [21], which uses optimistic optimization over probability distributions to construct a nonparametric approximation of the likelihood function that can be used in Bayesian statistics. The paper [26] uses DOO to solve the Trust Region Policy Optimization problem in the context of Reinforcement Learning, while optimistic optimization is also proposed in [6] for online contextual decision making. Finally, a recent paper [6] considers best-case optimization in stochastic optimization with model uncertainty using the “Rockafellian” framework for perturbation analysis. Best-case problems are also used in the empirical likelihood approach to generating confidence intervals of the optimal expected reward under the population distribution [10, 19, 20, 28]. We also note that the National Science Foundation recently awarded a grant [29] on “Favorable optimization under distributional distortions”. These papers do not consider the out-of-sample performance of solutions of best-case problems, nor the impact on the sensitivity of the out-of-sample expected reward to errors in the in-sample model.
A recent paper [18] shows that it is impossible, asymptotically, for a large class of data-driven problems, including DRO and DOO and popular regularization techniques, to beat the out-of-sample expected reward of SAA. While this appears to directly contradict our results, there is actually no inconsistency. The main difference is that [18] considers the large data limit, whereas our results apply to data sets of moderate size. Taken together, it is always possible to beat SAA using DRO/DOO with a finite data set (us), but impossible in the limit [18]. We discuss [18] in more detail at the end of this paper.
Organization
We introduce the Distributionally Optimistic Optimization (DOO) and Distributionally Robust Optimization (DRO) models in Section 2. For concreteness, we adopt a penalty framework with smooth -divergence as the penalty function. We show in Section 3 that the family of worst- and best-case solutions is continuously differentiable in a neighborhood of the SAA solution and characterize their distributional properties, that it is generally possible to find a DRO or DOO solution with a higher out-of-sample expected reward than SAA in Section 4, but that a best-case solution is always less robust than SAA in Section 5. The key ideas are illustrated using a data-driven inventory problem.
2. Setup
Let , and be an -valued random vector with population distribution . Consider the problem
(2.1) |
Let
(2.2) |
denote the solution of (2.1). We assume the following.
Assumption 2.1.
The function and random vector are such that
-
•
is strictly concave and twice continuously differentiable in for -almost surely every ;
-
•
for each fixed , the mappings , , are measurable and all moments of the random variables , , exist;
-
•
there exists a solution of (2.1).
In many applications, we do not know the population distribution but only have independent and identically distributed (iid) samples drawn from . This naturally leads to replacing (2.1) with a Sample Average Approximation (SAA):
(2.3) |
Here is the empirical distribution; we assume without loss of generality that for all . We denote the solution of the in-sample problem (2.3) by
It is well known that the SAA solution may not perform well out-of-sample. This has motivated worst-case versions of SAA, called Distributionally Robust Optimization (DRO), where the decision is chosen to maximize the expected reward under worst-case perturbations of the probability distribution () from the empirical distribution :
(2.4) |
where
(2.5) |
is the -divergence of relative to . We assume throughout that is a convex lower semi-continuous function such that for and for .
Distributionally optimistic optmization (DOO)
We now consider an optimistic version of (2.3) where nature works together with the decision maker to optimize the expected reward. The Distributionally Optimistic Optimization (DOO) model is
(2.6) |
Here, nature selects to maximize the expected reward and incurs a (negative) penalty for deviating from the nominal ; the parameter controls the size of the deviations. The decision maker accepts that may be misspecified, and makes a decision that maximizes the expected reward if nature is cooperative. DOO may be of interest to applications where the decision maker is looking for upside opportunities and does not wish to be conservative in the face of model uncertainty. We also note that one of the likelihood approximations in [21] has the form of the inner problem in (2.6). They do not consider optimization, though an optimistic version of Maximum Likelihood Estimation is natural step in this direction and an example of (2.6). Though our results provide insights about general properties of DOO and its solutions, applications of DOO are not the focus of this paper.
It is convenient to introduce population versions of the worst-case and best-case problems. If
(2.7) |
denotes -divergence, where is the Radon-Nikodym derivative (likelihood ratio) of with respect to , then the population version of the worst-case problem is
(2.8) |
and
(2.9) |
is the best-case problem.
We denote the solutions of (2.2), (2.8) and (2.9) by with , and , respectively, and for the sample versions (2.3), (2.4) and (2.6). It will be shown in Section 3 that and are continuously differentiable in a neighborhood of . We consider the case where the decision variable is unconstrained as this allows us to directly connect changes to the SAA solution from DRO/DOO to out-of-sample performance. We defer analysis of the constrained case to future work.
Dual characterization of worst/best-case objective
Let
denote the convex conjugate of . We have the following dual representation of the worst-case problems (2.4) and (2.8), which is well known and can be established using convex duality. We state it without proof though the interested reader can see [12].
Proposition 2.2 (Dual characterization for DRO).
Suppose that is a convex lower-semicontinuous function such that for and for . If , then
(2.10) |
and
(2.11) |
Proposition 2.3 (Dual characterization for DOO).
Suppose that is a convex lower-semicontinuous function such that for and for . If , then
(2.12) |
and
(2.13) |
3. Characterization of the solution
3.1. In-sample problems
It follows from Proposition 2.2 that the worst-case problems (2.4) and (2.8) are equivalent to
(3.1) | |||
while by Proposition 2.3, the best case problems (2.6) and (2.9) become
(3.2) | |||
We study properties of the solution of these problems using the first-order conditions. The assumptions about the function need to be strengthened.
Assumption 3.1.
is a convex lower-semicontinuous function such that for , for , and is twice continuously differentiable around with .
The first order conditions for the sample and population problems in (3.1) and (3.2) can be written in the form
(3.5) |
and
(3.8) |
respectively, where for DRO and for DOO, and
(3.13) |
(see Appendix A for a derivation). Since
(see Theorem 3.2 in [12]), it follows that
and
To ease notation, we suppress the dependence on in the function ; it should be clear from the context whether we are talking about the worst-case () or the best-case () problem.
Let and denote the solutions of the in-sample (3.5) and population (3.8) problems, respectively. The following result shows that the family of solutions parameterized by exists and is continuously differentiable in a neighborhood of . There is a similar result in [11] though the focus there is limited to DRO problems (). Proposition 3.2 shows that the continuation to negative values of are solutions of DOO problems. The proof is in the Appendix.
Proposition 3.2.
Suppose that satisfies Assumption 2.1 and satisfies Assumption 3.1. Then there is an open neighborhood of where the solutions and of (3.5) and (3.8), respectively, exist and are continuously differentiable. In particular,
(3.17) |
where111If is three times continuously differentiable, it can be shown that However, the first-order term is not needed in our analysis.
(3.18) |
and
(3.21) |
where
(3.22) |
Proposition 3.2 shows that worst- and best-case optimization adds a bias in the direction to the SAA maximizer (and similarly for ).
Clearly, the in-sample expected reward under the robust optimizer is smaller than that of the empirical optimizer, no matter the sign of . When applied out-of-sample, however, this might not be the case because the solutions are random variables under the population distribution . To compare the out-of-sample reward of the best and worst-case solutions and the SAA optimizer, we need to understand the impact of data variability on the solution.
3.2. Statistical properties of the SAA solution
The following regularity assumptions on will be needed to characterize the statistical properties of .
Assumption 3.3.
-
(1)
The function is three times continuously differentiable in for -almost all ;
-
(2)
As
-
(3)
for every
-
(4)
The and order derivatives of exist for any in a neighborhood of and
where is the -valued Hessian of the real-valued function .
-
(5)
For any in some neighborhood of
-
(6)
For any in some neighborhood of and random variable such that ,
We now characterize the statistical properties of the SAA solution. Let
and random vectors
where
By condition (1) of Assumption 3.3, is guaranteed to exist.
The following result gives the statistical properties of the solution of the SAA problem.
Proposition 3.4.
Proof.
Equation (3.24) shows that the SAA solution has a bias relative to the solution of the population problem.
3.3. Statistical properties of the DRO/DOO solution
The following assumptions for are a direct parallel of Assumption 3.3 for .
Assumption 3.5.
-
(1)
The function is three times continuously differentiable in for -almost surely , and is three times continuously differentiable in , and hence, is three times continuously differentiable in .
Let be defined by (3.13). The parameter in is such that
-
(2)
As
-
(3)
for every
-
(4)
The and order derivatives of exist in a neighborhood of and
where be the component of .
-
(5)
For some neighborhood of
-
(6)
For some neighborhood of and random variable such that ,
Condition (1) guarantees that the Hessian exists. The third condition implies that is such that the first order conditions (3.8) for the population problem has a unique solution. Under Assumptions 2.1 and 3.1, this is satisfied for every (i.e. for SAA and the worst-case problems), and for all in an open neighborhood of , which includes negative values.
The following result characterizes the statistical properties of the DRO/DOO solution .
Proposition 3.6.
Suppose that data is drawn iid from , satisfies Assumptions 2.1 and 3.3, satisfies Assumption 3.1, and is such that Assumption 3.5 holds. Then there is a unique solution of the first-order conditions (3.8), as , and
(3.29) |
where has mean and covariance matrix and is with mean . and are continuously differentiable in a neighborhood of .
Proof.
As in the case of the variance of and the mean of for SAA, and do not depend on . There is a bias of relative to the population DRO/DOO optimizer and the variance is different from that of the SAA solution.
4. Out-of-sample expected reward
Let be a solution of the best- or worst-case model, the out-of-sample expected reward given , and
be the expected out-of-sample reward after averaging over datasets of size and outcome . Under the assumptions of our model, has distribution , the distribution of is characterized in Proposition 3.6, and and are independent. We now explore the relationship between and the out-of-sample expected reward of the SAA optimizer, .
We denote the expected out-of-sample reward given by the function defined by
(4.1) |
Observe that . Concavity of in also implies that is concave.
4.1. Sample Average Approximation
Independence of and , the concavity of and hence in , and Jensen’s inequality imply that the out-of-sample expected reward under the SAA optimizer is less than that of its mean
(4.2) |
This inequality is strict if is strictly concave and is random.
Equation (4.2) suggests that we write the out-of-sample expected reward as a sum of the reward under the mean of the decision and a loss that we call the Jensen gap:
(4.3) |
The Jensen gap is negative when is concave (and not linear) on the support of , while the optimality of for the population problem implies that is less than if the solution is biased ().
Additional regularity (Assumptions 2.1 and 3.3) allows us to characterize the mean and variance of the SAA solution and to study each component in the decomposition (4.3) separately. Specifically, we know from Proposition 3.4 that
We make the assumption that the second-order error terms of the expectation and variance of are . Sufficient conditions for this additional level of regularity are provided in Appendix D.
Assumption 4.1.
The SAA solution is such that
(4.4) | |||||
Under Assumption 4.1, a Taylor series expansion of the first component of (4.3) around the population optimizer gives:
(4.5) | |||||
(4.6) |
which shows that the expected out-of-sample loss due to the finite-sample bias is small, being on the order of . In the case of the Jensen gap, a Taylor series expansion around gives
(4.7) | |||||
where the second and third equalities follow from (4.4) and (4.5). The Jensen gap is negative because is concave and becomes more negative when the variance of increases or the curvature becomes more negative.
Adding (4.6) and (4.7) we obtain an expression for the out-of-sample expected reward in terms of the population optimizer:
(4.8) |
This expression shows that the loss in expected reward is of order and comes from the uncertainty in the solution through the Jensen gap. There is also a loss due to the finite-sample bias in the SAA solution, but this is an order of magnitude smaller.
4.2. Distributionally Robust/Optimistic Optimization
To compare the out-of-sample expected reward of the DRO solution with that of SAA, we write in terms of the SAA reward and additional terms that show how changes in the mean and variance of the solution affect the expected reward:
(4.9) | |||||
The term (A) shows how a change in the expected value of the solution changes the reward, while (B) is the associated change in the Jensen gap.
DRO and DOO add a bias and change the variance of the SAA optimizer. Specifically, it follows from Proposition 3.6 and equations (3.21) and (4.4) (under Assumptions 2.1, 3.1 that (3.29) holds. As in the case of the SAA solution, we make the additional assumption that the second-order error terms of the mean and variance of is . A sufficient condition is given in Appendix D.
Assumption 4.2.
There exists a neighborhood of such that
(4.10) | |||||
Under this assumption and for sufficiently large, there is an open neighborhood of such that
(4.11) | |||||
Here, and denote derivatives of and at , which exist because and are continuously differentiable at (Proposition 3.6).
To the first order, the change in the bias relative to the SAA solution is and is the change in the variance. The expression for follows from the previous equation together with (3.21) and (4.4) and shows that the difference between the mean of and comes from two sources, the change in the population solution from to due to robustness which contributes the , and the change in the finite-sample bias which gives . The variance of the solution also changes by . An explicit relationship between terms (A) and (B) in (4.9) and the robustness parameter can be obtained by substituting (4.11) into each term in (4.9).
For the first term (A), it is shown in Appendix E that
In the case of the Jensen gap (B):
where
This expression shows that the change in the Jensen gap comes from the change in the curvature of the objective function due to the DRO/DOO bias (first term) and the change in the variance of the solution (second term).
The out-of-sample expected reward under a DRO/DOO solution can be written in terms of the expected reward under the SAA optimizer by adding (4.2) and (4.2).
Proposition 4.3.
It is worth reiterating the origins of each term in . The first is the change in the expected reward (4.2) that results from the change to the mean of the SAA solution, while the second is the change in the Jensen gap that results from the change in the variance of the solution. The last term is the change in the Jensen gap resulting from a change in the curvature of the objective function due to the DRO/DOO bias (4.2). When is concave but not linear, it is clear from (4.15) that is unlikely to be zero, except in pathological cases.
The following result shows that it is always possible to select so that the out-of-sample expected reward under the DRO/DOO solution exceeds that of and provides an estimate of the size of the outperformance.
Theorem 4.4.
Proof.
Equation (4.14) shows how the out-of-sample expected reward depends on the robustness parameter when it is small. If is non-zero, the change in the expected reward
is dominated by the linear term when is small and we can always choose so that the expected reward from exceeds that from the SAA solution . If is positive, a positive choice of (DRO) does the job, while a negative value of (DOO) should be chosen if is negative. The expression for is obtained by maximizing the quadratic on the right-hand-side of (4.14).
Although we can always out-perform SAA, (4.4) suggests that it is on the order of , regardless of whether it was achieved by best-case or worst-case optimization. Out-performance from either DRO or DOO is unlikely to be large if is large or is small. More generally, is the sensitivity of the sum of (A) and (B) in (4.9) to changes in . Our assumptions allow us to estimate this explicitly. (A) and (B) may depend differently on and in applications where our assumptions do not hold, though we have not explored this issue.
It is difficult to compute because it depends on the solution of the population problem and the population distribution , both of which are unknown, and hence to determine a priori whether the optimistic or worst-case solution will out-perform SAA. This can be sidestepped by optimizing a bootstrap or cross-validation estimate of the out-of-sample reward over , which we explore in an inventory problem in Section 5.
One limitation of our analysis is that we make strong assumptions about the differentiability of . These assumptions enable us to characterize the statistical properties of the SAA and DRO/DOO solutions and to analyze out-of-sample performance. However, there are many applications where they are not satisfied, the inventory problem below being one such case. Our assumptions enable us to illustrate what is possible with DOO and to provide insight into why it occurs. We do not rule out the possibility that either DRO or DOO will out-perform SAA when the differentiability assumptions we impose are not satisfied. All predictions that stem from our analysis, both in Theorem 4.4 and later sections, are observed in the inventory example, where the objective function violates Assumption 2.1.
Example 4.5.
In [1], many examples where the out-of-sample expected reward from DRO exceeds that of SAA are presented for an objective function of the form
where is strictly negative definite. Since , and all derivatives of of order or higher are zero, it follows from (3.28) that and
and hence by (4.15)
The only term from that remains corresponds to the change in the Jensen gap due to a change in the variance of the solution.
If is one-dimensional, is a negative constant. If DRO reduces the variance of the solution relative to SAA, is negative and is positive, so DRO will also out-perform SAA if is chosen appropriately. If DRO increases the variance of the solution (see [11] for an example where this happens), is negative and a DOO solution will have a higher out-of-sample expected reward than SAA.
Example 4.6.
Consider an inventory problem with reward
(4.17) |
where the selling price , the purchase cost , and the shortage and scrap costs, and , are zero. Under the population distribution, demand , where is exponential with mean (), is Bernoulli with , and is a constant; and are independent. In this experiment, , , and .
We note that the objective function (4.17) does not satisfy the differentiability assumptions that facilitate our analysis (Assumptions 2.1, 3.3, and 3.5). It will be seen, however, that all predictions from our analysis will be observed in this inventory model. More generally, this provides some evidence that the insights from our analysis are true under less stringent assumptions about the degree of smoothness of the objective function.
We generated datasets, each of size , and solved the DRO/DOO problems with a modified- penalty function () over a range of for each dataset. We then computed the out-of-sample expected reward for the family of DRO/DOO decisions , giving conditional expected-reward functions. The out-of-sample expected reward shown in Figure 4.1 was estimated by averaging these functions.
If we restrict ourselves to worst-case solutions (DRO), the SAA solution () is optimal and the out-of-sample expected reward is . However, the out-of-sample expected reward is maximized with an optimistic solution with and has value . With different parameter values and/or population distribution, a worst-case model ) could be optimal.

In summary, DRO models typically come with a free parameter – the size of the uncertainty set or the robustness parameter – which many select by optimizing an estimate of the out-of-sample expected reward via cross-validation or the bootstrap. However, it is not always possible to beat SAA if we restrict ourselves to worst-case models [1, 11]. If the ultimate goal is to beat SAA out-of-sample, it is reasonable to consider worst- and best-case models to ensure this is possible.
5. So, what is the catch?
5.1. Best-case solutions are not robust
We have focused on the out-of-sample expected reward of solutions of best- and worst-case problems. However, it has also been shown that DRO is intrinsically a tradeoff between (in-sample) expected reward and worst-case sensitivity [11, 12, 13]. Worst-case sensitivity is a quantitative measure of robustness and it is natural to evaluate the worst-case sensitivity of the best-case optimizer.
We recall the definition of worst-case sensitivity [12]. Suppose decision is fixed and be the expected reward under the nominal distribution . Worst-case sensitivity is the rate of decrease of the in-sample expected reward under “worst-case deviations” from the nominal distribution. When the difference between probability distributions is measured by smooth -divergence, it is natural to define the worst-case distribution as the minimizer
Here, the penalty on -divergence controls the size of the deviation of from , which is increasing in . Worst-case sensitivity is
(5.1) | |||||
which, in the case of smooth -divergence, is equal to the in-sample variance of the reward. There are other ways to define worst-case sensitivity: we can use a different divergence measure, or we can control the “distance” of from through a constraint and look at the limit when it vanishes [13]. However, they all capture a similar idea.
Worst-case sensitivity is an in-sample measure of robustness. Given a nominal model and a decision , it quantifies the sensitivity of the expected reward to errors in the nominal model and is one way to evaluate whether a given decision is more or less robust than another. Worst-case sensitivity depends on the choice of uncertainty set, though it is always a generalized measure of deviation (spread) [13]. Intuitively, the expected reward is sensitive to mis-specification when it has a large spread because small changes in the probability of extreme rewards (positive or negative) can have a big impact on the mean.
Instead of , we can substitute the solution of the best/worst-case problem into (5.1). To see the impact of best/worst case optimization on the “robustness” of the solution, we write the sensitivity of in terms of the sensitivity of the SAA solution . Using the expansion (3.17), the in-sample variance of the reward under is
where
It now follows from (5.1) that worst-case sensitivity
(5.2) | |||||
Strict concavity of in implies that
so solutions of the optimistic DOO problem () have a larger sensitivity than SAA, while worst-case solutions () have a smaller sensitivity and are therefore more robust.
5.2. The optimal value of may be difficult to estimate
Although there exists an optimal DOO or DRO decision that has a larger out-of-sample expected reward than SAA, the parameter corresponding to this decision needs to be estimated. It is natural to use bootstrap or cross-validation, though the estimate will depend on the data set so there will be estimation error. If the optimal value of is positive but we erroneously select a negative value of (i.e. DOO), not only do we lose a little out-of-sample expected reward but we increase worst-case sensitivity and the solution will be less robust than SAA.
Example 5.1 (Inventory control (continued)).
For each of the sampled datasets (each with datapoints), we compute worst-case sensitivity as a function of . Figure 1(a) is obtained by averaging these. As shown in (5.2) optimistic solutions have a larger sensitivity than SAA and DRO solutions in the neighborhood of and sensitivity is linear in . For each of the datasets, we can compute the in-sample mean-sensitivity frontier corresponding to the DRO/DOO solutions over a range of . Figure 1(b) is obtained by averaging these frontiers.
Figure 1(c) shows that out-of-sample mean-variance frontier. SAA can be beaten by an optimistic decision, but this comes at the cost of a large increase in the out-of-sample sensitivity (variance).
Figure 1(d) shows the distribution of bootstrap estimates of the optimal as a function of the number of data points. Here we simulated independent datasets of size and , and used 50 boostrap samples for each dataset to estimate of the out-of-sample expected reward which we optimized over . When , is optimal; when , . The bootstrap estimates are not very accurate when (mean = , SD = ), with only of experiments giving an estimate with the correct sign, and there is a long right tail. This is not very surprising given that a sample of size is small when the population demand has a standard deviation of . The chance of getting a data set that is not representative of the population is high, leading to poor estimates of (e.g. the extreme values in the right tail of the distribution). Accuracy improves with data points (mean = , SD = ), with estimates having the correct sign of the time and a smaller skew. It could well be the case that other methods estimate the optimal , or at least its sign, more accurately. We have not explored this issue further.




5.3. Other remarks
The paper [18]222The paper [18] was posted on arXiv one day after the first version of this paper. Both were written independently. studies data-driven algorithms that admit solutions of the form
(5.3) |
where is the population optimizer (2.2), and is the data-driven solution when there are data points, and is the free parameter of the algorithm (e.g. a regularization parameter, the size of an uncertainty set, robustness parameter).
To evaluate performance (5.3), [18] considers the difference (regret) between the expected reward of the data-driven and population optimizers
where was defined in (4.1), and shows that the weak limit () of the scaled regret is second-order stochastically larger than the weak limit of the scaled regret of the SAA optimizer. In this sense it is impossible to beat SAA asymptotically when an algorithm has solutions of the form (5.3).
While this appears to contradict Theorem 4.4, which says that it is always possible to beat SAA using DRO/DOO, there is actually no inconsistency. [18] shows the optimality of the limiting reward distribution of SAA under second order stochastic dominance, whereas we use the asymptotic distribution of the DRO/DOO and SAA solutions to estimate the difference between the out-of-sample expected rewards for finite (4.14).This difference is of order , and the advantage of DRO/DOO over SAA vanishes at a rate faster than if itself is going to . Indeed, it was shown in Theorem 4.4 that the optimal choice of is of order and the improvement over SAA is of the order . If the goal is to beat the expected reward of SAA out-of-sample, the improvement from DRO/DOO is at best modest when data sets are moderate and diminishes quickly in . There is no advantage in the limit, consistent with [18].
We have argued, however, that the advantage of DRO is more than the possibility that it can sometimes, but not always, beat SAA out-of-sample by just a little bit: It also reduces the sensitivity of the expected reward to misspecification in the nominal which, as the reader may recall, was the point of robust decision making in the first place. More generally, it captures the tradeoff between maximizing expected reward and controlling sensitivity which is relevant in many applications of data-driven decision making.
6. Conclusion
Solutions of DRO problems can sometimes have a larger out-of-sample expected reward than SAA. Indeed, much of the literature seems to suggest that this is the aspiration of DRO; uncertainty sets are often calibrated by optimizing an estimate of the out-of-sample expected reward, and recent papers focus on theoretically characterizing the out-of-sample expected reward of DRO solutions; applications of DRO are hailed a success when the out-of-sample mean exceeds that of SAA. However, beating SAA may not be possible if we only consider worst-case models.
We have shown that it is always possible to beat SAA out-of-sample if we consider both optimistic (DOO) and pessimistic (DRO) perturbations of the nominal model. If the only concern is beating the out-of-sample expected reward of the SAA optimizer, a decision maker should consider best-case and worst-case models.
As tempting as this may sound, there is a catch. While the out-of-sample expected reward might be larger, the worst-case sensitivity of a best-case (DOO) optimizer is larger than that of SAA so it will also be less robust. Indeed, the improvement in the expected reward from using the optimal DOO optimizer is small (of order ) relative to the robustness cost. The optimal value of the parameter also needs to be estimated, and bootstrap or cross-validation estimates may be unreliable if the training data set is small.
References
- [1] Anderson, E.J., Philpott, A. 2020. Improving sample average approximation using distributional robustness. INFORMS Journal on Optimization, 4(1), 90–124 (https://doi.org/10.1287/ijoo.2021.0061).
- [2] Ben-Tal, A., A. Nemirovski. 1999. Robust solutions to uncertain programs. Oper. Res. Lett. 25 1–13.
- [3] Bertsimas, D., Gupta, V., Kallus, N. 2014. Robust SAA. arXiv:1408.4445 (https://arxiv.org/abs/1408.4445).
- [4] Bertsimas, D., Litvinov, E., Sun, A.X., Zhao, J., Zheng, T. 2013. Adaptive robust optimization for the security constrained unit commitment problem. IEEE Transactions on Power Systems, 28(1), 52–63.
- [5] Cao, J., Gao, R. 2021. Contextual decision-making under parametric uncertainty and data- driven optimistic optimization. (https://optimization-online.org/2021/10/8634/).
- [6] Chen, L.L., Royset, J.O. 2022. Rockafellian Relaxation in Optimization under Uncertainty: Asymptotically Exact Formulations (https://arxiv.org/abs/2204.04762).
- [7] Doyle, J.C., Glover, K., Khargonekar, P.P., Francis, B.A. 1989. State-space solutions to standard and control problems. IEEE Transactions on Automatic Control, 34(8), 831–847.
- [8] El Ghaoui, L., Lebret, H. 1997. Robust solutions to least-square problems to uncertain data matrices. SIAM Journal on Matrix Analysis and Applications, 18, 1035–1064.
- [9] Duchi, J.C., Namkoong, H. 2016. Variance-based regularization with convex objectives. arXiv:1610.02581v2 (https://arxiv.org/abs/1610.02581).
- [10] Duchi, J.C., Glynn, P.W., Namkoong, H. 2016. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. arXiv:1610.03425 (https://arxiv.org/abs/1610.03425).
- [11] Gotoh, J., Kim, M.J., Lim, A.E.B. 2020. Calibration of robust empirical optimization models. Operations Research, 69(5), 11 1630–1650, https://doi.org/10.1287/opre.2020.2041.
- [12] Gotoh, J., Kim, M.J., Lim, A.E.B. 2018. Robust empirical optimization is almost the same as mean-variance optimization. Operations Research Letters, 46 (4), 448–452, https://doi.org/10.1016/j.orl.2018.05.005.
- [13] Gotoh, J., Kim, M.J., Lim, A.E.B. 2020. Worst-case sensitivity. arXiv:2010.10794 (https://arxiv.org/abs/2010.10794).
- [14] Hansen, L.P., Sargent, T.J. 2008. Robustness. Princeton University Press, Princeton, New Jersey.
- [15] Kim, M.J., Lim, A.E.B. 2014. Robust multi-armed bandit problems. Management Science, 62(1), 264–285.
- [16] Kuhn, D., Esfahani, P.M., Nguyen,V.A., Shafieezadeh-Abadeh, S. 2019. Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning. INFORMS TutORials in Operations Research, 130-166, https://doi.org/10.1287/educ.2019.0198.
- [17] Kundhi, G., Rilstone, P. 2008. The third order bias of nonlinear estimators. Communications in Statistics – Theory and Methods, 37: 2617 – 2633.
- [18] Lam, H. 2021. On the Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective. arXiv:2105.13419 (https://arxiv.org/pdf/2105.13419.pdf).
- [19] Lam, H. 2019. Recovering Best Statistical Guarantees via the Empirical Divergence-Based Distributionally Robust Optimization. Operations Research, 67(4), pp 1090 – 1105.
- [20] Lam H., Zhou E. 2017. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters 45(4), pp 301 – 307.
- [21] Nguyen, V.A., Shafieezadeh-Abadeh, S., Yue, M.C., Kuhn, D., Wiesemann, W. 2019. Optimistic Distributionally Robust Optimization for Nonparametric Likelihood Approximation. arXiv:1910.10583 (https://arxiv.org/abs/1910.10583).
- [22] Nishiyama, Y. 2010. Moment Convergence of -Estimators. Statistica Neerlandica, Vol. 64, nr. 4, pp. 505–507.
- [23] Norton, M., Takeda, A., Mafusalov, A. 2017. Optimistic robust optimization with applications to machine learning. arXiv:1711.07511 (https://arxiv.org/abs/1711.07511).
- [24] Peterson, I.R., James, M.R., Dupuis, P. 2000. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45, 398–412.
- [25] Rilstone, P., Srivastava, V.K., Ullah, A. (1996). The second-order bias and mean squared error of nonlinear estimators. Journal of Econometrics, 75: 369–395.
- [26] Song, J., Zhao, C. 2020. Optimistic Distributionally Robust Policy Optimization. arXiv:2006.07815v1, https://arxiv.org/pdf/2006.07815.pdf.
- [27] van der Vaart, A.W. 2000. Asymptotic Statistics. Cambridge University Press.
- [28] Wang, Z., Glynn, P.W., Ye. Y. (2016). Likelihood robust optimization for data-driven problems. Computational Management Science, 13(2), pp. 241 – 261.
- [29] Xie, W. 2021. CAREER: Favorable Optimization under Distributional Distortions: Frameworks, Algorithms, and Applications (Award Number 2046426). Division of Civil, Mechanical & Manufacturing Innovation, National Science Foundation.
Appendix A First order conditions (3.5)-(3.8)
Consider the population version of the DRO problem (3.1)
where
(A.1) |
Differentiating with respect to , the first order conditions are
(A.2) | |||||
with solutions . We eventually wish to show that is continuously differentiable in a neighborhood of , which depends on the properties of the convex conjugate . Under Assumption (3.1) (see Theorem 3.2 in [12]) is twice continuously differentiable in the neighborhood of and satisfies
(A.3) |
It follows that is continuously differentiable in a neighborhood of and
(A.4) |
so by (A.2)
and we can write the first order conditions as
This does not change the solution of (A.2), but allows us to use the Implicit Function Theorem to establish level of smoothness for in Proposition 3.2 (in particular, that (B.11) is invertible in the proof in Appendix B). It follows that is given by (3.13). First order conditions for the sample DRO problem, and the DOO problems can be derived similarly.
Appendix B Proof of Proposition 3.2
We use the Implicit Function Theorem to show existence of the solution of (3.8) and to characterize its smoothness. (Existence and smoothness of can be shown in a similar way). With a mild abuse of notation, let
The first-order conditions for the population problems (3.8) are
(B.4) |
Since is continuously differentiable in a neighborhood of (Theorem 3.2 in [12]), is continuously differentiable in in some open neighborhood of for every fixed . By (A.4)
and
(B.6) |
where is the solution of the empirical problem. Since is twice continuously differentiable in , is continuously differentiable in a neighborhood of , and
(B.8) | |||||
(B.11) |
is invertible. It follows from the Implicit Function Theorem that exists and is continuously differentiable in an open neighborhood of so we can write
where and is a constant. Substituting into the first equation in (B.6), a Taylor series expansion around gives
Optimality of for the population problem implies while the coefficient of the term is zero if
Smoothness of and the expression (3.17)-(3.18) can be established in a similar manner.
Appendix C Proof of Proposition 3.6
We begin with some preliminaries. Let be defined by (3.13), and
be the Jacobian of and the matrices
We also define the valued random vectors
(C.2) |
where
Here, is the component of the dimensional function
that was defined in (3.13), and is the Hessian evaluated at . Note that condition (2) of Assumption 3.5 implies that is three times continuously differentiable333It can be shown, along the lines of the Proof of Theorem 3.2 in [12], that where is the derivative of with respect to . in a neighborhood of . Together with condition (1), it follows that the Hessian evaluated at above is well defined.
The following result characterizes the distribution of and is obtained by applying Theorem 5.21 in [27] and Lemma 1 from [17] to (3.13).
Proposition C.1.
Suppose that data is drawn iid from , satisfies Assumption 2.1, satisfies Assumption 3.1, and is such that Assumption 3.5 holds. Then there is a unique solution of the first-order conditions (3.8), the matrix is invertible, , and for sufficiently large
has mean and covariance matrix
and is with mean
Proposition 3.6 is largely a restatement of Proposition C.1 for the component of . In particular, expression (3.29) for holds because we can extract the valued random vectors and of (C.2) corresponding to in (C.1)
and and from
and are continuously differentiable in a neighborhood of because the pair is continuously differentiable in a neighborhood of (Proposition 3.2) and the conditions imposed on in Assumption 3.5.
Appendix D Assumptions 4.1 and 4.2
We know from (3.24) that the solution of the SAA problem has an expansion of the form
(D.1) |
where the error term is . Assumption 4.1 holds if for . We now derive an expression for to determine conditions under which this holds. To ease notation, we assume is scalar. Our derivation shows that is a product of derivatives of with respect to of up to order , and Assumption 4.1 holds if the first two moments of are finite.
Our approach generalizes to the case where the decision variable is multi-dimensional, which is required for DRO and DOO (Assumption 4.2) where the pair is the decision and SAA with vector-valued , at the cost of even uglier equations and tensor notation. Not surprisingly, the same insight, that products of (partial) derivatives of order up to 4 have finite first- and second moment, holds. The equations are long but not particularly insightful (beyond this observation) and the derivation is arduous. We outline how this can be done and leave details to the reader.
To derive the expansion (D.1), we embed the first order conditions for our sample problem in a family of fixed-point equations parameterized by where
Here
is the deviation of the random reward from its expectation , while is the derivative of the centered random variable. We denote the solution of as . (We change the argument in from to to minimize the chance of confusion between , the solution of the SAA problem, and the solution of the fixed point problem). The case
corresponds to the population problem with solution ; is the SAA problem with solution
To begin, we find the derivatives of with respect to , or , and , such that
Since we do this by expanding around and choosing , and to ensure the first-order condition holds.
Specifically, a Taylor series expansion around gives (after some work)
Since , the coefficients of the powers of must be zero and hence
where
which gives us the derivatives of we are looking for. This gives the Taylor series expansion
where is a random variable taking values in that depends on the data. In particular, with and noting that when
where
and is a random variable taking values between and . Observe that and , as defined in (3.24) and that the error term in (D.1) is . Assumption 4.1 holds if for . This will be the case if there is a constant such that for every and the derivatives of order are uniformly bounded, though this is not the mildest assumption.
In the case of DRO/DOO, we have first-order conditions
(D.4) |
for the solution of the population problem, and
(D.7) |
for the solution of the empirical version. Analogous to the case of SAA, we can define as the solution of () where
This allows us to embed (D.4) and (D.7) in the problem of solving . Expressions for , and the remainder term such that
where is a random variable between and , can be derived, though equations are long and unwieldy. Setting we get
where , and the remainder term is a product of partial derivatives of
with respect to and of up to order , evaluated at for some random that depends on the data. Assumption 4.2 holds if for . Again, this will be the case if, for example, the partial derivatives up to order are uniformly bounded.
Appendix E Proof of Proposition 4.3
Recall from (4.4) that
(E.1) |
and hence
(E.2) |
From Proposition 3.6, the DRO/DOO solution
where . Taking expectations gives
(E.3) | |||||
Using (E.1) we can write (E.3) as
(E.4) | |||||
We now expand around :
where the second equality follows from (E.2) and (E.4). It now follows that the expected reward at the mean of the robust solution
In the case of the loss from Jensen’s inequality, recall that the variance of the solution
It now follows from a Taylor series expansion that
where the second equality follows from (E.3), the fourth is a Taylor series expansion of the function around , and the last equality follows from (4.7). Proposition 4.3 follows after substituting (E) and (E) into the decomposition (4.9).