This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A data-driven approach to beating SAA out-of-sample

Jun-ya Gotoh\dagger Michael Jong Kim\ddagger  and  Andrew E.B. Lim* \daggerDepartment of Data Science for Business Innovation, Chuo University, Tokyo, Japan. Email: [email protected]
\ddaggerSauder School of Business, University of British Columbia, Vancouver, Canada. Email: [email protected]
*Department of Analytics and Operations, Department of Finance, and Institute for Operations Research and Analytics, National University of Singapore, Singapore. Email: [email protected]
Abstract.

While solutions of Distributionally Robust Optimization (DRO) problems can sometimes have a higher out-of-sample expected reward than the Sample Average Approximation (SAA), there is no guarantee. In this paper, we introduce a class of Distributionally Optimistic Optimization (DOO) models, and show that it is always possible to “beat” SAA out-of-sample if we consider not just worst-case (DRO) models but also best-case (DOO) ones. We also show, however, that this comes at a cost: Optimistic solutions are more sensitive to model error than either worst-case or SAA optimizers, and hence are less robust and calibrating the worst- or best-case model to outperform SAA may be difficult when data is limited.

Keywords: Distributionally Optimistic Optimization (DOO), Distributionally Robust Optimization (DRO), Sample Average Approximation (SAA), data-driven optimization, model uncertainty, worst-case sensitivity, out-of-sample performance.

1. Introduction

It is well known that solutions of optimization problems calibrated from data can perform poorly out-of-sample. This occurs due to errors in both the modeling assumptions and the estimation of model parameters and has motivated various optimization models which account for misspecification in the in-sample model. Distributionally Robust Optimization (DRO), where the decision maker optimizes against worst-case perturbations from the nominal (in-sample) distribution, is one such approach. Similar models have been introduced in a number of communities [2, 7, 8, 14, 24].

The performance of DRO is often evaluated by comparing its out-of-sample expected reward to that of the SAA optimizer. This is often done experimentally, and recent papers characterize finite-sample and asymptotic properties of DRO solutions and the associated expected reward [3, 9, 10, 11, 16].

It has been observed that solutions of DRO and other worst-case models can sometimes have a higher out-of-sample expected reward than the Sample Average Approximation (SAA) optimizer [1, 4, 11, 15]. However, there is no guarantee, and it is of interest whether a decision that beats SAA out-of-sample can be found when DRO is unable to. In this paper, we introduce a class of Distributionally Optimistic Optimization (DOO) problems, where nature works together with the decision maker to optimize the in-sample reward, and show under relatively mild assumptions that it is always possible to “beat” the SAA optimizer out-of-sample if we consider best-case (DOO) and worst-case (DRO) decisions.

As tempting as this may sound, there is a catch. While the out-of-sample expected reward might be larger, the expected reward under a best-case optimizer is always more sensitive to worst-case perturbations from the nominal model than the SAA optimizer, and hence is less robust. More generally, worst and best-case optimizers make a tradeoff between (in-sample) expected reward and sensitivity to model error. We potentially miss the benefits of sensitivity reduction if we only focus on the mean reward, and could even make things worse if this leads to choosing a best-case solution.

In contrast to worst-case problems, the literature on DOO is small, though recent activity suggests an uptick in interest that is generally centered around the concern that worst-case optimization is “too pessimistic.” One disadvantage of optimistic optimization, however, is that convexity can be lost, making it potentially more difficult to solve (computationally) than its worst-case cousin. Non-convexity is considered in [23] where optimistic optimization problems are shown to be related to non-convex regularizers used in machine learning. A majorization minimization algorithm is employed in [23] to efficiently obtain a critical point of their nonconvex optimization problem by making use of its “difference of convex” structure. Also related is [21], which uses optimistic optimization over probability distributions to construct a nonparametric approximation of the likelihood function that can be used in Bayesian statistics. The paper [26] uses DOO to solve the Trust Region Policy Optimization problem in the context of Reinforcement Learning, while optimistic optimization is also proposed in [6] for online contextual decision making. Finally, a recent paper [6] considers best-case optimization in stochastic optimization with model uncertainty using the “Rockafellian” framework for perturbation analysis. Best-case problems are also used in the empirical likelihood approach to generating confidence intervals of the optimal expected reward under the population distribution [10, 19, 20, 28]. We also note that the National Science Foundation recently awarded a grant [29] on “Favorable optimization under distributional distortions”. These papers do not consider the out-of-sample performance of solutions of best-case problems, nor the impact on the sensitivity of the out-of-sample expected reward to errors in the in-sample model.

A recent paper [18] shows that it is impossible, asymptotically, for a large class of data-driven problems, including DRO and DOO and popular regularization techniques, to beat the out-of-sample expected reward of SAA. While this appears to directly contradict our results, there is actually no inconsistency. The main difference is that [18] considers the large data limit, whereas our results apply to data sets of moderate size. Taken together, it is always possible to beat SAA using DRO/DOO with a finite data set (us), but impossible in the limit [18]. We discuss [18] in more detail at the end of this paper.

Organization

We introduce the Distributionally Optimistic Optimization (DOO) and Distributionally Robust Optimization (DRO) models in Section 2. For concreteness, we adopt a penalty framework with smooth ϕ\phi-divergence as the penalty function. We show in Section 3 that the family of worst- and best-case solutions is continuously differentiable in a neighborhood of the SAA solution and characterize their distributional properties, that it is generally possible to find a DRO or DOO solution with a higher out-of-sample expected reward than SAA in Section 4, but that a best-case solution is always less robust than SAA in Section 5. The key ideas are illustrated using a data-driven inventory problem.

2. Setup

Let f:d×lf:{\mathbb{R}}^{d}\times{\mathbb{R}}^{l}\rightarrow{\mathbb{R}}, and YY be an l\mathbb{R}^{l}-valued random vector with population distribution \mathbb{P}. Consider the problem

maxx𝔼[f(x,Y)].\displaystyle\max_{x}{\mathbb{E}}_{\mathbb{P}}\big{[}f(x,\,Y)\big{]}. (2.1)

Let

x(0):=argmaxx{𝔼[f(x,Y)]}x^{\star}(0):=\arg\max_{x}\,\Big{\{}\mathbb{E}_{\mathbb{P}}\left[f(x,\,Y)\right]\Big{\}} (2.2)

denote the solution of (2.1). We assume the following.

Assumption 2.1.

The function f(x,Y)f(x,\,Y) and random vector YY\sim{\mathbb{P}} are such that

  • f(x,Y)f(x,\,Y) is strictly concave and twice continuously differentiable in xdx\in\mathbb{R}^{d} for {\mathbb{P}}-almost surely every YlY\in\mathbb{R}^{l};

  • for each fixed xdx\in\mathbb{R}^{d}, the mappings yf(x,y),xf(x,y),x2f(x,y)y\mapsto f(x,y),\nabla_{x}f(x,y),\nabla^{2}_{x}f(x,y), yly\in\mathbb{R}^{l}, are measurable and all moments of the random variables f(x,Y)f(x,Y), xf(x,Y)\nabla_{x}f(x,Y), x2f(x,Y)\nabla^{2}_{x}f(x,Y) exist;

  • there exists a solution x(0)x^{\star}(0) of (2.1).

In many applications, we do not know the population distribution {\mathbb{P}} but only have independent and identically distributed (iid) samples Y1,,YnY_{1},\cdots,\,Y_{n} drawn from \mathbb{P}. This naturally leads to replacing (2.1) with a Sample Average Approximation (SAA):

maxx𝔼n[f(x,Y)]i=1npinf(x,Yi).\displaystyle\max_{x}\;{\mathbb{E}}_{{\mathbb{P}}_{n}}\left[f(x,\,Y)\right]\equiv\sum_{i=1}^{n}p_{i}^{n}f(x,\,Y_{i}). (2.3)

Here n=[p1n,,pnn]{\mathbb{P}}_{n}=[p_{1}^{n},\cdots,\,p_{n}^{n}] is the empirical distribution; we assume without loss of generality that pin>0p^{n}_{i}>0 for all ii. We denote the solution of the in-sample problem (2.3) by

xn(0):=argmaxxi=1npinf(x,Yi).\displaystyle x_{n}(0):=\operatorname*{arg\,max}_{x}\sum_{i=1}^{n}p_{i}^{n}f(x,\,Y_{i}).

It is well known that the SAA solution xn(0)x_{n}(0) may not perform well out-of-sample. This has motivated worst-case versions of SAA, called Distributionally Robust Optimization (DRO), where the decision is chosen to maximize the expected reward under worst-case perturbations of the probability distribution ({\mathbb{Q}}) from the empirical distribution n{\mathbb{P}}_{n}:

maxxmini=1nqif(x,Yi)+1δϕ(|n),δ>0\displaystyle\max_{x}\min_{\mathbb{Q}}\sum_{i=1}^{n}q_{i}f(x,\,Y_{i})+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}_{n}),\;\delta>0 (2.4)

where

ϕ(|n):=i=1npinϕ(qipin),\displaystyle\mathcal{H}_{\phi}(\mathbb{Q}\,|\,{\mathbb{P}}_{n}):=\sum\limits_{i=1}^{n}{p}^{n}_{i}\phi\left(\frac{q_{i}}{{p}^{n}_{i}}\right), i=1nqi=1,qi0\displaystyle\sum\limits_{i=1}^{n}q_{i}=1,q_{i}\geq 0 (2.5)

is the ϕ\phi-divergence of =[q1,,qn]{\mathbb{Q}}=[q_{1},\cdots,\,q_{n}] relative to n=[p1n,,pnn]{\mathbb{P}}_{n}=[p_{1}^{n},\cdots,\,p_{n}^{n}]. We assume throughout that ϕ:{+}\phi:{\mathbb{R}}\rightarrow{\mathbb{R}}\cup\{+\infty\} is a convex lower semi-continuous function such that ϕ(z)ϕ(1)=0\phi(z)\geq\phi(1)=0 for z0z\geq 0 and ϕ(z)=+\phi(z)=+\infty for z<0z<0.

Distributionally optimistic optmization (DOO)

We now consider an optimistic version of (2.3) where nature works together with the decision maker to optimize the expected reward. The Distributionally Optimistic Optimization (DOO) model is

maxxmaxi=1nqif(x,Yi)+1δϕ(|n),δ<0.\displaystyle\max_{x}\max_{\mathbb{Q}}\sum_{i=1}^{n}q_{i}f(x,\,Y_{i})+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}_{n}),\;\delta<0. (2.6)

Here, nature selects \mathbb{Q} to maximize the expected reward and incurs a (negative) penalty for deviating from the nominal n{\mathbb{P}}_{n}; the parameter δ\delta controls the size of the deviations. The decision maker accepts that n{\mathbb{P}}_{n} may be misspecified, and makes a decision that maximizes the expected reward if nature is cooperative. DOO may be of interest to applications where the decision maker is looking for upside opportunities and does not wish to be conservative in the face of model uncertainty. We also note that one of the likelihood approximations in [21] has the form of the inner problem in (2.6). They do not consider optimization, though an optimistic version of Maximum Likelihood Estimation is natural step in this direction and an example of (2.6). Though our results provide insights about general properties of DOO and its solutions, applications of DOO are not the focus of this paper.

It is convenient to introduce population versions of the worst-case and best-case problems. If

ϕ(|):=𝔼[ϕ(dd)]\displaystyle\mathcal{H}_{\phi}(\mathbb{Q}\,|\,{\mathbb{P}}):={\mathbb{E}}_{\mathbb{P}}\Big{[}\phi\Big{(}\frac{\mathrm{d}{\mathbb{Q}}}{\mathrm{d}{\mathbb{P}}}\Big{)}\Big{]} (2.7)

denotes ϕ\phi-divergence, where dd\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\mathbb{P}} is the Radon-Nikodym derivative (likelihood ratio) of \mathbb{Q} with respect to \mathbb{P}, then the population version of the worst-case problem is

maxxmin𝔼[f(x,Y)]+1δϕ(|),δ>0\displaystyle\max_{x}\min_{\mathbb{Q}}{\mathbb{E}}_{\mathbb{Q}}[f(x,\,Y)]+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}),\;\delta>0 (2.8)

and

maxxmax𝔼[f(x,Y)]+1δϕ(|),δ<0\displaystyle\max_{x}\max_{\mathbb{Q}}{\mathbb{E}}_{\mathbb{Q}}[f(x,\,Y)]+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}),\;\delta<0 (2.9)

is the best-case problem.

We denote the solutions of (2.2), (2.8) and (2.9) by x(δ)x^{\star}(\delta) with δ=0\delta=0, δ>0\delta>0 and δ<0\delta<0, respectively, and xn(δ)x_{n}(\delta) for the sample versions (2.3), (2.4) and (2.6). It will be shown in Section 3 that x(δ)x^{\star}(\delta) and xn(δ)x_{n}(\delta) are continuously differentiable in a neighborhood of δ=0\delta=0. We consider the case where the decision variable xx is unconstrained as this allows us to directly connect changes to the SAA solution from DRO/DOO to out-of-sample performance. We defer analysis of the constrained case to future work.

Dual characterization of worst/best-case objective

Let

ϕ(ζ):=maxz{ζzϕ(z)}\displaystyle\phi^{*}(\zeta):=\max_{z}\Big{\{}\zeta z-\phi(z)\Big{\}}

denote the convex conjugate of ϕ(z)\phi(z). We have the following dual representation of the worst-case problems (2.4) and (2.8), which is well known and can be established using convex duality. We state it without proof though the interested reader can see [12].

Proposition 2.2 (Dual characterization for DRO).

Suppose that ϕ:{+}\phi:{\mathbb{R}}\rightarrow{\mathbb{R}}\cup\{+\infty\} is a convex lower-semicontinuous function such that ϕ(z)ϕ(1)=0\phi(z)\geq\phi(1)=0 for z0z\geq 0 and ϕ(z)=+\phi(z)=+\infty for z<0z<0. If δ>0\delta>0, then

min{𝔼[f(x,Y)]+1δϕ(|)}=maxc{1δ𝔼[ϕ(δ[f(x,Y)+c])]c}\displaystyle\min_{\mathbb{Q}}\Big{\{}{\mathbb{E}}_{\mathbb{Q}}[f(x,\,Y)]+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}})\Big{\}}=\max_{c}\Big{\{}-\frac{1}{\delta}{\mathbb{E}}_{\mathbb{P}}\Big{[}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\Big{)}\Big{]}-c\Big{\}} (2.10)

and

min{i=1nqif(x,Yi)+1δϕ(|n)}=maxc{1δi=1npinϕ(δ[f(x,Yi)+c])c}.\displaystyle\min_{\mathbb{Q}}\Big{\{}\sum_{i=1}^{n}q_{i}f(x,\,Y_{i})+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}_{n})\Big{\}}=\max_{c}\Big{\{}-\frac{1}{\delta}\sum_{i=1}^{n}p_{i}^{n}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y_{i})+c\big{]}\Big{)}-c\Big{\}}. (2.11)

Similarly, we can derive the dual representation for the optimistic problems (2.6) and (2.9).

Proposition 2.3 (Dual characterization for DOO).

Suppose that ϕ:{+}\phi:{\mathbb{R}}\rightarrow{\mathbb{R}}\cup\{+\infty\} is a convex lower-semicontinuous function such that ϕ(z)ϕ(1)=0\phi(z)\geq\phi(1)=0 for z0z\geq 0 and ϕ(z)=+\phi(z)=+\infty for z<0z<0. If δ<0\delta<0, then

max{𝔼[f(x,Y)]+1δϕ(|)}=minc{1δ𝔼[ϕ(δ[f(x,Y)+c])]c}\displaystyle\max_{\mathbb{Q}}\Big{\{}{\mathbb{E}}_{\mathbb{Q}}[f(x,\,Y)]+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}})\Big{\}}=\min_{c}\Big{\{}-\frac{1}{\delta}{\mathbb{E}}_{\mathbb{P}}\Big{[}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\Big{)}\Big{]}-c\Big{\}} (2.12)

and

maxi=1nqif(x,Yi)+1δϕ(|n)=minc{1δi=1npinϕ(δ[f(x,Yi)+c])c}.\displaystyle\max_{\mathbb{Q}}\sum_{i=1}^{n}q_{i}f(x,\,Y_{i})+\frac{1}{\delta}{\mathcal{H}}_{\phi}({\mathbb{Q}}\,|\,{\mathbb{P}}_{n})=\min_{c}\Big{\{}-\frac{1}{\delta}\sum_{i=1}^{n}p_{i}^{n}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y_{i})+c\big{]}\Big{)}-c\Big{\}}. (2.13)

3. Characterization of the solution

3.1. In-sample problems

It follows from Proposition 2.2 that the worst-case problems (2.4) and (2.8) are equivalent to

maxxmaxc{1δ𝔼[ϕ(δ[f(x,Y)+c])]c},δ>0\displaystyle\max_{x}\max_{c}\Big{\{}-\frac{1}{\delta}{\mathbb{E}}_{\mathbb{P}}\Big{[}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\Big{)}\Big{]}-c\Big{\}},\;\delta>0 (3.1)
maxxmaxc{1δi=1npinϕ(δ[f(x,Yi)+c])c},δ>0\displaystyle\max_{x}\max_{c}\Big{\{}-\frac{1}{\delta}\sum_{i=1}^{n}p_{i}^{n}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y_{i})+c\big{]}\Big{)}-c\Big{\}},\;\delta>0

while by Proposition 2.3, the best case problems (2.6) and (2.9) become

maxxminc{1δ𝔼[ϕ(δ[f(x,Y)+c])]c},δ<0\displaystyle\max_{x}\min_{c}\Big{\{}-\frac{1}{\delta}{\mathbb{E}}_{\mathbb{P}}\Big{[}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\Big{)}\Big{]}-c\Big{\}},\;\delta<0 (3.2)
maxxminc{1δi=1npinϕ(δ[f(x,Yi)+c])c},δ<0.\displaystyle\max_{x}\min_{c}\Big{\{}-\frac{1}{\delta}\sum_{i=1}^{n}p_{i}^{n}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y_{i})+c\big{]}\Big{)}-c\Big{\}},\;\delta<0.

We study properties of the solution of these problems using the first-order conditions. The assumptions about the function ϕ(z)\phi(z) need to be strengthened.

Assumption 3.1.

ϕ:{+}\phi:{\mathbb{R}}\rightarrow{\mathbb{R}}\cup\{+\infty\} is a convex lower-semicontinuous function such that ϕ(z)ϕ(1)=0\phi(z)\geq\phi(1)=0 for z0z\geq 0, ϕ(z)=+\phi(z)=+\infty for z<0z<0, and is twice continuously differentiable around z=1z=1 with ϕ′′(1)>0\phi^{\prime\prime}(1)>0.

The first order conditions for the sample and population problems in (3.1) and (3.2) can be written in the form

𝔼n[ψ(x,c,Y)]=[00]\displaystyle{\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\psi(x,\,c,\,Y)\big{]}=\left[\begin{array}[]{cc}0\\ 0\end{array}\right] (3.5)

and

𝔼[ψ(x,c,Y)]=[00],\displaystyle{\mathbb{E}}_{{\mathbb{P}}}\big{[}\psi(x,\,c,\,Y)\big{]}=\left[\begin{array}[]{cc}0\\ 0\end{array}\right], (3.8)

respectively, where δ>0\delta>0 for DRO and δ<0\delta<0 for DOO, and

ψ(x,c,Y)[ψ1(x,c,Y)ψ2(x,c,Y)]:=[[ϕ](δ[f(x,Y)+c])xf(x,Y)ϕ′′(1)δ{[ϕ](δ(f(x,Y)+c))1}].\displaystyle\psi(x,\,c,\,Y)\equiv\left[\begin{array}[]{c}\psi_{1}(x,\,c,\,Y)\\ \psi_{2}(x,\,c,\,Y)\end{array}\right]:=\left[\begin{array}[]{c}\displaystyle[\phi^{*}]^{\prime}\big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\big{)}\nabla_{x}f(x,\,Y)\\[5.0pt] \displaystyle-\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\big{(}-\delta\left(f(x,\,Y)+c\right)\big{)}-1\Big{\}}\end{array}\right]. (3.13)

(see Appendix A for a derivation). Since

ϕ(ζ)=ζ+12!ζ2ϕ′′(1)+o(ζ2)\displaystyle\phi^{*}(\zeta)=\zeta+\frac{1}{2!}\frac{\zeta^{2}}{\phi^{\prime\prime}(1)}+o(\zeta^{2})

(see Theorem 3.2 in [12]), it follows that

[ϕ](ζ)=1+ζϕ′′(1)+o(ζ),\displaystyle[\phi^{*}]^{\prime}(\zeta)=1+\frac{\zeta}{\phi^{\prime\prime}(1)}+o(\zeta),

and

ψ(x,c,Y)=[xf(x,Y)δϕ′′(1)(f(x,Y)+c)xf(x,Y)+o(δ)f(x,Y)+c+O(δ)].\displaystyle\psi(\,x,\,c,\,Y)=\left[\begin{array}[]{c}\nabla_{x}f(x,Y)-\frac{\delta}{\phi^{\prime\prime}(1)}\Big{(}f(x,Y)+c\Big{)}\nabla_{x}f(x,Y)+o(\delta)\\[10.0pt] f(x,Y)+c+O(\delta)\end{array}\right].

To ease notation, we suppress the dependence on δ\delta in the function ψ(x,c,Y)\psi(x,\,c,\,Y); it should be clear from the context whether we are talking about the worst-case (δ>0\delta>0) or the best-case (δ<0\delta<0) problem.

Let (xn(δ),cn(δ))(x_{n}(\delta),\,c_{n}(\delta)) and (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) denote the solutions of the in-sample (3.5) and population (3.8) problems, respectively. The following result shows that the family of solutions parameterized by δ\delta exists and is continuously differentiable in a neighborhood of δ=0\delta=0. There is a similar result in [11] though the focus there is limited to DRO problems (δ0\delta\geq 0). Proposition 3.2 shows that the continuation to negative values of δ\delta are solutions of DOO problems. The proof is in the Appendix.

Proposition 3.2.

Suppose that f(x,Y)f(x,\,Y) satisfies Assumption 2.1 and ϕ(z)\phi(z) satisfies Assumption 3.1. Then there is an open neighborhood of δ=0\delta=0 where the solutions (xn(δ),cn(δ))(x_{n}(\delta),\,c_{n}(\delta)) and (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) of (3.5) and (3.8), respectively, exist and are continuously differentiable. In particular,

{xn(δ)=xn(0)+πnδ+o(δ),cn(δ)=𝔼n[f(xn(0),Y)]+O(δ),\displaystyle\left\{\begin{array}[]{l}x_{n}(\delta)=x_{n}(0)+\pi_{n}\delta+o(\delta),\\[5.0pt] c_{n}(\delta)=-{\mathbb{E}}_{{\mathbb{P}}_{n}}[f(x_{n}(0),\,Y)]+O(\delta),\end{array}\right. (3.17)

where111If ϕ(z)\phi(z) is three times continuously differentiable, it can be shown that cn(δ)=𝔼n[f(xn(0),Y)]δ2ϕ′′′(1)[ϕ′′(1)]2𝕍n[f(xn(0),Y)]+o(δ).\displaystyle c_{n}(\delta)=-{\mathbb{E}}_{{\mathbb{P}}_{n}}[f(x_{n}(0),\,Y)]-\frac{\delta}{2}\frac{\phi^{\prime\prime\prime}(1)}{\left[\phi^{\prime\prime}(1)\right]^{2}}\mathbb{V}_{{\mathbb{P}}_{n}}\big{[}f(x_{n}(0),\,Y)\big{]}+o(\delta). However, the first-order term is not needed in our analysis.

πn:=1ϕ′′(1)(𝔼n[x2f(xn(0),Y)])1Covn[xf(xn(0),Y),f(xn(0),Y)],\displaystyle\pi_{n}:=\frac{1}{\phi^{{}^{\prime\prime}}(1)}\,\Big{(}{\mathbb{E}_{{\mathbb{P}}_{n}}[\nabla^{2}_{x}f(x_{n}(0),\,Y)]}\Big{)}^{-1}\mathrm{Cov}_{{\mathbb{P}_{n}}}\Big{[}{\nabla_{x}f(x_{n}(0),\,Y)},\,f(x_{n}(0),\,Y)\Big{]}, (3.18)

and

{x(δ)=x(0)+πδ+o(δ),c(δ)=𝔼[f(x(0),Y)]+O(δ),\displaystyle\left\{\begin{array}[]{l}x^{\star}(\delta)=x^{\star}(0)+\pi\delta+o(\delta),\\[5.0pt] c^{\star}(\delta)=-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)]+O(\delta),\end{array}\right. (3.21)

where

π:=1ϕ′′(1)(𝔼[x2f(x(0),Y)])1Cov[xf(x(0),Y),f(x(0),Y)].\displaystyle\pi:=\frac{1}{\phi^{{}^{\prime\prime}}(1)}\,\Big{(}{\mathbb{E}_{{\mathbb{P}}}[\nabla^{2}_{x}f(x^{\star}(0),\,Y)]}\Big{)}^{-1}\mathrm{Cov}_{{\mathbb{P}}}\Big{[}{\nabla_{x}f(x^{\star}(0),\,Y)},\,f(x^{\star}(0),\,Y)\Big{]}. (3.22)

Proposition 3.2 shows that worst- and best-case optimization adds a bias in the direction πn\pi_{n} to the SAA maximizer xn(0)x_{n}(0) (and similarly for x(δ)x^{\star}(\delta)).

Clearly, the in-sample expected reward under the robust optimizer is smaller than that of the empirical optimizer, no matter the sign of δ\delta. When applied out-of-sample, however, this might not be the case because the solutions are random variables under the population distribution \mathbb{P}. To compare the out-of-sample reward of the best and worst-case solutions and the SAA optimizer, we need to understand the impact of data variability on the solution.

3.2. Statistical properties of the SAA solution

The following regularity assumptions on f(x,Y)f(x,Y) will be needed to characterize the statistical properties of xn(0)x_{n}(0).

Assumption 3.3.
  1. (1)

    The function f(x,Y)f(x,Y) is three times continuously differentiable in xdx\in{\mathbb{R}}^{d} for {\mathbb{P}}-almost all YlY\in{\mathbb{R}}^{l};

  2. (2)

    As nn\rightarrow\infty

    supx|𝔼n[xf(x,Y)]𝔼[xf(x,Y)]|𝑃0\displaystyle\sup_{x}\Big{|}{\mathbb{E}}_{{\mathbb{P}}_{n}}[\nabla_{x}f(x,\,Y)]-{\mathbb{E}}_{{\mathbb{P}}}[\nabla_{x}f(x,\,Y)]\Big{|}\overset{P}{\longrightarrow}0
  3. (3)

    for every ϵ>0\epsilon>0

    infx:|xx(0)|ϵ|𝔼[xf(x,Y)]|>0=|𝔼[xf(x(0),Y)]|.\displaystyle\inf_{x:\left|x-x^{\star}(0)\right|\geq\epsilon}\Big{|}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(x,\,Y)]\Big{|}>0=\Big{|}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(x^{\star}(0),\,Y)]\Big{|}.
  4. (4)

    The 2nd2^{nd} and 3rd3^{rd} order derivatives of f(x,Y)f(x,\,Y) exist for any xx in a neighborhood of x(0)x^{\star}(0) and

    𝔼|x2f(x,Y)|2\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{|}\nabla_{x}^{2}f(x,Y)\big{|}^{2} <\displaystyle< ,\displaystyle\infty,
    𝔼|x2(fxi(x,Y))|2\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{|}\nabla_{x}^{2}\Big{(}\frac{\partial f}{\partial x_{i}}(x,\,Y)\Big{)}\Big{|}^{2} <\displaystyle< ,i=1,,d\displaystyle\infty,\;i=1,\cdots,d

    where x2(fxi(x,Y))\nabla_{x}^{2}\Big{(}\frac{\partial f}{\partial x_{i}}(x,\,Y)\Big{)} is the d×d{\mathbb{R}}^{d\times d}-valued Hessian of the real-valued function fxi(x,Y)\frac{\partial f}{\partial x_{i}}(x,\,Y).

  5. (5)

    For any xx in some neighborhood of x(0)x^{\star}(0)

    (𝔼n[x2f(x,Y)])1=OP(1)\displaystyle\left({\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\nabla_{x}^{2}f(x,Y)\big{]}\right)^{-1}=O_{P}(1)
  6. (6)

    For any xx in some neighborhood of x(0)x^{\star}(0) and random variable MM such that 𝔼|M|<{\mathbb{E}}_{\mathbb{P}}|M|<\infty,

    𝔼|x2f(x,Y)x2f(x(0),Y)|\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{|}\nabla_{x}^{2}f(x,Y)-\nabla_{x}^{2}f(x^{\star}(0),Y)\Big{|} <\displaystyle< M|xx(0)|,\displaystyle M|x-x^{\star}(0)|,
    𝔼|x2(fxi(x,Y))x2(fxi(x(0),Y))|\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{|}\nabla_{x}^{2}\Big{(}\frac{\partial f}{\partial x_{i}}(x,\,Y)\Big{)}-\nabla_{x}^{2}\Big{(}\frac{\partial f}{\partial x_{i}}(x^{\star}(0),\,Y)\Big{)}\Big{|} <\displaystyle< M|xx(0)|.\displaystyle M|x-x^{\star}(0)|.

We now characterize the statistical properties of the SAA solution. Let

A(0)\displaystyle{A}(0) :=\displaystyle:= 𝔼[x2f(x(0),Y)]d×d,\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}f(x^{\star}(0),\,Y)\big{]}\in{\mathbb{R}}^{d\times d},
B(0)\displaystyle{B}(0) :=\displaystyle:= 𝔼[xf(x(0),Y)xf(x(0),Y)]d×d,\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla_{x}f(x^{\star}(0),\,Y)\,\nabla_{x}f(x^{\star}(0),\,Y)^{\prime}\big{]}\in{\mathbb{R}}^{d\times d},

and random vectors

Wn(0)\displaystyle{W}_{n}(0) :=\displaystyle:= A(0)11ni=1nxf(x(0),Yi),\displaystyle-A(0)^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\nabla_{x}f(x^{\star}(0),\,Y_{i}),
Vn(0)\displaystyle{V}_{n}(0) :=\displaystyle:= Hn(0)Wn(0)In(0),\displaystyle H_{n}(0)W_{n}(0)-I_{n}(0),

where

Hn(0)\displaystyle{H}_{n}(0) :=\displaystyle:= 1nA(0)1i=1n{x2f(x(0),Yi)𝔼[x2f(x(0),Y)]}\displaystyle-\frac{1}{\sqrt{n}}{A}(0)^{-1}\sum_{i=1}^{n}\Big{\{}\nabla^{2}_{x}f(x^{\star}(0),\,Y_{i})-{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}f(x^{\star}(0),\,Y)\big{]}\Big{\}}
In(0)\displaystyle{I}_{n}(0) :=\displaystyle:= A(0)1[Wn(0)𝔼[x2(fx1(x(0),Y))]Wn(0)Wn(0)𝔼[x2(fxd(x(0),Y))]Wn(0)].\displaystyle{A}(0)^{-1}\left[\begin{array}[]{c}{W}_{n}(0)^{\prime}{\mathbb{E}}_{\mathbb{P}}\left[\nabla^{2}_{x}\big{(}\frac{\partial f}{\partial{x_{1}}}(x^{\star}(0),\,\,Y)\big{)}\right]{W}_{n}(0)\\ \vdots\\ {W}_{n}(0)^{\prime}{\mathbb{E}}_{\mathbb{P}}\left[\nabla^{2}_{x}\big{(}\frac{\partial f}{\partial x_{d}}(x^{\star}(0),\,Y)\big{)}\right]{W}_{n}(0)\end{array}\right].

By condition (1) of Assumption 3.3, x2(fxi(x(0),Y))\nabla^{2}_{x}\big{(}\frac{\partial f}{\partial x_{i}}(x^{\star}(0),\,Y)\big{)} is guaranteed to exist.

The following result gives the statistical properties of the solution of the SAA problem.

Proposition 3.4.

Suppose that data {Y1,,Yn}\{Y_{1},\cdots,\,Y_{n}\} is drawn iid from \mathbb{P} and f(x,Y)f(x,\,Y) satisfies Assumptions 2.1 and (3.3). Then there is a unique solution x(0)x^{\star}(0) of the first-order conditions 𝔼n[f(x,Y)]=0{\mathbb{E}}_{{\mathbb{P}}_{n}}[f(x,Y)]=0 for the SAA problem (2.3), the matrix A(0){A}(0) is invertible, xn(0)𝑃x(0)x_{n}(0)\overset{P}{\longrightarrow}x^{\star}(0) as nn\rightarrow\infty, and

xn(0)=x(0)+1nWn(0)+1nVn(0)+oP(n3/2)\displaystyle x_{n}(0)=x^{\star}(0)+\frac{1}{\sqrt{n}}W_{n}(0)+\frac{1}{n}V_{n}(0)+o_{P}(n^{-3/2}) (3.24)

where Wn(0){W}_{n}(0) has mean 0 and covariance matrix

ξ(0)\displaystyle\xi(0) =\displaystyle= A(0)1B(0)A(0)1\displaystyle A(0)^{-1}B(0){A(0)^{-1}}^{\prime}
=\displaystyle= 𝕍[(𝔼[x2f(x(0),Y)])1xf(x(0),Y)],\displaystyle{\mathbb{V}}_{\mathbb{P}}\Big{[}\Big{(}{\mathbb{E}}\big{[}\nabla^{2}_{x}f(x^{\star}(0),\,Y)\big{]}\Big{)}^{-1}{\nabla_{x}f(x^{\star}(0),\,Y)}\Big{]},

and Vn(0){V}_{n}(0) is OP(1)O_{P}(1) with mean

V¯(0)=𝔼[A(0)1x2f(x(0),Y)A(0)1xf(x(0),Y)]\displaystyle\overline{V}(0)={\mathbb{E}}_{\mathbb{P}}\Big{[}{A}(0)^{-1}\nabla^{2}_{x}f(x^{\star}(0),\,Y){A}(0)^{-1}\nabla_{x}f(x^{\star}(0),\,Y)\Big{]} (3.28)
A(0)1[tr{ξ(0)𝔼[x2fx1(x(0),Y)]}tr{ξ(0)𝔼[x2fxd(x(0),Y)]}].\displaystyle-{A}(0)^{-1}\left[\begin{array}[]{c}{\rm tr}\big{\{}\xi(0)\,{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}\frac{\partial f}{\partial x_{1}}(x^{\star}(0),\,Y)\big{]}\big{\}}\\ \vdots\\ {\rm tr}\big{\{}\xi(0)\,{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}\frac{\partial f}{\partial x_{d}}(x^{\star}(0),\,Y)\big{]}\big{\}}\end{array}\right].
Proof.

Uniqueness of x(0)x^{\star}(0) was established in Proposition 3.2, while consistency of xn(0)x_{n}(0) follows from Theorem 5.9 in [27]. Equation (3.24) is obtained by applying Lemma 2.1 in [17] or Lemma 3.1 from [25] to the first-order conditions of the SAA problem (2.3). Invertability of A(0)A(0) follows from Assumption 2.1. ∎

Equation (3.24) shows that the SAA solution has a bias 1nV¯(0)\frac{1}{n}\overline{V}(0) relative to the solution x(0)x^{\star}(0) of the population problem.

3.3. Statistical properties of the DRO/DOO solution

The following assumptions for ψ(x,c,Y)\psi(x,c,Y) are a direct parallel of Assumption 3.3 for f(x,Y)f(x,Y).

Assumption 3.5.
  1. (1)

    The function f(x,Y)f(x,Y) is three times continuously differentiable in xdx\in{\mathbb{R}}^{d} for {\mathbb{P}}-almost surely YlY\in{\mathbb{R}}^{l}, and ϕ(z)\phi(z) is three times continuously differentiable in zz\in{\mathbb{R}}, and hence, ϕ(ζ)\phi^{*}(\zeta) is three times continuously differentiable in ζ\zeta.

Let ψ(x,c,Y)\psi(x,c,Y) be defined by (3.13). The parameter δ\delta in ψ(x,c,Y)\psi(x,c,Y) is such that

  1. (2)

    As nn\rightarrow\infty

    sup(x,c)|𝔼n[ψ(x,c,Y)]𝔼[ψ(x,c,Y)]|𝑃0\displaystyle\sup_{(x,\,c)}\Big{|}{\mathbb{E}}_{{\mathbb{P}}_{n}}[\psi(x,\,c,\,Y)]-{\mathbb{E}}_{{\mathbb{P}}}[\psi(x,\,c,\,\,Y)]\Big{|}\overset{P}{\longrightarrow}0
  2. (3)

    for every ϵ>0\epsilon>0

    inf(x,c):|(x,c)(x(δ),c(δ))|ϵ|𝔼[ψ(x,c,Y)]|>0=|𝔼[ψ(x(δ),c(δ),Y)]|.\displaystyle\inf_{(x,\,c):|(x,\,c)-(x^{\star}(\delta),\,c^{\star}(\delta))|\geq\epsilon}\Big{|}{\mathbb{E}}_{\mathbb{P}}[\psi(x,\,c,\,Y)]\Big{|}>0=\Big{|}{\mathbb{E}}_{\mathbb{P}}[\psi(x^{\star}(\delta),\,c^{\star}(\delta),\,Y)]\Big{|}.
  3. (4)

    The 1st1^{st} and 2nd2^{nd} order derivatives of ψ(x,c,Y)\psi(x,c,Y) exist in a neighborhood of (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) and

    𝔼|(x,c)ψ(x,c,Y)|2\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{|}\nabla_{(x,c)}\psi(x,c,Y)\big{|}^{2} <\displaystyle< ,\displaystyle\infty,
    𝔼|(x,c)2ψ(i)(x,c,Y)|2\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{|}\nabla_{(x,c)}^{2}\psi^{(i)}(x,c,Y)\big{|}^{2} <\displaystyle< ,i=1,,d+1\displaystyle\infty,\;i=1,\cdots,d+1

    where ψ(i)\psi^{(i)} be the ithi^{th} component of ψ=[ψ(1),,ψ(d),ψ(d+1)]\psi=[\psi^{(1)},\cdots,\psi^{(d)},\psi^{(d+1)}]^{\prime}.

  4. (5)

    For some neighborhood of (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta))

    (𝔼n[(x,c)ψ(x,c,Yi)])1=OP(1)\displaystyle\left({\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\nabla_{(x,c)}\psi(x,c,Y_{i})\big{]}\right)^{-1}=O_{P}(1)
  5. (6)

    For some neighborhood of (x(0),c(0))(x^{\star}(0),c^{\star}(0)) and random variable MM such that 𝔼|M|<{\mathbb{E}}_{\mathbb{P}}|M|<\infty,

    𝔼|(x,c)ψ(x,c,Y)(x,c)ψ(x(0),c(0),Y)|\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{|}\nabla_{(x,c)}\psi(x,c,Y)-\nabla_{(x,c)}\psi(x^{\star}(0),c^{\star}(0),Y)\Big{|} <\displaystyle< M(|xx(0)|+|cc(0)|),\displaystyle M\Big{(}|x-x^{\star}(0)|+|c-c^{\star}(0)|\Big{)},
    𝔼|(x,c)2ψ(i)(x,c,Y)(x,c)2ψ(i)(x(0),c(0),Y)|\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{|}\nabla_{(x,c)}^{2}\psi^{(i)}(x,c,Y)-\nabla_{(x,c)}^{2}\psi^{(i)}(x^{\star}(0),c^{\star}(0),Y)\Big{|} <\displaystyle< M(|xx(0)|+|cc(0)|).\displaystyle M\Big{(}|x-x^{\star}(0)|+|c-c^{\star}(0)|\Big{)}.

Condition (1) guarantees that the Hessian (x,c)2ψ(i)(x,c,Y)\nabla_{(x,c)}^{2}\psi^{(i)}(x,c,Y) exists. The third condition implies that δ\delta is such that the first order conditions (3.8) for the population problem has a unique solution. Under Assumptions 2.1 and 3.1, this is satisfied for every δ0\delta\geq 0 (i.e. for SAA and the worst-case problems), and for all δ\delta in an open neighborhood of 0, which includes negative values.

The following result characterizes the statistical properties of the DRO/DOO solution xn(δ)x_{n}(\delta).

Proposition 3.6.

Suppose that data {Y1,,Yn}\{Y_{1},\cdots,\,Y_{n}\} is drawn iid from \mathbb{P}, f(x,Y)f(x,\,Y) satisfies Assumptions 2.1 and 3.3, ϕ(z)\phi(z) satisfies Assumption 3.1, and δ\delta is such that Assumption 3.5 holds. Then there is a unique solution (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) of the first-order conditions (3.8), (xn(δ),cn(δ))𝑃(x(δ),c(δ))(x_{n}(\delta),\,c_{n}(\delta))\overset{P}{\longrightarrow}(x^{\star}(\delta),\,c^{\star}(\delta)) as nn\rightarrow\infty, and

xn(δ)=x(δ)+1nWn(δ)+1nVn(δ)+oP(n3/2),\displaystyle x_{n}(\delta)=x^{\star}(\delta)+\frac{1}{\sqrt{n}}{W}_{n}(\delta)+\frac{1}{n}{V}_{n}(\delta)+o_{P}(n^{-3/2}), (3.29)

where Wn(δ){W}_{n}(\delta) has mean 0 and covariance matrix ξ(δ)𝕍[Wn(δ)]\xi(\delta)\equiv{\mathbb{V}}_{\mathbb{P}}[W_{n}(\delta)] and Vn(δ){V}_{n}(\delta) is OP(1)O_{P}(1) with mean V¯(δ)=𝔼[Vn(δ)]\overline{V}(\delta)={\mathbb{E}}_{\mathbb{P}}[V_{n}(\delta)]. ξ(δ)\xi(\delta) and V¯(δ)\overline{V}(\delta) are continuously differentiable in a neighborhood of δ=0\delta=0.

Proof.

Uniqueness of (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) was established in Proposition (3.2), while consistency of (xn(δ),cn(δ))(x_{n}(\delta),c_{n}(\delta)) follows from Theorem 5.9 in [27]. A proof of the remaining results can be found in the Appendix, where expressions for the random vectors Wn(δ)W_{n}(\delta) and Vn(δ)V_{n}(\delta), the covariance matrix ξ(δ)\xi(\delta) of Wn(δ)W_{n}(\delta) and the mean V¯(δ)\overline{V}(\delta) of Vn(δ)V_{n}(\delta) can also be found. ∎

As in the case of the variance of Wn(0)W_{n}(0) and the mean of Vn(0)V_{n}(0) for SAA, ξ(δ)\xi(\delta) and V¯(δ)\overline{V}(\delta) do not depend on nn. There is a bias of 1nV¯(δ)\frac{1}{n}\overline{V}(\delta) relative to the population DRO/DOO optimizer x(δ)x^{\star}(\delta) and the variance 1nξ(δ)\frac{1}{n}\xi(\delta) is different from that of the SAA solution.

4. Out-of-sample expected reward

Let xn(δ)x_{n}(\delta) be a solution of the best- or worst-case model, 𝔼[f(xn(δ),Yn+1)|xn(δ)]{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\,|\,x_{n}(\delta)\big{]} the out-of-sample expected reward given xn(δ)x_{n}(\delta), and

μn(δ):=𝔼[f(xn(δ),Yn+1)]𝔼{𝔼[f(xn(δ),Yn+1)|xn(δ)]}\displaystyle\mu_{n}(\delta):={\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\big{]}\equiv{\mathbb{E}}_{\mathbb{P}}\big{\{}{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\,|\,x_{n}(\delta)\big{]}\big{\}}

be the expected out-of-sample reward after averaging over datasets {Y1,,Yn}\{Y_{1},\cdots,\,Y_{n}\} of size nn and outcome Yn+1Y_{n+1}. Under the assumptions of our model, Yn+1Y_{n+1} has distribution \mathbb{P}, the distribution of xn(δ)x_{n}(\delta) is characterized in Proposition 3.6, and xn(δ)x_{n}(\delta) and Yn+1Y_{n+1} are independent. We now explore the relationship between μn(δ)\mu_{n}(\delta) and the out-of-sample expected reward of the SAA optimizer, μn(0)\mu_{n}(0).

We denote the expected out-of-sample reward given xx by the function g:dg:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} defined by

g(x):=𝔼[f(x,Y)|x].\displaystyle g(x):={\mathbb{E}}_{\mathbb{P}}[f(x,Y)|x]. (4.1)

Observe that μn(δ)=𝔼[g(xn(δ))]\mu_{n}(\delta)={\mathbb{E}}_{\mathbb{P}}[g(x_{n}(\delta))]. Concavity of f(x,Y)f(x,Y) in xx also implies that g(x)g(x) is concave.

4.1. Sample Average Approximation

Independence of xn(0)x_{n}(0) and YY, the concavity of f(x,Y)f(x,Y) and hence g(x)g(x) in xx, and Jensen’s inequality imply that the out-of-sample expected reward under the SAA optimizer xn(0)x_{n}(0) is less than that of its mean 𝔼[xn(0)]{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]

𝔼[f(xn(0),Y)]=𝔼[g(xn(0))]g(𝔼[xn(0)])=𝔼[f(𝔼[xn(0)],Y)]g(x(0)).\displaystyle{\mathbb{E}}_{\mathbb{P}}[f(x_{n}(0),Y)]={\mathbb{E}}_{\mathbb{P}}[g(x_{n}(0))]\leq g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])={\mathbb{E}}_{\mathbb{P}}\Big{[}f\big{(}{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)],Y\big{)}\Big{]}\leq g(x^{\star}(0)). (4.2)

This inequality is strict if g(x)=𝔼[f(x,Y)]g(x)={\mathbb{E}}_{\mathbb{P}}[f(x,Y)] is strictly concave and xn(0)x_{n}(0) is random.

Equation (4.2) suggests that we write the out-of-sample expected reward as a sum of the reward under the mean of the decision 𝔼[xn(0)]{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)] and a loss that we call the Jensen gap:

𝔼[g(xn(0))]μn(0)=g(𝔼[xn(0)])+𝔼[g(xn(0))g(𝔼[xn(0)])]Jensen gapg(x(0)).\displaystyle\underbrace{{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(0))]}_{\mu_{n}(0)}=g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])+\underbrace{{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(0))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{]}}_{\tiny\mbox{Jensen gap}}\leq g(x^{\star}(0)). (4.3)

The Jensen gap is negative when g(x)g(x) is concave (and not linear) on the support of xn(0)x_{n}(0), while the optimality of x(0)x^{\star}(0) for the population problem implies that 𝔼[g(xn(0))]{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(0))] is less than g(x(0))g(x^{\star}(0)) if the solution is biased (𝔼[xn(0)]x(0){\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\neq x^{\star}(0)).

Additional regularity (Assumptions 2.1 and 3.3) allows us to characterize the mean and variance of the SAA solution and to study each component in the decomposition (4.3) separately. Specifically, we know from Proposition 3.4 that

xn(0)=x(0)+1nWn(0)+1nVn(0)+oP(n1).\displaystyle x_{n}(0)=x^{\star}(0)+\frac{1}{\sqrt{n}}W_{n}(0)+\frac{1}{n}V_{n}(0)+o_{P}(n^{-1}).

We make the assumption that the second-order error terms of the expectation and variance of xn(0)x_{n}(0) are o(n1)o(n^{-1}). Sufficient conditions for this additional level of regularity are provided in Appendix D.

Assumption 4.1.

The SAA solution is such that

𝔼[xn(0)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)] =\displaystyle= x(0)+1nV¯(0)+o(n1),\displaystyle x^{\star}(0)+\frac{1}{n}\overline{V}(0)+o(n^{-1}), (4.4)
𝕍[xn(0)]\displaystyle{\mathbb{V}}_{\mathbb{P}}[x_{n}(0)] =\displaystyle= 1nξ(0)+o(n1).\displaystyle\frac{1}{n}\xi(0)+o(n^{-1}).

Under Assumption 4.1, a Taylor series expansion of the first component of (4.3) around the population optimizer x(0)x^{\star}(0) gives:

g(𝔼[xn(0)])\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]) =\displaystyle= g(x(0)+1nV¯(0)+oP(n1))\displaystyle g\Big{(}x^{\star}(0)+\frac{1}{n}\overline{V}(0)+o_{P}(n^{-1})\Big{)} (4.5)
=\displaystyle= g(x(0))+12n2V¯(0)[x2g(x(0))]V¯(0)+o(n2)\displaystyle g(x^{\star}(0))+\frac{1}{2n^{2}}\overline{V}(0)^{\prime}\big{[}\nabla^{2}_{x}g(x^{\star}(0))\big{]}\overline{V}(0)+o(n^{-2}) (4.6)

which shows that the expected out-of-sample loss due to the finite-sample bias 1nV¯(0)\frac{1}{n}\overline{V}(0) is small, being on the order of n2n^{-2}. In the case of the Jensen gap, a Taylor series expansion around 𝔼[xn(0)]{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)] gives

𝔼[g(xn(0))g(𝔼[xn(0)])]\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(0))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{]} (4.7)
=\displaystyle= 12𝔼[(xn(0)𝔼[xn(0)])x2g(𝔼[xn(0)])(xn(0)𝔼[xn(0)])]+o(n1)\displaystyle\frac{1}{2}{\mathbb{E}}_{\mathbb{P}}\Big{[}\big{(}x_{n}(0)-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\big{)}^{\prime}\nabla^{2}_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\big{(}x_{n}(0)-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\big{)}\Big{]}+o(n^{-1})
=\displaystyle= 12tr{𝔼[(xn(0)𝔼[xn(0)])(xn(0)𝔼[xn(0)])]x2g(x(0)+1nV¯(0)+o(n1))}+o(n1)\displaystyle\frac{1}{2}{\rm tr}\Big{\{}{\mathbb{E}}_{\mathbb{P}}\Big{[}\big{(}x_{n}(0)-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\big{)}\big{(}x_{n}(0)-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\big{)}^{\prime}\Big{]}\nabla^{2}_{x}g\Big{(}x^{\star}(0)+\frac{1}{n}\overline{V}(0)+o(n^{-1})\Big{)}\Big{\}}+o(n^{-1})
=\displaystyle= 12ntr{ξ(0)x2g(x(0))}+o(n1),\displaystyle\frac{1}{2n}\mbox{tr}\big{\{}\xi(0)\nabla^{2}_{x}g(x^{\star}(0))\big{\}}+o(n^{-1}),

where the second and third equalities follow from (4.4) and (4.5). The Jensen gap is negative because g(x)g(x) is concave and becomes more negative when the variance 1nξ(0)\frac{1}{n}\xi(0) of xn(0)x_{n}(0) increases or the curvature x2g(x(0))\nabla^{2}_{x}g(x^{\star}(0)) becomes more negative.

Adding (4.6) and (4.7) we obtain an expression for the out-of-sample expected reward in terms of the population optimizer:

𝔼[f(xn(0),Y)]μn(0)\displaystyle\underbrace{{\mathbb{E}}_{\mathbb{P}}[f(x_{n}(0),Y)]}_{\tiny\mu_{n}(0)} =\displaystyle= 𝔼[f(x(0),Y)]+12ntr{ξ(0)𝔼[x2f(x(0),Y)]}𝔼[Wn(0)x2f(x(0),Y)Wn(0)]<0+o(n1).\displaystyle{\mathbb{E}}_{\mathbb{P}}[f(x^{\star}(0),Y)]+\frac{1}{2n}\underbrace{\mbox{tr}\big{\{}\xi(0){\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),Y)]\big{\}}}_{\tiny{\mathbb{E}}_{\mathbb{P}}[W_{n}(0)^{\prime}\nabla^{2}_{x}f(x^{\star}(0),Y)W_{n}(0)]<0}+o(n^{-1}). (4.8)

This expression shows that the loss in expected reward is of order n1n^{-1} and comes from the uncertainty in the solution through the Jensen gap. There is also a loss due to the finite-sample bias in the SAA solution, but this is an order of magnitude smaller.

4.2. Distributionally Robust/Optimistic Optimization

To compare the out-of-sample expected reward of the DRO solution with that of SAA, we write 𝔼[g(xn(δ))]{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(\delta))] in terms of the SAA reward 𝔼[g(xn(0))]{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(0))] and additional terms that show how changes in the mean and variance of the solution affect the expected reward:

𝔼[g(xn(δ))]\displaystyle{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(\delta))] =\displaystyle= 𝔼[g(xn(0))]{g(𝔼[xn(0)])g(𝔼[xn(δ)])}(A) Loss from change in mean of solution\displaystyle{\mathbb{E}}_{\mathbb{P}}[g(x_{n}(0))]-\overbrace{\Big{\{}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\Big{\}}}^{\tiny\mbox{(A) Loss from change in mean of solution}} (4.9)
{𝔼[g(xn(0))g(𝔼[xn(0)])]𝔼[g(xn(δ))g(𝔼[xn(δ)])](B) Change in the Jensen gap}.\displaystyle-\Big{\{}\underbrace{{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(0))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{]}-{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(\delta))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\Big{]}}_{\tiny\mbox{(B) Change in the Jensen gap}}\Big{\}}.

The term (A) shows how a change in the expected value of the solution changes the reward, while (B) is the associated change in the Jensen gap.

DRO and DOO add a bias and change the variance of the SAA optimizer. Specifically, it follows from Proposition 3.6 and equations (3.21) and (4.4) (under Assumptions 2.1, 3.1 that (3.29) holds. As in the case of the SAA solution, we make the additional assumption that the second-order error terms of the mean and variance of xn(δ)x_{n}(\delta) is o(n1)o(n^{-1}). A sufficient condition is given in Appendix D.

Assumption 4.2.

There exists a neighborhood of δ=0\delta=0 such that

𝔼[xn(δ)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= x(δ)+1nV¯(δ)+o(n1)\displaystyle x^{\star}(\delta)+\frac{1}{n}\overline{V}(\delta)+o(n^{-1}) (4.10)
𝕍[xn(δ)]\displaystyle{\mathbb{V}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= 1nξ(δ)+o(n1).\displaystyle\frac{1}{n}\xi(\delta)+o(n^{-1}).

Under this assumption and for nn sufficiently large, there is an open neighborhood of δ=0\delta=0 such that

𝔼[xn(δ)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= 𝔼[xn(0)]+δ(π+1nV¯δ(0))+O(δ2)+1nO(δ2)+o(n1)\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]+\delta\Big{(}\pi+\frac{1}{n}\overline{V}_{\delta}(0)\Big{)}+O(\delta^{2})+\frac{1}{n}O(\delta^{2})+o(n^{-1}) (4.11)
𝕍[xn(δ)]\displaystyle{\mathbb{V}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= 𝕍[xn(0)]+δnξδ(0)+1nO(δ2)+o(n1).\displaystyle{\mathbb{V}}_{\mathbb{P}}[x_{n}(0)]+\frac{\delta}{n}\xi_{\delta}(0)+\frac{1}{n}O(\delta^{2})+o(n^{-1}).

Here, V¯δ(0)=ddδV¯(0)\overline{V}_{\delta}(0)=\frac{\mathrm{d}}{\mathrm{d}\delta}\overline{V}(0) and ξδ(0)=ddδξ(0)\xi_{\delta}(0)=\frac{\mathrm{d}}{\mathrm{d}\delta}\xi(0) denote derivatives of ξ(δ)\xi(\delta) and V¯(δ)\overline{V}(\delta) at δ=0\delta=0, which exist because ξ(δ)\xi(\delta) and V¯(δ)\overline{V}(\delta) are continuously differentiable at δ=0\delta=0 (Proposition 3.6).

To the first order, the change in the bias relative to the SAA solution is δ(π+1nV¯δ(0))\delta\big{(}\pi+\frac{1}{n}\overline{V}_{\delta}(0)\big{)} and δnξδ(0)\frac{\delta}{n}\xi_{\delta}(0) is the change in the variance. The expression for 𝔼[xn(δ)]{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)] follows from the previous equation together with (3.21) and (4.4) and shows that the difference between the mean of xn(δ)x_{n}(\delta) and xn(0)x_{n}(0) comes from two sources, the change in the population solution from x(0)x^{\star}(0) to x(δ)x^{\star}(\delta) due to robustness which contributes the π\pi, and the change in the finite-sample bias which gives 1nV¯δ(0)\frac{1}{n}\overline{V}_{\delta}(0). The variance of the solution also changes by 1nξδ(0)\frac{1}{n}\xi_{\delta}(0). An explicit relationship between terms (A) and (B) in (4.9) and the robustness parameter can be obtained by substituting (4.11) into each term in (4.9).

For the first term (A), it is shown in Appendix E that

g(𝔼[xn(δ)])g(𝔼[xn(0)])\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])
=\displaystyle= δ22π[x2g(x(0))]π+δn{πx2g(x(0))V¯(0)}Impact of robustness+o(1/n2)+o(δ2)+1no(δ2)+1n2O(δ).\displaystyle\underbrace{\frac{\delta^{2}}{2}\pi^{\prime}\big{[}\nabla^{2}_{x}g(x^{\star}(0))\big{]}\pi+\frac{\delta}{n}\Big{\{}\pi^{\prime}\nabla^{2}_{x}g(x^{\star}(0))\overline{V}(0)\Big{\}}}_{\tiny\mbox{Impact of robustness}}+o(1/n^{2})+o(\delta^{2})+\frac{1}{n}o(\delta^{2})+\frac{1}{n^{2}}O(\delta).

In the case of the Jensen gap (B):

𝔼{g(xn(δ))g(𝔼[xn(δ)])}𝔼{g(xn(0))g(𝔼[xn(0)])}\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{\{}g(x_{n}(\delta))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\Big{\}}-{\mathbb{E}}_{\mathbb{P}}\Big{\{}g(x_{n}(0))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{\}}
=\displaystyle= (δ2n)ddδ{tr(x2g(x(δ))ξ(δ))}|δ=0Impact of robustness+1nO(δ2)+O(n3/2)\displaystyle\underbrace{\Big{(}\frac{\delta}{2n}\Big{)}\frac{\mbox{d}}{\mbox{d}\delta}\Big{\{}\mbox{tr}\Big{(}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{)}\Big{\}}\Big{|}_{\delta=0}}_{\tiny\mbox{Impact of robustness}}+\frac{1}{n}O(\delta^{2})+O(n^{-3/2})

where

ddδ{tr(x2g(x(δ))ξ(δ))}|δ=0=πx(tr{x2g(x)ξ(0)})|x=x(0)+tr(x2g(x(0))ξδ(0)).\displaystyle\frac{\mbox{d}}{\mbox{d}\delta}\Big{\{}\mbox{tr}\Big{(}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{)}\Big{\}}\Big{|}_{\delta=0}=\pi^{\prime}\nabla_{x}\Big{(}\mbox{tr}\Big{\{}\nabla^{2}_{x}g(x)\xi(0)\Big{\}}\Big{)}\Big{|}_{x=x^{\star}(0)}+\mbox{tr}\Big{(}\nabla^{2}_{x}g(x^{\star}(0))\,\xi_{\delta}(0)\Big{)}.

This expression shows that the change in the Jensen gap comes from the change in the curvature of the objective function due to the DRO/DOO bias (first term) and the change in the variance of the solution (second term).

The out-of-sample expected reward under a DRO/DOO solution can be written in terms of the expected reward under the SAA optimizer by adding (4.2) and (4.2).

Proposition 4.3.

Suppose {Y1,,Yn}\{Y_{1},\cdots,\,Y_{n}\} and Yn+1Y_{n+1} are drawn iid from \mathbb{P}, f(x,Y)f(x,Y) satisfies Assumption 2.1 and 3.3, ϕ(z)\phi(z) satisfies Assumption (3.1), and Assumptions 3.5, 4.1 and 4.2 hold. Then for every nn sufficiently large, there is an open neighborhood of δ=0\delta=0 such that

𝔼[f(xn(δ),Yn+1)]\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\big{]} (4.14)
=\displaystyle= 𝔼[f(xn(0),Yn+1)]+δρn+δ22π𝔼[x2f(x(0),Yn+1)]π+o(δ2)+o(n1)\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(0),\,Y_{n+1})\big{]}+\delta\frac{\rho}{n}+\frac{\delta^{2}}{2}\pi^{\prime}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla_{x}^{2}f(x^{\star}(0),\,Y_{n+1})\big{]}\pi+o(\delta^{2})+o(n^{-1})

where

ρ\displaystyle\rho =\displaystyle= V¯(0)𝔼[x2f(x(0),Y)]π+12ddδtr(ξ(δ)𝔼[x2f(x(δ),Y)])|δ=0\displaystyle\overline{V}(0)^{\prime}\,{\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),Y)]\,\pi+\frac{1}{2}\,\frac{\mathrm{d}}{\mathrm{d}\delta}\mathrm{tr}\Big{(}\xi(\delta){\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(\delta),Y)]\Big{)}\Big{|}_{\delta=0} (4.15)
=\displaystyle= V¯(0)𝔼[x2f(x(0),Y)]π\displaystyle\overline{V}(0)^{\prime}\,{\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),Y)]\,\pi
+12tr(ξδ(0)𝔼[x2f(x(0),Y)])+12πx[tr(ξ(0)𝔼[x2f(x(0),Yn+1)])]\displaystyle+\frac{1}{2}\mathrm{tr}\Big{(}\xi_{\delta}(0){\mathbb{E}_{{\mathbb{P}}}[\nabla^{2}_{x}f(x^{\star}(0),\,Y)]}\Big{)}+\frac{1}{2}\pi^{\prime}\nabla_{x}\Big{[}\mathrm{tr}\Big{(}\xi(0)\,{\mathbb{E}}_{{\mathbb{P}}}\big{[}\nabla_{x}^{2}f(x^{\star}(0),\,Y_{n+1})\big{]}\Big{)}\Big{]}

and π\pi is given by (3.22).

It is worth reiterating the origins of each term in ρ\rho. The first is the change in the expected reward (4.2) that results from the change to the mean of the SAA solution, while the second is the change in the Jensen gap that results from the change in the variance of the solution. The last term is the change in the Jensen gap resulting from a change in the curvature of the objective function due to the DRO/DOO bias (4.2). When f(x,Y)f(x,Y) is concave but not linear, it is clear from (4.15) that ρ\rho is unlikely to be zero, except in pathological cases.

The following result shows that it is always possible to select δ\delta so that the out-of-sample expected reward under the DRO/DOO solution xn(δ)x_{n}(\delta) exceeds that of xn(0)x_{n}(0) and provides an estimate of the size of the outperformance.

Theorem 4.4.

Suppose that the assumptions of Proposition 4.3 hold. If ρ0\rho\neq 0, then for nn sufficiently large, δ\delta can always be selected so that the expected reward under xn(δ)x_{n}(\delta) exceeds that of the SAA optimizer out-of-sample

𝔼[f(xn(δ),Yn+1)]>𝔼[f(xn(0),Yn+1)].\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\big{]}>{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(0),\,Y_{n+1})\big{]}.

If nn is such that the constant

δn=1nρπ𝔼[x2f(x(0),Y)]π\displaystyle\delta_{n}=-\frac{1}{n}\frac{\rho}{\pi^{\prime}{\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),Y)]\pi}

is in the open neighborhood of δ=0\delta=0 where (4.14) holds, then

𝔼[f(xn(δn),Yn+1)]𝔼[f(xn(0),Yn+1)]\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta_{n}),\,Y_{n+1})\big{]}-{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(0),\,Y_{n+1})\big{]}
=\displaystyle= 12n2(ρ2π𝔼[x2f(x(0),Y)]π)negative+o(n2)\displaystyle-\frac{1}{2n^{2}}\underbrace{\Big{(}\frac{\rho^{2}}{\pi^{\prime}{\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),Y)]\pi}\Big{)}}_{\tiny\mbox{negative}}+o(n^{-2})
>\displaystyle> 0\displaystyle 0
Proof.

Equation (4.14) shows how the out-of-sample expected reward depends on the robustness parameter δ\delta when it is small. If ρ\rho is non-zero, the change in the expected reward

𝔼[f(xn(δ),Yn+1)]𝔼[f(xn(0),Yn+1)]\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(\delta),\,Y_{n+1})\big{]}-{\mathbb{E}}_{\mathbb{P}}\big{[}f(x_{n}(0),\,Y_{n+1})\big{]}
=\displaystyle= δρn+δ22π𝔼[x2f(x(0),Yn+1)]π+o(δ2)+o(1/n)\displaystyle\delta\frac{\rho}{n}+\frac{\delta^{2}}{2}\pi^{\prime}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla_{x}^{2}f(x^{\star}(0),\,Y_{n+1})\big{]}\pi+o(\delta^{2})+o(1/n)

is dominated by the linear term δρn\delta\frac{\rho}{n} when δ\delta is small and we can always choose δ\delta so that the expected reward from xn(δ)x_{n}(\delta) exceeds that from the SAA solution xn(0)x_{n}(0). If ρ\rho is positive, a positive choice of δ\delta (DRO) does the job, while a negative value of δ\delta (DOO) should be chosen if ρ\rho is negative. The expression for δn\delta_{n} is obtained by maximizing the quadratic on the right-hand-side of (4.14).

The inequality (4.4) follows by substituting the expression for δn\delta_{n} into (4.14). ∎

Although we can always out-perform SAA, (4.4) suggests that it is on the order of n2n^{-2}, regardless of whether it was achieved by best-case or worst-case optimization. Out-performance from either DRO or DOO is unlikely to be large if nn is large or ρ\rho is small. More generally, ρn\frac{\rho}{n} is the sensitivity of the sum of (A) and (B) in (4.9) to changes in δ\delta. Our assumptions allow us to estimate this explicitly. (A) and (B) may depend differently on nn and δ\delta in applications where our assumptions do not hold, though we have not explored this issue.

It is difficult to compute ρ\rho because it depends on the solution of the population problem x(0)x^{\star}(0) and the population distribution \mathbb{P}, both of which are unknown, and hence to determine a priori whether the optimistic or worst-case solution will out-perform SAA. This can be sidestepped by optimizing a bootstrap or cross-validation estimate of the out-of-sample reward over δ\delta, which we explore in an inventory problem in Section 5.

One limitation of our analysis is that we make strong assumptions about the differentiability of f(x,Y)f(x,Y). These assumptions enable us to characterize the statistical properties of the SAA and DRO/DOO solutions and to analyze out-of-sample performance. However, there are many applications where they are not satisfied, the inventory problem below being one such case. Our assumptions enable us to illustrate what is possible with DOO and to provide insight into why it occurs. We do not rule out the possibility that either DRO or DOO will out-perform SAA when the differentiability assumptions we impose are not satisfied. All predictions that stem from our analysis, both in Theorem 4.4 and later sections, are observed in the inventory example, where the objective function violates Assumption 2.1.

Example 4.5.

In [1], many examples where the out-of-sample expected reward from DRO exceeds that of SAA are presented for an objective function of the form

f(x,Y)=12xHx+v(Y)x+u(Y),\displaystyle f(x,Y)=\frac{1}{2}x^{\prime}Hx+v(Y)^{\prime}x+u(Y),

where HH is strictly negative definite. Since 𝔼[x2f(x(0),Y)]=H{\mathbb{E}}_{\mathbb{P}}[\nabla^{2}_{x}f(x^{\star}(0),\,Y)]=H, 𝔼[xf(x(0),Y)]=0{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(x^{\star}(0),\,Y)]=0 and all derivatives of f(x,Y)f(x,Y) of order 33 or higher are zero, it follows from (3.28) that V¯(0)=0\overline{V}(0)=0 and

12πx[tr(ξ(0)𝔼[x2f(x(0),Yn+1)])]=0,\displaystyle\frac{1}{2}\pi^{\prime}\nabla_{x}\Big{[}\mathrm{tr}\Big{(}\xi(0)\,{\mathbb{E}}_{{\mathbb{P}}}\big{[}\nabla_{x}^{2}f(x^{\star}(0),\,Y_{n+1})\big{]}\Big{)}\Big{]}=0,

and hence by (4.15)

ρ=12tr(ξδ(0)𝔼[x2f(x(0),Y)])=12tr(ξδ(0)H).\displaystyle\rho=\frac{1}{2}\mathrm{tr}\Big{(}\xi_{\delta}(0){\mathbb{E}_{{\mathbb{P}}}[\nabla^{2}_{x}f(x^{\star}(0),\,Y)]}\Big{)}=\frac{1}{2}{\rm tr}\Big{(}\xi_{\delta}(0)H\Big{)}.

The only term from ρ\rho that remains corresponds to the change in the Jensen gap due to a change in the variance of the solution.

If xx is one-dimensional, HH is a negative constant. If DRO reduces the variance of the solution relative to SAA, ξδ(0)\xi_{\delta}(0) is negative and ρ=12ξδ(0)H\rho=\frac{1}{2}\xi_{\delta}(0)H is positive, so DRO will also out-perform SAA if δ>0\delta>0 is chosen appropriately. If DRO increases the variance of the solution (see [11] for an example where this happens), ρ\rho is negative and a DOO solution will have a higher out-of-sample expected reward than SAA.

All examples in [1] where DRO out-performs SAA are ones where the change in variance is such that ρ=12tr(ξδ(0)H)\rho=\frac{1}{2}{\rm tr}(\xi_{\delta}(0)H) is positive. Optimistic optimization will beat SAA in the examples from [1] when DRO does not.

Example 4.6.

Consider an inventory problem with reward

f(x,Y)=rmin{x,Y}+qmax{xY,0}smax{Yx,0}cx\displaystyle f(x,Y)=r\min\{x,Y\}+q\max\{x-Y,0\}-s\max\{Y-x,0\}-cx (4.17)

where the selling price r=10r=10, the purchase cost c=9c=9, and the shortage and scrap costs, ss and qq, are zero. Under the population distribution, demand Y=max{m+IX1(1I)X2,0}Y=\max\{m+IX_{1}-(1-I)X_{2},0\}, where XiX_{i} is exponential with mean μi\mu_{i} (i=1,2i=1,2), II is Bernoulli with p=P[I=1]p=P[I=1], and mm is a constant; X1,X2X_{1},X_{2} and II are independent. In this experiment, m=250m=250, μ1=10\mu_{1}=10, μ2=60\mu_{2}=60 and p=0.9p=0.9.

We note that the objective function (4.17) does not satisfy the differentiability assumptions that facilitate our analysis (Assumptions 2.1, 3.3, and 3.5). It will be seen, however, that all predictions from our analysis will be observed in this inventory model. More generally, this provides some evidence that the insights from our analysis are true under less stringent assumptions about the degree of smoothness of the objective function.

We generated 5,0005,000 datasets, each of size n=30n=30, and solved the DRO/DOO problems with a modified-χ2\chi^{2} penalty function (ϕ(z)=12(z1)2\phi(z)=\frac{1}{2}(z-1)^{2}) over a range of δ\delta for each dataset. We then computed the out-of-sample expected reward for the family of DRO/DOO decisions xn(δ)x_{n}(\delta), giving 5,0005,000 conditional expected-reward functions. The out-of-sample expected reward μn(δ)=𝔼[f(xn(δ),Yn+1)]\mu_{n}(\delta)={\mathbb{E}}_{\mathbb{P}}[f(x_{n}(\delta),Y_{n+1})] shown in Figure 4.1 was estimated by averaging these 5,0005,000 functions.

If we restrict ourselves to worst-case solutions (DRO), the SAA solution (δ=0\delta=0) is optimal and the out-of-sample expected reward is 189189. However, the out-of-sample expected reward is maximized with an optimistic solution xn(δ)x_{n}(\delta) with δ=1.3×103\delta=-1.3\times 10^{-3} and has value 193193. With different parameter values and/or population distribution, a worst-case model (δ>0(\delta>0) could be optimal.

Refer to caption
Figure 4.1. The plot shows the out-of-sample expected reward μn(δ)=𝔼[f(xn(δ),Yn+1)]\mu_{n}(\delta)={\mathbb{E}}_{\mathbb{P}}[f(x_{n}(\delta),Y_{n+1})] as a function of δ\delta. The optimistic solution xn(δ)x_{n}(\delta) with δ=1.3×103\delta=-1.3\times 10^{-3} maximizes the expected reward (193193). The expected reward associated with SAA (δ=0\delta=0) is 189189.

In summary, DRO models typically come with a free parameter – the size of the uncertainty set or the robustness parameter δ\delta – which many select by optimizing an estimate of the out-of-sample expected reward via cross-validation or the bootstrap. However, it is not always possible to beat SAA if we restrict ourselves to worst-case models [1, 11]. If the ultimate goal is to beat SAA out-of-sample, it is reasonable to consider worst- and best-case models to ensure this is possible.

5. So, what is the catch?

5.1. Best-case solutions are not robust

We have focused on the out-of-sample expected reward of solutions of best- and worst-case problems. However, it has also been shown that DRO is intrinsically a tradeoff between (in-sample) expected reward and worst-case sensitivity [11, 12, 13]. Worst-case sensitivity is a quantitative measure of robustness and it is natural to evaluate the worst-case sensitivity of the best-case optimizer.

We recall the definition of worst-case sensitivity [12]. Suppose decision xx is fixed and 𝔼n[f(x,Y)]{\mathbb{E}}_{{\mathbb{P}}_{n}}[f(x,Y)] be the expected reward under the nominal distribution n{\mathbb{P}}_{n}. Worst-case sensitivity 𝒮n(f(x,Y)){\mathcal{S}}_{{\mathbb{P}}_{n}}(f(x,\,Y)) is the rate of decrease of the in-sample expected reward under “worst-case deviations” from the nominal distribution. When the difference between probability distributions is measured by smooth ϕ\phi-divergence, it is natural to define the worst-case distribution as the minimizer

(ε)\displaystyle{\mathbb{Q}}(\varepsilon) :={argmin{i=1nqif(x,Y)+1εi=1npinϕ(qipin)},ε>0,n,ε=0.\displaystyle:=\left\{\begin{array}[]{cl}\displaystyle\arg\min_{\mathbb{Q}}\Big{\{}\sum_{i=1}^{n}q_{i}f(x,\,Y)+\frac{1}{\varepsilon}\sum_{i=1}^{n}{p}^{n}_{i}\phi\Big{(}\frac{q_{i}}{{p}^{n}_{i}}\Big{)}\Big{\}},&\varepsilon>0,\\[10.0pt] {\mathbb{P}}_{n},&\varepsilon=0.\end{array}\right.

Here, the penalty 1/ε>0\nicefrac{{1}}{{\varepsilon}}>0 on ϕ\phi-divergence controls the size of the deviation of (ε){\mathbb{Q}}(\varepsilon) from n{\mathbb{P}}_{n}, which is increasing in ε\varepsilon. Worst-case sensitivity is

𝒮n(f(x,Y))\displaystyle{\mathcal{S}}_{{\mathbb{P}}_{n}}(f(x,\,Y)) :=\displaystyle:= ddε𝔼(ε)[f(x,Y)]|ε=0\displaystyle-\frac{\mathrm{d}}{\mathrm{d}\varepsilon}{\mathbb{E}}_{{\mathbb{Q}}(\varepsilon)}[f(x,\,Y)]\Big{|}_{\varepsilon=0} (5.1)
=\displaystyle= limε0𝔼(ε)[f(x,Y)]𝔼n[f(x,Y)]ε\displaystyle-\lim_{\varepsilon\downarrow 0}\frac{{\mathbb{E}}_{{\mathbb{Q}}(\varepsilon)}[f(x,\,Y)]-{\mathbb{E}}_{{\mathbb{P}}_{n}}[f(x,\,Y)]}{\varepsilon}
=\displaystyle= 1ϕ′′(1)𝕍n[f(x,Y)],\displaystyle\frac{1}{\phi^{\prime\prime}(1)}{\mathbb{V}}_{{\mathbb{P}}_{n}}[f(x,\,Y)],

which, in the case of smooth ϕ\phi-divergence, is equal to the in-sample variance of the reward. There are other ways to define worst-case sensitivity: we can use a different divergence measure, or we can control the “distance” of \mathbb{Q} from n{\mathbb{P}}_{n} through a constraint and look at the limit when it vanishes [13]. However, they all capture a similar idea.

Worst-case sensitivity is an in-sample measure of robustness. Given a nominal model n{\mathbb{P}}_{n} and a decision xx, it quantifies the sensitivity of the expected reward to errors in the nominal model and is one way to evaluate whether a given decision is more or less robust than another. Worst-case sensitivity depends on the choice of uncertainty set, though it is always a generalized measure of deviation (spread) [13]. Intuitively, the expected reward is sensitive to mis-specification when it has a large spread because small changes in the probability of extreme rewards (positive or negative) can have a big impact on the mean.

Instead of xx, we can substitute the solution xn(δ)x_{n}(\delta) of the best/worst-case problem into (5.1). To see the impact of best/worst case optimization on the “robustness” of the solution, we write the sensitivity of xn(δ)x_{n}(\delta) in terms of the sensitivity of the SAA solution xn(0)x_{n}(0). Using the expansion (3.17), the in-sample variance of the reward under xn(δ)x_{n}(\delta) is

𝕍n[f(xn(δ),Y)]=𝕍n[f(xn(0),Y)]+2δϕ′′(1)βn(𝔼n[x2f(xn(0),Y)])1βn+O(δ2)\displaystyle{\mathbb{V}}_{{\mathbb{P}}_{n}}[f(x_{n}(\delta),\,Y)]={\mathbb{V}}_{{\mathbb{P}}_{n}}[f(x_{n}(0),\,Y)]+2\frac{\delta}{\phi^{{}^{\prime\prime}}(1)}\,\beta_{n}^{\prime}\Big{(}{\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\nabla_{x}^{2}f(x_{n}(0),\,Y)\big{]}\Big{)}^{-1}\beta_{n}+O(\delta^{2})

where

βn\displaystyle\beta_{n} =\displaystyle= Covn[xf(xn(0),Y),f(xn(0),Y)].\displaystyle\mathrm{Cov}_{\mathbb{P}_{n}}\Big{[}{\nabla_{x}f(x_{n}(0),\,Y)},\,f(x_{n}(0),\,Y)\Big{]}.

It now follows from (5.1) that worst-case sensitivity

𝒮n(f(xn(δ),Y))\displaystyle{\mathcal{S}}_{{\mathbb{P}}_{n}}(f(x_{n}(\delta),\,Y)) (5.2)
=\displaystyle= 𝒮n(f(xn(0),Y))Sensitivity of SAA solution+2δ[ϕ′′(1)]2βn(𝔼n[x2f(xn(0),Y)])1βn+O(δ2).\displaystyle\underbrace{{\mathcal{S}}_{{\mathbb{P}}_{n}}(f(x_{n}(0),\,Y))}_{\tiny\mbox{Sensitivity of SAA solution}}+2\frac{\delta}{[\phi^{{}^{\prime\prime}}(1)]^{2}}\,\beta_{n}^{\prime}\Big{(}{\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\nabla_{x}^{2}f(x_{n}(0),\,Y)\big{]}\Big{)}^{-1}\beta_{n}+O(\delta^{2}).

Strict concavity of f(x,Y)f(x,Y) in xx implies that

βn(𝔼n[x2f(xn(0),Y)])1βn<0\displaystyle\beta_{n}^{\prime}\Big{(}{\mathbb{E}}_{{\mathbb{P}}_{n}}\big{[}\nabla_{x}^{2}f(x_{n}(0),\,Y)\big{]}\Big{)}^{-1}\beta_{n}<0

so solutions of the optimistic DOO problem (δ<0\delta<0) have a larger sensitivity than SAA, while worst-case solutions (δ>0\delta>0) have a smaller sensitivity and are therefore more robust.

5.2. The optimal value of δ\delta may be difficult to estimate

Although there exists an optimal DOO or DRO decision that has a larger out-of-sample expected reward than SAA, the parameter δ\delta corresponding to this decision needs to be estimated. It is natural to use bootstrap or cross-validation, though the estimate will depend on the data set so there will be estimation error. If the optimal value of δ\delta is positive but we erroneously select a negative value of δ\delta (i.e. DOO), not only do we lose a little out-of-sample expected reward but we increase worst-case sensitivity and the solution will be less robust than SAA.

Example 5.1 (Inventory control (continued)).

For each of the 5,0005,000 sampled datasets (each with n=30n=30 datapoints), we compute worst-case sensitivity 𝒮n(f(xn(δ),Y)){\mathcal{S}}_{{\mathbb{P}}_{n}}(f(x_{n}(\delta),\,Y)) as a function of δ\delta. Figure 1(a) is obtained by averaging these. As shown in (5.2) optimistic solutions have a larger sensitivity than SAA and DRO solutions in the neighborhood of δ=0\delta=0 and sensitivity is linear in δ\delta. For each of the 5,0005,000 datasets, we can compute the in-sample mean-sensitivity frontier corresponding to the DRO/DOO solutions over a range of δ\delta. Figure 1(b) is obtained by averaging these frontiers.

Figure 1(c) shows that out-of-sample mean-variance frontier. SAA can be beaten by an optimistic decision, but this comes at the cost of a large increase in the out-of-sample sensitivity (variance).

Figure 1(d) shows the distribution of bootstrap estimates of the optimal δ\delta as a function of the number of data points. Here we simulated 500500 independent datasets of size n=15n=15 and n=30n=30, and used 50 boostrap samples for each dataset to estimate of the out-of-sample expected reward which we optimized over δ\delta. When n=15n=15, δ=1.6×103\delta=-1.6\times 10^{-3} is optimal; when n=30n=30, δ=1.3×103\delta=-1.3\times 10^{-3}. The bootstrap estimates are not very accurate when n=15n=15 (mean = 1.4×1021.4\times 10^{-2}, SD = 2.8×1022.8\times 10^{-2}), with only 49%49\% of experiments giving an estimate with the correct sign, and there is a long right tail. This is not very surprising given that a sample of size n=15n=15 is small when the population demand has a standard deviation of 2828. The chance of getting a data set that is not representative of the population is high, leading to poor estimates of δ\delta (e.g. the extreme values in the right tail of the distribution). Accuracy improves with n=30n=30 data points (mean = 1.32-1.3^{-2}, SD = 4.7×1034.7\times 10^{-3}), with estimates having the correct sign 88%88\% of the time and a smaller skew. It could well be the case that other methods estimate the optimal δ\delta, or at least its sign, more accurately. We have not explored this issue further.

Refer to caption
(a) Worst-case sensitivity
Refer to caption
(b) Average in-sample mean-sensitivity frontier
Refer to caption
(c) Out-of-sample mean-sensitivity frontier.
Refer to caption
(d) Density of bootstrap estimates of δ\delta.
Figure 5.1. (A) shows the average worst-case sensitivity as a function of δ\delta. Consistent with (5.2), it changes linearly around δ=0\delta=0. (B) shows the (average) in-sample mean-sensitivity frontier corresponding to a range of δ\delta. The change in the in-sample expected reward around δ=0\delta=0 (SAA) is small compared to changes in the worst-case sensitivity. (C) shows the out-of-sample mean-sensitivity (variance) frontier that is mapped out when δ\delta is varied. Out-of-sample expected reward is maximized when δ=1.3×103\delta=-1.3\times 10^{-3}, which is on the optimistic part of the frontier. The expected reward is 193193 which exceeds that of the SAA optimizer (189). (D) shows the distribution of bootstrap estimates of the optimal δ\delta when there are n=15n=15 data points (optimal value δ=1.6×103\delta=-1.6\times 10^{-3}) and n=30n=30 data points (optimal value δ=1.3×103\delta=-1.3\times 10^{-3}). Estimation error is large when n=15n=15 and there is a long right tail. Accuracy improves significantly when there are n=30n=30 data points.

5.3. Other remarks

The paper [18]222The paper [18] was posted on arXiv one day after the first version of this paper. Both were written independently. studies data-driven algorithms that admit solutions of the form

x~n(δ)=x(0)+δπ~+1nW~n+oP(δ+n12),\displaystyle\tilde{x}_{n}(\delta)=x^{\star}(0)+\delta\tilde{\pi}+\frac{1}{\sqrt{n}}\tilde{W}_{n}+o_{P}(\delta+n^{-\frac{1}{2}}), (5.3)

where x(0)x^{\star}(0) is the population optimizer (2.2), and x~n(δ)\tilde{x}_{n}(\delta) is the data-driven solution when there are nn data points, and δ\delta is the free parameter of the algorithm (e.g. a regularization parameter, the size of an uncertainty set, robustness parameter).

To evaluate performance (5.3), [18] considers the difference (regret) between the expected reward of the data-driven and population optimizers

𝒢(x~n(δ)):=g(x(0))g(x~n(δ))\displaystyle{\mathcal{G}}(\tilde{x}_{n}(\delta)):=g(x^{\star}(0))-g(\tilde{x}_{n}(\delta))

where g(x)g(x) was defined in (4.1), and shows that the weak limit (nn\rightarrow\infty) of the scaled regret n𝒢(x~n(δn))n{\mathcal{G}}(\tilde{x}_{n}(\delta_{n})) is second-order stochastically larger than the weak limit of the scaled regret n𝒢(x~n(0))n{\mathcal{G}}(\tilde{x}_{n}(0)) of the SAA optimizer. In this sense it is impossible to beat SAA asymptotically when an algorithm has solutions of the form (5.3).

While this appears to contradict Theorem 4.4, which says that it is always possible to beat SAA using DRO/DOO, there is actually no inconsistency. [18] shows the optimality of the limiting reward distribution of SAA under second order stochastic dominance, whereas we use the asymptotic distribution of the DRO/DOO and SAA solutions to estimate the difference between the out-of-sample expected rewards for finite nn (4.14).This difference is of order δ/n\delta/n, and the advantage of DRO/DOO over SAA vanishes at a rate faster than 1/n1/n if δδn\delta\equiv\delta_{n} itself is going to 0. Indeed, it was shown in Theorem 4.4 that the optimal choice of δ\delta is of order 1/n1/n and the improvement over SAA is of the order 1/n21/n^{2}. If the goal is to beat the expected reward of SAA out-of-sample, the improvement from DRO/DOO is at best modest when data sets are moderate and diminishes quickly in nn. There is no advantage in the limit, consistent with [18].

We have argued, however, that the advantage of DRO is more than the possibility that it can sometimes, but not always, beat SAA out-of-sample by just a little bit: It also reduces the sensitivity of the expected reward to misspecification in the nominal which, as the reader may recall, was the point of robust decision making in the first place. More generally, it captures the tradeoff between maximizing expected reward and controlling sensitivity which is relevant in many applications of data-driven decision making.

6. Conclusion

Solutions of DRO problems can sometimes have a larger out-of-sample expected reward than SAA. Indeed, much of the literature seems to suggest that this is the aspiration of DRO; uncertainty sets are often calibrated by optimizing an estimate of the out-of-sample expected reward, and recent papers focus on theoretically characterizing the out-of-sample expected reward of DRO solutions; applications of DRO are hailed a success when the out-of-sample mean exceeds that of SAA. However, beating SAA may not be possible if we only consider worst-case models.

We have shown that it is always possible to beat SAA out-of-sample if we consider both optimistic (DOO) and pessimistic (DRO) perturbations of the nominal model. If the only concern is beating the out-of-sample expected reward of the SAA optimizer, a decision maker should consider best-case and worst-case models.

As tempting as this may sound, there is a catch. While the out-of-sample expected reward might be larger, the worst-case sensitivity of a best-case (DOO) optimizer is larger than that of SAA so it will also be less robust. Indeed, the improvement in the expected reward from using the optimal DOO optimizer is small (of order n2n^{-2}) relative to the robustness cost. The optimal value of the parameter δ\delta also needs to be estimated, and bootstrap or cross-validation estimates may be unreliable if the training data set is small.

References

  • [1] Anderson, E.J., Philpott, A. 2020. Improving sample average approximation using distributional robustness. INFORMS Journal on Optimization, 4(1), 90–124 (https://doi.org/10.1287/ijoo.2021.0061).
  • [2] Ben-Tal, A., A. Nemirovski. 1999. Robust solutions to uncertain programs. Oper. Res. Lett. 25 1–13.
  • [3] Bertsimas, D., Gupta, V., Kallus, N. 2014. Robust SAA. arXiv:1408.4445 (https://arxiv.org/abs/1408.4445).
  • [4] Bertsimas, D., Litvinov, E., Sun, A.X., Zhao, J., Zheng, T. 2013. Adaptive robust optimization for the security constrained unit commitment problem. IEEE Transactions on Power Systems, 28(1), 52–63.
  • [5] Cao, J., Gao, R. 2021. Contextual decision-making under parametric uncertainty and data- driven optimistic optimization. (https://optimization-online.org/2021/10/8634/).
  • [6] Chen, L.L., Royset, J.O. 2022. Rockafellian Relaxation in Optimization under Uncertainty: Asymptotically Exact Formulations (https://arxiv.org/abs/2204.04762).
  • [7] Doyle, J.C., Glover, K., Khargonekar, P.P., Francis, B.A. 1989. State-space solutions to standard 2{\mathcal{H}}_{2} and {\mathcal{H}}_{\infty} control problems. IEEE Transactions on Automatic Control, 34(8), 831–847.
  • [8] El Ghaoui, L., Lebret, H. 1997. Robust solutions to least-square problems to uncertain data matrices. SIAM Journal on Matrix Analysis and Applications, 18, 1035–1064.
  • [9] Duchi, J.C., Namkoong, H. 2016. Variance-based regularization with convex objectives. arXiv:1610.02581v2 (https://arxiv.org/abs/1610.02581).
  • [10] Duchi, J.C., Glynn, P.W., Namkoong, H. 2016. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. arXiv:1610.03425 (https://arxiv.org/abs/1610.03425).
  • [11] Gotoh, J., Kim, M.J., Lim, A.E.B. 2020. Calibration of robust empirical optimization models. Operations Research, 69(5), 11 1630–1650, https://doi.org/10.1287/opre.2020.2041.
  • [12] Gotoh, J., Kim, M.J., Lim, A.E.B. 2018. Robust empirical optimization is almost the same as mean-variance optimization. Operations Research Letters, 46 (4), 448–452, https://doi.org/10.1016/j.orl.2018.05.005.
  • [13] Gotoh, J., Kim, M.J., Lim, A.E.B. 2020. Worst-case sensitivity. arXiv:2010.10794 (https://arxiv.org/abs/2010.10794).
  • [14] Hansen, L.P., Sargent, T.J. 2008. Robustness. Princeton University Press, Princeton, New Jersey.
  • [15] Kim, M.J., Lim, A.E.B. 2014. Robust multi-armed bandit problems. Management Science, 62(1), 264–285.
  • [16] Kuhn, D., Esfahani, P.M., Nguyen,V.A., Shafieezadeh-Abadeh, S. 2019. Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning. INFORMS TutORials in Operations Research, 130-166, https://doi.org/10.1287/educ.2019.0198.
  • [17] Kundhi, G., Rilstone, P. 2008. The third order bias of nonlinear estimators. Communications in Statistics – Theory and Methods, 37: 2617 – 2633.
  • [18] Lam, H. 2021. On the Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective. arXiv:2105.13419 (https://arxiv.org/pdf/2105.13419.pdf).
  • [19] Lam, H. 2019. Recovering Best Statistical Guarantees via the Empirical Divergence-Based Distributionally Robust Optimization. Operations Research, 67(4), pp 1090 – 1105.
  • [20] Lam H., Zhou E. 2017. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters 45(4), pp 301 – 307.
  • [21] Nguyen, V.A., Shafieezadeh-Abadeh, S., Yue, M.C., Kuhn, D., Wiesemann, W. 2019. Optimistic Distributionally Robust Optimization for Nonparametric Likelihood Approximation. arXiv:1910.10583 (https://arxiv.org/abs/1910.10583).
  • [22] Nishiyama, Y. 2010. Moment Convergence of MM-Estimators. Statistica Neerlandica, Vol. 64, nr. 4, pp. 505–507.
  • [23] Norton, M., Takeda, A., Mafusalov, A. 2017. Optimistic robust optimization with applications to machine learning. arXiv:1711.07511 (https://arxiv.org/abs/1711.07511).
  • [24] Peterson, I.R., James, M.R., Dupuis, P. 2000. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45, 398–412.
  • [25] Rilstone, P., Srivastava, V.K., Ullah, A. (1996). The second-order bias and mean squared error of nonlinear estimators. Journal of Econometrics, 75: 369–395.
  • [26] Song, J., Zhao, C. 2020. Optimistic Distributionally Robust Policy Optimization. arXiv:2006.07815v1, https://arxiv.org/pdf/2006.07815.pdf.
  • [27] van der Vaart, A.W. 2000. Asymptotic Statistics. Cambridge University Press.
  • [28] Wang, Z., Glynn, P.W., Ye. Y. (2016). Likelihood robust optimization for data-driven problems. Computational Management Science, 13(2), pp. 241 – 261.
  • [29] Xie, W. 2021. CAREER: Favorable Optimization under Distributional Distortions: Frameworks, Algorithms, and Applications (Award Number 2046426). Division of Civil, Mechanical & Manufacturing Innovation, National Science Foundation.

Appendix A First order conditions (3.5)-(3.8)

Consider the population version of the DRO problem (3.1)

maxx,cG(x,c)\displaystyle\max_{x,c}G(x,c)

where

G(δ,x,c)=1δ𝔼[ϕ(δ[f(x,Y)+c])]c.\displaystyle G(\delta,x,c)=-\frac{1}{\delta}{\mathbb{E}}_{\mathbb{P}}\Big{[}\phi^{*}\Big{(}-\delta\big{[}f(x,\,Y)+c\big{]}\Big{)}\Big{]}-c. (A.1)

Differentiating with respect to xx, the first order conditions are

xG(δ,x,c)\displaystyle\nabla_{x}G(\delta,x,c) =\displaystyle= 𝔼[[ϕ](δ[f(x,Y)+c])xf(x,Y)]=0,\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}\nabla_{x}f(x,Y)\Big{]}=0, (A.2)
cG(δ,x,c)\displaystyle\nabla_{c}G(\delta,x,c) =\displaystyle= 𝔼[[ϕ](δ[f(x,Y)+c])]1=0\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}\Big{]}-1=0

with solutions (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)). We eventually wish to show that (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) is continuously differentiable in a neighborhood of δ=0\delta=0, which depends on the properties of the convex conjugate ϕ(ζ)\phi^{*}(\zeta). Under Assumption (3.1) (see Theorem 3.2 in [12]) ϕ(ζ)\phi^{*}(\zeta) is twice continuously differentiable in the neighborhood of ζ=0\zeta=0 and satisfies

ϕ(ζ)=ζ+12!(1ϕ′′(1))ζ2+o(ζ2).\displaystyle\phi^{*}(\zeta)=\zeta+\frac{1}{2!}\Big{(}\frac{1}{\phi^{\prime\prime}(1)}\Big{)}\zeta^{2}+o(\zeta^{2}). (A.3)

It follows that [ϕ](ζ)[\phi^{*}]^{\prime}(\zeta) is continuously differentiable in a neighborhood of ζ=0\zeta=0 and

[ϕ](ζ)=1+ζϕ′′(1)+o(ζ),\displaystyle[\phi^{*}]^{\prime}(\zeta)=1+\frac{\zeta}{\phi^{\prime\prime}(1)}+o(\zeta), (A.4)

so by (A.2)

cG(δ,x,c)=𝔼[δϕ′′(1)[f(x,Y)+c]]+o(δ)\displaystyle\nabla_{c}G(\delta,x,c)={\mathbb{E}}_{\mathbb{P}}\Big{[}-\frac{\delta}{\phi^{\prime\prime}(1)}\big{[}f(x,Y)+c\big{]}\Big{]}+o(\delta)

and we can write the first order conditions as

xG(δ,x,c)\displaystyle\nabla_{x}G(\delta,x,c) =\displaystyle= 𝔼[[ϕ](δ[f(x,Y)+c])xf(x,Y)]=0,\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}\nabla_{x}f(x,Y)\Big{]}=0,
ϕ′′(1)δcG(δ,x,c)\displaystyle-\frac{\phi^{\prime\prime}(1)}{\delta}\nabla_{c}G(\delta,x,c) =\displaystyle= 𝔼[ϕ′′(1)δ{[ϕ](δ[f(x,Y)+c])1}]=0.\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}-\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}-1\Big{\}}\Big{]}=0.

This does not change the solution of (A.2), but allows us to use the Implicit Function Theorem to establish level of smoothness for (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) in Proposition 3.2 (in particular, that (B.11) is invertible in the proof in Appendix B). It follows that ψ(x,c,Y)\psi(x,c,Y) is given by (3.13). First order conditions for the sample DRO problem, and the DOO problems can be derived similarly.

Appendix B Proof of Proposition 3.2

We use the Implicit Function Theorem to show existence of the solution (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) of (3.8) and to characterize its smoothness. (Existence and smoothness of (xn(δ),cn(δ))(x_{n}(\delta),\,c_{n}(\delta)) can be shown in a similar way). With a mild abuse of notation, let

g(δ,x,c)[g1(δ,x,c)g2(δ,x,c)]:=𝔼[ψ(x,c,Y)].\displaystyle g(\delta,\,x,\,c)\equiv\left[\begin{array}[]{c}g_{1}(\delta,\,x,\,c)\\ g_{2}(\delta,\,x,\,c)\end{array}\right]:={\mathbb{E}}_{\mathbb{P}}[\psi(x,\,c,\,Y)].

The first-order conditions for the population problems (3.8) are

g(δ,x,c)=[00].\displaystyle g(\delta,\,x,\,c)=\left[\begin{array}[]{cc}0\\ 0\end{array}\right]. (B.4)

Since [ϕ](ζ)[\phi^{*}]^{\prime}(\zeta) is continuously differentiable in a neighborhood of ζ=0\zeta=0 (Theorem 3.2 in [12]), g(δ,x,c)g(\delta,\,x,\,c) is continuously differentiable in δ\delta in some open neighborhood of δ=0\delta=0 for every fixed (x,c)(x,\,c). By (A.4)

g(δ,x,c)=[𝔼[xf(x,Y)]δϕ′′(1)𝔼[(f(x,Y)+c)xf(x,Y)]+o(δ)𝔼[f(x,Y)+c]+O(δ)],\displaystyle g(\delta,\,x,\,c)=\left[\begin{array}[]{c}{\mathbb{E}}_{{\mathbb{P}}}\Big{[}\nabla_{x}f(x,Y)\Big{]}-\frac{\delta}{\phi^{\prime\prime}(1)}{\mathbb{E}}_{\mathbb{P}}\Big{[}\Big{(}f(x,Y)+c\Big{)}\nabla_{x}f(x,Y)\Big{]}+o(\delta)\\[10.0pt] \mathbb{E}_{{\mathbb{P}}}\Big{[}f(x,Y)+c\Big{]}+O(\delta)\end{array}\right],

and

g(0,x(0),𝔼[f(x(0),Y)])=0,\displaystyle g\big{(}0,\,x^{\star}(0),\,-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)]\big{)}=0, (B.6)

where x(0)x^{\star}(0) is the solution of the empirical problem. Since f(x,Y)f(x,\,Y) is twice continuously differentiable in xx, g(δ,x,c)g(\delta,\,x,\,c) is continuously differentiable in a neighborhood of (0,x(0),𝔼[f(x(0),Y)])\big{(}0,\,x^{\star}(0),\,-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)]\big{)}, and

Jg,(x,c)(δ,x,c)|(0,x(0),𝔼[f(x(0),Y)])\displaystyle J_{g,\,(x,\,c)}(\delta,\,x,\,c)\Big{|}_{(0,\,x^{\star}(0),\,-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)])} \displaystyle\equiv [xg1(δ,x,c)cg1(δ,x,c)xg2(δ,x,c)cg2(δ,x,c)]|(0,x(0),𝔼[f(x(0),Y)])\displaystyle\left.\left[\begin{array}[]{cc}\nabla_{x}g_{1}(\delta,\,x,\,c)&\nabla_{c}g_{1}(\delta,\,x,\,c)\\ \nabla_{x}g_{2}(\delta,\,x,\,c)&\nabla_{c}g_{2}(\delta,\,x,\,c)\end{array}\right]\right|_{(0,\,x^{\star}(0),\,-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)])} (B.8)
=\displaystyle= [𝔼[x2f(x(0),Y)]001]\displaystyle\left[\begin{array}[]{cc}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(x^{\star}(0),\,Y)]&0\\ 0&1\end{array}\right] (B.11)

is invertible. It follows from the Implicit Function Theorem that (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) exists and is continuously differentiable in an open neighborhood of (0,x(0),𝔼[f(x(0),Y)])\big{(}0,\,x^{\star}(0),\,-{\mathbb{E}}_{{\mathbb{P}}}[f(x^{\star}(0),\,Y)]\big{)} so we can write

x(δ)\displaystyle x^{\star}(\delta) =\displaystyle= x(0)+δπ+o(δ)\displaystyle x^{\star}(0)+\delta\pi+o(\delta)
c(δ)\displaystyle c^{\star}(\delta) =\displaystyle= 𝔼[f(x(0),Y)]+δκ+o(δ)\displaystyle-{\mathbb{E}}_{\mathbb{P}}[f(x^{\star}(0),Y)]+\delta\kappa+o(\delta)

where πd\pi\in{\mathbb{R}}^{d} and κ\kappa is a constant. Substituting into the first equation in (B.6), a Taylor series expansion around δ=0\delta=0 gives

𝔼[xf(x(0),Y)]+δπ𝔼[x2f(x(0),Y)]\displaystyle{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla_{x}f(x^{\star}(0),Y)\big{]}+\delta\pi^{\prime}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}f(x^{\star}(0),Y)\big{]}
δϕ′′(1)𝔼[(f(x(0),Y)𝔼[f(x(0),Y))xf(x(0),Y)]+o(δ)=0.\displaystyle-\frac{\delta}{\phi^{{}^{\prime\prime}}(1)}{\mathbb{E}}_{\mathbb{P}}\Big{[}\Big{(}f(x^{\star}(0),Y)-{\mathbb{E}}_{\mathbb{P}}[f(x^{\star}(0),Y)\Big{)}\nabla_{x}f(x^{\star}(0),Y)\Big{]}+o(\delta)=0.

Optimality of x(0)x^{\star}(0) for the population problem implies 𝔼[xf(x(0),Y)]=0{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla_{x}f(x^{\star}(0),Y)\big{]}=0 while the coefficient of the δ\delta term is zero if

π=1ϕ′′(1)(𝔼[x2f(x(0),Y)])1Cov(f(x(0),Y),xf(x(0),Y)).\displaystyle\pi=\frac{1}{\phi^{{}^{\prime\prime}}(1)}\Big{(}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}_{x}f(x^{\star}(0),Y)\big{]}\Big{)}^{-1}\mbox{Cov}_{\mathbb{P}}\Big{(}f(x^{\star}(0),Y),\nabla_{x}f(x^{\star}(0),Y)\Big{)}.

Smoothness of (xn(δ),cn(δ))(x_{n}(\delta),\,c_{n}(\delta)) and the expression (3.17)-(3.18) can be established in a similar manner.

Appendix C Proof of Proposition 3.6

We begin with some preliminaries. Let ψ(x,c,Y)\psi(x,c,Y) be defined by (3.13), and

ψ(x(δ),c(δ),Y)=[x1ψ1xmψ1cψ1x1ψ2xmψ2cψ2](x(δ),c(δ),Y)\displaystyle\nabla\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)=\left[\begin{array}[]{ccc|c}\nabla_{x_{1}}\psi_{1}&\ldots&\nabla_{x_{m}}\psi_{1}&\nabla_{c}\psi_{1}\\ \hline\cr\nabla_{x_{1}}\psi_{2}&\ldots&\nabla_{x_{m}}\psi_{2}&\nabla_{c}\psi_{2}\\ \end{array}\right](x^{\star}(\delta),\,c^{\star}(\delta),Y)

be the Jacobian of ψ(x,c,Y)\psi(x,c,Y) and the matrices

A~(δ)\displaystyle\widetilde{A}(\delta) =\displaystyle= 𝔼[ψ(x(δ),c(δ),Y)](d+1)×(d+1),\displaystyle{\mathbb{E}}_{\mathbb{P}}[\nabla\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)]\in{\mathbb{R}}^{(d+1)\times(d+1)},
B~(δ)\displaystyle\widetilde{B}(\delta) =\displaystyle= 𝔼[ψ(x(δ),c(δ),Y)ψ(x(δ),c(δ),Y)](d+1)×(d+1).\displaystyle{\mathbb{E}}_{\mathbb{P}}[\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)\,\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)^{\prime}]\in{\mathbb{R}}^{(d+1)\times(d+1)}.

We also define the d+1{\mathbb{R}}^{d+1} valued random vectors

W~n(δ)\displaystyle\widetilde{W}_{n}(\delta) =\displaystyle= A~(δ)11ni=1nψ(x(δ),c(δ),Yi),\displaystyle-\widetilde{A}(\delta)^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi(x^{\star}(\delta),\,c^{\star}(\delta),\,Y_{i}),
V~n(δ)\displaystyle\widetilde{V}_{n}(\delta) =\displaystyle= H~n(δ)W~n(δ)I~n(δ)\displaystyle\widetilde{H}_{n}(\delta)\widetilde{W}_{n}(\delta)-\widetilde{I}_{n}(\delta) (C.2)

where

H~n(δ)\displaystyle\widetilde{H}_{n}(\delta) =\displaystyle= 1nA~(δ)1i=1n{ψ(x(δ),c(δ),Yi)𝔼[ψ(x(δ),c(δ),Y)]}\displaystyle-\frac{1}{\sqrt{n}}\widetilde{A}(\delta)^{-1}\sum_{i=1}^{n}\Big{\{}\nabla\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y_{i})-{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)\big{]}\Big{\}}
I~n(δ)\displaystyle\tilde{I}_{n}(\delta) =\displaystyle= A~(δ)1[W~n(δ)𝔼[2ψ(1)(x(δ),c(δ),Y)]W~n(δ)W~n(δ)𝔼[2ψ(d+1)(x(δ),c(δ),Y)]W~n(δ)]\displaystyle\widetilde{A}(\delta)^{-1}\left[\begin{array}[]{c}\widetilde{W}_{n}(\delta)^{\prime}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}\psi^{(1)}(x^{\star}(\delta),\,c^{\star}(\delta),\,Y)\big{]}\widetilde{W}_{n}(\delta)\\ \vdots\\ \widetilde{W}_{n}(\delta)^{\prime}{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}\psi^{(d+1)}(x^{\star}(\delta),\,c^{\star}(\delta),\,Y)\big{]}\widetilde{W}_{n}(\delta)\end{array}\right]

Here, ψ(i)(x,c,Y)\psi^{(i)}(x,c,Y) is the ithi^{th} component of the d+1d+1 dimensional function

ψ(x,c,Y)=[ψ(1)(x,c,Y)ψ(d+1)(x,c,Y)]\displaystyle\psi(x,c,Y)=\left[\begin{array}[]{c}\psi^{(1)}(x,c,Y)\\ \vdots\\ \psi^{(d+1)}(x,c,Y)\end{array}\right]

that was defined in (3.13), and 2ψ(i)(x(δ),c(δ),Y)\nabla^{2}\psi^{(i)}(x^{\star}(\delta),\,c^{\star}(\delta),\,Y) is the Hessian 2ψ(i)(x,c,Y)\nabla^{2}\psi^{(i)}(x,\,c,\,Y) evaluated at (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)). Note that condition (2) of Assumption 3.5 implies that ϕ(ζ)\phi^{*}(\zeta) is three times continuously differentiable333It can be shown, along the lines of the Proof of Theorem 3.2 in [12], that ϕ(ζ)=ζ+ζ22!(1ϕ(2)(1))ζ33!(1ϕ(3)(1)1ϕ(2)(1))+o(ζ3)\displaystyle\phi^{*}(\zeta)=\zeta+\frac{\zeta^{2}}{2!}\Big{(}\frac{1}{\phi^{(2)}(1)}\Big{)}-\frac{\zeta^{3}}{3!}\Big{(}\frac{1}{\phi^{(3)}(1)}\frac{1}{\phi^{(2)}(1)}\Big{)}+o(\zeta^{3}) where ϕ(i)(z)\phi^{(i)}(z) is the ithi^{th} derivative of ϕ(z)\phi(z) with respect to zz. in a neighborhood of ζ=0\zeta=0. Together with condition (1), it follows that the Hessian ψ(i)(x,c,Y)\psi^{(i)}(x,\,c,\,Y) evaluated at (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) above is well defined.

The following result characterizes the distribution of (xn(δ),cn(δ))(x_{n}(\delta),\,c_{n}(\delta)) and is obtained by applying Theorem 5.21 in [27] and Lemma 1 from [17] to (3.13).

Proposition C.1.

Suppose that data {Y1,,Yn}\{Y_{1},\cdots,\,Y_{n}\} is drawn iid from \mathbb{P}, f(x,Y)f(x,\,Y) satisfies Assumption 2.1, ϕ(z)\phi(z) satisfies Assumption 3.1, and δ\delta is such that Assumption 3.5 holds. Then there is a unique solution (x(δ),c(δ))(x^{\star}(\delta),\,c^{\star}(\delta)) of the first-order conditions (3.8), the matrix A~(δ)\widetilde{A}(\delta) is invertible, (xn(δ),cn(δ))𝑃(x(δ),c(δ))(x_{n}(\delta),\,c_{n}(\delta))\overset{P}{\longrightarrow}(x^{\star}(\delta),\,c^{\star}(\delta)), and for nn sufficiently large

[xn(δ)cn(δ)]=[x(δ)c(δ)]+1nW~n(δ)+1nV~n(δ)+oP(n3/2).\displaystyle\left[\begin{array}[]{c}x_{n}(\delta)\\ c_{n}(\delta)\end{array}\right]=\left[\begin{array}[]{c}x^{\star}(\delta)\\ c^{\star}(\delta)\end{array}\right]+\frac{1}{\sqrt{n}}\widetilde{W}_{n}(\delta)+\frac{1}{n}\widetilde{V}_{n}(\delta)+o_{P}(n^{-3/2}).

W~n(δ)\widetilde{W}_{n}(\delta) has mean 0 and covariance matrix

ξ~(δ)𝕍[W~n(δ)]=A~(δ)1B~(δ)A~(δ)1,\displaystyle\widetilde{\xi}(\delta)\equiv{\mathbb{V}}_{\mathbb{P}}[\widetilde{W}_{n}(\delta)]=\widetilde{A}(\delta)^{-1}\widetilde{B}(\delta){\widetilde{A}(\delta)^{-1}}^{\prime},

and V~n(δ)\widetilde{V}_{n}(\delta) is OP(1)O_{P}(1) with mean

𝔼[V~n(δ)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[\widetilde{V}_{n}(\delta)] =\displaystyle= 𝔼[A~(δ)1ψ(x(δ),c(δ),Y)A~(δ)1ψ(x(δ),c(δ),Y)]\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}\widetilde{A}(\delta)^{-1}\nabla\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)\widetilde{A}(\delta)^{-1}\psi(x^{\star}(\delta),\,c^{\star}(\delta),Y)\Big{]}
A~(δ)1[tr{ξ~(δ)𝔼[2ψ(1)(x(δ),c(δ),Y)]}tr{ξ~(δ)𝔼[2ψ(d+1)(x(δ),c(δ),Y)]}].\displaystyle-\widetilde{A}(\delta)^{-1}\left[\begin{array}[]{c}{\rm tr}\big{\{}\widetilde{\xi}(\delta)\,{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}\psi^{(1)}(x^{\star}(\delta),\,c^{\star}(\delta),\,Y)\big{]}\big{\}}\\ \vdots\\ {\rm tr}\big{\{}\widetilde{\xi}(\delta)\,{\mathbb{E}}_{\mathbb{P}}\big{[}\nabla^{2}\psi^{(d+1)}(x^{\star}(\delta),\,c^{\star}(\delta),\,Y)\big{]}\big{\}}\end{array}\right].

Proposition 3.6 is largely a restatement of Proposition C.1 for the xn(δ)x_{n}(\delta) component of (xn(δ),cn(δ))(x_{n}(\delta),c_{n}(\delta)). In particular, expression (3.29) for xn(δ)x_{n}(\delta) holds because we can extract the d{\mathbb{R}}^{d} valued random vectors Wn(δ)W_{n}(\delta) and Vn(δ)V_{n}(\delta) of (C.2) corresponding to xn(δ)x_{n}(\delta) in (C.1)

W~n(δ)=[Wn(δ)Un(δ)]d+1,V~n(δ)=[Vn(δ)Tn(δ)]d+1,\displaystyle\widetilde{W}_{n}(\delta)=\left[\begin{array}[]{c}W_{n}(\delta)\\ U_{n}(\delta)\end{array}\right]\in{\mathbb{R}}^{d+1},\hskip 14.22636pt\widetilde{V}_{n}(\delta)=\left[\begin{array}[]{c}V_{n}(\delta)\\ T_{n}(\delta)\end{array}\right]\in{\mathbb{R}}^{d+1},

and ξ(δ)d×d\xi(\delta)\in{\mathbb{R}}^{d\times d} and V¯(δ)d\overline{V}(\delta)\in{\mathbb{R}}^{d} from

ξ~(δ)=[ξ(δ)κ(δ)κ(δ)η(δ)],𝔼[V~n(δ)]=[V¯(δ)S¯(δ)].\displaystyle\widetilde{\xi}(\delta)=\left[\begin{array}[]{cc}\xi(\delta)&\kappa(\delta)\\ \kappa(\delta)^{\prime}&\eta(\delta)\end{array}\right],\hskip 14.22636pt{\mathbb{E}}_{\mathbb{P}}[\widetilde{V}_{n}(\delta)]=\left[\begin{array}[]{c}\overline{V}(\delta)\\ \overline{S}(\delta)\end{array}\right].

ξ(δ)\xi(\delta) and V¯(δ)\overline{V}(\delta) are continuously differentiable in a neighborhood of δ=0\delta=0 because the pair (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) is continuously differentiable in a neighborhood of δ=0\delta=0 (Proposition 3.2) and the conditions imposed on ψ(x,c,Y)\psi(x,c,Y) in Assumption 3.5.

Appendix D Assumptions 4.1 and 4.2

We know from (3.24) that the solution of the SAA problem has an expansion of the form

xn(0)=x(0)+1nWn(0)+1nVn(0)+Hn(0)\displaystyle x_{n}(0)=x^{\star}(0)+\frac{1}{\sqrt{n}}W_{n}(0)+\frac{1}{n}V_{n}(0)+H_{n}(0) (D.1)

where the error term Hn(0)H_{n}(0) is Op(n32)O_{p}(n^{-\frac{3}{2}}). Assumption 4.1 holds if supn𝔼n32Hn(0)p<\sup_{n}{\mathbb{E}}_{\mathbb{P}}\|n^{\frac{3}{2}}{H}_{n}(0)\|^{p}<\infty for p=1,2p=1,2. We now derive an expression for Hn(0)H_{n}(0) to determine conditions under which this holds. To ease notation, we assume xx is scalar. Our derivation shows that n32Hn(δ)n^{\frac{3}{2}}H_{n}(\delta) is a product of derivatives of f(x,Y)f(x,Y) with respect to xx of up to order 44, and Assumption 4.1 holds if the first two moments of n32Hn(δ)n^{\frac{3}{2}}H_{n}(\delta) are finite.

Our approach generalizes to the case where the decision variable is multi-dimensional, which is required for DRO and DOO (Assumption 4.2) where the pair (x,c)(x,c) is the decision and SAA with vector-valued xx, at the cost of even uglier equations and tensor notation. Not surprisingly, the same insight, that products of (partial) derivatives of order up to 4 have finite first- and second moment, holds. The equations are long but not particularly insightful (beyond this observation) and the derivation is arduous. We outline how this can be done and leave details to the reader.

To derive the expansion (D.1), we embed the first order conditions for our sample problem in a family of fixed-point equations G(y,ε)=0G(y,\varepsilon)=0 parameterized by ε[0,1]\varepsilon\in[0,1] where

G(y,ε)\displaystyle G(y,\varepsilon) =\displaystyle= 𝔼[xf(y,Y)]+ε1ni=1n{xf(y,Yi)𝔼[xf(y,Y)]}\displaystyle{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(y,Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\Big{\{}\nabla_{x}f(y,Y_{i})-{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(y,Y)]\Big{\}}
=\displaystyle= 𝔼[xf(y,Y)]+ε1ni=1nxf~(y,Yi).\displaystyle{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(y,Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}\widetilde{f}(y,Y_{i}).

Here

f~(y,Y)=f(y,Y)𝔼[f(y,Y)]\displaystyle\widetilde{f}(y,Y)=f(y,Y)-{\mathbb{E}}_{\mathbb{P}}[f(y,Y)]

is the deviation of the random reward f(y,Y)f(y,Y) from its expectation 𝔼[f(y,Y)]{\mathbb{E}}_{\mathbb{P}}[f(y,Y)], while xf~(y,Y)\nabla_{x}\widetilde{f}(y,Y) is the derivative of the centered random variable. We denote the solution of G(y,ε)=0G(y,\varepsilon)=0 as y(ε)y(\varepsilon). (We change the argument in GG from xx to yy to minimize the chance of confusion between xn(0)x_{n}(0), the solution of the SAA problem, and the solution y(ε)y(\varepsilon) of the fixed point problem). The case ε=0\varepsilon=0

G(y(0),0)=𝔼[xf(y(0),Y)]=0\displaystyle G(y(0),0)={\mathbb{E}}_{\mathbb{P}}[\nabla_{x}f(y(0),Y)]=0

corresponds to the population problem with solution y(0)=x(0)y(0)=x^{\star}(0); ε=1\varepsilon=1 is the SAA problem with solution y(1)=xn(0)y(1)=x_{n}(0)

G(xn(0),1)=1ni=1nxf(xn(0),Yi)=0.\displaystyle G(x_{n}(0),1)=\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}f(x_{n}(0),Y_{i})=0.

To begin, we find the derivatives of y(ε)y(\varepsilon) with respect to ε\varepsilon, or A(ε)A(\varepsilon), B(ε)B(\varepsilon) and C(ε)C(\varepsilon), such that

y(ε+Δ)=y(ε)+ΔA(ε)+Δ22B(ε)+Δ33!C(ε)+o(Δ3).\displaystyle y(\varepsilon+\Delta)=y(\varepsilon)+\Delta A(\varepsilon)+\frac{\Delta^{2}}{2}B(\varepsilon)+\frac{\Delta^{3}}{3!}C(\varepsilon)+o(\Delta^{3}).

Since G(y(ε+Δ),ε+Δ)=0G(y(\varepsilon+\Delta),\varepsilon+\Delta)=0 we do this by expanding G(y(ε+Δ),ε+Δ)G(y(\varepsilon+\Delta),\varepsilon+\Delta) around G(y(ε),ε)G(y(\varepsilon),\varepsilon) and choosing A(ε)A(\varepsilon), B(ε)B(\varepsilon) and C(ε)C(\varepsilon) to ensure the first-order condition holds.

Specifically, a Taylor series expansion around ε\varepsilon gives (after some work)

G(y(ε+Δ),ε+Δ)=G(y(ε),ε)\displaystyle G(y(\varepsilon+\Delta),\varepsilon+\Delta)=G(y(\varepsilon),\varepsilon)
+Δ{[𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)]A(ε)+1ni=1nxf(y(ε),Yi)}\displaystyle+\Delta\left\{\Big{[}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})\Big{]}A(\varepsilon)+\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}f(y(\varepsilon),Y_{i})\right\}
+Δ22!{B(ε)(𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi))\displaystyle+\frac{\Delta^{2}}{2!}\left\{B(\varepsilon)\Big{(}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}\right.
+A(ε)2(𝔼[x3f(y(ε),Y)]+ε1ni=1nx3f~(y(ε),Yi))+A(ε)1ni=1n2x2f~(y(ε),Y)}\displaystyle\left.+A(\varepsilon)^{2}\Big{(}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{3}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{3}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}+A(\varepsilon)\frac{1}{n}\sum_{i=1}^{n}2\nabla^{2}_{x}\widetilde{f}(y(\varepsilon),Y)\right\}
+Δ33!{C(ε)(𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi))\displaystyle+\frac{\Delta^{3}}{3!}\left\{C(\varepsilon)\Big{(}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}\right.
+A(ε)3(𝔼[x4f(y(ε),Y)]+ε1ni=1nx4f~(y(ε),Yi))\displaystyle+A(\varepsilon)^{3}\Big{(}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{4}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{4}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}
+3A(ε)B(ε)(𝔼[x3f(y(ε),Y)]+ε1ni=1nx3f~(y(ε),Yi))\displaystyle+3A(\varepsilon)B(\varepsilon)\Big{(}{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{3}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{3}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}
+3(1ni=1nx2f~(y(ε),Yi))B(ε)+3(1ni=1nx3f~(y(ε),Yi))A(ε)2}+o(Δ3).\displaystyle\left.+3\Big{(}\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{x}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}B(\varepsilon)+3\Big{(}\frac{1}{n}\sum_{i=1}^{n}\nabla^{3}_{x}\widetilde{f}(y(\varepsilon),Y_{i})\Big{)}A(\varepsilon)^{2}\right\}+o(\Delta^{3}).

Since G(y(ε+Δ),ε+Δ)=0G(y(\varepsilon+\Delta),\varepsilon+\Delta)=0, the coefficients of the powers of Δ\Delta must be zero and hence

A(ε)=1nΛn(ε),B(ε)=1nΓn(ε),C(ε)=n32Θn(ε)\displaystyle A(\varepsilon)=\frac{1}{\sqrt{n}}\Lambda_{n}(\varepsilon),\hskip 14.22636ptB(\varepsilon)=\frac{1}{{n}}\Gamma_{n}(\varepsilon),\hskip 14.22636ptC(\varepsilon)=n^{-\frac{3}{2}}\Theta_{n}(\varepsilon)

where

Λn(ε)\displaystyle\Lambda_{n}(\varepsilon) =\displaystyle= 1ni=1nxf(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)\displaystyle\frac{-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\nabla_{x}f(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}
Γn(ε)\displaystyle\Gamma_{n}(\varepsilon) =\displaystyle= [Λn(ε)]2{𝔼[x3f(y(ε),Y)]+ε1ni=1nx3f~(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-[\Lambda_{n}(\varepsilon)]^{2}\;\left\{\frac{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{3}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{3}\widetilde{f}(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}
Λn(ε){1ni=1n2x2f~(y(ε),Y)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-\Lambda_{n}(\varepsilon)\left\{\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}2\nabla^{2}_{x}\widetilde{f}(y(\varepsilon),Y)}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}
Θn(ε)\displaystyle\Theta_{n}(\varepsilon) =\displaystyle= 3Γn(ε){1ni=1nx2f~(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-3{\Gamma_{n}(\varepsilon)}\left\{\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\nabla^{2}_{x}\widetilde{f}(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}
3[Λn(ε)]2{1ni=1nx3f~(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-3[\Lambda_{n}(\varepsilon)]^{2}\left\{\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\nabla^{3}_{x}\widetilde{f}(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}
[Λn(ε)]3{𝔼[x4f(y(ε),Y)]+ε1ni=1nx4f~(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-[\Lambda_{n}(\varepsilon)]^{3}\left\{\frac{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{4}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{4}\widetilde{f}(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}
3Λn(ε)Γn(ε){𝔼[x3f(y(ε),Y)]+ε1ni=1nx3f~(y(ε),Yi)𝔼[x2f(y(ε),Y)]+ε1ni=1nx2f~(y(ε),Yi)}\displaystyle-3\Lambda_{n}(\varepsilon)\Gamma_{n}(\varepsilon)\left\{\frac{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{3}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{3}\widetilde{f}(y(\varepsilon),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(y(\varepsilon),Y)]+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\nabla_{x}^{2}\widetilde{f}(y(\varepsilon),Y_{i})}\right\}

which gives us the derivatives of y(ε)y(\varepsilon) we are looking for. This gives the Taylor series expansion

y(ε)=y(0)+εA(0)+ε22B(0)+ε33!C(η)\displaystyle y(\varepsilon)=y(0)+\varepsilon A(0)+\frac{\varepsilon^{2}}{2}B(0)+\frac{\varepsilon^{3}}{3!}C(\eta)

where η\eta is a random variable taking values in [0,1][0,1] that depends on the data. In particular, with ε=1\varepsilon=1 and noting that y(ε)=xn(0)y(\varepsilon)=x_{n}(0) when ε=1\varepsilon=1

xn(0)=x(0)+1nΛn(0)+1nΓn(0)+n32Θn(η)\displaystyle x_{n}(0)=x^{\star}(0)+\frac{1}{\sqrt{n}}\Lambda_{n}(0)+\frac{1}{n}\Gamma_{n}(0)+n^{-\frac{3}{2}}\Theta_{n}(\eta)

where

Λn(0)\displaystyle\Lambda_{n}(0) =\displaystyle= 1ni=1nxf(x(0),Yi)𝔼[x2f(x(0),Y)]\displaystyle\frac{-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\nabla_{x}f(x^{\star}(0),Y_{i})}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(x^{\star}(0),Y)]}
Γn(0)\displaystyle\Gamma_{n}(0) =\displaystyle= [Λn(0)]2{𝔼[x3f(x(0),Y)]𝔼[x2f(x(0),Y)]}Λn(0){1ni=1n2x2f~(x(0),Y)𝔼[x2f(x(0),Y)]},\displaystyle-[\Lambda_{n}(0)]^{2}\left\{\frac{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{3}f(x^{\star}(0),Y)]}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(x^{\star}(0),Y)]}\right\}-\Lambda_{n}(0)\left\{\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}2\nabla^{2}_{x}\widetilde{f}(x^{\star}(0),Y)}{{\mathbb{E}}_{\mathbb{P}}[\nabla_{x}^{2}f(x^{\star}(0),Y)]}\right\},

and η\eta is a random variable taking values between 0 and 11. Observe that Λn(0)=Wn(0)\Lambda_{n}(0)=W_{n}(0) and Γn(0)=Vn(0)\Gamma_{n}(0)=V_{n}(0), as defined in (3.24) and that the error term in (D.1) is Hn(0)=n32Θn(η)H_{n}(0)=n^{-\frac{3}{2}}\Theta_{n}(\eta). Assumption 4.1 holds if 𝔼Θn(η)p<{\mathbb{E}}_{\mathbb{P}}\|\Theta_{n}(\eta)\|^{p}<\infty for p=1,2p=1,2. This will be the case if there is a constant γ>0\gamma>0 such that x2f(x,Y)>γ\nabla^{2}_{x}f(x,Y)>\gamma for every (x,Y)(x,Y) and the derivatives xkf(x,Y)\|\nabla_{x}^{k}f(x,Y)\| of order k=1,2,3,4k=1,2,3,4 are uniformly bounded, though this is not the mildest assumption.

In the case of DRO/DOO, we have first-order conditions

𝔼[[ϕ](δ[f(x,Y)+c])xf(x,Y)=0ϕ′′(1)δ{[ϕ](δ[f(x,Y)+c])1}]=0\displaystyle{\mathbb{E}}_{\mathbb{P}}\left[\begin{array}[]{c}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}\nabla_{x}f(x,Y)=0\\[10.0pt] -\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}-1\Big{\}}\end{array}\right]=0 (D.4)

for the solution (x(δ),c(δ))(x^{\star}(\delta),c^{\star}(\delta)) of the population problem, and

1ni=1n[[ϕ](δ[f(x,Y)+c])xf(x,Y)ϕ′′(1)δ{[ϕ](δ[f(x,Y)+c])1}]=0\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left[\begin{array}[]{c}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}\nabla_{x}f(x,Y)\\[10.0pt] -\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(x,Y)+c\big{]}\Big{)}-1\Big{\}}\end{array}\right]=0 (D.7)

for the solution (xn(δ),cn(δ))(x^{\star}_{n}(\delta),c^{\star}_{n}(\delta)) of the empirical version. Analogous to the case of SAA, we can define (y(ε),ω(ε))(y(\varepsilon),\omega(\varepsilon)) as the solution of G((y(ε),ω(ε)),ε)=[0,0]G((y(\varepsilon),\omega(\varepsilon)),\varepsilon)=[0,0]^{\prime} (ε[0,1]\varepsilon\in[0,1]) where

G((y(ε),ω(ε)),ε)\displaystyle G((y(\varepsilon),\omega(\varepsilon)),\varepsilon) =\displaystyle= (1ε)𝔼[[ϕ](δ[f(y(ε),Y)+ω(ε)])xf(y(ε),Y)ϕ′′(1)δ{[ϕ](δ[f(y(ε),Y)+c(ε)])1}]\displaystyle(1-\varepsilon){\mathbb{E}}_{\mathbb{P}}\left[\begin{array}[]{c}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(y(\varepsilon),Y)+\omega(\varepsilon)\big{]}\Big{)}\nabla_{x}f(y(\varepsilon),Y)\\[10.0pt] -\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(y(\varepsilon),Y)+c(\varepsilon)\big{]}\Big{)}-1\Big{\}}\end{array}\right]
+ε1ni=1n[[ϕ](δ[f(y(ε),Y)+ω])xf(y(ε),Y)ϕ′′(1)δ{[ϕ](δ[f(y(ε),Y)+ω(ε)])}].\displaystyle+\varepsilon\frac{1}{n}\sum_{i=1}^{n}\left[\begin{array}[]{c}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(y(\varepsilon),Y)+\omega\big{]}\Big{)}\nabla_{x}f(y(\varepsilon),Y)\\[10.0pt] -\frac{\phi^{\prime\prime}(1)}{\delta}\Big{\{}[\phi^{*}]^{\prime}\Big{(}-\delta\big{[}f(y(\varepsilon),Y)+\omega(\varepsilon)\big{]}\Big{)}\Big{\}}\end{array}\right].

This allows us to embed (D.4) and (D.7) in the problem of solving G((y(ε),ω(ε)),ε)=[0,0]G((y(\varepsilon),\omega(\varepsilon)),\varepsilon)=[0,0]^{\prime}. Expressions for Λn(ε)\Lambda_{n}(\varepsilon), Γn(ε)\Gamma_{n}(\varepsilon) and the remainder term Θn(ε)\Theta_{n}(\varepsilon) such that

[y(ε)ω(ε)]=1nΛn(ε)+1nΓn(ε)+n32Θn(η),\displaystyle\left[\begin{array}[]{c}y(\varepsilon)\\ \omega(\varepsilon)\end{array}\right]=\frac{1}{\sqrt{n}}\Lambda_{n}(\varepsilon)+\frac{1}{n}\Gamma_{n}(\varepsilon)+n^{-\frac{3}{2}}\Theta_{n}(\eta),

where η\eta is a random variable between 0 and ε\varepsilon, can be derived, though equations are long and unwieldy. Setting ε=1\varepsilon=1 we get

[xn(δ)cn(δ)]=1nWn(δ)+1nVn(δ)+n32Θn(η),\displaystyle\left[\begin{array}[]{c}x_{n}(\delta)\\ c_{n}(\delta)\end{array}\right]=\frac{1}{\sqrt{n}}W_{n}(\delta)+\frac{1}{n}V_{n}(\delta)+n^{-\frac{3}{2}}\Theta_{n}(\eta),

where Wn(δ)=Λn(1)W_{n}(\delta)=\Lambda_{n}(1), Vn(δ)=Γn(1)V_{n}(\delta)=\Gamma_{n}(1) and the remainder term Θn(η)\Theta_{n}(\eta) is a product of partial derivatives of

Φ(y,ω,Y):=1δϕ(δ[f(y,Y)+ω])ω\displaystyle\Phi(y,\omega,Y):=-\frac{1}{\delta}\phi^{*}\Big{(}-\delta\Big{[}f(y,Y)+\omega\Big{]}\Big{)}-\omega

with respect to yy and ω\omega of up to order 44, evaluated at y(η)y(\eta) for some random η[0,1]\eta\in[0,1] that depends on the data. Assumption 4.2 holds if supn𝔼Θn(η)p<\sup_{n}{\mathbb{E}}_{\mathbb{P}}\|\Theta_{n}(\eta)\|^{p}<\infty for p=1,2p=1,2. Again, this will be the case if, for example, the partial derivatives up to order 44 are uniformly bounded.

As a final note, [22] provides conditions for the error term in an MM-estimation problem to converge in LpL^{p} for every p1p\geq 1. It can also be used to derive conditions for supn𝔼n32Hnp<\sup_{n}{\mathbb{E}}_{\mathbb{P}}\|n^{\frac{3}{2}}H_{n}\|^{p}<\infty for every p1p\geq 1, and hence for Assumptions 4.1 and 4.2 to hold.

Appendix E Proof of Proposition 4.3

Recall from (4.4) that

𝔼[xn(0)]=x(0)+1nV¯(0)+o(n1)\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]=x^{\star}(0)+\frac{1}{n}\overline{V}(0)+o(n^{-1}) (E.1)

and hence

xg(𝔼[xn(0)])\displaystyle\nabla_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]) =\displaystyle= xg(x(0)+1nV¯(0)+o(n1))\displaystyle\nabla_{x}g\Big{(}x^{\star}(0)+\frac{1}{n}\overline{V}(0)+o(n^{-1})\Big{)}
=\displaystyle= xg(x(0))+x2g(x(0))(1nV¯(0))+o(n1)\displaystyle\nabla_{x}g(x^{\star}(0))+\nabla^{2}_{x}g(x^{\star}(0))\Big{(}\frac{1}{n}\overline{V}(0)\Big{)}+o(n^{-1})
=\displaystyle= 1nx2g(x(0))V¯(0)+o(n1)\displaystyle\frac{1}{n}\nabla^{2}_{x}g(x^{\star}(0))\overline{V}(0)+o(n^{-1})
x2g(𝔼[xn(0)])\displaystyle\nabla^{2}_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]) =\displaystyle= x2g(x(0))+O(n1).\displaystyle\nabla^{2}_{x}g(x^{\star}(0))+O(n^{-1}). (E.2)

From Proposition 3.6, the DRO/DOO solution

xn(δ)=x(δ)+1nWn(δ)+1nVn(δ)+oP(n1)\displaystyle x_{n}(\delta)=x^{\star}(\delta)+\frac{1}{\sqrt{n}}W_{n}(\delta)+\frac{1}{n}V_{n}(\delta)+o_{P}(n^{-1})

where x(δ)=x(0)+δπ+o(δ)x^{\star}(\delta)=x^{\star}(0)+\delta\pi+o(\delta). Taking expectations gives

𝔼[xn(δ)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= x(δ)+1nV¯(δ)+o(n1)\displaystyle x^{\star}(\delta)+\frac{1}{n}\overline{V}(\delta)+o(n^{-1}) (E.3)
𝕍[xn(δ)]\displaystyle{\mathbb{V}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= 1nξ(δ)+o(n1).\displaystyle\frac{1}{n}\xi(\delta)+o(n^{-1}).

Using (E.1) we can write (E.3) as

𝔼[xn(δ)]\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)] =\displaystyle= 𝔼[xn(0)]+[x(δ)x(0)]+1n[V¯(δ)V¯(0)]+o(n1)\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]+[x^{\star}(\delta)-x^{\star}(0)]+\frac{1}{n}[\overline{V}(\delta)-\overline{V}(0)]+o(n^{-1}) (E.4)
=\displaystyle= 𝔼[xn(0)]+δπ+o(δ)+1n{δV¯δ(0)+o(δ)}+o(n1).\displaystyle{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]+\delta\pi+o(\delta)+\frac{1}{n}\{\delta\overline{V}_{\delta}(0)+o(\delta)\}+o(n^{-1}).

We now expand g(𝔼[xn(δ)])g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]) around g(𝔼[xn(0)])g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]):

g(𝔼[xn(δ)])\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])
=\displaystyle= g(𝔼[xn(0)])+(xg(𝔼[xn(0)])){𝔼[xn(δ)]𝔼[xn(0)]}\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])+\Big{(}\nabla_{x}g\big{(}{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\big{)}\Big{)}^{\prime}\Big{\{}{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\Big{\}}
+12{𝔼[xn(δ)]𝔼[xn(0)]}x2g(𝔼[xn(0)]){𝔼[xn(δ)]𝔼[xn(0)]}\displaystyle+\frac{1}{2}\Big{\{}{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\Big{\}}^{\prime}\nabla^{2}_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{\{}{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\Big{\}}
+o(xn(δ)]𝔼[xn(0)]2)\displaystyle+o\big{(}\|x_{n}(\delta)]-{\mathbb{E}}_{\mathbb{P}}[x_{n}(0)]\|^{2}\big{)}
=\displaystyle= g(𝔼[xn(0)])+(1nx2g(x(0))V¯(0)+o(n1)){δπ+O(δ2)+1nO(δ)}\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])+\Big{(}\frac{1}{n}\,\nabla^{2}_{x}g(x^{\star}(0))\overline{V}(0)+o(n^{-1})\Big{)}^{\prime}\Big{\{}\delta\pi+O(\delta^{2})+\frac{1}{n}O(\delta)\Big{\}}
+12(δπ+O(δ2)+1nO(δ)){x2g(x(0))+O(n1)}(δπ+O(δ2)+1nO(δ))\displaystyle+\frac{1}{2}\Big{(}\delta\pi+O(\delta^{2})+\frac{1}{n}O(\delta)\Big{)}^{\prime}\Big{\{}\nabla^{2}_{x}g(x^{\star}(0))+O(n^{-1})\Big{\}}\Big{(}\delta\pi+O(\delta^{2})+\frac{1}{n}O(\delta)\Big{)}

where the second equality follows from (E.2) and (E.4). It now follows that the expected reward at the mean of the robust solution

g(𝔼[xn(δ)])\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])
=\displaystyle= g(𝔼[xn(0)])+δn{πx2g(x(0))V¯(0)}+δ22π[x2g(x(0))]πImpact of robustness\displaystyle g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])+\underbrace{\frac{\delta}{n}\Big{\{}\pi^{\prime}\nabla^{2}_{x}g(x^{\star}(0))\overline{V}(0)\Big{\}}+\frac{\delta^{2}}{2}\pi^{\prime}\big{[}\nabla^{2}_{x}g(x^{\star}(0))\big{]}\pi}_{\tiny\mbox{Impact of robustness}}
+o(1/n2)+o(δ2)+1no(δ2)+1n2O(δ)\displaystyle+o(1/n^{2})+o(\delta^{2})+\frac{1}{n}o(\delta^{2})+\frac{1}{n^{2}}O(\delta)

In the case of the loss from Jensen’s inequality, recall that the variance of the solution

𝔼[(xn(δ))𝔼[xn(δ)])(xn(δ))𝔼[xn(δ)])]\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}\big{(}x_{n}(\delta))-{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]\big{)}\big{(}x_{n}(\delta))-{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]\big{)}^{\prime}\Big{]}
=\displaystyle= 1nξ(δ)+O(n3/2)\displaystyle\frac{1}{n}\xi(\delta)+O(n^{-3/2})
=\displaystyle= 1n{ξ(0)+δξδ(0)}+1no(δ)+O(n3/2).\displaystyle\frac{1}{n}\Big{\{}\xi(0)+\delta\xi_{\delta}(0)\Big{\}}+\frac{1}{n}o(\delta)+O(n^{-3/2}).

It now follows from a Taylor series expansion that

𝔼[g(xn(δ))g(𝔼[xn(δ)])]\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(\delta))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\Big{]}
=\displaystyle= 𝔼[xg(𝔼[xn(δ)])(xn(δ)𝔼[xn(δ)])]\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}\nabla_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\Big{(}x_{n}(\delta)-{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]\Big{)}\Big{]}
+12𝔼[(xn(δ))𝔼[xn(δ)])x2g(𝔼[xn(δ)])(xn(δ))𝔼[xn(δ)])]+O(n3/2)\displaystyle+\frac{1}{2}{\mathbb{E}}_{\mathbb{P}}\Big{[}\big{(}x_{n}(\delta))-{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]\big{)}^{\prime}\nabla^{2}_{x}g({\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)])\big{(}x_{n}(\delta))-{\mathbb{E}}_{\mathbb{P}}[x_{n}(\delta)]\big{)}\Big{]}+O(n^{-3/2})
=\displaystyle= 12tr{x2g(x(δ)+1nV¯(δ)+o(n1))ξ(δ)n}+O(n3/2)\displaystyle\frac{1}{2}{\rm tr}\Big{\{}\nabla^{2}_{x}g\Big{(}x^{\star}(\delta)+\frac{1}{n}\overline{V}(\delta)+o(n^{-1})\Big{)}\frac{\xi(\delta)}{n}\Big{\}}+O(n^{-3/2})
=\displaystyle= 12ntr{x2g(x(δ))ξ(δ)}+O(n3/2)\displaystyle\frac{1}{2n}\mbox{tr}\Big{\{}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{\}}+O(n^{-3/2})
=\displaystyle= 12ntr{x2g(x(0))ξ(0)}+δ2nddδ{tr(x2g(x(δ))ξ(δ))}|δ=0+1nO(δ2)+O(n3/2)\displaystyle\frac{1}{2n}\mbox{tr}\Big{\{}\nabla^{2}_{x}g(x^{\star}(0))\xi(0)\Big{\}}+\frac{\delta}{2n}\frac{\mbox{d}}{\mbox{d}\delta}\Big{\{}\mbox{tr}\Big{(}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{)}\Big{\}}\Big{|}_{\delta=0}+\frac{1}{n}O(\delta^{2})+O(n^{-3/2})
=\displaystyle= 𝔼[g(xn(0))g(𝔼[xn(0)])]+(δ2n)ddδ{tr(x2g(x(δ))ξ(δ))}|δ=0Impact of robustness\displaystyle{\mathbb{E}}_{\mathbb{P}}\Big{[}g(x_{n}(0))-g({\mathbb{E}}_{\mathbb{P}}[x_{n}(0)])\Big{]}+\underbrace{\Big{(}\frac{\delta}{2n}\Big{)}\frac{\mbox{d}}{\mbox{d}\delta}\Big{\{}\mbox{tr}\Big{(}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{)}\Big{\}}\Big{|}_{\delta=0}}_{\tiny\mbox{Impact of robustness}}
+1nO(δ2)+O(n3/2)\displaystyle+\frac{1}{n}O(\delta^{2})+O(n^{-3/2})

where the second equality follows from (E.3), the fourth is a Taylor series expansion of the function tr{x2g(x(δ))ξ(δ)}\mbox{tr}\Big{\{}\nabla^{2}_{x}g(x^{\star}(\delta))\xi(\delta)\Big{\}} around δ=0\delta=0, and the last equality follows from (4.7). Proposition 4.3 follows after substituting (E) and (E) into the decomposition (4.9).