This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical Limit Theorems in Distributionally Robust Optimization

Abstract

The goal of this paper is to develop methodology for the systematic analysis of asymptotic statistical properties of data driven DRO formulations based on their corresponding non-DRO counterparts. We illustrate our approach in various settings, including both phi-divergence and Wasserstein uncertainty sets. Different types of asymptotic behaviors are obtained depending on the rate at which the uncertainty radius decreases to zero as a function of the sample size and the geometry of the uncertainty sets.

Jose Blanchet
Department of Management Science
and Engineering
Stanford University
[email protected]
Alexander Shapiro
Georgia Institute of Technology
Atlanta, Georgia 30332-0205, USA,
[email protected]

1 Introduction

The statistical analysis of Empirical Risk Minimization (ERM) estimators is a well investigated topic both in statistics (e.g., [18]) and stochastic optimization (e.g., [17]). In recent years, there has been significant interest in the investigation of distributionally robust optimization (DRO) estimators (e.g., [13]). The goal of this paper is to develop methodology for the study of asymptotic statistical properties of data driven DRO formulations based on their corresponding non-DRO counterparts.

Our objective is to illustrate the main conceptual strategies for the statistical development, emphasizing qualitative features, for instance, the different types of behavior arising from the interaction between the distributional uncertainty size and the sample size, while keeping the discussion easily accessible. Consequently, in order to keep the discussion easily accessible, we do not necessarily focus on the most general assumptions to apply our results.

To set the stage, let us introduce some notation. We use 𝔓(𝒮){\mathfrak{P}}(\mathcal{S}) to denote the set of Borel probability measures supported on a closed (nonempty) set 𝒮d\mathcal{S}\subset{\mathbb{R}}^{d}. Let X1,,XnX_{1},...,X_{n} be a sequence of independent identically distributed (i.i.d.) random vectors viewed as realizations (or i.i.d. copies) of random vector XX having distribution P𝔓(𝒮)P_{\ast}\in{\mathfrak{P}}(\mathcal{S}). Consider the corresponding empirical measure Pn=n1i=1nδXiP_{n}=n^{-1}\sum_{i=1}^{n}\delta_{X_{i}}, where δx\delta_{x} denotes the Dirac measure of mass one at the point xdx\in{\mathbb{R}}^{d}. The sample mean of a function ψ:𝒮\psi:\mathcal{S}\rightarrow{\mathbb{R}} is 𝔼Pn[ψ(X)]=n1i=1nψ(Xi){\mathbb{E}}_{P_{n}}[\psi(X)]=n^{-1}\sum_{i=1}^{n}\psi(X_{i}). By the Strong Law of Large Numbers we have that 𝔼Pn[ψ(X)]{\mathbb{E}}_{P_{n}}[\psi(X)] converges with probability one (w.p.1) to 𝔼P[ψ(X)]{\mathbb{E}}_{P_{\ast}}[\psi(X)], provided the expectation 𝔼P[ψ(X)]{\mathbb{E}}_{P_{\ast}}[\psi(X)] is well defined111Throughout our discussion, every function whose expectation is considered will be assumed to be Borel measurable, so we will not be concerned with making this assumption repeatedly..

By the Central Limit Theorem,

n1/2(𝔼Pn[ψ(X)]𝔼P[ψ(X)])N(0,σ2),n^{1/2}({\mathbb{E}}_{P_{n}}\left[\psi\left(X\right)\right]-{\mathbb{E}}_{P_{\ast}}[\psi\left(X\right)]){\rightsquigarrow}N\left(0,\sigma^{2}\right),

where “{\rightsquigarrow}” denotes the weak convergence (converges in distribution) and N(0,σ2)N\left(0,\sigma^{2}\right) represents the normal distribution with mean zero and variance σ2=VarP[ψ(X)]\sigma^{2}=\mathrm{Var}_{P_{\ast}}[\psi(X)], provided this variance is finite.

We consider a loss function of the form l:d×Θl:{\mathbb{R}}^{d}\times\Theta\rightarrow{\mathbb{R}}, with Θm\Theta\subset{\mathbb{R}}^{m} being the parameter space. Unless stated otherwise, we assume that the set Θ\Theta is compact and l(x,θ)l(x,\theta) is continuous on 𝒮×Θ\mathcal{S}\times\Theta. We define

fn(θ):=𝔼Pn[l(X,θ)] and f(θ):=𝔼P[l(X,θ)].f_{n}\left(\theta\right):={\mathbb{E}}_{P_{n}}\left[l\left(X,\theta\right)\right]\text{ \ and }f\left(\theta\right):={\mathbb{E}}_{P_{\ast}}\left[l\left(X,\theta\right)\right]. (1.1)

So, the standard ERM formulation takes the form

minθΘfn(θ),\min_{\theta\in\Theta}f_{n}\left(\theta\right), (1.2)

and viewed as an empirical counterpart of the “true” (or limiting) form

minθΘf(θ).\min_{\theta\in\Theta}f\left(\theta\right). (1.3)

The statistical properties such as consistency and asymptotic normality of the ERM estimates have been widely studied in significant generality as the sample size nn\rightarrow\infty. These types of results hold under structural properties of the function f()f\left(\cdot\right) and natural stability assumptions (to be reviewed) which guarantee a functional Central Limit Theorems for fn()f_{n}\left(\cdot\right). Our goal is to present a development that is largely parallel for the associated distributionally robust counterpart to (1.2).

More precisely, (1.2) can be endowed with distributional robustness by defining a set of probability measures, referred to as the ambiguity set, 𝔐δ(Pn)𝔓(𝒮){\mathfrak{M}}_{\delta}(P_{n})\subset{\mathfrak{P}}(\mathcal{S}), which are seen as “reasonable” (according to some criterion) perturbations of the empirical measure. The parameter δ0\delta\geq 0 is the uncertainty size and the family of sets {𝔐δ(Pn):δ0}\{{\mathfrak{M}}_{\delta}(P_{n}):\delta\geq 0\} is typically nondecreasing in δ\delta (in the inclusion partial order sense). The ambiguity set can be defined around any reference probability measure, but unless stated otherwise, we will center the ambiguity set around PnP_{n}. In this paper we deal with ambiguity sets of the form

𝔐δ(Pn):={P𝔓(𝒮):D(P,Pn)δ},{\mathfrak{M}}_{\delta}(P_{n}):=\{P\in{\mathfrak{P}}(\mathcal{S}):D(P,P_{n})\leq\delta\}, (1.4)

where D(Q,P)D(Q,P) is a divergence between Q,P𝔓(𝒮)Q,P\in{\mathfrak{P}}(\mathcal{S}). Specifically, we consider the phi-divergence and Wasserstein distance cases.

In order to state the DRO version of (1.2) we define

n(θ,δn):=supP𝔐δn(Pn)𝔼P[l(X,θ)],\mathcal{F}_{n}(\theta,\delta_{n}):=\sup_{P\in{\mathfrak{M}}_{\delta_{n}}(P_{n})}{\mathbb{E}}_{P}\left[l\left(X,\theta\right)\right], (1.5)

where δn\delta_{n} is a monotonically decreasing sequence tending to zero as nn\rightarrow\infty. The DRO version of (1.2) takes the form

minθΘn(θ,δn).\min_{\theta\in\Theta}\mathcal{F}_{n}(\theta,\delta_{n}). (1.6)

The aim of this paper is to investigate asymptotic statistical properties of the optimal value and optimal solutions of the DRO problem (1.6). In particular, under natural assumptions (to be discussed), we will show that both in phi-divergence and Wasserstein DRO formulations, there are typically (but not always) three types of cases involving the limiting asymptotic statistics depending on the rate of convergence of δn\delta_{n} to zero. These can be seen both in terms of the value function error

minθΘn(θ,δn)minθΘf(θ),\min_{\theta\in\Theta}\mathcal{F}_{n}(\theta,\delta_{n})-\min_{\theta\in\Theta}f\left(\theta\right),

and the optimal solution error (assuming it is unique for the limiting version of the problem and sufficient regularity conditions are in place).

Intuitively, if δn\delta_{n} is smaller than a certain (to be characterized) critical rate relative to the canonical parametric statistical error rate n1/2n^{-1/2}, then the DRO effect is negligible compared to the statistical error implicit in a sample of size nn. If δn\delta_{n} decreases to zero right at the critical rate, the DRO effect is comparable with this statistical error and can be quantified in the form of an asymptotic bias. If δn\delta_{n} is bigger than the critical rate, the DRO effect overwhelms the statistical noise. These critical rates depend on the sensitivity of the optimal value function with respect to a small change in the size of uncertainty.

Our objective is to provide accessible principles that can be used to obtain explicit limiting distributions for the errors, both for value functions and optimizers, when δn0\delta_{n}\rightarrow 0 in these three cases; see Theorems 3.1 and 3.2 for general principles and Theorems 4.1 and 4.2 for the application to these principles to value functions of phi-divergence and Wasserstein DRO, respectively; and Theorems 5.1 and 5.2 for the corresponding application to phi-divergence and Wasserstein DRO optimal solutions, respectively.

It is important to note that it is common in the data-driven DRO literature to suggest choosing δn\delta_{n} in order to enforce that PP_{\ast} is inside 𝔐δn(Pn){\mathfrak{M}}_{\delta_{n}}(P_{n}) with high probability. Such selection typically will fall in the third case, that is, this choice will induce estimates that are substantially larger than standard statistical noise. Therefore, prescriptions corresponding to the third case should be assigned only if the optimizer perceives that the out-of-sample environment is substantially different from the observed (empirical) environment due to errors or fluctuations that fall outside of standard statistical noise.

The rest of the paper is organized as follows. In Section 2 we will quickly review the elements of statistical analysis of Empirical Risk Minimization (ERM) – also known as Empirical Optimization or Sample Average Approximation – which corresponds to the case δn=0\delta_{n}=0. Then, in Section 3, we will follow a parallel discussion to that of Section 2 and discuss assumptions for the data-driven DRO version of the problem. The objective is to use these assumptions so that we can obtain a flexible and disciplined approach that can be systematically applied to various DRO formulations. Then, in Section 4 we will discuss the application of this approach to the explicit development of asymptotics for the optimal value in phi-divergence and Wasserstein DRO and, finally, in Section 5, we also develop these explicit results for associated optimal solutions.

We use the following notation throughout the paper. For a sequence YnY_{n} of random variables, by writing Yn=op(nγ)Y_{n}=o_{p}(n^{-\gamma}) we mean that nγYnn^{\gamma}Y_{n} tends in probability to zero as nn\to\infty. In particular Yn=op(1)Y_{n}=o_{p}(1) means that YnY_{n} tends in probability to zero. The notation QPQ\ll P means that Q𝔓(𝒮)Q\in{\mathfrak{P}}(\mathcal{S}) is absolutely continuous with respect to P𝔓(𝒮)P\in{\mathfrak{P}}(\mathcal{S}). Unless stated otherwise probabilistic statements like “almost every” (a.e.), are made with respect to the probability measure PP_{\ast}. By saying that a function h:𝒮h:\mathcal{S}\to{\mathbb{R}} is integrable we mean that 𝔼P|h(X)|<{\mathbb{E}}_{P_{\ast}}|h(X)|<\infty. It is said that a mapping ϕ:mk\phi:{\mathbb{R}}^{m}\to{\mathbb{R}}^{k} is directionally differentiable at a point θm\theta\in{\mathbb{R}}^{m} if the directional derivative

ϕ(θ,d):=limt0ϕ(θ+td)ϕ(θ)t\phi^{\prime}(\theta,d):=\lim_{t\downarrow 0}\frac{\phi(\theta+td)-\phi(\theta)}{t} (1.7)

exists for every dmd\in{\mathbb{R}}^{m}. We will use the term ϵn(θ)\epsilon_{n}(\theta), θΘ\theta\in\Theta, to denote a random field such that

supθΘ|ϵn(θ)|=op(1).\sup_{\theta\in\Theta}\left|\epsilon_{n}\left(\theta\right)\right|=o_{p}(1). (1.8)

2 Statistics of ERM: Review

In addition to the population objective function f(θ):=𝔼P[l(X,θ)]f(\theta):={\mathbb{E}}_{P_{\ast}}\left[l\left(X,\theta\right)\right], introduced in (1.1), we also let

ϑ:=infθΘf(θ)andΘ:=argminθΘf(θ),\vartheta:=\inf_{\theta\in\Theta}f(\theta)\;\;\text{and}\;\;\Theta^{\ast}:=\mathop{\rm arg\,min}_{\theta\in\Theta}f(\theta), (2.1)

be the optimal value and the set of optimal solutions of the population version of the optimization problem, respectively.

As defined in (1.1), fn(θ)=𝔼Pn[l(X,θ)]f_{n}(\theta)={\mathbb{E}}_{P_{n}}\left[l\left(X,\theta\right)\right] is the objective function of the ERM version of the problem and

ϑn:=infθΘfn(θ)andθnargminθΘfn(θ)\vartheta_{n}:=\inf_{\theta\in\Theta}f_{n}(\theta)\;\;\text{and}\;\;\theta_{n}\in\mathop{\rm arg\,min}_{\theta\in\Theta}f_{n}(\theta) (2.2)

are the respective optimal value and an optimal solutions of the ERM problem. We will now quickly review the development of the asymptotic statistics of the optimal value in ERM and then we will discuss the corresponding results for optimal solutions.

2.1 Asymptotics of the Optimal Value

In order to analyze the statistical error in the difference between the optimal values ϑnϑ\vartheta_{n}-\vartheta, we start from enforcing a functional Central Limit Theorem (CLT) for fn()f_{n}\left(\cdot\right). In particular, one imposes assumptions which guarantee an expansion of the form222Recall that ϵn()\epsilon_{n}(\cdot) denotes a random field satisfying condition (1.8).

fn(θ)=f(θ)+n1/2rn(θ)+n1/2ϵn(θ),f_{n}\left(\theta\right)=f\left(\theta\right)+n^{-1/2}r_{n}\left(\theta\right)+n^{-1/2}\epsilon_{n}\left(\theta\right), (2.3)

where we have functional weak convergence

rn()𝔤()r_{n}\left(\cdot\right){\rightsquigarrow}{\mathfrak{g}}\left(\cdot\right) (2.4)

in the uniform topology on compact sets, with 𝔤(){\mathfrak{g}}\left(\cdot\right) being a mean zero Gaussian random field with covariance function

Cov(𝔤(θ),𝔤(θ))=CovP(l(X,θ),l(X,θ)).\mathrm{Cov}\left({\mathfrak{g}}\left(\theta\right),{\mathfrak{g}}\left(\theta^{\prime}\right)\right)=\mathrm{Cov}_{P_{\ast}}\big{(}l\left(X,\theta\right),l\left(X,\theta^{\prime}\right)\big{)}. (2.5)

There are several ways to enforce (2.3); a simple set of sufficient conditions satisfying this is given next (cf., [18, example 19.7]).

Assumption 2.1

(i) For some θ¯Θ\bar{\theta}\in\Theta the expectation 𝔼P[l(X,θ¯)2]{\mathbb{E}}_{P_{\ast}}[l(X,\bar{\theta})^{2}] is finite. (ii) There is a measurable function ψ:𝒮+\psi:\mathcal{S}\rightarrow{\mathbb{R}}_{+} such that 𝔼P[ψ(X)2]{\mathbb{E}}_{P_{\ast}}[\psi(X)^{2}] is finite and

|l(X,θ)l(X,θ)|ψ(X)θθ|l(X,\theta)-l(X,\theta^{\prime})|\leq\psi(X)\|\theta-\theta^{\prime}\| (2.6)

for all θ,θΘ\theta,\theta^{\prime}\in\Theta and a.e. X𝒮X\in\mathcal{S}.

In particular under this assumption, it follows that the expectation function f(θ)f(\theta) and variance

σ2(θ):=VarP(l(X,θ))\sigma^{2}(\theta):=\mathrm{Var}_{P_{\ast}}(l(X,\theta)) (2.7)

are finite valued and continuous on Θ\Theta. Furthermore, since the set Θ\Theta is compact, it follows that the optimal value ϑn\vartheta_{n}, of the ERM problem, converges to ϑ\vartheta in probability (in fact almost surely). Moreover, it is not difficult to show from (2.3) that the distance from θn\theta_{n} to Θ\Theta_{\ast} converges in probability to zero (actually, the convergence occurs almost surely) as nn\rightarrow\infty. Finally, since the functional V(ϕ):=infθΘϕ(θ)V(\phi):=\inf_{\theta\in\Theta}\phi(\theta), mapping continuous functions ϕ:Θ\phi:\Theta\to{\mathbb{R}} to the real line, is directionally differentiable, the following classical result is a direct consequence of the (functional) Delta Theorem (cf., [14]).

Proposition 2.1

Under Assumption 2.1,

n1/2(ϑnϑ)infθΘ𝔤(θ)n^{1/2}(\vartheta_{n}-\vartheta){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}{\mathfrak{g}}(\theta) (2.8)

as nn\rightarrow\infty. In particular, if Θ={θ}\Theta^{*}=\{\theta^{*}\} is the singleton, i.e. θ\theta^{*} is the unique optimal solution of the true problem, then n1/2(ϑnϑ)n^{1/2}(\vartheta_{n}-\vartheta^{*}) converges in distribution to normal N(0,σ2(θ))N(0,\sigma^{2}(\theta^{*})).

2.2 Asymptotics of Optimal Solutions

We assume now that Θ={θ}\Theta^{\ast}=\{\theta^{\ast}\} is the singleton, i.e., θ\theta^{\ast} is the unique optimal solution of the true (population) problem (1.3). We also assume that for a.e. XX, the function l(X,)l(X,\cdot) is continuously differentiable333Unless stated otherwise all first and second order derivatives will be taken with respect to vector θ\theta.. As it was argued in the previous section, asymptotics of the optimal value is governed by the asymptotics of the objective function. On the other hand, asymptotics of the optimal solutions can be derived from the asymptotics of the gradients of the objective function.

Let us consider the following parametrisation of problem (1.3):

minθΘf(θ)+vTθ,\min_{\theta\in\Theta}f\left(\theta\right)+v^{T}\theta, (2.9)

with parameter vector vmv\in{\mathbb{R}}^{m}. Denote by θ(v)\theta_{*}(v) an optimal solution of the above problem (2.9) viewed as a function of vector vv. Of course, we have that θ(0)=θ\theta_{*}(0)=\theta^{*}.

Assumption 2.2 (uniform second order growth)

There is a neighborhood 𝒱\mathcal{V} of θ\theta^{\ast} and a positive constant κ\kappa such that for every vv in a neighborhood of 0m0\in{\mathbb{R}}^{m}, problem (2.9) has an optimal solution θ(v)𝒱\theta_{*}(v)\in\mathcal{V} and

f(θ)+vT(θθ(v))f(θ(v))+κθθ(v)2,f(\theta)+v^{T}(\theta-\theta_{*}\left(v\right))\geq f(\theta_{*}\left(v\right))+\kappa\|\theta-\theta_{*}\left(v\right)\|^{2}, (2.10)

for all θΘ𝒱\theta\in\Theta\cap\mathcal{V}.

The following assumption can be viewed as a counterpart of Assumption 2.1 applied to the gradients of the objective function.

Assumption 2.3

(i) For some θ¯Θ\bar{\theta}\in\Theta the expectation 𝔼P[l(X,θ¯)2]{\mathbb{E}}_{P_{\ast}}\left[\|\nabla l(X,\bar{\theta})\|^{2}\right] is finite. (ii) There is a measurable function Ψ:𝒮+\Psi:\mathcal{S}\rightarrow{\mathbb{R}}_{+} such that 𝔼P[Ψ(X)2]{\mathbb{E}}_{P_{\ast}}[\Psi(X)^{2}] is finite and

l(X,θ)l(X,θ)Ψ(X)θθ,\|\nabla l(X,\theta)-\nabla l(X,\theta^{\prime})\|\leq\Psi(X)\|\theta-\theta^{\prime}\|, (2.11)

for all θ,θΘ\theta,\theta^{\prime}\in\Theta and a.e. X𝒮X\in\mathcal{S}.

By the functional CLT it follows that

fn(θ)=f(θ)+n1/2dn(θ)+n1/2ϵn(θ),\nabla f_{n}\left(\theta\right)=\nabla f\left(\theta\right)+n^{-1/2}d_{n}\left(\theta\right)+n^{-1/2}\epsilon_{n}\left(\theta\right), (2.12)

where we have a functional weak convergence dn()𝔊()d_{n}\left(\cdot\right){\rightsquigarrow}\,{\mathfrak{G}}\left(\cdot\right) in the uniform topology on a closed neighborhood of θ\theta^{\ast}, with 𝔊(){\mathfrak{G}}\left(\cdot\right) being a continuous mean zero Gaussian random field with covariance function

Cov[𝔊(θ),𝔊(θ)]=𝔼P[(l(X,θ)f(θ))(l(X,θ)f(θ))T].\mathrm{Cov}\left[{\mathfrak{G}}(\theta),{\mathfrak{G}}(\theta^{\prime})\right]={\mathbb{E}}_{P_{\ast}}[(\nabla l(X,\theta)-\nabla f(\theta))(\nabla l(X,\theta^{\prime})-\nabla f(\theta^{\prime}))^{T}].

It follows from (2.12) that

[fn(θ)f(θ)][fn(θ)f(θ)]=n1/2[dn(θ)dn(θ)+ϵn(θ)ϵn(θ)].\left[\nabla f_{n}(\theta)-\nabla f(\theta)\right]-\left[\nabla f_{n}(\theta^{\ast})-\nabla f(\theta^{\ast})\right]=n^{-1/2}\left[d_{n}\left(\theta\right)-d_{n}\left(\theta^{\ast}\right)+\epsilon_{n}\left(\theta\right)-\epsilon_{n}\left(\theta^{\ast}\right)\right]. (2.13)

Also since ρn:=θnθ\rho_{n}:=\|\theta_{n}-\theta^{\ast}\| tends in probability to zero, we have

supθ:θθρn[dn(θ)dn(θ)+ϵn(θ)ϵn(θ)]=op(1).\sup_{\theta:\|\theta-\theta^{\ast}\|\leq\rho_{n}}\left[d_{n}\left(\theta\right)-d_{n}\left(\theta^{\ast}\right)+\epsilon_{n}\left(\theta\right)-\epsilon_{n}\left(\theta^{\ast}\right)\right]=o_{p}(1). (2.14)

Thus we have the following result from [15, Theorem 2.1], where the respective regularity conditions are ensured by the above property (2.14).

Proposition 2.2

Suppose that Assumptions 2.2 and 2.3 hold. Then it follows that

θn=θ(Zn)+op(n1/2),\theta_{n}=\theta_{*}(Z_{n})+o_{p}(n^{-1/2}), (2.15)

where Zn:=fn(θ)f(θ)Z_{n}:=\nabla f_{n}(\theta^{*})-\nabla f(\theta^{*}).

The above result reduces the analysis of asymptotic properties of the optimal solutions to investigation of asymptotic behavior of the optimal solutions of the finite dimensional problem (2.9). By the (finite dimensional) Central Limit Theorem, n1/2Znn^{1/2}Z_{n} converges in distribution to normal N(0,Σ)N(0,\Sigma) with covariance matrix Σ=Cov(l(X,θ))\Sigma=\mathrm{Cov}(\nabla l(X,\theta^{*})). Moreover, if the mapping θ(v)\theta_{*}(v) is directionally differentiable at v=0v=0 (in the Hadamard sense), then by the finite dimensional Delta Theorem it follows from (2.15) that

n1/2(θnθ)θ(0,Z),n^{1/2}(\theta_{n}-\theta^{*})\,{\rightsquigarrow}\,\theta_{*}^{\prime}(0,Z), (2.16)

where ZN(0,Σ)Z\sim N(0,\Sigma). In particular, if θ(0,w)=Aw\theta_{*}^{\prime}(0,w)=Aw is linear (i.e., θ(v)\theta_{*}(v) is differentiable at v=0v=0 with Jacobian matrix AA), then n1/2(θnθ)n^{1/2}(\theta_{n}-\theta^{*}) converges in distribution to normal with null mean vector and covariance matrix AΣATA\Sigma A^{T}.

Directional differentiability of optimal solutions of parameterized problems is well investigated. For example, if θ\theta^{\ast} is an interior point of Θ\Theta, f(θ)f(\theta) is twice continuously differentiable at θ\theta^{*} and the Hessian matrix H:=2f(θ)H:=\nabla^{2}f(\theta^{\ast}) is nonsingular, then the uniform second order growth (Assumption 2.2) holds, and θ(v)\theta_{\ast}\left(v\right) is differentiable at v=0v=0 with θ(0,w)=H1w\theta_{\ast}^{\prime}\left(0,w\right)=H^{-1}w. When θ\theta^{\ast} is on the boundary of the set Θ\Theta, the sensitivity analysis of the parameterized problem (2.9) is more delicate and involves a certain measure of the curvature of the set Θ\Theta at the point θ\theta^{\ast}. This is discussed extensively in [6]. We also refer to [17, sections 5.1.3 and 7.1.5] for a basic summary of such results.

It is worthwhile to note at this point that the regularity conditions of assumptions 2.2 and 2.3 address different properties of the considered setting. Assumption 2.2 deals with the limiting optimization problem and is of deterministic nature. The uniform second order growth condition was introduced in [15], and in a more general form was discussed in [6, section 5.1.3]. On the other hand Assumption 2.3 is related to the stochastic behavior of the ERM problem (1.2).

3 Statistics of DRO: General Principles

We now provide sufficient conditions for the development of DRO statistical principles based on assumptions which are parallel to those imposed in the ERM section. Define

ϑ¯n:=infθΘn(θ,δn)andθ¯nargminθΘn(θ,δn),\bar{\vartheta}_{n}:=\inf_{\theta\in\Theta}\mathcal{F}_{n}(\theta,\delta_{n})\;\;\text{and}\;\;\bar{\theta}_{n}\in\mathop{\rm arg\,min}_{\theta\in\Theta}\mathcal{F}_{n}(\theta,\delta_{n}), (3.1)

the optimal value and an optimal solution of the DRO problem (1.6) (recall the definition of n(θ,δn)\mathcal{F}_{n}(\theta,\delta_{n}) in (1.5)).

3.1 DRO Asymptotics of the Optimal Value

Similar to the ERM case, in the DRO setting, we will typically have an expansion of the form

n(θ,δn)=fn(θ)+δnγRn(θ)+δnγϵn(θ),\mathcal{F}_{n}(\theta,\delta_{n})=f_{n}\left(\theta\right)+\delta_{n}^{\gamma}R_{n}\left(\theta\right)+\delta_{n}^{\gamma}\epsilon_{n}\left(\theta\right), (3.2)

for some γ>0\gamma>0, where Rn()R_{n}(\cdot) converges in probability in the uniform topology over Θ\Theta to a continuous deterministic process ϱ()\varrho(\cdot),

Rn(θ)=ϱ(θ)+ϵn(θ).R_{n}(\theta)=\varrho(\theta)+\epsilon_{n}(\theta). (3.3)

Since n(,δn)fn()\mathcal{F}_{n}(\cdot,\delta_{n})\geq f_{n}(\cdot), it follows then that ϱ()0\varrho(\cdot)\geq 0. We will characterize γ>0\gamma>0 and ϱ()\varrho\left(\cdot\right) explicitly in the next sections in the context of phi-divergence and Wasserstein DRO formulations under suitable conditions on the distributional uncertainty set – in addition to Assumption 2.1 (which is clearly independent of the distributional uncertainty). The following result, which is immediate from the application of the functional Delta Theorem summarizes the type of behavior that we expect in DRO formulations depending on the geometry of the distributional uncertainty set and the rate of decay to zero of the uncertainty size δn\delta_{n}. Recall that 𝔤(θ){\mathfrak{g}}(\theta) denotes a mean zero Gaussian random field with covariance function (2.5).

Theorem 3.1

Suppose that Assumption 2.1 and conditions (3.2) - (3.3) hold. Then there are three types of asymptotic behavior of the DRO optimal value:
(a) If δnγ=o(n1/2)\delta_{n}^{\gamma}=o\left(n^{-1/2}\right), then

ϑ¯n=ϑn+op(n1/2),\bar{\vartheta}_{n}=\vartheta_{n}+o_{p}\left(n^{-1/2}\right), (3.4)

and hence

n1/2(ϑ¯nϑ)infθΘ𝔤(θ),n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}{\mathfrak{g}}(\theta), (3.5)

which coincides with (2.8) and thus the DRO formulation has no asymptotic impact.
(b) If δnγ=n1/2\delta_{n}^{\gamma}=n^{-1/2}, then

n1/2(ϑ¯nϑ)infθΘ{𝔤(θ)+ϱ(θ)},n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}\left\{{\mathfrak{g}}(\theta)+\varrho\left(\theta\right)\right\}, (3.6)

so the DRO formulation introduces an explicit and quantifiable asymptotic bias which can be interpreted as a regularization term.
(c) If o(δnγ)=n1/2o\left(\delta_{n}^{\gamma}\right)=n^{-1/2}, then444The right hand side of (3.7) is a deterministic number. Therefore convergence in distribution ‘{\rightsquigarrow}’ there is the same as convergence in probability.

δnγ(ϑ¯nϑ)infθΘϱ(θ),\delta_{n}^{-\gamma}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}\varrho\left(\theta\right), (3.7)

so the bias term induced by the DRO formulation is larger than the statistical error.

Proof. Part (a). By (3.2) and (3.3) we have that in the considered case

n(θ,δn)=fn(θ)+o(n1/2)ϵn(θ),{\cal F}_{n}(\theta,\delta_{n})=f_{n}(\theta)+o(n^{-1/2})\epsilon_{n}\left(\theta\right),

where ϵn(θ)\epsilon_{n}\left(\theta\right) is the generic term satisfying (1.8). Thus (3.4) follows.

Part (b). By (3.2) and (3.3) in the considered case we can write

n1/2(n(θ,δn)f(θ))=n1/2(fn(θ)f(θ))+ϱ(θ)+ϵn(θ).n^{1/2}\left(\mathcal{F}_{n}(\theta,\delta_{n})-f(\theta)\right)=n^{1/2}(f_{n}(\theta)-f(\theta))+\varrho(\theta)+\epsilon_{n}\left(\theta\right).

Under Assumption 2.1, by the functional CLT we have that n1/2(fn(θ)f(θ))+ϱ(θ)n^{1/2}(f_{n}(\theta)-f(\theta))+\varrho(\theta) converges in distribution to 𝔤(θ)+ϱ(θ){\mathfrak{g}}(\theta)+\varrho(\theta), and hence (3.6) follows by the Delta Theorem.

Part (c) may appear somewhat different because the right hand side is deterministic but, under case (c) note that we can simply write

n(θ,δn)=f(θ)+δnγRn(θ)+δnγϵn(θ),\mathcal{F}_{n}(\theta,\delta_{n})=f\left(\theta\right)+\delta_{n}^{\gamma}R_{n}\left(\theta\right)+\delta_{n}^{\gamma}\epsilon_{n}\left(\theta\right),

so case (c) also follows from the standard analysis since Rn()R_{n}\left(\cdot\right) converges uniformly to ϱ()\varrho\left(\cdot\right) in probability (thus it converges weakly in the uniform topology). \hfill\square

3.2 DRO Asymptotics of the Optimal Solutions

As in the ERM development, in addition to Assumptions 2.2, it is convenient to guarantee that for all nn large enough, n(θ,δn)\mathcal{F}_{n}(\theta,\delta_{n}) is differentiable in a neighborhood 𝒱\mathcal{V} of θ\theta^{\ast} and

n(θ,δn)=fn(θ)+δnγDn(θ)+δnγϵn(θ),\nabla\mathcal{F}_{n}(\theta,\delta_{n})=\nabla f_{n}\left(\theta\right)+\delta_{n}^{\gamma}D_{n}\left(\theta\right)+\delta_{n}^{\gamma}\epsilon_{n}\left(\theta\right), (3.8)

for some γ>0\gamma>0, where Dn(θ)D_{n}\left(\theta\right) converges in probability to ϱ(θ)\nabla\varrho\left(\theta\right) uniformly around a closed neighborhood 𝒱¯\mathcal{\bar{V}} of θ\theta^{\ast}. In consequence, we obtain the following analog of Theorem 3.1, which follows from the finite dimensional Delta Theorem. Recall that θ(v)\theta_{*}(v) is an optimal solution of problem (2.9) and θ(0,)\theta^{\prime}_{*}(0,\cdot) is its directional derivative at v=0v=0.

Theorem 3.2

Suppose that: Assumptions 2.2 and 2.3 hold, conditions (3.2) - (3.3) are satisfied, identity (3.8) holds with Dn()D_{n}\left(\cdot\right) converging in probability to ϱ()\nabla\varrho\left(\cdot\right) uniformly around a closed neighborhood 𝒱¯\mathcal{\bar{V}} of θ\theta^{\ast}, and that θ(v)\theta_{\ast}(v) is directionally differentiable at v=0v=0 (in the Hadamard sense). Let ZN(0,Σ)Z\sim N(0,\Sigma) with covariance matrix Σ=Cov(l(X,θ))\Sigma=\mathrm{Cov}(\nabla l(X,\theta^{\ast})). Then the DRO optimal solutions can have three types of asymptotic behavior:
(A) If δnγ=o(n1/2)\delta_{n}^{\gamma}=o\left(n^{-1/2}\right), then

θ¯n=θn+op(n1/2),\bar{\theta}_{n}=\theta_{n}+o_{p}\left(n^{-1/2}\right), (3.9)

thus

n1/2(θ¯nθ)θ(0,Z).n^{1/2}\left(\bar{\theta}_{n}-\theta^{\ast}\right){\rightsquigarrow}\,\theta_{\ast}^{\prime}\left(0,Z\right)\mathfrak{.} (3.10)

(B) If δnγ=n1/2\delta_{n}^{\gamma}=n^{-1/2}, then

n1/2(θ¯nθ)θ(0,Z+ϱ(θ)).n^{1/2}\left(\bar{\theta}_{n}-\theta^{\ast}\right){\rightsquigarrow}\,\theta_{\ast}^{\prime}\left(0,Z+\nabla\varrho(\theta^{\ast})\right). (3.11)

(C) If o(δnγ)=n1/2o\left(\delta_{n}^{\gamma}\right)=n^{-1/2}, then

δnγ(θ¯nθ)θ(0,ϱ(θ)).\delta_{n}^{-\gamma}\left(\bar{\theta}_{n}-\theta^{\ast}\right){\rightsquigarrow}\,\theta_{\ast}^{\prime}\left(0,\nabla\varrho\left(\theta^{\ast}\right)\right). (3.12)

4 General Principle in Action: Optimal Values

In this section, we apply the general principle to the asymptotics of the value function in two of the main types of DRO formulations, namely, phi-divergence and Wasserstein DRO.

4.1 The Phi-Divergence Case

We recall the definition of the distributional uncertainty set for the phi-divergence case. Consider a convex lower semi-continuous function ϕ:+{+}\phi\colon{\mathbb{R}}\rightarrow{\mathbb{R}}_{+}\cup\{+\infty\} such that ϕ(1)=0\phi(1)=0 and ϕ(t)=+\phi(t)=+\infty for t<0t<0. For probability measures Q,P𝔓Q,P\in{\mathfrak{P}} such that QQ is absolutely continuous with respect to PP with the corresponding density dQ/dPdQ/dP, the ϕ\phi-divergence is defined as (cf., [7],[12])

Dϕ(QP):=𝔼P[ϕ(dQ/dP)]=ϕ(dQ/dP)𝑑P.D_{\phi}(Q\|P):={\mathbb{E}}_{P}[\phi(dQ/dP)]=\int\phi(dQ/dP)dP. (4.1)

In particular, for ϕ(t):=tlog(t)t+1\phi(t):=t\log\left(t\right)-t+1, t0t\geq 0, this becomes the Kullback–Leibler (KL) divergence of QQ from PP. The ambiguity set 𝔐δ(P){\mathfrak{M}}_{\delta}(P) associated with Dϕ(P)D_{\phi}(\cdot\|P) is defined as

𝔐δ(P):={QP:Dϕ(QP)δ}.{\mathfrak{M}}_{\delta}\left(P\right):=\{Q\ll P:D_{\phi}(Q\|P)\leq\delta\}. (4.2)

By duality arguments the corresponding distributionally robust functional can be written in the form (cf., [2], [3], [16])

supQ𝔐δ(P)𝔼Q[Y]=infμ,λ>0{λδ+μ+λ𝔼P[ϕ((Yμ)/λ)]},\sup_{Q\in{\mathfrak{M}}_{\delta}\left(P\right)}{\mathbb{E}}_{Q}[Y]=\inf_{\mu,\lambda>0}\left\{\lambda\delta+\mu+\lambda{\mathbb{E}}_{P}[\phi^{\ast}((Y-\mu)/\lambda)]\right\}, (4.3)

where ϕ(y)=supt{ytϕ(t)}\phi^{\ast}(y)=\sup_{t\in{\mathbb{R}}}\{yt-\phi(t)\} is the convex conjugate of ϕ\phi. Using this representation we can obtain an asymptotic expansion for (4.3) as a function of δ\delta. This expansion can be helpful to suggest the form of the expansion in (3.2) and (3.8). For this, we need to assume certain regularity properties of ϕ(t)\phi\left(t\right) around t=1t=1.

Assumption 4.1

Assume that ϕ(t)\phi(t) is two times continuously differentiable in a neighborhood of t=1t=1 with κ:=2/ϕ′′(1)>0\kappa:=2/\phi^{\prime\prime}(1)>0.

Under this condition we have the following expansion, which is obtained, in order to simplify our exposition, under the assumption that the probability measure PP has compact support. See also the results in [11], which provide additional correction terms under a fixed PP. The uniform feature of the statement below is helpful in the statistical analysis. Our development here will also be used in the expansion of the optimal solutions.

Proposition 4.1

Suppose that Assumption 4.1 holds, that P(|Y|ν)=1P\left(\left|Y\right|\leq\nu\right)=1 for some ν(0,)\nu\in\left(0,\infty\right). Then, for any b0>0b_{0}>0,

supQ𝔐δ(P)𝔼Q[Y]𝔼P(Y)δ1/2κ1/2VarP[Y]=o(δ1/2),\sup_{Q\in{\mathfrak{M}}_{\delta}\left(P\right)}{\mathbb{E}}_{Q}[Y]-{\mathbb{E}}_{P}(Y)-\delta^{1/2}\kappa^{1/2}\sqrt{\mathrm{Var}_{P}[Y]}=o\left(\delta^{1/2}\right), (4.4)

uniformly over Borel probability measures PP supported on [ν,ν][-\nu,\nu] such that VarP[Y]b0\mathrm{Var}_{P}[Y]\geq b_{0}. Moreover, there is δ¯>0\bar{\delta}>0 such that for all δ<δ¯\delta<\bar{\delta}

argmax{𝔼Q[Y]:Q𝔐δ(P)}\arg\max\{{\mathbb{E}}_{Q}[Y]:Q\in{\mathfrak{M}}_{\delta}\left(P\right)\}

is unique.

Proof. Note that we can write

supQ𝔐δ(P)𝔼Q[Y]=sup𝔼P(Z)=1,𝔼P(ϕ(Z))δ𝔼P[YZ],\sup_{Q\in{\mathfrak{M}}_{\delta}\left(P\right)}{\mathbb{E}}_{Q}[Y]=\sup_{{\mathbb{E}}_{P}\left(Z\right)=1,{\mathbb{E}}_{P}\left(\phi\left(Z\right)\right)\leq\delta}{\mathbb{E}}_{P}[YZ],

where the sup is taken over the set of positive random variables ZZ satisfying the specified moment constraints. We may assume that 𝔼P[Y]=0{\mathbb{E}}_{P}\left[Y\right]=0 for simplicity since we can always center the objective function around 𝔼P[Y)]{\mathbb{E}}_{P}\left[Y\right)]. In turn, by letting Δ¯=(Z1)/δ1/2\bar{\Delta}=(Z-1)/\delta^{1/2}, the previous optimization problem is equivalent to

δ1/2sup𝔼P(Δ¯)=0,Δ¯δ1/2,𝔼P(ϕ(1+δ1/2Δ¯))δ𝔼P[YΔ¯].\delta^{1/2}\sup_{{\mathbb{E}}_{P}\left(\bar{\Delta}\right)=0,\bar{\Delta}\geq-\delta^{-1/2},{\mathbb{E}}_{P}(\phi(1+\delta^{1/2}\bar{\Delta}))\leq\delta}{\mathbb{E}}_{P}[Y\bar{\Delta}]. (4.5)

Since |Y|ν\left|Y\right|\leq\nu and 𝔼P[Y]=0{\mathbb{E}}_{P}\left[Y\right]=0, then Δ¯=aY\bar{\Delta}=aY is feasible for any a>0a>0 provided that aνδ1/2a\nu\leq\delta^{-1/2} and

𝔼P[ϕ(1+δ1/2Δ¯)]δ.{\mathbb{E}}_{P}[\phi(1+\delta^{1/2}\bar{\Delta})]\leq\delta.

In turn, since ϕ(t)\phi\left(t\right) is two times continuously differentiable at t=1t=1, we have that

δ1ϕ(1+δ1/2ay))a2y2ϕ′′(1)/2\delta^{-1}\phi(1+\delta^{1/2}ay))\rightarrow a^{2}y^{2}\phi^{\prime\prime}\left(1\right)/2

as δ0\delta\rightarrow 0 uniformly over compact sets. Therefore, we conclude that there exists δ0>0\delta_{0}>0 such that for any δ<δ0\delta<\delta_{0}

sup𝔼P(Δ¯)=0,Δ¯δ1/2,𝔼P(ϕ(1+δ1/2Δ¯))δ𝔼P[YΔ¯]\displaystyle\sup_{{\mathbb{E}}_{P}\left(\bar{\Delta}\right)=0,\bar{\Delta}\geq-\delta^{-1/2},{\mathbb{E}}_{P}(\phi(1+\delta^{1/2}\bar{\Delta}))\leq\delta}{\mathbb{E}}_{P}[Y\bar{\Delta}]
supa>0,a2𝔼P(Y2)/2(1δ0)/ϕ′′(1)𝔼P[aY2]=κ(1δ0)𝔼P[Y2].\displaystyle\geq\sup_{a>0,a^{2}{\mathbb{E}}_{P}\left(Y^{2}\right)/2\leq\left(1-\delta_{0}\right)/\phi^{\prime\prime}\left(1\right)}{\mathbb{E}}_{P}[aY^{2}]=\sqrt{\kappa\left(1-\delta_{0}\right)}\cdot\sqrt{{\mathbb{E}}_{P}[Y^{2}]}.

Since δ0>0\delta_{0}>0 can be chosen to be arbitrarily small, we conclude an asymptotic lower bound which retrieves (4.4). For the upper bound, we apply the duality result (4.3) in the form corresponding to (4.5), we obtain

sup𝔼P(Δ¯)=0,Δ¯δ1/2,δ1𝔼P(ϕ(1+δ1/2Δ¯))1𝔼P[YΔ¯]\displaystyle\sup_{{\mathbb{E}}_{P}\left(\bar{\Delta}\right)=0,\bar{\Delta}\geq-\delta^{-1/2},\delta^{-1}{\mathbb{E}}_{P}(\phi(1+\delta^{1/2}\bar{\Delta}))\leq 1}{\mathbb{E}}_{P}[Y\bar{\Delta}]
=minλ¯>0,μ¯{λ¯+𝔼P[supΔ¯δ1/2{(Y+μ¯)Δ¯λ¯δ1/2ϕ(1+δ1/2Δ¯)}]}\displaystyle=\min_{\bar{\lambda}>0,\bar{\mu}}\{\bar{\lambda}+{\mathbb{E}}_{P}[\sup_{\bar{\Delta}\geq-\delta^{-1/2}}\{\left(Y+\bar{\mu}\right)\bar{\Delta}-\bar{\lambda}\delta^{-1/2}\phi(1+\delta^{1/2}\bar{\Delta})\}]\}
minλ¯>0{λ¯+𝔼P[supΔ¯δ1/2{YΔ¯λ¯δ1/2ϕ(1+δ1/2Δ¯)}]}.\displaystyle\leq\min_{\bar{\lambda}>0}\{\bar{\lambda}+{\mathbb{E}}_{P}[\sup_{\bar{\Delta}\geq-\delta^{-1/2}}\{Y\bar{\Delta}-\bar{\lambda}\delta^{-1/2}\phi(1+\delta^{1/2}\bar{\Delta})\}]\}. (4.6)

We will plug in

λ¯0=argmin{λ¯+κ𝔼P[Y2]/4λ¯:λ¯>0}=21κ𝔼P[Y2]>0\bar{\lambda}_{0}=\arg\min\{\bar{\lambda}+\kappa{\mathbb{E}}_{P}[Y^{2}]/4\bar{\lambda}:\bar{\lambda}>0\}=2^{-1}\sqrt{\kappa{\mathbb{E}}_{P}[Y^{2}]}>0

into (4.6) to obtain our upper bound. Using that λ¯0>0\bar{\lambda}_{0}>0 and that ϕ\phi is convex with ϕ′′(1)>0\phi^{\prime\prime}\left(1\right)>0, we have that the family of (continuous) functions

sδ(y):=supΔ¯δ1/2{yΔ¯λ¯δ1/2ϕ(1+δ1/2Δ¯)}s_{\delta}\left(y\right):=\sup_{\bar{\Delta}\geq-\delta^{-1/2}}\{y\bar{\Delta}-\bar{\lambda}\delta^{-1/2}\phi(1+\delta^{1/2}\bar{\Delta})\}

converges uniformly on compact sets to

s0(y)=supΔ¯{yΔ¯λ¯Δ¯2/κ}=κy24λ¯.s_{0}\left(y\right)=\sup_{\bar{\Delta}}\{y\bar{\Delta}-\bar{\lambda}\bar{\Delta}^{2}/\kappa\}=\frac{\kappa y^{2}}{4\bar{\lambda}}.

Therefore we obtain that

minλ¯>0{λ¯+𝔼P[supΔ¯δ1/2{YΔ¯λ¯δ1/2ϕ(1+δ1/2Δ¯)}]}\displaystyle\min_{\bar{\lambda}>0}\{\bar{\lambda}+{\mathbb{E}}_{P}[\sup_{\bar{\Delta}\geq-\delta^{-1/2}}\{Y\bar{\Delta}-\bar{\lambda}\delta^{-1/2}\phi(1+\delta^{1/2}\bar{\Delta})\}]\}
λ¯0+𝔼P[supΔ¯δ1/2{YΔ¯λ¯0δ1/2ϕ(1+δ1/2Δ¯)}]κ𝔼P[Y2].\displaystyle\leq\bar{\lambda}_{0}+{\mathbb{E}}_{P}[\sup_{\bar{\Delta}\geq-\delta^{-1/2}}\{Y\bar{\Delta}-\bar{\lambda}_{0}\delta^{-1/2}\phi(1+\delta^{1/2}\bar{\Delta})\}]\rightarrow\sqrt{\kappa}\cdot\sqrt{{\mathbb{E}}_{P}[Y^{2}]}.

These estimates, which are uniform given that |Y|ν\left|Y\right|\leq\nu, yield the estimate in the proposition. The uniqueness is standard, it follows from the local strong convexity of ϕ()\phi\left(\cdot\right) at the origin. \hfill\square

Recall that σ2(θ):=VarP(l(X,θ))\sigma^{2}(\theta):=\mathrm{Var}_{P_{\ast}}(l(X,\theta)), and that 𝔤(){\mathfrak{g}}\left(\cdot\right) is a mean zero Gaussian random field. Expansion (4.4) immediately yields, at least when supθΘ|l(X,θ)|\sup_{\theta\in\Theta}\left|l\left(X,\theta\right)\right| is PP_{\ast}-bounded, that

n(θ,δn)=fn(θ)+δn1/2κ1/2σ(θ)+δn1/2ϵn(θ).\mathcal{F}_{n}(\theta,\delta_{n})=f_{n}\left(\theta\right)+\delta_{n}^{1/2}\kappa^{1/2}\sigma(\theta)+\delta_{n}^{1/2}\epsilon_{n}\left(\theta\right). (4.7)

Consequently, we obtain the following result.

Theorem 4.1

Suppose that supθΘ|l(X,θ)|\sup_{\theta\in\Theta}\left|l\left(X,\theta\right)\right| is PP_{\ast}-essentially bounded, that Assumption 2.1 and Assumption 4.1 hold, and that σ2(θ)>0\sigma^{2}(\theta)>0 for all θΘ\theta\in\Theta^{*}. Then, we have the following types of asymptotic behavior of the DRO optimal values.
(a-phi) If δn=o(n1)\delta_{n}=o\left(n^{-1}\right), then

n1/2(ϑ¯nϑ)infθΘ𝔤(θ).n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}{\mathfrak{g}}(\theta). (4.8)

(b-phi) If δn=βn1\delta_{n}=\beta n^{-1}, then

n1/2(ϑ¯nϑ)infθΘ{𝔤(θ)+κ1/2β1/2σ(θ)}.n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\inf_{\theta\in\Theta^{\ast}}\left\{{\mathfrak{g}}(\theta)+\kappa^{1/2}\beta^{1/2}\sigma(\theta)\right\}. (4.9)

(c-phi) If o(δn)=n1o\left(\delta_{n}\right)=n^{-1}, then

δn1/2(ϑ¯nϑ)κ1/2infθΘσ(θ),\delta_{n}^{-1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\,\kappa^{1/2}\inf_{\theta\in\Theta^{\ast}}\sigma(\theta), (4.10)

so the bias term induced by the DRO formulation dominates the statistical error.

Proof. Proof of this theorem is quite standard (cf., [17, proof of Theorem 5.7]). For the sake of completeness we briefly outline proof of case (b-phi). Note that our assumptions imply Assumption 2.1, and hence σ2(θ)\sigma^{2}(\theta) is a continuous function of θ\theta. Therefore there is a compact neighborhood Θ¯\bar{\Theta} of Θ\Theta^{*} such that σ2(θ)>0\sigma^{2}(\theta)>0 for all θΘ¯\theta\in\bar{\Theta}. We can restrict the minimization to Θ¯\bar{\Theta} for which the expansion (4.7) holds.

Consider the space C(Θ¯)C(\bar{\Theta}) of continuous functions g:Θ¯g:\bar{\Theta}\to{\mathbb{R}} equipped with the sup-norm, and functional V(g):=infθΘ¯g(θ)V(g):=\inf_{\theta\in\bar{\Theta}}g(\theta), mapping C(Θ¯)C(\bar{\Theta}) into the real line. This functional is directionally differentiable in the Hadamard sense with the directional derivative at a point μC(Θ¯)\mu\in C(\bar{\Theta}) given by V(μ,h)=infθΘ¯(μ)h(θ)V^{\prime}(\mu,h)=\inf_{\theta\in\bar{\Theta}(\mu)}h(\theta), where Θ¯(μ):=argminθΘ¯μ(θ)\bar{\Theta}(\mu):=\mathop{\rm arg\,min}_{\theta\in\bar{\Theta}}\mu(\theta). We have that ϑ¯n=V(n)\bar{\vartheta}_{n}=V(\mathcal{F}_{n}) and ϑ=V(f)\vartheta=V(f), where n():=n(,δn)\mathcal{F}_{n}(\cdot):=\mathcal{F}_{n}(\cdot,\delta_{n}). By the functional CLT and (4.7) it follows that n1/2(nf)n^{1/2}(\mathcal{F}_{n}-f) converges in distribution (weakly) to 𝔤(θ)+κ1/2β1/2σ(θ){\mathfrak{g}}(\theta)+\kappa^{1/2}\beta^{1/2}\sigma(\theta). We can apply now the functional Delta Theorem to conclude (4.9). \hfill\square

Given that ϕ()\phi(\cdot) is only assumed to satisfy Assumption 4.1, without imposing any growth condition, situations such as the (c-phi) case require imposing stronger moment conditions than just assuming VarP[l(X,θ)]<\mathrm{Var}_{P_{\ast}}[l\left(X,\theta\right)]<\infty. This can be seen in the KL-divergence case in which ϕ(t)=tlog(t)t+1\phi\left(t\right)=t\log\left(t\right)-t+1. For fixed δ>0\delta>0, the population solution requires that l(X,θ)l\left(X,\theta\right) has a finite moment generating function in a neighborhood of the origin. Therefore, if δn\delta_{n} converges to zero sufficiently slowly and l(X,θ)l\left(X,\theta\right) has infinite moments of order 2+ε2+\varepsilon, an expansion such as (4.7) may not hold. However, if ϕ(t)=(t1)2\phi\left(t\right)=\left(t-1\right)^{2}, it follows that expansion (4.7) holds exactly with ϵn(θ)=0\epsilon_{n}\left(\theta\right)=0.

On the other hand, the result in [8, Theorem 2] provides more for the case (b-phi) since it does not require compact support (although it requires ϕ\phi to be three times continuously differentiable). The following example shows that the smoothness of ϕ()\phi\left(\cdot\right) is important in deriving the asymptotics in the previous result with δn=n1/2\delta_{n}=n^{-1/2}.

Example 4.1

Consider ϕ(t):=|t1|\phi(t):=|t-1|, t0t\geq 0. In that case (e.g., [16, Example 3.12]), for δ(0,2)\delta\in(0,2) and essentially bounded YY,

supQ𝔐δ(P)𝔼Q[Y]=(δ/2)esssup(Y)+(1δ/2)𝖠𝖵@𝖱P,1δ/2(Y),\sup_{Q\in{\mathfrak{M}}_{\delta}(P)}{\mathbb{E}}_{Q}[Y]=(\delta/2){\rm ess}\sup(Y)+(1-\delta/2)\mathsf{AV@R}_{P,1-\delta/2}(Y), (4.11)

where

𝖠𝖵@𝖱P,α(Y):=infτ{τ+α1𝔼P[Yτ]+},α(0,1].\mathsf{AV@R}_{P,\alpha}(Y):=\inf_{\tau\in{\mathbb{R}}}\left\{\tau+\alpha^{-1}{\mathbb{E}}_{P}[Y-\tau]_{+}\right\},\;\alpha\in(0,1]. (4.12)

Note that 𝖠𝖵@𝖱P,1(Y)=𝔼P[Y]\mathsf{AV@R}_{P,1}(Y)={\mathbb{E}}_{P}[Y] and as α\alpha tends to one,

|𝖠𝖵@𝖱P,α(Y)𝔼P[Y]|=O(1α),\big{|}\mathsf{AV@R}_{P,\alpha}(Y)-{\mathbb{E}}_{P}[Y]\big{|}=O(1-\alpha), (4.13)

provided YY is essentially bounded.

Suppose that l(x,θ)l(x,\theta) is bounded on 𝒮×Θ\mathcal{S}\times\Theta, and hence

n(θ,δn)=(δn/2)max1iNl(Xi,θ)+(1δn/2)𝖠𝖵@𝖱Pn,1δn/2(l(X,θ)).\mathcal{F}_{n}(\theta,\delta_{n})=(\delta_{n}/2)\max_{1\leq i\leq N}l(X_{i},\theta)+(1-\delta_{n}/2)\mathsf{AV@R}_{P_{n},1-\delta_{n}/2}(l(X,\theta)). (4.14)

Consider δ\mathrm{\delta} =nβn1{}_{n}=\beta n^{-1} with β>0\beta>0. Then the first term in (4.14) is of order O(n1)O(n^{-1}), and by (4.13) the second term is 𝔼Pn[l(X,θ)]+O(n1){\mathbb{E}}_{P_{n}}[l(X,\theta)]+O\left(n^{-1}\right). Consequently in that case ϑ¯n=ϑn+op(n1/2),\bar{\vartheta}_{n}=\vartheta_{n}+o_{p}(n^{-1/2}), and hence this corresponds to case (a) in Theorem 3.1. This shows that the assumption of smoothness (differentiability) of ϕ()\phi(\cdot) is essential for derivation of the asymptotics of ϑ¯n\bar{\vartheta}_{n}. Here some additional terms in the asymptotics of ϑ¯n\bar{\vartheta}_{n} appear when δn\delta_{n} is of order O(n1/2)O(n^{-1/2}), rather than O(n1)O(n^{-1}). \hfill\square

4.2 The Wasserstein Distance Case

We use 𝔓(𝒮×𝒮){\mathfrak{P}}(\mathcal{S}\times\mathcal{S}) to denote the set of Borel probability measures on the product space 𝒮×𝒮\mathcal{S\times S}. Let c:𝒮×𝒮+{+}c:\mathcal{S\times S}\rightarrow{\mathbb{R}}_{+}\cup\{+\infty\} be a lower semi-continuous function such that c(x,y)=0c(x,y)=0 if and only if x=yx=y. This function measures the marginal cost of a transporting a unit of mass from a source location to a target location, respectively. The optimal transport cost between P,Q𝔓(𝒮)P,Q\in{\mathfrak{P}}(\mathcal{S}) is given by

Dc(P,Q):=min{𝔼π[c(X,Y)]:π𝔓(𝒮×𝒮), πX=P, πY=Q},D_{c}\left(P,Q\right):=\min\{{\mathbb{E}}_{\pi}\left[c\left(X,Y\right)\right]:\pi\in{\mathfrak{P}}\left(\mathcal{S}\times\mathcal{S}\right),\text{ }\pi_{X}=P,\text{ }\pi_{Y}=Q\}, (4.15)

where 𝔼π[]{\mathbb{E}}_{\pi}[\,\cdot\,] is the expectation under a joint distribution π𝔓(𝒮×𝒮)\pi\in{\mathfrak{P}}\left(\mathcal{S}\times\mathcal{S}\right) and πX\pi_{X} and πY\pi_{Y} denote the marginal distributions of XX and YY, respectively. It turns out that the optimizer is always achieved, thus we write ‘min\min’ instead of `inf`\inf’. Let \|\cdot\| be a norm on the space d{\mathbb{R}}^{d}. An important special case corresponds to the choice c(x,y):=xypc\left(x,y\right):=\|x-y\|^{p} for some p>0p>0, in which case Dc(P,Q)1/pD_{c}\left(P,Q\right)^{1/p} is the so-called pp-Wasserstein distance. The reader is referred to the text of Villani [19] for more background on optimal transport.

For any given P𝔓(𝒮)P\in{\mathfrak{P}}(\mathcal{S}) and δ0\delta\geq 0 we have the following dual result (cf., [9], [4], [10]) assuming that 𝔥(){\mathfrak{h}}\left(\cdot\right) is upper semi-continuous and 𝔥(X){\mathfrak{h}}(X) is PP-integrable,

supQ:Dc(P,Q)δ𝔼Q[𝔥(Y)]=minλ0{λδ+𝔼P[𝔥¯λ(X)]},\sup_{Q:\,D_{c}(P,Q)\leq\delta}{\mathbb{E}}_{Q}[{\mathfrak{h}}(Y)]=\min_{\lambda\geq 0}\left\{\lambda\delta+{\mathbb{E}}_{P}\left[{\mathfrak{\bar{h}}}_{\lambda}(X)\right]\right\}, (4.16)

where

𝔥¯λ(x):=supy𝒮{𝔥(y)λc(x,y)},λ0.{\mathfrak{\bar{h}}}_{\lambda}(x):=\sup_{y\in\mathcal{S}}\{{\mathfrak{h}}(y)-\lambda c(x,y)\},\;\lambda\geq 0. (4.17)

Throughout the rest of our discussion, we will choose c(x,y):=xypc\left(x,y\right):=\|x-y\|^{p} for p(1,)p\in\left(1,\infty\right) and therefore write Dp(P,Q)D_{p}\left(P,Q\right) for this choice of cost function. Further, we use \left\|\cdot\right\|_{\ast} to denote the dual norm, namely,

y=sup{xTy:x=1}.\left\|y\right\|_{\ast}=\sup\{x^{T}y:\left\|x\right\|=1\}.

As in the phi-divergence case, assuming that PP is fixed and has compact support, for example, we can obtain an asymptotic expansion for (4.16) as a function of δ\delta. By writing 𝔼P(p1)/p[]{\mathbb{E}}_{P}^{(p-1)/p}[\,\cdot\,] we mean (𝔼P[])(p1)/p({\mathbb{E}}_{P}[\,\cdot\,])^{(p-1)/p}.

Proposition 4.2

Suppose that 𝔥(){\mathfrak{h}}\left(\cdot\right) is continuously differentiable and the mapping

xsup{𝔥(x+Δ)𝔥(x)/(1+Δp1):Δd}x\mapsto\sup\{\left\|\nabla{\mathfrak{h}}\left(x+\Delta\right)-\nabla{\mathfrak{h}}\left(x\right)\right\|/(1+\left\|\Delta\right\|^{p-1}):\Delta\in{\mathbb{R}}^{d}\} (4.18)

is bounded on compact sets. Then, for any b0>0b_{0}>0,

supQ:Dp(P,Q)δ𝔼Q[𝔥(Y)]𝔼P[𝔥(X)]δ1/p𝔼P(p1)/p[𝔥(X)p/(p1)]=o(δ1/p),\sup_{Q:\,D_{p}(P,Q)\leq\delta}{\mathbb{E}}_{Q}[{\mathfrak{h}}(Y)]-{\mathbb{E}}_{P}[{\mathfrak{h}}(X)]-\delta^{1/p}{\mathbb{E}}_{P}^{\left(p-1\right)/p}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{p/\left(p-1\right)}]=o\left(\delta^{1/p}\right),

uniformly over P𝔓([ν,ν]d)P\in{\mathfrak{P}}(\mathcal{[}-\nu,\nu\mathcal{]}^{d}) such that 𝔼P𝔥(Y)b0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}\left(Y\right)\right\|\geq b_{0}.

Proof. The proof of this result is similar to the one given in the phi-divergence case. We start by observing that

supQ:Dp(P,Q)δ𝔼Q[𝔥(Y)]=𝔼P[𝔥(X)]+sup𝔼PΔpδ𝔼P[𝔥(X+Δ)𝔥(X)],\sup_{Q:\,D_{p}(P,Q)\leq\delta}{\mathbb{E}}_{Q}[{\mathfrak{h}}(Y)]={\mathbb{E}}_{P}[{\mathfrak{h}}(X)]+\sup_{{\mathbb{E}}_{P}\left\|\Delta\right\|^{p}\leq\delta}{\mathbb{E}}_{P}[{\mathfrak{h}}(X+\Delta)-{\mathfrak{h}}(X)],

where the optimization in the right hand side is taken over random variables Δ\Delta. We let δ1/pΔ¯=Δ\delta^{1/p}\bar{\Delta}=\Delta and note that

sup𝔼PΔpδ𝔼P[𝔥(X+Δ)𝔥(X)]\displaystyle\sup_{{\mathbb{E}}_{P}\left\|\Delta\right\|^{p}\leq\delta}{\mathbb{E}}_{P}[{\mathfrak{h}}(X+\Delta)-{\mathfrak{h}}(X)]
=δ1/psup𝔼PΔ¯p1𝔼P[(𝔥(X+δ1/pΔ¯)𝔥(X))/δ1/p]\displaystyle=\delta^{1/p}\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}[\left({\mathfrak{h}}(X+\delta^{1/p}\bar{\Delta})-{\mathfrak{h}}(X)\right)/\delta^{1/p}]
=δ1/psup𝔼PΔ¯p1𝔼P[01𝔥(X+tδ1/pΔ¯)Δ¯𝑑t].\displaystyle=\delta^{1/p}\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}\left[\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt\right].

Next, we can obtain a lower bound by considering a specific form of Δ¯\bar{\Delta} suggested by the formal asymptotic limit as δ0\delta\rightarrow 0. Note that

𝔼P[𝔥(X)Δ¯]𝔼P[𝔥(X)Δ¯],{\mathbb{E}}_{P}[\nabla{\mathfrak{h}}(X)\cdot\bar{\Delta}]\leq{\mathbb{E}}_{P}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}\left\|\bar{\Delta}\right\|],

and the equality is achieved if we choose any Δ¯\bar{\Delta}_{\ast} which is a constant multiple of

Δ¯1(X)argmax{𝔥(X)Δ¯:Δ¯=1},\bar{\Delta}_{1}\left(X\right)\in\arg\max\{\nabla{\mathfrak{h}}(X)\cdot\bar{\Delta}:\left\|\bar{\Delta}\right\|=1\},

(The function Δ¯1()\bar{\Delta}_{1}\left(\cdot\right) can be selected in a measurable way using the uniformization technique of Jankov-von Neumann.) Next, if Δ¯=a𝔥(X)γ\left\|\bar{\Delta}^{\ast}\right\|=a\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{\gamma}, then

𝔼P[𝔥(X)Δ¯]=a𝔼P[𝔥(X)γ+1]{\mathbb{E}}_{P}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}\left\|\bar{\Delta}^{\ast}\right\|]=a{\mathbb{E}}_{P}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{\gamma+1}]

and

𝔼P(Δ¯p)=ap𝔼P𝔥(X)γp=1.{\mathbb{E}}_{P}\left(\left\|\bar{\Delta}^{\ast}\right\|^{p}\right)=a^{p}{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{\gamma p}=1.

Letting γp=γ+1\gamma p=\gamma+1 we have that γ=1/(p1)\gamma=1/(p-1) and therefore

sup𝔼PΔ¯p1𝔼P[𝔥(X)Δ¯]=𝔼P(p1)/p[𝔥(X)p/(p1)],\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}[\nabla{\mathfrak{h}}(X)\cdot\bar{\Delta}^{\ast}]={\mathbb{E}}_{P}^{\left(p-1\right)/p}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{p/\left(p-1\right)}],

with

Δ¯(X)=Δ¯1(X)𝔥(X)1/(p1)𝔼P1/p𝔥(X)p/(p1).\bar{\Delta}^{\ast}\left(X\right)=\bar{\Delta}_{1}\left(X\right)\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{1/\left(p-1\right)}{\mathbb{E}}_{P}^{-1/p}\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{p/\left(p-1\right)}.

The denominator is well defined since 𝔼P𝔥(Y)>0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}\left(Y\right)\right\|>0 and the random variable Δ¯(X)\bar{\Delta}^{\ast}\left(X\right) is essentially bounded uniformly over the family P𝔓([ν,ν]d)P\in{\mathfrak{P}}(\mathcal{[}-\nu,\nu\mathcal{]}^{d}) and 𝔼P𝔥(Y)b0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}\left(Y\right)\right\|\geq b_{0}. Since the gradient 𝔥()\nabla{\mathfrak{h}}(\cdot) is continuous, then it is uniformly continuous over compact sets and, consequently, uniformly over Δ¯\bar{\Delta} in compact sets,

01𝔥(x+tδ1/pΔ¯)𝔥(x)Δ¯𝑑t=o(1)\int_{0}^{1}\left\|\nabla{\mathfrak{h}}(x+t\delta^{1/p}\bar{\Delta})-\nabla{\mathfrak{h}}(x)\right\|\bar{\Delta}dt=o\left(1\right)

as δ0\delta\rightarrow 0. This yields that

sup𝔼PΔ¯p1𝔼P[01𝔥(X+tδ1/pΔ¯)Δ¯𝑑t]𝔼P(p1)/p[𝔥(X)p/(p1)]+o(1)\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}\left[\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt]\geq{\mathbb{E}}_{P}^{\left(p-1\right)/p}[\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}^{p/\left(p-1\right)}\right]+o\left(1\right)

uniformly over P𝔓([ν,ν]d)P\in{\mathfrak{P}}(\mathcal{[}-\nu,\nu\mathcal{]}^{d}) and 𝔼P𝔥(Y)b0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}\left(Y\right)\right\|\geq b_{0}. For the upper bound, we can apply the duality representation, just as we did in the phi-divergence case. Using duality, we have that

sup𝔼PΔ¯p1𝔼P[01𝔥(X+tδ1/pΔ¯)Δ¯𝑑t]\displaystyle\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}\left[\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt\right]
=minλ¯>0{λ¯+𝔼P[supΔ¯01𝔥(X+tδ1/pΔ¯)Δ¯𝑑tλ¯Δ¯p]}.\displaystyle=\min_{\bar{\lambda}>0}\left\{\bar{\lambda}+{\mathbb{E}}_{P}\left[\sup_{\bar{\Delta}}\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt-\bar{\lambda}\left\|\bar{\Delta}\right\|^{p}\right]\right\}.

Once again, we select a specific choice λ¯0\bar{\lambda}_{0} given by

0<λ¯0=argmin{λ¯+𝔼P[supΔ¯{𝔥(X)Δ¯λ¯Δ¯p}]:λ¯0}.0<\bar{\lambda}_{0}=\arg\min\left\{\bar{\lambda}+{\mathbb{E}}_{P}[\sup_{\bar{\Delta}}\{\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}\cdot\left\|\bar{\Delta}\right\|-\bar{\lambda}\left\|\bar{\Delta}\right\|^{p}\}]:\bar{\lambda}\geq 0\right\}.

The fact that λ¯0>0\bar{\lambda}_{0}>0 follows because 𝔼P𝔥(X)>0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}(X)\right\|_{\ast}>0. We then obtain

sup𝔼PΔ¯p1𝔼P[01𝔥(X+tδ1/pΔ¯)Δ¯]\displaystyle\sup_{{\mathbb{E}}_{P}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P}\left[\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}\right]
λ¯0+𝔼P[supΔ¯{01𝔥(X+tδ1/pΔ¯)Δ¯𝑑tλ¯0Δ¯p}].\displaystyle\leq\bar{\lambda}_{0}+{\mathbb{E}}_{P}\left[\sup_{\bar{\Delta}}\{\int_{0}^{1}\nabla{\mathfrak{h}}(X+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}\}\right].

Next, we argue that the family of functions

sδ(x):=supΔ¯[01𝔥(x+tδ1/pΔ¯)Δ¯𝑑tλ¯0Δ¯p]s_{\delta}\left(x\right):=\sup_{\bar{\Delta}}\left[\int_{0}^{1}\nabla{\mathfrak{h}}(x+t\delta^{1/p}\bar{\Delta})\cdot\bar{\Delta}dt-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}\right]

converges uniformly on compact sets to the function s0(x)s_{0}\left(x\right). Let us consider the sup over Δ¯>ε/δ1/p\left\|\bar{\Delta}\right\|>\varepsilon/\delta^{1/p} and note because (4.18) is bounded on compact sets, there exists a constant c0c_{0} independent of x[ν,ν]dx\in[-\nu,\nu]^{d} such that

01(𝔥(x+tδ1/pΔ¯)𝔥(x))Δ¯𝑑tλ¯0Δ¯p\displaystyle\int_{0}^{1}\left(\nabla{\mathfrak{h}}(x+t\delta^{1/p}\bar{\Delta})-\nabla{\mathfrak{h}}(x)\right)\cdot\bar{\Delta}dt-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}
c0(1+δ(p1)/pΔ¯p1)Δλ¯0Δ¯p.\displaystyle\leq c_{0}(1+\delta^{\left(p-1\right)/p}\left\|\bar{\Delta}\right\|^{p-1})\left\|\Delta\right\|-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}.

By selecting δ\delta small enough (depending only on c0>0c_{0}>0 and λ¯0>0\bar{\lambda}_{0}>0) we see that the right hand side can be made arbitrarily negative uniformly over x[ν,ν]dx\in[-\nu,\nu]^{d} as δ0\delta\rightarrow 0. So, it suffices to consider only Δ¯ε/δ1/p\left\|\bar{\Delta}\right\|\leq\varepsilon/\delta^{1/p}, in this case, since 𝔥()\nabla{\mathfrak{h}}(\cdot) is continuous, then it is uniformly continuous on compacts. So, we can write, in terms of the (uniform) modulus of the continuity function 𝔪(){\mathfrak{m}}\left(\cdot\right)

𝔥(x+tδ1/pΔ¯)𝔥(x)𝔪(ε),\left\|\nabla{\mathfrak{h}}(x+t\delta^{1/p}\bar{\Delta})-\nabla{\mathfrak{h}}(x)\right\|\leq{\mathfrak{m}}\left(\varepsilon\right),

where 𝔪(ε)0{\mathfrak{m}}\left(\varepsilon\right)\rightarrow 0. In conclusion, we have that

supΔε/δ1/p[𝔥(x)Δ¯(1𝔪(ε))λ¯0Δ¯p]\displaystyle\sup_{\left\|\Delta\right\|\leq\varepsilon/\delta^{1/p}}[\nabla{\mathfrak{h}}(x)\cdot\bar{\Delta}\left(1-{\mathfrak{m}}\left(\varepsilon\right)\right)-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}]
sδ(x)supΔε/δ1/p[𝔥(x)Δ¯(1+𝔪(ε))λ¯0Δ¯p].\displaystyle\leq s_{\delta}\left(x\right)\leq\sup_{\left\|\Delta\right\|\leq\varepsilon/\delta^{1/p}}[\nabla{\mathfrak{h}}(x)\cdot\bar{\Delta}\left(1+{\mathfrak{m}}\left(\varepsilon\right)\right)-\bar{\lambda}_{0}\left\|\bar{\Delta}\right\|^{p}].

Further, the range Δε/δ1/p\left\|\Delta\right\|\leq\varepsilon/\delta^{1/p} in the upper and lower envelopes above can be further constrained to be compact (independent of ε\varepsilon and δ\delta, but depending on λ¯0>0\bar{\lambda}_{0}>0). From the above expressions, we deduce the required uniform convergence of sδ()s0()s_{\delta}\left(\cdot\right)\rightarrow s_{0}\left(\cdot\right) on compacts. The asymptotic upper bound then follows from these estimates.\hfill\square

Similar results have appeared in the literature (cf., [1]). An important difference which is useful in our analysis is that the above result is uniform over a class P𝔓([ν,ν]d)P\in{\mathfrak{P}}(\mathcal{[}-\nu,\nu\mathcal{]}^{d}) such that 𝔼P𝔥(Y)b0{\mathbb{E}}_{P}\left\|\nabla{\mathfrak{h}}\left(Y\right)\right\|\geq b_{0}.

In order to write the expansion of n(θ,δn)\mathcal{F}_{n}(\theta,\delta_{n}) we clarify that here we use xl(x,θ)\nabla_{x}l\left(x,\theta\right) to denote the gradient with respect to xx. Under suitable boundedness and smoothness assumptions, the previous result yields

n(θ,δn)=fn(θ)+δn1/p𝔼Pn(p1)/p[xl(X,θ)p/(p1)]+δn1/pϵn(θ).\mathcal{F}_{n}(\theta,\delta_{n})=f_{n}\left(\theta\right)+\delta_{n}^{1/p}{\mathbb{E}}_{P_{n}}^{\left(p-1\right)/p}[\left\|\nabla_{x}l\left(X,\theta\right)\right\|_{\ast}^{p/\left(p-1\right)}]+\delta_{n}^{1/p}\epsilon_{n}\left(\theta\right). (4.19)

We collect the precise statement of our result next. The proof is similar to that of Theorem 4.1 and thus omitted.

Theorem 4.2

Suppose l(,θ)l\left(\cdot,\theta\right) is continuously differentiable, that

(x,θ)sup{l(x+Δ,θ)l(x,θ)/(1+Δp1):Δ0}\left(x,\theta\right)\mapsto\sup\{\left\|\nabla l\left(x+\Delta,\theta\right)-\nabla l\left(x,\theta\right)\right\|/(1+\left\|\Delta\right\|^{p-1}):\left\|\Delta\right\|\geq 0\} (4.20)

is locally bounded, that PP_{\ast} has compact support, l(x,)l\left(x,\cdot\right) is Lipschitz continuous and

infθΘ𝔼P[xl(X,θ)]>0.\inf_{\theta\in\Theta^{\ast}}{\mathbb{E}}_{P_{\ast}}[\left\|\nabla_{x}l\left(X,\theta\right)\right\|]>0. (4.21)

Then, we have the following types of behavior of optimal values .
(a-W) If δn1/p=o(n1/2)\delta_{n}^{1/p}=o\left(n^{-1/2}\right), then

n1/2(ϑ¯nϑ)minθΘ𝔤(θ).n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\min_{\theta\in\Theta^{\ast}}{\mathfrak{g}}(\theta). (4.22)

(b-W) If δn1/p=βn1/2\delta_{n}^{1/p}=\beta n^{-1/2}, then

n1/2(ϑ¯nϑ)minθΘ{𝔤(θ)+𝔼P(p1)/p[xl(X,θ)p/(p1)]}.n^{1/2}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\min_{\theta\in\Theta^{\ast}}\left\{{\mathfrak{g}}(\theta)+{\mathbb{E}}_{P_{\ast}}^{\left(p-1\right)/p}\big{[}\left\|\nabla_{x}l\left(X,\theta\right)\right\|_{\ast}^{p/\left(p-1\right)}\big{]}\right\}.

(c-W) If o(δn1/p)=n1/2o\left(\delta_{n}^{1/p}\right)=n^{-1/2}, then

δn1/p(ϑ¯nϑ)minθΘ𝔼P(p1)/p[xl(X,θ)p/(p1)].\delta_{n}^{-1/p}\left(\bar{\vartheta}_{n}-\vartheta\right){\rightsquigarrow}\min_{\theta\in\Theta^{\ast}}{\mathbb{E}}_{P_{\ast}}^{\left(p-1\right)/p}[\left\|\nabla_{x}l\left(X,\theta\right)\right\|_{\ast}^{p/\left(p-1\right)}].

A completely analogous situation to Example 4.1 can also be constructed in Wasserstein DRO to show that both differentiability of l(,θ)l\left(\cdot,\theta\right) and p>1p>1 are important in deriving the asymptotics in the critical case. The case in which p=2p=2 was covered in [5] under suitable quadratic growth conditions and the existence of second moments.

5 General Principle in Action: Optimal Solutions

We complete our discussion in this section, considering optimal solutions for phi-divergence and DRO problems. A key observation is that in both the phi-divergence case and the Wasserstein DRO case the uncertainty set is compact in the weak topology and therefore, if Assumption 2.3 holds, the function n(,δn)\mathcal{F}_{n}(\cdot,\delta_{n}) is differentiable and its gradient has expansion (3.8). In fact, the derivative can be shown to exist if we are able to argue that, for δ\delta sufficiently small, the worst case measure is unique. This is precisely the strategy that we will pursue in this section. Throughout the section we impose the condition that Θ={θ}\Theta^{\ast}=\left\{\theta^{\ast}\right\}. Recall that σ2(θ):=VarP(l(X,θ))\sigma^{2}(\theta):=\mathrm{Var}_{P_{\ast}}(l(X,\theta)).

5.1 The Phi-Divergence Case

Theorem 5.1

Suppose that Assumptions 2.2, 2.3 and 4.1 hold, that l(x,)l\left(x,\cdot\right) is essentially bounded under PP_{\ast} and σ2(θ)>0\sigma^{2}(\theta^{\ast})>0, and that θ(v)\theta_{\ast}(v) is directionally differentiable at v=0v=0 (in the Hadamard sense). Let ZN(0,Σ)Z\sim N(0,\Sigma), where Σ\Sigma is the covariance matrix of l(X,θ)\nabla l(X,\theta^{\ast}). Then we have the following.
(A-phi) If δn=o(n1)\delta_{n}=o\left(n^{-1}\right), then

n1/2(θ¯nθ)θ(0,Z).n^{1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\theta_{\ast}^{\prime}\left(0,Z\right).

(B-phi) If δn=βn1\delta_{n}=\beta n^{-1}, then

n1/2(θ¯nθ)θ(0,Z+κ1/2β1/2σ(θ)).n^{1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\,\theta_{\ast}^{\prime}\left(0,Z+\kappa^{1/2}\beta^{1/2}\nabla\sigma(\theta^{\ast})\right).

(C-phi) If o(δn)=n1o\left(\delta_{n}\right)=n^{-1}, then

δn1/2(θ¯nθ)θ(0,κ1/2β1/2σ(θ)).\delta_{n}^{-1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\,\theta_{\ast}^{\prime}\left(0,\kappa^{1/2}\beta^{1/2}\nabla\sigma(\theta^{\ast})\right).

Proof. Applying the centering and scaling used to obtain (4.5) we obtain

n(θ,δn)=fn(θ)+δn1/2Dn(θ,δn),\mathcal{F}_{n}(\theta,\delta_{n})=f_{n}\left(\theta\right)+\delta_{n}^{1/2}D_{n}\left(\theta,\delta_{n}\right),

where

𝒟n(θ,δn)=sup𝔼Pn(Δ)=0,Δδn1/2,δn1/2𝔼Pn(ϕ(1+δn1/2Δ))1𝔼Pn[ln(X,θ)Δ],\mathcal{D}_{n}\left(\theta,\delta_{n}\right)=\sup_{{\mathbb{E}}_{P_{n}}\left(\Delta\right)=0,\Delta\geq-\delta_{n}^{-1/2},\delta_{n}^{-1/2}{\mathbb{E}}_{P_{n}}(\phi(1+\delta_{n}^{1/2}\Delta))\leq 1}{\mathbb{E}}_{P_{n}}[l_{n}\left(X,\theta\right)\Delta], (5.1)

and

l¯n(X,θ)=l(X,θ)fn(θ).\bar{l}_{n}\left(X,\theta\right)=l\left(X,\theta\right)-f_{n}\left(\theta\right).

It suffices to show that

𝒟n(θ,δn)ϱ(θ)\nabla\mathcal{D}_{n}\left(\theta,\delta_{n}\right)\rightarrow\nabla\varrho\left(\theta\right)

uniformly over some region θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} for some δ0\delta_{0}. Note that the optimization region in (5.1) is compact in the weak topology and therefore, by Danskin’s Theorem (see [17, sections 5.1.3 and 7.1.5], Section 7), we have that 𝒟n(,δn)\mathcal{D}_{n}\left(\cdot,\delta_{n}\right) is directionally differentiable and by the uniqueness of the optimal Δ¯n\bar{\Delta}_{n} for δn\delta_{n} sufficiently small we have that

𝒟n(θ,δn)=𝔼Pn[ln(X,θ)Δ¯n(θ)].\nabla\mathcal{D}_{n}\left(\theta,\delta_{n}\right)={\mathbb{E}}_{P_{n}}[\nabla l_{n}\left(X,\theta\right)\bar{\Delta}_{n}\left(\theta\right)].

We can precisely characterize Δ¯n(θ)\bar{\Delta}_{n}\left(\theta\right) from Proposition 4.1 over a region θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} for which we can guarantee VarPn[l(X,θ)]>0Var_{P_{n}}[l\left(X,\theta\right)]>0. Note that such δ0>0\delta_{0}>0 can be found assuming that n>Nn>N (for some random but finite almost surely NN because of the Strong Law of Large Numbers and continuity since VarP[l(X,θ)]>0Var_{P_{\ast}}[l\left(X,\theta_{\ast}\right)]>0. We have, uniformly over θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0}, for n>Nn>N,

Δ¯n(θ)=κln(X,θ)ϕ′′(1)VarPn[l(X,θ)]+ϵn(θ).\bar{\Delta}_{n}\left(\theta\right)=\sqrt{\kappa}\frac{l_{n}\left(X,\theta\right)}{\sqrt{\phi^{\prime\prime}\left(1\right)Var_{P_{n}}[l\left(X,\theta\right)]}}+\epsilon_{n}\left(\theta\right).

On the other hand, defining

l¯(X,θ)=l(X,θ)f(θ),\bar{l}\left(X,\theta\right)=l\left(X,\theta\right)-f\left(\theta\right),

we have that

ϱ(θ)=𝔼P[l¯(X,θ)Δ¯(θ)],\nabla\varrho\left(\theta\right)={\mathbb{E}}_{P_{\ast}}[\bar{l}(X,\theta)\cdot\bar{\Delta}\left(\theta\right)],

where

Δ¯(θ)=κl¯(X,θ)ϕ′′(1)VarP[l(X,θ)].\bar{\Delta}\left(\theta\right)=\sqrt{\kappa}\frac{\bar{l}\left(X,\theta\right)}{\sqrt{\phi^{\prime\prime}\left(1\right)Var_{P_{\ast}}[l\left(X,\theta\right)]}}.

We obtain

𝒟n(θ,δn)ϱ(θ)\displaystyle\nabla\mathcal{D}_{n}\left(\theta,\delta_{n}\right)-\nabla\varrho\left(\theta\right)
=𝔼Pn[ln(X,θ)Δ¯n(θ)]𝔼P[l¯(X,θ)Δ¯(θ)]\displaystyle={\mathbb{E}}_{P_{n}}[\nabla l_{n}\left(X,\theta\right)\bar{\Delta}_{n}\left(\theta\right)]-{\mathbb{E}}_{P_{\ast}}[\nabla\bar{l}(X,\theta)\cdot\bar{\Delta}\left(\theta\right)]
=𝔼Pn[(ln(X,θ)l¯(X,θ))Δ¯n(θ)]\displaystyle={\mathbb{E}}_{P_{n}}[\left(\nabla l_{n}\left(X,\theta\right)-\nabla\bar{l}(X,\theta)\right)\bar{\Delta}_{n}\left(\theta\right)]
+𝔼Pn[l¯(X,θ)(Δ¯n(θ)Δ¯(θ))]\displaystyle+{\mathbb{E}}_{P_{n}}[\nabla\bar{l}(X,\theta)(\bar{\Delta}_{n}\left(\theta\right)-\bar{\Delta}\left(\theta\right))]
+𝔼Pn[l¯(X,θ)Δ¯(θ)]𝔼P[l¯(X,θ)Δ¯(θ)].\displaystyle+{\mathbb{E}}_{P_{n}}[\nabla\bar{l}\left(X,\theta\right)\bar{\Delta}\left(\theta\right)]-{\mathbb{E}}_{P_{\ast}}[\nabla\bar{l}(X,\theta)\cdot\bar{\Delta}\left(\theta\right)].

It follows that

Δ¯n(θ)Δ¯(θ)\bar{\Delta}_{n}\left(\theta\right)\rightarrow\bar{\Delta}\left(\theta\right)

uniformly over θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0}, and

(ln(X,θ)l¯(X,θ))0\left(\nabla l_{n}\left(X,\theta\right)-\nabla\bar{l}(X,\theta)\right)\rightarrow 0

uniformly in probability (in fact almost surely) as n0n\rightarrow 0. Uniform convergence in probability over θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} follows from these observations.\hfill\square

5.2 The Wasserstein Distance Case

As in Proposition 4.2, in order to simplify the exposition, we assume that PP_{\ast} has a compact support. We also let \left\|\cdot\right\| be the p¯\ell_{\bar{p}} norm for p¯(1,)\bar{p}\in\left(1,\infty\right). This choice, in particular, satisfies that for any xx such that x=1\left\|x\right\|=1, the set

argmax{zTx:z=1} is a singleton.\arg\max\{z^{T}x:\left\|z\right\|=1\}\text{ is a singleton.} (5.2)

This will help us argue, in the presence of Lipschitz gradients, that the worst case adversarial distribution is unique when the distributional uncertainty, δ\delta, is sufficiently small and this, in turn, will help guarantee differentiability. In this section, we use θ\nabla_{\theta} and x\nabla_{x} to denote the derivatives with respect to θ\theta and xx, respectively. The derivative with respect to all of the arguments is simply denoted via \nabla.

Theorem 5.2

Suppose that Assumptions 2.2 and 2.3 hold. Further, assume that conditions (4.20) - (4.21) hold and that θ(v)\theta_{\ast}(v) is directionally differentiable at v=0v=0 (in the Hadamard sense). Let ZN(0,Σ)Z\sim N(0,\Sigma), where Σ\Sigma is the covariance matrix of l(X,θ)\nabla l(X,\theta^{\ast}). Then, we have the following.
(A-W) If δn1/p=o(n1/2)\delta_{n}^{1/p}=o\left(n^{-1/2}\right), then

n1/2(θ¯nθ)θ(0,Z).n^{1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\theta_{\ast}^{\prime}\left(0,Z\right). (5.3)

(B-W) If δn1/p=βn1/2\delta_{n}^{1/p}=\beta n^{-1/2}, then

n1/2(θ¯nθ)θ(0,Z+θ𝔼P(p1)/p[xl(X,θ)p/(p1)]).n^{1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\theta_{\ast}^{\prime}\left(0,Z+\nabla_{\theta}{\mathbb{E}}_{P_{\ast}}^{\left(p-1\right)/p}[\left\|\nabla_{x}l\left(X,\theta_{\ast}\right)\right\|_{\ast}^{p/\left(p-1\right)}]\right). (5.4)

(C-W) If o(δn1/p)=n1/2o\left(\delta_{n}^{1/p}\right)=n^{-1/2}, then

δn1/2(θ¯nθ)θ(0,θ𝔼P(p1)/p[xl(X,θ)p/(p1)]).\delta_{n}^{-1/2}\left(\bar{\theta}_{n}-\theta_{\ast}\right){\rightsquigarrow}\theta_{\ast}^{\prime}\left(0,\nabla_{\theta}{\mathbb{E}}_{P_{\ast}}^{\left(p-1\right)/p}[\left\|\nabla_{x}l\left(X,\theta_{\ast}\right)\right\|_{\ast}^{p/\left(p-1\right)}]\right). (5.5)

Proof. Most of the work has already been done in the proof of Proposition 4.2. We have that

n(θ,δn)=fn(θ)+δn1/p𝒟n(θ,δn),\mathcal{F}_{n}(\theta,\delta_{n})=f_{n}\left(\theta\right)+\delta_{n}^{1/p}\mathcal{D}_{n}\left(\theta,\delta_{n}\right),

where

𝒟n(θ,δn):=sup𝔼PnΔ¯p1𝔼Pn[01xl(X+tδn1/pΔ¯,θ)Δ¯𝑑t].\mathcal{D}_{n}\left(\theta,\delta_{n}\right):=\sup_{{\mathbb{E}}_{P_{n}}\left\|\bar{\Delta}\right\|^{p}\leq 1}{\mathbb{E}}_{P_{n}}\left[\int_{0}^{1}\nabla_{x}l(X+t\delta_{n}^{1/p}\bar{\Delta},\theta)\cdot\bar{\Delta}dt\right]. (5.6)

It suffices to show uniform convergence of 𝒟n(θ,δn)\nabla\mathcal{D}_{n}\left(\theta,\delta_{n}\right) to ϱ(θ)\nabla\varrho\left(\theta\right) in some neighborhood θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} for some δ0>0\delta_{0}>0.

From the proof of Proposition 4.2 we can collect several facts, note that we are assuming that l()\nabla l(\cdot) is LL-Lipschitz, which guarantees (4.20).

I) First,

supθΘ|𝒟n(θ,δn)ϱ(θ)|0\sup_{\theta\in\Theta}\left|\mathcal{D}_{n}\left(\theta,\delta_{n}\right)-\varrho\left(\theta\right)\right|\rightarrow 0

in probability.

II) Moreover, we also saw that there exists a random NN (finite with probability one) such that

𝒟n(θ,δn)\displaystyle\mathcal{D}_{n}\left(\theta,\delta_{n}\right)
=λ¯n(θ)+𝔼Pn[maxΔ¯01xl(X+tδn1/pΔ¯,θ)Δ¯𝑑tλ¯n(θ)Δ¯p],\displaystyle=\bar{\lambda}_{n}\left(\theta\right)+{\mathbb{E}}_{P_{n}}\left[\max_{\bar{\Delta}}\int_{0}^{1}\nabla_{x}l(X+t\delta_{n}^{1/p}\bar{\Delta},\theta)\cdot\bar{\Delta}dt-\bar{\lambda}_{n}\left(\theta\right)\left\|\bar{\Delta}\right\|^{p}\right],

with λ¯n(θ)>0\bar{\lambda}_{n}\left(\theta\right)>0 for all n>Nn>N uniformly over θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} for δ0>0\delta_{0}>0 small enough so that 𝔼P[l(X,θ)]>δ0{\mathbb{E}}_{P_{\ast}}\left[\left\|\nabla l(X,\theta)\right\|\right]>\delta_{0}. Note that such δ0\delta_{0} exists by continuity since we assume that 𝔼P[l(X,θ)]>0{\mathbb{E}}_{P_{\ast}}\left[\left\|\nabla l(X,\theta_{\ast})\right\|\right]>0.

III) Finally, also on the set θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} from II), since l()\nabla l(\cdot) is Lipschitz for all δn\delta_{n} sufficiently small, we have that the maximizer Δ¯n(X,θ)\bar{\Delta}_{n}\left(X,\theta\right) inside the expectation is unique because of (5.2) and it converges uniformly on compacts in the variable XX and over θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} to

Δ¯(X,θ)=Δ¯1(X,θ)l(X,θ)1/(p1)𝔼P1/pl(X,θ)p/(p1),\bar{\Delta}\left(X,\theta\right)=\bar{\Delta}_{1}\left(X,\theta\right)\left\|\nabla l(X,\theta)\right\|_{\ast}^{1/\left(p-1\right)}{\mathbb{E}}_{P}^{-1/p}\left\|\nabla l(X,\theta)\right\|_{\ast}^{p/\left(p-1\right)}, (5.7)

where

Δ¯1(X,θ)=argmax{l(X,θ)Δ¯:Δ¯=1}.\bar{\Delta}_{1}\left(X,\theta\right)=\arg\max\{\nabla l(X,\theta)\cdot\bar{\Delta}:\left\|\bar{\Delta}\right\|=1\}.

Next, by Danskin’s Theorem, see [17, sections 5.1.3 and 7.1.5], Section 7, because the uncertainty set is compact in the weak topology, we have that 𝒟n(,δn)\mathcal{D}_{n}\left(\cdot,\delta_{n}\right) is differentiable by uniqueness of Δ¯n(X,θ)\bar{\Delta}_{n}\left(X,\theta\right). The most convenient representation to see this is (5.6). It is also direct that ϱ()\varrho\left(\cdot\right) is differentiable everywhere. Moreover, since

ϱ(θ)=sup𝔼PΔp1𝔼P[xl(X,θ)Δ],\varrho\left(\theta\right)=\sup_{{\mathbb{E}}_{P_{\ast}}\left\|\Delta\right\|^{p}\leq 1}{\mathbb{E}}_{P_{\ast}}[\nabla_{x}l(X,\theta)\cdot\Delta],

Danskin’s Theorem also applies and we have that

θϱ(θ)=𝔼P[θ,xl(X,θ)Δ¯],\nabla_{\theta}\varrho\left(\theta\right)={\mathbb{E}}_{P_{\ast}}[\nabla_{\theta,x}l(X,\theta)\cdot\bar{\Delta}],

where Δ¯\bar{\Delta} is given in (5.7). So, we have (using Δ¯n\bar{\Delta}_{n}) instead of Δ¯n(X,θ)\bar{\Delta}_{n}\left(X,\theta\right),

θ𝒟n(θ,δn)θϱ(θ)\displaystyle\nabla_{\theta}\mathcal{D}_{n}\left(\theta,\delta_{n}\right)-\nabla_{\theta}\varrho\left(\theta\right)
=𝔼Pn[01θ,xl(X+tδn1/pΔ¯n,θ)Δ¯n𝑑t]𝔼P[θ,xl(X,θ)Δ¯]\displaystyle={\mathbb{E}}_{P_{n}}\left[\int_{0}^{1}\nabla_{\theta,x}l(X+t\delta_{n}^{1/p}\bar{\Delta}_{n},\theta)\cdot\bar{\Delta}_{n}dt]-{\mathbb{E}}_{P_{\ast}}[\nabla_{\theta,x}l(X,\theta)\cdot\bar{\Delta}\right]
=𝔼Pn[01(θ,xl(X+tδn1/pΔ¯n,θ)θ,xl(X,θ))Δ¯n𝑑t]\displaystyle={\mathbb{E}}_{P_{n}}\left[\int_{0}^{1}\left(\nabla_{\theta,x}l(X+t\delta_{n}^{1/p}\bar{\Delta}_{n},\theta)-\nabla_{\theta,x}l(X,\theta)\right)\cdot\bar{\Delta}_{n}dt\right]
+𝔼Pn[θ,xl(X,θ)Δ¯n]𝔼P[θ,xl(X,θ)Δ¯].\displaystyle+{\mathbb{E}}_{P_{n}}[\nabla_{\theta,x}l(X,\theta)\cdot\bar{\Delta}_{n}]-{\mathbb{E}}_{P_{\ast}}[\nabla_{\theta,x}l(X,\theta)\cdot\bar{\Delta}].

Since Δ¯n(X,)\bar{\Delta}_{n}\left(X,\cdot\right) converges (in probability) uniformly over compact sets in XX and θθδ0\left\|\theta-\theta_{\ast}\right\|\leq\delta_{0} and it is bounded almost surely, we obtain the required uniform convergence occurs in probability from the fact that l()\nabla l(\cdot) is Lipschitz continuous.\hfill\square

A similar result is obtained in [5] quadratic growth conditions and the existence of second moments (thus relaxing compactness assumptions). However, [5] primarily focuses on the case in which the optimal solution lies in the interior of the feasible region. Our discussion here can be used in combination with the analysis in [5] to deal with boundary cases.

Acknowledgement
J. Blanchet’s research was partially supported by the Air Force Office of Scientific Research (AFOSR) under award number FA9550-20-1-0397, with additional support from NSF 1915967 and 2118199. The research of A. Shapiro was partially supported by (AFOSR) Grant FA9550-22-1-0244.

References

  • [1] D. Bartl, S. Drapeau, J. Obłój, and J. Wiesel. Sensitivity analysis of wasserstein distributionally robust optimization problems. Proc. of the Royal Society A., 447:2256, 2021.
  • [2] G. Bayraksan and D. K. Love. Data-driven stochastic programming using phi-divergences. Tutorials in Operations Research, INFORMS, pages 1563–1581, 2015.
  • [3] A. Ben-Tal and M. Teboulle. Penalty functions and duality in stochastic programming via phi-divergence functionals. Mathematics of Operations Research, 12:224–240, 1987.
  • [4] J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56:830–857, 2019.
  • [5] J. Blanchet, K. Murthy, and N. Si. Confidence regions in wasserstein distributionally robust estimation. Biometrika, 109:295–315, 2022.
  • [6] J. Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
  • [7] I. Csiszár. Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten. Magyar. Tud. Akad. Mat. Kutato Int. Kozls, 8:85–108, 1063.
  • [8] John C. Duchi, Peter W. Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969, 2021.
  • [9] P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171:115–166, 2018.
  • [10] R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
  • [11] H. Lam. Robust sensitivity analysis for stochastic systems. Math. of Oper. Research, 41:1248–1275, 2016.
  • [12] T. Morimoto. Markov processes and the h-theorem. J. Phys. Soc. Jap., 18(3):328–333, 1963.
  • [13] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. Arxiv, https://arxiv.org/abs/1908.05659.
  • [14] A. Shapiro. Asymptotic analysis of stochastic programs. Annals of Operations Research, 30:169–186, 1991.
  • [15] A. Shapiro. Asymptotic behavior of optimal solutions in stochastic programming. Mathematics of Operations Research, 18:829 – 845, 1993.
  • [16] A. Shapiro. Distributionally robust stochastic programming. SIAM J. Optimization, 27:2258–2275, 2017.
  • [17] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, 2009.
  • [18] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998.
  • [19] C. Villani. Topics in Optimal Transportation. American Mathematical Society, Graduate Studies in Mathematics, Vol. 58, 2003.