This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Decomposition Method for Two-Stage Distributionally Robust Optimization

Harsha Gangammanavar and Manish Bansal Department of Engineering Management, Information, and Systems, Southern Methodist University, Dallas, TX 75275 ([email protected]).Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA 24061 ([email protected]).
(First Submitted: November 4, 2020)

Abstract. In this paper, we present a sequential sampling-based algorithm for the two-stage distributionally robust linear programming (2-DRLP) models. The 2-DRLP models are defined over a general class of ambiguity sets with discrete or continuous probability distributions. The algorithm is a distributionally robust version of the well-known stochastic decomposition algorithm of Higle and Sen (Math. of OR 16(3), 650-669, 1991) for a two-stage stochastic linear program. We refer to the algorithm as the distributionally robust stochastic decomposition (DRSD) method. The key features of the algorithm include (1) it works with data-driven approximations of ambiguity sets that are constructed using samples of increasing size and (2) efficient construction of approximations of the worst-case expectation function that solves only two second-stage subproblems in every iteration. We identify conditions under which the ambiguity set approximations converge to the true ambiguity sets and show that the DRSD method asymptotically identifies an optimal solution, with probability one. We also computationally evaluate the performance of the DRSD method for solving distributionally robust versions of instances considered in stochastic programming literature. The numerical results corroborate the analytical behavior of the DRSD method and illustrate the computational advantage over an external sampling-based decomposition approach (distributionally robust L-shaped method).

1 Introduction

Many applications of practical interest have been formulated as stochastic programming (SP) models. The models with recourse, particularly in a two-stage setting, have gained wide acceptance across application domains. These two-stage stochastic linear programs (2-SLPs) can be stated as follows:

min{cx+𝔼P[Q(x,ω~)]|x𝒳}.\displaystyle\min~{}\{c^{\top}x+\mathbb{E}_{P^{\star}}[Q(x,\tilde{\omega})]~{}|~{}x\in\mathcal{X}\}. (1)

One needs to have complete knowledge of the probability distribution PP^{\star} to formulate the above problem. Alternatively, one must have an appropriate means to simulate observations of the random variable so that a sample average approximation (SAA) problem with a finite number of scenarios can be formulated and solved. In many practical applications, distribution associated with random parameters in the optimization model is not precisely known. It either has to be estimated from data or constructed by expert judgments, which tend to be subjective. In any case, identifying a distribution using available information may be cumbersome at best. Stochastic min-max programming that has gained significant attention in recent years under the name of distributionally robust optimization (DRO) is intended to alleviate the ambiguity is distributional information.

In this paper, we study a particular manifestation of the DRO problem in the two-stage setting, viz., the two-stage distributionally robust linear programming (2-DRLP) problem. This problem is stated as:

min{f(x)=cx+(x)|x𝒳}.\displaystyle\min~{}\{f(x)=c^{\top}x+\mathbb{Q}(x)~{}|~{}x\in\mathcal{X}\}. (2)

Here, cc is the coefficient vector of a linear cost function and 𝒳dx\mathcal{X}\subseteq\mathbb{R}^{d_{x}} is the feasible set of the first-stage decision vector. The feasible region 𝒳\mathcal{X} takes the form of a compact polyhedron 𝒳={x|Axb,x0}\mathcal{X}=\{x~{}|~{}Ax\geq b,x\geq 0\}, where Am1×dxA\in\mathbb{R}^{m_{1}\times d_{x}} and bm1b\in\mathbb{R}^{m_{1}}. The function (x)\mathbb{Q}(x) is the worst-case expected recourse cost, which is formally defined as follows:

(x)=maxP𝔓{𝒬(x;P):=𝔼P[Q(x,ω~)]}.\displaystyle\mathbb{Q}(x)=\max_{P\in\mathfrak{P}}~{}\bigg{\{}\mathcal{Q}(x;P):=\mathbb{E}_{P}[Q(x,\tilde{\omega})]\bigg{\}}. (3)

The random vector ω~dω\tilde{\omega}\in\mathbb{R}^{d_{\omega}} is defined on a measurable space (Ω,)(\Omega,\cal F), where Ω\Omega is the sample space equipped with the sigma-algebra \mathcal{F} and 𝔓\mathfrak{P} is a set of continuous or discrete probability distributions defined on the measurable space (Ω,)(\Omega,\mathcal{F}). The set of probability distributions is often referred to as the ambiguity set. The expectation operation 𝔼P[]\mathbb{E}_{P}[\cdot] is taken with respect to the probability distribution P𝔓P\in\mathfrak{P}. For a given x𝒳x\in\mathcal{X}, we refer to the optimization problem in (3) as the distribution separation problem. For a given realization ω\omega of the random vector ω~\tilde{\omega}, the recourse cost in (3) is the optimal value of the following second-stage linear program:

Q(x,ω):=min\displaystyle Q(x,\omega):=\min\quad g(ω)y\displaystyle g(\omega)^{\top}y (4)
s.t. y𝒴(x,ω):={W(ω)y=r(ω)T(ω)x,y0}dy.\displaystyle y\in\mathcal{Y}(x,\omega):=\big{\{}W(\omega)y=r(\omega)-T(\omega)x,~{}y\geq 0\big{\}}\subset\mathbb{R}^{d_{y}}.

The second-stage parameters gdyg\in\mathbb{R}^{d_{y}}, Wm×dyW\in\mathbb{R}^{m\times d_{y}}, rmr\in\mathbb{R}^{m}, and Tm×dxT\in\mathbb{R}^{m\times d_{x}} can depend on uncertainty.

Most data-driven approaches for 2-SLP, such as SAA, tackle the problem in two steps. In the first step, an uncertainty representation is generated using a finite set of observations that serves as an approximation of Ω\Omega and the corresponding empirical distribution serves as an approximation of PP^{\star}. For a given uncertainty representation, one obtains a deterministic approximation of (1). In the second step, the approximate problem is solved using deterministic optimization methods. Such a two-step approach may lead to poor out-of-sample performance, forcing the entire process to be repeated from scratch with an improved uncertainty representation. Since sampling is performed prior to the optimization step, the two-step approach is also referred to as the external sampling procedure.

The data-driven approaches for DRO problems avoid working with a fixed approximation of PP^{\star} in the first step. However, the ambiguity set is still defined either over the original sample space Ω\Omega or a finite approximation of it. Therefore, the resulting ambiguity set is a deterministic representation of the true ambiguity set in (2). Once again, deterministic optimization tools are employed to solve the DRO problem. In many data-driven settings, prior knowledge of the sample space may not be available, and using a finite sample to approximate the original sample space may result in similar out-of-sample performance as in the case of the external sampling approach for 2-SLP.

1.1 Contributions

In light of the above observations regarding the two-step procedure for data-driven optimization, the main contributions of this manuscript are highlighted in the following.

  1. 1.

    A Sequential Sampling Algorithm: We present a sequential sampling approach for solving 2-DRLP. We refer to this algorithm as the distributionally robust stochastic decomposition (DRSD) algorithm following its risk-neutral predecessor, the two-stage stochastic decomposition (SD) method [20]. The algorithm uses a sequence of ambiguity sets that evolve over the course of the algorithm due to the sequential inclusion of new observations. While the simulation step improves the representation of the ambiguity set, the optimization step improves the solution in an online manner. Therefore, the DRSD method concurrently performs simulation and optimization steps. Moreover, the algorithm design does not depend on any specific ambiguity set description, and hence, is suitable for a general family of ambiguity sets.

  2. 2.

    Convergence Analysis: The DRSD method is an inexact bundle method that creates outer linearization for the dynamically evolving approximation of the first-stage problem. We provide the asymptotic analysis of DRSD and identify conditions on ambiguity sets under which the sequential sampling approach identifies an optimal solution to the 2-DRLP problem in (2) with probability one.

  3. 3.

    Computational Evidence of Performance: We provide the first evidence that illustrates the advantages of a sequential sampling approach for DRO through computational experiments conducted on well-known instances in SP literature.

1.2 Related work

The DRSD method has its roots in two-stage SP, in particular two-stage SD. In this subsection, we review the related two-stage SP literature along with decomposition and reformulation-based approaches for 2-DRLP.

For 2-SLP problems with finite support, including the SAA problem, the L-shaped method due to Van Slyke and Wets [40] has proven to be very effective. Other algorithms for 2-SLP’s such as the Dantzig-Wolfe decomposition [12] and the progressive hedging (PH) algorithm [29] also operate on problems with finite support. The well-established theory of SAA (see Chapter 5 in [37]) supports the external sampling procedure for 2-SLP. The quality of the solution obtained by solving an SAA problem is assessed using the procedures developed, e.g., in [4]. When the quality of the SAA solution is not acceptable, a new SAA is constructed with a larger number of observations. Prior work, such as [5] and [31], provide rules on how to choose the sequence of sample sizes in a sequential SAA procedure.

In contrast to the above, SD-based methods incorporate one new observation in every iteration to create approximations of the dynamically updated SAA of (1). First proposed in [20], this method has seen significant development in the past three decades with the introduction of quadratic regularization term in [21], statistical optimality rules [23], and extensions to multistage stochastic linear programs [18, 34]. The DRSD method presented in this manuscript extends the sequential sampling approach (i.e., SD) for 2-SLPs to DRO problems. Since the simulation of new observations and optimization steps are carried out in every iteration of the SD-based methods, they can also be viewed as internal sampling methods.

The concept of DRO dates back to the work of Scarf [33], and has gained significant attention in recent years. We refer the reader to [27] for a comprehensive treatment on various aspects of the DRO. The algorithmic works on DRO are either decomposition-based or reformulation-based approaches. The decomposition-based methods for 2-DRLP mimic the two-stage SP approach of using a deterministic representation of the sample space using a finite number of observations. As a consequence, the SP solution methods with suitable adaptation can be applied to solve the 2-DRLP problems. For instance, Breton and El Hachem [10, 11] apply the PH algorithm for a 2-DRLP model with a moment-based ambiguity set. Riis and Anderson [28] extend the L-shaped method for 2-DRLP with continuous recourse and moment-based ambiguity set. Bansal et.al. [1] extend the algorithm in [28], which they refer to as the distributionally robust (DR) L-shaped method, to solve 2-DRLPs, with ambiguity set defined over a polytope, in finite iterations. Further extensions of this decomposition approach are presented in [1] and [2] for DRO with mixed-binary recourse and disjunctive programs, respectively.

Another predominant approach to solve 2-DRLP problems is to reformulate the distribution separation problem in (3) as a minimization problem, pose the problem in (2) as a single deterministic optimization problem, and use off-the-shelf deterministic optimization tools to solve the reformulation. For example, Shapiro and Kleywegt [38] and Shapiro and Ahmed [35] provided approaches for 2-DRLP problem with moment matching set to derive an equivalent stochastic program with a certain reference distribution. Bertsimas et al. [7] provided tight semidefinite programming reformulations for 2-DRLP where the ambiguity set is defined using multivariate distributions with known first and second moments. Likewise, Hanasusanto and Kuhn [19] provided a conic programming reformulation for 2-DRLP problem where the ambiguity set comprises of a 2-Wasserstein ball centered at a discrete distribution. Xie [41] provided similar reformulations to tractable convex programs for 2-DRLP problems with ambiguity set defined using \infty- Wasserstein metric. Jiang and Guan [24] reduced the worst-case expectation in 2-DRLP, where the ambiguity set is defined using l1l_{1}-norm on the space of all (continuous and discrete) probability distributions, to a convex combination of CVaR and an essential supremum. By taking the dual of inner maximization, Love and Bayraksan [3] demonstrated that 2-DRLP where the ambiguity set is defined using ϕ\phi-divergence and finite sample space is equivalent to 2-SLP with a coherent risk measure. When reformulations result in equivalent stochastic programs (in [24, 3, 36], for instance), a SAA of the reformulation is used to obtain an approximate solution.

Data-driven approaches for DRO have been presented for specific ambiguity sets. In [13], problems with ellipsoidal moment-based ambiguity set whose parameters are estimated using sampled data are addressed. Esfahani et. al. tackled data-driven problems with Wasserstein metric-based ambiguity sets with convex reformulations in [26]. In both these works, the authors provide finite-sample performance guarantees that probabilistically bound the gap between approximate and true DRO problems. Sun and Xu presented asymptotic convergence analysis of DRO problems with ambiguity sets that are based on moments and mixture distributions constructed using a finite set of observations in [39]. A practical approach to incorporate the results of these works to identify a high-quality DRO solution will be similar to the sequential SAA procedure for SP in [5]. Such an approach will involve the following steps performed in a series – a deterministic representation of ambiguity set using sampled observations, applying appropriate reformulation, and solving the resulting deterministic optimization problem. If the quality of the solution is deemed insufficient, then the entire series of steps is repeated with an improved representation of the ambiguity set (possibly with a larger number of observations).

Organization

The remainder of the paper is organized as follows. In §2, we present the two key ideas of the DRSD, viz., the sequential approximation of the ambiguity set and the recourse function. We provide a detailed description of the DRSD method in §3. We show the convergence of the value functions and solutions generated by the DRSD method in §4. We present results of our computational experiments in §5, and finally we conclude and discuss about potential extensions of this paper in §6.

Notations and Definitions

We define the ambiguity sets over \mathcal{M}, the set of all finite signed measures on the measurable space (Ω,)(\Omega,\mathcal{F}). A nonnegative measure (written as P0P\succeq 0) that satisfies P(Ω)=1P(\Omega)=1 is a probability measure. For probability distributions P,P𝔓P,P^{\prime}\in\mathfrak{P}, we define

dist(P,P):=supF|𝔼P[F(ω~)]𝔼P[F(ω~)]|\displaystyle\text{dist}(P,P^{\prime}):=\sup_{F\in\mathcal{F}}\Big{|}\mathbb{E}_{P}[F(\tilde{\omega})]-\mathbb{E}_{P^{\prime}}[F(\tilde{\omega})]\Big{|} (5)

as the uniform distance of expectation, where \mathcal{F} is a class of measurable functions. The above is the distance with ζ\zeta-structure that is used for the stability analysis in SP [30]. The distance between a single probability distribution PP to a set of distributions 𝔓\mathfrak{P} is given as dist(P,𝔓)=infP𝔓d(P,P)\text{dist}(P,\mathfrak{P})=\inf_{P^{\prime}\in\mathfrak{P}}d(P,P^{\prime}). The distance between two sets of probability distributions 𝔓\mathfrak{P} and 𝔓^\widehat{\mathfrak{P}} is given as

𝔻(𝔓,𝔓^):=supP𝔓^dist(P,𝔓).\displaystyle\mathbb{D}(\mathfrak{P},\widehat{\mathfrak{P}}):=\sup_{P\in\widehat{\mathfrak{P}}}\text{dist}(P,\mathfrak{P}).

Finally, the Hausdorff distance between 𝔓\mathfrak{P} and 𝔓^\widehat{\mathfrak{P}} is defined as

(𝔓,𝔓^):=max{𝔻(𝔓,𝔓^),𝔻(𝔓^,𝔓)}.\displaystyle\mathbb{H}(\mathfrak{P},\widehat{\mathfrak{P}}):=\max\{\mathbb{D}(\mathfrak{P},\widehat{\mathfrak{P}}),~{}\mathbb{D}(\widehat{\mathfrak{P}},\mathfrak{P})\}.

With suitable definitions for the set \mathcal{F}, the distance in (5) accepts the bounded Lipschitz, the Kantorovich and the pp-th order Fourier-Mourier metrics (see [30]).

2 Approximating Ambigiuty Set and Recourse Function

In this section, we present the building blocks for the DRSD method. Specifically, we present the procedures to obtain approximations of the ambiguity set 𝔓\mathfrak{P} and the recourse function Q(x,ω)Q(x,\omega). These procedures will be embedded within a sequential sampling-based approach. Going forward we make the following assumptions on the 2-DRLP models:

  1. (A1)

    The first-stage feasible region 𝒳\mathcal{X} is a non-empty and compact set.

  2. (A2)

    The recourse function satisfies relatively complete recourse. The dual feasible region of the recourse problem is nonempty compact polyhedral set. The transfer (or technology) matrix satisfies supP𝔓𝔼P[T(ω~)]<\sup_{P\in\mathfrak{P}}\mathbb{E}_{P}[T(\tilde{\omega})]<\infty.

  3. (A3)

    The randomness affects the right-hand sides of constraints in (4).

  4. (A4)

    Sample space Ω\Omega is a compact metric space and the ambiguity set 𝔓\mathfrak{P}\neq\emptyset.

As a consequence of (A2), the recourse function satisfies Q(x,ω~)<Q(x,\tilde{\omega})<\infty with probability one for all x𝒳x\in\mathcal{X}. It also implies that the second-stage feasible region, i.e., {y:Wy=r(ω)T(ω)x,y0}\{y~{}:~{}Wy=r(\omega)-T(\omega)x,~{}y\geq 0\}, is non-empty for all x𝒳x\in\mathcal{X} and every ωΩ\omega\in\Omega. The non-empty dual feasible region Π\Pi implies that there exists a L>L>-\infty such that Q(x,ω~)>LQ(x,\tilde{\omega})>L. Without loss of generality, we assume that Q(x,ω~)0Q(x,\tilde{\omega})\geq 0. As a consequence of (A3), the cost coefficient vector gg and the recourse matrix WW are not affected by uncertainty. Problems that satisfy (A4) are said to have a fixed recourse. Finally, the compactness of the support Ω\Omega guarantees that every probability measure P𝔓P\in\mathfrak{P} is tight.

2.1 Approximating the Ambiguity Set

The DRO approach assumes only partial knowledge about the underlying uncertainty that is captured by a suitable description of the ambiguity set. An ambiguity set must capture the true distribution with absolute or high degree of certainty, and must be computationally manageable. In this section we present a family of of ambiguity sets that are of interest to us in this paper.

The computational aspects of solving a DRO problem relies heavily on the structure of the ambiguity set. The description of these structures involve parameters which are determined based on practitioner’s risk-preferences. The ambiguity set descriptions that are prevalent in the literature include moment-based ambiguity sets with linear constraints (e.g., [14]) or conic constraints (e.g., [13]); Kantorovich distance or Wasserstein metric-based ambiguity sets [25]; ζ\zeta-structure metrics [42], ϕ\phi-divergences such as χ2\chi^{2} distance and Kullback-Leibler divergence [6]; Prokhorov metrics [15], among others. Although the design of the DRSD method can work with any ambiguity set description defined over a compact sample space, we use 2-DRLPs with moment-based ambiguity sets and Wasserstein distance-based ambiguity sets to illustrate the algorithm in details.

In a data-driven setting, the parameters used in the description of ambiguity sets are estimated using a finite set of independent observations which can either be past realizations of the random variable ω~\tilde{\omega} or simulated by an oracle. We will denote such a sample by ΩkΩ\Omega^{k}\subseteq\Omega with Ωk={ωj}j=1k\Omega^{k}=\{\omega^{j}\}_{j=1}^{k}. Naturally, we can view Ωk\Omega^{k} as a random sample and define the empirical frequency

p^k(ωj)=κ(ωj)kfor all ωjΩk,\displaystyle\hat{p}^{k}(\omega^{j})=\frac{\kappa(\omega^{j})}{k}\qquad\text{for all }\omega^{j}\in\Omega^{k}, (6)

where κ(ωj)\kappa(\omega^{j}) denotes the number of times ωj\omega^{j} is observed in the sample. Since in sequential sampling setting, the sample set is updated within the optimization algorithm, it is worthwhile to note that the empirical frequency can be updated using the following recursive equations:

p^k(ω)={θkp^k1(ω)ifωΩk1,ωωkθkp^k1(ω)+(1θk)ifωΩk1,ω=ωk(1θk)ifωΩk1,ω=ωk.\displaystyle\hat{p}^{k}(\omega)=\left\{\begin{array}[]{ll}\theta^{k}\hat{p}^{k-1}(\omega)&\text{if}~{}\omega\in\Omega^{k-1},\omega\neq\omega^{k}\\ \theta^{k}\hat{p}^{k-1}(\omega)+(1-\theta^{k})&\text{if}~{}\omega\in\Omega^{k-1},\omega=\omega^{k}\\ (1-\theta^{k})&\text{if}~{}\omega\notin\Omega^{k-1},\omega=\omega^{k}.\end{array}\right. (10)

where θk[0,1]\theta^{k}\in[0,1]. We will succinctly denote the the above operation using the mapping Θk:|Ωk1||Ωk|\Theta^{k}:\mathbb{R}^{|\Omega^{k-1}|}\rightarrow\mathbb{R}^{|\Omega^{k}|}.

In remainder of this section, we will present alternate descriptions of ambiguity sets and show the construction of what we will refer to as approximate ambiguity sets, denoted by 𝔓k\mathfrak{P}^{k}. Let k=σ(ωj|jk)\mathcal{F}^{k}=\sigma(\omega^{j}~{}|~{}j\leq k) be the σ\sigma-algebra generated by the observations in the sample Ωk\Omega^{k}. Notice that k1k\mathcal{F}^{k-1}\subseteq\mathcal{F}^{k}, and hence, {k}k1\{\mathcal{F}^{k}\}_{k\geq 1} is a filtration. We will define the approximate ambiguity sets over the measurable space (Ωk,k)(\Omega^{k},\mathcal{F}^{k}). These sets should be interpreted to include all distributions that could have generated using the sample Ωk\Omega^{k}, which share a certain relationship with sample statistics. We will use k\mathcal{M}^{k} to denote the finite signed measures on (Ωk,k)(\Omega^{k},\mathcal{F}^{k}).

2.1.1 Moment-based Ambiguity Sets

Given the first qq moments associated with the random variable ω~\tilde{\omega}, the moment-based ambiguity set can be defined as

𝔓mom={P|Ω𝑑P(ω~)=1,Ωψi(ω~)𝑑P(ω~)=bii=1,,q}.\displaystyle\mathfrak{P}_{\text{mom}}=\left\{P\in\mathcal{M}\left|\begin{array}[]{l}\int_{\Omega}dP(\tilde{\omega})=1,\\ \int_{\Omega}\psi_{i}(\tilde{\omega})dP(\tilde{\omega})=b_{i}\qquad i=1,\ldots,q\end{array}\right.\right\}. (13)

While the first constraint ensures the definition of a probability measure, the moment requirements are guaranteed by the second constraints. Here, ψi(ω~)\psi_{i}(\tilde{\omega}) denote real valued measurable function on (Ω,)(\Omega,\mathcal{F}) and bib_{i}\in\mathbb{R} be a scalar for i=1,,qi=1,\ldots,q. Existence of moments ensures that bi<b_{i}<\infty for all i=1,,qi=1,\ldots,q. Notice that the description of the ambiguity set requires explicit knowledge of the following statistics: support Ω\Omega and the moments bib_{i} for i=1,,qi=1,\ldots,q. In the data-driven setting, the support is approximated by Ωk\Omega^{k} and the sample moments b^ik=(1/k)j=1kψi(ωj)\hat{b}_{i}^{k}=(1/k)\sum_{j=1}^{k}\psi_{i}(\omega^{j}) are used to define the following approximate ambiguity set

𝔓^momk={Pk|ωΩkp(ω)=1,ωΩkp(ω)ψi(ω)=b^iki=1,,q}.\displaystyle\widehat{\mathfrak{P}}_{\textup{mom}}^{k}=\left\{P\in\mathcal{M}^{k}\left|\begin{array}[]{l}\sum_{\omega\in\Omega^{k}}p(\omega)=1,\\ \sum_{\omega\in\Omega^{k}}p(\omega)\psi_{i}(\omega)=\hat{b}_{i}^{k}\qquad i=1,\ldots,q\end{array}\right.\right\}. (16)

The following result characterizes the relationship between distributions drawn from the above approximate ambiguity set, as well as asymptotic behavior of the sequence {𝔓^momk}k1\{\widehat{\mathfrak{P}}_{\textup{mom}}^{k}\}_{k\geq 1}.

Proposition 2.1.

For any P𝔓^momk1P\in\widehat{\mathfrak{P}}_{\textup{mom}}^{k-1}, we have Θk(P)𝔓^momk\Theta^{k}(P)\in\widehat{\mathfrak{P}}_{\textup{mom}}^{k} where θk=k1k\theta^{k}=\frac{k-1}{k}. Further, suppose 𝔓^momk\widehat{\mathfrak{P}}_{\textup{mom}}^{k}\neq\emptyset for all k0k\geq 0, (𝔓^momk,𝔓mom)0\mathbb{H}(\widehat{\mathfrak{P}}_{\textup{mom}}^{k},\mathfrak{P}_{\textup{mom}})\rightarrow 0 as kk\rightarrow\infty, almost surely.

Proof.

See Appendix §A. ∎

In the context of DRO, similar ambiguity sets have been studied in [8, 14] where only the first moment (i.e., q=1q=1) was considered. The form of ambiguity set above also relates to those used in [13, 28, 33, 39] where constraints were imposed only on the mean and covariance. In the data driven setting of [13] and [39], the statistical estimates are used in constructing the approximate ambiguity set as in the case of (16). However, the ambiguity sets in these previous works are defined over the original sample space Ω\Omega, as opposed to Ωk\Omega^{k} that is used in (16). This marks a critical deviation in the way the approximate ambiguity set are constructed.

Remark 2.1.

When the moment information is available about the underlying distribution PP^{\star}, an approximate moment-based ambiguity set with constant parameters in (16) (i.e., with b^ik=bi\hat{b}_{i}^{k}=b_{i} for all kk) can be constructed. Such an approximate ambiguity sets defined over Ωk\Omega^{k} is studied in [28]. Notice that these approximate ambiguity sets satisfy k1𝔓^k𝔓\cup_{k\geq 1}\widehat{\mathfrak{P}}^{k}\subseteq\mathfrak{P} and 𝔓^k𝔓^k+1\widehat{\mathfrak{P}}^{k}\subseteq\widehat{\mathfrak{P}}^{k+1}, for all k1k\geq 1.

2.1.2 Wasserstein distance-based Ambiguity Sets

We next present approximations of another class of ambiguity sets that has gained significant attention in the DRO literature, viz., the Wasserstein distance-based ambiguity sets. Consider probability distributions μ1,μ2\mu_{1},\mu_{2}\in\mathcal{M}, and a function ν:Ω×Ω+{}\nu:\Omega\times\Omega\rightarrow\mathbb{R}_{+}\cup\{\infty\} such that ν\nu is symmetric, ν1r()\nu^{\frac{1}{r}}(\cdot) satisfies triangle inequality for 1r<1\leq r<\infty, and ν(ω1,ω2)=0\nu(\omega_{1},\omega_{2})=0 whenever ω1=ω2\omega_{1}=\omega_{2}. If Π(μ1,μ2)\Pi(\mu_{1},\mu_{2}) denotes the joint distribution of random vectors ω1\omega_{1} and ω2\omega_{2} with marginals μ1\mu_{1} and μ2\mu_{2}, respectively, then the Wasserstein metric of order rr is given by

dw(μ1,μ2)=infηΠ(μ1,μ2){Ω×Ων(ω1,ω2)η(dω1,dω2)}.\displaystyle d_{\text{w}}(\mu_{1},\mu_{2})=\inf_{\eta\in\Pi(\mu_{1},\mu_{2})}\bigg{\{}\int_{\Omega\times\Omega}\nu(\omega_{1},\omega_{2})\eta(d\omega_{1},d\omega_{2})\bigg{\}}. (17)

In the above definition, the decision variable ηΠ\eta\in\Pi can be viewed as a plan to transport goods/mass from an entity whose spatial distribution is given by the measure μ1\mu_{1} to another entity with spatial distribution μ2\mu_{2}. Therefore, the dw(μ1,μ2)d_{\text{w}}(\mu_{1},\mu_{2}) measures the optimal transport cost between the measures. Notice that an arbitrary norm r\|\bullet\|^{r} on dω\mathbb{R}^{d_{\omega}} satisfies the requirement of the function c()c(\cdot). In this paper, we will use \|\bullet\| as the choice of our metric, in which case we obtain the Wasserstein metric of order 1. Using this metric, we define an ambiguity set as follows:

𝔓w={P|dW(P,P)ϵ}\displaystyle\mathfrak{P}_{\textup{w}}=\{P\in\mathcal{M}~{}|~{}d_{W}(P,P^{*})\leq\epsilon\} (18)

for a given ϵ>0\epsilon>0 and a reference distribution PP^{*}. As was done in §2.1.1, we define approximate ambiguity sets defined over the measurable space (Ωk,k)(\Omega^{k},\mathcal{F}^{k}) as follows:

𝔓^wk={Pk|dW(P,P^k)ϵ}.\displaystyle\widehat{\mathfrak{P}}_{\textup{w}}^{k}=\{P\in\mathcal{M}^{k}~{}|~{}d_{W}(P,\widehat{P}^{k})\leq\epsilon\}. (19)

For the above approximate ambiguity set, the distribution separation problem in (3) takes the following form:

max\displaystyle\max~{} ωΩkp(ω)Q(x,ω)\displaystyle\sum_{\omega\in\Omega^{k}}p(\omega)Q(x,\omega) (20a)
subject to P𝔓^wk={Pk|ωΩkp(ω)=1ωΩkη(ω,ω)=p(ω)ωΩk,ωΩkη(ω,ω)=p^k(ω)ωΩk,(ω,ω)Ωk×Ωkωωη(ω,ω)ϵη(ω,ω)0ω,ωΩk}.\displaystyle P\in\widehat{\mathfrak{P}}^{k}_{\textup{w}}=\left\{P\in\mathcal{M}^{k}\left|\begin{array}[]{l}\sum_{\omega\in\Omega^{k}}p(\omega)=1\\ \sum_{\omega^{\prime}\in\Omega^{k}}\eta(\omega,\omega^{\prime})=p(\omega)\qquad\forall\omega\in\Omega^{k},\\ \sum_{\omega\in\Omega^{k}}\eta(\omega,\omega^{\prime})=\hat{p}^{k}(\omega^{\prime})\qquad\forall\omega^{\prime}\in\Omega^{k},\\ \sum_{(\omega,\omega^{\prime})\in\Omega^{k}\times\Omega^{k}}\|\omega-\omega^{\prime}\|\eta(\omega,\omega^{\prime})\leq\epsilon\\ \eta(\omega,\omega^{\prime})\geq 0\quad\forall\omega,\omega^{\prime}\in\Omega^{k}\end{array}\right.\right\}. (20g)

The following result characterizes the distributions drawn from the approximate ambiguity sets of the form in (19), or equivalently (20g).

Proposition 2.2.

The sequence of Wasserstein distance-based approximate ambiguity set satisfies the following properties (1) for any P𝔓^wk1P\in\widehat{\mathfrak{P}}_{\textup{w}}^{k-1}, we have Θk(P)𝔓^wk\Theta^{k}(P)\in\widehat{\mathfrak{P}}_{\textup{w}}^{k} where θk=k1k\theta^{k}=\frac{k-1}{k}, and (2) (𝔓^wk,𝔓w)0\mathbb{H}(\widehat{\mathfrak{P}}_{\textup{w}}^{k},\mathfrak{P}_{\textup{w}})\rightarrow 0 as kk\rightarrow\infty, almost surely.

Proof.

See appendix §A. ∎

The approximate ambiguity set in [26] is a ball constructed in the space of probability distributions that are defined over the sample space Ω\Omega and whose radius reduces with increase in the number of observations. Using Wasserstein balls of shrinking radii, the authors of [26] show that the optimal value of the sequence of DRO problems converges to the optimal value of the expectation-valued SP problem in (1) associated with the true distribution PP^{\star}. The approximate ambiguity set in (19), on the other hand, uses a constant radius for all k1k\geq 1. In this regard, we consider settings where the ambiguity is not necessarily resolved with increasing number of observations. This is reflected in the approximate ambiguity sets (16) and (19) that converge to their respective true ambiguity sets (13) and (18), respectively.

2.2 Approximating the Recourse Problem

Cutting plane methods for the 2-SLPs use an outer linearization-based approximation of the first-stage objective function in (1). In such algorithms, the challenging aspect of computing the expectation is addressed by taking advantage of the structure of the recourse problem (4). Specifically, for a given ω\omega, the recourse value Q(,ω)Q(\cdot,\omega) is known to be convex in the right-hand side parameters that includes the first-stage decision vector xx. Additionally, if the support of ω~\tilde{\omega} is finite and (A2) holds, then the function Q(,ω)Q(\cdot,\omega) is polyhedral. Under assumptions (A2) and (A4), these structural property of convexity extends to the expected recourse value 𝒬(x)\mathcal{Q}(x).

Due to strong duality of linear programs, the recourse value is also equal to the optimal value of the dual of (4), i.e.,

Q(x,ω)=max\displaystyle Q(x,\omega)=\max~{} π[r(ω)T(ω)x]\displaystyle\pi^{\top}[r(\omega)-T(\omega)x] (21)
subject to πΠ:={π|Wπg}.\displaystyle\pi\in\Pi:=\{\pi~{}|~{}W^{\top}\pi\leq g\}.

Due to (A2) and (A4), the dual feasible region Π\Pi is a polytope that is not affected by the uncertainty. If ΠΠ\Pi\subset\Pi denotes the set of all extreme points of the polytope Π\Pi, then the recourse value can also be expressed as the pointwise maximum of affine functions computed using elements of the set Π\Pi as given below.

Q(x,ω)=maxπΠπ[r(ω)T(ω)x].\displaystyle Q(x,\omega)=\max_{\pi\in\Pi}\pi^{\top}[r(\omega)-T(\omega)x]. (22)

The outer linearization approaches tend to approximate the above form of recourse function by identifying the extreme points (optimal solutions to (21)) at a sequence of candidate (or trial) solutions {xk}\{x^{k}\}, and generating the corresponding affine functions. If π(xk,ω)\pi(x^{k},\omega) is the optimal dual obtained by solving (21) with xkx^{k} as input, then the affine function αk(ω)+(βk(ω))x\alpha^{k}(\omega)+(\beta^{k}(\omega))^{\top}x is obtained by computing the coefficients αk(ω)=(πk(ω))r(ω)\alpha^{k}(\omega)=(\pi^{k}(\omega))^{\top}r(\omega) and βk(ω)=C(ω)πk(ω)\beta^{k}(\omega)=C(\omega)^{\top}\pi^{k}(\omega). Following linear programming duality, notice that this affine function is a supporting hyperplane to Q(x,ω)Q(x,\omega) at xkx^{k}, and lower bounds the function at every other x𝒳x\in\mathcal{X}.

If the support Ω\Omega is finite, then one can solve a dual subproblem for all ωΩ\omega\in\Omega with the candidate solution as input, generate the affine functions, and collate them together to obtain an approximation of the first-stage objective function. This is the essence of the L-shaped method applied to 2-SLP in (1). In each iteration of the L-shaped method, the affine functions generated using a candidate solution xkx^{k} and information gathered from individual observations are weighed by the probability density of the observation to update the approximate first-stage objective function. Notice that even when SAA of the first-stage objective function of the 2-SLP, using a sample ΩNΩ\Omega_{N}\subset\Omega of size NN, the L-shaped method can be applied. A similar approximation strategy is used in the DR L-shaped method for 2-DRLP problems.

Alternatively, we can consider the following approximation of the recourse function expressed in the form given in (22):

Qk(x,ω)=maxπΠkπ[r(ω)C(ω)x].\displaystyle Q^{k}(x,\omega)=\max_{\pi\in\Pi^{k}}\pi^{\top}[r(\omega)-C(\omega)x]. (23)

Notice that the above approximation is built using only a subset ΠkΠ\Pi^{k}\subset\Pi of extreme points, and therefore, satisfies Qk(x,ω)Q(x,ω)Q^{k}(x,\omega)\leq Q(x,\omega). Since Q(x,ω)0Q(x,\omega)\geq 0, we begin with Π0={0}\Pi^{0}=\{0\}. Subsequently, we construct a sequence of sets {Πk}\{\Pi^{k}\} such that ΠkΠk+1Π\Pi^{k}\subseteq\Pi^{k+1}\subseteq\ldots\subset\Pi that ensures Qk(x,ω)0Q^{k}(x,\omega)\geq 0 for all kk. The following result from [20] captures the behavior of the sequence of approximation {Qk}\{Q^{k}\}.

Proposition 2.3.

The sequence {Qk(x,ω)}k1\{Q^{k}(x,\omega)\}_{k\geq 1} converges uniformly to a continuous function on 𝒳\mathcal{X} for any ωΩ\omega\in\Omega.

Proof.

See Appendix A. ∎

The approximations of the form in (23) is one of the principal features of the SD algorithm (see [20, 21]). While the L-shaped and DR L-shaped methods require a finite support for ω~\tilde{\omega}, SD is applicable even for problems with continuous support. The algorithm uses an “incremental” SAA for the first-stage objective function by adding one new observation in each iteration. Therefore, the first-stage objective function approximation used in SD is built using the recourse problem approximation in (23) and the incremental SAA. This approximation is given by:

𝒬k(x)=cx+1kj=1kQk(x,ωj).\displaystyle\mathcal{Q}^{k}(x)=c^{\top}x+\frac{1}{k}\sum_{j=1}^{k}Q^{k}(x,\omega^{j}). (24)

The affine functions generated in SD provide an outer linearization for the approximation in (24). The sequence of sets that grow monotonically in size, viz. {Πk}\{\Pi^{k}\}, is generated by adding one new vertex to the previous set Πk1\Pi^{k-1} to obtain the updated set Πk\Pi^{k}. The newly added vertex is the optimal dual solution obtained by solving (21) with the most recent observation ωk\omega^{k} and candidate solution xkx^{k} as input.

We refer the reader to [9], [1, 28], and [20, 22] for the a detailed exposition of the L-shaped, the DR L-Shaped, and the SD methods, respectively, and note only the key differences between these methods. Firstly, the sample used to in the (DR) L-shaped method is fixed prior to optimization. In SD, this sample is updated dynamically throughout the course of the algorithm. Secondly, exact subproblem optimization for all observations in the sample is used in every iteration of the (DR) L-shaped method. On the contrary, exact optimization is used only for the subproblems corresponding to the latest observation, and an “argmax” procedure (to be described in the next section) is used for observations encountered in earlier iterations.

3 Distributionally Robust Stochastic Decomposition

In this paper, we focus on a setting where the ambiguity set 𝔓\mathfrak{P} is approximated by a sequence of ambiguity sets {𝔓^k}k>0\{\widehat{\mathfrak{P}}^{k}\}_{k>0} such that the following properties are satisfied: (ii) for any P𝔓^k1P\in\widehat{\mathfrak{P}}^{k-1}, there exists θk[0,1]\theta^{k}\in[0,1] such that Θk(P)𝔓^k\Theta^{k}(P)\in\widehat{\mathfrak{P}}^{k} and (iiii) (𝔓^k,𝔓)0\mathbb{H}(\widehat{\mathfrak{P}}^{k},\mathfrak{P})\rightarrow 0 as kk\rightarrow\infty, almost surely. The moment-based ambiguity set 𝔓mom\mathfrak{P}_{\text{mom}} and Wasserstein distance-based ambiguity set 𝔓w\mathfrak{P}_{w} are two sets that satisfy these properties (Propositions 2.1 and 2.2, respectively). Recall that the approximate ambiguity set in an iteration (say kk) is constructed using a finite set of observations Ωk\Omega^{k} that progressively grow in size. Note that the sequence of approximate ambiguity sets may not necessarily converge to a single distribution. In other words, we do not assume that increasing sample size will overcome ambiguity asymptotically, as is the case in [26, 42].

Algorithm 1 Distributionally Robust Stochastic Decomposition
1:Input: Incumbent solution x^1𝒳\hat{x}^{1}\in\mathcal{X}; initial sample Ω0Ω\Omega^{0}\subseteq\Omega; stopping tolerance τ>0\tau>0; γ(0,1]\gamma\in(0,1], and minimum iterations kmink^{\min}.
2:Initialization: Set iteration counter k1k\leftarrow 1; Π0=\Pi^{0}=\emptyset; 0=\mathcal{L}^{0}=\emptyset, and f0(x)=0f^{0}(x)=0.
3:while (kkmink\geq k^{\min} and fk1(x^k1)fk1(xk)τfk1(x^k1)f^{k-1}(\hat{x}^{k-1})-f^{k-1}(x^{k})\geq\tau f^{k-1}(\hat{x}^{k-1})do
4:     Solve the master problem k\mathcal{M}^{k} (25) to obtain a candidate solution xkx^{k}.
5:     Generate a scenario ωkΩ\omega^{k}\in\Omega to get sample ΩkΩk1{ωk}\Omega^{k}\leftarrow\Omega^{k-1}\cup\{\omega^{k}\}.
6:     for ωΩk\omega\in\Omega^{k} do
7:         if ω=ωk\omega=\omega^{k} then
8:              Solve the second-stage linear program (4) with (xk,ωx^{k},\omega) as input;
9:              Obtain the optimal value Q(xk,ω)Q(x^{k},\omega) and optimal dual solution π(xk,ω)\pi(x^{k},\omega);
10:              Update dual vertex set ΠkΠk1{π(xk,ω)}\Pi^{k}\leftarrow\Pi^{k-1}\cup\{\pi(x^{k},\omega)\}.
11:         else
12:              Use the argmax procedure (26) to identify dual vertex π(xk,ωk)\pi(x^{k},\omega^{k});
13:              Store Qk(xk,ω)=(π(xk,ω))[r(ω)T(ω)xk]Q^{k}(x^{k},\omega)=(\pi(x^{k},\omega))^{\top}[r(\omega)-T(\omega)x^{k}].
14:         end if
15:     end for
16:     Solve the distribution separation problem using the ambiguity set 𝔓^k\widehat{\mathfrak{P}}^{k} and {Qk(xk,ω)}ωΩk\{Q^{k}(x^{k},\omega)\}_{\omega\in\Omega^{k}} to get an extremal distribution Pk:=(pk(ω))ωΩkP^{k}:=(p^{k}(\omega))_{\omega\in\Omega_{k}}.
17:     Derive affine function kk(x)=αkk+(βkk)x{\ell}_{k}^{k}(x)={\alpha}_{k}^{k}+({\beta}_{k}^{k})^{\top}x using {π(xk,ω)}ωΩk\{\pi(x^{k},\omega)\}_{\omega\in\Omega^{k}} and PkP^{k} to get lower bound approximation of k(x)\mathbb{Q}^{k}(x) as in (29);
18:     Perform Steps 6-17 with x^k1\hat{x}^{k-1} (incumbent solution) to obtain ^kk()\hat{\ell}_{k}^{k}(\cdot).
19:     for jk1k1\ell_{j}^{k-1}\in\mathcal{L}^{k-1} do
20:         Update previously generated affine functions jk1(x)=αk1j+(βk1j)x\ell^{k-1}_{j}(x)={\alpha}_{k-1}^{j}+({\beta}_{k-1}^{j})^{\top}x:
αjk=θkαjk1 and βjk=θkαjk1;\alpha^{k}_{j}=\theta^{k}\alpha^{k-1}_{j}\text{ and }\beta^{k}_{j}=\theta^{k}\alpha^{k-1}_{j};
21:         Set jk(x)=αjk+(βjk)x{\ell}^{k}_{j}(x)={\alpha}^{k}_{j}+({\beta}^{k}_{j})^{\top}x that provides lower bound approx. of k(x)\mathbb{Q}^{k}(x);
22:     end for
23:     Build a collection of these affine functions, denoted by k\mathcal{L}^{k};
24:     Update approximation of the first-stage objective function:
cx+k(x)fk(x)=cx+maxjk{αjk+(βik)x};\displaystyle c^{\top}x+\mathbb{Q}^{k}(x)\geq f^{k}(x)=c^{\top}x+\max_{j\in\mathcal{L}^{k}}~{}\{\alpha^{k}_{j}+(\beta^{k}_{i})^{\top}x\};
25:     if Incumbent update rule (33) is satisfied then
26:         x^kx^k1\hat{x}^{k}\leftarrow\hat{x}^{k-1} and ikik1i_{k}\leftarrow i_{k-1};
27:     else
28:         x^kxk\hat{x}^{k}\leftarrow x^{k} and ikki_{k}\leftarrow k;
29:     end if
30:     Update the master problem by replacing fk1(x)f^{k-1}(x) with fk(x)f^{k}(x) to obtain k+1\mathcal{M}^{k+1};
31:     kk+1k\leftarrow k+1;
32:end while
33:return xkx^{k}

The pseudocode of the DRSD method is given in Algorithm 1. We present the main steps of the DRSD method in iteration kk (Steps 4-31 of Algorithm 1). At the beginning of iteration kk, we have a certain approximation of the first-stage objective function that we denote as fk1(x)f^{k-1}(x), a finite set of observations Ωk1\Omega^{k-1} and an incumbent solution x^k1\hat{x}^{k-1}. We will use the term incumbent solution to refer to the best solution discovered by the algorithm until iteration kk. The solution identified in the current iteration will be referred to as the candidate solution and denote it as xkx^{k} (without ^\hat{\bullet}).

Iteration kk begins by first identifying the candidate solution by solving the following the master problem (Step 4):

xkargmin{fk1(x)|x𝒳},\displaystyle x^{k}\in\arg\min~{}\{f^{k-1}(x)~{}|~{}x\in\mathcal{X}\}, (25)

denoted by k\mathcal{M}^{k}. Following this, a new observation ωkΩ\omega^{k}\in\Omega is realized, and added to the current sample of observations Ωk1\Omega^{k-1} to get Ωk=Ωk1{ωk}\Omega^{k}=\Omega^{k-1}\cup\{\omega^{k}\} (Step 5).

In order to build the first-stage objective function approximation, we rely upon the recourse function approximation presented in Section 2.2. For the most recent observation ωk\omega^{k} and the candidate solution xkx^{k}, we evaluate the recourse function value Q(xk,ωk)Q(x^{k},\omega^{k}) by solving (4), and obtain the dual optimum solution π(xk,ωk)\pi(x^{k},\omega^{k}). Likewise, we obtain dual optimum solution π(x^k1,ωk)\pi(\hat{x}^{k-1},\omega^{k}) by solving (4) for incumbent solution x^k1\hat{x}^{k-1} (Steps 710). These dual vectors are added to a set Πk1\Pi^{k-1} of previously discovered optimal dual vectors. In other words, we recursively update ΠkΠk1{π(xk,ωk),π(x^k1,ωk)}\Pi^{k}\leftarrow\Pi^{k-1}\cup\{\pi(x^{k},\omega^{k}),\pi(\hat{x}^{k-1},\omega^{k})\}. For all other observations (ωΩk,ωωk\omega\in\Omega^{k},\ \omega\neq\omega^{k}), we identify a dual vector in Πk\Pi^{k} that provides the best lower bounding approximation at {Q(xk,ω)}\{Q(x^{k},\omega)\} using the following operation (Steps 1113):

π(xk,ω)argmax{π[r(ω)T(ω)xk]|πΠk}.\pi(x^{k},\omega)\in\arg\max~{}\{\pi^{\top}[r(\omega)-T(\omega)x^{k}]~{}|~{}\pi\in\Pi^{k}\}. (26)

Note that the calculations in (26) are carried out only for previous observations as π(xk,ωk)\pi(x^{k},\omega^{k}) provides the best lower bound at Q(xk,ωk)Q(x^{k},\omega^{k}). Further, notice that

π(xk,ω)[r(ω)T(ω)xk]=Qk(xk,ω),\pi(x^{k},\omega)^{\top}[r(\omega)-T(\omega)x^{k}]=Q^{k}(x^{k},\omega),

the approximate recourse function value at xkx^{k} defined in (23), for all ωΩk\omega\in\Omega^{k}, and Qk(xk,ωk)=Q(xk,ωk)Q^{k}(x^{k},\omega^{k})=Q(x^{k},\omega^{k}).

Using {Qk(xk,ωj)}j=1k\{Q^{k}(x^{k},\omega^{j})\}_{j=1}^{k}, we solve a distribution separation problem (in Step 16):

k(xk)=max{ωΩkp(ω)Qk(xk,ω)|p(ω)𝔓^k}.\displaystyle\mathbb{Q}^{k}(x^{k})=\max~{}\bigg{\{}\sum_{\omega\in\Omega^{k}}p(\omega)Q^{k}(x^{k},\omega)~{}|~{}p(\omega)\in\widehat{\mathfrak{P}}^{k}\bigg{\}}. (27)

Let Pk=(pk(ω))ωΩkP^{k}=(p^{k}(\omega))_{\omega\in\Omega^{k}} denote the optimal solution of the above problem which we identify as the maximal/extremal probability distribution. Since the problem is solved over measures k\mathcal{M}^{k} that are defined only over the observed set Ωk\Omega^{k}, the maximal probability distribution has weights pk(ωj)p^{k}(\omega^{j}) for ωjΩk\omega^{j}\in\Omega^{k}, and pk(ω)=0p^{k}(\omega)=0 for ωΩΩk\omega\in\Omega\setminus\Omega^{k}. Notice that the problem in (3) differs from the distribution separation problem (27) as the latter uses the recourse function approximation Qk()Q^{k}(\cdot) and approximate ambiguity set 𝔓^k\widehat{\mathfrak{P}}^{k} as opposed to the true recourse function Q()Q(\cdot) and ambiguity set 𝔓\mathfrak{P}, respectively. For the ambiguity sets in (2.1) the distribution separation problem is a deterministic linear program. In general, the distribution separation problems associated with well-known ambiguity sets remain deterministic convex optimization problems [27], and off-the-shelf solvers can used to obtain the extremal distribution.

In Step 17 of Algorithm 1, we use the dual vectors {π(xk,ωj)}jk\{\pi(x^{k},\omega^{j})\}_{j\leq k} and the maximal probability distribution PkP^{k} to generate a lower bounding affine function:

k(x)=maxP𝔓^k𝔼P[Qk(x,ω~)]ωjΩkpk(ωj)(π(xk,ωj))[r(ωj)T(ωj)x],\displaystyle\mathbb{Q}^{k}(x)=\max_{P\in\widehat{\mathfrak{P}}^{k}}\mathbb{E}_{P}[Q^{k}(x,\tilde{\omega})]\geq\sum_{\omega^{j}\in\Omega^{k}}p^{k}(\omega^{j})\cdot(\pi(x^{k},\omega^{j}))^{\top}[r(\omega^{j})-T(\omega^{j})x], (28)

for the worst case expected recourse function measured with respect to the maximal probability distribution Pk𝔓^kP^{k}\in\widehat{\mathfrak{P}}^{k} which is obtained by solving the distribution separation problem (27). We denote the coefficients of the affine function on the right-hand side of (28) by

αkk=ωjΩkpk(ωj)π(xk,ωj)r(ωj) and βkk=ωjΩkpk(ωj)T(ωj)π(xk,ωj),\displaystyle\alpha_{k}^{k}=\sum_{\omega^{j}\in\Omega^{k}}p^{k}(\omega^{j})\pi(x^{k},\omega^{j})^{\top}r(\omega^{j})\text{ and }\beta_{k}^{k}=-\sum_{\omega^{j}\in\Omega^{k}}p^{k}(\omega^{j})T(\omega^{j})^{\top}\pi(x^{k},\omega^{j}), (29)

and succinctly write the affine function as kk(x)=αkk+(βkk)x\ell_{k}^{k}(x)=\alpha_{k}^{k}+(\beta_{k}^{k})^{\top}x. Similar calculations are carried out using the incumbent solution x^\hat{x} to identify a maximal probability distribution and a lower bounding affine function resulting in the affine function ^kk(x)=α^kk+(β^kk)x\hat{\ell}_{k}^{k}(x)=\hat{\alpha}_{k}^{k}+(\hat{\beta}_{k}^{k})^{\top}x.

While the latest affine functions provide lower bound on k\mathbb{Q}^{k}, the affine functions generated in previous iteration are not guaranteed to lower bound k\mathbb{Q}^{k}. To see this, let us consider the moment-based approximate ambiguity sets {𝔓^momk}k1\{\widehat{\mathfrak{P}}^{k}_{\text{mom}}\}_{k\geq 1}. Let Pmomj𝔓^momjP^{j}_{\text{mom}}\in\widehat{\mathfrak{P}}^{j}_{\text{mom}} be the maximal distribution identified in an iteration j<kj<k which was used to compute the affine function jj(x)\ell_{j}^{j}(x). By assigning pj(ω)=0p^{j}(\omega)=0 for all new observations encountered after iteration jj, i.e., ωΩkΩj\omega\in\Omega^{k}\setminus\Omega^{j}, we can construct a probability distribution P¯=((pj(ω))ωΩj,(0)ωΩkΩj)+|Ωk|\bar{P}=((p^{j}(\omega))_{\omega\in\Omega^{j}},(0)_{\omega\in\Omega^{k}\setminus\Omega^{j}})\in\mathbb{R}^{|\Omega^{k}|}_{+}. This reconstructed distribution satisfies ωΩkp¯(ω)=1\sum_{\omega\in\Omega^{k}}\bar{p}(\omega)=1. However, it is easy to see that it does not satisfy ωΩkψi(ω)p¯(ω)=b^ij=b^ik\sum_{\omega\in\Omega^{k}}\psi_{i}(\omega)\bar{p}(\omega)=\hat{b}^{j}_{i}=\hat{b}^{k}_{i} for all i=1,,qi=1,\ldots,q. Therefore, P¯𝔓k\bar{P}\notin\mathfrak{P}^{k}. In other words, while the coefficients (αjj,βjj)(\alpha_{j}^{j},\beta_{j}^{j}) are j\mathcal{F}^{j}-measurable, the corresponding measure is not feasible to the approximate ambiguity set 𝔓k\mathfrak{P}^{k}. Therefore, jj(x)\ell_{j}^{j}(x) is not a valid lower bound to k\mathbb{Q}^{k}. The arguments for the Wasserstein-based approximate ambiguity set are more involved, but persistence of a similar issue can be demonstrated.

In order to address this, we update the previously generated affine functions jk1(x)=αk1j+(βk1j)x\ell^{k-1}_{j}(x)={\alpha}_{k-1}^{j}+({\beta}_{k-1}^{j})^{\top}x for j<kj<k, as follows (Steps 19 - 22):

αjk=θkαjk1,βjk=θkβjk1, and jk(x)=αkj+(βkj)x for all j<k,\displaystyle\alpha_{j}^{k}=\theta^{k}\alpha_{j}^{k-1},\ \ \beta_{j}^{k}=\theta^{k}\beta_{j}^{k-1},\text{ and }{\ell}^{k}_{j}(x)={\alpha}_{k}^{j}+({\beta}_{k}^{j})^{\top}x\qquad\text{ for all }j<k, (30)

such that jk(x)\ell^{k}_{j}(x) provides lower bound approximation of k(x)\mathbb{Q}^{k}(x) for all j{1,,k1}j\in\{1,\ldots,k-1\}. Similarly, we update the affine functions ^jk(x)\hat{\ell}^{k}_{j}(x), j<kj<k, associated with incumbent solution (Step 18). Recall that for 𝔓^mom\widehat{\mathfrak{P}}_{\text{mom}} and 𝔓^w\widehat{\mathfrak{P}}_{w}, the parameter θk=k1k\theta^{k}=\frac{k-1}{k} (Propositions 2.1 and 2.2). The candidate and the incumbent affine functions (kk(x)\ell_{k}^{k}(x) and ^kk(x)\hat{\ell}_{k}^{k}(x), respectively), as well as the updated collection of previously generated affine functions are used to build the set of affine functions which we will denote by k\mathcal{L}^{k} (Step 23). The lower bounding property of this first-stage objective function approximation is captured in the following result.

Theorem 3.1.

Under assumption (A2), the first-stage objective function approximation in (32) satisfies

fk(x)cx+k(x) for all x𝒳 and k1.\displaystyle f^{k}(x)\leq c^{\top}x+\mathbb{Q}^{k}(x)\text{ for all }x\in\mathcal{X}\text{ and }k\geq 1.
Proof.

For non-empty approximate ambiguity set 𝔓^1\widehat{\mathfrak{P}}^{1} of ambiguity set 𝔓\mathfrak{P}, the construction of the affine function ensures that 11(x)1(x)\ell_{1}^{1}(x)\leq\mathbb{Q}^{1}(x). Assume that (x)k1(x)\ell(x)\leq\mathbb{Q}^{k-1}(x) for all k1\ell\in\mathcal{L}^{k-1} and k>1k>1. The maximal nature of the probability distribution PkP^{k} satisfies:

ωΩkpk(ω)Qk(x,ω)\displaystyle\sum_{\omega\in\Omega^{k}}p^{k}(\omega)Q^{k}(x,\omega) ωΩkp(ω)Qk(x,ω)P𝔓^k.\displaystyle\geq\sum_{\omega\in\Omega^{k}}p(\omega)Q^{k}(x,\omega)\qquad\forall P\in\widehat{\mathfrak{P}}^{k}.

Using above and the monotone property of the approximate recourse function, we have

ωΩkpk(ω)Qk(x,ω)\displaystyle\sum_{\omega\in\Omega^{k}}p^{k}(\omega)Q^{k}(x,\omega) ωΩkp(ω)Qk1(x,ω)\displaystyle\geq\sum_{\omega\in\Omega^{k}}p(\omega)Q^{k-1}(x,\omega)
=ωΩkωkp(ω)Qk1(x,ω)+p(ωk)Qk1(x,ωk),\displaystyle=\sum_{\omega\in\Omega^{k}\setminus\omega^{k}}p(\omega)Q^{k-1}(x,\omega)+p(\omega^{k})Q^{k-1}(x,\omega^{k}), (31)

for all {p(ω)}ωΩk𝔓^k\{p(\omega)\}_{\omega\in\Omega^{k}}\in\widehat{\mathfrak{P}}^{k}. Based on the properties of 𝔓\mathfrak{P} and {𝔓^k}k1\{\widehat{\mathfrak{P}}^{k}\}_{k\geq 1} (similar to Propositions 2.1 and 2.2), we know that for every P𝔓^k1P\in\widehat{\mathfrak{P}}^{k-1} we can construct a probability distribution in 𝔓k\mathfrak{P}^{k} using the mapping Θk\Theta^{k} defined by (10). Considering a probability distribution P={p(ω)}ωΩk1𝔓^k1P^{\prime}=\{p^{\prime}(\omega)\}_{\omega\in\Omega^{k-1}}\in\widehat{\mathfrak{P}}^{k-1} such that Θk(P)𝔓^k\Theta^{k}(P^{\prime})\in\widehat{\mathfrak{P}}^{k}, the inequality (31) reduces to

ωΩkpk(ω)Qk(x,ω)\displaystyle\sum_{\omega\in\Omega^{k}}p^{k}(\omega)Q^{k}(x,\omega) ωΩkωk[θkp(ω)Qk1(x,ω)]+[θkp(ωk)+(1θk)]Qk1(x,ωk)\displaystyle\geq\sum_{\omega\in\Omega^{k}\setminus\omega^{k}}[\theta^{k}p^{\prime}(\omega)Q^{k-1}(x,\omega)]+[\theta^{k}p^{\prime}(\omega^{k})+(1-\theta^{k})]Q^{k-1}(x,\omega^{k})
=θk[ωΩk1p(ω)Qk1(x,ω)]+(1θk)Qk1(x,ωk)\displaystyle=\theta^{k}\bigg{[}\sum_{\omega\in\Omega^{k-1}}p^{\prime}(\omega)Q^{k-1}(x,\omega)\bigg{]}+(1-\theta^{k})Q^{k-1}(x,\omega^{k})
θk[ωΩk1p(ω)Qk1(x,ω)].\displaystyle\geq\theta^{k}\bigg{[}\sum_{\omega\in\Omega^{k-1}}p^{\prime}(\omega)Q^{k-1}(x,\omega)\bigg{]}.

The last inequality is due to assumption (A2), i.e., Q(x,ωk)0Q(x,\omega^{k})\geq 0 and the construction of recourse function approximation QkQ^{k} described in §2.2. Since (x)\ell(x) lower bounds the term in bracket, we have

ωΩkpk(ω)Qk(x,ω)\displaystyle\sum_{\omega\in\Omega^{k}}p^{k}(\omega)Q^{k}(x,\omega) θk(x).\displaystyle\geq\theta^{k}\ell(x).

Using the same arguments for all k1\ell\in\mathcal{L}^{k-1}, and the fact that the kk(x)\ell_{k}^{k}(x) and ^(x)\hat{\ell}(x) are constructed as lower bounds to the k\mathbb{Q}^{k}, we have fk(x)cx+k(x)f^{k}(x)\leq c^{\top}x+\mathbb{Q}^{k}(x). This completes the proof by induction. ∎

Using the collection of affine functions k\mathcal{L}^{k}, we update approximation of the first-stage objective function in Step 24, as follows:

fk(x)=cx+maxik{αi+(βi)x}.\displaystyle f^{k}(x)=c^{\top}x+\max_{i\in\mathcal{L}^{k}}~{}\{\alpha^{i}+(\beta^{i})^{\top}x\}. (32)

Once the approximation is updated, the performance of the candidate solution is compared relative to the incumbent solution (Steps 2532). This comparison is performed by verifying if the following inequality

fk(xk)fk(x^k1)<γ[fk1(xk)fk1(x^k1],\displaystyle f^{k}(x^{k})-f^{k}(\hat{x}^{k-1})<\gamma[f^{k-1}(x^{k})-f^{k-1}(\hat{x}^{k-1}], (33)

where parameter γ(0,1]\gamma\in(0,1], is satisfied. If so, the candidate solution is designated to be the next incumbent solution, i.e., x^k=xk\hat{x}^{k}=x^{k}. If the inequality is not satisfied, the precious incumbent solution is retained as x^k=x^k1\hat{x}^{k}=\hat{x}^{k-1}. This completes an iteration of the DRSD method.

Remark 3.1.

The algorithm design can be extended to incorporate 2-DRLP where the relatively complete recourse assumption of (A2) and/or assumption (A3) is not satisfied. For problems where relatively complete recourse condition is not met, a candidate solution may lead to one or more subproblems to be infeasible. In this case, the dual extreme rays can be used to compute a feasibility cut that is included in the first-stage approximation. The argmax procedure in (26) is only valid when assumption (A3) is satisfied. In problems where the uncertainty also affects the cost coefficients, the argmax procedure presented in [17] can be utilized. These algorithmic enhancements can be incorporated without affecting the convergence properties of DRSD that we present in the next section.

4 Convergence Analysis

In this section we provide the convergence result of the sequential sampling-based approach to solve DRO problems. In order to facilitate the exposition of our theoretical results, we will define certain quantities for notational convenience that are not necessarily computed during the course of the algorithm in the form presented in the previous section. Our convergence results are built upon stability analyses presented in [39] and convergence analysis of the SD algorithm in [20].

We define a function which is defined over the approximate ambiguity set using the recourse function Q(,)Q(\cdot,\cdot), that is

gk(x):=cx+maxP𝔓^k𝔼P[Q(x,ω~)]\displaystyle g^{k}(x):=c^{\top}x+\max_{P\in\widehat{\mathfrak{P}}^{k}}\mathbb{E}_{P}[Q(x,\tilde{\omega})] (34)

for a fixed x𝒳x\in\mathcal{X}. We begin by analyzing the behavior of the sequence {gk}k1\{g^{k}\}_{k\geq 1} as kk\rightarrow\infty. In particular, we will assess the sequence of function evaluations at a converging subsequence of first-stage solutions. The result is captured in the following proposition.

Proposition 4.1.

Suppose {x^kn}\{\hat{x}^{k_{n}}\} denotes a subsequence of {x^k}\{\hat{x}^{k}\} such that x^knx¯\hat{x}^{k_{n}}\rightarrow\bar{x}, then limn|gkn(x^kn)f(x¯)|=0\lim_{n\rightarrow\infty}|g^{k_{n}}(\hat{x}^{k_{n}})-f(\bar{x})|=0, with probability one.

Proof.

Consider the ambiguity set 𝔓^k\widehat{\mathfrak{P}}^{k}. For i=1,2i=1,2 and xi𝒳x_{i}\in\mathcal{X}, let P(xi)argmaxP𝔓^k{𝔼P[Q(xi,ω~)]P(x_{i})\in\mathrm{argmax}_{P\in\widehat{\mathfrak{P}}^{k}}\{\mathbb{E}_{P}[Q(x_{i},\tilde{\omega})]. Then,

gk(x1)=\displaystyle g^{k}(x_{1})=~{} cx1+𝔼P(x1)[Q(x1,ω~)]\displaystyle c^{\top}x_{1}+\mathbb{E}_{P(x_{1})}[Q(x_{1},\tilde{\omega})]
\displaystyle\geq~{} cx1+𝔼P(x2)[Q(x1,ω~)]\displaystyle c^{\top}x_{1}+\mathbb{E}_{P(x_{2})}[Q(x_{1},\tilde{\omega})]
=\displaystyle=~{} cx2+𝔼P(x2)[Q(x2,ω~]+c(x1x2)+\displaystyle c^{\top}x_{2}+\mathbb{E}_{P(x_{2})}[Q(x_{2},\tilde{\omega}]+c^{\top}(x_{1}-x_{2})+
𝔼P(x2)[Q(x1,ω~)]𝔼P(x2)[Q(x2,ω~)]\displaystyle\hskip 85.35826pt\mathbb{E}_{P(x_{2})}[Q(x_{1},\tilde{\omega})]-\mathbb{E}_{P(x_{2})}[Q(x_{2},\tilde{\omega})]
=\displaystyle=~{} gk(x2)+c(x1x2)+𝔼P(x2)[Q(x1,ω~)]𝔼P(x2)[Q(x2,ω~)].\displaystyle g^{k}(x_{2})+c^{\top}(x_{1}-x_{2})+\mathbb{E}_{P(x_{2})}[Q(x_{1},\tilde{\omega})]-\mathbb{E}_{P(x_{2})}[Q(x_{2},\tilde{\omega})].

The inequality in the above follows from optimality of P(x1)P(x_{1}). The above implies that

gk(x2)gk(x1)\displaystyle g^{k}(x_{2})-g^{k}(x_{1})\leq~{} c(x2x1)+𝔼P(x2)[Q(x2,ω~)]𝔼P(x2)[Q(x1,ω~)]\displaystyle c^{\top}(x_{2}-x_{1})+\mathbb{E}_{P(x_{2})}[Q(x_{2},\tilde{\omega})]-\mathbb{E}_{P(x_{2})}[Q(x_{1},\tilde{\omega})]
\displaystyle\leq~{} |c(x2x1)|+|𝔼P(x2)[Q(x2,ω~)]𝔼P(x2)[Q(x1,ω~)]|\displaystyle|c^{\top}(x_{2}-x_{1})|+\bigg{|}\mathbb{E}_{P(x_{2})}[Q(x_{2},\tilde{\omega})]-\mathbb{E}_{P(x_{2})}[Q(x_{1},\tilde{\omega})]\bigg{|}
\displaystyle\leq~{} (c+C)x2x1.\displaystyle(\|c\|+C)\|x_{2}-x_{1}\|. (35)

The second relationship is due to the triangular inequality. The third inequality follows from uniform Lipschitz continuity of recourse function Q(x,ω~)Q(x,\tilde{\omega}), with probability one, under assumption (A2) which implies that there exists a constant CC such that |𝔼P[Q(x1,ω~)]𝔼P[Q(x2,ω~)]|Cx1x2|\mathbb{E}_{P}[Q(x_{1},\tilde{\omega})]-\mathbb{E}_{P}[Q(x_{2},\tilde{\omega})]|\leq C\|x_{1}-x_{2}\| for any PP. Therefore, the function gk(x)g^{k}(x) is equi-continuous on x𝒳x\in\mathcal{X}. Starting with x2x_{2} and using the same arguments, we have

gk(x1)gk(x2)\displaystyle g^{k}(x_{1})-g^{k}(x_{2})\leq~{} (c+C)x2x1.\displaystyle(\|c\|+C)\|x_{2}-x_{1}\|. (36)

Now consider ambiguity sets 𝔓\mathfrak{P} and 𝔓^k\widehat{\mathfrak{P}}^{k}. Note that

|f(x)gk(x)|=\displaystyle|f(x)-g^{k}(x)|= |maxP𝔓𝔼P[Q(x,ω~)]maxP𝔓^k𝔼P[Q(x,ω~)]|x𝒳\displaystyle~{}\bigg{|}\max_{P\in\mathfrak{P}}\mathbb{E}_{P}[Q(x,\tilde{\omega})]-\max_{P^{\prime}\in\widehat{\mathfrak{P}}^{k}}\mathbb{E}_{P^{\prime}}[Q(x,\tilde{\omega})]\bigg{|}\qquad\forall x\in\mathcal{X}
\displaystyle\leq maxP𝔓minP𝔓^k|𝔼P[Q(x,ω~)]𝔼P[Q(x,ω~]|x𝒳\displaystyle~{}\max_{P\in\mathfrak{P}}\min_{P^{\prime}\in\widehat{\mathfrak{P}}^{k}}\big{|}\mathbb{E}_{P}[Q(x,\tilde{\omega})]-\mathbb{E}_{P^{\prime}}[Q(x,\tilde{\omega}]\big{|}\qquad\forall x\in\mathcal{X}
\displaystyle\leq maxP𝔓minP𝔓^ksupx𝒳|𝔼P[Q(x,ω~)]𝔼P[Q(x,ω~)]|.\displaystyle~{}\max_{P\in\mathfrak{P}}\min_{P^{\prime}\in\widehat{\mathfrak{P}}^{k}}\sup_{x\in\mathcal{X}}\big{|}\mathbb{E}_{P}[Q(x,\tilde{\omega})]-\mathbb{E}_{P^{\prime}}[Q(x,\tilde{\omega})]\big{|}.

Using the definition of deviation between ambiguity sets 𝔓\mathfrak{P} and 𝔓^k\widehat{\mathfrak{P}}^{k} as well as its relationship with Hausdorff distance, we have

|f(x)gk(x)|=𝔻(𝔓,𝔓^k)(𝔓,𝔓^k).\displaystyle|f(x)-g^{k}(x)|=\mathbb{D}(\mathfrak{P},\widehat{\mathfrak{P}}^{k})\leq\mathbb{H}(\mathfrak{P},\widehat{\mathfrak{P}}^{k}). (37)

For x^kn\hat{x}^{k_{n}} and x¯\bar{x}, combining (35), (36), and (37), we have

|f(x¯)gkn(x^kn)|\displaystyle|f(\bar{x})-g^{k_{n}}(\hat{x}^{k_{n}})|\leq |f(x¯)gkn(x¯)|+|gkn(x¯)gkn(x^kn)|\displaystyle~{}|f(\bar{x})-g^{k_{n}}(\bar{x})|+|g^{k_{n}}(\bar{x})-g^{k_{n}}(\hat{x}^{k_{n}})|
\displaystyle\leq (𝔓,𝔓^k)+(c+C)x¯x^kn.\displaystyle~{}\mathbb{H}(\mathfrak{P},\widehat{\mathfrak{P}}^{k})+(\|c\|+C)\|\bar{x}-\hat{x}^{k_{n}}\|.

As nn\rightarrow\infty, the family of ambiguity sets considered satisfy (𝔓,𝔓^kn)0\mathbb{H}(\mathfrak{P},\widehat{\mathfrak{P}}^{k_{n}})\rightarrow 0 and the hypothesis informs us that x^knx¯\hat{x}^{k_{n}}\rightarrow\bar{x}. Therefore, we conclude that gkn(x^kn)f(x¯)g^{k_{n}}(\hat{x}^{k_{n}})\rightarrow f(\bar{x}) as nn\rightarrow\infty. ∎

Notice that, the behavior of the approximate ambiguity sets defined in §2.1, in particular, the condition (𝔓,𝔓^k)0\mathbb{H}(\mathfrak{P},\widehat{\mathfrak{P}}^{k})\rightarrow 0 as kk\rightarrow\infty plays a central role in the above proof. Recall that for the moment and Wasserstein distance-based ambiguity sets, the condition is established in propositions 2.1 and 2.2, respectively. It is also worthwhile to note that under the foregoing conditions, (37) also implies uniform convergence of the sequence {gk}\{g^{k}\} to f(x)f(x), with probability one.

The above result applies to any algorithm that generates a converging sequence of iterates {xk}\{x^{k}\} and a corresponding sequence of extremal distributions. Such an algorithm is guaranteed to exhibit convergence to the optimal distributionally robust objective function value. Therefore, this result is applicable to the sequence of instances constructed using external sampling and solved, for example, using reformulation-based methods. Such an approach was adopted in [28] and [39]. The analysis in [28] relies upon two rather restrictive assumptions. The first assumption is that for all P𝔓P\in\mathfrak{P} there exists a sequence of measures {Pk}\{P^{k}\} such that Pk𝔓^kP^{k}\in\widehat{\mathfrak{P}}^{k} and converges weakly to PP. The second assumption requires the approximate ambiguity sets to be strict subsets of the true ambiguity set, i.e., 𝔓^k𝔓\widehat{\mathfrak{P}}^{k}\subset\mathfrak{P}. Both of these assumptions are very difficult to satisfy in a data-driven setting (also see Remark 2.1).

The analysis in [39], on the other hand, does not make the above assumptions. Therefore, their analysis is more broadly applicable in settings where external sampling is used to generate Ωk\Omega^{k}. DRO instances are constructed based on statistics estimated using Ωk\Omega^{k} and solved to optimality for each k1k\geq 1. They show the convergence of optimal objective function values and optimal solution sets of approximate problems to the optimal objective function value and solutions of the true DRO problem, respectively. In this regard, the result in Proposition 4.1 can alternatively be derived using Theorem 1(i) in [39]. While the above function is not computed during the course of the sequential sampling algorithm, it provides the necessary benchmark for our convergence analysis.

One of the main point of deviation in our analysis stems from the fact that we use the objective function approximations that are built based on the approximate recourse function in (23). In order to study the piecewise affine approximation of the first-stage objective function, we introduce another benchmark function

ϕk(x):=cx+maxP𝔓^k𝔼P[Qk(x,ω~)].\displaystyle\phi^{k}(x):=c^{\top}x+\max_{P\in\widehat{\mathfrak{P}}^{k}}\mathbb{E}_{P}[Q^{k}(x,\tilde{\omega})]. (38)

Notice that the above function uses the approximations for the ambiguity set (as in the case of (34)) as well as the approximation for recourse function. This construction ensures for all x𝒳x\in\mathcal{X} and k1k\geq 1 that ϕk(x)gk(x)\phi^{k}(x)\leq g^{k}(x) that follows from the fact that Qk(x,ω~)Q(x,ω~)Q^{k}(x,\tilde{\omega})\leq Q(x,\tilde{\omega}), almost surely. Further, following the result in Theorem 3.1 ensures that fk(x)ϕk(x)f^{k}(x)\leq\phi^{k}(x). Putting these together, we obtain the following relationship

fk(x)ϕk(x)gk(x)x𝒳,k1.\displaystyle f^{k}(x)\leq\phi^{k}(x)\leq g^{k}(x)\qquad\forall x\in\mathcal{X},k\geq 1. (39)

While the previous proposition was focused on the upper limit in the above relationship, we present the asymptotic behavior of the {fk}\{f^{k}\} sequence in the following results.

Lemma 4.2.

Suppose {x^kn}\{\hat{x}^{k_{n}}\} denotes a subsequence of {x^k}\{\hat{x}^{k}\} such that x^knx¯\hat{x}^{k_{n}}\rightarrow\bar{x}, limnfkn(x^kn)f(x¯)=0\lim_{n\rightarrow\infty}f^{k_{n}}(\hat{x}^{k_{n}})-f(\bar{x})=0, with probability one.

Proof.

From Proposition 4.1, we have limn|f(x¯)gkn(x^kn)|=0\lim_{n\rightarrow\infty}|f(\bar{x})-g^{k_{n}}(\hat{x}^{k_{n}})|=0. Therefore, there exists N1<N_{1}<\infty and ϵ1>0\epsilon_{1}>0 such that

|maxP𝔓𝔼P[Q(x¯,ω~)]maxP𝔓^kn𝔼P[Q(x^kn,ω~)]|<ϵ1/2n>N1.\displaystyle\bigg{|}\max_{P\in\mathfrak{P}}\mathbb{E}_{P}[Q(\bar{x},\tilde{\omega})]-\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q(\hat{x}^{k_{n}},\tilde{\omega})]\bigg{|}<\epsilon_{1}/2\quad\forall n>N_{1}. (40)

Now consider,

maxP𝔓^kn𝔼P[Q(x^kn,ω~)]\displaystyle\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q(\hat{x}^{k_{n}},\tilde{\omega})] maxP𝔓^kn𝔼P[Qkn(x^kn,ω~)]\displaystyle-\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q^{k_{n}}(\hat{x}^{k_{n}},\tilde{\omega})]
=\displaystyle= maxP𝔓^kn(𝔼P[Q(x^kn,ω~)]𝔼P[Qkn(x^kn,ω~)])\displaystyle\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\big{(}\mathbb{E}_{P}[Q(\hat{x}^{k_{n}},\tilde{\omega})]-\mathbb{E}_{P}[Q^{k_{n}}(\hat{x}^{k_{n}},\tilde{\omega})]\big{)}
\displaystyle\leq maxP𝔓^kn|𝔼P[Q(x^kn,ω~)]𝔼P[Qkn(x^kn,ω~)]|\displaystyle~{}\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\big{|}\mathbb{E}_{P}[Q(\hat{x}^{k_{n}},\tilde{\omega})]-\mathbb{E}_{P}[Q^{k_{n}}(\hat{x}^{k_{n}},\tilde{\omega})]\big{|}
=\displaystyle= maxP𝔓^kn𝔼P[|Q(x^kn,ω~)Qkn(x^kn,ω~)|].\displaystyle~{}\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[|Q(\hat{x}^{k_{n}},\tilde{\omega})-Q^{k_{n}}(\hat{x}^{k_{n}},\tilde{\omega})|].

The last equality follows from the fact that Q(x,ω~)Qk(x,ω~)Q(x,\tilde{\omega})\geq Q^{k}(x,\tilde{\omega}) for all x𝒳x\in\mathcal{X} and k1k\geq 1, almost surely. Moreover, because of the uniform convergence of {Qk}\{Q^{k}\} (Proposition 2.3), the sequence of approximate functions {ϕk}\{\phi^{k}\} also convergences uniformly. This implies that, there exists N2<N_{2}<\infty such that

|maxP𝔓^kn𝔼P[Q(x^kn,ω~)]maxP𝔓^kn𝔼P[Qkn(x^kn,ω~)]|<ϵ1/2.\displaystyle\bigg{|}\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q(\hat{x}^{k_{n}},\tilde{\omega})]-\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q^{k_{n}}(\hat{x}^{k_{n}},\tilde{\omega})]\bigg{|}<\epsilon_{1}/2. (41)

Let N=max{N1,N2}N=\max\{N_{1},N_{2}\}. Using (40) and (41), we have for all n>Nn>N

|maxP𝔓𝔼P[Q(x¯,ω~)]maxP𝔓^kn𝔼P[Qkn(xkn,ω~)]|<ϵ1.\displaystyle\bigg{|}\max_{P\in\mathfrak{P}}\mathbb{E}_{P}[Q(\bar{x},\tilde{\omega})]-\max_{P\in\widehat{\mathfrak{P}}^{k_{n}}}\mathbb{E}_{P}[Q^{k_{n}}(x^{k_{n}},\tilde{\omega})]\bigg{|}<\epsilon_{1}.

This implies that |f(x¯)ϕkn(x^kn)|0|f(\bar{x})-\phi^{k_{n}}(\hat{x}^{k_{n}})|\rightarrow 0 as nn\rightarrow\infty. Based on (26), we have Qkn(x^kn,ω)=(π(x^kn,ω))[r(ω)T(ω)x^kn](π(x^kn,ω))[r(ω)T(ω)x]Q^{k_{n}}(\hat{x}^{k_{n}},\omega)=(\pi(\hat{x}^{k_{n}},\omega))^{\top}[r(\omega)-T(\omega)\hat{x}^{k_{n}}]\geq(\pi(\hat{x}^{k_{n}},\omega))^{\top}[r(\omega)-T(\omega)x] for all x𝒳x\in\mathcal{X} and ωΩkn\omega\in\Omega^{k_{n}}. Let

αknkn=ωΩknpkn(ω)(π(x^kn,ω))r(ω) and βknkn=ωΩknpkn(ω)T(ω)π(x^kn,ω),\displaystyle\alpha^{k_{n}}_{k_{n}}=\sum_{\omega\in\Omega^{k_{n}}}p^{k_{n}}(\omega)(\pi(\hat{x}^{k_{n}},\omega))^{\top}r(\omega)\text{ and }\beta^{k_{n}}_{k_{n}}=-\sum_{\omega\in\Omega^{k_{n}}}p^{k_{n}}(\omega)T(\omega)^{\top}\pi(\hat{x}^{k_{n}},\omega),

where {pkn(ω)}ωΩkn\{p^{k_{n}}(\omega)\}_{\omega\in\Omega^{k_{n}}} is an optimal solution of the distributional separation problem (27) where index kk is replaced by knk_{n}. Then, the affine function αknkn+(c+βknkn)x\alpha^{k_{n}}_{k_{n}}+(c+\beta^{k_{n}}_{k_{n}})^{\top}x provides a lower bound approximation for function ϕkn(x)\phi^{k_{n}}(x), i.e.,

ϕkn(x)αknkn+(c+βknkn)x for all x𝒳,\displaystyle\phi^{k_{n}}(x)\geq\alpha^{k_{n}}_{k_{n}}+(c+\beta^{k_{n}}_{k_{n}})^{\top}x\qquad\text{ for all }x\in\mathcal{X},

with strict equality holding only at x^kn\hat{x}^{k_{n}}. Therefore, using the definition of fk(x)f^{k}(x) we have limnαknkn+(c+βknkn)x^kn=limnfkn(x^kn)=limnϕkn(x^kn)=f(x¯)\lim_{n\rightarrow\infty}\alpha^{k_{n}}_{k_{n}}+(c+\beta^{k_{n}}_{k_{n}})^{\top}\hat{x}^{k_{n}}=\lim_{n\rightarrow\infty}f^{k_{n}}(\hat{x}^{k_{n}})=\lim_{n\rightarrow\infty}\phi^{k_{n}}(\hat{x}^{k_{n}})=f(\bar{x}), almost surely. This completes the proof. ∎

The above result characterizes the behavior of the sequence of affine functions generated during the course of the algorithm. In particular, the sequence {fk(x^k)}k1\{f^{k}(\hat{x}^{k})\}_{k\geq 1} accumulates at the objective value of the original DRO problem (2). Recall that the candidate solution xkx^{k} is a minimizer of fk1(x)f^{k-1}(x) and an affine function is generated at this point such that fk(xk)=ϕk(xk)f^{k}(x^{k})=\phi^{k}(x^{k}) in all iterations k1k\geq 1. However, due to the update procedure in (30) the quality of the estimates at xkx^{k} gradually diminishes leading to a large value for (ϕk(xk)fk(xk))(\phi^{k}(x^{k})-f^{k}(x^{k})) as kk increases. This emphasizes the role of the incumbent solution and computing the incumbent affine function ^(x)\hat{\ell}(x) during the course of the algorithm. By updating the incumbent solution and frequently reevaluating the affine functions at the incumbent solution, we can ensure that the approximation is “sufficiently good” in the neighborhood of the incumbent solution. In order to assess the improvement of approximation quality, we define

δk:=fk1(xk)fk1(x^k1)0k1.\displaystyle\delta^{k}:=f^{k-1}(x^{k})-f^{k-1}(\hat{x}^{k-1})\leq 0\qquad\forall k\geq 1. (42)

The inequality follows from the optimality of xkx^{k} with respect to the objective function fk1f^{k-1}. The quantity δk\delta^{k} measures the error in objective function estimate at the candidate solution with respect to the estimate at the current incumbent solution. The following result captures the asymptotic behavior of this error term.

Lemma 4.3.

Let 𝒦\mathcal{K} denotes a sequence of iterations where the incumbent solution changes. There exists a subsequence of iterations, denoted as 𝒦𝒦\mathcal{K}^{*}\subseteq\mathcal{K}, such that limk𝒦δk=0\lim_{k\in\mathcal{K}^{*}}\delta^{k}=0.

Proof.

We will consider two cases depending on whether the set 𝒦\mathcal{K} is finite or not. First, suppose that |𝒦||\mathcal{K}| is not finite. By the incumbent update rule and (42),

fkn(xkn)fkn(x^kn1)<γ[fkn1(xkn)fkn1(x^kn1)]=γδkn0kn𝒦.\displaystyle f^{k_{n}}(x^{k_{n}})-f^{k_{n}}(\hat{x}^{k_{n}-1})<\gamma[f^{k_{n}-1}(x^{k_{n}})-f^{{k_{n}}-1}(\hat{x}^{k_{n}-1})]=\gamma\delta^{k_{n}}\leq 0\qquad\forall k_{n}\in\mathcal{K}.

Subsequently, we have lim supnδkn0\limsup_{n\rightarrow\infty}\delta^{k_{n}}\leq 0. Since xkn=x^knx^{k_{n}}=\hat{x}^{k_{n}} and x^kn1=x^kn1\hat{x}^{k_{n}-1}=\hat{x}^{k_{n-1}}, we have

fkn(x^kn)fkn(x^kn1)γδkn0.\displaystyle f^{k_{n}}(\hat{x}^{k_{n}})-f^{k_{n}}(\hat{x}^{k_{n-1}})\leq\gamma\delta^{k_{n}}\leq 0.

The left-hand side of the above inequality captures the improvement in the objective function value at the current incumbent solution over the previous incumbent solution. Using the above, we can write the average improvement attained over nn incumbent changes as

1nj=1n[fkj(x^kj)fkj(x^kj1)]1nj=1nγδkj0 for all n.\displaystyle\frac{1}{n}\sum_{j=1}^{n}\bigg{[}f^{k_{j}}(\hat{x}^{k_{j}})-f^{k_{j}}(\hat{x}^{k_{j-1}})\bigg{]}\leq\frac{1}{n}\sum_{j=1}^{n}\gamma\delta^{k_{j}}\leq 0\qquad\text{ for all }n.

This implies that

1n(fkn(x^kn)fk1(x^k0))(a)+1n[j=1n1(fkj(x^kj)fkj+1(x^kj))(b)]1nj=1nγδkj0,n.\displaystyle\frac{1}{n}\underbrace{\bigg{(}f^{k_{n}}(\hat{x}^{k_{n}})-f^{k_{1}}(\hat{x}^{k_{0}})\bigg{)}}_{(a)}+\frac{1}{n}\bigg{[}\sum_{j=1}^{n-1}\underbrace{\bigg{(}f^{k_{j}}(\hat{x}^{k_{j}})-f^{k_{j+1}}(\hat{x}^{k_{j}})\bigg{)}}_{(b)}\bigg{]}\leq\frac{1}{n}\sum_{j=1}^{n}\gamma\delta^{k_{j}}\leq 0,\ \ \forall n.

Under the assumption that the dual feasible region is non-empty and bounded (this is ensured by relatively complete recourse, (A2)), {fk}\{f^{k}\} is a sequence of Lipschitz continuous functions. This along with compactness of 𝒳\mathcal{X} (A1), implies that fkn(x^kn)fk1(x^k0)f^{k_{n}}(\hat{x}^{k_{n}})-f^{k_{1}}(\hat{x}^{k_{0}}) is bounded from above. Hence, the term (a) reduces to zero as nn\rightarrow\infty. The term (b) converges to zero, with probability one, due to uniform convergence of {fk}\{f^{k}\}. Since γ(0,1]\gamma\in(0,1], we have

limn1nj=1nδkj=0\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{j=1}^{n}\delta^{k_{j}}=0

with probability one. Further,

limn1nj=1nδkjlim supnδkn0.\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{j=1}^{n}\delta^{k_{j}}\leq\limsup_{n\rightarrow\infty}\delta^{k_{n}}\leq 0.

Thus, there exists a subsequence indexed by the set 𝒦\mathcal{K}^{*} such that limk𝒦δk=0\lim_{k\in\mathcal{K}^{*}}\delta^{k}=0, with probability one.

Now if |𝒦||\mathcal{K}| is finite, then there exists x^\hat{x} and K<K<\infty such that for all kKk\geq K, we have x^k=x^\hat{x}^{k}=\hat{x}. Notice that, if limk𝒦xk=x¯\lim_{k\in\mathcal{K}^{*}}x^{k}=\bar{x}, uniform convergence of the sequence {fk}\{f^{k}\} and Lemma 4.2 ensure that

limk𝒦fk(xk)=limk𝒦fk1(xk)=f(x¯)\displaystyle\lim_{k\in\mathcal{K}^{*}}f^{k}(x^{k})=\lim_{k\in\mathcal{K}^{*}}f^{k-1}(x^{k})=f(\bar{x}) (43a)
limk𝒦fk(x^)=limk𝒦fk1(x^)=f(x^).\displaystyle\lim_{k\in\mathcal{K}^{*}}f^{k}(\hat{x})=\lim_{k\in\mathcal{K}^{*}}f^{k-1}(\hat{x})=f(\hat{x}). (43b)

Further, since the incumbent is not updated in iterations kKk\geq K, we must have from the update rule in (33) that

fk(xk)fk(x^)γ[fk1(xk)fk1(x^)]=γδk for all kK.\displaystyle f^{k}(x^{k})-f^{k}(\hat{x})\geq\gamma[f^{k-1}(x^{k})-f^{k-1}(\hat{x})]=\gamma\delta^{k}\quad\text{ for all }k\geq K.

Using (43), we have

limk𝒦(fk(xk)fk(x^))\displaystyle\lim_{k\in\mathcal{K}^{*}}\big{(}f^{k}(x^{k})-f^{k}(\hat{x})\big{)} γlimk𝒦(fk1(xk)fk1(x^)),\displaystyle\geq\gamma\lim_{k\in\mathcal{K}^{*}}\big{(}f^{k-1}(x^{k})-f^{k-1}(\hat{x})\big{)},

which implies

f(x¯)f(x^)\displaystyle f(\bar{x})-f(\hat{x}) γ(f(x¯)f(x^)).\displaystyle\geq\gamma(f(\bar{x})-f(\hat{x})).

Since γ(0,1]\gamma\in(0,1], we must have f(x¯)f(x^)=0f(\bar{x})-f(\hat{x})=0. Hence, limk𝒦δk=f(x¯)f(x^)=0\lim_{k\in\mathcal{K}^{*}}\delta^{k}=f(\bar{x})-f(\hat{x})=0, with probability one. ∎

Equipped with the results in lemmas 4.2 and 4.3, we state the main theorem which establishes the existence of a subsequence of incumbent solution sequence for which every accumulation point is an optimal solution to (2).

Theorem 4.4.

Let {xk}k=1\{x^{k}\}_{k=1}^{\infty} and {x^k}k=1\{\hat{x}^{k}\}_{k=1}^{\infty} be the sequence candidate and incumbent solutions generated by the algorithm. There exists a subsequence {x^k}k𝒦\{\hat{x}^{k}\}_{k\in\mathcal{K}} for which every accumulation point is an optimal solution of 2-DRLP (2), with probability one.

Proof.

Let x𝒳x^{*}\in\mathcal{X} be an optimal solution of (2). Consider a subsequence indexed by 𝒦\mathcal{K} such that limk𝒦x^k=x¯\lim_{k\in\mathcal{K}}\hat{x}^{k}=\bar{x}. Compactness of 𝒳\mathcal{X} ensures the existence of accumulation point x¯𝒳\bar{x}\in\mathcal{X} and therefore,

f(x)f(x¯).\displaystyle f(x^{*})\leq f(\bar{x}). (44)

From Theorem 3.1, we have

fk(x)\displaystyle f^{k}(x)\leq~{} cx+k(x)\displaystyle c^{\top}x+\mathbb{Q}^{k}(x)
\displaystyle\leq~{} cx+maxP𝔓^k𝔼P[Q(x,ω~)]=gk(x)k,x𝒳.\displaystyle c^{\top}x+\max_{P\in\widehat{\mathfrak{P}}^{k}}\mathbb{E}_{P}[Q(x,\tilde{\omega})]=g^{k}(x)\qquad\forall k,x\in\mathcal{X}.

Thus, using the uniform convergence of {gk}\{g^{k}\} (Proposition 4.1) we have

lim supk𝒦fk(x)limk𝒦gk(x)=f(x)\displaystyle\limsup_{k\in\mathcal{K}^{\prime}}f^{k}(x^{*})\leq\lim_{k\in\mathcal{K}^{\prime}}g^{k}(x^{*})=f(x^{*}) (45)

for all subsequences indexed by 𝒦{1,2,}\mathcal{K}^{\prime}\subseteq\{1,2,\ldots\}, with probability one. Recall that,

δk=fk1(xk)fk1(x^k1)\displaystyle\delta^{k}=f^{{k}-1}(x^{k})-f^{{k}-1}(\hat{x}^{{k}-1}) fk1(x)fk1(x^k1)for all k1.\displaystyle\leq f^{{k}-1}(x^{*})-f^{{k}-1}(\hat{x}^{{k}-1})\qquad\text{for all }k\geq 1.

The inequality in the above follows from the optimality of xkx^{k} with respect to fk1(x)f^{k-1}(x). Taking limit over 𝒦\mathcal{K}, we have

limk𝒦δk\displaystyle\lim_{k\in\mathcal{K}}\delta^{k}\leq~{} limk𝒦(fk1(x)fk1(x^k1))\displaystyle\lim_{k\in\mathcal{K}}\big{(}f^{k-1}(x^{*})-f^{{k}-1}(\hat{x}^{{k}-1})\big{)}
\displaystyle\leq~{} lim supk𝒦fk1(x)lim infk𝒦fk1(x^k1)\displaystyle\limsup_{k\in\mathcal{K}}f^{k-1}(x^{*})-\liminf_{k\in\mathcal{K}}f^{{k}-1}(\hat{x}^{{k}-1})
\displaystyle\leq~{} f(x)f(x¯).\displaystyle f(x^{*})-f(\bar{x}).

The last inequality follows from (45) and limk𝒦fk1(x^k1)=f(x¯)\lim_{k\in\mathcal{K}}f^{k-1}(\hat{x}^{k-1})=f(\bar{x}) (Lemma 4.2). From Lemma 4.3, there exists a subsequence indexed by 𝒦𝒦\mathcal{K}^{*}\subseteq\mathcal{K} such that limk𝒦δk=0\lim_{k\in\mathcal{K}^{*}}\delta^{k}=0. Therefore, if {x^k}k𝒦x¯\{\hat{x}^{k}\}_{k\in\mathcal{K}^{*}}\rightarrow\bar{x}, we have

f(x)f(x¯)0.\displaystyle f(x^{*})-f(\bar{x})\geq 0.

Using (44) and the above inequality, we conclude that x¯\bar{x} is an optimal solution with probability one. ∎

5 Computational Experiment

In this section, we evaluate the effectiveness and efficiency of the DRSD method in solving 2-DRLP problems. For our preliminary experiments, we consider 2-DRLP problems with moment-based ambiguity set 𝔓mom\mathfrak{P}_{\text{mom}} for the first two moments (q=2q=2). We report results from the computational experiments conducted on four well-known SP test problems: CEP, PGP, BAA, and STORM. The test problems have supports of size 216216, 576576, 101810^{18} and 108110^{81}, respectively.

We use an external sampling-based approach as a benchmark for comparison. The external sampling-based instances involve constructing approximate problems of the form (34) with a pre-determined number of observations N{100,250,500,1000,2000}N\in\{100,250,500,1000,2000\} (that might not be unique). The resulting instances are solved using the DR L-Shaped method. Specifically, we compare the solution quality provided by these methods along with the solution time.

We conduct 3030 independent replications for each problem instance with different seeds for the random number generator. The algorithms are implemented in the C programming language, and the experiments are conducted on a 6464-bit Intel core i7 - 4770 CPU at 3.43.4GHz ×8\times~{}8 machine with 3232 GB memory. All linear programs, i.e., master problem, subproblems, and distribution separation problem, are solved using CPLEX 12.10 callable subroutines. The results from the experiments are presented in Tables 1 and 2. The results for the instances with a finite sample size obtained from the DR L-shaped method are labeled as DRLS-NN, where NN denotes the number of observations used to approximate the ambiguity set. The DRSD method is run for a minimum of 256256 iteration, while no such limit is imposed on the DR L-shaped method.

Method # Iterations Objective Estimate # Unique Obs.
PGP
DRLS-100 17.86717.867 (±0.92\pm 0.92) 457.610457.610 (±3.28\pm 3.28) 38.23338.233 (±1.10\pm 1.10)
DRLS-250 20.46720.467 (±0.73\pm 0.73) 462.922462.922 (±2.28\pm 2.28) 53.26753.267 (±1.74\pm 1.74)
DRLS-500 20.46720.467 (±0.63\pm 0.63) 464.704464.704 (±1.95\pm 1.95) 68.66768.667 (±1.62\pm 1.62)
DRLS-1000 20.50020.500 (±0.84\pm 0.84) 466.104466.104 (±1.78\pm 1.78) 85.83385.833 (±1.78\pm 1.78)
DRLS-2000 20.90020.900 (±0.57\pm 0.57) 469.579469.579 (±2.53\pm 2.53) 102.867102.867 (±1.87\pm 1.87)
DRSD 503.667503.667 (±687.01\pm 687.01) 463.185463.185 (±16.28\pm 16.28) 65.66765.667 (±10.04\pm 10.04)
CEP
DRLS-100 2.6002.600 (±0.19\pm 0.19) 658817.129658817.129 (±14457.30\pm 14457.30) 80.70080.700 (±1.21\pm 1.21)
DRLS-250 2.2672.267 (±0.17\pm 0.17) 680735.606680735.606 (±10511.47\pm 10511.47) 147.900147.900 (±2.05\pm 2.05)
DRLS-500 2.0672.067 (±0.09\pm 0.09) 683252.019683252.019 (±5948.62\pm 5948.62) 195.167195.167 (±1.44\pm 1.44)
DRLS-1000 2.0002.000 (±0.00\pm 0.00) 679665.728679665.728 (±4926.88\pm 4926.88) 214.100214.100 (±0.47\pm 0.47)
DRLS-2000 2.0002.000 (±0.00\pm 0.00) 680744.118680744.118 (±3872.39\pm 3872.39) 215.967215.967 (±0.07\pm 0.07)
DRSD 257.000257.000 (±0.00\pm 0.00) 681772.359681772.359 (±10348.32\pm 10348.32) 150.033150.033 (±2.12\pm 2.12)
BAA
DRLS-100 247.000247.000 (±5.17\pm 5.17) 248677.974248677.974 (±1238.19\pm 1238.19) 100.000100.000 (±0.00\pm 0.00)
DRLS-250 240.233240.233 (±5.74\pm 5.74) 249173.795249173.795 (±770.38\pm 770.38) 250.000250.000 (±0.00\pm 0.00)
DRLS-500 236.067236.067 (±6.28\pm 6.28) 249827.472249827.472 (±499.59\pm 499.59) 500.000500.000 (±0.00\pm 0.00)
DRLS-1000 229.067229.067 (±6.03\pm 6.03) 250640.029250640.029 (±355.53\pm 355.53) 1000.0001000.000 (±0.00\pm 0.00)
DRLS-2000 219.900219.900 (±5.36\pm 5.36) 251405.806251405.806 (±243.54\pm 243.54) 2000.0002000.000 (±0.00\pm 0.00)
DRSD 316.367316.367 (±31.53\pm 31.53) 250235.142250235.142 (±737.21\pm 737.21) 315.367315.367 (±31.53\pm 31.53)
STORM
DRLS-100 11.66711.667 (±0.51\pm 0.51) 15742456.08215742456.082 (±12191.85\pm 12191.85) 100.000100.000 (±0.00\pm 0.00)
DRLS-250 11.16711.167 (±0.52\pm 0.52) 15781724.51015781724.510 (±8753.59\pm 8753.59) 250.000250.000 (±0.00\pm 0.00)
DRLS-500 11.73311.733 (±0.59\pm 0.59) 15797019.90215797019.902 (±5345.83\pm 5345.83) 500.000500.000 (±0.00\pm 0.00)
DRLS-1000 11.76711.767 (±0.52\pm 0.52) 15806575.38715806575.387 (±3771.91\pm 3771.91) 1000.0001000.000 (±0.00\pm 0.00)
DRLS-2000 11.90011.900 (±0.46\pm 0.46) 15817039.91715817039.917 (±2502.88\pm 2502.88) 2000.0002000.000 (±0.00\pm 0.00)
DRSD 516.200516.200 (±108.53\pm 108.53) 15786864.66215786864.662 (±9155.49\pm 9155.49) 515.200515.200 (±108.53\pm 108.53)
Table 1: Comparison of results obtained from DR L-Shaped method and DRSD method

Table 1 shows the average number of iterations, the average objective function value, and the average number of unique observations in Columns 2, 3, and 4, respectively. The values in the parenthesis are the half-widths of the corresponding confidence intervals. Notice that for DRSD, the number of iterations is also equal to the number of observations used to approximate the ambiguity set. The results show the ability of DRSD to dynamically determine the number of observations by assessing the progress made during the algorithm. The objective function estimate obtained using the DRSD is comparable to the objective function estimate obtained using the DR L-shaped method of comparable size. For instance, the DRSD objective function estimate for STORM that is based upon a sample of size 516.2516.2 (on average) is within 0.1%0.1\% of the objective function value estimate of DRLS-500. These results show that the optimal objective function estimate obtained from DRSD are comparable to those obtained using an external sampling-based approach.

Method Total Master Subproblem Optimality Argmax Separation
PGP
DRLS-100 0.05170.0517 0.00190.0019 0.04220.0422 0.00000.0000 - 0.00230.0023
DRLS-250 0.07700.0770 0.00210.0021 0.06510.0651 0.00000.0000 - 0.00250.0025
DRLS-500 0.09590.0959 0.00200.0020 0.08220.0822 0.00000.0000 - 0.00260.0026
DRLS-1000 0.12140.1214 0.00210.0021 0.10500.1050 0.00000.0000 - 0.00300.0030
DRLS-2000 0.14630.1463 0.00220.0022 0.12700.1270 0.00000.0000 - 0.00330.0033
DRSD 0.34980.3498 0.12250.1225 0.05490.0549 0.00030.0003 0.00180.0018 0.06970.0697
CEP
DRLS-100 0.01540.0154 0.00020.0002 0.00860.0086 0.00000.0000 - 0.00040.0004
DRLS-250 0.02370.0237 0.00010.0001 0.01240.0124 0.00000.0000 - 0.00040.0004
DRLS-500 0.02830.0283 0.00010.0001 0.01380.0138 0.00000.0000 - 0.00030.0003
DRLS-1000 0.02850.0285 0.00010.0001 0.01340.0134 0.00000.0000 - 0.00030.0003
DRLS-2000 0.02860.0286 0.00010.0001 0.01370.0137 0.00000.0000 - 0.00030.0003
DRSD 0.11540.1154 0.02990.0299 0.02620.0262 0.00010.0001 0.00090.0009 0.02920.0292
BAA
DRLS-100 2.64972.6497 0.07780.0778 2.16982.1698 0.00670.0067 - 0.18370.1837
DRLS-250 6.13766.1376 0.07560.0756 5.21575.2157 0.00780.0078 - 0.32520.3252
DRLS-500 11.184811.1848 0.06550.0655 9.72579.7257 0.00560.0056 - 0.46240.4624
DRLS-1000 21.175521.1755 0.06160.0616 18.551418.5514 0.00430.0043 - 0.82860.8286
DRLS-2000 44.257944.2579 0.06850.0685 39.202739.2027 0.00560.0056 - 1.27101.2710
DRSD 2.09432.0943 0.15700.1570 0.05770.0577 0.00020.0002 0.05500.0550 0.90030.9003
STORM
DRLS-100 0.43360.4336 0.00220.0022 0.31970.3197 0.00010.0001 - 0.03910.0391
DRLS-250 1.00801.0080 0.00210.0021 0.74290.7429 0.00010.0001 - 0.08850.0885
DRLS-500 2.11672.1167 0.00250.0025 1.56291.5629 0.00010.0001 - 0.19570.1957
DRLS-1000 4.31794.3179 0.00270.0027 3.14383.1438 0.00010.0001 - 0.45710.4571
DRLS-2000 9.00159.0015 0.00270.0027 6.26536.2653 0.00010.0001 - 1.32031.3203
DRSD 30.439430.4394 0.78150.7815 0.30840.3084 0.00030.0003 0.47810.4781 23.854423.8544
Table 2: Computational time comparison between DR L-shaped and DRSD

Table 2 shows the average total computational time (Column 2) for each instance. The table also includes the average time spent to solve the Master problem (first-stage approximate problem) and the subproblems, to verify the optimality conditions, to complete the argmax procedure (only for DRSD), and to solve the distribution separation problem (Columns 3–7, respectively). The results for small scale instances (PGP and CEP) show that both DRSD and the DR L-shaped method take a fraction of a second, but the computational time for DRSD is higher than the DR L-shaped method for all NN. This behavior can be attributed to the fact that (i) the subproblems are relatively easy to solve and the computational effort to solve all the subproblems does not increase significantly with NN, and (ii) the DRSD is run for a minimum number of iterations (256) and thereby, contributing to the total time taken to solve master problems and distribution separation problems. This observation is in-line with our computational experience with the SD method for 2-SLPs. It is important to note that, while the computational time for the DR L-shaped method on an individual instance may be lower, but the iterative procedure necessary to identify a sufficient sample size may require solving several instances with an increasing sample size. This may result in a significantly higher cumulative computational time. The DRSD method, and the sequential sampling idea in general, mitigates the need for this iterative process.

On the other hand, for large-scale problems (BAA and STORM), we observe a noticeable increase in the computational time for the DR L-shaped method with an increase in NN. A significant portion of this time is spent on solving the subproblems. Since the DRSD solves only two subproblems in each iteration, the time taken to solve the subproblems is significantly less in comparison to the DR L-shaped method for all NN where all subproblems corresponding to unique observations are solved in each iteration. Notice that for STORM, the average number of iterations taken by DRSD is at least 4343 times the average number of iterations taken by DR L-shaped for each NN. This is reflected in the significant increase in the computational time for solving the master problems and the distributional separation problems. In contrast, for BAA, the average number of DRSD iterations is only 28%28\% higher than the DR L-shaped iterations. As a result, the increase in the computational times due to an increase in the master and distributional separation problem solution times is overshadowed by the computational gains attained by solving only two subproblems in each iteration. Moreover, the computational time associated with solving the distribution separation problem can be reduced by using column-generation procedures that take advantage of the problem structure. Such an implementation was not undertaken for our current experiments and is a fruitful future research avenue.

6 Conclusions

We presented a new decomposition approach for solving two-stage distributionally robust linear programs (2-DRLPs) with a general ambiguity set that is defined using continuous and/or discrete probability distributions with a very large sample space. Since this approach extended the stochastic decomposition approach of Higle and Sen [20] for 2-DRLPs with a singleton ambiguity set, we referred to it as Distributionally Robust Stochastic Decomposition (DRSD) method. The DRSD is a sequential sampling-based approach that allowed sampling within the optimization step where only two second-stage subproblems are solved in each iteration. In contrast, an external sampling procedure utilizes the distributionally robust L-shaped method [1] for solving 2-DRLP with a finite number of scenarios, where in each iteration all subproblems are solved. While the design of DRSD accommodates general ambiguity sets, we provided its asymptotic convergence analysis for a family of ambiguity sets that includes the well-known moment-based and Wasserstein metric-based ambiguity sets. Furthermore, we performed computational experiments to evaluate the efficiency and effectiveness of solving distributionally robust variants of four well-known stochastic programming test problems that have supports of size ranging from 216216 to 108110^{81}. Based on our results, we observed that the objective function estimates obtained using the DRSD and the DR L-shaped method are statistically comparable. These DRSD estimates are obtained while providing computational improvements that are critical for large-scale problem instances.

Appendix A Proofs

In this appendix, we provide the proofs for the propositions related to the asymptotic behavior of the ambiguity sets approximations defined in §2.1 and the recourse function approximation presented in §2.2.

Proof.

(Proposition 2.1) For P𝔓^momk1P\in\widehat{\mathfrak{P}}_{\textup{mom}}^{k-1}, it is easy to verify that p=ΘPp^{\prime}=\Theta P satisfies the support constraint, viz., ωΩkp(ω)=1\sum_{\omega\in\Omega^{k}}p^{\prime}(\omega)=1. Now consider for i=1,,qi=1,\ldots,q, we have

ωΩkp(ω)ψi(ω)=\displaystyle\sum_{\omega\in\Omega^{k}}p^{\prime}(\omega)\psi_{i}(\omega)=~{} ωΩk1,ωωkp(ω)ψi(ω)+p(ωk)ψi(ωk)\displaystyle\sum_{\omega\in\Omega^{k-1},\omega\neq\omega^{k}}p^{\prime}(\omega)\psi_{i}(\omega)+p^{\prime}(\omega^{k})\psi_{i}(\omega^{k})
=\displaystyle=~{} θkωΩk1,ωωkp(ω)ψi(ω)+θkp(ωk)ψ(ωk)+(1θk)ψi(ωk)\displaystyle\theta^{k}\sum_{\omega\in\Omega^{k-1},\omega\neq\omega^{k}}p(\omega)\psi_{i}(\omega)+\theta^{k}p(\omega^{k})\psi(\omega^{k})+(1-\theta^{k})\psi_{i}(\omega^{k})
=\displaystyle=~{} θkωΩk1p(ω)ψi(ω)+(1θk)ψi(ωk)\displaystyle\theta^{k}\sum_{\omega\in\Omega^{k-1}}p(\omega)\psi_{i}(\omega)+(1-\theta^{k})\psi_{i}(\omega^{k})
=\displaystyle=~{} b^ik1+(1θk)ψi(ωk)\displaystyle\hat{b}_{i}^{k-1}+(1-\theta^{k})\psi_{i}(\omega^{k})
=\displaystyle=~{} ωΩk1p^k1(ω)ψi(ω)+(1θk)ψi(ωk)=b^ik.\displaystyle\sum_{\omega\in\Omega^{k-1}}\hat{p}^{k-1}(\omega)\psi_{i}(\omega)+(1-\theta^{k})\psi_{i}(\omega^{k})=\hat{b}^{k}_{i}.

This implies that Θk(P)𝔓^momk\Theta^{k}(P)\in\widehat{\mathfrak{P}}_{\textup{mom}}^{k}.

Using Proposition 4 in [39], there exists a positive constant χ\chi such that

0(𝔓^momk,𝔓mom)χ𝐛^k𝐛.\displaystyle 0\leq\mathbb{H}(\widehat{\mathfrak{P}}_{\textup{mom}}^{k},\mathfrak{P}_{\textup{mom}})\leq\chi\|\hat{\mathbf{b}}^{k}-\mathbf{b}\|.

Here, 𝐛=(bi)i=1q\mathbf{b}=(b_{i})_{i=1}^{q} and 𝐛^k=(b^ik)i=1q\hat{\mathbf{b}}^{k}=(\hat{b}_{i}^{k})_{i=1}^{q}, and \|\cdot\| denotes the Euclidean norm. Since the approximate ambiguity sets are constructed using independent and identically distributed samples of ω~\tilde{\omega}, using law of large numbers, we have b^ikbi\hat{b}_{i}^{k}\rightarrow b_{i} for all i=1,,qi=1,\ldots,q. This completes the proof. ∎

Proof.

(Proposition 2.2) Consider approximate ambiguity sets 𝔓^wk1\widehat{\mathfrak{P}}^{k-1}_{\text{w}} and 𝔓^wk\widehat{\mathfrak{P}}^{k}_{\text{w}} of the form given in (20g). Let P=(p(ω))ωΩk1𝔓^wk1P^{\prime}=(p^{\prime}(\omega))_{\omega\in\Omega^{k-1}}\in\widehat{\mathfrak{P}}^{k-1}_{\text{w}}, and let the reconstructed probability distribution be denoted by PP. We can easily check that P=Θ(P)P=\Theta(P^{\prime}) is indeed a probability distribution. With PP fixed, it suffices now to show that the polyhedron

(P,P^k)={ηΩk×Ωk|ωΩkη(ω,ω)=p(ω)ωΩk,ωΩkη(ω,ω)=p^k(ω)ωΩk,(ω,ω)Ωk×Ωkωωη(ω,ω)ϵ}.\displaystyle\mathcal{E}(P,\widehat{P}^{k})=\left\{\eta\in\mathbb{R}^{\Omega^{k}\times\Omega^{k}}\left|\begin{array}[]{l}\sum_{\omega^{\prime}\in\Omega^{k}}\eta(\omega,\omega^{\prime})=p(\omega)\qquad\forall\omega\in\Omega^{k},\\ \sum_{\omega\in\Omega^{k}}\eta(\omega,\omega^{\prime})=\hat{p}^{k}(\omega^{\prime})\qquad\forall\omega^{\prime}\in\Omega^{k},\\ \sum_{(\omega,\omega^{\prime})\in\Omega^{k}\times\Omega^{k}}\|\omega-\omega^{\prime}\|\eta(\omega,\omega^{\prime})\leq\epsilon\end{array}\right.\right\}. (49)

is non-empty. Since P𝔓^wk1P^{\prime}\in\widehat{\mathfrak{P}}^{k-1}_{\text{w}}, there exist η(ω,ω)\eta^{\prime}(\omega,\omega^{\prime}) for all (ω,ω)Ωk1×Ωk1(\omega,\omega^{\prime})\in\Omega^{k-1}\times\Omega^{k-1} such that the constraints in the description of the approximate ambiguity set in (20g) are satisfied. We will do this by analyzing two possibilities,

  1. 1.

    We encounter a previously seen observation, i.e., ωkΩk1\omega^{k}\in\Omega^{k-1} and Ωk=Ωk1\Omega^{k}=\Omega^{k-1}. Let η(ω,ω)=θkη(ω,ω)\eta(\omega,\omega^{\prime})=\theta^{k}\eta^{\prime}(\omega,\omega^{\prime}) for ω,ωΩk1\omega,\omega^{\prime}\in\Omega^{k-1} and ωωωk\omega\neq\omega^{\prime}\neq\omega^{k}; and η(ωk,ωk)=θkη(ωk,ωk)+(1θk)\eta(\omega^{k},\omega^{k})=\theta^{k}\eta^{\prime}(\omega^{k},\omega^{k})+(1-\theta^{k}). We will verify the feasibility of this choice by verifying the three sets of constraints in (49).

    ωΩkη(ω,ω)=\displaystyle\sum_{\omega^{\prime}\in\Omega^{k}}\eta(\omega,\omega^{\prime})=~{} ωΩk{ωk}η(ω,ω)+η(ω,ωk)\displaystyle\sum_{\omega^{\prime}\in\Omega^{k}\setminus\{\omega^{k}\}}\eta(\omega,\omega^{\prime})+\eta(\omega,\omega^{k})
    =\displaystyle=~{} ωΩk1{ωk}θkη(ω,ω)+θkη(ω,ωk)+𝟏ω=ωk(1θk)\displaystyle\sum_{\omega^{\prime}\in\Omega^{k-1}\setminus\{\omega^{k}\}}\theta^{k}\eta^{\prime}(\omega,\omega^{\prime})+\theta^{k}\eta^{\prime}(\omega,\omega^{k})+\mathbf{1}_{\omega=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θk(ωΩk1η(ω,ω))+𝟏ω=ωk(1θk)\displaystyle\theta^{k}\bigg{(}\sum_{\omega^{\prime}\in\Omega^{k-1}}\eta^{\prime}(\omega,\omega^{\prime})\bigg{)}+\mathbf{1}_{\omega=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θkp(ω)+𝟏ω=ωk(1θk)=p(ω)ωΩk.\displaystyle\theta^{k}p^{\prime}(\omega)+\mathbf{1}_{\omega=\omega^{k}}(1-\theta^{k})=p(\omega)\qquad\forall\omega\in\Omega^{k}.
    ωΩkη(ω,ω)=\displaystyle\sum_{\omega\in\Omega^{k}}\eta(\omega,\omega^{\prime})=~{} ωΩk{ωk}η(ω,ω)+η(ωk,ω)\displaystyle\sum_{\omega\in\Omega^{k}\setminus\{\omega^{k}\}}\eta(\omega,\omega^{\prime})+\eta(\omega^{k},\omega^{\prime})
    =\displaystyle=~{} ωΩk1{ωk}θkη(ω,ω)+θkη(ωk,ω)+𝟏ω=ωk(1θk)\displaystyle\sum_{\omega\in\Omega^{k-1}\setminus\{\omega^{k}\}}\theta^{k}\eta^{\prime}(\omega,\omega^{\prime})+\theta^{k}\eta(\omega^{k},\omega)+\mathbf{1}_{\omega^{\prime}=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θk(ωΩk1η(ω,ω))+𝟏ω=ωk(1θk)\displaystyle\theta^{k}\bigg{(}\sum_{\omega\in\Omega^{k-1}}\eta(\omega,\omega^{\prime})\bigg{)}+\mathbf{1}_{\omega^{\prime}=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θkp^k1(ω)+𝟏ω=ωk(1θk)=p^k(ω)ωΩk.\displaystyle\theta^{k}\hat{p}^{k-1}(\omega^{\prime})+\mathbf{1}_{\omega^{\prime}=\omega^{k}}(1-\theta^{k})=\hat{p}^{k}(\omega^{\prime})\qquad\forall\omega^{\prime}\in\Omega^{k}.

    Finally,

    (ω,ω)Ωk×Ωkωω\displaystyle\sum_{(\omega,\omega^{\prime})\in\Omega^{k}\times\Omega^{k}}\|\omega-\omega^{\prime}\| η(ω,ω)\displaystyle\eta(\omega,\omega^{\prime})
    =\displaystyle=~{} (ω,ω)Ωk1×Ωk1ωωωkθkωωη(ω,ω)+ωkωkη(ωk,ωk)\displaystyle\sum_{\begin{subarray}{c}(\omega,\omega^{\prime})\in\Omega^{k-1}\times\Omega^{k-1}\\ \omega\neq\omega^{\prime}\neq\omega^{k}\end{subarray}}\theta^{k}\|\omega-\omega^{\prime}\|\eta(\omega,\omega^{\prime})+\|\omega^{k}-\omega^{k}\|\eta(\omega^{k},\omega^{k})
    =\displaystyle=~{} θk((ω,ω)Ωk1×Ωk1ωωη(ω,ω))θkϵϵ.\displaystyle\theta^{k}\bigg{(}\sum_{(\omega,\omega^{\prime})\in\Omega^{k-1}\times\Omega^{k-1}}\|\omega-\omega^{\prime}\|\eta(\omega,\omega^{\prime})\bigg{)}\leq\theta^{k}\epsilon\leq\epsilon.

    Since all the three constraints are satisfied, the chosen values for η\eta is an element of the polyhedron \mathcal{E}, and therefore, \mathcal{E}\neq\emptyset.

  2. 2.

    We encounter a new observation, i.e., ωkΩk1\omega^{k}\notin\Omega^{k-1}. Let η(ω,ω)=θkη(ω,ω)\eta(\omega,\omega^{\prime})=\theta^{k}\eta^{\prime}(\omega,\omega^{\prime}) for ω,ωΩk1\omega,\omega^{\prime}\in\Omega^{k-1}, η(ωk,ω)=0\eta(\omega^{k},\omega^{\prime})=0 for ωΩk1\omega^{\prime}\in\Omega^{k-1}, η(ω,ωk)=0\eta(\omega,\omega^{k})=0 for ωΩk1\omega\in\Omega^{k-1}, and η(ωk,ωk)=(1θk)\eta(\omega^{k},\omega^{k})=(1-\theta^{k}). Let us verify the three conditions defining (49) with this choice for η\eta variables.

    ωΩkη(ω,ω)=\displaystyle\sum_{\omega^{\prime}\in\Omega^{k}}\eta(\omega,\omega^{\prime})=~{} ωΩk{ωk}η(ω,ω)+η(ω,ωk)\displaystyle\sum_{\omega^{\prime}\in\Omega^{k}\setminus\{\omega^{k}\}}\eta(\omega,\omega^{\prime})+\eta(\omega,\omega^{k})
    =\displaystyle=~{} ωΩk1θkη(ω,ω)+𝟏ω=ωk(1θk)\displaystyle\sum_{\omega^{\prime}\in\Omega^{k-1}}\theta^{k}\eta^{\prime}(\omega,\omega^{\prime})+\mathbf{1}_{\omega=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θkp(ω)+𝟏ω=ωk(1θk)=p(ω).\displaystyle\theta^{k}p^{\prime}(\omega)+\mathbf{1}_{\omega=\omega^{k}}(1-\theta^{k})=p(\omega).
    ωΩkη(ω,ω)=\displaystyle\sum_{\omega\in\Omega^{k}}\eta(\omega,\omega^{\prime})=~{} ωΩk{ωk}η(ω,ω)+η(ωk,ω)\displaystyle\sum_{\omega\in\Omega^{k}\setminus\{\omega^{k}\}}\eta(\omega,\omega^{\prime})+\eta(\omega^{k},\omega^{\prime})
    =\displaystyle=~{} ωΩk1θkη(ω,ω)+𝟏ω=ωk(1θk)\displaystyle\sum_{\omega^{\prime}\in\Omega^{k-1}}\theta^{k}\eta^{\prime}(\omega,\omega^{\prime})+\mathbf{1}_{\omega^{\prime}=\omega^{k}}(1-\theta^{k})
    =\displaystyle=~{} θkp^k1++𝟏ω=ωk(1θk)=p^k(ω)\displaystyle\theta^{k}\hat{p}^{k-1}++\mathbf{1}_{\omega^{\prime}=\omega^{k}}(1-\theta^{k})=\hat{p}^{k}(\omega^{\prime})

    Consider,

    (ω,ω)Ωk×Ωkωω\displaystyle\sum_{(\omega,\omega^{\prime})\in\Omega^{k}\times\Omega^{k}}\|\omega-\omega^{\prime}\| η(ω,ω)\displaystyle\eta(\omega,\omega^{\prime})
    =\displaystyle=~{} (ω,ω)Ωk1×Ωk1θkωωη(ω,ω)+ωkωkη(ωk,ωk)\displaystyle\sum_{(\omega,\omega^{\prime})\in\Omega^{k-1}\times\Omega^{k-1}}\theta^{k}\|\omega-\omega^{\prime}\|\eta^{\prime}(\omega,\omega^{\prime})+\|\omega^{k}-\omega^{k}\|\eta(\omega^{k},\omega^{k})
    +ωΩkωωkη(ω,ωk)+ωΩkωkωη(ωk,ω)\displaystyle\hskip 28.45274pt+\sum_{\omega\in\Omega^{k}}\|\omega-\omega^{k}\|\eta(\omega,\omega^{k})+\sum_{\omega^{\prime}\in\Omega^{k}}\|\omega^{k}-\omega^{\prime}\|\eta(\omega^{k},\omega^{\prime})
    \displaystyle\leq~{} θkϵϵ.\displaystyle\theta^{k}\epsilon\leq\epsilon.

    Therefore, the value of the η\eta variables satisfy the constraints and \mathcal{E}\neq\emptyset. This implies that Θk(P)𝔓^wk\Theta^{k}(P)\in\widehat{\mathfrak{P}}_{\textup{w}}^{k}.

Next, let us consider a distribution Q𝔓^wkQ\in\widehat{\mathfrak{P}}_{\textup{w}}^{k}. Then,

dw(Q,P)\displaystyle d_{\textup{w}}(Q,P^{*})\leq~{} dw(Q,P^k)+dw(P^k,P)ϵ+dw(P^k,P).\displaystyle d_{\textup{w}}(Q,\widehat{P}^{k})+d_{\textup{w}}(\widehat{P}^{k},P^{*})\leq\epsilon+d_{\textup{w}}(\widehat{P}^{k},P^{*}).

The above inequality is a consequence of the triangle inequality of Wasserstein distance. Since P𝔓^wkP\in\widehat{\mathfrak{P}}^{k}_{\textup{w}}, we have dw(P,P)ϵd_{\textup{w}}(P^{*},P)\leq\epsilon. Under compactness assumption for Ω\Omega, d>2d>2, and 𝔼P[exp(ω~a)]<\mathbb{E}_{P^{*}}[\text{exp}(\|\tilde{\omega}\|^{a})]<\infty, Theorem 2 in [16] guarantees

Prob[dw(P^k,P)δ]{Cexp(ckδd)ifδ>1Cexp(ckδa)ifδ1\displaystyle\text{Prob}\big{[}d_{\textup{w}}(\widehat{P}^{k},P^{*})\leq\delta\big{]}\leq\left\{\begin{array}[]{ll}C~{}\text{exp}(-ck\delta^{d})&\text{if}~{}\delta>1\\ C~{}\text{exp}(-ck\delta^{a})&\text{if}~{}\delta\leq 1\end{array}\right.

for all k1k\geq 1. This implies that the limkdw(P^k,P)=0\lim_{k\rightarrow\infty}d_{\textup{w}}(\widehat{P}^{k},P^{*})=0, almost surely. Consequently, we obtain that dw(Q,P)ϵd_{\textup{w}}(Q,P^{*})\leq\epsilon (or equivalently Q𝔓wQ\in\mathfrak{P}_{\textup{w}}) as kk\rightarrow\infty, almost surely. This completes the proof. ∎

Proof.

(Proposition 2.3) Recall that 𝒳×Ω\mathcal{X}\times\Omega is a compact set because of Assumptions (A1)) and (A4), and {Qk}\{Q^{k}\} is a sequence of continuous (piecewise linear and convex) functions. Further, the construction of the set of dual vertices satisfies Π0={𝟎}ΠkΠk+1Π\Pi^{0}=\{{\bf 0}\}\subseteq\ldots\subseteq\Pi^{k}\subseteq\Pi^{k+1}\subseteq\ldots\subseteq\Pi which ensures that 0Qk(x,ω)Qk+1(x,ω)Q(x,ω)0\leq Q^{k}(x,\omega)\leq Q^{k+1}(x,\omega)\leq Q(x,\omega) for all (x,ω)(𝒳,Ω)(x,\omega)\in(\mathcal{X},\Omega). Since {Qk}\{Q^{k}\} increases monotonically and is bounded by a finite function QQ (due to (A2)), this sequence pointwise converges to some function ξ(x,ω)Q(x,ω)\xi(x,\omega)\leq Q(x,\omega). Once again due to (A2), we know that the set of dual vertices Π\Pi is finite and since ΠkΠk+1Π\Pi^{k}\subseteq\Pi^{k+1}\subseteq\Pi, the set limkΠk:=Π¯(Π)\lim_{k\rightarrow\infty}\Pi^{k}:=\overline{\Pi}~{}(\subseteq\Pi) is also a finite set. Clearly,

ξ(x,ω)=limkQk(x,ω)=max{π[r(ω)T(ω)x]|πΠ¯}\displaystyle\xi(x,\omega)=\lim_{k\rightarrow\infty}Q^{k}(x,\omega)=\max~{}\{\pi^{\top}[r(\omega)-T(\omega)x]~{}|~{}\pi\in\overline{\Pi}\}

is the optimal value of a LP, and hence, is a continuous function. The compactness of 𝒳×Ω\mathcal{X}\times\Omega, and continuity, monotonicity and pointwise convergence of {Qk}\{Q^{k}\} to ξ\xi guarantees that the sequence uniformly converges to ξ\xi (Theorem 7.13 in [32]). ∎

References

  • [1] M. Bansal, K. Huang, and S. Mehrotra. Decomposition Algorithms for Two-Stage Distributionally Robust Mixed Binary Programs. SIAM Journal on Optimization, pages 2360–2383, January 2018.
  • [2] M. Bansal and S. Mehrotra. On solving two-stage distributionally robust disjunctive programs with a general ambiguity set. European Journal of Operational Research, 279(2):296–307, December 2019.
  • [3] G. Bayraksan and D. K. Love. Data-Driven Stochastic Programming Using Phi-Divergences. In The Operations Research Revolution, INFORMS TutORials in Operations Research, pages 1–19. INFORMS, September 2015.
  • [4] G. Bayraksan and D.P. Morton. Assessing solution quality in stochastic programs. Mathematical Programming, 108(2):495–514, Sep 2006.
  • [5] G. Bayraksan and D.P. Morton. A sequential sampling procedure for stochastic programming. Operations Research, 59(4):898–913, 2011.
  • [6] A. Ben-Tal, D. den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities. Management Science, 59(2):341–357, November 2012.
  • [7] D. Bertsimas, X. V. Doan, K. Natarajan, and C. Teo. Models for minimax stochastic linear optimization problems with risk aversion. Mathematics of Operations Research, 35(3):580–602, 2010.
  • [8] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization, 15(3):780–804, 2005.
  • [9] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer Series in Operations Research and Financial Engineering. Springer, 2011.
  • [10] M. Breton and S. El Hachem. Algorithms for the solution of stochastic dynamic minimax problems. Computational Optimization and Applications, 4(4):317–345, October 1995.
  • [11] M. Breton and S. El Hachem. A scenario aggregation algorithm for the solution of stochastic dynamic minimax problems. Stochastics and Stochastic Reports, 53(3-4):305–322, May 1995.
  • [12] G. B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations Research, 8(1):101–111, 1960.
  • [13] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58:595–612, 2010.
  • [14] J. Dupacová. The minimax approach to stochastic programming and an illustrative application. Stochastics, 20:73–88, 1987.
  • [15] E. Erdogan and G. Iyengar. Ambiguous chance constrained problems and robust optimization. Mathematical Programming, 107(1-2):37–61, 2006.
  • [16] N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.
  • [17] H. Gangammanavar, Y. Liu, and S. Sen. Stochastic decomposition for two-stage stochastic linear programs with random cost coefficients. INFORMS Journal on Computing, 2020.
  • [18] H. Gangammanavar and S. Sen. Stochastic dynamic linear programming: A sequential sampling algorithm for multistage stochastic linear programming. Technical report, Southern Methodist University, 2019. Available on Optimization Online: http://www.optimization-online.org/DB_HTML/2019/09/7395.html.
  • [19] G. A. Hanasusanto and D. Kuhn. Conic programming reformulations of two-stage distributionally robust linear programs over wasserstein balls. Operations Research, 66(3):849–869, 2018.
  • [20] J. L. Higle and S Sen. Stochastic decomposition: An algorithm for two-stage linear programs with recourse. Mathematics of Operations Research, 16(3):650–669, 1991.
  • [21] J. L. Higle and S Sen. Finite master programs in regularized stochastic decomposition. Mathematical Programming, 67(1-3):143–168, 1994.
  • [22] J. L. Higle and S Sen. Stochastic Decomposition: A Statistical Method for Large Scale Stochastic Linear Programming. Kluwer Academic Publishers, Boston, MA., 1996.
  • [23] J. L. Higle and S Sen. Statistical approximations for stochastic linear programming problems. Annals of Operations Research, 85(0):173–193, 1999.
  • [24] R. Jiang and Y. Guan. Risk-averse two-stage stochastic program with distributional ambiguity. Operations Research, 66(5):1390–1405, 2018.
  • [25] Sanjay Mehrotra and He Zhang. Models and algorithms for distributionally robust least squares problems. Mathematical Programming, 148(1–2):123–141, 2014.
  • [26] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115–166, Sep 2018.
  • [27] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659, 2019.
  • [28] M. Riis and K. A. Andersen. Applying the minimax criterion in stochastic recourse programs. European Journal of Operational Research, 165(3):569–584, September 2005.
  • [29] R. T. Rockafellar and Roger J.-B. Wets. Scenarios and Policy Aggregation in Optimization Under Uncertainty. Mathematics of Operations Research, 16(1):119–147, February 1991.
  • [30] W. Römisch. Stability of stochastic programming problems. Handbooks in operations research and management science, 10:483–554, 2003.
  • [31] J. O. Royset and R. Szechtman. Optimal budget allocation for sample average approximation. Operations Research, 61(3):762–776, 2013.
  • [32] W. Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition, 1976. International Series in Pure and Applied Mathematics.
  • [33] H. Scarf. A min-max solution of an inventory problem. In Studies in the Mathematical Theory of Inventory and Production, chapter 12, pages 201–209. RAND Corporation, Santa Monica CA, 1958.
  • [34] S. Sen and Z. Zhou. Multistage Stochastic Decomposition: A bridge between Stochastic Programming and Approximate Dynamic Programming. SIAM Journal on Optimization, 24(1):127–153, 2014.
  • [35] A. Shapiro and S. Ahmed. On a class of minimax stochastic programs. SIAM Journal on Optimization, 14(4):1237–1249, 2004.
  • [36] A. Shapiro and S. Ahmed. On a class of minimax stochastic programs. SIAM Journal on Optimization, 14(4):1237–1249, 2004.
  • [37] A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2014.
  • [38] A. Shapiro and A. Kleywegt. Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3):523–542, 2002.
  • [39] H. Sun and H. Xu. Convergence analysis for distributionally robust optimization and equilibrium problems. Mathematics of Operations Research, 41(2):377–401, May 2016.
  • [40] R. M. Van Slyke and R. J. B. Wets. L-shaped linear programs with applications to optimal control and stochastic programming. SIAM Journal on Applied Mathematics, 17(4):638–663, 1969.
  • [41] W. Xie. Tractable reformulations of distributionally robust two-stage stochastic programs with \infty- wasserstein distance. arXiv preprint arXiv:1908.08454, 2019.
  • [42] C. Zhao and Y. Guan. Data-driven risk-averse two-stage stochastic program with ζ\zeta-structure probability metrics. Available at http://www.optimization-online.org/DB_FILE/2015/07/5014, 2015.