Numerical Methods for Convex Multistage Stochastic Optimization

Guanghui Lan Georgia Institute of Technology, Atlanta, Georgia 30332, USA, [email protected]
Research of this author was partially supported by the NSF grant DMS-1953199 and the NSF AI Institute grant NSF-2112533. Alexander Shapiro Georgia Institute of Technology, Atlanta, Georgia 30332, USA, [email protected]
Research of this author was partially supported by Air Force Office of Scientific Research (AFOSR) under Grant FA9550-22-1-0244.

Optimization problems involving sequential decisions in a stochastic environment were studied in Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). In this paper we mainly concentrate on SP and SOC modelling approaches. In these frameworks there are natural situations when the considered problems are convex. Classical approach to sequential optimization is based on dynamic programming. It has the problem of the so-called “Curse of Dimensionality”, in that its computational complexity increases exponentially with increase of dimension of state variables. Recent progress in solving convex multistage stochastic problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations. Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We also discuss Stochastic Approximation type methods applied to multistage stochastic optimization problems. From the computational complexity point of view, these two types of methods seem to be complimentary to each other. Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but a large number of decision variables.

Key words: Stochastic Programming, Stochastic Optimal Control, Markov Decision Process, dynamic programming, risk measures, Stochastic Dual Dynamic Programming, Stochastic Approximation method, cutting planes algorithm.

AMS subject classifications: 65K05, 90C15, 90C39, 90C40.

1 Introduction

Traditionally different communities of researchers dealt with optimization problems involving uncertainty, modelled in stochastic terms, using different terminology and modeling frameworks. In this respect we can point to the fields of Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). Historically the developments in SP on one hand and SOC and MDP on the other, went along different directions with different modeling frameworks and solution methods. In this paper we mainly concentrate on SP and SOC approaches. In these frameworks there are natural situations when the considered problems are convex. In continuous optimization convexity is the main consideration allowing development of efficient numerical algorithms, [30].

The main goal of this paper is to present some recent developments in numerical approaches to solving convex optimization problems involving sequential decision making. We do not try to give a comprehensive review of the subject with a complete list of references, etc. Rather the aim is to present a certain point of view about some recent developments in solving convex multistage stochastic problems.

The SOC modeling is convenient for demonstration of some basic ideas since on one hand it can be naturally written in the MDP terms, and on the other hand it can be formulated in the Stochastic Programming framework. Stochastic Programming (SP) has a long history. Two stage stochastic programming (with recourse) was introduced in Dantzig [8] and Beale [2], and was intrinsically connected with linear programming. From the beginning SP was aimed at numerical solutions. Until about twenty years ago, the modeling approach to two and multistage SP was predominately based on construction of scenarios represented by scenario trees. This approach allows to formulate the so-called deterministic equivalent optimization problem with the number of decision variables more or less proportional to the number of scenarios. When the deterministic equivalent could be represented as a linear program, such problems were considered to be numerically solvable. Because of that, the topic of SP was often viewed as a large scale linear programming. We can refer for presentation of these developments to Birge [6] and references there in.

From the point of view of the scenarios construction approach there is no much difference between two stage and multistage SP. In both cases the numerical effort in solving the deterministic equivalent is more or less proportional to the number of generated scenarios. This view on SP started to change with developments of randomization methods and the sample complexity theory. From the point of view of solving deterministic equivalent, even two stage linear stochastic programs are computationally intractable; their computational complexity is #P-hard for a sufficiently high accuracy (cf., [10],[16]). On the other hand, under reasonable assumptions, the number of randomly generated scenarios (by Monte Carlo sampling techniques), which are required to solve two stage SP problems with accuracy $\varepsilon>0$ and high probability is of order $O(\varepsilon^{-2})$ , [47, Section 5.3]. While randomization methods were reasonably successful in solving two stage problems, as far as multistage SP is concerned the situation is different. Number of scenarios needed to solve multistage SP problems grows exponentially with increase of the number of stages, [50],[47, Section 5.8.2].

Classical approach to sequential optimization is based on dynamic programming, [3]. Dynamic programming also has a long history and is in the heart of the SOC and MDP modeling. It has the problem of the so-called “Curse of Dimensionality”, the term coined by Bellman. Its computational complexity increases exponentially with increase of dimension of state variables. There is a large literature trying to deal with this problem by using various approximations of dynamic programming equations (e.g., [37]). Most of these methods are heuristics and do not give verifiable guarantees for the accuracy of obtained solutions.

Recent progress in solving convex multistage SP problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations. These methods allow to give an estimate of the error of the computed solution. Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We discuss such algorithms in the frameworks of SP and SOC. Moreover, we will also present extensions of stochastic approximation (a.k.a. stochastic gradient descent) type methods [38, 29, 28, 21] for multistage stochastic optimization, referred to as dynamic stochastic approximation in [24]. From the computational complexity point of view, these two types of methods seem to be complimentary to each other in the following sense. Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but a large number of decision variables. These methods share the following common features: (a) both methods utilize the convex structure of the cost-to-go (value) functions of dynamic programming equations, (b) both methods do not require explicit discretization of state space, (c) both methods guarantee the convergence to the global optimality, (d) rates of convergence for both methods have been established.

We use the following notation and terminology throughout the paper. For $a\in{\mathbb{R}}$ we denote $[a]_{+}:=\max\{0,a\}$ . Unless stated otherwise $\|\cdot\|$ denotes Euclidean norm in ${\mathbb{R}}^{n}$ . By $\mathop{\rm dist}(x,S):=\inf_{y\in S}\|x-y\|$ we denote the distance from a point $x\in{\mathbb{R}}^{n}$ to a set $S\subset{\mathbb{R}}^{n}$ . We write $x^{\top}y$ or $\langle x,y\rangle$ for the scalar product $\sum_{i=1}^{n}x_{i}y_{i}$ of vectors $x,y\in{\mathbb{R}}^{n}$ . It is said that a set $S\subset{\mathbb{R}}^{n}$ is polyhedral if it can be represented by a finite number of affine constraints, it is said that a function $f:{\mathbb{R}}^{n}\to{\mathbb{R}}$ is polyhedral if it can be represented as maximum of a finite number of affine functions. For a process $\xi_{1},\xi_{2},...$ , we denote by $\xi_{[t]}=(\xi_{1},...,\xi_{t})$ its history up to time $t$ . By ${\mathbb{E}}_{|X}[\,\cdot\,]$ we denote the conditional expectation, conditional on random variable (random vector) $X$ . We use the same notation $\xi_{t}$ viewed as a random vector or as a vector variable, the particular meaning will be clear from the context. For a probability space $(\Omega,{\cal F},{\mathbb{P}})$ , by $L_{p}(\Omega,{\cal F},{\mathbb{P}})$ , $p\in[1,\infty)$ , we denote the space of random variables $Z:\Omega\to{\mathbb{R}}$ having finite $p$ -th order moment, i.e., such that $\int|Z|^{p}d{\mathbb{P}}<\infty$ . Equipped with norm $\|Z\|_{p}:=\left(\int|Z|^{p}d{\mathbb{P}}\right)^{1/p}$ , $L_{p}(\Omega,{\cal F},{\mathbb{P}})$ becomes a Banach space. The dual of ${\cal Z}:=L_{p}(\Omega,{\cal F},{\mathbb{P}})$ is the space ${\cal Z}^{*}=L_{q}(\Omega,{\cal F},{\mathbb{P}})$ with $q\in(1,\infty]$ such that $1/p+1/q=1$ .

2 Stochastic Programming

In Stochastic Programming (SP) the $T$ -stage optimization problem can be written as

		$\displaystyle\min\limits_{x_{t}\in{\cal X}_{t}}$	$\displaystyle{\mathbb{E}}\Big{[}\sum_{t=1}^{T}f_{t}(x_{t},\xi_{t})\Big{]}$		(2.1)
		$\displaystyle{\rm s.t}$	$\displaystyle B_{t}x_{t-1}+A_{t}x_{t}=b_{t},\;t=1,...,T.$		(2.2)

Here $f_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{d_{t}}\to{\mathbb{R}}$ are objective functions, $\xi_{t}\in{\mathbb{R}}^{d_{t}}$ are random vectors, ${\cal X}_{t}\subset{\mathbb{R}}^{n_{t}}$ are nonempty closed sets, $B_{t}=B_{t}(\xi_{t})$ , $A_{t}=A_{t}(\xi_{t})$ and $b_{t}=b_{t}(\xi_{t})$ are matrix and vector functions of $\xi_{t}$ , $t=1,...,T$ , with $B_{1}=0$ .

Constraints (2.2) often represent some type of balance equations between successive stages, while the set ${\cal X}_{t}$ can be viewed as representing local constraints at time $t$ . It is possible to consider balance equations in a more general form, nevertheless formulation (2.2) will be sufficient for our purpose of discussing convex problems. In particular, if the objective functions $f_{t}(x_{t},\xi_{t}):=c_{t}^{\top}x_{t}$ are linear, with $c_{t}=c_{t}(\xi_{t})$ , and the sets ${\cal X}_{t}$ are polyhedral, then problem (2.1) - (2.2) becomes a linear multistage stochastic program.

The sequence of random vectors $\xi_{1},...$ , is viewed as a data process. Unless stated otherwise we make the following assumption throughout the paper.

Assumption 2.1

Probability distribution of the random process $\xi_{t}$ , $t=1,...$ , does not depend on our decisions, i.e., is independent of the chosen decision policy.

At every time period $t=1,...,$ we have information about (observed) realization $\xi_{[t]}=(\xi_{1},...,\xi_{t})$ of the data process. Naturally our decisions should be based on the information available at time of the decision and should not depend on the future unknown values $\xi_{t+1},...,\xi_{T}$ ; this is the so-called principle of nonantisipativity. That is, the decision variables $x_{t}=x_{t}(\xi_{[t]})$ are functions of the data process, and a sequence of such functions for every stage $t=1,...,T$ , is called a policy or a decision rule. The optimization in (2.1) - (2.2) is performed over policies satisfying the feasibility constraints. The feasibility constraints of problem (2.1) - (2.2) should be satisfied with probability one, and the expectation is taken with respect to the probability distribution of the random vector $\xi_{[T]}=(\xi_{1},...,\xi_{T})$ . It could be noted that the dependence of decision variables $x_{t}$ , as well as parameters $(B_{t}$ , $A_{t}$ , $b_{t})$ , on the data process is often suppressed in the notation. It is worthwhile to emphasize that it suffices to consider policies depending only on the history of the data process without history of the decisions, because of Assumption 2.1.

The first stage decision $x_{1}$ is of the main interest and assumed to be deterministic, i.e., made before observing future realization of the data process. In that framework vector $\xi_{1}$ and the corresponding parameters $(A_{1},b_{1})$ are deterministic, i.e., their values are known at the time of first stage decision. Also the first stage objective function $f_{1}(x_{1})=f_{1}(x_{1},\xi_{1})$ is a function of $x_{1}$ alone.

We can write the following dynamic programming equations for problem (2.1) - (2.2) (e.g., [47, Section 3.1]). At the last stage the cost-to-go (value) function is

Q_{T}(x_{T-1},\xi_{T})=\inf_{x_{T}\in{\cal X}_{T}}\left\{f_{T}(x_{T},\xi_{T}):B_{T}x_{T-1}+A_{T}x_{T}=b_{T}\right\}

(2.3)

and then going backward in time for $t=T-1,...,2$ , the cost-to-go function is

Q_{t}(x_{t-1},\xi_{[t]})=\inf_{x_{t}\in{\cal X}_{t}}\left\{f_{t}(x_{t},\xi_{t})+{\mathbb{E}}_{|\xi_{[t]}}[Q_{t+1}(x_{t},\xi_{[t+1]})]:B_{t}x_{t-1}+A_{t}x_{t}=b_{t}\right\}.

(2.4)

The optimal value of problem (2.1) - (2.2) is given by the optimal value of the first stage problem

\min\limits_{x_{1}\in{\cal X}_{1}}f_{1}(x_{1})+{\mathbb{E}}\left[Q_{2}(x_{1},\xi_{2})\right]\;{\rm s.t.}\;A_{1}x_{1}=b_{1}.

(2.5)

The optimization in (2.3) - (2.4) is over (nonrandom) variables $x_{t}\in{\mathbb{R}}^{n_{t}}$ (recall that $B_{t}=B_{t}(\xi_{t})$ , $A_{t}=A_{t}(\xi_{t})$ and $b_{t}=b_{t}(\xi_{t})$ are functions of $\xi_{t}$ ). The corresponding optimal policy is defined by $\bar{x}_{t}=\bar{x}_{t}(\xi_{[t]})$ , $t=2,...,T$ , where

\bar{x}_{t}\in\mathop{\rm arg\,min}_{x_{t}\in{\cal X}_{t}}\left\{f_{t}(x_{t},\xi_{t})+{\mathbb{E}}_{|\xi_{[t]}}[Q_{t+1}(x_{t},\xi_{[t+1]})]:B_{t}\bar{x}_{t-1}+A_{t}x_{t}=b_{t}\right\},

(2.6)

with the expectation term of $Q_{T+1}(\cdot,\cdot)$ at the last stage omitted. Of the main interest is the first stage solution given by the optimal solution $\bar{x}_{1}$ of the first stage problem (2.5).

Since $\bar{x}_{t-1}$ is a function of $\xi_{[t-1]}$ , the cost-to-go function $Q_{t}(\bar{x}_{t-1},\xi_{[t]})$ can be considered as a function of $\xi_{[t]}$ . It represents the optimal value of problem

\min\limits_{x_{\tau}\in{\cal X}_{\tau}}{\mathbb{E}}\Big{[}\sum_{\tau=t+1}^{T}f_{\tau}(x_{\tau},\xi_{\tau})\Big{]}\;\;{\rm s.t.}\;B_{\tau}x_{\tau-1}+A_{\tau}x_{\tau}=b_{\tau},\;\tau=t+1,...,T,

(2.7)

conditional on realization $\xi_{[t]}$ of the data process. This is the dynamic programming (Bellman) principle of optimality. This is also the motivation for the name “cost-to-go” function. Another name for the cost-to-go functions is “value functions”.

Remark 2.1

Let us recall the following simple result. Consider an extended real valued function $\phi:{\mathbb{R}}^{n}\times{\mathbb{R}}^{m}\to{\mathbb{R}}\cup\{+\infty\}$ . We have that if $\phi(x,y)$ is a convex function of $(x,y)\in{\mathbb{R}}^{n}\times{\mathbb{R}}^{m}$ , then the min-value function $\varphi(x):=\inf_{y\in{\mathbb{R}}^{m}}\phi(x,y)$ is also convex. This can be applied to establish convexity of the cost-to-go functions. That is, suppose that the function $f_{T}(x_{T},\xi_{T})$ is convex in $x_{T}$ and the set ${\cal X}_{T}$ is convex. Then the extended real valued function taking value $f_{T}(x_{T},\xi_{T})$ if $x_{T}\in{\cal X}_{T}$ and $B_{T}x_{T-1}+A_{T}x_{T}=b_{T}$ , and value $+\infty$ otherwise, is convex. Since the cost-to-go function $Q_{T}(x_{T-1},\xi_{T})$ can be viewed as the min-function of this extended real valued function, it follows that it is convex in $x_{T-1}$ . Also note that if the cost-to-go function $Q_{t+1}(x_{t},\xi_{[t+1]})$ is convex in $x_{t}$ , then its expected value in (2.4) is also convex. Consequently by induction going backward in time it is not difficult to show the following convexity property of the cost-to-go functions.

Proposition 2.1

If the objective functions $f_{t}(x_{t},\xi_{t})$ are convex in $x_{t}$ and the sets ${\cal X}_{t}$ are convex, $t=1,...,T$ , then the cost-to-go functions $Q_{t}(x_{t-1},\xi_{[t]})$ , $t=2,...,T$ , are convex in $x_{t-1}$ and the first stage problem (2.5) is convex. In particular such convexity follows for linear multistage stochastic programs.

Such convexity property is crucial for development of efficient numerical algorithms. It is worthwhile to note that Assumption 2.1 is essential here. Dependence of the distribution of the random data on decisions typically destroys convexity even in the linear case. $\hfill\square$

The dynamic programming equations reduce the problem from optimization over policies, i.e., over infinite dimensional functional spaces, to evaluation of the cost-to-go functions $Q_{t}(x_{t-1},\xi_{[t]})$ of finite dimensional vectors. However, dependence of the cost-to-go functions on the whole history of the data process makes it numerically unmanageable with increase of the number of stages. The situation is simplified dramatically if we make the following assumption of stagewise independence.

Assumption 2.2 (Stagewise independence)

Probability distribution of $\xi_{t+1}$ is independent of $\xi_{[t]}$ for $t\geq 1$ .

Under Assumption 2.2 of stagewise independence, the conditional expectations in dynamic equations (2.4) become the corresponding unconditional expectations. It is straightforward to show then by induction, going backward in time, that the dynamic equations (2.4) can be written as

Q_{t}(x_{t-1},\xi_{t})=\inf_{x_{t}\in{\cal X}_{t}}\left\{f_{t}(x_{t},\xi_{t})+{\cal Q}_{t+1}(x_{t}):B_{t}x_{t-1}+A_{t}x_{t}=b_{t}\right\},

(2.8)

where

{\cal Q}_{t+1}(x_{t}):={\mathbb{E}}[Q_{t+1}(x_{t},\xi_{t+1})],

(2.9)

with the expectation taken with respect to the marginal distribution of $\xi_{t+1}$ . In that setting one needs to keep track of the (expectation) cost-to-go functions ${\cal Q}_{t+1}(x_{t})$ only. The optimal policy is defined by the equations

\bar{x}_{t}\in\mathop{\rm arg\,min}_{x_{t}\in{\cal X}_{t}}\left\{f_{t}(x_{t},\xi_{t})+{\cal Q}_{t+1}(x_{t}):B_{t}\bar{x}_{t-1}+A_{t}x_{t}=b_{t}\right\}.

(2.10)

Note that here the optimal policy values can be computed iteratively, starting from optimal solution $\bar{x}_{1}$ of the first stage problem, and then by sequentially computing a minimizer in the right hand side of (2.10) going forward in time for $t=2,...,T$ . Of course, this procedure requires availability of the cost-to-go functions ${\cal Q}_{t+1}(x_{t})$ .

It is worthwhile to note that here the optimal policy decision $\bar{x}_{t}$ can be viewed as a function of $\bar{x}_{t-1}$ and $\xi_{t}$ , for $t=2,...,T$ . Of course, $\bar{x}_{t-1}$ is a function of $\bar{x}_{t-2}$ and $\xi_{t-1}$ , and so on. So eventually $\bar{x}_{t}$ can be represented as a function of the history $\xi_{[t]}$ of the data process. Assumption 2.1 was crucial for this conclusion, since then we need to consider only the history of the data process rather than also the history of our decisions. This is an important point, we will discuss this later.

Remark 2.2

In SP setting the process $\xi_{t}$ is viewed as a data process, and the assumption of stagewise independence could be unrealistic. In order to model stagewise dependence we assume the following Markovian property of the data process: the conditional distribution of $\xi_{t+1}$ given $\xi_{[t]}$ is the same as the conditional distribution of $\xi_{t+1}$ given $\xi_{t}$ . In such Markovian setting the dynamic equations (2.4) become

Q_{t}(x_{t-1},\xi_{t})=\inf_{x_{t}\in{\cal X}_{t}}\left\{f_{t}(x_{t},\xi_{t})+{\mathbb{E}}_{|\xi_{t}}[Q_{t+1}(x_{t},\xi_{t+1})]:B_{t}x_{t-1}+A_{t}x_{t}=b_{t}\right\}.

(2.11)

That is, compared with the stagewise independent case, in the Markovian setting the expected cost-to-go function

{\cal Q}_{t+1}(x_{t},\xi_{t})={\mathbb{E}}_{|\xi_{t}}[Q_{t+1}(x_{t},\xi_{t+1})]

(2.12)

also depends on $\xi_{t}$ . $\hfill\square$

3 Stochastic Optimal Control

Consider the classical Stochastic Optimal Control (SOC) (discrete time, finite horizon) model (e.g. , Bertsekas and Shreve [4]):

		$\displaystyle\min\limits_{u_{t}\in{\cal U}_{t}(x_{t})}$	$\displaystyle{\mathbb{E}}^{\pi}\Big{[}\sum_{t=1}^{T}c_{t}(x_{t},u_{t},\xi_{t})+c_{T+1}(x_{T+1})\Big{]},$		(3.1)
		$\displaystyle{\rm s.t.}$	$\displaystyle x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}),\;t=1,...,T.$		(3.2)

Variables $x_{t}\in{\mathbb{R}}^{n_{t}}$ , $t=1,...,T+1$ , represent the state of the system, $u_{t}\in{\mathbb{R}}^{m_{t}}$ , $t=1,...,T$ , are controls, $\xi_{t}\in{\mathbb{R}}^{d_{t}}$ , $t=1,...,T$ , are random vectors, $c_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\times{\mathbb{R}}^{d_{t}}\to{\mathbb{R}}$ , $t=1,...,T$ , are cost functions, $c_{T+1}(x_{T+1})$ is a final cost function, $F_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\times{\mathbb{R}}^{d_{t}}\to{\mathbb{R}}^{n_{t+1}}$ are (measurable) mappings and ${\cal U}_{t}(x_{t})$ is a (measurable) multifunction mapping $x_{t}\in{\mathbb{R}}^{n_{t}}$ to a subset of ${\mathbb{R}}^{m_{t}}$ . Values $x_{1}$ and $\xi_{0}$ are deterministic (initial conditions); it is also possible to view $x_{1}$ as random with a given distribution, this is not essential for the following discussion. The optimization in (3.1) - (3.2) is performed over policies $\pi$ determined by decisions $u_{t}$ and state variables $x_{t}$ . This is emphasized in the notation ${\mathbb{E}}^{\pi}$ , we are going to discuss this below.

Unless stated otherwise we assume that probability distribution of the random process $\xi_{t}$ does not depend on our decisions (Assumption 2.1). We can consider problem (3.1)-(3.2) in the framework of SP if we view $y_{t}=(x_{t},u_{t})$ , $t=1,...,T$ , as decision variables. Suppose that the mapping $F_{t}(x_{t},u_{t},\xi_{t})$ is affine, i.e.,

F_{t}(x_{t},u_{t},\xi_{t}):=A_{t}x_{t}+B_{t}u_{t}+b_{t},\;t=1,...,T,

(3.3)

where $A_{t}=A_{t}(\xi_{t})$ , $B_{t}=B_{t}(\xi_{t})$ and $b_{t}=b_{t}(\xi_{t})$ are matrix and vector valued functions of $\xi_{t}$ . The constraints $u_{t}\in{\cal U}_{t}(x_{t})$ can be viewed as local constraints in the SP framework; in particular if ${\cal U}_{t}(x_{t})\equiv{\cal U}_{t}$ are independent of $x_{t}$ , then ${\cal U}_{t}$ can be viewed as a counterpart of the set ${\cal X}_{t}$ in problem (2.1) - (2.2). For the affine mapping of the form (3.3), the state equations (3.2) become

x_{t+1}-A_{t}x_{t}-B_{t}u_{t}=b_{t},\;t=1,...,T.

(3.4)

These state equations are linear in $y_{t}=(x_{t},u_{t})$ and can be viewed as a particular case of the balance equations (2.2) of the stochastic program (2.1) - (2.2). Note, however, that here decision $u_{t}$ should be made before realization of $\xi_{t}$ becomes known, and $x_{t}$ and $u_{t}$ are functions of $\xi_{[t-1]}$ rather than $\xi_{[t]}$ . We emphasize that $x_{t}$ and $u_{t}$ can be considered as functions of the history $\xi_{[t-1]}$ of the random process $\xi_{t}$ , without dependence on the history of the decisions. As in the SP framework, this follows from the dynamic programming equations discussed below, and Assumption 2.1 is essential for that conclusion.

There is a time shift in equations (3.4) as compared with the SP equations (2.2). We emphasize that in both the SOC and SP approaches, the decisions are made based on information available at time of the decision; this is the principle of nonanticipativity. Because of the time shift, the dynamic programming equations for the SOC problem are slightly different from the respective equations in SP. More important is that in the SOC framework there is a clear separation between the state and control variables. This will have an important consequence for the respective dynamic programming equations which we consider next.

Similar to SP, the dynamic programming equations for the SOC problem (3.1) - (3.2) can be written as follows. At the last stage, the value function $V_{T+1}(x_{T+1})=c_{T+1}(x_{T+1})$ and, going backward in time for $t=T,...,1$ , the value functions

V_{t}(x_{t},\xi_{[t-1]})=\inf\limits_{{u_{t}\in{\cal U}_{t}(x_{t})}\atop{x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t})}}{\mathbb{E}}_{|\xi_{[t-1]}}\left[c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}x_{t+1},\xi_{[t]}\big{)}\right].

(3.5)

The optimization in the right hand side of (3.5) is performed jointly in $u_{t}$ and $x_{t+1}$ . The optimal value of the SOC problem (3.1)-(3.2) is given by the first stage value function $V_{1}(x_{1})$ , and can be viewed as a function of the initial conditions $x_{1}$ . Note that because of the state equation (3.2), equations (3.5) can be written in the following equivalent form

V_{t}(x_{t},\xi_{[t-1]})=\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t})}{\mathbb{E}}_{|\xi_{[t-1]}}\left[c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t}),\xi_{[t]}\big{)}\right].

(3.6)

The corresponding optimal policy is defined by

(\bar{u}_{t},\bar{x}_{t+1})\in\mathop{\rm arg\,min}\limits_{{u_{t}\in{\cal U}_{t}(\bar{x}_{t})}\atop{x_{t+1}=F_{t}(\bar{x}_{t},u_{t},\xi_{t})}}{\mathbb{E}}_{|\xi_{[t-1]}}\left[c_{t}(\bar{x}_{t},u_{t},\xi_{t})+V_{t+1}\big{(}x_{t+1},\xi_{[t]}\big{)}\right].

(3.7)

Similar to SP, it follows that $\bar{u}_{t}$ can be considered as a function of $\xi_{[t-1]}$ , and $\bar{x}_{t+1}$ as a function of $\xi_{[t]}$ . That is, $(\bar{u}_{t},\bar{x}_{t})$ is a function of $\xi_{[t-1]}$ , this was already mentioned in the previous paragraphs. Again Assumption 2.1 is essential for the above derivations and conclusions.

Remark 3.1

Let us discuss convexity of the value functions. Consider graph of the multifunction ${\cal U}_{t}(\cdot)$ :

{\sf Gr}({\cal U}_{t}):=\left\{(x_{t},u_{t}):u_{t}\in{\cal U}_{t}(x_{t}),\;x_{t}\in{\mathbb{R}}^{n_{t}}\right\}.

(3.8)

Note that minimization $\min_{u_{t}\in{\cal U}_{t}(x_{t})}\varphi(x_{t},u_{t})$ of a function $\varphi:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\to{\mathbb{R}}\cup\{+\infty\}$ can equivalently be written as $\min_{u_{t}\in{\cal U}_{t}(x_{t})}\hat{\varphi}(x_{t},u_{t})$ , where function $\hat{\varphi}(x_{t},u_{t})$ coincides with $\varphi(x_{t},u_{t})$ when $u_{t}\in{\cal U}_{t}(x_{t})$ , and is $+\infty$ otherwise. Of course, the constraint $u_{t}\in{\cal U}_{t}(x_{t})$ is the same as $(x_{t},u_{t})\in{\sf Gr}({\cal U}_{t})$ . If function $\varphi(x_{t},u_{t})$ is convex and the set ${\sf Gr}({\cal U}_{t})$ is convex, then the corresponding function $\hat{\varphi}(x_{t},u_{t})$ is convex. Consequently in that case the min-function $\min_{u_{t}\in{\cal U}_{t}(x_{t})}\varphi(x_{t},u_{t})$ is a convex function of $x_{t}$ . By applying this observation and using induction going backward in time, it is not difficult to show that value functions $V_{t}\big{(}x_{t},\xi_{[t-1]}\big{)}$ are convex in $x_{t}$ , $t=1,...,T$ , if the following assumption holds.

Assumption 3.1

The function $c_{T+1}(\cdot)$ is convex, and for every $t=1,...,T,$ the cost function $c_{t}(x_{t},u_{t},\xi_{t})$ is convex in $(x_{t},u_{t})$ , the mapping $F_{t}$ is affine of the form (3.3), the graph ${\sf Gr}({\cal U}_{t})$ is a convex subset of ${\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}$ , $t=1,...,T$ .

Note that, in particular, ${\sf Gr}({\cal U}_{t})$ is convex if ${\cal U}_{t}(x_{t})\equiv{\cal U}_{t}$ is independent of $x_{t}$ and the set ${\cal U}_{t}$ is convex. When ${\cal U}_{t}(x_{t})$ is dependent on $x_{t}$ , we need some constructive way to represent it in order to ensure convexity of ${\sf Gr}({\cal U}_{t})$ . For example suppose that the feasibility sets are defined by inequality constraints as:

{\cal U}_{t}(x_{t}):=\{u_{t}\in{\mathbb{R}}^{m_{t}}:g_{t}(x_{t},u_{t})\leq 0\},

(3.9)

where $g_{t}:{\mathbb{R}}^{n_{t}}\times{\mathbb{R}}^{m_{t}}\to{\mathbb{R}}^{\ell_{t}}$ . Then ${\sf Gr}({\cal U}_{t})=\{(x_{t},u_{t}):g_{t}(x_{t},u_{t})\leq 0\}$ . Consequently if every component of mapping $g_{t}(x_{t},u_{t})$ is a convex function, then ${\sf Gr}({\cal U}_{t})$ is convex. $\hfill\square$

In SOC framework the random process $\xi_{1},...,$ is often viewed as a noise or disturbances (note that here $\xi_{1}$ is also random). In such cases the assumption of stagewise independence (Assumption 2.2) is natural. In the stagewise independent case the dynamic programming equations (3.6) simplify with the value functions $V_{t}(x_{t})$ , $t=1,...,T$ , being functions of the state variables only, that is

V_{t}(x_{t})=\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t})}{\mathbb{E}}\left[c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right],

(3.10)

where the expectation is taken with respect to the marginal distribution of $\xi_{t}$ . Consequently the optimal policy is defined by the optimal controls $\bar{u}_{t}=\pi_{t}(x_{t})$ , where

\bar{u}_{t}\in\mathop{\rm arg\,min}_{u_{t}\in{\cal U}_{t}(x_{t})}{\mathbb{E}}\left[c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right].

(3.11)

That is, in the stagewise independent case it is suffices to consider policies with controls of the form $u_{t}=\pi_{t}(x_{t})$ .

Remark 3.2

Suppose that distribution of $\xi_{t}$ can depend on $(x_{t},u_{t})$ , i.e., probability distribution $P_{t}(\cdot|x_{t},u_{t})$ of $\xi_{t}$ is conditional on $(x_{t},u_{t})$ . That is, distribution of the random process is effected by our decisions and consequently Assumption 2.1 is not satisfied. Under the stagewise independence Assumption 2.2, the dynamic programming equations (3.10) still hold and the optimal policy is determined by (3.11). However in that case $\bar{u}_{t}$ depends on the history $(\bar{x}_{1},\bar{u}_{1},...,\bar{u}_{t-1},\bar{x}_{t})$ of the decision process. This history depends on the decisions (on the considered policy) and cannot be represented in terms of the process $\xi_{t}$ alone. It also could be noted that in the stagewise independence case, SOC can be formulated as Markov Decision Processes (MDP) with transition probabilities $P_{t}(\cdot|x_{t},u_{t})$ determined by the state equations (3.2). We will not pursue the MDP formulation in the following discussion, and will rather deal with the SOC model instead. $\hfill\square$

As an example let us consider the classical inventory problem.

Example 3.1 (Inventory problem)

Consider problem [54],

	$\displaystyle\min\limits_{u_{t}\geq 0}$	$\displaystyle{\mathbb{E}}\Big{[}\,\sum\limits_{t=1}^{T}c_{t}u_{t}+\psi_{t}(x_{t}+u_{t},D_{t})\Big{]}$			(3.12)
	$\displaystyle{\rm s.t.}$	$\displaystyle x_{t+1}=x_{t}+u_{t}-D_{t},\;t=1,...,T,$			(3.13)

where $c_{t},b_{t},h_{t}$ are the ordering cost, backorder penalty cost and holding cost per unit, respectively, $x_{t}$ is the current inventory level, $u_{t}$ is the order quantity, $D_{t}$ is the demand at time $t=1,...,T$ , and

\psi_{t}(y_{t},d_{t}):=b_{t}[d_{t}-y_{t}]_{+}+h_{t}[y_{t}-d_{t}]_{+}.

In that format $x_{t}$ are state variables, $u_{t}$ are controls, $D_{t}$ are random disturbances, and ${\cal U}_{t}:={\mathbb{R}}_{+}$ is the feasibility set (independent of state variables). At time $t=1$ the inventory level $x_{1}$ and demand value $D_{0}$ are known (initial conditions). Decision about $u_{t}$ at stage $t=1,...,T,$ should be made before a realization of the demand $D_{t}$ becomes known. That is, at time $t$ history $D_{[t-1]}$ of the demand process is available, but future realizations $D_{t},...,D_{T}$ are not known.

By making change of variables $y_{t}=x_{t}+u_{t}$ , problem (3.12) -(3.13) is often written in the following equivalent form

	$\displaystyle\min\limits_{y_{t}\geq x_{t}}$	$\displaystyle{\mathbb{E}}\Big{[}\,\sum\limits_{t=1}^{T}c_{t}(y_{t}-x_{t})+\psi_{t}(y_{t},D_{t})\Big{]}$			(3.14)
	$\displaystyle{\rm s.t.}$	$\displaystyle x_{t+1}=y_{t}-D_{t},\;t=1,...,T.$			(3.15)

Problem (3.14) -(3.15) can be viewed as an SP problem in decision variables $y_{t}$ and $x_{t}$ .

It is convenient to write dynamic programming equations for the inventory problem in the form (3.14) -(3.15), of course this can be reformulated for the form (3.12) -(3.13) as well. As before we assume here that distribution of the demand process does not depend on our decisions. At stage $t=T,...,1,$ the value functions satisfy the following equation

V_{t}(x_{t},D_{[t-1]})=\inf\limits_{y_{t}\geq x_{t}}\left\{c_{t}(y_{t}-x_{t})+{\mathbb{E}}_{|D_{[t-1]}}\big{[}\psi_{t}(y_{t},D_{t})+V_{t+1}\left(y_{t}-D_{t},D_{[t]}\right)\big{]}\right\},

(3.16)

with $V_{T+1}(\cdot)\equiv 0$ . The objective functions $c_{t}(y_{t}-x_{t})+\psi_{t}(y_{t},D_{t})$ are convex in $(x_{t},y_{t})$ and hence value functions $V_{t}(x_{t},D_{[t-1]})$ are convex in $x_{t}$ (see Remark 3.1).

The optimal policy of how much to order at stage $t$ is given by $\bar{u}_{t}=\bar{y}_{t}-\bar{x}_{t}$ , where

\bar{y}_{t}\in\mathop{\rm arg\,min}\limits_{y_{t}\geq\bar{x}_{t}}\left\{c_{t}(y_{t}-\bar{x}_{t})+{\mathbb{E}}_{|D_{[t-1]}}\big{[}\psi_{t}(y_{t},D_{t})+V_{t+1}\left(y_{t}-D_{t},D_{[t]}\right)\big{]}\right\}.

(3.17)

That is, starting with the initial value $\bar{x}_{1}=x_{1}$ , value $\bar{y}_{1}$ is computed as an optimal solution of the minimization problem in the right hand side of (3.17) for $t=1$ (at the firs stage, since $D_{0}$ is deterministic, the conditional expectation in (3.17) is just the unconditional expectation with respect to $D_{1}$ ). Consequently the next inventory level $\bar{x}_{2}=\bar{y}_{1}-D_{1}$ is computed given an observed realization of $D_{1}$ . Then $\bar{y}_{2}$ is computed as an optimal solution of the minimization problem in the right hand side of (3.17) for $t=2$ , the inventory level is updated as $\bar{x}_{3}=\bar{y}_{2}-D_{2}$ , and so forth going forward in time. This defines an optimal policy with the inventory level and order quantity at stage $t$ being functions of observed realization of the history $D_{[t-1]}$ of the demand process.

If the demand process $D_{t}$ is stagewise independent, then value functions $V_{t}(x_{t})$ do not depend on the history of the demand process, and equations (3.16) become

V_{t}(x_{t})=\inf\limits_{y_{t}\geq x_{t}}\left\{c_{t}(y_{t}-x_{t})+{\mathbb{E}}\big{[}\psi_{t}(y_{t},D_{t})+V_{t+1}\left(y_{t}-D_{t}\right)\big{]}\right\},

(3.18)

with the expectations taken with respect to the marginal distributions of the demand process. By using convexity of the objective function it is possible to show that in the stagewise independence case the so-called basestock policy is optimal. That is, optimal control (optimal order quantity) $\bar{u}_{t}$ is a function of state $x_{t}$ , and is given by

\bar{u}_{t}=\max\{x_{t},y^{*}_{t}\}-x_{t}=[y^{*}_{t}-x_{t}]_{+},

(3.19)

where $y^{*}_{t}$ is the (unconstrained) minimizer

y^{*}_{t}\in\mathop{\rm arg\,min}\limits_{y_{t}}\left\{c_{t}y_{t}+{\mathbb{E}}\big{[}\psi_{t}(y_{t},D_{t})+V_{t+1}\left(y_{t}-D_{t}\right)\big{]}\right\}.

(3.20)

That is, if at stage $t$ the current inventory level $x_{t}$ is greater than or equal to $y^{*}_{t}$ , then order nothing; otherwise order $y^{*}_{t}-x_{t}$ . Of course, computation of the (critical) values $y^{*}_{t}$ is based on availability of value functions $V_{t}(\cdot)$ . $\hfill\square$

4 Risk Averse and Distributionally Robust Optimization

In the risk averse approach the expectation functional is replaced by a risk measure. Let $(\Omega,{\cal F},{\mathbb{P}})$ be a probability space and ${\cal Z}$ be a linear space of measurable functions (random variables) $Z:\Omega\to{\mathbb{R}}$ . Risk measure ${\cal R}$ is a functional assigning a number to random variable $Z\in{\cal Z}$ , that is ${\cal R}:{\cal Z}\to{\mathbb{R}}$ . In the risk neutral case, ${\cal R}$ is the expectation operator, i.e., ${\cal R}(Z):={\mathbb{E}}[Z]=\int_{\Omega}Z(\omega)d{\mathbb{P}}(\omega)$ . In order for this expectation to be well defined we have to restrict the space of considered random variables; for the expectation it is natural to take the space of integrable functions, i.e., ${\cal Z}:=L_{1}(\Omega,{\cal F},{\mathbb{P}})$ .

It was suggested in Artzner et al [1] that reasonable risk measures should satisfy the following axioms: subadditivity, monotonicity, translation equivariance, positive homogeneity. Risk measures satisfying these axioms were called coherent. For a thorough discussion of coherent risk measures we can refer to [11],[47]. It is possible to show that if the space ${\cal Z}$ is a Banach lattice, in particular if ${\cal Z}=L_{p}(\Omega,{\cal F},{\mathbb{P}})$ , then a (real valued) coherent risk measure ${\cal R}:{\cal Z}\to{\mathbb{R}}$ is continuous in the norm topology of ${\cal Z}$ (cf., [43, Proposition 3.1]). It follows then by the Fenchel - Moreau theorem that ${\cal R}$ has the following dual representation

{\cal R}(Z)=\sup_{\zeta\in{\mathfrak{A}}}\int_{\Omega}Z(\omega)\zeta(\omega)d{\mathbb{P}}(\omega),\;\;Z\in{\cal Z},

(4.1)

where ${\mathfrak{A}}$ is a convex weakly^∗ compact subset of ${\cal Z}^{*}$ . Moreover, the axioms of coherent risk measures imply that every $\zeta\in{\mathfrak{A}}$ is a density function, i.e., almost surely $\zeta(\omega)\geq 0$ and $\int_{\Omega}\zeta(\omega)d{\mathbb{P}}(\omega)=1$ (cf., [43, Theorem 2.2]).

An important example of coherent risk measure is the Average Value-at-Risk:

	$\displaystyle{\sf AV@R}_{\alpha}(Z)$	$\displaystyle:=$	$\displaystyle(1-\alpha)^{-1}\int_{\alpha}^{1}F^{-1}_{Z}(\tau)d\tau$		(4.2)
		$\displaystyle=$	$\displaystyle\inf_{\theta\in{\mathbb{R}}}\left\{\theta+(1-\alpha)^{-1}{\mathbb{E}}[Z-\theta]_{+}\right\},\;\;\alpha\in(0,1]$		(4.3)

where $F_{Z}(z):={\mathbb{P}}(Z\leq z)$ is the cumulative distribution function (cdf) of $Z$ and $F^{-1}_{Z}(\tau):=\inf\{z:F_{Z}(z)\geq\tau\}$ is the respective quantile. It is natural in this example to use the space ${\cal Z}=L_{1}(\Omega,{\cal F},{\mathbb{P}})$ and its dual ${\cal Z}^{*}=L_{\infty}(\Omega,{\cal F},{\mathbb{P}})$ . In various equivalent forms this risk measure was introduced in different contexts by different authors under different names, such as Expected Shortfall, Expected Tail Loss, Conditional Value-at-Risk; variational form (4.3) was suggested in [32],[41]. In the dual form, it has representation (4.1) with

\begin{array}[]{l}{\mathfrak{A}}=\left\{\zeta\in{\cal Z}^{*}:0\leq\zeta\leq(1-\alpha)^{-1},\;\int_{\Omega}\zeta d{\mathbb{P}}=1\right\}.\end{array}

Representation (4.1) can be viewed from the distributionally robust point of view. Recall that if $P$ is a probability measure on $(\Omega,{\cal F})$ which is absolutely continuous with respect to ${\mathbb{P}}$ , then by the Radon - Nikodym Theorem it has density $\zeta=dP/d{\mathbb{P}}$ . Consider the set ${\mathfrak{M}}$ of probability measures $P$ , absolutely continuous with respect to ${\mathbb{P}}$ , defined as ${\mathfrak{M}}:=\{P:dP/d{\mathbb{P}}\in{\mathfrak{A}}\}.$ Then representation (4.1) of functional ${\cal R}$ can be written as¹¹1By writing ${\mathbb{E}}_{P}[Z]$ we emphasize that the expectation is taken with respect to the probability measure (probability distribution) $P$ .

{\cal R}(Z)=\sup_{P\in{\mathfrak{M}}}{\mathbb{E}}_{P}[Z].

(4.4)

In the Distributionally Robust Optimization (DRO) it is argued that the “true” distribution is not known, and consequently a set ${\mathfrak{M}}$ of probability measures (distributions) is constructed in some way, and referred to as the ambiguity set. In that framework we can view (4.4) as the definition of the corresponding functional. It is straightforward to verify that the obtained functional ${\cal R}$ , defined on an appropriate space of random variables, satisfies the axioms of coherent risk measures. So we have a certain duality relation between the risk averse and distributionally robust approaches to stochastic optimization. However, there is an essential difference in a way how the corresponding ambiguity set ${\mathfrak{M}}$ is constructed. In the risk averse approach the set ${\mathfrak{M}}$ consists of probability measures absolutely continuous with respect a specified (reference) measure ${\mathbb{P}}$ . On the other hand, in the DRO such dominating measure may not be naturally defined. This happens, for instance, in popular settings where the ambiguity set is defined by moment constraints or is the set of probability measures with prescribed Wasserstein distance from the empirical distribution defined by the data.

In order to extend the risk averse and distributionally robust optimization to the dynamical setting of multistage stochastic optimization, we need to construct a conditional counterpart of the respective functional ${\cal R}$ . In writing dynamic equations (2.4) and (3.6) we used conditional expectations in a somewhat informal manner. For the expectation operator there is a classical definition of conditional expectation. That is, let $P$ be a probability measure and ${\cal G}$ be a sigma subalgebra of ${\cal F}$ . It is said that random variable, denoted ${\mathbb{E}}_{|{\cal G}}[Z]$ , is the conditional expected value of random variable $Z$ given ${\cal G}$ if the following two properties hold: (i) ${\mathbb{E}}_{|{\cal G}}[Z]:\Omega\to{\mathbb{R}}$ is ${\cal G}$ -measurable, (ii) $\int_{A}{\mathbb{E}}_{|{\cal G}}[Z](\omega)dP(\omega)=\int_{A}Z(\omega)dP(\omega)$ for any $A\in{\cal G}$ . There are many versions of ${\mathbb{E}}_{|{\cal G}}[Z](\omega)$ which can differ from each other on sets of $P$ -measure zero. The conditional expectation depends on measure $P$ , therefore we sometimes write ${\mathbb{E}}_{P|{\cal G}}[Z]$ to emphasize this when various probability measures are considered.

It seems that it is natural to define the conditional counterpart of the distributionally robust functional, defined in (4.4), as $\sup_{P\in{\mathfrak{M}}}{\mathbb{E}}_{P|{\cal G}}[Z]$ . However there are technical issues with a precise meaning of such definition since there are different versions of conditional expectations related to different probability measures $P\in{\mathfrak{M}}$ . Even more important, there are conceptual problems with such definition, and in fact this is not how conditional risk measures are defined (cf., [34]). In the risk averse setting there is a natural concept of law invariance. For a random variable $Z$ its cumulative distribution function (cdf), with respect to the reference probability measure ${\mathbb{P}}$ , is $F_{Z}(z):={\mathbb{P}}(Z\leq z)$ , $z\in{\mathbb{R}}$ . It is said that random variables $Z$ and $Z^{\prime}$ are distributionally equivalent if their cumulative distribution functions $F_{Z}(z)$ and $F_{Z^{\prime}}(z)$ do coincide for all $z\in{\mathbb{R}}$ . It is said that risk measure ${\cal R}$ is law invariant if ${\cal R}(Z)={\cal R}(Z^{\prime})$ for any distributionally equivalent $Z$ and $Z^{\prime}$ . For example the Average Value-at-Risk measure ${\cal R}={\sf AV@R}_{\alpha}$ is law invariant.

Note that law invariance is defined with respect to the reference measure ${\mathbb{P}}$ . Law invariant coherent risk measure ${\cal R}(Z)$ can be considered as a function of the cumulative distribution function $F_{Z}$ , and consequently its conditional counterpart can be defined as the respective function of the conditional counterpart of $F_{Z}$ ; this definition can be made precise. On the other hand, in the DRO setting where there is no dominating probability measure and the concept of law invariance is not applicable, it is not completely clear how to define a conditional counterpart of functional ${\cal R}$ defined in (4.4), although an attempt was made in [34]. See Remark 5.6 below for a further discussion of law invariance.

In the risk neutral setting we were particularly interested in the stagewise independent case (Assumption 2.2). Under that assumption the dynamic programming equations simplify to (2.8) - (2.9) in the SP and to (3.10) in the SOC frameworks, respectively. Natural counterparts of these equations are obtained by replacing the expectation operator with a coherent risk measure. That is, in the risk averse SP framework the counterpart of (2.9) is

{\cal Q}_{t+1}(x_{t}):={\cal R}_{t+1}[Q_{t+1}(x_{t},\xi_{t+1})],\;t=1,...,T-1,

(4.5)

where ${\cal R}_{t+1}$ is a coherent risk measure applied to the random variable $Q_{t+1}(x_{t},\xi_{t+1})$ . For example,

{\cal R}_{t}(\cdot):=(1-\lambda_{t}){\mathbb{E}}[\,\cdot\,]+\lambda_{t}{\sf AV@R}_{\alpha_{t}}(\cdot),\;\;\lambda_{t}\in[0,1],

(4.6)

is a law invariant coherent risk measure representing convex combination of the expectation and Average Value-at-Risk. Such choice of risk measure tries to reach a compromise between minimization of the cost on average and controlling the risk of upper deviations from the average (upper quantiles of the cost) at every stage of the process. In applications the parameters $\lambda_{t}$ and $\alpha_{t}$ often are constants, independent of the stage $t$ , i.e., the risk measure ${\cal R}_{t}={\cal R}$ is the same for every stage (cf., [51]).

Remark 4.1

The dynamic programming equations, with (4.5) replacing (2.9), correspond to the nested formulation of the risk averse multistage SP problem (cf., [42],[47, Section 6.5]). In the nested risk averse multistage optimization the risk is controlled at every stage of the process and the total nested value of the cost does not have practical interpretation. Nevertheless, when solving the corresponding risk averse optimization problem it would be useful to estimate its optimal value for the purpose of evaluating error of the computed solution. $\hfill\square$

In the DRO setting the counterpart of equations (2.9) can be written as

{\cal Q}_{t+1}(x_{t}):=\sup_{P\in{\mathfrak{M}}_{t+1}}{\mathbb{E}}_{P}[Q_{t+1}(x_{t},\xi_{t+1})],

(4.7)

where ${\mathfrak{M}}_{t+1}$ is a set of probability distributions of vector $\xi_{t+1}$ . Note that by convexity and monotonicity properties of coherent risk measures, convexity of the cost-to-go functions is preserved here. That is, convexity of $Q_{t+1}(\cdot,\xi_{t+1})$ implies convexity of ${\cal Q}_{t+1}(\cdot)$ in the left hand side of equations (4.5) and (4.7). This in turn implies convexity of $Q_{t}(\cdot,\xi_{t})$ in the left hand side of equation (2.8), provided the function $f_{t}(\cdot,\xi_{t})$ is convex and the set ${\cal X}_{t}$ is convex. Therefore we have here similar to the risk neutral case (compare with Proposition 2.1) the following convexity property.

Proposition 4.1

If the objective functions $f_{t}(x_{t},\xi_{t})$ are convex in $x_{t}$ and the sets ${\cal X}_{t}$ are convex for $t=1,...,T$ , then the cost-to-go functions $Q_{t+1}(x_{t},\xi_{t+1})$ and ${\cal Q}_{t+1}(x_{t})$ are convex in $x_{t}$ . In particular such convexity of the cost-to-go functions follows for risk averse linear multistage stochastic programs.

Similar considerations apply to the SOC framework, with the risk averse counterpart of equations (3.10) written as

V_{t}(x_{t})=\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t})}{\cal R}_{t}\left[c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right],

(4.8)

where ${\cal R}_{t}$ , $t=1,...,T$ , are coherent risk measures. Again these dynamic equations correspond to the nested formulation of the respective SOC problem. Similar to the risk neutral case, the value functions $V_{t}(x_{t})$ are convex if Assumption 3.1 holds.

5 Dynamic Cutting Planes Algorithms

The basic idea of cutting planes algorithms for solving convex problems is approximation of the (convex) objective function by cutting planes. That is, let $f:{\mathbb{R}}^{n}\to{\mathbb{R}}\cup\{+\infty\}$ be an extended real valued function. Its domain is ${\rm dom}(f):=\{x\in{\mathbb{R}}^{n}:f(x)<+\infty\}$ . We say that an affine function $\ell(x)=\alpha+\beta^{\top}x$ is a cutting plane of $f(x)$ if $f(x)\geq\ell(x)$ for all $x\in{\mathbb{R}}^{n}$ . If moreover $f(\bar{x})=\ell(\bar{x})$ for some $\bar{x}\in{\rm dom}(f)$ , then $\ell(x)$ is said to be a supporting plane of $f(x)$ , at the point $\bar{x}$ . When the function $f$ is convex, its supporting plane at a point $\bar{x}\in{\rm dom}(f)$ is given by $\ell(x)=f(\bar{x})+g^{\top}(x-\bar{x})$ , where $g$ is a subgradient of $f$ at $\bar{x}$ . The set of all subgradients, at a point $x\in{\rm dom}(f)$ , is called the subdifferential of $f$ at $x$ and denoted $\partial f(x)$ . The subdifferential of a convex function at every interior point of its domain is nonempty, [40, Theorem 23.4].

For multistage linear SP problems a cutting planes type algorithm, called Stochastic Dual Dynamic Programming (SDDP), was introduced in Pereira and Pinto [31], building on nested decomposition algorithm of Birge [5]. The SDDP algorithm is going backward and forward in order to build piecewise linear approximations of the cost-to-go functions by their cutting planes.

Remark 5.1

Let us recall how to compute a subgradient of the optimal value function of a linear program. That is, let $Q(x)$ be the optimal value of linear program

\min_{y\geq 0}c^{\top}y\;\;{\rm s.t.}\;\;Bx+Ay=b,

(5.1)

considered as a function²²2If for some $x$ problem (5.1) does not have a feasible solution, then $Q(x):=+\infty$ . of vector $x$ . The dual of (5.1) is the linear program

\max_{\lambda}\lambda^{\top}(b-Bx)\;\;{\rm s.t.}\;\;A^{\top}\lambda\leq c.

(5.2)

Function $Q(\cdot)$ is piecewise linear convex, and its subgradient at a point $x$ where $Q(x)$ is finite, is given by $-B^{\top}\bar{\lambda}$ with $\bar{\lambda}$ being an optimal solution of the dual problem (5.2) (e.g., [47, Section 2.1.1]). That is, the subgradient of the optimal value function of a linear program is computed by solving the dual of that linear program. A more general setting amendable to linear programming formulation, is when the objective function of (5.1) is polyhedral. In such cases the gradient of the respective optimal value function can be computed by solving the dual of the obtained linear program. This principle is applied in the SDDP algorithm to approximation of the cost-to-go functions by their cutting planes in a dynamical setting. This motivates the “Dual Dynamic” part in the name of the algorithm. $\hfill\square$

5.1 SDDP algorithm for SP problems

Consider SP problem (2.1) - (2.2). Suppose that Assumptions 2.1 and 2.2 hold, and hence the corresponding cost-to-go functions are defined by equations (2.8) - (2.9). If probability distributions of random vectors $\xi_{t}$ are continuous, these distributions should be discretized in order to compute the respective expectations (see Remark 5.2 below). So we also make the following assumption.

Assumption 5.1

The probability distribution of $\xi_{t}$ has finite support $\{\xi_{t1},...,\xi_{tN}\}$ with respective probabilities³³3For the sake of simplicity we assume that the number $N$ of realizations of $\xi_{t}$ is the same for every stage $t=2,...,T$ . Recall that in the SP framework, $\xi_{1}$ is deterministic. $p_{t1},...,p_{tN}$ , $t=2,...,T$ . Denote $A_{tj}:=A_{t}(\xi_{tj})$ , $B_{tj}:=B_{t}(\xi_{tj})$ , $b_{tj}:=b_{t}(\xi_{tj})$ , and $Q_{tj}(x_{t-1}):=Q_{t}(x_{t-1},\xi_{tj})$ the cost-to-cost-to-go functions, defined in (2.8).

The corresponding expected value cost-to-go functions can be written as

\begin{array}[]{ll}{\cal Q}_{t}(x_{t-1})=\sum_{j=1}^{N}p_{tj}Q_{tj}(x_{t-1}).\end{array}

(5.1)

Suppose further that the objective functions $f_{t}(x_{t},\xi_{tj}):=c_{tj}^{\top}x_{t}$ are linear in $x_{t}$ , and the sets ${\cal X}_{t}:={\mathbb{R}}^{n_{t}}_{+}$ , i.e., the constraint $x_{t}\in{\cal X}_{t}$ can be written as $x_{t}\geq 0$ , $t=1,...,T$ . In that case (2.1) - (2.2) becomes the standard linear multistage SP problem:

\min\limits_{x_{t}\geq 0}{\mathbb{E}}\Big{[}\sum_{t=1}^{T}c_{t}^{\top}x_{t}\Big{]}\;\;{\rm s.t.}\;B_{t}x_{t-1}+A_{t}x_{t}=b_{t},\;t=1,...,T.

(5.2)

It could be noted that if the sets ${\cal X}_{t}$ are polyhedral and the functions $f_{t}(x_{t},\xi_{t})$ are polyhedral in $x_{t}$ , then the problem is linear and can be formulated in the standard form by the well known techniques of linear programming.

Remark 5.2

From the modeling point of view, random vectors $\xi_{t}$ often are assumed to have continuous distributions. In such cases these continuous distributions have to be discretized in order to perform numerical calculations. One possible approach to such discretization is to use Monte Carlo sampling techniques. That is, a random sample $\xi_{t1},...,\xi_{tN}$ of size $N$ is generated from the probability distribution of random vector $\xi_{t}$ , $t=2,...,T$ . This approach is often referred to as the Sample Average Approximation (SAA) method. The constructed discretized problem can be considered in the above framework with equal probabilities $p_{tj}=1/N$ , $j=1,...,N$ . Statistical properties of the SAA method in the multistage setting are discussed, e.g., in [47, Section 5.8]. It could be mentioned that the question of the sample size $N$ , i.e. how many discretization points per stage to use, can be only addressed in the framework of the original model with continuous distributions. One of the advantages of the SAA method is that statistical tests can be applied to the results of computational procedures, this is discussed in [9]. $\hfill\square$

The SDDP algorithm for linear multistage SP problems, having finite number of scenarios, was described in details in a number of publications (see, e.g., recent survey paper [12] and references therein). For a later reference we briefly discuss it below. The SDDP algorithm consists of backward and forward steps. At each iteration of the algorithm, in the backward step current approximations of the cost-to-go functions ${\cal Q}_{t}(\cdot)$ , by cutting planes, are updated by adding respective cuts. That is, at the last stage the cost-to-go function $Q_{Tj}(x_{T-1})$ , $j=1,...,N$ , is given by the optimal solution of the linear program

\min_{x_{T}\geq 0}c_{Tj}^{\top}x_{T}\;\;{\rm s.t.}\;\;B_{Tj}x_{T-1}+A_{Tj}x_{T}=b_{Tj}.

(5.3)

The subgradient of $Q_{Tj}(x_{T-1})$ at a chosen point $\tilde{x}_{T-1}$ is given by $g_{Tj}=-B_{Tj}\tilde{\lambda}_{Tj}$ , where $\tilde{\lambda}_{Tj}$ is an optimal solution of the dual of problem (5.3) (see Remark 5.1 above). Consequently the subgradient of ${\cal Q}_{T}(x_{T-1})$ at $\tilde{x}_{T-1}$ is computed as ${\sf g}_{T}=\sum_{j=1}^{N}p_{Tj}g_{Tj}$ , and hence the cutting plane

\ell_{T}(x_{T-1})={\cal Q}_{T}(\tilde{x}_{T-1})+{\sf g}_{T}^{\top}(x_{T-1}-\tilde{x}_{T-1})

(5.4)

is added to the current family of cutting planes of ${\cal Q}_{T}(\cdot)$ . Actually at the last stage the above cutting plane $\ell_{T}(\cdot)$ is the supporting plane of ${\cal Q}_{T}(\cdot)$ at $\tilde{x}_{T-1}$ . This is because ${\cal Q}_{T}(\tilde{x}_{T-1})$ is explicitly computed by solving problem (5.3).

At one step back at stage $T-1$ , the cost-to-go function $Q_{T-1,j}(x_{T-2})$ , $j=1,...,N$ , is given by the optimal solution of the program

\min_{x_{T-1}\geq 0}c_{T-1,j}^{\top}x_{T-1}+{\cal Q}_{T}(x_{T-1})\;\;{\rm s.t.}\;\;B_{T-1,j}x_{T-2}+A_{T-1j}x_{T-1}=b_{T-1,j}.

(5.5)

The cost-to-go function ${\cal Q}_{T}(\cdot)$ is not known. Therefore it is replaced by the approximation $\underline{\cal Q}_{T}(\cdot)$ given by the maximum of the current family of cutting planes including the cutting plane added at stage $T$ as described above. Since $\underline{\cal Q}_{T}(\cdot)$ is the maximum of a finite number of affine functions, the obtained problems

\min_{x_{T-1}\geq 0}c_{T-1,j}^{\top}x_{T-1}+\underline{\cal Q}_{T}(x_{T-1})\;\;{\rm s.t.}\;\;B_{T-1,j}x_{T-2}+A_{T-1j}x_{T-1}=b_{T-1,j},

(5.6)

$j=1,...,N$ , are linear. Let ${\mathfrak{Q}}_{T-1,j}(x_{T-2})$ be the optimal value function of problem (5.6) and $\bar{{\mathfrak{Q}}}_{T-1}(x_{T-2})=\sum_{j=1}^{N}p_{T-1,j}{\mathfrak{Q}}_{T-1,j}(x_{T-2})$ be the corresponding estimate of the expected value cost function. Note that since ${\cal Q}_{T}(\cdot)\geq\underline{\cal Q}_{T}(\cdot)$ , we have that $Q_{T-1,j}(\cdot)\geq{\mathfrak{Q}}_{T-1,j}(\cdot)$ , $j=1,...,N$ , and hence ${\cal Q}_{T-1}(\cdot)\geq\bar{{\mathfrak{Q}}}_{T-1}(\cdot)$ . Gradient $g_{T-1,j}$ of the optimal value function ${\mathfrak{Q}}_{T-1,j}(x_{T-2})$ can be computed at a chosen point $\tilde{x}_{T-2}$ by solving the dual of problem (5.6) and hence computing its optimal solution. Then ${\sf g}_{T-1}=\sum_{j=1}^{N}p_{T-1,j}g_{T-1,j}$ is the gradient of $\bar{{\mathfrak{Q}}}_{T-1}(x_{T-2})$ at $\tilde{x}_{T-2}$ . Consequently

\ell_{T-1}(x_{T-2})=\underline{\cal Q}_{T-1}(\tilde{x}_{T-2})+{\sf g}_{T-1}^{\top}(x_{T-2}-\tilde{x}_{T-2})

(5.7)

is a cutting plane of ${\cal Q}_{T-1}(\cdot)$ , and is added to the current collection of cutting planes of ${\cal Q}_{T-1}(\cdot)$ . And so on going backward in time until the corresponding approximation of the first stage problem is obtained:

\min_{x_{1}\geq 0}c_{1}^{\top}x_{1}+\underline{\cal Q}_{2}(x_{1})\;\;{\rm s.t.}\;\;A_{1}x_{1}=b_{1}.

(5.8)

Let us note that by the construction, ${\cal Q}_{t}(\cdot)\geq\underline{\cal Q}_{t}(\cdot)$ for $t=2,...,T$ . Therefore any cutting plane of $\underline{\cal Q}_{t}(\cdot)$ is also a cutting plane of ${\cal Q}_{t}(\cdot)$ . Also it follows that the optimal value of problem (5.8) is less than or equal to the optimal value of problem (2.1) - (2.2). Therefore optimal value of problem (5.8) provides a (deterministic) lower bound for the optimal value the considered multistage problem.

The computed estimates $\underline{\cal Q}_{t}(\cdot)$ , $t=2,...,T$ , of the cost-to-go functions, and optimal solution $\bar{x}_{1}$ of the first stage problem (5.8), define a feasible policy in the same way as in (2.10). Since this policy is feasible, its value is greater than or equal to the optimal value of the considered problem (2.1) - (2.2). In the forward step this value is estimated by randomly generating realizations of the data process and evaluating the corresponding policy values. That is, consider a sample path (scenario) $\hat{\xi}_{2},...,\hat{\xi}_{T}$ of the data process. Let $\hat{B}_{t}=B_{t}(\hat{\xi}_{t})$ , $\hat{A}_{t}=A_{t}(\hat{\xi}_{t})$ , $\hat{b}_{t}=b_{t}(\hat{\xi}_{t})$ and $\hat{c}_{t}=c_{t}(\hat{\xi}_{t})$ be realizations of the random parameters corresponding to the generated sample path. Starting with first stage solution $\bar{x}_{1}$ , next values $\bar{x}_{t}$ are computed iteratively going forward in time by computing an optimal solution of the optimization problem in (2.10) with ${\cal Q}_{t+1}(\cdot)$ replaced by $\underline{\cal Q}_{t+1}(\cdot)$ . That is

\bar{x}_{t}\in\mathop{\rm arg\,min}_{x_{t}\geq 0}\left\{\hat{c}_{t}^{\top}x_{t}+\underline{\cal Q}_{t+1}(x_{t}):\hat{B}_{t}\bar{x}_{t-1}+\hat{A}_{t}x_{t}=\hat{b}_{t}\right\}.

(5.9)

The corresponding total sum

\hat{\vartheta}:=\hat{c}_{1}^{\top}\bar{x}_{1}+\sum_{t=2}^{T}\hat{c}_{t}^{\top}\bar{x}_{t}

(5.10)

is a function of the generated sample and hence is random. Its expected value ${\mathbb{E}}[\hat{\vartheta}]$ is equal to the value of the constructed policy, and thus ${\mathbb{E}}[\hat{\vartheta}]$ is greater than or equal to the optimal value of problem (2.1) - (2.2).

The forward step of the algorithm proceeds to evaluate value of the constructed policy by randomly generating the sample paths (scenarios). At every stage $t=2,...,T$ , a point $\xi_{t}$ is generated at random according to the distribution of the corresponding random vector. That is, a point $\hat{\xi}_{t}$ is chosen with probability $p_{ti}$ , $i=1,...,N$ , from the set $\{\xi_{t1},...,\xi_{tN}\}$ . This generates the corresponding sample path (scenario) $\hat{\xi}_{2},...,\hat{\xi}_{T}$ of the data process. Consequently the policy values for the generated sample path are computed, by using the rule (5.9), and thus the total sum $\hat{\vartheta}$ , defined in (5.10), is evaluated. The computed value $\hat{\vartheta}$ gives a point estimate of value of the constructed policy. This procedure can be repeated several times by randomly (and independently from each other) generating $M$ scenarios. Let $\hat{\vartheta}^{1},...,\hat{\vartheta}^{M}$ be values, calculated in accordance with (5.10), for each generated scenario. Their average $\bar{\vartheta}:=M^{-1}\sum_{i=1}^{M}\hat{\vartheta}^{i}$ gives an unbiased estimate of the value of the considered policy. An upper edge $\bar{\vartheta}+z_{\alpha}S/\sqrt{M}$ of the corresponding confidence interval is viewed as a statistical estimate of an upper bound for the optimal value of the considered multistage problem. Here $S^{2}=(M-1)^{-1}\sum_{i=1}^{M}(\hat{\vartheta}^{i}-\bar{\vartheta})^{2}$ is an estimate of the variance ${\rm Var}(\hat{\vartheta})$ , and $z_{\alpha}$ is the critical level, say $z_{\alpha}=2$ . The justification for this bound is that by the Central Limit Theorem (CLT) for large $M$ , the average $\bar{\vartheta}$ has approximately normal distribution and $z_{\alpha}$ corresponds to mass $100(1-\alpha)\%$ of the standard normal distribution. For $M$ not too large it is a common practice to use critical values from $t$ -distribution with $M-1$ degrees of freedom, rather than normal. In any case this is just a heuristic justified by the CLT. Nevertheless it is useful and typically gives a reasonable estimate of the upper bound.

Remark 5.3

The constructed approximations $\underline{\cal Q}_{t}(\cdot)$ , of the expected value cost-to-go functions, also define a feasible policy for the original problem with continuous distributions of the data process vectors $\xi_{t}$ . Its value can be estimated in the same way by generating a sample path $\hat{\xi}_{2},...,\hat{\xi}_{T}$ from the original continuous distribution and computing the corresponding policy values in the forward step in accordance with (5.9). Consequently the policy value is estimated by averaging the computed values for a number of randomly generated sample paths. This can be used for construction of statistical upper bound for the original problem. Expectation of the optimal value of the SAA problem, based on randomly generated sample, is less than or equal to the optimal value of the original problem. Therefore the lower bound generated by the SDDP algorithm gives a point estimate of a lower bound of the original problem. We can refer to [9] for various related statistical testing procedures. $\hfill\square$

The difference between the (statistical) upper bound and (deterministic) lower bound provides an upper bound for the optimality gap of the constructed policy. It can be used as a stopping rule by stopping the iteration procedure when this estimated gap is smaller than a specified precision level. The bound is quite conservative and for larger problems may not be reducible to the specified accuracy level in a reasonable computational time. A more practical heuristic approach is to stop the iterations after the lower bound stabilizes and new iterations, with additional cutting planes, do not significantly improve (increase) the lower bound.

The above procedure depends on a choice of the points $\tilde{x}_{t}$ , referred to as the trial points. The forward step suggests to use the computed values $\bar{x}_{t}$ as trial points in the next iteration of the backward step of the algorithm. The rational for this can be explained as follows. In general, in order to approximate the cost-to-go functions uniformly with a given accuracy in a specified region, the number of cutting planes grows exponentially with increase of the dimension of decision variables $x_{t}$ . In many applications the regions where the cost-to-go functions should be approximated more accurately are considerably smaller than the regions of possible variability of decision variables. By randomly generating scenarios, the forward step generates trial points where likely the cost-to-go functions should be approximated more accurately.

Remark 5.4

The trial points are supposed to be chosen at stages $t=2,...,T-1$ . This suggests that the above procedure is relevant for number of stages $T\geq 3$ . For two stage problems, i.e., when $T=2$ , there is no sampling involved and this becomes the classical Kelley’s cutting plane algorithm, [18]. It is well known in the deterministic convex optimization that Kelley’s algorithm can behave poorly with increase of the number of decision variables, and much better cutting plane algorithms can be designed by using regularization or trust region/level set strategies [26, 20]. However, it is not clear how such methods can be extended to the dynamic multistage settings (although such attempts were made). The reason is that in the multistage problems the optimization is performed over policies which are functions of the data process. It is not known a priori where realizations of considered policies may happen in future stages and an attempt to restrict regions of search by some form of regularization could be not helpful. $\hfill\square$

Proofs of convergence of the SDDP type algorithms were studied in several publications (e.g., [13] and references there in). These proofs are based on the argument that eventually almost surely the forward step of the algorithm will go through every possible scenario many times, and hence the cost-to-go functions will be approximated with any accuracy at relevant regions. Since number of scenarios grows exponentially with increase of the number of stages, this is not a very realistic assumption. Numerical experiments indicate that, in a certain sense, the numerical complexity of SDDP method grows almost linearly with increase of the number of stages while the dimension of decision vectors is constant. This property, of the rate of convergence, was investigated in recent paper [22].

5.1.1 Interstage dependence

Suppose now that the data process $\xi_{t}\in{\mathbb{R}}^{d}$ is Markovian, and hence the cost-to-go functions are defined by equations (2.11) - (2.12). There are different ways how such Markovian structure can be modeled. One classical approach is time series analysis. The data process is modeled as, say first order, autoregressive process, i.e.,

\xi_{t}=\mu+\Phi\xi_{t-1}+\varepsilon_{t},

(5.11)

where $\Phi\in{\mathbb{R}}^{d\times d}$ and $\mu\in{\mathbb{R}}^{d}$ are parameters of the process and $\varepsilon_{t}$ is a sequence of random error vectors. It is assumed that the errors process $\varepsilon_{t}$ is stagewise independent (in fact it is usually assumed that $\varepsilon_{t}$ are i.i.d.). Suppose further that only the right hand side variables $b_{t}$ in the considered problem are random and set $\xi_{t}:=b_{t}$ . Then we can write the cost-to-go functions in terms of $x_{t}$ and $\xi_{t}$ treating $\varepsilon_{t}$ as the corresponding random process:

Q_{t}(x_{t-1},\xi_{t-1},\varepsilon_{t})=\inf_{x_{t}\geq 0,\,\xi_{t}}\left\{c_{t}^{\top}x_{t}+{\cal Q}_{t+1}(x_{t},\xi_{t}):\begin{array}[]{lll}B_{t}x_{t-1}+A_{t}x_{t}=\xi_{t},\\ \xi_{t}-\mu-\Phi\xi_{t-1}=\varepsilon_{t}\end{array}\right\},

(5.12)

with

{\cal Q}_{t+1}(x_{t},\xi_{t})={\mathbb{E}}[Q_{t+1}(x_{t},\xi_{t},\varepsilon_{t+1})].

(5.13)

Here $(x_{t},\xi_{t})$ are treated as decision variables, $\varepsilon_{t}$ is viewed as the corresponding random process with the expectation in (5.13) is taken with respect to $\varepsilon_{t+1}$ .

There are two drawback in the above approach. First, the number of decision variables is artificially increased with the cutting planes in the corresponding SDDP algorithm computed jointly in $x_{t}$ and $\xi_{t}$ . This increases the computational complexity of the numerical procedure. Second, in order to have a linear problem, only the right hand side parameters are allowed to be random. An alternative approach is to approximate Markovian process $\xi_{t}$ by Markov chain, which we are going to discuss next.

Suppose that distribution of random vector $\xi_{t}$ is supported on set $\Xi_{t}\subset{\mathbb{R}}^{d_{t}}$ , $t=2,...,T$ . Consider the partition⁴⁴4For the sake of simplicity we assume that the number $N$ of cells is the same at every stage. $\Xi_{ti}$ , $i=1,...,N$ , of $\Xi_{t}$ by Voronoi cells, i.e.,

\Xi_{ti}:=\left\{\xi_{t}\in\Xi_{t}:\|\xi_{t}-\eta_{ti}\|\leq\|\xi_{t}-\eta_{tj}\|\;\text{for all}\;j\neq i\right\},

where $\eta_{ti}\in\Xi_{t}$ are given points. That is, $\Xi_{ti}$ is the set of points of $\Xi_{t}$ closest to $\eta_{ti}$ . There are various ways how the center points $\eta_{ti}$ can be chosen. For example, $\eta_{ti}$ could be determined by minimizing the squared distances:

\min_{\eta_{t1},...,\eta_{tN}}\int\min_{i=1,...,N}\|\xi_{t}-\eta_{ti}\|^{2}dP(\xi_{t}).

(5.14)

The above optimization problem is not convex. Nevertheless it can be reasonably solved by various methods (cf., [27]). Consider conditional probabilities $p_{tij}:=P(\xi_{t+1}\in\Xi_{t+1,j}|\xi_{t}\in\Xi_{ti})$ . Note that different Voronoi cells can have common boundary points, we assume that probabilities of such events are zero so that the probabilities $p_{tij}$ are well defined. This leads to an approximation of the data process by a (possibly nonhomogeneous) Markov chain with transition probabilities $p_{tij}$ of moving from point $\eta_{ti}$ to point $\eta_{t+1,j}$ at the next stage.

Consider $c_{ti}:=c_{t}(\eta_{ti})$ , $A_{ti}:=A_{t}(\eta_{ti})$ , $B_{ti}:=B_{t}(\eta_{ti})$ and $b_{ti}:=b_{t}(\eta_{ti})$ , $i=1,...,N$ . Recall dynamic equations (2.11) - (2.12) for Markovian data process. For the constructed Markov chain these dynamic equations can be written as

Q_{t}(x_{t-1},\eta_{ti})=\inf_{x_{t}\geq 0}\Big{\{}c_{ti}^{\top}x_{t}+{\cal Q}_{t+1}(x_{t},\eta_{ti}):B_{ti}x_{t-1}+A_{ti}x_{t}=b_{ti}\Big{\}},\;i=1,...,N,

(5.15)

where

{\cal Q}_{t+1}(x_{t},\eta_{ti})=\sum_{j=1}^{N}p_{tij}Q_{t+1}(x_{t},\eta_{t+1,j}).

(5.16)

Note that the cost-to-go functions $Q_{t}(x_{t-1},\eta_{t})$ are convex in $x_{t-1}$ , while the argument $\eta_{t}$ can take only the values $\eta_{t1},...,\eta_{tN}$ .

It is possible to apply the SDDP method separately for every $\eta_{t}=\eta_{ti}$ , $i=1,...,N$ (cf., [27]). That is, in the backward step of the algorithm the cost-to-go functions ${\cal Q}_{t+1,i}(x_{t}):={\cal Q}_{t+1}(x_{t},\eta_{ti})$ are approximated by their cutting planes (in $x_{t}$ ) separately for each $\eta_{ti}$ , $i=1,...,N$ . This defines a policy for every realization (sample path) $\eta_{1},...,$ of the constructed Markov chain. This can be used in the forward step of the algorithm.

It could happen that a sample path $\xi_{1},...,$ of the original data process is different from every sample path $\eta_{1},...,$ of the constructed Markov chain. In order to extend the constructed policy to a policy for the original data process, one approach is to define the policy values for a sample path of the original data process by taking the corresponding policy values from the sample path of the Markov chain closest in some sense to the considered sample path of the original data process.

The above approach is based on discretization (approximation) of the Markov process by a Markov chain. Compared with the approach based on the autoregressive modeling of the data process, the approach of Markov chain discretization is not restricted to the case where only the right hand side parameters are random. Also it uses cutting planes approximations only with respect to variables $x_{t}$ , which could be computationally advantageous. On the other hand, its computational complexity and required computer memory grows with increase of the number $N$ of per stage discretization points. Also unlike the SAA method, there is no available inference describing statistical properties of such discretization. From an intuitive point of view, such discretization could become problematic with increase of the dimension of the random vectors of the data process.

5.1.2 Duality of multistage linear stochastic programs

The linear multistage stochastic program (5.2), with a finite number of scenarios, can be viewed as a large scale linear program, the so called deterministic equivalent. By the standard theory of linear programming it has the (Lagrangian) dual. Suppose that the primal problem (5.2) has finite optimal value. Then the optimal values of the primal and its dual are equal to each other and both problems have optimal solutions. Dualization of the feasibility constraints leads to the following (Lagrangian) dual of problem (5.2) (cf., [47, Section 3.2.3])

\begin{array}[]{cll}\max\limits_{\pi}&{\mathbb{E}}\big{[}\sum_{t=1}^{T}b_{t}^{\top}\pi_{t}\big{]}\\ {\rm s.t.}&A^{\top}_{T}\pi_{T}\leq c_{T},\\ &A_{t-1}^{\top}\pi_{t-1}+{\mathbb{E}}_{|\xi_{[t-1]}}\left[B_{t}^{\top}\pi_{t}\right]\leq c_{t-1},\;t=2,...,T.\end{array}

(5.17)

The optimization in (5.17) is over policies $\pi_{t}=\pi_{t}(\xi_{[t]})$ , $t=1,...,T$ .

It is possible to write dynamic programming equations for the dual problem. The idea of applying the SDDP algorithm to the dual problem was introduced in [25]. In what follows we use the approach of [15] based on formulation (5.17). As before we make Assumptions 2.1, 2.2 and 5.1. At the last stage $t=T$ , given $\pi_{T-1}$ and $\xi_{[T-1]}$ , we need to solve the following problem with respect to $\pi_{T}$ :

\begin{array}[]{cll}\max\limits_{\pi_{T1},\ldots,\pi_{TN}}&\sum\limits_{j=1}^{N}p_{Tj}b_{Tj}^{\top}\pi_{Tj}\\ {\rm s.t.}&A_{Tj}^{\top}\pi_{Tj}\leq c_{Tj},\;j=1,...,N,\\ &A_{T-1}^{\top}\pi_{T-1}+\sum\limits_{j=1}^{N}p_{Tj}B_{Tj}^{\top}\pi_{Tj}\leq c_{T-1}.\end{array}

(5.18)

Note that $\sum\limits_{j=1}^{N}p_{Tj}b_{Tj}^{\top}\pi_{Tj}={\mathbb{E}}[b_{T}^{\top}\pi_{T}]$ and $\sum\limits_{j=1}^{N}p_{Tj}B_{Tj}^{\top}\pi_{Tj}={\mathbb{E}}\left[B_{T}^{\top}\pi_{T}\right]$ . Since $\xi_{T}$ is independent of $\xi_{[T-1]}$ , the expectation in (5.18) is unconditional with respect to the distribution of $\xi_{T}$ .

The optimal value $V_{T}(\pi_{T-1},\xi_{T-1})$ and an optimal solution $(\bar{\pi}_{T1},\ldots,\bar{\pi}_{TN})$ of problem (5.18) are functions of vectors $\pi_{T-1}$ and $c_{T-1}=c_{T-1}(\xi_{T-1})$ and matrix $A_{T-1}=A_{T-1}(\xi_{T-1})$ . And so on going backward in time, using the stagewise independence assumption, we can write the respective dynamic programming equations for $t=T-1,...,2$ , as

\begin{array}[]{cll}\max\limits_{\pi_{t1},\ldots,\pi_{tN}}&\sum\limits_{j=1}^{N}p_{tj}\left[b_{tj}^{\top}\pi_{tj}+V_{t+1}(\pi_{tj},\xi_{tj})\right]\\ {\rm s.t.}&A_{t-1}^{\top}\pi_{t-1}+\sum\limits_{j=1}^{N}p_{tj}B_{tj}^{\top}\pi_{tj}\leq c_{t-1},\end{array}

(5.19)

with $V_{t}(\pi_{t-1},\xi_{t-1})$ being the optimal value of problem (5.19). Finally at the first stage the following problem should be solved⁵⁵5Value functions $V_{t}$ here should not be confused with the value functions of the SOC framework.

\max_{\pi_{1}}b_{1}^{\top}\pi_{1}+V_{2}(\pi_{1},\xi_{1}).

(5.20)

Let us make the following observations about the dual problem. Unlike the primal problem, the optimization (maximization) problems (5.18) and (5.19) do not decompose into separate problems with respect to each $\pi_{tj}$ and should be solved as one linear program with respect to $(\pi_{t1},...,\pi_{tN})$ . If $A_{t}$ and $c_{t}$ , $t=2,...,T$ , are deterministic, then $V_{t}(\pi_{t-1})$ is only a function of $\pi_{t-1}$ . The value (cost-to-go) function $V_{t}(\pi_{t-1},\xi_{t-1})$ is a concave function of $\pi_{t-1}$ . This allows to apply an SDDP type algorithm to the dual problem (see [15] for details). For any feasible policy of the dual problem, value of the first stage problem (5.20) gives an upper bound for the optimal value of the primal problem (5.2). Since the dual is a maximization problem, the SDDP algorithm applied to the dual problem produces a deterministic upper bound for the optimal value of problem (5.2).

It is interesting to note that it could happen that the dual problem does not have relatively complete recourse even if the primal problem has it. It is said that the problem (primal or dual) has relatively complete recourse, if at every stage $t=2,...,T$ , for any generated solutions by the forward process at the previous stages, the respective dynamic program has a feasible solution at the stage $t$ for every realization of the random data. Without relatively complete recourse it could happen at a certain stage that for some realizations of the data process, the corresponding optimization problem is infeasible and the forward process cannot continue. Note that the relatively complete recourse is related to the policy construction in the forward dynamic process and may not hold even if the problem has finite optimal value. It is still possible to proceed with construction of the dual upper bound by introducing a penalty term to the dual problem (cf., [15]).

5.2 Cutting planes algorithm for SOC problems

Consider the SOC problem (3.1) - (3.2). Suppose that Assumptions 2.1 and 2.2 hold, and hence equation (3.10) for the value function follows. Suppose further that Assumption 3.1 is satisfied, and hence value functions $V_{t}(\cdot)$ are convex. Then a cutting planes algorithm of the SDDP type can be applied in a way similar to the SP setting. There are, however, some subtle differences which we are going to discuss next.

In the backward step of the algorithm an approximation $\underline{V}_{t}(\cdot)$ of the value function $V_{t}(\cdot)$ is constructed by maximum of a family of cutting planes, similar to the SP setting, going backward in time. Let $\tilde{x}_{t}$ , $t=2,...,T+1$ , be trial points, generated say by the forward step of the algorithm at the previous iteration. At the backward step, starting at time $t=T+1$ the cutting (supporting) plane $\ell_{T+1}(x_{T+1})=c_{T+1}(\tilde{x}_{T+1})+{\sf g}_{T+1}^{\top}(x_{T+1}-\tilde{x}_{T+1})$ , where ${\sf g}_{T+1}$ is a subgradient of $c_{T+1}(\cdot)$ at $\tilde{x}_{T+1}$ , is added to the current family of cutting planes of $V_{T+1}(\cdot)$ . Going backward in time, at stage $t$ we need to compute a subgradient of function

\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t})}\sum_{j=1}^{N}p_{tj}\left[c_{tj}(x_{t},u_{t})+\underline{V}_{t+1}\big{(}A_{tj}x_{t}+B_{tj}u_{t}+b_{tj}\big{)}\right]

(5.21)

at the trial point $x_{t}=\tilde{x}_{t}$ , where $\underline{V}_{t+1}(\cdot)=\max_{1\leq i\leq M}\ell_{t+1,i}(\cdot)$ is the current approximation of $V_{t+1}(\cdot)$ by cutting planes $\ell_{t+1,i}(\cdot)$ . The minimization problem (5.21) is linear if functions $c_{tj}(x_{t},u_{t})$ are linear (polyhedral) in $u_{t}$ , and ${\cal U}_{t}(x_{t})$ is of the form (3.9) with the corresponding mapping $g_{t}(x_{t},u_{t})$ being affine in $u_{t}$ .

Let us first consider the case where ${\cal U}_{t}(x_{t})\equiv{\cal U}_{t}$ does not depend on $x_{t}$ . Then the required subgradient is given by (cf., [39, Theorem 23.8])

{\sf g}_{t}=\sum_{j=1}^{N}p_{tj}\left[\nabla c_{tj}(\tilde{x}_{t},\tilde{u}_{t})+A^{\top}_{tj}\nabla\underline{V}_{t+1}\big{(}A_{tj}\tilde{x}_{t}+B_{tj}\tilde{u}_{t}+b_{tj}\big{)}\right],

(5.22)

where $\tilde{u}_{t}\in{\cal U}_{t}$ is a minimizer of problem (5.21) for $x_{t}=\tilde{x}_{t}$ , and $\nabla c_{tj}(\tilde{x}_{t},\tilde{u}_{t})$ is a subgradient of $c_{tj}(\cdot,\tilde{u}_{t})$ at $\tilde{x}_{t}$ . We also need to compute a subgradient of $\nabla\underline{V}_{t+1}(\cdot)$ at the points $A_{tj}\tilde{x}_{t}+B_{tj}\tilde{u}_{t}+b_{tj}$ , $j=1,...,N$ . The subgradient $\nabla\underline{V}_{t+1}(\cdot)$ at a point $x_{t+1}$ is given by the gradient $\nabla\ell_{t+1,\nu}(x_{t+1})$ of its cutting plane with $\nu\in\{1,...,M\}$ such that $\ell_{t+1,\nu}(x_{t+1})=\underline{V}_{t+1}(x_{t+1})$ , i.e., $\ell_{t+1,\nu}$ is the supporting plane of $\underline{V}_{t+1}(\cdot)$ at $x_{t+1}$ . Of course, the gradient of affine function $\ell(x)=\alpha+\beta^{\top}x$ is $\nabla\ell(x)=\beta$ for any $x$ . The cutting plane is then $\underline{V}_{t}(\tilde{x}_{t})+{\sf g}_{t}^{\top}(x_{t}-\tilde{x}_{t})$ , where

\underline{V}_{t}(\tilde{x}_{t})=\sum_{j=1}^{N}p_{tj}\left[c_{tj}(\tilde{x}_{t},\tilde{u}_{t})+\underline{V}_{t+1}\big{(}A_{tj}\tilde{x}_{t}+B_{tj}\tilde{u}_{t}+b_{tj}\big{)}\right].

It could be noted that in the present case the required subgradient, and hence the corresponding cutting plane, can be computed without solving the dual problem. Therefore it may be inappropriate in this setting to call it Stochastic Dual Dynamic Programming.

When ${\cal U}_{t}(x_{t})$ depends on $x_{t}$ , calculation of a subgradient of function (5.21) could be more involved; nevertheless in some cases it is still possible to proceed. Suppose that ${\cal U}_{t}(x_{t})$ is of the form (3.9) with the corresponding mapping $g_{t}(\cdot,\cdot)$ having convex components. That is, minimization problem (5.21) can be written as

\min\limits_{u_{t}}\sum_{j=1}^{N}p_{tj}\left[c_{tj}(x_{t},u_{t})+\underline{V}_{t+1}\big{(}A_{tj}x_{t}+B_{tj}u_{t}+b_{tj}\big{)}\right]\;\;{\rm s.t.}\;g_{t}(x_{t},u_{t})\leq 0.

(5.23)

Then the required gradient is given by the gradient of the Lagrangian $L_{t}(x_{t},\bar{u}_{t},\bar{\lambda}_{t})$ of problem (5.23) with $\bar{u}_{t}$ being the respective optimal solution and $\bar{\lambda}_{t}$ being the corresponding Lagrange multipliers vector, provided the optimal solution and Lagrange multipliers are unique (e.g., [7, Theorem 4.24]).

The forward step of the algorithm is performed similarly to the SP setting. Starting with initial value $x_{1}=\bar{x}_{1}$ and generated sample path $\hat{\xi}_{t}$ , $t=1,...,$ the policy $\bar{u}_{t}=\pi_{t}(x_{t})$ is computed iteratively going forward in time (compare with (3.11)). That is, at stage $t=1,...$ , given value $\bar{x}_{t}$ of the state vector, the control value is computes as

\bar{u}_{t}\in\mathop{\rm arg\,min}_{u_{t}\in{\cal U}_{t}(\bar{x}_{t})}\sum_{j=1}^{N}p_{tj}\left[c_{tj}(\bar{x}_{t},u_{t})+\underline{V}_{t+1}\big{(}A_{tj}\bar{x}_{t}+B_{tj}u_{t}+b_{tj}\big{)}\right].

(5.24)

Consequently the next stage state value is computed

\bar{x}_{t+1}=A_{t}(\hat{\xi}_{t})\bar{x}_{t}+B_{t}(\hat{\xi}_{t})\bar{u}_{t}+b_{t}(\hat{\xi}_{t}),

(5.25)

and so on. Recall that in the SOC framework, $\xi_{1}$ is also random. The computed values $\bar{x}_{t}$ of the state variables can be used as trial points in the next iteration of the backward step of the algorithm.

Remark 5.5 ( $Q$ -factor approach)

Consider dynamic equations (3.10) with affine state equations of the form (3.3), and define⁶⁶6The $Q$ -functions here should not be confused with the cost-to-go functions of the SP framework.

Q_{t}(x_{t},u_{t}):={\mathbb{E}}\left[(c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(A_{t}x_{t}+B_{t}u_{t}+b_{t})\right].

(5.26)

We have that

V_{t}(x_{t})=\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t})}Q_{t}(x_{t},u_{t}).

(5.27)

The dynamic equations (3.10) can be written in terms of $Q_{t}(x_{t},u_{t})$ as

Q_{t}(x_{t},u_{t})={\mathbb{E}}\Big{[}c_{t}(x_{t},u_{t},\xi_{t})+\inf\limits_{u_{t+1}\in{\cal U}_{t+1}(A_{t}x_{t}+B_{t}u_{t}+b_{t})}Q_{t+1}\big{(}A_{t}x_{t}+B_{t}u_{t}+b_{t},u_{t+1}\big{)}\Big{]}.

(5.28)

The optimal policy $\bar{u}_{t}=\pi_{t}(x_{t})$ , defined by (3.11), can be written in terms of the $Q$ -functions as

\bar{u}_{t}\in\mathop{\rm arg\,min}_{u_{t}\in{\cal U}_{t}(x_{t})}Q_{t}(x_{t},u_{t}).

(5.29)

Under Assumption 3.1, functions $Q_{t}(x_{t},u_{t})$ are convex jointly in $x_{t}$ and $u_{t}$ . Consequently the cutting plane algorithm can be applied directly to dynamic equations (5.28). That is, in the backward step current lower approximations $\underline{Q}_{t}(x_{t},u_{t})$ of the respective functions $Q_{t}(x_{t},u_{t})$ are updated by adding cutting planes $\ell_{t}(x_{t},u_{t})$ computed at trial points $(\tilde{x}_{t},\tilde{u}_{t})$ generated in the forward step of of the algorithm. The constructed approximations $\underline{Q}_{t}(\cdot,\cdot)$ are used in the forward step of the algorithm for generating a point estimate of the expected value of the corresponding policy for a randomly generated sample path. $\hfill\square$

5.3 SDDP algorithms for risk averse problems.

Cutting planes algorithms of SDDP type can be extended to risk averse problems (cf., [33],[44]). Let us start with the SP framework.

5.3.1 Stochastic Programming framework.

We follow the assumptions and notation of Section 5.1, and assume that equation (2.9) for the cost-to-go functions, is replaced by its risk averse counterpart (4.5). The backward step of the algorithm is performed similar to the risk neutral case. We only need to make adjustment to calculation of subgradients of the cutting planes approximations $\underline{\cal Q}_{t+1}(\cdot)$ of the cost-to-go functions ${\cal Q}_{t+1}(\cdot)$ .

In numerical algorithms, we deal with discrete distributions having a finite support. So we assume in this section that $\Omega=\{\omega_{1},...,\omega_{N}\}$ is a finite space equipped with sigma algebra ${\cal F}$ of all subsets of $\Omega$ , and reference probability measure ${\mathbb{P}}$ with the respective nonzero probabilities $p_{i}={\mathbb{P}}(\{\omega_{i}\})$ , $i=1,...,N$ . For a random variable (a function) $Z:\Omega\to{\mathbb{R}}$ , we denote $Z_{i}:=Z(\omega_{i})$ , $i=1,...,N$ .

Remark 5.6

Recall definition of law invariant risk measures discussed in section 4. Law invariant coherent risk measures have naturally defined conditional counterparts used in construction of the corresponding nested risk measures (see equations (5.51) - (5.52) below). Suppose for the moment that the probabilities $p_{i}$ are such that for ${\cal I},{\cal J}\subset\{1,...,N\}$ the equality $\sum_{i\in{\cal I}}p_{i}=\sum_{j\in{\cal J}}p_{j}$ holds only if ${\cal I}={\cal J}$ . Then two variables $Z,Z^{\prime}:\Omega\to{\mathbb{R}}$ are distributionally equivalent iff they do coincide. Therefore in that case any risk measure defined on the space of variables $Z:\Omega\to{\mathbb{R}}$ is law invariant. In the theory of law invariant coherent risk measures it is usually assumed that the reference probability space $(\Omega,{\cal F},{\mathbb{P}})$ is atomless; of course an atomless space cannot be finite. In the case of finite space $\Omega$ it makes sense to consider law invariant risk measures when all probabilities $p_{i}$ are equal to each other, i.e., $p_{i}=1/N$ , $i=1,...,N$ . In that case $Z,Z^{\prime}:\Omega\to{\mathbb{R}}$ are distributionally equivalent iff there exists a permutation $\pi:\Omega\to\Omega$ such that $Z^{\prime}_{i}=Z_{\pi(i)}$ , $i=1,...,N$ . Anyway we will not restrict the following discussion to the case of all probabilities $p_{i}$ being equal to each other. $\hfill\square$

In the considered setting ${\mathbb{E}}_{\mathbb{P}}[Z]=\sum_{i=1}^{N}p_{i}Z_{i}$ , and the dual representation (4.1) of coherent risk measure can be written as

\begin{array}[]{ll}{\cal R}(Z)=\sup\limits_{\zeta\in{\mathfrak{A}}}\sum_{i=1}^{N}p_{i}\zeta_{i}Z_{i},\end{array}

(5.30)

where ${\mathfrak{A}}$ is a convex closed subset of ${\mathbb{R}}^{N}$ consisting of vectors $\zeta\geq 0$ such that $\sum_{i=1}^{N}p_{i}\zeta_{i}=1$ . Since it is assumed that ${\cal R}(Z)$ is real valued, the set ${\mathfrak{A}}$ is bounded and hence is compact.

The subdifferential of ${\cal R}(\cdot)$ at $Z$ is given by

\begin{array}[]{ll}\partial{\cal R}(Z)=\mathop{\rm arg\,max}\limits_{\zeta\in{\mathfrak{A}}}\sum_{i=1}^{N}p_{i}\zeta_{i}Z_{i}.\end{array}

(5.31)

Specific formulas for the subgradients, in various examples of coherent risk measures, are listed in [47, Section 6.3.2]. These formulas can be applied for computation of cutting planes in the backward step of the algorithm. Several such examples were discussed and implemented in an SDDP type algorithm in [51].

As it was described in Section 5.1, in the risk neutral setting the forward step of the algorithm has two functions; namely, construction of statistical upper bounds and generation of trial points. The constructed approximations $\underline{\cal Q}_{t+1}(x_{t})$ of the cost-to-go functions define a feasible policy in the same way as in the risk neutral case by using formula (5.9). Therefore the forward step can be used for generation of trial points similar to the risk neutral case. However, because of the nonlinearity of the risk measures, it does not produce an unbiased estimate of the risk averse value of the constructed policy and cannot be applied in a straightforward way for computing statistical upper bounds similar to the risk neutral case. At the moment there are no known efficient numerical algorithms for computing statistical upper bounds for risk averse problems in the SP framework. Interestingly, as we will argue below, in the SOC setting it is possible to construct reasonably efficient statistical upper bounds for a certain class of risk measures.

5.3.2 Stochastic Optimal Control framework.

As before we suppose that Assumptions 2.1 and 2.2 hold, and thus value functions $V_{t}(x_{t})$ , associated with coherent risk measure ${\cal R}$ , are determined by the dynamic equations (4.8). Furthermore, we suppose that Assumption 3.1 is satisfied, and hence value functions $V_{t}(x_{t})$ are convex.

We consider in this section the following class of coherent risk measures. This class is quite broad and includes many risk measures used in practical applications (the presentation below is based on [14]). We assume that the considered risk measures can be represented as

{\cal R}_{t}(Z):=\inf_{\theta\in\Theta}{\mathbb{E}}_{{\mathbb{P}}_{t}}[\Psi(Z,\theta)],\;\;Z\in{\cal Z},

(5.32)

where $\Theta$ is a nonempty convex closed subset of a finite dimensional vector space and $\Psi:{\mathbb{R}}\times\Theta\to{\mathbb{R}}$ is a real valued function. For the sake of simplicity, we assume that the function $\Psi(z,\theta)$ is the same at every stage, while the reference probability distribution ${\mathbb{P}}_{t}$ of $\xi_{t}$ can be different at different stages. We make the following assumptions about function $\Psi(z,\theta)$ . (i) For every $Z\in{\cal Z}$ , the expectation in the right hand side of (5.32) is well defined and the infimum is finite valued. (ii) Function $\Psi(z,\theta)$ is convex in $(z,\theta)\in{\mathbb{R}}\times\Theta$ . (iii) $\Psi(\cdot,\theta)$ is monotonically nondecreasing, i.e., if $z_{1}\leq z_{2}$ , then $\Psi(z_{1},\theta)\leq\Psi(z_{2},\theta)$ for all $\theta\in\Theta$ .

Under these assumptions, the functional defined in (5.32) satisfies the axioms of convexity and monotonicity. Convex combination of expectation and Average Value-at-Risk, defined in (4.6), is of that form with $\Theta:={\mathbb{R}}$ , ${\cal Z}:=L_{1}(\Omega,{\cal F},{\mathbb{P}})$ and

\Psi(z,\theta):=(1-\lambda_{t})z+\lambda_{t}\left(\theta+(1-\alpha_{t})^{-1}[z-\theta]_{+}\right).

(5.33)

For risk measures of the form (5.32), dynamic programming equations (4.8) can be written as

V_{t}(x_{t})=\inf\limits_{u_{t}\in{\cal U}_{t}(x_{t}),\,\theta\in\Theta}\underbrace{\sum_{j=1}^{N}p_{tj}\Psi\big{(}c_{tj}(x_{t},u_{t})+V_{t+1}(A_{tj}x_{t}+B_{tj}u_{t}+b_{tj}),\theta\big{)}}_{{\mathbb{E}}_{{\mathbb{P}}_{t}}[\Psi(c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(A_{t}x_{t}+B_{t}u_{t}+b_{t}),\theta)]}.

(5.34)

Note that the minimization in the right hand side of (5.34) is performed jointly in controls $u_{t}$ and parameter vector $\theta$ .

The backward step of the algorithm can performed similar to the risk neutral case. In case ${\cal U}_{t}(x_{t})\equiv{\cal U}_{t}$ does not depend on $x_{t}$ , by the chain rule of differentiation formula (5.22) for the required subgradient is extended to

\sum_{j=1}^{N}p_{tj}\left[\Psi^{\prime}(y_{tj},\bar{\theta}_{t})\left(\nabla c_{tj}(x_{t},\bar{u}_{t})+A^{\top}_{tj}\nabla\underline{V}_{t+1}\big{(}A_{tj}x_{t}+B_{tj}\bar{u}_{t}+b_{tj}\big{)}\right)\right],

(5.35)

where $\bar{u}_{t}$ and $\bar{\theta}_{t}$ are minimizers in the right hand side of (5.34), with $V_{t+1}(\cdot)$ replaced by its current approximation $\underline{V}_{t+1}(\cdot)$ , and $\Psi^{\prime}(y_{tj},\bar{\theta}_{t})$ is a subgradient of $\Psi(\cdot,\bar{\theta}_{t})$ at $y_{tj}:=c_{tj}(x_{t},\bar{u}_{t})+\underline{V}_{t+1}\big{(}A_{tj}x_{t}+B_{tj}\bar{u}_{t}+b_{tj}\big{)}$ .

The subgradient $\Psi^{\prime}(\cdot,\theta)$ can be easily computed in many interesting examples (e.g., [47, Section 6.3.2]). For example, consider convex combination of expectation and Average Value-at-Risk measure, defined in (4.6). The corresponding function $\Psi(z,\theta)$ of the form (5.33), is piecewise linear in $z$ . Then $\Psi^{\prime}(z,\theta)=1-\lambda_{t}$ if $z<\theta$ , $\Psi^{\prime}(z,\theta)=1-\lambda_{t}+\lambda_{t}(1-\alpha_{t})^{-1}$ if $z>\theta$ ; and if $z=\theta$ , then $\Psi^{\prime}(z,\theta)$ can be any point in the interval $[1-\lambda_{t},1-\lambda_{t}+\lambda_{t}(1-\alpha_{t})^{-1}]$ .

The computed approximations $\underline{V}_{t+1}(\cdot)$ of the value functions define a feasible policy in the same way as in the risk neutral case. Also by construction we have that $V_{t}(\cdot)\geq\underline{V}_{t}(\cdot)$ , $t=1,...,T$ . Therefore the algorithm produces a (deterministic) lower bound for the optimal value of the problem.

Formula (5.34) shows that the controls and value of the parameter $\theta$ can be computed simultaneously. This leads to the following statistical upper bound for the (risk averse) value of the policy defined by the computed approximation. Let $\hat{\xi}_{1},...,\hat{\xi}_{T}$ be a sample path of the data process. Suppose for the moment that ${\cal R}_{t}={\mathbb{E}}$ , i.e., consider the risk neutral case. Let $\pi$ be a policy determined by controls $\bar{u}_{t}$ and states $\bar{x}_{t}$ , and $\hat{c}_{t}:=c_{t}(\bar{x}_{t},\bar{u}_{t},\hat{\xi}_{t})$ , $t=1,...,T$ , and $\hat{c}_{T+1}:=c_{T+1}(\bar{x}_{T+1})$ , be cost values associated with the sample path. Then the corresponding point estimate of the expectation (3.1) of the total cost for the considered policy $\pi$ , can be computed iteratively going backward in time with ${\mathfrak{v}}_{T+1}:=\hat{c}_{T+1}$ and

{\mathfrak{v}}_{t}:=\hat{c}_{t}+{\mathfrak{v}}_{t+1},\;\;t=T,...,1.

(5.36)

Value ${\mathfrak{v}}_{1}$ gives an unbiased estimate of the corresponding expected total cost.

This process can be adapted for general risk averse case as follows. Let $\hat{A}_{t}=A_{t}(\hat{\xi}_{t})$ , $\hat{B}_{t}=B_{t}(\hat{\xi}_{t})$ and $\hat{b}_{t}=b_{t}(\hat{\xi}_{t})$ , $t=1,...,T$ , be realizations of the parameters corresponding to the generated sample path. For the approximations $\underline{V}_{t}(\cdot)$ consider the corresponding policy with controls $\bar{u}_{t}$ and parameters $\bar{\theta}_{t}$ computed going forward, starting with initial value $\bar{x}_{1}$ and using (compare with (5.24) - (5.25))

	$\displaystyle(\bar{u}_{t},\bar{\theta}_{t})$	$\displaystyle\in$	$\displaystyle\mathop{\rm arg\,min}\limits_{u_{t}\in{\cal U}_{t}(\bar{x}_{t}),\,\theta\in\Theta}\sum_{j=1}^{N}p_{tj}\Psi\big{(}c_{tj}(\bar{x}_{t},u_{t})+\underline{V}_{t+1}(A_{tj}\bar{x}_{t}+B_{tj}u_{t}+b_{tj}),\theta\big{)},$		(5.37)
	$\displaystyle\bar{x}_{t+1}$	$\displaystyle=$	$\displaystyle\hat{A}_{t}\ \bar{x}_{t}+\hat{B}_{t}\bar{u}_{t}+\hat{b}_{t}.$		(5.38)

Let $\hat{c}_{t}:=c_{t}(\bar{x}_{t},\bar{u}_{t},\hat{\xi}_{t})$ , $t=1,...,T$ , and $\hat{c}_{T+1}:=c_{T+1}(\bar{x}_{T+1})$ , be cost values associated with the generated sample path. Consider the following values defined iteratively going backward in time: ${\mathfrak{v}}_{T+1}:=\hat{c}_{T+1}$ and

{\mathfrak{v}}_{t}:=\Psi(\hat{c}_{t}+{\mathfrak{v}}_{t+1},\bar{\theta}_{t}),\;\;t=T,...,1.

(5.39)

It is possible to show that ${\mathbb{E}}[{\mathfrak{v}}_{1}]$ is greater than or equal to value of the policy defined by the considered approximate value functions, and hence is an upper bound on the optimal value of the risk averse problem (cf., [14]). Expectation ${\mathbb{E}}[{\mathfrak{v}}_{1}]$ can be estimated by randomly generating sample paths (scenarios) and averaging the computed realizations of random variable ${\mathfrak{v}}_{1}$ . This is similar to construction of statistical upper bound in the risk neutral setting. The inequality: “ ${\mathbb{E}}[{\mathfrak{v}}_{1}]\geq$ value of the constructed policy”, is based on Jensen’s inequality applied to convex function $\Psi(\cdot,\theta)$ and, unlike the risk neutral case, can be strict. Therefore the above procedure introduces an additional error in the estimated optimality gap. This additional error becomes larger when function $\Psi(\cdot,\theta)$ is “more nonlinear”. Numerical experiments indicate that this procedure gives a reasonable upper bound, for instance for risk measures of the form (4.6) (cf., [14]).

$Q$ -factor approach.

Consider the dynamic equations (5.34) and define (see Remark 5.5)

Q_{t}(x_{t},u_{t},\theta_{t}):={\mathbb{E}}_{{\mathbb{P}}_{t}}\left[\Psi\big{(}c_{t}(x_{t},u_{t},\xi_{t})+V_{t+1}(A_{t}x_{t}+B_{t}u_{t}+b_{t}),\theta_{t}\big{)}\right].

(5.40)

We have that

V_{t}(x_{t})=\inf\limits_{u_{t}\in{\cal U}_{t},\,\theta_{t}\in\Theta}Q_{t}(x_{t},u_{t},\theta_{t}),

and hence dynamic equations (5.34) can be written in terms of $Q_{t}(x_{t},u_{t},\theta_{t})$ as

Q_{t}(x_{t},u_{t},\theta_{t})={\mathbb{E}}_{{\mathbb{P}}_{t}}\left[\Psi\Big{(}c_{t}(x_{t},u_{t},\xi_{t})+\inf\limits_{u_{t+1}\in{\cal U}_{t+1},\,\theta_{t+1}\in\Theta}Q_{t+1}\big{(}A_{t}x_{t}+B_{t}u_{t}+b_{t},u_{t+1},\theta_{t+1}\big{)},\theta_{t}\Big{)}\right].

The cutting planes algorithm can be applied directly to functions $Q_{t}(x_{t},u_{t},\theta_{t})$ rather than to the value functions $V_{t}(x_{t})$ . In the backward step of the algorithm, subgradients with respect to $x_{t},u_{t}$ and $\theta_{t}$ , of the current approximations of $Q_{t}(x_{t},u_{t},\theta_{t})$ , should be computed at the respective trial points, and the obtained cutting plane $\ell_{t}(x_{t},u_{t},\theta_{t})$ is added to the current family of cutting planes of $Q_{t}(x_{t},u_{t},\theta_{t})$ . An advantage of that approach could be that the calculation of the respective subgradients does not require solving nonlinear optimization programs even if the function $\Psi$ is not polyhedral.

5.3.3 Infinite horizon setting

An infinite horizon (stationary) counterpart of the SOC problem (3.1) - (3.2) is

\min\limits_{\pi\in\Pi}{\mathbb{E}}^{\pi}\Big{[}\sum_{t=1}^{\infty}\gamma^{t-1}c(x_{t},u_{t},\xi_{t})\Big{]},

(5.41)

where $\gamma\in(0,1)$ is the so-called discount factor. The optimization (minimization) in (5.41) is over the set of feasible policies $\Pi$ satisfying⁷⁷7When the number of scenarios is finite, the feasibility constrains should be satisfied for all possible realizations of the random process $\xi_{t}$ . (w.p.1):

u_{t}\in{\cal U},\;x_{t+1}=F(x_{t},u_{t},\xi_{t}),\;t\geq 1.

(5.42)

It is assumed that $\xi_{t}\sim P$ , $t=1,...,$ is an i.i.d. (independent identically distributed) sequence of random vectors, with common distribution $P$ . The cost $c(x,u,\xi)$ and the mapping $F(x,u,\xi)$ are assumed to be the same at every stage, and the (nonempty) feasibility set ${\cal U}$ of controls does not depend on the state and stage.

Bellman equation for the value function, associated with problem (5.41), can be written as

V(x)=\inf_{u\in{\cal U}}{\mathbb{E}}_{P}[c(x,u,\xi)+\gamma V(F(x,u,\xi))].

(5.43)

Assuming that $c(x,u,\xi)$ is bounded, equation (5.43) has a unique solution $\bar{V}(\cdot)$ . The optimal policy for problem (5.42) is given by $u_{t}=\pi(x_{t})$ , $t=1,...,$ with

\pi(x)\in\mathop{\rm arg\,min}_{u\in{\cal U}}{\mathbb{E}}_{P}\big{[}c(x,u,\xi)+\gamma\bar{V}(F(x,u,\xi))\big{]}.

(5.44)

Similar to Assumption 3.1 let us make following assumption ensuring convexity of the solution $\bar{V}(\cdot)$ of equation (5.43).

Assumption 5.2

The function $c(x,u,\xi)$ is convex in $(x,u)$ , the mapping $F(x,u,\xi)=A(\xi)x+B(\xi)u+b(\xi)$ is affine, and the set ${\cal U}$ is convex closed.

In order to solve equation (5.43) numerically we need to discretize the possibly continuous distribution $P$ . Suppose that the SAA method is applied, i.e., a sample $\xi^{1},...,\xi^{N}$ of size $N$ from the distribution $P$ is generated (say by Monte Carlo sampling techniques), and the distribution $P$ in Bellman equation (5.43) is replaced by its empirical estimate⁸⁸8By $\delta_{\xi}$ we denote measure assigning mass one at point $\xi$ , the so-called Dirac measure. $P_{N}=N^{-1}\sum_{j=1}^{N}\delta_{\xi^{j}}$ , assigning probability $1/N$ to each generated point. This raises the question of the involved sample complexity, i.e., how large should be the sample size $N$ in order for the SAA problem to give an accurate approximation of the original problem. In some applications the discount factor is very close to one. It is well known that as the discount factor $\gamma$ approaches one, it becomes more difficult to solve the problem. Note that the optimal value of problem (5.41) increases at the rate of $O((1-\gamma)^{-1})$ as $\gamma$ approaches one. It is shown in [45] that the required sample size $N$ is of order $O((1-\gamma)^{-1}\varepsilon^{-2})$ as a function of the discount factor $\gamma$ and the error level $\varepsilon>0$ . That is, the error grows more or less proportionally to the optimal value as $\gamma$ approaches one. This indicates that the sample size required to achieve a specified relative error is not sensitive to the discount factor being close to one.

Suppose now that random vector $\xi$ has a finite number $\xi^{1},...,\xi^{N}$ of possible values with respective probabilities $p_{1},...,p_{N}$ . For example, in the SAA approach this is a random sample generated by Monte Carlo sampling techniques, in which case $p_{i}=1/N$ , $i=1,...,N$ . In that setting Bellman equation (5.43) can be written as

V(x)=\inf_{u\in{\cal U}}\,\sum_{j=1}^{N}p_{j}[c_{j}(x,u)+\gamma V(A_{j}x+B_{j}u+b_{j})],

(5.45)

where $c_{j}(x,u):=c(x,u,\xi^{j})$ , $A_{j}:=A(\xi^{j})$ , $B_{j}:=B(\xi^{j})$ and $b_{j}:=b(\xi^{j})$ , $j=1,...,N$ .

A cutting planes algorithm of SDDP type can be applied to equation (5.45). Given a current approximation $\underline{V}(\cdot)$ of the value function by cutting planes and a trial point $\tilde{x}$ , new cutting plane $\ell(x)=\tilde{v}+{\sf g}^{\top}(x-\tilde{x})$ is added to the set of cutting planes of the value function, where

$\displaystyle\tilde{u}\in$	$\displaystyle\mathop{\rm arg\,min}\limits_{u\in{\cal U}}\sum_{j=1}^{N}p_{j}[c_{j}(\tilde{x},u)+\gamma\underline{V}(A_{j}\tilde{x}+B_{j}u+b_{j})],$	(5.46)
$\displaystyle\tilde{v}=$	$\displaystyle\sum_{j=1}^{N}p_{j}[c_{j}(\tilde{x},\tilde{u})+\gamma\underline{V}(A_{j}\tilde{x}+B_{j}\tilde{u}+b_{j})],$	(5.47)
$\displaystyle{\sf g}=$	$\displaystyle\sum_{j=1}^{N}p_{j}[\nabla c_{j}(\tilde{x},\tilde{u})+\gamma A_{j}^{\top}\nabla\underline{V}(A_{j}\tilde{x}+B_{j}\tilde{u}+b_{j})].$	(5.48)

If $c(x,u,\xi)$ is linear (polyhedral) in $u$ and the set ${\cal U}$ is polyhedral given by linear inequalities, the minimization problem in the right hand side of (5.46) can be formulated as a linear programming problem. The subgradient of $\underline{V}(\cdot)$ at a point $x$ can be computed in the same way as in (5.22) by using subgradient of its supporting plane at $x$ .

Similar to the case of finite horizon, by construction we have that $\bar{V}(\cdot)\geq\underline{V}(\cdot)$ , and hence $\underline{V}(x_{1})$ gives a (deterministic) lower bound for the optimal value of problem (5.41) with initial value $x_{1}$ of the state vector. Important questions, related to the above algorithm, is how to generate trial points and how to construct an upper bound for the optimal value of problem (5.41). Current implementation, which was tested in numerical experiments, is to apply the forward step of the cutting planes algorithm to a truncated version of problem (5.41). Note that

\left|\sum_{t=T+1}^{\infty}\gamma^{t-1}c(x_{t},u_{t},\xi_{t})\right|\leq\sum_{t=T+1}^{\infty}\gamma^{t-1}\left|c(x_{t},u_{t},\xi_{t})\right|\leq\frac{\kappa\gamma^{T}}{1-\gamma},

(5.49)

where $\kappa$ is an upper bound for $|c(x,u,\xi)|$ . Therefore the suggestion is to choose the horizon $T$ such that $\kappa\gamma^{T}(1-\gamma)^{-1}\leq\varepsilon$ , where $\varepsilon>0$ is a prescribed accuracy. Then the forward steps are applied to the finite horizon version of the problem using current approximation $\underline{V}(\cdot)$ of the value function. One run of the forward step, applied to the truncated version, will generate a point estimate (up to accuracy $\varepsilon$ ) of value of the constructed policy, and $T-1$ trial points. It is possible to choose, say at random, one of these trial points for construction of the cutting plane for the value function.

Such choice of the trial points could be not optimal. The question of “best way” for construction of trial points in the above algorithm was not carefully investigated. Another problem with this procedure is that when the discount factor is close to one, the required horizon $T$ could be too large for a numerical implementation; recall that in order to construct the statistical upper bound the forward step should be run a reasonable number of times. In such cases, if the problem can be formulated as a linear program, deterministic upper bounds of the dual problem can be more efficient (cf., [46]).

Risk averse case.

The infinite horizon risk neutral formulation (5.41) can be extended to the risk averse setting by replacing the expectation operator in Bellman equation (5.43) with a coherent risk measure ${\cal R}$ . That is, risk averse counterpart of Bellman equation (5.43) is

V(x)=\inf_{u\in{\cal U}}{\cal R}[c(x,u,\xi)+\gamma V(F(x,u,\xi)].

(5.50)

This equation corresponds to the nested counterpart of the risk neutral problem (5.41):

\begin{array}[]{ll}\min\limits_{\pi\in\Pi}\ \lim\limits_{T\to\infty}{\mathfrak{R}}_{T}\left(\sum_{t=1}^{T}\gamma^{t-1}c(x_{t},u_{t},\xi_{t})\right),\end{array}

(5.51)

where ${\mathfrak{R}}_{T}$ is the corresponding $T$ -stage nested risk measure (cf., [42]). That is, as it was discussed in section 3, for a considered policy, state $x_{t}$ and control $u_{t}$ at stage $t$ are functions of $\xi_{[t-1]}$ . Consider the corresponding cost $Z_{t}:=\gamma^{t-1}c(x_{t},u_{t},\xi_{t})$ which is a function of $\xi_{[t]}$ . Then

\begin{array}[]{ll}{\mathfrak{R}}_{T}\left(\sum_{t=1}^{T}Z_{t}\right)={\cal R}\Big{(}Z_{1}+{\cal R}_{|\xi_{[1]}}\big{(}Z_{2}+...+{\cal R}_{|\xi_{T-1}}(Z_{T})\big{)}\Big{)},\end{array}

(5.52)

where ${\cal R}_{|\xi_{[t]}}$ is the conditional counterpart of ${\cal R}$ .

The backward step of the cutting planes algorithm in the risk averse setting is basically the same as the one discussed above for the risk neutral case. In order to construct a statistical upper bound we can proceed similar to the construction of section 5.3.2. That is, suppose that risk measure ${\cal R}$ is of the form (5.32) with function $\Psi(z,\theta)$ satisfying conditions (i) - (iii) specified in section 5.3.2. Suppose also that $\xi$ has a finite number $\xi^{1},...,\xi^{N}$ of possible values with reference probabilities $p_{1},...,p_{N}$ . Then Bellman equation (5.50) can be written as

V(x)=\inf_{u\in{\cal U},\,\theta\in\Theta}\,\sum_{j=1}^{N}p_{j}\Psi\big{(}c_{j}(x,u)+\gamma V(A_{j}x+B_{j}u+b_{j}),\theta\big{)}.

(5.53)

A statistical upper bound can be constructed in a way similar to section 5.3.2 using a truncated version of problem (5.41).

Periodical setting

In various applications the data process $\xi_{t}$ has a periodical behavior. One such example is hydropower generation planning discussed in [51]. The uncertain process in that example is water inflows recorded on monthly basis. There is available historical data of 79 years of records of the natural monthly energy inflow. This allows to model the uncertain process as a periodical random process with period ${\sf p}=12$ . The planning in that example was for 5 years with another 5 years to eliminate the end of horizon effect. This resulted in $T=120$ stage stochastic optimization problem. Periodical approach, which we discuss below, allows drastic reduction in the number of stages while giving almost the same policy value for the first stage.

In [48] such periodical approach was introduced from the stochastic programming point of view. We discuss this in the framework of SOC. Consider the following infinite horizon SOC problem with discount factor $\gamma\in(0,1)$ :

	$\displaystyle\min\limits_{u_{t}\in{\cal U}_{t}}$	$\displaystyle{\mathbb{E}}^{\pi}\left[\sum\limits_{t=1}^{\infty}\gamma^{t-1}c_{t}(x_{t},u_{t},\xi_{t})\right]$			(5.54)
	$\displaystyle{\rm s.t.}$	$\displaystyle x_{t+1}=F_{t}(x_{t},u_{t},\xi_{t}),\;t\geq 1.$			(5.55)

We assume the following periodic structure, with period ${\sf p}\geq 1$ : the random data process $\xi_{t}$ is stagewise independent with $\xi_{t}$ and $\xi_{t+{\sf p}}$ having the same probability distribution, and with $c_{t}(\cdot,\cdot,\cdot)=c_{t+{\sf p}}(\cdot,\cdot,\cdot)$ , $F_{t}(\cdot,\cdot,\cdot)=F_{t+m}(\cdot,\cdot,\cdot)$ and ${\cal U}_{t}={\cal U}_{t+{\sf p}}$ , for $t\geq 1$ .

Bellman equations for this problem take the form (compare with (3.10))

	$\displaystyle V_{t}(x_{t})$	$\displaystyle=$	$\displaystyle\inf\limits_{u_{t}\in{\cal U}_{t}}{\mathbb{E}}\left[c_{t}(x_{t},u_{t},\xi_{t})+\gamma V_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right],\;t=1,...,{\sf p}-1,$		(5.56)
	$\displaystyle V_{\sf p}(x_{\sf p})$	$\displaystyle=$	$\displaystyle\inf\limits_{u_{\sf p}\in{\cal U}_{\sf p}}{\mathbb{E}}\left[c_{\sf p}(x_{\sf p},u_{\sf p},\xi_{\sf p})+\gamma V_{1}\big{(}F_{\sf p}(x_{\sf p},u_{\sf p},\xi_{\sf p})\big{)}\right],\;t={\sf p}.$		(5.57)

For period ${\sf p}=1$ , periodical problem (5.54) -(5.55) coincides with the stationary problem (5.41) of section 5.3.3, and equation (5.57) becomes Bellman equation (5.43). The optimal policy for problem (5.54) - (5.55) is given by $u_{t}=\pi_{t}(x_{t})$ with

	$\displaystyle\pi_{t}(x_{t})$	$\displaystyle\in$	$\displaystyle\inf\limits_{u_{t}\in{\cal U}_{t}}{\mathbb{E}}\left[c_{t}(x_{t},u_{t},\xi_{t})+\gamma\bar{V}_{t+1}\big{(}F_{t}(x_{t},u_{t},\xi_{t})\big{)}\right],\;t=1,...,{\sf p}-1,$		(5.58)
	$\displaystyle\pi_{\sf p}(x_{\sf p})$	$\displaystyle\in$	$\displaystyle\inf\limits_{u_{\sf p}\in{\cal U}_{\sf p}}{\mathbb{E}}\left[c_{\sf p}(x_{\sf p},u_{\sf p},\xi_{\sf p})+\gamma\bar{V}_{1}\big{(}F_{\sf p}(x_{\sf p},u_{\sf p},\xi_{\sf p})\big{)}\right],\;t={\sf p},$		(5.59)

and $\pi_{t+{\sf p}}(\cdot)=\pi_{t}(\cdot)$ for $t\geq 1$ , where $(\bar{V}_{1},...,\bar{V}_{\sf p})$ is the solution of equations (5.56) - (5.57).

Such periodical formulation can be extended to the (nested) risk averse setting and cutting planes type algorithm can be applied in a way similar to what was discussed above (cf., [48]).

6 Computational Complexity of Cutting Plane Methods

As mentioned earlier, SDDP reduces to Kelly’s cutting plane method for two stage problems and it can behave poorly as the number of decision variables increases. On the other hand, numerical experiments indicate that this type of algorithm can scale well with respect to the number of stages $T$ . In this section we summarize some recent theoretical studies in [22, 17] on the rate of convergence of SDDP. We first establish the number of iterations required by the SDDP method, and then present a variant of SDDP that can improve this complexity bound in terms of the dependence on the number of stages. Related developments, with a focus more on mixed-integer nonlinear optimization, can also be found in [53].

6.1 Computational complexity of SDDP

Consider the multi-stage stochastic programming problem (2.1)-(2.2). Similar to the the previous section, we assume that $\xi_{t}$ ’s are stagewise independent (i.e., Assumption 2.2) and finitely supported (i.e., Assumption 5.1). For simplicity we further assume that the probabilities for $\xi_{ti}$ , $i=1,\ldots,N$ , are equal, i.e., $p_{ti}=1/N$ , $t=2,...,T$ , $i=1,...,N$ . In that case, the dynamic equations in (2.8)-(2.9) reduce to

Q_{t}(x_{t-1},\xi_{ti})=\inf_{x_{t}\in{\cal X}_{t}}\left\{F_{ti}(x_{t}):=f_{t}(x_{t},\xi_{ti})+{\cal Q}_{t+1}(x_{t}):B_{ti}x_{t-1}+A_{ti}x_{t}=b_{ti}\right\},

(6.1)

where

{\cal Q}_{t+1}(x_{t})=\tfrac{1}{N}{\textstyle\sum}_{i=1}^{N}Q_{t+1}(x_{t},\xi_{t+1,i}).

(6.2)

The basic scheme of the SDDP method has been discussed in Section 5.1. Here we add a little more details to facilitate its complexity analysis. Each iteration of the SDDP method to solve problem (2.1)-(2.2) contains a forward step and a backward step. In the forward step of the SDDP method, we randomly pick up an index $i_{t}$ out of $\{1,\ldots,N\}$ and solve problem

\begin{array}[]{lll}x_{t}^{k}&\in&\mathop{\rm arg\,min}\left\{\underline{F}_{t{i_{t}}}^{k-1}(x_{t}):=f_{t}(x_{t},\xi_{t{i_{t}}})+\underline{\cal Q}_{t+1}^{k-1}(x_{t})\right\}\\ &&\ \ \ {\mbox{s.t.}\ \ \ B_{ti_{t}}x_{t-1}^{k}+A_{ti_{t}}x_{t}=b_{ti_{t}}},\;x_{t}\in{\cal X}_{t},\end{array}

to update $x_{t}^{k}$ . Equivalently, one can view $x_{t}^{k}$ as being randomly chosen from $\tilde{x}_{ti}^{k}$ , $i=1,\ldots,N$ , defined as

\begin{array}[]{lll}\tilde{x}_{ti}^{k}&\in&\mathop{\rm arg\,min}\left\{\underline{F}_{ti}^{k-1}(x_{t}):=f_{t}(x_{t},\xi_{ti})+\underline{\cal Q}_{t+1}^{k-1}(x_{t})\right\}\\ &&\ \ \ {\mbox{s.t.}\ \ \ B_{ti}x_{t-1}^{k}+A_{ti}x_{t}=b_{ti}},\;x_{t}\in{\cal X}_{t}.\end{array}

(6.3)

Note that we do not need to compute $\tilde{x}_{ti}^{k}$ for $i\neq i_{t}$ , even though they will be used in the analysis of the SDDP method. In next section, we will present a deterministic dual dynamic programming method which chooses the feasible solution $x_{t}^{k}$ in the forward step in a more aggressive manner.

In the backward step of SDDP, starting from the last stage, we update the cutting plane models $\underline{\cal Q}_{t}^{k}$ according to

\underline{\cal Q}_{t}^{k}(x_{t-1})=\max\left\{\underline{\cal Q}_{t}^{k-1}(x),\tfrac{1}{N}{\textstyle\sum}_{i=1}^{N}\left[{\mathfrak{Q}}_{ti}^{k}(x_{t-1}^{k})+\langle({\mathfrak{Q}}_{ti}^{k})^{\prime}(x_{t-1}^{k}),x_{t-1}-x_{t-1}^{k}\rangle\right]\right\},

for $t=T,T-1,\ldots,2$ . Here ${\mathfrak{Q}}_{ti}^{k}(x_{t-1}^{k})$ and $({\mathfrak{Q}}_{ti}^{k})^{\prime}(x_{t-1}^{k})$ , respectively, denote the optimal value and a subgradient at the point $x_{t-1}^{k}$ for the following approximate value function

\begin{array}[]{lll}{\mathfrak{Q}}_{ti}^{k}(x_{t-1}^{k})&=&\min\left\{\underline{F}_{ti}^{k}(x_{t}):=f_{t}(x_{t},\xi_{ti})+\underline{\cal Q}_{t+1}^{k}(x_{t})\right\}\\ &&\ \ {\mbox{s.t.}\ \ B_{ti}x_{t-1}^{k}+A_{ti}x_{t}=b_{ti}},x_{t}\in{\cal X}_{t}.\end{array}

In order to assess the progress made by each SDDP iteration, we need to introduce a few notions. We say that a search point $x_{t}^{k}$ is $\epsilon_{t}$ -saturated at iteration $k$ if ${\cal Q}_{t+1}(x_{t}^{k})-\underline{\cal Q}_{t+1}^{k}(x_{t}^{k})\leq\epsilon_{t}.$ Intuitively, a point $x_{t}^{k}$ is saturated if we have a good approximation for the value function ${\mathfrak{Q}}_{t+1}$ at $x_{t}^{k}$ . Moreover, we say an $\epsilon_{t}$ -saturated search point $x_{t}^{k}$ at stage $t$ is $\delta_{t}$ -distinguishable if $\|x_{t}^{k}-x_{t}^{j}\|>\delta_{t}$ for all other $\epsilon_{t}$ -saturated search points $x_{t}^{j}$ , $j\leq k-1$ , that have been generated for stage $t$ so far by the algorithm. Let us denote $S_{t}^{k-1}$ the set of saturated points at stage $t$ . An $\epsilon_{t}$ -saturated search point $x_{t}^{k}$ is $\delta_{t}$ -distinguishable if $\mathop{\rm dist}(x_{t}^{k},S_{t}^{k-1})>\delta_{t}.$ Recall that $\mathop{\rm dist}(x,S)$ denotes the distance from point $x$ to the set $S$ . Hence, in this case, this search point $x_{t}^{k}$ is far away enough from other saturated points. The rate of convergence of SDDP depends on how fast it generates $\epsilon_{t}$ -saturated and $\delta_{t}$ -distinguishable search points. For simplicity we assume in this paper that $\epsilon_{t}=\delta_{t}=\epsilon,\;t=1,\ldots,T$ , for a given tolerance $\epsilon>0$ .

Under certain regularity assumptions, it can be shown that the function $F_{ti}$ and lower approximation functions $\underline{F}_{ti}^{k}$ are Lipschitz continuous with constant $M$ . Utilizing these continuity assumptions, it can be shown that the probability for SDDP to find a new $\epsilon$ -distinguishable and $\epsilon$ -saturated search point at the $k$ -iteration can be bounded from below by

\tfrac{1}{\bar{N}}(1-{\sf Pr}\{\tilde{g}_{t}^{k}\leq\epsilon,t=1,\ldots,T-1\}),

(6.4)

where

\bar{N}:=N^{T-2}\ \ \mbox{and}\ \ \ \tilde{g}_{t}^{k}:=\tfrac{1}{N}{\textstyle\sum}_{i=1}^{N}\mathop{\rm dist}(\tilde{x}_{ti}^{k},S_{t}^{k-1}).

(6.5)

On the other hand, if for some iteration $k$ , we have $\tilde{g}_{t}^{k}\leq\epsilon$ for all $t=1,\ldots,T-1$ , then

F_{11}(x_{1}^{k})-F^{*}\leq F_{11}(x_{1}^{k})-\underline{F}^{k-1}_{11}(x_{1}^{k})\leq 2M(T-1)\,\epsilon.

(6.6)

Note that the upper bound $2M(T-1)$ can be further reduced so that it does not depend on $T$ if a discounting factor is incorporated in each stage of the problem.

Now let $K$ denote the number of iterations performed by the SDDP method before it finds a forward path $(x_{1}^{k},\ldots,x_{T}^{k})$ such that (6.6) holds. Using the previous observation in (6.4), it can be shown (see [22]) that ${\mathbb{E}}[K]\leq\bar{K}_{\epsilon}\bar{N}+2$ , where $\bar{N}$ is defined in (6.5) and

\bar{K}_{\epsilon}:=(T-1)\left(\tfrac{D}{\epsilon}+1\right)^{n}.

(6.7)

Here $n_{t}\leq n$ and $D_{t}\leq D$ for $t=1,\ldots,T$ denote the dimension and diameter of the feasible sets ${\cal X}_{t}$ , respectively. In addition, for any $\alpha\geq 1$ , we have

{\sf Pr}\{K\geq\alpha\bar{K}_{\epsilon}\bar{N}+1\}\leq\exp\left(-\tfrac{(\alpha-1)^{2}\bar{K}_{\epsilon}^{2}}{2\alpha\bar{N}}\right).

We now add a few remarks about the above complexity results for SDDP. Firstly, since SDDP is a randomized algorithm, we provide bounds on the expected number of iterations required to find an approximate solution of problem (2.1)-(2.2). We also show that the probability of having large deviations from these expected bounds for SDDP decays exponentially fast. Secondly, the complexity bound for the SDDP method depends on $\bar{N}$ , which increases exponentially with respect to $T$ . We will show in the next section how to reduce the dependence on $T$ by presenting a variant of the SDDP method.

6.2 Explorative Dual Dynamic Programming

The SDDP method in the previous section chooses the feasible solution $x_{t}^{k}$ in a randomized manner. In this section, we will introduce a deterministic method, called Explorative Dual Dynamic Programming (EDDP) first presented in [22], which chooses the feasible solution in the forward step in an aggressive manner. As we will see, the latter approach will exhibit better iteration complexity while the former one is easier to implement.

The EDDP method consists of the forward step and backward step. In the forward step of EDDP, for each stage $t$ , we solve $N$ subproblems defined in (6.3) to compute the search points $\tilde{x}_{ti}^{k}$ , $i=1,\ldots,N$ . For each $\tilde{x}_{ti}^{k}$ , we further compute the quantity $\mathop{\rm dist}(\tilde{x}_{ti}^{k},S_{t}^{k-1})$ , the distance between $\tilde{x}_{ti}^{k}$ and the set $S_{t}^{k-1}$ of currently saturated search points in stage $t$ . Then we will choose from $\tilde{x}_{ti}^{k}$ , $i=1,\ldots,N$ , the one with the largest value of $\mathop{\rm dist}(\tilde{x}_{ti}^{k},S_{t}^{k-1})$ as $x_{t}^{k}$ , i.e., $\mathop{\rm dist}(x_{t}^{k},S_{t}^{k-1})=\max_{i=1,\ldots,N}\mathop{\rm dist}(\tilde{x}_{ti}^{k},S_{t}^{k-1})$ . We can break the ties arbitrarily. In view of the above discussion, the EDDP method always chooses the most “distinguishable” forward path to encourage exploration in an aggressive manner. This also explains the origin of the name EDDP. The backward step of EDDP is similar to SDDP.

Under the same regularity assumptions as in SDDP, it can be shown that each EDDP iteration will either generate at least one $\epsilon$ -distinguishable and $\epsilon$ -saturated search point at some stage $t=1,\ldots,T$ , or find a feasible solution $x_{1}^{k}$ of problem (2.1)-(2.2) such that

\displaystyle F_{11}(x_{1}^{k})-F^{*}

\displaystyle\leq 2M(T-1)\,\epsilon.

(6.8)

As a result, the total number of iterations performed by EDDP before finding a feasible policy of problem (2.1)-(2.2) satisfying (6.8) within at most $\bar{K}_{\epsilon}+1$ iterations where $\bar{K}_{\epsilon}$ is defined in (6.7) (see [22]). This complexity bound does not depend on the number of scenarios $\bar{N}$ , and hence significantly outperforms the one for the original SDDP method. Moreover, we can show that the relation $\mathop{\rm dist}(x_{1}^{k},S_{1}^{k-1})\leq\delta$ implies the relation in (6.8) for a properly chosen value of $\delta$ , and hence it can be used as a termination criterion for the EDDP method. Note that it is possible to further reduce the dependence of the computational complexity of EDDP on $T$ so that the term $\bar{K}_{\epsilon}$ in (6.7) does not depend on $T$ . However, this would require us to further modify EDDP by incorporating early termination of the forward step. More specifically, we need to terminate the forward step at some stage $\bar{t}\leq T$ and start the backward step from $\bar{t}$ , whenever $\mathop{\rm dist}(x_{\bar{t}}^{k},S_{\bar{t}}^{k-1})$ falls within a certain threshold value (see [53, 17]).

It should be noted that although the complexity of SDDP is worse than that for EDDP, its performance in earlier steps of the algorithm should be similar to that of EDDP. Intuitively, for earlier iterations, the tolerance parameter $\delta_{t}$ used to define $\delta_{t}$ -distinguishable points are large. As long as $\delta_{t}$ are large enough so that the solutions $\tilde{x}^{k}_{ti}$ are contained within a ball with diameter roughly in the order of $\delta_{t}$ , one can choose any point randomly from $\tilde{x}^{k}_{ti}$ as $x^{k}_{t}$ . In this case, SDDP will perform similarly to EDDP. This may explain why SDDP exhibits good practical performance for low accuracy region. For high accuracy region, the new EDDP algorithm seems to be a much better choice in terms of its theoretical complexity.

6.3 Complexity of EDDP and SDDP over an infinite horizon

It is possible to generalize EDDP and SDDP to solve stochastic programs over an infinite horizon (see Section 5.3.3), when the number of stages $T=\infty$ , the cost function $f_{t}=f$ , the feasible set ${\cal X}_{t}={\cal X}$ for all $t$ , and the random variables $\xi_{t}$ are iid (independent identically distributed) and can be viewed as realizations of random vector $\xi$ . A common line is to approximate the infinite horizon with a finite horizon. For example, using the relation (5.49) in the SOC problem, a horizon length of $T=O((\epsilon(1-\lambda))^{-1})$ suffices for a target accuracy $\epsilon$ . A direct extension of EDDP method show that the computational complexity depends only quadratically on the time horizon $T$ . Exploiting the fact that the infinite-horizon problem is stationary, it has been shown that a simple modification that combines the forward and backward step can improve the dependence on $T$ from quadratic to linear. Alternative selection strategies, including the use of upper bounds on the cost-to-go function and random sampling as in SDDP can also be incorporated (see [17] for more details).

7 Dynamic Stochastic Approximation Algorithms

7.1 Extension of Stochastic Approximation

Stochastic approximation (SA) has attracted much attention recently for solving static stochastic optimization problems given in the form of

\min_{x\in{\cal X}}\left\{f(x):={\mathbb{E}}_{\xi}[F(x,\xi)]\right\},

(7.1)

where ${\cal X}$ is a closed convex set, $\xi$ denotes the random vector and $F(\cdot,\xi)$ is a lower semicontinuous convex function. Observe that we can cast two-stage stochastic optimization in the form of (7.1) by viewing $F$ as the value function of the second-stage problem (see [28, 23, 19]).

The basic SA algorithm, initially proposed by Robbins and Monro [38], mimics the simple projected gradient descent method by replacing exact gradient with its unbiased estimator. Important improvements for the SA methods have been made by Nemirovski and Yudin [29] and later by Polayk and Juditsky [35, 36]. During the past few years, significant progress has been made in SA methods including the incorporation of momentum and variance reduction, and generalization to nonconvex setting [21]. It has been shown in [28, 23] that SA type methods can significantly outperform the SAA approach for solving static (or two-stage) stochastic programming problems. However, it remains unclear whether these SA methods can be generalized for multi-stage stochastic optimization problems with $T\geq 3$ .

In this section, we discuss a recently developed dynamic stochastic approximation (DSA) method for multi-stage stochastic optimization [24]. The basic idea of the DSA method is to apply an inexact primal-dual SA method for solving the $t$ -th stage optimization problem to compute an approximate stochastic subgradient for its associated value functions. For simplicity, we focus on the basic scheme of the DSA algorithm and its main convergence properties for solving three-stage stochastic optimization problems. We also briefly discuss how to generalize it for solving more general form of multi-stage stochastic optimization with $T>3$ .

The main problem of interest in this section is the following three-stage SP problem:

$\displaystyle\min$	$\displaystyle f_{1}(x_{1},\xi_{1})+$	$\displaystyle{\mathbb{E}}_{\|\xi_{1}}[\min$	$\displaystyle\ f_{2}(x_{2},\xi_{2})$	$\displaystyle+{\mathbb{E}}_{\|\xi_{[2]}}[\min$	$\displaystyle\ f_{3}(x_{3},\xi_{3})]]$	(7.2)
s.t.	$\displaystyle\ A_{1}x_{1}=b_{1},$	s.t.	$\displaystyle\ A_{2}x_{2}+B_{2}x_{1}=b_{2},$	s.t.	$\displaystyle\ A_{3}x_{3}+B_{3}x_{2}=b_{3},$
	$\displaystyle x_{1}\in{\cal X}_{1},\ \ \$		$\displaystyle\ x_{2}\in{\cal X}_{2},\ \ \$		$\displaystyle\ x_{3}\in{\cal X}_{3}.$

We can write problem (7.2) in a more compact form by using value functions as discussed in (2.3)-(2.5). More specifically, let $Q_{3}(x_{2},\xi_{[3]})$ be the value function at the third stage and ${\cal Q}_{3}(x_{2})$ be the corresponding expected value function conditionally on $\xi_{[2]}$ :

\begin{array}[]{lll}Q_{3}(x_{2},\xi_{[3]})&:=&\min\ f_{3}(x_{3},\xi_{3})\\ &&\text{ s.t.}\ \ A_{3}x_{3}+B_{3}x_{2}=b_{3},\\ &&\quad\quad\quad x_{3}\in{\cal X}_{3}.\\ {\cal Q}_{3}(x_{2},\xi_{[2]})&:=&{\mathbb{E}}_{|\xi_{[2]}}[Q_{3}(x_{2},\xi_{[3]})].\end{array}

(7.3)

We can then define the stochastic cost-to-go function $Q_{2}(x_{1},\xi_{[2]})$ and its corresponding (expected) value function as

\begin{array}[]{lll}Q_{2}(x_{1},\xi_{[2]})&:=&\min\ f_{2}(x_{2},\xi_{2})+{\cal Q}_{3}(x_{2},\xi_{[2]})\\ &&\ \text{ s.t.}\ \ A_{2}x_{2}+B_{2}x_{1}=b_{2},\\ &&\quad\quad\quad x_{2}\in{\cal X}_{2}.\\ {\cal Q}_{2}(x_{1},\xi_{1})&:=&{\mathbb{E}}_{|\xi_{1}}[Q_{2}(x_{1},\xi_{[2]})]={\mathbb{E}}[Q_{2}(x_{1},\xi_{2})].\end{array}

(7.4)

Problem (7.2) can then be formulated equivalently as

\begin{array}[]{ll}\min\ f_{1}(x_{1},\xi_{1})+{\cal Q}_{2}(x_{1},\xi_{1})\\ \text{ s.t.}\ \ A_{1}x_{1}=b_{1},\\ \quad\quad\quad x_{1}\in{\cal X}_{1}.\end{array}

(7.5)

7.2 Approximate stochastic subgradients

In order to solve problem (7.5) by stochastic approximation type methods, we need to understand how to compute first-order information of the value functions ${\cal Q}_{2}$ and ${\cal Q}_{3}$ . Recall that for a given closed convex set ${\cal X}\subseteq{\mathbb{R}}^{n}$ and a closed convex function ${\cal Q}:{\cal X}\to{\mathbb{R}}$ , $g(x)$ is called an $\epsilon$ -subgradient of ${\cal Q}$ at $x\in X$ if

{\cal Q}(y)\geq{\cal Q}(x)+\langle g(x),y-x\rangle-\epsilon\ \ \forall y\in{\cal X}.

(7.6)

The collection of all such $\epsilon$ -subgradients of ${\cal Q}$ at $x$ is called the $\epsilon$ -subdeifferential of ${\cal Q}$ at $x$ , denoted by $\partial_{\epsilon}{\cal Q}(x)$ . Since both ${\cal Q}_{2}$ and ${\cal Q}_{3}$ are given in the form of (conditional) expectation, their exact first-order information is hard to compute. We resort to the computation of a stochastic $\epsilon$ -subgradient of these value functions defined as follows. Observe that we do not assume stagewise independence throughout this section.

Definition 7.1

$G(x,\xi_{[t]})$ is called a stochastic $\epsilon$ -subgradient of the value function ${\cal Q}_{t}(x,\xi_{[t-1]})={\mathbb{E}}_{|\xi_{[t-1]}}[Q_{t}(x,\xi_{[t]})]$ if $G(x,\xi_{[t]})$ is an unbiased estimator of an $\epsilon$ -subgradient of ${\cal Q}_{t}(x,\xi_{[t-1]})$ with respect to $x$ , i.e.,

{\mathbb{E}}_{|\xi_{[t-1]}}[G(x,\xi_{[t]})]=g(x,\xi_{[t-1]})\ \ \mbox{and}\ \ g(x,\xi_{[t-1]})\in\partial_{\epsilon}{\cal Q}_{t}(x,\xi_{[t-1]}).

(7.7)

To compute a stochastic $\epsilon$ -subgradient of ${\cal Q}_{2}$ (resp., ${\cal Q}_{3}$ ), we have to compute an approximate subgradient of the corresponding stochastic value function $Q_{2}(x_{1},\xi_{[2]})$ (resp., $Q_{3}(x_{2},\xi_{[3])}$ ). To this end, we further assume that strong Lagrange duality holds for the optimization problems defined in (7.4) (resp.,(7.3)) almost surely. In other words, these problems can be formulated as saddle point problems:

	$\displaystyle Q_{2}(x_{1},\xi_{[2]})$	$\displaystyle=\max_{y_{2}}\min_{x_{2}\in{\cal X}_{2}}\langle b_{2}-B_{2}x_{1}-A_{2}x_{2},y_{2}\rangle+f_{2}(x_{2},\xi_{2})+{\cal Q}_{3}(x_{2},\xi_{[2]}),$		(7.8)
	$\displaystyle Q_{3}(x_{2},\xi_{[3]})$	$\displaystyle=\max_{y_{3}}\min_{x_{3}\in{\cal X}_{3}}\langle b_{3}-B_{3}x_{2}-A_{3}x_{3},y_{3}\rangle+f_{3}(x_{3},\xi_{3}).$		(7.9)

One set of sufficient conditions to guarantee the equivalence between (7.4) (resp.,(7.3)) and (7.8) (resp., (7.9)) is that (7.4) (resp.,(7.3)) is solvable and the Slater condition holds.

Observe that (7.8) and (7.9) are special cases of a more generic saddle point problem:

Q(u,\xi):=\max_{y}\min_{x\in{\cal X}}\langle b-Bu-Ax,y\rangle+f(x,\xi)+{\cal Q}(x),

(7.10)

where $b$ , $A$ and $B$ are linear mappings of the random variable $\xi$ . For example, (7.9) is a special case of (7.10) with $u=x_{2}$ , $y=y_{3}$ , $\xi=\xi_{3}$ , $f=f_{3}$ and ${\cal Q}=0$ . It is worth noting that the first stage problem can also be viewed as a special case of (7.10), since (7.5) is equivalent to

\max_{y_{1}}\min_{x_{1}\in{\cal X}_{1}}\ \left\{\langle b_{1}-A_{1}x_{1},y_{1}\rangle+f_{1}(x_{1},\xi_{1})+{\cal Q}_{2}(x_{1},\xi_{1})\right\}.

(7.11)

We now discuss how to relate an approximate solution of the saddle point problem in (7.10) to an approximate subgradient of $Q$ . Let $(x_{*},y_{*})$ be a pair of optimal solutions of the saddle point problem (7.10), i.e.,

\displaystyle Q(u,\xi)

\displaystyle=\langle y_{*},b-Bu-Ax_{*}\rangle+f(x_{*},\xi)+{\cal Q}(x_{*})=f(x_{*},\xi)+{\cal Q}(x_{*}),

(7.12)

where the second identity follows from the complementary slackness of Lagrange duality. It can be shown (see Lemma 1 of [24]) that if

	$\displaystyle\mathop{\rm gap}(\bar{z};x,y_{*})$	$\displaystyle:=\langle y_{*},b-Bu-A\bar{x}\rangle+f(\bar{x},\xi)+{\cal Q}(\bar{x})$		(7.13)
		$\displaystyle\quad-\langle\bar{y},b-Bu-Ax\rangle-f(x,\xi)-{\cal Q}(x)\leq\epsilon,\ \forall x\in{\cal X},$		(7.13)

for a given $\bar{z}:=(\bar{x},\bar{y})\in Z$ and $u\in{\mathbb{R}}^{n_{0}}$ , then $-B^{T}\bar{y}$ is an $\epsilon$ -subgradient of $Q(u,\xi)$ at $u$ .

In view of this observation, in order to compute a stochastic subgradient of ${\cal Q}_{t}(u,\xi_{[t-1]})={\mathbb{E}}[Q_{t}(u,\xi_{[t]})|\xi_{[t-1]}]$ at a given point $u$ , we can first generate a random realization $\xi_{t}$ conditionally on $\xi_{[t-1]}$ and then try to find a pair of solutions $(\bar{x},\bar{y})$ satisfying

	$\displaystyle\langle y_{t*},b_{t}-B_{t}u-A_{t}\bar{x}\rangle+f(\bar{x},\xi_{t})+{\cal Q}_{t+1}(\bar{x},\xi_{[t]})$
	$\displaystyle-\langle\bar{y},b_{t}-B_{t}u-A_{t}x\rangle-f(x,\xi_{t})-{\cal Q}_{t+1}(x,\xi_{[t]})\leq\epsilon,\ \forall x\in{\cal X}_{t},$

where $y_{t*}\equiv y_{t*}(\xi_{[t]})$ denotes the optimal solution for the $t$ -th stage problem associated with the random realization $\xi_{[t]}$ . We will then use $-B^{T}\bar{y}$ as a stochastic $\epsilon$ -subgradient of ${\cal Q}_{t}(u,\xi_{[t-1]})$ at $u$ . The difficulty exists in that the function ${\cal Q}_{t+1}(\bar{x},\xi_{[t]})$ is also given in the form of expectation, and requires a numerical procedure to estimate its value. We will discuss this in more details in the next section.

7.3 The DSA algorithm and its convergence properties

Our goal in this section is to present the basic scheme of the dynamic stochastic approximation algorithm applied to problem (7.5). This algorithm relies on the following three key primal-dual steps, referred to as stochastic primal-dual transformation (SPDT), applied to the generic saddle point problem in (7.10) at every stage.

$(p_{+},d_{+},\tilde{d})={\rm SPDT}(p,d,d_{\_},{\cal Q}^{\prime},u,\xi,f,X,\theta,\tau,\eta)$ :

$\displaystyle\tilde{d}$	$\displaystyle=\theta(d-d_{\_})+d.$	(7.14)
$\displaystyle p_{+}$	$\displaystyle=\mathop{\rm arg\,min}_{x\in X}\langle b-Bu-Ax,\tilde{d}\rangle+f(x,\xi)+\langle{\cal Q}^{\prime},x\rangle+\tfrac{\tau}{2}\\|x-p\\|^{2}.$	(7.15)
$\displaystyle d_{+}$	$\displaystyle=\mathop{\rm arg\,min}_{y}\langle-b+Bu+Ap_{+},y\rangle+\tfrac{\eta}{2}\\|y-d\\|^{2}.$	(7.16)

In the above primal-dual transformation, the input $(p,d,d_{\_})$ denotes the current primal solution, dual solution, and the previous dual solution, respectively. Moreover, the input ${\cal Q}^{\prime}$ denotes a stochastic $\epsilon$ -subgradient for ${\cal Q}$ at the current search point $p$ . The parameters $(u,\xi,f,X)$ describe the problem in (7.10) and $(\theta,\tau,\eta)$ are certain algorithmic parameters to be specified. Given these input parameters, the relation in (7.14) defines a dual extrapolation (or prediction) step to estimate the dual variable $\tilde{d}$ for the next iterate. Based on this estimate, (7.15) performs a primal projection to compute $p_{+}$ , and then (7.16) updates in the dual space to compute $d_{+}$ by using the updated $p_{+}$ . Note that the Euclidean projection in (7.15) can be extended to the non-Euclidean setting by replacing the term $\|x-p\|^{2}/2$ with Bregman divergence. We assume that the above SPDT operator can be performed very fast or even has explicit expressions. The primal-dual transformation is closely related to the alternating direction method of multipliers and the primal-dual hybrid gradient method (see [24] for an account of history for this procedure).

In order to solve problem (7.5), we will combine the above primal-dual transformation applied to all the three stages, the scenario generation for the random variables $\xi_{2}$ and $\xi_{3}$ in the second and third stage, and certain averaging steps in both the primal and dual spaces to compute an approximate pair of primal and dual solution for the saddle point problem in the form of (7.10) at each stage. More specifically, the DSA method consists of three loops. The innermost (third) loop runs $N_{3}$ steps of SPDT in order to compute an approximate stochastic subgradient of the value function ${\cal Q}_{3}$ of the third stage. The second loop consists of $N_{2}$ SPDTs applied to the saddle point formulation of the second-stage problem, which requires the output from the third loop. The outer loop applies $N_{1}$ SPDTs to the saddle point formulation of the first-stage optimization problem in (7.5), using the approximate stochastic subgradients for ${\cal Q}_{2}$ computed by the second loop. In this algorithm, we need to generate $N_{1}$ and $N_{1}\times N_{2}$ realizations for the random vectors $\xi_{2}$ and $\xi_{3}$ , respectively.

Observe that the DSA algorithm described above is conceptual only since we have not specified any algorithmic parameters $(\theta,\tau,\eta)$ yet. Two sets of parameters will be chosen depending on the stages where SPDTs are applied. One set of parameters that may lead to slightly slower rate of convergence but can guarantee the boundedness of the generated dual iterates will be used for the second and third stage, while another set of parameter setting is used in the first stage to achieve faster rate of convergence. Suppose that $f_{t}$ , $t=1,2,3$ , are generally (not necessarily strongly) convex. By properly specifying the algorithmic parameters, it can be shown that by setting $N_{3}={\cal O}(1/\epsilon^{2})$ and $N_{2}={\cal O}(1/\epsilon^{2})$ , we will find an approximate $\epsilon$ -solution, i.e., a point $\bar{x}_{1}\in{\cal X}_{1}$ such that

\begin{array}[]{l}{\mathbb{E}}[f_{1}(\bar{x}_{1},\xi_{1})+{\cal Q}_{2}(\bar{x}_{1},\xi_{1})-(f_{1}(x_{*},\xi_{1})+{\cal Q}_{2}(x_{*},\xi_{1}))]\leq\epsilon,\\ {\mathbb{E}}[\|A\bar{x}_{1}-b\|]\leq\epsilon,\end{array}

in at most $N_{1}={\cal O}(1/\epsilon^{2})$ outer iterations. As a consequence, the number of random samples $\xi^{2}$ and $\xi^{3}$ used in the DSA method are bounded by

N_{1}=\mathcal{O}(1/\epsilon^{2})\ \ \mbox{and}\ \ \ N_{1}\times N_{2}=\mathcal{O}(1/\epsilon^{4}),

(7.17)

respectively. Now consider the case when the objective functions $f_{t}$ , $t=1,2,3$ , are strongly convex. In that case the above complexity results can be significantly improved. In particular, by choosing $N_{3}={\cal O}(1/\sqrt{\epsilon})$ and $N_{2}={\cal O}(1/\epsilon)$ , we will find an approximate $\epsilon$ -solution in at most $N_{1}={\cal O}(1/\epsilon)$ outer iterations. As a result, it can be shown that the number of random samples of $\xi_{2}$ and $\xi_{3}$ will be bounded by $N_{1}$ and $N_{1}\times N_{2}$ , i.e., $\mathcal{O}(1/\epsilon)$ and $\mathcal{O}(1/\epsilon^{2})$ , respectively,

It should be noted that our analysis of DSA focuses on the optimality of the first-stage decisions, and the decisions we generated for the later stages are mainly used for computing the approximate stochastic subgradients for the value functions at each stage. Except for the first stage decision $\bar{x}_{1}$ , the performance guarantees (e.g., feasibility and optimality) that we can provide for later stages are dependent on the sequences of random variables (or scenarios) we generated.

7.4 DSA for general multistage stochastic optimization

In this section, we consider a multistage stochastic optimization problem given by

\begin{array}[]{ll}\min\ f_{1}(x_{1},\xi_{1})+{\cal Q}_{2}(x_{1},\xi_{1})\\ \text{ s.t.}\ \ A_{1}x_{1}=b_{1},\\ \quad\quad\quad x_{1}\in{\cal X}_{1},\end{array}

(7.18)

where the value factions ${\cal Q}_{t}$ , $t=2,\ldots,T$ , are recursively defined by

\begin{array}[]{lll}{\cal Q}_{t}(x_{t-1},\xi_{[t-1]})&:=&F_{t-1}(x_{t-1},\xi_{t-1})+{\mathbb{E}}_{|\xi_{[t-1]}}[Q_{t}(x_{t-1},\xi_{[t]})],\ \ t=2,\ldots,T-1,\\ Q_{t}(x_{t-1},\xi_{[t]})&:=&\min\ f_{t}(x_{t},\xi_{t})+{\cal Q}_{t+1}(x_{t})\\ &&\ \text{ s.t.}\ \ A_{t}x_{t}+B_{t}x_{t-1}=b_{t},\\ &&\quad\quad\quad x_{t}\in{\cal X}_{t},\end{array}

(7.19)

and

\begin{array}[]{lll}{\cal Q}_{T}(x_{T-1},\xi_{[T-1]})&:=&{\mathbb{E}}_{\xi_{T}|\xi_{[T-1]}}[Q_{T}(x_{T-1},\xi_{[T]})],\\ Q^{T}(x_{T-1},\xi_{[T]})&:=&\min\ f_{T}(x_{T},\xi_{T})\\ &&\text{ s.t.}\ \ A_{T}x_{T}+B_{T}x_{T-1}=b_{T},\\ &&\quad\quad\quad x_{T}\in{\cal X}_{T}.\end{array}

(7.20)

Here $\xi_{t}$ are random variables, $f_{t}(\cdot,\xi_{t})$ are relatively simple functions, and $F_{t}(\cdot,\xi_{t})$ are general (not necessarily simple) Lipschitz continuous convex functions. We also assume that one can compute the subgradient $F^{\prime}(x_{t},\xi_{t})$ of function $F_{t}(x_{t},\xi_{t})$ at any point $x_{t}\in{\cal X}_{t}$ for a given $\xi_{t}$ .

Problem (7.18) is more general than problem (7.2) (or equivalently problem (7.5)) in the following sense. First, we are dealing with a more complicated multistage stochastic optimization problem where the number of stages $T$ in (7.18) can be greater than three. Second, the value function ${\cal Q}_{t}(x_{t-1},\xi_{[t-1]})$ in (7.19) is defined as the summation of $F_{t-1}(x_{t-1},\xi_{t-1})$ and ${\mathbb{E}}_{|\xi_{[t-1]}}[Q_{t}(x_{t-1},\xi_{[t]})]$ , where $F_{t-1}$ is not necessarily simple. The DSA algorithm can be generalized for solving problem (7.18). More specifically, we will call the SPDT operators to compute a stochastic $\epsilon$ -subgradient of ${\cal Q}_{t+1}$ at $x_{t}$ , $t=1,\ldots,T-2$ , in a recursive manner until we obtain the $\epsilon$ -subgradient of ${\cal Q}_{T}$ at $x_{T-1}$ .

Similar to the three-stage problem, let $N_{t}$ be the number of iterations for stage $t$ subproblem. For the last stage $T$ , we will set $N_{T}={\cal O}(1/\epsilon)$ and $N_{T}={\cal O}(1/\sqrt{\epsilon})$ for the generally convex and strongly convex cases, respectively. For the middle stages $t=2,\ldots,T-1$ , we set $N_{t}={\cal O}(1/\epsilon^{2})$ and $N_{t}={\cal O}(1/\epsilon)$ . Different algorithmic settings will be used for either generally convex or strongly convex cases. Moreover, less aggressive stepsize selection will be used for the inner loops to guarantee the boundedness of the generated dual variables. Under these parameter selection, we can show that the number of outer loops performed by the DSA method to find an approximate $\epsilon$ -solution can be bounded by $N_{1}={\cal O}(1/\epsilon^{2})$ and ${\cal O}(1/\epsilon)$ , respectively, for the generally convex and strongly convex problems.

In view of these results, the total number of scenarios required to find an $\epsilon$ -solution of (7.18) is given by $N_{2}\times N_{3}\times\ldots N_{T}$ , and hence will grow exponentially with respect to $T$ , no matter whether the objective functions are strongly convex or not. These sampling complexity bounds match well with those lower bounds in [49, 52], implying that multi-stage stochastic optimization problems are essentially intractable for $T\geq 5$ and a moderate target accuracy. Hence, it is reasonable to use the DSA algorithm only for multi-stage stochastic optimization problems with $T=3$ or $4$ and $\epsilon$ relatively large. However, it is interesting to point out that the DSA algorithm only needs to go through the scenario tree once and hence its memory requirement increases only linearly with respect to $T$ .

7.5 Combined EDDP and DSA for hierarchical problems

In this section, we discuss how EDDP/SDDP type algorithms can be applied together with DSA type algorithms for solving a class of hierarchical stationary stochastic programs (HSSPs). HSSPs can model problems with a hierarchy of decision-making, e.g., how managerial decisions influence day-to-day operations in a factory. More specifically, the upper level decisions are made by solving a stationary stochastic program over an infinite horizon (see Section 5.3.3), where the number of stages $T=\infty$ , the cost function $f_{t}=f$ , the feasible set ${\cal X}_{t}={\cal X}$ , and the random variables $\xi_{t}$ are iid having the same distribution as $\xi$ . The lower level decisions are determined by a stochastic two-stage program,

\displaystyle f(x,\xi):=\min_{z_{1}\in Z_{1}(x,\xi)}f_{1}(z_{1},\xi)+\mathbb{E}_{\zeta}\big{[}\min_{z_{2}\in Z_{2}(z_{1},\zeta)}f_{2}(z_{2},\zeta)\big{]},

(7.21)

where $\zeta$ denotes the random variable independent of $\xi$ involved in the second stage problem.

We extend the dual dynamic programming method to a so-called hierarchical dual dynamic programming by accounting for inexact solutions to the subproblems. To solve the lower-level stochastic multi-stage problem, we approximate it using SAA approach and solve the resulting problem using a primal-dual stochastic approximation method. This method can be generalized to a DSA-type method when the number of stages is three or more. Our results show that when solving an infinite-horizon hierarchical problem where the top-level decision variable is of modest size (i.e., $n=4$ ) and the lower-level stochastic multistage program has a modest number of stages (2 or 3), then by integrating dual dynamic programming to handle the top-level decisions and stochastic approximation-type method for the lower-level decisions, we can get a polynomial computational complexity that is independent of the dimension of the lower-level decision variables (see [17] for details).

References

[1] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9:203–228, 1999.
[2] E. M. L. Beale. On minimizing a convex function subject to linear inequalities. Journal of the Royal Statistical Society, Series B, 17:173–184, 1955.
[3] R.E. Bellman. Dynamic Programming. Princeton University Press, 1957.
[4] D.P. Bertsekas and S.E. Shreve. Stochastic Optimal Control, The Discrete Time Case. Academic Press, New York, 1978.
[5] J.R. Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations Research, 33:989–1007, 1985.
[6] R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer, New York, 2nd edition, 2011.
[7] J. F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
[8] G.B. Dantzig. Linear programming under uncertainty. Management Science, 1:197–206, 1955.
[9] L. Ding, S. Ahmed, and A. Shapiro. A python package for multi-stage stochastic programming. Optimization online, 2019.
[10] M. Dyer and L. Stougie. Computational complexity of stochastic programming problems. Mathematical Programming, 106:423–432, 2006.
[11] H. Föllmer and A. Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin, 2nd edition, 2004.
[12] C. Füllner and S. Rebennack. Stochastic dual dynamic programming and its variants. Optimization online, 2021.
[13] P. Girardeau, V. Leclére, and A. B. Philpott. On the convergence of decomposition methods for multistage stochastic convex programs. Mathematics of Operations Research, 40:130–145, 2016.
[14] V. Guigues, A. Shapiro, and Y. Cheng. Risk-averse stochastic optimal control: an efficiently computable statistical upper bound. Optimization online, 2022.
[15] V. Guigues, A. Shapiro, and Y. Cheng. Duality and sensitivity analysis of multistage linear stochastic programs. European Journal of Operational Research, 308:752–767, 2023.
[16] G.A. Hanasusanto, D. Kuhn, and W. Wiesemann. A comment on computational complexity of stochastic programming problems. Mathematical Programming, 159:557–569, 2015.
[17] C. Ju and G. Lan. Dual dynamic programming for stochastic programs over an infinite horizon. arXiv, 2023.
[18] J.E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8:703–712, 1960.
[19] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
[20] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Mathematical Programming, 149(1-2):1–45, 2015.
[21] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature, Switzerland AG, 2020.
[22] G. Lan. Complexity of stochastic dual dynamic programming. Mathematical Programming, 191:717–754, 2022.
[23] G. Lan, A. S. Nemirovski, and A. Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical Programming, 134:425–458, 2012.
[24] G. Lan and Z. Zhou. Dynamic stochastic approximation for multi-stage stochastic optimization. Mathematical Programming, 187:487–532, 2021.
[25] V. Leclére, P. Carpentier, J-P Chancelier, A. Lenoir, and F. Pacaud. Exact converging bounds for Stochastic Dual Dynamic Programming via Fenchel duality. SIAM J. Optimization, 30:1223–1250, 2020.
[26] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical Programming, 69:111–147, 1995.
[27] N. Löhndorf and A. Shapiro. Modeling time-dependent randomness in stochastic dual dynamic programming. European Journal of Operational Research, 273:650–661, 2019.
[28] A. S. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. 19:1574–1609, 2009.
[29] A. S. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley, XV, 1983.
[30] Y. Nesterov and A. Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, 1994.
[31] M.V.F. Pereira and L.M.V.G. Pinto. Multi-stage stochastic optimization applied to energy planning. Mathematical programming, 52(1-3):359–375, 1991.
[32] G. Pflug. Some remarks on the value-at-risk and the conditional value-at-risk. In Probabilistic Constrained Optimization: Methodology and Applications, S. Uryasev (Ed.). Kluwer Academic Publishers, Norwell, MA, 2000.
[33] A.B. Philpott, V.L. de Matos, and E. Finardi. On solving multistage stochastic programs with coherent risk measures. Operations Research, 61(4):957–970, 2013.
[34] A. Pichler and A. Shapiro. Mathematical foundations of distributionally robust multistage optimization. SIAM J. Optimization, 31:3044–3067, 2021.
[35] B.T. Polyak. New stochastic approximation type procedures. Automat. i Telemekh., 7:98–107, 1990.
[36] B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30:838–855, 1992.
[37] W.B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley and Sons, New York, 2nd edition, 2011.
[38] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
[39] R. T Rockafellar. Conjugate Duality and Optimization. Society for Industrial and Applied Mathematics, Philadelphia, 1974.
[40] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.
[41] R.T Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. J. of Banking and Finance, 26(7):1443–1471, 2002.
[42] A. Ruszczyński and A. Shapiro. Conditional risk mappings. Mathematics of Operations Research, 31:544–561, 2006.
[43] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions. Mathematics of Operations Research, 31:433–452, 2006.
[44] A. Shapiro. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research, 209:63–72, 2011.
[45] A. Shapiro and Y. Cheng. Central limit theorem and sample complexity of stationary stochastic programs. Operations Research Letters, 49:676–681, 2021.
[46] A. Shapiro and Y. Cheng. Dual bounds for periodical stochastic programs. Operations Research, 2021.
[47] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, third edition, 2021.
[48] A. Shapiro and L. Ding. Periodical multistage stochastic programs. SIAM J. Optimization, 30:2083–2102, 2020.
[49] A. Shapiro and A. Nemirovski. On complexity of stochastic programming problems. E-print available at: http://www.optimization-online.org, 2004.
[50] A. Shapiro and A. Nemirovski. On complexity of stochastic programming problems. In V. Jeyakumar and A.M. Rubinov, editors, Continuous Optimization: Current Trends and Applications: Current Trends and Applications, pages 111–144. Springer, 2005.
[51] A. Shapiro, W. Tekaya, J.P. da Costa, and M. Pereira Soares. Risk neutral and risk averse stochastic dual dynamic programming method. European Journal of Operational Research, 224:375–391, 2013.
[52] A. Shaprio. On complexity of multistage stochastic programs. Operations Research Letters, 34:1–8, 2006.
[53] S. Zhang and X. Sun. Stochastic dual dynamic programming for multistage stochastic mixed-integer nonlinear optimization. Mathematical Programming, 196:935–985, 2022.
[54] P.H. Zipkin. Foundation of inventory management. McGraw-Hill, 2000.

Numerical Methods for Convex Multistage Stochastic Optimization

1 Introduction

2 Stochastic Programming

Assumption 2.1

Remark 2.1

Proposition 2.1

Assumption 2.2 (Stagewise independence)

Remark 2.2

3 Stochastic Optimal Control

Remark 3.1

Assumption 3.1

Remark 3.2

Example 3.1 (Inventory problem)

4 Risk Averse and Distributionally Robust Optimization

Remark 4.1

Proposition 4.1

5 Dynamic Cutting Planes Algorithms

Remark 5.1

5.1 SDDP algorithm for SP problems

Assumption 5.1

Remark 5.2

Remark 5.3

Remark 5.4

5.1.1 Interstage dependence

5.1.2 Duality of multistage linear stochastic programs

5.2 Cutting planes algorithm for SOC problems

Remark 5.5 (QQ-factor approach)

5.3 SDDP algorithms for risk averse problems.

5.3.1 Stochastic Programming framework.

Remark 5.6

5.3.2 Stochastic Optimal Control framework.

QQ-factor approach.

5.3.3 Infinite horizon setting

Assumption 5.2

Risk averse case.

Periodical setting

6 Computational Complexity of Cutting Plane Methods

6.1 Computational complexity of SDDP

6.2 Explorative Dual Dynamic Programming

6.3 Complexity of EDDP and SDDP over an infinite horizon

7 Dynamic Stochastic Approximation Algorithms

7.1 Extension of Stochastic Approximation

7.2 Approximate stochastic subgradients

Definition 7.1

7.3 The DSA algorithm and its convergence properties

7.4 DSA for general multistage stochastic optimization

7.5 Combined EDDP and DSA for hierarchical problems

References

Remark 5.5 ( $Q$ -factor approach)

$Q$ -factor approach.