Numerical Methods for Convex Multistage Stochastic Optimization
Optimization problems involving sequential decisions in a stochastic environment were studied in Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). In this paper we mainly concentrate on SP and SOC modelling approaches. In these frameworks there are natural situations when the considered problems are convex. Classical approach to sequential optimization is based on dynamic programming. It has the problem of the so-called “Curse of Dimensionality”, in that its computational complexity increases exponentially with increase of dimension of state variables. Recent progress in solving convex multistage stochastic problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations. Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We also discuss Stochastic Approximation type methods applied to multistage stochastic optimization problems. From the computational complexity point of view, these two types of methods seem to be complimentary to each other. Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but a large number of decision variables.
Key words: Stochastic Programming, Stochastic Optimal Control,
Markov Decision Process, dynamic programming, risk measures, Stochastic Dual Dynamic Programming, Stochastic Approximation method, cutting planes algorithm.
AMS subject classifications: 65K05, 90C15, 90C39, 90C40.
1 Introduction
Traditionally different communities of researchers dealt with optimization problems involving uncertainty, modelled in stochastic terms, using different terminology and modeling frameworks. In this respect we can point to the fields of Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). Historically the developments in SP on one hand and SOC and MDP on the other, went along different directions with different modeling frameworks and solution methods. In this paper we mainly concentrate on SP and SOC approaches. In these frameworks there are natural situations when the considered problems are convex. In continuous optimization convexity is the main consideration allowing development of efficient numerical algorithms, [30].
The main goal of this paper is to present some recent developments in numerical approaches to solving convex optimization problems involving sequential decision making. We do not try to give a comprehensive review of the subject with a complete list of references, etc. Rather the aim is to present a certain point of view about some recent developments in solving convex multistage stochastic problems.
The SOC modeling is convenient for demonstration of some basic ideas since on one hand it can be naturally written in the MDP terms, and on the other hand it can be formulated in the Stochastic Programming framework. Stochastic Programming (SP) has a long history. Two stage stochastic programming (with recourse) was introduced in Dantzig [8] and Beale [2], and was intrinsically connected with linear programming. From the beginning SP was aimed at numerical solutions. Until about twenty years ago, the modeling approach to two and multistage SP was predominately based on construction of scenarios represented by scenario trees. This approach allows to formulate the so-called deterministic equivalent optimization problem with the number of decision variables more or less proportional to the number of scenarios. When the deterministic equivalent could be represented as a linear program, such problems were considered to be numerically solvable. Because of that, the topic of SP was often viewed as a large scale linear programming. We can refer for presentation of these developments to Birge [6] and references there in.
From the point of view of the scenarios construction approach there is no much difference between two stage and multistage SP. In both cases the numerical effort in solving the deterministic equivalent is more or less proportional to the number of generated scenarios. This view on SP started to change with developments of randomization methods and the sample complexity theory. From the point of view of solving deterministic equivalent, even two stage linear stochastic programs are computationally intractable; their computational complexity is #P-hard for a sufficiently high accuracy (cf., [10],[16]). On the other hand, under reasonable assumptions, the number of randomly generated scenarios (by Monte Carlo sampling techniques), which are required to solve two stage SP problems with accuracy and high probability is of order , [47, Section 5.3]. While randomization methods were reasonably successful in solving two stage problems, as far as multistage SP is concerned the situation is different. Number of scenarios needed to solve multistage SP problems grows exponentially with increase of the number of stages, [50],[47, Section 5.8.2].
Classical approach to sequential optimization is based on dynamic programming, [3]. Dynamic programming also has a long history and is in the heart of the SOC and MDP modeling. It has the problem of the so-called “Curse of Dimensionality”, the term coined by Bellman. Its computational complexity increases exponentially with increase of dimension of state variables. There is a large literature trying to deal with this problem by using various approximations of dynamic programming equations (e.g., [37]). Most of these methods are heuristics and do not give verifiable guarantees for the accuracy of obtained solutions.
Recent progress in solving convex multistage SP problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations.
These methods allow to give an estimate of the error of the computed solution.
Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We discuss such algorithms in the frameworks of SP and SOC.
Moreover, we will also present extensions of stochastic approximation (a.k.a. stochastic gradient descent) type methods [38, 29, 28, 21]
for multistage stochastic optimization, referred to as dynamic stochastic approximation in [24].
From the computational complexity point of view, these two types of methods seem to be complimentary to each other in the following sense.
Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively
smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but
a large number of decision variables. These methods share the following common features:
(a) both methods utilize the convex structure of the cost-to-go (value) functions of dynamic programming equations,
(b) both methods do not require explicit discretization of state space,
(c) both methods guarantee the convergence to the global optimality,
(d) rates of convergence for both methods have been established.
We use the following notation and terminology throughout the paper. For we denote . Unless stated otherwise denotes Euclidean norm in . By we denote the distance from a point to a set . We write or for the scalar product of vectors . It is said that a set is polyhedral if it can be represented by a finite number of affine constraints, it is said that a function is polyhedral if it can be represented as maximum of a finite number of affine functions. For a process , we denote by its history up to time . By we denote the conditional expectation, conditional on random variable (random vector) . We use the same notation viewed as a random vector or as a vector variable, the particular meaning will be clear from the context. For a probability space , by , , we denote the space of random variables having finite -th order moment, i.e., such that . Equipped with norm , becomes a Banach space. The dual of is the space with such that .
2 Stochastic Programming
In Stochastic Programming (SP) the -stage optimization problem can be written as
(2.1) | |||||
(2.2) |
Here are objective functions, are random vectors, are nonempty closed sets, , and are matrix and vector functions of , , with .
Constraints (2.2) often represent some type of balance equations between successive stages, while the set can be viewed as representing local constraints at time . It is possible to consider balance equations in a more general form, nevertheless formulation (2.2) will be sufficient for our purpose of discussing convex problems. In particular, if the objective functions are linear, with , and the sets are polyhedral, then problem (2.1) - (2.2) becomes a linear multistage stochastic program.
The sequence of random vectors , is viewed as a data process. Unless stated otherwise we make the following assumption throughout the paper.
Assumption 2.1
Probability distribution of the random process , , does not depend on our decisions, i.e., is independent of the chosen decision policy.
At every time period we have information about (observed) realization of the data process. Naturally our decisions should be based on the information available at time of the decision and should not depend on the future unknown values ; this is the so-called principle of nonantisipativity. That is, the decision variables are functions of the data process, and a sequence of such functions for every stage , is called a policy or a decision rule. The optimization in (2.1) - (2.2) is performed over policies satisfying the feasibility constraints. The feasibility constraints of problem (2.1) - (2.2) should be satisfied with probability one, and the expectation is taken with respect to the probability distribution of the random vector . It could be noted that the dependence of decision variables , as well as parameters ,,, on the data process is often suppressed in the notation. It is worthwhile to emphasize that it suffices to consider policies depending only on the history of the data process without history of the decisions, because of Assumption 2.1.
The first stage decision is of the main interest and assumed to be deterministic, i.e., made before observing future realization of the data process. In that framework vector and the corresponding parameters are deterministic, i.e., their values are known at the time of first stage decision. Also the first stage objective function is a function of alone.
We can write the following dynamic programming equations for problem (2.1) - (2.2) (e.g., [47, Section 3.1]). At the last stage the cost-to-go (value) function is
(2.3) |
and then going backward in time for , the cost-to-go function is
(2.4) |
The optimal value of problem (2.1) - (2.2) is given by the optimal value of the first stage problem
(2.5) |
The optimization in (2.3) - (2.4) is over (nonrandom) variables (recall that , and are functions of ). The corresponding optimal policy is defined by , , where
(2.6) |
with the expectation term of at the last stage omitted. Of the main interest is the first stage solution given by the optimal solution of the first stage problem (2.5).
Since is a function of , the cost-to-go function can be considered as a function of . It represents the optimal value of problem
(2.7) |
conditional on realization of the data process. This is the dynamic programming (Bellman) principle of optimality. This is also the motivation for the name “cost-to-go” function. Another name for the cost-to-go functions is “value functions”.
Remark 2.1
Let us recall the following simple result. Consider an extended real valued function . We have that if is a convex function of , then the min-value function is also convex. This can be applied to establish convexity of the cost-to-go functions. That is, suppose that the function is convex in and the set is convex. Then the extended real valued function taking value if and , and value otherwise, is convex. Since the cost-to-go function can be viewed as the min-function of this extended real valued function, it follows that it is convex in . Also note that if the cost-to-go function is convex in , then its expected value in (2.4) is also convex. Consequently by induction going backward in time it is not difficult to show the following convexity property of the cost-to-go functions.
Proposition 2.1
If the objective functions are convex in and the sets are convex, , then the cost-to-go functions , , are convex in and the first stage problem (2.5) is convex. In particular such convexity follows for linear multistage stochastic programs.
Such convexity property is crucial for development of efficient numerical algorithms. It is worthwhile to note that Assumption 2.1 is essential here. Dependence of the distribution of the random data on decisions typically destroys convexity even in the linear case.
The dynamic programming equations reduce the problem from optimization over policies, i.e., over infinite dimensional functional spaces, to evaluation of the cost-to-go functions of finite dimensional vectors. However, dependence of the cost-to-go functions on the whole history of the data process makes it numerically unmanageable with increase of the number of stages. The situation is simplified dramatically if we make the following assumption of stagewise independence.
Assumption 2.2 (Stagewise independence)
Probability distribution of is independent of for .
Under Assumption 2.2 of stagewise independence, the conditional expectations in dynamic equations (2.4) become the corresponding unconditional expectations. It is straightforward to show then by induction, going backward in time, that the dynamic equations (2.4) can be written as
(2.8) |
where
(2.9) |
with the expectation taken with respect to the marginal distribution of . In that setting one needs to keep track of the (expectation) cost-to-go functions only. The optimal policy is defined by the equations
(2.10) |
Note that here the optimal policy values can be computed iteratively, starting from optimal solution of the first stage problem, and then by sequentially computing a minimizer in the right hand side of (2.10) going forward in time for . Of course, this procedure requires availability of the cost-to-go functions .
It is worthwhile to note that here the optimal policy decision can be viewed as a function of and , for . Of course, is a function of and , and so on. So eventually can be represented as a function of the history of the data process. Assumption 2.1 was crucial for this conclusion, since then we need to consider only the history of the data process rather than also the history of our decisions. This is an important point, we will discuss this later.
Remark 2.2
In SP setting the process is viewed as a data process, and the assumption of stagewise independence could be unrealistic. In order to model stagewise dependence we assume the following Markovian property of the data process: the conditional distribution of given is the same as the conditional distribution of given . In such Markovian setting the dynamic equations (2.4) become
(2.11) |
That is, compared with the stagewise independent case, in the Markovian setting the expected cost-to-go function
(2.12) |
also depends on .
3 Stochastic Optimal Control
Consider the classical Stochastic Optimal Control (SOC) (discrete time, finite horizon) model (e.g. , Bertsekas and Shreve [4]):
(3.1) | |||||
(3.2) |
Variables , , represent the state of the system, , , are controls, , , are random vectors, , , are cost functions, is a final cost function, are (measurable) mappings and is a (measurable) multifunction mapping to a subset of . Values and are deterministic (initial conditions); it is also possible to view as random with a given distribution, this is not essential for the following discussion. The optimization in (3.1) - (3.2) is performed over policies determined by decisions and state variables . This is emphasized in the notation , we are going to discuss this below.
Unless stated otherwise we assume that probability distribution of the random process does not depend on our decisions (Assumption 2.1). We can consider problem (3.1)-(3.2) in the framework of SP if we view , , as decision variables. Suppose that the mapping is affine, i.e.,
(3.3) |
where , and are matrix and vector valued functions of . The constraints can be viewed as local constraints in the SP framework; in particular if are independent of , then can be viewed as a counterpart of the set in problem (2.1) - (2.2). For the affine mapping of the form (3.3), the state equations (3.2) become
(3.4) |
These state equations are linear in and can be viewed as a particular case of the balance equations (2.2) of the stochastic program (2.1) - (2.2). Note, however, that here decision should be made before realization of becomes known, and and are functions of rather than . We emphasize that and can be considered as functions of the history of the random process , without dependence on the history of the decisions. As in the SP framework, this follows from the dynamic programming equations discussed below, and Assumption 2.1 is essential for that conclusion.
There is a time shift in equations (3.4) as compared with the SP equations (2.2). We emphasize that in both the SOC and SP approaches, the decisions are made based on information available at time of the decision; this is the principle of nonanticipativity. Because of the time shift, the dynamic programming equations for the SOC problem are slightly different from the respective equations in SP. More important is that in the SOC framework there is a clear separation between the state and control variables. This will have an important consequence for the respective dynamic programming equations which we consider next.
Similar to SP, the dynamic programming equations for the SOC problem (3.1) - (3.2) can be written as follows. At the last stage, the value function and, going backward in time for , the value functions
(3.5) |
The optimization in the right hand side of (3.5) is performed jointly in and . The optimal value of the SOC problem (3.1)-(3.2) is given by the first stage value function , and can be viewed as a function of the initial conditions . Note that because of the state equation (3.2), equations (3.5) can be written in the following equivalent form
(3.6) |
The corresponding optimal policy is defined by
(3.7) |
Similar to SP, it follows that can be considered as a function of , and as a function of . That is, is a function of , this was already mentioned in the previous paragraphs. Again Assumption 2.1 is essential for the above derivations and conclusions.
Remark 3.1
Let us discuss convexity of the value functions. Consider graph of the multifunction :
(3.8) |
Note that minimization of a function can equivalently be written as , where function coincides with when , and is otherwise. Of course, the constraint is the same as . If function is convex and the set is convex, then the corresponding function is convex. Consequently in that case the min-function is a convex function of . By applying this observation and using induction going backward in time, it is not difficult to show that value functions are convex in , , if the following assumption holds.
Assumption 3.1
The function is convex, and for every the cost function is convex in , the mapping is affine of the form (3.3), the graph is a convex subset of , .
Note that, in particular, is convex if is independent of and the set is convex. When is dependent on , we need some constructive way to represent it in order to ensure convexity of . For example suppose that the feasibility sets are defined by inequality constraints as:
(3.9) |
where . Then . Consequently if every component of mapping is a convex function, then is convex.
In SOC framework the random process is often viewed as a noise or disturbances (note that here is also random). In such cases the assumption of stagewise independence (Assumption 2.2) is natural. In the stagewise independent case the dynamic programming equations (3.6) simplify with the value functions , , being functions of the state variables only, that is
(3.10) |
where the expectation is taken with respect to the marginal distribution of . Consequently the optimal policy is defined by the optimal controls , where
(3.11) |
That is, in the stagewise independent case it is suffices to consider policies with controls of the form .
Remark 3.2
Suppose that distribution of can depend on , i.e., probability distribution of is conditional on . That is, distribution of the random process is effected by our decisions and consequently Assumption 2.1 is not satisfied. Under the stagewise independence Assumption 2.2, the dynamic programming equations (3.10) still hold and the optimal policy is determined by (3.11). However in that case depends on the history of the decision process. This history depends on the decisions (on the considered policy) and cannot be represented in terms of the process alone. It also could be noted that in the stagewise independence case, SOC can be formulated as Markov Decision Processes (MDP) with transition probabilities determined by the state equations (3.2). We will not pursue the MDP formulation in the following discussion, and will rather deal with the SOC model instead.
As an example let us consider the classical inventory problem.
Example 3.1 (Inventory problem)
Consider problem [54],
(3.12) | |||||
(3.13) |
where are the ordering cost, backorder penalty cost and holding cost per unit, respectively, is the current inventory level, is the order quantity, is the demand at time , and
In that format are state variables, are controls, are random disturbances, and is the feasibility set (independent of state variables). At time the inventory level and demand value are known (initial conditions). Decision about at stage should be made before a realization of the demand becomes known. That is, at time history of the demand process is available, but future realizations are not known.
By making change of variables , problem (3.12) -(3.13) is often written in the following equivalent form
(3.14) | |||||
(3.15) |
Problem (3.14) -(3.15) can be viewed as an SP problem in decision variables and .
It is convenient to write dynamic programming equations for the inventory problem in the form (3.14) -(3.15), of course this can be reformulated for the form (3.12) -(3.13) as well. As before we assume here that distribution of the demand process does not depend on our decisions. At stage the value functions satisfy the following equation
(3.16) |
with . The objective functions are convex in and hence value functions are convex in (see Remark 3.1).
The optimal policy of how much to order at stage is given by , where
(3.17) |
That is, starting with the initial value , value is computed as an optimal solution of the minimization problem in the right hand side of (3.17) for (at the firs stage, since is deterministic, the conditional expectation in (3.17) is just the unconditional expectation with respect to ). Consequently the next inventory level is computed given an observed realization of . Then is computed as an optimal solution of the minimization problem in the right hand side of (3.17) for , the inventory level is updated as , and so forth going forward in time. This defines an optimal policy with the inventory level and order quantity at stage being functions of observed realization of the history of the demand process.
If the demand process is stagewise independent, then value functions do not depend on the history of the demand process, and equations (3.16) become
(3.18) |
with the expectations taken with respect to the marginal distributions of the demand process. By using convexity of the objective function it is possible to show that in the stagewise independence case the so-called basestock policy is optimal. That is, optimal control (optimal order quantity) is a function of state , and is given by
(3.19) |
where is the (unconstrained) minimizer
(3.20) |
That is, if at stage the current inventory level is greater than or equal to , then order nothing; otherwise order . Of course, computation of the (critical) values is based on availability of value functions .
4 Risk Averse and Distributionally Robust Optimization
In the risk averse approach the expectation functional is replaced by a risk measure. Let be a probability space and be a linear space of measurable functions (random variables) . Risk measure is a functional assigning a number to random variable , that is . In the risk neutral case, is the expectation operator, i.e., . In order for this expectation to be well defined we have to restrict the space of considered random variables; for the expectation it is natural to take the space of integrable functions, i.e., .
It was suggested in Artzner et al [1] that reasonable risk measures should satisfy the following axioms: subadditivity, monotonicity, translation equivariance, positive homogeneity. Risk measures satisfying these axioms were called coherent. For a thorough discussion of coherent risk measures we can refer to [11],[47]. It is possible to show that if the space is a Banach lattice, in particular if , then a (real valued) coherent risk measure is continuous in the norm topology of (cf., [43, Proposition 3.1]). It follows then by the Fenchel - Moreau theorem that has the following dual representation
(4.1) |
where is a convex weakly∗ compact subset of . Moreover, the axioms of coherent risk measures imply that every is a density function, i.e., almost surely and (cf., [43, Theorem 2.2]).
An important example of coherent risk measure is the Average Value-at-Risk:
(4.2) | |||||
(4.3) |
where is the cumulative distribution function (cdf) of and is the respective quantile. It is natural in this example to use the space and its dual . In various equivalent forms this risk measure was introduced in different contexts by different authors under different names, such as Expected Shortfall, Expected Tail Loss, Conditional Value-at-Risk; variational form (4.3) was suggested in [32],[41]. In the dual form, it has representation (4.1) with
Representation (4.1) can be viewed from the distributionally robust point of view. Recall that if is a probability measure on which is absolutely continuous with respect to , then by the Radon - Nikodym Theorem it has density . Consider the set of probability measures , absolutely continuous with respect to , defined as Then representation (4.1) of functional can be written as111By writing we emphasize that the expectation is taken with respect to the probability measure (probability distribution) .
(4.4) |
In the Distributionally Robust Optimization (DRO) it is argued that the “true” distribution is not known, and consequently a set of probability measures (distributions) is constructed in some way, and referred to as the ambiguity set. In that framework we can view (4.4) as the definition of the corresponding functional. It is straightforward to verify that the obtained functional , defined on an appropriate space of random variables, satisfies the axioms of coherent risk measures. So we have a certain duality relation between the risk averse and distributionally robust approaches to stochastic optimization. However, there is an essential difference in a way how the corresponding ambiguity set is constructed. In the risk averse approach the set consists of probability measures absolutely continuous with respect a specified (reference) measure . On the other hand, in the DRO such dominating measure may not be naturally defined. This happens, for instance, in popular settings where the ambiguity set is defined by moment constraints or is the set of probability measures with prescribed Wasserstein distance from the empirical distribution defined by the data.
In order to extend the risk averse and distributionally robust optimization to the dynamical setting of multistage stochastic optimization, we need to construct a conditional counterpart of the respective functional . In writing dynamic equations (2.4) and (3.6) we used conditional expectations in a somewhat informal manner. For the expectation operator there is a classical definition of conditional expectation. That is, let be a probability measure and be a sigma subalgebra of . It is said that random variable, denoted , is the conditional expected value of random variable given if the following two properties hold: (i) is -measurable, (ii) for any . There are many versions of which can differ from each other on sets of -measure zero. The conditional expectation depends on measure , therefore we sometimes write to emphasize this when various probability measures are considered.
It seems that it is natural to define the conditional counterpart of the distributionally robust functional, defined in (4.4), as . However there are technical issues with a precise meaning of such definition since there are different versions of conditional expectations related to different probability measures . Even more important, there are conceptual problems with such definition, and in fact this is not how conditional risk measures are defined (cf., [34]). In the risk averse setting there is a natural concept of law invariance. For a random variable its cumulative distribution function (cdf), with respect to the reference probability measure , is , . It is said that random variables and are distributionally equivalent if their cumulative distribution functions and do coincide for all . It is said that risk measure is law invariant if for any distributionally equivalent and . For example the Average Value-at-Risk measure is law invariant.
Note that law invariance is defined with respect to the reference measure . Law invariant coherent risk measure can be considered as a function of the cumulative distribution function , and consequently its conditional counterpart can be defined as the respective function of the conditional counterpart of ; this definition can be made precise. On the other hand, in the DRO setting where there is no dominating probability measure and the concept of law invariance is not applicable, it is not completely clear how to define a conditional counterpart of functional defined in (4.4), although an attempt was made in [34]. See Remark 5.6 below for a further discussion of law invariance.
In the risk neutral setting we were particularly interested in the stagewise independent case (Assumption 2.2). Under that assumption the dynamic programming equations simplify to (2.8) - (2.9) in the SP and to (3.10) in the SOC frameworks, respectively. Natural counterparts of these equations are obtained by replacing the expectation operator with a coherent risk measure. That is, in the risk averse SP framework the counterpart of (2.9) is
(4.5) |
where is a coherent risk measure applied to the random variable . For example,
(4.6) |
is a law invariant coherent risk measure representing convex combination of the expectation and Average Value-at-Risk. Such choice of risk measure tries to reach a compromise between minimization of the cost on average and controlling the risk of upper deviations from the average (upper quantiles of the cost) at every stage of the process. In applications the parameters and often are constants, independent of the stage , i.e., the risk measure is the same for every stage (cf., [51]).
Remark 4.1
The dynamic programming equations, with (4.5) replacing (2.9), correspond to the nested formulation of the risk averse multistage SP problem (cf., [42],[47, Section 6.5]). In the nested risk averse multistage optimization the risk is controlled at every stage of the process and the total nested value of the cost does not have practical interpretation. Nevertheless, when solving the corresponding risk averse optimization problem it would be useful to estimate its optimal value for the purpose of evaluating error of the computed solution.
In the DRO setting the counterpart of equations (2.9) can be written as
(4.7) |
where is a set of probability distributions of vector . Note that by convexity and monotonicity properties of coherent risk measures, convexity of the cost-to-go functions is preserved here. That is, convexity of implies convexity of in the left hand side of equations (4.5) and (4.7). This in turn implies convexity of in the left hand side of equation (2.8), provided the function is convex and the set is convex. Therefore we have here similar to the risk neutral case (compare with Proposition 2.1) the following convexity property.
Proposition 4.1
If the objective functions are convex in and the sets are convex for , then the cost-to-go functions and are convex in . In particular such convexity of the cost-to-go functions follows for risk averse linear multistage stochastic programs.
Similar considerations apply to the SOC framework, with the risk averse counterpart of equations (3.10) written as
(4.8) |
where , , are coherent risk measures. Again these dynamic equations correspond to the nested formulation of the respective SOC problem. Similar to the risk neutral case, the value functions are convex if Assumption 3.1 holds.
5 Dynamic Cutting Planes Algorithms
The basic idea of cutting planes algorithms for solving convex problems is approximation of the (convex) objective function by cutting planes. That is, let be an extended real valued function. Its domain is . We say that an affine function is a cutting plane of if for all . If moreover for some , then is said to be a supporting plane of , at the point . When the function is convex, its supporting plane at a point is given by , where is a subgradient of at . The set of all subgradients, at a point , is called the subdifferential of at and denoted . The subdifferential of a convex function at every interior point of its domain is nonempty, [40, Theorem 23.4].
For multistage linear SP problems a cutting planes type algorithm, called Stochastic Dual Dynamic Programming (SDDP), was introduced in Pereira and Pinto [31], building on nested decomposition algorithm of Birge [5]. The SDDP algorithm is going backward and forward in order to build piecewise linear approximations of the cost-to-go functions by their cutting planes.
Remark 5.1
Let us recall how to compute a subgradient of the optimal value function of a linear program. That is, let be the optimal value of linear program
(5.1) |
considered as a function222If for some problem (5.1) does not have a feasible solution, then . of vector . The dual of (5.1) is the linear program
(5.2) |
Function is piecewise linear convex, and its subgradient at a point where is finite, is given by with being an optimal solution of the dual problem (5.2) (e.g., [47, Section 2.1.1]). That is, the subgradient of the optimal value function of a linear program is computed by solving the dual of that linear program. A more general setting amendable to linear programming formulation, is when the objective function of (5.1) is polyhedral. In such cases the gradient of the respective optimal value function can be computed by solving the dual of the obtained linear program. This principle is applied in the SDDP algorithm to approximation of the cost-to-go functions by their cutting planes in a dynamical setting. This motivates the “Dual Dynamic” part in the name of the algorithm.
5.1 SDDP algorithm for SP problems
Consider SP problem (2.1) - (2.2). Suppose that Assumptions 2.1 and 2.2 hold, and hence the corresponding cost-to-go functions are defined by equations (2.8) - (2.9). If probability distributions of random vectors are continuous, these distributions should be discretized in order to compute the respective expectations (see Remark 5.2 below). So we also make the following assumption.
Assumption 5.1
The probability distribution of has finite support with respective probabilities333For the sake of simplicity we assume that the number of realizations of is the same for every stage . Recall that in the SP framework, is deterministic. , . Denote , , , and the cost-to-cost-to-go functions, defined in (2.8).
The corresponding expected value cost-to-go functions can be written as
(5.1) |
Suppose further that the objective functions are linear in , and the sets , i.e., the constraint can be written as , . In that case (2.1) - (2.2) becomes the standard linear multistage SP problem:
(5.2) |
It could be noted that if the sets are polyhedral and the functions are polyhedral in , then the problem is linear and can be formulated in the standard form by the well known techniques of linear programming.
Remark 5.2
From the modeling point of view, random vectors often are assumed to have continuous distributions. In such cases these continuous distributions have to be discretized in order to perform numerical calculations. One possible approach to such discretization is to use Monte Carlo sampling techniques. That is, a random sample of size is generated from the probability distribution of random vector , . This approach is often referred to as the Sample Average Approximation (SAA) method. The constructed discretized problem can be considered in the above framework with equal probabilities , . Statistical properties of the SAA method in the multistage setting are discussed, e.g., in [47, Section 5.8]. It could be mentioned that the question of the sample size , i.e. how many discretization points per stage to use, can be only addressed in the framework of the original model with continuous distributions. One of the advantages of the SAA method is that statistical tests can be applied to the results of computational procedures, this is discussed in [9].
The SDDP algorithm for linear multistage SP problems, having finite number of scenarios, was described in details in a number of publications (see, e.g., recent survey paper [12] and references therein). For a later reference we briefly discuss it below. The SDDP algorithm consists of backward and forward steps. At each iteration of the algorithm, in the backward step current approximations of the cost-to-go functions , by cutting planes, are updated by adding respective cuts. That is, at the last stage the cost-to-go function , , is given by the optimal solution of the linear program
(5.3) |
The subgradient of at a chosen point is given by , where is an optimal solution of the dual of problem (5.3) (see Remark 5.1 above). Consequently the subgradient of at is computed as , and hence the cutting plane
(5.4) |
is added to the current family of cutting planes of . Actually at the last stage the above cutting plane is the supporting plane of at . This is because is explicitly computed by solving problem (5.3).
At one step back at stage , the cost-to-go function , , is given by the optimal solution of the program
(5.5) |
The cost-to-go function is not known. Therefore it is replaced by the approximation given by the maximum of the current family of cutting planes including the cutting plane added at stage as described above. Since is the maximum of a finite number of affine functions, the obtained problems
(5.6) |
, are linear. Let be the optimal value function of problem (5.6) and be the corresponding estimate of the expected value cost function. Note that since , we have that , , and hence . Gradient of the optimal value function can be computed at a chosen point by solving the dual of problem (5.6) and hence computing its optimal solution. Then is the gradient of at . Consequently
(5.7) |
is a cutting plane of , and is added to the current collection of cutting planes of . And so on going backward in time until the corresponding approximation of the first stage problem is obtained:
(5.8) |
Let us note that by the construction, for . Therefore any cutting plane of is also a cutting plane of . Also it follows that the optimal value of problem (5.8) is less than or equal to the optimal value of problem (2.1) - (2.2). Therefore optimal value of problem (5.8) provides a (deterministic) lower bound for the optimal value the considered multistage problem.
The computed estimates , , of the cost-to-go functions, and optimal solution of the first stage problem (5.8), define a feasible policy in the same way as in (2.10). Since this policy is feasible, its value is greater than or equal to the optimal value of the considered problem (2.1) - (2.2). In the forward step this value is estimated by randomly generating realizations of the data process and evaluating the corresponding policy values. That is, consider a sample path (scenario) of the data process. Let , , and be realizations of the random parameters corresponding to the generated sample path. Starting with first stage solution , next values are computed iteratively going forward in time by computing an optimal solution of the optimization problem in (2.10) with replaced by . That is
(5.9) |
The corresponding total sum
(5.10) |
is a function of the generated sample and hence is random. Its expected value is equal to the value of the constructed policy, and thus is greater than or equal to the optimal value of problem (2.1) - (2.2).
The forward step of the algorithm proceeds to evaluate value of the constructed policy by randomly generating the sample paths (scenarios). At every stage , a point is generated at random according to the distribution of the corresponding random vector. That is, a point is chosen with probability , , from the set . This generates the corresponding sample path (scenario) of the data process. Consequently the policy values for the generated sample path are computed, by using the rule (5.9), and thus the total sum , defined in (5.10), is evaluated. The computed value gives a point estimate of value of the constructed policy. This procedure can be repeated several times by randomly (and independently from each other) generating scenarios. Let be values, calculated in accordance with (5.10), for each generated scenario. Their average gives an unbiased estimate of the value of the considered policy. An upper edge of the corresponding confidence interval is viewed as a statistical estimate of an upper bound for the optimal value of the considered multistage problem. Here is an estimate of the variance , and is the critical level, say . The justification for this bound is that by the Central Limit Theorem (CLT) for large , the average has approximately normal distribution and corresponds to mass of the standard normal distribution. For not too large it is a common practice to use critical values from -distribution with degrees of freedom, rather than normal. In any case this is just a heuristic justified by the CLT. Nevertheless it is useful and typically gives a reasonable estimate of the upper bound.
Remark 5.3
The constructed approximations , of the expected value cost-to-go functions, also define a feasible policy for the original problem with continuous distributions of the data process vectors . Its value can be estimated in the same way by generating a sample path from the original continuous distribution and computing the corresponding policy values in the forward step in accordance with (5.9). Consequently the policy value is estimated by averaging the computed values for a number of randomly generated sample paths. This can be used for construction of statistical upper bound for the original problem. Expectation of the optimal value of the SAA problem, based on randomly generated sample, is less than or equal to the optimal value of the original problem. Therefore the lower bound generated by the SDDP algorithm gives a point estimate of a lower bound of the original problem. We can refer to [9] for various related statistical testing procedures.
The difference between the (statistical) upper bound and (deterministic) lower bound provides an upper bound for the optimality gap of the constructed policy. It can be used as a stopping rule by stopping the iteration procedure when this estimated gap is smaller than a specified precision level. The bound is quite conservative and for larger problems may not be reducible to the specified accuracy level in a reasonable computational time. A more practical heuristic approach is to stop the iterations after the lower bound stabilizes and new iterations, with additional cutting planes, do not significantly improve (increase) the lower bound.
The above procedure depends on a choice of the points , referred to as the trial points. The forward step suggests to use the computed values as trial points in the next iteration of the backward step of the algorithm. The rational for this can be explained as follows. In general, in order to approximate the cost-to-go functions uniformly with a given accuracy in a specified region, the number of cutting planes grows exponentially with increase of the dimension of decision variables . In many applications the regions where the cost-to-go functions should be approximated more accurately are considerably smaller than the regions of possible variability of decision variables. By randomly generating scenarios, the forward step generates trial points where likely the cost-to-go functions should be approximated more accurately.
Remark 5.4
The trial points are supposed to be chosen at stages . This suggests that the above procedure is relevant for number of stages . For two stage problems, i.e., when , there is no sampling involved and this becomes the classical Kelley’s cutting plane algorithm, [18]. It is well known in the deterministic convex optimization that Kelley’s algorithm can behave poorly with increase of the number of decision variables, and much better cutting plane algorithms can be designed by using regularization or trust region/level set strategies [26, 20]. However, it is not clear how such methods can be extended to the dynamic multistage settings (although such attempts were made). The reason is that in the multistage problems the optimization is performed over policies which are functions of the data process. It is not known a priori where realizations of considered policies may happen in future stages and an attempt to restrict regions of search by some form of regularization could be not helpful.
Proofs of convergence of the SDDP type algorithms were studied in several publications (e.g., [13] and references there in). These proofs are based on the argument that eventually almost surely the forward step of the algorithm will go through every possible scenario many times, and hence the cost-to-go functions will be approximated with any accuracy at relevant regions. Since number of scenarios grows exponentially with increase of the number of stages, this is not a very realistic assumption. Numerical experiments indicate that, in a certain sense, the numerical complexity of SDDP method grows almost linearly with increase of the number of stages while the dimension of decision vectors is constant. This property, of the rate of convergence, was investigated in recent paper [22].
5.1.1 Interstage dependence
Suppose now that the data process is Markovian, and hence the cost-to-go functions are defined by equations (2.11) - (2.12). There are different ways how such Markovian structure can be modeled. One classical approach is time series analysis. The data process is modeled as, say first order, autoregressive process, i.e.,
(5.11) |
where and are parameters of the process and is a sequence of random error vectors. It is assumed that the errors process is stagewise independent (in fact it is usually assumed that are i.i.d.). Suppose further that only the right hand side variables in the considered problem are random and set . Then we can write the cost-to-go functions in terms of and treating as the corresponding random process:
(5.12) |
with
(5.13) |
Here are treated as decision variables, is viewed as the corresponding random process with the expectation in (5.13) is taken with respect to .
There are two drawback in the above approach. First, the number of decision variables is artificially increased with the cutting planes in the corresponding SDDP algorithm computed jointly in and . This increases the computational complexity of the numerical procedure. Second, in order to have a linear problem, only the right hand side parameters are allowed to be random. An alternative approach is to approximate Markovian process by Markov chain, which we are going to discuss next.
Suppose that distribution of random vector is supported on set , . Consider the partition444For the sake of simplicity we assume that the number of cells is the same at every stage. , , of by Voronoi cells, i.e.,
where are given points. That is, is the set of points of closest to . There are various ways how the center points can be chosen. For example, could be determined by minimizing the squared distances:
(5.14) |
The above optimization problem is not convex. Nevertheless it can be reasonably solved by various methods (cf., [27]). Consider conditional probabilities . Note that different Voronoi cells can have common boundary points, we assume that probabilities of such events are zero so that the probabilities are well defined. This leads to an approximation of the data process by a (possibly nonhomogeneous) Markov chain with transition probabilities of moving from point to point at the next stage.
Consider , , and , . Recall dynamic equations (2.11) - (2.12) for Markovian data process. For the constructed Markov chain these dynamic equations can be written as
(5.15) |
where
(5.16) |
Note that the cost-to-go functions are convex in , while the argument can take only the values .
It is possible to apply the SDDP method separately for every , (cf., [27]). That is, in the backward step of the algorithm the cost-to-go functions are approximated by their cutting planes (in ) separately for each , . This defines a policy for every realization (sample path) of the constructed Markov chain. This can be used in the forward step of the algorithm.
It could happen that a sample path of the original data process is different from every sample path of the constructed Markov chain. In order to extend the constructed policy to a policy for the original data process, one approach is to define the policy values for a sample path of the original data process by taking the corresponding policy values from the sample path of the Markov chain closest in some sense to the considered sample path of the original data process.
The above approach is based on discretization (approximation) of the Markov process by a Markov chain. Compared with the approach based on the autoregressive modeling of the data process, the approach of Markov chain discretization is not restricted to the case where only the right hand side parameters are random. Also it uses cutting planes approximations only with respect to variables , which could be computationally advantageous. On the other hand, its computational complexity and required computer memory grows with increase of the number of per stage discretization points. Also unlike the SAA method, there is no available inference describing statistical properties of such discretization. From an intuitive point of view, such discretization could become problematic with increase of the dimension of the random vectors of the data process.
5.1.2 Duality of multistage linear stochastic programs
The linear multistage stochastic program (5.2), with a finite number of scenarios, can be viewed as a large scale linear program, the so called deterministic equivalent. By the standard theory of linear programming it has the (Lagrangian) dual. Suppose that the primal problem (5.2) has finite optimal value. Then the optimal values of the primal and its dual are equal to each other and both problems have optimal solutions. Dualization of the feasibility constraints leads to the following (Lagrangian) dual of problem (5.2) (cf., [47, Section 3.2.3])
(5.17) |
The optimization in (5.17) is over policies , .
It is possible to write dynamic programming equations for the dual problem. The idea of applying the SDDP algorithm to the dual problem was introduced in [25]. In what follows we use the approach of [15] based on formulation (5.17). As before we make Assumptions 2.1, 2.2 and 5.1. At the last stage , given and , we need to solve the following problem with respect to :
(5.18) |
Note that and . Since is independent of , the expectation in (5.18) is unconditional with respect to the distribution of .
The optimal value and an optimal solution of problem (5.18) are functions of vectors and and matrix . And so on going backward in time, using the stagewise independence assumption, we can write the respective dynamic programming equations for , as
(5.19) |
with being the optimal value of problem (5.19). Finally at the first stage the following problem should be solved555Value functions here should not be confused with the value functions of the SOC framework.
(5.20) |
Let us make the following observations about the dual problem. Unlike the primal problem, the optimization (maximization) problems (5.18) and (5.19) do not decompose into separate problems with respect to each and should be solved as one linear program with respect to . If and , , are deterministic, then is only a function of . The value (cost-to-go) function is a concave function of . This allows to apply an SDDP type algorithm to the dual problem (see [15] for details). For any feasible policy of the dual problem, value of the first stage problem (5.20) gives an upper bound for the optimal value of the primal problem (5.2). Since the dual is a maximization problem, the SDDP algorithm applied to the dual problem produces a deterministic upper bound for the optimal value of problem (5.2).
It is interesting to note that it could happen that the dual problem does not have relatively complete recourse even if the primal problem has it. It is said that the problem (primal or dual) has relatively complete recourse, if at every stage , for any generated solutions by the forward process at the previous stages, the respective dynamic program has a feasible solution at the stage for every realization of the random data. Without relatively complete recourse it could happen at a certain stage that for some realizations of the data process, the corresponding optimization problem is infeasible and the forward process cannot continue. Note that the relatively complete recourse is related to the policy construction in the forward dynamic process and may not hold even if the problem has finite optimal value. It is still possible to proceed with construction of the dual upper bound by introducing a penalty term to the dual problem (cf., [15]).
5.2 Cutting planes algorithm for SOC problems
Consider the SOC problem (3.1) - (3.2). Suppose that Assumptions 2.1 and 2.2 hold, and hence equation (3.10) for the value function follows. Suppose further that Assumption 3.1 is satisfied, and hence value functions are convex. Then a cutting planes algorithm of the SDDP type can be applied in a way similar to the SP setting. There are, however, some subtle differences which we are going to discuss next.
In the backward step of the algorithm an approximation of the value function is constructed by maximum of a family of cutting planes, similar to the SP setting, going backward in time. Let , , be trial points, generated say by the forward step of the algorithm at the previous iteration. At the backward step, starting at time the cutting (supporting) plane , where is a subgradient of at , is added to the current family of cutting planes of . Going backward in time, at stage we need to compute a subgradient of function
(5.21) |
at the trial point , where is the current approximation of by cutting planes . The minimization problem (5.21) is linear if functions are linear (polyhedral) in , and is of the form (3.9) with the corresponding mapping being affine in .
Let us first consider the case where does not depend on . Then the required subgradient is given by (cf., [39, Theorem 23.8])
(5.22) |
where is a minimizer of problem (5.21) for , and is a subgradient of at . We also need to compute a subgradient of at the points , . The subgradient at a point is given by the gradient of its cutting plane with such that , i.e., is the supporting plane of at . Of course, the gradient of affine function is for any . The cutting plane is then , where
It could be noted that in the present case the required subgradient, and hence the corresponding cutting plane, can be computed without solving the dual problem. Therefore it may be inappropriate in this setting to call it Stochastic Dual Dynamic Programming.
When depends on , calculation of a subgradient of function (5.21) could be more involved; nevertheless in some cases it is still possible to proceed. Suppose that is of the form (3.9) with the corresponding mapping having convex components. That is, minimization problem (5.21) can be written as
(5.23) |
Then the required gradient is given by the gradient of the Lagrangian of problem (5.23) with being the respective optimal solution and being the corresponding Lagrange multipliers vector, provided the optimal solution and Lagrange multipliers are unique (e.g., [7, Theorem 4.24]).
The forward step of the algorithm is performed similarly to the SP setting. Starting with initial value and generated sample path , the policy is computed iteratively going forward in time (compare with (3.11)). That is, at stage , given value of the state vector, the control value is computes as
(5.24) |
Consequently the next stage state value is computed
(5.25) |
and so on. Recall that in the SOC framework, is also random. The computed values of the state variables can be used as trial points in the next iteration of the backward step of the algorithm.
Remark 5.5 (-factor approach)
Consider dynamic equations (3.10) with affine state equations of the form (3.3), and define666The -functions here should not be confused with the cost-to-go functions of the SP framework.
(5.26) |
We have that
(5.27) |
The dynamic equations (3.10) can be written in terms of as
(5.28) |
The optimal policy , defined by (3.11), can be written in terms of the -functions as
(5.29) |
Under Assumption 3.1, functions are convex jointly in and . Consequently the cutting plane algorithm can be applied directly to dynamic equations (5.28). That is, in the backward step current lower approximations of the respective functions are updated by adding cutting planes computed at trial points generated in the forward step of of the algorithm. The constructed approximations are used in the forward step of the algorithm for generating a point estimate of the expected value of the corresponding policy for a randomly generated sample path.
5.3 SDDP algorithms for risk averse problems.
Cutting planes algorithms of SDDP type can be extended to risk averse problems (cf., [33],[44]). Let us start with the SP framework.
5.3.1 Stochastic Programming framework.
We follow the assumptions and notation of Section 5.1, and assume that equation (2.9) for the cost-to-go functions, is replaced by its risk averse counterpart (4.5). The backward step of the algorithm is performed similar to the risk neutral case. We only need to make adjustment to calculation of subgradients of the cutting planes approximations of the cost-to-go functions .
In numerical algorithms, we deal with discrete distributions having a finite support. So we assume in this section that is a finite space equipped with sigma algebra of all subsets of , and reference probability measure with the respective nonzero probabilities , . For a random variable (a function) , we denote , .
Remark 5.6
Recall definition of law invariant risk measures discussed in section 4. Law invariant coherent risk measures have naturally defined conditional counterparts used in construction of the corresponding nested risk measures (see equations (5.51) - (5.52) below). Suppose for the moment that the probabilities are such that for the equality holds only if . Then two variables are distributionally equivalent iff they do coincide. Therefore in that case any risk measure defined on the space of variables is law invariant. In the theory of law invariant coherent risk measures it is usually assumed that the reference probability space is atomless; of course an atomless space cannot be finite. In the case of finite space it makes sense to consider law invariant risk measures when all probabilities are equal to each other, i.e., , . In that case are distributionally equivalent iff there exists a permutation such that , . Anyway we will not restrict the following discussion to the case of all probabilities being equal to each other.
In the considered setting , and the dual representation (4.1) of coherent risk measure can be written as
(5.30) |
where is a convex closed subset of consisting of vectors such that . Since it is assumed that is real valued, the set is bounded and hence is compact.
The subdifferential of at is given by
(5.31) |
Specific formulas for the subgradients, in various examples of coherent risk measures, are listed in [47, Section 6.3.2]. These formulas can be applied for computation of cutting planes in the backward step of the algorithm. Several such examples were discussed and implemented in an SDDP type algorithm in [51].
As it was described in Section 5.1, in the risk neutral setting the forward step of the algorithm has two functions; namely, construction of statistical upper bounds and generation of trial points. The constructed approximations of the cost-to-go functions define a feasible policy in the same way as in the risk neutral case by using formula (5.9). Therefore the forward step can be used for generation of trial points similar to the risk neutral case. However, because of the nonlinearity of the risk measures, it does not produce an unbiased estimate of the risk averse value of the constructed policy and cannot be applied in a straightforward way for computing statistical upper bounds similar to the risk neutral case. At the moment there are no known efficient numerical algorithms for computing statistical upper bounds for risk averse problems in the SP framework. Interestingly, as we will argue below, in the SOC setting it is possible to construct reasonably efficient statistical upper bounds for a certain class of risk measures.
5.3.2 Stochastic Optimal Control framework.
As before we suppose that Assumptions 2.1 and 2.2 hold, and thus value functions , associated with coherent risk measure , are determined by the dynamic equations (4.8). Furthermore, we suppose that Assumption 3.1 is satisfied, and hence value functions are convex.
We consider in this section the following class of coherent risk measures. This class is quite broad and includes many risk measures used in practical applications (the presentation below is based on [14]). We assume that the considered risk measures can be represented as
(5.32) |
where is a nonempty convex closed subset of a finite dimensional vector space and is a real valued function. For the sake of simplicity, we assume that the function is the same at every stage, while the reference probability distribution of can be different at different stages. We make the following assumptions about function . (i) For every , the expectation in the right hand side of (5.32) is well defined and the infimum is finite valued. (ii) Function is convex in . (iii) is monotonically nondecreasing, i.e., if , then for all .
Under these assumptions, the functional defined in (5.32) satisfies the axioms of convexity and monotonicity. Convex combination of expectation and Average Value-at-Risk, defined in (4.6), is of that form with , and
(5.33) |
For risk measures of the form (5.32), dynamic programming equations (4.8) can be written as
(5.34) |
Note that the minimization in the right hand side of (5.34) is performed jointly in controls and parameter vector .
The backward step of the algorithm can performed similar to the risk neutral case. In case does not depend on , by the chain rule of differentiation formula (5.22) for the required subgradient is extended to
(5.35) |
where and are minimizers in the right hand side of (5.34), with replaced by its current approximation , and is a subgradient of at .
The subgradient can be easily computed in many interesting examples (e.g., [47, Section 6.3.2]). For example, consider convex combination of expectation and Average Value-at-Risk measure, defined in (4.6). The corresponding function of the form (5.33), is piecewise linear in . Then if , if ; and if , then can be any point in the interval .
The computed approximations of the value functions define a feasible policy in the same way as in the risk neutral case. Also by construction we have that , . Therefore the algorithm produces a (deterministic) lower bound for the optimal value of the problem.
Formula (5.34) shows that the controls and value of the parameter can be computed simultaneously. This leads to the following statistical upper bound for the (risk averse) value of the policy defined by the computed approximation. Let be a sample path of the data process. Suppose for the moment that , i.e., consider the risk neutral case. Let be a policy determined by controls and states , and , , and , be cost values associated with the sample path. Then the corresponding point estimate of the expectation (3.1) of the total cost for the considered policy , can be computed iteratively going backward in time with and
(5.36) |
Value gives an unbiased estimate of the corresponding expected total cost.
This process can be adapted for general risk averse case as follows. Let , and , , be realizations of the parameters corresponding to the generated sample path. For the approximations consider the corresponding policy with controls and parameters computed going forward, starting with initial value and using (compare with (5.24) - (5.25))
(5.37) | |||||
(5.38) |
Let , , and , be cost values associated with the generated sample path. Consider the following values defined iteratively going backward in time: and
(5.39) |
It is possible to show that is greater than or equal to value of the policy defined by the considered approximate value functions, and hence is an upper bound on the optimal value of the risk averse problem (cf., [14]). Expectation can be estimated by randomly generating sample paths (scenarios) and averaging the computed realizations of random variable . This is similar to construction of statistical upper bound in the risk neutral setting. The inequality: “ value of the constructed policy”, is based on Jensen’s inequality applied to convex function and, unlike the risk neutral case, can be strict. Therefore the above procedure introduces an additional error in the estimated optimality gap. This additional error becomes larger when function is “more nonlinear”. Numerical experiments indicate that this procedure gives a reasonable upper bound, for instance for risk measures of the form (4.6) (cf., [14]).
-factor approach.
Consider the dynamic equations (5.34) and define (see Remark 5.5)
(5.40) |
We have that
and hence dynamic equations (5.34) can be written in terms of as
The cutting planes algorithm can be applied directly to functions rather than to the value functions . In the backward step of the algorithm, subgradients with respect to and , of the current approximations of , should be computed at the respective trial points, and the obtained cutting plane is added to the current family of cutting planes of . An advantage of that approach could be that the calculation of the respective subgradients does not require solving nonlinear optimization programs even if the function is not polyhedral.
5.3.3 Infinite horizon setting
An infinite horizon (stationary) counterpart of the SOC problem (3.1) - (3.2) is
(5.41) |
where is the so-called discount factor. The optimization (minimization) in (5.41) is over the set of feasible policies satisfying777When the number of scenarios is finite, the feasibility constrains should be satisfied for all possible realizations of the random process . (w.p.1):
(5.42) |
It is assumed that , is an i.i.d. (independent identically distributed) sequence of random vectors, with common distribution . The cost and the mapping are assumed to be the same at every stage, and the (nonempty) feasibility set of controls does not depend on the state and stage.
Bellman equation for the value function, associated with problem (5.41), can be written as
(5.43) |
Assuming that is bounded, equation (5.43) has a unique solution . The optimal policy for problem (5.42) is given by , with
(5.44) |
Similar to Assumption 3.1 let us make following assumption ensuring convexity of the solution of equation (5.43).
Assumption 5.2
The function is convex in , the mapping is affine, and the set is convex closed.
In order to solve equation (5.43) numerically we need to discretize the possibly continuous distribution . Suppose that the SAA method is applied, i.e., a sample of size from the distribution is generated (say by Monte Carlo sampling techniques), and the distribution in Bellman equation (5.43) is replaced by its empirical estimate888By we denote measure assigning mass one at point , the so-called Dirac measure. , assigning probability to each generated point. This raises the question of the involved sample complexity, i.e., how large should be the sample size in order for the SAA problem to give an accurate approximation of the original problem. In some applications the discount factor is very close to one. It is well known that as the discount factor approaches one, it becomes more difficult to solve the problem. Note that the optimal value of problem (5.41) increases at the rate of as approaches one. It is shown in [45] that the required sample size is of order as a function of the discount factor and the error level . That is, the error grows more or less proportionally to the optimal value as approaches one. This indicates that the sample size required to achieve a specified relative error is not sensitive to the discount factor being close to one.
Suppose now that random vector has a finite number of possible values with respective probabilities . For example, in the SAA approach this is a random sample generated by Monte Carlo sampling techniques, in which case , . In that setting Bellman equation (5.43) can be written as
(5.45) |
where , , and , .
A cutting planes algorithm of SDDP type can be applied to equation (5.45). Given a current approximation of the value function by cutting planes and a trial point , new cutting plane is added to the set of cutting planes of the value function, where
(5.46) | |||||
(5.47) | |||||
(5.48) |
If is linear (polyhedral) in and the set is polyhedral given by linear inequalities, the minimization problem in the right hand side of (5.46) can be formulated as a linear programming problem. The subgradient of at a point can be computed in the same way as in (5.22) by using subgradient of its supporting plane at .
Similar to the case of finite horizon, by construction we have that , and hence gives a (deterministic) lower bound for the optimal value of problem (5.41) with initial value of the state vector. Important questions, related to the above algorithm, is how to generate trial points and how to construct an upper bound for the optimal value of problem (5.41). Current implementation, which was tested in numerical experiments, is to apply the forward step of the cutting planes algorithm to a truncated version of problem (5.41). Note that
(5.49) |
where is an upper bound for . Therefore the suggestion is to choose the horizon such that , where is a prescribed accuracy. Then the forward steps are applied to the finite horizon version of the problem using current approximation of the value function. One run of the forward step, applied to the truncated version, will generate a point estimate (up to accuracy ) of value of the constructed policy, and trial points. It is possible to choose, say at random, one of these trial points for construction of the cutting plane for the value function.
Such choice of the trial points could be not optimal. The question of “best way” for construction of trial points in the above algorithm was not carefully investigated. Another problem with this procedure is that when the discount factor is close to one, the required horizon could be too large for a numerical implementation; recall that in order to construct the statistical upper bound the forward step should be run a reasonable number of times. In such cases, if the problem can be formulated as a linear program, deterministic upper bounds of the dual problem can be more efficient (cf., [46]).
Risk averse case.
The infinite horizon risk neutral formulation (5.41) can be extended to the risk averse setting by replacing the expectation operator in Bellman equation (5.43) with a coherent risk measure . That is, risk averse counterpart of Bellman equation (5.43) is
(5.50) |
This equation corresponds to the nested counterpart of the risk neutral problem (5.41):
(5.51) |
where is the corresponding -stage nested risk measure (cf., [42]). That is, as it was discussed in section 3, for a considered policy, state and control at stage are functions of . Consider the corresponding cost which is a function of . Then
(5.52) |
where is the conditional counterpart of .
The backward step of the cutting planes algorithm in the risk averse setting is basically the same as the one discussed above for the risk neutral case. In order to construct a statistical upper bound we can proceed similar to the construction of section 5.3.2. That is, suppose that risk measure is of the form (5.32) with function satisfying conditions (i) - (iii) specified in section 5.3.2. Suppose also that has a finite number of possible values with reference probabilities . Then Bellman equation (5.50) can be written as
(5.53) |
A statistical upper bound can be constructed in a way similar to section 5.3.2 using a truncated version of problem (5.41).
Periodical setting
In various applications the data process has a periodical behavior. One such example is hydropower generation planning discussed in [51]. The uncertain process in that example is water inflows recorded on monthly basis. There is available historical data of 79 years of records of the natural monthly energy inflow. This allows to model the uncertain process as a periodical random process with period . The planning in that example was for 5 years with another 5 years to eliminate the end of horizon effect. This resulted in stage stochastic optimization problem. Periodical approach, which we discuss below, allows drastic reduction in the number of stages while giving almost the same policy value for the first stage.
In [48] such periodical approach was introduced from the stochastic programming point of view. We discuss this in the framework of SOC. Consider the following infinite horizon SOC problem with discount factor :
(5.54) | |||||
(5.55) |
We assume the following periodic structure, with period : the random data process is stagewise independent with and having the same probability distribution, and with , and , for .
Bellman equations for this problem take the form (compare with (3.10))
(5.56) | |||||
(5.57) |
For period , periodical problem (5.54) -(5.55) coincides with the stationary problem (5.41) of section 5.3.3, and equation (5.57) becomes Bellman equation (5.43). The optimal policy for problem (5.54) - (5.55) is given by with
(5.58) | |||||
(5.59) |
and for , where is the solution of equations (5.56) - (5.57).
Such periodical formulation can be extended to the (nested) risk averse setting and cutting planes type algorithm can be applied in a way similar to what was discussed above (cf., [48]).
6 Computational Complexity of Cutting Plane Methods
As mentioned earlier, SDDP reduces to Kelly’s cutting plane method for two stage problems and it can behave poorly as the number of decision variables increases. On the other hand, numerical experiments indicate that this type of algorithm can scale well with respect to the number of stages . In this section we summarize some recent theoretical studies in [22, 17] on the rate of convergence of SDDP. We first establish the number of iterations required by the SDDP method, and then present a variant of SDDP that can improve this complexity bound in terms of the dependence on the number of stages. Related developments, with a focus more on mixed-integer nonlinear optimization, can also be found in [53].
6.1 Computational complexity of SDDP
Consider the multi-stage stochastic programming problem (2.1)-(2.2). Similar to the the previous section, we assume that ’s are stagewise independent (i.e., Assumption 2.2) and finitely supported (i.e., Assumption 5.1). For simplicity we further assume that the probabilities for , , are equal, i.e., , , . In that case, the dynamic equations in (2.8)-(2.9) reduce to
(6.1) |
where
(6.2) |
The basic scheme of the SDDP method has been discussed in Section 5.1. Here we add a little more details to facilitate its complexity analysis. Each iteration of the SDDP method to solve problem (2.1)-(2.2) contains a forward step and a backward step. In the forward step of the SDDP method, we randomly pick up an index out of and solve problem
to update . Equivalently, one can view as being randomly chosen from , , defined as
(6.3) |
Note that we do not need to compute for , even though they will be used in the analysis of the SDDP method. In next section, we will present a deterministic dual dynamic programming method which chooses the feasible solution in the forward step in a more aggressive manner.
In the backward step of SDDP, starting from the last stage, we update the cutting plane models according to
for . Here and , respectively, denote the optimal value and a subgradient at the point for the following approximate value function
In order to assess the progress made by each SDDP iteration, we need to introduce a few notions. We say that a search point is -saturated at iteration if Intuitively, a point is saturated if we have a good approximation for the value function at . Moreover, we say an -saturated search point at stage is -distinguishable if for all other -saturated search points , , that have been generated for stage so far by the algorithm. Let us denote the set of saturated points at stage . An -saturated search point is -distinguishable if Recall that denotes the distance from point to the set . Hence, in this case, this search point is far away enough from other saturated points. The rate of convergence of SDDP depends on how fast it generates -saturated and -distinguishable search points. For simplicity we assume in this paper that , for a given tolerance .
Under certain regularity assumptions, it can be shown that the function and lower approximation functions are Lipschitz continuous with constant . Utilizing these continuity assumptions, it can be shown that the probability for SDDP to find a new -distinguishable and -saturated search point at the -iteration can be bounded from below by
(6.4) |
where
(6.5) |
On the other hand, if for some iteration , we have for all , then
(6.6) |
Note that the upper bound can be further reduced so that it does not depend on if a discounting factor is incorporated in each stage of the problem.
Now let denote the number of iterations performed by the SDDP method before it finds a forward path such that (6.6) holds. Using the previous observation in (6.4), it can be shown (see [22]) that , where is defined in (6.5) and
(6.7) |
Here and for denote the dimension and diameter of the feasible sets , respectively. In addition, for any , we have
We now add a few remarks about the above complexity results for SDDP. Firstly, since SDDP is a randomized algorithm, we provide bounds on the expected number of iterations required to find an approximate solution of problem (2.1)-(2.2). We also show that the probability of having large deviations from these expected bounds for SDDP decays exponentially fast. Secondly, the complexity bound for the SDDP method depends on , which increases exponentially with respect to . We will show in the next section how to reduce the dependence on by presenting a variant of the SDDP method.
6.2 Explorative Dual Dynamic Programming
The SDDP method in the previous section chooses the feasible solution in a randomized manner. In this section, we will introduce a deterministic method, called Explorative Dual Dynamic Programming (EDDP) first presented in [22], which chooses the feasible solution in the forward step in an aggressive manner. As we will see, the latter approach will exhibit better iteration complexity while the former one is easier to implement.
The EDDP method consists of the forward step and backward step. In the forward step of EDDP, for each stage , we solve subproblems defined in (6.3) to compute the search points , . For each , we further compute the quantity , the distance between and the set of currently saturated search points in stage . Then we will choose from , , the one with the largest value of as , i.e., . We can break the ties arbitrarily. In view of the above discussion, the EDDP method always chooses the most “distinguishable” forward path to encourage exploration in an aggressive manner. This also explains the origin of the name EDDP. The backward step of EDDP is similar to SDDP.
Under the same regularity assumptions as in SDDP, it can be shown that each EDDP iteration will either generate at least one -distinguishable and -saturated search point at some stage , or find a feasible solution of problem (2.1)-(2.2) such that
(6.8) |
As a result, the total number of iterations performed by EDDP before finding a feasible policy of problem (2.1)-(2.2) satisfying (6.8) within at most iterations where is defined in (6.7) (see [22]). This complexity bound does not depend on the number of scenarios , and hence significantly outperforms the one for the original SDDP method. Moreover, we can show that the relation implies the relation in (6.8) for a properly chosen value of , and hence it can be used as a termination criterion for the EDDP method. Note that it is possible to further reduce the dependence of the computational complexity of EDDP on so that the term in (6.7) does not depend on . However, this would require us to further modify EDDP by incorporating early termination of the forward step. More specifically, we need to terminate the forward step at some stage and start the backward step from , whenever falls within a certain threshold value (see [53, 17]).
It should be noted that although the complexity of SDDP is worse than that for EDDP, its performance in earlier steps of the algorithm should be similar to that of EDDP. Intuitively, for earlier iterations, the tolerance parameter used to define -distinguishable points are large. As long as are large enough so that the solutions are contained within a ball with diameter roughly in the order of , one can choose any point randomly from as . In this case, SDDP will perform similarly to EDDP. This may explain why SDDP exhibits good practical performance for low accuracy region. For high accuracy region, the new EDDP algorithm seems to be a much better choice in terms of its theoretical complexity.
6.3 Complexity of EDDP and SDDP over an infinite horizon
It is possible to generalize EDDP and SDDP to solve stochastic programs over an infinite horizon (see Section 5.3.3), when the number of stages , the cost function , the feasible set for all , and the random variables are iid (independent identically distributed) and can be viewed as realizations of random vector . A common line is to approximate the infinite horizon with a finite horizon. For example, using the relation (5.49) in the SOC problem, a horizon length of suffices for a target accuracy . A direct extension of EDDP method show that the computational complexity depends only quadratically on the time horizon . Exploiting the fact that the infinite-horizon problem is stationary, it has been shown that a simple modification that combines the forward and backward step can improve the dependence on from quadratic to linear. Alternative selection strategies, including the use of upper bounds on the cost-to-go function and random sampling as in SDDP can also be incorporated (see [17] for more details).
7 Dynamic Stochastic Approximation Algorithms
7.1 Extension of Stochastic Approximation
Stochastic approximation (SA) has attracted much attention recently for solving static stochastic optimization problems given in the form of
(7.1) |
where is a closed convex set, denotes the random vector and is a lower semicontinuous convex function. Observe that we can cast two-stage stochastic optimization in the form of (7.1) by viewing as the value function of the second-stage problem (see [28, 23, 19]).
The basic SA algorithm, initially proposed by Robbins and Monro [38], mimics the simple projected gradient descent method by replacing exact gradient with its unbiased estimator. Important improvements for the SA methods have been made by Nemirovski and Yudin [29] and later by Polayk and Juditsky [35, 36]. During the past few years, significant progress has been made in SA methods including the incorporation of momentum and variance reduction, and generalization to nonconvex setting [21]. It has been shown in [28, 23] that SA type methods can significantly outperform the SAA approach for solving static (or two-stage) stochastic programming problems. However, it remains unclear whether these SA methods can be generalized for multi-stage stochastic optimization problems with .
In this section, we discuss a recently developed dynamic stochastic approximation (DSA) method for multi-stage stochastic optimization [24]. The basic idea of the DSA method is to apply an inexact primal-dual SA method for solving the -th stage optimization problem to compute an approximate stochastic subgradient for its associated value functions. For simplicity, we focus on the basic scheme of the DSA algorithm and its main convergence properties for solving three-stage stochastic optimization problems. We also briefly discuss how to generalize it for solving more general form of multi-stage stochastic optimization with .
The main problem of interest in this section is the following three-stage SP problem:
(7.2) | ||||||||
s.t. | s.t. | s.t. | ||||||
We can write problem (7.2) in a more compact form by using value functions as discussed in (2.3)-(2.5). More specifically, let be the value function at the third stage and be the corresponding expected value function conditionally on :
(7.3) |
We can then define the stochastic cost-to-go function and its corresponding (expected) value function as
(7.4) |
Problem (7.2) can then be formulated equivalently as
(7.5) |
7.2 Approximate stochastic subgradients
In order to solve problem (7.5) by stochastic approximation type methods, we need to understand how to compute first-order information of the value functions and . Recall that for a given closed convex set and a closed convex function , is called an -subgradient of at if
(7.6) |
The collection of all such -subgradients of at is called the -subdeifferential of at , denoted by . Since both and are given in the form of (conditional) expectation, their exact first-order information is hard to compute. We resort to the computation of a stochastic -subgradient of these value functions defined as follows. Observe that we do not assume stagewise independence throughout this section.
Definition 7.1
is called a stochastic -subgradient of the value function if is an unbiased estimator of an -subgradient of with respect to , i.e.,
(7.7) |
To compute a stochastic -subgradient of (resp., ), we have to compute an approximate subgradient of the corresponding stochastic value function (resp., ). To this end, we further assume that strong Lagrange duality holds for the optimization problems defined in (7.4) (resp.,(7.3)) almost surely. In other words, these problems can be formulated as saddle point problems:
(7.8) | ||||
(7.9) |
One set of sufficient conditions to guarantee the equivalence between (7.4) (resp.,(7.3)) and (7.8) (resp., (7.9)) is that (7.4) (resp.,(7.3)) is solvable and the Slater condition holds.
Observe that (7.8) and (7.9) are special cases of a more generic saddle point problem:
(7.10) |
where , and are linear mappings of the random variable . For example, (7.9) is a special case of (7.10) with , , , and . It is worth noting that the first stage problem can also be viewed as a special case of (7.10), since (7.5) is equivalent to
(7.11) |
We now discuss how to relate an approximate solution of the saddle point problem in (7.10) to an approximate subgradient of . Let be a pair of optimal solutions of the saddle point problem (7.10), i.e.,
(7.12) |
where the second identity follows from the complementary slackness of Lagrange duality. It can be shown (see Lemma 1 of [24]) that if
(7.13) | ||||
for a given and , then is an -subgradient of at .
In view of this observation, in order to compute a stochastic subgradient of at a given point , we can first generate a random realization conditionally on and then try to find a pair of solutions satisfying
where denotes the optimal solution for the -th stage problem associated with the random realization . We will then use as a stochastic -subgradient of at . The difficulty exists in that the function is also given in the form of expectation, and requires a numerical procedure to estimate its value. We will discuss this in more details in the next section.
7.3 The DSA algorithm and its convergence properties
Our goal in this section is to present the basic scheme of the dynamic stochastic approximation algorithm applied to problem (7.5). This algorithm relies on the following three key primal-dual steps, referred to as stochastic primal-dual transformation (SPDT), applied to the generic saddle point problem in (7.10) at every stage.
:
(7.14) | ||||
(7.15) | ||||
(7.16) |
In the above primal-dual transformation, the input denotes the current primal solution, dual solution, and the previous dual solution, respectively. Moreover, the input denotes a stochastic -subgradient for at the current search point . The parameters describe the problem in (7.10) and are certain algorithmic parameters to be specified. Given these input parameters, the relation in (7.14) defines a dual extrapolation (or prediction) step to estimate the dual variable for the next iterate. Based on this estimate, (7.15) performs a primal projection to compute , and then (7.16) updates in the dual space to compute by using the updated . Note that the Euclidean projection in (7.15) can be extended to the non-Euclidean setting by replacing the term with Bregman divergence. We assume that the above SPDT operator can be performed very fast or even has explicit expressions. The primal-dual transformation is closely related to the alternating direction method of multipliers and the primal-dual hybrid gradient method (see [24] for an account of history for this procedure).
In order to solve problem (7.5), we will combine the above primal-dual transformation applied to all the three stages, the scenario generation for the random variables and in the second and third stage, and certain averaging steps in both the primal and dual spaces to compute an approximate pair of primal and dual solution for the saddle point problem in the form of (7.10) at each stage. More specifically, the DSA method consists of three loops. The innermost (third) loop runs steps of SPDT in order to compute an approximate stochastic subgradient of the value function of the third stage. The second loop consists of SPDTs applied to the saddle point formulation of the second-stage problem, which requires the output from the third loop. The outer loop applies SPDTs to the saddle point formulation of the first-stage optimization problem in (7.5), using the approximate stochastic subgradients for computed by the second loop. In this algorithm, we need to generate and realizations for the random vectors and , respectively.
Observe that the DSA algorithm described above is conceptual only since we have not specified any algorithmic parameters yet. Two sets of parameters will be chosen depending on the stages where SPDTs are applied. One set of parameters that may lead to slightly slower rate of convergence but can guarantee the boundedness of the generated dual iterates will be used for the second and third stage, while another set of parameter setting is used in the first stage to achieve faster rate of convergence. Suppose that , , are generally (not necessarily strongly) convex. By properly specifying the algorithmic parameters, it can be shown that by setting and , we will find an approximate -solution, i.e., a point such that
in at most outer iterations. As a consequence, the number of random samples and used in the DSA method are bounded by
(7.17) |
respectively. Now consider the case when the objective functions , , are strongly convex. In that case the above complexity results can be significantly improved. In particular, by choosing and , we will find an approximate -solution in at most outer iterations. As a result, it can be shown that the number of random samples of and will be bounded by and , i.e., and , respectively,
It should be noted that our analysis of DSA focuses on the optimality of the first-stage decisions, and the decisions we generated for the later stages are mainly used for computing the approximate stochastic subgradients for the value functions at each stage. Except for the first stage decision , the performance guarantees (e.g., feasibility and optimality) that we can provide for later stages are dependent on the sequences of random variables (or scenarios) we generated.
7.4 DSA for general multistage stochastic optimization
In this section, we consider a multistage stochastic optimization problem given by
(7.18) |
where the value factions , , are recursively defined by
(7.19) |
and
(7.20) |
Here are random variables, are relatively simple functions, and are general (not necessarily simple) Lipschitz continuous convex functions. We also assume that one can compute the subgradient of function at any point for a given .
Problem (7.18) is more general than problem (7.2) (or equivalently problem (7.5)) in the following sense. First, we are dealing with a more complicated multistage stochastic optimization problem where the number of stages in (7.18) can be greater than three. Second, the value function in (7.19) is defined as the summation of and , where is not necessarily simple. The DSA algorithm can be generalized for solving problem (7.18). More specifically, we will call the SPDT operators to compute a stochastic -subgradient of at , , in a recursive manner until we obtain the -subgradient of at .
Similar to the three-stage problem, let be the number of iterations for stage subproblem. For the last stage , we will set and for the generally convex and strongly convex cases, respectively. For the middle stages , we set and . Different algorithmic settings will be used for either generally convex or strongly convex cases. Moreover, less aggressive stepsize selection will be used for the inner loops to guarantee the boundedness of the generated dual variables. Under these parameter selection, we can show that the number of outer loops performed by the DSA method to find an approximate -solution can be bounded by and , respectively, for the generally convex and strongly convex problems.
In view of these results, the total number of scenarios required to find an -solution of (7.18) is given by , and hence will grow exponentially with respect to , no matter whether the objective functions are strongly convex or not. These sampling complexity bounds match well with those lower bounds in [49, 52], implying that multi-stage stochastic optimization problems are essentially intractable for and a moderate target accuracy. Hence, it is reasonable to use the DSA algorithm only for multi-stage stochastic optimization problems with or and relatively large. However, it is interesting to point out that the DSA algorithm only needs to go through the scenario tree once and hence its memory requirement increases only linearly with respect to .
7.5 Combined EDDP and DSA for hierarchical problems
In this section, we discuss how EDDP/SDDP type algorithms can be applied together with DSA type algorithms for solving a class of hierarchical stationary stochastic programs (HSSPs). HSSPs can model problems with a hierarchy of decision-making, e.g., how managerial decisions influence day-to-day operations in a factory. More specifically, the upper level decisions are made by solving a stationary stochastic program over an infinite horizon (see Section 5.3.3), where the number of stages , the cost function , the feasible set , and the random variables are iid having the same distribution as . The lower level decisions are determined by a stochastic two-stage program,
(7.21) |
where denotes the random variable independent of involved in the second stage problem.
We extend the dual dynamic programming method to a so-called hierarchical dual dynamic programming by accounting for inexact solutions to the subproblems. To solve the lower-level stochastic multi-stage problem, we approximate it using SAA approach and solve the resulting problem using a primal-dual stochastic approximation method. This method can be generalized to a DSA-type method when the number of stages is three or more. Our results show that when solving an infinite-horizon hierarchical problem where the top-level decision variable is of modest size (i.e., ) and the lower-level stochastic multistage program has a modest number of stages (2 or 3), then by integrating dual dynamic programming to handle the top-level decisions and stochastic approximation-type method for the lower-level decisions, we can get a polynomial computational complexity that is independent of the dimension of the lower-level decision variables (see [17] for details).
References
- [1] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9:203–228, 1999.
- [2] E. M. L. Beale. On minimizing a convex function subject to linear inequalities. Journal of the Royal Statistical Society, Series B, 17:173–184, 1955.
- [3] R.E. Bellman. Dynamic Programming. Princeton University Press, 1957.
- [4] D.P. Bertsekas and S.E. Shreve. Stochastic Optimal Control, The Discrete Time Case. Academic Press, New York, 1978.
- [5] J.R. Birge. Decomposition and partitioning methods for multistage stochastic linear programs. Operations Research, 33:989–1007, 1985.
- [6] R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer, New York, 2nd edition, 2011.
- [7] J. F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000.
- [8] G.B. Dantzig. Linear programming under uncertainty. Management Science, 1:197–206, 1955.
- [9] L. Ding, S. Ahmed, and A. Shapiro. A python package for multi-stage stochastic programming. Optimization online, 2019.
- [10] M. Dyer and L. Stougie. Computational complexity of stochastic programming problems. Mathematical Programming, 106:423–432, 2006.
- [11] H. Föllmer and A. Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin, 2nd edition, 2004.
- [12] C. Füllner and S. Rebennack. Stochastic dual dynamic programming and its variants. Optimization online, 2021.
- [13] P. Girardeau, V. Leclére, and A. B. Philpott. On the convergence of decomposition methods for multistage stochastic convex programs. Mathematics of Operations Research, 40:130–145, 2016.
- [14] V. Guigues, A. Shapiro, and Y. Cheng. Risk-averse stochastic optimal control: an efficiently computable statistical upper bound. Optimization online, 2022.
- [15] V. Guigues, A. Shapiro, and Y. Cheng. Duality and sensitivity analysis of multistage linear stochastic programs. European Journal of Operational Research, 308:752–767, 2023.
- [16] G.A. Hanasusanto, D. Kuhn, and W. Wiesemann. A comment on computational complexity of stochastic programming problems. Mathematical Programming, 159:557–569, 2015.
- [17] C. Ju and G. Lan. Dual dynamic programming for stochastic programs over an infinite horizon. arXiv, 2023.
- [18] J.E. Kelley. The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8:703–712, 1960.
- [19] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
- [20] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Mathematical Programming, 149(1-2):1–45, 2015.
- [21] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature, Switzerland AG, 2020.
- [22] G. Lan. Complexity of stochastic dual dynamic programming. Mathematical Programming, 191:717–754, 2022.
- [23] G. Lan, A. S. Nemirovski, and A. Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical Programming, 134:425–458, 2012.
- [24] G. Lan and Z. Zhou. Dynamic stochastic approximation for multi-stage stochastic optimization. Mathematical Programming, 187:487–532, 2021.
- [25] V. Leclére, P. Carpentier, J-P Chancelier, A. Lenoir, and F. Pacaud. Exact converging bounds for Stochastic Dual Dynamic Programming via Fenchel duality. SIAM J. Optimization, 30:1223–1250, 2020.
- [26] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical Programming, 69:111–147, 1995.
- [27] N. Löhndorf and A. Shapiro. Modeling time-dependent randomness in stochastic dual dynamic programming. European Journal of Operational Research, 273:650–661, 2019.
- [28] A. S. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. 19:1574–1609, 2009.
- [29] A. S. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley, XV, 1983.
- [30] Y. Nesterov and A. Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia, 1994.
- [31] M.V.F. Pereira and L.M.V.G. Pinto. Multi-stage stochastic optimization applied to energy planning. Mathematical programming, 52(1-3):359–375, 1991.
- [32] G. Pflug. Some remarks on the value-at-risk and the conditional value-at-risk. In Probabilistic Constrained Optimization: Methodology and Applications, S. Uryasev (Ed.). Kluwer Academic Publishers, Norwell, MA, 2000.
- [33] A.B. Philpott, V.L. de Matos, and E. Finardi. On solving multistage stochastic programs with coherent risk measures. Operations Research, 61(4):957–970, 2013.
- [34] A. Pichler and A. Shapiro. Mathematical foundations of distributionally robust multistage optimization. SIAM J. Optimization, 31:3044–3067, 2021.
- [35] B.T. Polyak. New stochastic approximation type procedures. Automat. i Telemekh., 7:98–107, 1990.
- [36] B.T. Polyak and A.B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30:838–855, 1992.
- [37] W.B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley and Sons, New York, 2nd edition, 2011.
- [38] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- [39] R. T Rockafellar. Conjugate Duality and Optimization. Society for Industrial and Applied Mathematics, Philadelphia, 1974.
- [40] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970.
- [41] R.T Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. J. of Banking and Finance, 26(7):1443–1471, 2002.
- [42] A. Ruszczyński and A. Shapiro. Conditional risk mappings. Mathematics of Operations Research, 31:544–561, 2006.
- [43] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions. Mathematics of Operations Research, 31:433–452, 2006.
- [44] A. Shapiro. Analysis of stochastic dual dynamic programming method. European Journal of Operational Research, 209:63–72, 2011.
- [45] A. Shapiro and Y. Cheng. Central limit theorem and sample complexity of stationary stochastic programs. Operations Research Letters, 49:676–681, 2021.
- [46] A. Shapiro and Y. Cheng. Dual bounds for periodical stochastic programs. Operations Research, 2021.
- [47] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, third edition, 2021.
- [48] A. Shapiro and L. Ding. Periodical multistage stochastic programs. SIAM J. Optimization, 30:2083–2102, 2020.
- [49] A. Shapiro and A. Nemirovski. On complexity of stochastic programming problems. E-print available at: http://www.optimization-online.org, 2004.
- [50] A. Shapiro and A. Nemirovski. On complexity of stochastic programming problems. In V. Jeyakumar and A.M. Rubinov, editors, Continuous Optimization: Current Trends and Applications: Current Trends and Applications, pages 111–144. Springer, 2005.
- [51] A. Shapiro, W. Tekaya, J.P. da Costa, and M. Pereira Soares. Risk neutral and risk averse stochastic dual dynamic programming method. European Journal of Operational Research, 224:375–391, 2013.
- [52] A. Shaprio. On complexity of multistage stochastic programs. Operations Research Letters, 34:1–8, 2006.
- [53] S. Zhang and X. Sun. Stochastic dual dynamic programming for multistage stochastic mixed-integer nonlinear optimization. Mathematical Programming, 196:935–985, 2022.
- [54] P.H. Zipkin. Foundation of inventory management. McGraw-Hill, 2000.