This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Set-based Value Operators for Non-stationary and Uncertain Markov Decision Processes

Sarah H.Q. Li    Assalé Adjé    Pierre-Loïc Garoche    Behçet Açikmeşe Department of Aeronautics and Astronautics, University of Washington, Seattle, USA. (e-mail:{sarahli, behcet}@uw.edu). LAMPS, Université de Perpignan Via Domitia, Perpignan, France. (e-mail: [email protected]). École Nationale de l’Aviation Civile, Université de Toulouse, Toulouse, France. (e-mail: [email protected]).
Abstract

This paper analyzes finite state Markov Decision Processes (MDPs) with uncertain parameters in compact sets and re-examines results from robust MDP via set-based fixed point theory. To this end, we generalize the Bellman and policy evaluation operators to contracting operators on the value function space and denote them as value operators. We lift these value operators to act on sets of value functions and denote them as set-based value operators. We prove that the set-based value operators are contractions in the space of compact value function sets. Leveraging insights from set theory, we generalize the rectangularity condition in classic robust MDP literature to a containment condition for all value operators, which is weaker and can be applied to a larger set of parameter-uncertain MDPs and contracting operators in dynamic programming. We prove that both the rectangularity condition and the containment condition sufficiently ensure that the set-based value operator’s fixed point set contains its own extrema elements. For convex and compact sets of uncertain MDP parameters, we show equivalence between the classic robust value function and the supremum of the fixed point set of the set-based Bellman operator. Under dynamically changing MDP parameters in compact sets, we prove a set convergence result for value iteration, which otherwise may not converge to a single value function. Finally, we derive novel guarantees for probabilistic path planning problems in planet exploration and stratospheric station-keeping.

keywords:
Markov decision process, contraction operator, stochastic control, decision making and autonomy
thanks: This research is partly funded by NSF grant CMMI-210563 and the University of Washington Aero&Astro Condit fellowship. Corresponding author Sarah H.Q. Li. Email. [email protected].

, , ,

1 Introduction

Markov decision process (MDP) is a versatile model for decision making in stochastic environments and is widely used in trajectory planning [1], robotics [20], and operations research [4]. Given state-action costs and transition probabilities, finding an optimal policy of the MDP is equivalent to solving for the fixed point value function of the corresponding Bellman operator.

Many application settings of MDPs, including traffic light control, motion planning, and dexterous manipulation, deal with environmental non-stationarity—dynamically changing MDP cost and transition probabilities due to external factors or the presence of interfering decision makers. This environmental non-stationarity corresponds to uncertainty in the MDP transition and cost parameters and differs from an MDP’s internal stochasticity, which is modeled by stationary stochastic dynamics whose probability distributions do not change over time.

Example 1 (Navigating in changing wind).

An autonomous aircraft navigates in a two-dimensional and time-varying wind field towards a non-stationary target, where the wind field varies between NN major patterns over time. The aircraft’s transition probabilities are constructed from global averages of local wind observations, and the aircraft’s objective is to reach the location of the non-stationary target, which is also affected by wind. If the wind pattern strictly switches between the discrete wind trends, then the transition uncertainty at state s[S]s\in[S] is given by the set 𝒫s={Ps1,,PsN}\mathcal{P}_{s}=\{P^{1}_{s},\ldots,P^{N}_{s}\}. Similarly, the reachability cost of each state-action is also given by 𝒞s={Cs1,,CsN}\mathcal{C}_{s}=\{C^{1}_{s},\ldots,C^{N}_{s}\}. Collectively, we say that the MDP has time-varying parameters s={ms1,,msN}=𝒞s×𝒫s\mathcal{M}_{s}=\{m^{1}_{s},\ldots,m^{N}_{s}\}=\mathcal{C}_{s}\times\mathcal{P}_{s} at each state s[S]s\in[S], where msi=(Csi,Psi)m^{i}_{s}=(C^{i}_{s},P^{i}_{s}).

Environmental non-stationarity differs from parameter uncertainty and yet is closely related. Parameter-uncertain dynamic programming assumes that the MDP has stationary yet unknown stochastic dynamics within a bounded set, and its performance can be bounded via worst-case performance in robust MDP, risk-sensitive reinforcement learning, and zero-sum stochastic games—value functions that result from adversarial selections of the MDP parameters. Under environmental non-stationarity, we assume that at every time instance, the MDP parameters are known but will vary in time unpredictably. Environmental non-stationarity is a better assumption than parameter uncertainty for scenarios such as Example 1, where the dynamics are stochastic, observable, non-adversarial, yet time-varying.

Consider dynamic programming under environmental non-stationarity: at every time instance, the dynamic program is updated with respect to known but time-varying MDP parameters. Although interesting and highly relevant to many trajectory planning problems, this setting has no convergence guarantees. In fact, value iteration will most definitely diverge and can be demonstrated using simple examples. Does this mean that dynamic program has no convergence guarantees under environmental non-stationarity?

In this paper, we introduce a set-based framework to non-stationary MDPs that provides convergence guarantees under Hausdorff distance, and demonstrate that this set-based convergence also applies to parameter-uncertain MDPs and is related to dynamic programming robust dynamic programming.

Contributions. For environmental nonstationarity bounded by compact sets, we propose the set-extensions of value operators: a general class of contraction operators that extends the Bellman operator and the policy evaluation operator. We prove the existence of compact fixed point sets of the set-based value operators and show that the set-based value iteration converges. In a non-stationary Markovian environment, standard value iteration may not converge. However, we can show that the point-to-set distance of the resulting value function trajectory to the fixed point set always goes to zero in the limit. We derive a containment condition that is sufficient for the fixed point sets to contain their own extremal elements. Within robust MDPs, we show that the containment condition generalizes the rectangularity condition, such that the optimal worst-case policy, or the robust policy, exists when the containment condition is satisfied. We then derive the relationship between the fixed point sets of 1) the set-based optimistic policy evaluation operator, 2) the set-based robust policy evaluation operator, and 3) the set-based Bellman operator. Given a value operator and a compact MDP parameter uncertainty set, we present an algorithm that computes the bounds of the corresponding fixed point set and derive its convergence guarantees. Finally, we apply our results to the wind-assisted navigation of high altitude platform systems relevant to space exploration [22] and show that our algorithms can be used to derive policies with better guarantees.

Related research. MDP with parameter uncertainty is well studied in robust control and reinforcement learning. In control theory, the worst-case cost-to-go with respect to state-decoupled parameter uncertainties is derived via a minmax variation of the Bellman operator in [5, 8, 15, 21]. The cost-to-go under parameter uncertainty with coupling between states and time steps is similarly bounded in [12, 6]. The effect of statistical uncertainty on the optimal cost-to-go is studied in [15, 12, 21, 23]. Recently, MDP with parameter uncertainty has gained traction in the reinforcement learning community due to the presence of uncertainty in real world problems such as traffic signal control and multi-agent coordination [9, 10, 16]. Most RL research extends the minmax worst-case analysis to methods such as Q-learning and SARSA. Recently, methods for value-based RL using non-contracting operators have been investigated in [3].

As opposed to the worst-case approach to analyzing MDPs under parameter uncertainty, we do not assume adversarial MDP parameter selection. Instead, we derive a set of cost-to-gos that is invariant with respect to the compact parameter uncertainty sets for order-preserving, α\alpha-contracting operators, a class that the Bellman operator belongs to. We continue from our previous work [11], in which we analyzed the set-based Bellman operator for cost uncertainty only.

Notation: A set of NN elements is given by [N]={0,,N1}[N]=\{0,\ldots,N-1\}. We denote the set of matrices of ii rows and jj columns with real (non-negative) entries as i×j{\mathbb{R}}^{i\times j} (+i×j{\mathbb{R}}_{+}^{i\times j}), respectively. Matrices and some integers are denoted by capital letters, XX, while sets are denoted by cursive typeset 𝒳\mathcal{X}. The set of all compact subsets of d{\mathbb{R}}^{d} is denoted by 𝒦(d)\mathcal{K}({\mathbb{R}}^{d}). The column vector of ones of size NN\in\mathbb{N} is denoted by 𝟏N=[1,,1]TN×1\mathbf{1}_{N}=[1,\ldots,1]^{T}\in{\mathbb{R}}^{N\times 1}. The identity matrix of size SS is denoted by ISI_{S}. The simplex of dimension SS is denoted by

ΔS={pS| 1Sp=1,p0}.\Delta_{S}=\{p\in{\mathbb{R}}^{S}\ |\ \mathbf{1}_{S}^{\top}p=1,\ p\geq 0\}. (1)

A vector hSh\in{\mathbb{R}}^{S} has equivalent notation (h1,,hs)(h_{1},\ldots,h_{s}), where hsh_{s} is the value of hh in the sths^{th} coordinate, s[S]s\in[S]. Throughout the paper, \left\lVert\cdot\right\rVert denotes the infinity norm in S{\mathbb{R}}^{S}.

2 Discounted infinite-horizon MDP

A discounted infinite-horizon finite state MDP is given by ([S],[A],([S],[A], P,C,γ)P,C,\gamma), where γ(0,1)\gamma\in(0,1) is the discount factor, [S]={1,,S}[S]=\{1,\ldots,S\} is the finite set of states and [A]={1,,A}[A]=\{1,\ldots,A\} is the finite set of actions. Without loss of generality, assume that each action is admissible from each state s[S]s\in[S].

MDP Costs. CS×AC\in{\mathbb{R}}^{S\times A} is the matrix encoding the MDP cost. Each CsaC_{sa}\in{\mathbb{R}} is the cost of taking action a[A]a\in[A] from state s[S]s\in[S]. We also denote the cost of all actions at state ss by cs=[Cs1,,CsA]Ac_{s}=[C_{s1},\ldots,C_{sA}]\in{\mathbb{R}}^{A}, such that C=[c1,,cS]C=[c_{1},\ldots,c_{S}]^{\top}.

MDP Transition Dynamics. The transition probabilities when action aa is taken from state ss are given by psaΔSp_{sa}\in\Delta_{S}. Collectively, all possible transition probabilities from state s[S]s\in[S] are given by the matrix Ps=[ps1,,psA]ΔSAS×A\textstyle P_{s}=[p_{s1},\ldots,p_{sA}]\in\Delta_{S}^{A}\subset{\mathbb{R}}^{S\times A}, and all possible transition probabilities in the MDP are given by the matrix P=[P1,,PS]ΔSSAS×SAP=[P_{1},\ldots,P_{S}]\in\Delta_{S}^{SA}\subset{\mathbb{R}}^{S\times SA}.

MDP Objective. We want to minimize the expected cost-to-go, or the value vector VSV\in{\mathbb{R}}^{S}, defined per state as

Vs:=𝔼s{t=0γtCstat|s0=s},s[S],\textstyle V_{s}:=\ \mathbb{E}_{s}\Big{\{}\sum_{t=0}^{\infty}\gamma^{t}C_{s^{t}a^{t}}\ |\ s^{0}=s\Big{\}},\ \forall\ s\in[S], (2)

where 𝔼s{}\mathbb{E}_{s}\{\cdot\} is the expected value of the input with respect to initial state ss, and (st,ats^{t},a^{t}) are the state and action at time tt.

Remark 2.

Although value function is the standard term for the expected cost-to-go, we use value vector in this paper to emphasize that the cost-to-go values of finite MDPs belong in a finite dimensional space.

MDP Policy. The decision maker controls the policy, denoted as π=[π1,,πS]ΔAS\pi=[\pi_{1},\ldots,\pi_{S}]\in\Delta_{A}^{S}, where the atha^{th} element of πsΔA\pi_{s}\in\Delta_{A} is the conditional probability of action aa being chosen from state ss. Using the policy, we can minimize the value vector (2) in a closed-loop fashion.

Vs:=minπtΔAS𝔼s{t=0γtCstπt(st)|s0=s},s[S],\textstyle V_{s}^{\star}:=\ \underset{\pi^{t}\in\Delta_{A}^{S}}{\min}\ \mathbb{E}_{s}\Big{\{}\sum_{t=0}^{\infty}\gamma^{t}C_{s^{t}\pi^{t}(s^{t})}\ |\ s^{0}=s\Big{\}},\ \forall\ s\in[S], (3)

Under policy πs\pi_{s}, the expected immediate cost at ss is given by csπsc_{s}^{\top}\pi_{s}\in{\mathbb{R}} and the expected transition probabilities from ss is given by PsπsΔSP_{s}\pi_{s}\in\Delta_{S}.

2.1 Value operators

Solving an MDP is equivalent to finding the value vector and the associated policy that minimizes the objective (3). Typical solution methods utilize order preserving [19, Def.3.1], α\alpha-contractive operators whose fixed points are the optimal value vectors (e.g. Bellman operator [17, Thm.6.2.3], QQ-value operator [13]).

Definition 3 (α\alpha-Contraction).

Let (𝒳,d)(\mathcal{X},d) be a metric space with metric dd. The operator H:𝒳𝒳H:\mathcal{X}\mapsto\mathcal{X} is an α\alpha-contraction if and only if there exists α[0,1)\alpha\in[0,1) such that

d(H(V),H(V))αd(V,V),V,V𝒳.d(H(V),H(V^{\prime}))\leq\alpha d(V,V^{\prime}),\quad\forall\ V,\ V^{\prime}\in\mathcal{X}. (4)
Definition 4 (Order Preservation).

Let (𝒳,)(\mathcal{X},\leq) be an ordered space with partial order \leq. The operator H:𝒳𝒳H:\mathcal{X}\mapsto\mathcal{X} is order preserving if for all V,V𝒳V,V^{\prime}\in\mathcal{X} such that VVV\leq V^{\prime}, H(V)H(V)H(V)\leq H(V^{\prime}).

These operators are typically locally Lipschitz in MDP parameter space.

Definition 5 (K(V)K(V)-Lipschitz).

Let (𝒳,d𝒳)(\mathcal{X},d_{\mathcal{X}}) be a metric space with metric d𝒳d_{\mathcal{X}} and (𝒴,d𝒴)(\mathcal{Y},d_{{\color[rgb]{0,0,0}\mathcal{Y}}}) be a metric space with metric d𝒴d_{\mathcal{Y}}. The operator H:𝒳×𝒴𝒳H:\mathcal{X}\times\mathcal{Y}\mapsto\mathcal{X} is K(V)K(V)-Lipschitz with respect to 𝒴\mathcal{M}\subset\mathcal{Y} if for all V𝒳V\in\mathcal{X}, there exists K(V)+K(V)\in{\mathbb{R}}_{+} such that

d𝒳(H(V,m),H(V,m))K(V)d𝒴(m,m),m,m.d_{\mathcal{X}}(H(V,m),H(V,m^{\prime}))\leq K(V)d_{\mathcal{Y}}(m,m^{\prime}),\ \forall m,m^{\prime}\in\mathcal{M}. (5)
Remark 6.

The property α\alpha-contraction is a special case of Lipschitz continuity when the input and output spaces are identical and the Lipschitz constant is less than 11.

To capture operators with these properties, we define a value operator that takes inputs: value vector, MDP cost, and MDP transition probability. The MDP cost and transition probability are selected from an MDP parameter set \mathcal{M}.

Definition 7 (Value operator).

Consider the operator hh,

h:S×S,S×A×ΔSSA.h:{\mathbb{R}}^{S}\times\mathcal{M}\mapsto{\mathbb{R}}^{S},\ \mathcal{M}\subseteq{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}. (6)

We say hh (6) is a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M} if

  1. 1.

    For all mm\in\mathcal{M}, h(,m)h(\cdot,m) is an α\alpha-contraction in S{\mathbb{R}}^{S}.

  2. 2.

    For all mm\in\mathcal{M}, h(,m)h(\cdot,m) is order preserving in S{\mathbb{R}}^{S}.

  3. 3.

    For all VSV\in{\mathbb{R}}^{S}, h(V,m)h(V,m) is K(V)K(V)-Lipschitz on \mathcal{M}.

Remark 8.

While we only consider value operators whose input’s first component is S{\mathbb{R}}^{S}, Definition 7 and the subsequent results can be extended to the space of QQ-value functions by swapping S{\mathbb{R}}^{S} for SA{\mathbb{R}}^{SA} in Definition 7 [13].

Refer to caption
Figure 1: Illustration of the three value operator properties. (a) α\alpha-contraction on S{\mathbb{R}}^{S}, (b) Order preservation on S{\mathbb{R}}^{S}, and (c) K(V)K(V)-Lipschitz in input space \mathcal{M}.

An immediate consequence of the value operator hh being an α\alpha-contractive and order-preserving operator on S{\mathbb{R}}^{S} is that hh is continuous on S×{\mathbb{R}}^{S}\times\mathcal{M}.

Lemma 9 (Continuity).

If hh (6) is a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M}, hh is continuous on S×{\mathbb{R}}^{S}\times\mathcal{M}.

See App. B for proof. Examples of value operators include the Bellman operator and the policy evaluation operators when the MDP cost and transition probability are input parameters rather than fixed parameters.

Definition 10 (Policy evaluation operator).

Given a policy πΠ\pi\in\Pi, the vector-valued operator gπ=(g1π,,gSπ):S×S×A×ΔSSASg^{\pi}=(g^{\pi}_{1},\ldots,g^{\pi}_{S}):{\mathbb{R}}^{S}\times{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}\mapsto{\mathbb{R}}^{S} is defined per state as

gsπ(V,C,P):=csπs+γ(Psπs)V,s[S].g^{\pi}_{s}(V,C,P):=c_{s}^{\top}\pi_{s}+\gamma\Big{(}P_{s}\pi_{s}\Big{)}^{\top}V,\ \forall s\in[S]. (7)

Given (C,P)(C,P), gπ(,C,P):SSg^{\pi}(\cdot,C,P):{\mathbb{R}}^{S}\mapsto{\mathbb{R}}^{S} is a vector-valued operator whose fixed point is the expected cost-to-go of the MDP ([S],[A],C,P,γ)([S],[A],C,P,\gamma) under π\pi, denoted as Vπ(C,P)V^{\pi}(C,P) [17, Thm.6.2.5].

Vπ(C,P)=gπ(Vπ,C,P),Vπ(C,P)S.~{}V^{\pi}(C,P)=g^{\pi}(V^{\pi},C,P),\ V^{\pi}(C,P)\in{\mathbb{R}}^{S}. (8)

When the context is clear, we denote Vπ(C,P)V^{\pi}(C,P) as VπV^{\pi}.

Definition 11 (Bellman operator).

The vector-valued operator f=(f1,,f=(f_{1},\ldots, fS):S×S×A×ΔSSASf_{S}):{\mathbb{R}}^{S}\times{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}\mapsto{\mathbb{R}}^{S} is defined per each state as

fs(V,C,P):=infπsΔAgsπ(V,C,P),s[S].f_{s}(V,C,P):=\inf_{\pi_{s}\in\Delta_{A}}\ g^{\pi}_{s}(V,C,P),\ \forall\,s\in[S]. (9)

The corresponding optimal policy π=(π1,,πs)\pi^{\star}=(\pi_{1}^{\star},\ldots,\pi_{s}^{\star}) is defined per state as πsargminπsgsπ(V,C,P)\pi_{s}^{\star}\in\mathop{\rm argmin}_{\pi_{s}}g^{\pi}_{s}(V,C,P) (9). One such policy is defined for all (s,a)[S]×[A](s,a)\in[S]\times[A] by

πsa:={>0aargmina[A]Csa+γpsaV0otherwise,a[A]πsa=1.\pi^{\star}_{sa}:=\begin{cases}>0&a\in\underset{a\in[A]}{\mathop{\rm argmin}}\ C_{sa}+\gamma p_{sa}^{\top}V\\ 0&\text{otherwise}\end{cases},\sum_{a\in[A]}\pi^{\star}_{sa}=1. (10)

where argmina[A](h)\mathop{\rm argmin}_{a\in[A]}(h) is the set of minimizing actions for the function hh. An optimal policy in the form (10) always exists for a discounted infinite horizon MDP [17, Thm 6.2.10]. Given parameters (C,P)(C,P), f(,C,P):SSf(\cdot,C,P):{\mathbb{R}}^{S}\mapsto{\mathbb{R}}^{S} is a vector operator whose fixed point is the optimal cost-to-go for the MDP ([S],[A],P,C,γ)([S],[A],P,C,\gamma), denoted as VB(C,P)V^{B}(C,P).

VB(C,P)=f(VB,C,P),VB(C,P)S.~{}V^{B}(C,P)=f(V^{B},C,P),\ V^{B}(C,P)\in{\mathbb{R}}^{S}. (11)

When the context is clear, we denote VB(C,P)V^{B}(C,P) as VB.V^{B}.

We show that both (7) and (9) are value operators.

Lemma 12.

The Bellman operator (9) and the policy evaluation operators (7) for all πΠ\pi\in\Pi are value operators on S×{\mathbb{R}}^{S}\times\mathcal{M} where S×A×ΔSSA\mathcal{M}\subseteq{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA} (6).

See App. C for proof.

Remark 13.

Beyond the policy evaluation operator and the Bellman operator, many algorithms in reinforcement learning can be reformulated using value operators. For example, it’s not difficult to show that the Q-learning operator [13] is a value operator on the vector space SA{\mathbb{R}}^{SA}.

3 Set-based value operators

Motivated by stochastic and time-varying Markovian dynamics, we now consider set-based value operators with respect to a compact set of MDP parameters. We first introduce Hausdorff-type set distances.

Definition 14 (Point-to-set Distance).

The distance between a value vector and a set 𝒱S\mathcal{V}\subseteq{\mathbb{R}}^{S} is given by

Wd(W,𝒱):=infV𝒱WV.\textstyle W\mapsto d(W,\mathcal{V}):=\underset{V\in\mathcal{V}}{\inf}\left\lVert W-V\right\rVert_{{\color[rgb]{0,0,0}\infty}}. (12)

On the space of compact subsets of S{\mathbb{R}}^{S}, given by 𝒦(S)\mathcal{K}({\mathbb{R}}^{S}), the distance between value vector sets extends (12) and is given by the Hausdorff distance [7].

Definition 15 (Set-to-set Distance).

The Hausdorff distance between two value vector sets 𝒱,𝒲S\mathcal{V},\mathcal{W}\subseteq{\mathbb{R}}^{S} is given by

d𝒦(𝒱,𝒲):=max{supV𝒱d(V,𝒲),supW𝒲d(W,𝒱)}.\displaystyle d_{\mathcal{K}}(\mathcal{V},\mathcal{W}):=\max\left\{\sup_{V\in\mathcal{V}}d(V,\mathcal{W}),\sup_{W\in\mathcal{W}}d(W,\mathcal{V})\right\}. (13)

We use (𝒦(S),d𝒦)(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}}) to denote the metric space formed by the set of all compact subsets of S{\mathbb{R}}^{S} under the Hausdorff distance d𝒦d_{\mathcal{K}}. The induced Hausdorff space is complete if and only if the original metric space is complete [7, Thm 3.3]. Therefore, (𝒦(S),d𝒦)(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}}) is a complete metric space.

For a value operator hh (6), we ask the following question: what is the set of possible value vectors when the MDP has parameter non-stationarity given by \mathcal{M}? To resolve this, we define the set-based value operator HH.

Definition 16 (Set-based Value Operator).

The set-valued operator HH is induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M} (6) and is defined as

H(𝒱):={h(V,m)|(V,m)𝒱×}S,H(\mathcal{V}):=\left\{h(V,m)\ |\ (V,m)\in\mathcal{V}\times\mathcal{M}\right\}\subseteq{\mathbb{R}}^{S}, (14)

where 𝒱S\mathcal{V}\subseteq{\mathbb{R}}^{S} is a subset of the value vector space.

Refer to caption
Figure 2: Illustration of the set-based operator H(𝒱)H(\mathcal{V}) applied to the singleton set 𝒱={V}S\mathcal{V}=\{V\}\subset{\mathbb{R}}^{S}, we compute h(V,m)h(V,m) for every parameter mm\in\mathcal{M} and collect the output h(V,m)h(V,m), such that H(𝒱)=mh(V,m)H(\mathcal{V})=\cup_{m\in\mathcal{M}}h(V,m).

We denote the set-based value operator induced by the Bellman operator (9) and policy evaluation operators (7) as FF and GπG^{\pi}, respectively, such that for any value vector set 𝒱S\mathcal{V}\subseteq{\mathbb{R}}^{S},

F(𝒱):={f(V,C,P)|(V,C,P)𝒱×},F(\mathcal{V}):=\left\{f(V,C,P)\ |\ (V,C,P)\in\mathcal{V}\times\mathcal{M}\right\}, (15)
Gπ(𝒱):={gπ(V,C,P)|(V,C,P)𝒱×},πΠ.G^{\pi}(\mathcal{V}):=\left\{g^{\pi}(V,C,P)\ |\ (V,C,P)\in\mathcal{V}\times\mathcal{M}\right\},\ \forall\ \pi\in\Pi. (16)

The set-based Bellman operator F(𝒱)F(\mathcal{V}) is the union over all one-step optimal value vectors, which may result from different policies, while Gπ(𝒱)G^{\pi}(\mathcal{V}) is the union over all value vectors that result from the same policy π\pi.

We can ask the following question: is there a set of value vectors that is invariant with respect to HH? Similar to the value operators hh from Definition 7, we can affirmatively answer this question by showing that HH is α\alpha-contractive on 𝒦(S)\mathcal{K}({\mathbb{R}}^{S}).

Theorem 17.

If hh is a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M} (6) and \mathcal{M} is compact, then the induced set value operator HH (14) satisfies

  1. 1.

    For all 𝒱𝒦(S)\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S}), H(𝒱)𝒦(S)H\left(\mathcal{V}\right)\in\mathcal{K}({\mathbb{R}}^{S});

  2. 2.

    HH is an α\alpha-contractive on (𝒦(S),d𝒦)(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}}) (13) with a unique fixed point set 𝒱\mathcal{V}^{\star} given by

    H(𝒱)=𝒱,𝒱𝒦(S);\textstyle H(\mathcal{V}^{\star})=\mathcal{V}^{\star},\quad\mathcal{V}^{\star}\in\mathcal{K}({\mathbb{R}}^{S}); (17)
  3. 3.

    The sequence {𝒱k}k\{\mathcal{V}^{k}\}_{k\in\mathbb{N}} where 𝒱k+1=H(𝒱k)\mathcal{V}^{k+1}=H(\mathcal{V}^{k}) converges to 𝒱\mathcal{V}^{\star} for any 𝒱0𝒦(S)\mathcal{V}^{0}\in\mathcal{K}({\mathbb{R}}^{S}).

In particular, these hold for FF (15) and GπG^{\pi} (16), whose fixed point sets are denoted as 𝒱B\mathcal{V}^{B} and 𝒱π\mathcal{V}^{\pi}, respectively.

F(𝒱B)=𝒱B𝒦(S),Gπ(𝒱π)=𝒱π𝒦(S),πΠ.F(\mathcal{V}^{B})=\mathcal{V}^{B}\in\mathcal{K}({\mathbb{R}}^{S}),\ G^{\pi}(\mathcal{V}^{\pi})=\mathcal{V}^{\pi}\in\mathcal{K}({\mathbb{R}}^{S}),\ \forall\pi\in\Pi. (18)
{pf}

The first statement follows from Lemma 9, since the image of a compact set by a continuous function is compact [18]. Let us prove the second statement: for some β(0,1)\beta\in(0,1), for all, 𝒱,𝒱𝒦(S)\mathcal{V},\mathcal{V}^{\prime}\in\mathcal{K}({\mathbb{R}}^{S}):

d𝒦(H(𝒱),H(𝒱))\displaystyle d_{\mathcal{K}}(H(\mathcal{V}),H(\mathcal{V}^{\prime}))
=\displaystyle= max{supV𝒱md(h(V,m),H(𝒱)),supV𝒱md(h(V,m),H(𝒱))}\displaystyle\max\left\{\sup_{\begin{subarray}{c}V\in\mathcal{V}\\ m\in\mathcal{M}\end{subarray}}d\left(h(V,m),H(\mathcal{V}^{\prime})\right),\sup_{\begin{subarray}{c}V^{\prime}\in\mathcal{V}^{\prime}\\ m^{\prime}\in\mathcal{M}\end{subarray}}d(h(V^{\prime},m^{\prime}),H(\mathcal{V}))\right\}
\displaystyle\leq βd𝒦(𝒱,𝒱)\displaystyle\beta d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime})

Take (V,m)𝒱×(V,m)\in\mathcal{V}\times\mathcal{M}, then d(h(V,m),H(𝒱))infV𝒱h(V,m)h(V,m)αinfV𝒱VVd(h(V,m),H(\mathcal{V}^{\prime}))\leq\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert h(V,m)-h(V^{\prime},m)\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert V-V^{\prime}\right\rVert_{{\color[rgb]{0,0,0}\infty}} holds from the α\alpha-contractive property of hh. Finally,

supV𝒱md(h(V,m),H(𝒱))\displaystyle\sup_{\begin{subarray}{c}V\in\mathcal{V}\\ m\in\mathcal{M}\end{subarray}}d(h(V,m),H(\mathcal{V}^{\prime}))\leq αsupV𝒱infV𝒱VV\displaystyle\alpha\sup_{V\in\mathcal{V}}\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert V-V^{\prime}\right\rVert_{\infty}
\displaystyle\leq αd𝒦(𝒱,𝒱)\displaystyle\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime})

We use the same technique to prove that

supV𝒱md(h(V,m),H(𝒱))αd𝒦(𝒱,𝒱).\sup_{\begin{subarray}{c}V^{\prime}\in\mathcal{V}^{\prime}\\ m^{\prime}\in\mathcal{M}\end{subarray}}d(h(V^{\prime},m^{\prime}),H(\mathcal{V}))\leq\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime}).

Finally, d𝒦(H(𝒱),H(𝒱))αd𝒦(𝒱,𝒱)d_{\mathcal{K}}(H(\mathcal{V}),H(\mathcal{V}^{\prime}))\leq\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime}). From the Banach fixed point theorem and the completeness of (𝒦(S),d𝒦)({\mathcal{K}}({\mathbb{R}}^{S}),d_{\mathcal{K}}) [7, Thm 3.3], HH has a unique fixed point 𝒱\mathcal{V}^{\star} in 𝒦(S){\mathcal{K}}({\mathbb{R}}^{S}).

The third point is a consequence of the Banach fixed point theorem. Finally, ff and gπg^{\pi} are value operators (6) on S×{\mathbb{R}}^{S}\times\mathcal{M}, therefore this theorem’s statements apply. ∎

Remark 18 (Set-based value iteration).

An important consequence of Theorem 17 is the existence of the set-based value iteration, given by

𝒱k+1=H(𝒱k),𝒱0𝒦(S).\mathcal{V}^{k+1}=H(\mathcal{V}^{k}),\ \mathcal{V}^{0}\in\mathcal{K}({\mathbb{R}}^{S}). (19)

Analogous to standard value iteration, (19) is a sequence of value vector sets in 𝒦(S)\mathcal{K}({\mathbb{R}}^{S}) that converges to the fixed point set 𝒱𝒦(S)\mathcal{V}^{\star}\in\mathcal{K}({\mathbb{R}}^{S}).

4 Properties of the fixed point set

For the MDP parameters (C,P)(C,P), the fixed point of h(,C,P)h(\cdot,C,P) is typically meaningful for the corresponding MDP. For example, the fixed point of a policy evaluation operator gπ(,C,P)g^{\pi}(\cdot,C,P) (7) is the expected cost-to-go under policy π\pi, and the fixed point of the Bellman operator f(,C,P)f(\cdot,C,P) (9) is the minimum cost-to-go when π\pi can be freely chosen. In this section, we derive properties of the fixed point set 𝒱\mathcal{V} of HH (14) in the context of non-stationary value iteration.

4.1 Non-stationary value iteration

Given a value operator hh on S×{\mathbb{R}}^{S}\times\mathcal{M}, we consider value iteration under a dynamic parameter uncertainty model discussed in [15], where at every iteration, a new set of MDP parameters mkm^{k} is chosen from \mathcal{M} as

Vk+1=h(Vk,mk),V0S,mk,k.V^{k+1}=h(V^{k},m^{k}),\ V^{0}\in{\mathbb{R}}^{S},\ m^{k}\in\mathcal{M},\forall k\in\mathbb{N}. (20)

In robust MDP literature [8, 15], mkm^{k} is modified by an adversarial opponent of the MDP decision maker such that (20) converges to a worst-case value vector. We consider a more general scenario in which mkm^{k} is chosen from the closed and bounded set \mathcal{M} without any probabilistic prior. In this scenario, convergence of VkV^{k} in S{\mathbb{R}}^{S} will not occur for all possible sequences of {mk}k\{m^{k}\}_{k\in{\mathbb{N}}}. However, we can show convergence results on the set domain by leveraging our fixed point analysis of the set-based operator HH (14).

Proposition 19.

Let 𝒱\mathcal{V}^{\star} be the fixed point set of the set-based value operator HH (14) induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M} (6). If the non-stationary value iteration (20) satisfies {mk}k\{m^{k}\}_{k\in\mathbb{N}}\subset\mathcal{M}, then the sequence {Vk}k\{V^{k}\}_{k\in\mathbb{N}} defined by (20) satisfies

  1. 1.

    limk+d(Vk,𝒱)=0\lim_{k\to+\infty}d(V^{k},\mathcal{V}^{\star})=0,

  2. 2.

    there exists a sub-sequence {Vφ(k)}k\{V^{\varphi(k)}\}_{k\in\mathbb{N}} that converges to a point in 𝒱\mathcal{V}^{\star} as limkVφ(k)𝒱\lim_{k\rightarrow\infty}V^{\varphi(k)}\in\mathcal{V}^{\star}.

{pf}

Let {Vk}k\{V^{k}\}_{k\in\mathbb{N}} be a sequence defined by 𝒱0={V0}\mathcal{V}^{0}=\{V^{0}\} and 𝒱k+1=H(𝒱k)\mathcal{V}^{k+1}=H(\mathcal{V}^{k}), where HH (14) is the set operator induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M}. We first show statement 1). From Theorem 17, limk𝒱k\lim_{k\to\infty}\mathcal{V}^{k} converges to 𝒱\mathcal{V}^{\star} in d𝒦d_{\mathcal{K}}. Therefore, 0d(Vk,𝒱)=infy𝒱Vkysupx𝒱kinfy𝒱xyd𝒦(𝒱k,𝒱)00\leq d(V^{k},\mathcal{V}^{\star})=\inf_{y\in\mathcal{V}^{\star}}\left\lVert V^{k}-y\right\rVert_{\infty}\leq\sup_{x\in\mathcal{V}^{k}}\inf_{y\in\mathcal{V}^{\star}}\left\lVert x-y\right\rVert_{\infty}\leq d_{{\color[rgb]{0,0,0}\mathcal{K}}}(\mathcal{V}^{k},\mathcal{V}^{\star})\to 0 as kk tends to ++\infty.

Next, for all kk\in\mathbb{N}, there exists NN\in\mathbb{N} such that for all nNn\geq N, d(Vn,𝒱)(k+1)1d(V^{n},\mathcal{V}^{\star})\leq(k+1)^{-1}. We define the strictly increasing function ψ1:\psi_{1}:\mathbb{N}\to\mathbb{N}, such that ψ1(0)=0\psi_{1}(0)=0 and for all k0k\neq 0, ψ1(k):=min{N>ψ1(k1):nN,d(Vn,𝒱)<(k+1)1}\psi_{1}(k):=\min\{N>\psi_{1}(k-1):\forall n\geq N,\ d(V^{n},\mathcal{V}^{\star})<(k+1)^{-1}\}. Then, for all kk\in\mathbb{N}^{\star}, there exists yψ1(k)𝒱y^{\psi_{1}(k)}\in\mathcal{V}^{\star} such that Vψ1(k)yψ1(k)<(k+1)1\left\lVert V^{\psi_{1}(k)}-y^{\psi_{1}(k)}\right\rVert_{{\color[rgb]{0,0,0}\infty}}<(k+1)^{-1}. As 𝒱\mathcal{V}^{\star} is compact, there exists ψ2:\psi_{2}:\mathbb{N}\to\mathbb{N} strictly increasing such that (yψ1(ψ2(k)))k(y^{\psi_{1}(\psi_{2}(k))})_{k} converges to some y𝒱y^{\star}\in\mathcal{V}^{\star} [18, Thm 3.6]. Finally, let ε>0\varepsilon>0, there exist K1,K2K_{1},K_{2}\in\mathbb{N} such that for all lK1l\geq K_{1}, (ψ2(l))1<ε/2(\psi_{2}(l))^{-1}<\varepsilon/2 and for all lK2l^{\prime}\geq K_{2}, yψ1(ψ2(l))y<ε/2\left\lVert y^{\psi_{1}(\psi_{2}(l^{\prime}))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}<\varepsilon/2. So, taking kmax{K1,K2}k\geq\max\{K_{1},K_{2}\}, we have Vψ1(ψ2(k))yVψ1(ψ2(k))yψ1(ψ2(k))+yψ1(ψ2(k))yε\left\lVert V^{\psi_{1}(\psi_{2}(k))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\left\lVert V^{\psi_{1}(\psi_{2}(k))}-y^{\psi_{1}(\psi_{2}(k))}\right\rVert_{{\color[rgb]{0,0,0}\infty}}+\left\lVert y^{\psi_{1}(\psi_{2}(k))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\varepsilon and (Vψ1(ψ2(k)))k(V^{\psi_{1}(\psi_{2}(k))})_{k} converges to y𝒱y^{\star}\in\mathcal{V}^{\star}. ∎ In addition to containing all asymptotic behavior of value vector trajectories under time-varying value iteration, the fixed point set 𝒱\mathcal{V} also contains all fixed points of the value operator h(,C,P)h(\cdot,C,P) when (C,P)(C,P)\in\mathcal{M} (6) are fixed.

Corollary 20.

Let hh (6) be a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M} where \mathcal{M} is compact. For all mm\in\mathcal{M}, if V=h(V,m)SV=h(V,m)\in{\mathbb{R}}^{S} and 𝒱\mathcal{V}^{\star} is the fixed point set of the induced set-based value operator HH (14), V𝒱V\in\mathcal{V}^{\star}.

{pf}

We construct sequence {Vk}\{V^{k}\} where Vk+1=h(Vk,m)V^{k+1}=h(V^{k},m) and V0=VV^{0}=V. Then Vk=VV^{k}=V for all kk\in\mathbb{N}. From the second point of Proposition 19, V𝒱V\in\mathcal{V}^{\star} follows. ∎ Going further, we can bound the transient behavior of (20) when V0V^{0} is an element of the fixed point set 𝒱\mathcal{V}^{\star}.

Corollary 21 (Transient behavior).

Let 𝒱\mathcal{V}^{\star} be the fixed point of the set-based value operator HH (14) induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M}. If \mathcal{M} is compact and V0𝒱V^{0}\in\mathcal{V}^{\star}, then the sequence generated by (20) satisfies {Vk}k𝒱\{V^{k}\}_{k\in\mathbb{N}}\subseteq\mathcal{V}^{\star}.

{pf}

As a fixed point set of HH (14), 𝒱\mathcal{V}^{\star} (17) satisfies 𝒱=H(𝒱)\mathcal{V}^{\star}=H(\mathcal{V}^{\star}), then the following is true by definition of HH: if Vk𝒱V^{k}\in\mathcal{V}^{\star}, then Vk+1=h(Vk,mk)𝒱V^{k+1}=h(V^{k},m^{k})\in\mathcal{V}^{\star}. If V0𝒱V^{0}\in\mathcal{V}^{\star}, then {Vk}k𝒱\{V^{k}\}_{k\in\mathbb{N}}\subseteq\mathcal{V}^{\star} follows by induction. ∎

Remark 22.

Proposition 19 and Corollary 21 bound the asymptotic and transient behavior of the sequence {h(Vk,mk)}k\{h(V^{k},m^{k})\}_{k\in{\mathbb{N}}} generated from (20), regardless of the convergence of the value vector sequence. This is a more general result then the classic convergence results for MDPs and robust MDPs.

Remark 23.

Corollary 21 also implies that 𝒱\mathcal{V}^{\star} is invariant with respect to the non-stationary value iteration (20), and may prove useful in the analysis and design of MDPs with known parameter uncertainties.

4.2 Bounds of the fixed point set

In Theorem 17, the compactness of \mathcal{M} implied the compactness of 𝒱\mathcal{V}^{\star}. This relationship carries over to the supremum and infimum elements of \mathcal{M} and 𝒱\mathcal{V}^{\star}—i.e., if \mathcal{M} satisfies Assumption 30 with respect to hh, then 𝒱\mathcal{V}^{\star} contains its own supremum and infimum elements.

Greatest and least elements. We define the supremum and infimum elements of a value vector set 𝒱𝒦(S)\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S}) element-wise as follows,

V¯s:=supV𝒱Vs,V¯s:=infV𝒱Vs,s[S].\overline{V}_{s}:=\sup_{V\in\mathcal{V}}V_{s},\ \underline{V}_{s}:=\inf_{V\in\mathcal{V}}V_{s},\forall\ s\in[S]. (21)
Refer to caption
Figure 3: The greatest least bounds of three different value function sets 𝒱i2\mathcal{V}^{i}\in{\mathbb{R}}^{2}, where (0,0)(0,0) the origin is located on the lower left. Note that 𝒱2\mathcal{V}^{2} and 𝒱3\mathcal{V}^{3} contains their own greatest and least elements, but 𝒱1\mathcal{V}^{1} does not. In 𝒱1\mathcal{V}^{1}, the coordinate-wise greatest and least elements are achieved by some element in 𝒱1\mathcal{V}^{1} but not at the same time.

If a set 𝒱S\mathcal{V}\subseteq{\mathbb{R}}^{S} is compact, the projection of 𝒱\mathcal{V} on each state ss is compact. Then, the coordinate-wise supremum and infimum values for each state ss are achieved by 𝒱\mathcal{V}. However in general, no single element of the set 𝒱\mathcal{V} may simultaneously achieve the minimum over all states—i.e., V¯(V¯)\overline{V}(\underline{V}) may not be an element of 𝒱\mathcal{V}. This is illustrated in Figure 3.

Given hh and parameter uncertainty set \mathcal{M}, we wish to 1) bound the supremum and infimum elements of the fixed point set 𝒱\mathcal{V}^{\star} (17) and 2) derive sufficient conditions for when they are elements of 𝒱\mathcal{V}^{\star}. To facilitate bounding 𝒱\mathcal{V}^{\star}, we introduce the following bound operators.

Definition 24 (Bound Operators).

The bound operators induced by the value operator hh on S×{\mathbb{R}}^{S}\times\mathcal{M} are coordinate-wise defined at each s[S]s\in[S] as

h¯s(V)=infmhs(V,m),h¯s(V)=supmhs(V,m).\underline{h}_{s}(V)=\inf_{m\in\mathcal{M}}h_{s}(V,m),\ \overline{h}_{s}(V)=\sup_{m\in\mathcal{M}}h_{s}(V,m). (22)
Refer to caption
Figure 4: We visualize the bound operator for H(𝒱)H(\mathcal{V}) for a given value operator hh on S×{\mathbb{R}}^{S}\times\mathcal{M}. The input set 𝒱\mathcal{V} is a singleton {V}\{V\} in 2{\mathbb{R}}^{2}. Here, because h¯1\underline{h}_{1} and h¯2\underline{h}_{2}are reached for two different parameters mm\in\mathcal{M}, the resulting h¯(V)\underline{h}(V) lies outside of the fixed point set.

We want to bound the fixed point set 𝒱\mathcal{V} of the set-based value operator HH (14) by the bound operators h¯/h¯\underline{h}/\overline{h} (22). First we show that h¯/h¯\underline{h}/\overline{h} are themselves α\alpha-contractive and order preserving on S{\mathbb{R}}^{S}.

Lemma 25 (α\alpha-Contraction).

If hh (6) is a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M} and \mathcal{M} is compact, then h¯\underline{h} and h¯\overline{h} (22) are α\alpha-contractions with fixed points X¯,X¯\underline{X},\overline{X}, respectively.

h¯(X¯)=X¯,h¯(X¯)=X¯,X¯,X¯S.\textstyle\overline{h}(\overline{X})=\overline{X},\quad\underline{h}(\underline{X})=\underline{X},\ \underline{X},\overline{X}\in{\mathbb{R}}^{S}. (23)
{pf}

From Lemma 9, hh is continuous and \mathcal{M} is compact, then for all X,YSX,Y\in{\mathbb{R}}^{S}, there exists m^(s)\hat{m}(s)\in\mathcal{M} such that h¯s(Y)=hs(Y,m^(s))\underline{h}_{s}(Y)=h_{s}(Y,\hat{m}(s)) and h¯s(X)hs(X,m^(s))\underline{h}_{s}(X)\leq h_{s}(X,\hat{m}(s)). We upper-bound h¯s(X)h¯s(Y)\underline{h}_{s}(X)-\underline{h}_{s}(Y) by hs(X,m^(s))hs(Y,m^(s))h_{s}(X,\hat{m}(s))-h_{s}(Y,\hat{m}(s)), and use the α\alpha-contraction property of hh to derive

h¯s(X)h¯s(Y)\displaystyle\underline{h}_{s}(X)-\underline{h}_{s}(Y) |hs(X,m^(s))hs(Y,m^(s))|\displaystyle\leq|h_{s}(X,\hat{m}(s))-h_{s}(Y,\hat{m}(s))|
αXY.\displaystyle\leq\alpha\left\lVert X-Y\right\rVert_{\infty}.

Since XX and YY are arbitrarily ordered, we conclude that h¯(X)h¯(Y)αXY\left\lVert\underline{h}(X)-\underline{h}(Y)\right\rVert_{\infty}\leq\alpha\left\lVert X-Y\right\rVert_{\infty}. The proof for h¯\overline{h} follows a similar reasoning and takes m^(s)=argmaxmhs(X,m)\hat{m}(s)={\color[rgb]{0,0,0}\mathop{\rm argmax}}_{m\in\mathcal{M}}h_{s}(X,m). The existence of X¯(X¯)\underline{X}(\overline{X}) follows from applying Banach’s fixed point theorem. ∎

Lemma 26 (Order Preservation).

The bound operators h¯\underline{h} and h¯\overline{h} (22) are order-preserving on S{\mathbb{R}}^{S} (Definition 4).

U,VS,UVh¯(U)h¯(V),h¯(U)h¯(V).\forall\ U,V\in{\mathbb{R}}^{S},\ U\leq V\Rightarrow\underline{h}(U)\leq\underline{h}(V),\quad\overline{h}(U)\leq\overline{h}(V).
{pf}

The lemma statement follows directly from the fact that order preservation is conserved through composition with inf\inf and sup\sup. If h(U,m)h(V,m)h(U,m)\leq h(V,m), then infmh(U,m)infmh(V,m)\inf_{m\in\mathcal{M}}h(U,m)\leq\inf_{m\in\mathcal{M}}h(V,m). A similar argument follows for h¯()=supmh(,m)\overline{h}(\cdot)=\sup_{m\in\mathcal{M}}h(\cdot,m). ∎ We show that the fixed points X¯\underline{X} and X¯\overline{X} bounds the fixed point set 𝒱\mathcal{V}^{\star} of the set-based value operator HH (14).

Theorem 27 (Bounding fixed point sets).

If hh (6) is a value operator on S×{\mathbb{R}}^{S}\times\mathcal{M} and \mathcal{M} is compact,

X¯VX¯,V𝒱,\underline{X}\leq V\leq\overline{X},\ \forall\ V\in\mathcal{V}^{\star}, (24)

where X¯\underline{X} and X¯\overline{X} (23) are the fixed points of the bound operators h¯\underline{h} and h¯\overline{h} (22), and 𝒱\mathcal{V}^{\star} is the fixed point set of the set-based value operator HH (14) induced by hh (6) on S×{\mathbb{R}}^{S}\times\mathcal{M}.

{pf}

For 𝒱0={X¯,X¯}\mathcal{V}^{0}=\{\underline{X},\overline{X}\} and 𝒱k+1=H(𝒱k)\mathcal{V}^{k+1}=H(\mathcal{V}^{k}) (19), we first show

X¯VX¯,V𝒱k,\textstyle\underline{X}\leq V\leq\overline{X},\ \forall\ V\in\mathcal{V}^{k}, (25)

via induction. Suppose that (25) is satisfied for 𝒱k\mathcal{V}^{k}. The order preserving property of h(,m)h(\cdot,m) implies that h(X¯,m)h(V,m)h(X¯,m)h(\underline{X},m)\leq h(V,m)\leq h(\overline{X},m) holds for all (V,m)𝒱k×(V,m)\in\mathcal{V}^{k}\times\mathcal{M}. We take the infimum and supremum over h(X¯,m)h(\underline{X},m) and h(X¯,m)h(\overline{X},m), respectively, to show that for all (V,m)𝒱k×(V,m)\in\mathcal{V}^{k}\times\mathcal{M} and s[S]s\in[S],

infmhs(X¯,m)hs(V,m)supmhs(X¯,m).\inf_{m^{\prime}\in\mathcal{M}}h_{s}(\underline{X},m^{\prime})\leq h_{s}(V,m)\leq\sup_{m^{\prime}\in\mathcal{M}}h_{s}(\overline{X},m^{\prime}).

Since X¯\underline{X} and X¯\overline{X} are the fixed points of infmhs(,m)\inf_{m^{\prime}\in\mathcal{M}}h_{s}(\cdot,m^{\prime}) and supmhs(,m)\sup_{m^{\prime}\in\mathcal{M}}h_{s}(\cdot,m^{\prime}) for all s[S]s\in[S], respectively, we conclude that (25) holds for 𝒱k+1\mathcal{V}^{k+1}.

Next, we show that X¯\underline{X} and X¯\overline{X} bounds the fixed point set 𝒱\mathcal{V}^{\star} for the hh-induced operator HH (14). From Lemma 45, we know that for all V𝒱V\in\mathcal{V}^{\star}, there exists a strictly increasing sequence ϕ:\phi:\mathbb{N}\mapsto\mathbb{N} and corresponding value vectors {Wϕ(n)}\{W^{\phi(n)}\} such that limnWϕ(n)=V\lim_{n\to\infty}W^{\phi(n)}=V and Wϕ(n)𝒱ϕ(n)W^{\phi(n)}\in\mathcal{V}^{\phi(n)} for the sequence of value vector sets generated from 𝒱0={X¯,X¯}\mathcal{V}^{0}=\{\underline{X},\overline{X}\}. Since X¯Wϕ(n)X¯\underline{X}\leq W^{\phi(n)}\leq\overline{X} holds for all nn, we conclude (24) holds. ∎

5 Revisiting robust MDP

We re-examine robust MDP with the set-theoretical analysis in this section, and show that Assumption 30 generalizes the rectangularity assumption made in robust MDPs, thus enabling robust dynamic programming techniques to be available to a wider class of MDP problems and contraction operators.

Recall the optimistic value vector WoSW^{o}\in{\mathbb{R}}^{S} and robust value vectors WrSW^{r}\in{\mathbb{R}}^{S} of a discounted MDP ([S],[A],C,P,γ)([S],[A],C,P,\gamma) from [8, 15] as the fixed points of the following operators.

Wso=minπsΔAmin(C,P)gsπ(Wo,C,P),s[S]W^{o}_{s}=\min_{\pi_{s}\in\Delta_{A}}\min_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(W^{o},C,P),\forall s\in[S] (26)
Wsr=minπsΔAmax(C,P)g(Wr,C,P)sπ,s[S]W^{r}_{s}=\min_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g{\color[rgb]{0,0,0}{}_{s}}^{\pi}(W^{r},C,P),\forall s\in[S] (27)

The optimistic policy πo\pi^{o} and robust policy πr\pi^{r} are the optimal policies corresponding to (26) and (27), respectively.

πsoargminπsΔAmin(C,P)gsπ(Wo,C,P),s[S]\pi^{o}_{s}\in\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}}\min_{(C,P)\in\mathcal{M}}g_{s}^{\pi}(W^{o},C,P),\forall s\in[S] (28)
πsrargminπsΔAmax(C,P)gsπ(Wr,C,P),s[S]\pi^{r}_{s}\in\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g_{s}^{\pi}(W^{r},C,P),\forall s\in[S] (29)

For readability, we denote the policy evaluation operator (7) under πo\pi^{o} as gog^{o} and the policy evaluation operator (7) under πr\pi^{r} as grg^{r}.

When \mathcal{M} is (s,a)(s,a)-rectangular (39), the set of policies satisfying (28) and (29) are non-empty and includes deterministic policies [8, Thm 3.1]. When \mathcal{M} is ss-rectangular and convex, the set of policies satisfying (29) is non-empty but may be mixed [21, Thm 4]. When \mathcal{M} is convex, we show that policies (28) and (29) exist.

Proposition 28.

If the MDP parameter set \mathcal{M} is compact and convex, then

  1. 1.

    WoW^{o} (26) and WrW^{r} (27) exist and satisfy f¯(Wr)=Wr,f¯(Wo)=Wo\overline{f}(W^{r})=W^{r},\ \underline{f}(W^{o})=W^{o}, where f¯\overline{f} and f¯\underline{f} (22) are the bound operators of the Bellman operator (9).

  2. 2.

    πo\pi^{o} (28) and πr\pi^{r} (29) exist.

{pf}

Recall the Bellman operator ff (9). When ×ΔA\mathcal{M}\times\Delta_{A} is compact, the formulation of the fixed point of f¯\underline{f} (22) is equivalently given by

f¯s(X¯)=min(C,P)minπsΔAgsπ(X¯,C,P),s[S].{\color[rgb]{0,0,0}\underline{f}_{s}}(\underline{X})=\min_{(C,P)\in\mathcal{M}}\min_{\pi_{s}\in\Delta_{A}}g^{\pi}_{s}(\underline{X},C,P),\ \forall s\in[S]. (30)

We note that (30) is identical to the formulation of WoW^{o} (26). Therefore, Wo=X¯W^{o}=\underline{X} is the fixed point of f¯\underline{f}. When \mathcal{M} is compact, WoW^{o} exists due to Lemma 25. From (28), πso\pi^{o}_{s} is the optimal argument of gsπ(Wo,C,P)g^{\pi}_{s}(W^{o},C,P), a continuous function in πs,C,P\pi_{s},C,P minimized over compact sets ΔA×\Delta_{A}\times\mathcal{M} for all s[S]s\in[S]. Therefore πso\pi^{o}_{s} exists. Since πo=(π1o,,πSo)\pi^{o}=(\pi^{o}_{1},\ldots,\pi^{o}_{S}), the optimal πoΠ\pi^{o}\in\Pi exists.

For the robust scenario: when \mathcal{M} is compact, the fixed point of f¯\overline{f} (22), X¯\overline{X}, exists from Lemma 25 and is given by

X¯s=max(C,P)minπsΔAgsπ(X¯,C,P),s[S].\overline{X}_{s}=\max_{(C,P)\in\mathcal{M}}\min_{\pi_{s}\in\Delta_{A}}g^{\pi}_{s}(\overline{X},C,P),\ \forall s\in[S]. (31)

The function gsπ(X¯,C,P)g_{s}^{\pi}(\overline{X},C,P) is concave in (C,P)(C,P) and convex in π\pi. If \mathcal{M} is convex, then we apply the minimax theorem [14] to switch the order of min\min and max\max in (31) to derive

X¯s=minπsΔAmax(C,P)gsπ(X¯,C,P),s[S].\overline{X}_{s}=\min_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P),\ \forall s\in[S]. (32)

Equation (32) is identical to (27), therefore Wr=X¯W^{r}=\overline{X} and exists by Lemma 25. In (32), max(C,P)gsπ(X¯,C,P)\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P) is piece-wise linear in πs\pi_{s} and ΔA\Delta_{A} is compact for all s[S]s\in[S], thus argminπsΔA\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}} max(C,P)gsπ(X¯,C,P)\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P) is non-empty. Finally, since πr=(π1r,,πSr)\pi^{r}=(\pi^{r}_{1},\ldots,\pi^{r}_{S}), πr\pi^{r} exists. ∎

Remark 29.

Since max(C,P)gsπ(X¯,C,P)\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P) is piecewise linear in πs\pi_{s}, the optimal πsr\pi_{s}^{r} is mixed policy in general. This is consistent with the results in [21].

Proposition 28 generalizes the results from [21] to show that (27) exists when \mathcal{M} is compact and convex instead of ss-rectangular and convex. From Theorem 27, WoW^{o} and WrW^{r} are the fixed points of the bound operators g¯πo\underline{g}^{\pi^{o}} and g¯πr\overline{g}^{\pi_{r}} (24), respectively. They become infimum and supremum elements when \mathcal{M} satisfies Assumption 30 with respect to gog^{o} and grg^{r}. We explicitly derive this result next. First, we introduce some notations: let Go=GπoG^{o}=G^{\pi_{o}}, the fixed point of GoG^{o} be 𝒱o\mathcal{V}^{o}, Gr=GπrG^{r}=G^{\pi^{r}}, and the fixed point of GrG^{r} be 𝒱r\mathcal{V}^{r}.

𝒱o={go(V,C,P)|(C,P),V𝒱o},\mathcal{V}^{o}=\{g^{o}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{o}\}, (33)
𝒱r={gr(V,C,P)|(C,P),V𝒱r}.\mathcal{V}^{r}=\{g^{r}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{r}\}. (34)

Additionally, the supremum elements of 𝒱o\mathcal{V}^{o} and 𝒱r\mathcal{V}^{r} are V¯o\overline{V}^{o} and V¯r\overline{V}^{r} respectively and the infimum elements are V¯o\underline{V}^{o} and V¯r\underline{V}^{r}, respectively.

V¯sr=minV𝒱rVs,V¯sr=maxV𝒱rVs,s[S].\underline{V}_{s}^{r}=\min_{V\in\mathcal{V}^{r}}V_{s},\ \overline{V}_{s}^{r}=\max_{V\in\mathcal{V}_{r}}V_{s},\ \forall s\in[S]. (35)
V¯so=minV𝒱oVs,V¯so=maxV𝒱oVs,s[S].\underline{V}_{s}^{o}=\min_{V\in\mathcal{V}^{o}}V_{s},\ \overline{V}_{s}^{o}=\max_{V\in\mathcal{V}^{o}}V_{s},\ \forall s\in[S]. (36)

We compare these with the fixed point set of the Bellman operator, 𝒱B={minπgπ(V,C,P)|(C,P),V𝒱B}\mathcal{V}^{B}=\{\min_{\pi}g^{\pi}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{B}\} (17), denoted by V¯B\overline{V}^{B} and V¯B\underline{V}^{B} as

V¯sB=minV𝒱BVs,V¯sB=maxV𝒱BVs,s[S].\underline{V}_{s}^{B}=\min_{V\in\mathcal{V}^{B}}{\color[rgb]{0,0,0}V_{s}},\ \overline{V}_{s}^{B}=\max_{V\in\mathcal{V}^{B}}V_{s},\ \forall s\in[S]. (37)

6 Fixed-point set containing its infimum/supremum

We make the following assumption on the MDP parameter set \mathcal{M} with respect to hh.

Assumption 30 (Containment condition).

The MDP parameter set \mathcal{M} satisfies the containment condition with respect to hh if \mathcal{M} is compact and for all VSV\in{\mathbb{R}}^{S},

s[S]argminmhs(V,m),s[S]argmaxmhs(V,m).\bigcap_{s\in[S]}\mathop{\rm argmin}_{m\in\mathcal{M}}h_{s}(V,m)\neq\emptyset,\ \bigcap_{s\in[S]}\mathop{\rm argmax}_{m\in\mathcal{M}}h_{s}(V,m)\neq\emptyset. (38)
Refer to caption
Figure 5: We illustrate argmaxmhs(V,m)\mathop{\rm argmax}_{m\in\mathcal{M}}h_{s}(V,m) for a value operator hh when S=2S=2. Here, argmaxmh1(V,m)={m2,m3}\mathop{\rm argmax}_{m\in\mathcal{M}}h_{1}(V,m)=\{m_{2},m_{3}\}, argmaxmh2(V,m)={m1,m2}\mathop{\rm argmax}_{m\in\mathcal{M}}h_{2}(V,m)=\{m_{1},m_{2}\}. Therefore, m2m_{2} is the common parameter that achieves maxmhs(V,m)\max_{m\in\mathcal{M}}h_{s}(V,m) for all s[S]s\in[S].
Remark 31.

Assumption 30 is an hh-dependent condition imposed on the structure of \mathcal{M}, and is independent of \mathcal{M}’s convexity and connectivity.

6.1 Containment-satisfying MDP parameter sets

Assumption 30 restricts the structure of \mathcal{M} with respect to the value operator hh. Thus whether or not \mathcal{M} satisfies Assumption 30 must always be determined with respect to the operator hh. With respect to the Bellman operator ff (9) and the policy evaluation operators gπg^{\pi} (7), the following conditions in robust MDP are sufficient to satisfy Assumption 30.

Definition 32 ((s,a)(s,a)-rectangular sets [8, 15]).

The uncertainty set S×A×ΔSSA\mathcal{M}\subset{\mathbb{R}}^{S\times A}\times\Delta^{SA}_{S} is (s,a)(s,a)-rectangular if

=(s,a)[S]×[A]sa,sa×ΔS,(s,a)[S]×[A].\mathcal{M}=\bigtimes_{(s,a)\in[S]\times[A]}\mathcal{M}_{sa},\ \mathcal{M}_{sa}\subset{\mathbb{R}}\times\Delta_{S},\ \forall(s,a)\in[S]\times[A]. (39)

Intuitively, (s,a)(s,a)-rectangularity implies that the MDP parameter uncertainty is decoupled between each state-action. A more general condition is if the parameter uncertainty is decoupled between different states but not between different actions within the same state.

Definition 33 (ss-rectangular sets).

The uncertainty set S×A×ΔSSA\mathcal{M}\subset{\mathbb{R}}^{S\times A}\times\Delta^{SA}_{S} is ss-rectangular if

=s[S]s,sA×ΔSA,s[S].\mathcal{M}=\bigtimes_{s\in[S]}\mathcal{M}_{s},\ \mathcal{M}_{s}\subset{\mathbb{R}}^{A}\times\Delta^{A}_{S},\ \forall s\in[S]. (40)

ss-rectangularity generalizes (s,a)(s,a)-rectangularity—i.e. (s,a)(s,a)-rectangularity implies ss-rectangularity.

Example 34 (Wind uncertainty).

Consider the navigation problem presented in Example 1. If the wind pattern strictly switches between the discrete wind trends, then the transition uncertainty at state s[S]s\in[S] is 𝒫s={Ps1,,PsN}\mathcal{P}_{s}=\{P^{1}_{s},\ldots,P^{N}_{s}\}. If the wind pattern is a mixture of the discrete wind trends, the transition uncertainty at state s[S]s\in[S] is 𝒫s={iαiPsi|αΔN}\mathcal{P}_{s}=\{\sum_{i}\alpha_{i}P^{i}_{s}\ |\ \alpha\in\Delta_{N}\}. Both wind patterns lead to ss-rectangular uncertainty, given by 𝒫=s[S]𝒫s\mathcal{P}=\bigtimes_{s\in[S]}\mathcal{P}_{s}.

We show that the rectangularity conditions indeed are sufficient for satisfying Assumption 30 with respect to ff (9) and gπg^{\pi} (7).

Proposition 35.

If \mathcal{M} is compact and ss-rectangular (Definition 33), \mathcal{M} satisfies Assumption 30 with respect to ff (9) and gπg^{\pi} (7) for all πΠ\pi\in\Pi.

{pf}

We first show that \mathcal{M} satisfies Assumption 30 with respect to the Bellman operator. Given s[S]s\in[S], fs(V,C,P)f_{s}(V,C,P) only depends on the ss component of CC and PP. From Lemma 9, fsf_{s} is continuous in (cs,Ps)(c_{s},P_{s}). Let (cs,Ps)(c_{s}^{\star},P_{s}^{\star}) be the solution to argmin(cs,Ps)sfs(V,C,P)\mathop{\rm argmin}_{(c_{s},P_{s})\in\mathcal{M}_{s}}f_{s}(V,C,P) for all s[S]\forall\ s\in[S]. If s\mathcal{M}_{s} is compact, (cs,Ps)s(c_{s}^{\star},P^{\star}_{s})\in\mathcal{M}_{s}. We can construct C=[c1,,cS]C^{\star}=[c_{1}^{\star},\ldots,c^{\star}_{S}] and P=[P1,,PS]P^{\star}=[P_{1}^{\star},\ldots,P_{S}^{\star}]. If \mathcal{M} is ss-rectangular, then (C,P)(C^{\star},P^{\star})\in\mathcal{M} and (C,P)argmin(C,P)fs(V,C,P)(C^{\star},P^{\star})\in\mathop{\rm argmin}_{{\color[rgb]{0,0,0}(C,P)}\in\mathcal{M}}f_{s}(V,C,P) for all s[S]s\in[S]. We conclude that \mathcal{M} satisfies Assumption 30.

Given πΠ\pi\in\Pi and s[S]s\in[S], gsπg^{\pi}_{s} only depends on csc_{s} and PsP_{s} as well. We can similarly show that there exists an optimal parameter (C,P)argmin(C,P)gsπ(V,C,P)(C^{\star},P^{\star})\in\mathop{\rm argmin}_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(V,C,P) for all s[S]s\in[S] such that (C,P)(C^{\star},P^{\star})\in\mathcal{M}. ∎ Beyond ss-rectangularity, there are sets that satisfy Assumption 30 with respect to specific value operators.

Example 36 (Beyond rectangularity).
Refer to caption
Figure 6: MDP with parameter coupling in transition probability across different states.

In Figure 6, we visualize a four state MDP with transition uncertainty \mathcal{M} parameterized by α\alpha. MDP states are the nodes and MDP actions are the arrows. Actions that transition to multiple states are visualized by multi-headed arrows. Each head has an associated tuple (csa,psa,s)(c_{sa},p_{sa,s^{\prime}}) denoting its state-action cost and transition probability. All states have a single action except for state s4s_{4}, where two actions exist and are distinguished by different colors. Both s2s_{2} and s3s_{3} are absorbing states with a unique action, such that V2=11γV_{2}=\frac{1}{1-\gamma} and V3=0V_{3}=0 for both ff and gπg^{\pi} for all πΠ\pi\in\Pi, where γ\gamma is the discount factor.

The states s1s_{1} and s4s_{4} have transition uncertainty parametrized by α[0,1]\alpha\in[0,1]. Therefore, \mathcal{M} violates ss-rectangularity (Definition 33). The optimal cost-to-go values V1V_{1} and V4V_{4} occur at different α\alpha’s. Therefore, \mathcal{M} violates Assumption 30 with respect to ff. However, suppose that at s4s_{4}, we only consider policies that exclusively choose the action colored green in Fig. 6. Then the expected cost-to-go at s4s_{4}, V4V_{4}, is independent of α\alpha. The minimum and maximum values of V1V_{1} under π\pi occur at α=1\alpha=1 and α=0\alpha=0, respectively. Therefore, \mathcal{M} satisfies Assumption 30 with respect to operator gπg^{\pi} for all π=[πs1,,πs4]\pi=[\pi_{s_{1}},\ldots,\pi_{s_{4}}] where πs4=[1,0]\pi_{s_{4}}=[1,0].

When Assumption 30 is satisfied, the fixed point of HH (14) contains its own supremum and infimum.

Theorem 37.

If hh (6) on S×{\mathbb{R}}^{S}\times\mathcal{M} satisfies Assumption 30, then there exists m¯,m¯\underline{m},\overline{m}\in\mathcal{M} such that h¯\underline{h} and h¯\underline{h} (22) and their fixed points X¯\underline{X} and X¯\overline{X} (23) satisfies

h¯(X¯)=h(X¯,m¯)=X¯,h¯(X¯)=h(X¯,m¯)=X¯.\underline{h}(\underline{X})=h(\underline{X},\underline{m})=\underline{X},\ \overline{h}(\overline{X})=h(\overline{X},\overline{m})=\overline{X}. (41)

Additionally, X¯\underline{X} and X¯\overline{X} are the least and the greatest elements of HH’s fixed point set 𝒱\mathcal{V}^{\star}, V¯,V¯\underline{V}^{\star},\overline{V}^{\star} (21) respectively, and both belong to 𝒱\mathcal{V}^{\star} (17).

X¯=V¯,X¯=V¯,X¯,X¯𝒱.\textstyle\underline{X}=\underline{V}^{\star},\ \overline{X}=\overline{V}^{\star},\ \underline{X},\overline{X}\in\mathcal{V}^{\star}.
{pf}

From Theorem 27, X¯\underline{X} and X¯\overline{X} are the lower and upper bounds on the fixed point set 𝒱\mathcal{V}^{\star}. We show that these are the infimum and supremum elements of 𝒱\mathcal{V}^{\star} by showing that they are also elements of 𝒱\mathcal{V}^{\star}. From Assumption 30, there exists m¯,m¯\underline{m},\overline{m}\in\mathcal{M} such that hs(X¯,m¯)=minmhs(X¯,m¯){h}_{s}(\underline{X},\underline{m})=\min_{m\in\mathcal{M}}h_{s}(\underline{X},\underline{m}) and hs(X¯,m¯)=minmhs(X¯,m¯){h}_{s}(\overline{X},\overline{m})=\min_{m\in\mathcal{M}}h_{s}(\overline{X},\overline{m}) for all s[S]s\in[S]. Since X¯\underline{X} and X¯\overline{X} are fixed points of h(,m¯)h(\cdot,\underline{m}) and h(,m¯)h(\cdot,\overline{m}), we apply Corollary 20 to conclude that X¯,X¯𝒱\underline{X},\overline{X}\in\mathcal{V}^{\star}. ∎ Our next result proves the relationship between V¯B,V¯o,V¯r,\underline{V}^{B},\underline{V}^{o},\underline{V}^{r}, V¯B,V¯o,V¯r\overline{V}^{B},\overline{V}^{o},\overline{V}^{r} when f,gof,g^{o}, and grg^{r} on S×{\mathbb{R}}^{S}\times\mathcal{M} satisfy Assumption 30.

Theorem 38.

If f,go,grf,g^{o},g^{r} satisfy Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M}, then the bounding elements (37) (36) (35) of the corresponding fixed point sets 𝒱B\mathcal{V}^{B},𝒱o\mathcal{V}^{o} (33) and 𝒱r\mathcal{V}^{r} (34) are ordered as

V¯B=V¯oV¯r,V¯B=V¯rV¯o.\underline{V}^{B}=\underline{V}^{o}\leq\underline{V}^{r},\ \overline{V}^{B}=\overline{V}^{r}\leq\overline{V}^{o}. (42)
{pf}

Since V¯o\underline{V}^{o} is the infimum element for the fixed point set 𝒱o\mathcal{V}^{o} (36), we can apply Theorem 37 to derive

V¯o=min(C,P)go(V¯o,C,P).\textstyle\underline{V}^{o}=\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P). (43)

By definition of πo\pi^{o} (28), min(C,P)go(V¯o,C,P)=min(C,P)minπΠgπ(V¯o,C,P)\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P)=\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\underline{V}^{o},C,P). As the two minima commute,

min(C,P)go(V¯o,C,P)=min(C,P)minπΠgπ(V¯o,C,P).\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P)=\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\underline{V}^{o},C,P). (44)

Combining (43) and (44), V¯o\underline{V}^{o} is exactly the unique fixed point of min(C,P)minπΠgπ(,C,P)\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\cdot,C,P). However, by applying Theorem 37 to ff on S×{\mathbb{R}}^{S}\times\mathcal{M}, V¯B\underline{V}^{B} is also the unique fixed point of min(C,P)minπΠgπ(,C,P)\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\cdot,C,P). Therefore V¯o=V¯B\underline{V}^{o}=\underline{V}^{B}.

From (35), V¯r=min(C,P)gr(V¯r,C,P)\underline{V}^{r}=\min_{(C,P)\in\mathcal{M}}g^{r}(\underline{V}^{r},C,P), we can minimize over the policy space to lower bound V¯r\underline{V}^{r} as

V¯rminπΠmin(C,P)gr(V¯r,C,P).\underline{V}^{r}\geq\min_{\pi\in\Pi}\min_{(C,P)\in\mathcal{M}}g^{r}(\underline{V}^{r},C,P). (45)

Since the right hand side of (45) is equivalent to f¯(V¯r)\underline{f}(\underline{V}^{r}), (45) is equivalent to V¯rf¯(V¯r)\underline{V}^{r}\geq\underline{f}(\underline{V}^{r}). From Lemma 26, f¯\underline{f} is order-preserving in VV, we conclude that V¯o=V¯V¯r\underline{V}^{o}=\underline{V}^{\star}\leq\underline{V}^{r}.

From Theorem 37, V¯r\overline{V}^{r} is the fixed point of g¯r\overline{g}^{r}, such that

V¯r=max(C,P)gr(V¯r,C,P).\overline{V}^{r}=\max_{(C,P)\in\mathcal{M}}g^{r}(\overline{V}^{r},C,P). (46)

We apply minπ\min_{\pi} to both sides of (46) and use the definition of πr\pi^{r} to derive that V¯r\overline{V}^{r} is the fixed point of minπΠmax(C,P)gπ(Vr,C,P)\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(V^{r},C,P). From Assumption 30, there exists (C¯,P¯)(\overline{C},\overline{P})\in\mathcal{M} that maximizes gπ(V¯,C,P)g^{\pi}(\overline{V},C,P), so V¯r\overline{V}^{r} equivalently satisfies

V¯r=minπΠgπ(V¯r,C¯,P¯).\overline{V}^{r}=\min_{\pi\in\Pi}g^{\pi}(\overline{V}^{r},\overline{C},\overline{P}).

From Corollary 20, this implies that V¯r𝒱B\overline{V}^{r}\in\mathcal{V}^{B} and therefore V¯rV¯B\overline{V}^{r}\leq\overline{V}^{B}. Next we show V¯BV¯r\overline{V}^{B}\leq\overline{V}^{r}. From Theorem 37, V¯B\overline{V}^{B} is the fixed point of f¯\overline{f}, such that V¯B=max(C,P)minπgπ(V¯B,C,P)\overline{V}^{B}=\max_{(C,P)\in\mathcal{M}}\min_{\pi}g^{\pi}(\overline{V}^{B},C,P), From the min-max inequality,

V¯BminπΠmax(C,P)gπ(V¯B,C,P).\overline{V}^{B}\leq\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{B},C,P).

Since πrΠ\pi^{r}\in\Pi,

V¯Bmax(C,P)gr(V¯B,C,P).\overline{V}^{B}\leq\max_{(C,P)\in\mathcal{M}}g^{r}(\overline{V}^{B},C,P). (47)

The right-hand side of (47) is g¯r(V¯B)\overline{g}^{r}(\overline{V}^{B}) (22), such that (47) is equivalent to V¯Bg¯r(V¯B)\overline{V}^{B}\leq\overline{g}^{r}(\overline{V}^{B}). We consider the sequence Vk+1=g¯r(Vk)V^{k+1}=\overline{g}^{r}(V^{k}) where V1=V¯BV^{1}=\overline{V}^{B}. Since g¯r\overline{g}^{r} is a contraction, limkVk=Vr\lim_{k\to\infty}V^{k}=V^{r}, the fixed point of g¯r\overline{g}^{r}. From Lemma 26, g¯r\overline{g}^{r} is order preserving. Therefore V¯B=V1Vr\overline{V}^{B}=V^{1}\leq V^{r}.

Finally, Theorem 37 implies that V¯o\overline{V}^{o} is the fixed point of g¯o\overline{g}^{o}: V¯o=max(C,P)go(V¯o,C,P)\overline{V}^{o}=\max_{(C,P)\in\mathcal{M}}g^{o}(\overline{V}^{o},C,P). By construction, V¯ominπΠmax(C,P)gπ(V¯o,C,P)\overline{V}^{o}\geq\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{o},C,P). From the min-max inequality,

minπΠmax(C,P)gπ(V¯o,C,P)max(C,P)minπΠgπ(V¯o,C,P),\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{o},C,P)\geq\max_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\overline{V}^{o},C,P),

such that the right hand side of the inequality is equivalent to f¯(V¯o)\overline{f}(\overline{V}^{o}). Following the monotonicity properties of the Bellman operator ff [17, Thm.6.2.2], we conclude that V¯oV¯B\overline{V}^{o}\geq\overline{V}^{B}. ∎

Remark 39.

Through our fixed-point analysis, we see that in addition to having the best worst-case performance among {𝒱o,𝒱B,𝒱r}\{\mathcal{V}^{o},\mathcal{V}^{B},\mathcal{V}^{r}\}, 𝒱r\mathcal{V}^{r} also has the smallest variation in performance for the same uncertainty set \mathcal{M}.

Finally, we generalize the ss-rectangularity condition by showing that the optimistic and robust policies exist when the MDP parameter set \mathcal{M} satisfies Assumption 30.

Corollary 40 (Robust MDP under Assumption 30).

If \mathcal{M} is compact and convex, and f,go,grf,g^{o},g^{r} satisfy Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M}, then WoW^{o} (26) and WrW^{r} (27) are the infimum and supremum value vectors for the policy evaluation operator under πo\pi^{o} (28) and πr\pi^{r} (29), respectively.

Wso=infV𝒱oVs,Wsr=supV𝒱rVs,s[S],W^{o}_{s}=\inf_{V\in\mathcal{V}^{o}}{\color[rgb]{0,0,0}V_{s}},W^{r}_{s}=\sup_{V\in\mathcal{V}^{r}}{\color[rgb]{0,0,0}V_{s}},\ \forall s\in[S], (48)

where 𝒱o\mathcal{V}^{o} (33) and 𝒱r\mathcal{V}^{r} (34) are the fixed point sets of policies πo\pi^{o} and πr\pi^{r} under parameter uncertainty \mathcal{M}, respectively.

{pf}

When ff satisfies Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M}, Theorem 37 shows that V¯B=Wo,V¯B=Wr.\underline{V}^{B}=W^{o},\ \overline{V}^{B}=W^{r}. If f,go,f,g^{o}, and grg^{r} also satisfies Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M}, then we apply Theorem 38 to derive Wo=V¯oW^{o}=\underline{V}^{o} and Wr=V¯rW^{r}=\overline{V}^{r}. This proves the corollary statement. ∎

Remark 41.

When Assumption 30 is not satisfied, WoW^{o} and WrW^{r} still bound V¯o\underline{V}^{o} and V¯r\overline{V}^{r}. This result is also stated in [21].

Refer to caption
Figure 7: Illustration of Theorem 38. The purple, green, blue regions indicate the ranges of 𝒱r\mathcal{V}^{r}, 𝒱o\mathcal{V}^{o}, and 𝒱B\mathcal{V}^{B}, respectively.

7 Value iteration for fixed point set computation

In the previous sections, we proved the existence of a fixed point set for value operators with compact parameter uncertainty sets and re-interpreted robust control through our techniques. Next, we derive an iterative algorithm for computing the bounds of the fixed point set 𝒱\mathcal{V} given a value operator hh and parameter uncertainty set \mathcal{M}.

Algorithm Sketch. Based on the set-based value iteration (19), we iteratively find the one-step bounds of H(𝒱k)H(\mathcal{V}^{k}) to converge the bounds of the fixed point set.

For any compact set 𝒱𝒦(S)\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S}), the one step bounds of H(𝒱)H(\mathcal{V}) are equivalent to the one-step output of the bound operators h¯\underline{h} and h¯\overline{h} (22) applied to the extremal points of 𝒱\mathcal{V}.

Theorem 42 (One step HH bounds).

Consider a set operator HH (14) and its bound operators h¯\underline{h} and h¯\overline{h} (22) induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M} (6). For a compact set 𝒱S\mathcal{V}\subset{\mathbb{R}}^{S}, H(𝒱)H(\mathcal{V}) is bounded by h¯(V¯)\underline{h}(\underline{V}) and h¯(V¯)\overline{h}(\overline{V}) (22) as

h¯(V¯)Vh¯(V¯),VH(𝒱).\underline{h}(\underline{V})\leq V\leq\overline{h}(\overline{V}),\quad\forall\ V\ \in H(\mathcal{V}). (49)

where V¯\underline{V} and V¯\overline{V} (21) are the extremal elements of 𝒱\mathcal{V}. If hh satisfies Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M} and V¯,V¯𝒱\underline{V},\overline{V}\in\mathcal{V}, then h¯(V¯)\underline{h}(\underline{V}) and h¯(V¯)\overline{h}(\overline{V}) are the supremum and infimum elements of H(𝒱)H(\mathcal{V}), respectively— for all s[S]s\in[S], h¯s(V¯)\underline{h}_{s}(\underline{V}) and h¯s(V¯)\overline{h}_{s}(\overline{V}) satisfy

h¯s(V¯)=inf(V,m)𝒱×hs(V,m),h¯s(V¯)=sup(V,m)𝒱×hs(V,m).\underline{h}_{s}(\underline{V})=\underset{(V,m)\in\mathcal{V}\times\mathcal{M}}{\inf}{h_{s}(V,m)},\ \overline{h}_{s}(\overline{V})=\underset{(V,m)\in\mathcal{V}\times\mathcal{M}}{\sup}h_{s}(V,m). (50)
{pf}

For all s[S]s\in[S], hs(V,m)h¯s(V)h_{s}(V,m)\leq\overline{h}_{s}(V) for all mm\in\mathcal{M}. If hh is K(V)K(V)-Lipschitz and α\alpha-contractions in \mathcal{M}, then h¯\overline{h} is order-preserving (Lemma 26) such that h¯s(V)h¯s(V¯)\overline{h}_{s}(V)\leq\overline{h}_{s}(\overline{V}) for all V𝒱V\in\mathcal{V}. We conclude that

h(V,m)h¯(V¯),(V,m)𝒱×.h(V,m)\leq\overline{h}(\overline{V}),\ \forall(V,m)\in\mathcal{V}\times\mathcal{M}. (51)

Since h¯\overline{h} is an upper bound, and sup\sup is the least upper bound, it holds that supV,m[h(V,m)]sh¯(V¯)\sup_{V,m}[h(V,m)]_{s}\leq\overline{h}(\overline{V}). We use the definition of H(𝒱){H(\mathcal{V})} (14) to conclude that Vh¯(V¯)V\leq\overline{h}(\overline{V}) for all VH(𝒱)V\in H(\mathcal{V}). The inequality h¯(V¯)VVH(𝒱)\underline{h}(\underline{V})\leq V\ \forall V\in H(\mathcal{V}) can be similarly proved.

If hh satisfies Assumption 30 on S×{\mathbb{R}}^{S}\times\mathcal{M} and V¯,V¯𝒱\underline{V},\overline{V}\in\mathcal{V}, Assumption 30 states that there exists m¯\underline{m}\in\mathcal{M} such that h(V¯,m¯)=h¯(V¯)h(\underline{V},\underline{m})=\underline{h}(\underline{V}). Therefore, h¯(V¯)H(𝒱)\underline{h}(\underline{V})\in H(\mathcal{V}). Since h¯(V¯)\underline{h}(\underline{V}) also lower bounds all the elements of H(𝒱)H(\mathcal{V}), it is the infimum element of H(𝒱)H(\mathcal{V}). The fact that the greatest element of H(𝒱)H(\mathcal{V}) is h¯(V¯)\overline{h}(\overline{V}) can be similarly proved. ∎ Based on Theorem 42, we propose the following bound approximation algorithm of the fixed point set 𝒱\mathcal{V}^{\star} (17) for a set-valued operator HH (6). {algorithm} Bound approximation of the fixed point set 𝒱\mathcal{V}

1:𝒞\mathcal{C}, 𝒫\mathcal{P}, V0{V}^{0}, ϵ\epsilon.
2:V¯\underline{V}, V¯\overline{V}
3:V¯0:=V¯0:=V0\underline{V}^{0}:=\overline{V}^{0}:=V^{0}
4:e0=1γγϵe^{0}=\frac{1-\gamma}{\gamma}\epsilon
5:while γ1γekϵ\frac{\gamma}{1-\gamma}e^{k}\geq\epsilon do
6:     V¯sk+1=minmhs(V¯k,m),s[S]\underline{V}^{k+1}_{s}=\min_{m\in\mathcal{M}}h_{s}(\underline{V}^{k},m),\quad\forall s\in[S]
7:     V¯sk+1=maxmhs(V¯k,m),s[S]\overline{V}^{k+1}_{s}=\max_{m\in\mathcal{M}}h_{s}(\overline{V}^{k},m),\quad\forall s\in[S]
8:     ek+1=max{V¯k+1V¯k,V¯k+1V¯k}e^{k+1}=\max\Big{\{}\left\lVert\underline{V}^{k+1}-\underline{V}^{k}\right\rVert,\left\lVert\overline{V}^{k+1}-\overline{V}^{k}\right\rVert\Big{\}}
9:     k=k+1k=k+1
10:end while

7.1 Computing one-step optimal parameters

Algorithm 7 is stated for a general MDP parameter set \mathcal{M} and does not specify how to compute lines 6 and 7. Here we discuss solution methods for different shapes of \mathcal{M}.

  1. 1.

    Finite \mathcal{M}. If ={m1,,mN}\mathcal{M}=\{m_{1},\ldots,m_{N}\} is a set with finite number of elements, we can directly compute line 6 as

    V¯k+1=min{hs(V¯k,mi)|i={1,,N}}.\underline{V}^{k+1}=\min\Big{\{}h_{s}(\underline{V}^{k},m_{i})\ |\ i=\{1,\ldots,N\}\Big{\}}. (52)

    For line 7, we replace min\min with max\max in (52).

  2. 2.

    Convex \mathcal{M}. When \mathcal{M} is a convex set, the computation depends on hh. If h=gπh=g^{\pi} is the policy operator, lines 6 and 7 can be solved as convex optimization problems. If hh is the Bellman operator ff, lines 6 and 7 take on min-max formulation and is NP-hard to solve in the general form [21]. When \mathcal{M} can be characterized by an ellipsoidal set of parameters, the solutions to lines 6 and 7 is given in [21].

We recall the stochastic path planning problem from Example 1 with the two different parameter uncertainty scenarios. When the wind field uncertainty is discrete, \mathcal{M} is finite, when wind field is a combination of the major wind trends, \mathcal{M} is convex.

7.2 Algorithm Convergence Rate

When lines 6 and 7 are solvable, Algorithm 7 asymptotically converges to approximations of the bounding elements of 𝒱\mathcal{V}^{\star}. If \mathcal{M} satisfies Assumption 30 with respect to hh, Algorithm 7 derives the exact bounds of 𝒱\mathcal{V}. Algorithm 7 has similar rates of convergence in Hausdorff distance as standard value iteration using hh on S{\mathbb{R}}^{S}.

Theorem 43.

Consider the value operator hh, compact uncertainty set \mathcal{M}, and the fixed point set 𝒱\mathcal{V}^{\star} of the set-based operator HH (14) induced by hh on S×{\mathbb{R}}^{S}\times\mathcal{M}. If \mathcal{M} satisfies Assumption 30 with respect to hh, then at each iteration kk,

V¯k+1V¯αV¯kV¯,V¯k+1V¯αV¯kV¯,\textstyle\left\lVert\underline{V}^{k+1}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\overline{V}^{k+1}-\overline{V}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}, (53)

where all norms are infinity norms, and V¯,V¯\underline{V}^{\star},\overline{V}^{\star} are the infimum and supremum bounds of 𝒱\mathcal{V}, respectively. At Algorithm 7’s termination, V¯k,V¯k\underline{V}^{k},\overline{V}^{k} satisfies

max{V¯kV¯,V¯kV¯}<ϵ.\max\{\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\}<\epsilon. (54)
{pf}

From Algorithm 7, V¯k+1=h¯(V¯k)\overline{V}^{k+1}=\overline{h}(\overline{V}^{k}). From Lemma 25, h¯\overline{h} is an α\alpha-contraction. We obtain V¯k+1V¯αV¯kV¯\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}} and note that (53) holds by induction. Next, we apply triangle inequality to V¯kV¯\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}} to derive

V¯kV¯V¯kV¯k+1+V¯k+1V¯.\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\left\lVert\overline{V}^{k}-\overline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}+\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}. (55)

We can then use V¯k+1V¯αV¯kV¯\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}} to bound (55) as V¯kV¯11αV¯kV¯k+1\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\frac{1}{1-\alpha}\left\lVert\overline{V}^{k}-\overline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}. A similar argument can show that V¯kV¯11αV¯kV¯k+1\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\frac{1}{1-\alpha}\left\lVert\underline{V}^{k}-\underline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}. When Algorithm 7’s while condition is satisfied, max{V¯kV¯,V¯kV¯}ϵ.\max\Big{\{}\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\Big{\}}\leq\epsilon. This concludes our proof. ∎ In particular, the Bellman operator ff and policy operator gπg^{\pi} are γ\gamma-contractive on S{\mathbb{R}}^{S}, where γ\gamma is the discount factor, therefore, Theorem 42 applies with α=γ\alpha=\gamma.

Remark 44.

Theorem 43 implies that at the termination of Algorithm 7, the fixed point set 𝒱\mathcal{V}^{\star} can be over-approximated by

𝒱𝒱approx:=s[S][V¯sk+1ϵ,V¯sk+1+ϵ],\mathcal{V}^{\star}\subseteq\mathcal{V}_{approx}:=\prod_{s\in[S]}[\underline{V}_{s}^{k+1}-\epsilon,\overline{V}_{s}^{k+1}+\epsilon],

where kk is the last iterate before Algorithm 7 terminates.

8 Path Planning in Time-varying Wind Fields

We apply set-based value iteration to wind-assisted probabilistic path planning of a balloon in strong, uncertain wind fields [22]. MDP as a model for wind-assisted path planning of balloons in the stratosphere and exoplanets has recently gained traction [22, 2]. Discrete state-action MDPs have been shown to be a viable high-level path planning model [22] for such applications.

Mission Objective. In the two dimensional wind-field, we assume that the wind-assisted balloon is tasked with reaching target state (8,8)(8,8) in Figure 8 using minimum fuel.

Uncertain Wind Fields. By collecting a set of wind data on the environment’s wind field, an MDP can be created and a policy that handles stochastic planning can be deployed. However, wind can be a time-varying factor that causes the expected optimal policy to have worse-than-expected worst-case performance. We built an ideal uncertain wind field to demonstrate how the set Bellman operator can be used to predict the best and worst-case behavior of a robust policy.

MDP Modeling Assumptions. Following the framework described in [22], we model the path planning problem in an uncertain wind field as an infinite horizon, discounted MDP with discrete state-actions in a two-dimensional space. While balloons typically traverse in three dimensions, we assume that the wind is consistent in the vertical direction and that the final target is any vertical position along the given two-dimensional coordinates. As a result, we can disregard the vertical position during planning.

Refer to caption
(a) Wind field traversed by the balloon, discretized into 8181 states.
Refer to caption
(b) At each state, 99 actions corresponding to different thrust vectors are available.
Figure 8:

States. A total of 8181 states represent the two-dimensional space, composed of three different regions characterized by their wind variability as shown in Figure 8.

  1. 1.

    Calm wind. In calm states ScalmS_{calm}, the wind magnitude varies uniformly between [0,0.5][0,0.5], and the wind direction is uniformly sampled between [0,2π][0,2\pi]. Scalm={(i,j)|(0,0)(i,j)(2,8),(6,0)(i,j)(8,8)}.S_{calm}=\{(i,j)\ |\ (0,0)\leq(i,j)\leq(2,8),\ (6,0)\leq(i,j)\leq(8,8)\}.

  2. 2.

    Gusty wind. In states with gusts SgustyS_{gusty}, wind magnitude is consistently 11, while the wind direction is uniformly sampled between [0,2π][0,2\pi]. Sgusty={(i,j)|(3,3)(i,j)<(6,6)}S_{gusty}=\{(i,j)\ |\ (3,3)\leq(i,j)<(6,6)\}.

  3. 3.

    Unreliable wind. In unreliable states SunreliableS_{unreliable}, a predictable wind front occasionally moves across an otherwise windless region. In other words, the wind magnitude is either 0 or 11 and the wind direction varies uniformly between [π/4,π/2][\pi/4,\pi/2].

Actions. The balloon is equipped with an actuator that provides a constant thrust of 11 in 88 discretized directions shown in Figure 8b. The only stationary action vector with zero magnitude is highlighted in blue in the center of Figure 8b. We assume that the actuation force is enough to move the balloon across one state in wind with magnitude 0.5\leq 0.5, and is otherwise not strong enough to overcome wind effects.

Refer to caption
(a) State transition in calm wind.
Refer to caption
(b) State transition in unreliable wind.
Refer to caption
(c) State transition in gusty wind.
Figure 9: Transition probabilities for the three different wind regions.

Transition Probabilities. The transition probabilities are region-dependent. In the states [Scalm][S_{calm}] and [Sgusty][S_{gusty}], the transition dynamics are stochastic but stationary in time. In the states [Sunreliable][S_{unreliable}], the transition dynamics are stochastic but change over time. We define the following neighboring states for each state s[S]s\in[S].

  1. 1.

    𝒩(s)\mathcal{N}(s): all 88 neighboring states of state ss.

  2. 2.

    𝒩(s,a,0)\mathcal{N}(s,a,0): the neighboring state of ss in the direction of aa.

  3. 3.

    𝒩(s,a,1)\mathcal{N}(s,a,1): the neighboring state of ss in the direction of aa plus the two adjacent states as shown in Figure 9a.

  4. 4.

    𝒩(s,a,2)\mathcal{N}(s,a,2): the up and upper-right neighbors of ss, as shown in Figure 9b.

In the calm wind region, the transition probabilities are given by

Psa,s={1𝒩(s,a,1),s𝒩(s,a,1)0otherwise,s[Scalm].P_{sa,s^{\prime}}=\begin{cases}\frac{1}{\mathcal{N}(s,a,1)},&s^{\prime}\in\mathcal{N}(s,a,1)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{calm}]. (56)

In the gusty wind region, the transition probabilities are given by

Psa,s={1𝒩(s),s𝒩(s)0otherwise,s[Sgusty],a[A].P_{sa,s^{\prime}}=\begin{cases}\frac{1}{\mathcal{N}(s)},&s^{\prime}\in\mathcal{N}(s)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A]. (57)

In the unreliable wind region, the transition probabilities vary between transition dynamics Ps1P_{s}^{1} and Ps2P_{s}^{2}.

Psa,s1={1,s𝒩(s,a,0)0otherwise,s[Sgusty],a[A].P^{1}_{sa,s^{\prime}}=\begin{cases}1,&s^{\prime}\in\mathcal{N}(s,a,0)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A]. (58)
Psa,s2={0.5,s𝒩(s,a,2)0otherwise,s[Sgusty],a[A].P^{2}_{sa,s^{\prime}}=\begin{cases}0.5,&s^{\prime}\in\mathcal{N}(s,a,2)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A]. (59)

Collectively, Ps1P^{1}_{s} and Ps2P^{2}_{s} collectively form the uncertainty set 𝒫sΔSA\mathcal{P}_{s}\subset\Delta_{S}^{A} defined at each state.

𝒫s={Psai|i{1,2},a[A]},s[Sunreliable].\mathcal{P}_{s}=\{P^{i}_{sa}\ |\ i\in\{1,2\},a\in[A]\},\ \forall s\in[S_{unreliable}]. (60)

Cost. We define the following state-action cost to achieve the mission objective: at each state-action, the cost is the sum of the current distance from target position starg=(8,8)s_{targ}=(8,8), as well as the fuel expended by given action.

C((i,j),a)=(istarg[0])2+(jstarg[1])2+\halfa2.C((i,j),a)=\sqrt{(i-s_{targ}[0])^{2}+(j-s_{targ}[1])^{2}}+\half\left\lVert a\right\rVert_{2}.

We take a=1a=1 for all actions except for the staying still action, where a=0a=0.

8.1 Bellman, optimistic policy, and robust policy

We first compute the optimistic and robust bounds of the MDP with parameter uncertainty in 𝒫\mathcal{P} when s[Sunreliable]s\in[S_{unreliable}] by running Algorithm 7. The results are shown in Figure 10.

Refer to caption
(a) Optimistic case with expected objective of 54.254.2
Refer to caption
(b) Robust case with expected objective of 96.796.7.
Figure 10:

We denote the optimistic policy as πo\pi^{o} and the robust policy as πr\pi^{r}, and derive the bounds of their respective value vector sets 𝒱o\mathcal{V}^{o} (33) and 𝒱r\mathcal{V}^{r} (34) using Algorithm 7. The output is compared against the bounds of the set-based Bellman operator’s fixed point set 𝒱\mathcal{V}^{\star} in Table 1.

Set Maximum value Minimum value
𝒱\mathcal{V}^{\star} 70.61 62.25
𝒱o\mathcal{V}^{o} 101.58 62.25
𝒱r\mathcal{V}^{r} 70.63 70.52
Table 1: Bellman, optimistic policy, robust policy value bounds of the uncertain wind field.

Time-varying wind field Next, we consider a time-varying wind field: at each time step kk, the transition probability PkP^{k} is chosen at random from 𝒫\mathcal{P} (60). In this time-varying wind field, we compare three different policy deployments: 1) stationary optimistic policy πo\pi^{o} as policy operator gog^{o} (33), 2) stationary robust policy πr\pi^{r} as policy operator grg^{r} (34), and 3) dynamically changing policy that is optimal for the MDP ([S],[A],Pk,C,γ)([S],[A],P^{k},C,\gamma) as ff (9). These three different policy deployments are given by

Vk+1\displaystyle V^{k+1} =go(Vk,C,Pk),\displaystyle=g^{o}(V^{k},C,P^{k}), (61)
Vk+1\displaystyle V^{k+1} =gr(Vk,C,Pk),\displaystyle=g^{r}(V^{k},C,P^{k}), (62)
Vk+1\displaystyle V^{k+1} =f(Vk,C,Pk).\displaystyle=f(V^{k},C,P^{k}). (63)

The resulting cost-to-go at state sorig=[0,0]s_{orig}=[0,0] is plotted in Figure 11. Here, we see that the optimistic policy deployment (61) has the greatest variation in value over the course of 5050 MDP time steps. Both the robust policy deployment (62) and the dynamically changing policy deployment (63) achieve better upper-bound at each MDP iteration. The dynamically changing policy deployment (63) achieves less than 7070 in cost-to-go on average, which is the best among all three deployments. As we discussed in Remark 39, the robust policy deployment has the smallest variance in value in the presence of wind uncertainty, achieving a value difference of less than 0.10.1.

Refer to caption
(a) Optimistic Policy with 𝒱o\mathcal{V}^{o}’s bounding values.
Refer to caption
(b) Robust Policy with 𝒱r\mathcal{V}^{r}’s bounding values.
Refer to caption
(c) Dynamically changing policy with 𝒱B\mathcal{V}^{B}’s bounding values.
Figure 11: Comparison of robust policy, optimistic policy, and Bellman policy’s value trajectories in time-varying wind fields. Center blue line is the average over 5050 trials. The shaded blue region denotes the standard variation. The top and bottom lines are the supremum and infimum values of the fixed points.

Sampled solutions. We can compute a sampled MDP model based on 5050 samples of wind vectors for each state. Based on these samples, we add the action vector and compute the statistical distribution of state transitions. We then compute the value of these stationary sampled MDPs, and compare 99 randomly selected states’ values. The resulting scatter plot is shown in Figure 12.

Refer to caption
Figure 12: Comparison of different optimal value vectors under the Bellman operator for 5050 randomly sampled MDPs. On the x-axis, the state number is computed as i×9+ji\times 9+j.

9 Conclusion

In this paper, we categorized a class of operators utilized to solve Markov decision processes as value operators and lifted their input space from vectors to compact sets of vectors. We showed using fixed point analysis that the set extensions of value operators have fixed point sets that remain invariant given a compact set of MDP parameter uncertainties. These sets were applied to robust dynamic programming to further enrich existing results and generalize the kk-rectangularity assumption for robust MDPs. Finally, we applied our results to a path planning problem for time-varying wind fields. For future work, we plan on applying set-based value operators to stochastic games in the presence of uncoordinated players such as humans, as well as applying value operators to reinforcement learning to synthesize robust learning algorithms.

References

  • [1] Wesam H Al-Sabban, Luis F Gonzalez, and Ryan N Smith. Wind-energy based path planning for unmanned aerial vehicles using markov decision processes. In 2013 IEEE International Conference on Robotics and Automation, pages 784–789. IEEE, 2013.
  • [2] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588:77–82, 2020.
  • [3] Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • [4] Prashant Doshi, Richard Goodwin, Rama Akkiraju, and Kunal Verma. Dynamic workflow composition: Using markov decision processes. International Journal of Web Services Research (IJWSR), 2(1):1–17, 2005.
  • [5] Robert Givan, Sonia Leach, and Thomas Dean. Bounded-parameter markov decision processes. Artificial Intelligence, 122(1-2):71–109, 2000.
  • [6] Vineet Goyal and Julien Grand-Clement. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022.
  • [7] Jeff Henrikson. Completeness and total boundedness of the hausdorff metric. In MIT Undergraduate Journal of Mathematics. Citeseer, 1999.
  • [8] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  • [9] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • [10] Erwan Lecarpentier and Emmanuel Rachelson. Non-stationary markov decision processes, a worst-case approach using model-based reinforcement learning. Advances in neural information processing systems, 32, 2019.
  • [11] Sarah HQ Li, Assalé Adjé, Pierre-Loïc Garoche, and Behçet Açıkmeşe. Bounding fixed points of set-based bellman operator and nash equilibria of stochastic games. Automatica, 130:109685, 2021.
  • [12] Shie Mannor, Ofir Mebel, and Huan Xu. Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509, 2016.
  • [13] Francisco S Melo. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pages 1–4, 2001.
  • [14] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
  • [15] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  • [16] Sindhu Padakandla, Prabuchandran KJ, and Shalabh Bhatnagar. Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11):3590–3606, 2020.
  • [17] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • [18] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
  • [19] Bernd SW Schröder. Ordered sets. Springer, 29:30, 2003.
  • [20] Herke Van Hoof, Tucker Hermans, Gerhard Neumann, and Jan Peters. Learning robot in-hand manipulation with tactile features. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pages 121–127. IEEE, 2015.
  • [21] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  • [22] Michael T Wolf, Lars Blackmore, Yoshiaki Kuwata, Nanaz Fathpour, Alberto Elfes, and Claire Newman. Probabilistic motion planning of balloons in strong, uncertain wind fields. In 2010 IEEE International Conference on Robotics and Automation, pages 1123–1129. IEEE, 2010.
  • [23] Insoon Yang. A convex optimization approach to distributionally robust markov decision processes with wasserstein distance. IEEE control systems letters, 1(1):164–169, 2017.

Appendix A Set sequence convergence

Lemma 45.

Let {𝒱n}𝒦(S)\{\mathcal{V}_{n}\}\subseteq\mathcal{K}({\mathbb{R}}^{S}) be a converging sequence for d𝒦d_{\mathcal{K}} with 𝒱n𝒱\mathcal{V}_{n}\to\mathcal{V} as nn\to\infty. For all V𝒱V\in\mathcal{V}, there exists a converging subsequence {Vφ(n)}n\{V^{\varphi(n)}\}_{n\in\mathbb{N}} whose limit is VV for \left\lVert\cdot\right\rVert.

{pf}

Let V𝒱V\in\mathcal{V}. We can define the strictly increasing function φ\varphi on {\mathbb{N}} as follows: φ(0):=0\varphi(0):=0 and for all nn\in{\mathbb{N}}, φ(n+1):=min{j>φ(n)Vj𝒱j,VVj=d(V,𝒱j)(n+1)1}\varphi(n+1):=\min\{j>\varphi(n)\mid\exists\,V^{j}\in\mathcal{V}^{j},\left\lVert V-V^{j}\right\rVert=d(V,\mathcal{V}^{j})\leq(n+1)^{-1}\}. Finally, as for all nn\in{\mathbb{N}}^{*}, VVφ(n)(φ(n)+1)1\left\lVert V-V^{\varphi(n)}\right\rVert\leq(\varphi(n)+1)^{-1}, the result holds. ∎

Appendix B Proof of Lemma 9

{pf}

Let (V,m)S×(V,m)\in{\mathbb{R}}^{S}\times\mathcal{M} and consider a sequence {(Vk,mk)}kS×\{(V_{k},m_{k})\}_{k\in\mathbb{N}}\subset{\mathbb{R}}^{S}\times\mathcal{M} that converges to (V,m)(V,m). It holds that h(Vk,mk)h(V,m)\left\lVert h(V_{k},m_{k})-h(V,m)\right\rVert h(Vk,mk)h(V,mk)+h(V,mk)h(V,m)\leq\left\lVert h(V_{k},m_{k})-h(V,m_{k})\right\rVert+\left\lVert h(V,m_{k})-h(V,m)\right\rVert, where from the α\alpha-contractive property of h(,mk)h(\cdot,m^{k}), h(Vk,mk)h(V,mk)αVkV\left\lVert h(V_{k},m_{k})-h(V,m_{k})\right\rVert\leq\alpha\left\lVert V_{k}-V\right\rVert. From the K(V)K(V)-Lipschitz property of h(V,)h(V,\cdot),

h(V,mk)h(V,m)K(V)mkm.\textstyle\left\lVert h(V,m_{k})-h(V,m)\right\rVert\leq K(V)\left\lVert m_{k}-m\right\rVert.

As both limkVkV0\lim_{k\to\infty}\left\lVert V_{k}-V\right\rVert\to 0 and limkmkm0\lim_{k\to\infty}\left\lVert m_{k}-m\right\rVert\to 0, h(Vk,mk)h(V,m)0\left\lVert h(V_{k},m_{k})-h(V,m)\right\rVert\to 0 and hh is continuous. ∎

Appendix C Proof of Lemma 25

{pf}

We show that both the Bellman operator ff and the policy evaluation operator gπg^{\pi} satisfy the contractive, order preserving, and Lipschitz properties given in Definition 7. Contraction: given (C,P)(C,P)\in\mathcal{M}, gπ(,C,P)g^{\pi}(\cdot,C,P) and f(,C,P)f(\cdot,C,P) are both γ\gamma-contractions [17, Prop.6.2.4] on the complete metric space (S,)({\mathbb{R}}^{S},\left\lVert\cdot\right\rVert_{\infty}), where γ<1\gamma<1 is the discount factor.

Order preservation: given (C,P)(C,P)\in\mathcal{M}, the operator gπ(,C,P)g^{\pi}(\cdot,C,P) is order preserving [17, Lem.6.1.2]. Consider U,VSU,V\in{\mathbb{R}}^{S} where UVU\leq V. If gπ(,C,P)g^{\pi}(\cdot,C,P) is order-preserving, gπ(U,C,P)gπ(V,C,P)g^{\pi}(U,C,P)\leq g^{\pi}(V,C,P) for all πΠ\pi\in\Pi. Taking the infimum over Π\Pi, we have f(U,C,P)=infπΠgπ(U,C,P)infπΠgπ(V,C,P)=f(V,C,P)f(U,C,P)=\inf_{\pi\in\Pi}g^{\pi}(U,C,P)\leq\inf_{\pi\in\Pi}g^{\pi}(V,C,P)=f(V,C,P).

K(V)K(V)-Lipschitz: given (C,P),(C,P)(C,P),(C^{\prime},P^{\prime})\in\mathcal{M} and VSV\in{\mathbb{R}}^{S}, we prove the following for each s[S]s\in[S],

|fs(V,C,P)fs(V,C,P)|cscs+γPsPsmax{πs,π^s}V.|f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)|\\ \leq\left\lVert c_{s}^{\prime}-c_{s}\right\rVert_{\infty}+\gamma\left\lVert P^{\prime}_{s}-P_{s}\right\rVert_{\infty}\max\{\left\lVert\pi^{\star}_{s}\right\rVert_{\infty},\left\lVert\hat{\pi}_{s}\right\rVert_{\infty}\}\left\lVert V\right\rVert_{\infty}. (64)

We prove (64) for each of the following cases: 1) fs(V,C,P)fs(V,C,P)f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P), and 2) fs(V,C,P)fs(V,C,P)f_{s}(V,C^{\prime},P^{\prime})\leq f_{s}(V,C,P). For case 1), let π^\hat{\pi} (10) be the optimal policy for f(V,C,P)f(V,C^{\prime},P^{\prime}) and π\pi^{\star} be the optimal policy for f(V,C,P)f(V,C,P). For s[S]s\in[S], suppose fs(V,C,P)fs(V,C,P)f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P), then 0fs(V,C,P)fs(V,C,P)(cs)π^scsπs+γ(Psπ^s)Vγ(Psπs)V0\leq f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)\leq(c_{s}^{\prime})^{\top}\hat{\pi}_{s}-c_{s}^{\top}\pi^{\star}_{s}+\gamma(P^{\prime}_{s}\hat{\pi}_{s})^{\top}V-\gamma(P_{s}{\pi}^{\star}_{s})^{\top}V. Since π\pi^{\star} is sub-optimal for f(V,C,P)f(V,C^{\prime},P^{\prime}), we can upper bound |fs(V,C,P)fs(V,C,P)|(cscs)πs+γ[(PsPs)πs]V|f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)|\leq(c_{s}^{\prime}-c_{s})^{\top}\pi^{\star}_{s}+\gamma[(P^{\prime}_{s}-P_{s}){\pi_{s}}^{\star}]^{\top}V. Since πs,π^sΔA\pi^{\star}_{s},\hat{\pi}_{s}\in\Delta_{A}, πs1\left\lVert\pi^{\star}_{s}\right\rVert_{\infty}\leq 1. We conclude that (64) holds when fs(V,C,P)fs(V,C,P)f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P). For case 2), fs(V,C,P)fs(V,C,P)f_{s}(V,C^{\prime},P^{\prime})\leq f_{s}(V,C,P), (64) also holds by similar arguments.

Letting m=(C,P)m^{\prime}=(C^{\prime},P^{\prime}) and m=(C,P)m=(C,P), we can upper bound f(V,m)f(V,m)=fff(V,m)-f(V,m^{\prime})=f-f^{\prime} as

ff\displaystyle\left\lVert f-f^{\prime}\right\rVert_{\infty} maxs[S]{cscs+γ(PsPs)V}\displaystyle\leq\max_{s\in[S]}\{\left\lVert c^{\prime}_{s}-c_{s}\right\rVert_{\infty}+\gamma\left\lVert(P_{s}-P^{\prime}_{s})^{\top}V\right\rVert_{\infty}\} (65)
max(1,γV)mm.\displaystyle\leq\max(1,\gamma\left\lVert V\right\rVert_{\infty})\left\lVert m-m^{\prime}\right\rVert_{\infty}. (66)

The policy evaluation operator gπg^{\pi} satisfies (64) if max{πs,π^s}\max\{\left\lVert\pi^{\star}_{s}\right\rVert_{\infty},\left\lVert\hat{\pi}_{s}\right\rVert_{\infty}\} is replaced by πs\left\lVert{\pi}_{s}\right\rVert_{\infty}. Since πs1\left\lVert{\pi}_{s}\right\rVert_{\infty}\leq 1, gπg^{\pi} is K(V)K(V)-Lipschitz. ∎