Set-based Value Operators for Non-stationary and Uncertain Markov Decision Processes

Sarah H.Q. Li Assalé Adjé Pierre-Loïc Garoche Behçet Açikmeşe Department of Aeronautics and Astronautics, University of Washington, Seattle, USA. (e-mail:{sarahli, behcet}@uw.edu). LAMPS, Université de Perpignan Via Domitia, Perpignan, France. (e-mail: [email protected]). École Nationale de l’Aviation Civile, Université de Toulouse, Toulouse, France. (e-mail: [email protected]).

Abstract

This paper analyzes finite state Markov Decision Processes (MDPs) with uncertain parameters in compact sets and re-examines results from robust MDP via set-based fixed point theory. To this end, we generalize the Bellman and policy evaluation operators to contracting operators on the value function space and denote them as value operators. We lift these value operators to act on sets of value functions and denote them as set-based value operators. We prove that the set-based value operators are contractions in the space of compact value function sets. Leveraging insights from set theory, we generalize the rectangularity condition in classic robust MDP literature to a containment condition for all value operators, which is weaker and can be applied to a larger set of parameter-uncertain MDPs and contracting operators in dynamic programming. We prove that both the rectangularity condition and the containment condition sufficiently ensure that the set-based value operator’s fixed point set contains its own extrema elements. For convex and compact sets of uncertain MDP parameters, we show equivalence between the classic robust value function and the supremum of the fixed point set of the set-based Bellman operator. Under dynamically changing MDP parameters in compact sets, we prove a set convergence result for value iteration, which otherwise may not converge to a single value function. Finally, we derive novel guarantees for probabilistic path planning problems in planet exploration and stratospheric station-keeping.

keywords:

Markov decision process, contraction operator, stochastic control, decision making and autonomy

^†^†thanks: This research is partly funded by NSF grant CMMI-210563 and the University of Washington Aero&Astro Condit fellowship. Corresponding author Sarah H.Q. Li. Email. [email protected].

, , ,

1 Introduction

Markov decision process (MDP) is a versatile model for decision making in stochastic environments and is widely used in trajectory planning [1], robotics [20], and operations research [4]. Given state-action costs and transition probabilities, finding an optimal policy of the MDP is equivalent to solving for the fixed point value function of the corresponding Bellman operator.

Many application settings of MDPs, including traffic light control, motion planning, and dexterous manipulation, deal with environmental non-stationarity—dynamically changing MDP cost and transition probabilities due to external factors or the presence of interfering decision makers. This environmental non-stationarity corresponds to uncertainty in the MDP transition and cost parameters and differs from an MDP’s internal stochasticity, which is modeled by stationary stochastic dynamics whose probability distributions do not change over time.

Example 1 (Navigating in changing wind).

An autonomous aircraft navigates in a two-dimensional and time-varying wind field towards a non-stationary target, where the wind field varies between $N$ major patterns over time. The aircraft’s transition probabilities are constructed from global averages of local wind observations, and the aircraft’s objective is to reach the location of the non-stationary target, which is also affected by wind. If the wind pattern strictly switches between the discrete wind trends, then the transition uncertainty at state $s\in[S]$ is given by the set $\mathcal{P}_{s}=\{P^{1}_{s},\ldots,P^{N}_{s}\}$ . Similarly, the reachability cost of each state-action is also given by $\mathcal{C}_{s}=\{C^{1}_{s},\ldots,C^{N}_{s}\}$ . Collectively, we say that the MDP has time-varying parameters $\mathcal{M}_{s}=\{m^{1}_{s},\ldots,m^{N}_{s}\}=\mathcal{C}_{s}\times\mathcal{P}_{s}$ at each state $s\in[S]$ , where $m^{i}_{s}=(C^{i}_{s},P^{i}_{s})$ .

Environmental non-stationarity differs from parameter uncertainty and yet is closely related. Parameter-uncertain dynamic programming assumes that the MDP has stationary yet unknown stochastic dynamics within a bounded set, and its performance can be bounded via worst-case performance in robust MDP, risk-sensitive reinforcement learning, and zero-sum stochastic games—value functions that result from adversarial selections of the MDP parameters. Under environmental non-stationarity, we assume that at every time instance, the MDP parameters are known but will vary in time unpredictably. Environmental non-stationarity is a better assumption than parameter uncertainty for scenarios such as Example 1, where the dynamics are stochastic, observable, non-adversarial, yet time-varying.

Consider dynamic programming under environmental non-stationarity: at every time instance, the dynamic program is updated with respect to known but time-varying MDP parameters. Although interesting and highly relevant to many trajectory planning problems, this setting has no convergence guarantees. In fact, value iteration will most definitely diverge and can be demonstrated using simple examples. Does this mean that dynamic program has no convergence guarantees under environmental non-stationarity?

In this paper, we introduce a set-based framework to non-stationary MDPs that provides convergence guarantees under Hausdorff distance, and demonstrate that this set-based convergence also applies to parameter-uncertain MDPs and is related to dynamic programming robust dynamic programming.

Contributions. For environmental nonstationarity bounded by compact sets, we propose the set-extensions of value operators: a general class of contraction operators that extends the Bellman operator and the policy evaluation operator. We prove the existence of compact fixed point sets of the set-based value operators and show that the set-based value iteration converges. In a non-stationary Markovian environment, standard value iteration may not converge. However, we can show that the point-to-set distance of the resulting value function trajectory to the fixed point set always goes to zero in the limit. We derive a containment condition that is sufficient for the fixed point sets to contain their own extremal elements. Within robust MDPs, we show that the containment condition generalizes the rectangularity condition, such that the optimal worst-case policy, or the robust policy, exists when the containment condition is satisfied. We then derive the relationship between the fixed point sets of 1) the set-based optimistic policy evaluation operator, 2) the set-based robust policy evaluation operator, and 3) the set-based Bellman operator. Given a value operator and a compact MDP parameter uncertainty set, we present an algorithm that computes the bounds of the corresponding fixed point set and derive its convergence guarantees. Finally, we apply our results to the wind-assisted navigation of high altitude platform systems relevant to space exploration [22] and show that our algorithms can be used to derive policies with better guarantees.

Related research. MDP with parameter uncertainty is well studied in robust control and reinforcement learning. In control theory, the worst-case cost-to-go with respect to state-decoupled parameter uncertainties is derived via a minmax variation of the Bellman operator in [5, 8, 15, 21]. The cost-to-go under parameter uncertainty with coupling between states and time steps is similarly bounded in [12, 6]. The effect of statistical uncertainty on the optimal cost-to-go is studied in [15, 12, 21, 23]. Recently, MDP with parameter uncertainty has gained traction in the reinforcement learning community due to the presence of uncertainty in real world problems such as traffic signal control and multi-agent coordination [9, 10, 16]. Most RL research extends the minmax worst-case analysis to methods such as Q-learning and SARSA. Recently, methods for value-based RL using non-contracting operators have been investigated in [3].

As opposed to the worst-case approach to analyzing MDPs under parameter uncertainty, we do not assume adversarial MDP parameter selection. Instead, we derive a set of cost-to-gos that is invariant with respect to the compact parameter uncertainty sets for order-preserving, $\alpha$ -contracting operators, a class that the Bellman operator belongs to. We continue from our previous work [11], in which we analyzed the set-based Bellman operator for cost uncertainty only.

Notation: A set of $N$ elements is given by $[N]=\{0,\ldots,N-1\}$ . We denote the set of matrices of $i$ rows and $j$ columns with real (non-negative) entries as ${\mathbb{R}}^{i\times j}$ ( ${\mathbb{R}}_{+}^{i\times j}$ ), respectively. Matrices and some integers are denoted by capital letters, $X$ , while sets are denoted by cursive typeset $\mathcal{X}$ . The set of all compact subsets of ${\mathbb{R}}^{d}$ is denoted by $\mathcal{K}({\mathbb{R}}^{d})$ . The column vector of ones of size $N\in\mathbb{N}$ is denoted by $\mathbf{1}_{N}=[1,\ldots,1]^{T}\in{\mathbb{R}}^{N\times 1}$ . The identity matrix of size $S$ is denoted by $I_{S}$ . The simplex of dimension $S$ is denoted by

\Delta_{S}=\{p\in{\mathbb{R}}^{S}\ |\ \mathbf{1}_{S}^{\top}p=1,\ p\geq 0\}.

(1)

A vector $h\in{\mathbb{R}}^{S}$ has equivalent notation $(h_{1},\ldots,h_{s})$ , where $h_{s}$ is the value of $h$ in the $s^{th}$ coordinate, $s\in[S]$ . Throughout the paper, $\left\lVert\cdot\right\rVert$ denotes the infinity norm in ${\mathbb{R}}^{S}$ .

2 Discounted infinite-horizon MDP

A discounted infinite-horizon finite state MDP is given by $([S],[A],$ $P,C,\gamma)$ , where $\gamma\in(0,1)$ is the discount factor, $[S]=\{1,\ldots,S\}$ is the finite set of states and $[A]=\{1,\ldots,A\}$ is the finite set of actions. Without loss of generality, assume that each action is admissible from each state $s\in[S]$ .

MDP Costs. $C\in{\mathbb{R}}^{S\times A}$ is the matrix encoding the MDP cost. Each $C_{sa}\in{\mathbb{R}}$ is the cost of taking action $a\in[A]$ from state $s\in[S]$ . We also denote the cost of all actions at state $s$ by $c_{s}=[C_{s1},\ldots,C_{sA}]\in{\mathbb{R}}^{A}$ , such that $C=[c_{1},\ldots,c_{S}]^{\top}$ .

MDP Transition Dynamics. The transition probabilities when action $a$ is taken from state $s$ are given by $p_{sa}\in\Delta_{S}$ . Collectively, all possible transition probabilities from state $s\in[S]$ are given by the matrix $\textstyle P_{s}=[p_{s1},\ldots,p_{sA}]\in\Delta_{S}^{A}\subset{\mathbb{R}}^{S\times A}$ , and all possible transition probabilities in the MDP are given by the matrix $P=[P_{1},\ldots,P_{S}]\in\Delta_{S}^{SA}\subset{\mathbb{R}}^{S\times SA}$ .

MDP Objective. We want to minimize the expected cost-to-go, or the value vector $V\in{\mathbb{R}}^{S}$ , defined per state as

\textstyle V_{s}:=\ \mathbb{E}_{s}\Big{\{}\sum_{t=0}^{\infty}\gamma^{t}C_{s^{t}a^{t}}\ |\ s^{0}=s\Big{\}},\ \forall\ s\in[S],

(2)

where $\mathbb{E}_{s}\{\cdot\}$ is the expected value of the input with respect to initial state $s$ , and ( $s^{t},a^{t}$ ) are the state and action at time $t$ .

Remark 2.

Although value function is the standard term for the expected cost-to-go, we use value vector in this paper to emphasize that the cost-to-go values of finite MDPs belong in a finite dimensional space.

MDP Policy. The decision maker controls the policy, denoted as $\pi=[\pi_{1},\ldots,\pi_{S}]\in\Delta_{A}^{S}$ , where the $a^{th}$ element of $\pi_{s}\in\Delta_{A}$ is the conditional probability of action $a$ being chosen from state $s$ . Using the policy, we can minimize the value vector (2) in a closed-loop fashion.

\textstyle V_{s}^{\star}:=\ \underset{\pi^{t}\in\Delta_{A}^{S}}{\min}\ \mathbb{E}_{s}\Big{\{}\sum_{t=0}^{\infty}\gamma^{t}C_{s^{t}\pi^{t}(s^{t})}\ |\ s^{0}=s\Big{\}},\ \forall\ s\in[S],

(3)

Under policy $\pi_{s}$ , the expected immediate cost at $s$ is given by $c_{s}^{\top}\pi_{s}\in{\mathbb{R}}$ and the expected transition probabilities from $s$ is given by $P_{s}\pi_{s}\in\Delta_{S}$ .

2.1 Value operators

Solving an MDP is equivalent to finding the value vector and the associated policy that minimizes the objective (3). Typical solution methods utilize order preserving [19, Def.3.1], $\alpha$ -contractive operators whose fixed points are the optimal value vectors (e.g. Bellman operator [17, Thm.6.2.3], $Q$ -value operator [13]).

Definition 3 ( $\alpha$ -Contraction).

Let $(\mathcal{X},d)$ be a metric space with metric $d$ . The operator $H:\mathcal{X}\mapsto\mathcal{X}$ is an $\alpha$ -contraction if and only if there exists $\alpha\in[0,1)$ such that

d(H(V),H(V^{\prime}))\leq\alpha d(V,V^{\prime}),\quad\forall\ V,\ V^{\prime}\in\mathcal{X}.

(4)

Definition 4 (Order Preservation).

Let $(\mathcal{X},\leq)$ be an ordered space with partial order $\leq$ . The operator $H:\mathcal{X}\mapsto\mathcal{X}$ is order preserving if for all $V,V^{\prime}\in\mathcal{X}$ such that $V\leq V^{\prime}$ , $H(V)\leq H(V^{\prime})$ .

These operators are typically locally Lipschitz in MDP parameter space.

Definition 5 ( $K(V)$ -Lipschitz).

Let $(\mathcal{X},d_{\mathcal{X}})$ be a metric space with metric $d_{\mathcal{X}}$ and $(\mathcal{Y},d_{{\color[rgb]{0,0,0}\mathcal{Y}}})$ be a metric space with metric $d_{\mathcal{Y}}$ . The operator $H:\mathcal{X}\times\mathcal{Y}\mapsto\mathcal{X}$ is $K(V)$ -Lipschitz with respect to $\mathcal{M}\subset\mathcal{Y}$ if for all $V\in\mathcal{X}$ , there exists $K(V)\in{\mathbb{R}}_{+}$ such that

d_{\mathcal{X}}(H(V,m),H(V,m^{\prime}))\leq K(V)d_{\mathcal{Y}}(m,m^{\prime}),\ \forall m,m^{\prime}\in\mathcal{M}.

(5)

Remark 6.

The property $\alpha$ -contraction is a special case of Lipschitz continuity when the input and output spaces are identical and the Lipschitz constant is less than $1$ .

To capture operators with these properties, we define a value operator that takes inputs: value vector, MDP cost, and MDP transition probability. The MDP cost and transition probability are selected from an MDP parameter set $\mathcal{M}$ .

Definition 7 (Value operator).

Consider the operator $h$ ,

h:{\mathbb{R}}^{S}\times\mathcal{M}\mapsto{\mathbb{R}}^{S},\ \mathcal{M}\subseteq{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}.

(6)

We say $h$ (6) is a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ if

1.

For all $m\in\mathcal{M}$ , $h(\cdot,m)$ is an $\alpha$ -contraction in ${\mathbb{R}}^{S}$ .
2.

For all $m\in\mathcal{M}$ , $h(\cdot,m)$ is order preserving in ${\mathbb{R}}^{S}$ .
3.

For all $V\in{\mathbb{R}}^{S}$ , $h(V,m)$ is $K(V)$ -Lipschitz on $\mathcal{M}$ .

Remark 8.

While we only consider value operators whose input’s first component is ${\mathbb{R}}^{S}$ , Definition 7 and the subsequent results can be extended to the space of $Q$ -value functions by swapping ${\mathbb{R}}^{S}$ for ${\mathbb{R}}^{SA}$ in Definition 7 [13].

Refer to caption — Figure 1: Illustration of the three value operator properties. (a) $\alpha$ -contraction on ${\mathbb{R}}^{S}$ , (b) Order preservation on ${\mathbb{R}}^{S}$ , and (c) $K(V)$ -Lipschitz in input space $\mathcal{M}$ .

An immediate consequence of the value operator $h$ being an $\alpha$ -contractive and order-preserving operator on ${\mathbb{R}}^{S}$ is that $h$ is continuous on ${\mathbb{R}}^{S}\times\mathcal{M}$ .

Lemma 9 (Continuity).

If $h$ (6) is a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ , $h$ is continuous on ${\mathbb{R}}^{S}\times\mathcal{M}$ .

See App. B for proof. Examples of value operators include the Bellman operator and the policy evaluation operators when the MDP cost and transition probability are input parameters rather than fixed parameters.

Definition 10 (Policy evaluation operator).

Given a policy $\pi\in\Pi$ , the vector-valued operator $g^{\pi}=(g^{\pi}_{1},\ldots,g^{\pi}_{S}):{\mathbb{R}}^{S}\times{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}\mapsto{\mathbb{R}}^{S}$ is defined per state as

g^{\pi}_{s}(V,C,P):=c_{s}^{\top}\pi_{s}+\gamma\Big{(}P_{s}\pi_{s}\Big{)}^{\top}V,\ \forall s\in[S].

(7)

Given $(C,P)$ , $g^{\pi}(\cdot,C,P):{\mathbb{R}}^{S}\mapsto{\mathbb{R}}^{S}$ is a vector-valued operator whose fixed point is the expected cost-to-go of the MDP $([S],[A],C,P,\gamma)$ under $\pi$ , denoted as $V^{\pi}(C,P)$ [17, Thm.6.2.5].

~{}V^{\pi}(C,P)=g^{\pi}(V^{\pi},C,P),\ V^{\pi}(C,P)\in{\mathbb{R}}^{S}.

(8)

When the context is clear, we denote $V^{\pi}(C,P)$ as $V^{\pi}$ .

Definition 11 (Bellman operator).

The vector-valued operator $f=(f_{1},\ldots,$ $f_{S}):{\mathbb{R}}^{S}\times{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}\mapsto{\mathbb{R}}^{S}$ is defined per each state as

f_{s}(V,C,P):=\inf_{\pi_{s}\in\Delta_{A}}\ g^{\pi}_{s}(V,C,P),\ \forall\,s\in[S].

(9)

The corresponding optimal policy $\pi^{\star}=(\pi_{1}^{\star},\ldots,\pi_{s}^{\star})$ is defined per state as $\pi_{s}^{\star}\in\mathop{\rm argmin}_{\pi_{s}}g^{\pi}_{s}(V,C,P)$ (9). One such policy is defined for all $(s,a)\in[S]\times[A]$ by

\pi^{\star}_{sa}:=\begin{cases}>0&a\in\underset{a\in[A]}{\mathop{\rm argmin}}\ C_{sa}+\gamma p_{sa}^{\top}V\\ 0&\text{otherwise}\end{cases},\sum_{a\in[A]}\pi^{\star}_{sa}=1.

(10)

where $\mathop{\rm argmin}_{a\in[A]}(h)$ is the set of minimizing actions for the function $h$ . An optimal policy in the form (10) always exists for a discounted infinite horizon MDP [17, Thm 6.2.10]. Given parameters $(C,P)$ , $f(\cdot,C,P):{\mathbb{R}}^{S}\mapsto{\mathbb{R}}^{S}$ is a vector operator whose fixed point is the optimal cost-to-go for the MDP $([S],[A],P,C,\gamma)$ , denoted as $V^{B}(C,P)$ .

~{}V^{B}(C,P)=f(V^{B},C,P),\ V^{B}(C,P)\in{\mathbb{R}}^{S}.

(11)

When the context is clear, we denote $V^{B}(C,P)$ as $V^{B}.$

We show that both (7) and (9) are value operators.

Lemma 12.

The Bellman operator (9) and the policy evaluation operators (7) for all $\pi\in\Pi$ are value operators on ${\mathbb{R}}^{S}\times\mathcal{M}$ where $\mathcal{M}\subseteq{\mathbb{R}}^{S\times A}\times\Delta_{S}^{SA}$ (6).

See App. C for proof.

Remark 13.

Beyond the policy evaluation operator and the Bellman operator, many algorithms in reinforcement learning can be reformulated using value operators. For example, it’s not difficult to show that the Q-learning operator [13] is a value operator on the vector space ${\mathbb{R}}^{SA}$ .

3 Set-based value operators

Motivated by stochastic and time-varying Markovian dynamics, we now consider set-based value operators with respect to a compact set of MDP parameters. We first introduce Hausdorff-type set distances.

Definition 14 (Point-to-set Distance).

The distance between a value vector and a set $\mathcal{V}\subseteq{\mathbb{R}}^{S}$ is given by

\textstyle W\mapsto d(W,\mathcal{V}):=\underset{V\in\mathcal{V}}{\inf}\left\lVert W-V\right\rVert_{{\color[rgb]{0,0,0}\infty}}.

(12)

On the space of compact subsets of ${\mathbb{R}}^{S}$ , given by $\mathcal{K}({\mathbb{R}}^{S})$ , the distance between value vector sets extends (12) and is given by the Hausdorff distance [7].

Definition 15 (Set-to-set Distance).

The Hausdorff distance between two value vector sets $\mathcal{V},\mathcal{W}\subseteq{\mathbb{R}}^{S}$ is given by

\displaystyle d_{\mathcal{K}}(\mathcal{V},\mathcal{W}):=\max\left\{\sup_{V\in\mathcal{V}}d(V,\mathcal{W}),\sup_{W\in\mathcal{W}}d(W,\mathcal{V})\right\}.

(13)

We use $(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}})$ to denote the metric space formed by the set of all compact subsets of ${\mathbb{R}}^{S}$ under the Hausdorff distance $d_{\mathcal{K}}$ . The induced Hausdorff space is complete if and only if the original metric space is complete [7, Thm 3.3]. Therefore, $(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}})$ is a complete metric space.

For a value operator $h$ (6), we ask the following question: what is the set of possible value vectors when the MDP has parameter non-stationarity given by $\mathcal{M}$ ? To resolve this, we define the set-based value operator $H$ .

Definition 16 (Set-based Value Operator).

The set-valued operator $H$ is induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ (6) and is defined as

H(\mathcal{V}):=\left\{h(V,m)\ |\ (V,m)\in\mathcal{V}\times\mathcal{M}\right\}\subseteq{\mathbb{R}}^{S},

(14)

where $\mathcal{V}\subseteq{\mathbb{R}}^{S}$ is a subset of the value vector space.

We denote the set-based value operator induced by the Bellman operator (9) and policy evaluation operators (7) as $F$ and $G^{\pi}$ , respectively, such that for any value vector set $\mathcal{V}\subseteq{\mathbb{R}}^{S}$ ,

F(\mathcal{V}):=\left\{f(V,C,P)\ |\ (V,C,P)\in\mathcal{V}\times\mathcal{M}\right\},

(15)

G^{\pi}(\mathcal{V}):=\left\{g^{\pi}(V,C,P)\ |\ (V,C,P)\in\mathcal{V}\times\mathcal{M}\right\},\ \forall\ \pi\in\Pi.

(16)

The set-based Bellman operator $F(\mathcal{V})$ is the union over all one-step optimal value vectors, which may result from different policies, while $G^{\pi}(\mathcal{V})$ is the union over all value vectors that result from the same policy $\pi$ .

We can ask the following question: is there a set of value vectors that is invariant with respect to $H$ ? Similar to the value operators $h$ from Definition 7, we can affirmatively answer this question by showing that $H$ is $\alpha$ -contractive on $\mathcal{K}({\mathbb{R}}^{S})$ .

Theorem 17.

If $h$ is a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ (6) and $\mathcal{M}$ is compact, then the induced set value operator $H$ (14) satisfies

1.

For all $\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S})$ , $H\left(\mathcal{V}\right)\in\mathcal{K}({\mathbb{R}}^{S})$ ;
2.

$H$ is an $\alpha$ -contractive on $(\mathcal{K}({\mathbb{R}}^{S}),d_{\mathcal{K}})$ (13) with a unique fixed point set $\mathcal{V}^{\star}$ given by

$\textstyle H(\mathcal{V}^{\star})=\mathcal{V}^{\star},\quad\mathcal{V}^{\star}\in\mathcal{K}({\mathbb{R}}^{S});$ (17)
3.

The sequence $\{\mathcal{V}^{k}\}_{k\in\mathbb{N}}$ where $\mathcal{V}^{k+1}=H(\mathcal{V}^{k})$ converges to $\mathcal{V}^{\star}$ for any $\mathcal{V}^{0}\in\mathcal{K}({\mathbb{R}}^{S})$ .

In particular, these hold for $F$ (15) and $G^{\pi}$ (16), whose fixed point sets are denoted as $\mathcal{V}^{B}$ and $\mathcal{V}^{\pi}$ , respectively.

F(\mathcal{V}^{B})=\mathcal{V}^{B}\in\mathcal{K}({\mathbb{R}}^{S}),\ G^{\pi}(\mathcal{V}^{\pi})=\mathcal{V}^{\pi}\in\mathcal{K}({\mathbb{R}}^{S}),\ \forall\pi\in\Pi.

(18)

{pf}

The first statement follows from Lemma 9, since the image of a compact set by a continuous function is compact [18]. Let us prove the second statement: for some $\beta\in(0,1)$ , for all, $\mathcal{V},\mathcal{V}^{\prime}\in\mathcal{K}({\mathbb{R}}^{S})$ :

		$\displaystyle d_{\mathcal{K}}(H(\mathcal{V}),H(\mathcal{V}^{\prime}))$
	$\displaystyle=$	$\displaystyle\max\left\{\sup_{\begin{subarray}{c}V\in\mathcal{V}\\ m\in\mathcal{M}\end{subarray}}d\left(h(V,m),H(\mathcal{V}^{\prime})\right),\sup_{\begin{subarray}{c}V^{\prime}\in\mathcal{V}^{\prime}\\ m^{\prime}\in\mathcal{M}\end{subarray}}d(h(V^{\prime},m^{\prime}),H(\mathcal{V}))\right\}$
	$\displaystyle\leq$	$\displaystyle\beta d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime})$

Take $(V,m)\in\mathcal{V}\times\mathcal{M}$ , then $d(h(V,m),H(\mathcal{V}^{\prime}))\leq\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert h(V,m)-h(V^{\prime},m)\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert V-V^{\prime}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ holds from the $\alpha$ -contractive property of $h$ . Finally,

	$\displaystyle\sup_{\begin{subarray}{c}V\in\mathcal{V}\\ m\in\mathcal{M}\end{subarray}}d(h(V,m),H(\mathcal{V}^{\prime}))\leq$	$\displaystyle\alpha\sup_{V\in\mathcal{V}}\inf_{V^{\prime}\in\mathcal{V}^{\prime}}\left\lVert V-V^{\prime}\right\rVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime})$

We use the same technique to prove that

\sup_{\begin{subarray}{c}V^{\prime}\in\mathcal{V}^{\prime}\\ m^{\prime}\in\mathcal{M}\end{subarray}}d(h(V^{\prime},m^{\prime}),H(\mathcal{V}))\leq\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime}).

Finally, $d_{\mathcal{K}}(H(\mathcal{V}),H(\mathcal{V}^{\prime}))\leq\alpha d_{\mathcal{K}}(\mathcal{V},\mathcal{V}^{\prime})$ . From the Banach fixed point theorem and the completeness of $({\mathcal{K}}({\mathbb{R}}^{S}),d_{\mathcal{K}})$ [7, Thm 3.3], $H$ has a unique fixed point $\mathcal{V}^{\star}$ in ${\mathcal{K}}({\mathbb{R}}^{S})$ .

The third point is a consequence of the Banach fixed point theorem. Finally, $f$ and $g^{\pi}$ are value operators (6) on ${\mathbb{R}}^{S}\times\mathcal{M}$ , therefore this theorem’s statements apply. ∎

Remark 18 (Set-based value iteration).

An important consequence of Theorem 17 is the existence of the set-based value iteration, given by

\mathcal{V}^{k+1}=H(\mathcal{V}^{k}),\ \mathcal{V}^{0}\in\mathcal{K}({\mathbb{R}}^{S}).

(19)

Analogous to standard value iteration, (19) is a sequence of value vector sets in $\mathcal{K}({\mathbb{R}}^{S})$ that converges to the fixed point set $\mathcal{V}^{\star}\in\mathcal{K}({\mathbb{R}}^{S})$ .

4 Properties of the fixed point set

For the MDP parameters $(C,P)$ , the fixed point of $h(\cdot,C,P)$ is typically meaningful for the corresponding MDP. For example, the fixed point of a policy evaluation operator $g^{\pi}(\cdot,C,P)$ (7) is the expected cost-to-go under policy $\pi$ , and the fixed point of the Bellman operator $f(\cdot,C,P)$ (9) is the minimum cost-to-go when $\pi$ can be freely chosen. In this section, we derive properties of the fixed point set $\mathcal{V}$ of $H$ (14) in the context of non-stationary value iteration.

4.1 Non-stationary value iteration

Given a value operator $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ , we consider value iteration under a dynamic parameter uncertainty model discussed in [15], where at every iteration, a new set of MDP parameters $m^{k}$ is chosen from $\mathcal{M}$ as

V^{k+1}=h(V^{k},m^{k}),\ V^{0}\in{\mathbb{R}}^{S},\ m^{k}\in\mathcal{M},\forall k\in\mathbb{N}.

(20)

In robust MDP literature [8, 15], $m^{k}$ is modified by an adversarial opponent of the MDP decision maker such that (20) converges to a worst-case value vector. We consider a more general scenario in which $m^{k}$ is chosen from the closed and bounded set $\mathcal{M}$ without any probabilistic prior. In this scenario, convergence of $V^{k}$ in ${\mathbb{R}}^{S}$ will not occur for all possible sequences of $\{m^{k}\}_{k\in{\mathbb{N}}}$ . However, we can show convergence results on the set domain by leveraging our fixed point analysis of the set-based operator $H$ (14).

Proposition 19.

Let $\mathcal{V}^{\star}$ be the fixed point set of the set-based value operator $H$ (14) induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ (6). If the non-stationary value iteration (20) satisfies $\{m^{k}\}_{k\in\mathbb{N}}\subset\mathcal{M}$ , then the sequence $\{V^{k}\}_{k\in\mathbb{N}}$ defined by (20) satisfies

1.

$\lim_{k\to+\infty}d(V^{k},\mathcal{V}^{\star})=0$ ,
2.

there exists a sub-sequence $\{V^{\varphi(k)}\}_{k\in\mathbb{N}}$ that converges to a point in $\mathcal{V}^{\star}$ as $\lim_{k\rightarrow\infty}V^{\varphi(k)}\in\mathcal{V}^{\star}$ .

{pf}

Let $\{V^{k}\}_{k\in\mathbb{N}}$ be a sequence defined by $\mathcal{V}^{0}=\{V^{0}\}$ and $\mathcal{V}^{k+1}=H(\mathcal{V}^{k})$ , where $H$ (14) is the set operator induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ . We first show statement 1). From Theorem 17, $\lim_{k\to\infty}\mathcal{V}^{k}$ converges to $\mathcal{V}^{\star}$ in $d_{\mathcal{K}}$ . Therefore, $0\leq d(V^{k},\mathcal{V}^{\star})=\inf_{y\in\mathcal{V}^{\star}}\left\lVert V^{k}-y\right\rVert_{\infty}\leq\sup_{x\in\mathcal{V}^{k}}\inf_{y\in\mathcal{V}^{\star}}\left\lVert x-y\right\rVert_{\infty}\leq d_{{\color[rgb]{0,0,0}\mathcal{K}}}(\mathcal{V}^{k},\mathcal{V}^{\star})\to 0$ as $k$ tends to $+\infty$ .

Next, for all $k\in\mathbb{N}$ , there exists $N\in\mathbb{N}$ such that for all $n\geq N$ , $d(V^{n},\mathcal{V}^{\star})\leq(k+1)^{-1}$ . We define the strictly increasing function $\psi_{1}:\mathbb{N}\to\mathbb{N}$ , such that $\psi_{1}(0)=0$ and for all $k\neq 0$ , $\psi_{1}(k):=\min\{N>\psi_{1}(k-1):\forall n\geq N,\ d(V^{n},\mathcal{V}^{\star})<(k+1)^{-1}\}$ . Then, for all $k\in\mathbb{N}^{\star}$ , there exists $y^{\psi_{1}(k)}\in\mathcal{V}^{\star}$ such that $\left\lVert V^{\psi_{1}(k)}-y^{\psi_{1}(k)}\right\rVert_{{\color[rgb]{0,0,0}\infty}}<(k+1)^{-1}$ . As $\mathcal{V}^{\star}$ is compact, there exists $\psi_{2}:\mathbb{N}\to\mathbb{N}$ strictly increasing such that $(y^{\psi_{1}(\psi_{2}(k))})_{k}$ converges to some $y^{\star}\in\mathcal{V}^{\star}$ [18, Thm 3.6]. Finally, let $\varepsilon>0$ , there exist $K_{1},K_{2}\in\mathbb{N}$ such that for all $l\geq K_{1}$ , $(\psi_{2}(l))^{-1}<\varepsilon/2$ and for all $l^{\prime}\geq K_{2}$ , $\left\lVert y^{\psi_{1}(\psi_{2}(l^{\prime}))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}<\varepsilon/2$ . So, taking $k\geq\max\{K_{1},K_{2}\}$ , we have $\left\lVert V^{\psi_{1}(\psi_{2}(k))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\left\lVert V^{\psi_{1}(\psi_{2}(k))}-y^{\psi_{1}(\psi_{2}(k))}\right\rVert_{{\color[rgb]{0,0,0}\infty}}+\left\lVert y^{\psi_{1}(\psi_{2}(k))}-y^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\varepsilon$ and $(V^{\psi_{1}(\psi_{2}(k))})_{k}$ converges to $y^{\star}\in\mathcal{V}^{\star}$ . ∎ In addition to containing all asymptotic behavior of value vector trajectories under time-varying value iteration, the fixed point set $\mathcal{V}$ also contains all fixed points of the value operator $h(\cdot,C,P)$ when $(C,P)\in\mathcal{M}$ (6) are fixed.

Corollary 20.

Let $h$ (6) be a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ where $\mathcal{M}$ is compact. For all $m\in\mathcal{M}$ , if $V=h(V,m)\in{\mathbb{R}}^{S}$ and $\mathcal{V}^{\star}$ is the fixed point set of the induced set-based value operator $H$ (14), $V\in\mathcal{V}^{\star}$ .

{pf}

We construct sequence $\{V^{k}\}$ where $V^{k+1}=h(V^{k},m)$ and $V^{0}=V$ . Then $V^{k}=V$ for all $k\in\mathbb{N}$ . From the second point of Proposition 19, $V\in\mathcal{V}^{\star}$ follows. ∎ Going further, we can bound the transient behavior of (20) when $V^{0}$ is an element of the fixed point set $\mathcal{V}^{\star}$ .

Corollary 21 (Transient behavior).

Let $\mathcal{V}^{\star}$ be the fixed point of the set-based value operator $H$ (14) induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ . If $\mathcal{M}$ is compact and $V^{0}\in\mathcal{V}^{\star}$ , then the sequence generated by (20) satisfies $\{V^{k}\}_{k\in\mathbb{N}}\subseteq\mathcal{V}^{\star}$ .

{pf}

As a fixed point set of $H$ (14), $\mathcal{V}^{\star}$ (17) satisfies $\mathcal{V}^{\star}=H(\mathcal{V}^{\star})$ , then the following is true by definition of $H$ : if $V^{k}\in\mathcal{V}^{\star}$ , then $V^{k+1}=h(V^{k},m^{k})\in\mathcal{V}^{\star}$ . If $V^{0}\in\mathcal{V}^{\star}$ , then $\{V^{k}\}_{k\in\mathbb{N}}\subseteq\mathcal{V}^{\star}$ follows by induction. ∎

Remark 22.

Proposition 19 and Corollary 21 bound the asymptotic and transient behavior of the sequence $\{h(V^{k},m^{k})\}_{k\in{\mathbb{N}}}$ generated from (20), regardless of the convergence of the value vector sequence. This is a more general result then the classic convergence results for MDPs and robust MDPs.

Remark 23.

Corollary 21 also implies that $\mathcal{V}^{\star}$ is invariant with respect to the non-stationary value iteration (20), and may prove useful in the analysis and design of MDPs with known parameter uncertainties.

4.2 Bounds of the fixed point set

In Theorem 17, the compactness of $\mathcal{M}$ implied the compactness of $\mathcal{V}^{\star}$ . This relationship carries over to the supremum and infimum elements of $\mathcal{M}$ and $\mathcal{V}^{\star}$ —i.e., if $\mathcal{M}$ satisfies Assumption 30 with respect to $h$ , then $\mathcal{V}^{\star}$ contains its own supremum and infimum elements.

Greatest and least elements. We define the supremum and infimum elements of a value vector set $\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S})$ element-wise as follows,

\overline{V}_{s}:=\sup_{V\in\mathcal{V}}V_{s},\ \underline{V}_{s}:=\inf_{V\in\mathcal{V}}V_{s},\forall\ s\in[S].

(21)

If a set $\mathcal{V}\subseteq{\mathbb{R}}^{S}$ is compact, the projection of $\mathcal{V}$ on each state $s$ is compact. Then, the coordinate-wise supremum and infimum values for each state $s$ are achieved by $\mathcal{V}$ . However in general, no single element of the set $\mathcal{V}$ may simultaneously achieve the minimum over all states—i.e., $\overline{V}(\underline{V})$ may not be an element of $\mathcal{V}$ . This is illustrated in Figure 3.

Given $h$ and parameter uncertainty set $\mathcal{M}$ , we wish to 1) bound the supremum and infimum elements of the fixed point set $\mathcal{V}^{\star}$ (17) and 2) derive sufficient conditions for when they are elements of $\mathcal{V}^{\star}$ . To facilitate bounding $\mathcal{V}^{\star}$ , we introduce the following bound operators.

Definition 24 (Bound Operators).

The bound operators induced by the value operator $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ are coordinate-wise defined at each $s\in[S]$ as

\underline{h}_{s}(V)=\inf_{m\in\mathcal{M}}h_{s}(V,m),\ \overline{h}_{s}(V)=\sup_{m\in\mathcal{M}}h_{s}(V,m).

(22)

We want to bound the fixed point set $\mathcal{V}$ of the set-based value operator $H$ (14) by the bound operators $\underline{h}/\overline{h}$ (22). First we show that $\underline{h}/\overline{h}$ are themselves $\alpha$ -contractive and order preserving on ${\mathbb{R}}^{S}$ .

Lemma 25 ( $\alpha$ -Contraction).

If $h$ (6) is a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ and $\mathcal{M}$ is compact, then $\underline{h}$ and $\overline{h}$ (22) are $\alpha$ -contractions with fixed points $\underline{X},\overline{X}$ , respectively.

\textstyle\overline{h}(\overline{X})=\overline{X},\quad\underline{h}(\underline{X})=\underline{X},\ \underline{X},\overline{X}\in{\mathbb{R}}^{S}.

(23)

{pf}

From Lemma 9, $h$ is continuous and $\mathcal{M}$ is compact, then for all $X,Y\in{\mathbb{R}}^{S}$ , there exists $\hat{m}(s)\in\mathcal{M}$ such that $\underline{h}_{s}(Y)=h_{s}(Y,\hat{m}(s))$ and $\underline{h}_{s}(X)\leq h_{s}(X,\hat{m}(s))$ . We upper-bound $\underline{h}_{s}(X)-\underline{h}_{s}(Y)$ by $h_{s}(X,\hat{m}(s))-h_{s}(Y,\hat{m}(s))$ , and use the $\alpha$ -contraction property of $h$ to derive

	$\displaystyle\underline{h}_{s}(X)-\underline{h}_{s}(Y)$	$\displaystyle\leq\|h_{s}(X,\hat{m}(s))-h_{s}(Y,\hat{m}(s))\|$
		$\displaystyle\leq\alpha\left\lVert X-Y\right\rVert_{\infty}.$

Since $X$ and $Y$ are arbitrarily ordered, we conclude that $\left\lVert\underline{h}(X)-\underline{h}(Y)\right\rVert_{\infty}\leq\alpha\left\lVert X-Y\right\rVert_{\infty}$ . The proof for $\overline{h}$ follows a similar reasoning and takes $\hat{m}(s)={\color[rgb]{0,0,0}\mathop{\rm argmax}}_{m\in\mathcal{M}}h_{s}(X,m)$ . The existence of $\underline{X}(\overline{X})$ follows from applying Banach’s fixed point theorem. ∎

Lemma 26 (Order Preservation).

The bound operators $\underline{h}$ and $\overline{h}$ (22) are order-preserving on ${\mathbb{R}}^{S}$ (Definition 4).

\forall\ U,V\in{\mathbb{R}}^{S},\ U\leq V\Rightarrow\underline{h}(U)\leq\underline{h}(V),\quad\overline{h}(U)\leq\overline{h}(V).

{pf}

The lemma statement follows directly from the fact that order preservation is conserved through composition with $\inf$ and $\sup$ . If $h(U,m)\leq h(V,m)$ , then $\inf_{m\in\mathcal{M}}h(U,m)\leq\inf_{m\in\mathcal{M}}h(V,m)$ . A similar argument follows for $\overline{h}(\cdot)=\sup_{m\in\mathcal{M}}h(\cdot,m)$ . ∎ We show that the fixed points $\underline{X}$ and $\overline{X}$ bounds the fixed point set $\mathcal{V}^{\star}$ of the set-based value operator $H$ (14).

Theorem 27 (Bounding fixed point sets).

If $h$ (6) is a value operator on ${\mathbb{R}}^{S}\times\mathcal{M}$ and $\mathcal{M}$ is compact,

\underline{X}\leq V\leq\overline{X},\ \forall\ V\in\mathcal{V}^{\star},

(24)

where $\underline{X}$ and $\overline{X}$ (23) are the fixed points of the bound operators $\underline{h}$ and $\overline{h}$ (22), and $\mathcal{V}^{\star}$ is the fixed point set of the set-based value operator $H$ (14) induced by $h$ (6) on ${\mathbb{R}}^{S}\times\mathcal{M}$ .

{pf}

For $\mathcal{V}^{0}=\{\underline{X},\overline{X}\}$ and $\mathcal{V}^{k+1}=H(\mathcal{V}^{k})$ (19), we first show

\textstyle\underline{X}\leq V\leq\overline{X},\ \forall\ V\in\mathcal{V}^{k},

(25)

via induction. Suppose that (25) is satisfied for $\mathcal{V}^{k}$ . The order preserving property of $h(\cdot,m)$ implies that $h(\underline{X},m)\leq h(V,m)\leq h(\overline{X},m)$ holds for all $(V,m)\in\mathcal{V}^{k}\times\mathcal{M}$ . We take the infimum and supremum over $h(\underline{X},m)$ and $h(\overline{X},m)$ , respectively, to show that for all $(V,m)\in\mathcal{V}^{k}\times\mathcal{M}$ and $s\in[S]$ ,

\inf_{m^{\prime}\in\mathcal{M}}h_{s}(\underline{X},m^{\prime})\leq h_{s}(V,m)\leq\sup_{m^{\prime}\in\mathcal{M}}h_{s}(\overline{X},m^{\prime}).

Since $\underline{X}$ and $\overline{X}$ are the fixed points of $\inf_{m^{\prime}\in\mathcal{M}}h_{s}(\cdot,m^{\prime})$ and $\sup_{m^{\prime}\in\mathcal{M}}h_{s}(\cdot,m^{\prime})$ for all $s\in[S]$ , respectively, we conclude that (25) holds for $\mathcal{V}^{k+1}$ .

Next, we show that $\underline{X}$ and $\overline{X}$ bounds the fixed point set $\mathcal{V}^{\star}$ for the $h$ -induced operator $H$ (14). From Lemma 45, we know that for all $V\in\mathcal{V}^{\star}$ , there exists a strictly increasing sequence $\phi:\mathbb{N}\mapsto\mathbb{N}$ and corresponding value vectors $\{W^{\phi(n)}\}$ such that $\lim_{n\to\infty}W^{\phi(n)}=V$ and $W^{\phi(n)}\in\mathcal{V}^{\phi(n)}$ for the sequence of value vector sets generated from $\mathcal{V}^{0}=\{\underline{X},\overline{X}\}$ . Since $\underline{X}\leq W^{\phi(n)}\leq\overline{X}$ holds for all $n$ , we conclude (24) holds. ∎

5 Revisiting robust MDP

We re-examine robust MDP with the set-theoretical analysis in this section, and show that Assumption 30 generalizes the rectangularity assumption made in robust MDPs, thus enabling robust dynamic programming techniques to be available to a wider class of MDP problems and contraction operators.

Recall the optimistic value vector $W^{o}\in{\mathbb{R}}^{S}$ and robust value vectors $W^{r}\in{\mathbb{R}}^{S}$ of a discounted MDP $([S],[A],C,P,\gamma)$ from [8, 15] as the fixed points of the following operators.

W^{o}_{s}=\min_{\pi_{s}\in\Delta_{A}}\min_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(W^{o},C,P),\forall s\in[S]

(26)

W^{r}_{s}=\min_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g{\color[rgb]{0,0,0}{}_{s}}^{\pi}(W^{r},C,P),\forall s\in[S]

(27)

The optimistic policy $\pi^{o}$ and robust policy $\pi^{r}$ are the optimal policies corresponding to (26) and (27), respectively.

\pi^{o}_{s}\in\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}}\min_{(C,P)\in\mathcal{M}}g_{s}^{\pi}(W^{o},C,P),\forall s\in[S]

(28)

\pi^{r}_{s}\in\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g_{s}^{\pi}(W^{r},C,P),\forall s\in[S]

(29)

For readability, we denote the policy evaluation operator (7) under $\pi^{o}$ as $g^{o}$ and the policy evaluation operator (7) under $\pi^{r}$ as $g^{r}$ .

When $\mathcal{M}$ is $(s,a)$ -rectangular (39), the set of policies satisfying (28) and (29) are non-empty and includes deterministic policies [8, Thm 3.1]. When $\mathcal{M}$ is $s$ -rectangular and convex, the set of policies satisfying (29) is non-empty but may be mixed [21, Thm 4]. When $\mathcal{M}$ is convex, we show that policies (28) and (29) exist.

Proposition 28.

If the MDP parameter set $\mathcal{M}$ is compact and convex, then

1.

$W^{o}$ (26) and $W^{r}$ (27) exist and satisfy $\overline{f}(W^{r})=W^{r},\ \underline{f}(W^{o})=W^{o}$ , where $\overline{f}$ and $\underline{f}$ (22) are the bound operators of the Bellman operator (9).
2.

$\pi^{o}$ (28) and $\pi^{r}$ (29) exist.

{pf}

Recall the Bellman operator $f$ (9). When $\mathcal{M}\times\Delta_{A}$ is compact, the formulation of the fixed point of $\underline{f}$ (22) is equivalently given by

{\color[rgb]{0,0,0}\underline{f}_{s}}(\underline{X})=\min_{(C,P)\in\mathcal{M}}\min_{\pi_{s}\in\Delta_{A}}g^{\pi}_{s}(\underline{X},C,P),\ \forall s\in[S].

(30)

We note that (30) is identical to the formulation of $W^{o}$ (26). Therefore, $W^{o}=\underline{X}$ is the fixed point of $\underline{f}$ . When $\mathcal{M}$ is compact, $W^{o}$ exists due to Lemma 25. From (28), $\pi^{o}_{s}$ is the optimal argument of $g^{\pi}_{s}(W^{o},C,P)$ , a continuous function in $\pi_{s},C,P$ minimized over compact sets $\Delta_{A}\times\mathcal{M}$ for all $s\in[S]$ . Therefore $\pi^{o}_{s}$ exists. Since $\pi^{o}=(\pi^{o}_{1},\ldots,\pi^{o}_{S})$ , the optimal $\pi^{o}\in\Pi$ exists.

For the robust scenario: when $\mathcal{M}$ is compact, the fixed point of $\overline{f}$ (22), $\overline{X}$ , exists from Lemma 25 and is given by

\overline{X}_{s}=\max_{(C,P)\in\mathcal{M}}\min_{\pi_{s}\in\Delta_{A}}g^{\pi}_{s}(\overline{X},C,P),\ \forall s\in[S].

(31)

The function $g_{s}^{\pi}(\overline{X},C,P)$ is concave in $(C,P)$ and convex in $\pi$ . If $\mathcal{M}$ is convex, then we apply the minimax theorem [14] to switch the order of $\min$ and $\max$ in (31) to derive

\overline{X}_{s}=\min_{\pi_{s}\in\Delta_{A}}\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P),\ \forall s\in[S].

(32)

Equation (32) is identical to (27), therefore $W^{r}=\overline{X}$ and exists by Lemma 25. In (32), $\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P)$ is piece-wise linear in $\pi_{s}$ and $\Delta_{A}$ is compact for all $s\in[S]$ , thus $\mathop{\rm argmin}_{\pi_{s}\in\Delta_{A}}$ $\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P)$ is non-empty. Finally, since $\pi^{r}=(\pi^{r}_{1},\ldots,\pi^{r}_{S})$ , $\pi^{r}$ exists. ∎

Remark 29.

Since $\max_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(\overline{X},C,P)$ is piecewise linear in $\pi_{s}$ , the optimal $\pi_{s}^{r}$ is mixed policy in general. This is consistent with the results in [21].

Proposition 28 generalizes the results from [21] to show that (27) exists when $\mathcal{M}$ is compact and convex instead of $s$ -rectangular and convex. From Theorem 27, $W^{o}$ and $W^{r}$ are the fixed points of the bound operators $\underline{g}^{\pi^{o}}$ and $\overline{g}^{\pi_{r}}$ (24), respectively. They become infimum and supremum elements when $\mathcal{M}$ satisfies Assumption 30 with respect to $g^{o}$ and $g^{r}$ . We explicitly derive this result next. First, we introduce some notations: let $G^{o}=G^{\pi_{o}}$ , the fixed point of $G^{o}$ be $\mathcal{V}^{o}$ , $G^{r}=G^{\pi^{r}}$ , and the fixed point of $G^{r}$ be $\mathcal{V}^{r}$ .

\mathcal{V}^{o}=\{g^{o}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{o}\},

(33)

\mathcal{V}^{r}=\{g^{r}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{r}\}.

(34)

Additionally, the supremum elements of $\mathcal{V}^{o}$ and $\mathcal{V}^{r}$ are $\overline{V}^{o}$ and $\overline{V}^{r}$ respectively and the infimum elements are $\underline{V}^{o}$ and $\underline{V}^{r}$ , respectively.

\underline{V}_{s}^{r}=\min_{V\in\mathcal{V}^{r}}V_{s},\ \overline{V}_{s}^{r}=\max_{V\in\mathcal{V}_{r}}V_{s},\ \forall s\in[S].

(35)

\underline{V}_{s}^{o}=\min_{V\in\mathcal{V}^{o}}V_{s},\ \overline{V}_{s}^{o}=\max_{V\in\mathcal{V}^{o}}V_{s},\ \forall s\in[S].

(36)

We compare these with the fixed point set of the Bellman operator, $\mathcal{V}^{B}=\{\min_{\pi}g^{\pi}(V,C,P)\ |\ (C,P)\in\mathcal{M},V\in\mathcal{V}^{B}\}$ (17), denoted by $\overline{V}^{B}$ and $\underline{V}^{B}$ as

\underline{V}_{s}^{B}=\min_{V\in\mathcal{V}^{B}}{\color[rgb]{0,0,0}V_{s}},\ \overline{V}_{s}^{B}=\max_{V\in\mathcal{V}^{B}}V_{s},\ \forall s\in[S].

(37)

6 Fixed-point set containing its infimum/supremum

We make the following assumption on the MDP parameter set $\mathcal{M}$ with respect to $h$ .

Assumption 30 (Containment condition).

The MDP parameter set $\mathcal{M}$ satisfies the containment condition with respect to $h$ if $\mathcal{M}$ is compact and for all $V\in{\mathbb{R}}^{S}$ ,

\bigcap_{s\in[S]}\mathop{\rm argmin}_{m\in\mathcal{M}}h_{s}(V,m)\neq\emptyset,\ \bigcap_{s\in[S]}\mathop{\rm argmax}_{m\in\mathcal{M}}h_{s}(V,m)\neq\emptyset.

(38)

Remark 31.

Assumption 30 is an $h$ -dependent condition imposed on the structure of $\mathcal{M}$ , and is independent of $\mathcal{M}$ ’s convexity and connectivity.

6.1 Containment-satisfying MDP parameter sets

Assumption 30 restricts the structure of $\mathcal{M}$ with respect to the value operator $h$ . Thus whether or not $\mathcal{M}$ satisfies Assumption 30 must always be determined with respect to the operator $h$ . With respect to the Bellman operator $f$ (9) and the policy evaluation operators $g^{\pi}$ (7), the following conditions in robust MDP are sufficient to satisfy Assumption 30.

Definition 32 ( $(s,a)$ -rectangular sets [8, 15]).

The uncertainty set $\mathcal{M}\subset{\mathbb{R}}^{S\times A}\times\Delta^{SA}_{S}$ is $(s,a)$ -rectangular if

\mathcal{M}=\bigtimes_{(s,a)\in[S]\times[A]}\mathcal{M}_{sa},\ \mathcal{M}_{sa}\subset{\mathbb{R}}\times\Delta_{S},\ \forall(s,a)\in[S]\times[A].

(39)

Intuitively, $(s,a)$ -rectangularity implies that the MDP parameter uncertainty is decoupled between each state-action. A more general condition is if the parameter uncertainty is decoupled between different states but not between different actions within the same state.

Definition 33 ( $s$ -rectangular sets).

The uncertainty set $\mathcal{M}\subset{\mathbb{R}}^{S\times A}\times\Delta^{SA}_{S}$ is $s$ -rectangular if

\mathcal{M}=\bigtimes_{s\in[S]}\mathcal{M}_{s},\ \mathcal{M}_{s}\subset{\mathbb{R}}^{A}\times\Delta^{A}_{S},\ \forall s\in[S].

(40)

$s$ -rectangularity generalizes $(s,a)$ -rectangularity—i.e. $(s,a)$ -rectangularity implies $s$ -rectangularity.

Example 34 (Wind uncertainty).

Consider the navigation problem presented in Example 1. If the wind pattern strictly switches between the discrete wind trends, then the transition uncertainty at state $s\in[S]$ is $\mathcal{P}_{s}=\{P^{1}_{s},\ldots,P^{N}_{s}\}$ . If the wind pattern is a mixture of the discrete wind trends, the transition uncertainty at state $s\in[S]$ is $\mathcal{P}_{s}=\{\sum_{i}\alpha_{i}P^{i}_{s}\ |\ \alpha\in\Delta_{N}\}$ . Both wind patterns lead to $s$ -rectangular uncertainty, given by $\mathcal{P}=\bigtimes_{s\in[S]}\mathcal{P}_{s}$ .

We show that the rectangularity conditions indeed are sufficient for satisfying Assumption 30 with respect to $f$ (9) and $g^{\pi}$ (7).

Proposition 35.

If $\mathcal{M}$ is compact and $s$ -rectangular (Definition 33), $\mathcal{M}$ satisfies Assumption 30 with respect to $f$ (9) and $g^{\pi}$ (7) for all $\pi\in\Pi$ .

{pf}

We first show that $\mathcal{M}$ satisfies Assumption 30 with respect to the Bellman operator. Given $s\in[S]$ , $f_{s}(V,C,P)$ only depends on the $s$ component of $C$ and $P$ . From Lemma 9, $f_{s}$ is continuous in $(c_{s},P_{s})$ . Let $(c_{s}^{\star},P_{s}^{\star})$ be the solution to $\mathop{\rm argmin}_{(c_{s},P_{s})\in\mathcal{M}_{s}}f_{s}(V,C,P)$ for all $\forall\ s\in[S]$ . If $\mathcal{M}_{s}$ is compact, $(c_{s}^{\star},P^{\star}_{s})\in\mathcal{M}_{s}$ . We can construct $C^{\star}=[c_{1}^{\star},\ldots,c^{\star}_{S}]$ and $P^{\star}=[P_{1}^{\star},\ldots,P_{S}^{\star}]$ . If $\mathcal{M}$ is $s$ -rectangular, then $(C^{\star},P^{\star})\in\mathcal{M}$ and $(C^{\star},P^{\star})\in\mathop{\rm argmin}_{{\color[rgb]{0,0,0}(C,P)}\in\mathcal{M}}f_{s}(V,C,P)$ for all $s\in[S]$ . We conclude that $\mathcal{M}$ satisfies Assumption 30.

Given $\pi\in\Pi$ and $s\in[S]$ , $g^{\pi}_{s}$ only depends on $c_{s}$ and $P_{s}$ as well. We can similarly show that there exists an optimal parameter $(C^{\star},P^{\star})\in\mathop{\rm argmin}_{(C,P)\in\mathcal{M}}g^{\pi}_{s}(V,C,P)$ for all $s\in[S]$ such that $(C^{\star},P^{\star})\in\mathcal{M}$ . ∎ Beyond $s$ -rectangularity, there are sets that satisfy Assumption 30 with respect to specific value operators.

Example 36 (Beyond rectangularity).

In Figure 6, we visualize a four state MDP with transition uncertainty $\mathcal{M}$ parameterized by $\alpha$ . MDP states are the nodes and MDP actions are the arrows. Actions that transition to multiple states are visualized by multi-headed arrows. Each head has an associated tuple $(c_{sa},p_{sa,s^{\prime}})$ denoting its state-action cost and transition probability. All states have a single action except for state $s_{4}$ , where two actions exist and are distinguished by different colors. Both $s_{2}$ and $s_{3}$ are absorbing states with a unique action, such that $V_{2}=\frac{1}{1-\gamma}$ and $V_{3}=0$ for both $f$ and $g^{\pi}$ for all $\pi\in\Pi$ , where $\gamma$ is the discount factor.

The states $s_{1}$ and $s_{4}$ have transition uncertainty parametrized by $\alpha\in[0,1]$ . Therefore, $\mathcal{M}$ violates $s$ -rectangularity (Definition 33). The optimal cost-to-go values $V_{1}$ and $V_{4}$ occur at different $\alpha$ ’s. Therefore, $\mathcal{M}$ violates Assumption 30 with respect to $f$ . However, suppose that at $s_{4}$ , we only consider policies that exclusively choose the action colored green in Fig. 6. Then the expected cost-to-go at $s_{4}$ , $V_{4}$ , is independent of $\alpha$ . The minimum and maximum values of $V_{1}$ under $\pi$ occur at $\alpha=1$ and $\alpha=0$ , respectively. Therefore, $\mathcal{M}$ satisfies Assumption 30 with respect to operator $g^{\pi}$ for all $\pi=[\pi_{s_{1}},\ldots,\pi_{s_{4}}]$ where $\pi_{s_{4}}=[1,0]$ .

When Assumption 30 is satisfied, the fixed point of $H$ (14) contains its own supremum and infimum.

Theorem 37.

If $h$ (6) on ${\mathbb{R}}^{S}\times\mathcal{M}$ satisfies Assumption 30, then there exists $\underline{m},\overline{m}\in\mathcal{M}$ such that $\underline{h}$ and $\underline{h}$ (22) and their fixed points $\underline{X}$ and $\overline{X}$ (23) satisfies

\underline{h}(\underline{X})=h(\underline{X},\underline{m})=\underline{X},\ \overline{h}(\overline{X})=h(\overline{X},\overline{m})=\overline{X}.

(41)

Additionally, $\underline{X}$ and $\overline{X}$ are the least and the greatest elements of $H$ ’s fixed point set $\mathcal{V}^{\star}$ , $\underline{V}^{\star},\overline{V}^{\star}$ (21) respectively, and both belong to $\mathcal{V}^{\star}$ (17).

\textstyle\underline{X}=\underline{V}^{\star},\ \overline{X}=\overline{V}^{\star},\ \underline{X},\overline{X}\in\mathcal{V}^{\star}.

{pf}

From Theorem 27, $\underline{X}$ and $\overline{X}$ are the lower and upper bounds on the fixed point set $\mathcal{V}^{\star}$ . We show that these are the infimum and supremum elements of $\mathcal{V}^{\star}$ by showing that they are also elements of $\mathcal{V}^{\star}$ . From Assumption 30, there exists $\underline{m},\overline{m}\in\mathcal{M}$ such that ${h}_{s}(\underline{X},\underline{m})=\min_{m\in\mathcal{M}}h_{s}(\underline{X},\underline{m})$ and ${h}_{s}(\overline{X},\overline{m})=\min_{m\in\mathcal{M}}h_{s}(\overline{X},\overline{m})$ for all $s\in[S]$ . Since $\underline{X}$ and $\overline{X}$ are fixed points of $h(\cdot,\underline{m})$ and $h(\cdot,\overline{m})$ , we apply Corollary 20 to conclude that $\underline{X},\overline{X}\in\mathcal{V}^{\star}$ . ∎ Our next result proves the relationship between $\underline{V}^{B},\underline{V}^{o},\underline{V}^{r},$ $\overline{V}^{B},\overline{V}^{o},\overline{V}^{r}$ when $f,g^{o}$ , and $g^{r}$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ satisfy Assumption 30.

Theorem 38.

If $f,g^{o},g^{r}$ satisfy Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ , then the bounding elements (37) (36) (35) of the corresponding fixed point sets $\mathcal{V}^{B}$ , $\mathcal{V}^{o}$ (33) and $\mathcal{V}^{r}$ (34) are ordered as

\underline{V}^{B}=\underline{V}^{o}\leq\underline{V}^{r},\ \overline{V}^{B}=\overline{V}^{r}\leq\overline{V}^{o}.

(42)

{pf}

Since $\underline{V}^{o}$ is the infimum element for the fixed point set $\mathcal{V}^{o}$ (36), we can apply Theorem 37 to derive

\textstyle\underline{V}^{o}=\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P).

(43)

By definition of $\pi^{o}$ (28), $\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P)=\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\underline{V}^{o},C,P)$ . As the two minima commute,

\min_{(C,P)\in\mathcal{M}}g^{o}(\underline{V}^{o},C,P)=\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\underline{V}^{o},C,P).

(44)

Combining (43) and (44), $\underline{V}^{o}$ is exactly the unique fixed point of $\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\cdot,C,P)$ . However, by applying Theorem 37 to $f$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ , $\underline{V}^{B}$ is also the unique fixed point of $\min_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\cdot,C,P)$ . Therefore $\underline{V}^{o}=\underline{V}^{B}$ .

From (35), $\underline{V}^{r}=\min_{(C,P)\in\mathcal{M}}g^{r}(\underline{V}^{r},C,P)$ , we can minimize over the policy space to lower bound $\underline{V}^{r}$ as

\underline{V}^{r}\geq\min_{\pi\in\Pi}\min_{(C,P)\in\mathcal{M}}g^{r}(\underline{V}^{r},C,P).

(45)

Since the right hand side of (45) is equivalent to $\underline{f}(\underline{V}^{r})$ , (45) is equivalent to $\underline{V}^{r}\geq\underline{f}(\underline{V}^{r})$ . From Lemma 26, $\underline{f}$ is order-preserving in $V$ , we conclude that $\underline{V}^{o}=\underline{V}^{\star}\leq\underline{V}^{r}$ .

From Theorem 37, $\overline{V}^{r}$ is the fixed point of $\overline{g}^{r}$ , such that

\overline{V}^{r}=\max_{(C,P)\in\mathcal{M}}g^{r}(\overline{V}^{r},C,P).

(46)

We apply $\min_{\pi}$ to both sides of (46) and use the definition of $\pi^{r}$ to derive that $\overline{V}^{r}$ is the fixed point of $\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(V^{r},C,P)$ . From Assumption 30, there exists $(\overline{C},\overline{P})\in\mathcal{M}$ that maximizes $g^{\pi}(\overline{V},C,P)$ , so $\overline{V}^{r}$ equivalently satisfies

\overline{V}^{r}=\min_{\pi\in\Pi}g^{\pi}(\overline{V}^{r},\overline{C},\overline{P}).

From Corollary 20, this implies that $\overline{V}^{r}\in\mathcal{V}^{B}$ and therefore $\overline{V}^{r}\leq\overline{V}^{B}$ . Next we show $\overline{V}^{B}\leq\overline{V}^{r}$ . From Theorem 37, $\overline{V}^{B}$ is the fixed point of $\overline{f}$ , such that $\overline{V}^{B}=\max_{(C,P)\in\mathcal{M}}\min_{\pi}g^{\pi}(\overline{V}^{B},C,P)$ , From the min-max inequality,

\overline{V}^{B}\leq\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{B},C,P).

Since $\pi^{r}\in\Pi$ ,

\overline{V}^{B}\leq\max_{(C,P)\in\mathcal{M}}g^{r}(\overline{V}^{B},C,P).

(47)

The right-hand side of (47) is $\overline{g}^{r}(\overline{V}^{B})$ (22), such that (47) is equivalent to $\overline{V}^{B}\leq\overline{g}^{r}(\overline{V}^{B})$ . We consider the sequence $V^{k+1}=\overline{g}^{r}(V^{k})$ where $V^{1}=\overline{V}^{B}$ . Since $\overline{g}^{r}$ is a contraction, $\lim_{k\to\infty}V^{k}=V^{r}$ , the fixed point of $\overline{g}^{r}$ . From Lemma 26, $\overline{g}^{r}$ is order preserving. Therefore $\overline{V}^{B}=V^{1}\leq V^{r}$ .

Finally, Theorem 37 implies that $\overline{V}^{o}$ is the fixed point of $\overline{g}^{o}$ : $\overline{V}^{o}=\max_{(C,P)\in\mathcal{M}}g^{o}(\overline{V}^{o},C,P)$ . By construction, $\overline{V}^{o}\geq\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{o},C,P)$ . From the min-max inequality,

\min_{\pi\in\Pi}\max_{(C,P)\in\mathcal{M}}g^{\pi}(\overline{V}^{o},C,P)\geq\max_{(C,P)\in\mathcal{M}}\min_{\pi\in\Pi}g^{\pi}(\overline{V}^{o},C,P),

such that the right hand side of the inequality is equivalent to $\overline{f}(\overline{V}^{o})$ . Following the monotonicity properties of the Bellman operator $f$ [17, Thm.6.2.2], we conclude that $\overline{V}^{o}\geq\overline{V}^{B}$ . ∎

Remark 39.

Through our fixed-point analysis, we see that in addition to having the best worst-case performance among $\{\mathcal{V}^{o},\mathcal{V}^{B},\mathcal{V}^{r}\}$ , $\mathcal{V}^{r}$ also has the smallest variation in performance for the same uncertainty set $\mathcal{M}$ .

Finally, we generalize the $s$ -rectangularity condition by showing that the optimistic and robust policies exist when the MDP parameter set $\mathcal{M}$ satisfies Assumption 30.

Corollary 40 (Robust MDP under Assumption 30).

If $\mathcal{M}$ is compact and convex, and $f,g^{o},g^{r}$ satisfy Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ , then $W^{o}$ (26) and $W^{r}$ (27) are the infimum and supremum value vectors for the policy evaluation operator under $\pi^{o}$ (28) and $\pi^{r}$ (29), respectively.

W^{o}_{s}=\inf_{V\in\mathcal{V}^{o}}{\color[rgb]{0,0,0}V_{s}},W^{r}_{s}=\sup_{V\in\mathcal{V}^{r}}{\color[rgb]{0,0,0}V_{s}},\ \forall s\in[S],

(48)

where $\mathcal{V}^{o}$ (33) and $\mathcal{V}^{r}$ (34) are the fixed point sets of policies $\pi^{o}$ and $\pi^{r}$ under parameter uncertainty $\mathcal{M}$ , respectively.

{pf}

When $f$ satisfies Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ , Theorem 37 shows that $\underline{V}^{B}=W^{o},\ \overline{V}^{B}=W^{r}.$ If $f,g^{o},$ and $g^{r}$ also satisfies Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ , then we apply Theorem 38 to derive $W^{o}=\underline{V}^{o}$ and $W^{r}=\overline{V}^{r}$ . This proves the corollary statement. ∎

Remark 41.

When Assumption 30 is not satisfied, $W^{o}$ and $W^{r}$ still bound $\underline{V}^{o}$ and $\overline{V}^{r}$ . This result is also stated in [21].

7 Value iteration for fixed point set computation

In the previous sections, we proved the existence of a fixed point set for value operators with compact parameter uncertainty sets and re-interpreted robust control through our techniques. Next, we derive an iterative algorithm for computing the bounds of the fixed point set $\mathcal{V}$ given a value operator $h$ and parameter uncertainty set $\mathcal{M}$ .

Algorithm Sketch. Based on the set-based value iteration (19), we iteratively find the one-step bounds of $H(\mathcal{V}^{k})$ to converge the bounds of the fixed point set.

For any compact set $\mathcal{V}\in\mathcal{K}({\mathbb{R}}^{S})$ , the one step bounds of $H(\mathcal{V})$ are equivalent to the one-step output of the bound operators $\underline{h}$ and $\overline{h}$ (22) applied to the extremal points of $\mathcal{V}$ .

Theorem 42 (One step $H$ bounds).

Consider a set operator $H$ (14) and its bound operators $\underline{h}$ and $\overline{h}$ (22) induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ (6). For a compact set $\mathcal{V}\subset{\mathbb{R}}^{S}$ , $H(\mathcal{V})$ is bounded by $\underline{h}(\underline{V})$ and $\overline{h}(\overline{V})$ (22) as

\underline{h}(\underline{V})\leq V\leq\overline{h}(\overline{V}),\quad\forall\ V\ \in H(\mathcal{V}).

(49)

where $\underline{V}$ and $\overline{V}$ (21) are the extremal elements of $\mathcal{V}$ . If $h$ satisfies Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ and $\underline{V},\overline{V}\in\mathcal{V}$ , then $\underline{h}(\underline{V})$ and $\overline{h}(\overline{V})$ are the supremum and infimum elements of $H(\mathcal{V})$ , respectively— for all $s\in[S]$ , $\underline{h}_{s}(\underline{V})$ and $\overline{h}_{s}(\overline{V})$ satisfy

\underline{h}_{s}(\underline{V})=\underset{(V,m)\in\mathcal{V}\times\mathcal{M}}{\inf}{h_{s}(V,m)},\ \overline{h}_{s}(\overline{V})=\underset{(V,m)\in\mathcal{V}\times\mathcal{M}}{\sup}h_{s}(V,m).

(50)

{pf}

For all $s\in[S]$ , $h_{s}(V,m)\leq\overline{h}_{s}(V)$ for all $m\in\mathcal{M}$ . If $h$ is $K(V)$ -Lipschitz and $\alpha$ -contractions in $\mathcal{M}$ , then $\overline{h}$ is order-preserving (Lemma 26) such that $\overline{h}_{s}(V)\leq\overline{h}_{s}(\overline{V})$ for all $V\in\mathcal{V}$ . We conclude that

h(V,m)\leq\overline{h}(\overline{V}),\ \forall(V,m)\in\mathcal{V}\times\mathcal{M}.

(51)

Since $\overline{h}$ is an upper bound, and $\sup$ is the least upper bound, it holds that $\sup_{V,m}[h(V,m)]_{s}\leq\overline{h}(\overline{V})$ . We use the definition of ${H(\mathcal{V})}$ (14) to conclude that $V\leq\overline{h}(\overline{V})$ for all $V\in H(\mathcal{V})$ . The inequality $\underline{h}(\underline{V})\leq V\ \forall V\in H(\mathcal{V})$ can be similarly proved.

If $h$ satisfies Assumption 30 on ${\mathbb{R}}^{S}\times\mathcal{M}$ and $\underline{V},\overline{V}\in\mathcal{V}$ , Assumption 30 states that there exists $\underline{m}\in\mathcal{M}$ such that $h(\underline{V},\underline{m})=\underline{h}(\underline{V})$ . Therefore, $\underline{h}(\underline{V})\in H(\mathcal{V})$ . Since $\underline{h}(\underline{V})$ also lower bounds all the elements of $H(\mathcal{V})$ , it is the infimum element of $H(\mathcal{V})$ . The fact that the greatest element of $H(\mathcal{V})$ is $\overline{h}(\overline{V})$ can be similarly proved. ∎ Based on Theorem 42, we propose the following bound approximation algorithm of the fixed point set $\mathcal{V}^{\star}$ (17) for a set-valued operator $H$ (6). {algorithm} Bound approximation of the fixed point set $\mathcal{V}$

\mathcal{C}

\mathcal{P}

{V}^{0}

\epsilon

\underline{V}

\overline{V}

\underline{V}^{0}:=\overline{V}^{0}:=V^{0}

e^{0}=\frac{1-\gamma}{\gamma}\epsilon

5:while

\frac{\gamma}{1-\gamma}e^{k}\geq\epsilon

\underline{V}^{k+1}_{s}=\min_{m\in\mathcal{M}}h_{s}(\underline{V}^{k},m),\quad\forall s\in[S]

\overline{V}^{k+1}_{s}=\max_{m\in\mathcal{M}}h_{s}(\overline{V}^{k},m),\quad\forall s\in[S]

e^{k+1}=\max\Big{\{}\left\lVert\underline{V}^{k+1}-\underline{V}^{k}\right\rVert,\left\lVert\overline{V}^{k+1}-\overline{V}^{k}\right\rVert\Big{\}}

k=k+1

10:end while

7.1 Computing one-step optimal parameters

Algorithm 7 is stated for a general MDP parameter set $\mathcal{M}$ and does not specify how to compute lines 6 and 7. Here we discuss solution methods for different shapes of $\mathcal{M}$ .

1.

Finite $\mathcal{M}$ . If $\mathcal{M}=\{m_{1},\ldots,m_{N}\}$ is a set with finite number of elements, we can directly compute line 6 as

$\underline{V}^{k+1}=\min\Big{\{}h_{s}(\underline{V}^{k},m_{i})\ |\ i=\{1,\ldots,N\}\Big{\}}.$ (52)

For line 7, we replace $\min$ with $\max$ in (52).
2.

Convex $\mathcal{M}$ . When $\mathcal{M}$ is a convex set, the computation depends on $h$ . If $h=g^{\pi}$ is the policy operator, lines 6 and 7 can be solved as convex optimization problems. If $h$ is the Bellman operator $f$ , lines 6 and 7 take on min-max formulation and is NP-hard to solve in the general form [21]. When $\mathcal{M}$ can be characterized by an ellipsoidal set of parameters, the solutions to lines 6 and 7 is given in [21].

We recall the stochastic path planning problem from Example 1 with the two different parameter uncertainty scenarios. When the wind field uncertainty is discrete, $\mathcal{M}$ is finite, when wind field is a combination of the major wind trends, $\mathcal{M}$ is convex.

7.2 Algorithm Convergence Rate

When lines 6 and 7 are solvable, Algorithm 7 asymptotically converges to approximations of the bounding elements of $\mathcal{V}^{\star}$ . If $\mathcal{M}$ satisfies Assumption 30 with respect to $h$ , Algorithm 7 derives the exact bounds of $\mathcal{V}$ . Algorithm 7 has similar rates of convergence in Hausdorff distance as standard value iteration using $h$ on ${\mathbb{R}}^{S}$ .

Theorem 43.

Consider the value operator $h$ , compact uncertainty set $\mathcal{M}$ , and the fixed point set $\mathcal{V}^{\star}$ of the set-based operator $H$ (14) induced by $h$ on ${\mathbb{R}}^{S}\times\mathcal{M}$ . If $\mathcal{M}$ satisfies Assumption 30 with respect to $h$ , then at each iteration $k$ ,

\textstyle\left\lVert\underline{V}^{k+1}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\overline{V}^{k+1}-\overline{V}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},

(53)

where all norms are infinity norms, and $\underline{V}^{\star},\overline{V}^{\star}$ are the infimum and supremum bounds of $\mathcal{V}$ , respectively. At Algorithm 7’s termination, $\underline{V}^{k},\overline{V}^{k}$ satisfies

\max\{\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\}<\epsilon.

(54)

{pf}

From Algorithm 7, $\overline{V}^{k+1}=\overline{h}(\overline{V}^{k})$ . From Lemma 25, $\overline{h}$ is an $\alpha$ -contraction. We obtain $\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ and note that (53) holds by induction. Next, we apply triangle inequality to $\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ to derive

\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\left\lVert\overline{V}^{k}-\overline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}+\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}.

(55)

We can then use $\left\lVert\overline{V}^{k+1}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\alpha\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ to bound (55) as $\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\frac{1}{1-\alpha}\left\lVert\overline{V}^{k}-\overline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ . A similar argument can show that $\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\leq\frac{1}{1-\alpha}\left\lVert\underline{V}^{k}-\underline{V}^{k+1}\right\rVert_{{\color[rgb]{0,0,0}\infty}}$ . When Algorithm 7’s while condition is satisfied, $\max\Big{\{}\left\lVert\overline{V}^{k}-\overline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}},\left\lVert\underline{V}^{k}-\underline{V}^{\star}\right\rVert_{{\color[rgb]{0,0,0}\infty}}\Big{\}}\leq\epsilon.$ This concludes our proof. ∎ In particular, the Bellman operator $f$ and policy operator $g^{\pi}$ are $\gamma$ -contractive on ${\mathbb{R}}^{S}$ , where $\gamma$ is the discount factor, therefore, Theorem 42 applies with $\alpha=\gamma$ .

Remark 44.

Theorem 43 implies that at the termination of Algorithm 7, the fixed point set $\mathcal{V}^{\star}$ can be over-approximated by

\mathcal{V}^{\star}\subseteq\mathcal{V}_{approx}:=\prod_{s\in[S]}[\underline{V}_{s}^{k+1}-\epsilon,\overline{V}_{s}^{k+1}+\epsilon],

where $k$ is the last iterate before Algorithm 7 terminates.

8 Path Planning in Time-varying Wind Fields

We apply set-based value iteration to wind-assisted probabilistic path planning of a balloon in strong, uncertain wind fields [22]. MDP as a model for wind-assisted path planning of balloons in the stratosphere and exoplanets has recently gained traction [22, 2]. Discrete state-action MDPs have been shown to be a viable high-level path planning model [22] for such applications.

Mission Objective. In the two dimensional wind-field, we assume that the wind-assisted balloon is tasked with reaching target state $(8,8)$ in Figure 8 using minimum fuel.

Uncertain Wind Fields. By collecting a set of wind data on the environment’s wind field, an MDP can be created and a policy that handles stochastic planning can be deployed. However, wind can be a time-varying factor that causes the expected optimal policy to have worse-than-expected worst-case performance. We built an ideal uncertain wind field to demonstrate how the set Bellman operator can be used to predict the best and worst-case behavior of a robust policy.

MDP Modeling Assumptions. Following the framework described in [22], we model the path planning problem in an uncertain wind field as an infinite horizon, discounted MDP with discrete state-actions in a two-dimensional space. While balloons typically traverse in three dimensions, we assume that the wind is consistent in the vertical direction and that the final target is any vertical position along the given two-dimensional coordinates. As a result, we can disregard the vertical position during planning.

States. A total of $81$ states represent the two-dimensional space, composed of three different regions characterized by their wind variability as shown in Figure 8.

1.

Calm wind. In calm states $S_{calm}$ , the wind magnitude varies uniformly between $[0,0.5]$ , and the wind direction is uniformly sampled between $[0,2\pi]$ . $S_{calm}=\{(i,j)\ |\ (0,0)\leq(i,j)\leq(2,8),\ (6,0)\leq(i,j)\leq(8,8)\}.$
2.

Gusty wind. In states with gusts $S_{gusty}$ , wind magnitude is consistently $1$ , while the wind direction is uniformly sampled between $[0,2\pi]$ . $S_{gusty}=\{(i,j)\ |\ (3,3)\leq(i,j)<(6,6)\}$ .
3.

Unreliable wind. In unreliable states $S_{unreliable}$ , a predictable wind front occasionally moves across an otherwise windless region. In other words, the wind magnitude is either $0$ or $1$ and the wind direction varies uniformly between $[\pi/4,\pi/2]$ .

Actions. The balloon is equipped with an actuator that provides a constant thrust of $1$ in $8$ discretized directions shown in Figure 8b. The only stationary action vector with zero magnitude is highlighted in blue in the center of Figure 8b. We assume that the actuation force is enough to move the balloon across one state in wind with magnitude $\leq 0.5$ , and is otherwise not strong enough to overcome wind effects.

Transition Probabilities. The transition probabilities are region-dependent. In the states $[S_{calm}]$ and $[S_{gusty}]$ , the transition dynamics are stochastic but stationary in time. In the states $[S_{unreliable}]$ , the transition dynamics are stochastic but change over time. We define the following neighboring states for each state $s\in[S]$ .

1.

$\mathcal{N}(s)$ : all $8$ neighboring states of state $s$ .
2.

$\mathcal{N}(s,a,0)$ : the neighboring state of $s$ in the direction of $a$ .
3.

$\mathcal{N}(s,a,1)$ : the neighboring state of $s$ in the direction of $a$ plus the two adjacent states as shown in Figure 9a.
4.

$\mathcal{N}(s,a,2)$ : the up and upper-right neighbors of $s$ , as shown in Figure 9b.

In the calm wind region, the transition probabilities are given by

P_{sa,s^{\prime}}=\begin{cases}\frac{1}{\mathcal{N}(s,a,1)},&s^{\prime}\in\mathcal{N}(s,a,1)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{calm}].

(56)

In the gusty wind region, the transition probabilities are given by

P_{sa,s^{\prime}}=\begin{cases}\frac{1}{\mathcal{N}(s)},&s^{\prime}\in\mathcal{N}(s)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A].

(57)

In the unreliable wind region, the transition probabilities vary between transition dynamics $P_{s}^{1}$ and $P_{s}^{2}$ .

P^{1}_{sa,s^{\prime}}=\begin{cases}1,&s^{\prime}\in\mathcal{N}(s,a,0)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A].

(58)

P^{2}_{sa,s^{\prime}}=\begin{cases}0.5,&s^{\prime}\in\mathcal{N}(s,a,2)\\ 0&\text{otherwise}\end{cases},\ \forall\ s\in[S_{gusty}],\ \forall\ a\in[A].

(59)

Collectively, $P^{1}_{s}$ and $P^{2}_{s}$ collectively form the uncertainty set $\mathcal{P}_{s}\subset\Delta_{S}^{A}$ defined at each state.

\mathcal{P}_{s}=\{P^{i}_{sa}\ |\ i\in\{1,2\},a\in[A]\},\ \forall s\in[S_{unreliable}].

(60)

Cost. We define the following state-action cost to achieve the mission objective: at each state-action, the cost is the sum of the current distance from target position $s_{targ}=(8,8)$ , as well as the fuel expended by given action.

C((i,j),a)=\sqrt{(i-s_{targ}[0])^{2}+(j-s_{targ}[1])^{2}}+\half\left\lVert a\right\rVert_{2}.

We take $a=1$ for all actions except for the staying still action, where $a=0$ .

8.1 Bellman, optimistic policy, and robust policy

We first compute the optimistic and robust bounds of the MDP with parameter uncertainty in $\mathcal{P}$ when $s\in[S_{unreliable}]$ by running Algorithm 7. The results are shown in Figure 10.

We denote the optimistic policy as $\pi^{o}$ and the robust policy as $\pi^{r}$ , and derive the bounds of their respective value vector sets $\mathcal{V}^{o}$ (33) and $\mathcal{V}^{r}$ (34) using Algorithm 7. The output is compared against the bounds of the set-based Bellman operator’s fixed point set $\mathcal{V}^{\star}$ in Table 1.

Set	Maximum value	Minimum value
$\mathcal{V}^{\star}$	70.61	62.25
$\mathcal{V}^{o}$	101.58	62.25
$\mathcal{V}^{r}$	70.63	70.52

Table 1: Bellman, optimistic policy, robust policy value bounds of the uncertain wind field.

Time-varying wind field Next, we consider a time-varying wind field: at each time step $k$ , the transition probability $P^{k}$ is chosen at random from $\mathcal{P}$ (60). In this time-varying wind field, we compare three different policy deployments: 1) stationary optimistic policy $\pi^{o}$ as policy operator $g^{o}$ (33), 2) stationary robust policy $\pi^{r}$ as policy operator $g^{r}$ (34), and 3) dynamically changing policy that is optimal for the MDP $([S],[A],P^{k},C,\gamma)$ as $f$ (9). These three different policy deployments are given by

$\displaystyle V^{k+1}$	$\displaystyle=g^{o}(V^{k},C,P^{k}),$	(61)
$\displaystyle V^{k+1}$	$\displaystyle=g^{r}(V^{k},C,P^{k}),$	(62)
$\displaystyle V^{k+1}$	$\displaystyle=f(V^{k},C,P^{k}).$	(63)

The resulting cost-to-go at state $s_{orig}=[0,0]$ is plotted in Figure 11. Here, we see that the optimistic policy deployment (61) has the greatest variation in value over the course of $50$ MDP time steps. Both the robust policy deployment (62) and the dynamically changing policy deployment (63) achieve better upper-bound at each MDP iteration. The dynamically changing policy deployment (63) achieves less than $70$ in cost-to-go on average, which is the best among all three deployments. As we discussed in Remark 39, the robust policy deployment has the smallest variance in value in the presence of wind uncertainty, achieving a value difference of less than $0.1$ .

Sampled solutions. We can compute a sampled MDP model based on $50$ samples of wind vectors for each state. Based on these samples, we add the action vector and compute the statistical distribution of state transitions. We then compute the value of these stationary sampled MDPs, and compare $9$ randomly selected states’ values. The resulting scatter plot is shown in Figure 12.

9 Conclusion

In this paper, we categorized a class of operators utilized to solve Markov decision processes as value operators and lifted their input space from vectors to compact sets of vectors. We showed using fixed point analysis that the set extensions of value operators have fixed point sets that remain invariant given a compact set of MDP parameter uncertainties. These sets were applied to robust dynamic programming to further enrich existing results and generalize the $k$ -rectangularity assumption for robust MDPs. Finally, we applied our results to a path planning problem for time-varying wind fields. For future work, we plan on applying set-based value operators to stochastic games in the presence of uncoordinated players such as humans, as well as applying value operators to reinforcement learning to synthesize robust learning algorithms.

References

[1] Wesam H Al-Sabban, Luis F Gonzalez, and Ryan N Smith. Wind-energy based path planning for unmanned aerial vehicles using markov decision processes. In 2013 IEEE International Conference on Robotics and Automation, pages 784–789. IEEE, 2013.
[2] Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado, Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588:77–82, 2020.
[3] Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[4] Prashant Doshi, Richard Goodwin, Rama Akkiraju, and Kunal Verma. Dynamic workflow composition: Using markov decision processes. International Journal of Web Services Research (IJWSR), 2(1):1–17, 2005.
[5] Robert Givan, Sonia Leach, and Thomas Dean. Bounded-parameter markov decision processes. Artificial Intelligence, 122(1-2):71–109, 2000.
[6] Vineet Goyal and Julien Grand-Clement. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022.
[7] Jeff Henrikson. Completeness and total boundedness of the hausdorff metric. In MIT Undergraduate Journal of Mathematics. Citeseer, 1999.
[8] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
[9] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
[10] Erwan Lecarpentier and Emmanuel Rachelson. Non-stationary markov decision processes, a worst-case approach using model-based reinforcement learning. Advances in neural information processing systems, 32, 2019.
[11] Sarah HQ Li, Assalé Adjé, Pierre-Loïc Garoche, and Behçet Açıkmeşe. Bounding fixed points of set-based bellman operator and nash equilibria of stochastic games. Automatica, 130:109685, 2021.
[12] Shie Mannor, Ofir Mebel, and Huan Xu. Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509, 2016.
[13] Francisco S Melo. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pages 1–4, 2001.
[14] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
[15] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
[16] Sindhu Padakandla, Prabuchandran KJ, and Shalabh Bhatnagar. Reinforcement learning algorithm for non-stationary environments. Applied Intelligence, 50(11):3590–3606, 2020.
[17] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
[18] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
[19] Bernd SW Schröder. Ordered sets. Springer, 29:30, 2003.
[20] Herke Van Hoof, Tucker Hermans, Gerhard Neumann, and Jan Peters. Learning robot in-hand manipulation with tactile features. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pages 121–127. IEEE, 2015.
[21] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
[22] Michael T Wolf, Lars Blackmore, Yoshiaki Kuwata, Nanaz Fathpour, Alberto Elfes, and Claire Newman. Probabilistic motion planning of balloons in strong, uncertain wind fields. In 2010 IEEE International Conference on Robotics and Automation, pages 1123–1129. IEEE, 2010.
[23] Insoon Yang. A convex optimization approach to distributionally robust markov decision processes with wasserstein distance. IEEE control systems letters, 1(1):164–169, 2017.

Appendix A Set sequence convergence

Lemma 45.

Let $\{\mathcal{V}_{n}\}\subseteq\mathcal{K}({\mathbb{R}}^{S})$ be a converging sequence for $d_{\mathcal{K}}$ with $\mathcal{V}_{n}\to\mathcal{V}$ as $n\to\infty$ . For all $V\in\mathcal{V}$ , there exists a converging subsequence $\{V^{\varphi(n)}\}_{n\in\mathbb{N}}$ whose limit is $V$ for $\left\lVert\cdot\right\rVert$ .

{pf}

Let $V\in\mathcal{V}$ . We can define the strictly increasing function $\varphi$ on ${\mathbb{N}}$ as follows: $\varphi(0):=0$ and for all $n\in{\mathbb{N}}$ , $\varphi(n+1):=\min\{j>\varphi(n)\mid\exists\,V^{j}\in\mathcal{V}^{j},\left\lVert V-V^{j}\right\rVert=d(V,\mathcal{V}^{j})\leq(n+1)^{-1}\}$ . Finally, as for all $n\in{\mathbb{N}}^{*}$ , $\left\lVert V-V^{\varphi(n)}\right\rVert\leq(\varphi(n)+1)^{-1}$ , the result holds. ∎

Appendix B Proof of Lemma 9

{pf}

Let $(V,m)\in{\mathbb{R}}^{S}\times\mathcal{M}$ and consider a sequence $\{(V_{k},m_{k})\}_{k\in\mathbb{N}}\subset{\mathbb{R}}^{S}\times\mathcal{M}$ that converges to $(V,m)$ . It holds that $\left\lVert h(V_{k},m_{k})-h(V,m)\right\rVert$ $\leq\left\lVert h(V_{k},m_{k})-h(V,m_{k})\right\rVert+\left\lVert h(V,m_{k})-h(V,m)\right\rVert$ , where from the $\alpha$ -contractive property of $h(\cdot,m^{k})$ , $\left\lVert h(V_{k},m_{k})-h(V,m_{k})\right\rVert\leq\alpha\left\lVert V_{k}-V\right\rVert$ . From the $K(V)$ -Lipschitz property of $h(V,\cdot)$ ,

\textstyle\left\lVert h(V,m_{k})-h(V,m)\right\rVert\leq K(V)\left\lVert m_{k}-m\right\rVert.

As both $\lim_{k\to\infty}\left\lVert V_{k}-V\right\rVert\to 0$ and $\lim_{k\to\infty}\left\lVert m_{k}-m\right\rVert\to 0$ , $\left\lVert h(V_{k},m_{k})-h(V,m)\right\rVert\to 0$ and $h$ is continuous. ∎

Appendix C Proof of Lemma 25

{pf}

We show that both the Bellman operator $f$ and the policy evaluation operator $g^{\pi}$ satisfy the contractive, order preserving, and Lipschitz properties given in Definition 7. Contraction: given $(C,P)\in\mathcal{M}$ , $g^{\pi}(\cdot,C,P)$ and $f(\cdot,C,P)$ are both $\gamma$ -contractions [17, Prop.6.2.4] on the complete metric space $({\mathbb{R}}^{S},\left\lVert\cdot\right\rVert_{\infty})$ , where $\gamma<1$ is the discount factor.

Order preservation: given $(C,P)\in\mathcal{M}$ , the operator $g^{\pi}(\cdot,C,P)$ is order preserving [17, Lem.6.1.2]. Consider $U,V\in{\mathbb{R}}^{S}$ where $U\leq V$ . If $g^{\pi}(\cdot,C,P)$ is order-preserving, $g^{\pi}(U,C,P)\leq g^{\pi}(V,C,P)$ for all $\pi\in\Pi$ . Taking the infimum over $\Pi$ , we have $f(U,C,P)=\inf_{\pi\in\Pi}g^{\pi}(U,C,P)\leq\inf_{\pi\in\Pi}g^{\pi}(V,C,P)=f(V,C,P)$ .

$K(V)$ -Lipschitz: given $(C,P),(C^{\prime},P^{\prime})\in\mathcal{M}$ and $V\in{\mathbb{R}}^{S}$ , we prove the following for each $s\in[S]$ ,

|f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)|\\ \leq\left\lVert c_{s}^{\prime}-c_{s}\right\rVert_{\infty}+\gamma\left\lVert P^{\prime}_{s}-P_{s}\right\rVert_{\infty}\max\{\left\lVert\pi^{\star}_{s}\right\rVert_{\infty},\left\lVert\hat{\pi}_{s}\right\rVert_{\infty}\}\left\lVert V\right\rVert_{\infty}.

(64)

We prove (64) for each of the following cases: 1) $f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P)$ , and 2) $f_{s}(V,C^{\prime},P^{\prime})\leq f_{s}(V,C,P)$ . For case 1), let $\hat{\pi}$ (10) be the optimal policy for $f(V,C^{\prime},P^{\prime})$ and $\pi^{\star}$ be the optimal policy for $f(V,C,P)$ . For $s\in[S]$ , suppose $f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P)$ , then $0\leq f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)\leq(c_{s}^{\prime})^{\top}\hat{\pi}_{s}-c_{s}^{\top}\pi^{\star}_{s}+\gamma(P^{\prime}_{s}\hat{\pi}_{s})^{\top}V-\gamma(P_{s}{\pi}^{\star}_{s})^{\top}V$ . Since $\pi^{\star}$ is sub-optimal for $f(V,C^{\prime},P^{\prime})$ , we can upper bound $|f_{s}(V,C^{\prime},P^{\prime})-f_{s}(V,C,P)|\leq(c_{s}^{\prime}-c_{s})^{\top}\pi^{\star}_{s}+\gamma[(P^{\prime}_{s}-P_{s}){\pi_{s}}^{\star}]^{\top}V$ . Since $\pi^{\star}_{s},\hat{\pi}_{s}\in\Delta_{A}$ , $\left\lVert\pi^{\star}_{s}\right\rVert_{\infty}\leq 1$ . We conclude that (64) holds when $f_{s}(V,C^{\prime},P^{\prime})\geq f_{s}(V,C,P)$ . For case 2), $f_{s}(V,C^{\prime},P^{\prime})\leq f_{s}(V,C,P)$ , (64) also holds by similar arguments.

Letting $m^{\prime}=(C^{\prime},P^{\prime})$ and $m=(C,P)$ , we can upper bound $f(V,m)-f(V,m^{\prime})=f-f^{\prime}$ as

	$\displaystyle\left\lVert f-f^{\prime}\right\rVert_{\infty}$	$\displaystyle\leq\max_{s\in[S]}\{\left\lVert c^{\prime}_{s}-c_{s}\right\rVert_{\infty}+\gamma\left\lVert(P_{s}-P^{\prime}_{s})^{\top}V\right\rVert_{\infty}\}$		(65)
		$\displaystyle\leq\max(1,\gamma\left\lVert V\right\rVert_{\infty})\left\lVert m-m^{\prime}\right\rVert_{\infty}.$		(66)

The policy evaluation operator $g^{\pi}$ satisfies (64) if $\max\{\left\lVert\pi^{\star}_{s}\right\rVert_{\infty},\left\lVert\hat{\pi}_{s}\right\rVert_{\infty}\}$ is replaced by $\left\lVert{\pi}_{s}\right\rVert_{\infty}$ . Since $\left\lVert{\pi}_{s}\right\rVert_{\infty}\leq 1$ , $g^{\pi}$ is $K(V)$ -Lipschitz. ∎

Set-based Value Operators for Non-stationary and Uncertain Markov Decision Processes

Abstract

keywords:

1 Introduction

Example 1 (Navigating in changing wind).

2 Discounted infinite-horizon MDP

Remark 2.

2.1 Value operators

Definition 3 (α\alpha-Contraction).

Definition 4 (Order Preservation).

Definition 5 (K​(V)K(V)-Lipschitz).

Remark 6.

Definition 7 (Value operator).

Remark 8.

Lemma 9 (Continuity).

Definition 10 (Policy evaluation operator).

Definition 11 (Bellman operator).

Lemma 12.

Remark 13.

3 Set-based value operators

Definition 14 (Point-to-set Distance).

Definition 15 (Set-to-set Distance).

Definition 16 (Set-based Value Operator).

Theorem 17.

Remark 18 (Set-based value iteration).

4 Properties of the fixed point set

4.1 Non-stationary value iteration

Proposition 19.

Corollary 20.

Corollary 21 (Transient behavior).

Remark 22.

Remark 23.

4.2 Bounds of the fixed point set

Definition 24 (Bound Operators).

Lemma 25 (α\alpha-Contraction).

Lemma 26 (Order Preservation).

Theorem 27 (Bounding fixed point sets).

5 Revisiting robust MDP

Proposition 28.

Remark 29.

6 Fixed-point set containing its infimum/supremum

Assumption 30 (Containment condition).

Remark 31.

6.1 Containment-satisfying MDP parameter sets

Definition 32 ((s,a)(s,a)-rectangular sets [8, 15]).

Definition 33 (ss-rectangular sets).

Example 34 (Wind uncertainty).

Proposition 35.

Example 36 (Beyond rectangularity).

Theorem 37.

Theorem 38.

Remark 39.

Corollary 40 (Robust MDP under Assumption 30).

Remark 41.

7 Value iteration for fixed point set computation

Theorem 42 (One step HH bounds).

7.1 Computing one-step optimal parameters

7.2 Algorithm Convergence Rate

Theorem 43.

Remark 44.

8 Path Planning in Time-varying Wind Fields

8.1 Bellman, optimistic policy, and robust policy

9 Conclusion

References

Appendix A Set sequence convergence

Lemma 45.

Appendix B Proof of Lemma 9

Appendix C Proof of Lemma 25

Definition 3 ( $\alpha$ -Contraction).

Definition 5 ( $K(V)$ -Lipschitz).

Lemma 25 ( $\alpha$ -Contraction).

Definition 32 ( $(s,a)$ -rectangular sets [8, 15]).

Definition 33 ( $s$ -rectangular sets).

Theorem 42 (One step $H$ bounds).