An Adaptive State Aggregation Algorithm for Markov Decision Processes

Guanting Chen^† Johann Demetrio Gaebler^† Matt Peng^‡ Chunlin Sun^† Yinyu Ye^§
^† Institute for Computational and Mathematical Engineering, Stanford University
^‡ Department of Electrical Engineering & Computer Sciences, University of California, Berkeley
^§ Department of Management Science and Engineering, Stanford University

\{

guanting, jgaeb, chunlin, yyye

\}

@stanford.edu [email protected]

Abstract

Value iteration is a well-known method of solving Markov Decision Processes (MDPs) that is simple to implement and boasts strong theoretical convergence guarantees. However, the computational cost of value iteration quickly becomes infeasible as the size of the state space increases. Various methods have been proposed to overcome this issue for value iteration in large state and action space MDPs, often at the price, however, of generalizability and algorithmic simplicity. In this paper, we propose an intuitive algorithm for solving MDPs that reduces the cost of value iteration updates by dynamically grouping together states with similar cost-to-go values. We also prove that our algorithm converges almost surely to within $2\varepsilon/(1-\gamma)$ of the true optimal value in the $\ell^{\infty}$ norm, where $\gamma$ is the discount factor and aggregated states differ by at most $\varepsilon$ . Numerical experiments on a variety of simulated environments confirm the robustness of our algorithm and its ability to solve MDPs with much cheaper updates especially as the scale of the MDP problem increases.

1 Introduction

State aggregation is a long-standard approach to solving large-scale Markov decision processes (MDPs). The main idea of state aggregation is to define the similarity between states, and work with a system of reduced complexity size by grouping similar states into aggregate, or “mega-”, states. Although there has been a variety of results on the performance of the policy using state aggregation li2006 ; van2006 ; abel2016 , a common assumption is that states are aggregated according to the similarity of their optimal cost-to-go values. Such a scheme, which we term pre-specified aggregation, is generally infeasible unless the MDP is already solved. To address the difficulties inherent in pre-specified aggregation, algorithms that learn how to effectively aggregate states have been proposed ortner2013 ; duan2018 ; sinclair2019 .

This paper aims to provide new insights into adaptive online learning of the correct state aggregates. We propose a simple and efficient state aggregation algorithm for calculating the optimal value and policy of an infinite-horizon discounted MDP that can be applied in planning problems baras2000 and generative MDP problems sidford2018near . The algorithm alternates between two distinct phases. During the first phase (“global updates”), the algorithm updates the cost-to-go values as in the standard value iteration, trading off some efficiency to more accurately guide the cost-to-go values in the right direction; in the second phase (“aggregate updates”), the algorithm groups together states with similar cost-to-go values based on the last sequence of global updates then efficiently updates the states in each mega-state in tandem as it optimizes over the reduced space of aggregate states.

The algorithm is simple: the inputs needed for state aggregation are the current cost-to-go values and the parameter $\varepsilon$ , which bounds the difference between (current) cost to go values of states within a given mega-state. Compared to prior works on state aggregation that use information on the state-action value ( $Q$ -value) sinclair2019 , transition density ortner2013 , and methods such as upper confidence intervals, our method does not require strong assumptions or extra information, and only performs updates in a manner similar to standard value iteration.

Our contribution is a feasible online algorithm for learning aggregate states and cost-to-go values that requires no extra information beyond that required for standard value iteration. We showcase in our experimental results that our method provides significantly faster convergence than standard value iteration especially for problems with larger state and action spaces. We also provide theoretical guarantees for the convergence, accuracy, and convergence rate of our algorithm.

The simplicity and robustness of our novel state aggregation algorithm demonstrates its utility and general applicability in comparison to existing approaches for solving large MDPs.

1.1 Related literature

When a pre-spcified aggregation is given, tsitsiklis1996 ; van2006 give performance bounds on the aggregated system and propose variants of value iteration. There are also a variety of ways to perform state aggregation based on different criteria. ferns2012 ; dean1997 analyze partitioning the state space by grouping states whose transition model and reward function are close. mccallum1997 proposes aggregate states that have the same optimal action and similar $Q$ -values for these actions. jong2005 develops aggregation techniques such that states are aggregated if they have the same optimal action. We refer the readers to li2006 ; abel2016 ; abel2020 for a more comprehensive survey on this topic.

On the dynamical learning side, our adaptive state-aggregated value iteration algorithm is also related to the aggregation-disaggregation method used to accelerate the convergence of value iterations (bertsekas1988 ; schweitzer1985 ; mendelssohn1982 ; bertsekas2018 ) in policy evaluation, i.e., to evaluate the value function of a policy. Among those works, that of bertsekas1988 is closest to our approach. Assuming the underlying Markov processes to be ergodic, the authors propose to group states based on Bellman residuals in between runs of value iteration. They also allow states to be aggregated or disaggregated at every abstraction step. However, the value iteration step in our setting is different from the value iteration in bertsekas1988 for the reason that value iteration in policy evaluation will not involve the Bellman operator. Aside from state aggregation, a variety of other methods have been studied to accelerate value iteration. herzberg1994 proposes iterative algorithms based on a one-step look-ahead approach. shlakhter2010 combines the so called “projective operator” with value iteration and achieve better efficiency. anderson1965 ; fang2009 ; zhang2020 analyze the Anderson mixing approach to speed up the convergence of fixed-point problems.

Dynamic learning of the aggregate states has also been studied more generally in MDP and reinforcement learning (RL) settings. hostetler2014 proposes a class of $Q$ -value–based state aggregations and applies them to Monte Carlo tree search. slivkins2011 uses data-driven discretization to adaptively discretize state and action space in a contextual bandit setting. ortner2013 develops an algorithm for learning state aggregation in an online setting by leveraging confidence intervals. sinclair2019 designs a $Q$ -learning algorithm based on data-driven adaptive discretization of the state-action space. For more state abstraction techniques see baras2000 ; dean1997 ; jiang2014 ; abel2019 .

2 Preliminaries

2.1 Markov decision process

We consider an infinite-horizon Markov Decision Process $M=(\mathcal{S},\mathcal{A},P,r,\gamma,\rho)$ , consisting of: a finite state space $\mathcal{S}$ ; a finite action space $\mathcal{A}$ ; the probability transition model $P$ , where $P(s^{\prime}|s,a)$ denotes the probability of transitioning to state $s^{\prime}$ conditioned on the state-action pair $(s,a)$ ; the immediate cost function $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ , which denotes the immediate cost—or reward—obtained from a particular state-action pair; the discount factor $\gamma\in[0,1)$ ; and the initial distribution over states in $\mathcal{S}$ , which we denote by $\rho$ .

A policy $\pi:\mathcal{S}\to\mathcal{A}$ specifies the agent’s action based on the current state, either deterministically or stochastically. A policy induces a distribution over the trajectories $\tau=(s_{t},a_{t},r_{t})_{t=0}^{\infty}$ , where $s_{0}\sim\rho$ , and $a_{t}\sim\pi(\cdot|s_{t})$ and $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ for $t\geq 0$ .

A value function $\bm{V}:\mathcal{S}\to\mathbb{R}^{|\mathcal{S}|}$ assigns a value to each state; as $|\mathcal{S}|<\infty$ , $\bm{V}$ can also be represented by a finite-length vector $(V(1),...,V(|\mathcal{S}|))^{\top}\in\mathbb{R}^{|\mathcal{S}|}$ . (In this paper, we view all vector as vector functions mapping from the index to the corresponding entry.) Moreover, each policy $\pi$ is associated with a value function $\bm{V}^{\pi}:\mathcal{S}\to\mathbb{R}$ , which is defined to be the discounted sum of future rewards starting at $s$ and with policy $\pi$ :

\bm{V}^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|\pi,s_{0}=s\right].

As noted above, we represent both the value function corresponding to the policy $\pi$ and the value vector $(V^{\pi}(1),...,V^{\pi}(|\mathcal{S}|))^{\top}$ as $\bm{V}^{\pi}$ .

For each state $s\in\mathcal{S}$ and action $a\in\mathcal{A}$ belonging to state $s$ , let $\bm{P}_{s,a}\in\mathbb{R}^{|\mathcal{S}|}$ denote the vector of transition probabilities resulting from taking the action $a$ in state $s$ . We call a policy $\pi$ greedy with respect to a given value $\bm{V}\in\mathbb{R}^{|\mathcal{S}|}$ if

\pi(s)\in\operatorname*{arg\,min}_{a\in\mathcal{A}}\left(r(s,a)+\gamma\cdot\bm{P}_{s,a}^{\top}\bm{V})\right).

We define $\bm{T}:\mathbb{R}^{|\mathcal{S}|}\to\mathbb{R}^{|\mathcal{S}|}$ to be the dynamic programming operator such that $(\bm{T}\bm{V})(s)=T_{s}(\bm{V})$ where

T_{s}(\bm{V})=\min_{a\in\mathcal{A}}\left(r(s,a)+\gamma\cdot\bm{P}_{s,a}^{\top}\bm{V}\right).

The optimal value function $\bm{V}^{*}$ is the unique solution of the equation $\bm{V}^{*}=T\bm{V}^{*}$ . A common approach to find $\bm{V}^{*}$ is value iteration, which, given an initial guess $\bm{V}_{0}$ , generates a sequence of value functions $\{\bm{V}_{t}\}_{t=0}^{\infty}$ such that $\bm{V}_{t+1}=\bm{T}\bm{V}_{t}$ . The sequence $\{\bm{V}_{t}\}_{t=0}^{\infty}$ converges to $V^{*}$ as $t$ goes to $+\infty$ Bel .

2.2 State aggregation

The state space of MDPs can be very large. State aggregation divides the state space $\mathcal{S}$ into $K$ subsets and views each collection of states as a mega-state. Then, the value function generated by the mega-states can be used to approximate the optimal value $\bm{V}^{*}$ .

To represent a state aggregation, we define the matrix $\bm{\Phi}\in\mathbb{R}^{|\mathcal{S}|\times K}$ . We set $\phi_{i,j}=1$ if state $i$ is in the $j$ -th mega-state, and let $\phi_{i,j}=0$ otherwise; i.e., column $j$ of $\bm{\Phi}$ indicates whether each state belongs to mega-state $j$ . The state-reduction matrix $\bm{\Phi}$ also induces a partition $\{S_{i}\}_{i=1}^{K}$ on $\mathcal{S}$ , i.e., $\mathcal{S}=\bigcup_{i=1}^{K}S_{i}$ and $S_{i}\cap S_{j}=\emptyset$ for $i\neq j$ . Denote by $\bm{W}\in\mathbb{R}^{K}$ the cost-to-go value function for the aggregated state, and note that the current value of $\bm{W}$ induces a value function $\tilde{\bm{V}}(\bm{W})\in\mathbb{R}^{|\mathcal{S}|}$ on the original state space, where

\tilde{\bm{V}}(s,\bm{W})=\bm{W}(j),\,\,\text{ for }s\in S_{j}.

3 Algorithm design

In this section, we first introduce a state aggregation algorithm which assumes knowledge of the optimal value function. The algorithm is proposed in tsitsiklis1996 which also provides the corresponding convergence result. Based on existing theory, we design our adaptive algorithm and discuss its convergence properties.

3.1 A pre-speficied aggregation algorithm

Given a pre-specified aggregation, we seek the value function $\bm{W}$ for the aggregated states such that

\displaystyle\|\tilde{\bm{V}}(\bm{W})-\bm{V}^{*}\|_{\infty}=O(\|\bm{e}\|_{\infty}),

(1)

where $\bm{e}=(e_{1},...,e_{K})^{\top}$ and $e_{j}=\max_{s_{1},s_{2}\in S_{j}}|V^{*}(s_{1})-V^{*}(s_{2})|$ for $j=1,...,K$ . Intuitively, Eq. $\eqref{eqn_prefix_error}$ justifies our approach of aggregating states that have similar optimal cost-to-go values. We then state the algorithm that will converge to the correct cost-to-go values while satisfying Eq. $\eqref{eqn_prefix_error}$ .

Algorithm 1 Random Value Iteration with Aggregation

1:Input:

\bm{P}

r

\gamma

\bm{\Phi}

\{\alpha_{t}\}_{t=1}^{\infty}

2:Initialize

\bm{W}_{0}=\bm{0}

3:for

t=1,...,n

4: for

j=1,...,K

5: Sample state

s

uniformly from set

S_{j}

W_{t+1}(j)=(1-\alpha_{t})W_{t}(j)+\alpha_{t}T_{s}\tilde{\bm{V}}(\bm{W}_{t})

7: end for

8:end for

9:Output:

\tilde{\bm{V}}_{n}

Algorithm 1 takes a similar form in Stochastic Approximation robbins1951 ; wasan2004 , and will converge almost surely to a unique cost-to-go value. Here $\alpha_{t}$ is the step size of the learning algorithm; by taking, e.g., $\alpha_{t}=1$ , we recover the formula of value iteration. The following convergence result is proved in tsitsiklis1996 .

Proposition 1 (Theorem 1, tsitsiklis1996 ).

When $\sum_{t=1}^{\infty}\alpha_{t}=\infty$ and $\sum_{t=1}^{\infty}\alpha_{t}^{2}<\infty$ , $\{\bm{W}_{t}\}_{t=1}^{\infty}$ in Line 6 of Algorithm 1 will converge almost surely to $\bm{W}^{*}$ entry-wise, where $\bm{W}^{*}$ is the solution of

W^{*}(j)=\cfrac{1}{|S_{j}|}\sum_{s\in S_{j}}T_{s}\tilde{\bm{V}}(\bm{W}^{*}).

(2)

Define $\pi^{\bm{W}^{*}}$ to be the greedy policy with respect to $\tilde{\bm{V}}(\bm{W}^{*})$ . Then, we have, morevoer, that

	$\displaystyle\\|\tilde{\bm{V}}(\bm{W}^{})-\bm{V}^{}\\|_{\infty}$	$\displaystyle\leq\frac{\\|\bm{e}\\|_{\infty}}{1-\gamma},$		(3)
	$\displaystyle\\|{\bm{V}}^{\pi^{\bm{W}^{}}}-\bm{V}^{}\\|_{\infty}$	$\displaystyle\leq\frac{2\gamma\\|\bm{e}\\|_{\infty}}{(1-\gamma)^{2}},$		(3)

where ${\bm{V}}^{\pi^{\bm{W}^{*}}}$ is the value function associated with policy $\pi^{\bm{W}^{*}}$ .

Proposition 1 states that if we are able to partition the state space such that the maximum difference of the optimal value function within each mega-state is small, the value function produced by Algorithm 1 can approximate the optimal value up to $\frac{\|\bm{e}\|_{\infty}}{1-\gamma}$ , and the policy associated with the approximated value function will also be close to the optimal policy.

3.2 Value iteration with state aggregation

In order to generate an efficient approximation, Proposition 1 requires a pre-specified aggregation scheme such that $\max_{s_{1},s_{2}\in S_{i}}|V^{*}(s_{1})-V^{*}(s_{2})|$ is small for every $i$ to guarantee the appropriate level of convergence for Algorithm 1. Without knowing $\bm{V}^{*}$ , is it still possible to control the approximation error? In this section we answer in the affirmative by introducing an adaptive state aggregation scheme that learns the correct state aggregations online as it learns the true cost-to-go values.

Given the current cost-to-go value vector $V\in\mathbb{R}^{|\mathcal{S}|}$ , let $b_{1}=\min_{s\in\mathcal{S}}V(s)$ , let $b_{2}=\max_{s\in\mathcal{S}}V(s)$ . Group the cost-to-go values among disjoint subintervals of the form $\lceil(b_{2}-b_{1})/\varepsilon\rceil$ . Next, let $\Delta=(b_{2}-b_{1})/\varepsilon$ , and let $S_{j}$ to be the $j$ -th mega-state, which contains all the states whose current estimated cost-to-go value falls in the interval $[b_{1}+(j-1)\varepsilon,b_{1}+j\varepsilon)$ . Grouping the states in this way reduces the state size from $|\mathcal{S}|$ states to at most $\lceil(b_{2}-b_{1})/\varepsilon\rceil$ mega-states. See Algorithm 2 for further details.

Algorithm 2 Value-based Aggregation

1:Input:

\varepsilon

\bm{V}=(V(1),...,V(|\mathcal{S}|))^{\top}

b_{1}=\min\limits_{s\in|\mathcal{S}|}V(s)

b_{2}=\max\limits_{s\in|\mathcal{S}|}V(s)

\Delta=(b_{2}-b_{1})/\varepsilon

3:for

i=1,...,\lceil\Delta\rceil

\hat{S}_{i}=\left\{s|V(s)\in[b_{1}+(i-1)\varepsilon,b_{1}+i\varepsilon)\right\}

\hat{W}(i)=b_{1}+(i-\frac{1}{2})\varepsilon

5:end for

6:Delete the empty sets in

\{\hat{S}_{i}\}_{i=1}^{\lceil\Delta\rceil}

while keep the same order, and define the modified partition to be

\{S_{i}\}_{i=1}^{K}

, where

K

is the cardinally of the modified set of mega-states. Modify

\hat{W}

and generate

W\in\mathbb{R}^{K}

the similar way.

7:Return

\{S_{i}\}_{i=1}^{K}

and

W

Without the knowledge of $\bm{V}^{*}$ in advance, one must periodically perform value iteration on $\mathcal{S}$ to learn the correct aggregation to help with adapting the aggregation scheme. As a result, our algorithm alternates between two phases: in the global update phase the algorithm performs value iteration on $\mathcal{S}$ ; in the aggregated update phase, the algorithm starts to group together states with similar cost-to-go values based on the result of the last global update, and then performs aggregated updates as in Algorithm 1.

We denote by $\{\mathcal{A}_{i}\}_{i=1}^{\infty}$ the intervals of iterations in which the algorithm performs state-aggregated updates, and we denote by $\{\mathcal{B}_{i}\}_{i=1}^{\infty}$ the intervals of iterations in which the algorithm performs global update. As a consequence, $b<a$ for any $a\in\mathcal{A}_{i}$ and $b\in\mathcal{B}_{i}$ ; likewise, $a<b$ for any $a\in\mathcal{A}_{i}$ and $b\in\mathcal{B}_{i+1}$ .

We then present our adaptive algorithm. For a pre-speficied number of iterations $n$ , the time horizon $[1,n)$ is divided into intervals of the form $\mathcal{B}_{1},\mathcal{A}_{1},\mathcal{B}_{2},\mathcal{A}_{2},\ldots$ . Every time the algorithm exits an interval of global updates $\mathcal{B}_{i}$ , it runs Algorithm 2 based on the current cost-to-go value and the parameter $\varepsilon$ , using the output of Algorithm 2 for the current state aggregation and cost-to-go values for $\mathcal{A}_{i}$ . Similarly, every time the algorithm exits an interval of state-aggregated updates $\mathcal{A}_{i}$ , it sets $\tilde{\bm{V}}(\bm{W})$ , where $\bm{W}$ is the current cost-to-go value for the aggregated space, as the initial cost-to-go value for the subsequent interval of global iterations.

Algorithm 3 Value Iteration with Adaptive Aggregation

1:Input:

P

r

\varepsilon

\gamma

\{\alpha_{t}\}_{t=1}^{\infty}

\{\mathcal{A}_{i}\}_{i=1}^{\infty}

\{\mathcal{B}_{i}\}_{i=1}^{\infty}

2:Initialize

W_{0}=\bm{0}

V_{1}=\bm{0}

t_{sa}=1

3:for

t=1,...,n

4: if

t\in\mathcal{B}_{i}

then

5: if

t=\min\{\mathcal{B}_{i}\}

then

\bm{V}_{t-1}=\tilde{\bm{V}}(\bm{W}_{t-1}).

6: end if

7: for

j=1,...,|\mathcal{S}|

V_{t}(j)=T_{j}\bm{V}_{t-1}.

8: end for

9: else

10: Find current

i

s.t.

t\in\mathcal{A}_{i}

11: if

t=\min\{\mathcal{A}_{i}\}

then

12: Define

\{S_{i}\}_{i=1}^{K}

and

\bm{W}_{t}

to be the output of Algorithm 2 with input

\varepsilon

\bm{V}_{t-1}

13: end if

14: for

j=1,...,K

15: Sample state

s

uniformly from set

S_{j}

16:

\displaystyle W_{t}(j)=(1-\alpha_{t_{sa}}){W}_{t-1}(j)+\alpha_{t_{sa}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1})

(4)

17: end for

18:

t_{sa}=t_{sa}+1

19: end if

20:end for

21:if

n\in\mathcal{B}_{i}

then return

\bm{V}_{n}

22:end if

23:return

\tilde{\bm{V}}(\bm{W}_{n})

3.3 Convergence

From Proposition 1, if we fix the state-aggregation parameter $\varepsilon$ , even with perfect information, state aggregated value iteration will generate an approximation of the cost-to-go values with $\ell^{\infty}$ error bounded by $\varepsilon/(1-\gamma)$ . This bound is sharp, as shown in tsitsiklis1996 . Such error is negligible in the early phase of the algorithm, but the error would accumulate in the later phase of the algorithm and prevent the algorithm from converging to the optimal value. As a result, it is not desirable for $\limsup|\mathcal{A}_{t}|\to\infty$ .¹¹1 By setting $\varepsilon$ adaptively, one might achieve better complexity and error bound by setting $\lim\sup|\mathcal{A}_{t}|\to\infty$ ; however, adaptively choosing $\varepsilon$ lies beyond the scope of this work.

We state asymptotic convergence results for Algorithm 3; proofs can be found in the supplementary materials. For the remainder of the paper, by a slight abuse of notation, we denote by $\bm{V}_{t}$ the current cost-to-go value at iteration $t$ . More specifically, if the current algorithm is in phase $\mathcal{B}_{i}$ , $\bm{V}_{t}$ is the updated cost-to-go value for global value iteration. If the algorithm is in phase $\mathcal{A}_{i}$ , $\bm{V}_{t}$ represents $\tilde{\bm{V}}(\bm{W}_{t})$ .

Theorem 1.

If $\limsup\alpha_{t}\to 0$ , $\limsup_{t\to\infty}|\mathcal{A}_{i}|<\infty$ and $\liminf_{t\to\infty}|\mathcal{B}_{i}|>0$ , we have

\limsup_{t\rightarrow\infty}\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}.

Notice that the result of Theorem 1 is consistent with Proposition 1: due to state aggregation we suffer from the same $\frac{1}{1-\gamma}$ for the error bound. However, our algorithm maintains the same order of error bound without knowing the optimal value function $\bm{V}^{*}$ .

We also identify the existence of “stable field” for our algorithm, and we prove that under some specific choice of the learning rate $\alpha_{t}$ , with probability one the value function will stay within the stable field.

Proposition 2.

If $\alpha_{t}<\frac{\varepsilon}{(1-\gamma)\sup_{i}|\mathcal{A}_{i}|}$ for all $t$ , we have that after $O\left(\max\left\{1,\frac{\log(\varepsilon)}{\log(\gamma)}\right\}\right)$ iterations, with probability one the estimated approximation $\bm{V}_{t}$ satisfies

\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{3\varepsilon}{1-\gamma}

for any choice of initialization $\bm{V}_{0}\in\left[0,\frac{1}{1-\gamma}\right]^{|\mathcal{S}|}$ .

Proposition 3.

For $\beta>0$ , if $\alpha_{t}\leq t^{-\beta}$ , after $O(\frac{\log(\varepsilon)}{\log(\gamma)}+(1-\gamma)^{-\frac{1}{\beta}}\varepsilon^{-\frac{1}{\beta}})$ iterations, with probability one the estimated approximation $\bm{V}_{t}$ satisfies

\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{3\varepsilon}{1-\gamma}

for any choice of initialization $\bm{V}_{0}\in\left[0,\frac{1}{1-\gamma}\right]^{|\mathcal{S}|}$ .

Such results provide guidance for choice of parameters. Indeed, the experimental results are consistent with the theory presented above.

4 Experiments

To test the theory developed in Sections 3, we perform a number of numerical experiments.²²2All experiments were performed in parallel using forty Xeon E5-2698 v3 @ 2.30GHz CPUs. Total compute time was approximately 60 hours. Code and replication materials are available in the Supplementary Materials. We test our methods on a variety of MDPs of different sizes and complexity. Our results show that state aggregation can achieve faster convergence than standard value iteration; that state aggregation scales appropriately as the size and dimensionality of the underlying MDP increases; and that state aggregation is reasonably robust to measurement error (simulated by adding noise to the action costs) and varying levels of stochasticity in the transition matrix.

4.1 MDP problems

We consider two problems in testing our algorithm. The first, which we term the “standard maze problem” consists of a $d_{1}\times d_{2}\times\cdots d_{n}$ grid of positions. Each position is connected to one or more adjacent positions. Moving from position to position incurs a constant cost, except for moving to the terminal state of the maze (the position $(1,\ldots,1)$ ) which incurs a constant reward.³³3 We rescale action costs in both the standard and terrain mazes to ensure that the maximum cost-to-go is exactly 100. There is a unique path from each position to the terminal state. A two-dimensional $20\times 20$ standard maze, in which the player can move, depending on their position, up, down left, or right is illustrated in Figure 1(a).

The second problem is the “terrain maze problem.” As in a standard maze, each state in the terrain maze represents a position in a $d_{1}\times\cdots\times d_{n}$ grid. As before, we imagine that the player can move from state to state only by travelling a single unit in any direction. (The player is only constrained from moving out of the grid entirely.) The player receives a reward for reaching the final square of the maze, which again we place at position $(1,\ldots,1)$ . However, in contrast to the standard maze game, the player’s movements incur different costs at different positions. In particular, the maze is determined by a “height function” $H:\{1,\ldots,d_{1}\}\times\cdots\times\{1,\ldots,d_{n}\}\to\mathbb{R}$ . The cost of movement is set to be the difference in heights between the player’s destination position and their current position, normalized appropriately.

In both problems, we also allow for stochasticity controlled by a parameter $p$ which gives the probability that a player moves in their intended direction. For $p=1$ , the MDP is deterministic; otherwise, with probability $1-p$ , the player moves in a different direction chosen uniformly at random.

In both problems, the actions available at any state correspond to a very sparse transition probability vectors, since players are constrained to move along cardinal directions at a rate of a single unit. However, in the standard maze game, the cost-to-go at any position is extremely sensitive to the costs-to-go at states that are very distant. In a $10\times 10$ standard maze, the initial tile (i.e., position (1, 1)) is often between 25 and 30 units away from the destination tile (i.e., position (10, 10)). In contrast, in the terrain maze game, the cost-to-go is much less sensitive to far away positions, because local immediate costs, dictated by the slopes one must climb or go down to move locally, are more significant.

4.1.1 Benchmarks and parameters

We measure convergence by the $\ell^{\infty}$ -distance (hereafter “error”) between the current cost-to-go vector and the true cost-to-go vector, and we evaluate the speed based on the size of error and the number of updates performed. Notice that for global value iteration, an update for state $s$ has the form

T_{s}(\bm{V})=\min_{a\in\mathcal{A}}\left(r(s,a)+\gamma\bm{P}_{s,a}^{\top}\bm{V}\right),

and an update for mega-state $j$ (represented by $s$ ) has the form

W_{t}(j)=(1-\alpha_{t-t_{0}})W_{t-1}(j)+\alpha_{t-t_{0}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1}).

Because the transition matrix is not dense in our examples, the computational resources required for global value iteration update and aggregated update are roughly the same. For each iteration, the value iteration will always perform $|\mathcal{S}|$ updates, and for Algorithm 3, only $K$ updates, one for each mega-state, will be performed if in the aggregation phase.

All experiments are performed with a discount factor $\gamma=0.95$ . We set $|\mathcal{A}_{i}|=5$ and $|\mathcal{B}_{i}|=2$ for every $i$ , and for learning rate we set $\alpha_{t}=\frac{1}{\sqrt{t}}$ . The cost function is normalized such that $\|\bm{V}^{*}\|_{\infty}=100$ , and we choose aggregation constant to be $\varepsilon=0.5$ (unless otherwise indicated). We set the initial cost vector $\bm{V}_{0}$ to be the zero vector $\bm{0}$ .

4.1.2 Results

Influence of $\varepsilon$ .

We test the effect of $\varepsilon$ on the error Algorithm 3 produces. We run experiments with aggregation constant $\varepsilon=0.05,0.1$ , and $0.5$ for $500\times 500$ standard and terrain mazes. For each $\varepsilon$ , we run 1,000 iterations of Algorithm 3, and each experiment is repeated 20 times; the results, shown in Figure 2(a), indicate that the approximation error scales in proportion to $\varepsilon$ , which is consistent with Proposition 1 and Theorem 1.

Efficiency.

We test the convergence rate of Algorithm 3 against value iteration on $500\times 500$ standard and terrain mazes, repeating each experiment 20 times. From Figure 2(b) and 2(c), we see that state-aggregated value iteration is very efficient at the beginning phase, converging in fewer updates than value iteration.

Type	Dims.	Error	95% CI
Trn.	$100\times 100$	4.41	$\pm 0.14$
Trn.	$200\times 200$	4.34	$\pm 0.16$
Trn.	$300\times 300$	4.65	$\pm 0.18$
Trn.	$500\times 500$	4.27	$\pm 0.17$
Trn.	$1000\times 1000$	4.27	$\pm 0.16$
Std.	$100\times 100$	1.43	$\pm 0.16$
Std.	$200\times 200$	1.39	$\pm 0.15$
Std.	$300\times 300$	1.42	$\pm 0.20$
Std.	$500\times 500$	1.11	$\pm 0.16$
Std.	$1000\times 1000$	1.40	$\pm 0.16$

Type	Dims.	Error	95% CI
Trn.	$10^{3}$	1.91	$\pm 0.007$
Trn.	$10^{4}$	3.02	$\pm 0.009$
Trn.	$10^{5}$	3.59	$\pm 0.008$
Trn.	$10^{6}$	3.85	$\pm 0.005$
Std.	$10^{3}$	1.36	$\pm 0.019$
Std.	$10^{4}$	1.36	$\pm 0.017$
Std.	$10^{5}$	1.23	$\pm 0.013$
Std.	$10^{6}$	1.31	$\pm 0.013$

Table 1: Scaling properties of state aggregation value iteration. Reported errors represent the mean value from running each experiment 20 times.

Type	$p$	$\sigma$	Error	95% CI
Terrain	0.92	0.00	4.44	$\pm 0.24$
Terrain	0.92	0.01	4.41	$\pm 0.18$
Terrain	0.92	0.05	4.97	$\pm 0.17$
Terrain	0.92	0.10	6.36	$\pm 0.19$
Terrain	0.95	0.00	4.43	$\pm 0.17$
Terrain	0.95	0.01	4.32	$\pm 0.14$
Terrain	0.95	0.05	4.93	$\pm 0.17$
Terrain	0.95	0.10	6.39	$\pm 0.17$
Terrain	0.98	0.00	4.37	$\pm 0.19$
Terrain	0.98	0.01	4.31	$\pm 0.14$
Terrain	0.98	0.05	5.01	$\pm 0.14$
Terrain	0.98	0.10	6.52	$\pm 0.20$

Type	$p$	$\sigma$	Error	95% CI
Standard	0.92	0.00	1.39	$\pm 0.19$
Standard	0.92	0.01	1.61	$\pm 0.23$
Standard	0.92	0.05	2.62	$\pm 0.15$
Standard	0.92	0.10	5.56	$\pm 0.46$
Standard	0.95	0.00	1.49	$\pm 0.17$
Standard	0.95	0.01	1.57	$\pm 0.14$
Standard	0.95	0.05	2.86	$\pm 0.17$
Standard	0.95	0.10	5.72	$\pm 0.39$
Standard	0.98	0.00	1.43	$\pm 0.18$
Standard	0.98	0.01	1.88	$\pm 0.16$
Standard	0.98	0.05	3.48	$\pm 0.28$
Standard	0.98	0.10	5.86	$\pm 0.26$

Table 2: Numerical experiments illustrating the robustness of state-aggregated value iteration to stochasticity and noisy action costs. Errors represent the average

\ell_{\infty}

-distance to the true cost-to-go values in twenty independent runs.

Scalibility of state aggregation.

We run state-abstracted value iteration on standard and terrain mazes of size $100\times 100$ , $200\times 200$ , $300\times 300$ , $500\times 500$ , and $1000\times 1000$ for 1000 iterations. We repeat each experiment 20 times, displaying the results in Table 1. Next, we run state-aggregated value iteration on terrain mazes of increasingly large underlying dimension, as shown in the right side of Table 1, likewise for 20 repetitions for each size. The difference with the previous experiment is that not only does the state space increase, the action space also increases exponentially. Our results show that the added complexity of the high-dimensional problems does not appear to substantially affect the convergence of state-aggregated value iteration and our method is able to scale with very large MDP problems.

Robustness.

We examine the robustness of state-aggregated value iteration to two sources of noise. We generate $500\times 500$ standard and terrain mazes, varying the level of stochasiticity by setting $p=0.92,0.95,0.98$ . We also vary the amount of noise in the cost vector by adding a mean 0, standard deviation $\sigma=0.0,0.01,0.05,0.1$ normal vector to the action costs. The results, shown in Table 2, indicate that state-aggregated value iteration is reasonably robust to stochasiticity and measurement error.

4.2 Continuous Control Problems

We conclude the experiments section by showing the performance of our method on a real-world use case in continuous control. These problems often involve solving complex tasks with high-dimensional sensory input. The idea typically involves teaching an autonomous agent, usually a robot, to successfully complete some task or achieve some goal state. These problems are often very tough as they reside in the continuous state space (and many times action space) domain.

We already showcased the significant reduction in algorithm update costs during learning on grid-world problems in comparison with value iteration. Our goal for this section is to showcase real world examples of how our method may be practically applied in the field of continuous control. This is important not only to emphasize that our idea has a practical use case, but also to further showcase its ability to scale into extremely large (and continuous) dimensional problems.

4.2.1 Environment

We choose to use the common baseline control problem in the "CartPole" system. The CartPole system is a typical physics control problem where a pole is attached to a central joint of a cart, which moves along an endless friction-less track. The system is controlled by applying a force to either move to the right or left with the goal to balance the pole, which starts upright. The system terminates if (1) the pole angle is more than $12$ degrees from the vertical axis, or (2) the cart position is more than $2.4$ cm from the center, or (3) the episode length is greater than $200$ time-steps.

During each episode, the agent will receive constant reward for each step it takes. The state-space is the continuous position and angle of the pole for the cartpole pendulum system and the action-space involves two actions - applying a force to move the cart left or right. A more in-depth explanation of the problem can be found in DBLP:journals/corr/abs-2006-04938 .

4.2.2 Results

Since the CartPole problem is a multi-dimensional continuous control problem, there is no ground truth $v$ -values so we choose to utilize the dense reward nature of this problem and rely on the accumulated reward to quantify algorithm performance as opposed to error. In addition, since value iteration requires discrete states, we discretize all continuous dimensions of the state space into bins and generate policies on the discretized environment. More specifically, we discretize the continuous state space into bins by dividing each dimension of the domain into equidistant intervals.

For our aggregation algorithm, we use $\gamma=0.99$ following what is commonly used in this problem in past works. We set the initial cost vector $V_{0}$ to be the zero vector. We fix $\varepsilon=0.2$ and set $\alpha_{t}=\frac{1}{\sqrt{t}}$ . We note that given the symmetric nature of this continuous control problem, we do not need to alternate between global and aggregate updates. The adaptive aggregation of states already groups states effectively together for strong performance and significant speedups in the number of updates required. We first do a sweep of number of bins to discretize the problem space to determine the number that best maximizes the performance of value iteration in terms of both update number and reward and found this to be around 2000 bins. We then compare the performances of state aggregation and value iteration on this setting in figure 3. To also further showcase the adaptive advantage of our method, we choose a larger bin (10000) number that likely would have been chosen if no bin sweeping had occurred and show that our method acts as an "automatic bin adjuster" and offers the significant speedups without any prior tuning in figure 3. Interestingly in both situations, the experimental results show that the speedup offered by our method seem significant across different bin settings of the CartPole problem. These results may prove to have strong theoretical directions in future works.

5 Discussion

Value iteration is an effective tool for solving Markov decision processes but can quickly become infeasible in problems with large state and action spaces. To address this difficulty, we develop adaptive state-aggregated value iteration, a novel method of solving Markov decision processes by aggregating states with similar cost-to-go values and updating them in tandem. Unlike previous methods of reducing the dimensions of state and action spaces, our method is general and does not require prior knowledge of the true cost-to-go values to form aggregate states. Instead our algorithm learns the cost-to-go values online and uses them to form aggregate states in an adaptive manner. We prove theoretical guarantees for the accuracy, convergence, and convergence rate of our state-aggregated value iteration algorithm, and demonstrate its applicability through a variety of numerical experiments.

State- and action-space reduction techniques are an area of active research. Our contribution in the dynamic state-aggregated value iteration algorithm provides a general framework for approaching large MDPs with strong numerical performances justifying our method. We believe our algorithm can serve as a foundational ground for both future empirical and theoretical work.

We conclude by discussing promising directions for future work on adaptive state aggregation. First, we believe that reducing the number of updates per state by dynamically aggregating states can also be extended more generally in RL settings with model-free methods. Second, our proposed algorithm’s complexity is of the same order as value iteration. Future work may seek to eliminate the dependence on $\gamma$ in the error bound. Lastly, by adaptively choosing $\varepsilon$ , it may be possible to achieve better complexity bounds not only for planning problems but also for generative MDP models sidford2018 ; sidford2018near .

References

[1] D. Abel, D. Arumugam, K. Asadi, Y. Jinnai, M. L. Littman, and L. L. Wong. State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3134–3142, 2019.
[2] D. Abel, D. Hershkowitz, and M. Littman. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pages 2915–2923. PMLR, 2016.
[3] D. Abel, N. Umbanhowar, K. Khetarpal, D. Arumugam, D. Precup, and M. Littman. Value preserving state-action abstractions. In International Conference on Artificial Intelligence and Statistics, pages 1639–1650. PMLR, 2020.
[4] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
[5] J. Baras and V. Borkar. A learning algorithm for markov decision processes with adaptive state aggregation. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), volume 4, pages 3351–3356. IEEE, 2000.
[6] R. Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957.
[7] D. P. Bertsekas. Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6(1):1–31, 2018.
[8] D. P. Bertsekas, D. A. Castanon, et al. Adaptive aggregation methods for infinite horizon dynamic programming. 1988.
[9] T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, UAI’97, page 124–131, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
[10] Y. Duan, Z. T. Ke, and M. Wang. State aggregation learning from markov transition data. arXiv preprint arXiv:1811.02619, 2018.
[11] H.-r. Fang and Y. Saad. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009.
[12] N. Ferns, P. Panangaden, and D. Precup. Metrics for markov decision processes with infinite state spaces. arXiv preprint arXiv:1207.1386, 2012.
[13] M. Herzberg and U. Yechiali. Accelerating procedures of the value iteration algorithm for discounted markov decision processes, based on a one-step lookahead analysis. Operations Research, 42(5):940–946, 1994.
[14] J. Hostetler, A. Fern, and T. Dietterich. State aggregation in monte carlo tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
[15] N. Jiang, S. Singh, and R. Lewis. Improving uct planning via approximate homomorphisms. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1289–1296, 2014.
[16] N. K. Jong and P. Stone. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pages 752–757. Citeseer, 2005.
[17] S. Kumar. Balancing a cartpole system with reinforcement learning - A tutorial. CoRR, abs/2006.04938, 2020.
[18] L. Li, T. J. Walsh, and M. L. Littman. Towards a unified theory of state abstraction for mdps. ISAIM, 4:5, 2006.
[19] R. McCallum. Reinforcement learning with selective perception and hidden state. 1997.
[20] R. Mendelssohn. An iterative aggregation procedure for markov decision processes. Operations Research, 30(1):62–73, 1982.
[21] R. Ortner. Adaptive aggregation for reinforcement learning in average reward markov decision processes. Annals of Operations Research, 208(1):321–336, 2013.
[22] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[23] P. J. Schweitzer, M. L. Puterman, and K. W. Kindle. Iterative aggregation-disaggregation procedures for discounted semi-markov reward processes. Operations Research, 33(3):589–605, 1985.
[24] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for markov decision processes. Operations research, 58(1):193–202, 2010.
[25] A. Sidford, M. Wang, X. Wu, L. F. Yang, and Y. Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
[26] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. SIAM, 2018.
[27] S. R. Sinclair, S. Banerjee, and C. L. Yu. Adaptive discretization for episodic reinforcement learning in metric spaces. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–44, 2019.
[28] A. Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory, pages 679–702. JMLR Workshop and Conference Proceedings, 2011.
[29] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59–94, 1996.
[30] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
[31] M. T. Wasan. Stochastic approximation. Number 58. Cambridge University Press, 2004.
[32] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-i anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.

Appendix A Appendix

A.1 Preliminaries

Since the number state-action pairs is finite, without loss of generality we can assume that the cost function $r$ is bounded by $1$ . By a slight abuse of notation, we denote by $\bm{V}_{t}$ the current cost-to-go value at iteration $t$ generated by Algorithm 3. More specifically, if the current algorithm is in global value iteration phase $\mathcal{B}_{i}$ , $\bm{V}_{t}$ is the updated cost-to-go value for global value iteration. If the algorithm is in state aggregation phase $\mathcal{A}_{i}$ , $\bm{V}_{t}$ represents $\tilde{\bm{V}}(\bm{W}_{t})$ . Also, denote by $\bm{E}_{t}=\bm{V}_{t}-\bm{V}^{*}$ the error between the current estimate and the optimal value function. We first state results on the $\ell^{\infty}$ norm of $\bm{V}^{*}$ and $\bm{V}_{t}$ .

Lemma 1.

For any $t>0$ we have $||\bm{V}_{t}||_{\infty}\leq\frac{1}{1-\gamma}$ and $||\bm{V}^{*}||_{\infty}\leq\frac{1}{1-\gamma}$ .

Lemma 1 states that both of the optimal cost-to-go value and the value obtained by Algorithm 3 is uniformly bounded by $\frac{1}{1-\gamma}$ entry-wisely at all aggregated and global iterations.

Next, it is well known that $||\bm{E}_{t}||_{\infty}$ can only decrease in the global value iteration phase due to the fact that the Bellman operator is a contraction mapping. However, $||\bm{E}_{t}||_{\infty}$ can potentially grown when we switch into a state aggregation phase. To evaluate the fluctuation of $||\bm{E}_{t}||_{\infty}$ during the state aggregation phase, we firstly notice that the step 4 in Algorithm 2 (which is the initialization step for state aggregation) will generate a tight bound: if we denote $\bm{W}(\bm{V}_{t})$ to be the initialization result in step 4 Algorithm 2, we will have

\displaystyle||\tilde{\bm{V}}(\bm{W}(\bm{V}_{t}))-\bm{V}^{*}||_{\infty}\leq||\bm{V}_{t}-\bm{V}^{*}||_{\infty}+\frac{\varepsilon}{2}.

(5)

The inequality above is tight because by construction there exists a scenario such that $\bm{V}_{t}=\bm{V}^{*}$ and $\tilde{\bm{V}}(\bm{W}(\bm{V}_{t}))(s)-\bm{V}^{*}(s)=\frac{\varepsilon}{2}$ for some $s\in\mathcal{S}$ .

After the error introduced in the initialization stage of stage aggregation, the value updates process may also accumulate error. Recall that during each state aggregation phase $\mathcal{A}_{i}$ , for $t\in\mathcal{A}_{i}$ we have the following updates for mega-state $j$

W_{t}(j)=(1-\alpha_{t_{sa}})W_{t-1}(j)+\alpha_{t_{sa}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1}).

Hence by combining Lemma 1, we know that for every SA iteration we have

|W_{t}(j)-W_{t-1}(j)|\leq\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t-1})||_{\infty}+||\bm{W}_{t-1})||_{\infty})\leq\alpha_{t_{sa}}\frac{2}{1-\gamma}.

We then state a general condition that will control the error bound of the state aggregation value update, and we will show that all of our theorem and propositions satisfy such condition.

Condition 1.

There exists $M>0$ such that for any state aggregation phase $\mathcal{A}_{i}$ coming after $M$ state aggregation iterations, we have

\displaystyle||\bm{W}_{t_{0}+|\mathcal{A}_{i}|}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=0}^{|\mathcal{A}_{i}|-1}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}+i})||_{\infty}+||\bm{W}_{t_{0}+i}||_{\infty})\leq\frac{\varepsilon}{2}.

(6)

During state aggregation phase, from $\eqref{ineq_error_1}$ we know that the if we switch from $\mathcal{B}_{i}$ to $\mathcal{A}_{i}$ , $\bm{E}_{t}$ can increase by $\frac{\varepsilon}{2}$ , and from Condition 1 we know that after $|\mathcal{A}_{i}|$ steps of value iteration, $\bm{E}_{t}$ can increase by another $\frac{\varepsilon}{2}$ .

More specifically, after $M$ iterations suggested by Condition 1, if we take $t_{0}=\max\{t\mid t\in\mathcal{B}_{i}\}$ and $t_{1}=\min\{t\mid t\in\mathcal{B}_{i+1}\}$ , from our discussion above we have

\displaystyle||\bm{V}_{t_{1}}-\bm{V}^{*}||_{\infty}\leq||\bm{V}_{t_{0}}-\bm{V}^{*}||_{\infty}+\varepsilon.

(7)

Moreover, the above equation also holds for all $t\in\mathcal{B}_{i+1}$ , as the Bellman operator is a contraction.

Then, for notational convenience, we re-index the first global iteration phase after $M$ SA iterations to be $\mathcal{B}_{1}$ , the following time interval to be $\mathcal{A}_{1},\mathcal{B}_{2},\mathcal{A}_{2},\cdots$ . We also denote $\tau_{i}=\min\{t\mid t\in\mathcal{B}_{i}\}$ and $t_{i}=\max\{t\mid t\in\mathcal{B}_{i}\}$ ; see Figure 4.

Since $\lim\inf|\mathcal{B}_{i}|\geq 1$ , there exists $I>0$ such that for any $i>I$ and $t\in(t_{i},\tau_{i+1}-1]$ we have

	$\displaystyle\|\|\bm{V}_{t_{i}}-\bm{V}^{*}\|\|_{\infty}$	$\displaystyle\leq\gamma\|\|\bm{V}_{\tau_{i}-1}-\bm{V}^{*}\|\|_{\infty}$		(8)
	$\displaystyle\|\|\bm{V}_{t}-\bm{V}^{*}\|\|_{\infty}$	$\displaystyle\leq\|\|\bm{V}_{t_{i}}-\bm{V}^{*}\|\|_{\infty}+\varepsilon,$		(8)

where the first inequality comes from the fact that during $[\tau_{i},t_{i}]$ the global value iteration is performed. The second inequality comes from combining $\eqref{ineq_error_1}$ and $\eqref{ineq_error_2_general}$ .

The intuition of $\eqref{ineq_recur}$ is that from $\tau_{i}-1$ to $t_{i}$ , our approximation will only get better because we are using global value iteration only. From time $t_{i}$ to $\tau_{i+1}-1$ , the approximation within the entire SA value iteration phase will stay within the previous error up to another $\varepsilon$ . By balancing those bounds we are able to get a “stable field” for all phases.

We then state the bounds for the error at $t_{i}$ for $i>0$

Lemma 2.

For all $n\geq 1$ , we have

\displaystyle||\bm{V}_{t_{n+1}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\frac{\varepsilon}{1-\gamma}.

(9)

From Lemma 2 and $\eqref{ineq_recur}$ , we can generate bounds for all $t$ that is large.

A.2 Proof of Theorem 1

Since

\limsup_{t\to\infty}|\mathcal{A}_{t}|<\infty,\ \liminf_{t\to\infty}|\mathcal{B}_{t}|>0,

there exists $N$ such that we have that when $i>N$ , $|\mathcal{A}_{i}|$ will be bounded from above and $|\mathcal{B}_{i}|$ will be bounded from below. We denotes those two constants for the bound as $a$ and $b$ , respectively. Then, since the sequence $\{\alpha_{t}\}_{t=1}^{\infty}$ goes to zero as $t$ goes to infinity, there exists a constant $T$ such that for $t>T$ we have $a\cdot\alpha_{t}<\frac{\varepsilon}{2(1-\gamma)}$ . Thus, we have

||\bm{W}_{t_{0}+|\mathcal{A}_{i}|}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=0}^{a}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}+i})||_{\infty}+||\bm{W}_{t_{0}+i}||_{\infty})\leq\frac{\varepsilon}{2}.

Therefore, we have shown that Condition 1 has been satisfied.

Next, we show that $\bm{V}_{t_{i}}$ grows nearer to $\bm{V}^{*}$ as $t$ increases. To see this, by directly applying Lemma 2, we have

\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\gamma^{i-1}\|\bm{V}_{t_{1}}-\bm{V}^{*}\|_{\infty}+\frac{\varepsilon}{1-\gamma},

and, thus,

\lim\sup_{i\to\infty}\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}.

Then, in order to prove the final statement of the theorem, we aim to show that for any positive constant $\lambda>0$ , there exists a constant $T_{\lambda}$ so that for any $t>T_{\lambda}$ , the following inequality holds

\displaystyle\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}+\lambda.

(10)

Since $\lim\sup\limits_{i\rightarrow\infty}\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}$ , we can find $i_{\lambda}>0$ such that for any $i\geq i_{\lambda}$ , we have

\displaystyle\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}+\lambda.

(11)

Then, by defining $T_{\lambda}=T+t_{i_{\lambda}}$ , it suffices to show Eq. (10) for $t\in[T_{\lambda},+\infty)\cap\mathcal{A}_{i}$ and $t\in[T_{\lambda},+\infty)\cap\mathcal{B}_{i}$ separately. If $t\in[T_{\lambda},+\infty)\cap\mathcal{A}_{i}$ for some $i\in\mathbb{N}$ , from inequality $\eqref{ineq_recur}$ we have

\|\bm{V}_{t}-\bm{V}_{t_{i}}\|\leq\varepsilon.

Thus,

	$\displaystyle\\|\bm{V}_{t}-\bm{V}^{*}\\|_{\infty}$	$\displaystyle\leq\\|\bm{V}_{t}-\bm{V}_{t_{i}}\\|_{\infty}+\\|\bm{V}_{t_{i}}-\bm{V}^{*}\\|_{\infty}$
		$\displaystyle\leq\varepsilon+\frac{\varepsilon}{1-\gamma}+\lambda$
		$\displaystyle\leq\frac{2\varepsilon}{1-\gamma}+\lambda,$

where the first line is obtained by the triangle inequality and Eq. (11). If $t\in[T_{\lambda},+\infty)\cap\mathcal{B}_{i}$ for some $i\in\mathbb{N}$ , based on the contraction property of Value-Iteration method, we have $\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\gamma^{t-\tau_{i}}\|\bm{V}_{\tau_{i}}-\bm{V}^{*}\|_{\infty}$ . Hence, we have shown that Eq. (10) holds for all $t>T_{\lambda}$ and any $\lambda>0$ , which is equivalent to

\lim\sup_{t\rightarrow\infty}\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}.

Moreover, if we set $\lambda=\varepsilon$ and the step size satisfies $\alpha_{t}<\frac{\varepsilon}{(1-\gamma)\sup_{i}|\mathcal{A}_{i}|}$ for all $t$ , we have $T=1$ and $t_{i_{\lambda}}=O\left(\max\left\{1,\frac{\log(\varepsilon)}{\log(\gamma)}\right\}\right)$ . Then, we also can obtain Proposition 2 using a similar argument.

A.3 Proof of Proposition 3

Without loss of generality, assume $|\mathcal{A}_{i}|=a$ and $|\mathcal{B}_{i}|=b$ for all $i$ . By taking $\alpha_{t}=t^{-\beta}$ , after $\frac{(4a)^{1/\beta}}{(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}}$ state aggregated iterations (which correspond to $\frac{(a+b)(4a)^{1/\beta}}{a(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}}$ iterations in Algorithm 3), during any state aggregation phase $\mathcal{A}_{i}$ we have that $\sum_{t\in{\mathcal{A}_{i}}}\alpha_{t_{sa}}\leq\frac{(1-\gamma)\varepsilon}{4}$ .

Therefore, after the state aggregation phase $\mathcal{A}_{i}$ , if we let $t_{0}=\min_{t\in\mathcal{A}_{i}}$ , we have

\displaystyle||\bm{W}_{t_{0}+a}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=1}^{a}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}})||_{\infty}+||\bm{W}_{t_{0}-1}||_{\infty})\leq\frac{\varepsilon}{2}.

(12)

Hence Condition 1 is also satisfied. From Lemma 2, we know that if we run more than $\frac{\log(\varepsilon)}{\log(\gamma)}$ global iteration phases (corresponds to $\frac{(a+b)\log(\varepsilon)}{\log(\gamma)}$ iterations in Algorithm 3), for the following $t_{n}$ we have

\displaystyle||\bm{V}_{t_{n}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n-1}\frac{1}{1-\gamma}+\frac{\varepsilon}{1-\gamma}\leq\frac{2\varepsilon}{1-\gamma}.

(13)

Then, it suffices to prove that for all $t\geq t_{n}$ , we have $||\bm{V}_{t_{n}}-\bm{V}^{*}||_{\infty}\leq\frac{3\varepsilon}{1-\gamma}$ . We first check that the bound hold for state aggregated iterations. For any $i\geq n$ and any $t\in[t_{i},\tau_{i+1}-1]$ , from $\eqref{ineq_recur}$ and $\eqref{ineq_skeleton}$ we know that

\displaystyle||\bm{V}_{t}-\bm{V}^{*}||_{\infty}

\displaystyle\leq||\bm{V}_{t_{i}}-\bm{V}^{*}||_{\infty}+\varepsilon\leq\frac{3\varepsilon}{1-\gamma}.

Lastly, for any $i>n$ and any $t\in[\tau_{i},t_{i}]$ , from contraction property we know that

\displaystyle||\bm{V}_{t}-\bm{V}^{*}||_{\infty}

\displaystyle\leq\gamma^{t-\tau_{i}+1}||\bm{V}_{\tau_{i}-1}-\bm{V}^{*}||_{\infty}\leq\gamma^{t-\tau_{i}+1}\frac{3\varepsilon}{1-\gamma}\leq\frac{3\varepsilon}{1-\gamma}.

Therefore, the $\bm{V}_{t}$ process will stay stable for $t\geq t_{n}$ , where $t_{n}=\frac{(a+b)(4a)^{1/\beta}}{a(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}}+\frac{(a+b)\log(\varepsilon)}{\log(\gamma)}$ .

A.4 Proof of Lemma 1

First, we show that the optimal cost-to-go value is bounded by $\frac{1}{1-\gamma}$ . For any fixed policy $\pi$ , we have

V^{\pi}(s)=\mathbb{E}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})|s_{0}=s\right]\leq\frac{1}{1-\gamma},

where the last inequality is obtained by the assumption that the reward function is bounded by 1. The inequality also holds for the optimal policy, which implies $\|\bm{V}^{*}\|_{\infty}\leq\frac{1}{1-\gamma}$ .

Then, we show that the estimated cost-to-go value obtained by our algorithm is bounded by $\frac{1}{1-\gamma}$ by contradiction. Let $\tau$ be the smallest index such that

\displaystyle\|\bm{V}_{\tau}\|_{\infty}>\frac{1}{1-\gamma}

(14)

if $\tau$ indexes a global iteration, or

\displaystyle\|\bm{W}_{\tau}\|_{\infty}>\frac{1}{1-\gamma}

(15)

if it indexes an aggregated iteration. If the $\tau$ -th iteration is a global iteration, we have $\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma}$ and so

	$\displaystyle\|V_{\tau}(s)\|$	$\displaystyle=\|\min_{a\in\mathcal{A}}R(s,a)+\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}\|$
		$\displaystyle\leq\max_{a\in\mathcal{A}}\left\{\|R(s,a)\|+\|\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}\|\right\}$
		$\displaystyle\leq 1+\gamma\max_{a\in\mathcal{A}}\\|\bm{P}_{s,a}\\|_{1}\\|\bm{V}_{\tau-1}\\|_{\infty}$
		$\displaystyle\leq 1+\frac{\gamma}{1-\gamma}$
		$\displaystyle=\frac{1}{1-\gamma},$

for all $s\in\mathcal{S}$ . Here, the first line is obtained by the definition of value iteration process; the second line is obtained by the triangle inequality; the third line holds because $r(s,a)\leq 1$ and because Holder’s inequality gives that $|\bm{P}_{a}\bm{V}_{\tau-1}|\leq\|\bm{P}_{s,a}\|_{1}\|\bm{V}_{\tau-1}\|_{\infty}$ for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ ; the fourth line follows from $\|\bm{P}_{s,a}\|_{1}=1$ and the condition that $\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma}$ ; and the last line is obtained by direct calculation. However, this inequality contradicts Eq. (14).

Alternatively, if the $\tau$ -th iteration is a aggregated iteration, we find the contradiction in two cases. On the one hand, if the $\tau$ -th iteration is the first iteration in a aggregated process or $\tau=\min\{\mathcal{A}_{i}\}$ for some $i$ , we have $\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma}$ . Then, for any aggregated state $j$ , from Algorithm 2 we have

\displaystyle|\tilde{\bm{V}}(\bm{W}(\bm{V}_{\tau-1}))(s)|

\displaystyle\leq\|\bm{V}_{\tau-1}\|_{\infty}-\frac{\varepsilon}{2}\leq\frac{1}{1-\gamma},

which also contradicts Eq. (15). On the other hand, if $\tau$ is such that the $\tau-1$ -th iteration is also a aggregated iteration, we can use a similar method as in the global iteration case to derive a contradiction. Thus, we have that the estimated cost-to-go value obtained by our algorithm is bounded by $\frac{1}{1-\gamma}$ .

A.5 Proof of Lemma 2

The proof of lemma 2 follows an induction argument on a similar inequality: For all $n\geq 1$ , we have

\displaystyle||\bm{V}_{t_{n+1}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\sum_{i=1}^{n-1}\gamma^{i}\varepsilon.

(16)

For the initial case, we have that

||\bm{V}_{t_{2}}-\bm{V}^{*}||_{\infty}\leq\gamma||\bm{V}_{\tau_{2}}-\bm{V}^{*}||_{\infty}\leq\gamma(||\bm{V}_{t_{1}}-\bm{V}^{*}||_{\infty}+\varepsilon)\leq\gamma^{2}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\gamma\varepsilon.

For $n>1$ , by assuming $\eqref{ineq_induction}$ , we have that

	$\displaystyle\|\|\bm{V}_{t_{n+2}}-\bm{V}^{*}\|\|_{\infty}$	$\displaystyle\leq\gamma\|\|\bm{V}_{\tau_{n+2}-1}-\bm{V}^{}\|\|_{\infty}\leq\gamma(\|\|\bm{V}_{t_{n+1}}-\bm{V}^{}\|\|_{\infty}+\varepsilon)$
		$\displaystyle\leq\gamma^{n+1}\|\|\bm{V}_{\tau_{1}}-\bm{V}^{*}\|\|_{\infty}+\sum_{i=1}^{n}\gamma^{i}\varepsilon,$

where the second last inequality comes from Eq. (8). After showing Eq. (16), the lemma follows upon noticing that $\sum_{i=1}^{n}\gamma^{i}\varepsilon<\frac{\varepsilon}{1-\gamma}$ .

	$\displaystyle\\|\tilde{\bm{V}}(\bm{W}^{})-\bm{V}^{}\\|_{\infty}$	$\displaystyle\leq\frac{\\|\bm{e}\\|_{\infty}}{1-\gamma},$		(3)
	$\displaystyle\\|{\bm{V}}^{\pi^{\bm{W}^{}}}-\bm{V}^{}\\|_{\infty}$	$\displaystyle\leq\frac{2\gamma\\|\bm{e}\\|_{\infty}}{(1-\gamma)^{2}},$		(3)

	$\displaystyle\|V_{\tau}(s)\|$	$\displaystyle=\|\min_{a\in\mathcal{A}}R(s,a)+\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}\|$
		$\displaystyle\leq\max_{a\in\mathcal{A}}\left\{\|R(s,a)\|+\|\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}\|\right\}$
		$\displaystyle\leq 1+\gamma\max_{a\in\mathcal{A}}\\|\bm{P}_{s,a}\\|_{1}\\|\bm{V}_{\tau-1}\\|_{\infty}$
		$\displaystyle\leq 1+\frac{\gamma}{1-\gamma}$
		$\displaystyle=\frac{1}{1-\gamma},$

An Adaptive State Aggregation Algorithm for Markov Decision Processes

Abstract

1 Introduction

1.1 Related literature

2 Preliminaries

2.1 Markov decision process

2.2 State aggregation

3 Algorithm design

3.1 A pre-speficied aggregation algorithm

Proposition 1 (Theorem 1, tsitsiklis1996 ).

3.2 Value iteration with state aggregation

3.3 Convergence

Theorem 1.

Proposition 2.

Proposition 3.

4 Experiments

4.1 MDP problems

4.1.1 Benchmarks and parameters

4.1.2 Results

Influence of ε\varepsilon.

Efficiency.

Scalibility of state aggregation.

Robustness.

4.2 Continuous Control Problems

4.2.1 Environment

4.2.2 Results

5 Discussion

References

Appendix A Appendix

A.1 Preliminaries

Lemma 1.

Condition 1.

Lemma 2.

A.2 Proof of Theorem 1

A.3 Proof of Proposition 3

A.4 Proof of Lemma 1

A.5 Proof of Lemma 2

Influence of $\varepsilon$ .