This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Adaptive State Aggregation Algorithm for Markov Decision Processes

Guanting Chen      Johann Demetrio Gaebler     Matt Peng     Chunlin Sun      Yinyu Ye§
Institute for Computational and Mathematical Engineering, Stanford University
Department of Electrical Engineering & Computer Sciences, University of California, Berkeley
§ Department of Management Science and Engineering, Stanford University
{\{guanting, jgaeb, chunlin, yyye}\}@stanford.edu    [email protected]
Abstract

Value iteration is a well-known method of solving Markov Decision Processes (MDPs) that is simple to implement and boasts strong theoretical convergence guarantees. However, the computational cost of value iteration quickly becomes infeasible as the size of the state space increases. Various methods have been proposed to overcome this issue for value iteration in large state and action space MDPs, often at the price, however, of generalizability and algorithmic simplicity. In this paper, we propose an intuitive algorithm for solving MDPs that reduces the cost of value iteration updates by dynamically grouping together states with similar cost-to-go values. We also prove that our algorithm converges almost surely to within 2ε/(1γ)2\varepsilon/(1-\gamma) of the true optimal value in the \ell^{\infty} norm, where γ\gamma is the discount factor and aggregated states differ by at most ε\varepsilon. Numerical experiments on a variety of simulated environments confirm the robustness of our algorithm and its ability to solve MDPs with much cheaper updates especially as the scale of the MDP problem increases.

1 Introduction

State aggregation is a long-standard approach to solving large-scale Markov decision processes (MDPs). The main idea of state aggregation is to define the similarity between states, and work with a system of reduced complexity size by grouping similar states into aggregate, or “mega-”, states. Although there has been a variety of results on the performance of the policy using state aggregation li2006 ; van2006 ; abel2016 , a common assumption is that states are aggregated according to the similarity of their optimal cost-to-go values. Such a scheme, which we term pre-specified aggregation, is generally infeasible unless the MDP is already solved. To address the difficulties inherent in pre-specified aggregation, algorithms that learn how to effectively aggregate states have been proposed ortner2013 ; duan2018 ; sinclair2019 .

This paper aims to provide new insights into adaptive online learning of the correct state aggregates. We propose a simple and efficient state aggregation algorithm for calculating the optimal value and policy of an infinite-horizon discounted MDP that can be applied in planning problems baras2000 and generative MDP problems sidford2018near . The algorithm alternates between two distinct phases. During the first phase (“global updates”), the algorithm updates the cost-to-go values as in the standard value iteration, trading off some efficiency to more accurately guide the cost-to-go values in the right direction; in the second phase (“aggregate updates”), the algorithm groups together states with similar cost-to-go values based on the last sequence of global updates then efficiently updates the states in each mega-state in tandem as it optimizes over the reduced space of aggregate states.

The algorithm is simple: the inputs needed for state aggregation are the current cost-to-go values and the parameter ε\varepsilon, which bounds the difference between (current) cost to go values of states within a given mega-state. Compared to prior works on state aggregation that use information on the state-action value (QQ-value) sinclair2019 , transition density ortner2013 , and methods such as upper confidence intervals, our method does not require strong assumptions or extra information, and only performs updates in a manner similar to standard value iteration.

Our contribution is a feasible online algorithm for learning aggregate states and cost-to-go values that requires no extra information beyond that required for standard value iteration. We showcase in our experimental results that our method provides significantly faster convergence than standard value iteration especially for problems with larger state and action spaces. We also provide theoretical guarantees for the convergence, accuracy, and convergence rate of our algorithm.

The simplicity and robustness of our novel state aggregation algorithm demonstrates its utility and general applicability in comparison to existing approaches for solving large MDPs.

1.1 Related literature

When a pre-spcified aggregation is given, tsitsiklis1996 ; van2006 give performance bounds on the aggregated system and propose variants of value iteration. There are also a variety of ways to perform state aggregation based on different criteria. ferns2012 ; dean1997 analyze partitioning the state space by grouping states whose transition model and reward function are close. mccallum1997 proposes aggregate states that have the same optimal action and similar QQ-values for these actions. jong2005 develops aggregation techniques such that states are aggregated if they have the same optimal action. We refer the readers to li2006 ; abel2016 ; abel2020 for a more comprehensive survey on this topic.

On the dynamical learning side, our adaptive state-aggregated value iteration algorithm is also related to the aggregation-disaggregation method used to accelerate the convergence of value iterations (bertsekas1988 ; schweitzer1985 ; mendelssohn1982 ; bertsekas2018 ) in policy evaluation, i.e., to evaluate the value function of a policy. Among those works, that of bertsekas1988 is closest to our approach. Assuming the underlying Markov processes to be ergodic, the authors propose to group states based on Bellman residuals in between runs of value iteration. They also allow states to be aggregated or disaggregated at every abstraction step. However, the value iteration step in our setting is different from the value iteration in bertsekas1988 for the reason that value iteration in policy evaluation will not involve the Bellman operator. Aside from state aggregation, a variety of other methods have been studied to accelerate value iteration. herzberg1994 proposes iterative algorithms based on a one-step look-ahead approach. shlakhter2010 combines the so called “projective operator” with value iteration and achieve better efficiency. anderson1965 ; fang2009 ; zhang2020 analyze the Anderson mixing approach to speed up the convergence of fixed-point problems.

Dynamic learning of the aggregate states has also been studied more generally in MDP and reinforcement learning (RL) settings. hostetler2014 proposes a class of QQ-value–based state aggregations and applies them to Monte Carlo tree search. slivkins2011 uses data-driven discretization to adaptively discretize state and action space in a contextual bandit setting. ortner2013 develops an algorithm for learning state aggregation in an online setting by leveraging confidence intervals. sinclair2019 designs a QQ-learning algorithm based on data-driven adaptive discretization of the state-action space. For more state abstraction techniques see baras2000 ; dean1997 ; jiang2014 ; abel2019 .

2 Preliminaries

2.1 Markov decision process

We consider an infinite-horizon Markov Decision Process M=(𝒮,𝒜,P,r,γ,ρ)M=(\mathcal{S},\mathcal{A},P,r,\gamma,\rho), consisting of: a finite state space 𝒮\mathcal{S}; a finite action space 𝒜\mathcal{A}; the probability transition model PP, where P(s|s,a)P(s^{\prime}|s,a) denotes the probability of transitioning to state ss^{\prime} conditioned on the state-action pair (s,a)(s,a); the immediate cost function r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}, which denotes the immediate cost—or reward—obtained from a particular state-action pair; the discount factor γ[0,1)\gamma\in[0,1); and the initial distribution over states in 𝒮\mathcal{S}, which we denote by ρ\rho.

A policy π:𝒮𝒜\pi:\mathcal{S}\to\mathcal{A} specifies the agent’s action based on the current state, either deterministically or stochastically. A policy induces a distribution over the trajectories τ=(st,at,rt)t=0\tau=(s_{t},a_{t},r_{t})_{t=0}^{\infty}, where s0ρs_{0}\sim\rho, and atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}) for t0t\geq 0.

A value function 𝑽:𝒮|𝒮|\bm{V}:\mathcal{S}\to\mathbb{R}^{|\mathcal{S}|} assigns a value to each state; as |𝒮|<|\mathcal{S}|<\infty, 𝑽\bm{V} can also be represented by a finite-length vector (V(1),,V(|𝒮|))|𝒮|(V(1),...,V(|\mathcal{S}|))^{\top}\in\mathbb{R}^{|\mathcal{S}|}. (In this paper, we view all vector as vector functions mapping from the index to the corresponding entry.) Moreover, each policy π\pi is associated with a value function 𝑽π:𝒮\bm{V}^{\pi}:\mathcal{S}\to\mathbb{R}, which is defined to be the discounted sum of future rewards starting at ss and with policy π\pi:

𝑽π(s):=𝔼[t=0γtr(st,at)|π,s0=s].\bm{V}^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|\pi,s_{0}=s\right].

As noted above, we represent both the value function corresponding to the policy π\pi and the value vector (Vπ(1),,Vπ(|𝒮|))(V^{\pi}(1),...,V^{\pi}(|\mathcal{S}|))^{\top} as 𝑽π\bm{V}^{\pi}.

For each state s𝒮s\in\mathcal{S} and action a𝒜a\in\mathcal{A} belonging to state ss, let 𝑷s,a|𝒮|\bm{P}_{s,a}\in\mathbb{R}^{|\mathcal{S}|} denote the vector of transition probabilities resulting from taking the action aa in state ss. We call a policy π\pi greedy with respect to a given value 𝑽|𝒮|\bm{V}\in\mathbb{R}^{|\mathcal{S}|} if

π(s)argmina𝒜(r(s,a)+γ𝑷s,a𝑽)).\pi(s)\in\operatorname*{arg\,min}_{a\in\mathcal{A}}\left(r(s,a)+\gamma\cdot\bm{P}_{s,a}^{\top}\bm{V})\right).

We define 𝑻:|𝒮||𝒮|\bm{T}:\mathbb{R}^{|\mathcal{S}|}\to\mathbb{R}^{|\mathcal{S}|} to be the dynamic programming operator such that (𝑻𝑽)(s)=Ts(𝑽)(\bm{T}\bm{V})(s)=T_{s}(\bm{V}) where

Ts(𝑽)=mina𝒜(r(s,a)+γ𝑷s,a𝑽).T_{s}(\bm{V})=\min_{a\in\mathcal{A}}\left(r(s,a)+\gamma\cdot\bm{P}_{s,a}^{\top}\bm{V}\right).

The optimal value function 𝑽\bm{V}^{*} is the unique solution of the equation 𝑽=T𝑽\bm{V}^{*}=T\bm{V}^{*}. A common approach to find 𝑽\bm{V}^{*} is value iteration, which, given an initial guess 𝑽0\bm{V}_{0}, generates a sequence of value functions {𝑽t}t=0\{\bm{V}_{t}\}_{t=0}^{\infty} such that 𝑽t+1=𝑻𝑽t\bm{V}_{t+1}=\bm{T}\bm{V}_{t}. The sequence {𝑽t}t=0\{\bm{V}_{t}\}_{t=0}^{\infty} converges to VV^{*} as tt goes to ++\infty Bel .

2.2 State aggregation

The state space of MDPs can be very large. State aggregation divides the state space 𝒮\mathcal{S} into KK subsets and views each collection of states as a mega-state. Then, the value function generated by the mega-states can be used to approximate the optimal value 𝑽\bm{V}^{*}.

To represent a state aggregation, we define the matrix 𝚽|𝒮|×K\bm{\Phi}\in\mathbb{R}^{|\mathcal{S}|\times K}. We set ϕi,j=1\phi_{i,j}=1 if state ii is in the jj-th mega-state, and let ϕi,j=0\phi_{i,j}=0 otherwise; i.e., column jj of 𝚽\bm{\Phi} indicates whether each state belongs to mega-state jj. The state-reduction matrix 𝚽\bm{\Phi} also induces a partition {Si}i=1K\{S_{i}\}_{i=1}^{K} on 𝒮\mathcal{S}, i.e., 𝒮=i=1KSi\mathcal{S}=\bigcup_{i=1}^{K}S_{i} and SiSj=S_{i}\cap S_{j}=\emptyset for iji\neq j. Denote by 𝑾K\bm{W}\in\mathbb{R}^{K} the cost-to-go value function for the aggregated state, and note that the current value of 𝑾\bm{W} induces a value function 𝑽~(𝑾)|𝒮|\tilde{\bm{V}}(\bm{W})\in\mathbb{R}^{|\mathcal{S}|} on the original state space, where

𝑽~(s,𝑾)=𝑾(j), for sSj.\tilde{\bm{V}}(s,\bm{W})=\bm{W}(j),\,\,\text{ for }s\in S_{j}.

3 Algorithm design

In this section, we first introduce a state aggregation algorithm which assumes knowledge of the optimal value function. The algorithm is proposed in tsitsiklis1996 which also provides the corresponding convergence result. Based on existing theory, we design our adaptive algorithm and discuss its convergence properties.

3.1 A pre-speficied aggregation algorithm

Given a pre-specified aggregation, we seek the value function 𝑾\bm{W} for the aggregated states such that

𝑽~(𝑾)𝑽=O(𝒆),\displaystyle\|\tilde{\bm{V}}(\bm{W})-\bm{V}^{*}\|_{\infty}=O(\|\bm{e}\|_{\infty}), (1)

where 𝒆=(e1,,eK)\bm{e}=(e_{1},...,e_{K})^{\top} and ej=maxs1,s2Sj|V(s1)V(s2)|e_{j}=\max_{s_{1},s_{2}\in S_{j}}|V^{*}(s_{1})-V^{*}(s_{2})| for j=1,,Kj=1,...,K. Intuitively, Eq. (1)\eqref{eqn_prefix_error} justifies our approach of aggregating states that have similar optimal cost-to-go values. We then state the algorithm that will converge to the correct cost-to-go values while satisfying Eq. (1)\eqref{eqn_prefix_error}.

Algorithm 1 Random Value Iteration with Aggregation
1:Input: 𝑷\bm{P}, rr, γ\gamma, 𝚽\bm{\Phi}, {αt}t=1\{\alpha_{t}\}_{t=1}^{\infty}
2:Initialize 𝑾0=𝟎\bm{W}_{0}=\bm{0}
3:for t=1,,nt=1,...,n do
4:     for j=1,,Kj=1,...,K do
5:         Sample state ss uniformly from set SjS_{j}
6:         Wt+1(j)=(1αt)Wt(j)+αtTs𝑽~(𝑾t)W_{t+1}(j)=(1-\alpha_{t})W_{t}(j)+\alpha_{t}T_{s}\tilde{\bm{V}}(\bm{W}_{t})
7:     end for
8:end for
9:Output: 𝑽~n\tilde{\bm{V}}_{n}

Algorithm 1 takes a similar form in Stochastic Approximation robbins1951 ; wasan2004 , and will converge almost surely to a unique cost-to-go value. Here αt\alpha_{t} is the step size of the learning algorithm; by taking, e.g., αt=1\alpha_{t}=1, we recover the formula of value iteration. The following convergence result is proved in tsitsiklis1996 .

Proposition 1 (Theorem 1, tsitsiklis1996 ).

When t=1αt=\sum_{t=1}^{\infty}\alpha_{t}=\infty and t=1αt2<\sum_{t=1}^{\infty}\alpha_{t}^{2}<\infty, {𝐖t}t=1\{\bm{W}_{t}\}_{t=1}^{\infty} in Line 6 of Algorithm 1 will converge almost surely to 𝐖\bm{W}^{*} entry-wise, where 𝐖\bm{W}^{*} is the solution of

W(j)=1|Sj|sSjTs𝑽~(𝑾).W^{*}(j)=\cfrac{1}{|S_{j}|}\sum_{s\in S_{j}}T_{s}\tilde{\bm{V}}(\bm{W}^{*}). (2)

Define π𝐖\pi^{\bm{W}^{*}} to be the greedy policy with respect to 𝐕~(𝐖)\tilde{\bm{V}}(\bm{W}^{*}). Then, we have, morevoer, that

𝑽~(𝑾)𝑽\displaystyle\|\tilde{\bm{V}}(\bm{W}^{*})-\bm{V}^{*}\|_{\infty} 𝒆1γ,\displaystyle\leq\frac{\|\bm{e}\|_{\infty}}{1-\gamma}, (3)
𝑽π𝑾𝑽\displaystyle\|{\bm{V}}^{\pi^{\bm{W}^{*}}}-\bm{V}^{*}\|_{\infty} 2γ𝒆(1γ)2,\displaystyle\leq\frac{2\gamma\|\bm{e}\|_{\infty}}{(1-\gamma)^{2}},

where 𝐕π𝐖{\bm{V}}^{\pi^{\bm{W}^{*}}} is the value function associated with policy π𝐖\pi^{\bm{W}^{*}}.

Proposition 1 states that if we are able to partition the state space such that the maximum difference of the optimal value function within each mega-state is small, the value function produced by Algorithm 1 can approximate the optimal value up to 𝒆1γ\frac{\|\bm{e}\|_{\infty}}{1-\gamma}, and the policy associated with the approximated value function will also be close to the optimal policy.

3.2 Value iteration with state aggregation

In order to generate an efficient approximation, Proposition 1 requires a pre-specified aggregation scheme such that maxs1,s2Si|V(s1)V(s2)|\max_{s_{1},s_{2}\in S_{i}}|V^{*}(s_{1})-V^{*}(s_{2})| is small for every ii to guarantee the appropriate level of convergence for Algorithm 1. Without knowing 𝑽\bm{V}^{*}, is it still possible to control the approximation error? In this section we answer in the affirmative by introducing an adaptive state aggregation scheme that learns the correct state aggregations online as it learns the true cost-to-go values.

Given the current cost-to-go value vector V|𝒮|V\in\mathbb{R}^{|\mathcal{S}|}, let b1=mins𝒮V(s)b_{1}=\min_{s\in\mathcal{S}}V(s), let b2=maxs𝒮V(s)b_{2}=\max_{s\in\mathcal{S}}V(s). Group the cost-to-go values among disjoint subintervals of the form (b2b1)/ε\lceil(b_{2}-b_{1})/\varepsilon\rceil. Next, let Δ=(b2b1)/ε\Delta=(b_{2}-b_{1})/\varepsilon, and let SjS_{j} to be the jj-th mega-state, which contains all the states whose current estimated cost-to-go value falls in the interval [b1+(j1)ε,b1+jε)[b_{1}+(j-1)\varepsilon,b_{1}+j\varepsilon). Grouping the states in this way reduces the state size from |𝒮||\mathcal{S}| states to at most (b2b1)/ε\lceil(b_{2}-b_{1})/\varepsilon\rceil mega-states. See Algorithm 2 for further details.

Algorithm 2 Value-based Aggregation
1:Input: ε\varepsilon, 𝑽=(V(1),,V(|𝒮|))\bm{V}=(V(1),...,V(|\mathcal{S}|))^{\top}
2:b1=mins|𝒮|V(s)b_{1}=\min\limits_{s\in|\mathcal{S}|}V(s), b2=maxs|𝒮|V(s)b_{2}=\max\limits_{s\in|\mathcal{S}|}V(s), Δ=(b2b1)/ε\Delta=(b_{2}-b_{1})/\varepsilon
3:for i=1,,Δi=1,...,\lceil\Delta\rceil do
4:     S^i={s|V(s)[b1+(i1)ε,b1+iε)}\hat{S}_{i}=\left\{s|V(s)\in[b_{1}+(i-1)\varepsilon,b_{1}+i\varepsilon)\right\}, W^(i)=b1+(i12)ε\hat{W}(i)=b_{1}+(i-\frac{1}{2})\varepsilon
5:end for
6:Delete the empty sets in {S^i}i=1Δ\{\hat{S}_{i}\}_{i=1}^{\lceil\Delta\rceil} while keep the same order, and define the modified partition to be {Si}i=1K\{S_{i}\}_{i=1}^{K}, where KK is the cardinally of the modified set of mega-states. Modify W^\hat{W} and generate WKW\in\mathbb{R}^{K} the similar way.
7:Return {Si}i=1K\{S_{i}\}_{i=1}^{K} and WW.

Without the knowledge of 𝑽\bm{V}^{*} in advance, one must periodically perform value iteration on 𝒮\mathcal{S} to learn the correct aggregation to help with adapting the aggregation scheme. As a result, our algorithm alternates between two phases: in the global update phase the algorithm performs value iteration on 𝒮\mathcal{S}; in the aggregated update phase, the algorithm starts to group together states with similar cost-to-go values based on the result of the last global update, and then performs aggregated updates as in Algorithm 1.

We denote by {𝒜i}i=1\{\mathcal{A}_{i}\}_{i=1}^{\infty} the intervals of iterations in which the algorithm performs state-aggregated updates, and we denote by {i}i=1\{\mathcal{B}_{i}\}_{i=1}^{\infty} the intervals of iterations in which the algorithm performs global update. As a consequence, b<ab<a for any a𝒜ia\in\mathcal{A}_{i} and bib\in\mathcal{B}_{i}; likewise, a<ba<b for any a𝒜ia\in\mathcal{A}_{i} and bi+1b\in\mathcal{B}_{i+1}.

We then present our adaptive algorithm. For a pre-speficied number of iterations nn, the time horizon [1,n)[1,n) is divided into intervals of the form 1,𝒜1,2,𝒜2,\mathcal{B}_{1},\mathcal{A}_{1},\mathcal{B}_{2},\mathcal{A}_{2},\ldots. Every time the algorithm exits an interval of global updates i\mathcal{B}_{i}, it runs Algorithm 2 based on the current cost-to-go value and the parameter ε\varepsilon, using the output of Algorithm 2 for the current state aggregation and cost-to-go values for 𝒜i\mathcal{A}_{i}. Similarly, every time the algorithm exits an interval of state-aggregated updates 𝒜i\mathcal{A}_{i}, it sets 𝑽~(𝑾)\tilde{\bm{V}}(\bm{W}), where 𝑾\bm{W} is the current cost-to-go value for the aggregated space, as the initial cost-to-go value for the subsequent interval of global iterations.

Algorithm 3 Value Iteration with Adaptive Aggregation
1:Input: PP, rr, ε\varepsilon, γ\gamma, {αt}t=1\{\alpha_{t}\}_{t=1}^{\infty}, {𝒜i}i=1\{\mathcal{A}_{i}\}_{i=1}^{\infty}, {i}i=1\{\mathcal{B}_{i}\}_{i=1}^{\infty}
2:Initialize W0=𝟎W_{0}=\bm{0}, V1=𝟎V_{1}=\bm{0}, tsa=1t_{sa}=1
3:for t=1,,nt=1,...,n do
4:     if tit\in\mathcal{B}_{i} then
5:         if t=min{i}t=\min\{\mathcal{B}_{i}\} then 𝑽t1=𝑽~(𝑾t1).\bm{V}_{t-1}=\tilde{\bm{V}}(\bm{W}_{t-1}).
6:         end if
7:         for j=1,,|𝒮|j=1,...,|\mathcal{S}| do Vt(j)=Tj𝑽t1.V_{t}(j)=T_{j}\bm{V}_{t-1}.
8:         end for
9:     else
10:         Find current ii s.t. t𝒜it\in\mathcal{A}_{i}
11:         if t=min{𝒜i}t=\min\{\mathcal{A}_{i}\} then
12:              Define {Si}i=1K\{S_{i}\}_{i=1}^{K} and 𝑾t\bm{W}_{t} to be the output of Algorithm 2 with input ε\varepsilon, 𝑽t1\bm{V}_{t-1}.
13:         end if
14:         for j=1,,Kj=1,...,K do
15:              Sample state ss uniformly from set SjS_{j}.
16:              
Wt(j)=(1αtsa)Wt1(j)+αtsaTs𝑽~(𝑾t1)\displaystyle W_{t}(j)=(1-\alpha_{t_{sa}}){W}_{t-1}(j)+\alpha_{t_{sa}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1}) (4)
17:         end for
18:         tsa=tsa+1t_{sa}=t_{sa}+1
19:     end if
20:end for
21:if nin\in\mathcal{B}_{i} then return 𝑽n\bm{V}_{n}.
22:end if
23:return 𝑽~(𝑾n)\tilde{\bm{V}}(\bm{W}_{n}).

3.3 Convergence

From Proposition 1, if we fix the state-aggregation parameter ε\varepsilon, even with perfect information, state aggregated value iteration will generate an approximation of the cost-to-go values with \ell^{\infty} error bounded by ε/(1γ)\varepsilon/(1-\gamma). This bound is sharp, as shown in tsitsiklis1996 . Such error is negligible in the early phase of the algorithm, but the error would accumulate in the later phase of the algorithm and prevent the algorithm from converging to the optimal value. As a result, it is not desirable for lim sup|𝒜t|\limsup|\mathcal{A}_{t}|\to\infty.111 By setting ε\varepsilon adaptively, one might achieve better complexity and error bound by setting limsup|𝒜t|\lim\sup|\mathcal{A}_{t}|\to\infty; however, adaptively choosing ε\varepsilon lies beyond the scope of this work.

We state asymptotic convergence results for Algorithm 3; proofs can be found in the supplementary materials. For the remainder of the paper, by a slight abuse of notation, we denote by 𝑽t\bm{V}_{t} the current cost-to-go value at iteration tt. More specifically, if the current algorithm is in phase i\mathcal{B}_{i}, 𝑽t\bm{V}_{t} is the updated cost-to-go value for global value iteration. If the algorithm is in phase 𝒜i\mathcal{A}_{i}, 𝑽t\bm{V}_{t} represents 𝑽~(𝑾t)\tilde{\bm{V}}(\bm{W}_{t}).

Theorem 1.

If lim supαt0\limsup\alpha_{t}\to 0, lim supt|𝒜i|<\limsup_{t\to\infty}|\mathcal{A}_{i}|<\infty and lim inft|i|>0\liminf_{t\to\infty}|\mathcal{B}_{i}|>0, we have

lim supt𝑽t𝑽2ε1γ.\limsup_{t\rightarrow\infty}\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}.

Notice that the result of Theorem 1 is consistent with Proposition 1: due to state aggregation we suffer from the same 11γ\frac{1}{1-\gamma} for the error bound. However, our algorithm maintains the same order of error bound without knowing the optimal value function 𝑽\bm{V}^{*}.

We also identify the existence of “stable field” for our algorithm, and we prove that under some specific choice of the learning rate αt\alpha_{t}, with probability one the value function will stay within the stable field.

Proposition 2.

If αt<ε(1γ)supi|𝒜i|\alpha_{t}<\frac{\varepsilon}{(1-\gamma)\sup_{i}|\mathcal{A}_{i}|} for all tt, we have that after O(max{1,log(ε)log(γ)})O\left(\max\left\{1,\frac{\log(\varepsilon)}{\log(\gamma)}\right\}\right) iterations, with probability one the estimated approximation 𝐕t\bm{V}_{t} satisfies

𝑽t𝑽3ε1γ\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{3\varepsilon}{1-\gamma}

for any choice of initialization 𝐕0[0,11γ]|𝒮|\bm{V}_{0}\in\left[0,\frac{1}{1-\gamma}\right]^{|\mathcal{S}|}.

Proposition 3.

For β>0\beta>0, if αttβ\alpha_{t}\leq t^{-\beta} , after O(log(ε)log(γ)+(1γ)1βε1β)O(\frac{\log(\varepsilon)}{\log(\gamma)}+(1-\gamma)^{-\frac{1}{\beta}}\varepsilon^{-\frac{1}{\beta}}) iterations, with probability one the estimated approximation 𝐕t\bm{V}_{t} satisfies

𝑽t𝑽3ε1γ\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{3\varepsilon}{1-\gamma}

for any choice of initialization 𝐕0[0,11γ]|𝒮|\bm{V}_{0}\in\left[0,\frac{1}{1-\gamma}\right]^{|\mathcal{S}|}.

Such results provide guidance for choice of parameters. Indeed, the experimental results are consistent with the theory presented above.

Refer to caption
(a) Standard Maze
Refer to caption
(b) Terrain Maze
Refer to caption
(c) 3-Dimensional Standard Maze
Figure 1: Left: In the standard maze, the player’s objective is to navigate to the bottom left. Middle: In the terrain maze, the player proceeds to the bottom left corner. Greater costs are incurred for moving uphill than downhill. High positions are indicated by red colors, and low positions are indicated by blue colors. Right: An example multi-dimensional problem. Here too the player’s objective is to navigate to the lower-left corner (i.e., (1,1,1)(1,1,1)).

4 Experiments

To test the theory developed in Sections 3, we perform a number of numerical experiments.222All experiments were performed in parallel using forty Xeon E5-2698 v3 @ 2.30GHz CPUs. Total compute time was approximately 60 hours. Code and replication materials are available in the Supplementary Materials. We test our methods on a variety of MDPs of different sizes and complexity. Our results show that state aggregation can achieve faster convergence than standard value iteration; that state aggregation scales appropriately as the size and dimensionality of the underlying MDP increases; and that state aggregation is reasonably robust to measurement error (simulated by adding noise to the action costs) and varying levels of stochasticity in the transition matrix.

Refer to caption
(a) Influence of ε\varepsilon
Refer to caption
(b) Efficiency test on standard maze
Refer to caption
(c) Efficiency test on terrain maze
Figure 2: Left: The error of state-aggregated value iteration after convergence as a function of ε\varepsilon. Middle: Average convergence speed and 95% confidence intervals of state-aggregated value iteration on 500×500500\times 500 standard mazes. Right: Average convergence speed and 95% confidence intervals of state-aggregated value iteration on 500×500500\times 500 terrain mazes.

4.1 MDP problems

We consider two problems in testing our algorithm. The first, which we term the “standard maze problem” consists of a d1×d2×dnd_{1}\times d_{2}\times\cdots d_{n} grid of positions. Each position is connected to one or more adjacent positions. Moving from position to position incurs a constant cost, except for moving to the terminal state of the maze (the position (1,,1)(1,\ldots,1)) which incurs a constant reward.333 We rescale action costs in both the standard and terrain mazes to ensure that the maximum cost-to-go is exactly 100. There is a unique path from each position to the terminal state. A two-dimensional 20×2020\times 20 standard maze, in which the player can move, depending on their position, up, down left, or right is illustrated in Figure 1(a).

The second problem is the “terrain maze problem.” As in a standard maze, each state in the terrain maze represents a position in a d1××dnd_{1}\times\cdots\times d_{n} grid. As before, we imagine that the player can move from state to state only by travelling a single unit in any direction. (The player is only constrained from moving out of the grid entirely.) The player receives a reward for reaching the final square of the maze, which again we place at position (1,,1)(1,\ldots,1). However, in contrast to the standard maze game, the player’s movements incur different costs at different positions. In particular, the maze is determined by a “height function” H:{1,,d1}××{1,,dn}H:\{1,\ldots,d_{1}\}\times\cdots\times\{1,\ldots,d_{n}\}\to\mathbb{R}. The cost of movement is set to be the difference in heights between the player’s destination position and their current position, normalized appropriately.

In both problems, we also allow for stochasticity controlled by a parameter pp which gives the probability that a player moves in their intended direction. For p=1p=1, the MDP is deterministic; otherwise, with probability 1p1-p, the player moves in a different direction chosen uniformly at random.

In both problems, the actions available at any state correspond to a very sparse transition probability vectors, since players are constrained to move along cardinal directions at a rate of a single unit. However, in the standard maze game, the cost-to-go at any position is extremely sensitive to the costs-to-go at states that are very distant. In a 10×1010\times 10 standard maze, the initial tile (i.e., position (1, 1)) is often between 25 and 30 units away from the destination tile (i.e., position (10, 10)). In contrast, in the terrain maze game, the cost-to-go is much less sensitive to far away positions, because local immediate costs, dictated by the slopes one must climb or go down to move locally, are more significant.

4.1.1 Benchmarks and parameters

We measure convergence by the \ell^{\infty}-distance (hereafter “error”) between the current cost-to-go vector and the true cost-to-go vector, and we evaluate the speed based on the size of error and the number of updates performed. Notice that for global value iteration, an update for state ss has the form

Ts(𝑽)=mina𝒜(r(s,a)+γ𝑷s,a𝑽),T_{s}(\bm{V})=\min_{a\in\mathcal{A}}\left(r(s,a)+\gamma\bm{P}_{s,a}^{\top}\bm{V}\right),

and an update for mega-state jj (represented by ss) has the form

Wt(j)=(1αtt0)Wt1(j)+αtt0Ts𝑽~(𝑾t1).W_{t}(j)=(1-\alpha_{t-t_{0}})W_{t-1}(j)+\alpha_{t-t_{0}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1}).

Because the transition matrix is not dense in our examples, the computational resources required for global value iteration update and aggregated update are roughly the same. For each iteration, the value iteration will always perform |𝒮||\mathcal{S}| updates, and for Algorithm 3, only KK updates, one for each mega-state, will be performed if in the aggregation phase.

All experiments are performed with a discount factor γ=0.95\gamma=0.95. We set |𝒜i|=5|\mathcal{A}_{i}|=5 and |i|=2|\mathcal{B}_{i}|=2 for every ii, and for learning rate we set αt=1t\alpha_{t}=\frac{1}{\sqrt{t}} . The cost function is normalized such that 𝑽=100\|\bm{V}^{*}\|_{\infty}=100, and we choose aggregation constant to be ε=0.5\varepsilon=0.5 (unless otherwise indicated). We set the initial cost vector 𝑽0\bm{V}_{0} to be the zero vector 𝟎\bm{0}.

4.1.2 Results

Influence of ε\varepsilon.

We test the effect of ε\varepsilon on the error Algorithm 3 produces. We run experiments with aggregation constant ε=0.05,0.1\varepsilon=0.05,0.1, and 0.50.5 for 500×500500\times 500 standard and terrain mazes. For each ε\varepsilon, we run 1,000 iterations of Algorithm 3, and each experiment is repeated 20 times; the results, shown in Figure 2(a), indicate that the approximation error scales in proportion to ε\varepsilon, which is consistent with Proposition 1 and Theorem 1.

Efficiency.

We test the convergence rate of Algorithm 3 against value iteration on 500×500500\times 500 standard and terrain mazes, repeating each experiment 20 times. From Figure 2(b) and 2(c), we see that state-aggregated value iteration is very efficient at the beginning phase, converging in fewer updates than value iteration.

Type Dims. Error 95% CI
Trn. 100×100100\times 100 4.41 ±0.14\pm 0.14
Trn. 200×200200\times 200 4.34 ±0.16\pm 0.16
Trn. 300×300300\times 300 4.65 ±0.18\pm 0.18
Trn. 500×500500\times 500 4.27 ±0.17\pm 0.17
Trn. 1000×10001000\times 1000 4.27 ±0.16\pm 0.16
Std. 100×100100\times 100 1.43 ±0.16\pm 0.16
Std. 200×200200\times 200 1.39 ±0.15\pm 0.15
Std. 300×300300\times 300 1.42 ±0.20\pm 0.20
Std. 500×500500\times 500 1.11 ±0.16\pm 0.16
Std. 1000×10001000\times 1000 1.40 ±0.16\pm 0.16
Type Dims. Error 95% CI
Trn. 10310^{3} 1.91 ±0.007\pm 0.007
Trn. 10410^{4} 3.02 ±0.009\pm 0.009
Trn. 10510^{5} 3.59 ±0.008\pm 0.008
Trn. 10610^{6} 3.85 ±0.005\pm 0.005
Std. 10310^{3} 1.36 ±0.019\pm 0.019
Std. 10410^{4} 1.36 ±0.017\pm 0.017
Std. 10510^{5} 1.23 ±0.013\pm 0.013
Std. 10610^{6} 1.31 ±0.013\pm 0.013
Table 1: Scaling properties of state aggregation value iteration. Reported errors represent the mean value from running each experiment 20 times.
Type pp σ\sigma Error 95% CI
Terrain 0.92 0.00 4.44 ±0.24\pm 0.24
Terrain 0.92 0.01 4.41 ±0.18\pm 0.18
Terrain 0.92 0.05 4.97 ±0.17\pm 0.17
Terrain 0.92 0.10 6.36 ±0.19\pm 0.19
Terrain 0.95 0.00 4.43 ±0.17\pm 0.17
Terrain 0.95 0.01 4.32 ±0.14\pm 0.14
Terrain 0.95 0.05 4.93 ±0.17\pm 0.17
Terrain 0.95 0.10 6.39 ±0.17\pm 0.17
Terrain 0.98 0.00 4.37 ±0.19\pm 0.19
Terrain 0.98 0.01 4.31 ±0.14\pm 0.14
Terrain 0.98 0.05 5.01 ±0.14\pm 0.14
Terrain 0.98 0.10 6.52 ±0.20\pm 0.20
Type pp σ\sigma Error 95% CI
Standard 0.92 0.00 1.39 ±0.19\pm 0.19
Standard 0.92 0.01 1.61 ±0.23\pm 0.23
Standard 0.92 0.05 2.62 ±0.15\pm 0.15
Standard 0.92 0.10 5.56 ±0.46\pm 0.46
Standard 0.95 0.00 1.49 ±0.17\pm 0.17
Standard 0.95 0.01 1.57 ±0.14\pm 0.14
Standard 0.95 0.05 2.86 ±0.17\pm 0.17
Standard 0.95 0.10 5.72 ±0.39\pm 0.39
Standard 0.98 0.00 1.43 ±0.18\pm 0.18
Standard 0.98 0.01 1.88 ±0.16\pm 0.16
Standard 0.98 0.05 3.48 ±0.28\pm 0.28
Standard 0.98 0.10 5.86 ±0.26\pm 0.26
Table 2: Numerical experiments illustrating the robustness of state-aggregated value iteration to stochasticity and noisy action costs. Errors represent the average \ell_{\infty}-distance to the true cost-to-go values in twenty independent runs.
Scalibility of state aggregation.

We run state-abstracted value iteration on standard and terrain mazes of size 100×100100\times 100, 200×200200\times 200, 300×300300\times 300, 500×500500\times 500, and 1000×10001000\times 1000 for 1000 iterations. We repeat each experiment 20 times, displaying the results in Table 1. Next, we run state-aggregated value iteration on terrain mazes of increasingly large underlying dimension, as shown in the right side of Table 1, likewise for 20 repetitions for each size. The difference with the previous experiment is that not only does the state space increase, the action space also increases exponentially. Our results show that the added complexity of the high-dimensional problems does not appear to substantially affect the convergence of state-aggregated value iteration and our method is able to scale with very large MDP problems.

Robustness.

We examine the robustness of state-aggregated value iteration to two sources of noise. We generate 500×500500\times 500 standard and terrain mazes, varying the level of stochasiticity by setting p=0.92,0.95,0.98p=0.92,0.95,0.98. We also vary the amount of noise in the cost vector by adding a mean 0, standard deviation σ=0.0,0.01,0.05,0.1\sigma=0.0,0.01,0.05,0.1 normal vector to the action costs. The results, shown in Table 2, indicate that state-aggregated value iteration is reasonably robust to stochasiticity and measurement error.

4.2 Continuous Control Problems

We conclude the experiments section by showing the performance of our method on a real-world use case in continuous control. These problems often involve solving complex tasks with high-dimensional sensory input. The idea typically involves teaching an autonomous agent, usually a robot, to successfully complete some task or achieve some goal state. These problems are often very tough as they reside in the continuous state space (and many times action space) domain.

We already showcased the significant reduction in algorithm update costs during learning on grid-world problems in comparison with value iteration. Our goal for this section is to showcase real world examples of how our method may be practically applied in the field of continuous control. This is important not only to emphasize that our idea has a practical use case, but also to further showcase its ability to scale into extremely large (and continuous) dimensional problems.

4.2.1 Environment

We choose to use the common baseline control problem in the "CartPole" system. The CartPole system is a typical physics control problem where a pole is attached to a central joint of a cart, which moves along an endless friction-less track. The system is controlled by applying a force to either move to the right or left with the goal to balance the pole, which starts upright. The system terminates if (1) the pole angle is more than 1212 degrees from the vertical axis, or (2) the cart position is more than 2.42.4 cm from the center, or (3) the episode length is greater than 200200 time-steps.

During each episode, the agent will receive constant reward for each step it takes. The state-space is the continuous position and angle of the pole for the cartpole pendulum system and the action-space involves two actions - applying a force to move the cart left or right. A more in-depth explanation of the problem can be found in DBLP:journals/corr/abs-2006-04938 .

4.2.2 Results

Since the CartPole problem is a multi-dimensional continuous control problem, there is no ground truth vv-values so we choose to utilize the dense reward nature of this problem and rely on the accumulated reward to quantify algorithm performance as opposed to error. In addition, since value iteration requires discrete states, we discretize all continuous dimensions of the state space into bins and generate policies on the discretized environment. More specifically, we discretize the continuous state space into bins by dividing each dimension of the domain into equidistant intervals.

For our aggregation algorithm, we use γ=0.99\gamma=0.99 following what is commonly used in this problem in past works. We set the initial cost vector V0V_{0} to be the zero vector. We fix ε=0.2\varepsilon=0.2 and set αt=1t\alpha_{t}=\frac{1}{\sqrt{t}}. We note that given the symmetric nature of this continuous control problem, we do not need to alternate between global and aggregate updates. The adaptive aggregation of states already groups states effectively together for strong performance and significant speedups in the number of updates required. We first do a sweep of number of bins to discretize the problem space to determine the number that best maximizes the performance of value iteration in terms of both update number and reward and found this to be around 2000 bins. We then compare the performances of state aggregation and value iteration on this setting in figure 3. To also further showcase the adaptive advantage of our method, we choose a larger bin (10000) number that likely would have been chosen if no bin sweeping had occurred and show that our method acts as an "automatic bin adjuster" and offers the significant speedups without any prior tuning in figure 3. Interestingly in both situations, the experimental results show that the speedup offered by our method seem significant across different bin settings of the CartPole problem. These results may prove to have strong theoretical directions in future works.

Refer to caption
Refer to caption
Figure 3: CartPole problem reward versus number of updates comparison between value iteration and our adaptive state aggregation method. Even in use-cases such as continuous control, our method offers valuable speedups in the learning process of MDPs and stays robust against different discretization schemes.

5 Discussion

Value iteration is an effective tool for solving Markov decision processes but can quickly become infeasible in problems with large state and action spaces. To address this difficulty, we develop adaptive state-aggregated value iteration, a novel method of solving Markov decision processes by aggregating states with similar cost-to-go values and updating them in tandem. Unlike previous methods of reducing the dimensions of state and action spaces, our method is general and does not require prior knowledge of the true cost-to-go values to form aggregate states. Instead our algorithm learns the cost-to-go values online and uses them to form aggregate states in an adaptive manner. We prove theoretical guarantees for the accuracy, convergence, and convergence rate of our state-aggregated value iteration algorithm, and demonstrate its applicability through a variety of numerical experiments.

State- and action-space reduction techniques are an area of active research. Our contribution in the dynamic state-aggregated value iteration algorithm provides a general framework for approaching large MDPs with strong numerical performances justifying our method. We believe our algorithm can serve as a foundational ground for both future empirical and theoretical work.

We conclude by discussing promising directions for future work on adaptive state aggregation. First, we believe that reducing the number of updates per state by dynamically aggregating states can also be extended more generally in RL settings with model-free methods. Second, our proposed algorithm’s complexity is of the same order as value iteration. Future work may seek to eliminate the dependence on γ\gamma in the error bound. Lastly, by adaptively choosing ε\varepsilon, it may be possible to achieve better complexity bounds not only for planning problems but also for generative MDP models sidford2018 ; sidford2018near .

References

  • [1] D. Abel, D. Arumugam, K. Asadi, Y. Jinnai, M. L. Littman, and L. L. Wong. State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3134–3142, 2019.
  • [2] D. Abel, D. Hershkowitz, and M. Littman. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pages 2915–2923. PMLR, 2016.
  • [3] D. Abel, N. Umbanhowar, K. Khetarpal, D. Arumugam, D. Precup, and M. Littman. Value preserving state-action abstractions. In International Conference on Artificial Intelligence and Statistics, pages 1639–1650. PMLR, 2020.
  • [4] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
  • [5] J. Baras and V. Borkar. A learning algorithm for markov decision processes with adaptive state aggregation. In Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No. 00CH37187), volume 4, pages 3351–3356. IEEE, 2000.
  • [6] R. Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957.
  • [7] D. P. Bertsekas. Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6(1):1–31, 2018.
  • [8] D. P. Bertsekas, D. A. Castanon, et al. Adaptive aggregation methods for infinite horizon dynamic programming. 1988.
  • [9] T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, UAI’97, page 124–131, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
  • [10] Y. Duan, Z. T. Ke, and M. Wang. State aggregation learning from markov transition data. arXiv preprint arXiv:1811.02619, 2018.
  • [11] H.-r. Fang and Y. Saad. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009.
  • [12] N. Ferns, P. Panangaden, and D. Precup. Metrics for markov decision processes with infinite state spaces. arXiv preprint arXiv:1207.1386, 2012.
  • [13] M. Herzberg and U. Yechiali. Accelerating procedures of the value iteration algorithm for discounted markov decision processes, based on a one-step lookahead analysis. Operations Research, 42(5):940–946, 1994.
  • [14] J. Hostetler, A. Fern, and T. Dietterich. State aggregation in monte carlo tree search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  • [15] N. Jiang, S. Singh, and R. Lewis. Improving uct planning via approximate homomorphisms. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1289–1296, 2014.
  • [16] N. K. Jong and P. Stone. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pages 752–757. Citeseer, 2005.
  • [17] S. Kumar. Balancing a cartpole system with reinforcement learning - A tutorial. CoRR, abs/2006.04938, 2020.
  • [18] L. Li, T. J. Walsh, and M. L. Littman. Towards a unified theory of state abstraction for mdps. ISAIM, 4:5, 2006.
  • [19] R. McCallum. Reinforcement learning with selective perception and hidden state. 1997.
  • [20] R. Mendelssohn. An iterative aggregation procedure for markov decision processes. Operations Research, 30(1):62–73, 1982.
  • [21] R. Ortner. Adaptive aggregation for reinforcement learning in average reward markov decision processes. Annals of Operations Research, 208(1):321–336, 2013.
  • [22] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [23] P. J. Schweitzer, M. L. Puterman, and K. W. Kindle. Iterative aggregation-disaggregation procedures for discounted semi-markov reward processes. Operations Research, 33(3):589–605, 1985.
  • [24] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for markov decision processes. Operations research, 58(1):193–202, 2010.
  • [25] A. Sidford, M. Wang, X. Wu, L. F. Yang, and Y. Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
  • [26] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. SIAM, 2018.
  • [27] S. R. Sinclair, S. Banerjee, and C. L. Yu. Adaptive discretization for episodic reinforcement learning in metric spaces. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–44, 2019.
  • [28] A. Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory, pages 679–702. JMLR Workshop and Conference Proceedings, 2011.
  • [29] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59–94, 1996.
  • [30] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
  • [31] M. T. Wasan. Stochastic approximation. Number 58. Cambridge University Press, 2004.
  • [32] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-i anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.

Appendix A Appendix

A.1 Preliminaries

Since the number state-action pairs is finite, without loss of generality we can assume that the cost function rr is bounded by 11. By a slight abuse of notation, we denote by 𝑽t\bm{V}_{t} the current cost-to-go value at iteration tt generated by Algorithm 3. More specifically, if the current algorithm is in global value iteration phase i\mathcal{B}_{i}, 𝑽t\bm{V}_{t} is the updated cost-to-go value for global value iteration. If the algorithm is in state aggregation phase 𝒜i\mathcal{A}_{i}, 𝑽t\bm{V}_{t} represents 𝑽~(𝑾t)\tilde{\bm{V}}(\bm{W}_{t}). Also, denote by 𝑬t=𝑽t𝑽\bm{E}_{t}=\bm{V}_{t}-\bm{V}^{*} the error between the current estimate and the optimal value function. We first state results on the \ell^{\infty} norm of 𝑽\bm{V}^{*} and 𝑽t\bm{V}_{t}.

Lemma 1.

For any t>0t>0 we have 𝐕t11γ||\bm{V}_{t}||_{\infty}\leq\frac{1}{1-\gamma} and 𝐕11γ||\bm{V}^{*}||_{\infty}\leq\frac{1}{1-\gamma}.

Lemma 1 states that both of the optimal cost-to-go value and the value obtained by Algorithm 3 is uniformly bounded by 11γ\frac{1}{1-\gamma} entry-wisely at all aggregated and global iterations.

Next, it is well known that 𝑬t||\bm{E}_{t}||_{\infty} can only decrease in the global value iteration phase due to the fact that the Bellman operator is a contraction mapping. However, 𝑬t||\bm{E}_{t}||_{\infty} can potentially grown when we switch into a state aggregation phase. To evaluate the fluctuation of 𝑬t||\bm{E}_{t}||_{\infty} during the state aggregation phase, we firstly notice that the step 4 in Algorithm 2 (which is the initialization step for state aggregation) will generate a tight bound: if we denote 𝑾(𝑽t)\bm{W}(\bm{V}_{t}) to be the initialization result in step 4 Algorithm 2, we will have

𝑽~(𝑾(𝑽t))𝑽𝑽t𝑽+ε2.\displaystyle||\tilde{\bm{V}}(\bm{W}(\bm{V}_{t}))-\bm{V}^{*}||_{\infty}\leq||\bm{V}_{t}-\bm{V}^{*}||_{\infty}+\frac{\varepsilon}{2}. (5)

The inequality above is tight because by construction there exists a scenario such that 𝑽t=𝑽\bm{V}_{t}=\bm{V}^{*} and 𝑽~(𝑾(𝑽t))(s)𝑽(s)=ε2\tilde{\bm{V}}(\bm{W}(\bm{V}_{t}))(s)-\bm{V}^{*}(s)=\frac{\varepsilon}{2} for some s𝒮s\in\mathcal{S}.

After the error introduced in the initialization stage of stage aggregation, the value updates process may also accumulate error. Recall that during each state aggregation phase 𝒜i\mathcal{A}_{i}, for t𝒜it\in\mathcal{A}_{i} we have the following updates for mega-state jj

Wt(j)=(1αtsa)Wt1(j)+αtsaTs𝑽~(𝑾t1).W_{t}(j)=(1-\alpha_{t_{sa}})W_{t-1}(j)+\alpha_{t_{sa}}T_{s}\tilde{\bm{V}}(\bm{W}_{t-1}).

Hence by combining Lemma 1, we know that for every SA iteration we have

|Wt(j)Wt1(j)|αtsa(||Ts𝑽~(𝑾t1)||+||𝑾t1)||)αtsa21γ.|W_{t}(j)-W_{t-1}(j)|\leq\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t-1})||_{\infty}+||\bm{W}_{t-1})||_{\infty})\leq\alpha_{t_{sa}}\frac{2}{1-\gamma}.

We then state a general condition that will control the error bound of the state aggregation value update, and we will show that all of our theorem and propositions satisfy such condition.

Condition 1.

There exists M>0M>0 such that for any state aggregation phase 𝒜i\mathcal{A}_{i} coming after MM state aggregation iterations, we have

𝑾t0+|𝒜i|𝑾t0i=0|𝒜i|1αtsa(Ts𝑽~(𝑾t0+i)+𝑾t0+i)ε2.\displaystyle||\bm{W}_{t_{0}+|\mathcal{A}_{i}|}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=0}^{|\mathcal{A}_{i}|-1}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}+i})||_{\infty}+||\bm{W}_{t_{0}+i}||_{\infty})\leq\frac{\varepsilon}{2}. (6)

During state aggregation phase, from (5)\eqref{ineq_error_1} we know that the if we switch from i\mathcal{B}_{i} to 𝒜i\mathcal{A}_{i}, 𝑬t\bm{E}_{t} can increase by ε2\frac{\varepsilon}{2}, and from Condition 1 we know that after |𝒜i||\mathcal{A}_{i}| steps of value iteration, 𝑬t\bm{E}_{t} can increase by another ε2\frac{\varepsilon}{2}.

More specifically, after MM iterations suggested by Condition 1, if we take t0=max{tti}t_{0}=\max\{t\mid t\in\mathcal{B}_{i}\} and t1=min{tti+1}t_{1}=\min\{t\mid t\in\mathcal{B}_{i+1}\}, from our discussion above we have

𝑽t1𝑽𝑽t0𝑽+ε.\displaystyle||\bm{V}_{t_{1}}-\bm{V}^{*}||_{\infty}\leq||\bm{V}_{t_{0}}-\bm{V}^{*}||_{\infty}+\varepsilon. (7)

Moreover, the above equation also holds for all ti+1t\in\mathcal{B}_{i+1}, as the Bellman operator is a contraction.

Then, for notational convenience, we re-index the first global iteration phase after MM SA iterations to be 1\mathcal{B}_{1}, the following time interval to be 𝒜1,2,𝒜2,\mathcal{A}_{1},\mathcal{B}_{2},\mathcal{A}_{2},\cdots. We also denote τi=min{tti}\tau_{i}=\min\{t\mid t\in\mathcal{B}_{i}\} and ti=max{tti}t_{i}=\max\{t\mid t\in\mathcal{B}_{i}\}; see Figure 4.

Since liminf|i|1\lim\inf|\mathcal{B}_{i}|\geq 1, there exists I>0I>0 such that for any i>Ii>I and t(ti,τi+11]t\in(t_{i},\tau_{i+1}-1] we have

𝑽ti𝑽\displaystyle||\bm{V}_{t_{i}}-\bm{V}^{*}||_{\infty} γ𝑽τi1𝑽\displaystyle\leq\gamma||\bm{V}_{\tau_{i}-1}-\bm{V}^{*}||_{\infty} (8)
𝑽t𝑽\displaystyle||\bm{V}_{t}-\bm{V}^{*}||_{\infty} 𝑽ti𝑽+ε,\displaystyle\leq||\bm{V}_{t_{i}}-\bm{V}^{*}||_{\infty}+\varepsilon,

where the first inequality comes from the fact that during [τi,ti][\tau_{i},t_{i}] the global value iteration is performed. The second inequality comes from combining (5)\eqref{ineq_error_1} and (6)\eqref{ineq_error_2_general}.

The intuition of (8)\eqref{ineq_recur} is that from τi1\tau_{i}-1 to tit_{i}, our approximation will only get better because we are using global value iteration only. From time tit_{i} to τi+11\tau_{i+1}-1, the approximation within the entire SA value iteration phase will stay within the previous error up to another ε\varepsilon. By balancing those bounds we are able to get a “stable field” for all phases.

We then state the bounds for the error at tit_{i} for i>0i>0

Lemma 2.

For all n1n\geq 1, we have

𝑽tn+1𝑽γn𝑽τ1𝑽+ε1γ.\displaystyle||\bm{V}_{t_{n+1}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\frac{\varepsilon}{1-\gamma}. (9)

From Lemma 2 and (8)\eqref{ineq_recur}, we can generate bounds for all tt that is large.

Refer to caption
Figure 4: Notation after re-indexing

A.2 Proof of Theorem 1

Since

lim supt|𝒜t|<,lim inft|t|>0,\limsup_{t\to\infty}|\mathcal{A}_{t}|<\infty,\ \liminf_{t\to\infty}|\mathcal{B}_{t}|>0,

there exists NN such that we have that when i>Ni>N, |𝒜i||\mathcal{A}_{i}| will be bounded from above and |i||\mathcal{B}_{i}| will be bounded from below. We denotes those two constants for the bound as aa and bb, respectively. Then, since the sequence {αt}t=1\{\alpha_{t}\}_{t=1}^{\infty} goes to zero as tt goes to infinity, there exists a constant TT such that for t>Tt>T we have aαt<ε2(1γ)a\cdot\alpha_{t}<\frac{\varepsilon}{2(1-\gamma)}. Thus, we have

𝑾t0+|𝒜i|𝑾t0i=0aαtsa(Ts𝑽~(𝑾t0+i)+𝑾t0+i)ε2.||\bm{W}_{t_{0}+|\mathcal{A}_{i}|}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=0}^{a}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}+i})||_{\infty}+||\bm{W}_{t_{0}+i}||_{\infty})\leq\frac{\varepsilon}{2}.

Therefore, we have shown that Condition 1 has been satisfied.

Next, we show that 𝑽ti\bm{V}_{t_{i}} grows nearer to 𝑽\bm{V}^{*} as tt increases. To see this, by directly applying Lemma 2, we have

𝑽ti𝑽γi1𝑽t1𝑽+ε1γ,\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\gamma^{i-1}\|\bm{V}_{t_{1}}-\bm{V}^{*}\|_{\infty}+\frac{\varepsilon}{1-\gamma},

and, thus,

limsupi𝑽ti𝑽ε1γ.\lim\sup_{i\to\infty}\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}.

Then, in order to prove the final statement of the theorem, we aim to show that for any positive constant λ>0\lambda>0, there exists a constant TλT_{\lambda} so that for any t>Tλt>T_{\lambda}, the following inequality holds

𝑽t𝑽2ε1γ+λ.\displaystyle\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}+\lambda. (10)

Since limsupi𝑽ti𝑽ε1γ\lim\sup\limits_{i\rightarrow\infty}\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}, we can find iλ>0i_{\lambda}>0 such that for any iiλi\geq i_{\lambda}, we have

𝑽ti𝑽ε1γ+λ.\displaystyle\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}\leq\frac{\varepsilon}{1-\gamma}+\lambda. (11)

Then, by defining Tλ=T+tiλT_{\lambda}=T+t_{i_{\lambda}}, it suffices to show Eq. (10) for t[Tλ,+)𝒜it\in[T_{\lambda},+\infty)\cap\mathcal{A}_{i} and t[Tλ,+)it\in[T_{\lambda},+\infty)\cap\mathcal{B}_{i} separately. If t[Tλ,+)𝒜it\in[T_{\lambda},+\infty)\cap\mathcal{A}_{i} for some ii\in\mathbb{N}, from inequality (8)\eqref{ineq_recur} we have

𝑽t𝑽tiε.\|\bm{V}_{t}-\bm{V}_{t_{i}}\|\leq\varepsilon.

Thus,

𝑽t𝑽\displaystyle\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty} 𝑽t𝑽ti+𝑽ti𝑽\displaystyle\leq\|\bm{V}_{t}-\bm{V}_{t_{i}}\|_{\infty}+\|\bm{V}_{t_{i}}-\bm{V}^{*}\|_{\infty}
ε+ε1γ+λ\displaystyle\leq\varepsilon+\frac{\varepsilon}{1-\gamma}+\lambda
2ε1γ+λ,\displaystyle\leq\frac{2\varepsilon}{1-\gamma}+\lambda,

where the first line is obtained by the triangle inequality and Eq. (11). If t[Tλ,+)it\in[T_{\lambda},+\infty)\cap\mathcal{B}_{i} for some ii\in\mathbb{N}, based on the contraction property of Value-Iteration method, we have 𝑽t𝑽γtτi𝑽τi𝑽\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\gamma^{t-\tau_{i}}\|\bm{V}_{\tau_{i}}-\bm{V}^{*}\|_{\infty}. Hence, we have shown that Eq. (10) holds for all t>Tλt>T_{\lambda} and any λ>0\lambda>0, which is equivalent to

limsupt𝑽t𝑽2ε1γ.\lim\sup_{t\rightarrow\infty}\|\bm{V}_{t}-\bm{V}^{*}\|_{\infty}\leq\frac{2\varepsilon}{1-\gamma}.

Moreover, if we set λ=ε\lambda=\varepsilon and the step size satisfies αt<ε(1γ)supi|𝒜i|\alpha_{t}<\frac{\varepsilon}{(1-\gamma)\sup_{i}|\mathcal{A}_{i}|} for all tt, we have T=1T=1 and tiλ=O(max{1,log(ε)log(γ)})t_{i_{\lambda}}=O\left(\max\left\{1,\frac{\log(\varepsilon)}{\log(\gamma)}\right\}\right) . Then, we also can obtain Proposition 2 using a similar argument.

A.3 Proof of Proposition 3

Without loss of generality, assume |𝒜i|=a|\mathcal{A}_{i}|=a and |i|=b|\mathcal{B}_{i}|=b for all ii. By taking αt=tβ\alpha_{t}=t^{-\beta}, after (4a)1/β(1γ)1/βε1/β\frac{(4a)^{1/\beta}}{(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}} state aggregated iterations (which correspond to (a+b)(4a)1/βa(1γ)1/βε1/β\frac{(a+b)(4a)^{1/\beta}}{a(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}} iterations in Algorithm 3), during any state aggregation phase 𝒜i\mathcal{A}_{i} we have that t𝒜iαtsa(1γ)ε4\sum_{t\in{\mathcal{A}_{i}}}\alpha_{t_{sa}}\leq\frac{(1-\gamma)\varepsilon}{4}.

Therefore, after the state aggregation phase 𝒜i\mathcal{A}_{i}, if we let t0=mint𝒜it_{0}=\min_{t\in\mathcal{A}_{i}}, we have

𝑾t0+a𝑾t0i=1aαtsa(Ts𝑽~(𝑾t0)+𝑾t01)ε2.\displaystyle||\bm{W}_{t_{0}+a}-\bm{W}_{t_{0}}||_{\infty}\leq\sum_{i=1}^{a}\alpha_{t_{sa}}(||T_{s}\tilde{\bm{V}}(\bm{W}_{t_{0}})||_{\infty}+||\bm{W}_{t_{0}-1}||_{\infty})\leq\frac{\varepsilon}{2}. (12)

Hence Condition 1 is also satisfied. From Lemma 2, we know that if we run more than log(ε)log(γ)\frac{\log(\varepsilon)}{\log(\gamma)} global iteration phases (corresponds to (a+b)log(ε)log(γ)\frac{(a+b)\log(\varepsilon)}{\log(\gamma)} iterations in Algorithm 3), for the following tnt_{n} we have

𝑽tn𝑽γn111γ+ε1γ2ε1γ.\displaystyle||\bm{V}_{t_{n}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n-1}\frac{1}{1-\gamma}+\frac{\varepsilon}{1-\gamma}\leq\frac{2\varepsilon}{1-\gamma}. (13)

Then, it suffices to prove that for all ttnt\geq t_{n}, we have 𝑽tn𝑽3ε1γ||\bm{V}_{t_{n}}-\bm{V}^{*}||_{\infty}\leq\frac{3\varepsilon}{1-\gamma}. We first check that the bound hold for state aggregated iterations. For any ini\geq n and any t[ti,τi+11]t\in[t_{i},\tau_{i+1}-1], from (8)\eqref{ineq_recur} and (13)\eqref{ineq_skeleton} we know that

𝑽t𝑽\displaystyle||\bm{V}_{t}-\bm{V}^{*}||_{\infty} 𝑽ti𝑽+ε3ε1γ.\displaystyle\leq||\bm{V}_{t_{i}}-\bm{V}^{*}||_{\infty}+\varepsilon\leq\frac{3\varepsilon}{1-\gamma}.

Lastly, for any i>ni>n and any t[τi,ti]t\in[\tau_{i},t_{i}], from contraction property we know that

𝑽t𝑽\displaystyle||\bm{V}_{t}-\bm{V}^{*}||_{\infty} γtτi+1𝑽τi1𝑽γtτi+13ε1γ3ε1γ.\displaystyle\leq\gamma^{t-\tau_{i}+1}||\bm{V}_{\tau_{i}-1}-\bm{V}^{*}||_{\infty}\leq\gamma^{t-\tau_{i}+1}\frac{3\varepsilon}{1-\gamma}\leq\frac{3\varepsilon}{1-\gamma}.

Therefore, the 𝑽t\bm{V}_{t} process will stay stable for ttnt\geq t_{n}, where tn=(a+b)(4a)1/βa(1γ)1/βε1/β+(a+b)log(ε)log(γ)t_{n}=\frac{(a+b)(4a)^{1/\beta}}{a(1-\gamma)^{1/\beta}\varepsilon^{1/\beta}}+\frac{(a+b)\log(\varepsilon)}{\log(\gamma)}.

A.4 Proof of Lemma 1

First, we show that the optimal cost-to-go value is bounded by 11γ\frac{1}{1-\gamma}. For any fixed policy π\pi, we have

Vπ(s)=𝔼[t=0γtR(st,at,st+1)|s0=s]11γ,V^{\pi}(s)=\mathbb{E}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})|s_{0}=s\right]\leq\frac{1}{1-\gamma},

where the last inequality is obtained by the assumption that the reward function is bounded by 1. The inequality also holds for the optimal policy, which implies 𝑽11γ\|\bm{V}^{*}\|_{\infty}\leq\frac{1}{1-\gamma}.

Then, we show that the estimated cost-to-go value obtained by our algorithm is bounded by 11γ\frac{1}{1-\gamma} by contradiction. Let τ\tau be the smallest index such that

𝑽τ>11γ\displaystyle\|\bm{V}_{\tau}\|_{\infty}>\frac{1}{1-\gamma} (14)

if τ\tau indexes a global iteration, or

𝑾τ>11γ\displaystyle\|\bm{W}_{\tau}\|_{\infty}>\frac{1}{1-\gamma} (15)

if it indexes an aggregated iteration. If the τ\tau-th iteration is a global iteration, we have 𝑽τ111γ\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma} and so

|Vτ(s)|\displaystyle|V_{\tau}(s)| =|mina𝒜R(s,a)+γ𝑷s,a𝑽τ1|\displaystyle=|\min_{a\in\mathcal{A}}R(s,a)+\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}|
maxa𝒜{|R(s,a)|+|γ𝑷s,a𝑽τ1|}\displaystyle\leq\max_{a\in\mathcal{A}}\left\{|R(s,a)|+|\gamma\bm{P}_{s,a}^{\top}\bm{V}_{\tau-1}|\right\}
1+γmaxa𝒜𝑷s,a1𝑽τ1\displaystyle\leq 1+\gamma\max_{a\in\mathcal{A}}\|\bm{P}_{s,a}\|_{1}\|\bm{V}_{\tau-1}\|_{\infty}
1+γ1γ\displaystyle\leq 1+\frac{\gamma}{1-\gamma}
=11γ,\displaystyle=\frac{1}{1-\gamma},

for all s𝒮s\in\mathcal{S}. Here, the first line is obtained by the definition of value iteration process; the second line is obtained by the triangle inequality; the third line holds because r(s,a)1r(s,a)\leq 1 and because Holder’s inequality gives that |𝑷a𝑽τ1|𝑷s,a1𝑽τ1|\bm{P}_{a}\bm{V}_{\tau-1}|\leq\|\bm{P}_{s,a}\|_{1}\|\bm{V}_{\tau-1}\|_{\infty} for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}; the fourth line follows from 𝑷s,a1=1\|\bm{P}_{s,a}\|_{1}=1 and the condition that 𝑽τ111γ\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma}; and the last line is obtained by direct calculation. However, this inequality contradicts Eq. (14).

Alternatively, if the τ\tau-th iteration is a aggregated iteration, we find the contradiction in two cases. On the one hand, if the τ\tau-th iteration is the first iteration in a aggregated process or τ=min{𝒜i}\tau=\min\{\mathcal{A}_{i}\} for some ii, we have 𝑽τ111γ\|\bm{V}_{\tau-1}\|_{\infty}\leq\frac{1}{1-\gamma}. Then, for any aggregated state jj, from Algorithm 2 we have

|𝑽~(𝑾(𝑽τ1))(s)|\displaystyle|\tilde{\bm{V}}(\bm{W}(\bm{V}_{\tau-1}))(s)| 𝑽τ1ε211γ,\displaystyle\leq\|\bm{V}_{\tau-1}\|_{\infty}-\frac{\varepsilon}{2}\leq\frac{1}{1-\gamma},

which also contradicts Eq. (15). On the other hand, if τ\tau is such that the τ1\tau-1-th iteration is also a aggregated iteration, we can use a similar method as in the global iteration case to derive a contradiction. Thus, we have that the estimated cost-to-go value obtained by our algorithm is bounded by 11γ\frac{1}{1-\gamma}.

A.5 Proof of Lemma 2

The proof of lemma 2 follows an induction argument on a similar inequality: For all n1n\geq 1, we have

𝑽tn+1𝑽γn𝑽τ1𝑽+i=1n1γiε.\displaystyle||\bm{V}_{t_{n+1}}-\bm{V}^{*}||_{\infty}\leq\gamma^{n}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\sum_{i=1}^{n-1}\gamma^{i}\varepsilon. (16)

For the initial case, we have that

𝑽t2𝑽γ𝑽τ2𝑽γ(𝑽t1𝑽+ε)γ2𝑽τ1𝑽+γε.||\bm{V}_{t_{2}}-\bm{V}^{*}||_{\infty}\leq\gamma||\bm{V}_{\tau_{2}}-\bm{V}^{*}||_{\infty}\leq\gamma(||\bm{V}_{t_{1}}-\bm{V}^{*}||_{\infty}+\varepsilon)\leq\gamma^{2}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\gamma\varepsilon.

For n>1n>1, by assuming (16)\eqref{ineq_induction}, we have that

𝑽tn+2𝑽\displaystyle||\bm{V}_{t_{n+2}}-\bm{V}^{*}||_{\infty} γ𝑽τn+21𝑽γ(𝑽tn+1𝑽+ε)\displaystyle\leq\gamma||\bm{V}_{\tau_{n+2}-1}-\bm{V}^{*}||_{\infty}\leq\gamma(||\bm{V}_{t_{n+1}}-\bm{V}^{*}||_{\infty}+\varepsilon)
γn+1𝑽τ1𝑽+i=1nγiε,\displaystyle\leq\gamma^{n+1}||\bm{V}_{\tau_{1}}-\bm{V}^{*}||_{\infty}+\sum_{i=1}^{n}\gamma^{i}\varepsilon,

where the second last inequality comes from Eq. (8). After showing Eq. (16), the lemma follows upon noticing that i=1nγiε<ε1γ\sum_{i=1}^{n}\gamma^{i}\varepsilon<\frac{\varepsilon}{1-\gamma}.