This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\corrauth

Marcus Hoerger, School of Mathematics & Physics, The University of Queensland, Australia

Adaptive Discretization using Voronoi Trees for Continuous POMDPs

Marcus Hoerger11affiliationmark: and Hanna Kurniawati22affiliationmark: and Dirk Kroese11affiliationmark: and Nan Ye11affiliationmark: 11affiliationmark: School of Mathematics & Physics, The University of Queensland, Australia
22affiliationmark: School of Computing, Australian National University, Australia
[email protected]
Abstract

Solving continuous Partially Observable Markov Decision Processes (POMDPs) is challenging, particularly for high-dimensional continuous action spaces. To alleviate this difficulty, we propose a new sampling-based online POMDP solver, called Adaptive Discretization using Voronoi Trees (ADVT). It uses Monte Carlo Tree Search in combination with an adaptive discretization of the action space as well as optimistic optimization to efficiently sample high-dimensional continuous action spaces and compute the best action to perform. Specifically, we adaptively discretize the action space for each sampled belief using a hierarchical partition called Voronoi tree, which is a Binary Space Partitioning that implicitly maintains the partition of a cell as the Voronoi diagram of two points sampled from the cell. ADVT uses the estimated diameters of the cells to form an upper-confidence bound on the action value function within the cell, guiding the Monte Carlo Tree Search expansion and further discretization of the action space. This enables ADVT to better exploit local information with respect to the action value function, allowing faster identification of the most promising regions in the action space, compared to existing solvers. Voronoi trees keep the cost of partitioning and estimating the diameter of each cell low, even in high-dimensional spaces where many sampled points are required to cover the space well. ADVT additionally handles continuous observation spaces, by adopting an observation progressive widening strategy, along with a weighted particle representation of beliefs. Experimental results indicate that ADVT scales substantially better to high-dimensional continuous action spaces, compared to state-of-the-art methods.

keywords:
Planning under Uncertainty, Motion an Path Planning, Autonomy for Mobility and Manipulation

1 Introduction

Planning in scenarios with non-deterministic action effects and partial observability is an essential, yet challenging problem for autonomous robots. The Partially Observable Markov Decision Process (POMDP) (Kaelbling et al., 1998; Sondik, 1971) is a general principled framework for such planning problems. POMDPs lift the planning problem from the state space to the belief space — that is, the set of all probability distributions over the state space. By doing so, POMDPs enable robots to systematically account for uncertainty caused by stochastic actions and incomplete or noisy observations in computing the optimal strategy. Although computing the optimal strategy exactly is intractable in general (Papadimitriou and Tsitsiklis, 1987), the past two decades have seen a surge of sampling-based POMDP solvers (reviewed in (Kurniawati, 2022)) that trade optimality with approximate optimality for computational tractability, enabling POMDPs to become practical for a variety of realistic robotics problems.

Despite these advances, POMDPs with continuous state, action and observation spaces remain a challenge, particularly for high-dimensional continuous action spaces. Recent solvers for continuous-action POMDPs (Fischer and Tas, 2020; Mern et al., 2021; Seiler et al., 2015; Sunberg and Kochenderfer, 2018) are generally online —that is, planning and execution are interleaved— and exploit Monte Carlo Tree Search (MCTS) to find the best action among a finite representative subset of the action space. MCTS interleaves guided belief space sampling, value estimation and action subset refinement to incrementally improve the possibility that the selected subset of actions contains the best action. They generally use UCB1 (Auer et al., 2002) to guide belief space sampling and Monte Carlo backup for value estimation, but differ in the action subset refinement.

Several approaches use the Progressive Widening strategy (Couëtoux et al., 2011) to continuously add new randomly sampled actions once current actions have been sufficiently explored. Examples include POMCPOW (Sunberg and Kochenderfer, 2018) and IPFT (Fischer and Tas, 2020). More recent algorithms combine Progressive Widening with more informed methods for adding new actions: VOMCPOW (Lim et al., 2021) uses Voronoi Optimistic Optimization (Kim et al., 2020) and BOMCP (Mern et al., 2021) uses Bayesian optimization. All of these solvers use UCT-style simulations and Monte Carlo backups. An early line of work, GPS-ABT (Seiler et al., 2015), takes a different approach: It uses Generalized Pattern Search to iteratively select an action subset that is more likely to contain the best action and add it to the set of candidate actions. GPS-ABT uses UCT-style simulations and Bellman backup (following the implementation of ABT (Hoerger et al., 2018; Klimenko et al., 2014)), though the distinction between Monte Carlo and Bellman backup was not clarified nor explored. All of these solvers have been successful in finding good solutions to POMDPs with continuous action spaces, though for a relatively low (4)\leq 4) dimension.

To compute good strategies for POMDPs with high-dimensional action spaces, we propose a new MCTS-based online POMDP solver, called Adaptive Discretization using Voronoi Trees (ADVT). ADVT is designed for problems with continuous action spaces, while the state and observation spaces can either be discrete or continuous. ADVT contains three key ideas, as briefly described below.

The first key idea of ADVT is a new data structure for action space discretization, called Voronoi tree. A Voronoi tree represents a hierarchical partition of the action space for a single sampled belief. It follows the structure of a Binary Space Partitioning (BSP) tree, but each partitioning hyper-plane is only implicitly maintained and computed based on the Voronoi diagram of a pair of sampled actions, which results in a partitioning that is much more adaptive to the spatial locations of the sampled actions, compared to state-of-the-art methods (Bubeck et al., 2011; Lim et al., 2021; Mansley et al., 2011; Sunberg and Kochenderfer, 2018; Valko et al., 2013). We additionally maintain estimates of the diameters of the cells in the Voronoi tree. These diameters are used in the other two key ideas, namely, action selection and Voronoi tree refinement, as described in the next two paragraphs. The hierarchical structure of the Voronoi tree allows us to estimate the diameters with an efficient sampling-based algorithm that scales well to high-dimensional action spaces. Section 5 provides a more detailed description on the computational and representational advantages of the Voronoi tree.

The second key idea of ADVT is the use of a cell-diameter-aware upper-confidence bound on the values of actions to guide action selection during planning. This bound represents an upper-confidence bound on the values of all actions within a cell, based on the estimated value of the corresponding sampled action and the diameter of the cell. Our bound is a generalization of a bound that was developed for Lipschitz continuum-arm bandit problems  (Wang et al., 2020). It is motivated by the observation that in many continuous action POMDPs for robotics problems, the distance between two actions can often be used as an indication of how similar their values are. Using this observation, ADVT assumes that the action value for a belief is Lipschitz continuous in the action space, which in turn allows us to derive the upper bound. The upper bound helps ADVT to exploit local information (i.e., the estimated value of the representative action of a cell and the cell diameter) with respect to the action value function and bias its search towards the most promising regions of the action space.

The third key idea is a diameter-aware cell refinement rule. We use the estimated diameters of the cells to help ADVT decide if a cell needs further refinement: a larger cell will be split into smaller ones after a small number of simulations, while a smaller will require more simulations before it is split. This helps ADVT to avoid an unnecessarily small partitioning of non-promising regions in the action space.

To further support ADVT in efficiently finding approximately optimal actions, we use a stochastic version of the Bellman backup (Klimenko et al., 2014) rather than the typical Monte-Carlo backup to estimate the values of sampled actions. Stochastic Bellman backups help ADVT to backpropagate the value of good actions deep in the search tree, instead of averaging them out. This strategy of estimating the action values is particularly helpful for problems with sparse rewards.

Aside from continuous action spaces, continuous observation spaces pose an additional challenge for MCTS-based online POMDP solvers. Recent approaches such as POMCPOW and VOMCPOW apply Progressive Widening in the observation space in conjunction with an explicit representation of the sampled beliefs via a set of weighted particles. Due to its simplicity, we adopt POMCPOW’s strategy to handle continuous observation spaces. This enables ADVT to scale to problems with continuous state, action and observation spaces.

Experimental results on a variety of benchmark problems with increasing dimension (up to 12-D) of the action space and problems with continuous observation spaces indicate that ADVT substantially outperforms state-of-the-art methods (Lim et al., 2021; Sunberg and Kochenderfer, 2018). Our C++ implementation of ADVT is available at https://github.com/hoergems/ADVT.

This paper extends our previous work (Hoerger et al., 2022) in three ways: First is the extension of ADVT to handle continuous observation spaces. Experimental results on an additional POMDP benchmark problem demonstrate the effectiveness of ADVT in handling purely continuous POMDP problems. Second is a substantially expanded discussion on the technical concepts of ADVT. And third is an extended ablation study in which we investigate the effectiveness of stochastic Bellman backups when applied to two baseline solvers POMCPOW and VOMCPOW.

2 Background and Related Work

A POMDP provides a general mathematical framework for sequential decision making under uncertainty. Formally, it is an 8-tuple 𝒮,𝒜,𝒪,T,Z,R,b0,γ\langle\mathcal{S},\mathcal{A},\mathcal{O},T,Z,R,b_{0},\gamma\rangle. The robot is initially in a hidden state s0𝒮s_{0}\in\mathcal{S}. This uncertainty is represented by an initial belief b0b_{0}\in\mathcal{B}, which is a probability distribution on the state space 𝒮\mathcal{S}, where \mathcal{B} is the set of all possible beliefs. At each step t0t\geq 0, the robot executes an action at𝒜a_{t}\in\mathcal{A} according to some policy π\pi. It transitions to a next state st+1𝒮s_{t+1}\in\mathcal{S} according to the transition model T(st,at,st+1)=p(st+1|st,at)T(s_{t},a_{t},s_{t+1})=p(s_{t+1}|s_{t},a_{t}). For discrete state spaces, T(st,at,st+1)T(s_{t},a_{t},s_{t+1}) represents a probability mass function, whereas for continuous state spaces, it represents a probability density function. The robot does not know the state st+1s_{t+1} exactly, but perceives an observation ot𝒪o_{t}\in\mathcal{O} according to the observation model Z(st+1,at,ot)=p(ot|st+1,at)Z(s_{t+1},a_{t},o_{t})=p(o_{t}|s_{t+1},a_{t}). Z(st+1,at,ot)Z(s_{t+1},a_{t},o_{t}) represents a probability mass function for discrete observation spaces, or a probability density function for continuous observation spaces respectively. In addition, it receives an immediate reward rt=R(st,at)r_{t}=R(s_{t},a_{t})\in\operatorname{\mathbb{R}}. The robot’s goal is to find a policy π\pi that maximizes the expected total discounted reward or the policy value

Vπ(b0)=𝔼[t=0γtrt|b0,π],\displaystyle V_{\pi}(b_{0})=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\bigg{|}b_{0},\pi\right], (1)

where the discount factor 0<γ<10<\gamma<1 ensures that Vπ(b)V_{\pi}(b) is finite and well-defined.

The robot’s decision space is the set Π\Pi of policies defined as mappings from beliefs to actions. The POMDP solution is then the optimal policy, denoted as π\pi^{*} and defined as

π=argmaxπΠVπ(b).\displaystyle\pi^{*}=\operatorname*{arg\,max}_{\pi\in\Pi}V_{\pi}(b). (2)

In designing solvers, it is often convenient to work with the action value or QQ-value

Q(b,a)=R(b,a)+γ𝔼o𝒪[Vπ(bao)|b],\displaystyle Q(b,a)=R(b,a)+\gamma\mathbb{E}_{o\in\mathcal{O}}[V_{\pi^{*}}(b_{a}^{o})|b], (3)

where R(b,a)=s𝒮b(s)R(s,a)dsR(b,a)=\int_{s\in\mathcal{S}}b(s)R(s,a)\mathrm{d}{s} is the expected reward of executing action aa at belief bb, while bao=τ(b,a,o)b_{a}^{o}=\tau(b,a,o) is the updated robot’s belief estimate after it performs action a𝒜{a\in\mathcal{A}} while at belief bb, and subsequently perceives observation o𝒪o\in\mathcal{O}. The optimal value function is then

V(b)=maxa𝒜Q(b,a).\displaystyle V^{*}(b)=\max_{a\in\mathcal{A}}Q(b,a). (4)

A more elaborate explanation is available in (Kaelbling et al., 1998).

Belief trees are convenient data structures to find good approximations to the optimal solutions via sampling-based approaches, which has been shown to significantly improve the scalability of POMDP solving (Kurniawati, 2022). Each node in a belief tree represents a sampled belief. It has outgoing edges labelled by actions, and each action edge is followed by outgoing edges labelled by observations which lead to updated belief nodes. Naïvely, bottom-up dynamic programming can be applied to a truncated belief tree to obtain a near-optimal policy, but many scalable POMDP solvers use more sophisticated sampling strategies to construct a compact belief tree, from which a close-to-optimal policy can be computed efficiently. ADVT uses such a sampling-based approach and belief tree representation too.

Various efficient sampling-based offline and online POMDP solvers have been developed for increasingly complex discrete and continuous POMDPs in the last two decades. Offline solvers (e.g., Bai et al. 2014; Kurniawati et al. 2011, 2008; Pineau et al. 2003; Smith and Simmons 2005) compute an approximately optimal policy for all beliefs first before deploying it for execution. In contrast, online solvers (e.g., Kurniawati and Yadav 2013; Silver and Veness 2010; Ye et al. 2017) aim to further scale to larger and more complex problems by interleaving planning and execution, and focusing on computing an optimal action for only the current belief during planning. For scalability purposes, ADVT follows the online solving approach.

Some online solvers have been designed for continuous POMDPs. In addition to the general solvers discussed in Section 1, some solvers (Agha-Mohammadi et al., 2011; Sun et al., 2015; Van Den Berg et al., 2011, 2012) restrict beliefs to be Gaussian and use Linear-Quadratic-Gaussian (LQG) control (Lindquist, 1973) to compute the best action. This strategy generally performs well in high-dimensional action spaces. However, they tend to perform poorly in problems with large uncertainties (Hoerger et al., 2020).

Last but not least, hierarchical rectangular partitions have been commonly used to discretize action spaces when solving continuous action bandits and MDPs (the fully observed version of POMDPs), such as HOO (Bubeck et al., 2011) and HOOT (Mansley et al., 2011). However, the partitions used in these algorithms are typically predefined, which are less adaptive than Voronoi-based partitions constructed dynamically during the search. On the other hand, Voronoi partitions have been proposed in VOOT (Kim et al., 2020) and VOMCPOW (Lim et al., 2021). However, their partitions are based on the Voronoi diagram of all sampled actions, which makes the computation of cell diameters and sampling relatively complex in high-dimensional action spaces. ADVT is computationally efficient, just like hierarchical rectangular partitions, and yet adaptive, just like the Voronoi partitions, getting the best of both worlds.

3 ADVT: Overview

Algorithm 1 ADVT(Initial belief b0b_{0})
1:b=b0b=b_{0}
2:𝒯=\mathcal{T}= initializeBeliefTree(bb)
3:(b)=\mathcal{H}(b)= Initialize Voronoi tree for belief bb
4:isTerminal == False
5:while isTerminal is False do
6:     while planning budget not exceeded do
7:         (e,bc)=(e,b_{c})= SampleEpisode(𝒯\mathcal{T}, bb)\triangleright Algorithm 2
8:         for i=|e|1i=\left|e\right|-1 to 1 do
9:              (s,a,o,r)=ei(s,a,o,r)=e_{i}
10:              Backup(𝒯\mathcal{T}, bcb_{c}, aa, rr) \triangleright Algorithm 3
11:              bc=b_{c}= Parent node of bcb_{c} in 𝒯\mathcal{T}
12:              RefineVoronoiTree((bc)\mathcal{H}(b_{c}), aa) \triangleright Algorithm 5
13:         end for
14:     end while
15:     a=argmaxa𝒜(b)Q^(b,a)a^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}(b)}\widehat{Q}(b,a)
16:     (o(o, isTerminal)=)= Execute aa^{*}
17:     b=τ(b,a,o)b^{\prime}=\tau(b,a^{*},o)
18:     b=bb=b^{\prime}
19:end while

In this paper, we consider a POMDP 𝒫=𝒮,𝒜,𝒪,T,Z,R,b0,γ\mathcal{P}=\langle\mathcal{S},\mathcal{A},\mathcal{O},T,Z,R,b_{0},\gamma\rangle, where the action space 𝒜\mathcal{A} is continuous and embedded in a bounded metric space with distance function dd. Typically, we define the metric space to be a DD-dimensional bounded Euclidean space, though ADVT can also be used with other types of bounded metric spaces. We further consider the state space 𝒮\mathcal{S} and observation space 𝒪\mathcal{O} to be either discrete or continuous.

ADVT assumes that the QQ-value function is Lipschitz continuous in the action space; that is, for any belief bb\in\mathcal{B}, there exists a Lipschitz constant LbL_{b} such that for any actions a,a𝒜a,a^{\prime}\in\mathcal{A}, we have |Q(b,a)Q(b,a)|Lbd(a,a)\left|Q(b,a)-Q(b,a^{\prime})\right|\leq L_{b}\ d(a,a^{\prime}). Since generally we do not know a tight Lipschitz constant, in the implementation, ADVT uses the same Lipschitz constant LL for all beliefs in \mathcal{B}, as discussed in Section 4.1.

ADVT is an anytime online solver for POMDPs. It interleaves belief space sampling and action space sampling to find the best action to perform from the current belief bb\in\mathcal{B}. The sampled beliefs are maintained in a belief tree, denoted as 𝒯\mathcal{T}, while the sampled actions 𝒜(b)\mathcal{A}(b) for a belief bb are maintained in a Voronoi tree, denoted as (b)\mathcal{H}(b), which is adaptively refined. The Voronoi trees form part of the belief tree in ADVT: they determine the sampled action branches for the belief nodes.

Algorithm 1 presents the overall algorithm of ADVT, with details in the sections below. At each planning step, ADVT follows the MCTS approach of constructing a belief tree by sampling a set of episodes (line 7 in Algorithm 1), starting from the current belief. Details on the belief-tree construction are provided in Section 4. After sampling an episode, the estimated action values Q^(b,a)\widehat{Q}(b,a) along the sequence of actions selected by the episode are updated using a backup operation (line 10 in Algorithm 1). In addition, ADVT refines the Voronoi tree (b)\mathcal{H}(b) as needed for each belief visited by the episode (line 12 in Algorithm 1), as discussed in Section 5. Once a planning budget is exceeded, ADVT selects an action aa^{*} from the current belief acoording to

a=argmaxa𝒜(b)Q^(b,a),a^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}(b)}\widehat{Q}(b,a), (5)

executes aa^{*} in the environment to obtain an observation o𝒪o\in\mathcal{O}, updates the current belief to b=τ(b,a,o)b^{\prime}=\tau(b,a^{*},o) (line 15-17 in Algorithm 1), and proceeds planning from the updated belief. This process repeats until the robot enters a terminal state or a maximum number of planning steps has been exceeded.

4 ADVT: Construction of the Belief Tree

Algorithm 2 SampleEpisode(Belief tree 𝒯\mathcal{T}, Belief node bcb_{c})
1:Notations: (b)=\mathcal{H}(b)= Voronoi tree associated with belief bb; A(b)=A(b)= Set of candidate actions associated to the leaf-nodes of (b)\mathcal{H}(b)
2:
3:e=e= Empty sequence of state-action-observation-reward quadruples; b=bcb=b_{c}; s=s= A random state sampled from bb; newBelief == False
4:while newBelief is False and ss not terminal do
5:     a=argmaxak𝒜(b)U(b,ak)a=\operatorname*{arg\,max}_{a_{k}\in\mathcal{A}(b)}U(b,a_{k}) \triangleright eq. 6
6:     (b,s,o,r)=(b^{\prime},s^{\prime},o,r)= Step(bb, ss, aa) \triangleright Algorithm 4
7:     Append (s,a,o,r)(s,a,o,r) to ee
8:     N(b,a)=N(b,a)+1;N(b)=N(b)+1N(b,a)=N(b,a)+1;N(b)=N(b)+1
9:     if 𝒜(b)=\mathcal{A}(b^{\prime})=\emptyset then
10:         a=a=\ Sample uniformly from 𝒜\mathcal{A}
11:         (b)=\mathcal{H}(b^{\prime})= Initialize Voronoi tree for belief bb^{\prime}
12:         Associate (a,𝒜)(a,\mathcal{A}) with the root node of (b)\mathcal{H}(b^{\prime})
13:         N(b)=0N(b^{\prime})=0; N(b,a)=0N(b^{\prime},a)=0
14:         newBelief = True
15:     end if
16:     s=ss=s^{\prime}
17:     b=bb=b^{\prime}
18:end while
19:r=0r=0
20:if newBelief is True then
21:     h=h= calculateRolloutHeuristic(ss, bb)
22:     Initialize V^(b)\widehat{V}^{*}(b) with hh
23:end if
24:insert (s,,,r)(s,-,-,r) to ee
25:return (e,b)(e,b)

The belief tree 𝒯\mathcal{T} is a tree whose nodes represent beliefs and the edges are associated with action–observation pairs (a,o)(a,o), where a𝒜a\in\mathcal{A} and o𝒪o\in\mathcal{O}. A node bb^{\prime} is a child of node bb via edge (a,o(a,o) if and only if b=τ(b,a,o)b^{\prime}=\tau(b,a,o).

To construct the belief tree 𝒯\mathcal{T}, ADVT interleaves the iterative select-expand-simulate-backup operations used in many MCTS algorithms with adaptive discretization. We assume that we have access to a generative model G:𝒮×𝒜𝒮×𝒪×G:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{O}\times\mathbb{R} that simulates the dynamics, observation and reward models. In particular, for a given state s𝒮s\in\mathcal{S} and action a𝒜a\in\mathcal{A}, we have that (s,o,r)=G(s,a)(s^{\prime},o,r)=G(s,a), where (s,o)(s^{\prime},o) is distributed according to p(s,o|s,a)=T(s,a,s)Z(s,a,o)p(s^{\prime},o|s,a)=T(s,a,s^{\prime})Z(s^{\prime},a,o), and r=R(s,a)r=R(s,a). At each iteration, we first select a path starting from the root by sampling an episode s0,a0,o0,r0,s1,a1,o1,r1,s_{0},a_{0},o_{0},r_{0},s_{1},a_{1},o_{1},r_{1},\ldots as follows: We first set the current node bb as the root node, and sample s0s_{0} from bb. At each step i0i\geq 0, we choose an action ai𝒜(b)a_{i}\in\mathcal{A}(b) for bb using an action selection strategy (discussed in Section 4.1) , execute aia_{i} from state sis_{i} via the generative model GG to obtain a next state ss^{\prime}, an observation oo and the immediate reward rir_{i}. For problems with discrete observation spaces, we set the episode’s next state si+1s_{i+1} and observation oio_{i} to si+1=ss_{i+1}=s^{\prime} and oi=oo_{i}=o respectively. For problems with continuous observation spaces, we select si+1s_{i+1} and oio_{i} according to an observation sampling strategy, as discussed in Section 4.3. Finally, we update bb to bb’s child node via (ai,oi)(a_{i},o_{i}). The process terminates when encountering a terminal state or when the child node does not exist; in the latter case, the tree is expanded by adding a new node, and a rollout policy is simulated to provide an estimated value for the new node. In either case, backup operations are performed to update the estimated values for all actions selected by the episode. In addition, new actions are added to 𝒜(b)\mathcal{A}(b) by refining the associated Voronoi tree (b)\mathcal{H}(b) as needed for each encountered belief. Algorithm 2 presents the pseudo-code for constructing 𝒯\mathcal{T}, while the backup operation and refinement of (b)\mathcal{H}(b) are discussed in Section 4.2 and Section 5, respectively.

4.1 Action Selection Strategy

In contrast to many existing online solvers, which use UCB1 to select the action to expand a node bb of 𝒯\mathcal{T}, ADVT treats the action selection problem as a continuum-arm bandit problem. Specifically, it selects an action from the set of candidate actions 𝒜(b)\mathcal{A}(b) according to (Wang et al., 2020)

a\displaystyle a^{*} =argmaxa𝒜(b)U(b,a),with\displaystyle=\operatorname*{arg\,max}_{a\in\mathcal{A}(b)}U(b,a),\qquad\text{with} (6)
U(b,a)\displaystyle U(b,a) =Q^(b,a)+ClogN(b)N(b,a)+Ldiam(P),\displaystyle=\widehat{Q}(b,a)+C\sqrt{\frac{\log N(b)}{N(b,a)}}+L\ \mathrm{diam}(P), (7)

where N(b)N(b) is the number of times node bb has been visited so far, N(b,a)N(b,a) is the number of times aa has been selected at bb, P𝒜P\subseteq\mathcal{A} is the unique leaf cell containing aa in (b)\mathcal{H}(b) (see Section 5 for details on the Voronoi tree), and diam(P)=supa,aPd(a,a)\mathrm{diam}(P)=\sup_{a,a^{\prime}\in P}d(a,a^{\prime}) is the diameter of PP with respect to the distance metric dd. The constant CC is an exploration constant, where larger values of CC encourage exploration. In case N(b,a)=0N(b,a)=0, we set U(b,a)=U(b,a)=\infty. With the Lipschitz continuity assumption, the value U(b,a)U(b,a) can be seen as an upper-confidence bound for the maximum possible Q-value maxaPQ(b,a)\max_{a^{\prime}\in P}Q(b,a^{\prime}) within PP, as follows: Q^(b,a)+ClogN(b)N(b,a)\widehat{Q}(b,a)+C\sqrt{\frac{\log N(b)}{N(b,a)}} is the standard UCB1 bound for the Q-value Q(b,a)Q(b,a), and whenever this upper bounds Q(b,a)Q(b,a), we have U(b,a)Q(b,a)U(b,a)\geq Q(b,a^{\prime}) for any aPa^{\prime}\in P, because U(b,a)Q(b,a)+Ldiam(P)Q(b,a)U(b,a)\geq Q(b,a)+L\ \mathrm{diam}(P)\geq Q(b,a^{\prime}), where the last inequality holds due to the Lipschitz assumption. Since LL is unknown, we try different values of LL in our experiments and choose the best.

4.2 Backup

Algorithm 3 Backup(Belief tree 𝒯\mathcal{T}, Belief node bb^{\prime}, Action aa, Reward rr)
1:b=b= Parent node of bb^{\prime} in 𝒯\mathcal{T}
2:Q^(b,a)Q^(b,a)+r+γV^(b)Q^(b,a)N(b,a)\widehat{Q}(b,a)\leftarrow\widehat{Q}(b,a)+\frac{r+\gamma\widehat{V}^{*}(b^{\prime})-\widehat{Q}(b,a)}{N(b,a)}
3:V^(b)=maxa𝒜(b)Q^(b,a)\widehat{V}^{*}(b)=\max_{a\in\mathcal{A}(b)}\widehat{Q}(b,a)

After sampling an episode ee, ADVT updates the estimates Q^(b,a)\widehat{Q}(b,a) as well as the statistics N(b)N(b) and N(b,a)N(b,a) along the sequence of beliefs visited by the episode. To update Q^(b,a)\widehat{Q}(b,a), we use a stochastic version of the Bellman backup (Algorithm 3): Suppose rr is the immediate reward sampled by the episode after selecting aa from bb. We then update Q^(b,a)\widehat{Q}(b,a) according to

Q^(b,a)Q^(b,a)+r+γV^(b)Q^i(b,a)N(b,a),\widehat{Q}(b,a)\leftarrow\widehat{Q}(b,a)+\frac{r+\gamma\widehat{V}^{*}(b^{\prime})-\widehat{Q}_{i}(b,a)}{N(b,a)}, (8)

where bb^{\prime} is the child of bb in the belief tree 𝒯\mathcal{T} via edge (a,o)(a,o); i.e., the belief we arrived at after performing action a𝒜a\in\mathcal{A} and perceiving observation o𝒪{o\in\mathcal{O}} from bb, and Q^i(b,a)\widehat{Q}_{i}(b,a) is the previous estimate of Q(b,a)Q(b,a). This rule is in contrast to POMCP, POMCPOW and VOMCPOW, where the QQ-value estimates are updated via Monte Carlo backup, i.e.,

Q^(b,a)Q^(b,a)+r+γV^e(b)Q^i(b,a)N(b,a),\widehat{Q}(b,a)\leftarrow\widehat{Q}(b,a)+\frac{r+\gamma\widehat{V}_{e}(b^{\prime})-\widehat{Q}_{i}(b,a)}{N(b,a)}, (9)

where V^e(b)\widehat{V}_{e}(b^{\prime}) is the the total discounted reward of episode ee, starting from belief bb^{\prime}.

Our update rule in LABEL:eq:\bellmanBackup is akin to the rule used in QQ-Learning (Watkins and Dayan, 1992) and was implemented in the ABT software (Hoerger et al., 2018; Klimenko et al., 2014), though never explicitly compared with Monte Carlo backup.

The update rule in LABEL:eq:\bellmanBackup helps ADVT to focus its search on promising parts of the belief tree, particularly for problems where good rewards are sparse. For sparse-reward problems, the values of good actions deep in the search tree tend to get averaged out near the root when using Monte-Carlo backups, thus their influence on the action values at the current belief diminishes. In contrast, since stochastic Bellman backups always backpropagates the current largest action value for a visited belief, large action values deep in the search tree have a larger influence on the action values near the root. As we will demonstrate in the experiments, using stochastic Bellman backups instead of Monte-Carlo backups can lead to a significant performance benefit for many POMDP problems.

4.3 Observation Sampling Strategy

Algorithm 4 Step(Belief node bb, State ss, Action aa)
1:(s,o,r)=G(s,a)(s^{\prime},o,r)=G(s,a) \triangleright Generative model
2:b=nullb^{\prime}=null
3:if 𝒪\mathcal{O} is discrete then
4:     b=b^{\prime}= Child node of bb via edge (a,o)(a,o). (If no such child exists, create one)
5:else
6:     if |𝒪(b,a)|>koN(b,a)αo\left|\mathcal{O}(b,a)\right|>k_{o}N(b,a)^{\alpha_{o}} then
7:         o=o= Sample oo uniformly at random from 𝒪(b,a)\mathcal{O}(b,a)
8:     end if
9:     b=b^{\prime}= Child node of bb via edge (a,o)(a,o). (If no such child exists, create one)
10:     w=Z(s,a,o)w=Z(s^{\prime},a,o)
11:     b{(s,w)}b^{\prime}\cup\{(s^{\prime},w)\}
12:     sbs^{\prime}\sim b^{\prime}
13:     r=R(s,a,s)r=R(s,a,s^{\prime})
14:end if
15:return (b,s,o,r)(b^{\prime},s^{\prime},o,r)

During the episode-sampling process, when ADVT selects an action aa at a belief bb according to eq. 6, we must sample an observation to advance the search to the next belief. ADVT uses different observation sampling strategies, depending on whether the observation space 𝒪\mathcal{O} is discrete or continuous. Algorithm 4 summarizes ADVT’s observation sampling strategy for both discrete and continuous observation spaces.

For discrete observation spaces, ADVT follows the common strategy of sampling an observation from the generative model, given the current state of the episode and the selected action, and representing each sampled observation via an observation edge in the search tree. Since the number of possible observations is finite in the discrete setting, this strategy typically works well for moderately sized observation spaces.

For continuous observation spaces, however, the above strategy is unsuitable. In this setting, each sampled observation is generally unique, which leads to a possibly infinite number of observations that need to be represented in the search tree. As a consequence, we cannot expand the search beyond the first step, resulting in policies that are too myopic. Thus, to handle continuous observation spaces, we adopt a strategy similar to the one used by POMCPOW (Sunberg and Kochenderfer, 2018) and VOMCPOW (Lim et al., 2021). This strategy consists of two components. The first component is to apply Progressive Widening (Couëtoux et al., 2011) to limit the number of sampled observation edges per action edge as a function of N(b,a)N(b,a), i.e., the number of times we selected aa from bb. In particular, let 𝒪(b,a)\mathcal{O}(b,a) be the set of observation children of action aa at belief bb. We sample a new observation as a child aa whenever |𝒪(b,a)||\mathcal{O}(b,a)| satisfies |𝒪(b,a)|koN(b,a)αo|\mathcal{O}(b,a)|\leq k_{o}N(b,a)^{\alpha_{o}}, where ko0k_{o}\geq 0 and αo0\alpha_{o}\geq 0 are user defined parameters that control the rate at which new observations are added to the tree. If this condition is not satisfied, we uniformly sample an observation from 𝒪(b,a)\mathcal{O}(b,a). This enables ADVT to visit sampled observation edges multiple times and expand the search tree beyond one step. The second component is to use an explicit representation of each sampled belief in the search tree via a set of weighted particles {(si,wi)i=1n}\left\{(s_{i},w_{i})_{i=1}^{n}\right\}, which allows us to obtain state samples that are correctly distributed according to the beliefs. When sampling an episode, suppose ADVT selects action aa from belief bb, samples a next state ss^{\prime} from the generative model and selects an observation oo using the strategy above. The weight corresponding to ss^{\prime} is then computed according to w=Z(s,a,o)w=Z(s^{\prime},a,o) and (s,w)(s^{\prime},w) is added to particle set representing belief bb^{\prime} whose parent is bb via the edge (a,o)(a,o). To obtain a next state ss^{\prime} that is distributed according to bb^{\prime}, we resample ss^{\prime} from bb^{\prime} with a probability proportional to the particle weights and continue from bb^{\prime}.

Note that in contrast to online POMDP solvers for discrete observation spaces that only require a black-box model to sample observations, we additionally require access to the observation function Z(s,a,o)Z(s,a,o). However, this is a common assumption for solvers that are designed for continuous observation spaces (Sunberg and Kochenderfer, 2018; Hoerger and Kurniawati, 2021).

Additionally, note that it is possible to use the observation sampling strategy for continuous observation spaces for discrete observation spaces. However, for discrete observation spaces, this introduces unneccessary computational overhead. As discussed above, the purpose of explicit belief representations is to obain state samples that are correctly distributed according to the beliefs. For discrete observation spaces, this is achieved by sampling a next state ss^{\prime} according to the transition model T(s,a,s)T(s,a,s^{\prime}) and an observation oo according to the observation model Z(s,a,o)Z(s^{\prime},a,o), and then use ss^{\prime} as a sample from the belief b=τ(b,a,o)b^{\prime}=\tau(b,a,o). Since the sampled observations are always distributed according to Z(s,a,o)Z(s^{\prime},a,o), ss^{\prime} is a sample of bb^{\prime}. This is in contrast to continuous observation spaces, where ADVT samples observations from a distribution that is potentially different to Z(s,a,o)Z(s^{\prime},a,o) (line 7 in Algorithm 4). Therefore, ADVT handles discrete and continuous observation spaces separately and avoids the weighting an resampling step in the discrete case, thus saving computation time.

5 ADVT: Construction and Refinement of Voronoi Trees

Algorithm 5 RefineVoronoiTree(Voronoi tree (b)\mathcal{H}(b), Action aa)
1:(a,P)=(a,P)= leaf node of (b)\mathcal{H}(b) with its action component being aa
2:if CrN(b,a)1/diam(P)2C_{r}N(b,a)\geq 1/\mathrm{diam}(P)^{2} then
3:     aa^{\prime} = sample from PP
4:     (P1,P2)=(P_{1},P_{2})= Child cells of PP induced by aa and aa^{\prime}
5:     Compute diameters of P1P_{1} and P2P_{2}
6:     Add (a,P1)(a,P_{1}) and (a,P)(a^{\prime},P^{\prime}) as (a,P)(a,P)’s children
7:end if

For each belief node bb in the belief tree, its Voronoi tree (b)\mathcal{H}(b) is a BSP tree for 𝒜\mathcal{A}. Each node in (b)\mathcal{H}(b) consists of a pair (a,P)(a,P) with P𝒜P\subseteq\mathcal{A} and aPa\in P the representative action of PP, and each non-leaf node is partitioned into two child nodes. The partition of each cell in a Voronoi tree is a Voronoi diagram for two actions sampled from the cell.

Refer to caption
Figure 1: Illustration of the relation between a belief tree 𝒯\mathcal{T} (left), the Voronoi tree (b)\mathcal{H}(b) associated to belief bb (middle) and the partition of the action space induced by the Voronoi tree (right).

To construct (b)\mathcal{H}(b), ADVT first samples an action a0a_{0} uniformly at random from the action space 𝒜\mathcal{A}, and sets the pair (a0,𝒜)(a_{0},\mathcal{A}) as the root of (b)\mathcal{H}(b). When ADVT decides to expand a leaf node (a,P)(a,P), it first samples an action aa^{\prime} uniformly at random from P𝒜P\subseteq\mathcal{A}. ADVT then implicitly constructs the Voronoi diagram between aa and aa^{\prime} within the cell PP, splitting it into two regions: One is P1P_{1}, representing the set of all actions a′′Pa^{\prime\prime}\in P for which the distance d(a′′,a)d(a′′,a)d(a^{\prime\prime},a)\leq d(a^{\prime\prime},a^{\prime}), and the other is P2=P\P1P_{2}=P\backslash P_{1}. The nodes (a,P1)(a,P_{1}) and (a,P2)(a^{\prime},P_{2}) are then inserted as children of (a,P)(a,P) in (b)\mathcal{H}(b). The leaf nodes of (b)\mathcal{H}(b) form the partition of the action space 𝒜\mathcal{A} used by belief bb, while the finite action subset 𝒜(b)𝒜\mathcal{A}(b)\subset\mathcal{A} used to find the best action from bb is the set of actions associated with the leaves of (b)\mathcal{H}(b). Figure 1 illustrates the relationship between a belief, the Voronoi tree (b)\mathcal{H}(b) and the partition of 𝒜\mathcal{A}.

Voronoi trees have a number of representational and computational advantages compared to existing partitioning methods, such as hierarchical rectangular partitions (Bubeck et al., 2011; Mansley et al., 2011) or Voronoi diagrams (Kim et al., 2020; Lim et al., 2021). In contrast to hierarchical rectangular partitions, Voronoi trees are much more adaptive to the spatial locations of sampled actions, since the geometries of the cells are induced by the sampled actions. Furthermore, in contrast to Voronoi diagrams, the hierarchical structure of Voronoi trees allows us to derive efficient algorithms for estimating the diameters of the cells (discussed in Section 5.2) and sampling new actions from a cell (detailed in Section 5.3). For Voronoi diagrams the diameter computation can be prohibitively expensive in the context of online POMDP planning, since each sampled action results in a re-partitioning of the action space; thus, the diameters of the cells have to be re-computed from scratch. Voronoi tress combinine a hierarchical representation of the partition with an implicit construction of the cells via sampled actions, thereby achieving both adaptivity and computational efficiency.

The next section describes how ADVT decides which node of (b)\mathcal{H}(b) to expand.

5.1 Refining the Partition

ADVT decides how to refine the partitioning (b)\mathcal{H}(b) in two steps. First, it selects a leaf node of (b)\mathcal{H}(b) to be refined next. This step relies on the action selection strategy used for expanding the belief tree 𝒯\mathcal{T} (Section 4.1). The selected leaf node (a,P)(a,P) of (b)\mathcal{H}(b) is the unique leaf node with aa chosen according to eq. 6

In the second step ADVT decides if the cell PP should indeed be refined. This decision is based on the quality of the estimate Q^(b,a)\widehat{Q}(b,a), as reflected by the number of samples used to estimate Q^(b,a)\widehat{Q}(b,a), and the variation of the QQ-values for the actions contained in PP, as reflected by the diameter of PP. Specifically, ADVT refines the cell PP only when the following criteria is satisfied:

CrN(b,a)1diam(P)2,C_{r}N(b,a)\geq\frac{1}{\mathrm{diam}(P)^{2}}, (10)

where CrC_{r} is an exploration constant. N(b,a)N(b,a), i.e., the number of times that aa has been selected at bb, provides a rough estimate on the quality of the Q^(b,a)\widehat{Q}(b,a) estimate, whereas diam(P)\mathrm{diam}(P) serves as a measure of variation of the QQ-value function within PP due to the Lipschitz assumption. This criterion, which is inspired by (Touati et al., 2020), limits the growth of the finite set of candidate actions 𝒜(b)\mathcal{A}(b) and ensures that a cell is only refined when its corresponding action has been played sufficiently often. Larger CrC_{r} cause cells to be refined earlier, thereby encouraging exploration.

Our refinement strategy is highly adaptive, in the sense that we use local information (i.e., the diameters of the cells, induced by the representative actions and the quality of the QQ-value estimates of the representative actions), to influence the choice of the cell to be partitioned and when the chosen cell is partitioned, and the geometries of our cells are dependent on the sampled actions. This strategy is in contrast to other hierarchical decompositions, such as those used in HOO and HOOT, where the cell that corresponds to an action is refined immediately after the action is selected for the first time, which generally means the QQ-value of an action is estimated based only on a single play of the action, which is grossly insufficient for our problem. In addition, our strategy is more adaptive than VOMCPOW. For VOMCPOW, the decision on when to refine the partition is solely based on the number of times a belief has been visited and neither takes the quality of the QQ-value estimates, nor the diameters of the Voronoi cells into account. Furthermore, VOMCPOW’s refinement strategy is more global in a sense that each each sampled action results in a different Voronoi diagram of the action space. On the other hand, ADVT’s strategy is much more local, since each refinement only affects a single cell.

5.2 Estimating the Voronoi Cell Diameters

ADVT uses the diameters of the cells in the action selection strategy and the cell refinement rule, but efficiently computing the diameters of the cells is computationally challenging in high-dimensional spaces. We give an efficient approximation algorithm for computing the Voronoi cell diameters below.

Since the cells in (b)\mathcal{H}(b) are only implicitly defined, we use a sampling-based approach to approximate a cell’s diameter. Suppose we want to estimate the diameter of the cell PP corresponding to the node (a,P)(a,P) of the Voronoi tree (b)\mathcal{H}(b). Then, we first sample a set of kk boundary points 𝒜P(b){\mathcal{A}}_{P}(b) of PP, where kk is a user defined parameter. In our experiments, we typically set kk to be between 2020 and 5050. To sample a boundary point aP𝒜P(b)a_{P}\in{\mathcal{A}}_{P}(b), we first sample a point α\alpha that lies on the sphere centered at aa with diameter diam(𝒜)\mathrm{diam}(\mathcal{A}) – which can be easily computed for our benchmark problems – uniformly at random. The point α\alpha lies either on the boundary or outside of PP. We then use the Bisection method (Burden et al., 2016) with aa and α\alpha as the initial end-points, until the two end-points are less than a small threshold ϵ\epsilon away from each other, but one still lies inside PP and the other outside PP. The point that lies inside PP is then a boundary point aPa_{P}. The diameter of a bounding sphere that encloses all the sampled boundary points in 𝒜P(b){\mathcal{A}}_{P}(b) (Welzl, 1991) is then an approximation of the diameter of PP.

The above strategy to estimate the diameter of a cell requires us to determine whether a sampled action aa lies inside or outside a cell PP. Fortunately, a Voronoi tree allows us to determine if aPa\in P easily. Observe that for each cell PP, we have that PPARENT(P)P\subseteq\mathrm{PARENT}(P), where PARENT(P)\mathrm{PARENT}(P) is the cell associated to the parent node of PP in the Voronoi tree. As a result, aPa\in P implies that aPARENT(P)a\in\mathrm{PARENT}(P). Therefore, to check whether aPa\in P, it is sufficent to check if aa is contained in all cells associated to the path in the Voronoi tree from the root to PP. Suppose ζ=((a0,P0),(a1,P1),,(aL,PL))\zeta=\left((a_{0},P_{0}),(a_{1},P_{1}),...,(a_{L},P_{L})\right) is the sequence of nodes corresponding to a path in (b)\mathcal{H}(b) and we want to check whether aPLa\in P_{L}. The point aa is inside PLP_{L} if, for each (ai,Pi)ζ(a_{i},P_{i})\in\zeta, we have that d(a,ai)d(a,ai)d(a,a_{i})\leq d(a,a^{\prime}_{i}), where aia^{\prime}_{i} is the inducing point of the cell corresponding to the sibling node of (ai,Pi)(a_{i},P_{i}), and d(,)d(\cdot,\cdot) is the distance function on the action space. If, this condition is not satisfied, aa is outside PLP_{L}, i.e., aPLa\notin P_{L}.

To further increase the computational efficiency of our diameter estimator, we re-use the sampled boundary points 𝒜P(b){\mathcal{A}}_{P}(b) when ADVT decides to split PP into two child cells PP^{\prime} and P′′P^{\prime\prime}. Suppose the diameter of PP was estimated using kk boundary points. Since every point in 𝒜P(b){\mathcal{A}}_{P}(b) lies either on the boundary of PP^{\prime} or P′′P^{\prime\prime}, we divide 𝒜P(b){\mathcal{A}}_{P}(b) into 𝒜P(b){\mathcal{A}}_{P^{\prime}}(b) and 𝒜P′′(b){\mathcal{A}}_{P^{\prime\prime}}(b), such that 𝒜P(b)={a𝒜P(b)|d(a,a)d(a,a′′)}\mathcal{A}_{P^{\prime}}(b)=\left\{a\in\mathcal{A}_{P}(b)\ |\ d(a,a^{\prime})\leq d(a,a^{\prime\prime})\right\} and 𝒜P′′(b)=𝒜P(b)\𝒜P(b)\mathcal{A}_{P^{\prime\prime}}(b)=\mathcal{A}_{P}(b)\backslash\mathcal{A}_{P^{\prime}}(b), where aa^{\prime} and a′′a^{\prime\prime} are the inducing points of PP^{\prime} and P′′P^{\prime\prime} respectively. This will leave us with |𝒜P(b)|k\left|\mathcal{A}_{P^{\prime}}(b)\right|\leq k and |𝒜P′′(b)|k\left|\mathcal{A}_{P^{\prime\prime}}(b)\right|\leq k boundary points for cell PP^{\prime} and P′′P^{\prime\prime} respectively. For both 𝒜P(b)\mathcal{A}_{P^{\prime}}(b) and 𝒜P′′(b)\mathcal{A}_{P^{\prime\prime}}(b) we then sample k|𝒜P(b)|k-\left|\mathcal{A}_{P^{\prime}}(b)\right| and k|𝒜P′′(b)|k-\left|\mathcal{A}_{P^{\prime\prime}}(b)\right| additional boundary points using the method above such that |𝒜P(b)|=k\left|\mathcal{A}_{P^{\prime}}(b)\right|=k and |𝒜P′′(b)|=k\left|\mathcal{A}_{P^{\prime\prime}}(b)\right|=k. Using this method, we only need to sample kk new boundary points instead of 2k2k when ADVT decides to split the cell PP, leading to increased computational efficiency.

5.3 Sampling from the Voronoi Cells

To sample an action that is approximately uniformly distributed in a cell PP, we use a simple Hit & Run approach (Smith, 1984) that performs a random walk within PP. Suppose PP is the cell corresponding to the node (a,P)(a,P) of the Voronoi tree (b)\mathcal{H}(b). We first sample an action aPa_{P} on the boundary of PP using the method described in Section 5.2. Subsequently, we take a random step from aa in the direction towards aPa_{P}, resulting in a new action aPa^{\prime}\in P. We then use aa^{\prime} as the starting point, and iteratively perform this process for mm steps, which gives us a point that is approximately uniformly distributed in PP.

6 Experiments and Results

We evaluated ADVT on 5 robotics tasks, formulated as continuous-action POMDPs. The following section provides details regarding the tasks, while Table 1 summarizes the state, action and observation spaces for each problem scenario.

6.1 Problem Scenarios

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d) (e)
Figure 2: Illustrations of (a) the Pushbox2D, (b) the Parking2D, (c) the SensorPlacement-8, (d) the VDP-Tag and (e) the LunarLander problems. Goal regions in are marked as green circles or a green flag.
Table 1: Summary of the state, action and observation spaces for each problem scenario.
𝒮\mathcal{S} 𝒜\mathcal{A} 𝒪\mathcal{O}
Pushbox2D 4\mathbb{R}^{4} [1,1]2[-1,1]^{2} 2424 observations
Pushbox3D 6\mathbb{R}^{6} [1,1]3[-1,1]^{3} 288288 observations
Parking2D 4\mathbb{R}^{4} [π2,π2]×[8,8][-\frac{\pi}{2},\frac{\pi}{2}]\times[-8,8] 44 observations
Parking3D 5\mathbb{R}^{5} [π2,π2]×[8,8]×[72,72][-\frac{\pi}{2},\frac{\pi}{2}]\times[-8,8]\times[-\frac{7}{2},\frac{7}{2}] 44 observations
SensorPlacement-D D\mathbb{R}^{D} D\mathbb{R}^{D} 44 observations
VDP-Tag 4\mathbb{R}^{4} [0,2π)×{0,1}[0,2\pi)\times\{0,1\} 8\mathbb{R}^{8}
LunarLander 6\mathbb{R}^{6} +×\mathbb{R}^{+}\times\mathbb{R} 3\mathbb{R}^{3}

6.1.1 Pushbox

Pushbox is a scalable motion planning problem proposed in (Seiler et al., 2015) which is motivated by air hockey. A disk-shaped robot (blue disk in Figure 2(a)) has to push a disk-shape puck (red disk in Figure 2(a)) into a goal region (green circle in Figure 2(a)) by bumping into it, while avoiding any collision of itself and the puck with a boundary region (black region if Figure 2(a)). The robot receives a reward of 1,0001,000 when the puck is pushed into the goal region, while it receives a penalty of 1,000-1,000 if the robot or the puck collides with the boundary region. Additionally, the robot receives a penalty of 10-10 for every step. The robot can move freely in the environment by choosing a displacement vector. Upon bumping into the puck, the puck is pushed away and the motion of the puck is affected by noise. We consider two variants of the problem, Pushbox2D and Pushbox3D that differ in the dimensionality of the state and action spaces. For the Pushbox2D problem (illustrated in Figure 2(a)), the robot and the puck operate on a 2D-plane, whereas for Pushbox3D, both the robot and the puck operate inside a 3D-environment. Let us first describe the Pushbox2D variant. The state space consists of the xyxy-locations of both the robot and the puck, i.e., 𝒮=4\mathcal{S}=\mathbb{R}^{4}, while the action space is defined by 𝒜=[1,1]×[1,1]\mathcal{A}=[-1,1]\times[-1,1]. If the robot is not in contact with the puck during a move, the state evolves deterministically according to f(s,a)=(xr+ax,yr+ay,xp,yp)Tf(s,a)=(x_{r}+a_{x},y_{r}+a_{y},x_{p},y_{p})^{T}, where (xr,yr)(x_{r},y_{r}) and (xp,yp)(x_{p},y_{p}) are the xyxy-coordinates the robot and the puck respectively, corresponding to state ss, and (ax,ay)(a_{x},a_{y}) is the displacement vector corresponding to action aa. In particular, if the robot bumps into the puck, the next position (xp,yp)(x_{p}^{\prime},y_{p}^{\prime}) of the puck is computed as

(xpyp)=(xpyp)+5rs((axax)n)(n+(rxry)),\begin{pmatrix}x_{p}^{\prime}\\ y_{p}^{\prime}\end{pmatrix}=\begin{pmatrix}x_{p}\\ y_{p}\end{pmatrix}+5r_{s}\left(\begin{pmatrix}a_{x}\\ a_{x}\end{pmatrix}\cdot\vec{n}\right)\left(\vec{n}+\begin{pmatrix}r_{x}\\ r_{y}\end{pmatrix}\right), (11)

where the ``"``\cdot" operator denotes the dot product, n\vec{n} is the unit directional vector from the center of the robot to the center of the puck at the time of contact, and rsr_{s} is a random variable drawn from a truncated Gaussian distribution N(μ,σ2,l,u)N(\mu,\sigma^{2},l,u), which is the Gaussian distribution N(μ,σ2)N(\mu,\sigma^{2}) truncated to the interval [l,u][l,u]. For our experiments, we used μ=1.0\mu=1.0, σ=0.1\sigma=0.1, l=μσ=0.9l=\mu-\sigma=0.9 and u=μ+σ=1.1u=\mu+\sigma=1.1. The variables rxr_{x} and ryr_{y} are random variables drawn from a truncated Gaussian distribution N(0.0,0.12,0.1,0.1)N(0.0,0.1^{2},-0.1,0.1).

The initial position position of the robot is known and is set to xr=5.5x_{r}=5.5 and yr=9.5y_{r}=9.5 respectively. The initial puck position, however, is uncertain. Its initial xpx_{p} and ypy_{p} coordinates are drawn from a truncated Gaussian distribution N(5.5,2.02,3.5,7.5)N(5.5,2.0^{2},3.5,7.5), but the robot has access to a noisy bearing sensor to localize the puck and a noise-free collision sensor which detects contacts between the robot and the puck. In particular, given a state s𝒮s\in\mathcal{S}, an observation (oc,ob)(o_{c},o_{b}) consists of a binary component oco_{c} which indicates whether or not a contact between the robot and the puck occured, and a discretized bearing component obo_{b} calculated as

ob=floor(atan2(yoyr,xoxr)+roπ/6),o_{b}=\mathrm{floor}\left(\frac{\mathrm{atan2}(y_{o}-y_{r},x_{o}-x_{r})+r_{o}}{\pi/6}\right), (12)

where xr,yrx_{r},y_{r} and xo,yox_{o},y_{o} are the xyxy-coordinates of the robot and the puck corresponding to the state ss, and ror_{o} is a random angle (expressed in radians) drawn from a truncated Gaussian distribution N(0.0,(π18)2,π18,π18)N(0.0,(\frac{\pi}{18})^{2},-\frac{\pi}{18},\frac{\pi}{18}). Due to the floor\mathrm{floor} operator in eq. 12, the number of discretized bearing observation is 12, thus the observation space consists of 24 unique observations.

Pushbox3D is a straightforward extension of the Pushbox2D problem: The state space is defined as 𝒮=6\mathcal{S}=\mathbb{R}^{6}, consisting of the xyzxyz-locations of the robot and the puck. The action space is 𝒜=[1,1]×[1,1]×[1,1]\mathcal{A}=[-1,1]\times[-1,1]\times[-1,1], where each a𝒜a\in\mathcal{A} describes a 3D-displacement vector of the robot. The transition dynamics are defined similarly to Pushbox2D, except that all quantities are computed in 3D. The observation space is extended with an additional bearing observation, computed according to eq. 12, but with the term atan2(yoyr,xoxr)\mathrm{atan2}(y_{o}-y_{r},x_{o}-x_{r}) being replaced by atan2(zozr,yoyr)\mathrm{atan2}(z_{o}-z_{r},y_{o}-y_{r}), where zrz_{r} and zoz_{o} are the zz-coordinates of the robot and the puck respectively. The reward function is the same as in the Pushbox2D problem.

For both variants of the problem the discount factor is γ=0.95\gamma=0.95 and a run is considered successful if the robot manages to push the puck into a goal region within 50 steps, while avoiding collisions of itself and the puck with the boundary region.

6.1.2 Parking

An autonomous vehicle with deterministic dynamics operates in a 3D-environment populated by obstacles, shown in Figure 2(b). The goal of the vehicle is to safely navigate to a goal area located between the obstacles while avoiding any collision with the obstacles (black regions in Figure 2(b)). The vehicle receives a reward of 100100 when reaching the goal area, while a collisions with the obstaces are penalized by 100-100. Attionally, the vehicle receives a penalty of 1-1 for every step. We consider two variants of the problem, Parking2D and Parking3D. For Parking2D, the vehicle navigates on a 2D-plane, whereas for Parking3D, the vehicle operates in 3D-space. We first describe the Parking2D variant: The state space is 𝒮=4\mathcal{S}=\mathbb{R}^{4} and consists of the xyxy-position of the vehicle on the plane, its orientation θ\theta and its velocity vv. The vehicle is controlled via a steering wheel angle aθa_{\theta} and acceleration ava_{v}, i.e., the action space is 𝒜=Ω×Φ\mathcal{A}=\Omega\times\Phi, where Ω=[π2,π2]\Omega=[-\frac{\pi}{2},\frac{\pi}{2}] is the continuous set of steering wheel angles and Φ=[8,8]\Phi=[-8,8] is the continuous set of accelerations. We assume that for a given state s𝒮s\in\mathcal{S} and action a𝒜a\in\mathcal{A}, the state of the vehicle evolves acoording to the following deterministic second-order discrete-time dynamics model:

f(x,y,θ,v,aθ,av)=[x+vcos(θ)Δy+vsin(θ)Δθ+aθΔv+avΔ],f(x,y,\theta,v,a_{\theta},a_{v})=\begin{bmatrix}x+v\cos(\theta)\Delta\\ y+v\sin(\theta)\Delta\\ \theta+a_{\theta}\Delta\\ v+a_{v}\Delta\end{bmatrix}, (13)

where xx, yy, θ\theta and vv are the 2D-position, orientation and velocity corresponding to state ss, Δ=13s\Delta=\frac{1}{3}s is a fixed control duration, and aθa_{\theta}, ava_{v} are the steering wheel angle and acceleration components of the action. To compute the next state ss^{\prime}, given ss and aa, we numerically integrate eq. 13 for 3 steps.

There are three distinct areas in the environment, each consisting of a different type of terrain (colored areas in Figure 2(b)). Upon traversal, the vehicle receives an observation regarding the terrain type, which is only correct 70% of the time due to sensor noise. If the vehicle is outside the terrains, we assume that it deterministically receives a NULL observation. Initially the vehicle starts near one of three possible starting locations (red areas in Figure 2(b)) with equal probability. The exact initial position of the vehicle along the horizontal yy-axis is then drawn uniformly from U[0.175,0.175]U[-0.175,0.175] around the starting location. For Parking3D the vehicle operates in the full 3D space, and we have additional continuous state and action components that model the vehicles elevation and change in elevation respectively. We assume that the elevation of the vehicle changes according to z+Δahz+\Delta a_{h}, where zz\in\mathbb{R} is the elevation component of the state, and ah[72,72]a_{h}\in[-\frac{7}{2},\frac{7}{2}] is the elevation-change component of the action. The discount factor for both variants is γ=0.95\gamma=0.95 and a run is considered successful if the vehicle enters the goal area within 50 steps while avoiding collisions with the obstacles.

Two properties make this problem challenging. First is the multi-modal beliefs which require the vehicle to traverse the different terrains for a sufficient amount of time to localize itself before attempting to reach the goal. Second, due to the narrow passage that leads to the goal, small perturbations from the optimal action can quickly result in collisions with the obstacle. As a consequence, good rewards are sparse and a POMDP solver must discover them quickly in order to compute a near-optimal strategy.

6.1.3 SensorPlacement

We propose a scalable motion planning under uncertainty problem in which a DD-DOF manipulator with DD revolute joints operates in muddy water inside a 3D environment. The robot is located in front of a marine structure, represented as four distinct walls, and its task is to mount a sensor at a particular goal area between the walls (rewarded with 1,000) while having imperfect information regarding its exact joint configuration. To localize itself, the robot’s end-effector is equipped with a touch sensor. Upon touching a wall, it provides noise-free information regarding which wall is being touched, while we assume that the sensor deterministically outputs a NULL observation in a contact-free state. However, in order to avoid damage, the robot must avoid collisions (penalized by 500-500) between any of its other links and the walls. The state space of the robot consists of the set of joint-angles for each joint. The action space is 𝒜D\mathcal{A}\subset\mathbb{R}^{D}, where a𝒜a\in\mathcal{A} is a vector of joint velocities. Due to underwater currents, the robot is subject to random control errors and the joint-angles corresponding to a state evolve according to

θ=θ+a+r,\theta^{\prime}=\theta+a+r, (14)

where θ\theta is the set of joint angles corresponding to the current state, and rr is a random vector sampled from a multivariate Gaussian distribution N(𝟎,σ2I)N({\bf 0},\sigma^{2}I), where II is the identity matrix of size DD, and σ2=103\sigma^{2}=10^{-3}. Initially the robot is uncertain regarding its exact joint angle configuration. We assume that the initial joint angles are distributed uniformly according to U[θl,θu]U[\theta_{l},\theta_{u}], where θl=θ0h\theta_{l}=\theta_{0}-h and θu=θ0+h\theta_{u}=\theta_{0}+h, with θ0\theta_{0} corresponding to the configuration where all joint angles are zero, except for the second and third joint whose joint angles are 1.57-1.57 and 1.571.57 respectively and h=(0.1,,0.1)Dh=(0.1,\ldots,0.1)\in\mathbb{R}^{D} (units are in radians). We consider four variants of the problem, denoted as SensorPlacement-DD, with D{6,8,10,12}D\in\{6,8,10,12\}, that differ in the degrees-of-freedom (number of revolute joints and thus the dimensionality of the action space) of the manipulator. Figure 2(c) illustrates the SensorPlacement-8 problem, where the colored areas represent the walls and the green sphere represents the goal area. The discount factor is γ=0.95\gamma=0.95 and a run is considered successful if the manipulator mounts the sensor within 50 steps while avoiding collisions with the walls. To successfully mount the sensor at the target location, the robot must use its touch sensor to carefully reduce uncertainty regarding its configuration. This is challenging, since a slight variation in the performed actions can quickly lead to collisions with the walls.

6.1.4 Van Der Pol Tag

Van Der Pol Tag (VDP-Tag) is a benchmark problem introduced in (Sunberg and Kochenderfer, 2018) in which an agent (blue particle in Figure 2(d)) operates in a 2D-environment. The goal is to tag a moving target (red particle in Figure 2(d)) whose motion is described by the Van Der Pol differential equation and disturbed by Gaussian noise with standard deviation σ=0.05\sigma=0.05. Initially, the position of the target is unknown. The agent travels at a constant speed but can pick its direction of travel at each step and whether to activate a costly range sensor, i.e., the action space is 𝒜=[0,2π)×{0,1}\mathcal{A}=[0,2\pi)\times\{0,1\}, where the first component is the direction of travel and the second component is the activation/deactivation of the range sensor. The robot receives observations from its range sensor via 8 beams (i.e., 𝒪=8\mathcal{O}=\mathbb{R}^{8}) that measure the agent’s distance to the target if the target is within the beam’s range. These measurements are more accurate when the range sensor is active. While the target moves freely in the environment, the agent’s movements are restricted by four obstacles in the environment, shown in Figure 2(d). Catching the target is reward by 100100, while activating the range sensor is penalized by 5-5. Additionally, each step incurs a penalty of 1-1. The discount factor is γ=0.95\gamma=0.95 and a run is considered successful of the agent catches the target within 5050 steps. More details of the VDP-Tag problem can be found in (Sunberg and Kochenderfer, 2018).

Note that in this problem, the action space consists of two disconnected components, namely [0,2π)×{0}[0,2\pi)\times\{0\} and [0,2π)×{1}[0,2\pi)\times\{1\}. To allow a Voronoi tree (b)\mathcal{H}(b) in ADVT to cover the entire action space, we ensure that once the root node of (b)\mathcal{H}(b) is split, we set the range sensor component of the representative actions of the two resulting child cells to 0 and 11 respectively, such that one child cell covers the component [0,2π)×{0}[0,2\pi)\times\{0\}, and the other child cell covers the component [0,2π)×{1}[0,2\pi)\times\{1\}.

6.1.5 LunarLander

This problem is a partially observable adaptation of Atari’s Lunar Lander game. For this problem, the state, action and observation spaces are all continuous. The objective is to control a lander vehicle to safely land at a target zone located on the moon’s surface. The lander operates on a xyxy-plane and its state is a 6D-vector (x,y,θ,x˙,y˙,θ˙)(x,y,\theta,\dot{x},\dot{y},\dot{\theta}), where xx\in\mathbb{R} and yy\in\mathbb{R} are the horizontal and vertical positions of the lander and θ\theta\in\mathbb{R} its orientation. x˙\dot{x}\in\mathbb{R}, y˙\dot{y}\in\mathbb{R} and θ˙\dot{\theta}\in\mathbb{R} represent the lander’s horizontal, vertical and angular velocities respectively. The action space is 𝒜=Λ×Ψ\mathcal{A}=\Lambda\times\Psi, where Λ=+\Lambda=\mathbb{R}^{+} is the set of linear accelerations along the lander’s vertical axis, and Ψ=\Psi=\mathbb{R} is the set of angular accelerations about the lander’s geometric center. The initial belief of the lander is a multivariate Gaussian distribution with mean μ=(0,10,0,0,10,0)T\mu=(0,10,0,0,-10,0)^{T} and covariance matrix Σ=diag(1.52,1.02,0.12,02,0.52,0.12)\Sigma=\mathrm{diag}(1.5^{2},1.0^{2},0.1^{2},0^{2},0.5^{2},0.1^{2}).

We assume that the state of the lander evolves according to the following second-order discrete-time stochastic model:

f(x,y,θ,x˙,y˙,θ˙,λ~,ψ~)=[x+x˙Δy+y˙Δθ+θ˙Δx˙+(λ~sin(θ)M)Δy˙+(λ~cos(θ)MG)Δθ˙+Hψ~Δ],f(x,y,\theta,\dot{x},\dot{y},\dot{\theta},\tilde{\lambda},\tilde{\psi})=\begin{bmatrix}x+\dot{x}\Delta\\ y+\dot{y}\Delta\\ \theta+{\dot{\theta}}\Delta\\ \dot{x}+(-\tilde{\lambda}\sin(\theta)M)\Delta\\ \dot{y}+(\tilde{\lambda}\cos(\theta)M-G)\Delta\\ \dot{\theta}+H\tilde{\psi}\Delta\end{bmatrix}, (15)

where λ~=λ+eλ\tilde{\lambda}=\lambda+e_{\lambda} and ψ~=ψ+eψ\tilde{\psi}=\psi+e_{\psi}, with λ\lambda and ψ\psi being the vertical and angular accelerations corresponding to the action, and eλe_{\lambda} and eψe_{\psi} are random control errors drawn from zero-mean Gaussian distributions with standard deviations σλ=1×104\sigma_{\lambda}=1\times 10^{-4} and σψ=5×102\sigma_{\psi}=5\times 10^{-2} respectively. The variable Δ=0.2s\Delta=0.2s is a constant step size, whereas F=40F=40 and H=2H=2 are motor constants. The variable G=9.81m/s2G=-9.81m/s^{2} is a constant describing the gravitational acceleration along the yy-axis. To obtain the next state, given the current state and an action, we numerically integrate the system in eq. 15 for 5 steps.

The lander perceives information regarding its state via three noisy sensors: The first two sensors measure the lander’s horizontal and angular velocities, whereas the third sensor measures the distance to the ground along the lander’s vertical axis (dashed line in Figure 2(e)). The readings of all three sensors are disturbed by standard-Gaussian noise.

The reward function is defined by

rt={1000,if θ0.5or yt<0100|xt||θt|y˙t2,if yt0.31,otherwise.r_{t}=\begin{cases}-1000,&\text{if }\theta\geq 0.5\ \text{or }y_{t}<0\\ 100-\left|x_{t}\right|-\left|\theta_{t}\right|-\dot{y}_{t}^{2},&\text{if }y_{t}\leq 0.3\\ -1,&\text{otherwise.}\end{cases} (16)

The first term in eq. 16 encourages the lander to prevent dangerous angles and crashing into the ground. The second term encourages the lander to safely land at x=0x=0 with an upright angle and a small vertical velocity. The third term encourages the lander to land as quickly as possible. The discount factor is γ=0.95\gamma=0.95 and a run is considered successful if the lander’s vertical position reaches y0.3y\leq 0.3 within 5050 steps, without crashing into the ground (y<0y<0).

6.2 Experimental Setup

The purpose of our experiments is three-fold: First is to evaluate ADVT and compare it with two state-of-the-art online POMDP solvers for continuous actions spaces, POMCPOW (Sunberg and Kochenderfer, 2018) and VOMCPOW (Lim et al., 2021). The results of those experiments are discussed in Section 6.3.1.

Second is to investigate the importance of the different components of ADVT, specifically the Voronoi tree-based partitioning, the cell-diameter-aware exploration term in eq. 7, and the stochastic Bellman backups. For this purpose, we implemented the original ADVT and three modifications. First is ADVT-R, which replaces the Voronoi decomposition of ADVT with a simple rectangular-based method: Each cell in the partition is a hyper-rectangle that is subdivided by cutting it in the middle along the longest side (with ties broken arbitrarily). The second variant is ADVT (L=0), which is ADVT where eq. 7 reduces to the standard UCB1 bound. For the third variant, ADVT-MC, we replace the stochastic Bellman backup in LABEL:eq:\bellmanBackup with the Monte Carlo backup in eq. 9, as used by POMCPOW and VOMCPOW. The results for this study are discussed in Section 6.3.2

Third is to test the effects of two algorithmic components of ADVT when applied to the baselines VOMCPOW and POMCPOW: In the original implementation of VOMCPOW and POMCPOW provided by the authors, the policy is recomputed after every planning step using a new search tree. In contrast, for discrete observation spaces, ADVT applies ABT’s (Kurniawati and Yadav, 2013) strategy that re-uses the partial search tree (starting from the updated belief) constructed in the previous planning steps and improves the policy within the partial search tree. Therefore, for problems with discrete observation spaces, we also tested modified versions of VOMCPOW and POMCPOW, where we follow ADVT’s strategy of re-using the partial search trees. Note that for the VDP-Tag and LunarLander problems, we did not test the variants of VOMCPOW and POMCPOW that re-use partial search trees, since each observation that the agent perceives from the environment leads to a new search tree due to the continuous observation spaces. Moreover, to test the effects of ADVT’s stochastic Bellman backup strategy further, we implemented variants of VOMCPOW and POMCPOW to use stochastic Bellman backups instead of Monte Carlo backups. The results are discussed in Section 6.3.3.

To approximately determine the best parameters for each solver and problem, we ran a set of systematic preliminary trials by performing a grid-search over the parameter space. For each solver and problem, we used the best parameters and ran 1,000 simulation runs, with a fixed planning time of 1s CPU time for each solver and scenario. Each tested solver and the scenarios were implemented in C++ using the OPPT-framework (Hoerger et al., 2018). All simulations were run single-threaded on an Intel Xeon Platinum 8274 CPU with 3.2GHz and 4GB of memory.

6.3 Results

Table 2: Average total discounted rewards and 95% confidence intervals of ADVT, VOMCPOW and POMCPOW on the Pushbox, Parking, VDP-Tag and LunarLander problems. The average is taken over 1000 simulation runs per solver and problem, with a planning time of 1s per step. The best result for each problem is highlighted in bold.
Pushbox2D Pushbox3D Parking2D Parking3D VDP-Tag LunarLander
ADVT   351.6\mathbf{351.6} ±\pm 10.0\mathbf{10.0}   322.1\mathbf{322.1} ±\pm 14.9\mathbf{14.9}   35.2\mathbf{35.2} ±\pm 1.9\mathbf{1.9}   32.6\mathbf{32.6} ±\pm 3.5\mathbf{3.5}   30.530.5 ±\pm 1.01.0   26.01\mathbf{26.01} ±\pm 1.2\mathbf{1.2}
VOMCPOW   129.8129.8 ±\pm 13.313.3   73.573.5 ±\pm 13.813.8   0.78-0.78 ±\pm 2.82.8   18.4-18.4 ±\pm 1.41.4   32.9\mathbf{32.9} ±\pm 0.9\mathbf{0.9}   13.313.3 ±\pm 2.22.2
POMCPOW   82.182.1 ±\pm 14.214.2   3.63.6 ±\pm 12.912.9   5.1-5.1 ±\pm 3.03.0   25.7-25.7 ±\pm 1.41.4   28.228.2 ±\pm 1.11.1   13.513.5 ±\pm 2.12.1
Refer to caption
Figure 3: Average total discounted rewards of all tested solvers on the SensorPlacement problems. The average is taken over 1,000 simulation runs per solver and problem.
Table 3: Average total discounted rewards and 95% confidence intervals of all tested solvers on the Pushbox, Parking, VDP-Tag and LunarLander problems. The average is taken over 1000 simulation runs per solver and problem, with a planning time of 1s per step.
Pushbox2D Pushbox3D Parking2D Parking3D VDP-Tag LunarLander
ADVT-R   371.4371.4 ±\pm 9.89.8   321.2321.2 ±\pm 15.115.1   38.938.9 ±\pm 1.8{1.8}   24.324.3 ±\pm 3.43.4   30.230.2 ±\pm 1.01.0   29.629.6 ±\pm 1.11.1
ADVT (L=0)   340.8340.8 ±\pm 14.714.7   294.6294.6 ±\pm 13.313.3   29.229.2 ±\pm 3.53.5   18.618.6 ±\pm 1.71.7   28.728.7 ±\pm 1.11.1   24.724.7 ±\pm 1.21.2
ADVT-MC   319.6319.6 ±\pm 13.713.7   311.0311.0 ±\pm 16.216.2   3.2-3.2 ±\pm 1.81.8   14.7-14.7 ±\pm 0.50.5   33.533.5 ±\pm 0.80.8   21.921.9 ±\pm 1.71.7
VOMCPOW+our RT+our BB   322.9322.9 ±\pm 12.112.1   274.9274.9 ±\pm 14.214.2   28.228.2 ±\pm 1.81.8   24.424.4 ±\pm 2.42.4   -   -
VOMCPOW+our BB   316.3316.3 ±\pm 13.613.6   134.2134.2 ±\pm 17.417.4   27.527.5 ±\pm 1.91.9   23.723.7 ±\pm 2.52.5   29.929.9 ±\pm 1.01.0   19.919.9 ±\pm 1.61.6
VOMCPOW+our RT   316.0316.0 ±\pm 12.312.3   268.9268.9 ±\pm 14.214.2   0.42-0.42 ±\pm 2.82.8   15.7-15.7 ±\pm 1.51.5   -   -
POMCPOW+our RT+our BB   314.2314.2 ±\pm 13.013.0   245.7245.7 ±\pm 14.114.1   27.727.7 ±\pm 1.81.8   8.88.8 ±\pm 2.62.6   -   -
POMCPOW+our BB   300.6300.6 ±\pm 12.612.6   128.8128.8 ±\pm 17.517.5   24.224.2 ±\pm 1.91.9   10.4-10.4 ±\pm 2.12.1   27.527.5 ±\pm 1.21.2   18.418.4 ±\pm 2.22.2
POMCPOW+our RT   270.6270.6 ±\pm 18.918.9   203.7203.7 ±\pm 14.314.3   5.2-5.2 ±\pm 2.92.9   22.8-22.8 ±\pm 1.31.3   -   -
Table 4: Average total discounted rewards and 95% confidence intervals of all tested solvers on the SensorPlacement problems. The average is taken over 1000 simulation runs per solver and problem, with a planning time of 1s per step.

SensorPlacement-6 SensorPlacement-8 SensorPlacement-10 SensorPlacement-12 ADVT   842.8\mathbf{842.8} ±\pm 9.5\mathbf{9.5}   706.8\mathbf{706.8} ±\pm 17.5\mathbf{17.5}   565.1\mathbf{565.1} ±\pm 21.7\mathbf{21.7}   303.0\mathbf{303.0} ±\pm 19.8\mathbf{19.8} ADVT-R   676.3676.3 ±\pm 19.7{19.7}   238.1238.1 ±\pm 33.533.5   28.728.7 ±\pm 18.418.4   17.3-17.3 ±\pm 7.47.4 ADVT (L=0L=0)   780.4780.4 ±\pm 12.612.6   448.8448.8 ±\pm 15.915.9   325.3325.3 ±\pm 16.216.2   102.5102.5 ±\pm 7.47.4 ADVT-MC   812.6812.6 ±\pm 11.411.4   692.7692.7 ±\pm 17.717.7   551.2551.2 ±\pm 18.318.3   293.5293.5 ±\pm 19.619.6 VOMCPOW   768.5768.5 ±\pm 16.416.4   305.6305.6 ±\pm 25.825.8   110.1110.1 ±\pm 24.524.5   8.2-8.2 ±\pm 13.213.2 VOMCPOW+our RT+our BB   823.4823.4 ±\pm 15.115.1   679.1679.1 ±\pm 17.917.9   481.6481.6 ±\pm 22.222.2   191.3191.3 ±\pm 17.617.6 VOMCPOW+our BB   779.2779.2 ±\pm 16.116.1   654.9654.9 ±\pm 16.316.3   453.6453.6 ±\pm 22.822.8   182.4182.4 ±\pm 17.717.7 VOMCPOW+our RT   817.2817.2 ±\pm 15.815.8   663.4663.4 ±\pm 18.618.6   476.0476.0 ±\pm 22.722.7   189.9189.9 ±\pm 18.018.0 POMCPOW   377.6377.6 ±\pm 23.523.5   113.4113.4 ±\pm 24.224.2   36.8-36.8 ±\pm 11.311.3   74.3-74.3 ±\pm 12.912.9 POMCPOW+our RT+our BB   659.3659.3 ±\pm 17.217.2   428.7428.7 ±\pm 21.521.5   114.6114.6 ±\pm 16.616.6   1.9-1.9 ±\pm 6.66.6 POMCPOW+our BB   562.5562.5 ±\pm 18.318.3   408.7408.7 ±\pm 19.119.1   102.6102.6 ±\pm 14.414.4   6.5-6.5 ±\pm 7.17.1 POMCPOW+our RT   653.2653.2 ±\pm 17.317.3   425.2425.2 ±\pm 21.821.8   111.3111.3 ±\pm 16.816.8   2.1-2.1 ±\pm 6.8

Table 5: Success rates of all tested solvers on the Pushbox, Parking, VDP-Tag and LunarLander problems. The success rate is with respect to 1,000 simulation per solver and problem, with a planning time of 1s per step.
Pushbox2D Pushbox3D Parking2D Parking3D VDP-Tag LunarLander
ADVT   0.9850.985   0.9690.969   0.9120.912   0.9160.916   0.9850.985   0.9690.969
ADVT-R   0.9870.987   0.9680.968   0.9430.943   0.9060.906   0.9840.984   0.9700.970
ADVT (L=0)   0.9660.966   0.9650.965   0.8570.857   0.8980.898   0.9800.980   0.9310.931
ADVT-MC   0.9890.989   0.9720.972   0.4170.417   0.3370.337   0.9910.991   0.9570.957
VOMCPOW   0.7540.754   0.8150.815   0.5120.512   0.2970.297   0.9870.987   0.8590.859
VOMCPOW+our RT+our BB   0.9850.985   0.9700.970   0.8850.885   0.8860.886   -
VOMCPOW+our BB   0.9760.976   0.8930.893   0.8820.882   0.8850.885   0.9860.986   0.9580.958
VOMCPOW+our RT   0.9750.975   0.9390.939   0.5970.597   0.3140.314   -
POMCPOW   0.7120.712   0.6920.692   0.4010.401   0.1250.125   0.9790.979   0.8760.876
POMCPOW+our RT+our BB   0.9740.974   0.9530.953   0.8530.853   0.5340.534   -
POMCPOW+our BB   0.9700.970   0.8910.891   0.7930.793   0.2460.246   0.9760.976   0.9430.943
POMCPOW+our RT   0.9630.963   0.9690.969   0.4090.409   0.1220.122   -
Table 6: Success rates of all tested solvers on the SensorPlacement problems. The success rate is with respect to 1,000 simulation per solver and problem, with a planning time of 1s per step.

SensorPlacement-6 SensorPlacement-8 SensorPlacement-10 SensorPlacement-12 ADVT 0.9810.981 0.9620.962 0.8340.834 0.7240.724 ADVT-R 0.8320.832 0.6920.692 0.7560.756 0.5570.557 ADVT (L=0L=0) 0.9370.937 0.7260.726 0.7910.791 0.6010.601 ADVT-MC 0.9640.964 0.9590.959 0.8280.828 0.7190.719 VOMCPOW 0.9230.923 0.7210.721 0.6450.645 0.5830.583 VOMCPOW+our RT+our BB 0.9790.979 0.9510.951 0.8070.807 0.7030.703 VOMCPOW+our BB 0.9240.924 0.8830.883 0.7980.798 0.7020.702 VOMCPOW+our RT 0.9670.967 0.8910.891 0.8030.803 0.6980.698 POMCPOW 0.7380.738 0.6360.636 0.5190.519 0.3210.321 POMCPOW+our RT+our BB 0.8290.829 0.7940.794 0.6460.646 0.5750.575 POMCPOW+our BB 0.7940.794 0.7790.779 0.6410.641 0.5630.563 POMCPOW+our RT 0.8260.826 0.7810.781 0.6570.657 0.5780.578

6.3.1 Comparison with State-of-the-Art Methods

Table 2 shows the average total discounted rewards of all tested solvers on the Pushbox, Parking, VDP-Tag and LunarLander problems, while Figure 3 shows the results for the SensorPlacement problems. Detailed results for the SensorPlacement problems are presented in Table 4 while results on the success rates of the tested solvers are shown in Table 5 and Table 6.

ADVT generally outperforms all other methods, except for VDP-Tag, where VOMCPOW performs better. Interestingly, as we will see in Section 6.3.2, the variant of ADVT that uses Monte-Carlo backups instead of stochastic Bellman backups (ADVT-MC) outperforms all other methods in this problem, which supports ADVT’s effectiveness in handling continuous actions. The results for the SensorPlacement problems indicate that ADVT scales well to higher-dimensional action spaces. Additionally, the results on the VDP-Tag and LunarLander problems indicate that ADVT is capable of handling continuous observation spaces well.

ADVT performs well in terms of the success rate, too (Table 5 and Table 6). ADVT maintains more than 9090% success rate in the Pushbox, Parking, VDP-Tag and LunarLander problems. In the Parking3D problem, VOMCPOW’s and POMCPOW’s success rate can be as low as 30\sim 30% and 12.512.5%. Similarly, in the SensorPlacement problems ADVT achieves a higher success rate compared to VOMCPOW and POMCPOW. While in the SensorPlacement-6 problem, the gap between ADVT and VOMCPOW is relatively small (98%\sim 98\% for ADVT and 92%\sim 92\% for VOMCPOW), the gap increases as the dimensionality of the action space increases. In the SensorPlacement-12 problem, ADVT achieves a success rate of >72%>72\%, while VOMCPOW’s success rate drops to <60%<60\%. POMCPOW achieves a success rate of <74%<74\% in the SensorPlacement-6 problem, while only achieving a success rate of <32%<32\% in the SensorPlacement-12 problem.

6.3.2 Understanding the Effects of Different Components of ADVT.

In this section we present results demonstrating the importance of three key algorithmic components of ADVT, namely the Voronoi-based partitions, the cell-size-aware optimistic upper-confidence bound and the Stochastic Bellman backups.

Effects of Voronoi-based partitioning for ADVT.

To understand the benefit of our Voronoi-based partitioning method, we compare the results of ADVT with those of ADVT-R. Table 3 shows that ADVT-R slightly outperforms ADVT in the Pushbox2D, Parking2D, VDP-Tag and LunarLander problems, indicating that a rectangular-based partitioning works well for low-dimensional action spaces. However, Table 4 shows that ADVT-R is uncompetitive in the SensorPlacement problems as the dimensionality of the action space increases. For rectangular-based partitionings, the diameters of the cells can shrink very slowly in higher-dimensional action spaces. Additionally, the cell refinement method is independent of the spatial locations of the sampled actions. Both properties result in loose optimistic upper-confidence bounds of the QQ-values, leading to excessive exploration of sub-optimal areas of the action space. For Voronoi trees, the geometries (and therefore the diameters) of the cells are much more adaptive to the spatial locations of the sampled actions, leading to more accurate optimistic upper-confidence bounds of the associated QQ-values which avoids over-exploration of areas in the action space that already contain sufficiently many sampled actions.

Effects of cell-size-aware optimistic upper-confidence bound.

To investigate the importance of the component Ldiam(P)L\ \mathrm{diam}(P) in the optimistic upper-confidence bound in eq. 7, we compare ADVT and ADVT (L=0). The results in Table 3 and Table 4 indicate that the cell-diameter-aware component in the upper-confidence bound in eq. 7 is important for ADVT to perform well, particularly in the SensorPlacement problems. The reason is that in the early stages of planning, the partitions associated to the beliefs are still coarse, i.e., only a few candidate actions are considered per belief. If some of those candidate actions have small estimated QQ-values, ADVT (L=0) may discard large portions of the action space for a very long time, even if they potentially contain useful actions. The cell-diameter-aware bias term in eq. 7 alleviates this issue by encouraging ADVT to explore cells with large diameters. This is particularly important for problems with large action spaces such as the SensorPlacement problems.

Effects of Stochastic Bellman backups.

To investigate the effects of this component, let us compare ADVT with ADVT-MC. Table 3 and Table 4 reveal that ADVT which uses stochastic Bellman backups often performs significantly better, particularly in the Parking2D and Parking3D problems. The reason is that in both problems good rewards are sparse, particularly for beliefs where the vehicle is located between the walls and slight deviations from the optimal actions can lead to collisions. The stochastic Bellman backups help to focus the search on more promising regions of the action space. On the other hand, in the VDP-Tag problem, ADVT-MC performs better than ADVT. In this problem, it is important to reduce the uncertainty with respect to the target location during the first few steps by activating the range sensor. However, due to the cost of activating the range sensor, this strategy often seems suboptimal during the early stages of planning, when only a few episodes have been sampled. At the same time, in this problem, the stochastic Bellman backups in ADVT tend to overestimate the QQ-values for actions that do not activate the range sensor (this effect is known as maximization bias in QQ-learning (Sutton and Barto, 2018)). As a result, ADVT tends to discard strategies that reduce the uncertainty with respect to the target location during the first few steps. ADVT-MC which uses Monte-Carlo backups, does not suffer from maximization bias, which helps it to perform better than ADVT in this problem.

6.3.3 Ablation Study

Here we investgate how two components of ADVT, namely, re-using partial search trees and using stochastic Bellman backups instead of Monte-Carlo backups, affect the baselines VOMCPOW and POMCPOW when we apply these ideas to the baselines. The variants of VOMCPOW and POMCPOW that re-use partial search trees are denoted by VOMCPOW+RT and POMCPOW+RT respectively. The variants of the baselines that use stochastic Bellman backups are denoted by VOMCPOW+BB and POMCPOW+BB. The variants VOMCPOW+RT+BB and POMCPOW+RT+BB both re-use partial search trees and perform stochastic Bellman backups.

Generally, the results in Table 3 and Table 4 indicate that the baselines that re-use partial search trees perform much better than the baselines VOMCPOW and POMCPOW respectively, particularly in the Pushbox problems. These results (as well as those of ADVT and all its variants) indicate the benefit of re-using partial search trees that were generated in previous planning steps instead of re-computing the policy at every planning step. Similarly, the baselines that use stochastic Bellman backups tend to outperform the ones that use Monte-Carlo backups, except for the VDP-Tag problem. This is consistent with our results for ADVT and ADVT-MC and indicates that stochastic Bellman backup is a simple, yet viable tool to improve the performance of MCTS-based solvers.

7 Conclusion

We propose a new sampling-based online POMDP solver, called ADVT, that scales well to POMDPs with high-dimensional continuous action and observation spaces. Our solver builds on a number of works that uses adaptive discretization of the action space, and introduces a more effective adaptive discretization method that uses novel ideas: A Voronoi tree based adaptive hierarchical discretization of the action space, a novel cell-size aware refinement rule, and a cell-size aware upper-confidence bound. For continuous observation spaces, our solver adopts the Progressive Widening and explicit belief representation strategy, enabling ADVT to scale to higher-dimensional observation spaces. ADVT shows strong empirical results against state-of-the-art algorithms on several challenging benchmarks. We hope this work further expands the applicability of general-purpose POMDP solvers.

{acks}

We thank Jerzy Filar for many helpful discussions. This work is partially supported by the Australian Research Council (ARC) Discovery Project 200101049.

References

  • Agha-Mohammadi et al. (2011) Agha-Mohammadi AA, Chakravorty S and Amato NM (2011) Firm: Feedback controller-based information-state roadmap. A framework for motion planning under uncertainty. In: IROS. IEEE, pp. 4284–4291.
  • Auer et al. (2002) Auer P, Cesa-Bianchi N and Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3): 235–256.
  • Bai et al. (2014) Bai H, Hsu D and Lee WS (2014) Integrated perception and planning in the continuous space: A POMDP approach. IJRR 33(9): 1288–1302.
  • Bubeck et al. (2011) Bubeck S, Munos R, Stoltz G and Szepesvári C (2011) X-armed bandits. Journal of Machine Learning Research 12(5).
  • Burden et al. (2016) Burden RL, Faires JD and Burden AM (2016) Numerical analysis. Tenth edition. Boston, MA: Cengage Learning. ISBN 9781305253667.
  • Couëtoux et al. (2011) Couëtoux A, Hoock JB, Sokolovska N, Teytaud O and Bonnard N (2011) Continuous upper confidence trees. In: LION. Springer, pp. 433–445.
  • Fischer and Tas (2020) Fischer J and Tas ÖS (2020) Information particle filter tree: An online algorithm for POMDPs with belief-based rewards on continuous domains. In: ICML. PMLR, pp. 3177–3187.
  • Hoerger and Kurniawati (2021) Hoerger M and Kurniawati H (2021) An on-line POMDP solver for continuous observation spaces. In: Proc. IEEE/RSJ Int. Conference on Robotics and Automation (ICRA). IEEE.
  • Hoerger et al. (2020) Hoerger M, Kurniawati H, Bandyopadhyay T and Elfes A (2020) Linearization in motion planning under uncertainty. In: Algorithmic Foundations of Robotics XII. Springer, Cham, pp. 272–287.
  • Hoerger et al. (2018) Hoerger M, Kurniawati H and Elfes A (2018) A software framework for planning under partial observability. In: IROS. IEEE, pp. 1–9. URL https://github.com/RDLLab/oppt.
  • Hoerger et al. (2022) Hoerger M, Kurniawati H, Kroese D and Ye N (2022) Adaptive discretization using voronoi trees for continuous-action pomdps. In: Proc. Int. Workshop on the Algorithmic Foundations of Robotics.
  • Kaelbling et al. (1998) Kaelbling LP, Littman ML and Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101(1-2): 99–134.
  • Kim et al. (2020) Kim B, Lee K, Lim S, Kaelbling L and Lozano-Pérez T (2020) Monte carlo tree search in continuous spaces using Voronoi optimistic optimization with regret bounds. AAAI 34(06): 9916–9924.
  • Klimenko et al. (2014) Klimenko D, Song J and Kurniawati H (2014) Tapir: A software toolkit for approximating and adapting POMDP solutions online. In: ACRA. URL https://github.com/rdl-algorithm/tapir.
  • Kurniawati (2022) Kurniawati H (2022) Partially observable Markov decision processes and robotics. Annual Review of Control, Robotics, and Autonomous Systems 5(1). To appear.
  • Kurniawati et al. (2011) Kurniawati H, Du Y, Hsu D and Lee WS (2011) Motion planning under uncertainty for robotic tasks with long time horizons. IJRR 30(3): 308–323.
  • Kurniawati et al. (2008) Kurniawati H, Hsu D and Lee WS (2008) SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: RSS.
  • Kurniawati and Yadav (2013) Kurniawati H and Yadav V (2013) An online POMDP solver for uncertainty planning in dynamic environment. In: Proc. Int. Symp. on Robotics Research.
  • Lim et al. (2021) Lim MH, Tomlin CJ and Sunberg ZN (2021) Voronoi progressive widening: Efficient online solvers for continuous state, action, and observation POMDPs. In: 60th IEEE Conference on Decision and Control (CDC). pp. 4493–4500.
  • Lindquist (1973) Lindquist A (1973) On feedback control of linear stochastic systems. SIAM Journal on Control 11(2): 323–343. 10.1137/0311025.
  • Mansley et al. (2011) Mansley C, Weinstein A and Littman M (2011) Sample-based planning for continuous action Markov decision processes. In: ICAPS.
  • Mern et al. (2021) Mern J, Yildiz A, Sunberg Z, Mukerji T and Kochenderfer MJ (2021) Bayesian optimized monte carlo planning. In: AAAI, volume 35. pp. 11880–11887.
  • Papadimitriou and Tsitsiklis (1987) Papadimitriou CH and Tsitsiklis JN (1987) The complexity of Markov decision processes. Mathematics of operations research 12(3): 441–450.
  • Pineau et al. (2003) Pineau J, Gordon G and Thrun S (2003) Point-based Value Iteration: An anytime algorithm for POMDPs. In: IJCAI.
  • Seiler et al. (2015) Seiler KM, Kurniawati H and Singh SP (2015) An online and approximate solver for POMDPs with continuous action space. In: ICRA. IEEE, pp. 2290–2297.
  • Silver and Veness (2010) Silver D and Veness J (2010) Monte carlo planning in large POMDPs. In: Advances in neural information processing systems. pp. 2164–2172.
  • Smith (1984) Smith RL (1984) Efficient monte carlo procedures for generating points uniformly distributed over bounded regions. Operations Research 32(6): 1296–1308.
  • Smith and Simmons (2005) Smith T and Simmons R (2005) Point-based POMDP algorithms: Improved analysis and implementation. In: UAI.
  • Sondik (1971) Sondik EJ (1971) The Optimal Control of Partially Observable Markov Decision Processes. PhD Thesis, Stanford, California.
  • Sun et al. (2015) Sun W, Patil S and Alterovitz R (2015) High-frequency replanning under uncertainty using parallel sampling-based motion planning. IEEE TRO 31(1): 104–116.
  • Sunberg and Kochenderfer (2018) Sunberg ZN and Kochenderfer MJ (2018) Online algorithms for POMDPs with continuous state, action, and observation spaces. In: ICAPS.
  • Sutton and Barto (2018) Sutton RS and Barto AG (2018) Reinforcement learning: An introduction. MIT press.
  • Touati et al. (2020) Touati A, Taiga AA and Bellemare MG (2020) Zooming for efficient model-free reinforcement learning in metric spaces. arXiv preprint 2003.04069 .
  • Valko et al. (2013) Valko M, Carpentier A and Munos R (2013) Stochastic simultaneous optimistic optimization. In: ICML. PMLR, pp. 19–27.
  • Van Den Berg et al. (2011) Van Den Berg J, Abbeel P and Goldberg K (2011) LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research 30(7): 895–913.
  • Van Den Berg et al. (2012) Van Den Berg J, Patil S and Alterovitz R (2012) Motion planning under uncertainty using iterative local optimization in belief space. IJRR 31(11): 1263–1278.
  • Wang et al. (2020) Wang T, Ye W, Geng D and Rudin C (2020) Towards practical Lipschitz bandits. In: FODS. pp. 129–138.
  • Watkins and Dayan (1992) Watkins CJ and Dayan P (1992) Q-learning. Machine learning 8(3): 279–292.
  • Welzl (1991) Welzl E (1991) Smallest enclosing disks (balls and ellipsoids). In: Maurer H (ed.) New Results and New Trends in Computer Science. Berlin: Springer. ISBN 978-3-540-46457-0, pp. 359–370.
  • Ye et al. (2017) Ye N, Somani A, Hsu D and Lee WS (2017) DESPOT: Online POMDP planning with regularization. Journal of Artificial Intelligence Research 58: 231–266.