Efficient Constraint Generation for Stochastic Shortest Path Problems

Johannes Schmalz, Felipe Trevizan

Abstract

Current methods for solving Stochastic Shortest Path Problems (SSPs) find states’ costs-to-go by applying Bellman backups, where state-of-the-art methods employ heuristics to select states to back up and prune. A fundamental limitation of these algorithms is their need to compute the cost-to-go for every applicable action during each state backup, leading to unnecessary computation for actions identified as sub-optimal. We present new connections between planning and operations research and, using this framework, we address this issue of unnecessary computation by introducing an efficient version of constraint generation for SSPs. This technique allows algorithms to ignore sub-optimal actions and avoid computing their costs-to-go. We also apply our novel technique to iLAO^∗ resulting in a new algorithm, CG-iLAO^∗. Our experiments show that CG-iLAO^∗ ignores up to $57\%$ of iLAO^∗’s actions and it solves problems up to $8\times$ and $3\times$ faster than LRTDP and iLAO^∗.

1 Introduction

Planning is an important facet of AI that gives efficient algorithms for solving current real-world problems. Stochastic Shortest Path problems (SSPs) (Bertsekas and Tsitsiklis 1991) generalise classical (deterministic) planning by introducing actions with probabilistic effects, which lets us model problems where the actions are intrinsically probabilistic. Value Iteration (VI) (Bellman 1957) is a dynamic programming algorithm that forms the basis of optimal algorithms for solving SSPs. VI finds the cost-to-go for each state, which describes the solution of an SSP. A state $s$ ’s cost-to-go is the minimum expected cost of reaching a goal from $s$ , and similarly a action $a$ ’s cost-to-go is the minimum after applying $a$ . VI finds the optimal cost-to-go by iteratively applying Bellman backups, which update each state’s cost-to-go with the minimal outgoing action’s cost-to-go.

LRTDP (Bonet and Geffner 2003) and iLAO^∗ (Hansen and Zilberstein 2001), the state-of-the-art algorithms for optimally solving SSPs, build on VI and offer significant speedup by using heuristics to apply Bellman backups only to promising states and pruning states that are deemed too expensive. A shortcoming of such algorithms is that each Bellman backup must consider all applicable actions. For instance, let $s$ and $a$ be a state and an applicable action, even if all successors of $a$ will be pruned because they are too expensive, a Bellman backup on $s$ still computes the $Q$ -value of $a$ , so these algorithms can prune unpromising states but not actions. This issue is compounded because algorithms for SSPs require arbitrarily many Bellman backups on each state $s$ to find the optimal solution, thus wasting time on computing $Q$ -values for such actions many times.

This issue of computing unnecessary $Q$ -values for a state $s$ is addressed by action elimination (Bertsekas 1995), which can be implemented in search algorithms to prune useless actions. Action elimination looks for pairs $(a,\widehat{a})$ of applicable actions in a state, such that a lower bound on $\widehat{a}$ ’s cost-to-go exceeds an upper bound on $a$ ’s cost-to-go, in which case $\widehat{a}$ is proved to be a useless action and can be pruned. Although domain-independent lower bounds (heuristics) can be computed efficiently, finding an efficient, domain-independent upper bound remains an open question to the best of our knowledge. This gap has limited the use of action elimination in domain-independent planning. In the context of optimal heuristic planning for SSPs, the only algorithm we are aware of that utilises action elimination to prune actions is FTVI (Dai, Weld et al. 2009). Other algorithms, such as BRTDP (McMahan, Likhachev, and Gordon 2005), FRTDP (Smith and Simmons 2006) and VPI-RTDP (Sanner et al. 2009), use upper bounds in conjunction with lower bounds to guide their search. However, unlike FTVI, they do not perform action elimination.

We present a general technique for ignoring actions that does not rely on upper bounds. In contrast to action elimination that incrementally prunes useless actions, our approach initially treats all actions as inactive, i.e., not contributing to the solution. It then iteratively adds actions to the search that are guaranteed to improve the current solution. To develop our approach, we strengthen the connections between planning and operations research by relating heuristic search to variable and constraint generation. Similar to heuristic search, variable and constraint generation enable the solving of large Linear Programs (LPs) by considering only a subset of variables and constraints. We show that algorithms such as iLAO^∗ implicitly perform constraint generation, albeit in a trivial manner, to the LP encoding of VI. Building on this, we introduce an efficient algorithm for constraint generation for SSPs that leads to inactive actions being ignored. We apply our approach to iLAO^∗ to get the novel algorithm: CG-iLAO^∗.

In our experiments, CG-iLAO^∗ solves problems up to $8\times$ and $3\times$ faster than LRTDP and iLAO^∗, respectively. CG-iLAO^∗ is faster than the others over various problem difficulties: its improvement is apparent from problems that require only 4 minutes to solve, and the improvement gap increases as problems take longer to solve. To explain this, we quantify that CG-iLAO^∗ only considers 43–65% of iLAO^∗’s total actions thus fewer actions’ costs-to-go are computed. Investigating further, we empirically show that CG-iLAO^∗ combines iLAO^∗’s efficient use of backups and LRTDP’s strong pruning capability, thereby displaying the best characteristics of both in a single algorithm.

The structure of this paper is as follows: we first introduce the background for SSPs and existing methods for solving SSPs. Second, we give a brief background to linear programming and connect linear programming techniques to heuristic search. Then we explain and motivate our novel algorithm CG-iLAO^∗ and prove its correctness. Finally, we empirically evaluate the performance of CG-iLAO^∗.

2 Background

A Stochastic Shortest Path problem (SSP) (Bertsekas and Tsitsiklis 1991) is a tuple $\langle\mathsf{S},s_{0},\mathsf{G},\mathsf{A},P,C\rangle$ where: $\mathsf{S}$ is a finite set of states; $s_{0}\in\mathsf{S}$ is the initial state; $\mathsf{G}\subset\mathsf{S}$ are the goal states with $\mathsf{G}\neq\emptyset$ ; $\mathsf{A}$ is a finite set of actions and $\mathsf{A}(s)\subseteq\mathsf{A}$ denotes the actions applicable in state $s$ ; $P(s^{\prime}|s,a)$ gives the probability that applying action $a$ in state $s$ results in state $s^{\prime}$ ; $C(s,a)\in\mathbb{R}_{>0}$ is the cost of applying $a$ in $s$ .

The states immediately reachable by applying $a$ to $s$ are called successors and are given by $\text{succ}(s,a)\coloneqq\{s^{\prime}\in\mathsf{S}:P(s^{\prime}|s,a)>0\}$ . A solution to an SSP is given by a map $\pi:\mathsf{S}\to\mathsf{A}$ , called a policy. A policy $\pi$ is closed w.r.t. $s_{0}$ if each state $s$ that can be reached by following $\pi$ from $s_{0}$ is either a goal or $\pi$ is defined for $s$ . A policy is proper w.r.t. $s_{0}$ if it reaches the goal with probability 1 from $s_{0}$ and it is improper otherwise. An optimal policy $\pi^{*}$ is any proper policy that minimises the expected cost to reach the goal from the initial state $s_{0}$ .

For simplicity, we make two standard assumptions: (i) there exists at least one proper policy w.r.t. $s_{0}$ , this is called the reachability assumption; and (ii) all improper policies have infinite expected cost. A consequence of assumption (ii) is that $\mathsf{A}(s)\neq\emptyset$ for all $\mathsf{S}\setminus\mathsf{G}$ . Note that we define $C$ to be strictly positive in order to avoid zero-cost cycles that would violate assumption (ii). In our experiments, we relax the reachability assumption by applying the fixed-penalty transformation of SSPs (Trevizan, Teichteil-Königsbuch, and Thiébaux 2017) resulting in a new SSP without dead ends. Other approaches for handling SSPs such as S3P (Teichteil-Königsbuch 2012) are also compatible with our approach.

An SSP’s optimal solution is uniquely represented by the optimal value function $V^{*}$ , which is the unique fixed-point solution to the Bellman equations:

V(s)=\begin{cases}0&\text{ if }s\in\mathsf{G}\\ \min_{a\in\mathsf{A}(s)}Q(s,a)&\text{otherwise}\end{cases}\quad\forall s\in\mathsf{S}

(1)

where $Q(s,a)\coloneqq C(s,a)+\sum_{s^{\prime}\in\mathsf{A}(a)}P(s^{\prime}|s,a)V(s)$ is known as the $Q$ -value of $s$ and $a$ . A (optimal) value function $V(s)$ and the associated $Q$ -value $Q(s,a)$ represent the (minimum) expected cost to reach the goal from state $s$ and after executing action $a$ on state $s$ , respectively. Given a value function $V$ , the policy associated with it is defined as $\pi_{V}(s)\coloneqq\mathop{\rm argmin}_{a\in\mathsf{A}(s)}Q(s,a)$ and is known as the greedy policy for $V$ . Ties can be broken arbitrarily thanks to assumption (ii), and for simplicity we assume some tie-breaking rules that ensure the greedy policy is unique.

The Bellman equations (1) can be iteratively solved by Value Iteration (VI) (Bellman 1957): VI starts with an arbitrary value function $V^{0}$ and computes $V^{t+1}(s)\coloneqq\min_{a\in\mathsf{A}(s)}Q(s,a)$ over all states, where $Q(s,a)$ uses the previous value function $V^{t}$ . This process of computing $V^{t+1}(s)$ using $V^{t}$ is called a Bellman backup. VI is guaranteed to asymptotically converge to $V^{*}$ regardless of $V^{0}$ . For practical reasons, VI is terminated at iteration $t$ when the Bellman residual $\textsc{res}(s)\coloneqq|V^{t}(s)-\min_{a\in\mathsf{A}(s)}Q(s,a)|$ is less than or equal to $\epsilon\in\mathbb{R}_{>0}$ for all $s\in\mathsf{S}$ .

Given a policy $\pi$ , its policy envelope $\mathsf{S}^{\pi}\subseteq\mathsf{S}$ is the set of all reachable states when following $\pi$ from $s_{0}$ . Note that VI explores the complete state space $\mathsf{S}$ regardless of the optimal policy envelope $\mathsf{S}^{\pi*}$ ’s size. To address this shortcoming, heuristic search algorithms such as iLAO^∗ (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003) use a heuristic function $H$ to initialise $V^{0}$ , which guides the exploration of the state space $\mathsf{S}$ in a way that expands as few states as possible. To find $V^{*}$ , heuristic search algorithms require the heuristic to be admissible, that is, $H(s)\leq V^{*}(s)\;\forall s\in\mathsf{S}$ . Often heuristics are also monotonic, which immediately implies admissibility. A value function $V$ is monotonic if $V(s)\leq\min_{a\in\mathsf{A}(s)}Q(s,a)\;\forall s\in\mathsf{S}$ , and the definition is analogous for $H$ . Similar to VI, heuristic search algorithms converge to $V^{*}$ asymptotically and require a practical stop criterion. This stop criterion is known as $\epsilon\text{-consistency}$ (Bonet and Geffner 2003) and is defined as:

Definition 1 ( $\epsilon\text{-consistency}$ ).

A value function $V$ is $\epsilon\text{-consistent}$ if $\textsc{res}(s)\leq\epsilon\;\forall s\in\mathsf{S}^{\pi_{V}}$ .

Notice that $\epsilon\text{-consistency}$ checks the residual only on the policy envelope of a greedy policy and states outside the envelope are permitted to have a residual larger than $\epsilon$ .

1 Function iLAO^∗ $(\mathbb{S},H,\epsilon)$

\widehat{\mathbb{S}}\leftarrow

partial SSP

\langle\{s_{0}\},s_{0},\{s_{0}\},\emptyset,P,C,H\rangle

V\leftarrow\text{Value Function initialised by }H

4 repeat

\mathcal{E}\leftarrow

post-order DFS traversal of

\widehat{\pi}_{V}

from

s_{0}

\widehat{\mathbb{S}}\leftarrow\textsc{Expand-Fringes}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E})

\mathsf{F}\leftarrow\mathsf{S}^{\widehat{\pi}_{V}}\cap(\widehat{\mathsf{G}}\setminus\mathsf{G})

V,\textsc{res},\widehat{\pi}_{\text{old}}\leftarrow\textsc{Backups}{}(\widehat{\mathbb{S}},\mathcal{E},V,\mathsf{F})

10 until $\mathsf{F}=\emptyset$ and $\widehat{\pi}_{\text{old}}=\widehat{\pi}_{V}$ and $\textsc{res}<\epsilon$

11 return

V

13Function $\textsc{Expand-Fringes}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E})$

14 for $s_{f}\in\mathcal{E}\cap(\widehat{\mathsf{G}}\setminus\mathsf{G})$ do

\widehat{\mathsf{G}}\leftarrow\widehat{\mathsf{G}}\setminus\{s_{f}\}

\textsc{Add-Actions}(\mathbb{S},\widehat{\mathbb{S}},s_{f},\mathsf{A}(s_{f}))

18 return

\widehat{\mathbb{S}}

20Function $\textsc{Add-Actions}{}(\mathbb{S},\widehat{\mathbb{S}},s,\mathsf{A}^{\prime})$

\widehat{\mathsf{A}}(s)\leftarrow\widehat{\mathsf{A}}(s)\cup\mathsf{A}^{\prime}

22 for $a\in\mathsf{A}^{\prime}$ do

\widehat{\mathsf{G}}\leftarrow\widehat{\mathsf{G}}\cup(\text{succ}(s,a)\setminus\widehat{\mathsf{S}})

\widehat{\mathsf{S}}\leftarrow\widehat{\mathsf{S}}\cup\text{succ}(s,a)

28Function $\textsc{Backups}{}(\widehat{\mathbb{S}},\mathcal{E},V,\mathsf{F})$

\widehat{\pi}_{\text{old}}\leftarrow\widehat{\pi}_{V}

30 repeat

\textsc{res}\leftarrow 0

32 for $s\in\mathcal{E}\setminus\widehat{\mathsf{G}}$ do

Q_{\min}\leftarrow\min_{a\in\widehat{\mathsf{A}}(s)}Q(s,a)

\textsc{res}\leftarrow\max(|V(s)-Q_{\min}|,\textsc{res})

V(s)\leftarrow Q_{\min}

38 until $\textsc{res}<\epsilon$ or $\widehat{\pi}_{V}\neq\widehat{\pi}_{\text{old}}$ or $\mathsf{F}\neq\emptyset$

39 return

V,\textsc{res},\widehat{\pi}_{\text{old}}

Algorithm 1 iLAO^∗

We close this section by reviewing iLAO^∗ (Hansen and Zilberstein 2001). iLAO^∗ (alg. 1) is an iterative algorithm which works by incrementally growing a partial SSP:¹¹1Called the explicit graph in the original paper.

Definition 2 (Partial SSP).

Given an SSP $\langle\mathsf{S},s_{0},\mathsf{G},\mathsf{A},P,C\rangle$ and a heuristic $H$ , a partial SSP $\widehat{\mathbb{S}}$ is an SSP with terminal costs defined by $\widehat{\mathbb{S}}=\langle\widehat{\mathsf{S}},s_{0},\widehat{\mathsf{G}},\widehat{\mathsf{A}},P,C,H\rangle$ with $\widehat{\mathsf{S}}\subseteq\mathsf{S},\widehat{\mathsf{G}}\subset\widehat{\mathsf{S}},\mathsf{G}\cap\widehat{\mathsf{S}}\subseteq\widehat{\mathsf{G}},\widehat{\mathsf{A}}(s)\subseteq\mathsf{A}(s)\;\forall s\in\widehat{\mathsf{S}}$ and terminal cost $H$ .

SSPs with terminal costs have a one-time terminal cost of $H(\widehat{g})$ that is incurred when $\widehat{g}\in\widehat{\mathsf{G}}$ is reached, so their Bellman equations are (1) with the goal case replaced by $V(\widehat{g})=H(\widehat{g})$ for all $\widehat{g}\in\widehat{\mathsf{G}}$ . It is trivial to see that SSPs with terminal costs are equivalent to SSPs, and we use them to simplify our presentation. In a partial SSP $\widehat{\mathbb{S}}$ , we refer to states in $\widehat{\mathsf{G}}\setminus\mathsf{G}$ as artificial goals, and we define $\widehat{\pi}_{V}$ to be the greedy policy over $V$ restricted to $\widehat{\mathsf{S}}\setminus\widehat{\mathsf{G}}$ .

At each iteration, iLAO^∗ expands its partial SSP $\widehat{\mathbb{S}}$ by expanding the artificial goals reachable by $\widehat{\pi}_{V}$ into regular states. To expand $\widehat{g}\in\widehat{\mathsf{G}}\setminus\mathsf{G}$ , iLAO^∗ adds $\widehat{g}$ ’s applicable actions to $\widehat{\mathbb{S}}$ and adds new reachable states as artificial goals (alg. 1 alg. 1). This artificial goal expansion is done to make $\widehat{\pi}_{V}$ eventually closed w.r.t. $s_{0}$ for the original SSP $\mathbb{S}$ . Simultaneously, iLAO^∗ also works towards making $V$ $\epsilon\text{-consistent}$ by applying a Bellman backup to all the states reachable by $\widehat{\pi}_{V}$ (alg. 1 alg. 1). These Bellman backups are ordered by a post-order DFS traversal of $\widehat{\pi}_{V}$ , so states that occur closer to artificial goals are updated first, and $s_{0}$ is updated last. Note that, when a state is expanded, it may have a successor $s$ already within the partial SSP. If this happens, the DFS in alg. 1 alg. 1 must keep traversing $\widehat{\pi}_{V}$ from $s$ to ensure the policy envelope $\mathcal{E}$ is accurate.

In the next section, we show how to interpret iLAO^∗ through the lens of Operations Research by relating it to techniques used for handling large linear programs.

3 iLAO^∗ as Linear Program

SSPs can be solved by the Linear Program (LP) presented in LP 1. This LP is known as the primal LP or VI LP since it directly encodes the Bellman equations.

$\displaystyle\max_{\bm{\mathcal{V}}}\quad$ $\displaystyle\mathrlap{V_{s_{0}}}$ (LP 1) $\displaystyle\mathrm{s.t.}\quad$ $\displaystyle\mathrlap{V_{s}\leq C(s,a)+\!\sum_{\mathclap{s^{\prime}\in\mathsf{S}}}P(s^{\prime}|s,a)V_{s^{\prime}}}$ $\displaystyle\forall s\in\mathsf{S}\!\setminus\!\mathsf{G},a\!\in\!\mathsf{A}(s)$ (C1) $\displaystyle\mathrlap{V_{g}\leq 0}$ $\displaystyle\forall g\in\mathsf{G}$ (C2)

Each variable $V_{s}\in\bm{\mathcal{V}}$ represents $V(s)$ , and for each state $s\in\mathsf{S}\setminus\mathsf{G}$ the relevant constraints C1 encode $V(s)\leq\min_{a\in\mathsf{A}(s)}Q(s,a)$ . When clear from context, we use $V(s)$ to represent $V_{s}$ , and $Q(s,a)$ for the right-hand side of $s$ and $a$ ’s constraint C1. Together with the objective that maximises $V(s_{0})$ , we obtain the Bellman equations (1) for the states in the optimal policy envelope $\mathsf{S}^{\pi*}$ , i.e., the constraints are active (tight) for the pairs $(s,\pi^{*}(s))$ for all $s\in\mathsf{S}^{\pi*}$ and inactive (slack) everywhere else. We reframe iLAO^∗’s incremental growing of its partial SSP as solving LP 1 with variable and constraint generation (Bertsimas and Tsitsiklis 1997).²²2Variable generation is also known as column generation and constraint generation is also known as the cutting plane method.

Variable generation is a technique from Operations Research that enables us to solve LPs with a large number of variables by considering only a subset of variables. Given an LP with missing variables called the Reduced Master Problem (RMP), variable generation finds a set of variables outside the RMP whose addition lets the RMP’s solution quality improve. Variable generation provides a sound and complete method to select such variables and a stop criterion that ensures the optimal solution for the RMP is also optimal for the original LP. Heuristic search algorithms, such as iLAO^∗, can be seen as a variable generation algorithm over LP 1, where each of its partial SSPs represents an RMP with the subset of variables $\{V_{s}:s\in\widehat{\mathsf{S}}\}$ . For iLAO^∗, the variable selection mechanism is inherited from A^∗ and is represented by the expansion of the artificial goals reachable by $\widehat{\pi}_{V}$ (alg. 1 algs. 1 to 1): for all $s_{f}\in\mathsf{S}^{\widehat{\pi}_{V}}\cap(\widehat{\mathsf{G}}\setminus\mathsf{G})$ and $a\in\mathsf{A}(s_{f})$ , we add the variables $V_{s}$ such that $s\in\text{succ}(s_{f},a)$ and $s\not\in\widehat{\mathsf{S}}$ .

Constraint generation is a similar technique which enables the solving of LPs with a large (potentially infinite) number of constraints. The key idea is that the optimal solution of an LP with many constraints only makes a small number of constraints active, thus only a subset of constraints is needed to characterise this optimal solution. In constraint generation, the intermediate LPs are known as relaxed LPs since they relax the original LP by removing one or more constraints. Given a relaxed LP, constraint generation finds one or more constraints in the original LP that are violated by the optimal solution of the relaxed LP. By iteratively adding these violated constraints and re-optimising the new relaxed LP, a sequence of relaxed LPs with an increasing number of constraints is generated. When no violations are found, the optimal solution of the relaxed LP is an optimal solution for the original LP. The algorithm used to check constraint violations is called a separation oracle and the effectiveness of constraint generation relies on the availability of an efficient separation oracle for the original LP.

In the case of iLAO^∗, constraint generation adds all the actions of the expanded artificial goal (alg. 1 alg. 1) where each action $a\in\mathsf{A}(s_{f})$ added to the partial SSP implicitly represents the constraint $V(s_{f})\leq Q(s_{f},a)$ . This separation oracle is computationally cheap since no checks are performed to detect if this new constraint is needed or not, at the cost that inactive constraints are unnecessarily added to the partial SSP. In the next section, we present our new algorithm that uses an efficient separation oracle that only adds violated constraints.

4 CG-iLAO^∗

1 Function CG-iLAO^∗ $(\mathbb{S},H,\epsilon)$

\widehat{\mathbb{S}}\leftarrow

partial SSP

\langle\{s_{0}\},s_{0},\{s_{0}\},\emptyset,P,C,H\rangle

V\leftarrow\text{value function initialised by }H

5 repeat

\mathcal{E}\leftarrow

post-order DFS traversal of

\widehat{\pi}_{V}

from

s_{0}

\widehat{\mathbb{S}}\leftarrow\textsc{Partly-Expand-Fringes}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E},V)

\mathsf{F}\leftarrow\mathsf{S}^{\widehat{\pi}_{V}}\cap(\widehat{\mathsf{G}}\setminus\mathsf{G})

V,\textsc{res},\widehat{\pi}_{\text{old}},\Gamma\leftarrow\textsc{CG-Backups}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E},V,\mathsf{F},\Gamma)

V,\textsc{res},\Gamma,\widehat{\mathbb{S}}\leftarrow\textsc{Fix-Constrs}{}(\mathbb{S},\widehat{\mathbb{S}},V,\Gamma,\textsc{res})

12 until $\mathsf{F}=\emptyset$ and $\widehat{\pi}_{\text{old}}=\widehat{\pi}_{V}$ and $\textsc{res}\leq\epsilon$

13 return

V

15Function $\textsc{Partly-Expand-Fringes}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E},V)$

16 for $s_{f}\in\mathcal{E}\cap(\widehat{\mathsf{G}}\setminus\mathsf{G})$ do

\widehat{\mathsf{G}}\leftarrow\widehat{\mathsf{G}}\setminus\{s_{f}\}

Q_{min}\leftarrow\min_{a\in\mathsf{A}(s_{f})}Q(s_{f},a)

\mathsf{A}^{\prime}\leftarrow\{a\in\mathsf{A}(s_{f})|Q(s_{f},a)=Q_{min}\}

\textsc{Add-Actions}(\mathbb{S},\widehat{\mathbb{S}},s_{f},\mathsf{A}^{\prime})

22 return

\widehat{\mathbb{S}}

25Function $\textsc{CG-Backups}{}(\mathbb{S},\widehat{\mathbb{S}},\mathcal{E},V,\mathsf{F},\Gamma)$

\widehat{\pi}_{\text{old}}\leftarrow\widehat{\pi}_{V}

27 repeat

\textsc{res}\leftarrow 0

29 for $s\in\mathcal{E}\setminus\widehat{\mathsf{G}}$ do

Q_{\min}\leftarrow\min_{a\in\widehat{\mathsf{A}}(s)}Q(s,a)

31 if $Q_{\min}-V(s)>\epsilon$ then

\Gamma\leftarrow\Gamma\cup\textsc{Ext-Succs}(s,\mathbb{S},\widehat{\mathbb{S}})

34 else if $V(s)-Q_{\min}>\epsilon$ then

\Gamma\leftarrow\Gamma\cup\textsc{Preds}(s,\mathbb{S})

\textsc{res}\leftarrow\max(|V(s)-Q_{\min}|,\textsc{res})

V(s)\leftarrow Q_{\min}

41 until $\textsc{res}\leq\epsilon$ or $\widehat{\pi}_{V}\neq\widehat{\pi}_{\text{old}}$ or $\mathsf{F}\neq\emptyset$

42 return

V,\textsc{res},\widehat{\pi}_{\text{old}},\Gamma

44Function $\textsc{Fix-Constrs}{}(\mathbb{S},\widehat{\mathbb{S}},V,\Gamma,\textsc{res})$

\Gamma^{\prime}\leftarrow\emptyset

46 for $(s,a)\in\Gamma\text{ s.t. }V(s)>Q(s,a)+\epsilon$ do

47 if $a\not\in\widehat{\mathsf{A}}(s)$ then

\widehat{\mathsf{A}}(s)\leftarrow\widehat{\mathsf{A}}(s)\cup\{a\}

\widehat{\mathsf{G}}\leftarrow\widehat{\mathsf{G}}\cup(\text{succ}(s,a)\setminus\widehat{\mathsf{S}})

\widehat{\mathsf{S}}\leftarrow\widehat{\mathsf{S}}\cup\text{succ}(s,a)

\textsc{res}\leftarrow\max(V(s)-Q(s,a),\textsc{res})

V(s)\leftarrow Q(s,a)

\Gamma^{\prime}\leftarrow\Gamma^{\prime}\cup\textsc{Preds}(s,\mathbb{S})

56 return

V,\textsc{res},\Gamma^{\prime},\widehat{\mathbb{S}}

Algorithm 2 CG-iLAO^∗

In iLAO^∗, as most search algorithms, each state is either unexpanded or fully expanded, i.e., either none or all of its applicable actions are considered and it is not possible to ignore just a subset of the applicable actions. We start by defining which actions can be safely ignored in a partial SSP.

Definition 3 (Inactive Action).

Consider an SSP $\mathbb{S}$ , its partial SSP $\widehat{\mathbb{S}}$ , a value function $V$ , and a state $s\in\widehat{\mathsf{S}}$ . An action $a\in\mathsf{A}(s)\setminus\widehat{\mathsf{A}}(s)$ is inactive in state $s$ if $V(s)<Q(s,a).$

An inactive action $a\in\mathsf{A}(s)\setminus\widehat{\mathsf{A}}(s)$ for $s\in\widehat{\mathsf{S}}$ represents the inactive constraint C1 for $s$ and $a$ in LP 1. Since inactive actions are not in the partial SSP and their associated constraints are inactive, adding them to the partial SSP does not change the solution and only adds overhead in the form of $Q$ -value computation for sub-optimal actions.

We generalise iLAO^∗ by allowing states to be partially expanded, so these states only have a subset of actions available in the partial SSP. Under the lens of linear programming, we use constraint generation to identify and add actions that may be needed to encode the optimal solution and to ignore inactive actions. We call this algorithm Constraint-Generation iLAO^∗ (CG-iLAO^∗).

CG-iLAO^∗ is presented in alg. 2. One of the defining changes from iLAO^∗ is in CG-iLAO^∗’s expanding phase (alg. 2 alg. 2) where Partly-Expand-Fringes only expands a state with the greedy actions on $V$ , rather than all the applicable actions. This introduces two challenges: (i) actions that were not added by the partial expansion may need to be added later when $V$ is more accurate; and (ii) when we add such actions, $V$ must be updated to reflect $a$ ’s availability. If (ii) is not addressed, the reduction to $V$ offered by $a$ is not propagated, potentially leading to a suboptimal solution since $V$ would overestimate $V^{*}$ . The key insight of our algorithm is that both of these challenges are instances of constraint violation. Thus, we can solve both issues by finding which constraints are violated with a separation oracle and enforcing them in the style of constraint generation.

The trivial separation oracle checks all constraints $(s,a)$ for $s\!\in\!\widehat{\mathsf{S}}$ and $a\!\in\!\mathsf{A}(s)$ for violations; this is needlessly expensive since some non-violated constraints remain non-violated from one iteration to the next. Our separation oracle exploits this persistence between iterations by tracking changes in $V$ to compute a subset of constraints which could potentially be violated by the following rules (alg. 2 algs. 2 to 2 and alg. 2): suppose $V(s)$ is assigned $\min_{a\in\widehat{\mathsf{A}}(s)}Q(s,a)$ , then there are three cases:

1.

$V(s)$ stays the same. No new constraint violations.
2.

$V(s)$ increases. The constraints $V(s)\leq Q(s,a^{\prime})$ may be violated for $(s,a^{\prime})\in\textsc{Ext-Succs}(s,\mathbb{S},\widehat{\mathbb{S}})\coloneqq\{(s,a):a\in\mathsf{A}(s)\setminus\widehat{\mathsf{A}}(s)\}$ . Note that if $(s,a^{\prime})$ is already inside $\widehat{\mathbb{S}}$ , i.e., $a^{\prime}\in\widehat{\mathsf{A}}(s)$ , its constraint can not be violated since $V(s)\leftarrow\min_{a\in\widehat{\mathsf{A}}(s)}Q(s,a)\leq Q(s,a^{\prime})$ .
3.

$V(s)$ decreases. The only constraints that may be violated are $V(s^{\prime})\leq Q(s^{\prime},a^{\prime})$ for $(s^{\prime},a^{\prime})\in\textsc{Preds}(s,\mathbb{S})\coloneqq\{(s^{\prime},a^{\prime}):s\in\text{succ}(s^{\prime},a^{\prime}),a^{\prime}\in\mathsf{A}(s^{\prime})\}$ .

We store potential violations in $\Gamma$ , i.e., if $V(s)$ increases, the elements of $\textsc{Ext-Succs}(s,\mathbb{S},\widehat{\mathbb{S}})$ are added to $\Gamma$ (alg. 2 alg. 2); and if $V(s)$ decreases, the elements of $\textsc{Preds}(s,\mathbb{S})$ are added to $\Gamma$ (alg. 2 alg. 2). Elements are removed from $\Gamma$ after they are checked. As we prove later in this section, checking constraints in $\Gamma$ is sufficient to find any constraint violations.

CG-iLAO^∗ fixes a violated constraint $(s,a)$ by setting $V(s)\leftarrow Q(s,a)$ (alg. 2 alg. 2). This change in $V$ may create a new violation in another state, so we must track such potential violations in the same way as before. This ensures that all constraint violations are tracked and fixed eventually before termination.

Figure 1: An SSP where CG-iLAO^∗’s value function is not monotonically non-decreasing.

Note that $V(s)$ may decrease after an update (case 3 of the constraint violations) in CG-iLAO^∗ even if the heuristic used is monotonic. This is a departure from all other algorithms based on Bellman backups where $V$ is guaranteed to be monotonically non-decreasing during their executions when initialised with a monotonic heuristic. To illustrate a scenario where $V$ decreases in CG-iLAO^∗, consider the SSP in fig. 1 where $H$ is monotonic and represented inside nodes. The first iterations of CG-iLAO^∗ applied to this SSP are:

Iter. 1

expands $s_{0}$ .
Iter. 2

partly expands $s_{1}$ with $a_{1}$ .
Iter. 3

expands $s_{2}$ with $a_{2}$ and, after CG-Backups, we have $V(s_{2})=3$ , $V(s_{1})=4$ , $V(s_{0})=5$ and $\Gamma=\{(s_{1},a^{\prime}_{1})\}$ . Since $\Gamma\neq\emptyset$ , Fix-Constrs verifies that $(s_{1},a^{\prime}_{1})$ is currently better than the existing action $a_{1}$ for $s_{1}$ , so $a^{\prime}_{1}$ is added to $\widehat{\mathsf{A}}(s_{1})$ and $V(s_{1})$ is changed from 4 to 3. Recall that $V(s_{0})=5$ , so when $V(s_{1})$ is updated to $3$ , $\{s_{0},a_{0}\}$ is inserted into $\Gamma$ , and no further changes are made in this iteration.
Iter. 4

expands $s_{3}$ and CG-Backups reduces $V(s_{0})$ from 5 to 4, so $V$ has been decreased by CG-Backups.

CG-iLAO^∗ generalises iLAO^∗ by using a more precise separation oracle that only adds violated constraints, which translates to CG-iLAO^∗ ignoring inactive actions. For a state $s$ , any action that has been ignored and left out of the partial SSP $\widehat{\mathbb{S}}$ is not considered by a Bellman backup of $s$ in $\widehat{\mathbb{S}}$ , so such actions’ $Q$ -values are not computed. However, CG-iLAO^∗ needs to compute additional $Q$ -values in its separation oracle to check violations. As our experiments in section 5 show, the $Q$ -values saved by ignoring inactive actions outweigh the additional $Q$ -values in the separation oracle, which lets CG-iLAO^∗ outperform iLAO^∗.

To close the section, we prove that CG-iLAO^∗: (i) terminates (thm. 1); (ii) tracks all constraint violations (lem. 1); and (iii) returns an $\epsilon\text{-consistent}$ value function (thm. 2). Thus, CG-iLAO^∗ is optimal for SSPs.

Theorem 1.

CG-iLAO^∗ terminates.

Proof.

For contradiction, suppose CG-iLAO^∗ does not terminate. $\mathbb{S}$ is finite and we do not add duplicate states nor constraints, so eventually $\widehat{\mathbb{S}}$ is fixed and $\mathsf{F}=\emptyset$ . Then there must be a finite set of states $X\subseteq\widehat{\mathsf{S}}$ that are updated with Bellman backups infinitely often by CG-Backups and/or Fix-Constrs. But $X$ induces a new partial SSP, and applying Bellman backups infinitely often to all $X$ solves this new partial SSP with VI, so $V$ must converge to a fixed point and the residual will be less than $\epsilon$ in finite time. Thus, $V$ will not be updated, and all remaining termination conditions will be satisfied, giving us the desired contradiction. ∎

Lemma 1.

If there is $s\in\widehat{\mathsf{S}}\setminus\widehat{\mathsf{G}}$ and $a\in\mathsf{A}(s)$ such that $V(s)>Q(s,a)+\epsilon$ , then $(s,a)\in\Gamma$ .

Proof.

We prove by induction over $n$ , the number of updates to $V$ . In the base case, $n=0$ , $\widehat{\mathbb{S}}$ is the initial partial SSP with $\widehat{\mathsf{S}}\setminus\widehat{\mathsf{G}}=\emptyset$ , so the claim is vacuously true. Now, we show the claim holds after $n+1$ updates to $V$ , assuming that the claim holds for $n$ updates. Any violations must have been introduced in the latest update to $V$ by CG-Backups or Fix-Constrs, but we add any potential violations to $\Gamma$ in alg. 2 algs. 2 to 2 and alg. 2 respectively. ∎

Theorem 2.

CG-iLAO^∗ outputs an $\epsilon\text{-consistent}$ $V$ .

Proof.

For contradiction, suppose CG-iLAO^∗ has terminated and outputs $V$ with $s\in\mathsf{S}^{\widehat{\pi}_{V}}$ such that $\textsc{res}(s)>\epsilon$ . By CG-iLAO^∗’s termination condition (alg. 2 alg. 2) we know that $\widehat{\pi}_{V}=\widehat{\pi}_{\text{old}}$ and $\mathsf{F}=\emptyset$ , so CG-Backups applies Bellman backups to all states in the envelope until $\textsc{res}\leq\epsilon$ (alg. 2 alg. 2). Therefore, the inconsistency of $s$ must be introduced by Fix-Constrs, either directly by updating $V$ , or indirectly by forcing a policy change. But the residual is tracked (alg. 2 alg. 2) and policy changes are flagged when $\widehat{\pi}_{V}\neq\widehat{\pi}_{\text{old}}$ , which are both checked in the termination condition. So, Fix-Constrs can not introduce any inconsistency either. But these two methods are the only ones affecting $V$ , which yields the desired contradiction. This proves $\epsilon\text{-consistency}$ (defn. 1), but previous heuristic search methods rely on the invariant $V\leq V^{*}$ to safely prune states that can not be part of an optimal policy’s envelope, which CG-iLAO^∗ does not have. We must ensure that states $s$ outside the policy envelope with $V(s)>V^{*}(s)$ can not lead to a cheaper policy if we apply more Bellman backups to them. Consider such $s$ outside the greedy policy envelope with $V(s)>V^{*}(s)$ , and for contradiction let $V(s)>Q(s,a)+\epsilon$ for some $a\in\mathsf{A}(s)$ . Since states in $\widehat{\mathsf{G}}$ are initialised with an admissible $H$ , we know that $s\in\widehat{\mathsf{S}}\setminus\widehat{\mathsf{G}}$ , so $(s,a)\in\Gamma$ by lem. 1. Since Fix-Constrs overwrites $\Gamma$ , $(s,a)$ must have been added in the previous call, but then $\textsc{res}\leftarrow\max(V(s)-Q(s,a),\textsc{res})>\epsilon$ (alg. 2 alg. 2), so the termination criteria (alg. 2 alg. 2) are not satisfied, giving us a contradiction. Therefore, all inadmissible states $s$ satisfy $V(s)\leq Q(s,a)+\epsilon$ . Thus, if CG-iLAO^∗ terminates with $\textsc{res}\leq\epsilon$ , additional backups to states with $V(s)>V^{*}(s)$ would not change $V$ , so we can conclude that CG-iLAO^∗ outputs $\epsilon\text{-consistent}$ $V$ . ∎

5 Experiments

In this section we empirically compare CG-iLAO^∗ to two state-of-the-art optimal heuristic search planners: iLAO^∗ (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003). We also compared CG-iLAO^∗ against FTVI (Dai, Weld et al. 2009), the only algorithm we are aware of that uses action elimination to prune actions, but it is uncompetitive and FTVI’s results are reported in section 7. We consider the following admissible heuristics: h-max ( $h^{\text{max}}$ ); lm-cut ( $h^{\text{lmc}}$ ) (Helmert and Domshlak 2009); and h-roc ( $h^{\text{roc}}$ ) (Trevizan, Thiébaux, and Haslum 2017). As in (Trevizan, Thiébaux, and Haslum 2017), we use $h^{\text{roc}}$ with $h^{\text{max}}$ as a dead-end detection mechanism for problems with dead ends. We use $\epsilon=0.0001$ and convert SSPs into dead-end free SSPs (Trevizan, Teichteil-Königsbuch, and Thiébaux 2017) with a penalty of $D=500$ for all domains except Parc Printer variants where $D=10^{7}$ due to the large cost of single actions.

On all problems, we collected 50 runs with different random seeds for each combination of planner and heuristic. We refer to a problem paired with a fixed seed as an instance. All runs have a cutoff of 30 minutes of CPU time and 8GB of memory. The experiments were conducted in a cluster of Intel Xeon 3.2 GHz CPUs and each run used a single CPU core. The LP solver used for computing $h^{\text{roc}}$ was CPLEX version 20.1. We consider the following domains:

Figure 2: Triangle Tire World problems 1 (left) and 2 (right).

Refer to caption — Figure 3: For each algorithm and heuristic, the cumulative plot of how many instances were solved w.r.t. time in seconds (left) and number of $Q$ -values (right). Both plots start at 200 solved instances and (right) is cut off at $4\times 10^{8}$ $Q$ -values.

Figure 4: Cumulative plot of state density, i.e.,

{|\widehat{\mathsf{A}}(s)|}/{|\mathsf{A}(s)|}

over 50 instances for the largest solved problem per domain.

Triangle Tire World with Head-start (TW)

In the original Triangle Tire World domain (Little, Thiebaux et al. 2007; Buffet 2008), the agent is provided a map of locations in a triangular layout with corners $A,B,C$ , as in fig. 2. The agent’s task is to travel from corner $A$ to $C$ , but it gets a flat tyre with probability $0.5$ every time it moves. Once the car gets a flat tyre, it must change the tyre if a spare is available, otherwise no action is available and the goal is no longer reachable. The agent can only store one spare at a time, and it can only obtain spare tyres in select locations (circles in fig. 2). A shortcoming of this domain is that its difficulty scales exponentially, so it may be easy to solve problem $n$ and impractical for $n+1$ . For this reason we extend the domain by allowing the agent a head-start, that is, its starting location may be anywhere along the edge $AB$ , for instance, on problem 2 these are locations 1-1, 2-1, …, 5-1 (fig. 2 (right)). Let $\text{TW}(n,d)$ denote an instance of Triangle Tire World with Head-start where $n$ is the problem size and $d$ denotes the distance between the agent’s starting location and corner $B$ . When $d=2n$ , we obtain the original problem of size $n$ and reducing $d$ makes the problem easier until we reach $d=1$ , the easiest variant. Experiments using LRTDP and $h^{\text{roc}}$ suggest the following relation in terms of CPU time: $\text{TW}(n+1,2n-3)\leq\text{TW}(n,2n)\leq\text{TW}(n+1,2n-2)$ . Therefore, for each size $n$ , we consider 5 problems: $\text{TW}(n,2n-4),\dots,\text{TW}(n,2n)$ .

Probabilistic Blocks World (BW) (Buffet 2008)

As in the deterministic Blocks World from IPC, the agent is tasked with arranging blocks on a table into a particular configuration with actions to pick up blocks, put them down, or stack them. The probabilistic version adds a 0.25 probability to each action that the handled block falls onto the table. Furthermore, actions are added that allow the agent to pick up, put down, and stack a tower of three blocks; these have 0.9 probability of the whole tower falling onto the table.

Exploding Blocks World (ExBW) (Buffet 2008)

Another variation for the deterministic Blocks World but this time each block is rigged with an explosive that can detonate once and destroy the table or block immediately underneath it. When a block is placed on the table or another block, it detonates with probability 0.4 and 0.1 respectively. Once the table or a block has been destroyed, the agent can not interact with them anymore; therefore, if they are not in their goal position, the goal will be unreachable.

Probabilistic PARC Printer (PARC) (Trevizan, Thiébaux, and Haslum 2017)

This domain is a probabilistic extension of the PARC Printer domain from IPC. It models a modular printer consisting of various components and each page scheduled for printing needs to pass through multiple components in a particular order. The goal is to optimise how each page is directed through the different components to satisfy the printing requirements. With probability 0.1, a component jams ruining the relevant page and forcing it to be reprinted. The domain comes in two flavours: with repair (PARC-R), where jammed components can be repaired and then used again; and without repair (PARC-N), where jammed components remain unusable.

Our code and benchmarks are available at Schmalz and Trevizan (2023). We now present a summary of our findings.

What is the best planner and heuristic combination?

A common metric to evaluate planners is coverage, i.e., the total number of instances solved in a given amount of time, thus larger coverage is better. Fig. 3 (left) shows the coverage of each combination of planner and heuristic as a function of time. The top three combinations and their total coverages are CG-iLAO ${}^{*}_{\text{roc}}$ (1300), LRTDP ${}_{\text{roc}}$ (1202), and iLAO ${}^{*}_{\text{roc}}$ (1200).Note that CG-iLAO ${}^{*}_{\text{roc}}$ and LRTDP ${}_{\text{roc}}$ alternate in the top spot up to 220 seconds and CG-iLAO ${}^{*}_{\text{roc}}$ has the best coverage after that until the experiment cutoff. Moreover, from 300 seconds onwards, CG-iLAO ${}^{*}_{\text{roc}}$ ’s lead varies from 50 to 243 instances. We present a breakdown of coverage per domain in tab. 1. For each domain considered, CG-iLAO ${}^{*}_{\text{roc}}$ reaches the highest coverage over other planners and heuristics. For all three heuristics considered, CG-iLAO^∗ also obtains the highest coverage in all domains against the other planners for the same heuristic. In tab. 2, we present the minimum and maximum speedup per domain of CG-iLAO^∗ over the other planners for $h^{\text{roc}}$ . The speedups w.r.t. iLAO ${}^{*}_{\text{roc}}$ vary from $0.9\times$ (i.e., 11% slower) in the largest PARC-R problem and $3\times$ in problem #9 of ExBW. Against LRTDP ${}_{\text{roc}}$ , the speedups vary from $0.9\times$ in problem #8 of ExBW to $8.4\times$ in PARC-N problem s4-c3. For the performance of each planner and heuristic per problem, see section 7.

How many actions can CG-iLAO^∗ ignore?

To answer this question, we look at the density of states $s$ , defined as ${|\widehat{\mathsf{A}}(s)|}/{|\mathsf{A}(s)|}$ , in the final partial SSP of CG-iLAO ${}^{*}_{\text{roc}}$ . Fig. 4 shows, for each domain, the cumulative plot of density, i.e., how many states contain up to and including a given proportion of their applicable actions. In all instances, at least one third of the states contains at most 50% of the applicable actions. The density is high for TW because many states only have one applicable action. The density is also high for ExBW because heuristics are comparatively weak for this domain. For the other domains the results are much stronger: between 56% and 75% of states contain at most 50% of the actions. Overall, CG-iLAO ${}^{*}_{\text{roc}}$ added between $38\%$ and $66\%$ of all possible actions in its own partial SSP, and added between $43\%$ and $65\%$ of iLAO^∗’s actions.

		BW	ExBW	PARC-N	PARC-R	TW	Total
Num. of instances		300	250	300	250	200	1300
$h^{\text{roc}}$	CG-iLAO^∗	$\bm{300}$	$\bm{250}$	$\bm{300}$	$\bm{250}$	$\bm{200}$	$\bm{1300}$
	iLAO^∗	$\bm{300}$	200	$\bm{300}$	$\bm{250}$	150	1200
	LRTDP	257	$\bm{250}$	$\bm{300}$	200	195	1202
$h^{\text{lmc}}$	CG-iLAO^∗	150	$\bm{250}$	$\bm{300}$	200	150	1030
	iLAO^∗	150	200	$\bm{300}$	200	140	990
	LRTDP	0	200	$\bm{300}$	50	149	699
$h^{\text{max}}$	CG-iLAO^∗	150	200	150	0	161	661
	iLAO^∗	150	150	150	0	150	600
	LRTDP	150	200	150	0	150	650

Table 1: Coverage per domain. Best coverage for each domain (column) in bold.

	BW	ExBW	PARC-R	PARC-N	TW
iLAO ${}^{*}_{\text{roc}}$	1.1–1.4	1.0–3.0	0.9–1.3	1.3–2.2	1.4–1.6
LRTDP ${}_{\text{roc}}$	1.3–2.0	0.9–2.9	2.0–8.4	0.7–0.9	1.2–1.6

Table 2: CG-iLAO

{}^{*}_{\text{roc}}

’s speed-up over iLAO

{}^{*}_{\text{roc}}

and LRTDP

{}_{\text{roc}}

. The speed-ups only consider problems solved by both algorithms.

Are $Q$ -values being saved?

CG-iLAO^∗ can save $Q$ -value computations by ignoring inactive actions, but at the cost of computing additional $Q$ -values in its separation oracle. The cumulative plot over $Q$ -values in fig. 3 (right) shows that the savings in $Q$ -values outweigh the overhead, i.e., given a budget in $Q$ -values computations, CG-iLAO^∗ is capable of solving more instances than the other planners for the same heuristic. At their maximum coverage, iLAO ${}^{*}_{\text{roc}}$ and LRTDP ${}_{\text{roc}}$ use $4\times$ and $10\times$ more $Q$ -values than CG-iLAO ${}^{*}_{\text{roc}}$ , respectively. Moreover, CG-iLAO ${}^{*}_{\text{roc}}$ reaches its maximum coverage of 1300 using $1.64\times 10^{8}$ $Q$ -values while iLAO ${}^{*}_{\text{roc}}$ and LRTDP ${}_{\text{roc}}$ only solve 1149 and 1052 instances, respectively, for the same number of $Q$ -values. A similar trend is observed when using $h^{\text{lmc}}$ ; however, when using $h^{\text{max}}$ , the least informative heuristic considered, LRTDP ${}_{\text{max}}$ is slightly more $Q$ -value efficient than CG-iLAO ${}^{*}_{\text{max}}$ .

What is the impact of the heuristic on CG-iLAO^∗?

Note that, in fig. 3, as the heuristic becomes more informative, the performance gains of CG-iLAO^∗ over iLAO^∗ and LRTDP increases. To explore this trend, we use the heuristic $h^{\text{pert}}_{w}(s)$ defined for $w\in[0,1]$ as $V^{*}(s)\cdot r$ where $r$ is a uniform randomly selected number from $(w,1]$ , which lets us quantify how informative a heuristic is (on average) with $w$ . The randomness in the weight $w$ ensures that the ordering of states induced by $h^{\text{pert}}_{w}$ is different from the one induced by $V^{*}$ . Due to the high cost of computing $V^{*}$ we only consider the smallest problems of BW, ExBW, PARC-N, and TW. Over these problems and 50 instances each, fig. 5 shows the mean search time and 95% C.I. as $w$ varies over $0.1,0.2,\dots,0.9$ . Search time excludes time spent computing the heuristic. The ratio between CG-iLAO^∗’s and iLAO^∗’s runtime supports that CG-iLAO^∗ scales better with better heuristics: the ratio starts at 86% and decreases to 49% and 46% for $w=0.5$ and $w=0.9$ respectively. The reason for this behaviour is that good heuristics (i.e., tighter lower bounds) prevent CG-iLAO^∗ from adding inactive actions to its partial SSP, resulting in more savings in $Q$ -value computation. Over all values of $w$ , there is no statistically significant difference between CG-iLAO^∗ and LRTDP, but both offer substantial improvement over iLAO^∗. This suggests LRTDP’s sampling approach can more efficiently leverage the information provided the heuristics than iLAO^∗ and CG-iLAO^∗ bridges the gap between them, allowing a non-sampling-based planner to use the heuristics as effectively as LRTDP.

6 Conclusion, Related and Future Work

Building on existing connections between operations research and planning, we presented a new interpretation of heuristic search on SSPs as solving LPs using variable and constraint generation. We exploit this equivalence to introduce a new and efficient separation oracle for SSPs, which enables a search algorithm to selectively add actions when they are deemed necessary to find the optimal solution. This addresses the shortcoming of state-of-the-art algorithms that add all applicable actions during state expansion, with no mechanism for ignoring actions that will not contribute to the solution. Using this principle, we generalised iLAO^∗ into a new optimal heuristic search algorithm CG-iLAO^∗. Empirical evaluation showed that CG-iLAO^∗’s ability to consider a subset of actions results in significant savings in the number of $Q$ -values computed, which in turn reduces the runtime of the algorithm compared to the state-of-the-art.

Regarding related work, LP 1 has been approximated for factored MDPs to get a more compact LP, called the approximate LP (ALP). Schuurmans and Patrascu (2001) apply constraint generation to the ALP; their separation oracle has a similar condition for adding constraints as CG-iLAO^∗; however, it checks for the condition naively, which is only practical on the compact ALP offered by factored MDPs, and is infeasible for SSPs. Constraint generation lends itself well to complex planning problems where a relaxation can be efficiently solved and constraint violations by the relaxed solution can be efficiently detected, e.g., in multiagent planning (Calliess and Roberts 2021) and metric hybrid factored planning in nonlinear domains (Say and Sanner 2019). For POMDPs, an LP with constraint generation can be used to prune unneeded vectors from the set of vectors used to represent the value function (Walraven and Spaan 2017). In all these works, the separation oracle either naively checks all possible constraints or relies on sampling to find violations.

As future work, we aim to expand the application of CG-iLAO^∗ to more complex models that can benefit from our iterative method of generating applicable actions. Models with imprecise parameters, such as MDPIPs and MDPSTs (White III and Eldeib 1994; Trevizan, Cozman, and Barros 2007), are suitable candidates for our approach since they have a minimax semantics for the Bellman equations. In this minimax semantics, the value function minimises the expected cost-to-go assuming that an adversary aims to maximise the cost-to-go by selecting the values of the imprecise parameters. As a result, computing $Q(s,a)$ in these models requires solving a maximisation problem; therefore, ignoring inactive actions could lead to significant improvements in performance.

Other suitable models include SSPs with PLTL constraints (Baumgartner, Thiébaux, and Trevizan 2018; Mallet, Thiébaux, and Trevizan 2021) in which both the state space and action space are augmented to keep track of constraint violations. In these models, the concept of inactive actions can be extended to also prevent adding actions that lead to constraint violations to their partial problems. The methods presented in this paper may also be applicable to model checking more broadly. In particular, there has been work investigating how to use heuristics to guide the search for probabilistic reachability (Brázdil et al. 2014), in which action elimination is applicable.

Acknowledgements

We thank the anonymous reviewers for their feedback. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.

References

Baumgartner, Thiébaux, and Trevizan (2018) Baumgartner, P.; Thiébaux, S.; and Trevizan, F. 2018. Heuristic Search Planning With Multi-Objective Probabilistic LTL Constraints. In Proc. of 16th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR).
Bellman (1957) Bellman, R. 1957. Dynamic programming. Princeton University Press.
Bertsekas (1995) Bertsekas, D. 1995. Dynamic Programming and Optimal Control, volume 2. Athena Scientific.
Bertsekas and Tsitsiklis (1991) Bertsekas, D.; and Tsitsiklis, J. 1991. An Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research.
Bertsimas and Tsitsiklis (1997) Bertsimas, D.; and Tsitsiklis, J. 1997. Introduction to Linear Optimization. Athena Scientific.
Bonet and Geffner (2003) Bonet, B.; and Geffner, H. 2003. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming. In Proc. of 13th Int. Conf. on Automated Planning and Scheduling (ICAPS).
Brázdil et al. (2014) Brázdil, T.; Chatterjee, K.; Chmelík, M.; Forejt, V.; Křetínský, J.; Kwiatkowska, M.; Parker, D.; and Ujma, M. 2014. Verification of Markov Decision Processes Using Learning Algorithms. In Automated Technology for Verification and Analysis.
Buffet (2008) Buffet, D. 2008. International Planning Competition Uncertainty Part: Benchmarks and Results.
Calliess and Roberts (2021) Calliess, J.-P.; and Roberts, S. 2021. Multi-Agent Planning with Mixed-Integer Programming and Adaptive Interaction Constraint Generation (Extended Abstract). Proc. of 14th Symposium on Combinatorial Search (SoCS).
Dai, Weld et al. (2009) Dai, P.; Weld, D.; et al. 2009. Focused topological value iteration. In Proc. of 19th Int. Conf. on Automated Planning and Scheduling (ICAPS), volume 19, 82–89.
Hansen and Zilberstein (2001) Hansen, E.; and Zilberstein, S. 2001. LAO^∗: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence.
Helmert and Domshlak (2009) Helmert, M.; and Domshlak, C. 2009. Landmarks, Critical Paths and Abstractions: What’s the Difference Anyway? Proc. of 19th Int. Conf. on Automated Planning and Scheduling (ICAPS).
Little, Thiebaux et al. (2007) Little, I.; Thiebaux, S.; et al. 2007. Probabilistic planning vs. replanning. In ICAPS Workshop on IPC: Past, Present and Future.
Mallet, Thiébaux, and Trevizan (2021) Mallet, I.; Thiébaux, S.; and Trevizan, F. 2021. Progression Heuristics for Planning with Probabilistic LTL Constraints. In Proc. of 35th AAAI Conference on Artificial Intelligence.
McMahan, Likhachev, and Gordon (2005) McMahan, B.; Likhachev, M.; and Gordon, G. 2005. Bounded Real-Time Dynamic Programming: RTDP with Monotone Upper Bounds and Performance Guarantees. In Proc. of 22nd Int. Conf. on Machine Learning.
Sanner et al. (2009) Sanner, S.; Goetschalckx, R.; Driessens, K.; and Shani, G. 2009. Bayesian real-time dynamic programming. In Proc. of 21st Int. Joint Conf. on AI (IJCAI).
Say and Sanner (2019) Say, B.; and Sanner, S. 2019. Metric Hybrid Factored Planning in Nonlinear Domains with Constraint Generation, 502–518. Springer International Publishing.
Schmalz and Trevizan (2023) Schmalz, J.; and Trevizan, F. 2023. Code, benchmarks, and technical report for AAAI 2024 paper “Efficient Constraint Generation for Stochastic Shortest Path Problems”. https://doi.org/10.5281/zenodo.10344842.
Schuurmans and Patrascu (2001) Schuurmans, D.; and Patrascu, R. 2001. Direct value-approximation for factored MDPs. In Advances in Neural Information Processing Systems (NeurIPS).
Smith and Simmons (2006) Smith, T.; and Simmons, R. 2006. Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proc. of 20th AAAI Conf. on Artificial Intelligence.
Teichteil-Königsbuch (2012) Teichteil-Königsbuch, F. 2012. Stochastic safest and shortest path problems. In Proc. of 26th AAAI Conf. on AI.
Trevizan, Cozman, and Barros (2007) Trevizan, F.; Cozman, F. G.; and Barros, L. N. 2007. Planning under Risk and Knightian Uncertainty. In Proc. of 20th Int. Joint Conf. on AI (IJCAI).
Trevizan, Teichteil-Königsbuch, and Thiébaux (2017) Trevizan, F.; Teichteil-Königsbuch, F.; and Thiébaux, S. 2017. Efficient Solutions for Stochastic Shortest Path Problems with Dead Ends. In Proc. of 33rd Int. Conf. on Uncertainty in Artificial Intelligence (UAI).
Trevizan, Thiébaux, and Haslum (2017) Trevizan, F.; Thiébaux, S.; and Haslum, P. 2017. Occupation Measure Heuristics for Probabilistic Planning. In Proc. of 27th Int. Conf. on Automated Planning and Scheduling (ICAPS).
Walraven and Spaan (2017) Walraven, E.; and Spaan, M. 2017. Accelerated Vector Pruning for Optimal POMDP Solvers. Proc. of 31st AAAI Conf. on AI.
White III and Eldeib (1994) White III, C. C.; and Eldeib, H. K. 1994. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4): 739–749.

7 Appendix

We revisit the key questions of our experiments section by presenting additional results and explaining them.

What is the best planner and heuristic combination?

We report the performance statistics of each planner and heuristic combination per benchmark problem in tabs. 4 to 8. Concretely, we report the coverage over 50 runs, and over the runs that terminated we report the mean and 95% confidence interval associated with CPU time, $Q$ -values computed, and the number of calls to the heuristic. For each problem (column) we highlight the planner and heuristic combinations with highest coverage, and among these combinations with tied highest coverage, we highlight the 95% confidence intervals that are tied as the best (lowest) for each problem. Formally, an interval $[x_{l},x_{u}]$ is highlighted if there is no other interval $[y_{l},y_{u}]$ for that problem s.t. $y_{u}<x_{l}$ . The highlighted intervals are also known as non-dominated intervals since there is no interval that is strictly better. If a planner and heuristic combination is missing from a table, then the planner and heuristic combination had a 0% coverage over the relevant domain, e.g., all FTVI combinations are missing from PARC-N.

How many actions can CG-iLAO^∗ ignore?

Following on from the relevant experiment in section 5, which considers the largest solved problems of each domain, we report how many actions CG-iLAO^∗ can ignore in tab. 3. In particular, we show the number of actions that are in CG-iLAO^∗’s final partial SSP, compared with the potential number of actions that can be in CG-iLAO^∗’s partial SSP if all applicable actions are added for all partial states, and the number of actions in iLAO^∗’s final partial SSP. CG-iLAO^∗’s partial actions are significantly fewer than iLAO^∗’s total actions, i.e., CG-iLAO^∗ adds significantly fewer actions than iLAO^∗, which supports our claim that, with a sufficiently informative heuristic, CG-iLAO^∗ considers fewer actions and thereby saves on computation.

Num. Actions in Final Partial SSP

	partial actions CG-iLAO^∗ $\sum_{s\in\widehat{\mathsf{S}}}\|\widehat{\mathsf{A}}(s)\|$	potential actions CG-iLAO^∗ $\sum_{s\in\widehat{\mathsf{S}}}\|\mathsf{A}(s)\|$	total actions iLAO^∗ $\sum_{s\in\widehat{\mathsf{S}}^{\prime}}\|\mathsf{A}(s)\|$
BW	$3.77\times 10^{5}$	$9.98\times 10^{5}$	$8.86\times 10^{5}$
ExBW	$3.90\times 10^{7}$	$5.98\times 10^{7}$	$5.96\times 10^{7}$
PARC-R	$8.97\times 10^{7}$	$2.19\times 10^{8}$	$1.82\times 10^{8}$
PARC-N	$1.92\times 10^{7}$	$3.91\times 10^{7}$	$3.64\times 10^{7}$
TW	$8.05\times 10^{7}$	$1.21\times 10^{8}$	$1.24\times 10^{8}$

Table 3: This table shows, after running CG-iLAO^∗ and iLAO^∗ with

h^{\text{roc}}

, the number of CG-iLAO^∗’s partial actions (the actions added to CG-iLAO^∗’s final partial SSP), CG-iLAO^∗’s potential actions (the union of applicable actions over all states in CG-iLAO^∗’s final partial SSP), and iLAO^∗’s total actions (the actions added to iLAO^∗’s final partial SSP).

\widehat{\mathbb{S}}

is the final partial SSP of CG-iLAO^∗ and

\widehat{\mathbb{S}}^{\prime}

is final partial SSP of iLAO^∗. Each row corresponds to the largest solved problem of the domain and values are means over 50 instances.

What is the impact of the heuristic on CG-iLAO^∗?

We have already shown how the search time of CG-iLAO^∗, iLAO^∗, and LRTDP is affected as $w$ varies; we now show how the number of $Q$ -values is affected in fig. 6. As $w$ increases, CG-iLAO^∗ uses the fewest $Q$ -values. This is in line with our other results, but it is important to note that, for $w=0.1$ , CG-iLAO^∗ does not use fewer $Q$ -values than LRTDP, which suggests that CG-iLAO^∗ does rely on a reasonably informative heuristic, and then scales the best as informativeness increases.

algorithm	heuristic	8-24967	8-23171	8-25241	10-14262	10-19475	12-19848
CG-iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $\bm{(13.3\pm 0.060)}$ $\bm{[14,543\pm 175]}$ $\langle 1,399\pm 15.3\rangle$	$\bm{50}$ $\bm{(13.7\pm 0.082)}$ $[22,432\pm 920]$ $\bm{\langle 1,982\pm 81.5\rangle}$	$\bm{50}$ $\bm{(13.1\pm 0.078)}$ $[8,844\pm 1,019]$ $\langle 1,158\pm 114\rangle$	$\bm{50}$ $\bm{(86.5\pm 0.440)}$ $\bm{[8,781\pm 834]}$ $\bm{\langle 912\pm 89.7\rangle}$	$\bm{50}$ $\bm{(87.6\pm 0.396)}$ $\bm{[15,269\pm 1,968]}$ $\bm{\langle 2,229\pm 251\rangle}$	$\bm{50}$ $\bm{(445\pm 3.56)}$ $\bm{[79,771\pm 13,664]}$ $\bm{\langle 11,029\pm 1,890\rangle}$
FTVI	$h^{\text{roc}}$	$\bm{50}$ $(15.7\pm 0.086)$ $[16,134\pm 248]$ $\langle 5,351\pm 98.9\rangle$	$\bm{50}$ $(15.5\pm 0.175)$ $\bm{[17,695\pm 829]}$ $\langle 5,081\pm 271\rangle$	$\bm{50}$ $(14.0\pm 0.159)$ $\bm{[6,687\pm 861]}$ $\langle 2,314\pm 248\rangle$	$\bm{50}$ $(91.4\pm 0.664)$ $[12,162\pm 717]$ $\langle 2,891\pm 282\rangle$	$\bm{50}$ $(93.8\pm 0.821)$ $\bm{[16,122\pm 2,219]}$ $\langle 6,170\pm 723\rangle$	$\bm{50}$ $(523\pm 9.09)$ $[123,339\pm 13,841]$ $\langle 52,385\pm 5,382\rangle$
iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(14.8\pm 0.070)$ $[26,575\pm 1,122]$ $\bm{\langle 1,234\pm 8.92\rangle}$	$\bm{50}$ $(15.2\pm 0.073)$ $[49,315\pm 2,779]$ $\bm{\langle 1,914\pm 76.1\rangle}$	$\bm{50}$ $(14.5\pm 0.078)$ $[20,672\pm 2,543]$ $\bm{\langle 889\pm 94.4\rangle}$	$\bm{50}$ $(97.5\pm 0.560)$ $[36,477\pm 4,751]$ $\bm{\langle 819\pm 43.4\rangle}$	$\bm{50}$ $(98.0\pm 0.441)$ $[41,180\pm 5,678]$ $\bm{\langle 1,888\pm 206\rangle}$	$\bm{50}$ $(507\pm 3.43)$ $[260,141\pm 33,101]$ $\bm{\langle 13,659\pm 1,739\rangle}$
LRTDP	$h^{\text{roc}}$	$\bm{50}$ $(24.0\pm 0.283)$ $[102,730\pm 2,772]$ $\langle 19,342\pm 522\rangle$	$\bm{50}$ $(25.6\pm 0.424)$ $[128,979\pm 5,024]$ $\langle 22,313\pm 811\rangle$	$\bm{50}$ $(26.2\pm 1.18)$ $[144,314\pm 14,547]$ $\langle 22,941\pm 2,053\rangle$	$\bm{50}$ $(109\pm 1.32)$ $[55,781\pm 5,392]$ $\langle 14,774\pm 1,470\rangle$	$\bm{50}$ $(150\pm 5.61)$ $[248,434\pm 23,864]$ $\langle 58,918\pm 5,093\rangle$	7 $(1,301\pm 258)$ $[2,304,921\pm 724,145]$ $\langle 532,833\pm 159,334\rangle$
CG-iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(132\pm 7.05)$ $[780,075\pm 60,269]$ $\langle 35,591\pm 2,054\rangle$	$\bm{50}$ $(573\pm 21.5)$ $[3,609,948\pm 220,483]$ $\langle 129,509\pm 5,591\rangle$	$\bm{50}$ $(512\pm 29.6)$ $[2,787,033\pm 277,267]$ $\langle 106,939\pm 6,900\rangle$
iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(121\pm 6.59)$ $[851,211\pm 54,597]$ $\langle 31,891\pm 1,871\rangle$	$\bm{50}$ $(545\pm 21.0)$ $[3,955,151\pm 217,566]$ $\langle 116,921\pm 4,809\rangle$	$\bm{50}$ $(478\pm 26.9)$ $[2,948,487\pm 241,209]$ $\langle 99,287\pm 6,219\rangle$
CG-iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(302\pm 3.25)$ $[16,176,234\pm 22,293]$ $\langle 459,032\pm 180\rangle$	$\bm{50}$ $(1,615\pm 14.4)$ $[86,204,696\pm 69,607]$ $\langle 922,272\pm 0.000\rangle$	$\bm{50}$ $(1,688\pm 10.3)$ $[89,642,935\pm 187,375]$ $\langle 922,271\pm 0.000\rangle$
iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(240\pm 3.19)$ $[18,475,207\pm 83,150]$ $\langle 378,670\pm 445\rangle$	$\bm{50}$ $(743\pm 8.73)$ $[60,577,420\pm 130,400]$ $\langle 805,407\pm 188\rangle$	$\bm{50}$ $(819\pm 8.47)$ $[66,960,805\pm 29,399]$ $\langle 839,082\pm 133\rangle$
LRTDP	$h^{\text{max}}$	$\bm{50}$ $(567\pm 2.48)$ $[33,537,233\pm 30,476]$ $\langle 922,203\pm 2.75\rangle$	$\bm{50}$ $(748\pm 4.41)$ $[49,482,874\pm 26,003]$ $\langle 922,273\pm 0.000\rangle$	$\bm{50}$ $(863\pm 6.07)$ $[58,962,949\pm 50,989]$ $\langle 922,272\pm 0.000\rangle$

Table 4: BW:

A(B)[C]\langle D\rangle

where

A

is the coverage out of 50 runs,

(B)

is the mean CPU time in seconds (and the 95% confidence interval),

[C]

is the mean number of

Q

-values computed (and the 95% C.I.), and

\langle D\rangle

is the mean number of calls to the heuristic (and the 95% C.I.). Values for

B

C

and

D

only consider successful runs.

algorithm	heuristic	p07-n7-N9-s7	p08-n8-N10-s8	p09-n9-N11-s9	p10-n10-N12-s10	p11-n11-N13-s11
CG-iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(0.585\pm 0.031)$ $[3,289\pm 360]$ $\langle 1,406\pm 130\rangle$	$\bm{50}$ $(68.8\pm 1.70)$ $[8,551,674\pm 24,812]$ $\langle 164,822\pm 1,461\rangle$	$\bm{50}$ $(7.09\pm 0.086)$ $\bm{[458,161\pm 2,965]}$ $\bm{\langle 21,503\pm 220\rangle}$	$\bm{50}$ $(299\pm 9.21)$ $[75,922,919\pm 228,249]$ $\langle 214,709\pm 1,046\rangle$	$\bm{50}$ $\bm{(433\pm 2.36)}$ $[11,183,164\pm 33,827]$ $\langle 1,444,012\pm 2,101\rangle$
FTVI	$h^{\text{roc}}$	$\bm{50}$ $(1.15\pm 0.127)$ $[4,323\pm 687]$ $\langle 3,929\pm 564\rangle$
iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(0.601\pm 0.026)$ $[9,971\pm 571]$ $\langle 1,434\pm 112\rangle$	$\bm{50}$ $(200\pm 1.86)$ $[31,838,096\pm 38,023]$ $\langle 160,179\pm 1,431\rangle$	$\bm{50}$ $(21.3\pm 0.802)$ $[2,689,211\pm 133,082]$ $\langle 27,797\pm 284\rangle$		$\bm{50}$ $(490\pm 2.96)$ $[26,675,612\pm 33,185]$ $\langle 1,465,487\pm 1,379\rangle$
LRTDP	$h^{\text{roc}}$	$\bm{50}$ $(1.16\pm 0.092)$ $[11,266\pm 1,480]$ $\langle 3,904\pm 376\rangle$	$\bm{50}$ $\bm{(62.5\pm 1.01)}$ $[3,048,449\pm 95,535]$ $\langle 228,649\pm 3,906\rangle$	$\bm{50}$ $(19.9\pm 0.270)$ $[1,688,157\pm 20,216]$ $\langle 63,101\pm 985\rangle$	$\bm{50}$ $(136\pm 1.80)$ $\bm{[8,741,452\pm 62,416]}$ $\langle 428,008\pm 5,579\rangle$	$\bm{50}$ $(1,272\pm 26.3)$ $[20,222,574\pm 270,633]$ $\langle 4,334,206\pm 68,377\rangle$
CG-iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $\bm{(0.136\pm 0.006)}$ $\bm{[1,672\pm 69.7]}$ $\bm{\langle 868\pm 34.1\rangle}$	$\bm{50}$ $(65.8\pm 1.39)$ $[9,020,624\pm 21,098]$ $\bm{\langle 137,032\pm 1,593\rangle}$	$\bm{50}$ $(8.88\pm 0.082)$ $[598,012\pm 3,912]$ $\langle 22,187\pm 320\rangle$	$\bm{50}$ $(347\pm 8.46)$ $[67,704,563\pm 205,137]$ $\bm{\langle 208,048\pm 845\rangle}$	$\bm{50}$ $(1,016\pm 4.25)$ $\bm{[11,026,216\pm 33,388]}$ $\langle 1,268,011\pm 2,443\rangle$
FTVI	$h^{\text{lmc}}$	$\bm{50}$ $(0.561\pm 0.082)$ $[3,828\pm 615]$ $\langle 3,431\pm 506\rangle$
iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(0.233\pm 0.021)$ $[8,635\pm 453]$ $\langle 1,297\pm 101\rangle$	$\bm{50}$ $(229\pm 3.25)$ $[35,341,647\pm 45,652]$ $\bm{\langle 137,593\pm 1,550\rangle}$	$\bm{50}$ $(21.8\pm 0.678)$ $[2,581,881\pm 102,361]$ $\langle 27,040\pm 285\rangle$		$\bm{50}$ $(1,111\pm 6.48)$ $[25,842,856\pm 43,427]$ $\bm{\langle 1,263,692\pm 837\rangle}$
LRTDP	$h^{\text{lmc}}$	$\bm{50}$ $(0.485\pm 0.047)$ $[5,561\pm 716]$ $\langle 2,863\pm 280\rangle$	$\bm{50}$ $(91.9\pm 1.23)$ $\bm{[2,858,397\pm 56,354]}$ $\langle 240,751\pm 3,305\rangle$	$\bm{50}$ $(35.7\pm 0.613)$ $[1,704,587\pm 20,139]$ $\langle 65,962\pm 1,098\rangle$	$\bm{50}$ $(392\pm 5.43)$ $[9,892,329\pm 59,686]$ $\langle 516,447\pm 7,122\rangle$
CG-iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(1.69\pm 0.011)$ $[284,993\pm 2,474]$ $\langle 43,323\pm 242\rangle$	$\bm{50}$ $(78.5\pm 1.96)$ $[18,188,194\pm 35,225]$ $\langle 819,653\pm 1,842\rangle$	$\bm{50}$ $\bm{(4.00\pm 0.051)}$ $[966,920\pm 5,282]$ $\langle 43,816\pm 443\rangle$	$\bm{50}$ $(308\pm 11.5)$ $[107,680,890\pm 50,168]$ $\langle 282,810\pm 945\rangle$
iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(2.98\pm 0.023)$ $[837,143\pm 4,822]$ $\langle 41,336\pm 324\rangle$	$\bm{50}$ $(331\pm 5.48)$ $[57,048,934\pm 51,381]$ $\langle 739,570\pm 1,511\rangle$	$\bm{50}$ $(13.8\pm 0.764)$ $[2,678,145\pm 114,072]$ $\langle 42,956\pm 410\rangle$
LRTDP	$h^{\text{max}}$	$\bm{50}$ $(6.59\pm 0.169)$ $[779,021\pm 18,621]$ $\langle 175,517\pm 3,824\rangle$	$\bm{50}$ $(89.1\pm 0.880)$ $[19,435,368\pm 123,222]$ $\langle 1,349,405\pm 25,971\rangle$	$\bm{50}$ $(14.0\pm 0.195)$ $[3,314,262\pm 32,347]$ $\langle 145,387\pm 1,891\rangle$	$\bm{50}$ $\bm{(105\pm 0.821)}$ $[19,713,845\pm 112,489]$ $\langle 842,756\pm 9,501\rangle$

Table 5: ExBW:

A(B)[C]\langle D\rangle

where

A

is the coverage out of 50 runs,

(B)

is the mean CPU time in seconds (and the 95% confidence interval),

[C]

is the mean number of

Q

-values computed (and the 95% C.I.), and

\langle D\rangle

is the mean number of calls to the heuristic (and the 95% C.I.). Values for

B

C

and

D

only consider successful runs.

algorithm	heuristic	s4-c1	s4-c2	s4-c3	s5-c1	s5-c2	s5-c3
CG-iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(10.5\pm 0.155)$ $[1,013,976\pm 15,173]$ $\langle 31,350\pm 447\rangle$	$\bm{50}$ $(16.1\pm 0.256)$ $[1,715,321\pm 20,139]$ $\langle 42,106\pm 761\rangle$	$\bm{50}$ $(22.1\pm 0.403)$ $[3,022,500\pm 43,724]$ $\langle 54,854\pm 1,271\rangle$	$\bm{50}$ $(114\pm 1.38)$ $[12,324,266\pm 114,576]$ $\langle 307,166\pm 3,809\rangle$	$\bm{50}$ $(180\pm 3.49)$ $[20,993,881\pm 208,715]$ $\langle 404,594\pm 7,231\rangle$	$\bm{50}$ $(244\pm 4.14)$ $[37,055,726\pm 466,292]$ $\langle 515,254\pm 12,255\rangle$
iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(14.0\pm 0.194)$ $[3,044,522\pm 74,676]$ $\bm{\langle 27,845\pm 269\rangle}$	$\bm{50}$ $(24.0\pm 0.389)$ $[5,344,883\pm 121,538]$ $\bm{\langle 38,451\pm 543\rangle}$	$\bm{50}$ $(39.8\pm 0.888)$ $[9,597,755\pm 277,641]$ $\bm{\langle 50,201\pm 1,045\rangle}$	$\bm{50}$ $(184\pm 2.73)$ $[43,507,677\pm 1,056,857]$ $\bm{\langle 286,297\pm 2,390\rangle}$	$\bm{50}$ $(321\pm 6.36)$ $[73,968,382\pm 1,836,330]$ $\bm{\langle 378,071\pm 5,873\rangle}$	$\bm{50}$ $(525\pm 8.69)$ $[127,869,617\pm 3,713,867]$ $\bm{\langle 478,778\pm 9,732\rangle}$
LRTDP	$h^{\text{roc}}$	$\bm{50}$ $\bm{(8.78\pm 0.141)}$ $\bm{[436,531\pm 3,058]}$ $\langle 31,717\pm 561\rangle$	$\bm{50}$ $\bm{(13.1\pm 0.270)}$ $\bm{[585,888\pm 3,071]}$ $\langle 41,835\pm 942\rangle$	$\bm{50}$ $\bm{(16.4\pm 0.372)}$ $\bm{[680,237\pm 4,190]}$ $\langle 53,681\pm 1,522\rangle$	$\bm{50}$ $\bm{(100\pm 1.23)}$ $\bm{[5,261,132\pm 41,743]}$ $\langle 310,181\pm 4,941\rangle$	$\bm{50}$ $\bm{(150\pm 2.32)}$ $\bm{[7,098,114\pm 43,550]}$ $\langle 398,327\pm 8,576\rangle$	$\bm{50}$ $\bm{(184\pm 3.72)}$ $\bm{[8,273,497\pm 40,891]}$ $\bm{\langle 496,371\pm 13,701\rangle}$
CG-iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(10.7\pm 0.070)$ $[1,608,662\pm 17,455]$ $\langle 46,829\pm 436\rangle$	$\bm{50}$ $(19.4\pm 0.138)$ $[3,367,791\pm 40,108]$ $\langle 77,150\pm 782\rangle$	$\bm{50}$ $(33.1\pm 0.196)$ $[5,962,957\pm 81,595]$ $\langle 138,353\pm 1,487\rangle$	$\bm{50}$ $(141\pm 0.855)$ $[19,536,893\pm 198,117]$ $\langle 484,752\pm 4,355\rangle$	$\bm{50}$ $(268\pm 2.53)$ $[43,716,409\pm 423,614]$ $\langle 829,264\pm 8,947\rangle$	$\bm{50}$ $(477\pm 5.28)$ $[75,760,732\pm 884,923]$ $\langle 1,476,484\pm 17,030\rangle$
iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(16.5\pm 0.198)$ $[4,649,445\pm 114,510]$ $\langle 39,767\pm 379\rangle$	$\bm{50}$ $(36.5\pm 0.499)$ $[10,851,237\pm 231,976]$ $\langle 64,040\pm 835\rangle$	$\bm{50}$ $(61.8\pm 1.06)$ $[17,429,097\pm 478,163]$ $\langle 109,311\pm 1,250\rangle$	$\bm{50}$ $(229\pm 3.64)$ $[59,864,905\pm 1,529,872]$ $\langle 405,784\pm 3,557\rangle$	$\bm{50}$ $(539\pm 8.11)$ $[148,174,993\pm 3,339,338]$ $\langle 658,100\pm 8,781\rangle$	$\bm{50}$ $(941\pm 15.5)$ $[238,099,853\pm 6,527,230]$ $\langle 1,121,729\pm 13,299\rangle$
LRTDP	$h^{\text{lmc}}$	$\bm{50}$ $(11.6\pm 0.165)$ $[932,940\pm 5,927]$ $\langle 68,166\pm 1,087\rangle$	$\bm{50}$ $(20.9\pm 0.315)$ $[1,490,562\pm 8,226]$ $\langle 116,119\pm 2,166\rangle$	$\bm{50}$ $(32.0\pm 0.404)$ $[2,113,526\pm 10,839]$ $\langle 177,797\pm 3,138\rangle$	$\bm{50}$ $(156\pm 1.72)$ $[11,634,002\pm 108,431]$ $\langle 702,833\pm 10,416\rangle$	$\bm{50}$ $(269\pm 3.27)$ $[18,809,583\pm 146,182]$ $\langle 1,182,469\pm 23,053\rangle$	$\bm{50}$ $(432\pm 5.53)$ $[26,274,405\pm 171,814]$ $\langle 1,808,411\pm 33,918\rangle$
CG-iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(107\pm 1.18)$ $[45,597,493\pm 83,407]$ $\langle 639,153\pm 73.7\rangle$	$\bm{50}$ $(374\pm 5.78)$ $[132,743,888\pm 220,135]$ $\langle 1,940,914\pm 351\rangle$	$\bm{50}$ $(610\pm 11.1)$ $[203,454,719\pm 410,922]$ $\langle 2,773,558\pm 1,585\rangle$
iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(217\pm 3.22)$ $[99,246,994\pm 1,488,977]$ $\langle 618,593\pm 142\rangle$	$\bm{50}$ $(790\pm 14.3)$ $[317,087,417\pm 6,424,885]$ $\langle 1,787,826\pm 447\rangle$	$\bm{50}$ $(1,425\pm 38.8)$ $[448,028,105\pm 13,746,351]$ $\langle 2,229,375\pm 1,267\rangle$
LRTDP	$h^{\text{max}}$	$\bm{50}$ $(39.5\pm 0.419)$ $[17,779,098\pm 21,526]$ $\langle 588,247\pm 45.9\rangle$	$\bm{50}$ $(254\pm 2.51)$ $[126,641,023\pm 252,570]$ $\langle 1,627,845\pm 75.4\rangle$	$\bm{50}$ $(403\pm 5.60)$ $[175,415,264\pm 251,069]$ $\langle 2,593,447\pm 1,011\rangle$

Table 6: PARC-N:

A(B)[C]\langle D\rangle

where

A

is the coverage out of 50 runs,

(B)

is the mean CPU time in seconds (and the 95% confidence interval),

[C]

is the mean number of

Q

-values computed (and the 95% C.I.), and

\langle D\rangle

is the mean number of calls to the heuristic (and the 95% C.I.). Values for

B

C

and

D

only consider successful runs.

algorithm	heuristic	s4-c1	s4-c2	s4-c3	s5-c1	s5-c2
CG-iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $\bm{(45.9\pm 0.525)}$ $\bm{[6,228,997\pm 20,656]}$ $\langle 95,096\pm 261\rangle$	$\bm{50}$ $\bm{(121\pm 1.38)}$ $[14,811,715\pm 28,578]$ $\langle 208,401\pm 175\rangle$	$\bm{50}$ $(181\pm 2.27)$ $[22,138,217\pm 27,456]$ $\langle 281,753\pm 109\rangle$	$\bm{50}$ $\bm{(583\pm 7.32)}$ $\bm{[63,821,278\pm 180,676]}$ $\langle 1,061,846\pm 1,937\rangle$	$\bm{50}$ $(1,657\pm 16.4)$ $\bm{[161,991,609\pm 242,532]}$ $\langle 2,299,310\pm 1,043\rangle$
FTVI	$h^{\text{roc}}$		$\bm{50}$ $(765\pm 37.9)$ $\bm{[5,514,887\pm 39,772]}$ $\langle 235,023\pm 1,076\rangle$	$\bm{50}$ $(613\pm 69.3)$ $\bm{[6,767,005\pm 49,323]}$ $\langle 307,358\pm 1,049\rangle$
iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(58.3\pm 0.398)$ $[15,636,208\pm 88,234]$ $\bm{\langle 84,340\pm 123\rangle}$	$\bm{50}$ $(126\pm 0.667)$ $[32,377,004\pm 137,024]$ $\bm{\langle 183,079\pm 223\rangle}$	$\bm{50}$ $\bm{(168\pm 0.749)}$ $[41,961,804\pm 133,535]$ $\langle 268,261\pm 312\rangle$	$\bm{50}$ $(652\pm 4.57)$ $[150,228,580\pm 756,329]$ $\bm{\langle 887,525\pm 1,217\rangle}$	$\bm{50}$ $\bm{(1,502\pm 17.5)}$ $[292,637,047\pm 1,224,742]$ $\bm{\langle 1,945,156\pm 3,643\rangle}$
LRTDP	$h^{\text{roc}}$	$\bm{50}$ $(93.2\pm 0.568)$ $[41,042,790\pm 152,970]$ $\langle 92,102\pm 236\rangle$	$\bm{50}$ $(674\pm 4.92)$ $[352,166,761\pm 1,539,678]$ $\langle 196,298\pm 270\rangle$	$\bm{50}$ $(1,513\pm 8.48)$ $[797,662,187\pm 3,139,450]$ $\bm{\langle 265,844\pm 213\rangle}$	$\bm{50}$ $(1,335\pm 11.3)$ $[547,832,850\pm 1,127,533]$ $\langle 984,575\pm 2,753\rangle$
CG-iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(68.8\pm 0.339)$ $[9,184,226\pm 26,200]$ $\langle 148,718\pm 270\rangle$	$\bm{50}$ $(212\pm 2.82)$ $[26,805,385\pm 35,474]$ $\langle 351,686\pm 229\rangle$	$\bm{50}$ $(322\pm 2.82)$ $[38,522,863\pm 94,665]$ $\langle 517,017\pm 607\rangle$	$\bm{50}$ $(1,003\pm 7.77)$ $[105,649,983\pm 292,360]$ $\langle 1,669,291\pm 2,205\rangle$
FTVI	$h^{\text{lmc}}$			20 $(367\pm 116)$ $[12,477,085\pm 70,147]$ $\langle 563,694\pm 6,066\rangle$
iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(84.9\pm 1.31)$ $[20,392,448\pm 117,401]$ $\langle 143,780\pm 238\rangle$	$\bm{50}$ $(213\pm 1.13)$ $[49,478,612\pm 188,165]$ $\langle 342,211\pm 345\rangle$	$\bm{50}$ $(296\pm 2.26)$ $[67,320,675\pm 240,270]$ $\langle 471,351\pm 641\rangle$	$\bm{50}$ $(1,112\pm 6.03)$ $[209,913,627\pm 967,213]$ $\langle 1,614,163\pm 2,175\rangle$
LRTDP	$h^{\text{lmc}}$	$\bm{50}$ $(340\pm 1.30)$ $[119,765,590\pm 396,502]$ $\langle 190,315\pm 712\rangle$

Table 7: PARC-R:

A(B)[C]\langle D\rangle

where

A

is the coverage out of 50 runs,

(B)

is the mean CPU time in seconds (and the 95% confidence interval),

[C]

is the mean number of

Q

-values computed (and the 95% C.I.), and

\langle D\rangle

is the mean number of calls to the heuristic (and the 95% C.I.). Values for

B

C

and

D

only consider successful runs.

algorithm	heuristic	$\text{TW}(4,8)$ (original TW 4)	$\text{TW}(5,10)$ (original TW 5)	$\text{TW}(6,8)$	$\text{TW}(6,9)$
CG-iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(14.4\pm 0.211)$ $[1,148,040\pm 16,387]$ $\bm{\langle 40,634\pm 136\rangle}$	$\bm{50}$ $\bm{(395\pm 3.11)}$ $[31,842,440\pm 247,363]$ $\bm{\langle 579,176\pm 1,779\rangle}$	$\bm{50}$ $\bm{(516\pm 6.29)}$ $[32,427,798\pm 556,567]$ $\bm{\langle 908,936\pm 1,665\rangle}$	$\bm{50}$ $\bm{(1,223\pm 14.4)}$ $\bm{[80,746,215\pm 1,061,667]}$ $\bm{\langle 1,875,857\pm 4,111\rangle}$
FTVI	$h^{\text{roc}}$	$\bm{50}$ $(42.0\pm 0.601)$ $\bm{[386,702\pm 3,356]}$ $\langle 190,435\pm 3,029\rangle$	$\bm{50}$ $(695\pm 11.5)$ $\bm{[7,290,773\pm 83,692]}$ $\langle 2,400,870\pm 43,908\rangle$	$\bm{50}$ $(1,156\pm 31.4)$ $\bm{[9,429,896\pm 120,816]}$ $\langle 3,646,876\pm 109,529\rangle$
iLAO^∗	$h^{\text{roc}}$	$\bm{50}$ $(20.3\pm 0.162)$ $[2,456,523\pm 11,941]$ $\langle 50,203\pm 295\rangle$	$\bm{50}$ $(627\pm 5.06)$ $[64,753,531\pm 556,643]$ $\langle 730,691\pm 4,690\rangle$	$\bm{50}$ $(794\pm 9.02)$ $[63,978,036\pm 925,163]$ $\langle 1,175,526\pm 7,985\rangle$
LRTDP	$h^{\text{roc}}$	$\bm{50}$ $(23.6\pm 0.134)$ $[2,003,448\pm 12,008]$ $\langle 88,076\pm 545\rangle$	$\bm{50}$ $(481\pm 6.70)$ $[49,385,315\pm 285,830]$ $\langle 1,048,156\pm 7,230\rangle$	$\bm{50}$ $(719\pm 7.32)$ $[58,413,438\pm 267,244]$ $\langle 1,563,223\pm 13,711\rangle$	45 $(1,667\pm 16.6)$ $[140,361,933\pm 662,188]$ $\langle 3,195,700\pm 27,373\rangle$
CG-iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(14.9\pm 0.380)$ $[2,090,103\pm 11,607]$ $\langle 100,788\pm 452\rangle$	$\bm{50}$ $(581\pm 10.5)$ $[58,509,834\pm 285,036]$ $\langle 1,717,974\pm 10,015\rangle$	$\bm{50}$ $(972\pm 16.0)$ $[90,024,795\pm 604,542]$ $\langle 3,208,338\pm 17,172\rangle$
FTVI	$h^{\text{lmc}}$	$\bm{50}$ $(20.7\pm 0.257)$ $[991,116\pm 14,912]$ $\langle 460,168\pm 6,570\rangle$	$\bm{50}$ $(746\pm 13.7)$ $[21,539,497\pm 356,024]$ $\langle 7,726,282\pm 108,583\rangle$	37 $(1,621\pm 29.0)$ $[36,677,426\pm 771,289]$ $\langle 14,835,983\pm 193,833\rangle$
iLAO^∗	$h^{\text{lmc}}$	$\bm{50}$ $(22.7\pm 0.212)$ $[4,444,825\pm 21,791]$ $\langle 111,278\pm 490\rangle$	$\bm{50}$ $(1,082\pm 12.2)$ $[129,029,339\pm 518,976]$ $\langle 1,893,490\pm 8,924\rangle$	40 $(1,735\pm 12.3)$ $[195,389,276\pm 910,083]$ $\langle 3,527,186\pm 17,438\rangle$
LRTDP	$h^{\text{lmc}}$	$\bm{50}$ $(17.3\pm 0.114)$ $[3,018,205\pm 12,805]$ $\langle 321,326\pm 5,765\rangle$	$\bm{50}$ $(601\pm 15.2)$ $[67,063,225\pm 207,619]$ $\langle 4,924,764\pm 104,716\rangle$	49 $(1,490\pm 33.9)$ $[115,518,704\pm 447,329]$ $\langle 9,968,737\pm 218,595\rangle$
CG-iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(10.6\pm 0.307)$ $[2,090,103\pm 11,607]$ $\langle 100,788\pm 452\rangle$	$\bm{50}$ $(473\pm 7.72)$ $[58,509,834\pm 285,036]$ $\langle 1,717,974\pm 10,015\rangle$	$\bm{50}$ $(737\pm 13.1)$ $[90,024,795\pm 604,542]$ $\langle 3,208,338\pm 17,172\rangle$	11 $(1,746\pm 19.6)$ $[201,475,187\pm 2,092,087]$ $\langle 6,584,678\pm 88,077\rangle$
FTVI	$h^{\text{max}}$	$\bm{50}$ $(11.7\pm 0.189)$ $[991,116\pm 14,912]$ $\langle 460,168\pm 6,570\rangle$	$\bm{50}$ $(520\pm 11.5)$ $[21,539,497\pm 356,024]$ $\langle 7,726,282\pm 108,583\rangle$	$\bm{50}$ $(1,147\pm 38.7)$ $[38,153,835\pm 990,907]$ $\langle 15,053,868\pm 212,968\rangle$
iLAO^∗	$h^{\text{max}}$	$\bm{50}$ $(18.7\pm 0.249)$ $[4,444,825\pm 21,791]$ $\langle 111,278\pm 490\rangle$	$\bm{50}$ $(992\pm 11.2)$ $[129,029,339\pm 518,976]$ $\langle 1,893,490\pm 8,924\rangle$	$\bm{50}$ $(1,604\pm 16.8)$ $[195,835,939\pm 828,889]$ $\langle 3,536,389\pm 16,793\rangle$
LRTDP	$h^{\text{max}}$	$\bm{50}$ $\bm{(9.80\pm 0.083)}$ $[3,013,611\pm 10,921]$ $\langle 321,347\pm 5,796\rangle$	$\bm{50}$ $(415\pm 8.91)$ $[66,996,690\pm 214,188]$ $\langle 4,923,361\pm 104,809\rangle$	$\bm{50}$ $(1,030\pm 31.3)$ $[115,558,245\pm 461,900]$ $\langle 10,022,348\pm 213,992\rangle$

Table 8: TW:

A(B)[C]\langle D\rangle

where

A

is the coverage out of 50 runs,

(B)

is the mean CPU time in seconds (and the 95% confidence interval),

[C]

is the mean number of

Q

-values computed (and the 95% C.I.), and

\langle D\rangle

is the mean number of calls to the heuristic (and the 95% C.I.). Values for

B

C

and

D

only consider successful runs.

Efficient Constraint Generation for Stochastic Shortest Path Problems

Abstract

1 Introduction

2 Background

Definition 1 (ϵ​-consistency\epsilon\text{-consistency}).

Definition 2 (Partial SSP).

3 iLAO∗ as Linear Program

4 CG-iLAO∗

Definition 3 (Inactive Action).

Theorem 1.

Proof.

Lemma 1.

Proof.

Theorem 2.

Proof.

5 Experiments

Triangle Tire World with Head-start (TW)

Probabilistic Blocks World (BW) (Buffet 2008)

Exploding Blocks World (ExBW) (Buffet 2008)

Probabilistic PARC Printer (PARC) (Trevizan, Thiébaux, and Haslum 2017)

What is the best planner and heuristic combination?

How many actions can CG-iLAO∗ ignore?

Are QQ-values being saved?

What is the impact of the heuristic on CG-iLAO∗?

6 Conclusion, Related and Future Work

Acknowledgements

References

7 Appendix

What is the best planner and heuristic combination?

How many actions can CG-iLAO∗ ignore?

What is the impact of the heuristic on CG-iLAO∗?

Definition 1 ( $\epsilon\text{-consistency}$ ).

3 iLAO^∗ as Linear Program

4 CG-iLAO^∗

How many actions can CG-iLAO^∗ ignore?

Are $Q$ -values being saved?

What is the impact of the heuristic on CG-iLAO^∗?

How many actions can CG-iLAO^∗ ignore?

What is the impact of the heuristic on CG-iLAO^∗?