Efficient Constraint Generation for Stochastic Shortest Path Problems
Abstract
Current methods for solving Stochastic Shortest Path Problems (SSPs) find states’ costs-to-go by applying Bellman backups, where state-of-the-art methods employ heuristics to select states to back up and prune. A fundamental limitation of these algorithms is their need to compute the cost-to-go for every applicable action during each state backup, leading to unnecessary computation for actions identified as sub-optimal. We present new connections between planning and operations research and, using this framework, we address this issue of unnecessary computation by introducing an efficient version of constraint generation for SSPs. This technique allows algorithms to ignore sub-optimal actions and avoid computing their costs-to-go. We also apply our novel technique to iLAO∗ resulting in a new algorithm, CG-iLAO∗. Our experiments show that CG-iLAO∗ ignores up to of iLAO∗’s actions and it solves problems up to and faster than LRTDP and iLAO∗.
1 Introduction
Planning is an important facet of AI that gives efficient algorithms for solving current real-world problems. Stochastic Shortest Path problems (SSPs) (Bertsekas and Tsitsiklis 1991) generalise classical (deterministic) planning by introducing actions with probabilistic effects, which lets us model problems where the actions are intrinsically probabilistic. Value Iteration (VI) (Bellman 1957) is a dynamic programming algorithm that forms the basis of optimal algorithms for solving SSPs. VI finds the cost-to-go for each state, which describes the solution of an SSP. A state ’s cost-to-go is the minimum expected cost of reaching a goal from , and similarly a action ’s cost-to-go is the minimum after applying . VI finds the optimal cost-to-go by iteratively applying Bellman backups, which update each state’s cost-to-go with the minimal outgoing action’s cost-to-go.
LRTDP (Bonet and Geffner 2003) and iLAO∗ (Hansen and Zilberstein 2001), the state-of-the-art algorithms for optimally solving SSPs, build on VI and offer significant speedup by using heuristics to apply Bellman backups only to promising states and pruning states that are deemed too expensive. A shortcoming of such algorithms is that each Bellman backup must consider all applicable actions. For instance, let and be a state and an applicable action, even if all successors of will be pruned because they are too expensive, a Bellman backup on still computes the -value of , so these algorithms can prune unpromising states but not actions. This issue is compounded because algorithms for SSPs require arbitrarily many Bellman backups on each state to find the optimal solution, thus wasting time on computing -values for such actions many times.
This issue of computing unnecessary -values for a state is addressed by action elimination (Bertsekas 1995), which can be implemented in search algorithms to prune useless actions. Action elimination looks for pairs of applicable actions in a state, such that a lower bound on ’s cost-to-go exceeds an upper bound on ’s cost-to-go, in which case is proved to be a useless action and can be pruned. Although domain-independent lower bounds (heuristics) can be computed efficiently, finding an efficient, domain-independent upper bound remains an open question to the best of our knowledge. This gap has limited the use of action elimination in domain-independent planning. In the context of optimal heuristic planning for SSPs, the only algorithm we are aware of that utilises action elimination to prune actions is FTVI (Dai, Weld et al. 2009). Other algorithms, such as BRTDP (McMahan, Likhachev, and Gordon 2005), FRTDP (Smith and Simmons 2006) and VPI-RTDP (Sanner et al. 2009), use upper bounds in conjunction with lower bounds to guide their search. However, unlike FTVI, they do not perform action elimination.
We present a general technique for ignoring actions that does not rely on upper bounds. In contrast to action elimination that incrementally prunes useless actions, our approach initially treats all actions as inactive, i.e., not contributing to the solution. It then iteratively adds actions to the search that are guaranteed to improve the current solution. To develop our approach, we strengthen the connections between planning and operations research by relating heuristic search to variable and constraint generation. Similar to heuristic search, variable and constraint generation enable the solving of large Linear Programs (LPs) by considering only a subset of variables and constraints. We show that algorithms such as iLAO∗ implicitly perform constraint generation, albeit in a trivial manner, to the LP encoding of VI. Building on this, we introduce an efficient algorithm for constraint generation for SSPs that leads to inactive actions being ignored. We apply our approach to iLAO∗ to get the novel algorithm: CG-iLAO∗.
In our experiments, CG-iLAO∗ solves problems up to and faster than LRTDP and iLAO∗, respectively. CG-iLAO∗ is faster than the others over various problem difficulties: its improvement is apparent from problems that require only 4 minutes to solve, and the improvement gap increases as problems take longer to solve. To explain this, we quantify that CG-iLAO∗ only considers 43–65% of iLAO∗’s total actions thus fewer actions’ costs-to-go are computed. Investigating further, we empirically show that CG-iLAO∗ combines iLAO∗’s efficient use of backups and LRTDP’s strong pruning capability, thereby displaying the best characteristics of both in a single algorithm.
The structure of this paper is as follows: we first introduce the background for SSPs and existing methods for solving SSPs. Second, we give a brief background to linear programming and connect linear programming techniques to heuristic search. Then we explain and motivate our novel algorithm CG-iLAO∗ and prove its correctness. Finally, we empirically evaluate the performance of CG-iLAO∗.
2 Background
A Stochastic Shortest Path problem (SSP) (Bertsekas and Tsitsiklis 1991) is a tuple where: is a finite set of states; is the initial state; are the goal states with ; is a finite set of actions and denotes the actions applicable in state ; gives the probability that applying action in state results in state ; is the cost of applying in .
The states immediately reachable by applying to are called successors and are given by . A solution to an SSP is given by a map , called a policy. A policy is closed w.r.t. if each state that can be reached by following from is either a goal or is defined for . A policy is proper w.r.t. if it reaches the goal with probability 1 from and it is improper otherwise. An optimal policy is any proper policy that minimises the expected cost to reach the goal from the initial state .
For simplicity, we make two standard assumptions: (i) there exists at least one proper policy w.r.t. , this is called the reachability assumption; and (ii) all improper policies have infinite expected cost. A consequence of assumption (ii) is that for all . Note that we define to be strictly positive in order to avoid zero-cost cycles that would violate assumption (ii). In our experiments, we relax the reachability assumption by applying the fixed-penalty transformation of SSPs (Trevizan, Teichteil-Königsbuch, and Thiébaux 2017) resulting in a new SSP without dead ends. Other approaches for handling SSPs such as S3P (Teichteil-Königsbuch 2012) are also compatible with our approach.
An SSP’s optimal solution is uniquely represented by the optimal value function , which is the unique fixed-point solution to the Bellman equations:
(1) |
where is known as the -value of and . A (optimal) value function and the associated -value represent the (minimum) expected cost to reach the goal from state and after executing action on state , respectively. Given a value function , the policy associated with it is defined as and is known as the greedy policy for . Ties can be broken arbitrarily thanks to assumption (ii), and for simplicity we assume some tie-breaking rules that ensure the greedy policy is unique.
The Bellman equations (1) can be iteratively solved by Value Iteration (VI) (Bellman 1957): VI starts with an arbitrary value function and computes over all states, where uses the previous value function . This process of computing using is called a Bellman backup. VI is guaranteed to asymptotically converge to regardless of . For practical reasons, VI is terminated at iteration when the Bellman residual is less than or equal to for all .
Given a policy , its policy envelope is the set of all reachable states when following from . Note that VI explores the complete state space regardless of the optimal policy envelope ’s size. To address this shortcoming, heuristic search algorithms such as iLAO∗ (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003) use a heuristic function to initialise , which guides the exploration of the state space in a way that expands as few states as possible. To find , heuristic search algorithms require the heuristic to be admissible, that is, . Often heuristics are also monotonic, which immediately implies admissibility. A value function is monotonic if , and the definition is analogous for . Similar to VI, heuristic search algorithms converge to asymptotically and require a practical stop criterion. This stop criterion is known as (Bonet and Geffner 2003) and is defined as:
Definition 1 ().
A value function is if .
Notice that checks the residual only on the policy envelope of a greedy policy and states outside the envelope are permitted to have a residual larger than .
We close this section by reviewing iLAO∗ (Hansen and Zilberstein 2001). iLAO∗ (alg. 1) is an iterative algorithm which works by incrementally growing a partial SSP:111Called the explicit graph in the original paper.
Definition 2 (Partial SSP).
Given an SSP and a heuristic , a partial SSP is an SSP with terminal costs defined by with and terminal cost .
SSPs with terminal costs have a one-time terminal cost of that is incurred when is reached, so their Bellman equations are (1) with the goal case replaced by for all . It is trivial to see that SSPs with terminal costs are equivalent to SSPs, and we use them to simplify our presentation. In a partial SSP , we refer to states in as artificial goals, and we define to be the greedy policy over restricted to .
At each iteration, iLAO∗ expands its partial SSP by expanding the artificial goals reachable by into regular states. To expand , iLAO∗ adds ’s applicable actions to and adds new reachable states as artificial goals (alg. 1 alg. 1). This artificial goal expansion is done to make eventually closed w.r.t. for the original SSP . Simultaneously, iLAO∗ also works towards making by applying a Bellman backup to all the states reachable by (alg. 1 alg. 1). These Bellman backups are ordered by a post-order DFS traversal of , so states that occur closer to artificial goals are updated first, and is updated last. Note that, when a state is expanded, it may have a successor already within the partial SSP. If this happens, the DFS in alg. 1 alg. 1 must keep traversing from to ensure the policy envelope is accurate.
In the next section, we show how to interpret iLAO∗ through the lens of Operations Research by relating it to techniques used for handling large linear programs.
3 iLAO∗ as Linear Program
SSPs can be solved by the Linear Program (LP) presented in LP 1. This LP is known as the primal LP or VI LP since it directly encodes the Bellman equations.
(LP 1) (C1) (C2)
Each variable represents , and for each state the relevant constraints C1 encode . When clear from context, we use to represent , and for the right-hand side of and ’s constraint C1. Together with the objective that maximises , we obtain the Bellman equations (1) for the states in the optimal policy envelope , i.e., the constraints are active (tight) for the pairs for all and inactive (slack) everywhere else. We reframe iLAO∗’s incremental growing of its partial SSP as solving LP 1 with variable and constraint generation (Bertsimas and Tsitsiklis 1997).222Variable generation is also known as column generation and constraint generation is also known as the cutting plane method.
Variable generation is a technique from Operations Research that enables us to solve LPs with a large number of variables by considering only a subset of variables. Given an LP with missing variables called the Reduced Master Problem (RMP), variable generation finds a set of variables outside the RMP whose addition lets the RMP’s solution quality improve. Variable generation provides a sound and complete method to select such variables and a stop criterion that ensures the optimal solution for the RMP is also optimal for the original LP. Heuristic search algorithms, such as iLAO∗, can be seen as a variable generation algorithm over LP 1, where each of its partial SSPs represents an RMP with the subset of variables . For iLAO∗, the variable selection mechanism is inherited from A∗ and is represented by the expansion of the artificial goals reachable by (alg. 1 algs. 1 to 1): for all and , we add the variables such that and .
Constraint generation is a similar technique which enables the solving of LPs with a large (potentially infinite) number of constraints. The key idea is that the optimal solution of an LP with many constraints only makes a small number of constraints active, thus only a subset of constraints is needed to characterise this optimal solution. In constraint generation, the intermediate LPs are known as relaxed LPs since they relax the original LP by removing one or more constraints. Given a relaxed LP, constraint generation finds one or more constraints in the original LP that are violated by the optimal solution of the relaxed LP. By iteratively adding these violated constraints and re-optimising the new relaxed LP, a sequence of relaxed LPs with an increasing number of constraints is generated. When no violations are found, the optimal solution of the relaxed LP is an optimal solution for the original LP. The algorithm used to check constraint violations is called a separation oracle and the effectiveness of constraint generation relies on the availability of an efficient separation oracle for the original LP.
In the case of iLAO∗, constraint generation adds all the actions of the expanded artificial goal (alg. 1 alg. 1) where each action added to the partial SSP implicitly represents the constraint . This separation oracle is computationally cheap since no checks are performed to detect if this new constraint is needed or not, at the cost that inactive constraints are unnecessarily added to the partial SSP. In the next section, we present our new algorithm that uses an efficient separation oracle that only adds violated constraints.
4 CG-iLAO∗
In iLAO∗, as most search algorithms, each state is either unexpanded or fully expanded, i.e., either none or all of its applicable actions are considered and it is not possible to ignore just a subset of the applicable actions. We start by defining which actions can be safely ignored in a partial SSP.
Definition 3 (Inactive Action).
Consider an SSP , its partial SSP , a value function , and a state . An action is inactive in state if
An inactive action for represents the inactive constraint C1 for and in LP 1. Since inactive actions are not in the partial SSP and their associated constraints are inactive, adding them to the partial SSP does not change the solution and only adds overhead in the form of -value computation for sub-optimal actions.
We generalise iLAO∗ by allowing states to be partially expanded, so these states only have a subset of actions available in the partial SSP. Under the lens of linear programming, we use constraint generation to identify and add actions that may be needed to encode the optimal solution and to ignore inactive actions. We call this algorithm Constraint-Generation iLAO∗ (CG-iLAO∗).
CG-iLAO∗ is presented in alg. 2. One of the defining changes from iLAO∗ is in CG-iLAO∗’s expanding phase (alg. 2 alg. 2) where Partly-Expand-Fringes only expands a state with the greedy actions on , rather than all the applicable actions. This introduces two challenges: (i) actions that were not added by the partial expansion may need to be added later when is more accurate; and (ii) when we add such actions, must be updated to reflect ’s availability. If (ii) is not addressed, the reduction to offered by is not propagated, potentially leading to a suboptimal solution since would overestimate . The key insight of our algorithm is that both of these challenges are instances of constraint violation. Thus, we can solve both issues by finding which constraints are violated with a separation oracle and enforcing them in the style of constraint generation.
The trivial separation oracle checks all constraints for and for violations; this is needlessly expensive since some non-violated constraints remain non-violated from one iteration to the next. Our separation oracle exploits this persistence between iterations by tracking changes in to compute a subset of constraints which could potentially be violated by the following rules (alg. 2 algs. 2 to 2 and alg. 2): suppose is assigned , then there are three cases:
-
1.
stays the same. No new constraint violations.
-
2.
increases. The constraints may be violated for . Note that if is already inside , i.e., , its constraint can not be violated since .
-
3.
decreases. The only constraints that may be violated are for .
We store potential violations in , i.e., if increases, the elements of are added to (alg. 2 alg. 2); and if decreases, the elements of are added to (alg. 2 alg. 2). Elements are removed from after they are checked. As we prove later in this section, checking constraints in is sufficient to find any constraint violations.
CG-iLAO∗ fixes a violated constraint by setting (alg. 2 alg. 2). This change in may create a new violation in another state, so we must track such potential violations in the same way as before. This ensures that all constraint violations are tracked and fixed eventually before termination.
Note that may decrease after an update (case 3 of the constraint violations) in CG-iLAO∗ even if the heuristic used is monotonic. This is a departure from all other algorithms based on Bellman backups where is guaranteed to be monotonically non-decreasing during their executions when initialised with a monotonic heuristic. To illustrate a scenario where decreases in CG-iLAO∗, consider the SSP in fig. 1 where is monotonic and represented inside nodes. The first iterations of CG-iLAO∗ applied to this SSP are:
-
Iter. 1
expands .
-
Iter. 2
partly expands with .
-
Iter. 3
expands with and, after CG-Backups, we have , , and . Since , Fix-Constrs verifies that is currently better than the existing action for , so is added to and is changed from 4 to 3. Recall that , so when is updated to , is inserted into , and no further changes are made in this iteration.
-
Iter. 4
expands and CG-Backups reduces from 5 to 4, so has been decreased by CG-Backups.
CG-iLAO∗ generalises iLAO∗ by using a more precise separation oracle that only adds violated constraints, which translates to CG-iLAO∗ ignoring inactive actions. For a state , any action that has been ignored and left out of the partial SSP is not considered by a Bellman backup of in , so such actions’ -values are not computed. However, CG-iLAO∗ needs to compute additional -values in its separation oracle to check violations. As our experiments in section 5 show, the -values saved by ignoring inactive actions outweigh the additional -values in the separation oracle, which lets CG-iLAO∗ outperform iLAO∗.
To close the section, we prove that CG-iLAO∗: (i) terminates (thm. 1); (ii) tracks all constraint violations (lem. 1); and (iii) returns an value function (thm. 2). Thus, CG-iLAO∗ is optimal for SSPs.
Theorem 1.
CG-iLAO∗ terminates.
Proof.
For contradiction, suppose CG-iLAO∗ does not terminate. is finite and we do not add duplicate states nor constraints, so eventually is fixed and . Then there must be a finite set of states that are updated with Bellman backups infinitely often by CG-Backups and/or Fix-Constrs. But induces a new partial SSP, and applying Bellman backups infinitely often to all solves this new partial SSP with VI, so must converge to a fixed point and the residual will be less than in finite time. Thus, will not be updated, and all remaining termination conditions will be satisfied, giving us the desired contradiction. ∎
Lemma 1.
If there is and such that , then .
Proof.
We prove by induction over , the number of updates to . In the base case, , is the initial partial SSP with , so the claim is vacuously true. Now, we show the claim holds after updates to , assuming that the claim holds for updates. Any violations must have been introduced in the latest update to by CG-Backups or Fix-Constrs, but we add any potential violations to in alg. 2 algs. 2 to 2 and alg. 2 respectively. ∎
Theorem 2.
CG-iLAO∗ outputs an .
Proof.
For contradiction, suppose CG-iLAO∗ has terminated and outputs with such that . By CG-iLAO∗’s termination condition (alg. 2 alg. 2) we know that and , so CG-Backups applies Bellman backups to all states in the envelope until (alg. 2 alg. 2). Therefore, the inconsistency of must be introduced by Fix-Constrs, either directly by updating , or indirectly by forcing a policy change. But the residual is tracked (alg. 2 alg. 2) and policy changes are flagged when , which are both checked in the termination condition. So, Fix-Constrs can not introduce any inconsistency either. But these two methods are the only ones affecting , which yields the desired contradiction. This proves (defn. 1), but previous heuristic search methods rely on the invariant to safely prune states that can not be part of an optimal policy’s envelope, which CG-iLAO∗ does not have. We must ensure that states outside the policy envelope with can not lead to a cheaper policy if we apply more Bellman backups to them. Consider such outside the greedy policy envelope with , and for contradiction let for some . Since states in are initialised with an admissible , we know that , so by lem. 1. Since Fix-Constrs overwrites , must have been added in the previous call, but then (alg. 2 alg. 2), so the termination criteria (alg. 2 alg. 2) are not satisfied, giving us a contradiction. Therefore, all inadmissible states satisfy . Thus, if CG-iLAO∗ terminates with , additional backups to states with would not change , so we can conclude that CG-iLAO∗ outputs . ∎
5 Experiments
In this section we empirically compare CG-iLAO∗ to two state-of-the-art optimal heuristic search planners: iLAO∗ (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003). We also compared CG-iLAO∗ against FTVI (Dai, Weld et al. 2009), the only algorithm we are aware of that uses action elimination to prune actions, but it is uncompetitive and FTVI’s results are reported in section 7. We consider the following admissible heuristics: h-max (); lm-cut () (Helmert and Domshlak 2009); and h-roc () (Trevizan, Thiébaux, and Haslum 2017). As in (Trevizan, Thiébaux, and Haslum 2017), we use with as a dead-end detection mechanism for problems with dead ends. We use and convert SSPs into dead-end free SSPs (Trevizan, Teichteil-Königsbuch, and Thiébaux 2017) with a penalty of for all domains except Parc Printer variants where due to the large cost of single actions.
On all problems, we collected 50 runs with different random seeds for each combination of planner and heuristic. We refer to a problem paired with a fixed seed as an instance. All runs have a cutoff of 30 minutes of CPU time and 8GB of memory. The experiments were conducted in a cluster of Intel Xeon 3.2 GHz CPUs and each run used a single CPU core. The LP solver used for computing was CPLEX version 20.1. We consider the following domains:








Triangle Tire World with Head-start (TW)
In the original Triangle Tire World domain (Little, Thiebaux et al. 2007; Buffet 2008), the agent is provided a map of locations in a triangular layout with corners , as in fig. 2. The agent’s task is to travel from corner to , but it gets a flat tyre with probability every time it moves. Once the car gets a flat tyre, it must change the tyre if a spare is available, otherwise no action is available and the goal is no longer reachable. The agent can only store one spare at a time, and it can only obtain spare tyres in select locations (circles in fig. 2). A shortcoming of this domain is that its difficulty scales exponentially, so it may be easy to solve problem and impractical for . For this reason we extend the domain by allowing the agent a head-start, that is, its starting location may be anywhere along the edge , for instance, on problem 2 these are locations 1-1, 2-1, …, 5-1 (fig. 2 (right)). Let denote an instance of Triangle Tire World with Head-start where is the problem size and denotes the distance between the agent’s starting location and corner . When , we obtain the original problem of size and reducing makes the problem easier until we reach , the easiest variant. Experiments using LRTDP and suggest the following relation in terms of CPU time: . Therefore, for each size , we consider 5 problems: .
Probabilistic Blocks World (BW) (Buffet 2008)
As in the deterministic Blocks World from IPC, the agent is tasked with arranging blocks on a table into a particular configuration with actions to pick up blocks, put them down, or stack them. The probabilistic version adds a 0.25 probability to each action that the handled block falls onto the table. Furthermore, actions are added that allow the agent to pick up, put down, and stack a tower of three blocks; these have 0.9 probability of the whole tower falling onto the table.
Exploding Blocks World (ExBW) (Buffet 2008)
Another variation for the deterministic Blocks World but this time each block is rigged with an explosive that can detonate once and destroy the table or block immediately underneath it. When a block is placed on the table or another block, it detonates with probability 0.4 and 0.1 respectively. Once the table or a block has been destroyed, the agent can not interact with them anymore; therefore, if they are not in their goal position, the goal will be unreachable.
Probabilistic PARC Printer (PARC) (Trevizan, Thiébaux, and Haslum 2017)
This domain is a probabilistic extension of the PARC Printer domain from IPC. It models a modular printer consisting of various components and each page scheduled for printing needs to pass through multiple components in a particular order. The goal is to optimise how each page is directed through the different components to satisfy the printing requirements. With probability 0.1, a component jams ruining the relevant page and forcing it to be reprinted. The domain comes in two flavours: with repair (PARC-R), where jammed components can be repaired and then used again; and without repair (PARC-N), where jammed components remain unusable.
Our code and benchmarks are available at Schmalz and Trevizan (2023). We now present a summary of our findings.
What is the best planner and heuristic combination?
A common metric to evaluate planners is coverage, i.e., the total number of instances solved in a given amount of time, thus larger coverage is better. Fig. 3 (left) shows the coverage of each combination of planner and heuristic as a function of time. The top three combinations and their total coverages are CG-iLAO (1300), LRTDP (1202), and iLAO (1200).Note that CG-iLAO and LRTDP alternate in the top spot up to 220 seconds and CG-iLAO has the best coverage after that until the experiment cutoff. Moreover, from 300 seconds onwards, CG-iLAO’s lead varies from 50 to 243 instances. We present a breakdown of coverage per domain in tab. 1. For each domain considered, CG-iLAO reaches the highest coverage over other planners and heuristics. For all three heuristics considered, CG-iLAO∗ also obtains the highest coverage in all domains against the other planners for the same heuristic. In tab. 2, we present the minimum and maximum speedup per domain of CG-iLAO∗ over the other planners for . The speedups w.r.t. iLAO vary from (i.e., 11% slower) in the largest PARC-R problem and in problem #9 of ExBW. Against LRTDP, the speedups vary from in problem #8 of ExBW to in PARC-N problem s4-c3. For the performance of each planner and heuristic per problem, see section 7.
How many actions can CG-iLAO∗ ignore?
To answer this question, we look at the density of states , defined as , in the final partial SSP of CG-iLAO. Fig. 4 shows, for each domain, the cumulative plot of density, i.e., how many states contain up to and including a given proportion of their applicable actions. In all instances, at least one third of the states contains at most 50% of the applicable actions. The density is high for TW because many states only have one applicable action. The density is also high for ExBW because heuristics are comparatively weak for this domain. For the other domains the results are much stronger: between 56% and 75% of states contain at most 50% of the actions. Overall, CG-iLAO added between and of all possible actions in its own partial SSP, and added between and of iLAO∗’s actions.
BW | ExBW | PARC-N | PARC-R | TW | Total | ||
Num. of instances | 300 | 250 | 300 | 250 | 200 | 1300 | |
CG-iLAO∗ | |||||||
iLAO∗ | 200 | 150 | 1200 | ||||
LRTDP | 257 | 200 | 195 | 1202 | |||
CG-iLAO∗ | 150 | 200 | 150 | 1030 | |||
iLAO∗ | 150 | 200 | 200 | 140 | 990 | ||
LRTDP | 0 | 200 | 50 | 149 | 699 | ||
CG-iLAO∗ | 150 | 200 | 150 | 0 | 161 | 661 | |
iLAO∗ | 150 | 150 | 150 | 0 | 150 | 600 | |
LRTDP | 150 | 200 | 150 | 0 | 150 | 650 |
BW | ExBW | PARC-R | PARC-N | TW | |
iLAO | 1.1–1.4 | 1.0–3.0 | 0.9–1.3 | 1.3–2.2 | 1.4–1.6 |
LRTDP | 1.3–2.0 | 0.9–2.9 | 2.0–8.4 | 0.7–0.9 | 1.2–1.6 |
Are -values being saved?
CG-iLAO∗ can save -value computations by ignoring inactive actions, but at the cost of computing additional -values in its separation oracle. The cumulative plot over -values in fig. 3 (right) shows that the savings in -values outweigh the overhead, i.e., given a budget in -values computations, CG-iLAO∗ is capable of solving more instances than the other planners for the same heuristic. At their maximum coverage, iLAO and LRTDP use and more -values than CG-iLAO, respectively. Moreover, CG-iLAO reaches its maximum coverage of 1300 using -values while iLAO and LRTDP only solve 1149 and 1052 instances, respectively, for the same number of -values. A similar trend is observed when using ; however, when using , the least informative heuristic considered, LRTDP is slightly more -value efficient than CG-iLAO.
What is the impact of the heuristic on CG-iLAO∗?
Note that, in fig. 3, as the heuristic becomes more informative, the performance gains of CG-iLAO∗ over iLAO∗ and LRTDP increases. To explore this trend, we use the heuristic defined for as where is a uniform randomly selected number from , which lets us quantify how informative a heuristic is (on average) with . The randomness in the weight ensures that the ordering of states induced by is different from the one induced by . Due to the high cost of computing we only consider the smallest problems of BW, ExBW, PARC-N, and TW. Over these problems and 50 instances each, fig. 5 shows the mean search time and 95% C.I. as varies over . Search time excludes time spent computing the heuristic. The ratio between CG-iLAO∗’s and iLAO∗’s runtime supports that CG-iLAO∗ scales better with better heuristics: the ratio starts at 86% and decreases to 49% and 46% for and respectively. The reason for this behaviour is that good heuristics (i.e., tighter lower bounds) prevent CG-iLAO∗ from adding inactive actions to its partial SSP, resulting in more savings in -value computation. Over all values of , there is no statistically significant difference between CG-iLAO∗ and LRTDP, but both offer substantial improvement over iLAO∗. This suggests LRTDP’s sampling approach can more efficiently leverage the information provided the heuristics than iLAO∗ and CG-iLAO∗ bridges the gap between them, allowing a non-sampling-based planner to use the heuristics as effectively as LRTDP.

6 Conclusion, Related and Future Work
Building on existing connections between operations research and planning, we presented a new interpretation of heuristic search on SSPs as solving LPs using variable and constraint generation. We exploit this equivalence to introduce a new and efficient separation oracle for SSPs, which enables a search algorithm to selectively add actions when they are deemed necessary to find the optimal solution. This addresses the shortcoming of state-of-the-art algorithms that add all applicable actions during state expansion, with no mechanism for ignoring actions that will not contribute to the solution. Using this principle, we generalised iLAO∗ into a new optimal heuristic search algorithm CG-iLAO∗. Empirical evaluation showed that CG-iLAO∗’s ability to consider a subset of actions results in significant savings in the number of -values computed, which in turn reduces the runtime of the algorithm compared to the state-of-the-art.
Regarding related work, LP 1 has been approximated for factored MDPs to get a more compact LP, called the approximate LP (ALP). Schuurmans and Patrascu (2001) apply constraint generation to the ALP; their separation oracle has a similar condition for adding constraints as CG-iLAO∗; however, it checks for the condition naively, which is only practical on the compact ALP offered by factored MDPs, and is infeasible for SSPs. Constraint generation lends itself well to complex planning problems where a relaxation can be efficiently solved and constraint violations by the relaxed solution can be efficiently detected, e.g., in multiagent planning (Calliess and Roberts 2021) and metric hybrid factored planning in nonlinear domains (Say and Sanner 2019). For POMDPs, an LP with constraint generation can be used to prune unneeded vectors from the set of vectors used to represent the value function (Walraven and Spaan 2017). In all these works, the separation oracle either naively checks all possible constraints or relies on sampling to find violations.
As future work, we aim to expand the application of CG-iLAO∗ to more complex models that can benefit from our iterative method of generating applicable actions. Models with imprecise parameters, such as MDPIPs and MDPSTs (White III and Eldeib 1994; Trevizan, Cozman, and Barros 2007), are suitable candidates for our approach since they have a minimax semantics for the Bellman equations. In this minimax semantics, the value function minimises the expected cost-to-go assuming that an adversary aims to maximise the cost-to-go by selecting the values of the imprecise parameters. As a result, computing in these models requires solving a maximisation problem; therefore, ignoring inactive actions could lead to significant improvements in performance.
Other suitable models include SSPs with PLTL constraints (Baumgartner, Thiébaux, and Trevizan 2018; Mallet, Thiébaux, and Trevizan 2021) in which both the state space and action space are augmented to keep track of constraint violations. In these models, the concept of inactive actions can be extended to also prevent adding actions that lead to constraint violations to their partial problems. The methods presented in this paper may also be applicable to model checking more broadly. In particular, there has been work investigating how to use heuristics to guide the search for probabilistic reachability (Brázdil et al. 2014), in which action elimination is applicable.
Acknowledgements
We thank the anonymous reviewers for their feedback. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.
References
- Baumgartner, Thiébaux, and Trevizan (2018) Baumgartner, P.; Thiébaux, S.; and Trevizan, F. 2018. Heuristic Search Planning With Multi-Objective Probabilistic LTL Constraints. In Proc. of 16th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR).
- Bellman (1957) Bellman, R. 1957. Dynamic programming. Princeton University Press.
- Bertsekas (1995) Bertsekas, D. 1995. Dynamic Programming and Optimal Control, volume 2. Athena Scientific.
- Bertsekas and Tsitsiklis (1991) Bertsekas, D.; and Tsitsiklis, J. 1991. An Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research.
- Bertsimas and Tsitsiklis (1997) Bertsimas, D.; and Tsitsiklis, J. 1997. Introduction to Linear Optimization. Athena Scientific.
- Bonet and Geffner (2003) Bonet, B.; and Geffner, H. 2003. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming. In Proc. of 13th Int. Conf. on Automated Planning and Scheduling (ICAPS).
- Brázdil et al. (2014) Brázdil, T.; Chatterjee, K.; Chmelík, M.; Forejt, V.; Křetínský, J.; Kwiatkowska, M.; Parker, D.; and Ujma, M. 2014. Verification of Markov Decision Processes Using Learning Algorithms. In Automated Technology for Verification and Analysis.
- Buffet (2008) Buffet, D. 2008. International Planning Competition Uncertainty Part: Benchmarks and Results.
- Calliess and Roberts (2021) Calliess, J.-P.; and Roberts, S. 2021. Multi-Agent Planning with Mixed-Integer Programming and Adaptive Interaction Constraint Generation (Extended Abstract). Proc. of 14th Symposium on Combinatorial Search (SoCS).
- Dai, Weld et al. (2009) Dai, P.; Weld, D.; et al. 2009. Focused topological value iteration. In Proc. of 19th Int. Conf. on Automated Planning and Scheduling (ICAPS), volume 19, 82–89.
- Hansen and Zilberstein (2001) Hansen, E.; and Zilberstein, S. 2001. LAO∗: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence.
- Helmert and Domshlak (2009) Helmert, M.; and Domshlak, C. 2009. Landmarks, Critical Paths and Abstractions: What’s the Difference Anyway? Proc. of 19th Int. Conf. on Automated Planning and Scheduling (ICAPS).
- Little, Thiebaux et al. (2007) Little, I.; Thiebaux, S.; et al. 2007. Probabilistic planning vs. replanning. In ICAPS Workshop on IPC: Past, Present and Future.
- Mallet, Thiébaux, and Trevizan (2021) Mallet, I.; Thiébaux, S.; and Trevizan, F. 2021. Progression Heuristics for Planning with Probabilistic LTL Constraints. In Proc. of 35th AAAI Conference on Artificial Intelligence.
- McMahan, Likhachev, and Gordon (2005) McMahan, B.; Likhachev, M.; and Gordon, G. 2005. Bounded Real-Time Dynamic Programming: RTDP with Monotone Upper Bounds and Performance Guarantees. In Proc. of 22nd Int. Conf. on Machine Learning.
- Sanner et al. (2009) Sanner, S.; Goetschalckx, R.; Driessens, K.; and Shani, G. 2009. Bayesian real-time dynamic programming. In Proc. of 21st Int. Joint Conf. on AI (IJCAI).
- Say and Sanner (2019) Say, B.; and Sanner, S. 2019. Metric Hybrid Factored Planning in Nonlinear Domains with Constraint Generation, 502–518. Springer International Publishing.
- Schmalz and Trevizan (2023) Schmalz, J.; and Trevizan, F. 2023. Code, benchmarks, and technical report for AAAI 2024 paper “Efficient Constraint Generation for Stochastic Shortest Path Problems”. https://doi.org/10.5281/zenodo.10344842.
- Schuurmans and Patrascu (2001) Schuurmans, D.; and Patrascu, R. 2001. Direct value-approximation for factored MDPs. In Advances in Neural Information Processing Systems (NeurIPS).
- Smith and Simmons (2006) Smith, T.; and Simmons, R. 2006. Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proc. of 20th AAAI Conf. on Artificial Intelligence.
- Teichteil-Königsbuch (2012) Teichteil-Königsbuch, F. 2012. Stochastic safest and shortest path problems. In Proc. of 26th AAAI Conf. on AI.
- Trevizan, Cozman, and Barros (2007) Trevizan, F.; Cozman, F. G.; and Barros, L. N. 2007. Planning under Risk and Knightian Uncertainty. In Proc. of 20th Int. Joint Conf. on AI (IJCAI).
- Trevizan, Teichteil-Königsbuch, and Thiébaux (2017) Trevizan, F.; Teichteil-Königsbuch, F.; and Thiébaux, S. 2017. Efficient Solutions for Stochastic Shortest Path Problems with Dead Ends. In Proc. of 33rd Int. Conf. on Uncertainty in Artificial Intelligence (UAI).
- Trevizan, Thiébaux, and Haslum (2017) Trevizan, F.; Thiébaux, S.; and Haslum, P. 2017. Occupation Measure Heuristics for Probabilistic Planning. In Proc. of 27th Int. Conf. on Automated Planning and Scheduling (ICAPS).
- Walraven and Spaan (2017) Walraven, E.; and Spaan, M. 2017. Accelerated Vector Pruning for Optimal POMDP Solvers. Proc. of 31st AAAI Conf. on AI.
- White III and Eldeib (1994) White III, C. C.; and Eldeib, H. K. 1994. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4): 739–749.
7 Appendix
We revisit the key questions of our experiments section by presenting additional results and explaining them.
What is the best planner and heuristic combination?
We report the performance statistics of each planner and heuristic combination per benchmark problem in tabs. 4 to 8. Concretely, we report the coverage over 50 runs, and over the runs that terminated we report the mean and 95% confidence interval associated with CPU time, -values computed, and the number of calls to the heuristic. For each problem (column) we highlight the planner and heuristic combinations with highest coverage, and among these combinations with tied highest coverage, we highlight the 95% confidence intervals that are tied as the best (lowest) for each problem. Formally, an interval is highlighted if there is no other interval for that problem s.t. . The highlighted intervals are also known as non-dominated intervals since there is no interval that is strictly better. If a planner and heuristic combination is missing from a table, then the planner and heuristic combination had a 0% coverage over the relevant domain, e.g., all FTVI combinations are missing from PARC-N.
How many actions can CG-iLAO∗ ignore?
Following on from the relevant experiment in section 5, which considers the largest solved problems of each domain, we report how many actions CG-iLAO∗ can ignore in tab. 3. In particular, we show the number of actions that are in CG-iLAO∗’s final partial SSP, compared with the potential number of actions that can be in CG-iLAO∗’s partial SSP if all applicable actions are added for all partial states, and the number of actions in iLAO∗’s final partial SSP. CG-iLAO∗’s partial actions are significantly fewer than iLAO∗’s total actions, i.e., CG-iLAO∗ adds significantly fewer actions than iLAO∗, which supports our claim that, with a sufficiently informative heuristic, CG-iLAO∗ considers fewer actions and thereby saves on computation.
Num. Actions in Final Partial SSP
partial actions CG-iLAO∗ | potential actions CG-iLAO∗ | total actions iLAO∗ | |
---|---|---|---|
BW | |||
ExBW | |||
PARC-R | |||
PARC-N | |||
TW |
What is the impact of the heuristic on CG-iLAO∗?
We have already shown how the search time of CG-iLAO∗, iLAO∗, and LRTDP is affected as varies; we now show how the number of -values is affected in fig. 6. As increases, CG-iLAO∗ uses the fewest -values. This is in line with our other results, but it is important to note that, for , CG-iLAO∗ does not use fewer -values than LRTDP, which suggests that CG-iLAO∗ does rely on a reasonably informative heuristic, and then scales the best as informativeness increases.


algorithm | heuristic | 8-24967 | 8-23171 | 8-25241 | 10-14262 | 10-19475 | 12-19848 |
CG-iLAO∗ | |||||||
FTVI | |||||||
iLAO∗ | |||||||
LRTDP | 7 | ||||||
CG-iLAO∗ | |||||||
iLAO∗ | |||||||
CG-iLAO∗ | |||||||
iLAO∗ | |||||||
LRTDP |
algorithm | heuristic | p07-n7-N9-s7 | p08-n8-N10-s8 | p09-n9-N11-s9 | p10-n10-N12-s10 | p11-n11-N13-s11 |
CG-iLAO∗ | ||||||
FTVI | ||||||
iLAO∗ | ||||||
LRTDP | ||||||
CG-iLAO∗ | ||||||
FTVI | ||||||
iLAO∗ | ||||||
LRTDP | ||||||
CG-iLAO∗ | ||||||
iLAO∗ | ||||||
LRTDP |
algorithm | heuristic | s4-c1 | s4-c2 | s4-c3 | s5-c1 | s5-c2 | s5-c3 |
CG-iLAO∗ | |||||||
iLAO∗ | |||||||
LRTDP | |||||||
CG-iLAO∗ | |||||||
iLAO∗ | |||||||
LRTDP | |||||||
CG-iLAO∗ | |||||||
iLAO∗ | |||||||
LRTDP |
algorithm | heuristic | s4-c1 | s4-c2 | s4-c3 | s5-c1 | s5-c2 |
CG-iLAO∗ | ||||||
FTVI | ||||||
iLAO∗ | ||||||
LRTDP | ||||||
CG-iLAO∗ | ||||||
FTVI | 20 | |||||
iLAO∗ | ||||||
LRTDP |
algorithm | heuristic | (original TW 4) | (original TW 5) | ||
CG-iLAO∗ | |||||
FTVI | |||||
iLAO∗ | |||||
LRTDP | 45 | ||||
CG-iLAO∗ | |||||
FTVI | 37 | ||||
iLAO∗ | 40 | ||||
LRTDP | 49 | ||||
CG-iLAO∗ | 11 | ||||
FTVI | |||||
iLAO∗ | |||||
LRTDP |