This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Heuristic Search for Multi-Objective Probabilistic Planning

Dillon Z. Chen,1 Felipe Trevizan,1 Sylvie Thiébaux1,2
Abstract

Heuristic search is a powerful approach that has successfully been applied to a broad class of planning problems, including classical planning, multi-objective planning, and probabilistic planning modelled as a stochastic shortest path (SSP) problem. Here, we extend the reach of heuristic search to a more expressive class of problems, namely multi-objective stochastic shortest paths (MOSSPs), which require computing a coverage set of non-dominated policies. We design new heuristic search algorithms MOLAO* and MOLRTDP, which extend well-known SSP algorithms to the multi-objective case. We further construct a spectrum of domain-independent heuristic functions differing in their ability to take into account the stochastic and multi-objective features of the problem to guide the search. Our experiments demonstrate the benefits of these algorithms and the relative merits of the heuristics.

1 Introduction

Stochastic shortest path problems (SSPs) are the de facto model for planning under uncertainty. Solving an SSP involves computing a policy which maps states to actions so as to minimise the expected (scalar) cost to reach the goal from a given initial state. Multi-objective stochastic shortest path problems (MOSSPs) are a useful generalisation of SSPs where multiple objectives (e.g. time, fuel, risk) need to be optimised without the user being able to a priori weigh these objectives against each other (Roijers and Whiteson 2017). In this more complex case, we now aim to compute a set of non-dominated policies whose vector-valued costs represent all the possible trade-offs between the objectives.

There exist two approaches to solving MOMDPs (Roijers and Whiteson 2017), a special case of MOSSPs: inner loop planning which consists in extending SSP solvers by generalising single objective operators to the multi-objective (MO) case, and outer loop planning which consists in solving a scalarised MOSSP multiple times with an SSP solver. We focus on the former. The canonical inner loop method, MO Value Iteration (MOVI) (White 1982) and its variants (Wiering and de Jong 2007; Barrett and Narayanan 2008; Roijers and Whiteson 2017), require enumerating the entire state space of the problem and are unable to scale to the huge state spaces typically found in automated planning.

Therefore, this paper focuses on heuristic search methods which explore only a small fraction of the state space when guided with informative heuristics. Heuristic search has been successfully applied to a range of optimal planning settings, including single objective (SO) or constrained SSPs (Hansen and Zilberstein 2001; Bonet and Geffner 2003; Trevizan et al. 2017), and MO deterministic planning (Mandow and Pérez-de-la-Cruz 2010; Khouadjia et al. 2013; Ulloa et al. 2020). Moreover, a technical report by Bryce et al. (2007) advocated the need for heuristic search and outlined an extension of LAO* (Hansen and Zilberstein 2001) for finite horizon problems involving multiple objectives and partial observability. Convex Hull Monte-Carlo Tree-Search (Painter, Lacerda, and Hawes 2020) extends the Trial-Based Heuristic Tree Search framework (Keller and Helmert 2013) to the MO setting but applies only to finite-horizon MOMDPs. Yet to the best of our knowledge there is no investigation of heuristic search for MOSSPs.

This paper fills this gap in heuristic search for MOSSPs. First, we characterise the necessary conditions for MOVI to converge in the general case of MOSSPs. We then extend the well-known SSP heuristic search algorithms LAO* (Hansen and Zilberstein 2001) and LRTDP (Bonet and Geffner 2003) to the multi-objective case, leading to two new MOSSP algorithms, MOLAO* and MOLRTDP, which we describe along with sufficient conditions for their convergence. We also consider the problem of guiding the search of these algorithms with domain-independent heuristics. A plethora of domain-independent heuristics exist for classical planning, but works on constructing heuristics for (single objective) probabilistic or multi-objective (deterministic) planning are much more recent (Trevizan, Thiébaux, and Haslum 2017; Klößner and Hoffmann 2021; Geißer et al. 2022). Building on these recent works, we investigate a spectrum of heuristics for MOSSPs differing in their ability to account for the probabilistic and multi-objective features of the problem.

Finally, we conduct an experimental comparison of these algorithms and of the guidance obtained via these heuristics. We observe the superiority of heuristic search over value iteration methods for MOSSPs, and of heuristics that are able to account for the tradeoffs between competing objectives.

2 Background

A multi-objective stochastic shortest path problem (MOSSP) is a tuple (S,s0,G,A,P,C)(S,s_{0},G,A,P,{\vec{C}}) where: SS is a finite set of states, one of which is the initial state s0s_{0}, GSG\subseteq S is a set of goal states, AA is a finite set of actions, P(ss,a)P(s^{\prime}\!\mid\!s,a) is the probability of reaching ss^{\prime} after applying action aa in ss, and C(a)0n{\vec{C}}(a)\in{\mathbb{R}}_{\geq 0}^{n} is the nn-dimensional vector representing the cost of action aa. Two special cases of MOSSPs are stochastic shortest path problems (SSPs) and bi-objective SSPs which are obtained when nn equals 1 and 2, respectively.

A solution for an SSP is a deterministic policy π\pi, i.e., a mapping from states to actions. A policy π\pi is proper or closed w.r.t. s0s_{0} if the probability of reaching GG when following π\pi from s0s_{0} is 1; if this probability of reaching GG is less than 1, then π\pi is an improper policy. We denote by SπSS^{\pi}\subseteq S the set of states visited when following a policy π\pi from s0s_{0}. The expected cost of reaching GG when using a proper policy π\pi from a state sSπs\in S^{\pi} is given by the policy value function defined as

Vπ(s)=C(π(s))+sSP(s|s,π(s))Vπ(s)\displaystyle V^{\pi}(s)=C(\pi(s))+\sum_{s^{\prime}\in S}P(s^{\prime}|s,\pi(s))V^{\pi}(s^{\prime}) (1)

for sSπGs\in S^{\pi}\setminus G and Vπ(g)=0V^{\pi}(g)=0 for gGg\in G. An optimal policy for an SSP is any proper policy π\pi^{*} such that Vπ(s0)Vπ(s0)V^{\pi^{*}}(s_{0})\leq V^{\pi^{\prime}}(s_{0}) for all proper policies π\pi^{\prime}. Although π\pi^{*} might not be unique, the optimal value function VV^{*} is unique and equals VπV^{\pi^{*}} for any π\pi^{*}.

Stochastic policies are a generalisation of deterministic policies which map states to probability distributions of actions. The definitions of proper, improper and policy value function are trivially generalised to stochastic policies. A key result for SSPs is that at least one of its optimal policies is deterministic (Bertsekas and Tsitsiklis 1991); thus it suffices to search for deterministic policies when solving SSPs.

Coverage sets and solutions for MOSSPs

In the context of MOSSPs, given a policy π\pi, we denote by Vπ:S0n{\vec{V}}^{\pi}:S\to{\mathbb{R}}_{\geq 0}^{n} the vector value function for π\pi. The function Vπ{\vec{V}}^{\pi} is computed by replacing VV and CC by V{\vec{V}} and C{\vec{C}} in (1), respectively. In order to define the solution of an MOSSP, we need to first define how to compare two different vectors: a cost vector v\vec{v} dominates u\vec{u}, denoted as vu\vec{v}\preceq\vec{u}, if viui\vec{v}_{i}\leq\vec{u}_{i} for i=1,,ni=1,\ldots,n. A coverage set for a set of vectors 𝐕{\mathbf{V}}, denoted as CS(𝐕)\mathrm{CS}({\mathbf{V}}), is any set satisfying vCS(𝐕),uCS(𝐕)\forall\vec{v}\in\mathrm{CS}({\mathbf{V}}),\nexists\vec{u}\in\mathrm{CS}({\mathbf{V}}) s.t. uv\vec{u}\preceq\vec{v} and uv\vec{u}\not=\vec{v}.111We will denote sets of vectors or functions which map to sets of vectors with bold face, e.g. 𝐕{\mathbf{V}}, and single vectors or functions which map to single vectors with vector notation, e.g. V\vec{V}. An example of a coverage set is the Pareto coverage set (PCS\mathrm{PCS}) which is the largest possible coverage set. For the remainder of the paper we focus on the convex coverage set (CCS\mathrm{CCS}(Barrett and Narayanan 2008; Roijers and Whiteson 2017) which is defined as the convex hull of the PCS\mathrm{PCS}. Details for computing the CCS\mathrm{CCS} of a set 𝐕{\mathbf{V}} with a linear program (LP) can be found in (Roijers and Whiteson 2017, Sec. 4.1.3). We say that a set of vectors 𝐔{\mathbf{U}} dominates another set of vectors 𝐕{\mathbf{V}}, denoted by 𝐔𝐕{\mathbf{U}}\preceq{\mathbf{V}}, if for all v𝐕\vec{v}\in{\mathbf{V}} there exists u𝐔\vec{u}\in{\mathbf{U}} such that uv\vec{u}\preceq\vec{v}.

Given an MOSSP define the optimal value function 𝐕{\mathbf{V}}^{*} by 𝐕(s)=CCS({Vπ(s)π is a proper policy}){\mathbf{V}}^{*}(s)=\mathrm{CCS}(\{{\vec{V}}^{\pi}(s)\mid\text{$\pi$ is a proper policy}\}). Then we define a solution to the MOSSP to be any set of proper policies Π\Pi such that the function φ:Π𝐕(s0)\varphi:\Pi\to{\mathbf{V}}^{*}(s_{0}) with φ(π)=Vπ(s0)\varphi(\pi)=\vec{V}^{\pi}(s_{0}) is a bijection. By choosing CCS\mathrm{CCS} as our coverage set operator, we may focus our attention to only non-dominated deterministic policies, where non-dominated stochastic policies are implicitly given by the points on the surface of the polyhedron drawn out by the CCS\mathrm{CCS}. In this way, we avoid having to explicitly compute infinitely many non-dominated stochastic policies.

s0s_{0}g1g_{1}g2g_{2}a1a_{1}a2a_{2}0.50.50.50.50.50.50.50.5
Figure 1: A MOSSP with action costs given by C(a1)=[1,0]\vec{C}(a_{1})=[1,0] and C(a2)=[0,1]\vec{C}(a_{2})=[0,1].

To illustrate this statement, consider the MOSSP in Fig. 1 with actions a1a_{1} and a2a_{2} where P(s0|s0,a1)=P(g1|s0,a1)=P(s0|s0,a2)=P(g2|s0,a2)=0.5P(s_{0}|s_{0},a_{1})=P(g_{1}|s_{0},a_{1})=P(s_{0}|s_{0},a_{2})=P(g_{2}|s_{0},a_{2})=0.5 and C(a1)=[1,0]\vec{C}(a_{1})=[1,0] and C(a2)=[0,1]\vec{C}(a_{2})=[0,1]. One solution consists of only two deterministic policies π1(s0)=a1\pi_{1}(s_{0})=a_{1} and π2(s0)=a2\pi_{2}(s_{0})=a_{2} with corresponding expected costs [2,0]{[2,0]} and [0,2]{[0,2]}. Notice that there are uncountably many stochastic policies obtained by the convex combinations of π1\pi_{1} and π2\pi_{2}, i.e., πt(a1|s0)=1t,πt(a2|s0)=t\pi_{t}(a_{1}|s_{0})=1-t,\pi_{t}(a_{2}|s_{0})=t for t[0,1]t\in[0,1]. The expected cost of each πt\pi_{t} is [22t,2t][2-2t,2t] and these stochastic policies do not dominate each other. Therefore, if the PCS\mathrm{PCS} is used instead of the CCS\mathrm{CCS} in the definition of optimal value function, then 𝐕(s0){\mathbf{V}}^{*}(s_{0}) would be {[22t,2t]t[0,1]}\left\{[2-2t,2t]\mid t\in[0,1]\right\} and the corresponding solution would be {πtt[0,1]}\{\pi_{t}\mid t\in[0,1]\}. The CCS\mathrm{CCS} allows us to compactly represent the potentially infinite PCS\mathrm{PCS} by storing only the deterministic policies: in this example, the actual solution is {π1,π2}\{\pi_{1},\pi_{2}\} and the optimal value function is 𝐕(s0)={[2,0],[0,2]}{\mathbf{V}}^{*}(s_{0})=\left\{[2,0],[0,2]\right\}.

Value Iteration for SSPs

We finish this section by reviewing how to compute the optimal value function V{V}^{*} for SSPs and extend these results to MOSSPs in the next section. The optimal (scalar) value function V{V}^{*} for an SSP can be computed using the Value Iteration (VI) algorithm (Bertsekas and Tsitsiklis 1996): given an initial value function V0{V}^{0} it computes the sequence V1,,Vk{V}^{1},\ldots,{V}^{k} where Vt+1{V}^{t+1} is obtained by applying a Bellman backup, that is Vt+1(g)=0{V}^{t+1}(g)=0 if gGg\in G and, for sSGs\in S\setminus G,

Vt+1(s)\displaystyle{V}^{t+1}(s) =minaAQt+1(s,a)\displaystyle=\min_{a\in A}{Q}^{t+1}(s,a) (2)
Qt+1(s,a)\displaystyle{Q}^{t+1}(s,a) =C(a)+sSP(s|s,a)Vt(s).\displaystyle={C}(a)+\sum_{s^{\prime}\in S}P(s^{\prime}|s,a){V}^{t}(s^{\prime}). (3)

VI guarantees that Vk{V}^{k} converges to V{V}^{*} as kk\to\infty under the following conditions: (i) for all sSs\in S, there exists a proper policy w.r.t. ss; and (ii) all improper policies have infinite cost for at least one state. The former condition is known as the reachability assumption and it is equivalent to requiring that no dead ends exist, while the latter condition can be seen as preventing cycles with 0 cost. Notice that finite-horizon Markov Decision Processes (MDPs) and infinite-horizon discounted MDPs are special cases of SSPs in which all policies are guaranteed to be proper (Bertsekas and Tsitsiklis 1996).

3 Value Iteration for MOSSPs

Value Iteration has been extended to the MO setting in special cases of SSPs, e.g. infinite-horizon discounted MDPs (White 1982). In this section, we present the MO version of Value Iteration for the general MOSSP case. While the changes in the algorithm are minimal, the key contribution of this section is the generalisation of the assumptions needed for convergence. We start by generalising the Bellman backup operation from the SO, i.e. (2) and (3), to the MO case. For sSGs\in S\setminus G we have

𝐕t+1(s)\displaystyle{\mathbf{V}}^{t+1}(s) =CS(aA𝐐t+1(s,a))\displaystyle=\mathrm{CS}\Big{(}\bigcup_{a\in A}{\mathbf{Q}}^{t+1}(s,a)\Big{)} (4)
𝐐t+1(s,a)\displaystyle{\mathbf{Q}}^{t+1}(s,a) ={C(a)}(sSP(s|s,a)𝐕t(s)),\displaystyle=\{{\vec{C}}(a)\}\oplus\Big{(}\bigoplus_{s^{\prime}\in S}P(s^{\prime}|s,a){\mathbf{V}}^{t}(s^{\prime})\Big{)}, (5)

and 𝐕t+1(g)={0}{\mathbf{V}}^{t+1}(g)=\{\vec{0}\} for gGg\in G where \oplus denotes the sum of two sets of vectors 𝐕{\mathbf{V}} and 𝐔{\mathbf{U}} defined as {u+vu𝐔,v𝐕}\{\vec{u}+\vec{v}\mid\vec{u}\in{\mathbf{U}},\vec{v}\in{\mathbf{V}}\}, and \bigoplus is the generalised version of \oplus to several sets.

Data: MOSSP problem P=(S,s0,G,A,P,C)P=(S,s_{0},G,A,P,{\vec{C}}), initial values 𝐕(s){\mathbf{V}}(s) for each state ss (default to 𝐕(s)={0}{\mathbf{V}}(s)=\{\vec{0}\}), and consistency threshold ε\varepsilon.
1 while maxsSres(s)<ε\max_{s\in S}\textit{res}(s)<\varepsilon do
2       for sSs\in S do
3             if sGs\in G then  𝐕new(s){0}{\mathbf{V}}_{\textit{new}}(s)\leftarrow\{\vec{0}\} ;
4            else  𝐕new(s)BellmanBackup(s){\mathbf{V}}_{\textit{new}}(s)\leftarrow{\mathrm{BellmanBackup}}(s) ;
5             res(s)D(𝐕,𝐕new)\textit{res}(s)\leftarrow D({\mathbf{V}},{\mathbf{V}}_{\textit{new}})
6            
7      𝐕𝐕new{\mathbf{V}}\leftarrow{\mathbf{V}}_{\textit{new}}
8      
9return 𝐕{\mathbf{V}}
Algorithm 1 MOVI

Alg. 1 illustrates the MO version of Value Iteration (MOVI) which is very similar to the single-objective VI algorithm with the notable difference that the convergence criterion is generalised to handle sets of vectors. The Hausdorff distance between two sets of vectors 𝐔{\mathbf{U}} and 𝐕{\mathbf{V}} is given by

D(𝐔,𝐕)=max{maxu𝐔minv𝐕d(u,v),maxu𝐕minv𝐔d(u,v)}\displaystyle D({\mathbf{U}},{\mathbf{V}})=\max\!\big{\{}\!\max_{\vec{u}\in{\mathbf{U}}}\min_{\vec{v}\in{\mathbf{V}}}d(\vec{u},\vec{v}),\max_{\vec{u}\in{\mathbf{V}}}\min_{\vec{v}\in{\mathbf{U}}}d(\vec{u},\vec{v})\big{\}} (6)

for some choice of metric dd such as the Euclidean metric. We use this distance to define residuals in line 1. As with VI, MOVI converges to the optimal value function at the limit (White 1982; Barrett and Narayanan 2008) under certain strong assumptions presented in the next section. We can extract policies with the choice of a scalarising weight w{\vec{w}} from the value function (Barrett and Narayanan 2008).

Data: MOSSP problem P=(S,s0,G,A,P,C)P=(S,s_{0},G,A,P,{\vec{C}}), initial values 𝐕(s){\mathbf{V}}(s) for each state ss (default to 𝐕(s)={0}{\mathbf{V}}(s)=\{\vec{0}\}), consistency threshold ε\varepsilon and upper bound b\vec{b}.
1 while maxsSres(s)<ε\max_{s\in S}\textit{res}(s)<\varepsilon do
2       for sSs\in S do
3             if sGs\in G then  𝐕new(s){0}{\mathbf{V}}_{\textit{new}}(s)\leftarrow\{\vec{0}\} ;
4            else  𝐕new(s)BellmanBackupB(s){\mathbf{V}}_{\textit{new}}(s)\leftarrow{\mathrm{BellmanBackup}}_{B}(s) ;
5             res(s)D(𝐕,𝐕new)\textit{res}(s)\leftarrow D({\mathbf{V}},{\mathbf{V}}_{\textit{new}})
6            
7      𝐕𝐕new{\mathbf{V}}\leftarrow{\mathbf{V}}_{\textit{new}}
8      
9for sSs\in S do  𝐕(s)=𝐕(s){b}{\mathbf{V}}(s)={\mathbf{V}}(s)\setminus\{\vec{b}\} ;
10 return 𝐕{\mathbf{V}}
Algorithm 2 MOVI under Assumption 1

Assumptions for the convergence of MOVI

Similarly to the SO version of VI, MOVI requires that the reachability assumption holds; however, the assumption that improper policies have infinite cost needs to be generalised to the MO case. One option is to require that all improper policies cost \vec{\infty} which we call the strong improper policy assumption and, if it holds with the reachability assumption, then MOVI (Alg. 1) is sound and complete for MOSSPs. The strong assumption is very restrictive because it implies that any cycle in an MOSSP must have a cost greater than zero in all dimensions; however, it is common for MOSSPs to have zero-cost cycles in one or more dimensions. For instance, in navigation domains where some actions such as wait, load and unload do not consume fuel and can be used to generate zero-fuel-cost loops.

s0s_{0}s1s_{1}gga2a_{2}a1a_{1}aga_{g}
Figure 2: An MOSSP with action costs given by C(a1)=[1,0],C(a2)=[1,0],C(ag)=[0,1]\vec{C}(a_{1})=[1,0],{\vec{C}(a_{2})=[1,0]},\vec{C}(a_{g})=[0,1].

Another possible generalisation for the cost of improper policies is that they cost infinity in at least one dimension. This requirement is not enough as illustrated by the (deterministic) MOSSP in Fig. 2. The only deterministic proper policy is π(s0)=ag\pi(s_{0})=a_{g} with cost [0,1][0,1], and the other deterministic policy is π(s0)=a1,π(s1)=a2\pi^{\prime}(s_{0})=a_{1},\pi^{\prime}(s_{1})=a_{2} which is improper and costs [,0][\infty,0]. Since neither [0,1][0,1] or [,0][\infty,0] dominate each other, two issues arise: MOVI tries to converge to infinity and thus never terminates; and even if it did, the cost of an improper policy would be wrongly added to the CCS.

These issues stem from VI and its generalisations not explicitly pruning improper policies. Instead they rely on the assumption that improper policies have infinite cost to implicitly prune them. This approach is enough in the SO case because any proper policy dominates all improper policies since domination is reduced to the ‘\leq’ operator; however, as shown in this example, the implicit pruning approach for improper policies is not correct for the MO case.

We address these issues by providing a version of MOVI that explicitly prunes improper policies and to do so we need a new assumption:

Assumption 1 (Weak Improper Policy Assumption).

There exists a vector b0n\vec{b}\in{\mathbb{R}}_{\geq 0}^{n} such that for every improper policy π\pi and sSs\in S, Vπ(s)b{\vec{V}}^{\pi}(s)\not\preceq\vec{b}, and for every proper policy π\pi and sSs\in S, Vπ(s)b\vec{V}^{\pi}(s)\preceq\vec{b} and Vπ(s)b\vec{V}^{\pi}(s)\neq\vec{b}.

The weak assumption allows improper policies to have finite cost in some but not all dimensions at the expense of knowing an upper bound b\vec{b} on the cost of all proper policies. This upper bound b\vec{b} lets us infer that a policy π\pi is improper whenever Vπ(s)b\vec{V}^{\pi}(s)\not\preceq\vec{b}, i.e., that Vπ(s)\vec{V}^{\pi}(s) is greater than b\vec{b} in at least one dimension.

Alg. 2 outlines the new MOVI where the differences can be found in lines 2 and 2. In line 2, we use a modified Bellman backup which detects vectors associated to improper policies and assigns them the upper bound b\vec{b} given by the weak improper policy assumption. The modified backup is given by

𝐕t+1(s)\displaystyle{\mathbf{V}}^{t+1}(s) =CSB(aA𝐐Bt+1(s,a))\displaystyle=\mathrm{CS}_{B}\biggl{(}\bigcup_{a\in A}{\mathbf{Q}}_{B}^{t+1}(s,a)\biggr{)} (7)
𝐐Bt+1(s,a)\displaystyle{\mathbf{Q}}_{B}^{t+1}(s,a) ={C(a)}B(sSBP(s|s,a)𝐕t(s))\displaystyle=\{{\vec{C}}(a)\}\oplus_{B}\Big{(}\bigoplus_{s^{\prime}\in S}\!\!\!\!\!\!\!\!\!\!{\phantom{{a_{3}}_{3}}}_{B}P(s^{\prime}|s,a){\mathbf{V}}^{t}(s^{\prime})\Big{)} (8)

where B\oplus_{B} is a modified set addition operator defined on pairwise vectors u\vec{u} and v\vec{v} by

uBv={bif uvbuvotherwise,\displaystyle\vec{u}\oplus_{B}\vec{v}=\begin{cases}\vec{b}&\text{if $\vec{u}\oplus\vec{v}\not\preceq\vec{b}$}\\ \vec{u}\oplus\vec{v}&\text{otherwise},\end{cases} (9)

and similarly for the generalised version B{{\bigoplus}}_{B}. Also define CSB\mathrm{CS}_{B} to be the modified coverage set operator which does not prune away b\vec{b} if it is in the input set. The modified backup (7) enforces a cap on value functions (i.e., 𝐕t(s){b}{\mathbf{V}}^{t}(s)\preceq\{\vec{b}\}) through the operator B\oplus_{B} (9). This guarantees that a finite fixed point exists for all Vπ(s)\vec{V}^{\pi}(s) and, as a result, that Alg. 2 terminates. Once the algorithm converges, we need to explicitly prune the expected cost of improper policies which is done in line 2. By Assumption 1, we have that no proper policy costs b\vec{b} thus we can safely remove b\vec{b} from the solution. Note that the overhead in applying this modified backup and post-processing pruning is negligible. Theorem 3.1 shows that MOVI converges to the correct solution.

Theorem 3.1.

Given an MOSSP in which the reachability and weak improper policy assumptions hold for an upper bound b\vec{b}, and given a set of vectors 𝐕0{\mathbf{V}}^{0} such that 𝐕0𝐕{\mathbf{V}}^{0}\preceq{\mathbf{V}}^{*}, the sequence 𝐕1,,𝐕k{\mathbf{V}}^{1},\ldots,{\mathbf{V}}^{k} computed by MOVI converges to the MOSSP solution 𝐕{\mathbf{V}}^{*} as kk\to\infty.

Proof sketch.

By the assumption that 𝐕0𝐕{\mathbf{V}}^{0}\preceq{\mathbf{V}}^{*}, we have that MOVI will not prune away vectors associated with proper policies which contribute to a solution. If 𝐕0𝐕{\mathbf{V}}^{0}\not\preceq{\mathbf{V}}^{*}, e.g., 𝐕0(s)={b}{\mathbf{V}}^{0}(s)=\{\vec{b}\} for all ss, then Alg. 2 is not guaranteed to find the optimal value function since it will incorrectly prune proper policies. Otherwise, we have that 𝐕1,,𝐕k{\mathbf{V}}^{1},\ldots,{\mathbf{V}}^{k} converges to 𝐕=𝐕{b}{\mathbf{V}}^{\dagger}={\mathbf{V}}^{*}\cup\{\vec{b}\} if the original MOVI in Alg. 1 does not converge, and 𝐕=𝐕{\mathbf{V}}^{\dagger}={\mathbf{V}}^{*} otherwise. For example, 𝐕={[0,1]}{\mathbf{V}}^{*}=\left\{[0,1]\right\} in the example seen in Fig. 2 but 𝐕={[2,2],[0,1]}{\mathbf{V}}^{\dagger}=\left\{[2,2],[0,1]\right\} if we choose b=[2,2]\vec{b}=[2,2].

This convergence result follows by noticing that, by definition of the modified backup in (7), every vector in 𝐕t(s){\mathbf{V}}^{t}(s) for all tt dominates b\vec{b}. We may then apply the proof for the convergence of MOVI with convex coverage sets by Barrett and Narayanan (2008) which reduces to the convergence of scalarised SSPs using VI in the limit, of which there are finitely many since the number of deterministic policies is finite. Here, we have for every w\vec{w} that wb\vec{w}\cdot\vec{b} is an upper bound for the problem scalarised by w\vec{w}. Finally, we have that line 7 removes b\vec{b} from 𝐕{\mathbf{V}}^{\dagger} such that we correctly return 𝐕{\mathbf{V}}^{*}. ∎

Note that the original proof for MOMDPs by Barrett and Narayanan (2008) does not directly work for MOSSPs as some of the scalarised SSPs have a solution of infinite expected cost such that VI never converges. The upper bound b\vec{b} we introduce solves this issue as achieving this bound is the same as detecting improper policies. The proof remains correct in the original setting applied to discounted MOMDPs in which there are no improper policies.

Relaxing the reachability assumption

The reachability assumption can also be relaxed by transforming an MOSSP with dead ends into a new MOSSP without dead ends. Formally, given an MOSSP (S,s0,G,A,P,C)(S,s_{0},G,A,P,{\vec{C}}) with dead ends, let A=A{give-up}A^{\prime}=A\cup\{\text{give-up}\}; P(sg|give-up,s)=1P(s_{g}|\text{give-up},s)=1 for all sSs\in S and any sgGs_{g}\in G; C(a)=[C(a):0]{\vec{C}}^{\prime}(a)=[{\vec{C}}(a):0] for all aAa\in A; and C(give-up)=[0:1]{\vec{C}}^{\prime}(\text{give-up})=[\vec{0}:1].

Then the MOSSP (S,s0,G,A,P,C)(S,s_{0},G,A^{\prime},P,{\vec{C}}^{\prime}) does not have dead ends (i.e., the reachability assumption holds) since the give-up action is applicable in all states sSs\in S. This transformation is similar to the finite-penalty transformation for SSPs (Trevizan, Teichteil-Königsbuch, and Thiébaux 2017); however it does not require a large penalty constant. Instead, the MOSSP transformation uses an extra cost dimension to encode the cost of giving up and the value of the n+1n\!+\!1-th dimension of C(give-up){\vec{C}}^{\prime}(\text{give-up}) is irrelevant as long as it is greater than 0. Since the give-up action can only be applied once, defining the cost of give-up as 1 let us interpret the n+1n\!+\!1-th dimension of any Vπ(s)\vec{V}^{\pi}(s) as the probability of giving up when following π\pi from ss.

4 Heuristic search

Value iteration gives us one method for solving MOSSPs but is limited by the fact that it requires enumerating over the whole state space. This is impractical for planning problems, as their state space grows exponentially with the size of the problem encoding. This motivates the need for heuristic search algorithms in the vein of LAO and LRTDP for SSPs (Hansen and Zilberstein 2001; Bonet and Geffner 2003), which perform backups on a promising subset of states at each iteration. To guide the choice of the states to expand and explore at each iteration, they use heuristic estimates of state values initialised when the state is first encountered. In this section, we extend the concept of heuristics to the more general MOSSP setting, discuss a range of ways these heuristics can be constructed, and provide multi-objective versions of LAO and LRTDP.

Heuristics

For the remainder of the paper, we will call a set of vectors a value. Thus, a heuristic value for a state is a set of vectors 𝐇(s)0n{\mathbf{H}}(s)\subset{\mathbb{R}}_{\geq 0}^{n}. We implicitly assume that any heuristic value 𝐇(s){\mathbf{H}}(s) has already been pruned with some coverage set operator. The definition of an admissible heuristic for MOSSPs should be at most as strong as the definition for deterministic MO search (Mandow and Pérez-de-la-Cruz 2010).

Definition 4.1.

A heuristic 𝐇{\mathbf{H}} for an MOSSP is admissible if sSG,𝐇(s)𝐕(s)\forall s\in S\setminus G,{\mathbf{H}}(s)\preceq{\mathbf{V}}^{*}(s) where 𝐕{\mathbf{V}}^{*} is the optimal value function, and gG,𝐇(g)={0}\forall g\in G,{\mathbf{H}}(g)=\{\vec{0}\}.

For example, if for some MOSSP we have 𝐕(s)={[0,2]}{\mathbf{V}}^{*}(s)=\left\{[0,2]\right\} for a state ss, then 𝐇(s)={[0,1],[3,0]}{\mathbf{H}}(s)=\left\{[0,1],[3,0]\right\} and 𝐇(s)={0}{\mathbf{H}}(s^{\prime})=\{\vec{0}\} for all other ss^{\prime} is an admissible heuristic. As in the single-objective case, admissible heuristics ensure that the algorithms below converge to near optimal value functions for ε>0\varepsilon>0 with finitely many iterations and to optimal value functions with possibly infinitely many iterations when ε=0\varepsilon=0.

Definition 4.2.

A heuristic 𝐇{\mathbf{H}} for an MOSSP is consistent if we have for all sS,aAs\in S,a\in A that 𝐇(s){C(a)}(sSP(s|s,a)𝐇(s)).{\mathbf{H}}(s)\preceq\{{\vec{C}}(a)\}\oplus(\bigoplus_{s^{\prime}\in S}P(s^{\prime}|s,a){\mathbf{H}}(s^{\prime})).

The definition of consistent heuristic is derived and generalised from the definition of consistent heuristics for deterministic search: h(s)c(s,a,s)+h(s)h(s)\leq c(s,a,s^{\prime})+h(s^{\prime}). The main idea is that assuming non-negative costs, state costs increase monotonically which results in no re-expansions of nodes in A search. Similarly, we desire monotonically increasing value functions for faster convergence.

We now turn to the question of constructing domain-independent heuristics satisfying these properties from the encoding of a planning domain. This question has only recently been addressed in the deterministic multi-objective planning setting (Geißer et al. 2022): with the exception of this latter work, MO heuristic search algorithms have been evaluated using “ideal-point” heuristics, which apply a single-objective heuristic hih_{i} to each objective in isolation, resulting in a single vector 𝐇ideal(s)={[h1(s),hn(s)]}{\mathbf{H}}_{\textit{ideal}}(s)=\{[h_{1}(s),\ldots h_{n}(s)]\}. Moreover, informative domain-independent heuristics for single objective SSPs are also relatively new (Trevizan, Thiébaux, and Haslum 2017; Klößner and Hoffmann 2021): a common practice was to resort to classical planning heuristics obtained after determinisation of the SSP (Jimenez, Coles, and Smith 2006).

We consider a spectrum of options when constructing domain-independent heuristics for MOSSPs, which we instantiate using the most promising families of heuristics identified in (Geißer et al. 2022) for the deterministic case: critical paths and abstraction heuristics. All heuristics are consistent and admissible unless otherwise mentioned.

  • The baseline option is to ignore both the MO and stochastic aspects of the problem, and resort to an ideal-point heuristic constructed from the determinised problem. In our experiments below, we apply the classical h𝗆𝖺𝗑h^{\mathsf{max}} heuristic (Bonet and Geffner 2001) to each objective and call this 𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}}.

  • A more elaborate option is to consider only the stochastic aspects of the problem, resulting in an ideal-point SSP heuristic. In our experiments, we apply the recent SSP canonical PDB abstraction heuristic by Klößner and Hoffmann (2021) to each objective which we call 𝐇idealpdb2{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb2}} and 𝐇idealpdb3{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb3}} for patterns of size 2 and 3, respectively.

  • Alternatively, one might consider only the multi-objective aspects, by applying some of the MO deterministic heuristics (Geißer et al. 2022) to the determinised SSP. The heuristics we consider in our experiments are the MO extension of h𝗆𝖺𝗑h^{\mathsf{max}} and canonical PDBs: 𝐇mocomax{\mathbf{H}}_{\textit{mo}}^{\textit{comax}}, 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}}, and 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}}. These extend classical planning heuristics by using MO deterministic search to solve subproblems and combining solutions by taking the maximum of two sets of vectors with the admissibility preserving operator comax\operatorname{comax} (Geißer et al. 2022).

  • The best option is to combine the power of SSP and MO heuristics. We do so with a novel heuristic 𝐇mossppdb{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb}} which extends the SSP PDB abstraction heuristic (Klößner and Hoffmann 2021) to the MO case by using an MO extension of topological VI (TVI) (Dai et al. 2014) to solve each projection and the comax\operatorname{comax} operator to combine the results. Our experiments use 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} and 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}}.

(i)MOLAO

Readers familiar with the LAO heuristic search algorithm and the multi-objective backup can convince themselves that an extension of LAO to the multi-objective case can be obtained by replacing the backup operator with the multi-objective version. This idea was first outlined by Bryce, Cushing, and Kambhampati (2007) for finite-horizon MDPs which is a special case of SSPs without improper policies. In the same vein as (Hansen and Zilberstein 2001), we provide MOLAO alongside an ‘improved’ version, iMOLAO, in Alg. 3 and 4 respectively.

MOLAO begins in line 1 by lazily assigning an initial value function 𝐕{\mathbf{V}} to each state with the heuristic function, as opposed to explicitly initialising all initial values at once. Π\Pi is a dictionary representing our partial solution which maps states to sets of optimal actions corresponding to their current value function. The main while loop of the algorithm terminates once there are no nongoal states on the frontier as described in line 3, at which point we have achieved a set of closed policies.

The loop begins with lines 3-5 by removing a nongoal state ss from the frontier representing a state on the boundary of the partial solution graph, and adding it to the interior set II. The nodes of the partial solution graph are partial solution states, and the edges are the probabilistic transitions under all partial solution actions. Next in line 6 we add the set of yet unexplored successors of ss to the frontier: any state sSIs^{\prime}\in S\setminus I such that aA,P(s|s,a)>0\exists a\in A,P(s^{\prime}|s,a)>0. Then we extract the set ZZ of all ancestor states of ss in the partial solution Π\Pi using graph search in line 7. We run MOVI to ε\varepsilon-consistency on the MOSSP problem restricted to the set of states ZZ and update the value functions for the corresponding states in line 8. The partial solution is updated in line 9 by extracting the actions corresponding to the value function with

getActions(s,𝐕)={aA𝐐(s,a)𝐕(s)}.\displaystyle\mathrm{getActions}(s,{\mathbf{V}})=\left\{a\in A\mid{\mathbf{Q}}(s,a)\cap{\mathbf{V}}(s)\not=\emptyset\right\}. (10)

This function can be seen as a generalisation of argmin\operatorname*{arg\,min} to the MO case where here we select the actions whose 𝐐{\mathbf{Q}} value at ss contribute to the current value 𝐕(s){\mathbf{V}}(s). Next, we extract the set of states NN corresponding to all states reachable from s0s_{0} by the partial solution Π\Pi in line 10.

Data: MOSSP problem P=(S,s0,G,A,P,C)P=(S,s_{0},G,A,P,{\vec{C}}), heuristic 𝐇{\mathbf{H}}, and consistency threshold ε\varepsilon
1 𝐕𝐇;Π;F{s0};I;N{s0}{\mathbf{V}}\leftarrow{\mathbf{H}};\;\Pi\leftarrow\emptyset;\;F\leftarrow\left\{s_{0}\right\};\;I\leftarrow\emptyset;\;N\leftarrow\left\{s_{0}\right\}
2 while (FN)G(F\cap N)\setminus G\not=\emptyset do
3       sany element from (FN)Gs\leftarrow\text{any element from $(F\cap N)\setminus G$}
4       FF{s}F\leftarrow F\setminus\left\{s\right\}
5       II{s}I\leftarrow I\cup\left\{s\right\}
6       FF(successors(s)I)F\leftarrow F\cup(\text{successors}(s)\setminus I)
7       ZancestorStates(s,Π)Z\leftarrow\mathrm{ancestorStates}(s,\Pi)
8       𝐕|ZMOVI(P|Z,𝐕|Z,ε){\mathbf{V}}|_{Z}\leftarrow\text{MOVI}(P|_{Z},{\mathbf{V}}|_{Z},\varepsilon)
9       for sZs\in Z do  Π(s)getActions(s,𝐕)\Pi(s)\leftarrow\mathrm{getActions}(s,{\mathbf{V}}) ;
10       NsolutionGraph(sI,Π)N\leftarrow\text{solutionGraph}(s_{I},\Pi)
11      
12return 𝐕{\mathbf{V}}
Algorithm 3 MOLAO

The completeness and correctness of MOLAO follows from the same reasoning as LAO extended to the MO case. Specifically, we have that given an admissible heuristic the algorithm achieves ε\varepsilon-consistency upon termination.

Data: MOSSP problem P=(S,s0,G,A,P,C)P=(S,s_{0},G,A,P,{\vec{C}}), heuristic 𝐇{\mathbf{H}}, and consistency threshold ε\varepsilon
1 𝐕𝐇;Π;F{s0};I;N{s0}{\mathbf{V}}\leftarrow{\mathbf{H}};\;\Pi\leftarrow\emptyset;\;F\leftarrow\left\{s_{0}\right\};\;I\leftarrow\emptyset;\;N\leftarrow\left\{s_{0}\right\}
2 while ((FN)G)(maxsNres(s)<ε)((F\cap N)\setminus G\not=\emptyset)\wedge(\max_{s\in N}\textit{res}(s)<\varepsilon) do
3       F=F=\emptyset
4       NpostorderTraversalDFS(sI,Π)N\leftarrow\text{postorderTraversalDFS}(s_{I},\Pi)
5       for sNs\in N in the computed order do
6             𝐕(s)BellmanBackup(s){\mathbf{V}}(s)\leftarrow{\mathrm{BellmanBackup}}(s)
7             Π(s)=getActions(s,𝐕)\Pi(s)=\mathrm{getActions}(s,{\mathbf{V}})
8             if sIs\notin I then  F=F{s}F=F\cup\left\{s\right\} ;
9             I=I{s}I=I\cup\left\{s\right\}
10            
11      
12return 𝐕{\mathbf{V}}
Algorithm 4 iMOLAO

One potential issue with MOLAO is that we may waste a lot of backups while running MOVI to convergence several times on partial solution graphs which do not end up contributing to the final solution. The original authors of LAO proposed the iLAO algorithm to deal with this. We provide the MO extension, namely iMOLAO. The main idea with iMOLAO is that we only run one set of backups every time we (re-)expand a state instead of running VI to convergence in the loop in lines 5 to 9. Backups are also performed asynchronously using DFS postorder traversal of the states in the partial solution graph computed in line 4, allowing for faster convergence times.

Data: MOSSP problem P=(S,s0,G,A,P,C)P=(S,s_{0},G,A,P,{\vec{C}}), heuristic 𝐇{\mathbf{H}}, and consistency threshold ε\varepsilon
procedure MOLRTDP(P,ε,𝐇)\textsc{MOLRTDP}(P,\varepsilon,{\mathbf{H}})
1       𝐕𝐇{\mathbf{V}}\leftarrow{\mathbf{H}}
2       while ¬s0.𝑠𝑜𝑙𝑣𝑒𝑑\neg s_{0}.\mathit{solved} do
3             visited\textit{visited}\leftarrow\emptyset
4             ss0s\leftarrow s_{0}
5             while ¬s.𝑠𝑜𝑙𝑣𝑒𝑑\neg s.\mathit{solved} do
6                   visited.push(s)\textit{visited.push}(s)
7                   if sGs\in G then  break ;
8                   𝐕(s)BellmanBackup(s){\mathbf{V}}(s)\leftarrow\mathrm{BellmanBackup}(s)
9                   asampleUnsolvedGreedyAction(s)a\leftarrow\mathrm{sampleUnsolvedGreedyAction}(s)
10                   ssampleUnsolvedNextState(s,a)s\leftarrow\mathrm{sampleUnsolvedNextState}(s,a)
11                  
12            
13            while ¬visited.empty()\neg\textit{visited.empty()} do
14                   svisited.pop()s\leftarrow\textit{visited.pop()}
15                   if ¬checkSolved(s)\neg\mathrm{checkSolved}(s) then  break ;
16                  
17            
18      return 𝐕{\mathbf{V}}
19      
routine checkSolved(s)\mathrm{checkSolved}(s)
1       𝑟𝑣𝑡𝑟𝑢𝑒;𝑜𝑝𝑒𝑛;𝑐𝑙𝑜𝑠𝑒𝑑\mathit{rv}\leftarrow\mathit{true};\;\mathit{open}\leftarrow\emptyset;\;\mathit{closed}\leftarrow\emptyset
2       if ¬s.𝑠𝑜𝑙𝑣𝑒𝑑\neg s.\mathit{solved} then  open.push(s)open.\mathrm{push}(s) ;
3      
4      while ¬open.empty()\neg\textit{open.empty}() do
5             sopen.pop()s\leftarrow\textit{open.pop}()
6             if res(s)>ε\textit{res}(s)>\varepsilon then
7                   𝑟𝑣𝑓𝑎𝑙𝑠𝑒\mathit{rv}\leftarrow\mathit{false}
8                   continue
9                  
10            for agetActions(s,𝐕)a\in\mathrm{getActions}(s,{\mathbf{V}}) do
11                   for ssuccessors(s,a)s^{\prime}\in\mathrm{successors}(s,a) do
12                         if ¬s.𝑠𝑜𝑙𝑣𝑒𝑑s𝑜𝑝𝑒𝑛𝑐𝑙𝑜𝑠𝑒𝑑\neg s^{\prime}.\mathit{solved}\wedge s^{\prime}\notin\mathit{open}\cup\mathit{closed} then  open.push(s)\textit{open.push}(s^{\prime}) ;
13                        
14                  
15            
16      
17      if 𝑟𝑣\mathit{rv} then  for s𝑐𝑙𝑜𝑠𝑒𝑑s\in\mathit{closed} do s.𝑠𝑜𝑙𝑣𝑒𝑑=𝑡𝑟𝑢𝑒s.\mathit{solved}=\mathit{true} ;
18       else
19             while 𝑐𝑙𝑜𝑠𝑒𝑑\mathit{closed}\not=\emptyset do
20                   s𝑐𝑙𝑜𝑠𝑒𝑑.𝑝𝑜𝑝()s\leftarrow\mathit{closed}.\mathit{pop}()
21                   𝐕(s)BellmanBackup(s){\mathbf{V}}(s)\leftarrow{\mathrm{BellmanBackup}}(s)
22            
23      return 𝑟𝑣\mathit{rv}
24      
Algorithm 5 MOLRTDP

To summarise, the two main changes required to extend (i)LAO to the MO case are (1) replacing the backup operator with the MO version, and (2) storing possibly more than one greedy action at each state corresponding to incomparable vector 𝐐{\mathbf{Q}}-values, resulting in a larger set of successors associated to each state. These ideas can be applied to other asynchronous VI methods such as Prioritised VI (Wingate and Seppi 2005) and Topological VI (Dai et al. 2014).

MOLRTDP

LRTDP (Bonet and Geffner 2003) is another heuristic search algorithm for solving SSPs. The idea of LRTDP is to perform random trials using greedy actions based on the current value function or heuristic for performing backups, and labelling states as solved when the consistency threshold has been reached in order to gradually narrow down the search space and achieve a convergence criterion. A multi-objective extension of LRTDP is not as straightforward as extending the backup operator given that each state has possibly more than one greedy action to account for. The two main changes from the original LRTDP are (1) the sampling of a random greedy action before sampling successor states in the main loop, and (2) defining the descendants of a state by considering successors of all greedy actions (as opposed to just one in LRDTP). Alg. 5 outlines the MO extension of LRTDP.

MOLRTDP consists of repeatedly trialing paths through the state space and performing backups until the initial state is marked as solved. Trials are run by randomly sampling a greedy action aa from getActions(s,𝐕)\mathrm{getActions}(s,{\mathbf{V}}) at each state ss followed by a next state sampled from the probability distribution of aa until a goal state is reached as in the first inner loop from lines 5 to 10. The second inner loop from lines 11 to 13 calls the checkSolved\mathrm{checkSolved} routine in the reverse trial order to label states as solved by checking whether the residual of all its descendant states under greedy actions are less than the convergence criterion.

The checkSolved\mathrm{checkSolved} routine begins with inserting ss into the open set if it has not been solved, and returns true otherwise due to line 2. The main loop in lines 3 to 10 collects the descendent states under all greedy actions (lines 8-10) and checks whether the residual of all the descendent states are small (lines 5-7). If true, all descendents are labelled as solved as in line 11. Otherwise, backups are performed in the reverse order of explored states as in lines 12 to 15.

The completeness and correctness of MOLRTDP follows similarly from its single objective ancestor. We note specifically that the labelling feature works similarly in the sense that whenever a state is marked as solved, it is known for certain that all its descendents’ values are ε\varepsilon-consistent and remain unchanged when backups are performed elsewhere.

5 Experiments

In this section we empirically evaluate the different algorithms and heuristics for MOSSPs in several domains. Since no benchmark domains for MOSSPs exist, we adapt domains from a variety of sources to capture challenging features of both SSPs and MO deterministic planning. Our benchmark set is introduced in the next section and it is followed by our experiments setup and analysis of the results.

Benchmarks

kk-d Exploding Blocksworld

Exploding Blocksworld (ExBw) was first introduced by Younes et al. (2005) and later slightly modified for the IPPC’08 (Bryce and Buffet 2008). In ExBw, a robot has to pick up and stack blocks on top of each other to get a target tower configuration. Blocks have a chance of detonating and destroying blocks underneath or the table. We consider a version with no dead ends using an action to repair the table (Trevizan et al. 2017). We extend ExBw to contain multi-objective costs. ExBw-2d has two objectives: the number of actions required to stack the blocks and number of times we need to repair the table. ExBw-3d has an additional action to repair blocks with an objective to minimise the number of block repairs.

MO Search and Rescue

Search and Rescue (SAR-nn) was first introduced by Trevizan et al. (2017) as a Constrained SSP. The goal is to find, board and escort to a safe location any one survivor in an n×nn\times n grid as quickly as possible, constrained by a fuel limit. Probabilities are introduced when modelling fuel consumption and partial observability of whether a survivor exists in a given location. The location of only one survivor is known for certain. We extend the problem by making fuel consumption as an additional objective instead of a constraint. A solution for a SAR MOSSP is a set of policies with different trade-offs between fuel and time.

MO Triangle Tireworld

Triangle Tireworld, introduced by Little and Thiébaux (2007) as a probabilistically interesting problem, consists of a triangular grid of locations. The goal is to travel from an initial to a target location where each location has a probability of getting a flat tire. Some locations contain a spare tire which the agent can load into the car for future use. These problems have dead ends and they occur when a tire is flat and no spare is available. We apply the give-up transformation from Section 3 resulting in an MOSSP with two objectives: total number of actions and probability of using the give-up action.

Probabilistic VisitAll

The deterministic MO VisitAll (Geißer et al. 2022) is a TSP problem on a grid, where the agents must collectively visit all locations, and each agent’s objective is to minimise its own number of moves. This problem is considered MO interesting because any admissible ideal-point heuristic returns the zero vector for all states since it is possible for a single agent to visit all locations while no actions are performed by the other agents. We extend this domain by adding a probabilistic action move-risky which has probability 0.5 of acting as the regular move action and 0.5 of teleporting the agent back to its initial location. The cost of the original move action was changed to 2 while the cost of the move-risky action is 1.

VisitAllTire

This domain combines features of both the probabilistic Tireworld and the deterministic MO VisitAll domains into one that is probabilistically and MO interesting. The underlying structure and dynamics is the same as the deterministic MO VisitAll except that the move action now can result in a flat tire with probability 0.5. We also added the actions for loading and changing tires for each of the agents. Similarly to Triangle Tireworld, the problems in this domain can have dead ends when a flat tire occurs and no spare tire is available. Applying the give-up transformation from Section 3 to this domain results in k+1k+1 cost functions where kk is the number of agents.

Setup

We implemented the MO versions of the VI, TVI, (i)LAO and LRTDP algorithms and the MO version of the PDB abstraction heuristics (𝐇mossppdb{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb}}) in C++.222Code at https://github.com/DillonZChen/cpp-mossp-planner PDB heuristics are computed using TVI, ε=0.001\varepsilon=0.001 and b=100\vec{b}=\vec{100}. We include in our experiments the following heuristics for deterministic MO planning from (Geißer et al. 2022): the ideal-point version of h𝗆𝖺𝗑h^{\mathsf{max}} (𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}}); the MO extension of h𝗆𝖺𝗑h^{\mathsf{max}} (𝐇mocomax{\mathbf{H}}_{\textit{mo}}^{\textit{comax}}); and the MO canonical PDBs of size 2 and 3 (𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}} and 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}}). The SO PDB heuristics for SSPs from (Klößner and Hoffmann 2021) of size 2 and 3 (𝐇idealpdb2{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb2}} and 𝐇idealpdb3{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb3}}) are also included in our experiments. All problem configurations are run with a timeout of 1800 seconds, memory limit of 4GB, and a single CPU core. The experiments are run on a cluster with Intel Xeon 3.2 GHz CPUs. We used CPLEX version 22.1 as the LP solver for computing CCS. The consistency threshold is set to ε=0.001\varepsilon=0.001 and we set b=100\vec{b}=\vec{100}. Each experimental configuration is run 6 times and averages are taken to reduce variance in the data.

We also modify the inequality constraint in Alg. 4.3 from Roijers and Whiteson 2017 for computing CCS to w(vv)+xλ,v𝐕{\vec{w}}\cdot(\vec{v}-\vec{v}^{\prime})+x\leq-\lambda,\forall\vec{v}^{\prime}\in{\mathbf{V}} with λ=0.01\lambda=0.01 to deal with inevitable numerical precision errors when solving the LP. If the λ\lambda term is not added, the function may fail to prune unnecessary points such as those corresponding to stochastic policies in the CCS, resulting in slower convergence. However, an overly large λ\lambda may return incorrect results by discarding important points in the CCS. One may alternatively consider the term as an MO parameter for trading off solution accuracy and search time, similarly to the ε\varepsilon parameter.

Fig. 3 shows the distribution of CCS sizes for different domains. Recall that the CCS implicitly represents a potentially infinite set of stochastic policies and their value functions. As a result, a small CCS is sufficient to dominate a large number of solutions, for instance, in Triangle Tireworld, a CCS of size 3 to 4 is enough to dominate all other policies even though the number of deterministic policies for this domain is double exponential in its parameter nn.

Refer to caption
Figure 3: Boxplot of CCS size of instances across domains which have been solved at least once. Number of solved and total instances for each domain is indicated in parentheses.

Results

Refer to caption
\phantomcaption
Refer to caption
\phantomcaption
Refer to caption
\phantomcaption
Refer to caption
\phantomcaption
Figure 4: Cumulative coverage of: (a) planners and heuristics combinations (low-performing planners omitted); (b) planners only, i.e., summation across different heuristics; (c) heuristics only, i.e. summation across different planners; (d) PDB approaches considered, i.e., summation across the ideal-point, MO, and MOSSP approaches. Notice that the xx-axis is the same for all plots but the yy-axis is different and might not start at 0.
planner + heuristic coverage heuristic coverage
LRTDP 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} 307.6±2.7307.6\pm 2.7 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} 893.5±1.9893.5\pm 1.9
LRTDP 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}} 302.8±2.1302.8\pm 2.1 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} 893.3±3.2893.3\pm 3.2
LRTDP 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} 291.2±1.3291.2\pm 1.3 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} 871.2±1.8871.2\pm 1.8
iLAO 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} 276.7±0.7276.7\pm 0.7 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}} 851.8±2.1851.8\pm 2.1
iLAO 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} 272.0±0.0272.0\pm 0.0 𝐇idealpdb3{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb3}} 768.7±0.7768.7\pm 0.7
LRTDP 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} 271.2±1.7271.2\pm 1.7 𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}} 755.2±0.7755.2\pm 0.7
LRTDP 𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}} 263.2±0.7263.2\pm 0.7 𝐇idealpdb2{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb2}} 737.2±1.2737.2\pm 1.2
iLAO 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}} 262.0±0.0262.0\pm 0.0 blind 572.9±1.4572.9\pm 1.4
planner coverage PDB coverage
LRTDP 2297.3±3.02297.3\pm 3.0 𝐇mossppdb{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb}} 1786.8±1.61786.8\pm 1.6
iLAO 2111.7±0.72111.7\pm 0.7 𝐇mopdb{\mathbf{H}}_{\textit{mo}}^{\textit{pdb}} 1723.0±2.01723.0\pm 2.0
LAO 1529.0±1.21529.0\pm 1.2 𝐇idealpdb{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb}} 1505.8±1.91505.8\pm 1.9
TVI 460.0±0.5460.0\pm 0.5
VI 450.3±0.7450.3\pm 0.7
Table 1: Average marginalised cumulative coverages with 95% confidence intervals. Higher values are better.

Fig. 4 summarises our results by showing the cumulative coverage of different planners and heuristics across all considered problems. The following is a summary of our findings (we omit the MO in the planner names for simplicity):

What is the best planner and heuristic combination?

Referring to the overall cumulative coverage plot in Fig. 4 and coverage tables in Tab. 1, we notice that LRTDP+𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} performs best followed by LRDTP+𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}}, LRTDP+𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} and iLAO+𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}}. The ranking of the top 3 planners remain the same if we normalise the coverage by domain.

We note that the considered Exbw3 problems had several improper policies resulting in slow convergence of TVI for solving PDBs of size 3. This is due to Assumption 1 and the chosen b=100\vec{b}=\vec{100} for our experiments which that requires the cost of improper policies in each PDB to cost at least 100100 in one of its dimensions. Ignoring the Exbw3 domain, the top 3 configurations are LRTDP+𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}}, LRTDP+𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} and iLAO+𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} and their 95% confidence intervals all overlap. The ranking of the top few planners remain the same if the coverage is normalised.

What planner is best?

To answer this question, we look at the cumulative coverage marginalised over heuristics, that is, we sum the coverage of a planner across the different heuristics. The results are shown in Fig. 4 and Tab. 1 and the best three planners in order are LRTDP, iLAO and LAO. Notice that the difference in coverage between LRTDP and iLAO is at 185.6185.6 solved instances while the difference between TVI and VI is only 9.79.7. The ranking between planners remains the same when the coverage is normalised per domain. Considering domains individually, LRTDP is the best planner for Exbw and VisitAllTire while for iLAO is the best planner for SAR and Probabilistic VisitAll. For MO Tireworld, both LRTDP and iLAO are tied in first place.

Crit. Path Abstractions

​​blind

​​𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}}

​​𝐇mocomax{\mathbf{H}}_{\textit{mo}}^{\textit{comax}}

​​𝐇idealpdb2{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb2}}

​​𝐇idealpdb3{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb3}}

​​𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}}

​​𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}}

​​𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}}

​​𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}}

SAR-4 100 97 44 93 92 44 44 38 26
SAR-5 100 97 43 93 92 43 43 37 27
ExBw-2d 100 59 22 51 45 51 45 51 45
ExBw-3d 100 58 22 52 44 52 44 52 44
Tireworld 100 100 68 100 100 68 68 57 13
VisitAll 100 100 54 100 100 61 53 22 12
VisitAllTire 100 100 50 100 100 66 45 66 45
Table 2: The mean relative error (%) of the heuristic value relative to the optimal value at the initial state computed with the directed Hausdorff distance divided by the norm of the largest vector in the optimal value: maxv𝐕minu𝐇d(v,u)/maxv𝐕v\max_{\vec{v}\in{\mathbf{V}}^{*}}\min_{\vec{u}\in{\mathbf{H}}}d(\vec{v},\vec{u})/\max_{\vec{v}\in{\mathbf{V}}^{*}}\|\vec{v}\|. Only solved instances for which all heuristics were computed for the initial state within the time limit were considered. Cells ranked first to third are shaded. Lower values are better.

What heuristic is best?

The cumulative coverage marginalised over planners for selected heuristics is shown in Fig. 4 and Tab. 1. The MOSSP PDB heuristic of size 3 (𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}}) has the best performance followed by 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}}, 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} and 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}}. We note there is a large overlap in the 95% confidence intervals of 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} and 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} due to the slow computation of 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} on Exbw3. The overlap disappears when we remove the Exbw3 results with coverage and confidence intervals given by 862.9±3.0862.9\pm 3.0 and 807.6±2.9807.6\pm 2.9 for 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} and 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}}, respectively. We further quantify heuristic accuracy using the directed Hausdorff distance between heuristic and optimal values in Tab. 2. We notice that 𝐇mocomax{\mathbf{H}}_{\textit{mo}}^{\textit{comax}} achieves strong accuracy relative to other heuristics but has the worst coverage of 504.5504.5 due to its high computation time.

What feature of MOSSPs is more important to capture in a heuristic?

To answer this question, consider the performance of the 3 classes of PDB heuristics: probabilistic-only PDBs (𝐇idealpdb{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb}}), MO only PDBs (𝐇mopdb{\mathbf{H}}_{\textit{mo}}^{\textit{pdb}}), and MO probabilistic PDBs (𝐇mossppdb{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb}}). The cumulative coverage marginalised over the different PDB heuristics shown in Fig. 4 and Tab. 1 highlights the effectiveness of MO PDBs. 𝐇mossppdb{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb}} is able to solve 63.863.8 and 281281 more instances than 𝐇mopdb{\mathbf{H}}_{\textit{mo}}^{\textit{pdb}} and 𝐇idealpdb{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb}}, respectively. The ranking remains the same when the coverage is normalised per domain. Moreover, notice in Tab. 2 that 𝐇mopdb{\mathbf{H}}_{\textit{mo}}^{\textit{pdb}} heuristics are at least as informative as 𝐇idealpdb{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb}} heuristics on all domains. Lastly, we note that an admissible ideal point heuristic is upper bounded by {u}\{\vec{u}\} where ui=minv𝐕vi\vec{u}_{i}=\min_{\vec{v}\in{\mathbf{V}}^{*}}\vec{v}_{i} for i=1,,ni=1,\ldots,n. These results suggest that it is more important for a heuristic to maintain the MO cost structure of MOSSPs than the stochastic structure of actions.

6 Conclusion

In this work we define a new general class of problems, namely multi-objective stochastic shortest path problems (MOSSPs). We adapt MOVI which was originally constructed for solving multi-objective Markov Decision Processes to our MOSSP setting with conditions for convergence. We further design new, more powerful heuristic search algorithms (i)MOLAO and MOLRTDP for solving MOSSPs. The algorithms are complemented by a range of MOSSP heuristics and an extensive set of experiments on several benchmarks with varying difficulty and features. Through our evaluation, we can conclude that MOLRTDP is the best performing MOSSP solver, and abstraction heuristics which consider both the MO and probabilistic aspects of MOSSPs the best performing MOSSP heuristic. Our future work includes adding probabilistic LTL constraints as additional multi-objectives in order to compute solutions to MO-PLTL SSPs (Baumgartner, Thiébaux, and Trevizan 2018) that are robust to the choice of probability thresholds.

Acknowledgements

This work was supported by ARC project DP180103446, On-line planning for constrained autonomous agents in an uncertain world. The computational resources for this project were provided by the Australian Government through the National Computational Infrastructure (NCI) under the ANU Startup Scheme.

References

  • Barrett and Narayanan (2008) Barrett, L.; and Narayanan, S. 2008. Learning All Optimal Policies with Multiple Criteria. In Proc. 25th International Conference on Machine Learning (ICML), 41–47.
  • Baumgartner, Thiébaux, and Trevizan (2018) Baumgartner, P.; Thiébaux, S.; and Trevizan, F. 2018. Heuristic Search Planning With Multi-Objective Probabilistic LTL Constraints. In Proc. of 16th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR).
  • Bertsekas and Tsitsiklis (1991) Bertsekas, D. P.; and Tsitsiklis, J. N. 1991. An Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research, 16(3): 580–595.
  • Bertsekas and Tsitsiklis (1996) Bertsekas, D. P.; and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming, volume 3 of Optimization and neural computation series. Athena scientific.
  • Bonet and Geffner (2001) Bonet, B.; and Geffner, H. 2001. Planning as Heuristic Search. Artif. Intell., 129(1-2): 5–33.
  • Bonet and Geffner (2003) Bonet, B.; and Geffner, H. 2003. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming. In Proc. 13th International Conference on Automated Planning and Scheduling (ICAPS), 12–21.
  • Bryce and Buffet (2008) Bryce, D.; and Buffet, O. 2008. 6th Int. Planning Competition: Uncertainty Track. 3rd Int. Probabilistic Planning Competition.
  • Bryce, Cushing, and Kambhampati (2007) Bryce, D.; Cushing, W.; and Kambhampati, S. 2007. Probabilistic Planning is Multi-Objective. Technical Report 07-006, ASU CSE.
  • Dai et al. (2014) Dai, P.; Mausam; Weld, D. S.; and Goldsmith, J. 2014. Topological Value Iteration Algorithms. J. Artif. Intell. Res., 42: 181–209.
  • Geißer et al. (2022) Geißer, F.; Haslum, P.; Thiébaux, S.; and Trevizan, F. 2022. Admissible Heuristics for Multi-Objective Planning. In Proc. 32nd International Conference on Automated Planning and Scheduling (ICAPS), 100–109.
  • Hansen and Zilberstein (2001) Hansen, E. A.; and Zilberstein, S. 2001. LAO*{}^{\mbox{*}}: A Heuristic search algorithm that finds solutions with loops. Artif. Intell., 129(1-2): 35–62.
  • Jimenez, Coles, and Smith (2006) Jimenez, S.; Coles, A.; and Smith, A. 2006. Planning in Probabilistic Domains using a Deterministic Numeric Planner. In Proc. 25th Workshop of the UK Planning and Scheduling Special Interest Group (PLANSIG), 74–79.
  • Keller and Helmert (2013) Keller, T.; and Helmert, M. 2013. Trial-Based Heuristic Tree Search for Finite Horizon MDPs. In Proc. 23rd International Conference on Automated Planning and Scheduling (ICAPS). AAAI.
  • Khouadjia et al. (2013) Khouadjia, M.; Schoenauer, M.; Vidal, V.; Dréo, J.; and Savéant, P. 2013. Pareto-Based Multiobjective AI Planning. In Proc. 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2321–2327.
  • Klößner and Hoffmann (2021) Klößner, T.; and Hoffmann, J. 2021. Pattern Databases for Stochastic Shortest Path Problems. In Ma, H.; and Serina, I., eds., Proc. 14th International Symposium on Combinatorial Search (SOCS), 131–135.
  • Little and Thiébaux (2007) Little, I.; and Thiébaux, S. 2007. Probabilistic Planning vs Replanning. In Proc. ICAPS Workshop International Planning Competition: Past, Present and Future.
  • Mandow and Pérez-de-la-Cruz (2010) Mandow, L.; and Pérez-de-la-Cruz, J. 2010. Multiobjective A search with consistent heuristics. J. ACM, 57(5): 27:1–27:25.
  • Painter, Lacerda, and Hawes (2020) Painter, M.; Lacerda, B.; and Hawes, N. 2020. Convex Hull Monte-Carlo Tree-Search. In Proc. 30th International Conference on Automated Planning and Scheduling (ICAPS), 217–225.
  • Roijers and Whiteson (2017) Roijers, D. M.; and Whiteson, S. 2017. Multi-Objective Decision Making. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.
  • Trevizan, Teichteil-Königsbuch, and Thiébaux (2017) Trevizan, F.; Teichteil-Königsbuch, F.; and Thiébaux, S. 2017. Efficient Solutions for Stochastic Shortest Path Problems with Dead Ends. In Proc. 33rd International Conference on Uncertainty in Artificial Intelligence (UAI).
  • Trevizan et al. (2017) Trevizan, F.; Thiébaux, S.; Santana, P.; and Williams, B. 2017. I-dual: Solving Constrained SSPs via Heuristic Search in the Dual Space. In Proc. 26th International Joint Conference on Artificial Intelligence (IJCAI).
  • Trevizan, Thiébaux, and Haslum (2017) Trevizan, F. W.; Thiébaux, S.; and Haslum, P. 2017. Occupation Measure Heuristics for Probabilistic Planning. In Proc. 27th International Conference on Automated Planning and Scheduling (ICAPS), 306–315.
  • Ulloa et al. (2020) Ulloa, C. H.; Yeoh, W.; Baier, J. A.; Zhang, H.; Suazo, L.; and Koenig, S. 2020. A Simple and Fast Bi-Objective Search Algorithm. In Proc. 30th International Conference on Automated Planning and Scheduling (ICAPS), 143–151.
  • White (1982) White, D. J. 1982. Multi-objective infinite-horizon discounted Markov decision processes. Journal of Mathematical Analysis and Applications, 89(2): 639–647.
  • Wiering and de Jong (2007) Wiering, M. A.; and de Jong, E. D. 2007. Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes. In Proc. 1st IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 158–165.
  • Wingate and Seppi (2005) Wingate, D.; and Seppi, K. D. 2005. Prioritization Methods for Accelerating MDP Solvers. J. Mach. Learn. Res., 6: 851–881.
  • Younes et al. (2005) Younes, H. L. S.; Littman, M. L.; Weissman, D.; and Asmuth, J. 2005. The First Probabilistic Track of the International Planning Competition. J. Artif. Intell. Res., 24: 851–887.

Appendix
Critical Path Abstractions blind 𝐇idealmax{\mathbf{H}}_{\textit{ideal}}^{\textit{max}} 𝐇mocomax{\mathbf{H}}_{\textit{mo}}^{\textit{comax}} 𝐇idealpdb2{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb2}} 𝐇idealpdb3{\mathbf{H}}_{\textit{ideal}}^{\textit{pdb3}} 𝐇mopdb2{\mathbf{H}}_{\textit{mo}}^{\textit{pdb2}} 𝐇mopdb3{\mathbf{H}}_{\textit{mo}}^{\textit{pdb3}} 𝐇mossppdb2{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb2}} 𝐇mossppdb3{\mathbf{H}}_{\textit{mossp}}^{\textit{pdb3}} Sum SAR-4 (120) VI 39.7 39.0 - 39.0 39.0 35.0 38.0 39.0 39.0 307.7 TVI 40.0 40.0 - 40.0 40.0 40.0 40.0 40.0 40.0 320.0 LAO 51.0 54.0 83.0 62.0 66.0 76.0 77.0 84.3 91.0 644.3 iLAO 80.0 90.0 75.0 94.0 100.0 90.0 97.0 93.0 104.0 823.0 LRTDP 72.2 90.8 80.8 93.8 95.8 91.0 93.6 98.2 99.4 815.5 SAR-5 (120) VI - - - - - - - - 0.7 0.7 TVI - - - - - - - - 0.7 0.7 LAO 42.0 38.0 54.0 54.0 58.0 71.0 76.7 82.0 87.7 563.3 iLAO 65.0 81.0 43.0 84.0 97.0 88.0 96.0 94.0 100.0 748.0 LRTDP 55.5 80.8 41.2 86.0 89.5 88.0 91.0 91.2 93.8 717.0 ExBw-2d (155) VI - - - - - - - - - - TVI - - - - - - - - - - LAO 4.0 29.0 10.0 7.7 11.7 18.0 20.3 16.3 22.3 139.3 iLAO 5.0 26.0 8.0 17.0 21.0 31.0 32.0 30.0 32.0 202.0 LRTDP 22.0 39.8 11.5 33.0 27.0 52.8 44.2 47.6 42.2 320.1 ExBw-3d (155) VI - - - - - - - - - - TVI - - - - - - - - - - LAO 2.0 21.0 9.0 10.0 9.0 15.0 17.3 14.7 7.3 105.3 iLAO 5.0 21.0 8.0 16.0 20.0 28.0 27.0 26.0 14.7 165.7 LRTDP 20.0 36.0 11.0 31.5 26.0 50.0 41.4 45.0 8.6 269.5 Tireworld (10) VI 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 26.0 TVI 2.0 2.0 2.0 3.0 3.0 3.0 2.7 3.0 3.0 23.7 LAO 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 19.0 iLAO 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0 4.0 27.0 LRTDP 3.0 3.0 2.0 3.0 3.0 3.0 3.0 3.0 4.0 27.0 Pr. VisitAll (26) VI 13.0 13.0 12.0 13.0 13.0 12.0 12.0 13.0 15.0 116.0 TVI 13.0 13.0 12.0 13.0 13.0 12.0 12.0 13.7 14.0 115.7 LAO 2.0 2.0 6.0 1.0 1.0 5.0 6.0 11.7 18.0 52.7 iLAO 14.0 14.0 14.0 14.0 14.0 15.0 14.0 15.0 19.0 133.0 LRTDP 10.5 10.2 13.0 10.2 9.8 13.0 13.0 17.6 18.2 115.5 VisitAllTire (24) VI - - - - - - - - - - TVI - - - - - - - - - - LAO - - 1.0 - - 1.0 1.0 1.0 1.0 5.0 iLAO 1.0 1.0 1.0 1.0 1.0 1.0 3.0 1.0 3.0 13.0 LRTDP 3.0 2.8 1.0 3.0 3.0 5.0 5.0 5.0 5.0 32.8 Sum (610) VI 55.7 55.0 14.0 55.0 55.0 50.0 53.0 55.0 57.7 450.3 TVI 55.0 55.0 14.0 56.0 56.0 55.0 54.7 56.7 57.7 460.0 LAO 103.0 146.0 165.0 136.7 147.7 188.0 200.3 212.0 230.3 1529.0 iLAO 173.0 236.0 151.0 229.0 256.0 256.0 272.0 262.0 276.7 2111.7 LRTDP 186.2 263.2 160.5 260.5 254.0 302.8 291.2 307.6 271.2 2297.3

Table 3: Number of tasks solved with different MOSSP solvers and heuristics. SAR-5 increases in difficulty from SAR-4 with the increase in search space size. ExBw-3d increases in difficulty from ExBw-2d in the number of objectives. The rightmost sum column marginalises over heuristics, whereas the bottom sum rows marginalises over domains.