This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Separate Generation and Evaluation for Parallel Greedy Best-First Search

Takumi Shimoda, Alex Fukunaga
Abstract

Parallelization of Greedy Best First Search (GBFS) has been difficult because straightforward parallelization can result in search behavior which differs significantly from sequential GBFS, exploring states which would not be explored by sequential GBFS with any tie-breaking strategy. Recent work has proposed a class of parallel GBFS algorithms which constrains search to exploration of the Bench Transition System (BTS), which is the set of states that can be expanded by GBFS under some tie-breaking policy. However, enforcing this constraint is costly, as such BTS-constrained algorithms are forced to spend much of the time waiting so that only states which are guaranteed to be in the BTS are expanded. We propose an improvement to parallel search which decouples state generation and state evaluation and significantly improves state evaluation rate, resulting in better search performance.

1 Introduction

Parallelization of combinatorial search algorithms is important in order to maximize search algorithm performance on modern, multi-core CPUs. In the case of cost-optimal search, parallelization of the standard A* algorithm (Hart, Nilsson, and Raphael 1968) is somewhat well understood, and viable, practical approaches have been proposed. The optimality requirement imposes a relatively strong constraint on the set of states which must be expanded by any parallel A* (all nodes with ff-values less than the optimal path cost CC* must be expanded), so previous work has focused on approaches for expanding those required states while minimizing synchronization and communication overheads and avoiding expansion of non-required states (Burns et al. 2010; Kishimoto, Fukunaga, and Botea 2013; Phillips, Likhachev, and Koenig 2014; Fukunaga et al. 2017).

For satisficing search where the object is to quickly find any valid solution path (regardless of path cost), parallelization is not well understood. Greedy Best First Search (GBFS; Doran and Michie (1966)) is a widely used satisficing search algorithm. However, the performance of straightforward parallelizations of GBFS is non-monotonic with respect to resource usage – there is a significant risk that using kk threads can result in significantly worse performance than using fewer than kk threads. It has been shown experimentally that parallel GBFS can expand orders of magnitude more states than GBFS (Kuroiwa and Fukunaga 2019), and it has been shown theoretically that KPGBFS, a straightforward parallelization of GBFS, can expand arbitrarily many more states than GBFS (Kuroiwa and Fukunaga 2020).

Unlike parallel cost-optimal search, there is no obvious set of states which a parallel satisficing search must explore in order to be considered a “correct” parallelization of the sequential algorithm. Recent theoretical analysis of GBFS has yielded a promising direction for determining which states should be expanded. Heusner, Keller, and Helmert (2017) identified the Bench Transition System (𝐵𝑇𝑆\mathit{BTS}), the set of all states that can be expanded by GBFS under some tie-breaking policy (conversely, if a state is not in the BTS, there does not exist any tie-breaking strategy for GBFS which will expand that state). Limiting search to states in the 𝐵𝑇𝑆\mathit{BTS} provides a natural constraint for parallel GBFS.

Recent work has proposed parallel GBFS algorithms which expand states from 𝑂𝑝𝑒𝑛\mathit{Open} only if some constraint is satisfied. Kuroiwa and Fukunaga (2020) proposed PUHF, a parallel GBFS which guarantees that only states which are in the 𝐵𝑇𝑆\mathit{BTS} will be expanded. However, this guarantee comes at a cost in performance. Since PUHF prevents expansion of any state unless it is certain that the state is in the 𝐵𝑇𝑆\mathit{BTS}, threads can be forced to be idle while they wait until a state which is guaranteed to be in the 𝐵𝑇𝑆\mathit{BTS} become available, resulting in significantly worse performance than KPGBFS. Improved versions of PUHF with looser constraints (which still guarantee that only states in the BTS are expanded) have been proposed (Shimoda and Fukunaga 2023), but these still have a significantly lower state evaluation rate than parallel GBFS without expansion constraints.

In this paper, we propose Separate Generation and Evaluation (SGE), which decouples state expansion and evaluation so that instead of waiting for a single thread to fully expand a state (generating and evaluating its successors), multiple threads can be used to evaluate the successors. We show that this significantly improves the state evaluation rate in parallel GBFS with expansion constraints.

The rest of the paper is structured as follows. Section 2 reviews background and previous work. Section 3 presents Constrained Parallel GBFS (CPGBFS), which unifies KPGBFS and all previous versions of PUHF as instances of a class of search algorithms with various state expansion constraints. Section 4 discusses the state expansion bottlenecks in CPGBFS. Section 5 proposes SGE. Section 6 experimentally evaluates SGE and compares them to PUHF and KPGBFS. Section 7 concludes with a discussion and directions for future work.

2 Preliminaries and Background

State Space Topology

State space topologies are defined following Heusner, Keller, and Helmert (2018).

Definition 1.

A state space is a 4-tuple 𝒮=S,𝑠𝑢𝑐𝑐,sinit,Sgoal\mathcal{S}=\langle S,\mathit{succ},s_{init},S_{goal}\rangle, where SS is a finite set of states, 𝑠𝑢𝑐𝑐:S2S\mathit{succ}:S\rightarrow 2^{S} is the successor function, sinitSs_{init}\in S is the initial state, and SgoalSS_{goal}\subseteq S is the set of goal states. If s𝑠𝑢𝑐𝑐(s)s^{\prime}\in\mathit{succ}(s), we say that ss^{\prime} is a successor of ss and that sss\rightarrow s^{\prime} is a (state) transition. sSgoal,𝑠𝑢𝑐𝑐(s)=\forall s\in S_{goal},\mathit{succ}(s)=\emptyset. A heuristic for 𝒮\mathcal{S} is a function h:S0h:S\rightarrow\mathbb{N}_{0} and sSgoal,h(s)=0\forall s\in S_{goal},h(s)=0. A state space topology is a pair 𝒮,h\langle\mathcal{S},h\rangle, where 𝒮\mathcal{S} is a state space.

We call a sequence of states s0,,sn\langle s_{0},...,s_{n}\rangle a path from s0s_{0} to sns_{n}, and denote the set of paths from ss to ss^{\prime} as P(s,s)P(s,s^{\prime}). pip_{i} is the ii th state in a path pp and |p||p| is the length of pp. A solution of a state space topology is a path pp from sinits_{init} to a goal state. We assume at least one goal state is reachable from sinits_{init}, and sS,s𝑠𝑢𝑐𝑐(s)\forall s\in S,s\notin\mathit{succ}(s).

Best-First Search

Best-First Search (BFS) is a class of search algorithms that use an evaluation function f:Sf:S\rightarrow\mathbb{R} and a tie-breaking strategy τ\tau. BFS searches states in the order of evaluation function values (ff-values). States with the same ff-value are prioritized by τ\tau. In Greedy Best-First Search (GBFS; Doran and Michie (1966)), f(s)=h(s)f(s)=h(s).

K-Parallel GBFS (KPGBFS)

K-Parallel BFS (Vidal, Bordeaux, and Hamadi 2010) is a straightforward, baseline parallelization of BFS. All threads share a single 𝑂𝑝𝑒𝑛\mathit{Open} and 𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed}. Each thread locks 𝑂𝑝𝑒𝑛\mathit{Open} to remove a state ss with the lowest ff-value in 𝑂𝑝𝑒𝑛\mathit{Open}, locks 𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed} to check duplicates and add 𝑠𝑢𝑐𝑐(s)\mathit{succ}(s) to 𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed}, and locks 𝑂𝑝𝑒𝑛\mathit{Open} to add 𝑠𝑢𝑐𝑐(s)\mathit{succ}(s) to Open. KPGBFS is KPBFS with f(s)=h(s)f(s)=h(s).

The set of states explored by KPGBFS may be very different from those explored by GBFS. Kuroiwa and Fukunaga (2020) showed that straightforward parallelizations of GBFS with shared 𝑂𝑝𝑒𝑛\mathit{Open} and/or 𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed}, including KPGBFS, can expand arbitrarily more states than GBFS.

Bench Transition Systems

Heusner et al. (2017) defined bench transition system (𝐵𝑇𝑆\mathit{BTS}) in order to characterize the behavior of GBFS, building upon the definition of high-water marks by Wilt and Ruml (2014). The 𝐵𝑇𝑆\mathit{BTS} is defined as the set of all states which can be expanded by GBFS with some tie-breaking policy, i.e., a state ss is in the 𝐵𝑇𝑆\mathit{BTS} if there exists some tie-breaking policy under which ss is expanded. Conversely, states not in the BTS will not be expanded by GBFS under any tie-breaking policy.

BTS-Constrained Search

Restricting the search to only expand states which are in the 𝐵𝑇𝑆\mathit{BTS} is a natural constraint for parallel GBFS.

Definition 2.

A search algorithm is 𝐵𝑇𝑆\mathit{BTS}-constrained if it expands only states which are in the 𝐵𝑇𝑆\mathit{BTS} (Shimoda and Fukunaga 2023).

PUHF: A BTS-Constrained Parallel GBFS

Kuroiwa and Fukunaga (2020) proposed Parallel Under High-water mark First (PUHF), a 𝐵𝑇𝑆\mathit{BTS}-constrained parallel GBFS. PUHF marks states which are guaranteed to be in the BTS as certain, and only expands states marked as certain. The criterion used by PUHF to mark states as certain was a restrictive, sufficient (but not necessary) condition for being in the BTS. Recently, looser sufficient conditions for marking states as certain were proposed, resulting in PUHF2–4, which significantly improved performance over PUHF (Shimoda and Fukunaga 2023).

3 Constrained Parallel GBFS

As described above, the original proposing PUHF and PUHF2–4 presented these algorithms as marking states guaranteed to be in the BTS as certain, and only expanding nodes marked as certain. The stage in the algorithm where states were marked as certain were specific to the specific algorithm (PUHF, PUHF2–4). However, the the similarities and differences among KPGBFS, PUHF, and PUHF2–4 can be clarified by reframing this behavior in terms of satisfying an algorithm-specific expansion constraint.

We present a unified framework, Constrained Parallel GBFS (CPGBFS) (Algorithm 1), which subsumes KPGBFS, PUHF, and PUHF2–4. CPGBFS is a schema for a class of parallel search algorithms based on KPGBFS, which only expands nodes which satisfy some algorithm-specific constraint in line 7, where 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠(s)\mathit{satisfies(s)} is a function which returns truetrue if and only if ss satisfies the algorithm-specific expansion constraint.

KPGBFS is a special case of CPGBFS where 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠(s)\mathit{satisfies(s)} always returns truetrue. The previously proposed BTS-constrained search algorithms (PUHF and PUHF2–4) are instances of CPGBFS where the 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠(s)\mathit{satisfies(s)} function implements a check for the sufficient constraint which guarantees that ss is in the BTS – the specific implementation details of 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠\mathit{satisfies} depend on whether the algorithm is PUHF, PUHF2, PUHF3, or PUHF4. In addition, algorithm-specific auxiliary computations related to 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠\mathit{satisfies} are omitted for clarity.

1:𝑂𝑝𝑒𝑛{sinit},𝐶𝑙𝑜𝑠𝑒𝑑{sinit};i,si𝑁𝑈𝐿𝐿\mathit{Open}\leftarrow\{s_{init}\},\mathit{Closed}\leftarrow\{s_{init}\};\forall i,s_{i}\leftarrow\mathit{NULL}
2:for i0,,k1i\leftarrow 0,...,k-1 in parallel do
3:     loop
4:         lock(𝑂𝑝𝑒𝑛\mathit{Open})
5:         if 𝑂𝑝𝑒𝑛=\mathit{Open}=\emptyset then
6:              if j,sj=𝑁𝑈𝐿𝐿\forall j,s_{j}=\mathit{NULL} then unlock(𝑂𝑝𝑒𝑛\mathit{Open}); return NULLNULL               
7:         else if 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠(top(𝑂𝑝𝑒𝑛))=𝑡𝑟𝑢𝑒\mathit{satisfies}(top(\mathit{Open}))=\mathit{true} then
8:              sitop(𝑂𝑝𝑒𝑛)s_{i}\leftarrow top(\mathit{Open}); 𝑂𝑝𝑒𝑛𝑂𝑝𝑒𝑛{si}\mathit{Open}\leftarrow\mathit{Open}\setminus\{s_{i}\}          
9:         unlock(𝑂𝑝𝑒𝑛\mathit{Open})
10:         if si=𝑁𝑈𝐿𝐿s_{i}=\mathit{NULL} then continue          
11:         if siSgoals_{i}\in S_{goal} then return Path(si)Path(s_{i})          
12:         for si𝑠𝑢𝑐𝑐(si)s_{i}^{\prime}\in\mathit{succ}(s_{i}) do
13:              lock(𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed})
14:              if si𝐶𝑙𝑜𝑠𝑒𝑑s_{i}^{\prime}\notin\mathit{Closed} then
15:                  𝐶𝑙𝑜𝑠𝑒𝑑𝐶𝑙𝑜𝑠𝑒𝑑{si}\mathit{Closed}\leftarrow\mathit{Closed}\cup\{s_{i}^{\prime}\}
16:                  unlock(𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed})
17:                  children(si)children(si){si}children(s_{i})\leftarrow children(s_{i})\cup\{s_{i}^{\prime}\}
18:                  evaluate(ss^{\prime})
19:              else
20:                  unlock(𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed})                        
21:         lock(𝑂𝑝𝑒𝑛\mathit{Open})
22:         for sichildren(si)s_{i}^{\prime}\in children(s_{i}) do
23:              𝑂𝑝𝑒𝑛𝑂𝑝𝑒𝑛{si}\mathit{Open}\leftarrow\mathit{Open}\cup\{s_{i}^{\prime}\}          
24:         unlock(𝑂𝑝𝑒𝑛\mathit{Open})
25:         si𝑁𝑈𝐿𝐿s_{i}\leftarrow\mathit{NULL}      
Algorithm 1 CPGBFS: Constrained Parallel GBFS Template

4 State Expansion Bottlenecks in Constrained Parallel Search

Previous work has shown that CPGBFS (all of the PUHF variants) have a significantly lower state evaluation rate than unconstrained parallel search (KPGBFS). There are two, related reasons for the low state evaluation rate in CPGBFS: (1) the expansion constraint, and (2) eager evaluation policy.

Expansion Constraint Bottleneck

Refer to caption
Figure 1: Example for SGE

Unconstrained parallel search algorithms such as KPGBFS will unconditionally expand the top states in 𝑂𝑝𝑒𝑛\mathit{Open}. Threads in unconstrained parallel search algorithms are only idle when waiting for a mutex lock for the shared 𝑂𝑝𝑒𝑛\mathit{Open} and 𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed} structures. If the shared 𝑂𝑝𝑒𝑛\mathit{Open}/𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed} data structures are implemented efficiently (e.g., using sharding to internally partition the data structures), then waiting mutual exclusion overhead can be greatly reduced.

In contrast, constrained parallel GBFS algorithms such as PUHF force threads to be idle in order to guarantee that only states which satisfy the search constraints are expanded – even if there is a state in 𝑂𝑝𝑒𝑛\mathit{Open} and the mutex lock is available, CPGBFS can not expand the top state top(𝑂𝑝𝑒𝑛)top(\mathit{Open}) unless the expansion constraint is satisfied (satisfies(top(𝑂𝑝𝑒𝑛))=truesatisfies(top(\mathit{Open}))=true).

One approach to decreasing idle time is to improve the expansion constraint to be closer to a necessary constraint, as the previously proposed expansion constraints are all sufficient (overly restrictive) constraints for a state to be in the BTS. Previous work used this approach to improving the expansion constraint for PUHF, resulting in significantly improved evaluation rate (Shimoda and Fukunaga 2023). While this approach can be effective, finding safe improvements to the expansion constraint can be nontrivial.

Furthermore, even an ideal constraint (e.g., a necessary and sufficient expansion constraint for the BTS), would not eliminate idle threads. For example, in Figure 1 the only states in the BTS are the circular nodes (s0s_{0},s1,1,s2,1,s3,1,sgoals_{1,1},s_{2,1},s_{3,1},s_{goal}), so any BTS-constrained algorithm can only expand one of these states at a time, while all other threads must wait.

Batch Successor Insertion Bottleneck

A second, closely related bottleneck is caused by the requirement that CPGBFS needs to insert successor states into 𝑂𝑝𝑒𝑛\mathit{Open} in a single batch. Consider Figure 1. In a standard, single-thread implementation of best-first search with eager evaluation, the expansion of s1,1s_{1,1} includes (1) generating succ(s1,1)succ(s_{1,1}) and (2) evaluating all states in succ(s1,1)=s2,1,s2,21,,s2,2xsucc(s_{1,1})=s_{2,1},s_{2,2}^{1},\ldots,s_{2,2}^{x} with a heuristic evaluation function, and (3) inserting succ(s1,1)succ(s_{1,1}) in 𝑂𝑝𝑒𝑛\mathit{Open}. In many cases, computing the heuristic evaluation function consumes the majority of time spent expanding the state, and the full expansion of a single state such as s1,1s_{1,1} can take a significant amount of time due to the evaluation of all of its successors.

A constrained parallel search which seeks to expands a similar set of nodes as GBFS has an additional requirement not present in single-threaded search: the successors of a state ss are all simultaneously inserted in 𝑂𝑝𝑒𝑛\mathit{Open} only after all successors of state ss are evaluated (Algorithm 1, lines 2124). This ensures that the successors of ss are expanded in best-first order – without this batch insertion (e.g., if states were inserted one at a time directly into 𝑂𝑝𝑒𝑛\mathit{Open} immediately after being evaluated), a state ss^{\prime} with a worse ff-value than its sibling s′′s^{\prime\prime} might be inserted into 𝑂𝑝𝑒𝑛\mathit{Open} and expanded by another thread before s′′s^{\prime\prime} has been evaluated into 𝑂𝑝𝑒𝑛\mathit{Open}.

In the case of unconstrained parallel search such as KPGBFS, the overall evaluation rate is not significantly affected by whether the successors are inserted in a single batch or one at a time, because available threads can freely expand the top states from 𝑂𝑝𝑒𝑛\mathit{Open}.

However, in CPGBFS, the combination/interaction of the expansion constraint and the batch successor insertion requirement causes a significant bottleneck. For example, in Figure 1, in Algorithm 1, if a thread starts to expand s1,1s_{1,1}, all other threads must stop and wait until all of succ(s1,1)succ(s_{1,1}) have been fully evaluated and inserted into 𝑂𝑝𝑒𝑛\mathit{Open}.

5 Separate Generation and Evaluation (SGE)

In this section, we propose SGE, an approach for increasing the state evaluation rate in constrained best-first search algorithms such as PUHF. SGE alleviates the batch successor insertion bottleneck described above.

Continuing the example from the previous section, in the case of Figure 1, instead of waiting idly while one thread expands s1,1s_{1,1} (which includes computing all of the heuristic values for s2,1,s2,21,..,s2,2xs_{2,1},s_{2,2}^{1},..,s_{2,2}^{x}), it would be more efficient to parallelize the evaluation of s2,1,s2,21,..,s2,2xs_{2,1},s_{2,2}^{1},..,s_{2,2}^{x} among available threads.

1:𝑂𝑝𝑒𝑛{sinit},𝐶𝑙𝑜𝑠𝑒𝑑{sinit};i,si𝑁𝑈𝐿𝐿{\mathit{Open}\leftarrow\{s_{init}\},\mathit{Closed}\leftarrow\{s_{init}\};\forall{i},s_{i}\leftarrow{\mathit{NULL}}}
2:for i0,,k1i\leftarrow{0,...,k-1} in parallel do
3:     loop
4:         lock(𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated})
5:         if 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated}\neq\emptyset then
6:              sitop(Unevaluated)s_{i}\leftarrow top(Unevaluated)
7:              𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑{si}\mathit{Unevaluated}\leftarrow\mathit{Unevaluated}\setminus\{s_{i}\}
8:              unlock(𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated})
9:              evaluate(sis_{i}) \triangleright using a hashtable to prevent reevaluation of states
10:              if all siblings of sis_{i} has been evaluated then
11:                  lock(𝑂𝑝𝑒𝑛\mathit{Open}), lock(𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed})
12:                  for sis_{i}^{\prime}\in siblings of sis_{i} do
13:                       if si𝐶𝑙𝑜𝑠𝑒𝑑s_{i}^{\prime}\notin\mathit{Closed} then
14:                           𝐶𝑙𝑜𝑠𝑒𝑑𝐶𝑙𝑜𝑠𝑒𝑑{si}\mathit{Closed}\leftarrow{\mathit{Closed}\cup\{s_{i}^{\prime}\}}
15:                           𝑂𝑝𝑒𝑛𝑂𝑝𝑒𝑛{si}\mathit{Open}\leftarrow\mathit{Open}\cup\{s_{i}^{\prime}\}                                          
16:                  unlock(𝑂𝑝𝑒𝑛\mathit{Open}), unlock(𝐶𝑙𝑜𝑠𝑒𝑑\mathit{Closed})               
17:         else
18:              unlock(𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated}), lock(𝑂𝑝𝑒𝑛\mathit{Open})
19:              if 𝑂𝑝𝑒𝑛=\mathit{Open}=\emptyset then
20:                  unlock(𝑂𝑝𝑒𝑛\mathit{Open})
21:                  if j,sj=𝑁𝑈𝐿𝐿\forall{j},s_{j}=\mathit{NULL} then
22:                       return 𝑁𝑈𝐿𝐿\mathit{NULL}                   
23:              else if 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠(top(𝑂𝑝𝑒𝑛))=𝑡𝑟𝑢𝑒\mathit{satisfies}(top(\mathit{Open}))=\mathit{true} then
24:                  sitop(𝑂𝑝𝑒𝑛)s_{i}\leftarrow top(\mathit{Open}); 𝑂𝑝𝑒𝑛𝑂𝑝𝑒𝑛{si}\mathit{Open}\leftarrow\mathit{Open}\setminus\{s_{i}\}
25:                  unlock(𝑂𝑝𝑒𝑛\mathit{Open})               
26:              if si=𝑁𝑈𝐿𝐿s_{i}=\mathit{NULL} then continue               
27:              if siSgoals_{i}\in S_{goal} then
28:                  return Path(si)Path(s_{i})               
29:              lock(𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated})
30:              𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑Unevaluatedsucc(si)\mathit{Unevaluated}\leftarrow{Unevaluated\cup succ(s_{i}^{\prime})}
31:              unlock(𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated})          
32:         si𝑁𝑈𝐿𝐿s_{i}\leftarrow{\mathit{NULL}}      
Algorithm 2 Constrained Parallel GBFS with SGE Tem- plate

We propose Separate Generation and Evaluation (SGE), which parallelizes state evaluations in addition to expansions. The main idea is to decompose the expansion of state ss into separate units of work which can be parallelized: (1) successor generation, which generates succ(s)succ(s), the successors of ss, and (2) successor evaluation, which evaluates succ(s)succ(s).

Algorithm 2 shows Constrained Parallel GBFS with SGE. After a thread selects a state for expansion from the shared 𝑂𝑝𝑒𝑛\mathit{Open}, it generates succ(s)succ(s), and inserts succ(s)succ(s) into the shared 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} queue. The evaluation of states in 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} is done in parallel, taking precedence over selection of states for expansion (a thread will select a state for expansion from 𝑂𝑝𝑒𝑛\mathit{Open} only if 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} is currently empty (Algorithm 2, line 5)).

Evaluated states are not immediately inserted into 𝑂𝑝𝑒𝑛\mathit{Open}. Instead, we insert all of the successors of ss simultaneously into 𝑂𝑝𝑒𝑛\mathit{Open}, after they have all been evaluated (lines 10-16). This is so that the parallel search is able to prioritize succ(s)succ(s) similarly to GBFS (otherwise, succ(s)succ(s) can be expanded in a completely different order than by GBFS, which we are trying to prevent).

Consider the search behavior of BTS-constrained parallel search such as PUHF with SGE on the search space in Figure 1. First, a thread pops s0s_{0} from 𝑂𝑝𝑒𝑛\mathit{Open} (satisfied(s0)=truesatisfied(s_{0})=true), and generates its successors (s1,1,s1,21,..,s1,2xs_{1,1},s_{1,2}^{1},..,s_{1,2}^{x}), which are inserted in 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated}. Available threads will pop these successors from 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} and evaluate them. When all successors of s0s_{0} have been evaluated, they are all inserted in 𝑂𝑝𝑒𝑛\mathit{Open}. Next, some thread removes s1,1s_{1,1} (satisfied(s1,1)=truesatisfied(s_{1,1})=true), generates its successors (s2,1,s2,21,..,s2,2xs_{2,1},s_{2,2}^{1},..,s_{2,2}^{x}), and inserts them in 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated}. While generating the successors of s1,1s_{1,1}, other threads may try to pop a state from 𝑂𝑝𝑒𝑛\mathit{Open}, but since the top state at that time (s1,2is_{1,2}^{i}) is not in the BTS (satisfied(s1,2i)=falsesatisfied(s_{1,2}^{i})=false), 𝑂𝑝𝑒𝑛\mathit{Open} will remain untouched. After the successors of s1,1s_{1,1} have been inserted in 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated}, the available threads will remove them from 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} and evaluate them. The search continues similarly until sgoals_{goal} is found. Each time a state’s successors are generated, multiple available threads evaluate the successors in parallel. This clearly improves thread utilization compared to BTS-constrained search without SGE, which only uses one thread to expand (generate and evaluate successors of) each state so that only 1 thread is active throughout the search space in Figure 1.

The basic idea of decoupling state generation and state evaluation is similar to that of ePA*SE, which decouples state generation and edge evaluations in a parallel A* (Mukherjee, Aine, and Likhachev 2022), where an edge evaluation is the computation required to evaluate the application of an operator, e.g., collision checking using a simulation model in robot motion planning). In both cases, using only a single thread to fully process a state expansion causes a bottleneck forcing other threads to be idle, as search algorithm constraints for constrained parallel GBFS and A* requires waiting until that expansion is fully processed, so decomposing the expansion process and parallelizing it among the available threads leads to better processor utilization. Because the requirements and objectives of GBFS (satisficing search) and A* (cost-optimal search) differ, the implementation of SGE is somewhat simpler (using an 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} queue instead of dummy/real edges as in ePA*SE).

SGE and Thread Utilization

Consider a situation where all threads are currently simultaneously available. In unconstrained parallel BFS with kk threads, the top kk nodes in 𝑂𝑝𝑒𝑛\mathit{Open} can be simultaneously expanded, so all kk threads will be utilized. On the other hand, in CPGBFS without SGE, if there are currently aa states which satisfy the expansion constraint, then the number of states which can be simultaneously expanded is min(a,k)\min(a,k), and kmin(a,k)k-\min(a,k) threads will be idle until more states in 𝑂𝑝𝑒𝑛\mathit{Open} satisfy the expansion constraint. With SGE, although there will be a brief time when only min(a,k)\min(a,k) threads are active (the generation phase), the number of active threads while the successors of the aa states are being evaluated will be min(ab,k)\min(ab,k), where bb is the average number of previously unevaluated successors per state.

SGE and search behavior

The state expansion order of a parallel search algorithm AA with SGE will differ from the expansion order of AA without SGE. Although it is nontrivial to precisely characterize the difference between the expansion order of an algorithm with and without SGE, a simple approximation is that executing a parallel search algorithm AA with SGE on kk threads is somewhat similar to executing AA without SGE on m<km<k threads, where each of the mm “threads” is faster than each of the actual kk threads.

As a simple example, consider searching a state space which is a tree with uniform branch factor 2, where the heuristic evaluation function computation is the computational bottleneck, and assume that 𝑂𝑝𝑒𝑛\mathit{Open} currently contains many nodes. With k=16k=16 threads, KPGBFS will expand 16 states at a time – each thread expands 1 state, where the expansion includes generation and heuristic evaluation of the state’s 2 successors. In contrast, KPGBFS with SGE would be evaluating 16 states at a time – each thread, after quickly generating the successors of a state, would then be assigned to evaluate 1 successor state, essentially the same as KPGBFS without SGE expanding 8 states, i.e., similar to KPGBFS without SGE running on 8 threads.

SGE and expansion constraints

The state expansion constraints (i.e., the 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠\mathit{satisfies} check) for the various CPGBFS algorithms known to date are defined based on: (a) comparisons between the hh-value of a state’s parent and the hh-values of the siblings of ss (for PUHF), or (b) h(s)h(s) vs. the hh-values of other states currently being expanded (for PUHF2, PUHF3, PUHF4). Therefore, distributing the evaluation of the successors of ss among multiple threads has no impact on the correctness of the expansion constraint (i.e., the guarantee that the node being expanded is in the BTS).

Overall effect on search performance

SGE can be applied to any parallel best-first search algorithm such as PUHF and KPGBFS. The impact on performance will depend on the extent to which state expansion/evaluation is a performance bottleneck. In the case of a constrained algorithm such as PUHF, SGE should allow higher utilization of threads that would otherwise be idle (waiting for a state which satisfies expansion constraints), resulting in significantly higher state evaluation rates, which should in turn result in better overall search performance (more problems solved within a given time limit).

On the other hand, in the case of an unconstrained algorithm such as KPGBFS, the evaluation rate should be minimally affected, but search performance may be slightly affected because of the different state expansion order due to SGE.

6 Experimental Evaluation of SGE

In this section, we experimentally evaluate SGE. We emphasize that the goal of this experimental study is to evaluate and understand SGE as an implementation technique which can be applied to a wide range of parallel best-first search algorithms. Thus, we focus on comparing PUHF2 vs. PUHF2S to evaluate the effect of SGE on constrained parallel best-first search, and we compare KPGBFS vs. KPGBFSS to evaluate the effect of SGE on unconstrained parallel best-first search. Comparing constrained parallel search vs. unconstrained parallel search (e.g., PUHF2S vs. KPGBFS) is beyond the scope of this study.

To evaluate SGE, we use the satisficing instances of the Autoscale-21.11 benchmark set (42 STRIPS domains, 30 instances/domain, 1260 total instances) (Torralba, Seipp, and Sievers 2021), an improved benchmark suite based on the IPC classical planning benchmarks. All search algorithms use the FF heuristic (Hoffmann and Nebel 2001). Each run has a run time limit of 5 minutes and 3 GB RAM/thread (e.g., 24 GB total for a 8-thread run) limit on a Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz processor.

All tie-breaking is First-In-First-Out.

We evaluate all algorithms on k{4,8,16}k\in\{4,8,16\} threads. We also report some results for single-threaded GBFS as a baseline.

Table 1(a) and Figure 2 compares the state evaluation rates of the algorithms. Table 1(b) and Figure 3 compares the number of states expanded by the algorithms. Table 1(c) and Figure 4 compares search time by the algorithms. Table 1(d) shows the speedup compared to single-threaded GBFS.

Table 2(a) shows the number of instances solved by each algorithm, and Table 2(b) shows the number of problems solved by algorithm xx but not by algorithm yy for x,y{KPGBFS,KPGBFSS,PUHF2,PUHF2S}x,y\in\{\text{KPGBFS},\text{KPGBFS${}_{S}$},\text{PUHF2},\text{PUHF2${}_{S}$}\}.

In Tables 1(a)-1(d), we include only the 412 instances solved by all algorithms so that means can be computed. Figures 24 show all instances – instances which were not solved (due to either timeout or memory exhaustion) are shown as either x=𝑓𝑎𝑖𝑙x=\mathit{fail} or y=𝑓𝑎𝑖𝑙y=\mathit{fail}.

#threads 1 thread 4 threads 8 threads 16 threads
GBFS 5791 -
KPGBFS - 18502 33944 58790
KPGBFSS - 17683 32074 53880
PUHF2 - 14994 22716 33632
PUHF2S - 16350 26126 40097
(a) State evaluation rate (geometric mean)
#threads 1 thread 4 threads 8 threads 16 threads
GBFS 1123 -
KPGBFS - 1896 2559 3400
KPGBFSS - 1601 2016 2557
PUHF2 - 1631 2009 2422
PUHF2S - 1456 1634 1945
(b) Number of states expanded (geometric mean)
#threads 1 thread 4 threads 8 threads 16 threads
GBFS 1.2605 -
KPGBFS - 0.672 0.497 0.385
KPGBFSS - 0.588 0.401 0.302
PUHF2 - 0.723 0.595 0.498
PUHF2S - 0.601 0.426 0.334
(c) Search time (geometric mean)
#threads 1 thread 4 threads 8 threads 16 threads
GBFS 1.0 -
KPGBFS - 3.17 10.22 12.69
KPGBFSS - 3.06 8.29 12.38
PUHF2 - 2.38 5.22 6.67
PUHF2S - 3.81 7.02 11.03
(d) Speedup (Search time[GBFS]/Search time , arithmetic mean)
Table 1: Comparison on Autoscale-21.11 IPC-based planning benchmark set. Means for 412 instances solved by all algorithms
#threads 1 thread 4 threads 8 threads 16 threads
GBFS 448 -
KPGBFS - 510 534 567
KPGBFSS - 507 529 565
PUHF2 - 500 517 547
PUHF2S - 507 524 551
(a) Number of instances solved
KPGBFS KPGBFSS PUHF2 PUHF2S
KPGBFS - 34 41 43
KPGBFSS 32 - 38 33
PUHF2 21 20 - 20
PUHF2S 27 19 24 -
(b) For k=16k=16 threads, number of problems solved by algorithm (row) but not by algorithm (column), e.g., KPGBFS solved 41 problems not solved by PUHF2
Table 2: Coverage results on Autoscale-21.11 IPC-based planning benchmark set.

Evaluation Rates

First, comparing the state evaluation rates of KPGBFS and PUHF2 (Table 1(a), Figure 2), we confirm that as shown in previous work (Shimoda and Fukunaga 2023), PUHF2 has a significantly lower evaluation rate than KPGBFS. The only difference between KPGBFS and PUHF2 is that KPGBFS can always expand the top state in 𝑂𝑝𝑒𝑛\mathit{Open} while PUHF2 only expands the top node in 𝑂𝑝𝑒𝑛\mathit{Open} if a more restrictive constraint is satisfied in Algorithm 1. Thus, the lower state evaluation rate in PUHF2 is due to threads waiting idly when top(Open)top(Open) does not satisfy the constraint.

Comparing the state evaluation rates (states/second) of PUHF2S and PUHF2 (Table 1(a), Figure 2), we see that for k{4,8,16}k\in\{4,8,16\} threads, PUHF2S has a significantly higher evaluation rate than PUHF2. This shows that SGE successfully achieves the goal of improving the evaluation rate for constrained parallel best-first search.

Comparing KPGBFS and KPGBFSS, the evaluation rate of KPGBFSS is somewhat lower than that of KPGBFS for k{4,8,16}k\in\{4,8,16\} threads. Thus, management of the overhead of a separate 𝑈𝑛𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑\mathit{Unevaluated} queue introduced by SGE imposes a noticeable penalty – in the comparison of KPGBFS vs KPGBFSS in Figure 2, the effect of this overhead is noticeable for the problems with the highest evaluation rate (top right).

Number of States Expanded

The number of states expanded to solve an instance measures search efficiency. For all values of kk, PUHF2S expanded fewer states than PUHF2, and KPGBFSS expanded fewer states than KPGBFS, so SGE has the effect of reducing the search required to solve problem instances for both constrained and unconstrained parallel GBFS.

A possible explanation for this is that as explained in Section 5, an algorithm with SGE tends to behave as if there were fewer threads, and previous work has shown that as the number of threads increases, the search efficiency of parallel search tends to decrease (Kuroiwa and Fukunaga 2019), so behaving as if there were fewer threads leads to better search efficiency.

Search Time and Speedup

The wall-clock runtime required to find a solution (search time) is a measure of overall performance.

As a result of higher state evaluation rate and comparable search efficiency, PUHF2S has significantly lower search time than PUHF2. On the other hand, KPGBFSS only runs slightly faster than KPGBFS, because the improved number of states expanded due to SGE is offset by the lower evaluation rate.

Speedup compared to single-threaded search (GBFS) measures how successful a method is as a parallelization technique. Comparing PUHF2S and PUHF2, we observe that PUHF2S achieves almost kk-fold speedup relative to single-threaded GBFS, while the speedup for PUHF2 is far from kk.

Coverage

Table 2(a) shows that applying SGE improves the number of instances solved by PUHF2, but slightly lowers the number of instances solved by KPGBFS. Table 2(b) shows that for k=16k=16 threads, each algorithm (PUHF2, PUHF2S, KPGBFS, KPGBFSS) solves some instances not solved by other algorithms (similar for 4 and 8 threads, not shown due to space).

7 Discussion and Conclusion

We proposed SGE, an approach to increase state evaluation rates in constrained parallel search algorithm by separating successor generation and evaluation. We showed SGE significantly increases the state evaluation rate of PUHF2, resulting in improved overall performance for PUHF2S compared to PUHF2. Preliminary experiments with PUHF and PUHF3 have yielded similar results, suggesting that SGE consistently yields performance improvements for parallel GBFS algorithms with state expansion constraints. For unconstrained search such as KPGBFS, SGE results in a decrease in evaluation rate, as there are no bottlenecks in unconstrained search, and the additional complexity of SGE imposes an overhead.

We also showed that SGE results in more efficient search (fewer expansions to solve an instance) for both PUHF2 (constrained search) as well as for KPGBFS (unconstrained search). While we hypothesize that this is because SGE with kk threads behaves as if there were fewer, faster threads as explained in Section 5, a deeper experimental investigation of the expansion order with and without SGE is an avenue for future work.

The batch successor insertion state bottleneck addressed by SGE (Section 4) is a byproduct of implementing a parallel search algorithm which is limited to expanding a set of states similar to the states expanded by single-threaded GBFS with a standard eager evaluation policy, where states are evaluated immediately after they are generated and before being inserted in 𝑂𝑝𝑒𝑛\mathit{Open}. In lazy (deferred) evaluation (Richter and Helmert 2009), where states are not evaluated before insertion into 𝑂𝑝𝑒𝑛\mathit{Open} and are inserted into 𝑂𝑝𝑒𝑛\mathit{Open} based on their parent’s ff-value (and later evaluated when they are expanded), the batch successor insertion bottleneck would not apply. Search with lazy evaluation behaves quite differently than search with eager evaluation, and parallel satisficing search with lazy evaluation is an avenue for future work.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: State evaluation rates (states/second), diagonal lines are y=0.1xy=0.1x, y=xy=x, and y=10xy=10x
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Number of states expanded, “fail”= out of time/memory, diagonal lines are y=0.1xy=0.1x, y=xy=x, and y=10xy=10x
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Search time (seconds) “fail”= out of time/memory, diagonal lines are y=0.1xy=0.1x, y=xy=x, and y=10xy=10x

References

  • Burns et al. (2010) Burns, E.; Lemons, S.; Ruml, W.; and Zhou, R. 2010. Best-First Heuristic Search for Multicore Machines. J. Artif. Intell. Res., 39: 689–743.
  • Doran and Michie (1966) Doran, J.; and Michie, D. 1966. Experiments with the Graph Traverser Program. In Proc. Royal Society A: Mathematical, Physical and Engineering Sciences, volume 294, 235–259.
  • Fukunaga et al. (2017) Fukunaga, A.; Botea, A.; Jinnai, Y.; and Kishimoto, A. 2017. A Survey of Parallel A*. CoRR, abs/1708.05296.
  • Hart, Nilsson, and Raphael (1968) Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. on Systems Science and Cybernetics, 4(2): 100–107.
  • Heusner, Keller, and Helmert (2017) Heusner, M.; Keller, T.; and Helmert, M. 2017. Understanding the Search Behaviour of Greedy Best-First Search. In Proc. SOCS, 47–55.
  • Heusner, Keller, and Helmert (2018) Heusner, M.; Keller, T.; and Helmert, M. 2018. Best-Case and Worst-Case Behavior of Greedy Best-First Search. In Proc. IJCAI, 1463–1470.
  • Hoffmann and Nebel (2001) Hoffmann, J.; and Nebel, B. 2001. The FF Planning System: Fast Plan Generation through Heuristic Search. J. Artif. Intell. Res., 14: 253–302.
  • Kishimoto, Fukunaga, and Botea (2013) Kishimoto, A.; Fukunaga, A.; and Botea, A. 2013. Evaluation of a simple, scalable, parallel best-first search strategy. Artif. Intell., 195: 222–248.
  • Kuroiwa and Fukunaga (2019) Kuroiwa, R.; and Fukunaga, A. 2019. On the Pathological Search Behavior of Distributed Greedy Best-First Search. In Proc. ICAPS, 255–263.
  • Kuroiwa and Fukunaga (2020) Kuroiwa, R.; and Fukunaga, A. 2020. Analyzing and Avoiding Pathological Behavior in Parallel Best-First Search. In Proc. ICAPS, 175–183.
  • Mukherjee, Aine, and Likhachev (2022) Mukherjee, S.; Aine, S.; and Likhachev, M. 2022. ePA*SE: Edge-Based Parallel A* for Slow Evaluations. In Chrpa, L.; and Saetti, A., eds., Proc. SOCS, 136–144.
  • Phillips, Likhachev, and Koenig (2014) Phillips, M.; Likhachev, M.; and Koenig, S. 2014. PA*SE: Parallel A* for Slow Expansions. In Proc. ICAPS, 208–216.
  • Richter and Helmert (2009) Richter, S.; and Helmert, M. 2009. Preferred Operators and Deferred Evaluation in Satisficing Planning. In Proc. ICAPS, volume 19, 273–280.
  • Shimoda and Fukunaga (2023) Shimoda, T.; and Fukunaga, A. 2023. Improved Exploration of the Bench Transition System in Parallel Greedy Best First Search. In Proc. SOCS, 74–82.
  • Torralba, Seipp, and Sievers (2021) Torralba, Á.; Seipp, J.; and Sievers, S. 2021. Automatic Instance Generation for Classical Planning. In Proc. ICAPS, 376–384.
  • Vidal, Bordeaux, and Hamadi (2010) Vidal, V.; Bordeaux, L.; and Hamadi, Y. 2010. Adaptive K-Parallel Best-First Search: A Simple but Efficient Algorithm for Multi-Core Domain-Independent Planning. In Proc. SOCS, 100–107.
  • Wilt and Ruml (2014) Wilt, C. M.; and Ruml, W. 2014. Speedy Versus Greedy Search. In Proc. SOCS, 184–192.