This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Neural Neighborhood Search for Pickup and Delivery Problems

Yining Ma1, Equally contributed.    Jingwen Li1,∗    Zhiguang Cao2, Zhiguang Cao is the corresponding author.    Wen Song3    Hongliang Guo4   
Yuejiao Gong5&Yeow Meng Chee1 1National University of Singapore
2Singapore Institute of Manufacturing Technology, A*STAR
3Institute of Marine Science and Technology, Shandong University
4Institute for Infocomm Research, A*STAR
5South China University of Technology {yiningma, lijingwen}@u.nus.edu, [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract

We present an efficient Neural Neighborhood Search (N2S) approach for pickup and delivery problems (PDPs). In specific, we design a powerful Synthesis Attention that allows the vanilla self-attention to synthesize various types of features regarding a route solution. We also exploit two customized decoders that automatically learn to perform removal and reinsertion of a pickup-delivery node pair to tackle the precedence constraint. Additionally, a diversity enhancement scheme is leveraged to further ameliorate the performance. Our N2S is generic, and extensive experiments on two canonical PDP variants show that it can produce state-of-the-art results among existing neural methods. Moreover, it even outstrips the well-known LKH3 solver on the more constrained PDP variant. Our implementation for N2S is available online111Code is available at https://github.com/yining043/PDP-N2S. The published version of this preprint can be found at IJCAI-22 (see https://www.ijcai.org/proceedings/2022/662). This preprint fixes some typos and contains additional discussions and results. .

1 Introduction

Refer to caption
Figure 1: An example of a PDTSP-7 instance to illustrate our N2S approach. From the left to the right: encoding, decoding and state transition process. The encoders process raw features of the current solution to produce node embeddings, which are then fed into the two decoders to sample an action. In the state transition, the node pair (1+,11^{+},1^{-}) are removed and then reinserted after depot (0) and node 3+3^{+}, respectively.

Efficient neighborhood search functions as the key component of powerful heuristics for pickup and delivery problems (PDPs) Parragh et al. (2008). It involves an iterative search process which transforms a solution into another candidate in its current neighborhood, hopefully in an efficient way. Designing the neighborhood and search rules usually determines the efficiency and even the success of a solver. However, they are often problem-specific and manually engineered with a lot of trial and error, which needs to redesign when changes occur in constraints or objectives. These limitations may hinder the applications in the rapidly evolving industries.

On the other hand, recent neural methods for vehicle routing problems (VRPs) have emerged as promising alternatives to traditional heuristics (e.g., Ma et al. (2021b)). They are usually faster, and more importantly, could automate the design of heuristics for new variants where no hand-crafted rule is available  Kwon et al. (2020). However, the prevailing neural methods mainly focus on travelling salesman problem (TSP) or capacitated vehicle routing problem (CVRP), where efficient solvers for PDPs are rarely studied. The PDP is ubiquitous in logistics, robotics, meal-delivery services, etc Parragh et al. (2008), which optimizes the route for pickup-delivery requests and is characterised by the precedence constraint (pickup before delivery). Despite the first attempt in Li et al. (2021b), which learns a construction method to build a PDP solution (route) in seconds, it leaves a considerable gap to traditional heuristics in solution quality.

To reduce the gap, we propose an efficient Neural Neighborhood Search (N2S) approach for PDPs, based on a novel Transformer styled policy network with the encoder-decoder structure. To our knowledge, the most similar existing policy network to ours is the DACT in Ma et al. (2021b). Also as an improvement method, DACT learns to encode the current solution and transform it into another one using the 2-opt, insert, or swap decoder. Among them, the 2-opt decoder, which considers reversing a segment of the solution, performed the best for TSP and CVRP. However, it is not suitable for PDPs since the precedence constraint can be easily violated by segment inversion. Though the insert decoder performed well on small-scale PDPs, which considers removing and reinserting a single node, its performance significantly drops on larger-scale or highly constrained PDPs (see Section 5). Conversely, our N2S tackles the precedence constraint more efficiently by allowing a pair of pickup-delivery nodes to be simultaneously operated in the neighborhood search through two customized removal and reinsertion decoders as shown in Figure 1.

Another challenge lies in the design of the encoders. It was well revealed in Ma et al. (2021b) that the vanilla Transformer encoder Vaswani et al. (2017) failed to correctly encode route solutions since the embeddings of node features (i.e., coordinates) and node positional features (i.e., node positions) involve two different aspects of a route solution which are not directly compatible during encoding. They thus proposed the dual-aspect collaborative attention (DAC-Att) to learn dual representations for each feature aspect. In this paper, we propose a simple yet powerful Synthesis Attention (Synth-Att) where the attention scores from various types of node feature embeddings can be synthesized to attain a comprehensive representation. It not only has the potential to encode more aspects than DAC-Att, but also reduces the computation costs while reserving competitive performance.

Additionally, we design a diversity enhancement scheme to further ameliorate the performance. The proposed N2S approach was trained through reinforcement learning Ma et al. (2021b), and we evaluated it on two canonical problems in the PDP family, i.e., the pickup and delivery travelling salesman problem (PDTSP) and its variant with the last-in-first-out constraint (PDTSP-LIFO) to verify our design. Experimental results show that our N2S outperforms the state-of-the-art, and becomes the first neural method to surpass the well-known LKH3 solver Helsgaun (2017) when solving the synthesized PDP instances.

The contributions of the paper are summarized as follows: 1) we present an efficient N2S approach for PDPs, which learns to perform removal and reinsertion of a pickup-delivery node pair automatically; 2) we propose Synth-Att, which allows the vanilla self-attention mechanism to synthesize node relationships from various feature embeddings in a simple way, and achieves superior expressiveness with much fewer computation costs in comparison to DAC-Att; 3) we explore a diversity enhancement scheme, which further makes our N2S the first neural method with almost no domain knowledge to surpass the LKH3 solver on the synthesized PDP instances (when there are no significant distribution shift w.r.t training ones). All these advantages highlight the potential of N2S as a powerful neural solver for new PDP variants without much need for manual trial and error.

2 Related Work

Neural Methods for VRPs.

We classify recent neural methods into construction and improvement ones. The construction methods, e.g., Nazari et al. (2018); Joshi et al. (2019); Kool et al. (2018), learn a distribution of selecting nodes to autoregressively build solutions from scratch. Despite being fast, they lack abilities to search (near-)optimal solutions, even if armed with sampling (e.g., Kool et al. (2018)), local search (e.g., Kim et al. (2021)), or Monte-Carlo tree search (e.g., Fu et al. (2021)). Among them, POMO Kwon et al. (2020) which explored diverse rollouts and data augments is recognized as the best construction method. Differently, improvement methods often hinge on a neighborhood search procedure such as node swap in Chen and Tian (2019), ruin-and-repair in Hottung and Tierney (2020), and 2-opt in Wu et al. (2021). Ma et al. (2021b) extended the Transformer styled model of Wu et al. (2021) to Dual-Aspect Collaborative Transformer (DACT), and achieved the state-of-the-art performance, which was also competitive to the hybrid neural methods, e.g., the ones combined with differential evolution Hottung et al. (2021) and dynamic programming Kool et al. (2021). Despite the success of the above methods for CVRP or TSP, they are not verified on the precedence constrained PDPs. Though Li et al. (2021b) made the first attempt to learn a construction solver for PDPs by introducing the heterogeneous attention to Kool et al. (2018), the resulting solution qualities are still far from the optimality.

Neighbourhood Search for PDPs.

Various heuristics based on neighborhood search have been proposed for PDPs. For PDTSP, Savelsbergh (1990) studied the kk-interchange neighborhood. A ruin-and-repair neighborhood was later proposed in Renaud et al. (2000), and further extended in Renaud et al. (2002) with multiple perturbation methods. Aside from PDTSP, other PDP variants were also studied, normally solved by designing new problem-specific neighborhoods Parragh et al. (2008), e.g., Veenstra et al. (2017) proposed five neighborhoods to tackle the PDTSP with handling costs. Among them, the PDTSP-LIFO attracts much attention due to its largely constrained search space. To solve it, Carrabs et al. (2007) introduced additional neighborhoods such as double-bridge and shake. In Li et al. (2011), neighborhoods with tree structures were further proposed. Different from the above PDP solvers, the well-known LKH3 solver Helsgaun (2017) combines neighborhood restriction strategies to achieve more effective local search, which could solve various VRPs with superior performance. Recently, LKH3 was extended to tackle several PDP variants including PDTSP and PDTSP-LIFO, and delivered comparable performance to Renaud et al. (2002) and Li et al. (2011) on PDTSP and PDTSP-LIFO, respectively. Also given the open-sourced nature222http://webhotel4.ruc.dk/\simkeld/research/LKH-3/, we use LKH3 as the benchmark heuristic in this paper.

3 Problem Formulation

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Two PDP solutions. (a) all goods are on top of the stack at the delivery node; (b) goods from 1+1^{+} is blocked by 2+2^{+} at 11^{-}.

3.1 Notations

We define the studied PDPs over a graph G=(V,E)G=(V,E), where nodes in V=PD0V=P\cup D\cup 0 represent locations and edges in EE represent trips between locations. With nn one-to-one pickup-delivery requests, an PDP instance contains |V|=2n+1|V|=2n+1 different locations, where node 0 is depot, set P={1+,2+,,n+}P=\{1^{+},2^{+},...,n^{+}\} contains pickup nodes, and set D={1,2,,n}D=\{1^{-},2^{-},...,n^{-}\} contains delivery nodes333We also refer x0x_{0} to the depot, {x1,,xn}\{x_{1},...,x_{n}\} to the pickup nodes, and {xn+1,,x2n}\{x_{n+1},...,x_{2n}\} to the delivery nodes from here on.. Each pickup node i+i^{+} has a number of goods to be transported to its delivery node ii^{-}. The objective is to find the shortest Hamiltonian cycle to fulfill all requests. In this paper, we consider two representative PDP variants, i.e., PDTSP and PDTSP-LIFO. The solution δ\delta is defined as a cyclic sequence (x0,,x2n+1)(x_{0},...,x_{2n+1}), where x0x_{0} and x2n+1x_{2n+1} are the depot, and the rest is a permutation of nodes in PDP\cup D. For PDTSP, such permutation is under the precedence constraint that requires each pickup i+i^{+} to be visited before its delivery ii^{-}. For PDTSP-LIFO, the last-in-first-out constraint is further imposed which requires loading and unloading to be executed in the corresponding order. This implies that unloading at a delivery node is allowed if and only if the goods is at the top of the stack. In Figure 2, we present two example solutions with n=3n\!=\!3 and |V|=7|V|\!=\!7. The two solutions are both feasible to PDTSP, however, solution δ2\delta_{2} in Figure 2(b) is infeasible to PDTSP-LIFO as the goods from 1+1^{+} is NOT at the top of the stack when it needs to be delivered at 11^{-}.

3.2 MDP Formulations

We define the process of solving PDPs by our N2S as a Markov Decision Process =(𝒮,𝒜,𝒯,,γ)\mathcal{M}\!=\!(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\mathcal{\gamma}) as follows.

State 𝒮\mathcal{S}. At time tt, the state is defined to include: 1) features of the current solution δt\delta_{t}, 2) action history, and 3) objective value of the best incumbent solution, i.e.,

st={{l(x)}xV,{pt(x)}xV,(t,K),f(δt)},s_{t}=\left\{\{l(x)\}_{x\in V},\{p_{t}(x)\}_{x\in V},\mathcal{H}(t,K),f(\delta_{t}^{*})\right\}, (1)

where δt\delta_{t} is described from two aspects following Ma et al. (2021b): l(x)l(x) contains 2-dim coordinates of node xx (i.e., node features) and pt(x)p_{t}(x) indicates the index position of xx in δt\delta_{t} (i.e., node positional feature); (t,K)\mathcal{H}(t,K) stores the most recent KK actions at time tt if any; and f()f(\cdot) denotes the objective function and δt=argminδt{δ0,,δt}f(δt)\delta_{t}^{*}=\operatorname*{arg\,min}_{\delta_{t^{\prime}}\in\{\delta_{0},...,\delta_{t}\}}f(\delta_{t^{\prime}}).

Action 𝒜\mathcal{A}. With action at={(i+,i),(j,k)}a_{t}\!=\!\{(i^{+},i^{-}),(j,k)\} where j,kV\{i+,i}j,k\in V\backslash\{i^{+},i^{-}\}, the agent removes node pair (i+,i)(i^{+},i^{-}), and then reinserts node i+i^{+} and ii^{-} after node jj and kk, respectively.

State Transition 𝒯\mathcal{T}. We utilize a deterministic transition rule to perform ata_{t}. An example is illustrated in Figure 1.

Reward \mathcal{R}. The reward function is defined as rt=f(δt*)min[f(δt+1),f(δt*)]r_{t}\!=\!f(\delta_{t}^{\textit{*}})\!-\!min\left[f(\delta_{t+1}),f(\delta_{t}^{\textit{*}})\right] which is the immediate reduced cost w.r.t. f(δt*)f(\delta_{t}^{\textit{*}}). The N2S agent aims to maximize the expected total reduced cost w.r.t. δ0\delta_{0} with a discount factor γ<1\gamma<1.

4 Methodology

4.1 Encoder and Synth-Att

Given state s={{l(x)}xV,{p(x)}xV,(t,K),f(δ)}s\!=\!\left\{\{l(x)\}_{x\in V},\{p(x)\}_{x\in V},\mathcal{H}(t,K),f(\delta^{*})\right\}, the N2S encoder takes {l(x)}xV\{l(x)\}_{x\in V} and {p(x)}xV\{p(x)\}_{x\in V} as inputs444(t,K)\mathcal{H}(t,K) is input to decoder and f(δ)f(\delta^{*}) is input to critic network (introduced later). Here we omit time step tt for better readability. to learn embeddings for representing the current solution. Following DACT, we first project these raw features into two sets of embeddings, i.e., node feature embeddings (NFEs) {hi}i=0|V|\{h_{i}\}_{i=0}^{|V|} and positional feature embeddings (PFEs) {gi}i=0|V|\{g_{i}\}_{i=0}^{|V|}. Different from Ma et al. (2021b), we treat NFEs as the primary set of embeddings whereas PFEs as auxiliary ones.

NFEs.

We define hih_{i} as the linear projection of its node features l(xi)l(x_{i}) for any xiVx_{i}\in V with output dimension dh=128d_{h}=128.

PFEs.

By extending the absolute positional encoding in the vanilla Transformer Vaswani et al. (2017), the cyclic positional encoding (CPE) was proposed in Ma et al. (2021b), which enables Transformer to encode cyclic sequences (as our PDP solutions) more accurately. The PFE gig_{i} with output dimension dg=128d_{g}=128 are initialized by CPE as follows,

gi(d)={sin(ωd|(z(i)mod4πωd)2πωd|), if d is evencos(ωd|(z(i)mod4πωd)2πωd|), if d is odd{g_{i}}^{(d)}=\left\{\begin{matrix}\!sin(\omega_{d}\!\cdot\!\left|(z(i)\bmod\frac{4\pi}{\omega_{d}})-\frac{2\pi}{\omega_{d}}\right|),\text{ if }d\text{ is even}\\ \!cos(\omega_{d}\!\cdot\!\left|(z(i)\bmod\frac{4\pi}{\omega_{d}})-\frac{2\pi}{\omega_{d}}\right|),\text{ if }d\text{ is odd}\end{matrix}\right.

(2)

where superscript dd of gi(d){g_{i}}^{(d)} refers to the dd-th dimension of gig_{i}, and the scalar z(i)z(i) as well as the angular frequency ωd=2πTd\omega_{d}=\frac{2\pi}{T_{d}} are defined accorting to Eq. (3) and Eq. (4), respectively.

z(i)=i|V|2πωd|V|2π/ωd,z(i)=\frac{i}{{|V|}}\frac{2\pi}{\omega_{d}}\left\lceil\frac{{|V|}}{2\pi/\omega_{d}}\right\rceil, (3)

Td={3d/3+1dg(|V||V|1dg/2)+|V|1dg/2,​if ​d<dg2|V|.otherwiseT_{d}\!=\!\!\left\{\begin{matrix}\frac{\!3\lfloor d/3\rfloor+1}{{d_{g}}}({|V|}\!\!-\!\!{|V|}^{\frac{1}{\left\lfloor{d_{g}}/2\right\rfloor}})\!+\!{|V|}^{\frac{1}{\left\lfloor{d_{g}}/2\right\rfloor}}\!,\!\!\!\!&\!\text{\!if \!}d\!<\!\lfloor\!\frac{{d_{g}}}{2}\!\rfloor\\ {|V|}.\!&\!\text{otherwise}\end{matrix}\right.

(4)
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Comparison of attention modules for VRPs. The blue and orange squares are used to represent self-attention score matrices.

According to Ma et al. (2021b), directly fusing the two sets of embeddings (i.e., hi+gih_{i}\!+\!g_{i}) may cause undesired noises to the vanilla self-attention. As shown in Figure 3(a), they thus proposed the DAC-Att in ways that each embedding set independently computes attention scores and shares the other aspect to learn dual-aspect representations. Different from it, we propose a simple and generic mechanism by incorporating a multilayer perception (MLP). As shown in Figure 3(b), besides the original self-attention scores (the orange squares), multiple auxiliary attention scores learned from other feature embeddings (the blue squares) are leveraged and fed into an element-wise MLP, which allows it to synthesize heterogeneous attention relationships into comprehensive ones. We call it Synthesis Attention (Synth-Att). It is able to not only leverage more attention scores from various feature embeddings, but also achieve competitive performance to DAC-Att with less computation costs. Below, we present more details.

Auxiliary Attention Scores.

In our N2S, PFEs are used to generate multi-head auxiliary attention scores as follows,

αi,j,maux=1dk(giWmQaux)(gjWmKaux)T,\alpha_{i,j,m}^{\text{aux}}=\frac{1}{\sqrt{d_{k}}}\left(g_{i}W^{Q_{\text{aux}}}_{m}\right)\left(g_{j}W^{K_{\text{aux}}}_{m}\right)^{T}, (5)

where WmQauxdg×dq,WmKauxdg×dkW^{Q_{\text{aux}}}_{m}\in\mathbb{R}^{d_{g}\times d_{q}},W^{K_{\text{aux}}}_{m}\in\mathbb{R}^{d_{g}\times d_{k}} are trainable matrices for each head mm. We set m=4m=4 and dq=dk=dg/md_{q}=d_{k}=d_{g}/m.

Syn-Att.

The Syn-Att is defined as follows,

h~i=Syn-Att(WQ,WK,WV,WO,MLP).{\tilde{h}}_{i}=\textbf{Syn-Att}(W^{Q},W^{K},W^{V},W^{O},\text{MLP}). (6)

In specific, it first computes the multi-head self-attention scores αi,j,mself\alpha_{i,j,m}^{\text{self}} for NFEs based on the trainable matrices WmQdh×dqW^{Q}_{m}\in\mathbb{R}^{d_{h}\times d_{q}} and WmKdh×dkW^{K}_{m}\in\mathbb{R}^{d_{h}\times d_{k}} for head mm using Eq. (7) as

αi,j,mself=1dk(hiWmQ)(hjWmK)T.\alpha_{i,j,m}^{\text{self}}=\frac{1}{\sqrt{d_{k}}}\left(h_{i}W^{Q}_{m}\right)\left(h_{j}W^{K}_{m}\right)^{T}. (7)

Thereafter, the attention scores αaux\alpha^{\text{aux}} and αself\alpha^{\text{self}} are fed into a three-layer MLP with structure (2m×2m×m2m\times 2m\times m) to compute the synthesized multi-head attention scores as follows,

αi,j,1Synth,,αi,j,mSynth=MLP(αi,j,1self,,αi,j,mself,αi,j,1aux,,αi,j,maux).\alpha_{i,j,1}^{\text{Synth}},...,\alpha_{i,j,{{m}}}^{\text{Synth}}\!=\!\text{MLP}\!\left(\alpha_{i,j,1}^{\text{self}},...,\alpha_{i,j,m}^{\text{self}},\alpha_{i,j,1}^{\text{aux}},...,\alpha_{i,j,m}^{\text{aux}}\right).

(8)

The scores are then normalized to α~i,j,m\tilde{\alpha}_{i,j,m} through Softmax, which are further used to calculate the attention values for each head as Eq. (9). Finally, the outputs are given by Eq. (10) with trainable matrix WOmdv×dhW^{O}\in\mathbb{R}^{md_{v}\times d_{h}} (dv=dh/md_{v}=d_{h}/m).

headi,m=j=1|V|α~i,j,m(hjWmV),\text{head}_{i,m}=\sum_{j=1}^{|V|}{\left.\tilde{\alpha}_{i,j,m}\left(h_{j}W^{V}_{m}\right)\right.}, (9)
h~i=Concat[headi,1,,headi,m]WO.{\tilde{h}}_{i}=\text{Concat}\left[\text{head}_{i,1},...,\text{head}_{i,m}\right]W^{O}. (10)

N2S Encoder.

We stack LL (LL=3) encoders, each of which is the same as the Transformer encoder, except that the vanilla multi-head self-attention is replaced with our multi-head Synth-Att and we use the same instance normalization layer as Ma et al. (2021b). Note that the auxiliary attention scores αi,j,maux\alpha_{i,j,m}^{\text{aux}} are only computed once and shared among all stacked encoders to reduce the computation costs.

4.2 Decoder

The N2S decoder first adopts the max-pooling layer in Wu et al. (2021) to aggregate the global representation of all embeddings into each individual one as follows,

h^i=h~i(L)WhLocal+max[{h~i(L)}i=1|V|]WhGlobal.{\hat{h}}_{i}=\tilde{h}_{i}^{(L)}W_{h}^{\textit{Local}}+\max\left[\{\tilde{h}_{i}^{(L)}\}_{i=1}^{|V|}\right]W_{h}^{\textit{Global}}. (11)

Node-Pair Removal Decoder.

Given the enhanced embeddings {h^i}i=1|V|\{\hat{h}_{i}\}_{i=1}^{|V|} and the set (t,K)\mathcal{H}(t,K), the removal decoder outputs a categorical distribution over nn requests for removal action. In specific, it first computes a score λi\lambda_{i} for each xiVx_{i}\in V indicating the closeness between node xix_{i} and its neighbors as

λi=(h^pred(xi)WλQ)(h^iWλK)T+(h^iWλQ)(h^succ(xi)WλK)T(h^pred(xi)WλQ)(h^succ(xi)WλK)T,\begin{split}\lambda_{i}=({\hat{h}}_{\text{pred}(x_{i})}W^{Q}_{\lambda})({\hat{h}}_{i}W^{K}_{\lambda})^{T}+({\hat{h}}_{i}W^{Q}_{\lambda})({\hat{h}}_{\text{succ}(x_{i})}W^{K}_{\lambda})^{T}&\\ -({\hat{h}}_{\text{pred}(x_{i})}W^{Q}_{\lambda})({\hat{h}}_{\text{succ}(x_{i})}W^{K}_{\lambda})^{T},&\end{split}

(12)

where pred(xi)\text{pred}(x_{i}) and succ(xi)\text{succ}(x_{i}) refer to the predecessor and the successor nodes of xix_{i}, respectively, and WλQdh×dhW^{Q}_{\lambda}\in\mathbb{R}^{d_{h}\times d_{h}}, WλKdh×dhW^{K}_{\lambda}\in\mathbb{R}^{d_{h}\times d_{h}}. We use multi-head technique to obtain λi,1\lambda_{i,1} to λi,m\lambda_{i,m}. Then the decoder aggregates the scores for each pickup-delivery pair (i+,ii^{+},i^{-}) based on a three-layer MLPλ,

Λ~(i+,i)=MLPλ(λi+,1,,λi+,m,λi,1,,λi,m,c(i),𝟙last(1)=i,𝟙last(2)=i,𝟙last(3)=i),\begin{split}\tilde{\Lambda}_{(i^{+},i^{-})}=\text{MLP}_{\lambda}(&\lambda_{i^{+},1},...,\lambda_{i^{+},m},\lambda_{i^{-},1},...,\lambda_{i^{-},m},\\ &c(i),\mathds{1}_{\text{last}(1)=i},\mathds{1}_{\text{last}(2)=i},\mathds{1}_{\text{last}(3)=i}),\end{split}

(13)

where the MLP structure is (2m+42m\!+\!4, 32, 32, 1), scalar c(i)c(i) counts the frequency of request (i+,i)(i^{+},i^{-}) being selected for removal in the past KK steps, and 𝟙last(t)=i\mathds{1}_{\text{last}(t)=i^{\prime}} is a binary variable indicating whether request ii^{\prime} was selected at the tt-th last step. An activation layer Λ^=CTanh(Λ~){\hat{\Lambda}}\!=\!C\cdot\text{Tanh}(\tilde{\Lambda}) is then applied (C=6C\!=\!6), followed by Softmax to normalize the distribution which is then used to sample a node pair (i+,ii^{+},i^{-}) as removal action.

Node-Pair Reinsertion Decoder.

Given a request (i+,i)(i^{+},i^{-}) for removal, the reinsertion decoder outputs the joint distribution that reinserts the two nodes back to the solution. We first define two scores μp(xα,xβ)\mu^{\text{p}}(x_{\alpha},x_{\beta}) and μs(xα,xβ)\mu^{\text{s}}(x_{\alpha},x_{\beta}) for a node xαx_{\alpha} indicating the degree of preference of accepting a node xβx_{\beta} as its new predecessor and successor nodes, respectively,

μp[xα,xβ]=(h^αWμQp)(h^βWμKp)T,μs[xα,xβ]=(h^αWμQs)(h^βWμKs)T,\begin{split}\mu^{\text{p}}\left[x_{\alpha},x_{\beta}\right]=({\hat{h}}_{\alpha}W^{Q_{\text{p}}}_{\mu})({\hat{h}}_{\beta}W^{K_{\text{p}}}_{\mu})^{T},\\ \mu^{\text{s}}\left[x_{\alpha},x_{\beta}\right]=({\hat{h}}_{\alpha}W^{Q_{\text{s}}}_{\mu})({\hat{h}}_{\beta}W^{K_{\text{s}}}_{\mu})^{T},\end{split} (14)

where WμQp,WμQsdh×dhW^{Q_{\text{p}}}_{\mu},W^{Q_{\text{s}}}_{\mu}\!\in\!\mathbb{R}^{d_{h}\times d_{h}}, and WμKp,WμKsdh×dhW^{K_{\text{p}}}_{\mu},W^{K_{\text{s}}}_{\mu}\!\in\!\mathbb{R}^{d_{h}\times d_{h}}. Again, we use multiple heads. Based on the scores, the decoder predicts the distribution of reinserting node i+i^{+} after node jj, and reinserting node ii^{-} after node kk using MLPμ\text{MLP}_{\mu},​

μ~[j,k]=MLPμ(μ1p[i+,succ(j)],,μmp[i+,succ(j)],μ1p[i,succ(k)],,μmp[i,succ(k)],μ1s[i+,j],,μms[i+,j],μ1s[i,k],,μms[i,k]),\begin{split}\tilde{\mu}[j,k]=\text{MLP}_{\mu}\!(&\mu_{1}^{\text{p}}[i^{+},{\text{succ}(j)}],...,\mu_{m}^{\text{p}}[i^{+},{\text{succ}(j)}],\\ &\mu_{1}^{\text{p}}[i^{-},{\text{succ}(k)}],...,\mu_{m}^{\text{p}}[i^{-},{\text{succ}(k)}],\\ \mu_{1}^{\text{s}}[i^{+},{j}],...&,\mu_{m}^{\text{s}}[i^{+},{j}],\mu_{1}^{\text{s}}[i^{-},{k}],...,\mu_{m}^{\text{s}}[i^{-},{k}]),\end{split} (15)

where the MLP structure is (4m4m, 32, 32, 1). Note that here pred()\text{pred}(\cdot) and succ()\text{succ}(\cdot) should be considered in the new solution where nodes i+,ii^{+},i^{-} have already been removed. Afterwards, μ^=CTanh(μ~){\hat{\mu}}=C\cdot\text{Tanh}(\tilde{\mu}) is applied and infeasible choices are masked as -\infty before normalizing by Softmax. Finally, a node pair (j,k)(j,k), as the reinsertion action, is sampled according to the resulting distribution to indicate the positions of reinserting the node-pair (i,i+)(i^{-},i^{+}) back to the solution.

Algorithm 1 n-step PPO with CL strategy

Input: policy πθ\pi_{\theta}, critic vϕv_{\phi}, PPO clipping threshold ε\varepsilon, learning rate ηθ\eta_{\theta}, ηϕ\eta_{\phi}, learning rate decay β\beta, epochs EE, batches BB, mini-batch κ\kappa, training steps TtrainT_{\text{train}}, CL scalar ρCL\rho^{CL}

1:  for e=1e=1 to EE do
2:     for b=1b=1 to BB do
3:        Generate training data 𝒟b\mathcal{D}_{b} on the fly;
4:        Initialize random solutions {δi}\{\delta_{i}\} to 𝒟b\mathcal{D}_{b};
5:        Improve {δi}\{\delta_{i}\} to {δi}\{\delta^{\prime}_{i}\} via πθ\pi_{\theta} for T=e/ρCLT={e}/{\rho^{CL}} steps;
6:        Set initial state s0s_{0} based on {δi}\{\delta^{\prime}_{i}\} and Eq. (1); t0t\!\leftarrow\!0;
7:        while t<Ttraint<T_{\text{train}} do
8:           Get {(st,at,rt)}t=tt+n\{(s_{t^{\prime}},\!a_{t^{\prime}},\!r_{t^{\prime}})\}^{t+n}_{t^{\prime}=t} where atπθ(at|st)a_{t^{\prime}}\!\sim\!\pi_{\theta}(a_{t^{\prime}}|s_{t^{\prime}});
9:           tt+nt\leftarrow t+n, πoldπθ\pi_{old}\leftarrow\pi_{\theta}, voldvϕv_{old}\leftarrow v_{\phi};
10:           for k=1k=1 to κ\kappa do
11:              R^t+1=vϕ(st+1)\hat{R}_{t+1}=v_{\phi}(s_{t+1});
12:              for t{t,t1,,tn}t^{\prime}\in\{t,t-1,...,t-n\} do
13:                 R^trt+γR^t+1\hat{R}_{t^{\prime}}\leftarrow r_{t^{\prime}}+\gamma\hat{R}_{t^{\prime}+1};
14:                 A^tR^tvϕ(st)\hat{A}_{t^{\prime}}\leftarrow\hat{R}_{t^{\prime}}-v_{\phi}(s_{t^{\prime}});
15:              end for
16:              Compute RL loss JRL(θ)J_{RL}(\theta) using Eq. (18) and clipped critic loss LBL(ϕ)L_{BL}(\phi) using Eq. (19);
17:              θθ+ηθJRL(θ)\theta\leftarrow\theta+\eta_{\theta}\nabla J_{RL}(\theta);
18:              ϕϕηϕLBL(ϕ)\phi\leftarrow\phi-\eta_{\phi}\nabla L_{BL}(\phi);
19:           end for
20:        end while
21:     end for
22:     ηθβηθ\eta_{\theta}\leftarrow\beta\eta_{\theta}, ηϕβηϕ\eta_{\phi}\leftarrow\beta\eta_{\phi};
23:  end for
Algorithm 2 N2S-A Inference

Input: Instance \mathcal{I} with size |V||V|, policy πθ\pi_{\theta}, maximum step TT

1:  for i=1,,12|V|i=1,...,\lfloor\frac{1}{2}|V|\rfloor do
2:     i\mathcal{I}_{i}\leftarrow\mathcal{I};
3:     𝒜iRandomShuffle([flip-x-y,1-x,1-y,rotate])\mathcal{A}_{i}\leftarrow\textbf{RandomShuffle}([\text{flip-x-y},\text{1-x},\text{1-y},\text{rotate}]);
4:     for each augment method j𝒜ij\in\mathcal{A}_{i} do
5:        ϱj\varrho_{j}\leftarrow RandomConfig(j)\textbf{RandomConfig}(j);
6:        i\mathcal{I}_{i}\leftarrow perform augment jj on i\mathcal{I}_{i} with config ϱj\varrho_{j};
7:     end for
8:  end for
9:  Solve all instances i\mathcal{I}_{i} in parallel with πθ\pi_{\theta} for TT steps;
10:  return the best solution found among all i\mathcal{I}_{i}

4.3 Training Algorithm

As presented in Algorithm 1, the training algorithm for N2S is the same as Ma et al. (2021b), namely the proximal policy optimization with a curriculum learning strategy. It learns a policy πθ\pi_{\theta} (our N2S) with the help of a critic vϕv_{\phi} as follows.

Critic Network.

Given the embeddings {h~i(L)}i=0|V|\{\tilde{h}_{i}^{(L)}\}_{i=0}^{|V|} from policy network πθ\pi_{\theta}, the critic network first enhances them by a vanilla multi-head attention layer (with m=4m=4 heads) to get {yi}i=0|V|\{y_{i}\}_{i=0}^{|V|}. The enhanced embeddings are then fed into a mean-pooling layer in Wu et al. (2021) to aggregate the global representation of all embeddings into each individual one as

y^i=yiWvLocal+mean[{yi}i=1|V|]WvGlobal,{\hat{y}}_{i}=y_{i}W_{v}^{\textit{Local}}+{\text{mean}}\left[\{y_{i}\}_{i=1}^{|V|}\right]W_{v}^{\textit{Global}}, (16)

where we use trainable matrices WvLocal,WvGlobaldh×dh2W_{v}^{\textit{Local}},W_{v}^{\textit{Global}}\in\mathbb{R}^{d_{h}\times\frac{d_{h}}{2}}. Lastly, the state value v(st)v(s_{t}) is output by a four-layer MLP in Eq. (17) with structure (129, 128, 64, 1). Here, f(δ)f(\delta^{*}) is added as an additional input to MLPv\text{MLP}_{v}.

v(st)=MLPv(max[{y^i}i=1|V|],mean[{y^i}i=1|V|],f(δ))v(s_{t})=\text{MLP}_{v}\left({\text{max}}\!\left[\{\hat{y}_{i}\}_{i=1}^{|V|}\right],{\text{mean}}\!\left[\{\hat{y}_{i}\}_{i=1}^{|V|}\right],f(\delta^{*})\right)

(17)

We train N2S for EE epochs and BB batches per epoch, where the training dataset 𝒟b\mathcal{D}_{b} is randomly generated on the fly with the uniform distribution (line 3). Following Ma et al. (2021b), we introduce a similar CL strategy (in lines 4 to 6) to further ameliorate the training. In specific, it improves the randomly generated solutions {δi}\{\delta_{i}\} to {δi}\{\delta^{\prime}_{i}\} by running the current policy πθ\pi_{\theta} for T=e/ρCLT={e}/{\rho^{CL}} steps. The improved solutions {δi}\{\delta^{\prime}_{i}\} with higher quality are then used to initialize the first state s0s_{0}. In such a way, the hardness level of neighborhood search is gradually increased per epoch following the idea of curriculum learning. The remaining algorithm follows the original design of the n-step PPO, where we present the reinforcement learning loss function in Eq. (18), and the baseline loss function of critic in Eq. (19), respectively. Here, we clip the estimated value v^(st)\hat{v}(s_{t}) around the previous value estimates using vϕclip(st)=clip[vϕ(st),vold(st)ε,vold(st)+ε]v_{\phi}^{clip}(s_{t^{\prime}})=clip\left[v_{\phi}(s_{t^{\prime}}),v_{old}(s_{t^{\prime}})-\varepsilon,v_{old}(s_{t^{\prime}})+\varepsilon\right].

JRL(θ)=1n|𝒟b|𝒟bt=tt+nmin(πθ(at|st)πold(at|st)A^t,clip[πθ(at|st)πold(at|st),1ε,1+ε]A^t),\begin{split}J_{RL}(\theta)=\frac{1}{n|\mathcal{D}_{b}|}\sum_{\mathcal{D}_{b}}\sum_{t^{\prime}=t}^{t+n}min\left(\frac{\pi_{\theta}(a_{t^{\prime}}|s_{t^{\prime}})}{\pi_{old}(a_{t^{\prime}}|s_{t^{\prime}})}\hat{A}_{t^{\prime}},\right.\\ \left.clip\left[\frac{\pi_{\theta}(a_{t^{\prime}}|s_{t^{\prime}})}{\pi_{old}(a_{t^{\prime}}|s_{t^{\prime}})},1-\varepsilon,1+\varepsilon\right]\hat{A}_{t^{\prime}}\right),\end{split}

(18)

LBL(ϕ)=1n|𝒟b|𝒟bt=tt+nmax(|vϕ(st)R^t|,|vϕclip(st)R^t|)2.\begin{split}L_{BL}(\phi)=\frac{1}{n|\mathcal{D}_{b}|}\sum_{\mathcal{D}_{b}}\sum_{{t^{\prime}}=t}^{{t}+n}max\left(\left|v_{\phi}(s_{t^{\prime}})-\hat{R}_{t^{\prime}}\right|,\right.\\ \left.\left|v_{\phi}^{clip}(s_{t^{\prime}})-\hat{R}_{t^{\prime}}\right|\right)^{2}.\end{split}

(19)

4.4 Diversity Enhancement

To be more resistant to local minima, we further equip our N2S with an augmentation-based inference scheme, which leads to N2S-A in Algorithm 2. Such idea was originally explored in Kwon et al. (2020) for a neural construction method, and we extend it to an improvement one. The rationale is that an instance \mathcal{I} can be transformed into different ones for searching while reserving the same optimal solution, e.g., rotating all locations of nodes by π/2\pi/2 radian. For an instance of size |V||V|, our N2S-A performs 12|V|\lfloor\frac{1}{2}|V|\rfloor augments, each of which is generated by sequentially applying four preset invariant transformation operations with different orders and different configurations as listed in Table 1. Note that although the mentioned transformations are conducted on instances defined in the Euclidean space, we believe that such an idea has favourable potential to be also exploited in non-Euclidean space, as long as there are proper invariant transformations for coordinates in the target space. Meanwhile, we use K=|V|K\!=\!|V| for training and K=12|V|K^{\prime}\!=\!\lfloor\frac{1}{2}|V|\rfloor for inference. This is because we found that when a specific KK in (t,K)\mathcal{H}(t,K) is used for training, a smaller KK during inference can improve the diversity of the solutions, thus better performance.

Transformations Formulations Configurations
flip-x-y (x,y)=(y,x)(x^{\prime},y^{\prime})=(y,x) perform or skip
1-x (x,y)=(1x,y)(x^{\prime},y^{\prime})=(1-x,y) perform or skip
1-y (x,y)=(x,1y)(x^{\prime},y^{\prime})=(x,1-y) perform or skip
rotate (xy)=(xcosϑysinϑxsinϑ+ycosϑ)\begin{pmatrix}x^{\prime}\\ y^{\prime}\end{pmatrix}=\begin{pmatrix}x\cos\vartheta-y\sin\vartheta\\ x\sin\vartheta+y\cos\vartheta\end{pmatrix} ϑ{0,π/2,π,3π/2}\vartheta\in\{0,\pi/2,\pi,3\pi/2\}
Table 1: Descriptions of the four invariant transformations.

5 Evaluation

We design experiments to answer the following questions:

  1. 1.

    How good is the proposed N2S approach against the baselines, including the state-of-the-art neural methods and the strong LKH3 solver? (see Table 3 and Table 4)

  2. 2.

    Can Synth-Att reduce computation costs while achieving competitive performance to DAC-Att? (see Table 5)

  3. 3.

    How crucial are the proposed learnable node-pair removal and node-pair reinsertion decoders for achieving an efficient search? (see Table 6)

  4. 4.

    Can our N2S generalize well to benchmark instances that are different from training ones? (see Table 7)

5.1 Setup

We evaluate N2S on PDTSP and PDTSP-LIFO with three sizes |V|=21,51,101|V|=21,51,101 following the conventions in Kwon et al. (2020); Ma et al. (2021b), where the node coordinates of instances are randomly and uniformly generated in the unit square [0,1]×[0,1][0,1]\times[0,1]. For our N2S, the initial solution δ0\delta_{0} is sequentially constructed in a random fashion. Our experiments were conducted on a server equipped with 8 RTX 2080 Ti GPU cards and Intel E5-2680 CPU @ 2.4GHz. The training time of our N2S varies with problem sizes, i.e., around 1 day for |V|=21|V|\!=\!21, 3 days for |V|=51|V|\!=\!51, and 7 days for |V|=101|V|\!=\!101, all of which are shorter than the baselines in Table 2.

Hyper-parameters.

Our N2S is trained with E=200E\!=\!200 epochs and B=20B\!=\!20 batches per epoch using batch size 600. We set n=5n\!=\!5, Ttrain=250T_{\text{train}}\!=\!250 for the nn-step PPO with κ=3\kappa\!=\!3 mini-batch updates and a clip threshold ϵ=0.1\epsilon\!=\!0.1. Adam optimizer is used with learning rate ηθ=8×105\eta_{\theta}\!=\!8\!\times\!10^{-5} for πθ\pi_{\theta}, and ηϕ=2×105\eta_{\phi}\!=\!2\!\times\!10^{-5} for vϕv_{\phi} (decayed β=0.985\beta\!=\!0.985 per epoch). The discount factor γ\gamma is set to 0.999 for both PDPs. We clip the gradient norm to be within 0.05, 0.15, 0.35, and set the curriculum learning ρCL\rho^{CL} to 2, 1.5, 1 for the three problem sizes, respectively.

5.2 Comparison Evaluation

We compare our N2S with the state-of-the-art (SOTA) neural methods and the highly-optimized LKH3 solver.

Regarding the former, we consider the SOTA improvement method DACT Ma et al. (2021b)555We use the insert decoder which is the best for PDPs. and the SOTA construction method Heter-AM Li et al. (2021b) (specially designed for PDPs). To make a fair comparison with our N2S-A (with the diversity enhancement), we upgrade Heter-AM to Heter-POMO, also given the known superiority of POMO to AM. In specific, we reserve the policy network in Heter-AM as the backbone while adopting the diverse rollouts and the data augmentation techniques in POMO Kwon et al. (2020) to leverage the advantages of them for the best performance. Each neural baseline is trained using the respective implementation code that is publicly available. For the upgraded Heter-POMO, we adapt and combine the model architecture from the original Heter-AM and the original POMO. The links to their original implementations, approximate training time for the size |V|=101|V|\!=\!101 , and the used hyper-parameters are presented in Table 2. For other hyper-parameters, we follow the recommendation in their papers. Regarding the latter, LKH3 is a strong heuristic (as reviewed in Section 2) which is widely used as a baseline to benchmark neural methods in recent studies (e.g., Ma et al. (2021b); Hottung et al. (2021); Li et al. (2021c); Wu et al. (2021)). We report its results with two settings of iterations, i.e., LKH (5k) and LKH (10k).

All baselines are evaluated on a test dataset with 2,000 instances, and we report the metrics of averaged objective values, averaged gaps to LKH3 (10K) and the total solving time. Note that it is hard to perform an absolutely fair time comparison between running Python codes on GPUs (neural methods) and running ANSI C codes on CPUs (LKH solver). Thus we follow the guidelines in Accorsi et al. (2022) to perform the facilitate comparison that lets each method make full use of the best settings on our machine. In particular, we report the time of LKH3 when running in parallel with 16 CPU cores and the time of each neural method when all 8 GPU cards are available (but do not need to be fully used).

Method Code
Training Time
Hyper-parameters
Heter-AM online666https://github.com/Demon0312/Heterogeneous-Attentions-PDP-DRL \sim 20 days
train 800 epochs as per
the original setting.
Heter-POMO online777https://github.com/yd-kwon/POMO \sim 14 days
train 2,000 epochs as per
the original setting.
DACT online888https://github.com/yining043/VRP-DACT \sim 10 days
ξCL=0.25,1,4\xi^{CL}=0.25,1,4 for sizes
|v||v| = 21, 51, 101, respectively;
n=5n\!=\!5, Ttrain=250T_{\text{train}}\!=\!250 (same as ours);
train 200 epochs (same as ours).
Table 2: Training details of the adopted neural baselines.
Methods PDTSP-21 PDTSP-51 PDTSP-101
Obj. Value Gap to LKH Total Time Obj. Value Gap to LKH Total Time Obj. Value Gap to LKH Total Time
LKH (5k) 4.563 0.00% 3m 6.866 0.06% 10m 9.443 0.16% 49m
LKH (10k) 4.563 0.00% 5m 6.862 0.00% 19m 9.428 0.00% 98m
Heter-AM (gr.) 4.655 2.02% (0s) 7.333 6.86% (1s) 10.348 9.76% (2s)
Heter-AM (5k) 4.578 0.33% (33s) 7.108 3.58% (1.5m) 10.051 6.61% (5m)
DACT (1k) 4.572 0.20% (18s) 7.245 5.57% (29s) 10.551 11.91% (51s)
DACT (2k) 4.566 0.07% (37s) 7.118 3.72% (1m) 10.312 9.38% (1.5m)
DACT (3k) 4.564 0.03% (1m) 7.057 2.83% (1.5m) 10.195 8.13% (2.5m)
N2S (1k) 4.573 0.21% (21s) 7.103 3.51% (31s) 10.030 6.38% (1m)
N2S (2k) 4.567 0.09% (42s) 7.053 2.77% (1m) 9.905 5.06% (2m)
N2S (3k) 4.565 0.05% (1m) 7.027 2.40% (1.5m) 9.846 4.44% (3m)
Heter-POMO (gr.) 4.634 1.56% (0s) 7.168 4.45% (1s) 10.060 6.70% (2s)
Heter-POMO-A (gr.) 4.584 0.46% (1s) 6.995 1.93% (5s) 9.681 2.68% (11s)
Heter-POMO-A (3k) 4.564 0.03% (7m) 6.916 0.77% (32m) 9.567 1.47% (135m)
N2S-A (1k) 4.563 0.01% (1m) 6.865 0.03% (8m) 9.475 0.50% (40m)
N2S-A (2k) 4.563 0.00% (2m) 6.860 -0.03% (16m) 9.427 -0.01% (80m)
N2S-A (3k) 4.563 0.00% (3m) 6.860 -0.04% (24m) 9.409 -0.20% (121m)
Table 3: Comparison results for solving PDTSP instances of sizes |V|=|V|= 21, 51, and 101.

Results on PDTSP.

Table 3 shows the results on PDTSP. In the first group, we compare N2S with Heter-AM (greedy and sampling), and DACT. Compared to Heter-AM (5k), our N2S with only 1k steps attains lower gaps with less time for all sizes. Although DACT offers the best gap on PDTSP-21, its performance drops significantly as the problem size increases, partly because its decoder is less efficient than our node-pair removal and reinsertion ones when tackling larger-scale problems. Instead, our N2S achieves higher performance and consistently dominates DACT in terms of both the gaps and the time on PDTSP-51 and PDTSP-101. In the second group, our augmented N2S-A is compared to the upgraded Heter-POMO method with three variants999Heter-POMO (gr.), Heter-POMO-A (gr.), Heter-POMO-A (3k) refer to: greedily generate |V||V| solution; greedily generate |V||V| solutions with 8 augments; sample |V|×|V|\times3k solutions with 8 augments.. It is shown that even with only 1k steps, our N2S-A attains a significantly smaller gap than all three Heter-POMO variants, by almost an order of magnitude. Although Heter-POMO-A (gr.) tends to be competitive with fast speed, the gap is hard to be further reduced by increasing inference time if we refer to Heter-POMO-A (3k). Moreover, our N2S-A (2k) keeps abreast of, or even slightly exceeds the strong LKH3 solver, achieving gaps of -0.03% and -0.01% with less time on PDTSP-51 and PDTSP-101, respectively. Those gaps are further reduced to -0.04% and -0.20% with more steps, i.e., 3k.

Methods PDTSP-LIFO-21 PDTSP-LIFO-51 PDTSP-LIFO-101
Obj. Value Gap to LKH Total Time Obj. Value Gap to LKH Total Time Obj. Value Gap to LKH Total Time
LKH (5k) 5.539 0.00% 1m 10.218 0.17% 8m 17.115 0.34% 33m
LKH (10k) 5.539 0.00% 3m 10.200 0.00% 16m 17.057 0.00% 67m
Heter-POMO (gr.) 5.636 1.75% (0s) 10.540 3.33% (1s) 17.583 3.08% (2s)
Heter-POMO-A (gr.) 5.567 0.51% (1s) 10.353 1.50% (5s) 17.276 1.29% (10s)
Heter-POMO-A (4k) 5.545 0.12% (10m) 10.209 0.09% (48m) 16.890 -0.98% (180m)
N2S-A (1k) 5.539 0.00% (3m) 10.144 -0.55% (13m) 16.976 -0.47% (54m)
N2S-A (2k) 5.539 0.00% (6m) 10.137 -0.62% (27m) 16.859 -1.16% (107m)
N2S-A (3k) 5.539 0.00% (9m) 10.135 -0.64% (40m) 16.806 -1.47% (161m)
Table 4: Comparison results for solving PDTSP-LIFO instances of sizes |V|=|V|= 21, 51, and 101.

Results on PDTSP-LIFO.

In Table 4, we report the results on PDTSP-LIFO. Due to the more constrained search space, DACT failed to work well (see Table 6). Therefore, we mainly compare our N2S-A with Heter-POMO (the best neural baseline in Table 3) and the LKH3 solver. As exhibited, the advantages of neural methods over LKH3 have been further enhanced on this harder problem, where our approach consistently outperforms Heter-POMO for all sizes. Compared to LKH3, our N2S-A with 3k steps presents superior performance again, and attains gaps of -0.64% and -1.47% on PDTSP-LIFO-51 and PDTSP-LIFO-101, respectively.

5.3 Ablation Evaluation

Effects of Different Encoding Methods.

We replace the proposed Synth-Att in our N2S encoder with vanilla-Att and DAC-Att, respectively. Using only one GPU card, we report the number of model parameters, time, and gaps for solving 2,000 PDTSP-51 instances with 3k steps in Table 5. As exhibited, Synth-Att attains slightly smaller gaps than DAC-Att with much fewer computation costs, and considerably lower gaps than vanilla-Att with slight extra computation costs.

Att. in Encoders Dim. # Param.(M) Time(s) Gap(%)
Vanilla-Att 64 0.18 (1.00×\times) 239 (1.00×\times) 4.66
DAC-Att 0.31 (1.72×\times) 285 (1.19×\times) 2.89
Synth-Att 0.19 (1.06×\times) 255 (1.07×\times) 2.88
Vanilla-Att 128 0.72 (1.00×\times) 322 (1.00×\times) 3.92
DAC-Att 1.25 (1.73×\times) 400 (1.24×\times) 2.43
Synth-Att 0.76 (1.06×\times) 340 (1.06×\times) 2.40
Table 5: Effects of different encoding methods.

Effects of Different Decoding Methods.

Removal
Decoder
Reinsertion
Decoder
Gap(%) on
PDTSP
Gap(%) on
PDTSP-LIFO
✗(random) ✗(random) 210.82 112.37
✗(random) ✗(ϵ\epsilon-greedy) 18.03 17.87
✗(ϵ\epsilon-greedy) ✗(random) 86.31 47.57
✗(ϵ\epsilon-greedy) ✗(ϵ\epsilon-greedy) 15.62 12.64
✗(random) 43.12 17.75
✗(ϵ\epsilon-greedy) 3.54 3.88
✗(random) 6.38 4.38
✗(ϵ\epsilon-greedy) 8.42 6.28
✗(DACT) ✗(DACT) 2.83 11.73
2.40 0.64
Table 6: Effects of different decoding methods.

To highlight the desirability of our two decoders, we replace the trainable decoders with hand-crafted ones (i.e., random and the ϵ\epsilon-greedy with ϵ=0.1\epsilon\!=\!0.1) while ensuring feasibility. We gather the results on PDTSP-51 and PDTSP-LIFO-51 with 3k steps in Table 6 where mark ‘✓’ means our proposed decoder is used (the one without augments) and mark ‘✗’ means the decoder in the parentheses is used instead. As revealed, the methods of retaining at least one trainable decoder always attain much lower gaps than the ones equipped with only hand-crafted decoders. The method with both trainable decoders (i.e., N2S) achieves the best performance. We also notice that DACT fail to solve PDTSP-LIFO well. This might be because its decoder considers removing and reinserting only one node instead of a pickup-delivery node pair in each action, which is a serious limitation for more constrained PDPs. For example, on PDTSP-LIFO, over 92% of the action space need to be masked for DACT, which leads to extremely low efficiency.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Example instances from Renaud et al. (2002).
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 5: Example instances from Carrabs et al. (2007).

5.4 Generalization Evaluation

We further evaluate our N2S on benchmark instances, including all the ones from Renaud et al. (2002) for PDTSP and the ones with size |V|201|V|\!\leq\!201 from Carrabs et al. (2007) for PDTSP-LIFO, which are largely different from our training ones, e.g., different sizes (i.e., 200 nodes) as shown in Figure 5 and different node distributions as shown in Figure 5. In Table 7, we report the best and the average gaps (with 10 independent runs) achieved by N2S-A and neural baseline Heter-POMO-A w.r.t. optimal solutions for PDTSP, or heuristic baseline B1 Carrabs et al. (2007) and B2 Li et al. (2011) for PDTSP-LIFO. It can be seen that our N2S significantly outstrips Heter-POMO in all cases. Without re-training or leveraging other techniques, this is fairly hard to achieve because machine learning often suffers from a mediocre out-of-distribution zero-shot generalization Zhou et al. (2021); Li et al. (2021a). The results also imply slight inferiority of our N2S to the LKH3 solver, given that LHK3 reports similar performance to the B2 baseline on those instances Helsgaun (2017). Accordingly, we will focus on improving the out-of-distribution generalization for our N2S in future.

Problem |V||V| Gaps to Heter-POMO-A N2S-A
Avg. Best Avg. Best
PDTSP 101 Opt. 1.64% 1.46% 0.08% 0.00%
201 9.71% 8.66% 2.82% 2.19%
PDTSP \leq101 B1 10.06% 9.19% 0.85% -0.18%
-LIFO B2 11.46% 10.58% 2.11% 1.06%
Table 7: Generalization performance on benchmark instances.

6 Conclusion

We present an efficient N2S approach for PDPs. It utilizes a novel Synth-Att to synthesize node relationships from various types of solution features and exploits the node-pair removal and reinsertion decoders to tackle the precedence constraint. Extensive experiments on PDTSP and PDTSP-LIFO verified our design, where N2S achieves state-of-the-art performance among existing neural methods. Further equipped with a diversity enhancement scheme, it even becomes the first neural method to surpass LKH3 on synthesized PDP instances. In future, we will 1) deploy N2S as a low-level agent in the hierarchical framework of Ma et al. (2021a) for dynamic PDPs, 2) combine N2S with a similar divide-and-conquer strategy in Li et al. (2021c) for much larger-scale instances, and 3) enhance the out-of-distribution generalization to exceed LKH3 on instances of arbitrary distributions.

Acknowledgments

This work was supported by the IEO Decentralised GAP project of Continuous Last-Mile Logistics (CLML) at SIMTech under Grant I22D1AG003. It was also supported in part by the National Natural Science Foundation of China under Grant 62102228, and in part by Guangdong Regional Joint Fund for Basic and Applied Research under Grant 2021B1515120078.

References

  • Accorsi et al. [2022] Luca Accorsi, Andrea Lodi, and Daniele Vigo. Guidelines for the computational testing of machine learning approaches to vehicle routing problems. Operations Research Letters, 50(2):229–234, 2022.
  • Carrabs et al. [2007] Francesco Carrabs, Jean-François Cordeau, and Gilbert Laporte. Variable neighborhood search for the pickup and delivery traveling salesman problem with LIFO loading. INFORMS Journal on Computing, 19(4):618–632, 2007.
  • Chen and Tian [2019] Xinyun Chen and Yuandong Tian. Learning to perform local rewriting for combinatorial optimization. In Advances in Neural Information Processing Systems, volume 32, pages 6281–6292, 2019.
  • Fu et al. [2021] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI Conference on Artificial Intelligence, 2021.
  • Helsgaun [2017] Keld Helsgaun. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Roskilde: Roskilde University, 2017.
  • Hottung and Tierney [2020] André Hottung and Kevin Tierney. Neural large neighborhood search for the capacitated vehicle routing problem. In European Conference on Artificial Intelligence, 2020.
  • Hottung et al. [2021] André Hottung, Bhanu Bhandari, and Kevin Tierney. Learning a latent search space for routing problems using variational autoencoders. In International Conference on Learning Representations, 2021.
  • Joshi et al. [2019] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. arxiv preprint arxiv: 1906.01227, 2019.
  • Kim et al. [2021] Minsu Kim, Jinkyoo Park, and joungho kim. Learning collaborative policies to solve NP-hard routing problems. In Advances in Neural Information Processing Systems, volume 34, pages 10418–10430, 2021.
  • Kool et al. [2018] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2018.
  • Kool et al. [2021] Wouter Kool, Herke van Hoof, Joaquim Gromicho, and Max Welling. Deep policy dynamic programming for vehicle routing problems. arxiv preprint arxiv:2102.11756, ArXiv, 2021.
  • Kwon et al. [2020] Yeong-Dae Kwon, Jinho Choo, Byoungjip Kim, Iljoo Yoon, Youngjune Gwon, and Seungjai Min. POMO: Policy optimization with multiple optima for reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 21188–21198, 2020.
  • Li et al. [2011] Yongquan Li, Andrew Lim, Wee-Chong Oon, Hu Qin, and Dejian Tu. The tree representation for the pickup and delivery traveling salesman problem with LIFO loading. European Journal of Operational Research, 212(3):482–496, 2011.
  • Li et al. [2021a] Jingwen Li, Yining Ma, Ruize Gao, Zhiguang Cao, Andrew Lim, Wen Song, and Jie Zhang. Deep reinforcement learning for solving the heterogeneous capacitated vehicle routing problem. IEEE Transactions on Cybernetics, 2021.
  • Li et al. [2021b] Jingwen Li, Liang Xin, Zhiguang Cao, Andrew Lim, Wen Song, and Jie Zhang. Heterogeneous attentions for solving pickup and delivery problem via deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 2021.
  • Li et al. [2021c] Sirui Li, Zhongxia Yan, and Cathy Wu. Learning to delegate for large-scale vehicle routing. In Advances in Neural Information Processing Systems, volume 34, pages 26198–26211, 2021.
  • Ma et al. [2021a] Yi Ma, Xiaotian Hao, Jianye Hao, Jiawen Lu, Xing Liu, Tong Xialiang, Mingxuan Yuan, Zhigang Li, Jie Tang, and Zhaopeng Meng. A hierarchical reinforcement learning based optimization framework for large-scale dynamic pickup and delivery problems. In Advances in Neural Information Processing Systems, volume 34, pages 23609–23620, 2021.
  • Ma et al. [2021b] Yining Ma, Jingwen Li, Zhiguang Cao, Wen Song, Le Zhang, Zhenghua Chen, and Jing Tang. Learning to iteratively solve routing problems with dual-aspect collaborative Transformer. In Advances in Neural Information Processing Systems, volume 34, pages 11096–11107, 2021.
  • Nazari et al. [2018] Mohammadreza Nazari, Afshin Oroojlooy, Martin Takáč, and Lawrence V Snyder. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9861–9871, 2018.
  • Parragh et al. [2008] Sophie N Parragh, Karl F Doerner, and Richard F Hartl. A survey on pickup and delivery problems. Journal für Betriebswirtschaft, 58(2):81–117, 2008.
  • Renaud et al. [2000] Jacques Renaud, Fayez F Boctor, and Jamal Ouenniche. A heuristic for the pickup and delivery traveling salesman problem. Computers & Operations Research, 27(9):905–916, 2000.
  • Renaud et al. [2002] Jacques Renaud, Fayez F Boctor, and Gilbert Laporte. Perturbation heuristics for the pickup and delivery traveling salesman problem. Computers & Operations Research, 29(9):1129–1141, 2002.
  • Savelsbergh [1990] Martin WP Savelsbergh. An efficient implementation of local search algorithms for constrained routing problems. European Journal of Operational Research, 47(1):75–85, 1990.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 6000–6010, 2017.
  • Veenstra et al. [2017] Marjolein Veenstra, Kees Jan Roodbergen, Iris FA Vis, and Leandro C Coelho. The pickup and delivery traveling salesman problem with handling costs. European Journal of Operational Research, 257(1):118–132, 2017.
  • Wu et al. [2021] Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang, and Andrew Lim. Learning improvement heuristics for solving routing problems. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • Zhou et al. [2021] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. arxiv preprint arxiv:2103.02503, ArXiv, 2021.

Appendix A Full Results of Generalization

We present more details for our generalization evaluation on benchmark datasets. For both Heter-POMO-A and N2S-A, the coordinates of instances are normalized to [0,1]×[0,1][0,1]\times[0,1] as per training, and we adopted the “closest” model learned in Section 5 to infer the instances, e.g., models trained on PDP-21 are used to infer PDP-25 instances, and models trained on PDP-101 are used to infer instances with |V|101|V|\geq 101. We increase CC in our decoders during generalization since it boosts the performance. Full results of Table 7 are gathered in Table 8 (PDTSP) and Table 9 (PDTSP-LIFO). The gaps in Table 8 are computed w.r.t the optimal solutions and the gaps in Table 9 are computed w.r.t. heuristic baselines B1 Carrabs et al. [2007] and B2 Li et al. [2011]. In all cases, our N2S-A significantly outperforms Heter-POMO-A. In specific, we achieve the best gaps of 0.00% (|V|=101|V|\!=\!101), and 2.19% (|V|=201|V|\!=\!201) on PDTSP while the gaps of Heter-POMO are 1.46% (|V|=101|V|\!=\!101) and 8.66% (|V|=201|V|\!=\!201); we achieve best gaps of -0.18% (w.r.t. B1) and 1.06% (w.r.t. B2) on PDTSP-LIFO while the gaps of Heter-POMO are 9.19% (w.r.t. B1) and 10.58% (w.r.t. B2). This further reveals the desirable generalization of our N2S approach.

Heter-POMO-A (3k) N2S-A (3k) Heter-POMO-A (3k) N2S-A (3k)
Instances |V||V| Optimal Avg. Cost Best Cost Avg. Cost Best Cost Avg. Gap(%) Best Gap(%) Avg. Gap(%) Best Gap(%)
N101P1 101 799.0 824.0 820.0 800.0 799.0 3.13 2.63 0.13 0.00
N101P2 101 729.0 736.0 735.0 730.0 729.0 0.96 0.82 0.14 0.00
N101P3 101 748.0 751.0 751.0 748.0 748.0 0.40 0.40 0.00 0.00
N101P4 101 807.0 815.0 814.0 808.0 807.0 0.99 0.87 0.12 0.00
N101P5 101 783.0 794.0 791.0 783.0 783.0 1.40 1.02 0.00 0.00
N101P6 101 755.0 766.0 763.0 755.0 755.0 1.46 1.06 0.00 0.00
N101P7 101 767.0 787.0 787.0 768.0 767.0 2.61 2.61 0.13 0.00
N101P8 101 762.0 783.0 782.0 764.0 762.0 2.76 2.62 0.26 0.00
N101P9 101 766.0 766.0 766.0 766.0 766.0 0.00 0.00 0.00 0.00
N101P10 101 754.0 774.0 773.0 754.0 754.0 2.65 2.52 0.00 0.00
Average 767.0* 779.6 778.2 767.6 767.0* 1.64 1.46 0.08 0.00
N201P1 201 1039.0 1109.0 1095.0 1068.0 1059.0 6.74 5.39 2.79 1.92
N201P2 201 1086.0 1117.0 1114.0 1102.0 1099.0 2.85 2.58 1.47 1.20
N201P3 201 1070.0 1198.0 1177.0 1112.0 1107.0 11.96 10.00 3.93 3.46
N201P4 201 1050.0 1114.0 1108.0 1073.0 1066.0 6.10 5.52 2.19 1.52
N201P5 201 1052.0 1187.0 1169.0 1094.0 1078.0 12.83 11.12 3.99 2.47
N201P6 201 1059.0 1120.0 1111.0 1077.0 1072.0 5.76 4.91 1.70 1.23
N201P7 201 1037.0 1170.0 1164.0 1076.0 1065.0 12.83 12.25 3.76 2.70
N201P8 201 1079.0 1215.0 1196.0 1112.0 1108.0 12.60 10.84 3.06 2.69
N201P9 201 1050.0 1208.0 1198.0 1076.0 1073.0 15.05 14.10 2.48 2.19
N201P10 201 1085.0 1198.0 1192.0 1116.0 1112.0 10.41 9.86 2.86 2.49
Average 1060.7* 1163.6 1152.4 1090.6 1083.9 9.71 8.66 2.82 2.19
Table 8: Generalization performance on benchmark instances from Renaud et al. [2002] for PDTSP using the trained model in Section 5.
Instances |V||V| B1(2007) B2(2011) Heter-POMO-A (3k) N2S-A (3k) Heter-POMO-A (3k) N2S-A (3k) Heter-POMO-A (3k) N2S-A (3k)
Avg. Cost Best Cost Avg. Cost Best Cost
Avg. Gap
to B1(%)
Best Gap
to B1(%)
Avg. Gap
to B1(%)
Best Gap
to B1(%)
Avg. Gap
to B2(%)
Best Gap
to B2(%)
Avg. Gap
to B2(%)
Best Gap
to B2(%)
brd14051 25 4682.2 4672.0 4705.0 4705.0 4680.0 4672.0 0.49 0.49 -0.05 -0.22 0.71 0.71 0.17 0.00
51 7763.2 7740.0 8276.0 8201.0 7948.0 7828.0 6.61 5.64 2.38 0.83 6.93 5.96 2.69 1.14
75 7309.1 7232.4 9059.0 8938.0 7775.0 7554.0 23.94 22.29 6.37 3.35 25.26 23.58 7.50 4.45
101 10005.2 9735.0 13315.0 13000.0 10539.0 10370.0 33.08 29.93 5.34 3.65 36.77 33.54 8.26 6.52
pr1002 25 16221.0 16221.0 16221.0 16221.0 16221.0 16221.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
51 31186.7 30936.0 30936.0 30936.0 30936.0 30936.0 -0.80 -0.80 -0.80 -0.80 0.00 0.00 0.00 0.00
75 46911.0 46673.0 47404.0 47202.0 47284 46923 1.05 0.62 0.80 0.03 1.57 1.13 1.31 0.54
101 63611.1 61433.0 62569.0 62565.0 62787.0 62353.0 -1.64 -1.64 -1.30 -1.98 1.85 1.84 2.20 1.50
fnl4461 25 2168.0 2168.0 2168.0 2168.0 2168.0 2168.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
51 4020.0 4020.0 4038.0 4037.0 4020.0 4020.0 0.45 0.42 0.00 0.00 0.45 0.42 0.00 0.00
75 5865.0 5739.0 6118.0 6038.0 5905.0 5763.0 4.31 2.95 0.68 -1.74 6.60 5.21 2.89 0.42
101 8852.8 8562.0 9186.0 9065.0 8827.0 8702.0 3.76 2.40 -0.29 -1.70 7.29 5.87 3.10 1.64
d18512 25 4683.4 4672.0 4707.0 4705.0 4680.0 4672.0 0.50 0.46 -0.07 -0.24 0.75 0.71 0.17 0.00
51 7565.6 7502.0 8215.0 8126.0 7696.0 7519.0 8.58 7.41 1.72 -0.62 9.50 8.32 2.59 0.23
75 8781.5 8629.0 10282.0 10215.0 8884.0 8802.0 17.09 16.32 1.17 0.23 19.16 18.38 2.96 2.00
101 10332.4 10256.4 13499.0 13218.0 10729.0 10555.0 30.65 27.93 3.84 2.15 31.62 28.88 4.61 2.91
d15112 25 93981.0 93981.0 93981.0 93981.0 93981.0 93981.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
51 143575.2 142113.0 144120.0 143716.0 142113.0 142113.0 0.38 0.10 -1.02 -1.02 1.41 1.13 0.00 0.00
75 201385.4 199047.8 206442.0 204944.0 200004.0 199076.0 2.51 1.77 -0.69 -1.15 3.71 2.96 0.48 0.01
101 276876.8 266925.3 272784.0 273693.0 269160.0 267305.0 -1.48 -1.15 -2.79 -3.46 2.19 2.54 0.84 0.14
nrw1379 25 3194.8 3192.0 3192.0 3192.0 3192.0 3192.0 -0.09 -0.09 -0.09 -0.09 0.00 0.00 0.00 0.00
51 5095.0 5055.0 6369.0 6517.0 5086.0 5056.0 25.00 27.91 -0.18 -0.77 25.99 28.92 0.61 0.02
75 6865.1 6831.0 9647.0 9272.0 7081.0 6960.0 40.52 35.06 3.14 1.38 41.22 35.73 3.66 1.89
101 10197.5 9889.4 13898.0 13592.0 10330.0 9996.0 36.29 33.29 1.30 -1.98 40.53 37.44 4.46 1.08
Average 42518.9 41740.6 43388.7 43263.3 42123.2 41893.3 10.06 9.19 0.85 -0.18 11.46 10.58 2.11 1.06
Table 9: Generalization performance on benchmark instances from Li et al. [2011] for PDTSP-LIFO using the trained model in Section 5.

Appendix B Dealing with Capacity Constraint

Our N2S is generic to the capacity constraint, similar to DACT for handling capacity in CVRP Ma et al. [2021b]. Specifically, we can, 1) make copies of depots (i.e., dummy depots) so that N2S can search solutions with different numbers of vehicles automatically; 2) add capacity/demand features to NFEs; 3) mask out infeasible choices in the Reinsertion decoder; and 4) use diversity enhancement as usual since it only affects the node coordinates. Nevertheless, the capacity in PDP might not be so crucial as the vehicle may always alternatively load or unload the goods.