Patrol Security Game: Defending Against Adversary with Freedom in Attack Timing, Location, and Duration

Hao-Tsung Yang [email protected] 0000-0003-4463-1616 , Ting-Kai Weng , Ting-Yu Chang Department of Computer Science and Information Science, National Central UniversityTaiwan, R.O.C. , Kin Sum Liu Department of Ads Quality, DoorDash inc.USA , Shan Lin [email protected] Department of Electrical and Computer Engineering, Stony Brook UniversityUSA , Jie Gao [email protected] 0000-0001-5083-6082 Department of Computer Science, Rutgers UniversityUSA and Shih-Yu Tsai [email protected] Department of Information Management and Finance, National Yang Ming Chiao Tung UniversityTaiwan, R.O.C.

Abstract.

We explored the Patrol Security Game (PSG), a robotic patrolling problem modeled as an extensive-form Stackelberg game, where the attacker determines the timing, location, and duration of their attack. Our objective is to devise a patrolling schedule with an infinite time horizon that minimizes the attacker’s payoff. We demonstrated that PSG can be transformed into a combinatorial minimax problem with a closed-form objective function. By constraining the defender’s strategy to a time-homogeneous first-order Markov chain (i.e., the patroller’s next move depends solely on their current location), we proved that the optimal solution in cases of zero penalty involves either minimizing the expected hitting time or return time, depending on the attacker model, and that these solutions can be computed efficiently. Additionally, we observed that increasing the randomness in the patrol schedule reduces the attacker’s expected payoff in high-penalty cases. However, the minimax problem becomes non-convex in other scenarios. To address this, we formulated a bi-criteria optimization problem incorporating two objectives: expected maximum reward and entropy. We proposed three graph-based algorithms and one deep reinforcement learning model, designed to efficiently balance the trade-off between these two objectives. Notably, the third algorithm can identify the optimal deterministic patrol schedule, though its runtime grows exponentially with the number of patrol spots.

Experimental results validate the effectiveness and scalability of our solutions, demonstrating that our approaches outperform state-of-the-art baselines on both synthetic and real-world crime datasets.

Stackelberg Game, Patrol Algorithm, Traveling Salesman Problem, Deep Reinforcement Learning

^†^†ccs: Theory of computation Random walks and Markov chains^†^†ccs: Theory of computation Computational geometry^†^†ccs: Computing methodologies Markov decision processes^†^†ccs: Computing methodologies Stochastic games^†^†ccs: Computing methodologies Neural networks^†^†ccs: Computer systems organization Robotic autonomy

1. Introduction

Public safety is crucial for ensuring a thriving and harmonious society. In responding to criminal activities, it is essential to account for game-theoretic models and strategic behaviors, a core concept of Stackelberg security games (see (Sinha et al., 2018) for further detail). In this framework, the problem is modeled as a Stackelberg game, wherein a defender, with a limited set of resources, protects a set of targets, and an attacker plans attacks after observing the defender’s strategy. The goal is to compute a Stackelberg equilibrium—a mixed strategy for the defender that maximizes their utility, taking into account that the attacker is aware of this strategy and will respond optimally. This approach extends to cyber-physical systems, including the deployment of mobile robots for autonomous security enforcement (Rubio et al., 2019). This domain is often referred to as patrolling security games or adversarial patrolling games (Gatti, 2008; Bucarey et al., 2021; Vorobeychik et al., 2014; Basilico et al., 2012; Agmon et al., 2008; Bošanskỳ et al., 2011; Basilico et al., 2017). These games are modeled as two-player, multi-stage interactions over an infinite time horizon, in which the defender controls a patroller moving between vertices on a graph to protect targets, while the attacker chooses when and where to launch an attack.

A common approach to analyzing or solving patrolling security games is to formulate them as mixed-integer linear programming problems and compute approximate optimal strategies for the defender. However, given the infinite time horizon in these games, the number of pure strategies is infinite. To address this, additional constraints are often imposed to reduce the strategy space. For instance, time constraints may be simplified by ignoring the time it takes for a patroller to move between locations, assuming that movement time is negligible compared to time spent guarding (Pita et al., 2008; Fang et al., 2013; Shieh et al., 2013; Basilico et al., 2012; Bucarey et al., 2021). Other works adopt specific attacker models, such as attackers that require a fixed period to execute an attack (Basilico et al., 2017), or models that introduce an exponential discount factor on the attacker’s utility (Vorobeychik et al., 2014). Despite these constraints, which limit the number of pure strategies, scalability remains a significant challenge due to the exponential growth of the strategy space (Nguyen et al., 2016).

Problem Statement We consider a generalization of zero-sum patrolling security game (PSG), in which the attacker is given not only the freedom to decide when and where to launch the attack but also the duration of the attack in order to maximize the expected payoff. The attacker’s payoff is the acquired utilities of the attack minus a penalty if the attacker is caught by the defender in patrol. To the best of our knowledge, this is the first work considering varying attack duration in the patrolling game. We consider three different attacker models which affects how much information that the attacker can possibly gain by observing the patrol routes. The game is converted to a minimax problem with geometric properties. One main challenge is the exponentially increased size in the solution space due to varying attack durations. Furthermore, for general utility functions, the problem of finding optimal defender strategy is not convex in general.

Our Contribution To tackle the problem, we first focus on a subset of the defender strategy which is restrcited as a time-homogeneous first-order Markov chain. Finding the optimal defender strategy under this subset can be formulated as a closed-form minimax problem. In special cases with the zero penalties, the optimal solutions can be linked to minimizing the expected pairwise/ average hitting time or return time, depending on the visibility model of the attacker. In a scenario of high penalties, increasing the entropy of visiting time for each site helps to reduce the attacker’s expected payoff, since the attacker would pay a high price if he is getting caught, even with a small chance. Thus, a randomness patrol schedule with high entropy of visiting time is beneficial to decrease the attacker’s payoff. By the aforementioned observations, we formulate a bi-criteria problem of balancing the attacker’s expected maximum reward and randomness of the patrol schedule and use the solution as the defender strategies for the original game. This is the first work to consider the randomized strategies in vehicle routing problems. We propose four algorithms: TSP-based solution (TSP-b), Biased random walk (Bwalk), Walk on state graph (SG), and Graph pointer network (GPN-b). The first two are related to TSP and random walk solutions. SG is a state machine mechanism. GPN-b is from the framework of of deep reinforcement learning, where the model is an graph convolutional network equiped with LSTM mechanism. All proposed algorithms can balance the two criteria by a parameter $\alpha$ . In addition, SG can be used to find an optimal deterministic patrol schedule for the original game with any utility functions. Experiments show that four algorithms are adaptive to various utility functions/ penalties and both TSP-b, Bwalk, and GPN-b are scalable with the increase to the number of sites. Our solutions also outperform (achieving lower expected payoff for the attacker) other baselines such as Markov chains of minimum hitting time (Patel et al., 2015), and Maxentropic Markov chains (George et al., 2018).

For your notice, the preliminary version of this work has been published in AAMAS 2019 titled as “Patrol scheduling against adversaries with varying attack durations (Yang et al., 2019)”. This full version includes a new proposed algorithm which is based on deep reinforcement learning and new experiements with more detail to evaluate the performace of soluions. The rest of the paper is organized as follows. Section 2 is the related work. Section 3 gives the formal definition of the patrol game. Section 4 discusses the optimal (mix) strategy in the special cases when the strategy is restricted as a first-order markov chain. Section 5 provides three geometric-based algorithms and Section 6 proposes a deep reinforcement learning based algorithm for general cases. Section 7 is the experiments and Section 8 is the conclusion.

2. Related Work

2.1. Surveillance and Security Game

Patrolling and surveillance problems have been extensively studied in the fields of robotics (see the detailed survey by Basilico et al. (Basilico, 2022)) and operations research (Samanta et al., 2022). In non-strategic settings, algorithms are designed for traversing a specified region using centralized optimization to achieve specific objectives (Elmaliach et al., 2008; Portugal and Rocha, 2011; Iocchi et al., 2011; Stump and Michael, 2011; Liu et al., 2017). For example, Alamdari et al. (Alamdari et al., 2014) address the problem of minimizing the maximum duration between consecutive visits to a particular location, providing a $\log n$ -approximation algorithm for general graphs. Subsequent research has expanded on these results for specific graph structures, such as chains and trees (Pasqualetti et al., 2012). Recently, this objective has been extended to multi-agent scenarios (Afshani et al., 2021, 2022).

In strategic settings, patrol strategies are designed to defend against intelligent intruders who seek to avoid detection. Consequently, many studies model patroller movements as Markov chains or random walks to introduce unpredictability into patrol routes (Grace and Baillieul, 2005; Patel et al., 2015; Duan et al., 2018; Cannata and Sgorbissa, 2011; Asghar and Smith, 2016), focusing on metrics such as efficiency or entropy. For instance, Patel et al. (Patel et al., 2015) investigate minimizing the first-passage time, while Duan et al. (Duan et al., 2018) examine maximizing the entropy of return times. Salih et al. (Çam, 2023) combine game theory with these approaches to estimate expected return times. These objectives can also be discussed in more advanced random walk settings (Branquinho et al., 2021; Dshalalow and White, 2021), corresponding to different defender models.

A more advanced strategic settings which explicitly define the interaction between defenders and attackers, called Stackelberg security games. In this field, Kiekintveld et al. (Kiekintveld et al., 2009) introduced a general framework for security resource allocation, which has since been widely applied in various security domains with differing scenarios (Li et al., 2020; Agmon et al., 2011; Basilico et al., 2009; Vorobeychik et al., 2012; Abdallah et al., 2021). Notable applications include the deployment of randomized checkpoints and canine patrol routes at airports (Pita et al., 2008), deployment scheduling for U.S. Federal Air Marshals (Tsai et al., 2009; Janssen et al., 2020), and the patrol schedules for U.S. Coast Guard operations (Fang et al., 2013; Shieh et al., 2012) and wildlife protected rangers (Yang et al., 2014; Kirkland et al., 2020). On the other hand, various adaptations of the security game model have been proposed to fit specific application scenarios by altering the utility functions and the dynamics of attacker-defender interactions. Vorobeychik et al. (Vorobeychik et al., 2014) introduced a discounted time factor in the attacker’s payoff function to account for the increased likelihood of detection over prolonged attack periods. Bošansk‘y et al. (Bošanskỳ et al., 2011) explored scenarios where targets move according to deterministic schedules, while Huang et al. (Huang and Zhu, 2020) introduced a dynamic game framework to model the interaction between a stealthy attacker and a proactive defender in cyber-physical systems. Song et al. (Song et al., 2023) investigated security games involving multiple heterogeneous defenders.

Addressing the scalability of these models remains a significant challenge as suggested in Hunt et al. (Hunt and Zhuang, 2024). Current solutions are often tailored to specific applications. For example, in the ASPEN framework (Jain et al., 2010), multiple patrollers are deployed, with each patroller’s strategy solved independently to prevent combinatorial explosion in schedule allocation. This approach has been extended in the RUGGED system (Jain et al., 2011) for road network security. Shieh et al. (Shieh et al., 2013) built on previous work, utilizing the Traveling Salesman Problem (TSP) as a heuristic tool to order the search space, providing efficient heuristic solutions for each patroller. Basilico et al. (Basilico et al., 2012) assumed the attacker requires a fixed period to execute an attack and employed reduction techniques to address scalability, though this approach is limited to unweighted graphs where the attacker cannot control the attack duration. More recently, Wang et al. (Wang et al., 2020) applied graph convolutional networks to model attacker behavior, using randomized block updates to reduce time complexity. Wang et al. (Wang, 2020) also examined the trade-offs between linear and non-linear formulations, analyzing the impact of linearization on scalability. However, due to the specificity of these designs, these approaches cannot be directly applied to all problems.

2.2. TSP

The problem of planning patrol routes is related to the general family of vehicle routing problems (VRPs) and traveling salesman problems (TSPs) with constraints (Ausiello et al., 2000; Laporte, 1992; Pillac et al., 2013; Yu and Liu, 2014). This is a huge literature thus we only introduce the most relevant papers.

TSP is a well-known NP-complete problem in combinatorial optimization and has been discussed in operation research (Angel et al., 1972; Kim and Park, 2004; Carter and Ragsdale, 2002). Christofides algorithm (Christofides, 1976) provides a tour whose length is less or equal to 1.5 times of the minimum possible. Additionally, there are two independent papers that provide polynomial-time approximation scheme (PTAS) for Euclidean TSP by Mitchell and Arora (Mitchell, 1999; Arora, 1996). There are many variations of TSP that consider multiple objectives (Bienstock et al., 1993; Awerbuch et al., 1995). In this work, one objective is to increase the randomness between generated tours. A close-related objective called “diversity” has been discussed recently with other combinatorial problems such as diverse vertex cover (Hanaka et al., 2021) or diverse spanning trees (Gao et al., 2022). However, to the best of our knowledge, TSP had not been studied in the terms of diversity or tours with high randomness. The other objective is related to minimize the maximal weighted latency among sites of the tour, which has been discussed in some works (Alamdari et al., 2014; Min and Radzik, 2017; Afshani et al., 2021, 2022). One difference is that this work generalizes the “weight latency” as functions rather than constant weights.

In recent years, the integration of Deep learning into combinatorial optimization problems has seen significant advances, including the TSP problem A pivotal development in this domain was introduced by Vinyals et al., who developed the Pointer Network (PN), a model that utilizes an attention mechanism to output a permutation of the input and trains to solve TSP (Vinyals et al., 2015). Building on this, Bello et al. enhanced the PN by employing an Actor-Critic algorithm trained through reinforcement learning, further refining the network’s ability to optimize combinatorial structures (Bello et al., 2016). The application of PN was further extended by Kool et al., who incorporated a transformer-like structure to tackle not only TSP but also the Vehicle Routing Problem (VRP), the Orienteering Problem (OP), and a stochastic variant of the Prize Collecting TSP (PCTSP). (Kool et al., 2018). In addressing challenges associated with large graph sizes, Ma et al. (Ma et al., 2020) introduced a hierarchical structure that scales the model effectively, enabling it to manage and solve larger instances of combinatorial problems. Additionally, Hottung et al. (Hottung et al., [n. d.]) use of Variational Autoencoder (VAE) and explore the latent space for routing problems. Furthermore, Kim et al. (Kim et al., 2021) proposed a novel method for solving TSP that involves a seeding and revision process, which generates tours with an element of randomness and subsequently refines them to find superior routes.

3. Problem Definition

The patrol game is structured as a Stackelberg zero-sum game. That is, the defender executes a strategy first and the attacker chooses the best strategy based on the defender’s executed strategy. The attacker’s objective is to choose a strategy that maximizes his (expected) payoff and the defender’s objective is to choose a strategy that minimizes the attacker’s maximum expected payoff.

Mathematically, given a tuple $(G,H,M)$ , where $G=(V,E,W)$ is a weighted graph with vertices $V=\{1,2,\cdots n\}$ , edge set $E$ , and edge-weight matrix $W$ representing the traveling costs. $M$ is the penalty cost ( $M\geq 0$ ) and each vertex $j$ has a utility function $h_{j}\in H$ . Time is discretized into time slots. The attacker can launch one attack and can decide where ( $j$ ), when ( $\tau$ ) and how long ( $T$ ) the attack lasts. During the attack, at the $(\tau+t)$ -th time slot the attacker collects a utility $h_{j}(t)$ , where $1\leq t\leq T$ . Note that the utility function can be node dependent. We assume that $h_{j}(t)\geq 0$ always.

If the attacker is caught by the defender at the $(\tau+t^{\prime})$ -th time slot, the attacker would pay a penalty $M$ and be forced to stop the attack. Thus, the total collected utilities of the attacker is $\sum_{t=1}^{t^{\prime}}h_{j}(t)-M$ . Otherwise, the total collected utilities is $\sum_{t=1}^{T}h_{j}(t)$ if the attacker is not caught.

Notice that in the adversarial patrolling games, it is possible that the attacker waits for a long time and acquires additional information such as when the patroller passes by. In the literature, there are different models which specify how much information the attacker can collect.

•

Full visibility: The attacker has a probe in each site such that it would notify the position of the patroller when he arrives any site during the game. This model is used in Patrolling Security Games (Basilico et al., 2012; Vorobeychik et al., 2014).
•

Local visibility: The attacker would have to choose a site $j$ first and would launch an attack right after the patroller leaves site $j$ (Asghar and Smith, 2016).
•

No visibility: The attacker cannot know the patroller’s positions during the whole game. This is a common assumption in (An et al., 2012; Pita et al., 2008).

In general assumption, the attacker knows the strategy used by the defender before the game starts in any attacker models.

4. Strategy with First-order Markov Chain

To tackle the problem, the defender’s strategy is restricted as a time-homogeneous first-order Markov chain (only in this section). That is, the patroller movement is modeled as a Markov process over graph $G$ with a transition matrix $P$ , which is known by the attacker. Notice that any high-order Markov chain can be “flatten” into the first order one by some standard methods (which takes time exponential on the order of the Markov chain) (Basilico et al., 2012).

To calculate the attacker’s payoff we use the notation of first visit matrix $F$ (Asghar and Smith, 2016), where each element represents the visit probability distribution from a site $i$ to another site $j$ . In detail, given graph $G$ and transition matrix $P$ , the probability of taking $k$ slots for the patroller, starting at $i$ to reach $j$ for the first time is given by

F_{k}(i,j)=\begin{cases}p_{ij}\mathbb{1}_{w_{ij}=k},&k=1\\ \sum_{h\neq j}p_{ih}F_{k-w_{ih}}(h,j)+p_{ij}\mathbb{1}_{w_{ij}=k},&k\geq 2,\end{cases}

where $w_{i}j$ is the travel cost from site $i$ to $j$ and $\mathbb{1}_{w_{ij}=k}$ is the indicator function which returns 1 if $w_{ij}=k$ , and 0 otherwise. $F_{k}(i,j)=0$ when $k$ is non-positive. Extensively, we define expected hitting time matrix $A$ , where each entry $a_{i,j}=\sum_{k=1}^{\infty}k\cdot F_{k}(i,j)$ .

4.1. Attacker has full visibility

In the model of full visibility, the attacker knows the exact position of the patroller among all sites. Denote $Z_{i,j,T}$ as the expected payoff if the attacker launches an attack at $j$ with the attack period $T$ when the patroller is at $i$ . In any time slot $t$ during the attack, where $1\leq t\leq T$ , there are only 3 possible events: the patroller comes to site $j$ (after visiting $i$ ) in the period of time 1 to $t-1$ , the patroller comes exactly at time $t$ , or the patroller comes after time $t$ . In the first case, the attacker cannot collect utility at time $t$ since the attack is enforced to stop at $t^{\prime}$ , where $t^{\prime}<t$ (the penalty is also paid at time $t^{\prime}$ too). In the second case, the attack is caught at time $t$ thus there is a penalty $M$ substrated from the attacker’s payoff. In the third case, the attacker collects utility $h_{j}(t)$ . Thus, the expected payoff at time $t,1\leq t\leq T$ , can be expressed as a closed form associated with $F$ .

(1)

z_{i,j}(t)=(h_{j}(t)-M)\cdot F_{t}(i,j)+h_{j}(t)(\sum_{k=t+1}^{\infty}F_{k}(i,j)).

The total (expected) payoff during the whole attack period is $Z_{i,j,T}=\sum_{t=1}^{T}z_{i,j}(t)$ , which is called as the payoff matrix. The attacker chooses an element of $Z$ with the highest payoff, which describes his strategy of when, where, and how long the attack lasts.

For the defender, the problem of choosing a best strategy can be formulated as a minimax problem:

\min_{P}f(P),\text{ where }f(P)=\max_{i,j,T}Z_{i,j,T}.

For general utility function $h_{j}$ and penalty $M$ , the Hessian matrix of $f$ is not guaranteed to be semi-definite thus $f(P)$ is not convex in general. However, in special cases $f(P)$ has strong connection with the expected hitting time matrix $A$ .

If $M=0$ and the utility functions are all constant functions, then $f(P)$ is either $\infty$ or the maximum weighted expected hitting time of all pairs $(i,j)$ , with the weight for $(i,j)$ as the constant of the utility function $h_{j}$ .

Proof.

If the transition matrix $P$ is reducible, i.e, there exists a pair of vertices $i,j$ such that the patroller starting at $i$ would never visit site $j$ , then the attacker can choose to attack $j$ for infinitely long. In this case $Z_{i,j,\infty}=\infty$ .

Now, assume that the transition matrix is irreducible. Denote by $h_{j}$ the constant of the utility function at site $j$ . Given an attack period $T$ , $M=0$ , from Equation 1, $Z_{i,j,T}$ can be simplified as

(2)

Z_{i,j,T}=h_{j}\cdot\sum_{k=1}^{T}k\cdot F_{k}(i,j)+h_{j}\cdot T\cdot\sum_{k=T+1}^{\infty}F_{k}(i,j)

Since $z_{i,j}(t)\geq 0$ for any $t$ . Thus, taking $T=\infty$ period maximizes his payoff. That is,

(3)

f(P)=\max_{i,j}Z_{i,j,\infty}=\max_{i,j}h_{j}\cdot\sum_{k=1}^{\infty}F_{k}(i,j)\cdot k=\max_{i,j}h_{j}\cdot a_{i,j}

where $a_{i,j}$ is the expected first hitting time from $i$ to $j$ . ∎

At the defender’s side, minimizing the maximum of all pairwise expected hitting times is still an open question to the best of our knowledge. One can find a relevant work which provides a lower bound and discusses the complication for this question (Breen and Kirkland, 2017).

4.2. Attacker has local visibility

In this model, assume the attacker’s strategy is to attack site $j$ with the attack period $T$ . Denote $z^{\prime}_{j}(t)$ as the utility he collects for every time $t$ where $1\leq t\leq T$ ,

(4)

z^{\prime}_{j}(t)=z_{j,j}(t)

By a similar discussion in Observation 4.1, one can infer that the best strategy for the attacker is to attack the site with the longest expected (weighted) return time if the utility functions are all constants and the penalty is zero. If all edges have weight one, the optimal defender strategy can be derived by constructing an ergodic Markov chain with stationary distribution $\pi^{*}$ , where $\pi^{*}_{j}=\frac{h_{j}}{\sum_{i=1}^{n}h_{i}}$ , since the expected return time of a site $j$ is $1/\pi^{*}_{j}$ (Serfozo, 2009).

4.3. Attacker has no visibility

In this case, the attacker has no information of the patroller’s trace thus it is meaningless for the attacker to choose when to launch an attack; instead, the payoff of attacking site $j$ is the expected payoff when the patroller is either at a random site $i$ or travels on a random edge $(i,j)$ . For the following analysis, we only consider the attacks that starting at the time when the patroller is at exactly one of the sites. For general cases, it would underestimate the attacker’s expected payoff at most $\max_{i,j}\sum_{t=1}^{w_{ij}}h_{j}(t)$ utilities.

Denote $Z^{\prime\prime}_{j,T}$ as the cumulative expected payoff for attacking $j$ with period $T$ and $z^{\prime\prime}_{j}(t)$ is the expected payoff at time $t$ . Assume the attack is launched at a random time slot, $z^{\prime\prime}_{j}(t)$ is

z^{\prime\prime}_{j}(t)=\sum_{i=1}^{n}\pi_{i}\cdot z_{i,j}(t)

where $\pi$ is the stationary distribution with transition matrix $P$ . Thus, the cumulative expected payoff is

(5)

Z^{\prime\prime}_{j,T}=\sum_{t=1}^{T}z^{\prime\prime}_{j}(t)=\sum_{t=1}^{T}\sum_{i=1}^{n}\pi_{i}\cdot z_{i,j}(t)=\sum_{i=1}^{n}\pi_{i}Z_{i,j,T}.

Denote $\kappa_{i}$ as the Kemeny constant (Kemeny and Snell, 1960), the expected hitting time when the walk starts at $i$ , $\kappa_{i}=\sum_{j=1}^{n}a_{i,j}\pi_{j}$ . It is known that the Kemeny constant is independent of the start node (Kemeny and Snell, 1983). Thus, the Kemeny constant can be written as another formation

(6)

\kappa=\sum_{i=1}^{n}\pi_{i}\sum_{j=1}^{n}a_{i,j}\pi_{j}.

Equation 6 can be written as an expression with matrix $A$ .

(7)

\kappa=\pi^{T}A\pi.

Now, suppose $f^{\prime\prime}(P)=\max_{j,T}Z^{\prime\prime}_{j,T}$ is the function maximizing the expected payoff, the following observation is shown. If $M=0$ and the utility functions are all constant functions, $f^{\prime\prime}(P)$ is either $\infty$ or the Kemeny constant multiplying with the maximum constant among all utility functions.

Proof.

From the same argument in Observation 4.1, $f^{\prime\prime}(P)$ goes to $\infty$ when the Markov chain is reducible. Now, consider an irreducible Markov chain, from Equation 5, we have

(8)

f^{\prime\prime}(P)=\max_{j}\sum_{i=1}^{n}\pi_{i}Z_{i,j,\infty}=\max_{j}h_{j}\sum_{i=1}^{n}a_{i,j}\pi_{i}.

On the other hand, take transpose on both side in Equation 7, we have

(9)

\kappa=(\pi^{T}A\pi)^{T}=\pi^{T}A^{T}\pi.

Thus, $A$ and $A^{T}$ has the same Kemeny constant. The Kemeny constant of $A^{T}$ is actually $\kappa_{j}=\sum_{i=1}^{n}a_{i,j}\pi_{i}$ for site $j$ , which means

(10)

f^{\prime\prime}(P)=\kappa\max_{j}h_{j}.

∎

Observation 7 shows that when the penalty is zero with constant utility functions, the attacker’s best strategy is to attack the site with highest utility. From the defender side, it has to determine $P$ such that the Kemeny constant is minimized. When all edges have weight 1, a simple solution is to construct $P$ same as the adjacent matrix of a directed $n$ -cycle in $G$ (Kirkland, 2010). In other cases, it has to minimize the Kemeny constant subject to a given stationary distribution (Patel et al., 2015).

4.4. High penalty scenarios

When $M\gg h_{j}(t)$ for all sites $j$ and all time $t$ , Equation 1 can be simplified as

z_{i,j}(t)=h_{j}(t)(\sum_{k=t+1}^{\infty}F_{k}(i,j))-M\cdot F_{t}(i,j).

Assume that the attacker has full visibility and all utility functions are constants.

(11)

f(P)=\max_{i,j,T}(h_{j}\cdot T-(M+1)\cdot\sum_{t=1}^{T}F_{t}(i,j)).

At the defender side, it is beneficial to increase $\sum_{t=1}^{T}F_{t}(i,j)$ for all $(i,j)$ pairs. Thus, having a schedule which is more random could help in this case. This observation also works in other two attacker models.

5. Graph-based Algorithmic Strategy

In the previous section, we show that in special cases (e.g. When the attacker has no visibility, the penalty is zero, and utility functions are all constants) the minimax problem of the zero-sum game is possibly solvable. In general, the optimization problem is not convex. Our solution for general cases is motivated by two observations. First, when the penalty is zero, the optimal schedule is to minimize the expected (pairwise/ average) hitting time or return time. Secondly, if the penalty is significant, it would be better to increase the randomness of the patrol schedule to “scare” the attacker away. In fact, there are prior works emphasizing each one as the objective for the patrol mission (George et al., 2018; Duan et al., 2018; Patel et al., 2015; Duan et al., 2021; Basilico, 2022) but, to the best of our knowledge, this is the first work to incorporate both objectives at the same time.

Specifically, we consider two optimization criteria: expected maximum reward (EMR) and entropy rate ( $\mathcal{H_{r}}$ ). Given a patrol schedule $X=(X_{1},X_{2},\cdots)$ as a random variable sequence and $(\omega_{1},\omega_{2},\cdots)$ is one of its possible realizations. Denote $U_{j}=(u_{1},u_{2},\cdots)$ is the sequence of times that the patroller visits $j$ , i.e., $\forall u_{r}\in U_{j},\omega_{u_{r}}=j$ . Then, the maximum return time is

\phi_{j}=\max_{u_{r}\in U_{j}}\{\sum_{k=u_{r}}^{u_{r+1}}w_{\omega_{k}\omega_{k+1}}\}

and the maximum cumulative rewards of $j$ is $\sum_{t=1}^{\phi_{j}}h_{j}(t)$ .

Since $\{\omega\}$ comes from a randomized process, we can define EMR as the expectation of the maximum (cumulative) rewards among all sites.

\text{EMR}=\max_{j\in\{1,2,\cdots n\}}\mathbb{E}[\sum_{t=1}^{\phi_{j}}h_{j}(t)].

In the following paragraphs, EMR( $X$ ) is used for emphasizing the value of EMR of schedule $X$ . As a reminder, minimizing the maximum reward can be NP-hard since this problem has TSP as a special case.

On the other hand, the entropy rate is to quantify the randomness of a schedule $X$ . It is defined as the following.

\mathcal{H_{r}}(X)=\lim_{m\to\infty}\frac{\sum_{k=1}^{m}\mathcal{H}(X_{k})}{m},

where $\mathcal{H}$ is the entropy function in information theory (Jaynes, 1957).

In the following, three tractable algorithms are proposed: TSP-based solution(TSP-b), Biased Random Walk(Bwalk), and Walk on State Graph(SG). These algorithms balance the two criteria via a hyper-parameter $\alpha$ . Intuitively, the higher value of $\alpha$ induces a schedule with higher EMR and low entropy, which is the most “efficient” but low-randomness tours. Notice that one can track the influence of $\alpha$ explicitly in these three algorithms, which make them transparent and self-explainable.

5.1. TSP-based solution

The Algorithm TSP-based solution (TSP-b) is perturbing the optimal (or approximately optimal) deterministic EMR solutions by a parameter $\alpha$ . Adjusting this skipping parameter $\alpha$ will balance the two criteria. Roughly speaking, the main idea is to traverse on a deterministic tour but each vertex is only visited with probability $\alpha$ (i.e., with probability $1-\alpha$ it is skipped). Obviously, Algorithm TSP-b generates a randomized schedule. Also, since the algorithm works with a metric (with triangular inequality), the total travel distance after one round along the tour is bounded by the original tour length. Hence, the expected reward can be bounded.

The following is the analysis of EMR and entropy rate for TSP-b when the utility functions are polynomial functions with the maximum degree $d$ .

5.1.1. Analysis of TSP-b with the uniform utility functions

When the utility functions among all sites are the same, Algorithm TSP-b firstly generates a approximated-TSP tour $Q=\{q_{1},q_{2},\cdots q_{n}\},q_{i}\in\{1,2,\cdots n\}$ by, for example, a PTAS algorithm (Mitchell, 1999; Arora, 1998). Denote $Y$ as the randomized schedule perturbed by $\alpha$ . Now, assume the site of an arbitrary index $k$ in the schedule is $i$ , i.e., $Y_{k}=i$ , without loss of generality, the tour $Q$ is shifted such that $q_{1}=i$ . Thus, the probability of the next site to visit being $q_{j}$ is

\mathbb{P}(Y_{t+1}=q_{j}|Y_{t}=q_{1})=\begin{cases}\sum_{x=1}^{\infty}(1-\alpha)^{xn-1}\alpha&\text{if }j=1\\ \sum_{x=0}^{\infty}(1-\alpha)^{xn+j-2}\alpha&\text{if }j=\{2,3,\cdots n\}.\end{cases}

Denote $\mathbb{P}(Y_{k+1}=q_{j}|Y_{k}=q_{1})$ as $\gamma_{j}$ , then the entropy rate of $Y$ would be

(12)

\mathcal{H_{r}}(Y)=\sum_{j=1}^{n}\gamma_{j}\log\frac{1}{\gamma_{j}}

On the other hand, to bound $\mathbb{E}[\phi_{i}]$ we mainly need to determine how many rounds does the patroller tour around $Q$ before site $i$ is visited again (a round is defined as the number of time slots for touring $Q$ ). Suppose the time taken for $Q$ is $T(Q)$ . Each such tour by triangle inequality has length at most $T(Q)$ . Define $\beta_{i}$ as the number of the rounds traveled until $i$ is visited again. The probability of $\beta_{i}$ is calculated as follows,

\mathbb{P}(\beta_{i}=k)=(1-\alpha)^{k-1}\alpha.

Denote $\beta=\max_{i}\beta_{i}$ . the probability distribution for $\beta$ is bounded,

\mathbb{P}([\beta\leq k])=\prod_{i}\mathbb{P}(\beta_{i}\leq k)=(1-(1-\alpha)^{k-1})^{n}.

The expected value of $\beta$ is,

\begin{array}[]{ll}E[\beta]&=\sum_{k=1}^{\infty}\mathbb{P}(\beta\geq k)=\sum_{k=1}^{\infty}1-\mathbb{P}(\beta\leq k-1)\\ &=\sum_{k=1}^{\infty}(1-(1-(1-\alpha)^{k-1})^{n}).\end{array}

By tuning the probability $\alpha$ , TSP-b has different bounds on EMR and entropy rate $\mathcal{H}$ . For a small $\alpha$ , lots of sites are skipped creating a schedule with high randomness, but EMR is also higher. On the other hand, for a large $\alpha$ , the sites are visited more frequently with lower reduced entropy rate. With some calculations, the analysis of $\alpha$ is summarized in Table 1. Remark that when $\alpha$ is sufficiently small ( $\alpha<\frac{1}{n}$ ), TSP-b achieves maximum entropy and when $\alpha$ is sufficiently large ( $\alpha>\frac{n-1}{n}$ ), it provides $(1+\frac{n}{n-1})^{d+1}$ -approximation for EMR compared to the TSP tour $Q$ , with the maximum degree $d$ among all the utility functions. Despite that, when $\alpha$ is a constant between 0 to 1, A constant entropy and about $\log^{d+1}n$ extra factor of EMR are derived.

$\alpha$	$\alpha<\frac{1}{n}$	$\alpha=\Theta(1)$	$\alpha>\frac{n-1}{n}$
EMR	$O(n^{d+1}\log^{d+1}n)$	$O(\log^{d+1}n)$	$O((1+\frac{n}{n-1})^{d+1})$
$\mathcal{H_{r}}$	$\Theta(\log n)$	$\Theta(1)$	$\frac{\log n}{n}$

Table 1. The summary of the analysis for TSP-b when all the utility functions are the same with the maximum degree

d

(

0<\alpha\leq 1

5.1.2. Analysis of TSP-b with non-uniform utility functions

In the case of non-uniform utility functions, TSP-b firstly generates the deterministic schedule by Bamboo garden trimming (BGT) algorithm (Min and Radzik, 2017) and then perturb it into a randomized schedule with $\alpha$ . One can describe BGT as a vertex-weighted version of TSP. The objective is to output schedule such that the maximal weighted visited time among all sites is minimized. For the input, the graph is set up as $G$ and each vertex $j$ has a weight $l_{j}$ , which is the coefficient of degree $d$ in $h_{j}$ , where $d$ is the maximum degree among all sites. BGT divides sites into groups such that the weight of each group is less than 2. Then, the patroller visits one group with constant distance and switches to another until all sites are visited. In this way, it can not be hard to identify that the schedule generated by BGT gives $O(\log^{d+1}n)$ approximation of EMR.

For analyzing EMR in TSP-b, notice that when a certain site $i$ is skipped, the attacker can collect $O(\log^{d+1}n)$ additional utility if he attacks $i$ . Thus, the expected reward of the attacker would be $E[\beta_{i}]\cdot O(\log^{d+1}n)$ , where $\beta_{i}-1$ is the number of times skipping $i$ between two consecutive visits of $i$ in this randomized schedule. Follow the similar analysis of $\beta_{i}$ in the case of same utility functions, the bounds of EMR are the values of the second row in Table 1 multiplying with $O(\log^{d+1}n)$ .

5.2. Biased Random Walk

Algorithm Biased Random Walk (Bwalk) uses a biased random walk to decide the patrol schedule. Define matrix $W^{\prime}=(w^{\prime}(i,j))\in\mathbb{Z}^{n\times n}_{\geq 0}$ . For each pair $(i,j)$ ,

w^{\prime}(i,j)=1/\alpha^{w(i,j)},\alpha>1,

where $\alpha$ is an input parameter. Define stochastic matrix $P^{\prime}$ as

\begin{array}[]{lll}P^{\prime}(i,j)&=\frac{w^{\prime}(i,j)}{\sum_{(i,j^{\prime})\in E}{w^{\prime}(i,j^{\prime})}}&\text{if }(i,j)\text{ is an edge}\\ &=0&\text{otherwise.}\end{array}

5.2.1. Analysis of Bwalk with same utility functions

In this case, Bwalk repeatedly generates a set of randomized tours $\{S_{1},S_{2},\cdots\}$ . Each tour $S_{l}$ is an Euler-tour traversing on a randomized spanning tree $\Gamma_{l}$ , where $\Gamma_{l}$ is generated by the biased random walk with transition probability $P^{\prime}$ .

Let $(B_{k};k\geq 0)$ be the biased walk on $G$ with $B_{0}$ arbitrary. For each site $i$ , let $\nu_{i}$ be the first hitting time:

\nu_{i}=\min\{k\geq 0:B_{k}=i\}.

From $(B_{k};k\geq 0)$ , a randomized spanning tree $\Gamma$ can be constructed, which consists of these $n-1$ edges,

(B_{\nu_{i}-1},B_{\nu_{i}});i\neq B_{0}.

Notice that the probability of generating a specific tree $\Gamma$ is proportional to the product of $w^{\prime}(i,j)$ , for all edge $(i,j)\in\Gamma$ (Mosbah and Saheb, 1999). Thus, by controlling the input parameter $\alpha$ , the two criteria can be balanced.

Denote the schedule generated by Bwalk as $Y_{B}$ . If $\alpha=1$ and assume that graph $G$ is a complete graph, $S_{l}$ is actually a random permutation of $n$ sites, which has the entropy $\log n+\log(n-1)+\cdots 1=\log(n!)=O(n\log n)$ . Thus, the entropy rate of $Y_{B}$ is

\mathcal{H_{r}}(Y_{B})=\lim_{m\rightarrow\infty}\sum_{l=1}^{m}\frac{\mathcal{H}(S_{l})}{m}=\lim_{m\rightarrow\infty}\sum_{l=1}^{m}\frac{\frac{n\log n}{n}}{m}=O(\log n).

On the other hand, the expected reward is bounded by the expected time of traversal on the uniform random spanning tree. Since each edge is traversed at most twice, the length of the tour is less than $2n\eta$ , where $\eta$ is the maximum distance among all edges. Thus, the maximum payoff of the attacker is actually $\max_{j}\sum_{t=1}^{2n\eta}h_{j}(t)=O(n^{d+1})$ , if the utility function is polynomial with maximum degree $d$ .

In other cases that $\alpha>1$ , the generated spanning tree is more likely a low-weight tree. Thus, the traversing distance is lower which makes EMR lower. However, the entropy would also become lower due to the probability distribution among all generated spanning tree is more “biased”.

5.2.2. Analysis of Bwalk with different utility functions

When the utility functions are not the same, Bwalk would use BGT (which is introduced in Analysis on TSP-b with different utility functions) as a backbone. That is, when the patroller visits sites in each group with a constant distance, the tour which he has followed is not a deterministic tour but an Euler tour traversing on a randomized spanning tree of the vertices in the group. Similar to the case of the same utility functions, the randomized tour in each group is regenerated every time when the patroller visits all sites in the group.

5.3. Walk on State Graph

Algorithm Walk on the State Graph (SG) with a parameter $\alpha$ generates the schedule by a state machine with the transition process as another random walk.

5.3.1. Deterministic SG

One characteristic of deterministic SG is that it generates the optimal deterministic schedule for any utility functions and has the running time exponential in the number of sites.

Define $D$ is a state machine and each state $x$ is a $(n+1)$ -dimension vector $x=(x_{1},x_{2},\cdots,x_{n},k_{x})$ , where $x_{j}\in\mathbb{R},x_{j}\geq 0$ and $k_{x}\in\{1,\cdots,n\}$ . $x_{i}$ represents the maximum utility the attacker could have collected since the last time the defender leaves site $i$ . The last variable represents the defender’s current position.

State $x,y$ is said to have an arc from $x$ to $y$ if $y=(y_{1},y_{2},\cdots,y_{n},k_{y})$ , where

y_{i}=\begin{cases}h_{i}(x_{i}+d(k_{x},k_{y})),&\text{if }i\neq k_{y}\\ 0,&\text{otherwise.}\end{cases}

$d(k_{x},k_{y})$ represents the time needed to travel from $k_{x}$ to $k_{y}$ . An arc represents the change of state from $x$ to $y$ when the defender moves from $k_{x}$ to $k_{y}$ .

Clearly, any periodic $R$ schedule of the defender can be represented as a cycle on the state machine defined above. Further, the state diagram captures all the information needed to decide on the next stop. Although there could be infinitely many states as defined above, only a finite number of them is needed. Basically, let’s take a periodic schedule $S$ with the kernel as some traveling salesman tour $C$ . Suppose the maximum utility of this schedule is $Z$ . $Z$ is finite and is an upper bound of the optimal value. Thus, all states $x$ that have any current utility of $x_{j}$ greater than $Z$ can be removed. This will reduce the size of the state machine to be at most $O(Z^{n})$ .

Now we attach with each edge $(x,y)$ a weight as the maximum payoff among all variables within state $x,y$ . That is,

w(x,y)=\max\{x_{1},\cdots x_{n},y_{1},\cdots y_{n}\}.

For any cycle/path in this state machine, define bottleneck weight as the highest weight on edges of the cycle/path. The optimal deterministic schedule is actually the cycle of this state machine with the minimum bottleneck weight. To find this cycle, the first step is to find the minimum bottleneck path from any state $u$ to any state $v$ by Floyd-Marshall algorithm. The total running time takes time $O(|V|^{3})$ , where $|V|$ is the number of vertices (states) in the state machine. The optimal tour is obtained by taking the cycle $u\rightsquigarrow v\rightsquigarrow u$ with the minimum bottleneck value for all possible $u,v$ . The total running time is still bounded by $O(|V|^{3})$ .

5.3.2. Non-deterministic SG

Since the state graph records the utility that would be collected at each site from the historical trace at each state, we run a random walk on the state graph with a probability dependent on the utility of the state.

Each state is defined as the aforementioned state machine $D$ . From each state, the random walker can possibly move to $\deg(k_{x})$ different states where $\deg(k_{x})$ is the degree of site $k_{x}$ in $G$ . The probability of moving from state $x$ to $y$ is

c_{x,y}=\min_{i\in\{1,2,\cdots,n\}}\frac{1}{y_{i}^{\alpha}},

where $\alpha$ is the given input parameter. Let the transition probability from state $x$ to all possible $y$ to be proportional to their edge weights. That is,

\mathbb{P}(x,y)=\frac{c_{x,y}}{\sum\limits_{(x,w)\in E(D)}c_{x,w}},

where $E(D)$ is the edge set of $D$ .

Although there are (in the worst case) exponential states respect to the number of sites in the state graph, the probability of walking on each possible state is determined by local information $\{y_{1},y_{2},\cdots y_{n}\}$ . Thus, the running time of the random walk depends only on the desired length of the output schedule.

6. Reinforcement Learning Strategy

Following the discussion in Section 4, the patrol game is formulated as a Markov decision process. It is natural to consider solving this game using deep reinforcement learning (DRL), as explored in works such as (Rupprecht and Wang, 2022). However, applying DRL directly to our scenario presents several challenges.

(1)

The existence of an infinite number of patrol strategies.
(2)

A lack of a closed-form solution for the attacker’s optimal strategy.
(3)

Delayed feedback on the attacker’s strategies.

The second challenge implies the necessity of an additional heuristic method to effectively model attacker behavior, complicating the framework (e.g., incorporating a GAN structure). These challenges also increase convergence difficulties (Lei et al., 2020; Chen et al., 2021), and, more critically, may result in the model becoming trapped in local minima, yielding suboptimal solutions. To address these challenges, we proposed Graph pointer network-based (GPN-b) model that draws on the intuitions outlined in Section 5, focusing on two criteria: Expected Maximal Reward (EMR) and entropy. GPN-b seeks to solve the traveling salesman problem while incorporating randomness, using a hyper-parameter $\alpha$ to balance these criteria.

The learning process of GPN-b integrates several advanced deep learning techniques applied recently to NP-hard routing problems such as the TSP. The framework adopts a transformer-like architecture and utilizes a ”rollout baseline” to smooth the training process (Kool et al., 2018; Kim et al., 2021). Since the model is a MDP, one can calculate the distribution among the policy in each state and then derive the entropy of the generated tours. We add an additional loss term of the randomness with the derived entropy in the training phrase such that the model is able to learn the way of generating efficient tours and increase the entropy of its policy at the same time. On the architecture part, the model is based on the work of graph pointer network (Ma et al., 2020), which is an autoencoder design with a graph convolutional network (GCN) as the encoder, augmented by an LSTM. The decoder employs an attention mechanism, enabling adaptability to various graph structures (Vinyals et al., 2015). In the following sections, we give the details of our model design, the training methodology, and preliminary experimental results.

6.1. Model and Training

Given a graph, the tour of the graph can be treated as a rearrangement or permutation $\pi$ of the input nodes. We can formulate it as the Markov decision process(MDP).

(13)

\mathbf{\pi}=(\pi_{0},...,\pi_{N-1}),\;\mathrm{where}\;\pi_{t}\in(0,...,N-1)\;\mathrm{and}\;\pi_{i}\neq\pi_{j}\;\mathrm{iff}\;i\neq j

At each time step $t$ , the state consists of the sequence of action made from step $0$ to $t-1$ , while the action space at time step $t$ is the remaining unvisited nodes. Our objective is to minimize the length of the action sequence made by MDP, so we define the negative tour length as our cumulative reward $R$ for each solution sequence. Now we can define the policy $p(\pi|g)$ which can generate the solution of instance $g$ parameterized by parameter $\theta$ as below:

(14)

p(\pi|g)=\prod_{t=0}^{N-1}p_{\theta}(\pi_{t}|g,\pi_{1:t-1})

where $p_{\theta}(\pi_{t}|g,\pi_{1:t-1})$ provide action $a_{t}$ at each time step $t$ on instance $g$ . Overall, the model takes the input of the coordinates among all sites $X$ and put into the encoder. Then, the decoder starts to output the desired policy (i.e., a site) step by step recurrently till the sequence includes all the sites. The model structure of the encoder and decoder is shown at fig. 1 which is base on the work of Ma et al (Ma et al., 2020).

Refer to caption — Figure 1. The model structure

(1)

Encoder: The encoder consists of a Graph convolutional Network (GCN) and a Long Short-Term Memory (LSTM) network. The GCN is responsible for processing the graph, encode the nodes coordinates into ”context” $X^{l}$ . Meanwhile, the LSTM handles the trails, process the action sequences $\pi_{1:t-1}$ into hidden variable $h_{t}$ at each time step $t$ . Both the context and hidden variable are passed to the decoder in the current step. Additionally, the hidden variable $h_{t}$ will pass to next time step encoder to process the sequence $\pi_{1:t+1}$ . The GCN layer can be described as below:

(15)

x_{i}^{l}=\gamma x_{i}^{l-1}\Theta+(1-\gamma)\phi_{\theta}\left(\frac{1}{\left|\mathcal{N}(i)\right|}\left\{x_{j}^{l-1}\right\}_{j\in\mathcal{N}(i)\cup\{i\}}\right)

Where $\gamma$ is the learnable weight, $\Theta$ is trainable parameters and $\phi_{\theta}$ is the aggregate function of adjacent nodes $\mathcal{N}(i)$ and node $x_{i}$ .

(2)

Decoder: The decoder is using an attention mechanism to generate the pointer which will point to a input node $x_{j}$ as a output action $a_{t}$ , similar to pointer networks. The attention mechanism will provide a pointer vector $u_{t}^{(j)}$ than pass it through a softmax layer to get the distribution of the candidate nodes, which can be sampled or chosen greedily as output $a_{t}$ .

(16)

u_{t}^{(j)}=\begin{Bmatrix}w^{\top}*\tanh(W_{r}r_{j}+W_{q}q)&\text{if }j\neq\pi_{t^{\prime}}\text{ , }\forall{t^{\prime}<t}\\ -\infty&\text{otherwise }\par\end{Bmatrix}

The decoder will mask the nodes at attention mechanism if the candidate node $x_{j}\in\pi_{1:t-1}$ . Here, $q$ represents the hidden variable $h_{t}$ , and $r_{j}$ denotes the context $X_{j}^{l}$ from the encoder. Both $W_{r}$ and $W_{q}$ are trainable parameters, and $\pi_{t}$ can be expressed as fallow:

(17)

\pi_{t}=a_{t}\sim\mbox{softmax}(\mathbf{u}_{t})

To force the policy have more ability to sample the diverse solution, we add the entropy constrain to objective function.

Entropy : For each time step, the output of the solver will provide us a probability distribution of the candidate cities. Here, we define the entropy as below:

(18)

\mathcal{H}_{p_{\theta}}=\sum_{t=1}^{N}\mathcal{H}(\pi_{t}\sim p_{\theta}(\pi_{t}|\pi_{1:t-1},g))

The entropy of the policy effectively captures the randomness of the solution. However, computing the accuracy entropy of the policy $p_{\theta}(\mathbf{\pi}|g)$ is computationally expensive. Therefore, we approximate it by summing the entropy computed at each time step $t$ .

Training: To train the slover, we use the REINFORCE algorithm with rollout baseline $b$ . (Ma et al., 2020; Kool et al., 2018). The the objective function can be described as follows:

(19)

\nabla_{\theta}J(\theta|s)=E_{\pi\sim p_{\theta}}\left[(L(\pi|g)-b(g))\nabla\log(p_{\theta})-\alpha\nabla\mathcal{H}_{p_{\theta}}\right]

Where $L(\pi|s)=\sum_{t=1}^{N-1}\left\|x_{\pi_{t+1}}-x_{\pi_{t}}\right\|_{2}+\left\|x_{\pi_{N}}-x_{\pi_{1}}\right\|_{2}$ and $\alpha$ is the weight parameter of the constraint, with the importance of randomness increasing as the value of $\alpha$ grows. We use the ADAM optimizer to obtain the optimal parameter $\theta$ that minimizes this function. During training, we chose to use the central self-critic baseline introduced by Ma et al. (Ma et al., 2020). The $b(g)$ is expressed as

(20)

\begin{split}b(g)&=\sum_{t=1}^{N}\left(R(\tilde{\mathbf{s}}_{t},\tilde{\mathbf{a}}_{t})\right)+\sum_{t=1}^{N}\left(R(\mathbf{s}_{t},\mathbf{a}_{t})-R(\tilde{\mathbf{s}}_{t},\tilde{\mathbf{a}}_{t})\right)\end{split}

where the action $\tilde{\mathbf{a}}_{t}$ is picked by the greedy policy $p_{\theta}^{Greedy}$ , which means each action is the candidate node that have the highest probability at each time step and $\tilde{\mathbf{s}}_{t}$ is the corresponding state, i.e. the instance $g$ and $\tilde{\mathbf{\pi}}_{1:t-1}$ we mention at Equation 14 .

6.1.1. Preliminary experiment

Before applying our model to the patrol problem, we conducted preliminary experiment to observe the model’s edge usage and path length statistics in graphs(fig. 2). For each setting of $\alpha$ ,

In these tests, all graphs were generated by sampling 10 sites from a uniform distribution within a unit square and all graphs are complete graph. Each model with different parameters of $\alpha$ is trained via these graphs by 20 epochs and 512 batch size. Each epoch includes 2500 steps which takes around 4 minutes on NVIDIA RTX 4090 GPU. Thus, the total running time for training a model takes around 80 minutes.

For the edge usage test, we generated single graph and ran the model with different $\alpha$ 1,000 times, recording the usage frequency of each edge. For the path length statistics, we generated 1,000 graphs as instance for models and recording the resulting path lengths. In this experiment, we can see the trade-off between total length and randomness at various values of $\alpha$ . Figure 2 (a) shows that when $\alpha$ =1, the distribution of used edges is highly skewed, and as $\alpha$ increases, the edge usage distribution becomes more uniform. Concurrently, Figure 2 (b) illustrates that the path length tends to increase with rising $\alpha$ values. Based on these observations, we can confidently assert that the model effectively adjusts the two conflicting criteria according to the settings of $\alpha$ .

6.2. Incorporating BGT

For cases involving uniform utility weights, where the utility functions are consistent across all sites, directly applying the aforementioned model with an appropriate hyperparameter $\alpha$ yields the desired tours that effectively balance Expected Maximal Reward (EMR) and entropy. In scenarios with non-uniform utility weights, it is necessary to integrate the BGT algorithm (Min and Radzik, 2017) into the generated schedule. However, an intriguing observation about the BGT algorithm is that it visits each group, regardless of whether they have high or low weights, with equal frequency. Theoretically, this uniform visitation does not impact the analysis of the approximation factor, since the total number of groups is $O(\log n)$ . Empirically, however, increasing the visitation frequency of high-weight sites can significantly enhance the overall EMR.

We present a method to carefully modify the BGT algorithm so that the overall EMR is empirically increased while preserving the $O(\log n)$ approximation factor. Notably, within the TSP-based solution procedure outlined in Section 5.1.2, modifying the BGT is unnecessary since the skipping parameter can be controlled concerning the weight coefficient of the sites (i.e., the skipping probability is inversely proportional to the site’s weight).

Algorithm 1 describes the DRL implementation that incorporates BGT. Let $G$ represent a graph instance. The graph’s diameter, denoted as $D$ , is defined as the longest distance between any two nodes in $G$ . For a given node $j$ , let $w_{j}$ be the coefficient of the highest degree term in the utility function $h_{j}$ for node $j$ . Define $w$ as the set containing all such coefficients $w_{j}$ for each node $j\in G$ . Let $L$ denote the length of the sequence intended to be generated. The overall algorithm follows standard BGT procedures (Min and Radzik, 2017), with exceptions at Lines 11 and 16-17. In Line 11, the generated tour is replaced by our DRL model. In Lines 16-17, the visiting order for weight groups ${V_{i}}$ is determined by the ”inorder traversal of a complete binary tree.” Specifically, we construct a complete binary tree where the height corresponds to the number of groups. Groups are organized hierarchically: the group with the lowest weight is positioned at the root, the second lowest at the first level, and so forth, with each level filled with exactly the same group. The visiting order of each group follows the node sequence encountered in the traversal. See Fig 3 for an example.

Algorithm 1 DRL-BGT

1: Input:

G,w,D,L

2: Output:

S

3: for each

v_{k}\in V

w_{k}\leftarrow

the normalized coefficient

w

5: end for

S\leftarrow\{\}

s\leftarrow\left\lceil 2\log n\right\rceil

V_{0}\leftarrow\{v_{i}\in V\mid w_{i}\leq n^{-2}\}

9: for

i\in\{1,2,\ldots,s\}

10:

V_{i}\leftarrow\{v_{j}\in V\mid 2^{i-1}\cdot n^{-2}<w_{j}\leq 2^{i}\cdot n^{-2}\}

11:

T_{i}\leftarrow

GPN-b(

V_{i}

)

12: end for

13: Let

V_{0}=\{v_{0}^{\prime},v_{1}^{\prime},\ldots,v_{l-1}^{\prime}\}

14:

C\leftarrow

a random permutation of

V_{0}

15:

C_{last}\leftarrow c_{t}

, where

t=0

16:

B\leftarrow

the Binary tree for the collection of

\{V_{i}|V_{i}\neq\emptyset\}

17:

V_{order}\leftarrow

the order index in

B

18: while Length of

S<L

19:

i\leftarrow

next index in

V_{order}

20:

S_{i}\leftarrow

truncate

T_{i}

, such that the traveling time of

S_{i}

is at most

D

21: Update the last visited point of

T_{i}

22:

S_{i}\leftarrow S_{i}+C_{last}

23:

C_{last}\leftarrow c_{(t+1)}

24: Append

S_{i}

into

S

25: end while

26: return

S

7. Experiments

We evaluate the proposed algorithms, GPN-based(GPN-b),TSP-based (TSP-b), Biased random walk (Bwalk), and State graph walk (SG), with two baselines Markov chains with minimal Kemeny constant (minKC) (Patel et al., 2015) and Markov chains with maximum entropy (maxEn) (George et al., 2018). The experiments are based on artificial datasets and Denver crime dataset (City and of Denver, 2016) with three different attacker models, Full visibility (Full vis.), Local visibility (Local vis.), and No visibility (No vis.). There are three major observations.

(1)

Our algorithms realize the tradeoff between expected maximum reward (EMR) and entropy rate. For comparison, SG can achieve higher entropy but TSP-b has more freedom to control EMR and entropy rate with parameter $\alpha$ . GPN-b generates the most efficient tours since it can achieve low EMR with certain entropy (Figure 4).
(2)

For all algorithms, when the penalty increased, the attacker’s (expected) payoff decreased. For the same evaluation setup, the attacker’s payoff is the minimum when the attacker adopts the model of no visibility and the highest when the attacker adopts full visibility. Roughly speaking, the proposed algorithms perform well when the utility function is not constant (Figure 6, 7). MinKC has comparable performance when the utility function is constant.
(3)

There is no dominant among the four of our proposed algorithms. In general, SG performs the best when the penalty is high with full and local visibility. GPN-b performs well cases of non-constant utility functions.
(4)

GPN-b, TSP-b, and Bwalk are scalable with the increase of the number of sites. One reason is that these algorithms perturbed the tours from TSP/BGT, which are more delicate designed routes (Figure 15).

The patroller costs one time slot to travel a unit of length. For experiments, all sites are randomly generated in $1000\times 1000$ square. Without specification, the number of sites in a setup is 30. We apply the minKC and maxEn by baseline, each subject to constraints dictated by the stationary distribution. To ensure coherence with the utility function structures, the stationary distributions for each location $j$ are configured to be proportional to $b_{j}$ which is the coefficient of the highest degree in the utility function $h_{j}$ of each location. For the Denver dataset, the geographic range is in Denver City only, which has 77 neighborhoods. The coefficients for the utility functions are uniformly generated at random within the range of $.001$ to $1$ in the general case. In practical applications, we employ the Denver Crime Dataset to determine these coefficients based on the frequency of various types of crimes across different neighborhoods.

In the game, the defender’s strategy is formed by the patrol schedule, with each proposed solution generating a distinct strategy. The attacker collects their payoff, denoted as $Z$ , by targeting a specific site $i$ from start time $t_{s}$ to end time $t_{e}$ . We have empirically determined the expected payoff for each possible target site $i$ over all conceivable attack period $t_{s},t_{e}$ , using specific attacker models. From these calculations, we extracted the maximum expected payoff for the attacker. To manage the high raw values of these payoffs, we normalize them by dividing them by $\zeta$ , where $\zeta$ represents the attacker payoff on the base tour generated by BGT.

In each experiment, each bi-criteria algorithm generates around 8 to 10 schedules based on different values of parameter $\alpha$ . The values of $\alpha$ are uniformly generated in the following domains. GPN-b: $[1,10]$ , TSP-b: $[0.1,1]$ , Bwalk: $[1,1.5]$ , SG: $[0,80]$ . GPN-b is the same model in Section 6.1.1 without any further training. Generally speaking, increasing the number of $\alpha$ values would increase the performance of the algorithm but take more computation time, which is a performance-complexity trade-off.

7.1. EMR v.s. entropy rate

Figure 4 reports the performance of algorithms under Expected maximum reward and Entropy rate. In this specific experiment, GPN-b has an additional version,GPN-b:I, that incorporates BGT with the inorder traversal method (see Section 6.2. In the y-axis, we scale the EMR as 1 if the maximum reward is generated by BGT. Each point represents the schedule which is generated by different algorithms and the digit aside from each point denotes the value of the input parameter $\alpha$ . For example, in TSP-b, the skip probability is 0.2 as $\alpha=0.8$ . For TSP-b and Bwalk, the lower value of $\alpha$ indicates the higher randomness of the schedule. For GPN-b,GPN-b:I, and SG, the higher the value of $\alpha$ indicates the higher randomness of the schedule.

The results in Figure 4 demonstrate that all proposed algorithms can balance the two criteria by adjusting the parameter $\alpha$ . Notably, GPN-b:I exhibits greater efficiency than GPN-b regarding the trade-off between EMR and entropy, indicating that the inorder traversal method empirically outperforms the approach that directly utilizes BGT. Conversely, while SG achieves the highest entropy, its performance varies with different values of $\alpha$ , indicating some instability.

[Uncaptioned image] — Table 2. Comparing attacker’s expected payoff (the lower the value, the better the performance of the patrol route) of our algorithms (GPN, TSP-b, Bwalk, SG) and baselines (minKC, maxEN) with different settings (constant, linear, or quadratic utility functions) and attacker models (Full, Local, or No visibility). Figure 14 is the simulation on Denver Crime Dataset with full visibility.


Figure 5. Constant utility, Full vis.	Figure 6. Linear utility, Full vis.	Figure 7. Quadratic utility, Full vis.
Figure 8. Constant utility, Local vis.	Figure 9. Linear utility, Local vis.	Figure 10. Quadratic utility, Local vis.
Figure 11. Constant utility, No vis.	Figure 12. Linear utility, No vis.	Figure 13. Quadratic utility, No vis.
	Figure 14. Denver Dataset

7.2. Attacker’s payoff in artificial and real-world scenario

The experiments are examined with the following variables; penalty values, the maximum degree of utility functions (1, 2, 3), and the attacker models (Full vis., Local vis., No vis.). The last figure reports the simulation result of Denver crime dataset. Each figure shows the attacker’s (expected) payoff under different penalties. Each realization has been run 10 times and the y-axis is the average attacker’s payoff with standard errors. We interpret an algorithm has better performance if and only if the attacker has the lower payoff in the schedule generated by this algorithm.

Generally speaking, the attacker payoff drops down when the penalty increased and with lower ability of the attacker (e.g., full vis. v.s. local vis.) the attacker payoff is also lower. In the experiments of constant utility functions (Figure 5,8,11), Although TSP-b performs the best in Full vis., minKC is comparable in Local and No vis.. This reflects our observation in Section 4, that the optimal solution has a strong correlation with the minimum hitting time.

In the experiments of non-constant utility functions (Figure 6, 7, 14), our algorithms clearly outperform the baselines in most cases. For example, the attacker’s payoff is around 4 for minKC but only around 0.25 for GPN-b in the case of linear utility functions, 2.18 penalty value. One possible reason is that minKC and maxEn are designed only for constant vertex weight and they are not suitable for non-constant utility functions. On the other hand, our algorithms focus on the two objectives: EMR and entropy rate, which are not limited to the constant utility functions. In the comparison of four proposed algorithms, SG performs the best in high penality scenarios with full and local vis. In these cases, the adversary can learn more if the visiting sequence has some correlation with the visiting history, which is the case in the other three algorithms. On the other hand, the performance of SG is not dominiated anymore in No vis. cases.

7.3. Scalability

Figure 15 reports the scalability of the solutions. The settings are full visibility attacker model, constant utility functions, and 0 penalty value for demonstration. To compare the performance under different setups, all attacker’s expected payoff is divided by the payoff of the BGT patrol route. Since the number of constraints in minKC increases exponentially concerning the number of the sites, when the number of sites is more than 100, the solution cannot converge after 200k iterations (the solution minKC is calculated by CXYOPT in a desktop of i7-13700K 3.40 GHz with 96.0 GB RAM).

For other solutions, the four proposed algorithms have much better performance than maxEn, which has 1.2, 174.6, 132.8, and 132.9 attacker payoff in 50, 100, 150, and 200 sites respectively (those values are too high to compare in the plot thus we report the numbers here). Comparing within the proposed algorithms, GPN-b has the best performance and SG has the worst performance. One reason is that schedules generated by SG have higher randomness. When the number of sites increases which makes the topology become complicated, it favors patrol schedules with more delicate designed routes.

8. Conclusion

We look into a general patrolling game that the attacker can also choose the attack period. Instead of formulating it as a mixed-integer linear programming problem and searching for combinatorial defend strategies which are exponential growth, we focus on two objectives, minimizing the maximum reward and the entropy rate. Based on that, we formulate the Randomized TSP problem and propose four algorithms to achieve the tradeoff between the two criteria. We also design a framework that uses the proposed algorithms to solve patrol security games efficiently. Experiments show that our work is scalable and adaptable to various utility functions and penalties.

Acknowledgements.

This work is support by NSTC 111-2222-E-008-008-MY2, NSF DMS-1737812, CNS-1618391, CNS-1553273, and CCF-1535900. The authors would like to acknowledge sociologist Prof. Yue Zhuo for helpful discussions on criminology literatures.

References

(1)
Abdallah et al. (2021) Mustafa Abdallah, Timothy Cason, Saurabh Bagchi, and Shreyas Sundaram. 2021. The effect of behavioral probability weighting in a simultaneous multi-target attacker-defender game. In 2021 European Control Conference (ECC). IEEE, 933–938.
Afshani et al. (2021) Peyman Afshani, Mark De Berg, Kevin Buchin, Jie Gao, Maarten Löffler, Amir Nayyeri, Benjamin Raichel, Rik Sarkar, Haotian Wang, and Hao-Tsung Yang. 2021. Approximation algorithms for multi-robot patrol-scheduling with min-max latency. In Algorithmic Foundations of Robotics XIV: Proceedings of the Fourteenth Workshop on the Algorithmic Foundations of Robotics 14. Springer, 107–123.
Afshani et al. (2022) Peyman Afshani, Mark de Berg, Kevin Buchin, Jie Gao, Maarten Löffler, Amir Nayyeri, Benjamin Raichel, Rik Sarkar, Haotian Wang, and Hao Tsung Yang. 2022. On Cyclic Solutions to the Min-Max Latency Multi-Robot Patrolling Problem. In 38th International Symposium on Computational Geometry, SoCG 2022. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2.
Agmon et al. (2011) Noa Agmon, Sarit Kraus, and Gal A. Kaminka. 2011. Multi-Robot Adversarial Patrolling: Facing a Full-Knowledge Opponent. 42 (December 2011), 887–916.
Agmon et al. (2008) Noa Agmon, Vladimir Sadov, Gal A Kaminka, and Sarit Kraus. 2008. The impact of adversarial knowledge on adversarial planning in perimeter patrol. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 55–62.
Alamdari et al. (2014) Soroush Alamdari, Elaheh Fata, and Stephen L Smith. 2014. Persistent monitoring in discrete environments: Minimizing the maximum weighted latency between observations. The International Journal of Robotics Research 33, 1 (2014), 138–154.
An et al. (2012) Bo An, Eric Shieh, Milind Tambe, Rong Yang, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. 2012. PROTECT–A Deployed Game Theoretic System for Strategic Security Allocation for the United States Coast Guard. Ai Magazine 33, 4 (2012), 96.
Angel et al. (1972) RD Angel, WL Caudle, R Noonan, and ANDA Whinston. 1972. Computer-assisted school bus scheduling. Management Science 18, 6 (1972), B–279.
Arora (1996) Sanjeev Arora. 1996. Polynomial time approximation schemes for Euclidean TSP and other geometric problems. In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on. IEEE, 2–11.
Arora (1998) Sanjeev Arora. 1998. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. Journal of the ACM (JACM) 45, 5 (1998), 753–782.
Asghar and Smith (2016) Ahmad Bilal Asghar and Stephen L Smith. 2016. Stochastic patrolling in adversarial settings. In American Control Conference (ACC), 2016. IEEE, 6435–6440.
Ausiello et al. (2000) Giorgio Ausiello, Stefano Leonardi, and Alberto Marchetti-Spaccamela. 2000. On Salesmen, Repairmen, Spiders, and Other Traveling Agents. In Algorithms and Complexity, Giancarlo Bongiovanni, Rossella Petreschi, and Giorgio Gambosi (Eds.). Lecture Notes in Computer Science, Vol. 1767. Springer Berlin Heidelberg, 1–16. https://doi.org/10.1007/3-540-46521-9_1
Awerbuch et al. (1995) Baruch Awerbuch, Yossi Azar, Avrim Blum, and Santosh Vempala. 1995. Improved Approximation Guarantees for Minimum-Weight k-Trees and Prize-Collecting Salesmen. In SIAM JOURNAL ON COMPUTING. 277–283.
Basilico (2022) Nicola Basilico. 2022. Recent trends in robotic patrolling. Current Robotics Reports 3, 2 (2022), 65–76.
Basilico et al. (2017) Nicola Basilico, Giuseppe De Nittis, and Nicola Gatti. 2017. Adversarial patrolling with spatially uncertain alarm signals. Artificial Intelligence 246 (2017), 220–257.
Basilico et al. (2009) Nicola Basilico, Nicola Gatti, and Francesco Amigoni. 2009. Leader-follower Strategies for Robotic Patrolling in Environments with Arbitrary Topologies. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 1 (Budapest, Hungary) (AAMAS ’09). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 57–64.
Basilico et al. (2012) Nicola Basilico, Nicola Gatti, and Francesco Amigoni. 2012. Patrolling security games: Definition and algorithms for solving large instances with single patroller and single intruder. Artificial Intelligence 184 (2012), 78–123.
Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016).
Bienstock et al. (1993) Daniel Bienstock, Michel X. Goemans, David Simchi-Levi, and David Williamson. 1993. A Note on the Prize Collecting Traveling Salesman Problem. Math. Program. 59, 3 (May 1993), 413–420. https://doi.org/10.1007/BF01581256
Bošanskỳ et al. (2011) Branislav Bošanskỳ, Viliam Lisỳ, Michal Jakob, and Michal Pěchouček. 2011. Computing time-dependent policies for patrolling games with mobile targets. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3. International Foundation for Autonomous Agents and Multiagent Systems, 989–996.
Branquinho et al. (2021) Amílcar Branquinho, Ana Foulquié-Moreno, Manuel Mañas, Carlos Álvarez-Fernández, and Juan E Fernández-Díaz. 2021. Multiple orthogonal polynomials and random walks. arXiv preprint arXiv:2103.13715 (2021).
Breen and Kirkland (2017) Jane Breen and Steve Kirkland. 2017. Minimising the largest mean first passage time of a Markov chain: The influence of directed graphs. Linear Algebra Appl. 520 (2017), 306–334.
Bucarey et al. (2021) Víctor Bucarey, Carlos Casorrán, Martine Labbé, Fernando Ordoñez, and Oscar Figueroa. 2021. Coordinating resources in stackelberg security games. European Journal of Operational Research 291, 3 (2021), 846–861.
Çam (2023) Salih Çam. 2023. Asset Allocation with Combined Models Based on Game-Theory Approach and Markov Chain Models. EKOIST Journal of Econometrics and Statistics 39 (2023), 26–36.
Cannata and Sgorbissa (2011) Giorgio Cannata and Antonio Sgorbissa. 2011. A minimalist algorithm for multirobot continuous coverage. IEEE Transactions on Robotics 27, 2 (2011), 297–312.
Carter and Ragsdale (2002) Arthur E Carter and Cliff T Ragsdale. 2002. Scheduling pre-printed newspaper advertising inserts using genetic algorithms. Omega 30, 6 (2002), 415–421.
Chen et al. (2021) Baiming Chen, Mengdi Xu, Liang Li, and Ding Zhao. 2021. Delay-aware model-based reinforcement learning for continuous control. Neurocomputing 450 (2021), 119–128.
Christofides (1976) Nicos Christofides. 1976. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical Report. Carnegie-Mellon Univ Pittsburgh Pa Management Sciences Research Group.
City and of Denver (2016) City and County of Denver. 2016. Denver Open Data Catalog. City and County of Denver (2016).
Dshalalow and White (2021) Jewgeni H Dshalalow and Ryan T White. 2021. Current trends in random walks on random lattices. Mathematics 9, 10 (2021), 1148.
Duan et al. (2018) Xiaoming Duan, Mishel George, and Francesco Bullo. 2018. Markov Chains with Maximum Return Time Entropy for Robotic Surveillance. arXiv preprint arXiv:1803.07705 (2018).
Duan et al. (2021) Xiaoming Duan, Dario Paccagnan, and Francesco Bullo. 2021. Stochastic strategies for robotic surveillance as stackelberg games. IEEE Transactions on Control of Network Systems 8, 2 (2021), 769–780.
Elmaliach et al. (2008) Yehuda Elmaliach, Asaf Shiloni, and Gal A. Kaminka. 2008. A Realistic Model of Frequency-based Multi-robot Polyline Patrolling. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems - Volume 1 (Estoril, Portugal) (AAMAS ’08). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 63–70.
Fang et al. (2013) Fei Fang, Albert Xin Jiang, and Milind Tambe. 2013. Optimal patrol strategy for protecting moving targets with multiple mobile resources. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 957–964.
Gao et al. (2022) Jie Gao, Mayank Goswami, CS Karthik, Meng-Tsung Tsai, Shih-Yu Tsai, and Hao-Tsung Yang. 2022. Obtaining approximately optimal and diverse solutions via dispersion. In Latin American Symposium on Theoretical Informatics. Springer, 222–239.
Gatti (2008) Nicola Gatti. 2008. Game Theoretical Insights in Strategic Patrolling: Model and Algorithm in Normal-Form.. In ECAI. 403–407.
George et al. (2018) Mishel George, Saber Jafarpour, and Francesco Bullo. 2018. Markov chains with maximum entropy for robotic surveillance. IEEE Trans. Automat. Control (2018).
Grace and Baillieul (2005) Jeremy Grace and John Baillieul. 2005. Stochastic strategies for autonomous robotic surveillance. In Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC’05. 44th IEEE Conference on. IEEE, 2200–2205.
Hanaka et al. (2021) Tesshu Hanaka, Yasuaki Kobayashi, Kazuhiro Kurita, and Yota Otachi. 2021. Finding diverse trees, paths, and more. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3778–3786.
Hottung et al. ([n. d.]) André Hottung, Bhanu Bhandari, and Kevin Tierney. [n. d.]. Learning a latent search space for routing problems using variational autoencoders. In International Conference on Learning Representations.
Huang and Zhu (2020) Linan Huang and Quanyan Zhu. 2020. A dynamic games approach to proactive defense strategies against advanced persistent threats in cyber-physical systems. Computers & Security 89 (2020), 101660.
Hunt and Zhuang (2024) Kyle Hunt and Jun Zhuang. 2024. A review of attacker-defender games: Current state and paths forward. European Journal of Operational Research 313, 2 (2024), 401–417.
Iocchi et al. (2011) L. Iocchi, L. Marchetti, and D. Nardi. 2011. Multi-robot patrolling with coordinated behaviours in realistic environments. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2796–2801. https://doi.org/10.1109/IROS.2011.6094844
Jain et al. (2010) Manish Jain, Erim Kardes, Christopher Kiekintveld, Fernando Ordónez, and Milind Tambe. 2010. Security Games with Arbitrary Schedules: A Branch and Price Approach.. In AAAI.
Jain et al. (2011) Manish Jain, Dmytro Korzhyk, Ondřej Vaněk, Vincent Conitzer, Michal Pěchouček, and Milind Tambe. 2011. A double oracle algorithm for zero-sum security games on graphs. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 327–334.
Janssen et al. (2020) Stef Janssen, Diogo Matias, and Alexei Sharpanskykh. 2020. An agent-based empirical game theory approach for airport security patrols. Aerospace 7, 1 (2020), 8.
Jaynes (1957) Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical review 106, 4 (1957), 620.
Kemeny and Snell (1960) John G Kemeny and J Laurie Snell. 1960. Finite Markov Chains. D Van Nostad Co. Inc., Princeton, NJ (1960).
Kemeny and Snell (1983) John G Kemeny and J Laurie Snell. 1983. Finite Markov chains: with a new appendix” Generalization of a fundamental matrix”. Springer.
Kiekintveld et al. (2009) Christopher Kiekintveld, Manish Jain, Jason Tsai, James Pita, Fernando Ordóñez, and Milind Tambe. 2009. Computing optimal randomized resource allocations for massive security games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 689–696.
Kim and Park (2004) Kap Hwan Kim and Young-Man Park. 2004. A crane scheduling method for port container terminals. European Journal of operational research 156, 3 (2004), 752–768.
Kim et al. (2021) Minsu Kim, Jinkyoo Park, et al. 2021. Learning collaborative policies to solve np-hard routing problems. Advances in Neural Information Processing Systems 34 (2021), 10418–10430.
Kirkland et al. (2020) Lisa-Ann Kirkland, Alta De Waal, and Johan Pieter De Villiers. 2020. Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security in a Geospatial Framework. In Southern African Conference for Artificial Intelligence Research. Springer, 101–118.
Kirkland (2010) Steve Kirkland. 2010. Fastest expected time to mixing for a Markov chain on a directed graph. Linear Algebra Appl. 433, 11-12 (2010), 1988–1996.
Kool et al. (2018) Wouter Kool, Herke Van Hoof, and Max Welling. 2018. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 (2018).
Laporte (1992) Gilbert Laporte. 1992. The vehicle routing problem: An overview of exact and approximate algorithms. European Journal of Operational Research 59, 3 (1992), 345–358. https://doi.org/10.1016/0377-2217(92)90192-C
Lei et al. (2020) Lei Lei, Yue Tan, Kan Zheng, Shiwen Liu, Kuan Zhang, and Xuemin Shen. 2020. Deep reinforcement learning for autonomous internet of things: Model, applications and challenges. IEEE Communications Surveys & Tutorials 22, 3 (2020), 1722–1760.
Li et al. (2020) Zhongkai Li, Chengcheng Huo, and Xiangwei Qi. 2020. Analysis and Study of Several Game Algorithms for Public Safety. In 2020 International Conference on Computer Engineering and Application (ICCEA). IEEE, 575–579.
Liu et al. (2017) Kin Sum Liu, Tyler Mayer, Hao Tsung Yang, Esther Arkin, Jie Gao, Mayank Goswami, Matthew P JohnsonS, Nirman KumarP, and Shan Lin. 2017. Joint Sensing Duty Cycle Scheduling for Heterogeneous Coverage Guarantee. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE. IEEE, 1–9.
Ma et al. (2020) Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. 2020. Combinatorial Optimization by Graph Pointer Networks and Hierarchical Reinforcement Learning. In AAAI Workshop on Deep Learning on Graphs: Methodologies and Applications.
Min and Radzik (2017) Jie Min and Tomasz Radzik. 2017. Bamboo Garden Trimming Problem. In SOFSEM 2017: Theory and Practice of Computer Science: 43rd International Conference on Current Trends in Theory and Practice of Computer Science, Limerick, Ireland, January 16-20, 2017, Proceedings, Vol. 10139. Springer, 229.
Mitchell (1999) Joseph SB Mitchell. 1999. Guillotine subdivisions approximate polygonal subdivisions: A simple polynomial-time approximation scheme for geometric TSP, k-MST, and related problems. SIAM Journal on computing 28, 4 (1999), 1298–1309.
Mosbah and Saheb (1999) Mohamed Mosbah and Nasser Saheb. 1999. Non-uniform random spanning trees on weighted graphs. Theoretical computer science 218, 2 (1999), 263–271.
Nguyen et al. (2016) Thanh H. Nguyen, Debarun Kar, Matthew Brown, Arunesh Sinha, Albert Xin Jiang, and Milind Tambe. 2016. Towards a Science of Security Games. In New Frontiers of Multidisciplinary Research in STEAM-H, B. Toni (Ed.).
Pasqualetti et al. (2012) F. Pasqualetti, A. Franchi, and F. Bullo. 2012. On Cooperative Patrolling: Optimal Trajectories, Complexity Analysis, and Approximation Algorithms. IEEE Transactions on Robotics 28, 3 (June 2012), 592–606. https://doi.org/10.1109/TRO.2011.2179580
Patel et al. (2015) Rushabh Patel, Pushkarini Agharkar, and Francesco Bullo. 2015. Robotic surveillance and Markov chains with minimal weighted Kemeny constant. IEEE Trans. Automat. Control 60, 12 (2015), 3156–3167.
Pillac et al. (2013) Victor Pillac, Michel Gendreau, Christelle Guéret, and Andrés L. Medaglia. 2013. A review of dynamic vehicle routing problems. European Journal of Operational Research 225, 1 (2013), 1–11. https://doi.org/10.1016/j.ejor.2012.08.015
Pita et al. (2008) James Pita, Manish Jain, Janusz Marecki, Fernando Ordóñez, Christopher Portway, Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus. 2008. Deployed ARMOR protection: the application of a game theoretic model for security at the Los Angeles International Airport. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems: industrial track. International Foundation for Autonomous Agents and Multiagent Systems, 125–132.
Portugal and Rocha (2011) D. Portugal and R. P. Rocha. 2011. On the performance and scalability of multi-robot patrolling algorithms. In 2011 IEEE International Symposium on Safety, Security, and Rescue Robotics. 50–55. https://doi.org/10.1109/SSRR.2011.6106761
Rubio et al. (2019) Francisco Rubio, Francisco Valero, and Carlos Llopis-Albert. 2019. A review of mobile robots: Concepts, methods, theoretical framework, and applications. International Journal of Advanced Robotic Systems 16, 2 (2019), 1729881419839596.
Rupprecht and Wang (2022) Timothy Rupprecht and Yanzhi Wang. 2022. A survey for deep reinforcement learning in markovian cyber–physical systems: Common problems and solutions. Neural Networks 153 (2022), 13–36.
Samanta et al. (2022) Sukanya Samanta, Goutam Sen, and Soumya Kanti Ghosh. 2022. A literature review on police patrolling problems. Annals of Operations Research 316, 2 (2022), 1063–1106.
Serfozo (2009) Richard Serfozo. 2009. Basics of applied stochastic processes. Springer Science & Business Media.
Shieh et al. (2012) Eric Shieh, Bo An, Rong Yang, Milind Tambe, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. 2012. Protect: A deployed game theoretic system to protect the ports of the united states. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 13–20.
Shieh et al. (2013) Eric Shieh, Manish Jain, Albert Xin Jiang, and Milind Tambe. 2013. Efficiently solving joint activity based security games. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 346–352.
Sinha et al. (2018) Arunesh Sinha, Fei Fang, Bo An, Christopher Kiekintveld, and Milind Tambe. 2018. Stackelberg security games: Looking beyond a decade of success. IJCAI.
Song et al. (2023) Zimeng Song, Chun Kai Ling, and Fei Fang. 2023. Multi-defender Security Games with Schedules. In International Conference on Decision and Game Theory for Security. Springer, 65–85.
Stump and Michael (2011) E. Stump and N. Michael. 2011. Multi-robot persistent surveillance planning as a Vehicle Routing Problem. In Automation Science and Engineering (CASE), 2011 IEEE Conference on. 569–575. https://doi.org/10.1109/CASE.2011.6042503
Tsai et al. (2009) Jason Tsai, Christopher Kiekintveld, Fernando Ordonez, Milind Tambe, and Shyamsunder Rathi. 2009. IRIS-a tool for strategic security allocation in transportation networks. (2009).
Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. Advances in neural information processing systems 28 (2015).
Vorobeychik et al. (2012) Yevgeniy Vorobeychik, Bo An, and Milind Tambe. 2012. Adversarial Patrolling Games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 3 (Valencia, Spain) (AAMAS ’12). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1307–1308.
Vorobeychik et al. (2014) Yevgeniy Vorobeychik, Bo An, Milind Tambe, and Satinder P Singh. 2014. Computing Solutions in Infinite-Horizon Discounted Adversarial Patrolling Games. In ICAPS.
Wang (2020) Kai Wang. 2020. Balance Between Scalability and Optimality in Network Security Games.. In AAMAS. 2228–2230.
Wang et al. (2020) Kai Wang, Andrew Perrault, Aditya Mate, and Milind Tambe. 2020. Scalable Game-Focused Learning of Adversary Models: Data-to-Decisions in Network Security Games.. In AAMAS. 1449–1457.
Yang et al. (2019) Hao-Tsung Yang, Shih-Yu Tsai, Kin Sum Liu, Shan Lin, and Jie Gao. 2019. Patrol scheduling against adversaries with varying attack durations. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 1179–1188.
Yang et al. (2014) Rong Yang, Benjamin Ford, Milind Tambe, and Andrew Lemieux. 2014. Adaptive resource allocation for wildlife protection against illegal poachers. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 453–460.
Yu and Liu (2014) Wei Yu and Zhaohui Liu. 2014. Vehicle routing problems with regular objective functions on a path. Naval Research Logistics (NRL) 61, 1 (2014), 34–43. https://doi.org/10.1002/nav.21564