Optimal Dispatch in Emergency Service System via Reinforcement Learning

Cheng Hua Tauhid Zaman School of Management, Yale University
{cheng.hua, tauhid.zaman}@yale.edu

Abstract

In the United States, medical responses by fire departments over the last four decades increased by 367%. This had made it critical to decision makers in emergency response departments that existing resources are efficiently used. In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process and present a policy iteration approach to find an optimal dispatch policy. We then propose an alternative formulation using post-decision states that is shown to be mathematically equivalent to the original model, but with a much smaller state space. We present a temporal difference learning approach to the dispatch problem based on the post-decision states. In our numerical experiments, we show that our obtained temporal-difference policy outperforms the benchmark myopic policy. Our findings suggest that emergency response departments can improve their performance with minimal to no cost.

1 Introduction

In the United States, medical responses by fire departments over the last four decades has increased by 367% (Figure 1), as reported by Evarts (2019). The reasons for the dramatic increase in medical calls include the aging population and health insurance that covers most of the ambulance costs (Boston Globe, Nov 29, 2015). Cities with tight budgets are short of response units to respond to the growing amount of medical calls in time. NBC10 in Philadelphia, on Feb 28, 2019, reported that two thirds of the emergency medical calls had an ambulance response time of more than nine minutes. Thus, how to efficiently use the existing resources becomes an important topic to decision makers in emergency response departments.

Refer to caption — Figure 1: Annual emergency medical servie (EMS) incidents from 1980 to 2018.

In current practice, cities dispatch ambulance units according to a fixed dispatch rule, which always dispatches the closest available unit. The dispatch policy is deemed as a myopic policy as it only considers the current call and ignores the impact of dispatching a unit to future calls. In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process (MDP). We finds a dispatch policy that minimizes the mean response time using reinforcement learning. We propose an alternative MDP formulation for the problem using post-decision states that we show is mathematically equivalent to the original MDP formulation, but with a much smaller state space.

Due to the curse of dimensionality, the application of both formulations are only restricted to small problems. To solve larger problems, we use temporal difference learning (TD-learning) with the post-decision states. In our numerical experiments, we show that the TD-learning algorithm converges quickly and the policies obtained from our method outperform the myopic policy that always dispatches the closest available unit.

The remainder of this paper is organized as follows. In Section 2, we provide a review of the relevant literature. In Section 3, we present the Markov decision process formulation. Section 4 presents the formulation using post-decision states which reduces the state space of the original formulation. In Section 5, we present the temporal difference learning algorithm that is based on the post-decision states and its theoretical properties, while in Section 6, we show the performance of the algorithm in numerical experiments. We conclude the paper in Section 7.

2 Literature Review

There are a number of papers that study dispatch policies for emergency medical services (EMS) in different settings. The optimal dispatch rule was first studied in Carter et al. (1972). The authors studied a simple system of two units and provided a closed form solution that determines the response areas to be served by each unit to minimize average response time. However, such a closed form solution no longer exists in a system that has more than two units, and finding the optimal dispatch rule has been an important topic.

Bandara et al. (2014) developed a simulation model to evaluate priority dispatching strategies for ambulance systems with different priorities of calls to improve patient survival probability. The optimal policy was found via full enumeration using a simulation model, and the authors found that dispatching the closest available unit is not always optimal in this setting. The limitation of their method is due to the exponential computational complexity. A system with more than four units would take a significant amount of time. McLay and Mayorga (2013b) developed a Markov decision process to optimally dispatch ambulances to emergency calls with classification errors in identifying priorities to maximize the expected coverage of true high-risk patients. They compared the optimal policy to the myopic policy that sends the closest unit and found an advantage in increasing the utility. The authors applied the value iteration algorithm to find the optimal solution, but the solution method is also limited only to small problems. They extended the paper in McLay and Mayorga (2013a) to consider the optimal policy that balances equity and efficiency by adding four equity constraints.

Jagtenberg et al. (2017a) studied whether dispatching the closest available unit is optimal in a dynamic ambulance dispatching problem based on a Markov decision process. The problem was discretized by time using one minute as the time interval. Two objectives were considered: mean response time and the fraction of calls with response time beyond a certain time threshold. The value iteration method was used to find the optimal solution. A heuristic that leverages the dynamic maximum expected covering location problem (MEXCLP) model was proposed by Jagtenberg et al. (2015), where the MEXCLP problem was studied by Daskin (1983) to determine the best unit locations using expected covering demand as a proximal objective function than response time. They found an improvement of 18% on the fraction of late arrivals using the optimal policy compared to the myopic policy, which provides insights about deviating from the myopic policy. Jagtenberg et al. (2017b) empirically provide a bound for the gap between the existing solutions and the optimal solution.

The aforementioned papers use enumeration or MDP models to analyze dispatch policies. While they are guaranteed to find the optimal policy, their computational complexity limit their use to only small systems, due to the curse of dimensionality. For larger problems, they normally fail badly computationally. Thus, many works have tried to resolve this issue using approximate dynamic programming (ADP). A comprehensive introduction to the application of Approximate Dynamic Programming in the operations research filed can be found in Powell (2010) and Powell (2007).

Schmid (2012) followed the same formulation as introduced in Powell (2010) that uses ADP with aggregation function approximation to an ambulance relocation and dispatching problem to reduce the mean response time. A seminal paper by Maxwell et al. (2010) applied ADP to the ambulance redeployment problem, where they used Monte Carlo simulation with one-step bootstrap to estimate complex expectations and applied least squares regression with linear function approximation to learn the approximate value functions. Nasrollahzadeh et al. (2018) studied the problem of real-time ambulance dispatching and relocation, which is formulated as a stochastic dynamic program and solved using ADP. Their formulation and method is the same as proposed in Maxwell et al. (2010). Issues of this approach include that while most of the time they can beat the benchmark policy, they normally never output the optimal policy. Also, it is not guaranteed that the learning method always converges and finding useful basis functions for approximation is more of an art than science, which requires domain knowledge and testing.

Our paper is different from the existing work in the following three aspects: 1) We model the problem as an average-cost MDP problem, which solves for the optimal policy that minimizes the mean response time. The papers we mentioned above use simpler discounted-reward schemes which minimize the sum of discounted response times as a surrogate to the mean response time. However, discounted response times are more difficult to interpret compared to mean response time. 2) We propose an alternative formulation with post-decision states and show its mathematical equivalence to the original problem formulation. 3) We show that the proposed TD-learning algorithm based on post-decision states is guaranteed to converge to the optimal solution, while little or no theoretical guarantees exist in the aforementioned works using ADP methods.

3 Markov Decision Process Formulation

Consider a geographical region $R\subset\mathbb{R}^{2}$ that is served by a total of $N$ emergency units, all of the same type, e.g., ambulance units. Calls arrive in region $R$ according to a Poisson point process with an arrival intensity $\{\Lambda(x,y):(x,y)\in R\}$ at location with coordinate $(x,y)$ . We partition the region $R$ into $J$ sub-regions and associate a center of mass with each sub-region $R_{j}\subset R$ , which is also referred to as demand node $j$ . Note that $J$ can be as large as needed. Denote the total call rate in node $j$ as

\lambda_{j}=\int_{R_{j}}\Lambda(x,y)dxdy.

(1)

The overall call rate in region $R$ is denoted by

\lambda=\sum_{j}\lambda_{j}=\int_{R}\Lambda(x,y)dxdy.

(2)

We assume the mean service time follows a negative exponential distribution with rate $\mu_{i}$ for unit $i$ . We assume that the service time includes turnout time, travel time and time at the scene. The justification for this assumption is that travel time is usually a small portion of the total service time. With longer travel times, Jarvis (1975) mentioned a mean service time calibration to calibrate the service time to maintain the Markov property.

Let $b_{i}$ represent the state of unit $i$ , where $b_{i}=0$ if the unit is available and $b_{i}=1$ if the unit is busy. We denote the state space of all units as an $N$ -dimensional vector $B=\{b_{N}\cdots b_{1}\}\in\mathscr{B}$ , which is in a backward order similar to the representation of binary numbers. We define $B=b_{i}$ as the status of unit $i$ . Note that $|\mathscr{B}|=2^{N}$ .

If all units are busy when a call is received, we assume that it is handled by a unit outside of the system, such as a private ambulance company, or a unit from a neighboring jurisdiction, which is common mutual aid policy.

Define $t_{ij}$ as the average response time from the base of unit $i$ to node $j$ . In this paper, we aim to find the optimal dispatch policy that minimizes the average response time of all served calls.

3.1 State Space

Define $S$ as the state space $S$ and $s\in S$ as the state of the system. We have

s=(j,B),

(3)

which is a tuple of $j$ and $B$ that consists of the location of the current call and the state of all units (available or busy) at the time of the arrival. We denote $s(0)=j$ and $s(1)=B$ in state $s$ . The entire state space has size $|S|=J\times 2^{N}$ .

3.2 Action Space

When a call is received in the system, we decide on which unit to be dispatched to this call. An action in this problem is to dispatch a particular unit upon receiving a call, so the action space is given as

A=\{1,2,\cdots,N\}.

(4)

Note that only an available unit may be dispatched. We define $A_{s}\subset A$ as the set of feasible actions at state $s$ , where

A_{s}=\{i:B(i)=0,i=\{1,2,\cdots,N\}\}.

(5)

We define $a\in A_{s}$ as an action from the feasible action space.

3.3 Policy Space

We define the policy space as $\Pi$ , the set of all feasible policies. A policy $\pi\in\Pi$ is a mapping from the state space to the action space, $\pi:S\rightarrow A$ . Specifically, $\pi(s)=a,a\in A_{s}$ . The optimal policy $\pi^{*}\in\Pi$ is the policy that minimizes the average cost over all time steps. Our goal is to find this optimal policy.

We use a benchmark policy which sends the closest available unit, denoted by $\pi^{m}\in\Pi$ .

\pi^{m}(s)=\arg\min_{i}t_{ij},\qquad\forall i\in A_{s},\forall s\in S.

(6)

Sending the closest available unit is a policy widely used in emergency service systems. This policy is myopic as it does not consider potential future calls. Saving the closest unit to the current call might greatly reduce the expected response time of a future call.

3.4 Costs

Define $c^{\pi}(s)$ as the cost of being in state $s$ following policy $\pi$ , which equals to the response time

c^{\pi}(s)=t_{ij}

(7)

when the call location in state $s$ is $j$ , $s(0)=j$ , and the policy dispatches unit $i$ in state $s$ , $\pi(s)=i$ .

3.5 Transition Probabilities with Augmented Transitions

Define $p^{\pi}(s,s^{\prime})$ as the transition probability from state $s=(j,B)$ to state $s^{\prime}=(j^{\prime},B^{\prime})$ under policy $\pi$ . In determining the transition state probability, we consider an augmented transition where a unit completes a service and no dispatch action is needed. This is because the number of service completed between two arrivals is a random variable whose probability is complicated to compute. Introducing the augmented transition reduces the number of transition possibilities. Denote $I_{i}$ as the vector of all 0’s except for the $i$ th element, which is 1. The transition rate $p^{\pi}(s,s^{\prime})$ with augmented transition is given as

p^{\pi}(s,s^{\prime})=\begin{cases}\frac{\lambda_{j^{\prime}}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s^{\prime}=(j^{\prime},B+I_{i}),\\ \frac{\mu_{l}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s^{\prime}=(\emptyset,B+I_{i}-I_{l}).\end{cases}

(8)

where the expression on the top corresponds to the transition from state $s$ to state $s^{\prime}$ upon action $\pi(s)=i$ is taken and a new call arrives in node $j^{\prime}$ , and the bottom expression corresponds to the augmented transition where no arrival occurs but a busy unit $l\in A/A_{s}$ completes its current service. No action is needed since there is no arriving calls in this transition.

3.6 Bellman’s Equation

Define $V^{\pi}:S^{+}\mapsto\mathbb{R}$ as the value function for the MDP following policy $\pi$ and the value of state $s$ is $V^{\pi}(s)$ , where $S^{+}$ is the augmented state space that has dimension $|S^{+}|=(J+1)2^{N}$ . Let $\mu^{\pi}$ be the average cost following policy $\pi$ . The Bellman’s equation for the average cost is

V^{\pi}(s)=c^{\pi}(s)-\frac{\mu^{\pi}}{2}+\sum_{s^{\prime}\in S}p^{\pi}(s,s^{\prime})V^{\pi}(s^{\prime}),\ \forall s\in S^{+}.

(9)

Note that the $1/2$ in the above equation is due to the existence of augmented transitions. A transition that is due to a service completion has zero cost and the number of service completions is always equal to the number of calls being served.

Define $V^{\pi}$ as the vector of all state values, $c^{\pi}$ as the vector of all state costs, $P^{\pi}$ as the transition matrix, and $e$ as the vector of all ones. The vector form of Bellman’s equation is

V^{\pi}=c^{\pi}-\frac{\mu^{\pi}e}{2}+P^{\pi}V^{\pi}.

(10)

The solution to the above Bellman’s equation is not unique. Instead, the set of all value functions takes the form $\left\{V^{\pi}+me\mid m\in\mathbb{R}\right\}$ . Since shifting the value function by a constant value does not change the relative differences between state values, once we obtain a set of state values $V^{\pi}$ , the policy can be updated as

\pi^{\prime}(s)=\arg\min_{a\in A_{s}}t_{aj}+\sum_{s^{\prime}\in S^{+}}p(s,s^{\prime}|a)V^{\pi}(s^{\prime}),

(11)

where $p^{\pi}(s,s^{\prime}|a)$ is the one-step transition probability when taking action $a$ instead of following the policy $\pi(s)$ . This so-called policy iteration method for solving the MDP is summarized in Algorithm 1.

Algorithm 1 Policy Iteration Method

1: Pick a random policy

\pi_{0}

. Set

k=0

2: while

\pi_{k}\not=\pi_{k+1}

3: Compute the cost

c^{\pi_{k}}

and transition matrix

P^{\pi_{k}}

4: Policy Evaluation: Solve the state values

V^{\pi_{k}}

from the Bellman’s equation (10).

5: Policy Improvement: For each state

s\in S

, update the actions of each state by

\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+\sum_{s^{\prime}\in S^{+}}p(s,s^{\prime}|a)V^{\pi_{k}}(s^{\prime}).

k=k+1

7: end while

8: Output: Optimal Policy

\pi^{*}=\pi_{k}

4 Post-Decision State Formulation

The policy iteration method guarantees the convergence to the optimal dispatch policy that minimizes the average response time, which requires solving a linear system with $(J+1)2^{N}$ states repeatedly. In the section, by realizing the nature of the state transitions of the dispatch problem, we introduce the notion of post-decision states and use them as the new states in our problem. We show that the MDP formulation using post-decision states reduce the state space to $2^{N}$ , which also guarantees to find the optimal dispatch policy. In the next section, we will develop the temporal difference learning method based on this formulation using post-decision states.

In the original formulation, a state is a tuple of call locations and the statuses of all units $s=(j,B)$ . A post-decision state $s_{x}$ is a state that the system is in immediately after observing state $s$ and taking action $a=\pi(s)$ , before the next random information arrives into the system, which is the arrival of the next call location. Thus, given state $s$ and unit $i$ being dispatched, i.e., $a=\pi(s)=i$ , the post-decision state $s_{x}$ is

s_{x}=B+I_{i}.

(12)

Note that by defining the post-decision state this way, we only need information about the statuses of all units. Define $S_{x}$ as the post-decision state space; we have $|S_{x}|=2^{N}$ . Indeed, $S_{x}=\mathscr{B}$ .

Lemma 1.

Let $p_{x}^{\pi}(s_{x},s_{x}^{\prime})$ be the corresponding transition probability from $s_{x}$ to $s_{x}^{\prime}$ . We have

p_{x}^{\pi}(s_{x},s_{x}^{\prime})=\begin{cases}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s_{x}^{\prime}=s_{x}+I_{l},\\ \frac{\mu_{l}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s_{x}^{\prime}=s_{x}-I_{l},\end{cases}

(13)

where $\mathscr{R}_{l|s_{x}}^{\pi}$ is the set of demand nodes where policy $\pi$ dispatches unit $l$ , i.e.,

\mathscr{R}_{l|s_{x}}^{\pi}=\{j:\pi(s^{\prime})=l,s^{\prime}=(j,s_{x})\}.

(14)

Proof.

For $s_{x}^{\prime}=B+I_{i}-I_{l}$ , where a transition happens when unit $l$ completes its current service, the post-decision state transition is the same as the transition from $s$ to the augmented state $s^{\prime}=(\emptyset,B+I_{i}-I_{l})$ as no call arrives and no action is needed for this state; thus $s^{\prime}=s_{x}^{\prime}$ .

For $s_{x}^{\prime}=B+I_{i}+I_{l}$ , where a call arrives in post-decision state $s_{x}$ , we need to capture the randomness of exogenous information, which is the location of call that arrives in $s_{x}$ . We thus have

$\displaystyle p_{x}^{\pi}(s_{x},s_{x}^{\prime})$	$\displaystyle=$	$\displaystyle\sum_{j}p^{\pi}(s,s^{\prime}=(j,s_{x}))\mathbbm{1}_{\{\pi(s^{\prime})=l\}}$
	$\displaystyle=$	$\displaystyle\sum_{j}p^{\pi}(s,s^{\prime})\mathbbm{1}_{\{s^{\prime}=(j,s_{x})\}}\mathbbm{1}_{\{\pi(s^{\prime})=l\}}$
	$\displaystyle=$	$\displaystyle\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}p^{\pi}(s,s^{\prime}=(j,s_{x}))$
	$\displaystyle=$	$\displaystyle\frac{\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}$

∎

Lemma 2.

The cost of post-decision state $s_{x}$ , $c^{\pi}(s_{x})$ , is given as

c^{\pi}(s_{x})=\frac{\sum_{l\in A_{s}}\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}.

(15)

Proof.

The cost of $s_{x}$ is the expected one-step transition cost from $s_{x}$ to $s_{x}^{\prime}$ under policy $\pi$ . Let $c^{\pi}_{l}(s_{x})$ be the expected cost of dispatching unit $l$ in $s_{x}$ . We have

c^{\pi}_{l}(s_{x})=\begin{cases}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}},&\text{if }l\in A_{s},\\ 0,&\text{if }l\in\emptyset.\end{cases}

(16)

We thus have

$\displaystyle c^{\pi}(s_{x})$	$\displaystyle=$	$\displaystyle\sum_{l\in A_{s}}c^{\pi}_{l}(s_{x})p_{x}^{\pi}(s_{x},s_{x}+I_{l})$
	$\displaystyle=$	$\displaystyle\sum_{l\in A_{s}}\frac{\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}}\frac{\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}$
	$\displaystyle=$	$\displaystyle\frac{\sum_{l\in A_{s}}\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}.$

∎

Note that all components defining $c^{\pi}(s_{x})$ and $p_{x}^{\pi}(s_{x},s_{x}^{\prime})$ are known, which are computed beforehand. Define $J^{\pi}:S_{x}\mapsto\mathbb{R}$ as the value function for the post-decision state space $S_{x}$ . Let $\mu_{x}^{\pi}$ be the average cost following policy $\pi$ in the post-decision state space. The Bellman’s equation is

J^{\pi}(s_{x})=c^{\pi}(s_{x})-\frac{\mu_{x}^{\pi}}{2}+\sum_{s_{x}^{\prime}\in S_{x}}p_{x}^{\pi}(s_{x},s_{x}^{\prime})J^{\pi}(s_{x}^{\prime}),\ \forall s_{x}\in S_{x}.

(17)

Let $J^{\pi}$ , $c_{x}^{\pi}$ and $P_{x}^{\pi}$ be the corresponding vector representations. The vector form Bellman’s equation around post-decision states is

J^{\pi}=c_{x}^{\pi}-\frac{\mu_{x}^{\pi}e}{2}+P_{x}^{\pi}J^{\pi}.

(18)

Theorem 1.

The MDP formulation around post-decision states is equivalent to the original formulation. In particular

i)

$\mu_{x}^{\pi}=\mu^{\pi}$ ;

ii)

For $s_{x}=B$ , let $\Gamma=\lambda+\sum_{k:B(k)=1}\mu_{k}$ . We have

\displaystyle J^{\pi}(s_{x})=\sum_{j}\frac{\lambda_{j}}{\Gamma}V^{\pi}\big{(}s^{(j)}\big{)}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k]}\big{)},

(19)

where $s^{(j)}=(j,B)$ and $s^{[k]}=(\emptyset,B-I_{k})$ .

Proof.

Expanding the value functions $V^{\pi}$ in (19) by (9) and collecting terms, we have

			$\displaystyle\sum_{j}\frac{\lambda_{j}}{\Gamma}V^{\pi}\big{(}s^{(j)}\big{)}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k]}\big{)}$
		$\displaystyle=$	$\displaystyle\sum_{j}\frac{\lambda_{j}}{\Gamma}c^{\pi}\big{(}s^{(j)}\big{)}-\sum_{j}\frac{\lambda_{j}}{\Gamma}\frac{\mu^{\pi}}{2}$
			$\displaystyle+\sum_{j}\frac{\lambda_{j}}{\Gamma}\sum_{s^{\prime}\in S}p^{\pi}(s^{(j)},s^{\prime})V^{\pi}(s^{\prime})$
			$\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\underbrace{c^{\pi}\big{(}s^{[k]}\big{)}}_{=0}-\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\frac{\mu^{\pi}}{2}$
			$\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\sum_{s^{\prime}\in S}p^{\pi}(s^{[k]},s^{\prime})V^{\pi}(s^{\prime})$
		$\displaystyle=$	$\displaystyle\underbrace{\frac{\sum_{l\in A_{s^{(j)}}}\sum_{j\in\mathscr{R}_{l\|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\Gamma}}_{c^{\pi}(s_{x})}-\Big{(}\underbrace{\sum_{j}\frac{\lambda_{j}}{\Gamma}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}}_{=1}\Big{)}\frac{\mu^{\pi}}{2}$
			$\displaystyle+\sum_{j}\frac{\lambda_{j}}{\Gamma}\Big{[}\frac{\sum_{j^{\prime}}\lambda_{j^{\prime}}}{\Gamma+\mu_{\pi(s^{(j)})}}V^{\pi}\big{(}s^{(j^{\prime})}\big{)}+\frac{\sum_{l:B(l)=1}\mu_{l}}{\Gamma+\mu_{\pi(s^{(j)})}}V^{\pi}\big{(}s^{[l]}\big{)}\Big{]}$
			$\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\Big{[}\frac{\sum_{j^{\prime}}\lambda_{j^{\prime}}}{\Gamma}V^{\pi}\big{(}s^{(j^{\prime})[k]}\big{)}$
			$\displaystyle\qquad\qquad\qquad\qquad+\frac{\mu_{k}}{\Gamma}\frac{\sum_{l:B(l)=1}\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k][l]}\big{)}\Big{]}$
		$\displaystyle=$	$\displaystyle c^{\pi}(s_{x})-\frac{\mu^{\pi}}{2}+\sum_{s_{x}^{\prime}\in S_{x}}p_{x}^{\pi}(s_{x},s_{x}^{\prime})J^{\pi}(s_{x}^{\prime})$

The last equality holds by realizing the first part of the summation corresponds to the transition from a call arrival and dispatching a unit, and the the last part corresponds to the transition from a service completion that a unit returns to its base and becomes available. ∎

Under this formulation, the new policy $\pi^{\prime}$ for state $s=(j,B)$ is updated as

\pi^{\prime}(s)=\arg\min_{a\in A_{s}}t_{aj}+J^{\pi}(s_{x}=B+I_{a}).

(20)

We note that even though the state space is smaller using post-decision states, the policy state space remains the same. The new policy iteration around post-decision states is summarized in Algorithm 2.

Algorithm 2 Policy Iteration with Post-Decision States

1: Pick a random policy

\pi_{0}

. Set

k=0

2: while

\pi_{k}\not=\pi_{k+1}

3: Compute the cost

c_{x}^{\pi_{k}}

and transition matrix

P_{x}^{\pi_{k}}

4: Policy Evaluation: Solve the state values

J^{\pi_{k}}

from the Bellman’s equation (18).

5: Policy Improvement: For each state

s\in S

, update the actions of each state by

\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+J^{\pi_{k}}(s_{x}=B+I_{a}).

k=k+1

7: end while

8: Output: Optimal Policy

\pi^{*}=\pi_{k}

5 Temporal Difference Learning with Post-Decision States

The policy iteration method with post-decision states is also guaranteed to find the optimal policy while reducing the number of equations from $(J+1)2^{N}$ to $2^{N}$ . However, the exponential complexity limits its application only to small or medium-sized problems. In this section, we present the temporal difference learning method using value function approximation based on the formulation with post-decision states that reduces the dimension of the problem.

Let $\phi_{[p]}:S_{x}\mapsto\mathbb{R}$ , $p=1,2\cdots,P$ , be the basis functions of post-decision states, and let $r=\{r_{[p]}:p=1,2\cdots,P\}$ be the tunable parameters. The value function approximation is given by $\tilde{J}(s_{x},r)$ , where

\tilde{J}(s_{x},r)=\sum_{p=1}^{P}r_{[p]}\phi_{[p]}(s_{x}).

(21)

Let $\tilde{J}(r)$ be the vector of approximate state values of all states given parameter vector $r$ and let $\Phi$ be an $2^{N}\times P$ matrix whose pth column is equal to the vector $\phi_{[p]}$ of all states in $S^{x}$ . The vector form of the above equation is

\tilde{J}(r)=\Phi r.

(22)

Define $\left\{x_{t}\mid t=0,1,\ldots\right\}$ as the Markov chain on the post-decision state space $S_{x}$ with transition matrix $P_{x}^{\pi}$ .

Lemma 3.

The Markov chain corresponding to $P_{x}^{\pi}$ is irreducible and has a unique stationary distribution.

The proof of the above lemma is straightforward by noting that the Markov chain with post-decision state space forms a hypercube loss model whose property can be found in Larson (1974). We define the temporal difference $d_{t}$ as

d_{t}=c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right)-\tilde{J}\left(x_{t},r_{t}\right),

(23)

where $c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right)$ is the differential cost function at state $x_{t}$ based on the one-step bootstrap and $\tilde{J}\left(x_{t},r_{t}\right)$ is the old approximate differential cost function at state $x_{t}$ . Define $r_{t}$ as the parameter vector at time $t$ . We update the parameter vector $r$ using this temporal difference by

r_{t+1}=r_{t}+\gamma_{t}d_{t}\phi\left(x_{t}\right),

(24)

\mu_{t+1}=\left(1-\gamma_{t}\right)\mu_{t}+2\gamma_{t}c\left(x_{t}\right).

(25)

We let $\gamma_{t}=\frac{a}{a+t}$ where $a\geq 1$ is a hyper-parameter that controls the learning speed. We note that when $a=1$ , $\gamma_{t}$ forms the harmonic series, but usually the performance is not good empirically as discussed in Powell (2007). A larger $a$ slows the learning speed at the beginning of the process and choosing the appropriate value of $a$ depends on the nature of the problem.

In this paper, we present a simple way of defining the basis function. We give an index $i_{x}$ to each post-decision state $s_{x}=B$ , where $i_{x}=\sum_{i}^{N}2^{i}\mathbbm{1}_{B(i)=1}$ . We let the basis function $\phi_{[p]}(s_{x})$ be

\phi_{[p]}(s_{x})=\begin{cases}1,&\text{if }p=i_{x},\\ 0,&\text{if }p\not=i_{x}.\end{cases}

(26)

The algorithm is described in Algorithm 3.

Algorithm 3 Policy Iteration with TD-Learning

1: Pick a random policy

\pi_{0}

. Set

k=0

2: Specify

T

and

K

3: while

k\leq K

4: Set

t=0

. Initialize

r_{0}

and

\mu_{0}

5: Starting from a random state

x_{0}

, generate a state trajectory

\left\{x_{t}\mid t=0,1,\ldots T\right\}

corresponding to the Markov chain with state transition probability

P_{x}^{\pi_{k}}

that is defined by the policy

\pi_{k}

6: for

t=0

T

7: Calculate the temporal difference

d_{t}

d_{t}=c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right)-\tilde{J}\left(x_{t},r_{t}\right).

8: Update the parameters by

	$\displaystyle r_{t+1}$	$\displaystyle=$	$\displaystyle r_{t}+\gamma_{t}d_{t}\phi\left(x_{t}\right),$
	$\displaystyle\mu_{t+1}$	$\displaystyle=$	$\displaystyle\left(1-\gamma_{t}\right)\mu_{t}+2\gamma_{t}c\left(x_{t}\right).$

9: end for

10: For each state

s\in S

, update the policy

\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+\tilde{J}(s_{x}=B+I_{a},r_{T}).

11:

k=k+1

12: end while

13: Output: Policy

\tilde{\pi}^{*}=\pi_{K}

Theorem 2.

Algorithm 3 has the following property:

1.

Converges with probability 1.
2.

The limit of the sequence $\frac{\mu_{t}}{2}$ at the $k$ th iteration of the algorithm is the average cost $\frac{\mu_{x}^{\pi_{k}}}{2}$ , i.e.,

$\lim_{t\rightarrow\infty}\mu_{t}=\mu_{x}^{\pi_{k}}.$ (27)
3.

The limit of the sequence $r_{t}$ at the $k$ th iteration of the algorithm, denoted by $r^{k*}$ , is the unique solution of the equation

$T\left(\Phi r^{k*}\right)=\Phi r^{k*},$ (28)

where $T:\mathbb{R}^{2^{N}}\mapsto\mathbb{R}^{2^{N}}$ is an operator defined by

$TJ=c_{x}^{\pi_{k}}-\frac{\mu^{\pi_{k}}_{x}e}{2}+P_{x}^{\pi_{k}}J.$ (29)

The proof of the above theorem follows from Lemma 3, and the basis functions $\phi(s_{x})$ being linearly independent for all states. Also it is necessary to have the sequence $\gamma_{t}$ is positive, deterministic, and satisfies $\sum_{t=0}^{\infty}\gamma_{t}=\infty$ and $\sum_{t=0}^{\infty}\gamma_{t}^{2}<\infty$ . A detailed proof are shown in Tsitsiklis and Van Roy (1999).

Theorem 2 guarantees that the TD-Learning algorithm with post-decision states shown in Algorithm 3 always returns the optimal policy if $T$ is large enough. When $T$ is moderately large, it is enough to obtain a policy close to optimal as we will show in the next section. Our algorithm avoids solving the complex bellman’s equation which has an exponential complexity. Once calculated the cost vector $c_{x}^{\pi}$ and transition matrix $P_{x}^{\pi}$ , we easily obtain the temporal differences $d_{t}$ by Monte Carlo simulation and evaluate the value functions that is needed for policy improvement in the policy iteration algorithm. One can also leverage parallel computing to obtain multiple sample paths at the same time, which further reduces the computation time.

6 Numerical Results

In this section, we show the numerical results comparing the policy obtained from the TD-Learning method to the myopic policy that always dispatch the closest available unit for systems with $N=5,10$ and $15$ units. We created an imaginary region which is partitioned into $J=30$ demand nodes. We randomly locate units in the region and obtain the corresponding response times from each unit to each demand point.

The myopic policy $\pi^{m}$ is obtained by choosing from the available units in each state $s\in S$ that results in the shortest response time. We solve the exact mean response time of the myopic policy using the hypercube queuing model developed by Larson (1974) for the cases of $N=5$ and $10$ units. For the case where $N=15$ , the hypercube model becomes computationally costly due to its exponential complicity, and we obtain the approximate mean response time using the approximation method developed by Larson (1975), which has been tested to be very close to the exact value. An alternative is to obtain the approximate result via simulation which could be more accurate while taking a longer time.

The policy from the proposed TD-Learning method with post-decision states is obtained by running the algorithm in 25 iterations. We perform a roll-out with 200,000 state transitions in each iteration and update the parameter vector $r$ using the temporal differences $d$ . We record the sample average response time in each iteration and the results are shown in Figures 2, 3, and 4, respectively.

From the three experiments we observe that the TD-Learning algorithm converges quickly in all cases as expected. We show the updates of the post-decision state values $\tilde{J}$ in one TD-Learning iteration for the case of $N=5$ in Figure 5. The resulted policies in all three cases outperform the myopic policy that always dispatch the closest available units. We also observe that our algorithm obtains a superior policy fairly quickly in about three iterations. In the case where $N=15$ , solving the Bellman’s equation requires solving a system with $31\times 2^{15}=1,015,808$ states, which is very computational costly, if not infeasible, by most modern computers. In contrast, our method obtains a good policy in less than two minutes in this case, and it applies virtually to systems with any sizes as guaranteed by its theoretical properties.

Note that the policies obtained from our algorithm result in an average of three seconds reduction in terms of response time with no additional resources. This suggests that myopic policies are unnecessarily wasting critical resources and slightly more intelligent policies can provide improved performance with little to no cost.

7 Conclusion

In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process and aim to find the optimal dispatch policy that minimizes the mean response time. The regular MDP formulation has a state space of $(J+1)\cdot 2^{N}$ , where $J$ is the number of demand nodes that partition the entire region and $N$ is the total number of ambulances in the system. The policy iteration is able to find the optimal policy. We propose an alternative MDP formulation that uses the post-decision states and reduce the state space to $2^{N}$ . We show that this formulation is mathematically equivalent to the original MDP formulation.

Even though the state space is reduced in the new formulation, due to the curse of dimensionality, its application is only restricted to small problems. We next present a TD-Learning algorithm based on the post-decision states that is guaranteed to converge to the optimal solution. In our numerically experiments, we show that the TD-Learning algorithm with post-decision states converge quickly and the policies obtained from our method outperforms the myopic policy that always dispatch the closest available unit in all cases.

References

Bandara et al. [2014] Damitha Bandara, Maria E Mayorga, and Laura A McLay. Priority dispatching strategies for ems systems. Journal of the Operational Research Society, 65(4):572–587, 2014.
Carter et al. [1972] Grace M Carter, Jan M Chaiken, and Edward Ignall. Response areas for two emergency units. Operations Research, 20(3):571–594, 1972.
Daskin [1983] Mark S Daskin. A maximum expected covering location model: formulation, properties and heuristic solution. Transportation Science, 17(1):48–70, 1983.
Evarts [2019] Ben Evarts. Fire loss in the united states during 2018. NFPA National Fire Protection Association, Quincy, 2019.
Jagtenberg et al. [2015] Caroline J Jagtenberg, Sandjai Bhulai, and Robert D van der Mei. An efficient heuristic for real-time ambulance redeployment. Operations Research for Health Care, 4:27–35, 2015.
Jagtenberg et al. [2017a] Caroline J Jagtenberg, Sandjai Bhulai, and Robert D van der Mei. Dynamic ambulance dispatching: is the closest-idle policy always optimal? Health care management science, 20(4):517–531, 2017.
Jagtenberg et al. [2017b] Caroline J Jagtenberg, Pieter L van den Berg, and Robert D van der Mei. Benchmarking online dispatch algorithms for emergency medical services. European Journal of Operational Research, 258(2):715–725, 2017.
Jarvis [1975] James Patrick Jarvis. Optimization in stochastic service systems with distinguishable servers. PhD thesis, Massachusetts Institute of Technology, 1975.
Larson [1974] Richard C Larson. A hypercube queuing model for facility location and redistricting in urban emergency services. Computers & Operations Research, 1(1):67–95, 1974.
Larson [1975] Richard C. Larson. Approximating the performance of urban emergency service systems. Operations Research, 23(5):845–868, 1975.
Maxwell et al. [2010] Matthew S Maxwell, Mateo Restrepo, Shane G Henderson, and Huseyin Topaloglu. Approximate dynamic programming for ambulance redeployment. INFORMS Journal on Computing, 22(2):266–281, 2010.
McLay and Mayorga [2013a] Laura A McLay and Maria E Mayorga. A dispatching model for server-to-customer systems that balances efficiency and equity. Manufacturing & Service Operations Management, 15(2):205–220, 2013.
McLay and Mayorga [2013b] Laura A McLay and Maria E Mayorga. A model for optimally dispatching ambulances to emergency calls with classification errors in patient priorities. IIE Transactions, 45(1):1–24, 2013.
Nasrollahzadeh et al. [2018] Amir Ali Nasrollahzadeh, Amin Khademi, and Maria E Mayorga. Real-time ambulance dispatching and relocation. Manufacturing & Service Operations Management, 20(3):467–480, 2018.
Powell [2007] Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
Powell [2010] Warren B Powell. Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing, 22(1):2–17, 2010.
Schmid [2012] Verena Schmid. Solving the dynamic ambulance relocation and dispatching problem using approximate dynamic programming. European journal of operational research, 219(3):611–621, 2012.
Tsitsiklis and Van Roy [1999] John N Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999.