This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Dispatch in Emergency Service System via Reinforcement Learning

Cheng Hua    Tauhid Zaman School of Management, Yale University
{cheng.hua, tauhid.zaman}@yale.edu
Abstract

In the United States, medical responses by fire departments over the last four decades increased by 367%. This had made it critical to decision makers in emergency response departments that existing resources are efficiently used. In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process and present a policy iteration approach to find an optimal dispatch policy. We then propose an alternative formulation using post-decision states that is shown to be mathematically equivalent to the original model, but with a much smaller state space. We present a temporal difference learning approach to the dispatch problem based on the post-decision states. In our numerical experiments, we show that our obtained temporal-difference policy outperforms the benchmark myopic policy. Our findings suggest that emergency response departments can improve their performance with minimal to no cost.

1 Introduction

In the United States, medical responses by fire departments over the last four decades has increased by 367% (Figure 1), as reported by Evarts (2019). The reasons for the dramatic increase in medical calls include the aging population and health insurance that covers most of the ambulance costs (Boston Globe, Nov 29, 2015). Cities with tight budgets are short of response units to respond to the growing amount of medical calls in time. NBC10 in Philadelphia, on Feb 28, 2019, reported that two thirds of the emergency medical calls had an ambulance response time of more than nine minutes. Thus, how to efficiently use the existing resources becomes an important topic to decision makers in emergency response departments.

Refer to caption
Figure 1: Annual emergency medical servie (EMS) incidents from 1980 to 2018.

In current practice, cities dispatch ambulance units according to a fixed dispatch rule, which always dispatches the closest available unit. The dispatch policy is deemed as a myopic policy as it only considers the current call and ignores the impact of dispatching a unit to future calls. In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process (MDP). We finds a dispatch policy that minimizes the mean response time using reinforcement learning. We propose an alternative MDP formulation for the problem using post-decision states that we show is mathematically equivalent to the original MDP formulation, but with a much smaller state space.

Due to the curse of dimensionality, the application of both formulations are only restricted to small problems. To solve larger problems, we use temporal difference learning (TD-learning) with the post-decision states. In our numerical experiments, we show that the TD-learning algorithm converges quickly and the policies obtained from our method outperform the myopic policy that always dispatches the closest available unit.

The remainder of this paper is organized as follows. In Section 2, we provide a review of the relevant literature. In Section 3, we present the Markov decision process formulation. Section 4 presents the formulation using post-decision states which reduces the state space of the original formulation. In Section 5, we present the temporal difference learning algorithm that is based on the post-decision states and its theoretical properties, while in Section 6, we show the performance of the algorithm in numerical experiments. We conclude the paper in Section 7.

2 Literature Review

There are a number of papers that study dispatch policies for emergency medical services (EMS) in different settings. The optimal dispatch rule was first studied in Carter et al. (1972). The authors studied a simple system of two units and provided a closed form solution that determines the response areas to be served by each unit to minimize average response time. However, such a closed form solution no longer exists in a system that has more than two units, and finding the optimal dispatch rule has been an important topic.

Bandara et al. (2014) developed a simulation model to evaluate priority dispatching strategies for ambulance systems with different priorities of calls to improve patient survival probability. The optimal policy was found via full enumeration using a simulation model, and the authors found that dispatching the closest available unit is not always optimal in this setting. The limitation of their method is due to the exponential computational complexity. A system with more than four units would take a significant amount of time. McLay and Mayorga (2013b) developed a Markov decision process to optimally dispatch ambulances to emergency calls with classification errors in identifying priorities to maximize the expected coverage of true high-risk patients. They compared the optimal policy to the myopic policy that sends the closest unit and found an advantage in increasing the utility. The authors applied the value iteration algorithm to find the optimal solution, but the solution method is also limited only to small problems. They extended the paper in McLay and Mayorga (2013a) to consider the optimal policy that balances equity and efficiency by adding four equity constraints.

Jagtenberg et al. (2017a) studied whether dispatching the closest available unit is optimal in a dynamic ambulance dispatching problem based on a Markov decision process. The problem was discretized by time using one minute as the time interval. Two objectives were considered: mean response time and the fraction of calls with response time beyond a certain time threshold. The value iteration method was used to find the optimal solution. A heuristic that leverages the dynamic maximum expected covering location problem (MEXCLP) model was proposed by Jagtenberg et al. (2015), where the MEXCLP problem was studied by Daskin (1983) to determine the best unit locations using expected covering demand as a proximal objective function than response time. They found an improvement of 18% on the fraction of late arrivals using the optimal policy compared to the myopic policy, which provides insights about deviating from the myopic policy. Jagtenberg et al. (2017b) empirically provide a bound for the gap between the existing solutions and the optimal solution.

The aforementioned papers use enumeration or MDP models to analyze dispatch policies. While they are guaranteed to find the optimal policy, their computational complexity limit their use to only small systems, due to the curse of dimensionality. For larger problems, they normally fail badly computationally. Thus, many works have tried to resolve this issue using approximate dynamic programming (ADP). A comprehensive introduction to the application of Approximate Dynamic Programming in the operations research filed can be found in Powell (2010) and Powell (2007).

Schmid (2012) followed the same formulation as introduced in Powell (2010) that uses ADP with aggregation function approximation to an ambulance relocation and dispatching problem to reduce the mean response time. A seminal paper by Maxwell et al. (2010) applied ADP to the ambulance redeployment problem, where they used Monte Carlo simulation with one-step bootstrap to estimate complex expectations and applied least squares regression with linear function approximation to learn the approximate value functions. Nasrollahzadeh et al. (2018) studied the problem of real-time ambulance dispatching and relocation, which is formulated as a stochastic dynamic program and solved using ADP. Their formulation and method is the same as proposed in Maxwell et al. (2010). Issues of this approach include that while most of the time they can beat the benchmark policy, they normally never output the optimal policy. Also, it is not guaranteed that the learning method always converges and finding useful basis functions for approximation is more of an art than science, which requires domain knowledge and testing.

Our paper is different from the existing work in the following three aspects: 1) We model the problem as an average-cost MDP problem, which solves for the optimal policy that minimizes the mean response time. The papers we mentioned above use simpler discounted-reward schemes which minimize the sum of discounted response times as a surrogate to the mean response time. However, discounted response times are more difficult to interpret compared to mean response time. 2) We propose an alternative formulation with post-decision states and show its mathematical equivalence to the original problem formulation. 3) We show that the proposed TD-learning algorithm based on post-decision states is guaranteed to converge to the optimal solution, while little or no theoretical guarantees exist in the aforementioned works using ADP methods.

3 Markov Decision Process Formulation

Consider a geographical region R2R\subset\mathbb{R}^{2} that is served by a total of NN emergency units, all of the same type, e.g., ambulance units. Calls arrive in region RR according to a Poisson point process with an arrival intensity {Λ(x,y):(x,y)R}\{\Lambda(x,y):(x,y)\in R\} at location with coordinate (x,y)(x,y). We partition the region RR into JJ sub-regions and associate a center of mass with each sub-region RjRR_{j}\subset R, which is also referred to as demand node jj. Note that JJ can be as large as needed. Denote the total call rate in node jj as

λj=RjΛ(x,y)𝑑x𝑑y.\lambda_{j}=\int_{R_{j}}\Lambda(x,y)dxdy. (1)

The overall call rate in region RR is denoted by

λ=jλj=RΛ(x,y)𝑑x𝑑y.\lambda=\sum_{j}\lambda_{j}=\int_{R}\Lambda(x,y)dxdy. (2)

We assume the mean service time follows a negative exponential distribution with rate μi\mu_{i} for unit ii. We assume that the service time includes turnout time, travel time and time at the scene. The justification for this assumption is that travel time is usually a small portion of the total service time. With longer travel times, Jarvis (1975) mentioned a mean service time calibration to calibrate the service time to maintain the Markov property.

Let bib_{i} represent the state of unit ii, where bi=0b_{i}=0 if the unit is available and bi=1b_{i}=1 if the unit is busy. We denote the state space of all units as an NN-dimensional vector B={bNb1}B=\{b_{N}\cdots b_{1}\}\in\mathscr{B}, which is in a backward order similar to the representation of binary numbers. We define B=biB=b_{i} as the status of unit ii. Note that ||=2N|\mathscr{B}|=2^{N}.

If all units are busy when a call is received, we assume that it is handled by a unit outside of the system, such as a private ambulance company, or a unit from a neighboring jurisdiction, which is common mutual aid policy.

Define tijt_{ij} as the average response time from the base of unit ii to node jj. In this paper, we aim to find the optimal dispatch policy that minimizes the average response time of all served calls.

3.1 State Space

Define SS as the state space SS and sSs\in S as the state of the system. We have

s=(j,B),s=(j,B), (3)

which is a tuple of jj and BB that consists of the location of the current call and the state of all units (available or busy) at the time of the arrival. We denote s(0)=js(0)=j and s(1)=Bs(1)=B in state ss. The entire state space has size |S|=J×2N|S|=J\times 2^{N}.

3.2 Action Space

When a call is received in the system, we decide on which unit to be dispatched to this call. An action in this problem is to dispatch a particular unit upon receiving a call, so the action space is given as

A={1,2,,N}.A=\{1,2,\cdots,N\}. (4)

Note that only an available unit may be dispatched. We define AsAA_{s}\subset A as the set of feasible actions at state ss, where

As={i:B(i)=0,i={1,2,,N}}.A_{s}=\{i:B(i)=0,i=\{1,2,\cdots,N\}\}. (5)

We define aAsa\in A_{s} as an action from the feasible action space.

3.3 Policy Space

We define the policy space as Π\Pi, the set of all feasible policies. A policy πΠ\pi\in\Pi is a mapping from the state space to the action space, π:SA\pi:S\rightarrow A. Specifically, π(s)=a,aAs\pi(s)=a,a\in A_{s}. The optimal policy πΠ\pi^{*}\in\Pi is the policy that minimizes the average cost over all time steps. Our goal is to find this optimal policy.

We use a benchmark policy which sends the closest available unit, denoted by πmΠ\pi^{m}\in\Pi.

πm(s)=argminitij,iAs,sS.\pi^{m}(s)=\arg\min_{i}t_{ij},\qquad\forall i\in A_{s},\forall s\in S. (6)

Sending the closest available unit is a policy widely used in emergency service systems. This policy is myopic as it does not consider potential future calls. Saving the closest unit to the current call might greatly reduce the expected response time of a future call.

3.4 Costs

Define cπ(s)c^{\pi}(s) as the cost of being in state ss following policy π\pi, which equals to the response time

cπ(s)=tijc^{\pi}(s)=t_{ij} (7)

when the call location in state ss is jj, s(0)=js(0)=j, and the policy dispatches unit ii in state ss, π(s)=i\pi(s)=i.

3.5 Transition Probabilities with Augmented Transitions

Define pπ(s,s)p^{\pi}(s,s^{\prime}) as the transition probability from state s=(j,B)s=(j,B) to state s=(j,B)s^{\prime}=(j^{\prime},B^{\prime}) under policy π\pi. In determining the transition state probability, we consider an augmented transition where a unit completes a service and no dispatch action is needed. This is because the number of service completed between two arrivals is a random variable whose probability is complicated to compute. Introducing the augmented transition reduces the number of transition possibilities. Denote IiI_{i} as the vector of all 0’s except for the iith element, which is 1. The transition rate pπ(s,s)p^{\pi}(s,s^{\prime}) with augmented transition is given as

pπ(s,s)={λjλ+k:B(k)=1μk+μi,if s=(j,B+Ii),μlλ+k:B(k)=1μk+μi,if s=(,B+IiIl).p^{\pi}(s,s^{\prime})=\begin{cases}\frac{\lambda_{j^{\prime}}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s^{\prime}=(j^{\prime},B+I_{i}),\\ \frac{\mu_{l}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s^{\prime}=(\emptyset,B+I_{i}-I_{l}).\end{cases} (8)

where the expression on the top corresponds to the transition from state ss to state ss^{\prime} upon action π(s)=i\pi(s)=i is taken and a new call arrives in node jj^{\prime}, and the bottom expression corresponds to the augmented transition where no arrival occurs but a busy unit lA/Asl\in A/A_{s} completes its current service. No action is needed since there is no arriving calls in this transition.

3.6 Bellman’s Equation

Define Vπ:S+V^{\pi}:S^{+}\mapsto\mathbb{R} as the value function for the MDP following policy π\pi and the value of state ss is Vπ(s)V^{\pi}(s), where S+S^{+} is the augmented state space that has dimension |S+|=(J+1)2N|S^{+}|=(J+1)2^{N}. Let μπ\mu^{\pi} be the average cost following policy π\pi. The Bellman’s equation for the average cost is

Vπ(s)=cπ(s)μπ2+sSpπ(s,s)Vπ(s),sS+.V^{\pi}(s)=c^{\pi}(s)-\frac{\mu^{\pi}}{2}+\sum_{s^{\prime}\in S}p^{\pi}(s,s^{\prime})V^{\pi}(s^{\prime}),\ \forall s\in S^{+}. (9)

Note that the 1/21/2 in the above equation is due to the existence of augmented transitions. A transition that is due to a service completion has zero cost and the number of service completions is always equal to the number of calls being served.

Define VπV^{\pi} as the vector of all state values, cπc^{\pi} as the vector of all state costs, PπP^{\pi} as the transition matrix, and ee as the vector of all ones. The vector form of Bellman’s equation is

Vπ=cπμπe2+PπVπ.V^{\pi}=c^{\pi}-\frac{\mu^{\pi}e}{2}+P^{\pi}V^{\pi}. (10)

The solution to the above Bellman’s equation is not unique. Instead, the set of all value functions takes the form {Vπ+mem}\left\{V^{\pi}+me\mid m\in\mathbb{R}\right\}. Since shifting the value function by a constant value does not change the relative differences between state values, once we obtain a set of state values VπV^{\pi}, the policy can be updated as

π(s)=argminaAstaj+sS+p(s,s|a)Vπ(s),\pi^{\prime}(s)=\arg\min_{a\in A_{s}}t_{aj}+\sum_{s^{\prime}\in S^{+}}p(s,s^{\prime}|a)V^{\pi}(s^{\prime}), (11)

where pπ(s,s|a)p^{\pi}(s,s^{\prime}|a) is the one-step transition probability when taking action aa instead of following the policy π(s)\pi(s). This so-called policy iteration method for solving the MDP is summarized in Algorithm 1.

Algorithm 1 Policy Iteration Method
1:  Pick a random policy π0\pi_{0}. Set k=0k=0.
2:  while πkπk+1\pi_{k}\not=\pi_{k+1} do
3:     Compute the cost cπkc^{\pi_{k}} and transition matrix PπkP^{\pi_{k}}.
4:     Policy Evaluation: Solve the state values VπkV^{\pi_{k}} from the Bellman’s equation (10).
5:     Policy Improvement: For each state sSs\in S, update the actions of each state by
πk+1(s)=argminaAstaj+sS+p(s,s|a)Vπk(s).\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+\sum_{s^{\prime}\in S^{+}}p(s,s^{\prime}|a)V^{\pi_{k}}(s^{\prime}).
6:     k=k+1k=k+1
7:  end while
8:  Output: Optimal Policy π=πk\pi^{*}=\pi_{k}

4 Post-Decision State Formulation

The policy iteration method guarantees the convergence to the optimal dispatch policy that minimizes the average response time, which requires solving a linear system with (J+1)2N(J+1)2^{N} states repeatedly. In the section, by realizing the nature of the state transitions of the dispatch problem, we introduce the notion of post-decision states and use them as the new states in our problem. We show that the MDP formulation using post-decision states reduce the state space to 2N2^{N}, which also guarantees to find the optimal dispatch policy. In the next section, we will develop the temporal difference learning method based on this formulation using post-decision states.

In the original formulation, a state is a tuple of call locations and the statuses of all units s=(j,B)s=(j,B). A post-decision state sxs_{x} is a state that the system is in immediately after observing state ss and taking action a=π(s)a=\pi(s), before the next random information arrives into the system, which is the arrival of the next call location. Thus, given state ss and unit ii being dispatched, i.e., a=π(s)=ia=\pi(s)=i, the post-decision state sxs_{x} is

sx=B+Ii.s_{x}=B+I_{i}. (12)

Note that by defining the post-decision state this way, we only need information about the statuses of all units. Define SxS_{x} as the post-decision state space; we have |Sx|=2N|S_{x}|=2^{N}. Indeed, Sx=S_{x}=\mathscr{B}.

Lemma 1.

Let pxπ(sx,sx)p_{x}^{\pi}(s_{x},s_{x}^{\prime}) be the corresponding transition probability from sxs_{x} to sxs_{x}^{\prime}. We have

pxπ(sx,sx)={jl|sxπλjλ+k:B(k)=1μk+μi,if sx=sx+Il,μlλ+k:B(k)=1μk+μi,if sx=sxIl,p_{x}^{\pi}(s_{x},s_{x}^{\prime})=\begin{cases}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s_{x}^{\prime}=s_{x}+I_{l},\\ \frac{\mu_{l}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}},&\text{if }s_{x}^{\prime}=s_{x}-I_{l},\end{cases} (13)

where l|sxπ\mathscr{R}_{l|s_{x}}^{\pi} is the set of demand nodes where policy π\pi dispatches unit ll, i.e.,

l|sxπ={j:π(s)=l,s=(j,sx)}.\mathscr{R}_{l|s_{x}}^{\pi}=\{j:\pi(s^{\prime})=l,s^{\prime}=(j,s_{x})\}. (14)
Proof.

For sx=B+IiIls_{x}^{\prime}=B+I_{i}-I_{l}, where a transition happens when unit ll completes its current service, the post-decision state transition is the same as the transition from ss to the augmented state s=(,B+IiIl)s^{\prime}=(\emptyset,B+I_{i}-I_{l}) as no call arrives and no action is needed for this state; thus s=sxs^{\prime}=s_{x}^{\prime}.

For sx=B+Ii+Ils_{x}^{\prime}=B+I_{i}+I_{l}, where a call arrives in post-decision state sxs_{x}, we need to capture the randomness of exogenous information, which is the location of call that arrives in sxs_{x}. We thus have

pxπ(sx,sx)\displaystyle p_{x}^{\pi}(s_{x},s_{x}^{\prime}) =\displaystyle= jpπ(s,s=(j,sx))𝟙{π(s)=l}\displaystyle\sum_{j}p^{\pi}(s,s^{\prime}=(j,s_{x}))\mathbbm{1}_{\{\pi(s^{\prime})=l\}}
=\displaystyle= jpπ(s,s)𝟙{s=(j,sx)}𝟙{π(s)=l}\displaystyle\sum_{j}p^{\pi}(s,s^{\prime})\mathbbm{1}_{\{s^{\prime}=(j,s_{x})\}}\mathbbm{1}_{\{\pi(s^{\prime})=l\}}
=\displaystyle= jl|sxπpπ(s,s=(j,sx))\displaystyle\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}p^{\pi}(s,s^{\prime}=(j,s_{x}))
=\displaystyle= jl|sxπλjλ+k:B(k)=1μk+μi\displaystyle\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}

Lemma 2.

The cost of post-decision state sxs_{x}, cπ(sx)c^{\pi}(s_{x}), is given as

cπ(sx)=lAsjl|sxπλjtljλ+k:B(k)=1μk+μi.c^{\pi}(s_{x})=\frac{\sum_{l\in A_{s}}\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}. (15)
Proof.

The cost of sxs_{x} is the expected one-step transition cost from sxs_{x} to sxs_{x}^{\prime} under policy π\pi. Let clπ(sx)c^{\pi}_{l}(s_{x}) be the expected cost of dispatching unit ll in sxs_{x}. We have

clπ(sx)={jl|sxπλjtljjl|sxπλj,if lAs,0,if l.c^{\pi}_{l}(s_{x})=\begin{cases}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}},&\text{if }l\in A_{s},\\ 0,&\text{if }l\in\emptyset.\end{cases} (16)

We thus have

cπ(sx)\displaystyle c^{\pi}(s_{x}) =\displaystyle= lAsclπ(sx)pxπ(sx,sx+Il)\displaystyle\sum_{l\in A_{s}}c^{\pi}_{l}(s_{x})p_{x}^{\pi}(s_{x},s_{x}+I_{l})
=\displaystyle= lAsjl|sxπλjtljjl|sxπλjjl|sxπλjλ+k:B(k)=1μk+μi\displaystyle\sum_{l\in A_{s}}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}}\frac{\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}
=\displaystyle= lAsjl|sxπλjtljλ+k:B(k)=1μk+μi.\displaystyle\frac{\sum_{l\in A_{s}}\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\lambda+\sum_{k:B(k)=1}\mu_{k}+\mu_{i}}.

Note that all components defining cπ(sx)c^{\pi}(s_{x}) and pxπ(sx,sx)p_{x}^{\pi}(s_{x},s_{x}^{\prime}) are known, which are computed beforehand. Define Jπ:SxJ^{\pi}:S_{x}\mapsto\mathbb{R} as the value function for the post-decision state space SxS_{x}. Let μxπ\mu_{x}^{\pi} be the average cost following policy π\pi in the post-decision state space. The Bellman’s equation is

Jπ(sx)=cπ(sx)μxπ2+sxSxpxπ(sx,sx)Jπ(sx),sxSx.J^{\pi}(s_{x})=c^{\pi}(s_{x})-\frac{\mu_{x}^{\pi}}{2}+\sum_{s_{x}^{\prime}\in S_{x}}p_{x}^{\pi}(s_{x},s_{x}^{\prime})J^{\pi}(s_{x}^{\prime}),\ \forall s_{x}\in S_{x}. (17)

Let JπJ^{\pi}, cxπc_{x}^{\pi} and PxπP_{x}^{\pi} be the corresponding vector representations. The vector form Bellman’s equation around post-decision states is

Jπ=cxπμxπe2+PxπJπ.J^{\pi}=c_{x}^{\pi}-\frac{\mu_{x}^{\pi}e}{2}+P_{x}^{\pi}J^{\pi}. (18)
Theorem 1.

The MDP formulation around post-decision states is equivalent to the original formulation. In particular

  1. i)

    μxπ=μπ\mu_{x}^{\pi}=\mu^{\pi};

  2. ii)

    For sx=Bs_{x}=B, let Γ=λ+k:B(k)=1μk\Gamma=\lambda+\sum_{k:B(k)=1}\mu_{k}. We have

    Jπ(sx)=jλjΓVπ(s(j))+k:B(k)=1μkΓVπ(s[k]),\displaystyle J^{\pi}(s_{x})=\sum_{j}\frac{\lambda_{j}}{\Gamma}V^{\pi}\big{(}s^{(j)}\big{)}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k]}\big{)}, (19)

    where s(j)=(j,B)s^{(j)}=(j,B) and s[k]=(,BIk)s^{[k]}=(\emptyset,B-I_{k}).

Proof.

Expanding the value functions VπV^{\pi} in (19) by (9) and collecting terms, we have

jλjΓVπ(s(j))+k:B(k)=1μkΓVπ(s[k])\displaystyle\sum_{j}\frac{\lambda_{j}}{\Gamma}V^{\pi}\big{(}s^{(j)}\big{)}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k]}\big{)}
=\displaystyle= jλjΓcπ(s(j))jλjΓμπ2\displaystyle\sum_{j}\frac{\lambda_{j}}{\Gamma}c^{\pi}\big{(}s^{(j)}\big{)}-\sum_{j}\frac{\lambda_{j}}{\Gamma}\frac{\mu^{\pi}}{2}
+jλjΓsSpπ(s(j),s)Vπ(s)\displaystyle+\sum_{j}\frac{\lambda_{j}}{\Gamma}\sum_{s^{\prime}\in S}p^{\pi}(s^{(j)},s^{\prime})V^{\pi}(s^{\prime})
+k:B(k)=1μkΓcπ(s[k])=0k:B(k)=1μkΓμπ2\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\underbrace{c^{\pi}\big{(}s^{[k]}\big{)}}_{=0}-\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\frac{\mu^{\pi}}{2}
+k:B(k)=1μkΓsSpπ(s[k],s)Vπ(s)\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\sum_{s^{\prime}\in S}p^{\pi}(s^{[k]},s^{\prime})V^{\pi}(s^{\prime})
=\displaystyle= lAs(j)jl|sxπλjtljΓcπ(sx)(jλjΓ+k:B(k)=1μkΓ=1)μπ2\displaystyle\underbrace{\frac{\sum_{l\in A_{s^{(j)}}}\sum_{j\in\mathscr{R}_{l|s_{x}}^{\pi}}\lambda_{j}t_{lj}}{\Gamma}}_{c^{\pi}(s_{x})}-\Big{(}\underbrace{\sum_{j}\frac{\lambda_{j}}{\Gamma}+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}}_{=1}\Big{)}\frac{\mu^{\pi}}{2}
+jλjΓ[jλjΓ+μπ(s(j))Vπ(s(j))+l:B(l)=1μlΓ+μπ(s(j))Vπ(s[l])]\displaystyle+\sum_{j}\frac{\lambda_{j}}{\Gamma}\Big{[}\frac{\sum_{j^{\prime}}\lambda_{j^{\prime}}}{\Gamma+\mu_{\pi(s^{(j)})}}V^{\pi}\big{(}s^{(j^{\prime})}\big{)}+\frac{\sum_{l:B(l)=1}\mu_{l}}{\Gamma+\mu_{\pi(s^{(j)})}}V^{\pi}\big{(}s^{[l]}\big{)}\Big{]}
+k:B(k)=1μkΓ[jλjΓVπ(s(j)[k])\displaystyle+\sum_{k:B(k)=1}\frac{\mu_{k}}{\Gamma}\Big{[}\frac{\sum_{j^{\prime}}\lambda_{j^{\prime}}}{\Gamma}V^{\pi}\big{(}s^{(j^{\prime})[k]}\big{)}
+μkΓl:B(l)=1μkΓVπ(s[k][l])]\displaystyle\qquad\qquad\qquad\qquad+\frac{\mu_{k}}{\Gamma}\frac{\sum_{l:B(l)=1}\mu_{k}}{\Gamma}V^{\pi}\big{(}s^{[k][l]}\big{)}\Big{]}
=\displaystyle= cπ(sx)μπ2+sxSxpxπ(sx,sx)Jπ(sx)\displaystyle c^{\pi}(s_{x})-\frac{\mu^{\pi}}{2}+\sum_{s_{x}^{\prime}\in S_{x}}p_{x}^{\pi}(s_{x},s_{x}^{\prime})J^{\pi}(s_{x}^{\prime})

The last equality holds by realizing the first part of the summation corresponds to the transition from a call arrival and dispatching a unit, and the the last part corresponds to the transition from a service completion that a unit returns to its base and becomes available. ∎

Under this formulation, the new policy π\pi^{\prime} for state s=(j,B)s=(j,B) is updated as

π(s)=argminaAstaj+Jπ(sx=B+Ia).\pi^{\prime}(s)=\arg\min_{a\in A_{s}}t_{aj}+J^{\pi}(s_{x}=B+I_{a}). (20)

We note that even though the state space is smaller using post-decision states, the policy state space remains the same. The new policy iteration around post-decision states is summarized in Algorithm 2.

Algorithm 2 Policy Iteration with Post-Decision States
1:  Pick a random policy π0\pi_{0}. Set k=0k=0.
2:  while πkπk+1\pi_{k}\not=\pi_{k+1} do
3:     Compute the cost cxπkc_{x}^{\pi_{k}} and transition matrix PxπkP_{x}^{\pi_{k}}.
4:     Policy Evaluation: Solve the state values JπkJ^{\pi_{k}} from the Bellman’s equation (18).
5:     Policy Improvement: For each state sSs\in S, update the actions of each state by
πk+1(s)=argminaAstaj+Jπk(sx=B+Ia).\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+J^{\pi_{k}}(s_{x}=B+I_{a}).
6:     k=k+1k=k+1
7:  end while
8:  Output: Optimal Policy π=πk\pi^{*}=\pi_{k}

5 Temporal Difference Learning with Post-Decision States

The policy iteration method with post-decision states is also guaranteed to find the optimal policy while reducing the number of equations from (J+1)2N(J+1)2^{N} to 2N2^{N}. However, the exponential complexity limits its application only to small or medium-sized problems. In this section, we present the temporal difference learning method using value function approximation based on the formulation with post-decision states that reduces the dimension of the problem.

Let ϕ[p]:Sx\phi_{[p]}:S_{x}\mapsto\mathbb{R}, p=1,2,Pp=1,2\cdots,P, be the basis functions of post-decision states, and let r={r[p]:p=1,2,P}r=\{r_{[p]}:p=1,2\cdots,P\} be the tunable parameters. The value function approximation is given by J~(sx,r)\tilde{J}(s_{x},r), where

J~(sx,r)=p=1Pr[p]ϕ[p](sx).\tilde{J}(s_{x},r)=\sum_{p=1}^{P}r_{[p]}\phi_{[p]}(s_{x}). (21)

Let J~(r)\tilde{J}(r) be the vector of approximate state values of all states given parameter vector rr and let Φ\Phi be an 2N×P2^{N}\times P matrix whose pth column is equal to the vector ϕ[p]\phi_{[p]} of all states in SxS^{x}. The vector form of the above equation is

J~(r)=Φr.\tilde{J}(r)=\Phi r. (22)

Define {xtt=0,1,}\left\{x_{t}\mid t=0,1,\ldots\right\} as the Markov chain on the post-decision state space SxS_{x} with transition matrix PxπP_{x}^{\pi}.

Lemma 3.

The Markov chain corresponding to PxπP_{x}^{\pi} is irreducible and has a unique stationary distribution.

The proof of the above lemma is straightforward by noting that the Markov chain with post-decision state space forms a hypercube loss model whose property can be found in Larson (1974). We define the temporal difference dtd_{t} as

dt=c(xt)μt2+J~(xt+1,rt)J~(xt,rt),d_{t}=c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right)-\tilde{J}\left(x_{t},r_{t}\right), (23)

where c(xt)μt2+J~(xt+1,rt)c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right) is the differential cost function at state xtx_{t} based on the one-step bootstrap and J~(xt,rt)\tilde{J}\left(x_{t},r_{t}\right) is the old approximate differential cost function at state xtx_{t}. Define rtr_{t} as the parameter vector at time tt. We update the parameter vector rr using this temporal difference by

rt+1=rt+γtdtϕ(xt),r_{t+1}=r_{t}+\gamma_{t}d_{t}\phi\left(x_{t}\right), (24)
μt+1=(1γt)μt+2γtc(xt).\mu_{t+1}=\left(1-\gamma_{t}\right)\mu_{t}+2\gamma_{t}c\left(x_{t}\right). (25)

We let γt=aa+t\gamma_{t}=\frac{a}{a+t} where a1a\geq 1 is a hyper-parameter that controls the learning speed. We note that when a=1a=1, γt\gamma_{t} forms the harmonic series, but usually the performance is not good empirically as discussed in Powell (2007). A larger aa slows the learning speed at the beginning of the process and choosing the appropriate value of aa depends on the nature of the problem.

In this paper, we present a simple way of defining the basis function. We give an index ixi_{x} to each post-decision state sx=Bs_{x}=B, where ix=iN2i𝟙B(i)=1i_{x}=\sum_{i}^{N}2^{i}\mathbbm{1}_{B(i)=1}. We let the basis function ϕ[p](sx)\phi_{[p]}(s_{x}) be

ϕ[p](sx)={1,if p=ix,0,if pix.\phi_{[p]}(s_{x})=\begin{cases}1,&\text{if }p=i_{x},\\ 0,&\text{if }p\not=i_{x}.\end{cases} (26)

The algorithm is described in Algorithm 3.

Algorithm 3 Policy Iteration with TD-Learning
1:  Pick a random policy π0\pi_{0}. Set k=0k=0.
2:  Specify TT and KK.
3:  while kKk\leq K do
4:     Set t=0t=0. Initialize r0r_{0} and μ0\mu_{0}.
5:     Starting from a random state x0x_{0}, generate a state trajectory {xtt=0,1,T}\left\{x_{t}\mid t=0,1,\ldots T\right\} corresponding to the Markov chain with state transition probability PxπkP_{x}^{\pi_{k}} that is defined by the policy πk\pi_{k}.
6:     for t=0t=0 to TT do
7:        Calculate the temporal difference dtd_{t} by
dt=c(xt)μt2+J~(xt+1,rt)J~(xt,rt).d_{t}=c\left(x_{t}\right)-\frac{\mu_{t}}{2}+\tilde{J}\left(x_{t+1},r_{t}\right)-\tilde{J}\left(x_{t},r_{t}\right).
8:        Update the parameters by
rt+1\displaystyle r_{t+1} =\displaystyle= rt+γtdtϕ(xt),\displaystyle r_{t}+\gamma_{t}d_{t}\phi\left(x_{t}\right),
μt+1\displaystyle\mu_{t+1} =\displaystyle= (1γt)μt+2γtc(xt).\displaystyle\left(1-\gamma_{t}\right)\mu_{t}+2\gamma_{t}c\left(x_{t}\right).
9:     end for
10:     For each state sSs\in S, update the policy
πk+1(s)=argminaAstaj+J~(sx=B+Ia,rT).\pi_{k+1}(s)=\arg\min_{a\in A_{s}}t_{aj}+\tilde{J}(s_{x}=B+I_{a},r_{T}).
11:     k=k+1k=k+1
12:  end while
13:  Output: Policy π~=πK\tilde{\pi}^{*}=\pi_{K}
Theorem 2.

Algorithm 3 has the following property:

  1. 1.

    Converges with probability 1.

  2. 2.

    The limit of the sequence μt2\frac{\mu_{t}}{2} at the kkth iteration of the algorithm is the average cost μxπk2\frac{\mu_{x}^{\pi_{k}}}{2}, i.e.,

    limtμt=μxπk.\lim_{t\rightarrow\infty}\mu_{t}=\mu_{x}^{\pi_{k}}. (27)
  3. 3.

    The limit of the sequence rtr_{t} at the kkth iteration of the algorithm, denoted by rkr^{k*}, is the unique solution of the equation

    T(Φrk)=Φrk,T\left(\Phi r^{k*}\right)=\Phi r^{k*}, (28)

    where T:2N2NT:\mathbb{R}^{2^{N}}\mapsto\mathbb{R}^{2^{N}} is an operator defined by

    TJ=cxπkμxπke2+PxπkJ.TJ=c_{x}^{\pi_{k}}-\frac{\mu^{\pi_{k}}_{x}e}{2}+P_{x}^{\pi_{k}}J. (29)

The proof of the above theorem follows from Lemma 3, and the basis functions ϕ(sx)\phi(s_{x}) being linearly independent for all states. Also it is necessary to have the sequence γt\gamma_{t} is positive, deterministic, and satisfies t=0γt=\sum_{t=0}^{\infty}\gamma_{t}=\infty and t=0γt2<\sum_{t=0}^{\infty}\gamma_{t}^{2}<\infty. A detailed proof are shown in Tsitsiklis and Van Roy (1999).

Theorem 2 guarantees that the TD-Learning algorithm with post-decision states shown in Algorithm 3 always returns the optimal policy if TT is large enough. When TT is moderately large, it is enough to obtain a policy close to optimal as we will show in the next section. Our algorithm avoids solving the complex bellman’s equation which has an exponential complexity. Once calculated the cost vector cxπc_{x}^{\pi} and transition matrix PxπP_{x}^{\pi}, we easily obtain the temporal differences dtd_{t} by Monte Carlo simulation and evaluate the value functions that is needed for policy improvement in the policy iteration algorithm. One can also leverage parallel computing to obtain multiple sample paths at the same time, which further reduces the computation time.

6 Numerical Results

In this section, we show the numerical results comparing the policy obtained from the TD-Learning method to the myopic policy that always dispatch the closest available unit for systems with N=5,10N=5,10 and 1515 units. We created an imaginary region which is partitioned into J=30J=30 demand nodes. We randomly locate units in the region and obtain the corresponding response times from each unit to each demand point.

The myopic policy πm\pi^{m} is obtained by choosing from the available units in each state sSs\in S that results in the shortest response time. We solve the exact mean response time of the myopic policy using the hypercube queuing model developed by Larson (1974) for the cases of N=5N=5 and 1010 units. For the case where N=15N=15, the hypercube model becomes computationally costly due to its exponential complicity, and we obtain the approximate mean response time using the approximation method developed by Larson (1975), which has been tested to be very close to the exact value. An alternative is to obtain the approximate result via simulation which could be more accurate while taking a longer time.

The policy from the proposed TD-Learning method with post-decision states is obtained by running the algorithm in 25 iterations. We perform a roll-out with 200,000 state transitions in each iteration and update the parameter vector rr using the temporal differences dd. We record the sample average response time in each iteration and the results are shown in Figures 2, 3, and 4, respectively.

Refer to caption
Figure 2: Mean response time comparison for N=5N=5 units.
Refer to caption
Figure 3: Mean response time comparison for N=10N=10 units.
Refer to caption
Figure 4: Mean response time comparison for N=15N=15 units.
Refer to caption
Figure 5: State value updates in one TD-Learning iteration.

From the three experiments we observe that the TD-Learning algorithm converges quickly in all cases as expected. We show the updates of the post-decision state values J~\tilde{J} in one TD-Learning iteration for the case of N=5N=5 in Figure 5. The resulted policies in all three cases outperform the myopic policy that always dispatch the closest available units. We also observe that our algorithm obtains a superior policy fairly quickly in about three iterations. In the case where N=15N=15, solving the Bellman’s equation requires solving a system with 31×215=1,015,80831\times 2^{15}=1,015,808 states, which is very computational costly, if not infeasible, by most modern computers. In contrast, our method obtains a good policy in less than two minutes in this case, and it applies virtually to systems with any sizes as guaranteed by its theoretical properties.

Note that the policies obtained from our algorithm result in an average of three seconds reduction in terms of response time with no additional resources. This suggests that myopic policies are unnecessarily wasting critical resources and slightly more intelligent policies can provide improved performance with little to no cost.

7 Conclusion

In this paper, we model the ambulance dispatch problem as an average-cost Markov decision process and aim to find the optimal dispatch policy that minimizes the mean response time. The regular MDP formulation has a state space of (J+1)2N(J+1)\cdot 2^{N}, where JJ is the number of demand nodes that partition the entire region and NN is the total number of ambulances in the system. The policy iteration is able to find the optimal policy. We propose an alternative MDP formulation that uses the post-decision states and reduce the state space to 2N2^{N}. We show that this formulation is mathematically equivalent to the original MDP formulation.

Even though the state space is reduced in the new formulation, due to the curse of dimensionality, its application is only restricted to small problems. We next present a TD-Learning algorithm based on the post-decision states that is guaranteed to converge to the optimal solution. In our numerically experiments, we show that the TD-Learning algorithm with post-decision states converge quickly and the policies obtained from our method outperforms the myopic policy that always dispatch the closest available unit in all cases.

References

  • Bandara et al. [2014] Damitha Bandara, Maria E Mayorga, and Laura A McLay. Priority dispatching strategies for ems systems. Journal of the Operational Research Society, 65(4):572–587, 2014.
  • Carter et al. [1972] Grace M Carter, Jan M Chaiken, and Edward Ignall. Response areas for two emergency units. Operations Research, 20(3):571–594, 1972.
  • Daskin [1983] Mark S Daskin. A maximum expected covering location model: formulation, properties and heuristic solution. Transportation Science, 17(1):48–70, 1983.
  • Evarts [2019] Ben Evarts. Fire loss in the united states during 2018. NFPA National Fire Protection Association, Quincy, 2019.
  • Jagtenberg et al. [2015] Caroline J Jagtenberg, Sandjai Bhulai, and Robert D van der Mei. An efficient heuristic for real-time ambulance redeployment. Operations Research for Health Care, 4:27–35, 2015.
  • Jagtenberg et al. [2017a] Caroline J Jagtenberg, Sandjai Bhulai, and Robert D van der Mei. Dynamic ambulance dispatching: is the closest-idle policy always optimal? Health care management science, 20(4):517–531, 2017.
  • Jagtenberg et al. [2017b] Caroline J Jagtenberg, Pieter L van den Berg, and Robert D van der Mei. Benchmarking online dispatch algorithms for emergency medical services. European Journal of Operational Research, 258(2):715–725, 2017.
  • Jarvis [1975] James Patrick Jarvis. Optimization in stochastic service systems with distinguishable servers. PhD thesis, Massachusetts Institute of Technology, 1975.
  • Larson [1974] Richard C Larson. A hypercube queuing model for facility location and redistricting in urban emergency services. Computers & Operations Research, 1(1):67–95, 1974.
  • Larson [1975] Richard C. Larson. Approximating the performance of urban emergency service systems. Operations Research, 23(5):845–868, 1975.
  • Maxwell et al. [2010] Matthew S Maxwell, Mateo Restrepo, Shane G Henderson, and Huseyin Topaloglu. Approximate dynamic programming for ambulance redeployment. INFORMS Journal on Computing, 22(2):266–281, 2010.
  • McLay and Mayorga [2013a] Laura A McLay and Maria E Mayorga. A dispatching model for server-to-customer systems that balances efficiency and equity. Manufacturing & Service Operations Management, 15(2):205–220, 2013.
  • McLay and Mayorga [2013b] Laura A McLay and Maria E Mayorga. A model for optimally dispatching ambulances to emergency calls with classification errors in patient priorities. IIE Transactions, 45(1):1–24, 2013.
  • Nasrollahzadeh et al. [2018] Amir Ali Nasrollahzadeh, Amin Khademi, and Maria E Mayorga. Real-time ambulance dispatching and relocation. Manufacturing & Service Operations Management, 20(3):467–480, 2018.
  • Powell [2007] Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
  • Powell [2010] Warren B Powell. Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing, 22(1):2–17, 2010.
  • Schmid [2012] Verena Schmid. Solving the dynamic ambulance relocation and dispatching problem using approximate dynamic programming. European journal of operational research, 219(3):611–621, 2012.
  • Tsitsiklis and Van Roy [1999] John N Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999.