Stochastic Dynamic Network Utility Maximization with Application to Disaster Response
Abstract
In this paper, we are interested in solving Network Utility Maximization (NUM) problems whose underlying local utilities and constraints depend on a complex stochastic dynamic environment. While the general model applies broadly, this work is motivated by resource sharing during disasters concurrently occurring in multiple areas. In such situations, hierarchical layers of Incident Command Systems (ICS) are engaged; specifically, a central entity (e.g., the federal government) typically coordinates the incident response allocating resources to different sites, which then get distributed to the affected by local entities. The benefits of an allocation decision to the different sites are generally not expressed explicitly as a closed-form utility function because of the complexity of the response and the random nature of the underlying phenomenon we try to contain. We use the classic approach of decomposing the NUM formulation and applying a primal-dual algorithm to achieve optimal higher-level decisions under coupled constraints while modeling the optimized response to the local dynamics with deep reinforcement learning algorithms. The decomposition we propose has several benefits: 1) the entities respond to their local utilities based on a congestion signal conveyed by the ICS upper layers; 2) the complexity of capturing the utility of local responses and their diversity is addressed effectively without sharing local parameters and priorities with the ICS layers above; 3) utilities, known as explicit functions, are approximated as convex functions of the resources allocated; 4) decisions rely on up-to-date data from the ground along with future forecasts.
I Introduction
This paper’s main contribution is a novel Network Utility Maximization (NUM) framework to address complex and heterogeneous resource allocation problems for hierarchical systems, where the objective is maximizing social utility. The application we study is Incident Command Systems (ICS). During an incident, resources and services from first responders are necessary to contain its impact. These resources and services are made available by various entities with limited supply at any given time. ICS are typically hierarchical, whereby higher layers are engaged only if the lower layers cannot manage the response alone. As the scale and frequency of health crises and natural disasters continue to rise, it is increasingly common to require a higher layer of ICS to coordinate resource allocation. In this context, optimal (or near-optimal), foresighted and practical decision-making is critical to study beforehand.
Utilizing a distributed primal-dual formulation proposed solution is decomposed so that it is possible to coordinate the management of wide-area crises. Primal-dual decompositions of NUM formulations are similar to a free-market mechanism: the higher layers communicate a congestion signal, akin to a shadow price for each resource, to the lower layers. The latter adds the resource cost in maximizing their local utility. In turn, the information about the local situation and priorities does not have to be explicitly shared, only the final demand for resources to reiterate the congestion signal. We can even use different approaches to address local complex dynamic optimal responses based on their specific geography, needs, or priorities.
What distinguishes our formulation from the more conventional NUM formulations is that we consider cases where one needs to allocate resources facing a stochastic dynamic environment. Therefore, each resource allocation decision should account for not only the short-term but also the long-term utility. While we decompose the NUM problem in space and layers, the resource allocation problem is not decomposable in time, and the local sub-problems fall into the class of stochastic dynamic optimization problems. For instance, a relatively limited allocation of firefighter units to a wildfire may cause the fire to spread vastly, requiring substantially more resources to control it in the future. Therefore, we need foresighted decisions that use future forecasts built upon the present real-time data. A necessary condition for decomposability across localities and layers is that the allocations do not directly affect the utility of other sites except for the scarcity of the cumulative resources for each site contends. In reality, this is often an approximation but reflects the local nature of the response on the ground. Not just during crises but also for planning, one can adopt the proposed framework to shed light on the cost of potential disasters for a given availability of resources, even when all parties involved are responding optimally.
We study the proposed approach in two compelling scenarios: pandemic response and wildfire response. In these scenarios, we model a pandemic evolution or a wildfire propagation as Markov Decision Processes (MDPs), where we obtain the approximate optimal local response for an allocation via deep reinforcement learning algorithms. These algorithms rely on trial-and-error type training on agent-based simulation systems, where the necessary training is highly dependent on the scale of the simulations and possible action spaces. Therefore, our divide-and-conquer strategy with the decomposition on the higher layer is significantly helpful in obtaining multiple local scalable reinforcement learning systems. In return, the higher layer does not require data/simulation model parameters or any other complexities and only establishes a market mechanism to find the best allocation to different localities.
We run these agent-based simulations over a finite time horizon window starting from the present to obtain foresighted allocations. The initial state of MDP (or the simulation) is the most recent data from the ground. When the local ICS gathers more data, that new current time becomes the time zero, and we repeat the simulation and the overall optimization. Therefore, in a shifting time window manner, we find foresighted optimal allocations while using up-to-date ground data as simulation initializations. During that process, even though we make decisions for a future time window, only the near-decisions are finalized, while future ones get revised potentially based on new incoming data from the ground.
I-A Related Work
NUM framework establishes an analytical foundation to design distributed and modularized control of networks through investigating resource-constrained utility maximization problems. While initial works mostly use dual decomposition-based distributed algorithms, as in the seminal works [1, 2], later various alternative decompositions of network resource allocation problems have been developed. We refer the interested reader to the tutorial paper [3] in this area. More recently, a multi-layered hierarchical NUM [4, 5] and a fully-decentralized federated variant [6] have been investigated, whereby in [7] linear convergence rate for NUM algorithms (when the resource constraints are linear equalities) has been shown without the time-scale separation assumption between the iteration layers. An overwhelming majority of this literature considers a static setup, whereas in this paper we consider a dynamic problem where underlying systems evolve with the allocated resources in time. Among those considering a dynamic allocation, a common approach is to use Lyapunov drift-plus-penalty approach if the problem has soft intertemporal constraints [8, 9, 10]. In the case of delivery contracts, where feasibility should be guaranteed with a deadline, [11] provides an exact solution with distributed implementation for the case where all the constraints in time are known exactly, and an approximate solution based on model predictive control where there is uncertainty in the future constraint set. To improve the convergence speed in this setup, a Newton method-based distributed algorithm has been proposed in [12]. For the case where the utility functions at localities change following a known set of deterministic dynamical equations, [13] proposes an iterative method based on dual decomposition. This setup is similar to our consideration. The differences lie in the fact that we consider large-scale disaster response problems with underlying dynamics following stochastic computer simulations. The decomposed problems at localities, in our consideration, are stochastic and complex and require approximate solutions such as deep reinforcement learning rather than being deterministic closed-form problems.
The paper is organized as follows. Next, we discuss the general modeling of a stochastic dynamic network utility maximization (NUM) problem for optimal resource allocation in an incident response followed by the proposed distributed solution approach. In the later sections, we present the details of the considered case studies. Finally, we end with numerical results.
II Problem Formulation
We formulate the resource allocation problem as maximizing the sum utility of independent MDPs subject to a common resources constraint. The notation is as follows:
-
1.
is the population in location .
-
2.
is the state of at future horizon window time . These states contain all the available information regarding how the disaster can impact locations, including information regarding the entities as well as the environmental factors.
-
3.
is the state of at present (or ), i.e., it denotes the most recent available data from the ground.
-
4.
is the set of all possible state values, i.e., the state space.
-
5.
is vector of resources allocated to at time in .
-
6.
is the set constraining the individual resources (e.g., non-negativity).
-
7.
is the collection of , for a given . It denotes an allocation decision, i.e., an allocation action in the MDP framework.
-
8.
represents the resources assigned to location per unit of time, i.e., , .
-
9.
denotes the set of all available actions from state , i.e., it is the action space from .
-
10.
We assume that the disaster dynamics are Markovian, i.e., they can be defined with state transition probabilities . We denote these dynamics with , such that , mapping the previous state and the action pair to the new state probabilistically based on the underlying dynamics that define the evolution of the specific disaster.
-
11.
It is common to have a probabilistic policy in stochastic dynamic optimization. denotes a potentially probabilistic mapping from state space to action space, i.e., , and .
-
12.
is the immediate utility for , with the allocation at state .
-
13.
represents the total resource supply per time unit. This leads to the constraint , which couples the resources across all locations.
The NUM problem formulation captures the basic problem of assigning resources under congestion/capacity constraints to maximize the sum utility for the population, i.e.:
(1) | ||||
s.t. | (2) | |||
(3) | ||||
(4) | ||||
(5) | ||||
(6) |
where is a utility (or reward) discount factor, capturing the penalty for delaying the allocation of resources (commonly, ). The expectation (denoted by ) is over the sources of the stochasticity in the system as well as the randomized policy . To ensure that the demand met is feasible, i.e., less or equal to the supply, we have constraint sets (4), (5). Finally, (6) accounts for the impact of the incident dynamics.
These dynamics start with present-time data , and we simulate the future states ’s via agent-based simulations. These simulations, which produce probabilistic future realizations branched off from by moving agents in the system, establish a playground for deep reinforcement learning algorithms to train and reach effective foresighted allocation strategies. As time moves forward and new data becomes available (at time for example), it replaces and the horizon window shifts, meaning becomes , and so on. The simulations are then restarted from the updated and decisions are revised accordingly. Consequently, the higher-level allocations for all are updated only at a slower pace, such as at time intervals of . For simplicity, in a single simulation run, it is assumed that remains constant for the duration of the future horizon window.
There might be a need for more than one type of resource. In that case, we can assume that and are vectors where each entry is the quantity of a particular item type to be allocated or supplied. The division of the population over sites affected by the crisis here is known, and we assume that the dynamics of the event are independent across localities, i.e., they are disjointed MDPs.
We assume that each of the sites has its own local ICS that is activated to address the local resource allocation problem, i.e., choose the values of for given . Owing to the hierarchical nature of ICS, our tenet is that we can cast the optimal resource allocation problem for various disaster response scenarios in general in such a formulation, and the discriminating aspect lies in the resource type and, more importantly, in the dynamics that depend on the type of incident. The set of constraints in (6) refers to the underlying dynamics of the incident in consideration. For instance, if the incident is a natural disaster, such as an earthquake, the destruction peaks at time zero and is generally non-increasing. In a pandemic, instead, the additional exposure of people causes waves of infections.
Now, this view of the problem reflects the task of allocating resources to local ICSs when the event footprint is so vast that entities and suppliers at a higher layer need to be activated. However, it does not capture the information bottlenecks and the coarse information available about the individuals’ utilities. Even though it might be sometimes possible to estimate the impact of an incident over populations with a well-defined set of equations, in other cases, the way to tackle the response locally and the best practices vary significantly. By decomposing the problem into localities and by creating a market-like mechanism for the allocation of global resources to the local ICSs, where global ICS only uses the result of local optimization problems regardless of their complexity or choice of solution, our approach allows significant flexibility in terms of compatibility with different applications with different modeling of underlying dynamics as well as different solution methods.
The real-life optimality of the response depends on various factors, first and foremost on how well localities gather data to gain situation awareness about what is unfolding and, second, how they use the information to decide what to do with the resources optimally. In this paper’s case studies, we consider two dynamic incident models for and assume information regarding the situation is perfect (the state is available, the state is observable through simulations, and the dynamics are known); our focus is on exemplifying how to obtain a strategy to use the resources optimally. Another imprecision in our model is that it does not capture the external impact a local response may have in increasing the utility for other sites. For example, wildfires and epidemics may not be confined to localities, and the allocation in one place can reduce the risk for others since both can spread from one site to the other. However, it is mostly the case that for local ICSs, the other localities are more of an afterthought.
III Distributed Solution via Primal Dual
Decomposition Theory
III-A Decomposing the Global Problem
In this section, we apply decomposition theory, and in particular, the primal-dual decomposition, to the problem in (1) and obtain a distributed solution that reduces the burden of exchanging data from the local sites to the central one. That is possible because the Markov Decision Process (MDP) dynamics in each location are modeled independently from each other and coupled only due to the total resources, , being finite. That is, for a given , the values affect only the utilities inside location . It follows that for a given , each location can solve the inner problem:
(7) | ||||
s.t. | ||||
where denotes the resulting optimal utility attainable for a given allocation starting from present-time state . Under the congestion constraint, the problem to solve for the higher layer:
(8) |
This formulation corresponds to a NUM resource allocation [14], which can leverage the dual decomposition. Let denotes the Lagrange multiplier. The Lagrangian of the upper layer of ICS is as follows
(9) |
and the dual problem to solve becomes
(10) |
For fixed , the primal variables can be found as
(11) |
and we find the optimum dual variables via an iterative gradient descent method which the following update:
(12) |
where denotes a step size, denotes iteration index and denotes projection onto non-negative orthant and is the solution of problem (11) evaluated based on the current iteration of . As illustrated in Fig. 1, to sum up, the distributed solution finds in each location separately for a given whose value is then updated using new from all locations. This process continues iteratively till a convergence criterion (market equilibrium) is satisfied. The values serve as pseudo-prices, which are determined based on the total supply and the total demand .
In this method defined with Eqns. (11)-(12), the convergence to a global optimum is guaranteed if is an increasing and differentiable concave function [4]. However, in our case, rather than being a well-defined closed-form function, is the result of a challenging optimization problem, the local stochastic optimization in (7). Note that, here, our algorithm that solves the global allocation problem cooperatively, with (12), does not use any information from the local optimizations except the returned total utility for given . Therefore, as a result, it is unaffected by the choice of optimization methods in every locality, including if one chooses to use a deterministic policy or a randomized policy to solve the dynamic sub-problem or any heuristic approximation as long as they achieve near-optimal results.
We highlight that is affected heavily by the initial state , i.e., real-time data. For example, in the case of wildfire, our simulations do not create out-of-nowhere new fires, only model the propagation of existing ones. As a result, if there is no ongoing wildfire in a location, the utility of any firefighting resources will be zero. The utility, in this case, is a function of many factors in the real-time data, such as wind speed and direction, exact fire location, et cetera, that affect the simulation branches. Therefore, we need to revise values as well when we shift the time horizon window with incoming data in time.
III-B Solving the Sub-Problems in Localities
Owing to the complexity of solving the problems as in (7), many different approximation methods have been proposed in the literature to reach sub-optimal allocations. The modular nature of our decomposition provides flexibility in choosing how to approach the solution of the inner local optimization. In fact, the gradient step at the higher layer only requires the value of the optimum demand in the local problem in each location allowing different localities to use different methods to determine . The decomposition separates the solution process of locations except for the exchange of , which is analogous to a shadow price for the total resource allocation.
The inner problem in (7) is a stochastic dynamic optimization modeled as an MDP. To solve such problems, traditionally dynamic programming and reinforcement learning, and recently, more commonly, deep reinforcement learning (which utilizes neural networks) approaches are being used due to the proliferation of computational advancements and new techniques. In this paper’s case studies, we choose to use deep reinforcement learning algorithms along with agent-based simulations to solve the local problems in (7) due to their capacity to handle relatively large-scale complex problems. Nonetheless, our framework is flexible to accommodate the preferred method among dynamic optimization/reinforcement learning algorithms for the application in mind. We refer interested readers to [15, 16] for detailed background on different algorithms to solve dynamic optimization problems.
III-C Concave and Non-decreasing Interpolation
The critical aspect regarding the solution of these local complex dynamic problems is that finding for a given is non-trivial. If a time-consuming computational model is available to find a for a given upper-layer allocation (i.e., to solve the inner dynamic problem), an offline computation might be preferable to an online one. In such cases, we may want to estimate from samples obtained from numerical or experimental evidence of what is attainable with a vector of possible values of , and interpolate the response to the set of all possible values.
The option we propose is to resort to a piece-wise linear interpolation. Let us denote the curve we fit to the samples with , where denotes a realization of . Assume one has samples of for certain allocations , . We can cast the interpolation into an optimization that minimizes the mean square error from the sample points under the constraint that is non-decreasing and concave, which is due to the observation that additional resources do not reduce the utility and generally provide diminishing returns. Thus, we can formulate the constrained curve fitting problem:
(13) |
If is concave and non-decreasing, then
(14) | ||||
(15) |
where is the gradient.
To simplify the notation, let us omit the location index and define:
(16) |
We can write a problem to find the piece-wise linear approximation as follows:
(17) |
which is a quadratic program (QP) since all the constraints are linear and the objective function is quadratic and, therefore, can be solved by one of the many available QP solvers. The optimization outcome is lines that pass through ’s with slope with minimum distance to the data points. Then, we can approximate the utility function at point with:
(18) |
With this function, we can estimate the result of the local optimization for any possible on the run. It can be suitable for cases where solving the inner optimization online is impossible with the computation power available.
III-C1 Optimality gap
We want to characterize the difference between the optimal solution of the problem in consideration and the solution we reach using the approximate objective function . In the latter, we reach the optimal solution using the dynamics of the primal-dual iterations (11) and (12) with the approximate function , i.e.:
(19) | ||||
(20) |
to solve the problem defined in (8). Let’s define and , where is a long vector keeping all values . Furthermore, let’s denote the optimal solution with and without the function approximation as and , respectively.
Let us refer to the maximum total approximation error over locations sum-utility at any value of vector as , i.e.,
(21) |
Theorem 1
Assume the sum-utility function is strongly concave, i.e., , with a positive , and an increasing function. Then, we have
(22) |
Proof:
The first important observation about the solutions and of the resource allocation problem in (8) is that they satisfy the constraints with equality since objective utilities are increasing with additional resources. Therefore, they are both feasible points, i.e., with or without the approximations.
Notice that due to the definition of maximum, we can write
(23) |
since the algorithms would converge to the other (feasible) points otherwise.
Due to the strong concavity of ,
(26) |
All of our constraints are linear, and sum-capacity constraints are over disjoint sets of variables (indices of ), and therefore, allowable movement at the intersection of the set of linear equalities occurs at a hyperplane, where both and lie on. Furthermore, at constrained maxima , should be perpendicular to this hyperplane since otherwise, we could increase by moving towards the direction of the projection of the gradient onto the hyperplane. Therefore, . In such a case,
(27) |
An exception to the orthogonality of the gradient at solution and hyperplane might occur if there are corner points due to additional constraints, such as the non-negativity of individual variables. Even if these constraints pull the solutions to the corner points, this mapping is non-expansive, i.e., gets smaller or stays the same. Therefore, in general, we can conclude:
(28) |
∎
IV Case Studies
We consider two relevant disaster response scenarios today: wildfire and pandemic response. More specifically, we focus on the firefighting unit allocation in wildfires and vaccine allocation in pandemics. Even though there are many differences between these two scenarios in resource types, actions, and underlying dynamics, we can write both as our general formulation in Sec. II.
Before getting into the differences and the application-specific details, we first highlight the fact that the stochastic models for the dynamics can in both cases be described as instances of stochastic multi-agent interactions over a particular graph, which in one case (the fire dynamics) represent the topological characteristics of the terrain and in the other (the pandemic) the social interactions among individuals and contagion.
IV-A Propagation Through Multi-Agent Networks
Assume we have a graph , where each node denotes a person in the dynamics of a pandemic or a small forest area in a wildfire. In both scenarios, the edges represent the possible contacts that spread the phenomenon among neighboring nodes: the social contact causing the spread of disease from an infected individual, and the ignition of a neighboring zone due to the proximity to others that are burning. Each node is in only one of several possible discrete states: for pandemics, states can be susceptible, infected, recovered, vaccinated, or dead, and for wildfire, states are vulnerable, on fire, extinguished, or burnt.
Therefore, omitting index for simplicity, node ’s state at time , , is a categorical random variable that can change due to the outcome of interaction and the local state. If we normalize the time between decision/action epochs to be , we can divide each decision period into infinitesimal intervals of time , during which there can be zero or at most one interaction between a node and its neighbors with probability proportional to (two or more would be ) that can result in a change of state. Let us assume that is an integer number. Let be the neighbors of node and be the set of neighbors of node that are susceptible. Over each of the sub-intervals between two decision epochs each node can meet a neighbor with probability , and infect/ignite of a susceptible neighbor , or make no contact, with probability . Let us assume that:
-
1.
corresponds to susceptible/vulnerable state
-
2.
is infected/on fire
-
3.
is dead/burnt
-
4.
is vaccinated/extinguished
-
5.
is recovered (this state does not exist in the case of wildfire).
Let also introduce the random indicator variable such that:
(29) |
and if , so that . Also, we can introduce the node action such that if is vaccinated or the cell fire is extinguished at time , and else. For the pandemic, we can write:
(30) |
for , where the death and recovery probabilities, say and , are nodal characteristics and is the probability that the agent remains infected. Note that 2, 3, and 4 are absorbing states. For the wildfire, the states evolve similarly, with the main difference that state 4 does not exist, and the fire gets extinguished while :
(31) |
IV-B Case Study I: Pandemic Response
IV-B1 Background
The first case study for our framework is that of pandemic response, which has gained significant attention due to the COVID-19 pandemic. Much of the research in this field centers on finding the best ways to distribute vaccines, but the same principles and techniques can also be applied to other resources, such as hospital beds and medical personnel, as well as determining the most effective policies for stay-at-home orders and travel restrictions.
In general, two different approaches are employed to model the infection dynamics. In compartmental model, the population is divided into compartments (e.g., susceptible, exposed, infected, recovered), where each individual in these compartments is modeled as identical with others in terms of contact probability, etc. In contrast, in the agent-based simulation model, the infections are simulated via computationally expensive methods where the individuals are modeled more heterogeneously, and the studies that use these models focus on comparing a number of predetermined vaccination strategies mostly [17]. In the paper [18], the authors compare the outcomes of optimal vaccine distribution under both compartmental and agent-based simulation models. They found that policies developed using more realistic agent-based simulations were more effective in reducing the overall number of infections. However, compartmental models are still commonly used because they are more amenable to mathematical optimization. Prior research in this field has mainly focused on influenza pandemics. In [19], the authors study which age group should be prioritized in influenza vaccinations; the result is that the optimal strategy prioritizes schoolchildren and adults aged 30 to 39 years. The work in [20] proposes an optimization formulation for vaccine distribution in developing countries with detailed supply-chain constraints. In [21], the authors study optimal influenza vaccine distribution by considering heterogeneous populations with multiple subgroups based on geographic location and age, as well as fairness as a criterion. In that formulation, the authors model the population dynamics by also incorporating the contact between different age groups, and the optimization aims to keep the reproduction number (i.e., expected secondary infection rate) under with the minimum amount of vaccine use. In [22], the authors estimate social contact matrices among different age groups. Also related are [23], which studies epidemic control over short time horizons and [24], which investigates an optimal vaccine allocation problem over multiple populations, to maximize the benefits of herd immunity. For the other studies in this area, we refer the interested reader to the survey in [17] and references therein. In recent studies, the attention naturally shifted towards COVID-19. In [25], the authors investigate optimal static and dynamic allocation policies for COVID-19 vaccinations over different age groups and show that dynamic policies provide a significant improvement over static policies. In [26], the optimal allocation over different geographic locations and age groups is investigated using an epidemiological model called DELPHI. The study uses real data from the United States and specifies which states and age groups should be prioritized. Finally, [27] provides an analysis of COVID-19 vaccine prioritization based on age-based mortality rates and occupation-based exposure risks.
IV-B2 Modeling Details
We employ agent-based simulations due to their capability of highlighting the heterogeneity of populations. In our simulations, every agent is characterized by a state variable, , that indicates whether they are susceptible, infected, etc., as well as a fixed age-group variable that indicates which age group they belong to: teen, adult, or elderly. The simulations iterate random social interactions over a graph network based on the rules outlined in (30). The death probability, , of the adult group is set to be times higher than that of the teen group and of the elderly group. Recovery times, which are obtained from a Poisson distribution with a certain mean time after infections, do not vary by age group. The model is implemented using the Python library Mesa [28, 29].
The underlying social graphs are produced as follows. First, we form families between teens and some of the adult population, and connect all teens imagining a school setting. And later, we connect the elements in the union set of elderly nodes and adult nodes randomly in an Erdos-Renyi fashion. In addition, we ensure that each graph we use is connected, i.e., there are no isolated nodes. An example graph is illustrated in Fig. 2.

Once the simulation tool is developed as described above, we can use it to solve the local problem (7), the task of selecting which nodes to vaccinate with a total supply of per time unit. In this problem, denotes the action of vaccinating node at time , and for all . The goal is to find , a randomized policy for determining which nodes to vaccinate by observing the state , with the objective of maximizing total expected utility over a horizon length . In this case study, we set the utility as negative of new deaths in location at , i.e., the policy aims to minimize total (technically, discounted sum with with ) death over .
To find a near-optimal policy to maximize total utility at , we employ a deep reinforcement learning (DRL) model. To train our DRL model, we utilize Proximal Policy Optimization (PPO) algorithm with invalid action masking [30]. PPO is a policy gradient method for reinforcement learning (see [31] for details), and invalid action masking helps to ensure the model is taking feasible actions during training and execution. As the action space, we use the set of all nodes, where with the use of masking, the set of actionable nodes is reduced to only those that are in a susceptible state, i.e., we take infected, dead, recovered or already vaccinated nodes out. As for the neural network architecture, we use a single-layer Graph Convolutional Network (GCN) [32] and a single-linear layer as the shared feature extractor network, followed by 2-layer multi-layer perceptron (MLP) networks for both policy and value networks. As the activation function, ReLU is used throughout the model. The implementation is done in Python utilizing the libraries such as Stable Baselines 3 [33], OpenAI Gym [34], and PyTorch Geometric [35].
Recently, there have been papers [36, 37, 38, 39] that use DRL to study vaccine allocation. In these studies, the vaccination policy searched for by DRL is typically restricted to group-based allocation, such as age-group-based vaccination. While this is a practical strategy for limiting the action space and making DRL training easier, we believe that making each susceptible node a separate potential action utilizes the detailed heterogeneous simulation models more effectively. For example, with our model parameters where the elderly group has a 10x and 100x higher mortality rate than the adult and teenager groups, respectively, if we restrict the action space to selecting age groups and then vaccinating randomly within the group, the policy converges to an elderly-first vaccination policy since mortality rate directly impacts the total number of deaths significantly. However, according to our observations, it appears that our DRL model selects nodes based on their age group, their node degree (how connected they are to others), and their distance to infected nodes. The extra freedom given to the model in taking actions opens the door to more innovative strategies.
IV-B3 System Integration and Numerical Examples
In this case study, we generate 5 locations with distinct underlying graphs and demographic characteristics as outlined in the Table I. The table shows the number of nodes in each group, as well as the average degree of the elderly group (EAD) and the average degree except for the teenager group (AD). The teenager group is excluded because of the high level of connectivity among them, which is a result of the school setting assumed during the graph creation.
Location | # teen | # adult | # elderly | EAD | AD |
---|---|---|---|---|---|
Loc.1 | 20 | 50 | 30 | 7.56 | 8.24 |
Loc.2 | 30 | 60 | 10 | 6.7 | 7.97 |
Loc.3 | 20 | 60 | 20 | 8.55 | 8.3 |
Loc.4 | 20 | 60 | 20 | 8.7 | 9.05 |
Loc.5 | 30 | 60 | 10 | 7.3 | 7.48 |

In Fig. 3, we compare the performance of various policies, namely, no vaccination, old-first policy (which randomly vaccinates the elderly group first, adult group second, and finally teenager population), uniformly random selection among susceptible nodes, and our DRL model. All these policies are run for 50 time units, where each policy apply 1 vaccine per time unit. In addition, contact probability of two nodes sharing an edge is 0.02, death probability is each time unit for adult group, for elderly, for teen, and average recovery time is time units for all infected. We start all the experiments with 5 randomly infected nodes and run each policy 10000 times. On the y-axis, we plot the average number of deaths at the end of 50 time units, normalized with respect to the no vaccination policy. The results are consistent across different locations. Our results show that even if vaccinations are given out randomly, it still greatly lowers the number of deaths on average. The old-first policy, which is intuitively a good strategy (especially with our mortality rate parameters across different age groups), also provides a significant improvement compared to the random policy. Our DRL model convincingly outperforms the old-first policy across all locations. It’s worth noting that the RL models presented for each location are relatively simple, and each have been trained with random initialization (randomly infected nodes at the start) for about 90 minutes on a single Intel Xeon 6226R CPU. When we train a model with fixed initialization (same nodes are infected at the start at each run) for the same amount of time, for example, training becomes easier and it achieves roughly 10 percent reduction in the average number of death (in the environment with the same fixed initialization) compared to the model trained with random initialization. There is room for improvement on these results with more training time, usage of parallel computing and GPUs, different network architectures etc. Since our aim here is to illustrate the proposed method in a case study, we select to use a DRL model fairly simple to reproduce and can handle different initializations.
After training DRL models for each location, we can solve inner optimizations to obtain for given allocation , i.e., we now know how to redistribute to the population at efficiently. in this case denote total utility achievable with , which is the only value needed from local optimizations to solve the higher layer allocations. This value hides all the intricacies of complex dynamics at locality . Next, we use our simulation tool and DRL model to generate samples of for different values. At this stage, we do not train our model any further, we only record its performance. Once we have collected these samples, we use concave and non-increasing interpolation to obtain a piecewise linear approximate utility function as described in Sec. III-C. An example run is shown in Fig. 4. In this example, we take samples for all possible values between 0 and 5, so we don’t use the approximate function to interpolate but rather for convexification if some samples do not follow convex behavior.

Finally, the results obtained from the local models are integrated utilizing the methodology outlined in Eqns. (11)-(12) in order to arrive at higher-layer allocations. An exemplary execution of the complete system, with a look-ahead horizon length of and higher-layer allocation update period of time units, is presented in Fig. 5 and Fig. 6. In a nutshell, the example system works as follows. In the beginning, randomly 5 nodes are infected at each location. By looking at a horizon of 10 days, the system makes higher-layer allocations, wherein resources are primarily directed towards locations with a higher concentration of elderly individuals. Subsequently, the local DRL systems employ these allocations to vaccinate nodes in accordance with their own models. After 5th day, local models observe their own location and make predictions based on this information for the next 10 days and share their after the convex interpolation. The upper-layer subsequently utilizes this information to calculate and respond with updated allocations, and this process is repeated iteratively on a rolling horizon basis. When we run this setup repeatedly, we observe unique realizations and, as a result, unique higher-level allocations. Even though the allocations are skewed towards to locations with more elderly at the beginning as expected, the system adapts effectively to the underlying realizations and manages to keep death numbers under control. For instance, in the presented example, the system stops allocation to Loc. 1, which has highest elderly population, after tenth day completely, and redirects resources to the other locations upon observing that they are in a statistically more precarious situation. It is important to note that, due to the utilization of numerous approximations, it is not possible to make claims of optimality with regard to the system under consideration. Nevertheless, the system presents an impression of efficient resource allocation according to our observations.





IV-C Case Study II: Wildfire Response
IV-C1 Background
As a second case study, we examine the problem of wildfire response. In this case, we aim to incorporate the dynamics of wildfire propagation across different regions as an input for optimizing response and resource allocation for extinguishing fires. To effectively address this problem, it is crucial to have a stochastic model for the evolution of wildfires in order to find the best policies through dynamic optimization. Wildfire propagation has been widely studied in the literature. Different approaches consider various different factors affecting the fire spread, such as fuel type (type of vegetation), humidity, wind speed and direction, forest topography (slope and natural barriers), and fuel continuity (vegetation thickness) [40]. The earlier works (e.g., [41]) primarily focused on developing dynamic equations that capture the physical nature of the propagation through controlled laboratory experiments. When it comes to characterizing propagation over a large geographical area with more randomness, approaches can be classified into two types in terms of spatial representation: 1) models based on continuous planes, and 2) models based on grid representation. We adopt the latter in our modeling, due to its relative computational efficiency. A popular approach in grid-based methods is using a cellular automaton (CA) model. This method involves dividing a finite area into a large number of grid cells, where the evolution of the state in each cell is based on the state of its nearby neighbors and its local state through a set of rules that map these state values into transition probabilities. This discrete model allows for efficient computation and simulation while also accounting for the stochasticity of state transitions. Examples of works that use CA models include [42, 40, 43].
IV-C2 Modeling Details
Similar to the pandemic response, we again employ an agent-based simulation approach to evaluate wildfire response strategies. The simulation model employed is consistent with the methodology described in Section IV-A. The underlying graph structure of the simulation is a mesh grid neighborhood network, wherein each cell, denoted by , possesses a state variable, denoted by , that reflects its condition, which can be one of four possibilities: vulnerable, ignited, burnt, or extinguished. Additionally, each cell possesses unique characteristics, such as vegetation density and type, that are taken into account within the simulation. The agent-based simulation tool employed is able to simulate the dynamic evolution of these cell states, as described by the mathematical formulation presented in (31).

The probability of cell being ignited due to the fire in neighboring cell (represented by ) is influenced by factors such as the vegetation type and density in cell , wind speed and direction. We model this propagation probability using a multiplicative function that takes into account multiple factors, similar to the approach in [44]:
(32) |
where represents a normalization constant, is a constant dependent on the type of vegetation present in cell , is a constant dependent on the density of vegetation within cell , and is a constant that is contingent upon the wind direction and speed. An illustration of a simulation run with strong eastward wind is shown in Fig. 7.
We use this simulation tool to address the inner problem outlined in (7), which in the context of this case study pertains to the problem of determining the optimal cells to extinguish in order to maximize the discounted sum of utilities over a given horizon, given the allocation of firefighting units to location , . In contrast to the pandemic response scenario, there are additional constraints on the action of extinguishing cell , , in the context of wildfire management. These constraints stem from the movement limitations of the firefighters. Specifically, we impose a restriction on the potential travel distance of a firefighting unit at each discrete time step, and the unit is only able to be at a specific location at each time. As a result, the problem is then can be formulated as identifying the optimal trajectories for the allocated firefighting units such that the cells that are on fire and situated within these trajectories are extinguished.
We incorporate the movement constraints directly into the simulation tool, ensuring that every action selected by the DRL model is a feasible movement. To determine the optimal movement of the firefighting units, we employ the PPO algorithm to train the DRL model. The neural network architecture utilized comprises of a two-layer Convolutional Neural Network (CNN) and a single linear layer serving as the shared feature extractor network. This is followed by 2-layer MLP networks for both the policy and value networks. The ReLU activation function is employed throughout the model. The implementation is carried out in Python using libraries such as Stable Baselines 3 and OpenAI Gym.
IV-C3 System Integration and Numerical Examples
In this evaluation, we created two locations with different wind and vegetation characteristics, as shown in Table II. The table displays the wind direction, speed (as a percentage of the typical maximum), and the normalized average vegetation coefficient (NAVegC). The NAVegC indicates the average impact of vegetation type and density on the rate of wildfire spread in a location, using a normalized scale. It’s important to note that the vegetation density and type vary between cells, and the NAVegC is an average representation.
Location | Wind Dir. | Wind Speed | NAVegC |
---|---|---|---|
Loc.1 | 1 | ||
Loc.2 | - | 0.75 |

We use a multi-objective reward function to incentivize extinguishing cells that are on fire while penalizing the overall spread of fire at each time unit. The locations have a size of . We set the discount factor to , and train the DRL model for 15 minutes on a single CPU. This simple training provides us with a fairly effective and intuitive firefighting behavior, which allows us to calculate . An example of this is shown in Fig. 8, where we observe a diminishing return in the total discounted reward with respect to allocated resources. The environmental conditions at Loc. 1, including stronger winds and a specific vegetation profile, make it more susceptible to fire spread. Consequently, we have observed higher levels of total utility in fire suppression efforts at this location, as the DRL model employed is able to more effectively identify and navigate routes that lead to the extinguishment of ablaze cells in a shorter time frame. Furthermore, deploying a greater number of agents at this location has a more substantial impact on fire suppression efforts, in comparison to Loc. 2.
As the next step, we integrate the results from the local systems to determine higher-level allocations in a rolling window manner, similar to the approach used in pandemic response. This integration is carried out using the system outlined in Eqns. (11)-(12). An example of higher-layer allocation process can be seen in Figs. (9) and (10). In this exemplary system, the look-ahead horizon is set to 24 and the allocation update period is every 10 time units. The allocations are based on both the actual fire spread and predictions, resulting in unique outcomes for each run. In this specific run, it is observed that the fire spreads significantly more in the first location, leading to the reallocation of one unit from Loc. 2 to Loc. 1. Clearly, this example is just a simple exposition, and in a real life implementation, there would be significant practical considerations to take into account. Nevertheless, it illustrates the theoretical approach of our study.


V Conclusions
We present a methodology for addressing the complex problem of stochastic dynamic network utility maximization in the context of resource allocation, with a specific emphasis on the domain of disaster management. By taking into account the increasingly prevalent reality of the heterogeneous and hierarchical nature of large-scale incident response, the proposed approach utilizes a divide-and-conquer strategy through the implementation of a primal-dual framework to decompose the problem into more manageable localities. The local allocation of resources to individual entities is then addressed through the utilization of advanced deep reinforcement learning algorithms, wherein agent-based simulations are employed to model the underlying dynamics of the disaster scenario. These locally-derived solutions are informed by both real-time data from the ground as well as predictive forecasts, and the proposed methodology also incorporates a market mechanism for higher-level resource allocations. As such, the proposed approach does not necessitate a detailed understanding of the internal workings of the local systems at the upper level, with local entities effectively bidding for resources while the upper level dynamically sets pricing.
In this study, a comprehensive examination of the proposed methodology is presented, including a thorough exposition of the underlying theoretical foundations. The effectiveness of the method is subsequently demonstrated through the case studies of two distinct scenarios, specifically pandemic response and wildfire response. The results of the study provide insight into the potential utility of the proposed methodology in real-world applications.
References
- [1] F. P. Kelly, A. K. Maulloo, and D. K. H. Tan, “Rate control for communication networks: shadow prices, proportional fairness and stability,” Journal of the Operational Research Soc., vol. 49, no. 3, pp. 237–252, 1998.
- [2] S. H. Low and D. E. Lapsley, “Optimization flow control I: Basic algorithm and convergence,” IEEE/ACM Transactions on Networking, vol. 7, no. 6, pp. 861–874, 1999.
- [3] D. P. Palomar and M. Chiang, “A tutorial on decomposition methods for network utility maximization,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 8, pp. 1439–1451, 2006.
- [4] N. Karakoç, A. Scaglione, A. Nedić, and M. Reisslein, “Multi-layer decomposition of network utility maximization problems,” IEEE/ACM Transactions on Networking, vol. 28, no. 5, pp. 2077–2091, 2020.
- [5] N. Karakoc, A. Scaglione, and A. Nedić, “Multi-layer decomposition of optimal resource sharing problems,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 178–183.
- [6] N. Karakoç and A. Scaglione, “Federated network utility maximization,” in 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), 2020, pp. 1–6.
- [7] N. Karakoç, A. Scaglione, M. Reisslein, and R. Wu, “Federated edge network utility maximization for a multi-server system: Algorithm and convergence,” IEEE/ACM Transactions on Networking, 2022.
- [8] M. J. Neely, E. Modiano, and C. E. Rohrs, “Dynamic power allocation and routing for time varying wireless networks,” in IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), vol. 1. IEEE, 2003, pp. 745–755.
- [9] M. Wang, N. Karakoc, L. Ferrari, P. Shantharama, A. S. Thyagaturu, M. Reisslein, and A. Scaglione, “A multi-layer multi-timescale network utility maximization framework for the SDN-based layback architecture enabling wireless backhaul resource sharing,” Electronics, vol. 8, no. 9, p. 937, 2019.
- [10] L. Ferrari, N. Karakoc, A. Scaglione, M. Reisslein, and A. Thyagaturu, “Layered cooperative resource sharing at a wireless SDN backhaul,” in 2018 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 2018, pp. 1–6.
- [11] N. Trichakis, A. Zymnis, and S. Boyd, “Dynamic network utility maximization with delivery contracts,” IFAC Proceedings Volumes, vol. 41, no. 2, pp. 2907–2912, 2008.
- [12] E. Wei, A. Ozdaglar, A. Eryilmaz, and A. Jadbabaie, “A distributed newton method for dynamic network utility maximization with delivery contracts,” in 2012 46th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2012, pp. 1–6.
- [13] K. Ma, P. Wang, T. Ramachandran, J. Lian, and D. J. Hammerstrom, “Optimal iterative method for network utility maximization with intertemporal dynamic constraints,” in 2019 IEEE Conference on Control Technology and Applications (CCTA). IEEE, 2019, pp. 543–548.
- [14] M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle, “Layering as optimization decomposition: A mathematical theory of network architectures,” Proceedings of the IEEE, vol. 95, no. 1, pp. 255–312, 2007.
- [15] D. Bertsekas, Reinforcement learning and optimal control. Athena Scientific, 2019.
- [16] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- [17] L. E. Duijzer, W. van Jaarsveld, and R. Dekker, “Literature review: The vaccine supply chain,” European Journal of Operational Research, vol. 268, no. 1, pp. 174–192, 2018.
- [18] Ö. O. Dalgıç, O. Y. Özaltın, W. A. Ciccotelli, and F. S. Erenay, “Deriving effective vaccine allocation strategies for pandemic influenza: Comparison of an agent-based simulation and a compartmental model,” PloS one, vol. 12, no. 2, p. e0172261, 2017.
- [19] J. Medlock and A. P. Galvani, “Optimizing influenza vaccine distribution,” Science, vol. 325, no. 5948, pp. 1705–1708, 2009.
- [20] S.-I. Chen, B. A. Norman, J. Rajgopal, T. M. Assi, B. Y. Lee, and S. T. Brown, “A planning model for the WHO-EPI vaccine distribution network in developing countries,” IIE Transactions, vol. 46, no. 8, pp. 853–865, 2014.
- [21] S. Enayati and O. Y. Özaltın, “Optimal influenza vaccine distribution with equity,” European Journal of Operational Research, vol. 283, no. 2, pp. 714–725, 2020.
- [22] K. T. Eames, N. L. Tilston, E. Brooks-Pollock, and W. J. Edmunds, “Measured dynamic social contact patterns explain the spread of H1N1v influenza,” PLoS Comput Biol, vol. 8, no. 3, p. e1002425, 2012.
- [23] G. S. Zaric and M. L. Brandeau, “Resource allocation for epidemic control over short time horizons,” Mathematical Biosciences, vol. 171, no. 1, pp. 33–58, 2001.
- [24] E. Duijzer, W. van Jaarsveld, J. Wallinga, and R. Dekker, “The most efficient critical vaccination coverage and its equivalence with maximizing the herd effect,” Mathematical biosciences, vol. 282, pp. 68–81, 2016.
- [25] X. Chen, M. Li, D. Simchi-Levi, and T. Zhao, “Allocation of COVID-19 vaccines under limited supply,” Available at SSRN 3678986, 2020.
- [26] D. Bertsimas, J. K. Ivanhoe, A. Jacquillat, M. L. Li, A. Previero, O. S. Lami, and H. T. Bouardi, “Optimizing vaccine allocation to combat the COVID-19 pandemic,” medRxiv, 2020.
- [27] A. Babus, S. Das, and S. Lee, “The optimal allocation of COVID-19 vaccines,” medRxiv, 2020.
- [28] J. Kazil, D. Masad, and A. Crooks, “Utilizing python for agent-based modeling: The mesa framework,” in Social, Cultural, and Behavioral Modeling, R. Thomson, H. Bisgin, C. Dancy, A. Hyder, and M. Hussain, Eds. Cham: Springer International Publishing, 2020, pp. 308–317.
- [29] “A network agent based infection model with mesa.” [Online]. Available: https://dmnfarrell.github.io/bioinformatics/abm-mesa-network
- [30] S. Huang and S. Ontañón, “A closer look at invalid action masking in policy gradient algorithms,” arXiv preprint arXiv:2006.14171, 2020.
- [31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- [32] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- [33] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann, “Stable baselines3,” https://github.com/DLR-RM/stable-baselines3, 2019.
- [34] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
- [35] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
- [36] F. Trad and S. El Falou, “Towards using deep reinforcement learning for better covid-19 vaccine distribution strategies,” in 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA). IEEE, 2022, pp. 7–12.
- [37] X. Wei, C. Pu, Z. He, and P. Liò, “Deep reinforcement learning-based vaccine distribution strategies,” in 2021 2nd International Conference on Electronics, Communications and Information Technology (CECIT). IEEE, 2021, pp. 427–436.
- [38] S. Bushaj, X. Yin, A. Beqiri, D. Andrews, and İ. E. Büyüktahtakın, “A simulation-deep reinforcement learning (sirl) approach for epidemic control optimization,” Annals of Operations Research, pp. 1–33, 2022.
- [39] L. Ling, W. U. Mondal, and S. V. Ukkusuri, “Cooperating graph neural networks with deep reinforcement learning for vaccine prioritization,” IEEE Journal of Biomedical and Health Informatics, 2024.
- [40] A. Alexandridis, D. Vakalis, C. I. Siettos, and G. V. Bafas, “A cellular automata model for forest fire spread prediction: The case of the wildfire that swept through Spetses Island in 1990,” Applied Mathematics and Computation, vol. 204, no. 1, pp. 191–201, 2008.
- [41] R. C. Rothermel, A mathematical model for predicting fire spread in wildland fuels. Intermountain Forest & Range Experiment Station, Forest Service, US …, 1972, vol. 115.
- [42] I. Karafyllidis and A. Thanailakis, “A model for predicting forest fire spreading using cellular automata,” Ecological Modelling, vol. 99, no. 1, pp. 87–97, 1997.
- [43] J. G. Freire and C. C. DaCamara, “Using cellular automata to simulate wildfire propagation and to assist in fire management,” Natural Hazards and Earth System Sciences, vol. 19, no. 1, pp. 169–179, 2019.
- [44] A. Alexandridis, L. Russo, D. Vakalis, G. Bafas, and C. Siettos, “Wildland fire spread modelling using cellular automata: evolution in large-scale spatially heterogeneous environments under fire suppression tactics,” International Journal of Wildland Fire, vol. 20, no. 5, pp. 633–647, 2011.