Data-driven Optimization for Drone Delivery Service Planning
with Online Demand

Aditya Paul aditya.paul@unsw.edu.au Michael W. Levin mlevin@umn.edu S. Travis Waller s.travis.waller@gmail.com David Rey david.rey@skema.edu

Abstract

In this study, we develop an innovative data-driven optimization approach to solve the drone delivery service planning problem with online demand. Drone-based logistics are expected to improve operations by enhancing flexibility and reducing congestion effects induced by last-mile deliveries. With rising digitalization and urbanization, however, logistics service providers are constantly grappling with the challenge of uncertain real-time demand. This study investigates the problem of planning drone delivery service through an urban air traffic network to fulfil online and stochastic demand. Customer requests – if accepted – generate profit and are serviced by individual drone flights as per request origins, destinations and time windows. We cast this stochastic optimization problem as a Markov decision process. We present a novel data-driven optimization approach which generates predictive prescriptions of parameters of a surrogate optimization formulation. Our solution method consists of synthesizing training data via lookahead simulations to train a supervised machine learning model for predicting relative link priority based on the state of the network. This knowledge is then leveraged to selectively create weighted reserve capacity in the network and via a surrogate objective function that controls the trade-off between reserve capacity and profit maximization to maximize the cumulative profit earned. Using numerical experiments based on benchmarking transportation networks, the resulting data-driven optimization policy is shown to outperform a myopic policy. Sensitivity analyses on learning parameters reveal insights into the design of efficient policies for drone delivery service planning with online demand.

keywords:

Stochastic processes , Data-driven optimization , Online demand , Drone delivery , Service planning

\affiliation

[1]organization=School of Civil and Environmental Engineering, addressline=UNSW Sydney, postcode=NSW, 2052, city=Sydney, country=Australia

\affiliation

[3]organization=Department of Civil, Environmental, and Geo-Engineering, University of Minnesota, postcode=55455, city=Minneapolis, country=USA

\affiliation

[2]organization=Lighthouse Professorship “Transport Modelling and Simulation”, Faculty of Transport and Traffic Sciences, Technische Universität, city=Dresden, country=Germany

\affiliation

[4]organization=SKEMA Business School, Université Côte d’Azur, addressline=Sophia Antipolis, country=France

1 Introduction

Unmanned Aerial Vehicles (UAVs) have been growing in both prevalence and application. UAVs provide the advantage of performing a task aerially with limited or no human supervision. Additionally, most UAV designs are light-weight and can be deployed in large numbers. These features cause UAVs to be in high demand in many different domains, particularly logistics. UAV applications broadly fall into two categories: either some form of data acquisition and transmission (also called monitoring or reconnaissance), or payload pickup and delivery. This study falls into the second category and addresses the drone delivery service planning problem (DDSP) under stochastic online demand. The DDSP takes the perspective of a transportation network manager that plans service for drone deliveries. We consider discrete time steps grouped into time intervals. At each time step, customers may submit delivery requests that are to be served by a fleet of UAVs subject to time windows and link capacity constraints. The probability distributions behind request parameters, and the request arrival process, are unknown to the network manager and estimated from historical data. Decisions are made at every time interval to accept or reject the requests that arrive and to route accepted requests. We explore the potential of data-driven optimization methods for this online DDSP problem.

In stochastic optimization problems, some or all parameters are subject to uncertainty. These parameters are not deterministic, but instead emerge from probability distributions or uncertainty sets. Traditional solution methods to stochastic problems exploit a priori knowledge of these probability distributions. Stochastic problems can further be sub-divided into two categories: (a) static, where all problem parameters are available beforehand, and some problem specific condition determines if and when the stochastic parameters are instantiated; (b) dynamic, where problem parameters data arrives is revealed over time. Besides, online optimization makes no assumption on the nature of uncertain data and focuses on the immediacy and real-time nature of incoming problem parameters. Online problems are always dynamic in nature. Nevertheless, in many cases, it is reasonable to presume that the probability distribution, or an estimate thereof, can be extracted either through historical data or predictive models (Van Hentenryck et al., 2010). From a computational standpoint, stochastic and online optimization problems lead to significant challenges. The rise of e-commerce has compelled logistic service providers to tackle online demand. In this study, we consider the setting of the static DDSP in the context of online demand, i.e. user requests that are revealed over time.

Our solution methodology is based on an innovative data-driven optimization and lies at the intersection of predictive analytics (machine learning) and prescriptive analytics (stochastic optimization). The literature on the combination of machine learning (ML) and stochastic optimization reveals two distinct avenues for a successful collaboration:

1.

ML-enabled optimization: Combinatorial optimization problems can be very hard to solve, and special techniques or methods can be used to solve a particular kind of problem or problem instance. Instead of relying on experts to create such techniques, algorithms and models can be developed to automatically learn from implicit distribution of problem instances and adapt to those instances. ML has been found to be useful here.
2.

Prediction and optimization: Many real-world practical applications involve uncertain parameters which directly or indirectly affect the objective function, but which must first be predicted before optimization. In such predict-then-optimize situations, ML is being recognized as a vital tool to make accurate data-driven predictions.

Research into transportation service planning under online demand, with a dedicated integration of ML and mathematical programming (MP), is still in its nascent stages. In this study, we incorporate strategies from both these avenues to develop a customized data-driven optimization approach for the DDSP with online demand. This study makes the following contributions to the field.

1.

We extend the integer linear programming (ILP) formulation of the deterministic DDSP to incorporate online demand, i.e. delivery requests are revealed over time and routing decisions must be taken to accommodate accepted requests, and cast this problem as a Markov Decision Process (MDP).
2.

We propose a novel solution method for the online DDSP where predictive prescriptions are computed over time. Historical demand data including spatial and temporal distributions, as well as the request arrival process are used to determine link-based features that inform a surrogate optimization formulation.
3.

We demonstrate through numerical experiments that our data-driven optimization approach outperforms a myopic optimization approach. Sensitivity analysis reveal how learning parameters affect optimal policies for the online DDSP.

Since we base our solution method on link-based features, our approach is applicable to other network-based logistics systems. The rest of the paper is organized as follows. In Section 2, we review the relevant literature. In Section 3, we present the online DDSP formulation and outline the MDP framework. Section 4 develops the solution methodology and the proposed algorithms. Section 5 reports the numerical experiments and their results. Our findings and research perspectives are discussed in Section 6.

2 Literature review

We examine the state-of-the art relevant to the online DDSP. Section 2.1 reviews recent advances in data-driven optimization, notably the interplay between ML and optimization. Section 2.2 records the advent of stochastic and online demand as well as the latest solution approaches arising in urban logistics. Finally, Section 2.3 outlines the widespread adoption of UAVs in commercial and industrial sectors.

2.1 Data-driven optimization

Machine learning (ML) has proved useful for optimization purposes along two lines: to assist directly within the optimization process by exploiting particular problem characteristics, or as a data-driven predictive framework to estimate uncertain parameters that are involved in the optimization formulation. We review these two streams of literature hereafter.

2.1.1 Machine learning-enabled optimization

It is common in optimization that a solution algorithm is tuned to a specific problem scenario by refining its parameters or policy to that scenario. Such tuning (or adaptation) need not come from research expertise and intuition. ML can assist an optimization algorithm to this end (Bengio et al., 2021). This direct assistance provided by ML can be analysed based on its motivation or on its structure. In terms of motivation, ML can either: (a) replace some heavy computation with a fast, reliable approximation or (b) explore the decision space (defined by the optimization model) and discover better policies. As for structure, the collaboration between ML and optimization can be sequential, whereby the ML model supports the optimization algorithm with valuable information. Or it can be parallel, wherein the ML and MP components work alongside each other. If one of the components works within the framework of the other, then the solutions of the nested component help the external one make better decisions.

Kruber et al. (2017) provide an illustration of the sequential implementation. They use supervised ML to decide if a given Dantzig-Wolf decomposition will be effective in solving instances of a mixed-integer linear programming (MILP) problem. Bonami et al. (2018) address a similar question. They employ supervised ML to determine if linearizing a given instance of a mixed integer quadratic programming problem will reduce solve time. An example of parallel implementation is found in Lodi and Zarpellon (2017). The ML model works within a branch-and-bound framework, and is used to make decisions about branching. Specifically, the goal of the ML component is to learn an efficient policy for selecting the branching variable and branching direction. Yilmaz and Büyüktahtakın (2024) address sequential combinatorial optimization problems in industrial contexts, where problem instances arise with slight parameter variations. A neural network generates a relaxed problem instance by predicting binding constraints, then uses this relaxed instance to eliminate infeasible predictions of decision variables through rapid feasibility checks. This reduces solution time by up to three orders of magnitude, while maintaining an optimality gap below 0.1%. Larsen et al. (2022) introduce a supervised ML model to quickly predict expected tactical descriptions of operational solutions in stochastic optimization. This addresses a common challenge in two-stage stochastic programming, where the second stage is computationally demanding. Their approach aids in solving the overall two-stage problem by circumventing the need for online generation of multiple second stage scenarios and solutions. Another implementation possibility is to leverage a MP formulation to aid an ML algorithm in the solution process. Tran-Thanh et al. (2012) employ an unbounded knapsack formulation to improve the performance of the $\epsilon$ -greedy algorithm for multi-armed bandits, where the knapsack solution governs the probability of pulling a given arm of the bandit. A comprehensive overview of ML-enabled optimization methods is reported in Bengio et al. (2021).

2.1.2 Predict-then-optimize

Many decision-making problems deal with both prediction and optimization (or decision). These two tasks are complex on their own and are hence often treated separately and sequentially. The prediction model predicts the unknown parameters of the optimization model, which the latter uses to perform the optimization. ML has been a popular candidate for the prediction task in recent times. But the conventional approach is to train the ML prediction model to minimize the prediction error. This does not consider the nature and attributes of the successive optimization problem and may result in sub-optimal decisions (Elmachtoub and Grigas, 2022). As a brief but standard example, consider two possible paths between an origin O and a destination D. Travel times for both paths are unknown to the decision-maker, but may be predicted with contextual data (weather, congestion, etc.). Suppose the actual travel times are 10 and 20 minutes each, respectively. ML model 1 predicts travel times on the two paths to be 15 and 13 minutes (respectively), while ML model 2 predicts the same to be 30 and 40 minutes. Although model 1 has better accuracy, it causes a worse decision.

Two frameworks have been proposed which integrate prediction with optimization by taking the downstream optimization model into account (Yan and Wang, 2022). The first framework is called smart “predict, then optimize” (SPO), and it evaluates the performance of the ML prediction model based on the final decision error induced by the prediction made, not on the initial prediction error (Elmachtoub and Grigas, 2022). The authors propose a design which applies to the optimization of a linear objective functions over a convex feasible region. They introduce the SPO loss function which captures the decision error of a prediction by utilizing the objective function of the downstream optimization model. Since training with the SPO loss function can be cumbersome, a more manageable surrogate loss function called SPO+ loss is also introduced, which is equivalent to the original loss under specific conditions. The second framework, predictive prescriptions, aims to harness the joint distributions between decision variables and auxiliary data, both of which are uncertain (Bertsimas and Kallus, 2020). The principle of predictive prescriptions is to simulate various scenarios in which the values of the unknown parameters are estimated by the ML prediction model. These estimations lead to a cost in the optimization process, and the costs are weighted by the data-driven exploration of the scenarios. The optimal decision policy then minimizes the weighted sum costs thus generated. Both SPO and predictive prescriptions framework have been increasingly adopted by the scientific community and a survey of contextual optimization methods for decision-making under uncertainty is available in Sadana et al. (2023).

2.2 Stochastic and online demand

Online demand, sometimes referred to as on-demand logistics, deals with that aspect of e-commerce which provides an affordable delivery service to satisfy the customer’s place and time requirements at short notice. It concerns the sale of immediately deliverable products, sometimes at the cost of short-term losses incurred by the e-commerce business or logistic service provider. With the consolidation of the requisite digital mechanisms (internet ubiquity, personal computers, mobile phones, online payment apps) and supply chain infrastructure (warehouses, highways, enterprise resource planning, GPS), e-commerce has experienced a consistent surge in online demand in the 21st century. E-commerce is now experiencing over $10\%$ growth per annum worldwide, and B2C e-commerce deliveries rose by nearly $25\%$ in 2020 alone (Lozzi et al., 2022). In addition, the disruption caused by COVID-19 has caused an acceleration in internet usage and a lowering of traditional offline purchases, both factors contributing further to e-commerce home deliveries (Han et al., 2022).

Even in the 2000s the research literature observed the benefits of transitioning from brick-and-mortar retail to online e-commerce. Particularly, engaging in both physical offline sales and e-commerce – what is called multi-channel retailing, or click-and-mortar – was identified as a crucial paradigm shift. In the USA, more than $50\%$ of multi-channel retailers reported positive operating margins as early as 2001 (Kim and Park, 2005). Investing in e-commerce gave companies access to new markets at relatively low capital costs. Those customers who shopped both online and offline exhibited greater loyalty to the business. The stable customer base and retailer trust acquired from years of brick-and-mortar operation could be extended to the online domain with e-commerce expansions. Engaging with online demand also gives businesses low-cost direct contact with their customers (Fruhling and Digman, 2000). This not only helps build customer relationships faster, but also enables the construction of customer profiles, the analysis of customer interests and purchasing behaviour, and comparison of similar profiles to make accurate and relevant recommendations. The comprehensive and coherent advantages of online e-commerce have even led to the evolution of omni-channel retailing, which is the modification of multi-channel retailing such that customers can experience and exploit a seamless array of multiple purchase possibilities (Verhoef et al., 2015).

We next discuss the impact of online demand on vehicle routing, both in terms of emerging obstacles and recent solution approaches. We then describe solution strategies which incorporate ML, in line with the methodologies outlined in Section 2.1.

2.2.1 Online demand in vehicle routing

The benefits of e-commerce come with the challenges of online demand. By its dynamic and uncertain nature, online demand is difficult to fulfill. Failed first deliveries can account for up to $60\%$ of all orders for a service provider (Song et al., 2009), the primary reason being the absence of the customer during package handover. This is partially due to modern lifestyle changes such as increase in single person households and flexible (but demanding) work patterns. But a major contributing factor also is that no delivery time slot is explicitly agreed upon between the customer and the e-commerce retailer. This leads to several detrimental effects. If packages are left unattended then lack of security is a concern. Repeated deliveries or customers personally travelling to collection points result in economic losses as well as excess carbon emissions (Edwards et al., 2010). A high rate of unsatisfied or partially satisfied demand also strains driver-customer relationships (Masorgo et al., 2023). However, if a delivery time window is agreed upon through some form of online demand management, then the success of last mile deliveries is shown to increase (Van Duin et al., 2016).

Many studies dealing with online demand employ some form of heuristic approach to modify routes as and when new demand requests emerge. Hong et al. (2019) consider a shipping company that faces online demand, but delivers products to pre-determined collection centres that the customer must visit on their own. They implement an ant colony based heuristic which changes the previously decided collection centre for a particular customer in light of incoming demand requests. Mobility services also fall under stochastic and online demand, where the commodity sold is the route itself. Bertsimas et al. (2019) analyse online taxi ride-sharing with time windows for pickups. They employ a cost metric to measure waiting or empty driving time between customers. The network is pruned by keeping, for each customer, the top $k$ subsequent customers with the lowest costs. A maxflow heuristic is applied by reducing pickup time windows to a single randomly selected time instant. The resulting route arcs are retained, and the process is iteratively repeated until a specified number of arcs is reached. Finally, a mixed-integer formulation is applied to the modified network for obtaining results. Consolidation of delivery requests can also save costs by revealing efficient routes. Muñoz-Villamizar et al. (2024) explore stochastic demand where customer requests arrive daily. The city of operation is divided into regions, each with an expected customer density, estimated average velocity and travel distance per order. A region-specific postponement cost function is proposed for quantifying the delay of a request in a given region to a later date. A tabu-search metaheuristic modifies vehicle routes by identifying better (lower cost) routes in their local neighbourhood. The neighbourhood is traversed by applying insertion, swap and inversion operators that transform the customer sequence of a vehicle route.

2.2.2 ML-based solution methods

Using ML to assist optimization was given much attention in the transport literature. The same-day delivery problem is a rich base for such joint implementations as mentioned in Section 2.1.1. Chen et al. (2023) study same-day delivery under fairness, dividing the service area into regions. Service availability is defined as the ratio of expected accepted requests per day to the expected overall number of requests. The objective is to maximize both the service availability in the entire area and the minimum service availability across all regions. The learning agent receives rewards for each fulfilled request, guiding it to improve its policy by considering changes in both components of the objective function. Feng et al. (2022) apply reinforcement learning to the real-time ride-sourcing problem with incorporation of public transport services. For every customer ride request, feasible routes are heuristically generated. For every driver capable of responding to the request, the learning agent estimates the perceived value of the route-driver pairs. Finally an integer linear program is solved to determine the optimal allocation of drivers to routes. If electric vehicles are involved, then charge depletion adds an additional factor of uncertainty. Basso et al. (2022) study the electric vehicle routing problem where both arrival of customer requests and energy consumption per unit road length are stochastic. Charging locations are deterministic. At each state of the MDP formulation, the reinforcement learning agent estimates the energy cost of moving to possible subsequent states and, in doing so, the probability of battery charge falling below a safety threshold.

The predict-then-optimize frameworks discussed in Section 2.1.2 have also been applied in stochastic transport optimization. Chu et al. (2021) implement the SPO method to solve the last-mile problem for an online food delivery service. They consider driver behaviour, traffic conditions and route lengths as input features to predict travel times. Simulated annealing is used to generate feasible routes, which are further updated with customer exchange operators to derive new routes. Baty et al. (2024) design a machine learning pipeline which relies on a downstream optimization component to improve predictions. The authors study the dynamic vehicle routing problem with time windows (VRPTW), where all requests must be fulfilled and fleet size is unlimited. They demonstrate that solving an equivalent prize-collecting VRPTW, where each unfulfilled customer is assigned a prize or profit, addresses this version of the VRP. An ML model determines the optimal allocation of prizes, yielding a near-optimal solution for the original problem. A hybrid genetic search algorithm solves the prize-collecting VRPTW based on input from the ML component. The ML model refines itself through a loss function, which measures the disparity between the target and acquired objective function values. The target objective value is derived by solving a static VRPTW from historical data, utilizing the same hybrid genetic search algorithm.

2.3 UAV for urban logistics

The increasing adoption of UAVs is due to their autonomy and efficiency. They are also versatile and affordable in relatively large numbers, and these attributes are getting better with time. Recent advancements in drone technology exhibit improvements along multiple dimensions (Chen et al., 2021). Upgrades in battery and propulsion technology allow enhanced manoeuvrability, longer flight times, greater altitude capabilities and higher payload capacity. For example, Hecken et al. (2022) studies the construction of a heavy-lift UAV for low-altitude flight. The incorporation of machine learning and swarm technology permits greater degrees of autonomy and coordination in large groups. Progress in materials science and UAV design create scalability in drone manufacture, reduction in overall weight and better operational safety. Kornatowski et al. (2020) design a drone with collapsible rotor arms to improve safety while landing and aerodynamic efficiency during flight. Efforts are also being made to conceptualize novel UAV infrastructure, such as docking stations, precision landing sites, and innovative battery charging modes (Mohsan et al., 2023).

UAV deployment for traffic monitoring has witnessed growing attention from a research perspective. UAV patrols involve repeated deployments to specific parts of a city to monitor traffic flow based on volume (Chow, 2016). Roadside surveillance by UAVs also extends to the conception of intelligent transport systems. Working as accident report agents, UAVs can alert nearby autonomous vehicles and assist rescue teams to minimize and mitigate the damage caused (Menouar et al., 2017). Aerial data analysis can also reveal possible blockages at both traffic and railroad intersections (Congress et al., 2021). Visual and spatial information gathered by planned UAV flights can help make crucial decisions for disaster management. Glock and Meyer (2020) explore route planning for UAVs to detect sampling locations and collect atmospheric samples after a fire or hazardous chemical accident. This data is then used to predict the atmospheric distribution of toxic substances in the proximity of the affected region.

There are many models for payload delivery using UAVs in the literature. UAVs can either be deployed independently, or in association with conventional vehicles such as trucks (with or without synchronization). This paper focuses on the former, i.e. the deployment of a homogeneous fleet of UAVs without ground vehicular support. However, for completeness, we delve into the relevant literature for both drone-only and truck and drone logistics, in that order.

2.3.1 Drone-only routing

UAVs are cheaper to maintain, reduce labour and delay costs and have a lower carbon footprint compared to trucks. Amazon and Alibaba have been frontrunners in adopting UAVs for logistics, in moderate density urban regions and to deliver time-sensitive packages respectively. Many other companies across the world are following suit (Han et al., 2022). Additionally, a majority of customers in urban locations live close to a supermarket, and most delivery packages are relatively light-weight (Moshref-Javadi and Winkenbach, 2021). This includes food, beverages, groceries and also medical and pharmaceutical supplies, all of which have a perishability constraint which makes UAVs an attractive choice.

Levin and Rey (2023) explore the drone delivery service planning problem and propose a novel branch-and-price algorithm for its solution. They also design a reservation heuristic to generate feasible solutions expediently. Vlahovic et al. (2017) perform a case study of a UAV delivery model for pharmaceutical supplies, exhibiting cost reductions and improvements in service responsiveness. Retail facility placements can be improved when network design is conducted under the availability of a UAV delivery service (Baloch and Gzara, 2020). Liu (2019) propose a mixed integer programming formulation for online meal delivery using UAVs. The objective function penalizes excess battery depletion, motivates fast delivery and minimizes unnecessary movement for re-routing. The model is executed in a progressive rolling fashion across discrete time-steps. Chen et al. (2021) explore key strategic and tactical decisions for retailers implementing drone-based delivery operations, focusing on when to offer drone delivery and the appropriate pricing. Using a Markov decision process (MDP) framework, they introduce heuristic procedures to obtain near-optimal closed-form solutions efficiently. The aim is to assist online retailers in real-time decision-making regarding the feasibility and extent of offering drone-based delivery for specific product categories in various service zones. Additionally, they identify effective pricing strategies for drone-based deliveries. Payload delivery expands outside the realm of delivery to consumers. For example, UAVs can be used in precision agriculture, to simultaneously monitor crop sites and perform spraying tasks. This increases the efficiency of both pesticide and fertiliser use (Radoglou-Grammatikis et al., 2020).

The study of drone-only logistics also attracts ML-enriched solution strategies. Asadi et al. (2022) study drone delivery of medical supplies to hospitals, categorizing demand based on hospital distance from the drone hub. Each demand class includes a demand distribution and the required battery charge for successful delivery. The problem is formulated as a MDP where the system state is a vector containing the number of batteries at specific charge levels within the hub. The concise state expression allows reinforcement learning where a lookup table suffices for value function approximation. The action space involves charging batteries from lower to higher levels within decision epochs. This model optimizes battery usage, preserving charge for future deliveries and reducing replacement and charging costs. UAVs also have potential as a supplementary resource in two-echelon logistics networks. Li et al. (2024) study the last-mile delivery problem, where trucks transport goods from a primary depot to satellite depots, and UAVs complete the delivery to customers. Note that we categorize this study as drone-only logistics since trucks are treated as a stochastic externality. The stochastic nature of parcel arrival times at satellites is handled through Bayesian estimation, converting continuous arrival scenarios into discrete expected arrival intervals. A two-stage stochastic model is applied: parcel consolidation is first performed to release a subset for immediate delivery, and the remaining parcels are reserved for routing with future arrivals. Next, the route is treated as a particle in a particle swarm optimization problem. Reinforcement learning using search operators incrementally permutes the route in each iteration to discover better routes in the local neighbourhood.

2.3.2 Truck and drone logistics

The integration of UAVs into the logistic service fleet (together with conventional vehicles) can result in considerable benefits (Rejeb et al., 2023). Such a multi-modal delivery model decreases delivery times, overcomes traffic congestion and minimises human participation in the delivery process. Dayarian et al. (2020) investigate a same-day delivery model where a fleet of trucks is assisted by UAVs (through commodity resupply) at mutually determined meeting locations, allowing trucks to make longer trips before returning to the depot. Gu et al. (2023) investigate a single truck-and-drone scenario for online demand, consisting of deterministic delivery requests along with online pickup requests. The drone revisits the truck for resupply when needed, such revisits dividing the route into segments. An insertion heuristic compares the costs of placing pickup customers at various segments of the initial route. A $15\%$ increase in profits is reported, with UAVs contributing a $50\%$ increase in the acceptance rate of online requests. Yang et al. (2023) study a robust drone-truck delivery problem with customer time windows. They leverage the precision of drone routes to mitigate uncertainties in truck travel times caused by ground traffic congestion. Unlike Gu et al. (2023), here the truck remains at a customer location until the drone serves other customers and returns. Numerical experiments show that the drone-truck combination can safe-guard against the risk of failed deliveries due to unmet time windows. Rave et al. (2023) broaden the truck-drone route planning framework by introducing deployable drone launching stations known as microdepots. In this setup, trucks leave a central distribution centre equipped with both onboard drones and microdepots. The authors address a truck-drone fleet planning problem for a single logistics service provider using adaptive large neighbourhood search. Experiments demonstrate that mixed fleets provide cost-savings over a truck-only fleet, and that the cost-optimal choice of fleet composition and delivery method involves considering all possible combinations.

Heterogeneous fleets logistics, such as truck and drone models, also welcome ML-supported solutions. Chen et al. (2022) handle a mixed fleet of vehicles and UAVs to handle online requests. Network evolution is modelled as a MDP. Routing heuristics determine route feasibility by vehicle or drone, and deep reinforcement learning is then used to determine if a vehicle or a drone should be dispatched. Ghiasvand et al. (2024) tackle a two-echelon multi-trip truck-drone routing problem with uncertainty. Trucks make multiple trips from the depot to local distribution centres (LDCs), while UAVs make single trips from LDCs to customers. The research formulates a mixed-integer linear programming model to minimize total customer waiting times, managing uncertain customer demand through kernel-function based machine learning algorithms.

In an urban scenario, the operation environment of UAVs can mimic the road network. Buildings and other infrastructure can impede flight paths or require excessive energy to overcome aerially. But the airspace above roadways is generally free of obstructions. A possible drone flight network can be a map of the road traffic network itself, where links are above roads and nodes are above traffic intersections and several flight levels may be considered (Levin and Rey, 2023). We adopt this problem and context for the online DDSP.

3 Problem description and formulation

We first introduce the deterministic demand formulation for the DDSP in Section 3.1. We next present the online demand model in Section 3.2 and the MDP formulation in Section 3.3.

3.1 Deterministic DDSP

The deterministic (or static) version of the DDSP is examined by Levin and Rey (2023) through an integer linear programming (ILP) formulation. This formulation is discussed below and serves as the foundation for the model and solution approach designed in this study for the online DDSP.

Let $G=(N,A)$ be a network across which customers send transport requests. These requests, if accepted, are fulfilled by a fleet of drones. Every link $i\in A$ has a length $\rho_{i}$ and capacity $C_{i}$ . All links are directed. The drones traverse the network through these links at a constant velocity $v$ . Each node $n$ has a set of incoming and outgoing links $\Gamma^{-}_{n}$ and $\Gamma^{+}_{n}$ respectively. The time horizon is discretized into time steps $\{0,1,\ldots,t,\ldots,T\}$ , where $T$ marks the end of the horizon. Customers submit requests to the network before the time period of network operation $T$ . Each customer request $r$ consists of (a) spatial parameters – the request origin $o_{r}$ and destination $d_{r}$ such that $o_{r},d_{r}\in N$ , (b) temporal parameters – the earliest permissible time of departure $e_{r}$ and the arrival time window at the destination $[l_{r},u_{r}]$ and (c) a profit $p_{r}$ earned upon fulfilling this request. The set of all requests that arrive over the entire time horizon is denoted by $\mathcal{R}$ . In the static version, the network manager is aware of the complete request set $\mathcal{R}$ before the time horizon begins.

Levin and Rey (2023) introduce a link-based network formulation. Every link is divided into an upstream and downstream segment (Figure 1). The key decision variable $\chi^{r\uparrow}_{i}(t)\in\{0,1\}$ indicates whether request $r$ occupies the upstream portion of link $i\in A$ at time $t$ . The counterpart of $\chi^{r\uparrow}_{i}(t)$ for the downstream segment is $\chi^{r\downarrow}_{i}(t)$ .

Refer to caption — Figure 1: Link-Based Formulation

Two successive links connected by a node is called a $turn$ . The decision variable $\gamma_{ij}^{r}(t)\in\{0,1\}$ indicates whether request $r$ occupies the turn from link $i$ to $j$ at time $t$ . Note that the movement of the drone from the downstream segment of link $i$ to the upstream segment of link $j$ through the connecting node is assumed to be instantaneous. A single node can serve as the junction for many turns. For a given turn $(i,j)$ , the set of all other turns sharing the same junction node is denoted by $\mathcal{C}_{ij}$ . The decision variable $\phi_{ij}(t)\in\{0,1\}$ resolves turn conflicts at every node by indicating which turn is active at a given time $t$ , consequently deactivating all other turns which share that node as a junction.

Finally, the decision variable $z_{r}\in\{0,1\}$ indicates the acceptance or rejection of request $r$ , and the objective is to maximize the total profit acquired. The ILP Formulation (1) is presented below:


$\displaystyle\max\ Z=$	$\displaystyle\sum\limits_{r\in\mathcal{R}}z_{r}p_{r}$		(1a)
s.t.	$\displaystyle\sum\limits_{i\in\Gamma_{o_{r}}^{+}}\sum\limits_{t=e_{r}}^{T}\chi^{r\uparrow}_{i}(t)=z_{r}$	$\displaystyle\forall r\in\mathcal{R}$	(1b)
	$\displaystyle\sum\limits_{i\in\Gamma_{d_{r}}^{-}}\sum\limits_{t=l_{r}}^{u_{r}}\chi^{r\downarrow}_{i}(t)=\sum\limits_{i\in\Gamma_{o_{r}}^{+}}\sum\limits_{t=e_{r}}^{T}\chi^{r\uparrow}_{i}(t)$	$\displaystyle\forall r\in\mathcal{R}$	(1c)
	$\displaystyle\chi^{r\downarrow}_{i}\left(t+\frac{\rho_{i}}{v}\right)=\chi^{r\uparrow}_{i}(t)$	$\displaystyle\forall r\in\mathcal{R},\forall i\in A,\forall t\in\{0,\ldots,T\}$	(1d)
	$\displaystyle\sum\limits_{r\in\mathcal{R}}\chi^{r\uparrow}_{i}(t)\leq C_{i}$	$\displaystyle\forall i\in A,\forall t\in\{0,\ldots,T\}$	(1e)
	$\displaystyle\sum\limits_{r\in\mathcal{R}}\chi^{r\downarrow}_{i}(t)\leq C_{i}$	$\displaystyle\forall i\in A,\forall t\in\{0,\ldots,T\}$	(1f)
	$\displaystyle\sum\limits_{i\in\Gamma_{n}^{-}}\chi^{r\downarrow}_{i}(t)=\sum\limits_{i\in\Gamma_{n}^{+}}\chi^{r\uparrow}_{i}(t)$	$\displaystyle\forall n\in N,\forall r\in\mathcal{R},\forall t\in\{0,\ldots,T\}$	(1g)
	$\displaystyle\gamma^{r}_{ij}(t)\leq\chi^{r\downarrow}_{i}(t)$	$\displaystyle\forall r\in\mathcal{R},\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1h)
	$\displaystyle\gamma^{r}_{ij}(t)\leq\chi^{r\uparrow}_{i}(t)$	$\displaystyle\forall r\in\mathcal{R},\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1i)
	$\displaystyle\gamma^{r}_{ij}(t)\geq\chi^{r\downarrow}_{i}(t)+\chi^{r\uparrow}_{i}(t)-1$	$\displaystyle\forall r\in\mathcal{R},\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1j)
	$\displaystyle\phi_{ij}(t)\geq\frac{1}{\|\mathcal{R}\|}\sum\limits_{r\in\mathcal{R}}\gamma^{r}_{ij}(t)$	$\displaystyle\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1k)
	$\displaystyle\phi_{ij}(t)+\sum\limits_{(i^{\prime},j^{\prime})\in\mathcal{C}_{ij}}\phi_{i^{\prime}j^{\prime}}(t)\leq 1$	$\displaystyle\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1l)
	$\displaystyle\gamma_{ij}(t)\in\{0,1\}$	$\displaystyle\forall r\in\mathcal{R},\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1m)
	$\displaystyle\phi_{ij}^{r}(t)\in\{0,1\}$	$\displaystyle\forall(i,j)\in A^{2},\forall t\in\{0,\ldots,T\}$	(1n)
	$\displaystyle\chi^{r\uparrow}_{i}(t),\chi^{r\uparrow}_{i}(t)\in\{0,1\}$	$\displaystyle\forall r\in\mathcal{R},\forall i\in A,\forall t\in\{0,\ldots,T\}$	(1o)
	$\displaystyle z_{r}\in\{0,1\}$	$\displaystyle\forall r\in\mathcal{R}$	(1p)

Equations (1b) and (1c) ensure that only accepted requests are routed, and the route reaches the destination $d_{r}$ within the time window. Equation (1d) monitors link travel time while equations (1e)–(1f) prevent capacity violations. Equation (1g) conserves link flow, and equations (1h)–(1j) construct valid turns. Lastly, equations (1k)–(1l) resolve conflicts at turn junctions (i.e., nodes with non-empty $\Gamma_{n}^{+}$ and $\Gamma_{n}^{-}$ ) in the network.

3.2 Online demand model

We build upon the deterministic DDSP to establish the online demand model. The time horizon is also divided into $I$ intervals, each of duration $D$ , such that $DI=T$ . The successive execution of all $I$ intervals concludes a single instance (or day) of network operation. The number of requests arriving at the network in each interval is governed by a Poisson process with rate $\lambda$ . This implies that, on average, the expected total number of requests submit to the network during the entire time horizon is $\lambda I$ . Within a specific interval, the time of submission of each request (by the customer) is assumed to be distributed uniformly.

Customers submit requests to the network during each interval $i\in I$ . The current incoming requests in interval $i$ (denoted by $\mathcal{R}_{C}^{i}\subset\mathcal{R}$ ) can either be accepted or rejected. Those which are accepted are added to the set $\mathcal{R}_{\checkmark}\subset\mathcal{R}$ and must then be routed through the network. Once a request is accepted (assigned a drone and a planned route), it cannot later on be rejected or left unfulfilled. As soon as a drone initiates its route within the network, the corresponding request is termed $active$ . Alternatively, if a request is accepted (and routed) but the assigned drone has not left the origin, the request is termed $idle$ . Let $\mathcal{R}_{I}\subset\mathcal{R}_{\checkmark}$ and $\mathcal{R}_{A}\subset\mathcal{R}_{\checkmark}$ be the sets of idle and active requests respectively. All idle requests in the current interval are carried over to the next one, where their planned routes are subject to change. Once departed, no route changes are permitted and the drone is assumed to complete the delivery and drop off the network. Upon route completion the corresponding request is removed from $\mathcal{R}_{A}$ and $\mathcal{R}_{\checkmark}$ . The problem objective is to maximize the total profit earned by servicing requests in the given time horizon. This demand model is illustrated in Figure 2.

3.3 Formulation as a Markov decision process

The dynamic evolution of the network (arrival and acceptance of requests, followed by drone traffic routing and optimization) can be formulated as a Markov Decision Process (MDP). In an MDP, the problem is decomposed into an $environment$ and an $agent$ . In the online DDSP, the network is the environment, and the network manager (who accepts and routes requests) is the agent. The environment produces $states$ , based on which the agent takes $actions$ , to receive $rewards$ . The agent decides the best action at a given state. The environment subsequently controls the transition to the next state.

3.3.1 State space

The evolution of drone traffic in the network is represented as a sequential MDP moving through states. The state of the network captures all relevant information about its current structure. The state $s_{r}$ is constructed with the following components:

1.

$t_{r}$ : The time step at which a new request arrives. This triggers the creation of a new state. Note that two successive requests may arrive several time steps apart i.e., $t_{r+1}-t_{r}\geq 1$ .
2.

$\vec{r}$ : The request vector contains details of the request. These are: the request sequence number, origin node, destination node, earliest departure time, time window of arrival at the destination, and request profit. We write $\vec{r}=(r,o_{r},d_{r},e_{r},[l_{r},u_{r}],p_{r})$ .
3.

$\vec{\Theta}_{r}^{pre}$ : The route vector $\vec{\theta}_{r}$ contains details of the drone route assigned to request $r$ . The collection of all route vectors in state $s_{r}$ is $\vec{\Theta}_{r}^{pre}$ . The composition of $\vec{\theta}_{r}$ reveals the request sequence number, the time of drone departure, and sequence of nodes in the route $(r,t_{o_{r}},[o_{r},\ldots,d_{r}])$ . For sake of brevity, the sequence of nodes in the route is also denoted by $\mu_{r}=[o_{r},\ldots,d_{r}]$ . It can be known if the request $r$ is active or idle based on whether $t_{o_{r}}$ lies within the current interval or without. If a request becomes active, its route vector is immune to any further change. If a request is fulfilled by the time the new state is triggered, its corresponding $\vec{\theta}_{r}$ is removed from $\vec{\Theta}_{r}^{pre}$ .

Hence the state $s_{r}$ is represented as the tuple $s_{r}=(t_{r},\vec{r},\vec{\Theta}_{r}^{pre})$ . Every interval witnesses the creation of as many states as there are request arrivals in that interval. Between the acceptance (and subsequent routing) or rejection of a request and the arrival of the successive request, the only change in the network is the movement of active drones as per their routes. Since this is a deterministic change (the routes of active drones are known and fixed), a new state need not be declared until the arrival of the next request.

3.3.2 Action space

After the arrival of a new request $r$ triggers the creation of state $s_{r}$ , the agent must take an action $a_{r}$ from the feasible action space $\mathcal{A}(s_{r})$ . The action space allows the rejection of request $r$ , but also contains all possible ways in which request $r$ can be routed through the network, provided four conditions are met:

1.

The drone must not exceed the capacity of a link pre-occupied by other active drones.
2.

The drone must not, at any time-step, utilize a turn junction which is already utilized by an active drone at that time-step.
3.

The drone must not depart the request origin before $e_{r}$ .
4.

The drone must reach the request destination within the time window $[l_{r},u_{r}]$ .

The action space $\mathcal{A}(s_{r})$ also permits re-routing idle requests from previous intervals in order to accommodate request $r$ or to achieve a better objective value. The action taken is then represented as the tuple $a_{r}=(z_{r},\vec{\Theta}_{r}^{post})$ where $z_{r}$ is a binary decision regarding request $r$ and $\vec{\Theta}_{r}^{post}$ is the $updated$ collective route vector, which would include a route $\vec{\theta}_{r}$ for request $r$ if $z_{r}=1$ . At a given state, action selection is done by the agent’s policy, which maps states to actions. Conventionally, a policy is represented as $\pi(a_{r}|s_{r})$ , and governs the probability of taking action $a_{r}$ when in state $s_{r}$ .

3.3.3 Reward and value functions

Accepting a request increases the value of the objective function and should therefore indicate a reward. The reward received in state $s_{r}=(t_{r},\vec{r},\vec{\Theta}_{r}^{pre})$ with action $a_{r}=(z_{r},\vec{\Theta}_{r}^{post})$ is simply:

R(s_{r},a_{r})=z_{r}p_{r}

(2)

This gives a reward of $p_{r}$ for accepting a request and a reward of 0 for rejecting it. Consequently, the reward accumulated over the course of interval $i\in I$ is given by $\sum\limits_{r\in\mathcal{R}_{C}^{i}}R(s_{r},a_{r})=\sum\limits_{r\in\mathcal{R}_{C}^{i}}z_{r}p_{r}$ . The overall cumulative reward is $\sum\limits_{t_{r}\in[0,T]}z_{r}p_{r}$ , which is equivalent to $\sum\limits_{r\in\mathcal{R}}z_{r}p_{r}$ .

The state-action pair $(s_{r},a_{r})$ captures the process of taking an action $a_{r}$ while in a state $s_{r}$ . In the application of ML under the MDP framework, every state-action pair is assigned a relative value in comparison to all other pairs. This value indicates the expected final objective (or reward) that can be achieved given that the agent occupies state $s_{r}$ and takes action $a_{r}$ . The learning model presented in this study seeks to estimate the value of taking a certain action while occupying a certain state. In other words, it seeks to discover and exploit the action value function $Q(s_{r},a_{r})$ . Thus “learning” is equivalent to improving the accuracy of the value estimations of $Q(s_{r},a_{r})$ to a satisfactory degree.

3.3.4 Online information and state transition

The transition to the next state occurs under transition probabilities $P^{a_{r}}_{s_{r}\rightarrow s_{r}^{\prime}}$ . The transition probability determines which state the environment jumps to once the agent undertakes a specific action in the current state. There are many application scenarios where the transition probability matrix is either unknown to the agent, or the size of the state space prohibits its use. In such scenarios the agent can only control its policy, and has no prior knowledge of how the environment will react to its actions. The drone delivery network with online demand falls under this scenario.

As mentioned, the density of request arrival across the time horizon is due to a Poisson process with rate $\lambda$ . The origins and destinations of the requests are generated via spatial distributions, the time windows of arrival at the destinations are governed by a temporal distribution, and the request profits are sourced from a profit distribution. However, the agent or network manager has no prior information about any of these distributions or the Poisson process. This makes the problem online and the nature of request arrival becomes exogenous to the agent.

3.3.5 Objective function and optimal policy

The objective is to maximize the cumulative profit earned from requests served over the entire horizon. This is identical to maximizing the cumulative reward acquired in the MDP formulation. In MP terminology, this becomes:

\text{max}\ Z=\sum\limits_{r\in\mathcal{R}}z_{r}p_{r}

(3)

Given an (accurate) action value function $Q(s_{r},a_{r})$ , an optimal policy $\pi^{*}$ would aim to choose action $a_{r}$ in state $s_{r}$ such that the state-action pair $(s_{r},a_{r})$ returns the maximum achievable $Q$ -value in that state, i.e.:

\pi^{*}(a_{r}|s_{r})=\begin{cases}1,&\text{if }a_{r}=\underset{a}{\arg\max}\ Q(s_{r},a)\\ 0,&\text{otherwise }\end{cases}

(4)

This is equivalent to stating that the policy, out of the set of all policies $\Pi$ , that maximizes the objective function, i.e. the expected total reward gained over the course of the time horizon, is $\pi^{*}$ :

\pi^{*}\in\underset{\pi\in\Pi}{\arg\max}\ \mathbf{E}\left[\sum\limits_{r=1}^{|\mathcal{R}|}R(s_{r},A_{r}^{\pi}(s_{r}))\Big{|}s_{r-1}\right]

(5)

Here $A_{r}^{\pi}(s_{r})$ is the action taken by policy $\pi$ in state $s_{r}$ . The state $s_{0}$ depicts the initial empty network where no request has yet arrived. Note that the limits of summation in (3) and (5) are slightly different because, unlike in (3), in (5) we emphasize operating upon requests sequentially to demonstrate the evolution of the MDP.

4 Solution methodology

A rolling implementation of Formulation (1) in consecutive intervals $i\in I$ is developed to solve the online DDSP. Section 4.1 first outlines the rolling ILP implementation as a solution framework. This is followed by Section 4.2 which illustrates the motivation to construct a surrogate ILP objective function. The determination of predictive prescriptions via a training scheme is provided in Section 4.3. The resulting data-driven optimization approach and the proposed policy for solving the online DDSP are summarized in Section 4.4.

4.1 Myopic ILP

Five modifications need to be made to Formulation (1) for the dynamic online DDSP. First, since Formulation (1) was developed for the static DDSP, all time-indexed variables and constraints extend over the entire time horizon $T$ . However, in the online demand context, the ILP will be solved repeatedly for each interval $i\in I$ in a rolling fashion. This requires that, in a given interval, the time-indexed variables and constraints extend from the beginning to the end of that interval. Hence the limits of $t$ change from $\{0,\ldots,T\}$ to $\{(i-1)D,\ldots,iD\}$ for interval $i$ .

Second, since idle requests are carried over from previous intervals, any change in their $z_{r}$ values is not permissible. Requests once accepted must be fulfilled. The same is true for active requests from preceding intervals as well. This implies that in each interval,

	$z_{r}=1\qquad\forall r\in\mathcal{R}_{I}\cup\mathcal{R}_{A}$	(6a)
Third, the objective function needs to be implemented for all requests which have arrived in the current interval $i$ , and for idle requests from previous intervals. The latter is because idle requests are still a part of the solution process, since their routes are subject to re-optimization. This is captured as:
	$\text{max}\ Z=\sum\limits_{r\in\mathcal{R}_{O}^{i}}z_{r}p_{r}\quad\text{where}\ \mathcal{R}_{O}^{i}=\mathcal{R}_{C}^{i}\cup\mathcal{R}_{I}\ \forall i\in I$	(6b)

Fourth, since idle requests may still be re-routed, their earliest time of departure $e_{r}$ cannot precede the start of the current interval $i$ . An idle request cannot be re-routed such that its new route begins at a time-step that has already passed.

e_{r}=\begin{cases}(i-1)D,\quad\text{if}\ e_{r}\leq(i-1)D\\ e_{r},\quad\text{otherwise}\\ \end{cases}\forall r\in\mathcal{R}_{I}

(6c)

Finally, the unfinished portion of active request routes which will be undertaken in the current interval must be preserved. Suppose we are in interval $i$ and the last updated route vector for (currently) active request $r$ is $\vec{\theta}_{r}=(r,t_{o_{r}},[o_{r},\ldots,d_{r}])$ . Let the number of nodes in its path $\mu_{r}$ be $P$ and let $\omega(m,n)$ denote the directed link from node $m$ to node $n$ . Then the definitive path for request $r$ is $\mu_{r}=[o_{r},n_{2},\ldots,n_{k},\ldots,n_{P-1},d_{r}]$ where $n_{1}=o_{r}$ and $n_{P}=d_{r}$ . The following impositions must be made for all such active requests $r\in\mathcal{R}_{A}$ :

\begin{rcases}\chi^{r\downarrow}_{\omega(n_{k-1},n_{k})}(t_{n_{k}}^{r})=1,\quad\text{if}\ n_{k}\neq o_{r}\\ \chi^{r\uparrow}_{\omega(n_{k},n_{k+1})}(t_{n_{k}}^{r})=1,\quad\text{if}\ n_{k}\neq d_{r}\\ \gamma_{\omega(n_{k-1},n_{k})\omega(n_{k},n_{k+1})}(t_{n_{k}}^{r})=1,\quad\text{if}\ n_{k}\notin\{o_{r},d_{r}\}\\ \end{rcases}\forall\ n_{k}\in\mu_{r}\mid(i-1)D\leq t_{n_{k}}^{r}\leq iD

(6d)

$t_{n_{k}}^{r}$ denotes the time at which the drone responsible for request $r$ reaches node $n_{k}$ . $t_{n_{k}}^{r}$ can be determined since $t_{o_{r}}$ , link lengths and drone velocity are known. Solving Formulation (1) with these five modifications at every interval completes the rolling ILP implementation. The pseudo-code is presented in Algorithm 1. Since this model does not consider the impact of current decisions on the fate of requests submitted in future intervals, and only focuses on the current interval, we hereby refer to it as the Myopic ILP.

2for $d\leftarrow 1$ to $N_{days}$ do

3 initialize request set

\mathcal{R}_{\checkmark}

and collective route vector

\vec{\Theta}

4 for $i\leftarrow 1$ to $I$ do

5 initialize request sets

\mathcal{R}_{A},\mathcal{R}_{I},\mathcal{R}_{C}^{i}

// request segregation

6 for $r\in\mathcal{R}_{\checkmark}$ do

7 if $t_{o_{r}}\leq(i-1)D$ then

\mathcal{R}_{A}\leftarrow\mathcal{R}_{A}\cup\{r\}

9 else

10 if $e_{r}\leq(i-1)D$ then

e_{r}\leftarrow(i-1)D

\mathcal{R}_{I}\leftarrow\mathcal{R}_{I}\cup\{r\}

14 populate

\mathcal{R}_{C}^{i}

// model formulation

15 add constraints (1b) – (1p) with

t\in\{(i-1)D,\ldots,iD\}

wherever applicable

16 for $r\in\mathcal{R}_{I}\cup\mathcal{R}_{A}$ do

17 add constraint

z_{r}=1

18 for $r\in\mathcal{R}_{A}$ do

19 add constraints

\begin{rcases}\chi^{r\downarrow}_{\omega(n_{k-1},n_{k})}(t_{n_{k}}^{r})=1,\ \text{if}\ n_{k}\neq o_{r}\\ \chi^{r\uparrow}_{\omega(n_{k},n_{k+1})}(t_{n_{k}}^{r})=1,\ \text{if}\ n_{k}\neq d_{r}\\ \gamma_{\omega(n_{k-1},n_{k})\omega(n_{k},n_{k+1})}(t_{n_{k}}^{r})=1,\ \text{if}\ n_{k}\notin\{o_{r},d_{r}\}\\ \end{rcases}\forall\ n_{k}\in\mu_{r}\mid(i-1)D\leq t_{n_{k}}^{r}\leq iD

21 add objective

\max\ Z=\sum\limits_{r\in\mathcal{R}_{O}^{i}}z_{r}p_{r}

22 solve modified Formulation (1)

// update/add drone routes

23 for $r\in\mathcal{R}_{O}^{i}$ do

24 if $r\in\mathcal{R}_{I}$ then

25 update

\vec{\theta_{r}}\leftarrow\ (r,t_{o_{r}},\mu_{r})

26 if $r\in\mathcal{R}_{C}^{i}$ then

27 if $z_{r}^{sol}=1$ then

\mathcal{R}_{\checkmark}\leftarrow\mathcal{R}_{\checkmark}\cup\{r\}

\vec{\theta_{r}}\leftarrow\ (r,t_{o_{r}},\mu_{r})

\vec{\Theta}\leftarrow\vec{\Theta}\cup\{\theta_{r}\}

// remove completed requests

33 for $r\in\mathcal{R}_{\checkmark}$ do

34 if $t_{d_{r}}\leq iD$ then

\mathcal{R}_{\checkmark}\leftarrow\mathcal{R}_{\checkmark}\backslash\{r\}

\vec{\Theta}\leftarrow\vec{\Theta}\backslash\{\theta_{r}\}

Algorithm 1 Pseudo-code for Myopic ILP

4.2 Surrogate objective function

Maximizing profit acquisition at each time interval ignores future demand. Taking future demand into account requires making decisions that are sub-optimal in the current time interval but increase the cumulative reward by contributing to the objective of a later interval. We tolerate a sub-optimal reward per interval $\sum\limits_{r\in\mathcal{R}_{C}^{i}}R(s_{r},a_{r})$ to achieve a greater reward-to-go $\sum\limits_{r\in\mathcal{R}}R(s_{r},a_{r})$ . This sub-optimality comes from allocating some resources to the future, thus preventing their use in the present. In other words, a sacrifice is made in the present to reap greater rewards in the future.

We define available link capacity per unit time as the network resource. This is captured as $C_{l}-\sum\limits_{r\in\mathcal{R}_{O}^{i}}\chi^{r\uparrow}_{l}(t)$ , where the capacity of link occupied by idle and current requests is deducted from its total capacity. The strategy then translates into (intentionally) creating reserve capacity in the network at the current interval, which prevents that capacity from being utilized at the current interval. This may lead to rejection of requests or longer routes in the current interval, but is expected to yield greater profit accumulation in future intervals. We motivate the choice of using link capacity as a resource as follows: first, it is a quantitative definition, and link capacity at a given time can be easily measured during the execution of the optimization program; second, the spatial distribution of origins and destinations of customer requests, combined with conflict resolution of drone routes, leads to a heterogeneous resource space. In other words, the available capacities on different links have different value to the network. This creates a rich decision set, providing a sizeable margin with which to outperform myopic policies.

The priority of available capacity on link $l\in A$ at time interval $i$ is denoted $\beta_{i,l}$ . This indicates the significance of the link in the network. If a link has higher priority, then any reserve capacity on it is relatively more important to network operation and route creation, compared to other links. The formulation of a data-driven prescription of $\beta_{i,l}$ is described in Section 4.3.1. Reserve capacity is generated on network links in proportion to their priority, since slack created on higher priority links will have greater impact on maximizing profit in future intervals.

Indefinite and uncontrolled slack creation will adversely affect network performance and reduce cumulative profit. The reserve capacity being created in the present should at some point be capitalized upon. Hence the balance between profit maximization and ( $\beta_{i,l}$ proportionate) slack creation must be controlled. We construct a new objective function to incorporate the sacrificial strategy. Since this objective function is intended to replace the myopic objective function of the Myopic ILP, it is called the surrogate objective function and comprises two components:

\max\ Z=\sum\limits_{r\in\mathcal{R}_{O}^{i}}z_{r}p_{r}\ +\ \alpha_{i}\sum\limits_{l\in A}\beta_{i,l}\sum\limits_{t=(i-1)D}^{iD}\left[C_{l}-\sum\limits_{r\in\mathcal{R}_{O}^{i}}\chi^{r\uparrow}_{l}(t)\right]

(7)

The DDSP formulation with the surrogate objective function is named the Surrogate ILP. The first component is profit maximization, and the second is slack creation based on link priority. For every link $l\in A$ , we first compute the slack capacity on $l$ during the current time interval, and then multiply it with the relative priority of that link, $\beta_{i,l}$ . We also define a parameter $\alpha_{i}$ which influences the balance between the two components. The value of $\alpha_{i}$ chosen for this interval determines to what extent reserve capacity will be created on network links (relative to profit maximization). Note that independent of the value of $\alpha_{i}$ , the slack generated on a link will be proportional to the relative priority of that link. Unless, of course, if $\alpha_{i}=0$ , in which we case the surrogate objective is reduced to the myopic objective of pure profit maximization and $\beta_{i,l}$ has no role left to play.

In the computational experiments, both components of the surrogate objective function are standardized, so that $\alpha_{i}$ can effectively control the balance. The first component – which is essentially the myopic objective function – is standardized by division with the expected total profit over the entire time horizon $p_{avg}\mathbf{E}[|\mathcal{R}|]$ . This component is hereafter referred to as the Profit Component. The second component is standardized by division with the total available network capacity in that interval $(\sum\limits_{l\in A}C_{l})D$ . This component is hereafter referred to as the Slack Component.

4.3 Training procedure of relative link priority ( $\beta_{i,l}$ )

We next explain the training procedure used to determine predictive prescriptions of the relative link priority ( $\beta_{i,l}$ ) which is used in the Surrogate ILP. The training procedure is articulated across two main steps: i) the synthesis of training data via lookahead simulation (Section 4.3.1) and ii) the training of a supervised ML model for predicting relative link priority (Section 4.3.2). To scale up the training process, we use heuristic algorithm to solve the Myopic ILPs that arise therein (Section 4.3.3).

4.3.1 Training data synthesis via lookahead simulation

We use lookahead simulations to synthesize training data that will be used to train a predictive model which itself will be queried during the execution of the Surrogate ILP policy.

The priority of a link can be correlated to the frequency of link traversal by drones in a suitable span of time. Given a duration of network operation (e.g. $\delta$ intervals), crucial high priority links will experience large throughput and are likely to create bottlenecks which impede request acceptance. The link priority $\beta_{i,l}$ is therefore formally defined as the number of times link $l$ is traversed by incoming requests between the beginning of interval $i$ and the end of interval $i+\delta$ .

\beta_{i,l}=\sum\limits_{q=i}^{i+\delta}\ \sum\limits_{r\in\mathcal{R}_{C}^{q}}\ \sum\limits_{t=(q-1)D}^{qD}\chi^{r\uparrow,sol}_{l}(t)

(8)

Here $\chi^{r\uparrow,sol}_{l}(t)$ represents the routing solution value of the upstream-link variable $\chi^{r\uparrow}_{l}(t)$ for link $l$ and time $t$ at the end of each interval. We also define a vector $\bm{\vec{B}}$ to store link priority values for all intervals.

\bm{\vec{B}}=\begin{bmatrix}\bm{\vec{\beta}_{i}}\end{bmatrix}^{i\in\{1,\ldots,I\}}=\begin{bmatrix}\beta_{i,l}\end{bmatrix}_{l\in A}^{i\in\{1,\ldots,I\}}

(9)

Link priority values as defined above are a function of drone traffic evolution in the network, which in turn depends on customer demand. From a snapshot of the network state at a certain time-step, the short-term future of the network may be estimated, provided historical patterns of customer demand are known. The introduction of the ML framework serves this purpose. We develop a supervised ML model which takes as input a snapshot of the network (number of drones occupying each link) and returns as output an estimate of link priorities for the near future. This input snapshot of link occupation is captured for the beginning of every time interval and is denoted by:

s_{i,l}=\sum\limits_{r\in\mathcal{R}_{A}}\ \sum\limits_{t=(i-1)D-\Delta}^{(i-1)D+\Delta}\ \chi^{r\uparrow,sol}_{l}(t)\ \mathds{1}[t\leq(i-1)D<t+\Delta]

(10)

where $\Delta=\frac{\rho_{l}}{v}$ is the time taken by a drone to traverse link $l$ . Eq. (10) increments the value of $s_{i,l}$ for every active drone occupying link $l$ at the beginning of interval $i$ (i.e., $t=(i-1)D$ ). We define a collection vector to store link occupancy values for all intervals.

\bm{\vec{S}}=\begin{bmatrix}\bm{\vec{s}_{i}}\end{bmatrix}^{i\in\{1,\ldots,I\}}=\begin{bmatrix}s_{i,l}\end{bmatrix}_{l\in A}^{i\in\{1,\ldots,I\}}

(11)

$\bm{\vec{S}}$ serves as the input feature vector dataset and $\bm{\vec{B}}$ as the target output vector dataset while training the ML model. Once trained, it is deployed in the Surrogate ILP to estimate the values of $\beta_{i,l}$ which are subsequently used in the surrogate objective function (7). Note that in the training phase, all $\beta_{i,l}$ – and hence $\bm{\vec{B}}$ – are calculated via Eq. (8) to produce the output targets of the training dataset. In the testing and execution of the Surrogate ILP, however, the values of $\beta_{i,l}$ are predicted by the ML model.

In the training scheme, the number of training intervals is designed to be much larger than that in a test instance ( $I_{training}>>I$ ). This is equivalent to processing several test instances successively without emptying the network between two consecutive test instances. At the beginning of each training interval $i$ , the link occupancy on each link $l$ by currently active requests is calculated and stored in $s_{i,l}$ . For this we define the function LinkActivity which computes $s_{i,l}$ as per Eq. (10). This function takes two inputs – the route $\mu_{r}$ of an active request and the link occupancy vector $\vec{\bm{s_{i}}}$ for the current interval. It iterates through the route to determine which link the UAV occupied at the beginning of the interval ( $t=(i-1)D$ ) and increments the corresponding $s_{i,l}$ in $\vec{\bm{s_{i}}}$ . Next, a lookahead is performed by virtually simulating the future of the network for the next $I_{virtual}$ number of intervals. For these intervals, virtual requests $\mathcal{R}_{V}$ are created, imitating historical data. These virtual requests are then routed through the network, obeying link capacity and conflict resolution constraints in the presence of active requests. The routes of successfully serviced virtual requests then contribute to the values of $\beta_{i,l}$ for interval $i$ . For a link $l$ , $\beta_{i,l}$ is equal to the number of times link $l$ was traversed by virtual requests in the lookahead simulation. This captures the perceived priority of link $l$ in the foreseeable near future, from the perspective of the network at interval $i$ , in accordance with Eq. (8), with $\delta=I_{virtual}$ . We define the function BetaUpdate, which takes two inputs: the route $\mu_{r}$ of a virtual request and the link priority vector $\vec{\bm{\beta_{i}}}$ for the current interval, and updates $\beta_{i,l}$ for all the links that this request traverses according to $\mu_{r}$ .

Once the virtual lookahead is complete, all virtual request paths are erased from the network. Thereafter, the process resembles the Myopic ILP. Idle requests from previous intervals and incoming requests of current interval $i$ are routed. Successfully accepted requests are stored in $\mathcal{R}_{\checkmark}$ . Successful paths $\mu_{r}$ for a given request $r$ are also stored in a route vector $\theta_{r}=(r,t_{o_{r}},\mu_{r})$ and added to the collective route vector $\vec{\Theta}$ . Then the next interval $i+1$ is initiated.

4.3.2 $k$ -Nearest Neighbors multi-output regressor

We use $k$ -Nearest Neighbors ( $k$ NN) regression to train a predictive model for estimating relative link priority based on network snapshots. The $k$ NN regressor is a supervised ML algorithm used for prediction tasks. It predicts the output target value of a new data point by considering the average or weighted average of its $k$ nearest neighbors in the input feature space. The $k$ NN algorithm utilizes a defined distance metric, such as the Euclidean distance, to measure the similarity between data points in the input feature space. By identifying the $k$ training feature vectors closest to a new data point, the algorithm establishes the nearest neighbors of the given data point. The prediction of the target value of the new data point is made by averaging or weighting the target values associated with its $k$ nearest neighbors, leveraging the assumption that closer points in the feature space exhibit similar target values. The $k$ NN multi-output regressor extends the concept of the $k$ NN regressor to handle multiple target values for each data point. Notably, the $k$ NN regressor is non-parametric and makes minimal assumptions about the underlying data distribution, making it more versatile in handling diverse datasets. It can also capture non-linear relationships between input features and output targets. We use the training data synthesized using the lookahead simulation procedure described in Section 4.3.1 to train the $k$ NN regressor.

4.3.3 Reservation heuristic

In practice, the training phase has an associated time budget, implying that link priority values must be estimated expediently. Solving the Myopic ILP for $I_{training}$ intervals demands excessive time and memory. Levin and Rey (2023) propose a reservation heuristic which routes incoming customer requests quickly. This heuristic is used within the training phase. We illustrate the reservation heuristic and formalize the training scheme in Algorithm 3.

The original spatial network $G=(N,A)$ can be converted into a spatio-temporal network $G^{T}$ as demonstrated in Figure 3. $G^{T}$ is the time-expanded dual graph of $G$ . To construct $G^{T}$ , the nodes and links are extended across time through discrete time levels and connections are made on the basis of link travel times. Suppose a link in $G$ connects node $n_{1}$ to node $n_{2}$ and can be traversed in $\Delta t$ time steps. Then it connects two nodes $\Delta t$ time levels apart in $G^{T}$ . The graph $G^{T}$ can be viewed as a network of node-time tuples ( $n,t$ ), where $n$ denotes the node and $t$ is the time level on which it is located. The reservation heuristic works by routing customer requests on $G^{T}$ using a shortest-path finding algorithm. The route is then “reserved” for that request by securing link capacity in space-time, making that capacity unavailable for other requests. Since turn conflict constraints must also be respected in $G^{T}$ , the heuristic removes connectivity between node-time tuples whenever constraints (1e)–(1f) and (1h)–(1l) are violated. Algorithm 2 defines the function NewReservation which discovers a path $\mu_{r}^{T}$ for request $r$ , makes the relevant reservations in $G^{T}$ and deduces the corresponding path $\mu_{r}$ in $G$ . Two additional functions are also required in the training phase. A request may already have a path, but reservations for this path may either need to be made or erased. We use functions Reserve( $\mu_{r}^{T},G^{T}$ ) and EmptyReservation( $\mu_{r}^{T},G^{T}$ ) for these purposes respectively. They follow a similar procedure to NewReservation and are hence not illustrated.

Input:

r,G_{T}

Output:

\mu_{r}

1 Function NewReservation( $r,G_{T}$ ):

2 find space-time path

\mu_{r}^{T}

G_{T}

satisfying

3 1: previous reservations for (1e)–(1f) and (1h)–(1l)

4 2: spatial and temporal constraints of request

r

5 if $\mu_{r}^{T}$ found then

6 make reservations (break connectivity) in

G_{T}

as per (1e)–(1f) and (1h)–(1l) for

\mu_{r}^{T}

7 create

\mu_{r}

G

from

\mu_{r}^{T}

8 return

\mu_{r}

9 else

10 return false

End Function

Algorithm 2 Pseudo-code for reservation heuristic

1 initialize request sets

\mathcal{R}_{\checkmark}

and collective route vector

\vec{\Theta}

2 set

s_{i,l}\leftarrow 0

and

\beta_{i,l}\leftarrow 0\ \forall\ l\in A,\forall\ i\in\{1,\ldots,I_{training}\}

3 create

G^{T}

from

G

4 for $i\leftarrow 1$ to $I_{training}$ do

5 initialize request sets

\mathcal{R}_{A},\mathcal{R}_{I},\mathcal{R}_{C}^{i}

// request segregation as in Algorithm 1 (lines 6 to 12)

// reserve active requests and record link traversals

6 for $r\in\mathcal{R}_{A}$ do

7 Reserve(

\mu_{r}^{T},G_{T}

)

8 LinkActivity(

\mu_{r},\vec{\bm{s_{i}}}

)

// virtual requests

10 for $i_{v}\leftarrow 1$ to $I_{virtual}$ do

11 create virtual requests for interval

i_{v}

and store in

\mathcal{R}_{V}

13 for $r\in\mathcal{R}_{V}$ do

14 if NewReservation( $r,G_{T}$ ) then

15 BetaUpdate(

\mu_{r},\vec{\bm{\beta_{i}}}

)

18 for $r\in\mathcal{R}_{V}$ do

19 EmptyReservation(

\mu_{r}^{T},G_{T}

)

// solve real requests of current interval and update/add drone routes

21 populate

\mathcal{R}_{C}^{i}

22 for $r\in\mathcal{R}_{O}^{i}$ do

\mu_{r}\leftarrow

NewReservation(

r,G_{T}

)

24 if $r\in\mathcal{R}_{I}$ then

25 update

\theta_{r}\leftarrow\ (r,t_{o_{r}},\mu_{r})

26 if $r\in\mathcal{R}_{C}^{i}$ then

27 if $\exists\mu_{r}$ then

\mathcal{R}_{\checkmark}\leftarrow\mathcal{R}_{\checkmark}\cup\{r\}

\theta_{r}\leftarrow\ (r,t_{o_{r}},\mu_{r})

\vec{\Theta}\leftarrow\vec{\Theta}\cup\{\theta_{r}\}

// empty all reservations to prepare for next interval

34 for $r\in\mathcal{R}_{O}^{i}\cup\mathcal{R}_{A}$ do

35 EmptyReservation(

\mu_{r}^{T},G_{T}

)

// remove completed requests as in Algorithm 1 (lines 33 to 36)

return

\bm{\vec{B}}

and

\bm{\vec{S}}

Algorithm 3 Training data synthesis for predicting relative link priority

1 for $d\leftarrow 1$ to $N_{days}$ do

2 initialize request set

\mathcal{R}_{\checkmark}

and collective route vector

\vec{\Theta}

3 initialize

\alpha

-profile

\begin{bmatrix}\alpha_{i}\end{bmatrix}^{i\in\{1,\ldots,I\}}

4 set

s_{i,l}\leftarrow 0

and

\beta_{i,l}\leftarrow 0\ \forall\ l\in A,\forall\ i\in\{1,\ldots,I\}

5 for $i\leftarrow 1$ to $I$ do

6 initialize request sets

\mathcal{R}_{A},\mathcal{R}_{I},\mathcal{R}_{C}^{i}

// request segregation as in Algorithm 1 (lines 6 to 12)

7 for $r\in\mathcal{R}_{A}$ do

8 LinkActivity(

\mu_{r},\vec{\bm{s_{i}}}

)

// query ML component for link priorities & standardize

\bm{\vec{\beta_{i}}}\leftarrow

k

NN(

\bm{\vec{s_{i}}}

)

\beta^{\max}\leftarrow

\max(\bm{\vec{\beta_{i}}})

\bm{\vec{\beta_{i}}}\leftarrow\begin{bmatrix}\frac{\beta_{i,l}}{\beta^{\max}}\end{bmatrix}_{l\in A}

13 populate

\mathcal{R}_{C}^{i}

// model formulation as in Algorithm 1 (lines 15 to 20)

14 add surrogate objective

\max\ Z=\sum\limits_{r\in\mathcal{R}_{I}}z_{r}p_{r}\ +\alpha_{i}\sum\limits_{l\in A}\beta_{i,l}\sum\limits_{t=(i-1)D}^{iD}\left[C_{l}-\sum\limits_{r\in\mathcal{R}_{I}}\chi^{r\uparrow}_{l}(t)\right]

15 solve modified Formulation (1) with surrogate objective

// update/add drone routes as in Algorithm 1 (lines 24 to 31)

// remove completed requests as in Algorithm 1 (lines 33 to 36)

Algorithm 4 Surrogate ILP policy

4.4 Surrogate ILP policy

The Surrogate ILP policy is executed after completing the training procedure described in Section 4.3. The Surrogate ILP policy iteratively solve the Surrogate ILP where the surrogate objective function is parameterized by the predictive prescriptions for relative link priority ( $\beta_{i,l}$ ) and the balance parameter ( $\alpha_{i}$ ). Recall that the ratio of profit maximization to link priority-proportional slack creation is controlled via the balance parameter $\alpha_{i}$ . The magnitude of this balance parameter may be varied from one time interval to the next. Since $\alpha_{i}$ is coupled with the slack component, a greater magnitude favours slack creation and a smaller one favours profit maximization. Too large a value of $\alpha_{i}$ , sustained over many intervals, can impose a vacuum in the network which degrades cumulative profit acquired by injecting excessive reserve capacities. Too small a value can make its contribution insignificant, and a negative value is likely to suffocate the network by overloading link capacities. It follows that there exists a range of desirable values for $\alpha_{i}$ , which we explore in our numerical experiments. The variation (or lack thereof) in the values of $\alpha_{i}$ , denoted by the sequence $\begin{bmatrix}\alpha_{i}\end{bmatrix}^{i\in\{1,\ldots,I\}}$ , is called the $\alpha$ -profile. In our numerical experiments, three types of $\alpha$ -profiles are investigated – constant, step-wise and continuous.

The pseudo-code the Surrogate ILP policy is presented in Algorithm 4. At the start of every interval, the current active requests (which were accepted in previous intervals) are segregated. The function LinkActivity populates the link occupancy vector $\vec{\bm{s_{i}}}$ in accordance with the positions of active requests. The trained $k$ NN regressor is then queried with $\vec{\bm{s_{i}}}$ as input, after which it supplies a predictive prescription of link priorities in the form of $\vec{\bm{\beta_{i}}}$ . Once the link priority vector is standardized, it is then implemented in the parameterized surrogate objective function, along with the chosen $\alpha$ -profile.

5 Numerical experiments

We conduct numerical experiments to assess the performance of the proposed data-driven optimization approach, i.e. the Surrogate ILP, for the online DDSP. We use the Myopic ILP as a baseline for benchmarking and we also conduct sensitivity analyses on learning parameters of the ML-enabled surrogate objective function.

We conduct these experiments on the Sioux Falls network, since it is an established data set for traffic routing problems and has been used in by Levin and Rey (2023) for the deterministic DDSP. There are 24 nodes and 76 links in this network. The length of the time horizon is $T$ = 60 minutes, with 12 intervals of 5 minutes each ( $I$ = 12 and $D$ = 5 minutes). For routing and capacity tracking, the time horizon is discretized into time steps of a single minute. After 60 minutes elapse, the network still remains operational as long as the final accepted request reaches its destination. Thus the network remains active until $\max\limits_{r\in R_{\checkmark}}d_{r}$ . We focus on the proof-of-concept for the proposed novel solution methodology. In line with this goal, we maintain all link capacities at 1 drone per minute, even though this is probably more restrictive than is necessary to avoid collisions. The size of the ILPs are a function of the number of variables, and assigning small capacities permits solving the problem on a large network with capacity constraints. The number of customer requests submitted in each interval follows a Poisson process with rate $\lambda$ = 100, which amounts to an expected total of 1200 requests over the course of 12 intervals. Within an interval $i$ , the request submission times are uniformly distributed between the start and end of the interval $[(i-1)D,iD)$ . The spatial distribution of request origins $o_{r}$ and destinations $d_{r}$ is exponential with respect to the y-coordinate, with origins dominating the bottom of the network and destinations occupying the top. For each request, the earliest permissible departure time $e_{r}$ varies uniformly between $[0,10]$ minutes of request submission time $t_{r}$ . The arrival time window at the destination $[l_{r},u_{r}]$ follows a skewed normal distribution between expected travel time and the end of the time horizon, skewed towards expected travel time. The price (profit) of each request is uniformly distributed between $[1,10]$ .

We implemented the proposed algorithms in Python on a Windows machine with 3.19 GHz and 64 GB of RAM. We use CPLEX 22.1.0.0 to solve the test instances, and we use a solve time limit of 300 seconds for each interval of a given instance.

5.1 $\alpha$ -profiles

We consider three types of $\alpha$ -profiles: constant, step-wise and continuous. The performance of constant profiles can be used to determine the desirable range of $\alpha_{i}$ , after which temporal variations can be introduced to extend the analysis. It should be noted that even though the time intervals are discrete, a continuous $\alpha$ -profile is still permissible, because $\alpha_{i}$ is only evaluated at discrete points marking the beginning of each interval. A summary of the $\alpha$ -profiles experimented with is provided in Table 1.

An intuitive approach consists of starting with a high value of $\alpha_{i}$ and reduce it over time. This realizes the strategy of preferentially creating link priority-based reserve capacity in the initial intervals and transitioning into profit accumulation towards the end of the time horizon. All non-constant profiles presented are non-increasing for this reason. However, constant $\alpha$ -profiles have significant potential too, since the $k$ NN model is adaptive and returns a different link priority vector at each interval. This implies that the priority-based slack creation is also adaptive and may relieve some of the adjustment burden from $\alpha_{i}$ .

Name	$\alpha$ -profile composition	Type
SP_CTE1	$\alpha=2$	Constant
SP_CTE2	$\alpha=1.5$
SP_CTE3	$\alpha=1.25$
SP_CTE4	$\alpha=1$
SP_CTE5	$\alpha=0.5$
SP_CTE6	$\alpha=-1$
SP_STP1	$\alpha$ = 1.5 → 1.25 \| 6:6	Step-wise
SP_STP2	$\alpha$ = 1.5 → 1.25 \| 8:4
SP_STP3	$\alpha$ = 1.5 → 1.25 → 1 \| 4:4:4
SP_STP4	$\alpha$ = 1.5 → 1.25 → 1 \| 7:3:2
SP_STP5	$\alpha$ = 1.25 → 1 \| 6:6
SP_STP6	$\alpha$ = 1.25 → 1 \| 8:4
SP_PLY1	$\alpha(t)=0.01t^{2}-0.15545t+1.5$	Continuous
SP_PLY2	$\alpha(t)=-0.01t^{2}+0.06455t+1.5$
SP_PLY3	$\alpha(t)=-0.04545t+1.5$
SP_PLY4	$\alpha(t)=e^{-0.063t}+0.5$

Table 1: Summary of

\alpha

-profiles for the Surrogate ILP

5.2 Myopic ILP policy

We first examine the performance of the Myopic ILP policy on 5 problem instances based on the Sioux Falls network as described above. Table 2 encapsulates the results. The Myopic ILP achieved an average profit of 3178 across all 5 instances, with the expected maximum possible profit being $p_{avg}\mathbf{E}[|\mathcal{R}|]=6600$ . The average service rate was 43.6%. These performance results serve as a baseline for bench-marking the proposed data-driven optimization approach. The average solve-time for an interval across all 5 instances was 123.3 seconds. We observed that if an optimal solution was reached, then the solve-time was below 50 seconds, but for some intervals the solver reached the time limit of 300 seconds and gave the best feasible solution it could find. The latter case occurred in the initial three to four intervals, presumably because the network traffic was still evolving and had not reached complete saturation. In such a situation, many idle requests have not yet been dispatched and incoming requests continue to arrive at each interval, making it difficult to reach optimality. As idle requests begin to enter the network in greater numbers and network traffic increases, the pending request load subsides and optimality can be reached more easily.

Instance	Total Requests	Service Rate	Profit
0	1203	41.6%	3076
1	1125	45.5%	3249
2	1260	41.7%	3260
3	1184	44.5%	3216
4	1147	44.0%	3089

Table 2: Performance of the Myopic ILP

5.3 Training $k$ NN multi-output regressor

The training instance contains 2000 intervals, instead of the usual 12 for test instances. The lookahead simulation contained 5 virtual intervals. We ran the training scheme described in Algorithm 3 on this instance and obtained the training dataset where the input feature set is $\bm{\vec{S}}$ and the output target set is $\bm{\vec{B}}$ . We employed the reservation heuristic to handle the size of the training instance. The time-expanded dual graph $G^{T}$ is constructed up to a time layer $t_{\max}+\Delta$ where $t_{\max}=\max\limits_{r\in\mathcal{R}}d_{r}$ of the training requests. The purpose of the buffer $\Delta$ is to account for the virtual requests being created in the final interval, some of whose arrival windows may exceed $t_{\max}$ .

The completed training dataset is passed through a $k$ NN multi-output regressor. We implemented $k$ NN through the Scikit-learn package in Python. As mentioned previously, the collection vectors $\bm{\vec{S}}$ and $\bm{\vec{B}}$ contain the link occupancy vectors $\bm{\vec{s_{i}}}=\begin{bmatrix}s_{i,l}\end{bmatrix}^{l\in A}$ and link priority vectors $\bm{\vec{\beta_{i}}}=\begin{bmatrix}\beta_{i,l}\end{bmatrix}^{l\in A}$ respectively for all training intervals. During the execution of the training scheme, we converted the link occupancy vector $\bm{\vec{s_{i}}}$ into a 2-D node adjacency matrix $\bm{N_{s}}$ where the row and column indices of the matrix are the nodes of the network. Each entry $s_{m,n}$ of this matrix represents a directed link from node $m$ to node $n$ along link $\omega(m,n)$ , and the value of the entry equals the link occupancy. This transformation is useful because it encodes the state of the network in a more information-rich format. Apart from storing link occupancy, the node adjacency matrix also conveys the network topology (for example, two separate non-zero entries in the same row $m$ represent two outgoing links with $m$ as the common source). With $\bm{N_{s}}$ as input, the $k$ NN regressor is trained to estimate the link priorities as output. The number of neighbours considered for each input is set to 60, and the distance metric is set to Euclidean distance.

Figure 4(a) demonstrates one pair of input and target output vectors in the training scheme (training interval 500). The input is the snapshot of the network at the beginning of training interval 500, depicting the number of drones (UAVs) on each link. After simulating 5 virtual intervals, the link traversals of the accepted virtual requests give rise to the desired target output as shown in Figure 4(b). This $k$ NN model is trained on such input and target output pairs for the entire duration of the training scheme. We note that the spatial distribution of request origins and destinations influences the link priorities. Since origins dominate the bottom portion of the network and destinations dominate the top, links directed southwards (top to bottom) are of least priority. Links within the central portion of the network, which carry traffic in the opposite direction (bottom to top) are of high priority. Some links responsible for lateral movement are also of moderate to high importance.

5.4 Constant $\alpha_{i}$

We conduct a sensitivity analysis to quantify the impact of the $\alpha$ -profile onto the performance of the Surrogate ILP policy. Since $\alpha_{i}=0\ \forall i\in\{1,\ldots,I\}$ is equivalent to the Myopic ILP, we perturbed $\alpha_{i}$ above and below 0 and observed the effect on cumulative profit acquired. The cumulative profit of the Surrogate ILP is used to determine the Profit Gap % with respect to the Myopic ILP and serves as the main performance metric. We similarly recorded the cumulative service (total requests accepted) and Service Gap %. The performance of the Surrogate ILP policy with constant $\alpha$ -profiles are displayed in Table 3.

As expected, a fully negative $\alpha$ -profile (SP_CTE6) performed worse than the Myopic ILP, because the surrogate objective seeks to minimize slack and force traffic though the network in a counter-productive manner. The influence of the link priorities $\beta_{i,l}$ further worsened the solution, because high priority links are overloaded more in proportion to their priority. In contrast, a positive alpha profile performed better than the benchmark, but the profit gap decreases if $\alpha$ is exceedingly large. This can be explained by observing that too large an alpha maintained over the entire time horizon drains the network excessively and prevents profitable requests from being routed successfully. For the given network and the demand scenario presented, constant $\alpha$ profiles of 1.5, 1.25 and 1 performed the best, with an average gap across the 5 problem instances of 9.43%, 10.62% and 9.73% respectively. For these three $\alpha$ profiles, the average solve-times for each interval across the 5 problem instances are 148.5, 150.0, 142.4 seconds respectively. A similar pattern to the myopic ILP emerged, whereby initial three to four intervals deplete the solve-time limit but thereafter optimality is reached well before it.

We observed that the Surrogate ILP policies served slightly fewer requests than the Myopic ILP policy, but attained a larger profit. The Myopic ILP behaved greedily – optimal in the current interval (for initial intervals) but negligent of the future – and lost out on potentially profitable requests in later intervals. Instead, the Surrogate ILPs maintained reserve capacity on crucial links which enabled the acceptance of high profit requests no matter in which interval they arrived. It is possible however, to create excessive reserves that begin to hinder traffic through the network, as evidenced by the inferior performance (relative to other Surrogate ILPs) of SP_CTE1. With $\alpha_{i}=2$ , SP_CTE1 had the lowest service rate among all Surrogate ILPs but was also the worst performing policy that still outperformed the Myopic ILP policy.

Instance	Total Arrived	Myopic ILP		$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
Instance	Total Arrived	Service Rate	Profit	$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
0	1203	41.6%	3076	SP_CTE1	37.0%	3255	179	5.8%	-56	-4.7%
				SP_CTE2	40.6%	3497	421	13.7%	-13	-1.0%
				SP_CTE3	40.9%	3474	398	12.9%	-9	-0.7%
				SP_CTE4	41.6%	3413	337	11.0%	-1	0.0%
				SP_CTE5	42.6%	3308	232	7.5%	11	1.0%
				SP_CTE6	41.5%	3062	-14	-0.5%	-2	-0.1%
1	1125	45.5%	3249	SP_CTE1	40.1%	3290	41	1.3%	-61	-5.4%
				SP_CTE2	43.8%	3534	285	8.8%	-19	-1.7%
				SP_CTE3	45.0%	3590	341	10.5%	-6	-0.5%
				SP_CTE4	45.6%	3542	293	9.0%	1	0.1%
				SP_CTE5	46.4%	3409	160	4.9%	10	0.9%
				SP_CTE6	44.6%	3141	-108	-3.3%	-10	-0.9%
2	1260	41.7%	3260	SP_CTE1	36.4%	3328	68	2.1%	-67	-5.3%
				SP_CTE2	39.7%	3563	303	9.3%	-26	-2.0%
				SP_CTE3	40.6%	3577	317	9.7%	-14	-1.1%
				SP_CTE4	41.3%	3564	304	9.3%	-6	-0.4%
				SP_CTE5	41.9%	3401	141	4.3%	2	0.2%
				SP_CTE6	41.4%	3222	-38	-1.2%	-4	-0.3%
3	1184	44.5%	3216	SP_CTE1	37.8%	3280	64	2.0%	-80	-6.8%
				SP_CTE2	40.7%	3443	227	7.1%	-45	-3.8%
				SP_CTE3	42.1%	3502	286	8.9%	-29	-2.4%
				SP_CTE4	43.3%	3515	299	9.3%	-14	-1.2%
				SP_CTE5	43.6%	3278	62	1.9%	-11	-0.9%
				SP_CTE6	43.7%	3195	-21	-0.7%	-10	-0.8%
4	1147	44.0%	3089	SP_CTE1	37.1%	3148	59	1.9%	-79	-6.9%
				SP_CTE2	41.0%	3346	257	8.3%	-35	-3.0%
				SP_CTE3	43.0%	3430	341	11.0%	-12	-1.0%
				SP_CTE4	43.0%	3400	311	10.1%	-12	-1.0%
				SP_CTE5	43.9%	3590	501	16.2%	-1	-0.1%
				SP_CTE6	44.1%	2980	-109	-3.5%	1	0.1%

Table 3: Surrogate ILP policy with constant

\alpha

-profile vs. Myopic ILP policy

5.5 Step-wise & continuous $\alpha_{i}$

Having demonstrated the persistent superiority of SP_CTE2, SP_CTE3 and SP_CTE4, we explore step-wise changes in the $\alpha$ -profile with 1.5, 1.25 and 1 as pivots. We named these Surrogate ILPs SP_STP1 through SP_STP6. For example, SP_STP1 has the $\alpha$ -profile $1.5\rightarrow 1.25\ |\ 6:6$ , which depicts 6 intervals of $\alpha_{i}=1.5$ followed by 6 intervals of $\alpha_{i}=1.25$ . Similarly, SP_STP2 has the $\alpha$ -profile $1.5\rightarrow 1.25\ |\ 8:4$ , which depicts 8 intervals of $\alpha_{i}=1.5$ followed by 4 intervals of $\alpha_{i}=1.25$ . The subsequent two step-wise profiles move from 1.5 all the way to 1, and the final two go from 1.25 to 1. In these 3 pairs of Surrogate ILP policies, the latter one captures a delayed transition to a lower $\alpha_{i}$ .

Table 4 records the performance of this class of $\alpha$ -profiles. We observe a higher quality solution from the group of step-wise $\alpha$ -profiles than from the constant profiles. However, there is no clear superiority of one step-wise profile over another. We find that only SP_STP5 emerged as the best in 2 instances, while the remaining 3 instances are each dominated by adifferent step profile. As with SP_CTE2, SP_CTE3 and SP_CTE4, the service rate is less than that of the Myopic ILP policy but the cumulative profit is greater.

The link priority vector returned by the $k$ NN multi-output regressor is adaptive and different in each interval, since it is extremely unlikely for the input snapshot vector of link occupancy to be identical in two consecutive intervals. For this reason, it is difficult to establish a prevailing improvement over constant $\alpha$ -profiles with step-wise or continuous profiles. By continuous $\alpha_{i}$ we imply a temporal function for the $\alpha$ -profile which is evaluated at discrete time-steps (the start of each interval). We approximated the step-wise algorithms SP_STP3 and SP_STP4 as continuous functions of time. These algorithms (SP_PLY1 to SP_PLY4) outperformed the Myopic ILP as well, as shown in Table 5.

In Table 6, we present a comparison of the step-wise and continuous $\alpha$ -profiles with the 3 best constant profiles. No sustained performance gap is observed, indicating that it is difficult to generate consistent improvement upon the adaptive $k$ NN component, provided the value of $\alpha$ is not arbitrarily chosen. In Table 7 we summarize the profits and profit gaps (relative to the Myopic ILP) of all surrogate objective functions in the form of a heat map.

Instance	Total Arrived	Myopic ILP		$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
Instance	Total Arrived	Service Rate	Profit	$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
0	1203	41.6%	3076	SP_STP1	40.2%	3444	368	12.0%	-17	-1.4%
				SP_STP2	40.5%	3498	422	13.7%	-14	-1.1%
				SP_STP3	41.3%	3501	425	13.8%	-4	-0.3%
				SP_STP4	40.2%	3481	405	13.2%	-17	-1.4%
				SP_STP5	41.6%	3511	435	14.1%	-1	0.0%
				SP_STP6	41.1%	3449	373	12.1%	-7	-0.5%
1	1125	45.5%	3249	SP_STP1	43.8%	3568	319	9.8%	-19	-1.7%
				SP_STP2	44.4%	3584	335	10.3%	-12	-1.1%
				SP_STP3	43.6%	3485	236	7.3%	-21	-1.9%
				SP_STP4	44.7%	3588	339	10.4%	-9	-0.8%
				SP_STP5	45.2%	3547	298	9.2%	-3	-0.3%
				SP_STP6	44.4%	3517	268	8.2%	-12	-1.1%
2	1260	41.7%	3260	SP_STP1	40.5%	3591	331	10.2%	-16	-1.2%
				SP_STP2	40.4%	3614	354	10.9%	-17	-1.3%
				SP_STP3	40.8%	3570	310	9.5%	-12	-0.9%
				SP_STP4	40.2%	3568	308	9.4%	-19	-1.5%
				SP_STP5	41.1%	3582	322	9.9%	-8	-0.6%
				SP_STP6	41.5%	3625	365	11.2%	-3	-0.2%
3	1184	44.5%	3216	SP_STP1	41.9%	3536	320	10.0%	-31	-2.6%
				SP_STP2	41.6%	3492	276	8.6%	-34	-2.9%
				SP_STP3	42.1%	3470	254	7.9%	-29	-2.4%
				SP_STP4	42.4%	3552	336	10.4%	-25	-2.1%
				SP_STP5	43.6%	3556	340	10.6%	-11	-0.9%
				SP_STP6	42.5%	3470	254	7.9%	-24	-2.0%
4	1147	44.0%	3089	SP_STP1	42.2%	3431	342	11.1%	-21	-1.8%
				SP_STP2	41.7%	3395	306	9.9%	-27	-2.3%
				SP_STP3	42.9%	3460	371	12.0%	-13	-1.1%
				SP_STP4	41.8%	3382	293	9.5%	-26	-2.2%
				SP_STP5	43.0%	3412	323	10.5%	-12	-1.0%
				SP_STP6	42.6%	3418	329	10.7%	-16	-1.4%

Table 4: Surrogate ILP policy with step-wise

\alpha

-profile vs. Myopic ILP policy

Instance	Total Arrived	Myopic ILP		$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
Instance	Total Arrived	Service Rate	Profit	$\alpha$ -profile	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
0	1203	41.6%	3076	SP_PLY1	41.1%	3461	385	12.5%	-7	-0.5%
				SP_PLY2	40.6%	3502	426	13.8%	-13	-1.0%
				SP_PLY3	41.7%	3542	466	15.1%	1	0.1%
				SP_PLY4	40.9%	3458	382	12.4%	-9	-0.7%
1	1125	45.5%	3249	SP_PLY1	44.6%	3514	265	8.2%	-10	-0.9%
				SP_PLY2	44.3%	3571	322	9.9%	-14	-1.2%
				SP_PLY3	45.0%	3562	313	9.6%	-6	-0.5%
				SP_PLY4	44.7%	3530	281	8.6%	-9	-0.8%
2	1260	41.7%	3260	SP_PLY1	40.9%	3540	280	8.6%	-11	-0.8%
				SP_PLY2	40.9%	3648	388	11.9%	-11	-0.8%
				SP_PLY3	40.6%	3572	312	9.6%	-14	-1.1%
				SP_PLY4	40.8%	3576	316	9.7%	-12	-0.9%
3	1184	44.5%	3216	SP_PLY1	43.2%	3501	285	8.9%	-16	-1.3%
				SP_PLY2	42.2%	3544	328	10.2%	-27	-2.3%
				SP_PLY3	42.8%	3546	330	10.3%	-20	-1.7%
				SP_PLY4	42.8%	3550	334	10.4%	-20	-1.7%
4	1147	44.0%	3089	SP_PLY1	43.8%	3436	347	11.2%	-3	-0.2%
				SP_PLY2	41.8%	3379	290	9.4%	-26	-2.2%
				SP_PLY3	42.5%	3411	322	10.4%	-18	-1.5%
				SP_PLY4	42.9%	3453	364	11.8%	-13	-1.1%

Table 5: Surrogate ILP policy with continuous

\alpha

-profile vs. Myopic ILP policy

		Step-wise and Continuous $\alpha$ -profiles
Constant $\alpha$ -profile	Instance	SP_STP1	SP_STP2	SP_STP3	SP_STP4	SP_STP5	SP_STP6	SP_PLY1	SP_PLY2	SP_PLY3	SP_PLY4
	0	-1.5%	0.0%	0.1%	-0.5%	0.4%	-1.4%	-1.0%	0.1%	1.3%	-1.1%
	1	1.0%	1.4%	-1.4%	1.5%	0.4%	-0.5%	-0.6%	1.0%	0.8%	-0.1%
	2	0.8%	1.4%	0.2%	0.1%	0.5%	1.7%	-0.6%	2.4%	0.3%	0.4%
	3	2.7%	1.4%	0.8%	3.2%	3.3%	0.8%	1.7%	2.9%	3.0%	3.1%
SP_CTE2	4	2.5%	1.5%	3.4%	1.1%	2.0%	2.2%	2.7%	1.0%	1.9%	3.2%
	0	-0.9%	0.7%	0.8%	0.2%	1.1%	-0.7%	-0.4%	0.8%	2.0%	-0.5%
	1	-0.6%	-0.2%	-2.9%	-0.1%	-1.2%	-2.0%	-2.1%	-0.5%	-0.8%	-1.7%
	2	0.4%	1.0%	-0.2%	-0.3%	0.1%	1.3%	-1.0%	2.0%	-0.1%	0.0%
	3	1.0%	-0.3%	-0.9%	1.4%	1.5%	-0.9%	0.0%	1.2%	1.3%	1.4%
SP_CTE3	4	0.0%	-1.0%	0.9%	-1.4%	-0.5%	-0.3%	0.2%	-1.5%	-0.6%	0.7%
	0	0.9%	2.5%	2.6%	2.0%	2.9%	1.1%	1.4%	2.6%	3.8%	1.3%
	1	0.7%	1.2%	-1.6%	1.3%	0.1%	-0.7%	-0.8%	0.8%	0.6%	-0.3%
	2	0.8%	1.4%	0.2%	0.1%	0.5%	1.7%	-0.7%	2.4%	0.2%	0.3%
	3	0.6%	-0.7%	-1.3%	1.1%	1.2%	-1.3%	-0.4%	0.8%	0.9%	1.0%
SP_CTE4	4	0.9%	-0.1%	1.8%	-0.5%	0.4%	0.5%	1.1%	-0.6%	0.3%	1.6%

Table 6: Step-wise and continuous

\alpha

-profiles compared with constant

\alpha

-profiles. Values represent the profit gap % between the two, with respect to the constant profiles.

Profit
Instance	Myopic	SP_CTE1	SP_CTE2	SP_CTE3	SP_CTE4	SP_CTE5	SP_CTE6	SP_STP1	SP_STP2	SP_STP3	SP_STP4	SP_STP5	SP_STP6	SP_PLY1	SP_PLY2	SP_PLY3	SP_PLY4
0	3076	3255	3497	3474	3413	3308	3062	3444	3498	3501	3481	3511	3449	3461	3502	3542	3458
1	3249	3290	3534	3590	3542	3409	3141	3568	3584	3485	3588	3547	3517	3514	3571	3562	3530
2	3260	3328	3563	3577	3564	3401	3222	3591	3614	3570	3568	3582	3625	3540	3648	3572	3576
3	3216	3280	3443	3502	3515	3278	3195	3536	3492	3470	3552	3556	3470	3501	3544	3546	3550
4	3089	3148	3346	3430	3400	3290	2980	3431	3395	3460	3382	3412	3418	3436	3379	3411	3453
Profit Gap % with respect to the Myopic ILP
Instance	Myopic	SP_CTE1	SP_CTE2	SP_CTE3	SP_CTE4	SP_CTE5	SP_CTE6	SP_STP1	SP_STP2	SP_STP3	SP_STP4	SP_STP5	SP_STP6	SP_PLY1	SP_PLY2	SP_PLY3	SP_PLY4
0	0.0%	5.8%	13.7%	12.9%	11.0%	7.5%	-0.5%	12.0%	13.7%	13.8%	13.2%	14.1%	12.1%	12.5%	13.8%	15.1%	12.4%
1	0.0%	1.3%	8.8%	10.5%	9.0%	4.9%	-3.3%	9.8%	10.3%	7.3%	10.4%	9.2%	8.2%	8.2%	9.9%	9.6%	8.6%
2	0.0%	2.1%	9.3%	9.7%	9.3%	4.3%	-1.2%	10.2%	10.9%	9.5%	9.4%	9.9%	11.2%	8.6%	11.9%	9.6%	9.7%
3	0.0%	2.0%	7.1%	8.9%	9.3%	1.9%	-0.7%	10.0%	8.6%	7.9%	10.4%	10.6%	7.9%	8.9%	10.2%	10.3%	10.4%
4	0.0%	1.9%	8.3%	11.0%	10.1%	6.5%	-3.5%	11.1%	9.9%	12.0%	9.5%	10.5%	10.7%	11.2%	9.4%	10.4%	11.8%

Table 7: Heat map of cumulative profit and profit gap (%) w.r.t. Myopic ILP policy

5.6 Profit acquired per interval

Figure 5 depicts the profit acquired per interval for test instances 0 and 4 by the Myopic ILP, SP_CTE2 and SP_STP3. In both instances the first interval witnessed a larger profit acquisition by the Myopic ILP. Thereafter the surrogate models closed the distance and maintained the performance gap. Presumably, in the first interval, link priority based slack creation limits the number of profitable requests being serviced. But once this reserve capacity is capitalized upon in subsequent intervals, the profit gained per interval is larger for the surrogate models. Occasionally the Myopic ILP policy performs slightly better (interval 7 of instance 0 and interval 3 of instance 4), but by and large the Surrogate ILP policies dominate and acquire a larger cumulative profit.

As denoted in Table 1, SP_CTE2 sets $\alpha_{i}=1.5\ \forall i\in I$ whereas SP_STP3 transitions from 1.5 to 1.25 to 1 in steps of 4 intervals each. This means that for the first four intervals, both surrogate policies have identical $\alpha_{i}$ values ( $=1.5$ ). However the profit acquired per interval is different from the second interval onwards, owing to the dynamic and adaptive nature of the $k$ NN component of the surrogate ILP policy.

5.7 Significance of $\beta_{i,l}$

We conclude the numerical experiments by demonstrating the role of using data-driven predictive prescriptions to determine efficient link priority vectors $\bm{\vec{\beta_{i}}}=\begin{bmatrix}\beta_{i,l}\end{bmatrix}^{l\in A}$ . The value of these estimations of link priority can be exhibited by comparison to an arbitrary policy which assumes equal importance (or priority) of all links in the network. Such a policy would assign a uniform link priority across the network. It would thereby delegate all responsibility for managing the differential slack in the network to the $\alpha$ -profile. Yet it would suffer from the disadvantage of being unable to create reserve capacity differentiated across space (links in the network), and would only do this across time (varying $\alpha_{i}$ across intervals). To this end, we conducted experiments using the Surrogate ILP policy with a uniform standardized link priority vector $\bm{\vec{\beta_{i}}}=\begin{bmatrix}\frac{1}{|A|}\end{bmatrix}^{l\in A}$ and report the outcome in Table 8. We observed that upon using a uniform link priority vector, the performance dropped sharply and in many cases even fell below the Myopic ILP policy benchmark. This shows that uniform slack creation across all network links fails to improve performance because the reserve capacities are introduced in a manner which cannot be taken advantage of. It is crucial to create reserve capacity differentiated across both space and time to outperform the benchmark of the myopic ILP.

Instance	Total Arrived	Myopic ILP		$\alpha$ -profile	$\beta$ Application	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
Instance	Total Arrived	Service Rate	Profit	$\alpha$ -profile	$\beta$ Application	Service Rate	Profit	Profit Gap	Profit Gap %	Service Gap	Service Gap %
0	1203	41.6%	3076	SP_CTE2	kNN	40.6%	3497	421	13.7%	-13	-1.0%
				SP_CTE2	uniform	41.5%	3109	33	1.1%	-2	-0.1%
				SP_STP3	kNN	41.3%	3501	425	13.8%	-4	-0.3%
				SP_STP3	uniform	42.5%	3157	81	2.6%	10	0.9%
				SP_PLY2	kNN	40.6%	3502	426	13.8%	-13	-1.0%
				SP_PLY2	uniform	41.3%	3091	15	0.5%	-4	-0.3%
1	1125	45.5%	3249	SP_CTE2	kNN	43.8%	3534	285	8.8%	-19	-1.7%
				SP_CTE2	uniform	46.4%	3244	-5	-0.2%	10	0.9%
				SP_STP3	kNN	43.6%	3485	236	7.3%	-21	-1.9%
				SP_STP3	uniform	46.0%	3219	-30	-0.9%	6	0.5%
				SP_PLY2	kNN	44.3%	3571	322	9.9%	-14	-1.2%
				SP_PLY2	uniform	46.4%	3272	23	0.7%	10	0.9%
2	1260	41.7%	3260	SP_CTE2	kNN	39.7%	3563	303	9.3%	-26	-2.0%
				SP_CTE2	uniform	42.3%	3303	43	1.3%	7	0.6%
				SP_STP3	kNN	40.8%	3570	310	9.5%	-12	-0.9%
				SP_STP3	uniform	41.7%	3313	53	1.6%	-1	0.0%
				SP_PLY2	kNN	40.9%	3648	388	11.9%	-11	-0.8%
				SP_PLY2	uniform	40.7%	3221	-39	-1.2%	-13	-1.0%
3	1184	44.5%	3216	SP_CTE2	kNN	40.7%	3443	227	7.1%	-45	-3.8%
				SP_CTE2	uniform	42.6%	3091	-125	-3.9%	-23	-1.9%
				SP_STP3	kNN	42.1%	3470	254	7.9%	-29	-2.4%
				SP_STP3	uniform	44.0%	3168	-48	-1.5%	-6	-0.5%
				SP_PLY2	kNN	42.2%	3544	328	10.2%	-27	-2.3%
				SP_PLY2	uniform	43.9%	3185	-31	-1.0%	-7	-0.6%
4	1147	44.0%	3089	SP_CTE2	kNN	41.0%	3346	257	8.3%	-35	-3.0%
				SP_CTE2	uniform	44.4%	3005	-84	-2.7%	4	0.4%
				SP_STP3	kNN	42.9%	3460	371	12.0%	-13	-1.1%
				SP_STP3	uniform	44.0%	2975	-114	-3.7%	0	0.0%
				SP_PLY2	kNN	41.8%	3379	290	9.4%	-26	-2.2%
				SP_PLY2	uniform	43.8%	2976	-113	-3.7%	-3	-0.2%

Table 8: Performance comparison with uniform

\beta

6 Conclusion and perspectives

This study addresses the stochastic online version of the drone delivery service planning problem in urban airspace. The solution methods we explore are executed from the perspective of a network manager. Given an incoming Poisson process of drone delivery trip requests, we aim to find the optimal assignment of drones to space-time trajectories in an urban air traffic network within a pre-defined time horizon. We assume that drone deliveries must respect delivery time windows and that UAVs must fly at constant speed throughout the network to avoid airborne delays and congestion. Our main contribution is to expand the integer linear programming formulation of the deterministic DDSP to the stochastic variant with online demand. We also propose a novel solution methodology which leverages data-driven, machine learning-enabled predictions of network link priorities to construct a modified surrogate objective function. We outline a training scheme to train the machine learning component as well. The numerical results indicate that the Surrogate ILP policy outperforms the Myopic ILP policy which iteratively solves a for the objective of profit maximization. We demonstrate that even though the training procedure might be extensive, it is one-time investment returns rewards in the form of a sustained profit margin over the Myopic ILP policy across multiple instances.

The escalating adoption of digital technology by the public and private sectors is producing real-time uncertain demand for logistic service providers globally. This is happening concurrently with the accelerating adoption of UAVs in transport fleets to service regions with varying degrees of urbanization. The utility of the solution methodology proposed in this study lies at the intersection of these two trends. By outperforming conventional online optimization methods our approach unlocks considerable benefits for the service provider and for the customers. Moreover, we successfully establish link priority and available link capacity as pivotal factors for enhancing network performance. This implies that the strategy and procedure of non-myopic anticipatory resource allocation we present in this paper is transferable to other classes of network-based logistic problems such as pickup and delivery logistics and ride-sharing.

Multiple directions of further research can be identified. First, there is an opportunity to extend the data-driven optimization framework to learn efficient $\alpha$ -profiles automatically. This also necessitates the design of a training scheme that explores the $\alpha$ -profile space quickly and intelligently under time and computational budgets. Second, modifications can be imposed on the problem definition which make the DDSP more realistic, such as capacity constraints on each UAV, differential velocities and charging limitations. Extensions to the case of multiple deliveries per UAV also warrant further research. Although this stream of problem has been extensively studied through the lens of vehicle routing models, few studies have addressed the online demand case in a time-dependent routing context. The introduction and placement of charging platforms at selected nodes or a centralized drone hub somewhere in the network is also a compelling network-design possibility.

References

Asadi et al. (2022) A. Asadi, S. N. Pinkley, and M. Mes. A Markov decision process approach for managing medical drone deliveries. Expert Systems with Applications, 204:117490, 2022.
Baloch and Gzara (2020) G. Baloch and F. Gzara. Strategic network design for parcel delivery with drones under competition. Transportation Science, 54(1):204–228, 2020.
Basso et al. (2022) R. Basso, B. Kulcsár, I. Sanchez-Diaz, and X. Qu. Dynamic stochastic electric vehicle routing with safe reinforcement learning. Transportation Research Part E: Logistics and Transportation Review, 157:102496, 2022.
Baty et al. (2024) L. Baty, K. Jungel, P. S. Klein, A. Parmentier, and M. Schiffer. Combinatorial optimization-enriched machine learning to solve the dynamic vehicle routing problem with time windows. Transportation Science, 2024.
Bengio et al. (2021) Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
Bertsimas and Kallus (2020) D. Bertsimas and N. Kallus. From predictive to prescriptive analytics. Management Science, 66(3):1025–1044, 2020.
Bertsimas et al. (2019) D. Bertsimas, P. Jaillet, and S. Martin. Online vehicle routing: The edge of optimization in large-scale applications. Operations Research, 67(1):143–162, 2019.
Bonami et al. (2018) P. Bonami, A. Lodi, and G. Zarpellon. Learning a classification of mixed-integer quadratic programming problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer, 2018.
Chen et al. (2021) H. Chen, Z. Hu, and S. Solak. Improved delivery policies for future drone-based delivery systems. European Journal of Operational Research, 294(3):1181–1201, 2021.
Chen et al. (2022) X. Chen, M. W. Ulmer, and B. W. Thomas. Deep Q-learning for same-day delivery with vehicles and drones. European Journal of Operational Research, 298(3):939–952, 2022.
Chen et al. (2023) X. Chen, T. Wang, B. W. Thomas, and M. W. Ulmer. Same-day delivery with fair customer service. European Journal of Operational Research, 308(2):738–751, 2023.
Chow (2016) J. Y. Chow. Dynamic uav-based traffic monitoring under uncertainty as a stochastic arc-inventory routing policy. International Journal of Transportation Science and Technology, 5(3):167–185, 2016.
Chu et al. (2021) H. Chu, W. Zhang, P. Bai, and Y. Chen. Data-driven optimization for last-mile delivery. Complex & Intelligent Systems, pages 1–14, 2021.
Congress et al. (2021) S. S. C. Congress, A. J. Puppala, A. Banerjee, and U. D. Patil. Identifying hazardous obstructions within an intersection using unmanned aerial data analysis. International journal of Transportation Science and Technology, 10(1):34–48, 2021.
Dayarian et al. (2020) I. Dayarian, M. Savelsbergh, and J.-P. Clarke. Same-day delivery with drone resupply. Transportation Science, 54(1):229–249, 2020.
Edwards et al. (2010) J. Edwards, A. McKinnon, T. Cherrett, F. McLeod, and L. Song. Carbon dioxide benefits of using collection–delivery points for failed home deliveries in the united kingdom. Transportation Research Record, 2191(1):136–143, 2010.
Elmachtoub and Grigas (2022) A. N. Elmachtoub and P. Grigas. Smart “Predict, then Optimize”. Management Science, 68(1):9–26, 2022.
Feng et al. (2022) S. Feng, P. Duan, J. Ke, and H. Yang. Coordinating ride-sourcing and public transport services with a reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 138:103611, 2022.
Fruhling and Digman (2000) A. L. Fruhling and L. A. Digman. The impact of electronic commerce on business-level strategies. Journal of Electronic Commerce Research, 1(1):13, 2000.
Ghiasvand et al. (2024) M. R. Ghiasvand, D. Rahmani, and M. Moshref-Javadi. Data-driven robust optimization for a multi-trip truck-drone routing problem. Expert Systems with Applications, 241:122485, 2024.
Glock and Meyer (2020) K. Glock and A. Meyer. Mission planning for emergency rapid mapping with drones. Transportation Science, 54(2):534–560, 2020.
Gu et al. (2023) R. Gu, Y. Liu, and M. Poon. Dynamic truck–drone routing problem for scheduled deliveries and on-demand pickups with time-related constraints. Transportation Research Part C: Emerging Technologies, 151:104139, 2023.
Han et al. (2022) B. R. Han, T. Sun, L. Y. Chu, and L. Wu. Covid-19 and e-commerce operations: Evidence from alibaba. Manufacturing & Service Operations Management, 24(3):1388–1405, 2022.
Hecken et al. (2022) T. Hecken, S. Cumnuantip, and T. Klimmek. Structural design of heavy-lift unmanned cargo drones in low altitudes. Automated Low-Altitude Air Delivery: Towards Autonomous Cargo Transportation with Drones, pages 159–183, 2022.
Hong et al. (2019) J. Hong, M. Lee, T. Cheong, and H. C. Lee. Routing for an on-demand logistics service. Transportation Research Part C: Emerging Technologies, 103:328–351, 2019.
Kim and Park (2005) J. Kim and J. Park. A consumer shopping channel extension model: attitude shift toward the online store. Journal of Fashion Marketing and Management: An International Journal, 9(1):106–121, 2005.
Kornatowski et al. (2020) P. M. Kornatowski, M. Feroskhan, W. J. Stewart, and D. Floreano. A morphing cargo drone for safe flight in proximity of humans. IEEE Robotics and Automation Letters, 5(3):4233–4240, 2020.
Kruber et al. (2017) M. Kruber, M. E. Lübbecke, and A. Parmentier. Learning when to use a decomposition. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 202–210. Springer, 2017.
Larsen et al. (2022) E. Larsen, S. Lachapelle, Y. Bengio, E. Frejinger, S. Lacoste-Julien, and A. Lodi. Predicting tactical solutions to operational planning problems under imperfect information. INFORMS Journal on Computing, 34(1):227–242, 2022.
Levin and Rey (2023) M. W. Levin and D. Rey. Branch-and-price for drone delivery service planning in urban airspace. Transportation Science, 57(4):843–865, 2023.
Li et al. (2024) X. Li, P. Yan, K. Yu, P. Li, and Y. Liu. Parcel consolidation approach and routing algorithm for last-mile delivery by unmanned aerial vehicles. Expert Systems with Applications, 238:122149, 2024.
Liu (2019) Y. Liu. An optimization-driven dynamic vehicle routing algorithm for on-demand meal delivery using drones. Computers & Operations Research, 111:1–20, 2019.
Lodi and Zarpellon (2017) A. Lodi and G. Zarpellon. On learning and branching: a survey. Transactions in Operations Research, 25(2):207–236, 2017.
Lozzi et al. (2022) G. Lozzi, G. Iannaccone, I. Maltese, V. Gatta, E. Marcucci, and R. Lozzi. On-demand logistics: Solutions, barriers, and enablers. Sustainability, 14(15):9465, 2022.
Masorgo et al. (2023) N. Masorgo, S. Mir, and A. R. Hofer. You’re driving me crazy! how emotions elicited by negative driver behaviors impact customer outcomes in last mile delivery. Journal of Business Logistics, 2023.
Menouar et al. (2017) H. Menouar, I. Guvenc, K. Akkaya, A. S. Uluagac, A. Kadri, and A. Tuncer. Uav-enabled intelligent transportation systems for the smart city: Applications and challenges. IEEE Communications Magazine, 55(3):22–28, 2017.
Mohsan et al. (2023) S. A. H. Mohsan, N. Q. H. Othman, Y. Li, M. H. Alsharif, and M. A. Khan. Unmanned aerial vehicles (uavs): Practical aspects, applications, open challenges, security issues, and future trends. Intelligent Service Robotics, 16(1):109–137, 2023.
Moshref-Javadi and Winkenbach (2021) M. Moshref-Javadi and M. Winkenbach. Applications and research avenues for drone-based models in logistics: A classification and review. Expert Systems with Applications, 177:114854, 2021.
Muñoz-Villamizar et al. (2024) A. Muñoz-Villamizar, J. C. Velazquez-Martínez, and S. Caballero-Caballero. A large-scale last-mile consolidation model for e-commerce home delivery. Expert Systems with Applications, 235:121200, 2024.
Radoglou-Grammatikis et al. (2020) P. Radoglou-Grammatikis, P. Sarigiannidis, T. Lagkas, and I. Moscholios. A compilation of uav applications for precision agriculture. Computer Networks, 172:107148, 2020.
Rave et al. (2023) A. Rave, P. Fontaine, and H. Kuhn. Drone location and vehicle fleet planning with trucks and aerial drones. European Journal of Operational Research, 308(1):113–130, 2023.
Rejeb et al. (2023) A. Rejeb, K. Rejeb, S. J. Simske, and H. Treiblmaier. Drones for supply chain management and logistics: a review and research agenda. International Journal of Logistics Research and Applications, 26(6):708–731, 2023.
Sadana et al. (2023) U. Sadana, A. Chenreddy, E. Delage, A. Forel, E. Frejinger, and T. Vidal. A survey of contextual optimization methods for decision making under uncertainty. arXiv preprint arXiv:2306.10374, 2023.
Song et al. (2009) L. Song, T. Cherrett, F. McLeod, and W. Guan. Addressing the last mile problem: transport impacts of collection and delivery points. Transportation Research Record, 2097(1):9–18, 2009.
Tran-Thanh et al. (2012) L. Tran-Thanh, A. Chapman, A. Rogers, and N. Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134–1140, 2012.
Van Duin et al. (2016) J. Van Duin, W. de Goffau, B. Wiegmans, L. Tavasszy, and M. Saes. Improving home delivery efficiency by using principles of address intelligence for b2c deliveries. Transportation Research Procedia, 12:14–25, 2016.
Van Hentenryck et al. (2010) P. Van Hentenryck, R. Bent, and E. Upfal. Online stochastic optimization under time constraints. Annals of Operations Research, 177:151–183, 2010.
Verhoef et al. (2015) P. C. Verhoef, P. K. Kannan, and J. J. Inman. From multi-channel retailing to omni-channel retailing: introduction to the special issue on multi-channel retailing. Journal of Retailing, 91(2):174–181, 2015.
Vlahovic et al. (2017) N. Vlahovic, B. Knezevic, and P. Batalic. Implementing delivery drones in logistics business process: Case of pharmaceutical industry. International Journal of Mechanical and Industrial Engineering, 10(12):4026–4031, 2017.
Yan and Wang (2022) R. Yan and S. Wang. Integrating prediction with optimization: Models and applications in transportation management. Multimodal Transportation, 1(3):100018, 2022.
Yang et al. (2023) Y. Yang, C. Yan, Y. Cao, and R. Roberti. Planning robust drone-truck delivery routes under road traffic uncertainty. European Journal of Operational Research, 309(3):1145–1160, 2023.
Yilmaz and Büyüktahtakın (2024) D. Yilmaz and İ. E. Büyüktahtakın. An expandable machine learning-optimization framework to sequential decision-making. European Journal of Operational Research, 314(1):280–296, 2024.

Data-driven Optimization for Drone Delivery Service Planning with Online Demand

Abstract

keywords:

1 Introduction

2 Literature review

2.1 Data-driven optimization

2.1.1 Machine learning-enabled optimization

2.1.2 Predict-then-optimize

2.2 Stochastic and online demand

2.2.1 Online demand in vehicle routing

2.2.2 ML-based solution methods

2.3 UAV for urban logistics

2.3.1 Drone-only routing

2.3.2 Truck and drone logistics

3 Problem description and formulation

3.1 Deterministic DDSP

3.2 Online demand model

3.3 Formulation as a Markov decision process

3.3.1 State space

3.3.2 Action space

3.3.3 Reward and value functions

3.3.4 Online information and state transition

3.3.5 Objective function and optimal policy

4 Solution methodology

4.1 Myopic ILP

4.2 Surrogate objective function

4.3 Training procedure of relative link priority (βi,l\beta_{i,l})

4.3.1 Training data synthesis via lookahead simulation

4.3.2 kk-Nearest Neighbors multi-output regressor

4.3.3 Reservation heuristic

4.4 Surrogate ILP policy

5 Numerical experiments

5.1 α\alpha-profiles

5.2 Myopic ILP policy

5.3 Training kkNN multi-output regressor

5.4 Constant αi\alpha_{i}

5.5 Step-wise & continuous αi\alpha_{i}

5.6 Profit acquired per interval

5.7 Significance of βi,l\beta_{i,l}

6 Conclusion and perspectives

References

Data-driven Optimization for Drone Delivery Service Planning
with Online Demand

4.3 Training procedure of relative link priority ( $\beta_{i,l}$ )

4.3.2 $k$ -Nearest Neighbors multi-output regressor

5.1 $\alpha$ -profiles

5.3 Training $k$ NN multi-output regressor

5.4 Constant $\alpha_{i}$

5.5 Step-wise & continuous $\alpha_{i}$

5.7 Significance of $\beta_{i,l}$