Differentially Private Stochastic Convex Optimization for Network Routing Applications

Matthew Tsao Lyft Inc. [email protected] Karthik Gopalakrishnan Stanford University [email protected] Kaidi Yang National University of Singapore [email protected] Marco Pavone Stanford University [email protected]

Abstract

Network routing problems are common across many engineering applications. Computing optimal routing policies requires knowledge about network demand, i.e., the origin and destination (OD) of all requests in the network. However, privacy considerations make it challenging to share individual OD data that would be required for computing optimal policies. Privacy can be particularly challenging in standard network routing problems because sources and sinks can be easily identified from flow conservation constraints, making feasibility and privacy mutually exclusive.

In this paper, we present a differentially private algorithm for network routing problems. The main ingredient is a reformulation of network routing which moves all user data-dependent parameters out of the constraint set and into the objective function. We then present an algorithm for solving this formulation based on a differentially private variant of stochastic gradient descent. In this algorithm, differential privacy is achieved by injecting noise, and one may wonder if this noise injection compromises solution quality. We prove that our algorithm is both differentially private and asymptotically optimal as the size of the training set goes to infinity. We corroborate the theoretical results with numerical experiments on a road traffic network which show that our algorithm provides differentially private and near-optimal solutions in practice.

1 Introduction

Network routing problems appear in many important topics in engineering, including traffic routing in transportation systems, power routing in electrical grids, and packet routing in distributed computer systems. Network routing problems study settings where resources must be delivered to customers through a network with limited bandwidth. The goal is typically to route resources to their respective customers as efficiently as possible, or equivalently, with as little network congestion as possible.

One key challenge in network routing problems is that the requests (i.e., network demand) are often not known in advance. Indeed, it is difficult to know exactly how much power a neighborhood will need, or exactly how many visits a particular website will receive on any given day. Since information about the demand is often necessary to develop optimal or near-optimal network routing solutions, network routing algorithms often need a way of obtaining or estimating future demand. With the advent of big data and internet-of-things systems, crowd-sourcing has gained popularity as a demand forecasting approach for network routing systems. In crowd-sourcing, customers submit their request history to the network operator. The network operator uses this historical data to train a statistical or machine learning model to predict future demand from historical demand.

While crowd-sourcing provides a bountiful supply of data for training demand forecasting models, it can also introduce potential privacy concerns. Since crowd-sourcing uses individual-level customer data to train demand forecasting models, the model’s outputs may reveal sensitive user information, especially if it overfits to its training data [CTW⁺21]. Such privacy risks are problematic because they may deter users from sharing their data with network operators, hence reducing the supply of training data for demand forecasting models.

To address such concerns, the demand forecasting pipeline should be augmented with privacy-preserving mechanisms. Differential privacy [DMNS06] is a principled and popular method to occlude the influence a single user’s data on the result of a population study while also maintaining the study’s accuracy. This is done by carefully injecting noise into the desired computation so that data sets that differ by at most one data point will produce statistically indistinguishable results.

Providing differential privacy guarantees for the standard formulation of network routing is difficult because the constraints contain user data, meaning that in general feasibility and privacy become mutually exclusive. More specifically, in the standard network routing problem, the demand sources and sinks can be identified by checking for conservation of flow, and as a result, the presence of a user going to or from a very rare location can be detected from any feasible solution. Because differential privacy requires that the presence of any single user’s data be difficult to detect from the algorithm’s output, privacy and feasibility are at odds with one another in the standard formulation.

1.1 Statement of Contributions

In this paper we present a differentially private algorithm for network routing problems. The main ingredient is a reformulation of network routing which moves all user data dependent parameters out of the constraint set and into the objective function. We then present an algorithm for solving this formulation based on differentially private variants of stochastic gradient descent. In this algorithm, differential privacy is achieved by injecting noise, and one may wonder if this noise injection compromises solution quality. We prove that our algorithm is differentially private and under several reasonable regularity conditions, is also asymptotically optimal (as the size of the training set goes to infinity). We note that in exchange for becoming compatible with differentially private algorithms, this new formulation is more computationally expensive.

1.2 Related Work

Traffic assignment in transportation systems is one of the most well-known applications of network routing. Herein we focus our literature review on privacy research in transportation networks. Privacy works in transportation mainly focus on location privacy, where the objectives is to prevent untrusted and/or external entities from learning geographic locations or location sequences of an individual [BS03]. Privacy-preserving approaches have been proposed for various location-based applications, e.g., trajectory publishing, mobile crowdsensing, traffic control, etc. These techniques are based on spatial cloaking [CML11] and differential privacy [Dwo08]. While not the setting of interest for this paper, there are many works that use Secure Multi-Party Computation (MPC) [GMW87] to achieve privacy in distributed mobility systems.

Spatial cloaking approaches aggregate users’ exact locations into coarse information. These approaches are often based on $k$ -anonymity [Swe02], where a mobility dataset is divided into equivalence classes based on data attributes (e.g., geological regions, time, etc.) so that each class contains at least $k$ records [GFCA14, HC20]. These $k$ -anonymity-based approaches can guarantee that every record in the dataset is indistinguishable from at least $k-1$ other records. However, $k$ -anonymity is generally considered to be a weak privacy guarantee, especially when $k$ is small. Furthermore, very coarse data aggregation is required to address outliers or sparse data, and in these cases spatial cloaking-based approaches provide low data accuracy.

Differential privacy provides a principled privacy guarantee by producing randomized responses to queries, whereby two datasets that differ in only one entry produce statistically indistinguishable responses [DMNS06]. In other words, differential privacy ensures that an adversary with arbitrary background information (e.g., query responses, other entries) cannot infer individual entries with high confidence. Within transportation research, [WHL⁺18, YLL⁺19] share noisy versions of location data for mobile crowd-sensing applications. [GZFS15, GLTY18, AHFI⁺18, LYH20] use differential privacy to publish noisy versions of trajectory data. [DKBS15] and [HTP17] apply differential privacy to gradient descent algorithms for federated learning in mobility systems.

Many of the works mentioned in the previous paragraph establish differential privacy of their proposed algorithms by using composition properties of differential privacy. Composition theorems for differential privacy describe how well privacy is preserved when conducting several computations one after another. In [DKBS15] and [HTP17], composition theorems are applied as black boxes without considering the mathematical properties of the gradient descent algorithm. As a result, the privacy guarantees are overly conservative, meaning that large amounts of noise are added to the algorithm, leading to suboptimal behavior both in theory and in practice. Similarly, [GZFS15, GLTY18, AHFI⁺18, LYH20] use composition rules as a black box, and while privacy is achieved in this way, there are no accuracy guarantees for the algorithms presented in those works.

While blackbox applications of differential privacy can lead to impractical levels of noise injection, specialized applications of differential privacy were discovered that could provide privacy with much less noise. [WLK⁺17] show how a simple adjustment to stochastic gradient descent can give rise to an algorithm which is both differentially private, and under reasonable regularity assumptions, is also asymptotically optimal. [FMTT18] and [FKT20] refined this idea to develop stochastic gradient descent algorithms that are differentially private, computationally efficient, and have optimal convergence rates. These techniques cannot directly be used to solve the standard formulation of network routing because they study unconstrained optimization problems or optimization problems with public constraint sets (i.e., constraints that do not include any private data).

2 Model

In this section we define notations, network models, and assumptions that will allow us to formulate network routing as a data-driven convex optimization problem.

2.1 Preliminaries

Indicator Representation of Edge Sets: Let $G=(V,E)$ be a graph with vertices $V$ and edges $E=\left\{e_{1},...,e_{m}\right\}$ . For any subset of edges $E^{\prime}\subset E$ , we define the indicator representation of $E^{\prime}$ as $\mathds{1}_{[E^{\prime}]}$ as a boolean vector of length $m$ in the following way: The $i$ th entry of $\mathds{1}_{[E^{\prime}]}$ is $1$ if $e_{i}\in E^{\prime}$ , and is $0$ otherwise.

Derivative Notation: For a scalar valued function $x\mapsto f(x)$ , we use $\nabla f(x_{0})$ to denote the gradient of the $f$ with respect to $x$ evaluated at the point $x=x_{0}$ . For a vector valued function $x\mapsto g(x)$ , and a variable $z$ , we use $\mathcal{D}_{z}[g](x_{0})$ to denote the derivative matrix of $g$ with respect to $z$ evaluated at the point $x=x_{0}$ .

Projections: For a convex set $\mathcal{S}\subset\mathbb{R}^{d}$ , we use $\Pi_{\mathcal{S}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ to denote the projection operator for $\mathcal{S}$ . For any $x\in\mathbb{R}^{d}$ , $\Pi_{S}(x):=\arg\min_{s\in\mathcal{S}}\left|\left|x-s\right|\right|_{2}$ to be the point in $S$ that is closest to $x$ .

2.2 Network, Demand, and Network Flow

In this section we will introduce a) a graph model of a network, b) a representation of network demand, c) the standard formulation for network routing and d) privacy requirements. The notation defined in this section is aggregated in table form in Section A for the reader’s convenience.

Definition 1 (Network Representation).

We use a directed graph $G=(V,E)$ to represent the network, where $V$ and $E$ represent the set of vertices and edges in the network, respectively. We will use $n:=\left|V\right|$ and $m:=\left|E\right|$ to denote the number of vertices and edges in the graph, respectively. For vertex pairs $(o,d)\in V\times V$ we will use $\mathcal{P}_{G}(o,d)$ to denote the set of simple paths from $o$ to $d$ in $G$ .

Definition 2 (Operation Period).

We use $\mathcal{T}:=\left[t_{\text{start}},t_{\text{end}}\right]$ to denote the operation period within which a network operator wants to optimize its routing decisions. We will also use $T$ to denote the number of minutes in the operation period. For example, $t_{\text{start}}=8:00\text{am},\;t_{\text{end}}=9:00\text{am}$ could represent a morning commute period where $T=60$ .

Definition 3 (Demand Representation).

We study a stationary demand model where demand within the operation period $\mathcal{T}$ is specified by a matrix $\Lambda\in\mathbb{R}_{+}^{n\times n}$ . For each ordered pair of vertices $(o,d)\in V\times V$ , $\Lambda(o,d)$ is the number of requests arriving per minute (i.e., the arrival rate) during $\mathcal{T}$ that need routing from vertex $o$ to vertex $d$ .

Remark 1 (Estimating $\Lambda$ from historical data).

The arrival rates from historical demand are computed empirically, i.e., if $\Lambda_{t}$ represents the demand for day $t$ , then $\Lambda_{t}(o,d)$ is calculated by counting the number of $(o,d)$ requests appearing on day $t$ , and then dividing it by $T$ to obtain requests per minute.

Definition 4 (Link Latency Functions).

To model congestion effects, each link $e\in E$ has a latency function $f_{e}:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ which specifies the average travel time through the link as a function of the total flow on the link.

In this paper we study a setting where a network operator wants to route demand while minimizing the total travel time for the requests. With these definitions, the standard formulation of minimum travle time network routing is described in Definition 5.

Definition 5 (Standard Formulation of Network Flow).

For a network $G=(V,E)$ with link latency functions $\left\{f_{e}\right\}_{e\in E}$ and demand $\Lambda$ , the standard network flow problem is the following minimization program:


$\displaystyle\underset{x}{\text{minimize}}$	$\displaystyle\sum_{e\in E}y_{e}f_{e}(y_{e})$	(1a)
subject to	$\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},$	(1b)
$\displaystyle y_{e}$	$\displaystyle=\sum_{(o,d)\in V\times V}x_{e}^{(o,d)}\text{ for all }e\in E,$	(1c)
$\displaystyle\sum_{v:(u,v)\in E}x_{(u,v)}^{(o,d)}-\sum_{w:(w,u)\in E}x_{(w,u)}^{(o,d)}$	$\displaystyle=\Lambda(o,d)\mathds{1}_{[u=o]}-\Lambda(o,d)\mathds{1}_{[u=d]}\text{ for all }u\in V,(o,d)\in V\times V.$	(1d)

In (1), the decision variable $x$ is a collection of flows $\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ , one for each non-zero entry of $\Lambda$ , as represented by constraint (1b). (1d) are flow conservation constraints to ensure that $x^{(o,d)}$ sends $\Lambda(o,d)$ units of flow from $o$ to $d$ . Constraints (1c) ensure that $\left\{y_{e}\right\}_{e\in E}$ represents the total amount of flow on each edge. Finally, the objective function (1a) is to minimize the total travel time as a function of total network flow.

In the next subsection we will describe the rigorous privacy requirements that we will mandate while designing algorithms for network flow. We then describe in Section 2.4 why privacy and feasibility are mutually exclusive in the standard network flow formulation.

2.3 Privacy Requirements

We will use differential privacy to reason about the privacy-preserving properties of our algorithms. At a high level, changing one data point of the input to a differentially private algorithm should lead to a statistically indistinguishable change in the output. To make this concept concrete we will need to define data sets and adjacency.

Definition 6 (Data sets).

Given a space of data points $\mathcal{Z}$ , a data set $L$ is any finite set of elements from $\mathcal{Z}$ . In practice, each element of a data set is data collected from a single user, or data collected during a specific time period. We will use $\mathcal{L}$ to denote the set of all data sets.

Definition 7 (Data Set Adjacency).

Given a space of data sets $\mathcal{L}$ , an adjacency relation Adj is a mapping $\text{Adj}:\mathcal{L}\times\mathcal{L}\rightarrow\left\{0,1\right\}$ . Two data sets $L_{1},L_{2}$ are said to be adjacent if and only if $\text{Adj}(L_{1},L_{2})=1$ .

While the exact definition can vary across applications, adjacent data sets are sets that are very similar to one another. The most common definition of adjacency is the following: $L,L^{\prime}$ are adjacent if and only if $L^{\prime}$ can be obtained by adding, deleting, or modifying at most one data point from $L$ , and vice versa. Thus comparing a function’s output on adjacent data sets measures how sensitive the function is to changes in a single data point.

With these definitions in place, we are now ready to define differential privacy.

Definition 8 (Differential Privacy).

For a given adjacency relation Adj, a function $M:\mathcal{L}\rightarrow\mathcal{X}$ is $(\epsilon,\delta)$ -differentially private if for any $L_{1},L_{2}\in\mathcal{L}$ with $\text{Adj}(L_{1},L_{2})=1$ , the following inequality

\displaystyle\mathbb{P}\left[M(L_{1})\in E\right]\leq e^{\epsilon}\mathbb{P}\left[M(L_{2})\in E\right]+\delta

holds for every event $E\subset\mathcal{X}$ .

The definition of differential privacy requires that changing one data point in the input to a differentially private algorithm cannot change the distribution of the algorithm’s output by too much. Such a requirement ensures the following strong privacy guarantee: It is statistically impossible to reliably infer the properties of a single data point just by examining the algorithm’s output, even if the observer knows all of the other data points in the input [DMNS06].

To proceed, we must specify what data set adjacency means in the context of network routing. For network routing problems, historical data is often a collection of routing requests through the network. To protect the privacy of those who submit requests, we want the routing policy we compute to not depend too much on any single request that appears in the historical data. This motivates the the following notion of data set adjacency that we will be using throughout this paper.

Definition 9 (Request Level Adjacency (RLA)).

We define a function $\text{RLA}:\mathcal{L}\times\mathcal{L}\rightarrow\left\{0,1\right\}$ which maps pairs of data sets to booleans. For two historical datasets of network demand $L:=(\Lambda_{1},...,\Lambda_{N})$ and $L^{\prime}:=(\Lambda^{\prime}_{1},...,\Lambda^{\prime}_{N})$ , we say $L$ and $L^{\prime}$ are request-level-adjacent (RLA) if exactly one of the following statements is true:

1.

$L$ contains all of the requests in $L^{\prime}$ , and contains one extra request that is not present in $L^{\prime}$ .
2.

$L^{\prime}$ contains all of the requests in $L$ , and contains exactly one extra request that is not present in $L$ .

Mathematically, $L,L^{\prime}$ are request-level-adjacent, i.e., $\text{RLA}(L,L^{\prime})=1$ , if and only if they satisfy all of the following relations:

•

There exists $t$ so that $\Lambda_{k}=\Lambda^{\prime}_{k}$ for all $k\neq t$ .
•

There exists two vertices $o$ and $d$ so that $\Lambda_{t}(o^{\prime},d^{\prime})=\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})$ for all $(o^{\prime},d^{\prime})\neq(o,d)$ .
•

$\left|\Lambda_{t}(o,d)-\Lambda^{\prime}_{t}(o,d)\right|\leq\frac{1}{T}$ .

Indeed, these relations dictate that one of the datasets contains an extra request from $o$ to $d$ which happened on the $t$ th day. A difference of $1$ request within a $T$ minute operation period leads to a change of $\frac{1}{T}$ in the arrival rate. Aside from this difference, the datasets are otherwise identical.

2.4 Differential Privacy Challenges in Standard Network Flow

In the introduction we mentioned that privacy and feasibility can be mutually exclusive in the standard formulation of network flow described in (1). In this section we formally show that if $\Lambda$ is constructed from a data set of trips as described in Remark 1, then trips to or from uncommon locations can be easily detected from any feasible solution to (1). As a result, announcing or releasing a feasible solution to (1) is not, in general, differentially private. Formally, we will prove the following theorem in this section.

Theorem 1 (Differential Privacy Impossibility for Standard Network Flow).

Let $M$ be a function that takes as input a matrix $\Lambda$ with non-negative entries and returns a feasible solution to the optimization problem $\eqref{eqn:standard_nf}$ where $\Lambda$ is used as a demand matrix. Then $M$ cannot be $(\epsilon,\delta)$ -differentially private for any $\delta<1$ .

We further note that $(\epsilon,\delta)$ -differential privacy only provides a meaningful privacy guarantee when $\epsilon<1$ and $\delta<1$ [Dwo08].

Proof of Theorem 1.

Let $G=(V,E)$ be a network, and $\Lambda$ be constructed from a historical data set of requests in $G$ . Suppose there exist uncommon locations $o$ and $d$ for which $\Lambda$ contains no trips to or from either $o$ or $d$ . Mathematically, this means that

\Lambda(o,u)=0,\Lambda(u,o)=0,\Lambda(d,u)=0,\Lambda(u,d)=0\text{ for all }u\in V.

Such situations are not uncommon in transportation networks, if, for example, $o$ and $d$ are the homes of two different people who do not drive (perhaps they walk or bike to and from work).

If we now add a trip from $o$ to $d$ to the data set, and let $\Lambda^{\prime}$ be the resulting demand matrix, then we have

i

$\Lambda(o^{\prime},d^{\prime})=\Lambda^{\prime}(o^{\prime},d^{\prime})$ for all $(o^{\prime},d^{\prime})\neq(o,d)$ , and
ii

$\Lambda(o,d)=0$ , $\Lambda^{\prime}(o,d)=\frac{1}{T}$ .

Let $\text{Prob}_{1},\text{Prob}_{2}$ be the optimization problem $\eqref{eqn:standard_nf}$ using demand $\Lambda,\Lambda^{\prime}$ respectively. Because $\Lambda,\Lambda^{\prime}$ are request-level-adjacent, any differentially private algorithm must behave similarly when acting on $\text{Prob}_{1}\text{ and }\text{Prob}_{2}$ . However, this is impossible because the feasible sets of $\text{Prob}_{1},\text{Prob}_{2}$ are disjoint. If we look at constraint $\eqref{eqn:standard_nf:flow_conserve}$ with $u=d$ and $(o,d)$ then any feasible solution to $\text{Prob}_{1}$ satisfies

\displaystyle\sum_{u:(u,d)\in E}x_{(u,d)}^{(o,d)}-\sum_{w:(d,w)\in E}x_{(u,d)}^{(o,d)}=\Lambda(o,d)\mathds{1}_{[d=d]}=0.

However, any feasible solution to $\text{Prob}_{2}$ satisfies

\displaystyle\sum_{u:(u,d)\in E}x_{(u,d)}^{(o,d)}-\sum_{w:(d,w)\in E}x_{(u,d)}^{(o,d)}=\Lambda(o,d)\mathds{1}_{[d=d]}=\frac{1}{T}.

In other words, checking the net flow leaving node $d$ will detect the presence of any trips going to or from $d$ . We will now show that any algorithm which returns a feasible solution to (1) cannot be differentially private. To this end, define the event $E$ to be the event that flow is conserved at node $d$ . Then for any algorithm $M$ that takes as input a demand matrix and returns a feasible solution to (1) with the specified demand matrix, we have $\mathbb{P}[M(\Lambda)\in E]=1,\mathbb{P}[M(\Lambda^{\prime})\in E]=0$ . Recalling Definition 8, $M$ is $(\epsilon,\delta)$ -differentially private only if $\mathbb{P}[M(\Lambda)\in E]\leq e^{\epsilon}\mathbb{P}[M(\Lambda^{\prime})\in E]+\delta$ . This equation can only be satisfied if $\delta\geq 1$ . ∎

We have two remarks regarding Theorem 1.

Remark 2.

The same result holds if $M$ returns the total flow $y$ associated with a feasible solution (see (1c)), rather than returning the feasible solution itself. In other words, even total traffic measurements (without knowing the breakdown by $(o,d)$ pairs) can already expose trips to or from uncommon locations.

Remark 3.

The vulnerability of trips to and from uncommon locations is not a purely theoretical concern. A study on the New York City Taxi and Limousine data set was able to identify trips from residential areas to gentlemen’s club [Pan14]. Because the start locations were in residential areas, it was easy to re-identify the users who booked the taxi trips as the owners of the homes that the taxi trips began at. As a result, despite the data set being anonymized, users who had taken taxis to this gentlemen’s club were re-identified, and their reputations may have been negatively affected.

3 Routing Policy Formulation of Network Flow

To sidestep the impossibility result described in Theorem 1, we present an alternative formulation for the network flow problem in this section. The alternative formulation avoids the issues mentioned in Theorem 1 by moving all parameters related to user data from the constraints to the objective function, as described in (8). We note that (8) can only be solved if the demand $\Lambda$ is known, which may not always be the case. For this reason, we present two variations of (8): (9) is the stochastic version of (8) where $\Lambda$ is drawn from a distribution $\mathcal{Q}$ , and (10) is the data driven approximation to (9) that one would solve if $\mathcal{Q}$ is unknown.

Before formally defining the model, we provide a high level description of how this formulation works. In this formulation, a feasible solution $x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ specifies, for each $(o,d)\in V\times V$ , a flow $x^{(o,d)}$ that routes 1 unit of flow from $o$ to $d$ . We note that a flow is specified for $(o,d)$ even if there is no demand for this origin and destination in $\Lambda$ , i.e., $\Lambda(o,d)=0$ . We refer to $x$ as a routing policy due to its connection to randomized routing, which Remark 5 discusses in further detail. Given a feasible solution $x$ , the objective function first calculates the total traffic on each edge by taking a linear combination of $\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ flows, where the coefficients of the linear combination are determined by the demand $\Lambda$ , with high demand $(o,d)$ pairs having larger coefficients. The total travel time can be computed from the total traffic in the same way as (1a). These ideas are formalized by the following definitions.

Definition 10 (Unit $(o,d)$ flow).

For a given origin-destination pair $(o,d)$ , we say that a flow $x^{(o,d)}\in\mathbb{R}_{+}^{m}$ is a unit $(o,d)$ flow if and only if it routes exactly $1$ unit of flow from $o$ to $d$ . Formally, this condition is represented by the following constraints:

\displaystyle\sum_{v\in V:(v,u)\in E}x^{(o,d)}_{(v,u)}-\sum_{v\in V:(u,v)\in E}x^{(o,d)}_{(u,v)}=\left\{\begin{tabular}[]{cc}$-1$&if $u=o$\\ $1$&if $u=d$\\ $0$&otherwise.\end{tabular}\right.\text{ for all }u\in V

(5)

Indeed, (5) requires that the net flow entering $o$ is $-1$ , the net flow entering $d$ is $1$ , and that flow is conserved at all other vertices in the graph.

Definition 11 (Unit Network Flow).

A unit network flow is a collection of flows $x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ so that $x^{(o,d)}$ is a unit $(o,d)$ flow for each $(o,d)\in V\times V$ . We use $\mathcal{X}$ to denote the set of all unit network flows.

Remark 4.

We can represent $x$ as a concatenation of the vectors $\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ . Since each unit $(o,d)$ flow is a $m$ dimensional vector, we have $x\in\mathbb{R}^{n^{2}m}$ .

We will refer to unit network flows as routing policies, due to their connection with randomized routing, described in the following remark.

Remark 5 (Routing demand using unit flow policies).

A unit network flow represents a randomized routing policy. Due to the flow decomposition theorem [Tre11], every unit $(o,d)$ flow can be written as a distribution of paths from $o$ to $d$ . Formally, for any unit $(o,d)$ flow $x^{(o,d)}$ , there exist paths $p_{1}^{(o,d)},p_{2}^{(o,d)},...,p_{m}^{(o,d)}$ from $o$ to $d$ and weights $w_{1}^{(o,d)},...,w_{m}^{(o,d)}$ so that the following equations hold:

$\displaystyle x^{(o,d)}$	$\displaystyle=\sum_{i=1}^{m}w_{i}^{(o,d)}\mathds{1}_{[p_{i}^{(o,d)}]}$	(6)
	$\displaystyle\sum_{i=1}^{m}w_{i}^{(o,d)}=1$
	$\displaystyle w_{i}^{(o,d)}\geq 0\text{ for all }1\leq i\leq m.$

Furthermore, $p_{1}^{(o,d)},...,p_{m}^{(o,d)}$ and $w_{1}^{(o,d)},...,w_{m}^{(o,d)}$ are efficiently computable. Defining the probability distribution $P_{x^{(o,d)}}$

\displaystyle P_{x^{(o,d)}}(\mathds{1}_{[p_{i}^{(o,d)}]})=w_{i}^{(o,d)}\text{ for all }1\leq i\leq m,

$x^{(o,d)}_{e}$ represents the probability that a random path chosen from $P_{x^{(o,d)}}$ contains $e$ . $x$ describes the expected behavior of the randomized routing policy that determines routes for $(o,d)$ requests by drawing a path independently at random from $P_{x^{(o,d)}}$ . In particular, when using this policy to serve demand $\Lambda$ , by linearity of expectation, the expected number of requests from $o$ to $d$ whose assigned path contains $e$ is exactly $\Lambda(o,d)x_{e}^{(o,d)}$ . Furthermore, the average number of requests on each edge will be $\sum_{(o,d)\in V\times V}\Lambda(o,d)x^{(o,d)}$ .

3.1 Minimum Total Travel Time Network Flow

In the minimum travel time network flow problem, the network operator wants to find a stationary routing policy for each $(o,d)$ pair that will lead to small total travel times for the requests. Due to the equivalence between stationary $(o,d)$ routing policies and unit $(o,d)$ flows, the network operator can instead search over the space of unit $(o,d)$ flows.

The total travel time of a flow $y$ through $G$ is given by $\sum_{e\in E}y_{e}f_{e}(y_{e})$ . The total flow resulting from a unit network flow $x$ serving demand $\Lambda$ is the sum of the flow contributions from each of the $(o,d)$ pairs. With Remark 5 in mind, the total amount of flow on an edge $e\in E$ when serving $\Lambda$ according to $x$ is given by $\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}$ . We can thus define $F(x,\Lambda)$ , the total travel time experienced by requests $\Lambda$ when being routed by $x$ , as follows:

\displaystyle F(x,\Lambda):=\sum_{e\in E}\left(\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}\right)f_{e}\left(\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}\right).

(7)

With these definitions in place, the unit network flow that minimizes the total travel time when serving the demand $\Lambda$ can be found by solving the following optimization problem

$\displaystyle\underset{x}{\text{minimize }}$	$\displaystyle F(x,\Lambda)$	(8)
subject to	$\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},$
	$\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,$

3.2 Network Flow with Stochastic Demand

In practice, demand may vary from day to day, and such variations can be modeled by $\Lambda$ being a random variable with distribution $\mathcal{Q}$ . If $\mathcal{Q}$ is known by the network operator, then rather than solving (8), the operator is interested in minimizing expected total travel time through the following stochastic optimization problem:

$\displaystyle\underset{x}{\text{minimize }}$	$\displaystyle\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x,\Lambda)\right]$	(9)
subject to	$\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},$
	$\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,$

We note that (9) is a generalization of (1) to the case when $\Lambda$ is random.

In the more realistic case that $\mathcal{Q}$ is not known, the optimization problem (9) can be approximated from historical data. We study a situation where the network operator has demand data $\Lambda_{1},\Lambda_{2},...,\Lambda_{N}\overset{\text{i.i.d}}{\sim}\mathcal{Q}$ collected from operations of previous days. Using this data it can solve the following empirical approximation to (9):

$\displaystyle\underset{x}{\text{minimize }}$	$\displaystyle\frac{1}{N}\sum_{k=1}^{N}F(x,\Lambda_{k})$	(10)
subject to	$\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},$
	$\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,$

The optimization problem in (10) uses historical data to estimate (9). In line with Assumption 1 we will assume that $\Lambda_{t}(o,d)\leq\lambda_{\max}$ for all values of $t,o$ and $d$ .

In Section 4 we show how (10) can be solved in a request-level differentially private way.

3.3 Assumptions on Travel Time functions

In this section we will introduce some assumptions that will help us establish our technical results. We will make the following assumptions on the network demand:

Assumption 1 (Bounded Demand).

We assume there exists a non-negative constant $\lambda_{\max}$ so that $\Lambda(o,d)\leq\lambda_{\max}$ for every $(o,d)\in V\times V$ . In practice, this constant can be estimated from historical data.

The following are assumptions we make on the objective function $F$ (see (7)). These assumptions are related to properties of the link latency functions $\left\{f_{e}\right\}_{e\in E}$ . We present a typical model for link latency functions in Section 3.4 that satisfies all of the following assumptions.

Assumption 2 (Bounded Variance Gradients).

We assume there exists a non-negative constant $K$ so that for every $x$ , the variance of $\nabla F(x,\Lambda)$ is upper bounded by $K^{2}$ , i.e., $\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left|\left|\nabla F(x,\Lambda)\right|\right|_{2}^{2}\right]\leq K^{2}$ .

Assumption 3 (Twice Differentiability).

We assume that $F(\cdot,\Lambda)$ is twice-differentiable for every $\Lambda$ so that the hessian $H(x,\Lambda):=\frac{\partial^{2}}{\partial x^{2}}F(x,\Lambda)$ is defined on the entire domain of $x$ .

Assumption 4 (Strong Convexity).

We assume that there exists $\alpha>0$ for which $F(\cdot,\Lambda)$ is $\alpha$ -strongly convex for every $\Lambda$ . This means that for every $\Lambda$ , and any unit network flows $x,x^{\prime}$ we have

\displaystyle F(x^{\prime},\Lambda)\geq F(x,\Lambda)+\nabla_{x}F(x,\Lambda)^{\top}(x^{\prime}-x)+\frac{\alpha}{2}\left|\left|x^{\prime}-x\right|\right|_{2}^{2}

Assumption 5 (Smoothness).

We assume that there exists $\beta>\alpha$ for which $F(\cdot,\Lambda)$ is $\beta$ -smooth for every $\Lambda$ . This means that for every $\Lambda$ , and any unit network flows $x,x^{\prime}$ we have

\displaystyle F(x^{\prime},\Lambda)\leq F(x,\Lambda)+\nabla_{x}F(x,\Lambda)^{\top}(x^{\prime}-x)+\frac{\beta}{2}\left|\left|x^{\prime}-x\right|\right|_{2}^{2}.

Assumption 6 (Bounded second order partial derivative).

We assume that there exists $C>0$ so that $\left|\left|\mathcal{D}_{\Lambda(o,d)}\left[\nabla_{x}F\right](x,\Lambda)\right|\right|_{op}\leq C$ for all $x,\Lambda$ and $(o,d)\in V\times V$ .

Remark 6 (Satisfying Assumption 6).

We can satisfy Assumption 6 for any positive $C$ by re-scaling. In particular, letting $\theta:=\max_{x^{\prime},\Lambda^{\prime}}\left|\left|\mathcal{D}_{\Lambda(o,d)}\left[\nabla_{x}F\right](x^{\prime},\Lambda^{\prime})\right|\right|_{2}$ , then for any $C$ we can define a re-scaled objective function

\displaystyle\widetilde{F}(x,\Lambda):=\frac{C}{\theta}F(x,\Lambda).

By construction, $\widetilde{F}$ satisfies Assumption 6 with constant value $C$ . Note, however, that the smoothness and strong convexity parameters for $\widetilde{F}$ will be rescaled accordingly to $\frac{\beta}{\theta}$ and $\frac{\alpha}{\theta}$ respectively.

3.4 Transportation Model satisfying assumptions from Section 3.3

In this section, we present a transportation network model that satisfies Assumptions 3,4,5 and 6. We study a network where the link latency functions are all affine where for each $e\in E$ , there are non-negative constants $q_{e},c_{e}$ so that $f_{e}(y)=q_{e}y+c_{e}$ . Let $Q\in\mathbb{R}^{m\times m}$ be defined so that $Q_{ee}=q_{e}$ , and let $c\in\mathbb{R}^{m}$ be the concatenation of all of the zero order coefficients in the link latency functions.

As mentioned in Remark 4, we will represent $x$ as a concatenation of $\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}$ . As such, $x\in\mathbb{R}^{n^{2}m}$ . Let $\left\{(o_{i},d_{i})\right\}_{i=1}^{n^{2}}$ be the order in which the unit $(o,d)$ flows are concatenated to produce $x$ so that

\displaystyle x=\left[\begin{tabular}[]{c}$x^{(o_{1},d_{1})}$\\ $x^{(o_{2},d_{2})}$\\ $\vdots$\\ $x^{(o_{n^{2}},d_{n^{2}})}$\end{tabular}\right].

The total flow on the links in the network when serving demand $\Lambda$ according to $x$ can then be written as:

\displaystyle y:=\sum_{(o,d)\in V\times V}\Lambda(o,d)x^{(o,d)}=B_{\Lambda}x

where $B_{\Lambda}=\text{vec}(\Lambda)\otimes I_{m}$ . Here $\text{vec}(\Lambda)$ is a $n^{2}$ dimensional row vector whose $i$ th entry is $\Lambda(o_{i},d_{i})$ , and $\otimes$ represents the Kronecker product. Then when $\Lambda$ is being routed according to $x$ , the travel times on the links can be computed as

\displaystyle Qy+c=QB_{\Lambda}x+c,

which means that the total travel time can be written as

$\displaystyle\widehat{F}(x,\Lambda):=\sum_{e\in E}y_{e}f_{e}(y_{e})$	$\displaystyle=y^{\top}\left(Qy+c\right)$
	$\displaystyle=x^{\top}B_{\Lambda}^{\top}\left(QB_{\Lambda}x+c\right)$
	$\displaystyle=x^{\top}B_{\Lambda}^{\top}QB_{\Lambda}x+c^{\top}B_{\Lambda}x.$	(11)

If we add a bit of $\ell_{2}$ regularization, we obtain

	$\displaystyle F(x,\Lambda)$	$\displaystyle=\widehat{F}(x,\Lambda)+\frac{\alpha}{2}\left\|\left\|x\right\|\right\|_{2}^{2}$
		$\displaystyle=x^{\top}B_{\Lambda}^{\top}QB_{\Lambda}x+c^{\top}B_{\Lambda}x+\frac{\alpha}{2}\left\|\left\|x\right\|\right\|_{2}^{2}$
		$\displaystyle=x^{\top}\left(B_{\Lambda}^{\top}QB_{\Lambda}+\frac{\alpha}{2}I\right)x+c^{\top}B_{\Lambda}x.$

Recall that we use $H(\cdot,\Lambda)$ to denote the hessian of $F(\cdot,\Lambda)$ with respect to its first argument. We now make the following observations:

•

The Hessian of $F(x,\Lambda)$ with respect to $x$ is defined for all $x$ and is equal to $2B_{\Lambda}^{\top}QB_{\Lambda}+\alpha I$ . Hence Assumption 3 is satisfied.
•

Since $Q$ is a diagonal matrix with non-negative entries, $Q\succeq 0$ . This implies that $B_{\Lambda}^{\top}QB_{\lambda}\succeq 0$ . Hence $H(x,\Lambda)\succeq\alpha I$ . This implies that $F$ is $\alpha$ -strongly convex, and hence Assumption 4 is satisfied.
•

Note that $\left|\left|B_{\Lambda}^{\top}QB_{\Lambda}+(\alpha/2)I\right|\right|_{op}\leq\left|\left|Q\right|\right|_{op}\left|\left|B_{\Lambda}\right|\right|_{op}^{2}+\alpha/2\leq\left|\left|Q\right|\right|_{op}\left|\left|\text{vec}(\Lambda)\right|\right|_{2}^{2}\left|\left|I_{m}\right|\right|_{op}^{2}+\alpha/2$ . Defining $\beta:=n^{2}\left(\max_{e}q_{e}\right)\lambda_{\max}^{2}+\alpha/2$ , we see that $\left|\left|H(x,\Lambda)\right|\right|_{op}\leq\beta$ for all $x\in\mathcal{X}$ , meaning that $F$ is $\beta$ -smooth, and thus Assumption 5 is satisfied.

•

By product rule,

	$\displaystyle\mathcal{D}_{\Lambda(o_{i},d_{i})}\left[\nabla F\right](x,\Lambda)$	$\displaystyle=\frac{\partial}{\partial\Lambda(o_{i},d_{i})}\left\{2\left(B_{\Lambda}^{\top}QB_{\Lambda}+\alpha I\right)x+B_{\Lambda}^{\top}c\right\}$
		$\displaystyle=\frac{\partial}{\partial\Lambda(o_{i},d_{i})}\left\{2B_{\Lambda}^{\top}Qy+2\alpha x+B_{\Lambda}^{\top}c\right\}$
		$\displaystyle=2\left((e_{i}\otimes I)Qy+B_{\Lambda}^{\top}Qx^{(o,d)}\right)+e_{i}\otimes c$

where $e_{i}$ is the $i$ th standard basis vector for $\mathbb{R}^{n^{2}}$ . By triangle inequality we can then conclude that

	$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda(o_{i},d_{i})}\left[\nabla F\right](x,\Lambda)\right\|\right\|_{2}$	$\displaystyle\leq 2\left\|\left\|Qy\right\|\right\|_{2}+2\left\|\left\|B_{\Lambda}^{\top}Qx^{(o,d)}\right\|\right\|_{2}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\leq 2\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|y\right\|\right\|_{2}+2\left\|\left\|B_{\Lambda}\right\|\right\|_{op}\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|x^{(o,d)}\right\|\right\|_{2}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\overset{(a)}{\leq}2\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|y\right\|\right\|_{2}+2n\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}\sqrt{m}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\overset{(b)}{\leq}2\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}n^{2}\sqrt{m}+2\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}n\sqrt{m}+\left\|\left\|c\right\|\right\|_{2}.$

Here $(a)$ is due to the fact that $x^{(o,d)}$ is a unit $(o,d)$ flow, meaning $\left|\left|x^{(o,d)}\right|\right|_{\infty}\leq 1$ and thus $\left|\left|x^{(o,d)}\right|\right|_{2}\leq\sqrt{m}$ . Also, $B_{\Lambda}=\text{vec}(\Lambda)\otimes I_{m}$ implies that $\left|\left|B_{\Lambda}\right|\right|_{op}\leq\left|\left|\mathds{1}_{m}^{\top}\right|\right|_{2}\left|\left|I_{m}\right|\right|_{op}=n\lambda_{\max}$ . $(b)$ is due to $\left|\left|y\right|\right|_{2}=\left|\left|B_{\Lambda}x^{(o,d)}\right|\right|_{2}\leq\left|\left|B_{\Lambda}\right|\right|_{op}\left|\left|x^{(o,d)}\right|\right|_{2}\leq n^{2}\sqrt{m}\lambda_{\max}$ .

Hence Assumption 6 is satisfied with $C=2\lambda_{\max}\left|\left|Q\right|\right|_{op}\sqrt{m}n(n+1)+\left|\left|c\right|\right|_{2}$ .

4 Differentially Private Network Routing Optimization

Given the setup from Section 3, our objective is to design a request-level differentially private algorithm that returns a near optimal solution to (9). Since the true distribution $\mathcal{Q}$ of demand is unknown, we will design an algorithm for (10) and show that under the assumptions described in Section 3.2, the algorithm’s solution is also near optimal for (9).

Computing a near-optimal solution to (10) while maintaining differential privacy may seem like a daunting task, but it turns out that a single modification to a well-known optimization algorithm gives an accurate and differentially private solution.

We present a Private Projected Stochastic Gradient Descent algorithm, which is described in Algorithm 1. As the name suggests, Algorithm 1 is a modified version of stochastic gradient descent. The algorithm conducts a single pass over the historical data, using each data point to perform a noisy gradient step (see line 6). The key difference between Algorithm 1 and standard stochastic gradient descent is in line 11, where instead of returning the final gradient descent iterate, Algorithm 1 returns a noisy version of the last iterate. Algorithm 1 has the following privacy and performance guarantees.

Theorem 2 (Privacy Guarantee for Algorithm 1).

Algorithm 1 is $(\epsilon,\delta)$ -differentially private under request level adjacency defined in Definition 9.

Theorem 3 (Performance Guarantee for Algorithm 1).

If $\Lambda_{1},...,\Lambda_{N}\overset{\text{i.i.d}}{\sim}\mathcal{Q}$ , and $x^{*}$ is a solution to (9), then the output $x_{\text{alg}}$ of Algorithm 1 satisfies:

	$\displaystyle\mathbb{E}\left[\left\|\left\|x_{\text{alg}}-x^{*}\right\|\right\|_{2}\right]$	$\displaystyle\leq\frac{1}{\sqrt{N}}\frac{KC\exp\left(\frac{\beta^{2}\pi^{2}}{12\alpha^{2}}\right)}{\alpha}+\frac{n\sqrt{m}}{N}\left(\frac{\beta\exp\left(\frac{\beta^{2}\pi^{2}}{12\alpha^{2}}\right)}{\alpha\min(1,2\alpha)}+\frac{C}{\epsilon\alpha T}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}\right)$
		$\displaystyle\leq O\left(\frac{1}{\sqrt{N}}\right)+O\left(\frac{1}{\epsilon N}\sqrt{\ln\frac{1}{\delta}}\right)$

In particular, $x_{\text{alg}}$ is $(\epsilon,\delta)$ -differentially private and converges to $x^{*}$ as $N\rightarrow\infty$ , meaning that privacy and asymptotic optimality are simultaneously achieved.

See Appendix B and Appendix C for proofs of Theorem 2 and Theorem 3 respectively.

1 Input: Historical demand data

\Lambda_{1},...,\Lambda_{N}

, Desired privacy level

(\epsilon,\delta)

;

2 Output: Unit network flow

x\in\mathcal{X}

;

3 Initialize

x_{0}\in\mathcal{X}

arbitrarily ;

4 for $1\leq k\leq N$ do

\eta_{k-1}\leftarrow\min\left(\frac{1}{\alpha k},\frac{\min(1,2\alpha)}{\beta}\right)

;

x_{k}\leftarrow\Pi_{\mathcal{X}}\left(x_{k-1}+\eta_{k-1}\nabla_{x}F(x_{k-1},\Lambda_{k})\right)

;

s\leftarrow\frac{C}{T}\min\left(\frac{\min(1,2\alpha)}{\beta},\frac{1}{\alpha N}\right)

;

\sigma^{2}\leftarrow 2\frac{s^{2}}{\epsilon^{2}}\ln\left(\frac{1.25}{\delta}\right)

;

Z\sim\mathcal{N}\left(0,\sigma^{2}I\right)

;

x_{\text{alg}}\leftarrow\Pi_{\mathcal{X}}\left(x_{N}+Z\right)

;

12 Return

x_{\text{alg}}

;

Algorithm 1 Private Projected Stochastic Gradient Descent

4.1 Discussion

Carefully adding noise to specific parts of existing algorithms is a principled approach for developing differentially private algorithms [DMNS06, FMTT18, FKT20]. The main challenge in such an approach is determining a) where and b) how much noise to add. Suppose the goal is to (approximately) compute a query function $f(L)$ on a data set $L$ in a differentially private way. The latter question can be addressed by measuring the sensitivity of the desired query function.

Definition 12 ( $\ell_{2}$ sensitivity).

Consider a function $f:\mathcal{L}\rightarrow\mathbb{R}^{d}$ which maps data sets to real vectors. For a given adjacency relation Adj, the $\ell_{2}$ sensitivity of $f$ , denoted $s_{\text{Adj}}(f)$ , is the largest achievable difference in function value between adjacent data sets. Namely,

\displaystyle s_{\text{Adj}}(f):=\max_{\begin{subarray}{c}L_{1},L_{2}\in\mathcal{L}\\ \text{Adj}(L_{1},L_{2})=1\end{subarray}}\left|\left|f(L_{1})-f(L_{2})\right|\right|_{2}.

Once the sensitivity of the query function is known, the required noise distribution can be determined using the Gaussian mechanism as described in Theorem 4.

Theorem 4 (From [DR14]).

Suppose $f:\mathcal{D}\rightarrow\mathbb{R}^{p}$ maps datasets to real vectors. If $s_{\text{Adj}}$ is the $\ell_{2}$ sensitivity of $f$ , then $\widehat{f}(D):=f(D)+Z$ where $Z\sim\mathcal{N}\left(0,2\frac{s_{\text{Adj}}^{2}}{\epsilon^{2}}\ln\left(\frac{1.25}{\delta}\right)I_{p}\right)$ is $(\epsilon,\delta)$ -differentially private with respect to the adjacency relation Adj.

Calculating the sensitivity of the simple query functions (e.g., counting, voting, selecting the maximum value) is relatively easy, making the Gaussian mechanism straightforward to apply. However, for more complicated functions, noise calibration becomes more involved.

Algorithm 1 is an application of the Gaussian mechanism. Moreover, Theorem 3 shows that the suboptimality of Algorithm 1 is $O\left(\frac{1}{\sqrt{N}}\right)$ . The asymptotic optimality of Algorithm 1 comes from the fact that the $\ell_{2}$ sensitivity of the final gradient descent iterate is actually converging to zero as $N$ approaches infinity. This fact enables us to add less noise as $N\rightarrow\infty$ . Indeed, the Gaussian noise added to the final gradient descent iterate in Algorithm 1 has standard deviation which is $O\left(\frac{1}{N}\right)$ .

In the remainder of this section, we sketch some of the mathematical ideas behind the perhaps non-intuitive result that the sensitivity of gradient descent converges to zero as the number of data points increases. For simplicity of exposition we will a) use scalar notation in place of vector notation for the sake of readability, and b) consider the simpler case of unconstrained gradient descent, which removes the need to perform projections. As a reminder, the full proof can be found in Appendix B.

Given a mobility data set, we use $x_{t}(L)$ to denote the $t$ -th iterate of Algorithm 1 when using data set $L$ . It is sufficient to show that the $\max_{t}\left|\frac{dx_{N}}{d\Lambda_{t}}\right|$ converges to zero for every $t$ . This is because the sensitivity is obtained by integrating the derivative:

\displaystyle\left|\left|x_{N}(L_{1})-x_{N}(L_{2})\right|\right|_{2}=\left|\left|\int_{L_{1}}^{L_{2}}\frac{dx_{N}}{dL}dL\right|\right|_{2}.

Because $L_{1},L_{2}$ are adjacent, the distance between them is finite, and thus the above integral will converge to zero if its integrand converges to zero.

With this in mind, by chain rule, we can write

\displaystyle\frac{dx_{N}}{d\Lambda_{t}}

\displaystyle=\frac{dx_{t}}{d\Lambda_{t}}\frac{dx_{t+1}}{dx_{t}}\frac{dx_{t+2}}{dx_{t+1}}...\frac{dx_{N}}{dx_{N-1}}.

Next, by using properties of smooth and strongly convex functions, we show that $\left|\frac{dx_{t+1}}{dx_{t}}\right|\leq 1-\theta$ for some positive constant $\theta$ . This result implies

\displaystyle\left|\frac{dx_{N}}{d\Lambda_{t}}\right|

\displaystyle=\left|\frac{dx_{t}}{d\Lambda_{t}}\right|(1-\theta)^{N-t}.

Next, recalling that $x_{t}=x_{t-1}-\eta_{t}\nabla F(x_{t-1},\Lambda_{t})$ , note that $x_{t-1}$ is computed from the first $t-1$ gradient steps, which only depend on the first $t-1$ data points $\Lambda_{1},...,\Lambda_{t-1}$ . Hence the $\frac{dx_{t-1}}{d\Lambda_{t}}=0$ . From this we see that

	$\displaystyle\frac{dx_{t}}{d\Lambda_{t}}$	$\displaystyle=\frac{d}{d\Lambda_{t}}\left(x_{t-1}-\eta_{t}\nabla F(x_{t-1},\Lambda_{t})\right)$
		$\displaystyle=-\eta_{t}\frac{d}{d\Lambda_{t}}\nabla F(x_{t-1},\Lambda_{t}).$

By using Assumption 6 we have the following inequality:

\displaystyle\left|\frac{dx_{t}}{d\Lambda_{t}}\right|=\left|\eta_{t}\frac{d}{d\Lambda_{t}}\nabla F(x_{t-1},\Lambda_{t})\right|\leq\eta_{t}C.

Putting everything together, we have

\displaystyle\left|\frac{dx_{N}}{d\Lambda_{t}}\right|

\displaystyle=C\eta_{t}(1-\theta)^{N-t}.

The choice of stepsize in Algorithm 1 ensures three things: a) $\eta_{t}\leq 2$ for every $t$ , b) $\eta_{t}$ is non-increasing, and c) $\eta_{t}\rightarrow 0$ as $t\rightarrow\infty$ . Given these facts, there are two cases to consider for $\left|\frac{dx_{N}}{d\Lambda_{t}}\right|$ .

Case 1:

$t\leq N/2$ . In this case, we can upper bound $\eta_{t}\leq 2$ and $(1-\theta)^{N/2}$ . Hence $\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq 2C(1-\theta)^{N/2}$ . Since $1-\theta<1$ , this converges to zero as $N\rightarrow\infty$ .
Case 2:

$t\geq N/2$ . In this case, we simply upper bound $(1-\theta)^{N-t}\leq 1$ and $\eta_{t}\leq\eta_{N/2}$ which gives $\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq C\eta_{N/2}$ which converges to zero as $N\rightarrow\infty$ .

Finally, this shows that

\displaystyle\max_{t}\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq\max\left(C\eta_{N/2},2C(1-\theta)^{N}\right),

and since both terms in the maximum are converging to zero, we have $\lim_{N\rightarrow\infty}\left|\frac{dx_{N}}{d\Lambda_{t}}\right|=0$ .

5 Experiments

Differentially private mechanisms add noise to provide a principled privacy guarantee for individual level user data. One immediate question is the degree to which the added noise degrades the quality of the obtained solution. In the previous section, we addressed this question theoretically in Theorem 3 by showing that Algorithm 1 is both differentially private and asymptotic optimal. In this section, we present empirical studies on privacy-performance trade-offs by comparing our algorithm’s performance to that of a non-private network routing approach. To this end, we simulate a transportation network to evaluate the performance of our algorithm and the non-private optimal solution to the network flow problem (10).

We describe the dataset used for the experiments in Section 5.1. Next, the algorithms used in the experiments are described in Section 5.2. In Section 5.3, we evaluate the practical performance of our algorithm for different values of $N$ . In particular, we study the effect of the number of samples, and quantify the loss in system performance we may experience due to the introduction of our privacy-preserving algorithm. Finally, in Section 5.4, we study the sensitivity of our algorithms to the simulation parameters. In particular, we study the convergence of the routing policy with increasing data for different demands, edge latency function, and the magnitude of the regularization loss introduced to convexify the edge latency function.

5.1 Data Set

We use data for the Sioux Fall network, which is available in the Transportation Network Test Problem (TNTP) dataset [Tra16]. This network has 24 nodes, 76 edges, and 528 OD pairs (see Figure 1 for an overview of the network topology). The distribution of mean hourly demand across different OD pairs is shown in Figure 1. The mean OD demand is 682 vehicles/ hour. Furthermore, for more context on the scale of the network, the travel time on edges ranges from 2 to 10 minutes. Trip data is generated each hour, i.e., $T=60$ , from a Poisson distribution with a mean value that is given by the data. Our objective is to learn a routing policy for these trips that minimizes the total travel costs for all users. To model congestion, we use the link latency model described in Section 3.4. In particular, for each $e\in E$ , we estimate the free flow latency as $c_{e}$ directly from the data set. The sensitivity of the latency function to the traffic volume on the link, denoted by $q_{e}$ is chosen such that the travel time on the link is doubled when the link flow equals the link capacity. In later experiments, we will change this factor to study the sensitivity of our algorithm.

5.2 Algorithms

Throughout Sections 5.3 and 5.4 we compare the performance of two different algorithms: a Baseline algorithm and Algorithm 1, which we describe below.

Baseline: This algorithm computes the minimum travel time solution to (10) for the Sioux Fall model described in Section 3.4. Recall that in the Section 3.4 model, (11) describes the travel time incurred when serving demand $\Lambda$ with a routing policy $x$ . Given a data set $\Lambda_{1},...,\Lambda_{N}$ , it computes the solution to the following minimization problem: $\min_{x\in\mathcal{X}}\frac{1}{N}\sum_{i=1}^{N}x^{\top}B_{\Lambda_{i}}^{\top}QB_{\Lambda_{i}}x+c^{\top}B_{\Lambda_{i}}x$ .

Algorithm 1: The travel time function described in (11) is not strongly convex because $B_{\Lambda}^{\top}QB_{\Lambda}$ is rank deficient. In order to satisfy Assumption 4, we introduce an $\ell_{2}$ regularization. Namely, given a data set $\Lambda_{1},...,\Lambda_{N}$ , we apply Algorithm 1 to the following minimization problem: $\min_{x\in\mathcal{X}}\alpha\left|\left|x\right|\right|_{2}^{2}+\frac{1}{N}\sum_{i=1}^{N}x^{\top}B_{\Lambda_{i}}^{\top}QB_{\Lambda_{i}}x+c^{\top}B_{\Lambda_{i}}x$ .

For the Sioux Falls network, the parameters for implementing Algorithm 1 are set as follows. The smoothness parameter $\beta$ is set to be the largest eigenvalue of $B_{\Lambda}^{\top}QB_{\Lambda}$ , which is equal to $2.08\times 10^{7}$ . We set $C=\beta$ , and the regularizer parameter $\alpha=10^{4}$ . It is easy to check that these values satisfy the assumptions described in Section 3.4. Note that several different values of $\alpha$ could have been used to convexify the latency function. However, our choice is governed by two factors. A small value of $\alpha$ will ensure that the regularized objective is a good estimate to the true objective, which is desirable. However, smaller values of $\alpha$ will lead to a larger condition number (which is $\beta/\alpha$ ), resulting in slower convergence and necessitating more data for achieving a similar performance. Thus, the particular value of $\alpha=10^{4}$ balances both these factors for our problem instance. In section 5.4, we present a sensitivity analysis with different values of $\alpha$ .

5.3 Performance of Algorithm 1

In this subsection, we study the convergence of the routing policy with each step of the gradient descent performed by our algorithm. Since the impact of a routing policy is directly reflected in the total travel time, we plot the travel cost induced by a the learned routing policy as a function of the iterates. Recall that the number of iterations for a data set with $N$ data points is $N$ according to Algorithm 1. In our experiments we evaluate the cost of a routing policy $x$ as $F(x,\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i})$ instead of using the sample average $\frac{1}{N}\sum_{i=1}^{N}F(x,\Lambda_{i})$ . This approximation is done solely for improving the run-time of our experiment (by up to 50X) and introduces less than $10^{-4}$ % error in the evaluation of the system costs. Similarly, the optimal solution is also computed by minimizing the objective function $F(x,\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i})$ instead of the term described in Equation 10 to achieve a computational speed up of 30X. Evaluating the cost of this solution using the average-demand approximation results in errors less than $10^{-7}$ % error. Thus, in these experiments, we define the optimal costs as the approximate costs obtained through this procedure. Further details justifying these approximation are presented in Appendix D

For our first set of experiments, we compare the objective values obtained by Algorithm 1 and the Baseline as a function of sample size $N$ . In Figure 2, we plot the ratio of Algorithm 1’s cost to Baseline’s cost over the course of iterations for different values of $N$ . For this set of experiments, we set the privacy parameters to $\epsilon=0.1$ and $\delta=0.1$ . Note that Baseline’s cost is fixed for a given $N$ and is computed offline to serve as a benchmark. For a given $N$ , we only have $N$ iterations since each data point is only used once in Algorithm 1 to maintain privacy. For all three experiments ( $N=10$ , $N=25$ , $N=50$ ), the cost decreases monotonically with additional iterations. It is therefore not surprising that the final costs for the $N=50$ case is the lowest, as we expect the routing policy learned with 50 data points to be better than the routing policy learned from 10 data points. It is however very interesting that even with a random routing policy initialization, our algorithm finds solutions that are just around 2% away from the optimal policy. We suspect that this 2% gap is due to the fact that the two algorithms have slightly different objective functions. Indeed, Algorithm 1 has an $\ell_{2}$ regularizer in the objective, but Baseline does not. Our results therefore show that although the convergence is guaranteed only in the limit $N\rightarrow\infty$ , we can obtain practically useful solutions with a relatively small number of data points.

In the next set of experiments, we study the effect of different privacy parameters on total travel time. To this end, we compare the costs of the pre-noise and post-noise solutions $x_{N}$ $x_{\text{alg}}$ from Algorithm 1. We conduct this comparison for $\epsilon\in\{0.01,0.1,0.5\}$ and $\delta\in\{0.1,0.5\}$ . Table 1 presents the percentage increase in total travel time due to the addition of privacy noise. The results in indicate that the price of privacy, i.e., the increase in total travel time due to the introduction of differential privacy noise is less than $7.8\times 10^{-2}$ % in the worst case. In fact, for more commonly used privacy parameters of $\epsilon=0.1$ and $\delta=0.1$ , the cost of privacy is even smaller. One reason for this low cost of privacy is the high demand in the traffic network. From Figure 1(c), we observe that every OD pair typically has a few hundred trips. With over 500 OD pairs, it is thus clear that there are tens of thousands of vehicles in the network contributing to trip information with every data point. Thus, with such a large number of vehicles, the noise required to protect the identity of one vehicle is not too high.

$\epsilon$	$\delta$	Cost (% increase)
0.01	0.1	$7.83\times 10^{-2}$
0.01	0.5	$3.97\times 10^{-3}$
0.1	0.1	$9.06\times 10^{-3}$
0.1	0.5	$5.96\times 10^{-3}$
0.5	0.1	$2.44\times 10^{-3}$
0.5	0.5	$2.05\times 10^{-3}$

Table 1: Change in routing costs due to incorporating privacy.

5.4 Sensitivity

We now study the sensitivity of our algorithm to input parameters. For these experiments, we fix the number of data points to $N=50$ , since prior results suggest that most of the cost benefits are obtained with 50 data points. First, we study the effect of the regularizer term by setting $\alpha\in\{10^{2},10^{3},10^{4}\}$ and plot the convergence of the normalized costs in Figure 3(a). We see that for larger values of $\alpha$ , the costs decrease faster. This makes sense because larger $\alpha$ results in a lower condition number $\frac{\beta}{\alpha}$ , which leads to faster convergence. We also note that the cost ratio does not go to 1 because Algorithm 1 is minimizing a regularized objective, while Baseline has no regularization.

In Figure 3(b), we compare three scenarios with varying slope for the latency function $q_{e}$ . Recall that our previous experiments set the slope $q_{e}$ on each edge such that the travel time on the link doubles when the traffic is equal to the link capacity. This setting corresponds to a sensitivity factor of 2. We consider two more cases where where the travel time at capacity flow is 1.5 times and 5 times the free flow latency. Note that changing the latency sensitivity factor changes the matrix $Q$ . Thus, for each of these experiments, we recompute the value of $\beta$ and $C$ and set it to be equal to the largest eigenvalue of the appropriate $B_{\Lambda}^{\top}QB_{\Lambda}$ . The optimal cost also varies for all the three cases and is recomputed. The value of $\alpha$ is fixed at $10^{4}$ . We observe that when the latency function is steeper, i.e., the sensitivity factor is higher, the algorithm takes more iterations to reduce the costs, but eventually ends up with the lowest costs. This is because a larger $\beta$ leads to larger condition number $\frac{\beta}{\alpha}$ , which makes convergence slow. However, a larger $\beta$ means that the $\ell_{2}$ regularizer is a smaller proportion of the total cost, meaning that the objectives of Algorithm 1 and Baseline become more similar, which is what we believe causes the cost ratio to improve as the latency function becomes steeper.

Finally, we present the sensitivity of our algorithm to varying traffic demand in Figure 3(c). Elaborating further, in these experiments, we compare the nominal setting, where the demand is the mean demand with a scale factor of 1 to two cases. In the first case, we use a lower demand, where the mean traffic is 0.5 times the nominal traffic, and in the second case, the mean traffic is 1.5 times the nominal traffic. Again, as the demand changes, the matrix $B$ changes, and we recompute $\beta$ and $C$ as before. We observe that for the same value of $\alpha$ , higher demand leads to better convergence and lower costs. This is because higher demand increases the travel time, making the $\ell_{2}$ regularizer a smaller proportion of Algorithm 1’s objective. The objective functions becoming more similar leads to the cost ratio being closer to $1$ .

6 Conclusions

In this paper, we study the problem of learning network routing policies from sensitive user data. In particular, we consider the setting of a transportation network, where we want to learn and share a routing policy such that it does not reveal too much information about individual trips that may have contributed to learning this policy. Our paper presented a new approach to learn privacy-preserving routing policies by solving a reformulated network flow problem using a differentially private variant of the stochastic gradient descent algorithm. We prove that our algorithm is asymptotically optimal, meaning that the cost of the routing policy produced by our algorithm converges to the optimal non-private cost as the number of data points goes to infinity. Finally, our simulations on a Sioux Falls road network suggests that for realistic travel demands, we can learn differentially private routing policies that result in only a 2% suboptimality in terms of total travel time.

There are several interesting directions for future work. First, because differentially private algorithms are not allowed to be sensitive to single data points, they are naturally robust, and can be useful for tracking non-stationary demand distributions, as opposed to the stationary demand models we studied in this paper. In this paper, we studied request-level differential privacy where the goal is to occlude the influence of a single trip on the algorithm’s output. Another practical and important notion is user-level differential privacy where the goal is to occlude the influence of all trips belonging to the same person on the algorithm’s output. User-level privacy is harder to achieve, but is important in practice. Finally, generalizing the results to non-smooth objective functions would expand the domain of models that this technique can be applied to.

References

[AHFI⁺18] Khalil Al-Hussaeni, Benjamin CM Fung, Farkhund Iqbal, Gaby G Dagher, and Eun G Park. Safepath: Differentially-private publishing of passenger trajectories in transportation systems. Computer Networks, 143:126–139, 2018.
[BS03] Alastair R Beresford and Frank Stajano. Location privacy in pervasive computing. IEEE Pervasive computing, 2(1):46–55, 2003.
[CML11] Chi-Yin Chow, Mohamed F Mokbel, and Xuan Liu. Spatial cloaking for anonymous location-based services in mobile peer-to-peer environments. GeoInformatica, 15(2):351–380, 2011.
[CTW⁺21] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 2633–2650. USENIX Association, 2021.
[DKBS15] Roy Dong, Walid Krichene, Alexandre M. Bayen, and S. Shankar Sastry. Differential privacy of populations in routing games. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2798–2803, 2015.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, volume 3876, pages 265–284. Springer, 2006.
[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
[Dwo08] Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008.
[FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 439–449. ACM, 2020.
[FMTT18] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 521–532. IEEE Computer Society, 2018.
[GFCA14] Moein Ghasemzadeh, Benjamin CM Fung, Rui Chen, and Anjali Awasthi. Anonymizing trajectory data for passenger flow analysis. Transportation research part C: emerging technologies, 39:63–79, 2014.
[GLTY18] Mehmet Emre Gursoy, Ling Liu, Stacey Truex, and Lei Yu. Differentially private and utility preserving publication of trajectory data. IEEE Transactions on Mobile Computing, 18(10):2315–2329, 2018.
[GMW87] Oded Goldreich, Silvio Micali, and Avi Wigderson. How to play any mental game or A completeness theorem for protocols with honest majority. In Alfred V. Aho, editor, STOC 1987, New York, New York, USA, pages 218–229. ACM, 1987.
[GZFS15] Yanmin Gong, Chi Zhang, Yuguang Fang, and Jinyuan Sun. Protecting location privacy for task allocation in ad hoc mobile cloud computing. IEEE Transactions on Emerging Topics in Computing, 6(1):110–121, 2015.
[HC20] Brian Yueshuai He and Joseph YJ Chow. Optimal privacy control for transport network data sharing. Transportation Research Part C: Emerging Technologies, 113, 2020.
[HTP17] Shuo Han, Ufuk Topcu, and George J. Pappas. Differentially private distributed constrained optimization. IEEE Transactions on Automatic Control, 62(1):50–64, 2017.
[LYH20] Yang Li, Dasen Yang, and Xianbiao Hu. A differential privacy-based privacy-preserving data publishing algorithm for transit smart card data. Transportation Research Part C: Emerging Technologies, 115:102634, 2020.
[Pan14] Vijay Pandurangan. Riding with the stars: Passenger privacy in the nyc taxicab dataset. Available Online, 2014.
[Swe02] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
[Tra16] Transportation networks for research. github.com/bstabler/TransportationNetworks, 2016. Accessed January 20, 2021.
[Tre11] Luca Trevisan. Lecture notes on optimization and algorithmic paradigms (lecture 11). Available online, 2011.
[WHL⁺18] Zhibo Wang, Jiahui Hu, Ruizhao Lv, Jian Wei, Qian Wang, Dejun Yang, and Hairong Qi. Personalized privacy-preserving task allocation for mobile crowdsensing. IEEE Transactions on Mobile Computing, 18(6):1330–1341, 2018.
[WLK⁺17] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey F. Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1307–1322. ACM, 2017.
[YLL⁺19] Ke Yan, Guoming Lu, Guangchun Luo, Xu Zheng, Ling Tian, and Akshita Maradapu Vera Venkata Sai. Location privacy-aware task bidding and assignment for mobile crowd-sensing. IEEE Access, 7:131929–131943, 2019.

Appendix A Notation

The terms and concepts defined and used throughout the paper are all described here for the sake of convenience.

Notation Definition $G$ Graph representation of the network. (Definition 1) $V$ The vertex set of the network $G$ . (Definition 1) $E$ The edge set of the network $G$ . (Definition 1) $n$ The number of vertices in $G$ , i.e., $n=\left|V\right|$ . (Definition 1) $m$ The number of edges in $G$ , i.e., $m=\left|E\right|$ . (Definition 1) $\mathcal{T}$ Operation period $\mathcal{T}=\left[t_{\text{start}},t_{\text{end}}\right]$ which represents the time of day that the network operator is trying to optimize its decisions in. (Definition 2) $T$ Number of minutes in the operation period. (Definition 2) $\Lambda$ Demand matrix. $\Lambda(o,d)$ specifies the rate at which requests from $o\in V$ to $d\in V$ appear in the network. (Definition 3) $\mathcal{Q}$ The distribution of $\Lambda$ . (Defined in Section 3.2) $L$ Dataset $L:=(\Lambda_{1},...,\Lambda_{N})$ containing $N$ days of historical request data for the operation period $\mathcal{T}$ . (Defined in Section 3.2) $x^{(o,d)}$ Unit $(o,d)$ flow. A flow that routes exactly 1 unit of flow from $o$ to $d$ through the network. (Definition 10) $x$ Unit network flow. Specifies a unit $(o,d)$ flow for all pairs of vertices $(o,d)$ in the network. (Definition 11) $\mathcal{X}$ The set of all unit network flows. It is also the feasible set for optimization problems (8),(9) and (10). (Definition 11) $x_{k}$ A unit network flow and the $k$ th gradient descent iterate from Algorithm 1. $x_{k}(L)$ is used to show explicit dependence of $x_{k}$ on the historical training data $L$ . (Algorithm 1) $f_{e}$ Link latency function for the edge $e\in E$ . If $x$ is the total flow on $e$ , then $f(x)$ will be the average travel time through the edge. (Definition 4) $F$ Total travel time function. $F(x,\Lambda)$ is the total travel time of requests in $\Lambda$ experience when they are routed according to $x$ . (Defined in (7)) $(\epsilon,\delta)$ Differential privacy parameters. As $\epsilon$ and $\delta$ get smaller, the privacy guarantees offered by differential privacy get stronger. (Definition 8) $K$ Upper bound on the standard deviation of the gradient of $F$ . (Assumption 2) $H$ Hessian of $F$ with respect to $x$ . Concretely, $H(x,\Lambda)$ is the hessian of $F$ with respect to $x$ evaluated at $(x,\Lambda)$ . $\alpha$ Strong convexity parameter. (Assumption 4) $\beta$ Smoothness parameter. (Assumption 5 $C$ Upper bound on $\left|\left|\mathcal{D}_{\Lambda}\left[\nabla F\right](x,\Lambda)\right|\right|_{2}$ . (Assumption 6)

Appendix B Privacy Analysis (Proof of Theorem 2)

Recall the privacy guarantee for the Gaussian Mechanism from Theorem 4. To show that Algorithm 1 is $(\epsilon,\delta)$ -differentially private, it suffices to show the following Lemma:

Lemma 1.

For every $1\leq t\leq N$ , any value of $L:=(\Lambda_{1},...,\Lambda_{N})$ , and any $(o,d)\in V\times V$ we have $\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)$ .

Lemma 1 is sufficient to prove privacy, because it implies that the $\ell_{2}$ sensitivity of $x_{N}$ is at most $\frac{L}{\alpha N}$ . To see why this is true, let $L_{1}=(\Lambda_{1},...,\Lambda^{\prime}_{t},...,\Lambda_{N})$ and $L_{2}=(\Lambda_{1},...,\Lambda^{\prime\prime}_{t},...,\Lambda_{N})$ be request-level-adjacent data sets that differ only on the $t$ th day. Furthermore, let $(o^{\prime},d^{\prime})$ be the request for which $\Lambda_{t}^{\prime}$ and $\Lambda_{t}^{\prime\prime}$ differ. Let $x_{N}(L_{1})$ and $x_{N}(L_{2})$ be the final iterates of gradient descent obtained by using the data sets $L_{1}$ and $L_{2}$ respectively. By the fundamental theorem of calculus,

	$\displaystyle x_{N}(L_{2})-x_{N}(L_{1})$	$\displaystyle=\int_{L_{1}}^{L_{2}}\mathcal{D}_{\Lambda_{t}}[x_{N}](L)\;dL$
		$\displaystyle=\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda$
	$\displaystyle\implies\left\|\left\|x_{N}^{(L^{\prime})}-x_{N}^{(L)}\right\|\right\|_{2}$	$\displaystyle=\left\|\left\|\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right\|\right\|_{2}$
		$\displaystyle\leq\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\left\|\left\|\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right\|\right\|_{2}$
		$\displaystyle=\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left\|\left\|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda(o^{\prime},d^{\prime})\right\|\right\|_{2}$
		$\displaystyle\leq\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left\|\left\|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\right\|\right\|_{2}d\Lambda(o^{\prime},d^{\prime})$
		$\displaystyle\overset{(a)}{\leq}\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)d\Lambda(o^{\prime},d^{\prime})$
		$\displaystyle\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)\left\|\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})-\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})\right\|$
		$\displaystyle\overset{(b)}{\leq}\min\left(\frac{C\min(1,2\alpha)}{\beta T},\frac{C}{\alpha TN}\right).$

Here $(a)$ due to Lemma 1. $(b)$ is because $\left|\Lambda^{\prime\prime}_{t}(o,d)-\Lambda^{\prime}_{t}(o,d)\right|\leq T^{-1}$ is a consequence of $L_{1},L_{2}$ being request-level-adjacent.

Thus to establish $(\epsilon,\delta)$ -differential privacy of Algorithm 1, all that remains is to prove Lemma 1.

Proof of Lemma 1.

Define $L:=(\Lambda_{1},...,\Lambda_{N})$ . By Assumptions 3 , 4 and 5, the functions $F(\cdot,\Lambda_{1}),...,F(\cdot,\Lambda_{N})$ are all twice differentiable, $\alpha$ -strongly convex and $\beta$ -smooth. For each $t\leq N$ , and any $(o,d)\in V\times V$ by chain rule we have

\displaystyle\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)=\left(\prod_{k=t+1}^{N-1}\mathcal{D}_{x_{k}}[x_{k+1}](L)\right)\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L).

Since $x_{k+1}=x_{k}-\eta_{k}\nabla_{x}F(x_{k},\Lambda_{k})$ , differentiating both sides with respect to $x_{k}$ gives

\displaystyle\mathcal{D}_{x_{k}}[x_{k+1}](L)

\displaystyle=I-\eta_{k}H(x_{k},\Lambda_{k})

where $H(\cdot,\Lambda_{k})$ is the Hessian of $F(\cdot,\Lambda_{k})$ . Since $F(\cdot,\Lambda_{k})$ is $\alpha$ -strongly convex, we know $H(x_{k},\Lambda_{k})\succeq\alpha I$ . By $\beta$ -smoothness of $F(\cdot,\Lambda_{k})$ , we also know that $H(x_{k},\Lambda_{k})\preceq\beta I$ . Therefore if $\eta_{k}\leq\frac{1}{\beta}$ , we see that $\left|\left|\mathcal{D}_{x_{k}}[x_{k+1}](L)\right|\right|_{op}\leq 1-\eta_{k}\alpha$ . From this we can conclude that

$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right\|\right\|_{op}$	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}\left\|\left\|\mathcal{D}_{x_{k}}[x_{k+1}](L)\right\|\right\|_{op}$
	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}(1-\eta_{k}\alpha)$
	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}\exp\left(-\eta_{k}\alpha\right)$
	$\displaystyle=\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)$	(12)

Finally, note that $x_{k}$ only depends on $x_{0}$ and $\Lambda_{1},...,\Lambda_{k-1}$ , and in particular it does not depend on $\Lambda_{k}$ . Therefore differentiating both sides of $x_{k+1}=x_{k}-\eta_{k}\nabla_{x}F(x_{k},\Lambda_{k})$ with respect to $\Lambda_{k}(o,d)$ gives

	$\displaystyle\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)$	$\displaystyle=-\eta_{t}\mathcal{D}_{\Lambda_{t}(o,d)}\left[\nabla_{x}F\right](x_{t},\Lambda_{t})$
	$\displaystyle\implies\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}$	$\displaystyle=\eta_{t}\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}\left[\nabla_{x}F\right](x_{t},\Lambda_{t})\right\|\right\|_{op}\overset{(a)}{\leq}C\eta_{t},$		(13)

where $(a)$ is due to Assumption 6. Combining inequalities (12) and (13) gives

\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}

\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right).

(14)

Letting $t_{0}:=\max\left(\frac{\beta}{\alpha\min(1,2\alpha)}-1,0\right)$ note that

\displaystyle\eta_{t}=\left\{\begin{tabular}[]{cc}$\frac{1}{\alpha(t+1)}$&if $t\geq t_{0}$\\ $\frac{\min(1,2\alpha)}{\beta}$&otherwise.\end{tabular}\right.

If $t\geq t_{0}$ , then $\eta_{k}=\frac{1}{\alpha(k+1)}$ for all $k\geq t$ , and we see that

	$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right\|\right\|_{op}$	$\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)$
		$\displaystyle=\frac{C}{\alpha(t+1)}\exp\left(-\alpha\sum_{k=t+1}^{N-1}\frac{1}{\alpha(k+1)}\right)$
		$\displaystyle\leq\frac{C}{\alpha(t+1)}\exp\left(-\alpha\int_{t+1}^{N}\frac{1}{\alpha y}dy\right)$
		$\displaystyle=\frac{C}{\alpha(t+1)}\exp\left(-\ln N+\ln(t+1)\right)$
		$\displaystyle=\frac{C}{\alpha(t+1)}\frac{t+1}{N}=\frac{C}{\alpha N}$

On the other hand, if $t\leq t_{0}$ , then $\eta_{t}=\frac{\min(1,2\alpha)}{\beta}$ and we see that

	$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right\|\right\|_{op}$	$\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)$
		$\displaystyle=\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\sum_{k=t+1}^{N-1}\frac{1}{\alpha(k+1)}\right)$
		$\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\sum_{k=\left\lfloor t_{0}\right\rfloor+1}^{N-1}\frac{1}{\alpha(k+1)}\right)$
		$\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\int_{t_{0}+1}^{N}\frac{1}{\alpha y}dy\right)$
		$\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\frac{t_{0}+1}{N}$
		$\displaystyle=\frac{C\min(1,2\alpha)}{\beta}\frac{\beta}{\alpha\min(1,2\alpha)N}=\frac{C}{\alpha N}.$

Thus in either case we have $\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq\frac{C}{\alpha N}$ .

Finally, note that (14) implies $\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq C\eta_{t}\leq C\frac{\min(1,2\alpha)}{\beta}$ . Combining these two bounds gives the desired result:

\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}}[x_{N}](L)\right|\right|_{op}\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right).

∎

Appendix C Performance Analysis (Proof of Theorem 3)

Let $x^{*}$ be a solution to (9). Note that

	$\displaystyle\left\|\left\|x_{k+1}-x^{*}\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-x^{*}\right\|\right\|_{2}^{2}$
	$\displaystyle\overset{(a)}{=}\left\|\left\|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-\prod_{\mathcal{X}}\left(x^{}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle\overset{(b)}{\leq}\left\|\left\|x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})-\left(x^{}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|x_{k}-x^{}-\eta_{k}\left(\nabla F_{k}(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}+\eta_{k}^{2}\left\|\left\|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right\|\right\|_{2}^{2}-2\eta_{k}\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right)^{\top}(x_{k}-x^{})$

Taking expectation of both sides conditioned on $x_{k}$ , we see that

	$\displaystyle\mathbb{E}\left[\left.\left\|\left\|x_{k+1}-x^{*}\right\|\right\|_{2}^{2}\right\|x_{k}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left.\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}+\underbrace{\eta_{k}^{2}\left\|\left\|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right\|\right\|_{2}^{2}}_{\text{Term 2}}\underbrace{-2\eta_{k}\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right)^{\top}(x_{k}-x^{})}_{\text{Term 2}}\right\|x_{k}\right]$

To show that $x_{k+1}$ is closer to $x^{*}$ than $x_{k}$ is, we will provide bounds for both Term 2 and Term 3.

C.1 Bounding Term 2

To upper bound Term 2, noting that $\mathbb{E}_{\Lambda_{k}}[\nabla f_{k}(x)]=\nabla f(x)$ for all $x\in\mathcal{X}$ , we have

	$\displaystyle\eta_{k}^{2}\mathbb{E}\left[\left.\left\|\left\|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right\|\right\|_{2}^{2}\right\|x_{k}\right]$
	$\displaystyle=\eta_{k}^{2}\mathbb{E}\left[\left.\left\|\left\|\underbrace{\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]}_{A}+\underbrace{\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]}_{B}\right\|\right\|_{2}^{2}\right\|x_{k}\right]$
	$\displaystyle=\eta_{k}^{2}\mathbb{E}\left[\left.\left\|\left\|A\right\|\right\|_{2}^{2}+\left\|\left\|B\right\|\right\|_{2}^{2}+2A^{\top}B\right\|x_{k}\right]$		(15)

By Assumption 6 and the dominated convergence theorem, we have $\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]=\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)\right]$ . Hence we see that

\displaystyle\mathbb{E}\left[\left.A\right|x_{k}\right]=\mathbb{E}_{\Lambda_{k}\sim\mathcal{Q}}\left[\left.F(x_{k},\Lambda_{k})\right|x_{k}\right]-\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left.F(x_{k},\Lambda)\right|x_{k}\right]=0.

From this observation we have the three following remarks:

1.

Since $A$ is a zero mean random vector, $\mathbb{E}\left[\left.\left|\left|A\right|\right|_{2}^{2}\right|x_{k}\right]$ is the variance of $\nabla F(x_{k},\Lambda)$ given $x_{k}$ . By Assumption 2, $\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left|\left|\nabla F(x,\Lambda)\right|\right|_{2}^{2}\right]\leq K^{2}$ for any $x$ , which implies that $\mathbb{E}\left[\left.\left|\left|A\right|\right|_{2}^{2}\right|x_{k}\right]\leq K^{2}$ .

By Assumption 5, $F(\cdot,\Lambda)$ is $\beta$ -smooth for every $\Lambda$ . $\beta$ -smoothness of $F(\cdot,\Lambda)$ implies that $\nabla F(\cdot,\Lambda)$ is $\beta$ -lipschitz. Thus we have

	$\displaystyle\mathbb{E}\left[\left.\left\|\left\|B\right\|\right\|_{2}^{2}\right\|x_{k}\right]$	$\displaystyle=\mathbb{E}\left[\left.\left\|\left\|\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right]\right\|\right\|_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left\|\left\|\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right\|\right\|\right]_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\beta^{2}\left\|\left\|x_{k}-x^{*}\right\|\right\|\right]_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle=\beta^{2}\left\|\left\|x_{k}-x^{*}\right\|\right\|_{2}^{2}.$

where $(a)$ is due to Jensen’s inequality and $(b)$ is due to $\nabla F(\cdot,\Lambda)$ being $\beta$ -Lipschitz.

3.

Conditioned on $x_{k}$ , $A$ is zero mean and $B$ is constant, meaning that $A^{\top}B$ is a zero mean random vector. Therefore $\mathbb{E}[\left.A^{\top}B\right|x_{k}]=0$ .

Applying these three remarks to the inequality (15) we see that

\displaystyle\eta_{k}^{2}\mathbb{E}\left[\left.\left|\left|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right|\right|_{2}^{2}\right|x_{k}\right]

\displaystyle\leq\eta_{k}^{2}\left(K^{2}+\beta^{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}\right).

(16)

C.2 Bounding Term 3

By linearity of expectation, Term 3 is equal to

	$\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)^{\top}(x_{k}-x^{})\right\|x_{k}\right]$
	$\displaystyle=-2\eta_{k}\left(\mathbb{E}\left[\left.\nabla f(x_{k},\Lambda_{k})\right\|x_{k}\right]-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda\right])\right)^{\top}(x_{k}-x^{})$
	$\displaystyle=-2\eta_{k}\left(\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)-\nabla F(x^{},\Lambda)\right]\right)^{\top}(x_{k}-x^{}).$

Define $f(x):=\mathbb{E}_{\Lambda\sim\mathcal{Q}}[F(x,\Lambda)]$ . We can re-write the above equation as

\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*})\right|x_{k}\right]

\displaystyle\leq-2\eta_{k}\left(\nabla f(x_{k})-\nabla f(x^{*})\right)^{\top}(x_{k}-x^{*})

Since $F(\cdot,\Lambda)$ is $\alpha$ -strongly convex for every $\Lambda$ , we can conclude that $f$ is also $\alpha$ -strongly convex. To upper bound Term 3, we use $\alpha$ -strong convexity of $f$ to conclude that

	$\displaystyle f(x_{k})$	$\displaystyle\geq f(x^{})+\nabla f(x^{})^{\top}(x_{k}-x^{})+\frac{\alpha}{2}\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}$
		and
	$\displaystyle f(x^{*})$	$\displaystyle\geq f(x_{k})+\nabla f(x_{k})^{\top}(x^{}-x_{k})+\frac{\alpha}{2}\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}.$

Adding these inequalities together gives

	$\displaystyle f(x_{k})+f(x^{*})$	$\displaystyle\geq f(x_{k})+f(x^{})-\left(\nabla f(x_{k})-\nabla f(x^{})\right)^{\top}(x_{k}-x^{})+\alpha\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}$
	$\displaystyle\implies-\alpha\left\|\left\|x_{k}-x^{*}\right\|\right\|_{2}^{2}$	$\displaystyle\geq-\left(\nabla f(x_{k})-\nabla f(x^{})\right)^{\top}(x_{k}-x^{}).$

This inequality implies the following bound on Term 3:

\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*})\right|x_{k}\right]

\displaystyle\leq-2\alpha\eta_{k}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}.

(17)

C.3 Putting everything together

Applying the bounds (16) and (17) on Term 2 and Term 3 respectively, we see that

	$\displaystyle\mathbb{E}\left[\left.\left\|\left\|x_{k+1}-x^{*}\right\|\right\|_{2}^{2}\right\|x_{k}\right]$	$\displaystyle\leq\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}+\eta_{k}^{2}K^{2}+\beta^{2}\eta_{k}^{2}\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}-2\alpha\eta_{k}\left\|\left\|x_{k}-x^{*}\right\|\right\|_{2}^{2}$
		$\displaystyle=\left(1-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right)\left\|\left\|x_{k}-x^{*}\right\|\right\|_{2}^{2}+\eta_{k}^{2}K^{2}.$		(18)

By the tower property of expectation, we can write

\displaystyle\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right]

\displaystyle=\mathbb{E}\left[...\left.\mathbb{E}\left[\left.\mathbb{E}\left[\left.\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right|x_{N-1}\right]\right|x_{N-2}\right]...\right|x_{0}\right].

(19)

Combining the recursive relation from (18) with (19) we see that

	$\displaystyle\mathbb{E}\left[\left\|\left\|x_{N}-x^{*}\right\|\right\|_{2}^{2}\right]$	$\displaystyle\leq\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\left(\prod_{t=0}^{N-1}1-2\alpha\eta_{t}+\beta^{2}\eta_{t}^{2}\right)+K^{2}\sum_{t=0}^{N-1}\eta_{t}^{2}\left(\prod_{k=t+1}^{N-1}1-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right)$
		$\displaystyle\leq\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}+\beta^{2}\eta_{t}^{2}\right)+K^{2}\sum_{t=0}^{N-1}\eta_{t}^{2}\exp\left(\sum_{k=t+1}^{N-1}-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right).$

Since we chose $\eta_{k}:=\min\left(\frac{1}{\alpha k},\frac{\min(1,2\alpha)}{\beta}\right)$ , we have $\eta_{k}\leq\frac{1}{\alpha k}$ . This means that $\sum_{k=1}^{\infty}\eta_{k}^{2}\leq\sum_{k=1}^{\infty}\frac{1}{\alpha^{2}k^{2}}=\frac{\pi^{2}}{6\alpha^{2}}$ . Thus defining $C_{\alpha,\beta}:=\exp\left(\frac{\beta^{2}\pi^{2}}{6\alpha^{2}}\right)$ , we have

	$\displaystyle\mathbb{E}\left[\left\|\left\|x_{N}-x^{*}\right\|\right\|_{2}^{2}\right]$	$\displaystyle\leq C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+K^{2}C_{\alpha,\beta}\sum_{t=0}^{N-1}\eta_{t}^{2}\exp\left(\sum_{k=t+1}^{N-1}-2\alpha\eta_{k}\right)$
		$\displaystyle\overset{(a)}{\leq}C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+K^{2}C_{\alpha,\beta}\sum_{t=0}^{N-1}\left(\frac{C}{\alpha N}\right)^{2}$
		$\displaystyle=C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$

where $(a)$ is because in Appendix B we showed that $\eta_{t}\exp\left(\sum_{k=t+1}^{N-1}-\alpha\eta_{k}\right)\leq\frac{C}{\alpha N}$ for all $0\leq t\leq N$ . Next, let $t_{0}:=\max\left(\frac{\beta}{\alpha\min(1,2\alpha)}-1,0\right)$ so that $\eta_{t}=\frac{1}{\alpha t}$ if $t\geq t_{0}$ and $\eta_{t}=\frac{\min(1,2\alpha)}{\beta}$ otherwise. We then have

	$\displaystyle\mathbb{E}\left[\left\|\left\|x_{N}-x^{*}\right\|\right\|_{2}^{2}\right]$	$\displaystyle\leq C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$
		$\displaystyle=C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(\sum_{t=0}^{t_{0}}-2\alpha\eta_{t}\right)\exp\left(\sum_{t=t_{0}+1}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$
		$\displaystyle\leq C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(2\alpha\sum_{t=t_{0}+1}^{N-1}-\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$
		$\displaystyle\leq C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\exp\left(-2\alpha\int_{t_{0}+1}^{N}\frac{1}{\alpha y}dy\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$
		$\displaystyle=C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\left(\frac{t_{0}+1}{N}\right)^{2}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}$
		$\displaystyle=\frac{C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\beta^{2}}{\left(\alpha\min(1,2\alpha)\right)^{2}N^{2}}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}.$

Finally, we have

	$\displaystyle\mathbb{E}\left[\left\|\left\|x_{\text{alg}}-x^{*}\right\|\right\|_{2}\right]$	$\displaystyle=\mathbb{E}\left[\left\|\left\|\Pi_{\mathcal{X}}\left(x_{N}+Z\right)-x^{*}\right\|\right\|_{2}\right]$
		$\displaystyle=\mathbb{E}\left[\left\|\left\|\Pi_{\mathcal{X}}\left(x_{N}+Z\right)-\Pi_{\mathcal{X}}(x^{*})\right\|\right\|_{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[\left\|\left\|x_{N}+Z-x^{*}\right\|\right\|_{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[\left\|\left\|x_{N}-x^{*}\right\|\right\|_{2}\right]+\mathbb{E}\left[\left\|\left\|Z\right\|\right\|_{2}\right]$
		$\displaystyle\overset{(a)}{\leq}\sqrt{\mathbb{E}\left[\left\|\left\|x_{N}-x^{*}\right\|\right\|_{2}^{2}\right]}+\sqrt{\mathbb{E}\left[\left\|\left\|Z\right\|\right\|_{2}^{2}\right]}$
		$\displaystyle\leq\sqrt{\frac{C_{\alpha,\beta}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}^{2}\beta^{2}}{\left(\alpha\min(1,2\alpha)\right)^{2}N^{2}}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}}+\sqrt{n^{2}m\frac{2C^{2}}{\epsilon^{2}\alpha^{2}T^{2}N^{2}}\ln\left(\frac{1.25}{\delta}\right)}$
		$\displaystyle\leq\frac{\sqrt{C_{\alpha,\beta}}\left\|\left\|x_{0}-x^{*}\right\|\right\|_{2}\beta}{\left(\alpha\min(1,2\alpha)\right)N}+\frac{KC\sqrt{C_{\alpha,\beta}}}{\alpha\sqrt{N}}+\frac{Cn\sqrt{m}}{\epsilon\alpha TN}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}$
		$\displaystyle\overset{(b)}{\leq}\frac{\sqrt{C_{\alpha,\beta}}n^{2}m\beta}{\left(\alpha\min(1,2\alpha)\right)N}+\frac{KC\sqrt{C_{\alpha,\beta}}}{\alpha\sqrt{N}}+\frac{Cn\sqrt{m}}{\epsilon\alpha TN}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}$

where $(a)$ is due to Jensen’s inequality and $(b)$ is due to the fact that $\left|\left|x^{\prime}-x^{\prime\prime}\right|\right|_{2}\leq n\sqrt{m}$ for any pair of unit network flows $x^{\prime},x^{\prime\prime}$ .

Appendix D Computational approximations

In our experiments, we make several approximations for computational tractability. In this section, we provide empirical evidence that these approximations are reasonable and do not introduce significant errors. For ease of discussion, we define the following notations.

•

$x^{opt}_{\alpha}$ : Solution obtained by solving the optimization problem (7) with regularizer $\alpha$
•

$x_{\alpha}$ : Solution obtained by solving the optimization problem with the average demand and a regularizer $\alpha$
•

$F_{\alpha}(x,\Lambda_{i})$ : Evaluating the routing policy $x$ on demand $\Lambda_{i}$ with a regularizer $\alpha$
•

$\langle F_{\alpha}(x,\Lambda_{i})\rangle$ : Evaluating the average cost of the routing policy $x$ on the set of demands $\{\Lambda_{1}\ldots,\Lambda_{N}\}$ with a regularizer $\alpha$ . More precisely, $\langle F_{\alpha}(x,\Lambda_{i})\rangle=\frac{1}{N}\sum_{i=1}^{N}F_{\alpha}(x,\Lambda_{i})$
•

$F_{\alpha}(x,\langle\Lambda_{i}\rangle)$ : Evaluating the cost of the routing policy $x$ on the average demands $\langle\Lambda_{i}\rangle=\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i}$ with a regularizer $\alpha$

	Error #1	Error #2	Error #3
$N$	$\frac{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle-F_{\alpha}(x^{opt}_{\alpha},\langle\Lambda_{i}\rangle)}{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle}$	$\frac{\langle F_{\alpha}(x_{\alpha},\Lambda_{i})\rangle-\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle}{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle}$	$\frac{\langle F_{\alpha=0}(x_{\alpha},\Lambda_{i})\rangle-\langle F_{\alpha=0}(x_{\alpha=0},\Lambda_{i})\rangle}{\langle F_{\alpha=0}(x_{\alpha=0},\Lambda_{i})\rangle}$
10	$8.97\times 10^{-7}$	$<10^{-10}$	$8.17\times 10^{-3}$
25	$1.02\times 10^{-6}$	$<10^{-10}$	$8.16\times 10^{-3}$
50	$9.84\times 10^{-7}$	$<10^{-10}$	$8.17\times 10^{-3}$

Table 2: Approximation errors.

Table 2 presents errors from three different approximations. Our first approximation is to compute the system costs for a given policy by using the average demand instead of averaging the costs over every observed demand. The first column (denoted as Error #) lists the fractional error introduced by this approximation for different values of $N$ when using the optimal routing policy. We note that the error is less than $10^{-6}$ , and we we observe a 30-50X improvement in computational time when evaluating the costs using he average flow. This justifies the use of the average demand for estimating costs. Our second approximation is in solving an easier optimization problem to compute the optimal routing policy. In this case, the exact approach would be to solve the optimization problem described in Equation 10. However, the size of this problem grows rapidly with the number of data points $N$ . Our approximation involves solving the easier optimization problem of maximizing $F_{\alpha}(x,\langle\Lambda_{i}\rangle)$ to obtain the routing policy $x_{\alpha}$ instead of solving the original optimization problem to obtain $x_{\alpha}^{opt}$ . The second column of the table (titled Error #2) presents the error introduced due to this approximation on the travel costs. The small errors indicate that this assumption is reasonable, and helps us obtain upto a 30X speedup in solving the optimization problem. Finally, we show through numerical evaluations that the addition of the $\alpha=10^{3}$ regularizer does not change the travel costs significantly (third column, denoted as Error #3), and results in less than a 0.01% error in the total travel cost.

	$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda(o_{i},d_{i})}\left[\nabla F\right](x,\Lambda)\right\|\right\|_{2}$	$\displaystyle\leq 2\left\|\left\|Qy\right\|\right\|_{2}+2\left\|\left\|B_{\Lambda}^{\top}Qx^{(o,d)}\right\|\right\|_{2}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\leq 2\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|y\right\|\right\|_{2}+2\left\|\left\|B_{\Lambda}\right\|\right\|_{op}\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|x^{(o,d)}\right\|\right\|_{2}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\overset{(a)}{\leq}2\left\|\left\|Q\right\|\right\|_{op}\left\|\left\|y\right\|\right\|_{2}+2n\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}\sqrt{m}+\left\|\left\|c\right\|\right\|_{2}$
		$\displaystyle\overset{(b)}{\leq}2\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}n^{2}\sqrt{m}+2\lambda_{\max}\left\|\left\|Q\right\|\right\|_{op}n\sqrt{m}+\left\|\left\|c\right\|\right\|_{2}.$

	$\displaystyle x_{N}(L_{2})-x_{N}(L_{1})$	$\displaystyle=\int_{L_{1}}^{L_{2}}\mathcal{D}_{\Lambda_{t}}[x_{N}](L)\;dL$
		$\displaystyle=\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda$
	$\displaystyle\implies\left\|\left\|x_{N}^{(L^{\prime})}-x_{N}^{(L)}\right\|\right\|_{2}$	$\displaystyle=\left\|\left\|\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right\|\right\|_{2}$
		$\displaystyle\leq\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\left\|\left\|\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right\|\right\|_{2}$
		$\displaystyle=\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left\|\left\|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda(o^{\prime},d^{\prime})\right\|\right\|_{2}$
		$\displaystyle\leq\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left\|\left\|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\right\|\right\|_{2}d\Lambda(o^{\prime},d^{\prime})$
		$\displaystyle\overset{(a)}{\leq}\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)d\Lambda(o^{\prime},d^{\prime})$
		$\displaystyle\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)\left\|\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})-\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})\right\|$
		$\displaystyle\overset{(b)}{\leq}\min\left(\frac{C\min(1,2\alpha)}{\beta T},\frac{C}{\alpha TN}\right).$

$\displaystyle\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right\|\right\|_{op}$	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}\left\|\left\|\mathcal{D}_{x_{k}}[x_{k+1}](L)\right\|\right\|_{op}$
	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}(1-\eta_{k}\alpha)$
	$\displaystyle\leq\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\prod_{k=t+1}^{N-1}\exp\left(-\eta_{k}\alpha\right)$
	$\displaystyle=\left\|\left\|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right\|\right\|_{op}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)$	(12)

	$\displaystyle\left\|\left\|x_{k+1}-x^{*}\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-x^{*}\right\|\right\|_{2}^{2}$
	$\displaystyle\overset{(a)}{=}\left\|\left\|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-\prod_{\mathcal{X}}\left(x^{}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle\overset{(b)}{\leq}\left\|\left\|x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})-\left(x^{}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|x_{k}-x^{}-\eta_{k}\left(\nabla F_{k}(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{},\Lambda)\right]\right)\right\|\right\|_{2}^{2}$
	$\displaystyle=\left\|\left\|x_{k}-x^{}\right\|\right\|_{2}^{2}+\eta_{k}^{2}\left\|\left\|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right\|\right\|_{2}^{2}-2\eta_{k}\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{},\Lambda)\right)^{\top}(x_{k}-x^{})$

	$\displaystyle\mathbb{E}\left[\left.\left\|\left\|B\right\|\right\|_{2}^{2}\right\|x_{k}\right]$	$\displaystyle=\mathbb{E}\left[\left.\left\|\left\|\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right]\right\|\right\|_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left\|\left\|\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right\|\right\|\right]_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\beta^{2}\left\|\left\|x_{k}-x^{*}\right\|\right\|\right]_{2}^{2}\right\|x_{k}\right]$
		$\displaystyle=\beta^{2}\left\|\left\|x_{k}-x^{*}\right\|\right\|_{2}^{2}.$