This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private Stochastic Convex Optimization for Network Routing Applications

Matthew Tsao Lyft Inc. [email protected] Karthik Gopalakrishnan Stanford University [email protected] Kaidi Yang National University of Singapore [email protected] Marco Pavone Stanford University [email protected]
Abstract

Network routing problems are common across many engineering applications. Computing optimal routing policies requires knowledge about network demand, i.e., the origin and destination (OD) of all requests in the network. However, privacy considerations make it challenging to share individual OD data that would be required for computing optimal policies. Privacy can be particularly challenging in standard network routing problems because sources and sinks can be easily identified from flow conservation constraints, making feasibility and privacy mutually exclusive.

In this paper, we present a differentially private algorithm for network routing problems. The main ingredient is a reformulation of network routing which moves all user data-dependent parameters out of the constraint set and into the objective function. We then present an algorithm for solving this formulation based on a differentially private variant of stochastic gradient descent. In this algorithm, differential privacy is achieved by injecting noise, and one may wonder if this noise injection compromises solution quality. We prove that our algorithm is both differentially private and asymptotically optimal as the size of the training set goes to infinity. We corroborate the theoretical results with numerical experiments on a road traffic network which show that our algorithm provides differentially private and near-optimal solutions in practice.

1 Introduction

Network routing problems appear in many important topics in engineering, including traffic routing in transportation systems, power routing in electrical grids, and packet routing in distributed computer systems. Network routing problems study settings where resources must be delivered to customers through a network with limited bandwidth. The goal is typically to route resources to their respective customers as efficiently as possible, or equivalently, with as little network congestion as possible.

One key challenge in network routing problems is that the requests (i.e., network demand) are often not known in advance. Indeed, it is difficult to know exactly how much power a neighborhood will need, or exactly how many visits a particular website will receive on any given day. Since information about the demand is often necessary to develop optimal or near-optimal network routing solutions, network routing algorithms often need a way of obtaining or estimating future demand. With the advent of big data and internet-of-things systems, crowd-sourcing has gained popularity as a demand forecasting approach for network routing systems. In crowd-sourcing, customers submit their request history to the network operator. The network operator uses this historical data to train a statistical or machine learning model to predict future demand from historical demand.

While crowd-sourcing provides a bountiful supply of data for training demand forecasting models, it can also introduce potential privacy concerns. Since crowd-sourcing uses individual-level customer data to train demand forecasting models, the model’s outputs may reveal sensitive user information, especially if it overfits to its training data [CTW+21]. Such privacy risks are problematic because they may deter users from sharing their data with network operators, hence reducing the supply of training data for demand forecasting models.

To address such concerns, the demand forecasting pipeline should be augmented with privacy-preserving mechanisms. Differential privacy [DMNS06] is a principled and popular method to occlude the influence a single user’s data on the result of a population study while also maintaining the study’s accuracy. This is done by carefully injecting noise into the desired computation so that data sets that differ by at most one data point will produce statistically indistinguishable results.

Providing differential privacy guarantees for the standard formulation of network routing is difficult because the constraints contain user data, meaning that in general feasibility and privacy become mutually exclusive. More specifically, in the standard network routing problem, the demand sources and sinks can be identified by checking for conservation of flow, and as a result, the presence of a user going to or from a very rare location can be detected from any feasible solution. Because differential privacy requires that the presence of any single user’s data be difficult to detect from the algorithm’s output, privacy and feasibility are at odds with one another in the standard formulation.

1.1 Statement of Contributions

In this paper we present a differentially private algorithm for network routing problems. The main ingredient is a reformulation of network routing which moves all user data dependent parameters out of the constraint set and into the objective function. We then present an algorithm for solving this formulation based on differentially private variants of stochastic gradient descent. In this algorithm, differential privacy is achieved by injecting noise, and one may wonder if this noise injection compromises solution quality. We prove that our algorithm is differentially private and under several reasonable regularity conditions, is also asymptotically optimal (as the size of the training set goes to infinity). We note that in exchange for becoming compatible with differentially private algorithms, this new formulation is more computationally expensive.

1.2 Related Work

Traffic assignment in transportation systems is one of the most well-known applications of network routing. Herein we focus our literature review on privacy research in transportation networks. Privacy works in transportation mainly focus on location privacy, where the objectives is to prevent untrusted and/or external entities from learning geographic locations or location sequences of an individual [BS03]. Privacy-preserving approaches have been proposed for various location-based applications, e.g., trajectory publishing, mobile crowdsensing, traffic control, etc. These techniques are based on spatial cloaking [CML11] and differential privacy [Dwo08]. While not the setting of interest for this paper, there are many works that use Secure Multi-Party Computation (MPC) [GMW87] to achieve privacy in distributed mobility systems.

Spatial cloaking approaches aggregate users’ exact locations into coarse information. These approaches are often based on kk-anonymity [Swe02], where a mobility dataset is divided into equivalence classes based on data attributes (e.g., geological regions, time, etc.) so that each class contains at least kk records [GFCA14, HC20]. These kk-anonymity-based approaches can guarantee that every record in the dataset is indistinguishable from at least k1k-1 other records. However, kk-anonymity is generally considered to be a weak privacy guarantee, especially when kk is small. Furthermore, very coarse data aggregation is required to address outliers or sparse data, and in these cases spatial cloaking-based approaches provide low data accuracy.

Differential privacy provides a principled privacy guarantee by producing randomized responses to queries, whereby two datasets that differ in only one entry produce statistically indistinguishable responses [DMNS06]. In other words, differential privacy ensures that an adversary with arbitrary background information (e.g., query responses, other entries) cannot infer individual entries with high confidence. Within transportation research, [WHL+18, YLL+19] share noisy versions of location data for mobile crowd-sensing applications. [GZFS15, GLTY18, AHFI+18, LYH20] use differential privacy to publish noisy versions of trajectory data. [DKBS15] and [HTP17] apply differential privacy to gradient descent algorithms for federated learning in mobility systems.

Many of the works mentioned in the previous paragraph establish differential privacy of their proposed algorithms by using composition properties of differential privacy. Composition theorems for differential privacy describe how well privacy is preserved when conducting several computations one after another. In [DKBS15] and [HTP17], composition theorems are applied as black boxes without considering the mathematical properties of the gradient descent algorithm. As a result, the privacy guarantees are overly conservative, meaning that large amounts of noise are added to the algorithm, leading to suboptimal behavior both in theory and in practice. Similarly, [GZFS15, GLTY18, AHFI+18, LYH20] use composition rules as a black box, and while privacy is achieved in this way, there are no accuracy guarantees for the algorithms presented in those works.

While blackbox applications of differential privacy can lead to impractical levels of noise injection, specialized applications of differential privacy were discovered that could provide privacy with much less noise. [WLK+17] show how a simple adjustment to stochastic gradient descent can give rise to an algorithm which is both differentially private, and under reasonable regularity assumptions, is also asymptotically optimal. [FMTT18] and [FKT20] refined this idea to develop stochastic gradient descent algorithms that are differentially private, computationally efficient, and have optimal convergence rates. These techniques cannot directly be used to solve the standard formulation of network routing because they study unconstrained optimization problems or optimization problems with public constraint sets (i.e., constraints that do not include any private data).

2 Model

In this section we define notations, network models, and assumptions that will allow us to formulate network routing as a data-driven convex optimization problem.

2.1 Preliminaries

Indicator Representation of Edge Sets: Let G=(V,E)G=(V,E) be a graph with vertices VV and edges E={e1,,em}E=\left\{e_{1},...,e_{m}\right\}. For any subset of edges EEE^{\prime}\subset E, we define the indicator representation of EE^{\prime} as 𝟙[E]\mathds{1}_{[E^{\prime}]} as a boolean vector of length mm in the following way: The iith entry of 𝟙[E]\mathds{1}_{[E^{\prime}]} is 11 if eiEe_{i}\in E^{\prime}, and is 0 otherwise.

Derivative Notation: For a scalar valued function xf(x)x\mapsto f(x), we use f(x0)\nabla f(x_{0}) to denote the gradient of the ff with respect to xx evaluated at the point x=x0x=x_{0}. For a vector valued function xg(x)x\mapsto g(x), and a variable zz, we use 𝒟z[g](x0)\mathcal{D}_{z}[g](x_{0}) to denote the derivative matrix of gg with respect to zz evaluated at the point x=x0x=x_{0}.

Projections: For a convex set 𝒮d\mathcal{S}\subset\mathbb{R}^{d}, we use Π𝒮:dd\Pi_{\mathcal{S}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} to denote the projection operator for 𝒮\mathcal{S}. For any xdx\in\mathbb{R}^{d}, ΠS(x):=argmins𝒮xs2\Pi_{S}(x):=\arg\min_{s\in\mathcal{S}}\left|\left|x-s\right|\right|_{2} to be the point in SS that is closest to xx.

2.2 Network, Demand, and Network Flow

In this section we will introduce a) a graph model of a network, b) a representation of network demand, c) the standard formulation for network routing and d) privacy requirements. The notation defined in this section is aggregated in table form in Section A for the reader’s convenience.

Definition 1 (Network Representation).

We use a directed graph G=(V,E)G=(V,E) to represent the network, where VV and EE represent the set of vertices and edges in the network, respectively. We will use n:=|V|n:=\left|V\right| and m:=|E|m:=\left|E\right| to denote the number of vertices and edges in the graph, respectively. For vertex pairs (o,d)V×V(o,d)\in V\times V we will use 𝒫G(o,d)\mathcal{P}_{G}(o,d) to denote the set of simple paths from oo to dd in GG.

Definition 2 (Operation Period).

We use 𝒯:=[tstart,tend]\mathcal{T}:=\left[t_{\text{start}},t_{\text{end}}\right] to denote the operation period within which a network operator wants to optimize its routing decisions. We will also use TT to denote the number of minutes in the operation period. For example, tstart=8:00am,tend=9:00amt_{\text{start}}=8:00\text{am},\;t_{\text{end}}=9:00\text{am} could represent a morning commute period where T=60T=60.

Definition 3 (Demand Representation).

We study a stationary demand model where demand within the operation period 𝒯\mathcal{T} is specified by a matrix Λ+n×n\Lambda\in\mathbb{R}_{+}^{n\times n}. For each ordered pair of vertices (o,d)V×V(o,d)\in V\times V, Λ(o,d)\Lambda(o,d) is the number of requests arriving per minute (i.e., the arrival rate) during 𝒯\mathcal{T} that need routing from vertex oo to vertex dd.

Remark 1 (Estimating Λ\Lambda from historical data).

The arrival rates from historical demand are computed empirically, i.e., if Λt\Lambda_{t} represents the demand for day tt, then Λt(o,d)\Lambda_{t}(o,d) is calculated by counting the number of (o,d)(o,d) requests appearing on day tt, and then dividing it by TT to obtain requests per minute.

Definition 4 (Link Latency Functions).

To model congestion effects, each link eEe\in E has a latency function fe:++f_{e}:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+} which specifies the average travel time through the link as a function of the total flow on the link.

In this paper we study a setting where a network operator wants to route demand while minimizing the total travel time for the requests. With these definitions, the standard formulation of minimum travle time network routing is described in Definition 5.

Definition 5 (Standard Formulation of Network Flow).

For a network G=(V,E)G=(V,E) with link latency functions {fe}eE\left\{f_{e}\right\}_{e\in E} and demand Λ\Lambda, the standard network flow problem is the following minimization program:

minimize𝑥\displaystyle\underset{x}{\text{minimize}} eEyefe(ye)\displaystyle\sum_{e\in E}y_{e}f_{e}(y_{e}) (1a)
subject to x={x(o,d)}(o,d)V×V,\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}, (1b)
ye\displaystyle y_{e} =(o,d)V×Vxe(o,d) for all eE,\displaystyle=\sum_{(o,d)\in V\times V}x_{e}^{(o,d)}\text{ for all }e\in E, (1c)
v:(u,v)Ex(u,v)(o,d)w:(w,u)Ex(w,u)(o,d)\displaystyle\sum_{v:(u,v)\in E}x_{(u,v)}^{(o,d)}-\sum_{w:(w,u)\in E}x_{(w,u)}^{(o,d)} =Λ(o,d)𝟙[u=o]Λ(o,d)𝟙[u=d] for all uV,(o,d)V×V.\displaystyle=\Lambda(o,d)\mathds{1}_{[u=o]}-\Lambda(o,d)\mathds{1}_{[u=d]}\text{ for all }u\in V,(o,d)\in V\times V. (1d)

In (1), the decision variable xx is a collection of flows {x(o,d)}(o,d)V×V\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}, one for each non-zero entry of Λ\Lambda, as represented by constraint (1b). (1d) are flow conservation constraints to ensure that x(o,d)x^{(o,d)} sends Λ(o,d)\Lambda(o,d) units of flow from oo to dd. Constraints (1c) ensure that {ye}eE\left\{y_{e}\right\}_{e\in E} represents the total amount of flow on each edge. Finally, the objective function (1a) is to minimize the total travel time as a function of total network flow.

In the next subsection we will describe the rigorous privacy requirements that we will mandate while designing algorithms for network flow. We then describe in Section 2.4 why privacy and feasibility are mutually exclusive in the standard network flow formulation.

2.3 Privacy Requirements

We will use differential privacy to reason about the privacy-preserving properties of our algorithms. At a high level, changing one data point of the input to a differentially private algorithm should lead to a statistically indistinguishable change in the output. To make this concept concrete we will need to define data sets and adjacency.

Definition 6 (Data sets).

Given a space of data points 𝒵\mathcal{Z}, a data set LL is any finite set of elements from 𝒵\mathcal{Z}. In practice, each element of a data set is data collected from a single user, or data collected during a specific time period. We will use \mathcal{L} to denote the set of all data sets.

Definition 7 (Data Set Adjacency).

Given a space of data sets \mathcal{L}, an adjacency relation Adj is a mapping Adj:×{0,1}\text{Adj}:\mathcal{L}\times\mathcal{L}\rightarrow\left\{0,1\right\}. Two data sets L1,L2L_{1},L_{2} are said to be adjacent if and only if Adj(L1,L2)=1\text{Adj}(L_{1},L_{2})=1.

While the exact definition can vary across applications, adjacent data sets are sets that are very similar to one another. The most common definition of adjacency is the following: L,LL,L^{\prime} are adjacent if and only if LL^{\prime} can be obtained by adding, deleting, or modifying at most one data point from LL, and vice versa. Thus comparing a function’s output on adjacent data sets measures how sensitive the function is to changes in a single data point.

With these definitions in place, we are now ready to define differential privacy.

Definition 8 (Differential Privacy).

For a given adjacency relation Adj, a function M:𝒳M:\mathcal{L}\rightarrow\mathcal{X} is (ϵ,δ)(\epsilon,\delta)-differentially private if for any L1,L2L_{1},L_{2}\in\mathcal{L} with Adj(L1,L2)=1\text{Adj}(L_{1},L_{2})=1, the following inequality

[M(L1)E]eϵ[M(L2)E]+δ\displaystyle\mathbb{P}\left[M(L_{1})\in E\right]\leq e^{\epsilon}\mathbb{P}\left[M(L_{2})\in E\right]+\delta

holds for every event E𝒳E\subset\mathcal{X}.

The definition of differential privacy requires that changing one data point in the input to a differentially private algorithm cannot change the distribution of the algorithm’s output by too much. Such a requirement ensures the following strong privacy guarantee: It is statistically impossible to reliably infer the properties of a single data point just by examining the algorithm’s output, even if the observer knows all of the other data points in the input [DMNS06].

To proceed, we must specify what data set adjacency means in the context of network routing. For network routing problems, historical data is often a collection of routing requests through the network. To protect the privacy of those who submit requests, we want the routing policy we compute to not depend too much on any single request that appears in the historical data. This motivates the the following notion of data set adjacency that we will be using throughout this paper.

Definition 9 (Request Level Adjacency (RLA)).

We define a function RLA:×{0,1}\text{RLA}:\mathcal{L}\times\mathcal{L}\rightarrow\left\{0,1\right\} which maps pairs of data sets to booleans. For two historical datasets of network demand L:=(Λ1,,ΛN)L:=(\Lambda_{1},...,\Lambda_{N}) and L:=(Λ1,,ΛN)L^{\prime}:=(\Lambda^{\prime}_{1},...,\Lambda^{\prime}_{N}), we say LL and LL^{\prime} are request-level-adjacent (RLA) if exactly one of the following statements is true:

  1. 1.

    LL contains all of the requests in LL^{\prime}, and contains one extra request that is not present in LL^{\prime}.

  2. 2.

    LL^{\prime} contains all of the requests in LL, and contains exactly one extra request that is not present in LL.

Mathematically, L,LL,L^{\prime} are request-level-adjacent, i.e., RLA(L,L)=1\text{RLA}(L,L^{\prime})=1, if and only if they satisfy all of the following relations:

  • There exists tt so that Λk=Λk\Lambda_{k}=\Lambda^{\prime}_{k} for all ktk\neq t.

  • There exists two vertices oo and dd so that Λt(o,d)=Λt(o,d)\Lambda_{t}(o^{\prime},d^{\prime})=\Lambda^{\prime}_{t}(o^{\prime},d^{\prime}) for all (o,d)(o,d)(o^{\prime},d^{\prime})\neq(o,d).

  • |Λt(o,d)Λt(o,d)|1T\left|\Lambda_{t}(o,d)-\Lambda^{\prime}_{t}(o,d)\right|\leq\frac{1}{T}.

Indeed, these relations dictate that one of the datasets contains an extra request from oo to dd which happened on the ttth day. A difference of 11 request within a TT minute operation period leads to a change of 1T\frac{1}{T} in the arrival rate. Aside from this difference, the datasets are otherwise identical.

2.4 Differential Privacy Challenges in Standard Network Flow

In the introduction we mentioned that privacy and feasibility can be mutually exclusive in the standard formulation of network flow described in (1). In this section we formally show that if Λ\Lambda is constructed from a data set of trips as described in Remark 1, then trips to or from uncommon locations can be easily detected from any feasible solution to (1). As a result, announcing or releasing a feasible solution to (1) is not, in general, differentially private. Formally, we will prove the following theorem in this section.

Theorem 1 (Differential Privacy Impossibility for Standard Network Flow).

Let MM be a function that takes as input a matrix Λ\Lambda with non-negative entries and returns a feasible solution to the optimization problem (1)\eqref{eqn:standard_nf} where Λ\Lambda is used as a demand matrix. Then MM cannot be (ϵ,δ)(\epsilon,\delta)-differentially private for any δ<1\delta<1.

We further note that (ϵ,δ)(\epsilon,\delta)-differential privacy only provides a meaningful privacy guarantee when ϵ<1\epsilon<1 and δ<1\delta<1 [Dwo08].

Proof of Theorem 1.

Let G=(V,E)G=(V,E) be a network, and Λ\Lambda be constructed from a historical data set of requests in GG. Suppose there exist uncommon locations oo and dd for which Λ\Lambda contains no trips to or from either oo or dd. Mathematically, this means that

Λ(o,u)=0,Λ(u,o)=0,Λ(d,u)=0,Λ(u,d)=0 for all uV.\Lambda(o,u)=0,\Lambda(u,o)=0,\Lambda(d,u)=0,\Lambda(u,d)=0\text{ for all }u\in V.

Such situations are not uncommon in transportation networks, if, for example, oo and dd are the homes of two different people who do not drive (perhaps they walk or bike to and from work).

If we now add a trip from oo to dd to the data set, and let Λ\Lambda^{\prime} be the resulting demand matrix, then we have

  1. i

    Λ(o,d)=Λ(o,d)\Lambda(o^{\prime},d^{\prime})=\Lambda^{\prime}(o^{\prime},d^{\prime}) for all (o,d)(o,d)(o^{\prime},d^{\prime})\neq(o,d), and

  2. ii

    Λ(o,d)=0\Lambda(o,d)=0, Λ(o,d)=1T\Lambda^{\prime}(o,d)=\frac{1}{T}.

Let Prob1,Prob2\text{Prob}_{1},\text{Prob}_{2} be the optimization problem (1)\eqref{eqn:standard_nf} using demand Λ,Λ\Lambda,\Lambda^{\prime} respectively. Because Λ,Λ\Lambda,\Lambda^{\prime} are request-level-adjacent, any differentially private algorithm must behave similarly when acting on Prob1 and Prob2\text{Prob}_{1}\text{ and }\text{Prob}_{2}. However, this is impossible because the feasible sets of Prob1,Prob2\text{Prob}_{1},\text{Prob}_{2} are disjoint. If we look at constraint (1d)\eqref{eqn:standard_nf:flow_conserve} with u=du=d and (o,d)(o,d) then any feasible solution to Prob1\text{Prob}_{1} satisfies

u:(u,d)Ex(u,d)(o,d)w:(d,w)Ex(u,d)(o,d)=Λ(o,d)𝟙[d=d]=0.\displaystyle\sum_{u:(u,d)\in E}x_{(u,d)}^{(o,d)}-\sum_{w:(d,w)\in E}x_{(u,d)}^{(o,d)}=\Lambda(o,d)\mathds{1}_{[d=d]}=0.

However, any feasible solution to Prob2\text{Prob}_{2} satisfies

u:(u,d)Ex(u,d)(o,d)w:(d,w)Ex(u,d)(o,d)=Λ(o,d)𝟙[d=d]=1T.\displaystyle\sum_{u:(u,d)\in E}x_{(u,d)}^{(o,d)}-\sum_{w:(d,w)\in E}x_{(u,d)}^{(o,d)}=\Lambda(o,d)\mathds{1}_{[d=d]}=\frac{1}{T}.

In other words, checking the net flow leaving node dd will detect the presence of any trips going to or from dd. We will now show that any algorithm which returns a feasible solution to (1) cannot be differentially private. To this end, define the event EE to be the event that flow is conserved at node dd. Then for any algorithm MM that takes as input a demand matrix and returns a feasible solution to (1) with the specified demand matrix, we have [M(Λ)E]=1,[M(Λ)E]=0\mathbb{P}[M(\Lambda)\in E]=1,\mathbb{P}[M(\Lambda^{\prime})\in E]=0. Recalling Definition 8, MM is (ϵ,δ)(\epsilon,\delta)-differentially private only if [M(Λ)E]eϵ[M(Λ)E]+δ\mathbb{P}[M(\Lambda)\in E]\leq e^{\epsilon}\mathbb{P}[M(\Lambda^{\prime})\in E]+\delta. This equation can only be satisfied if δ1\delta\geq 1. ∎

We have two remarks regarding Theorem 1.

Remark 2.

The same result holds if MM returns the total flow yy associated with a feasible solution (see (1c)), rather than returning the feasible solution itself. In other words, even total traffic measurements (without knowing the breakdown by (o,d)(o,d) pairs) can already expose trips to or from uncommon locations.

Remark 3.

The vulnerability of trips to and from uncommon locations is not a purely theoretical concern. A study on the New York City Taxi and Limousine data set was able to identify trips from residential areas to gentlemen’s club [Pan14]. Because the start locations were in residential areas, it was easy to re-identify the users who booked the taxi trips as the owners of the homes that the taxi trips began at. As a result, despite the data set being anonymized, users who had taken taxis to this gentlemen’s club were re-identified, and their reputations may have been negatively affected.

3 Routing Policy Formulation of Network Flow

To sidestep the impossibility result described in Theorem 1, we present an alternative formulation for the network flow problem in this section. The alternative formulation avoids the issues mentioned in Theorem 1 by moving all parameters related to user data from the constraints to the objective function, as described in (8). We note that (8) can only be solved if the demand Λ\Lambda is known, which may not always be the case. For this reason, we present two variations of (8): (9) is the stochastic version of (8) where Λ\Lambda is drawn from a distribution 𝒬\mathcal{Q}, and (10) is the data driven approximation to (9) that one would solve if 𝒬\mathcal{Q} is unknown.

Before formally defining the model, we provide a high level description of how this formulation works. In this formulation, a feasible solution x={x(o,d)}(o,d)V×Vx=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V} specifies, for each (o,d)V×V(o,d)\in V\times V, a flow x(o,d)x^{(o,d)} that routes 1 unit of flow from oo to dd. We note that a flow is specified for (o,d)(o,d) even if there is no demand for this origin and destination in Λ\Lambda, i.e., Λ(o,d)=0\Lambda(o,d)=0. We refer to xx as a routing policy due to its connection to randomized routing, which Remark 5 discusses in further detail. Given a feasible solution xx, the objective function first calculates the total traffic on each edge by taking a linear combination of {x(o,d)}(o,d)V×V\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V} flows, where the coefficients of the linear combination are determined by the demand Λ\Lambda, with high demand (o,d)(o,d) pairs having larger coefficients. The total travel time can be computed from the total traffic in the same way as (1a). These ideas are formalized by the following definitions.

Definition 10 (Unit (o,d)(o,d) flow).

For a given origin-destination pair (o,d)(o,d), we say that a flow x(o,d)+mx^{(o,d)}\in\mathbb{R}_{+}^{m} is a unit (o,d)(o,d) flow if and only if it routes exactly 11 unit of flow from oo to dd. Formally, this condition is represented by the following constraints:

vV:(v,u)Ex(v,u)(o,d)vV:(u,v)Ex(u,v)(o,d)={1if u=o1if u=d0otherwise. for all uV\displaystyle\sum_{v\in V:(v,u)\in E}x^{(o,d)}_{(v,u)}-\sum_{v\in V:(u,v)\in E}x^{(o,d)}_{(u,v)}=\left\{\begin{tabular}[]{cc}$-1$&if $u=o$\\ $1$&if $u=d$\\ $0$&otherwise.\end{tabular}\right.\text{ for all }u\in V (5)

Indeed, (5) requires that the net flow entering oo is 1-1, the net flow entering dd is 11, and that flow is conserved at all other vertices in the graph.

Definition 11 (Unit Network Flow).

A unit network flow is a collection of flows x={x(o,d)}(o,d)V×Vx=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V} so that x(o,d)x^{(o,d)} is a unit (o,d)(o,d) flow for each (o,d)V×V(o,d)\in V\times V. We use 𝒳\mathcal{X} to denote the set of all unit network flows.

Remark 4.

We can represent xx as a concatenation of the vectors {x(o,d)}(o,d)V×V\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}. Since each unit (o,d)(o,d) flow is a mm dimensional vector, we have xn2mx\in\mathbb{R}^{n^{2}m}.

We will refer to unit network flows as routing policies, due to their connection with randomized routing, described in the following remark.

Remark 5 (Routing demand using unit flow policies).

A unit network flow represents a randomized routing policy. Due to the flow decomposition theorem [Tre11], every unit (o,d)(o,d) flow can be written as a distribution of paths from oo to dd. Formally, for any unit (o,d)(o,d) flow x(o,d)x^{(o,d)}, there exist paths p1(o,d),p2(o,d),,pm(o,d)p_{1}^{(o,d)},p_{2}^{(o,d)},...,p_{m}^{(o,d)} from oo to dd and weights w1(o,d),,wm(o,d)w_{1}^{(o,d)},...,w_{m}^{(o,d)} so that the following equations hold:

x(o,d)\displaystyle x^{(o,d)} =i=1mwi(o,d)𝟙[pi(o,d)]\displaystyle=\sum_{i=1}^{m}w_{i}^{(o,d)}\mathds{1}_{[p_{i}^{(o,d)}]} (6)
i=1mwi(o,d)=1\displaystyle\sum_{i=1}^{m}w_{i}^{(o,d)}=1
wi(o,d)0 for all 1im.\displaystyle w_{i}^{(o,d)}\geq 0\text{ for all }1\leq i\leq m.

Furthermore, p1(o,d),,pm(o,d)p_{1}^{(o,d)},...,p_{m}^{(o,d)} and w1(o,d),,wm(o,d)w_{1}^{(o,d)},...,w_{m}^{(o,d)} are efficiently computable. Defining the probability distribution Px(o,d)P_{x^{(o,d)}}

Px(o,d)(𝟙[pi(o,d)])=wi(o,d) for all 1im,\displaystyle P_{x^{(o,d)}}(\mathds{1}_{[p_{i}^{(o,d)}]})=w_{i}^{(o,d)}\text{ for all }1\leq i\leq m,

xe(o,d)x^{(o,d)}_{e} represents the probability that a random path chosen from Px(o,d)P_{x^{(o,d)}} contains ee. xx describes the expected behavior of the randomized routing policy that determines routes for (o,d)(o,d) requests by drawing a path independently at random from Px(o,d)P_{x^{(o,d)}}. In particular, when using this policy to serve demand Λ\Lambda, by linearity of expectation, the expected number of requests from oo to dd whose assigned path contains ee is exactly Λ(o,d)xe(o,d)\Lambda(o,d)x_{e}^{(o,d)}. Furthermore, the average number of requests on each edge will be (o,d)V×VΛ(o,d)x(o,d)\sum_{(o,d)\in V\times V}\Lambda(o,d)x^{(o,d)}.

3.1 Minimum Total Travel Time Network Flow

In the minimum travel time network flow problem, the network operator wants to find a stationary routing policy for each (o,d)(o,d) pair that will lead to small total travel times for the requests. Due to the equivalence between stationary (o,d)(o,d) routing policies and unit (o,d)(o,d) flows, the network operator can instead search over the space of unit (o,d)(o,d) flows.

The total travel time of a flow yy through GG is given by eEyefe(ye)\sum_{e\in E}y_{e}f_{e}(y_{e}). The total flow resulting from a unit network flow xx serving demand Λ\Lambda is the sum of the flow contributions from each of the (o,d)(o,d) pairs. With Remark 5 in mind, the total amount of flow on an edge eEe\in E when serving Λ\Lambda according to xx is given by (o,d)V×VΛ(o,d)xe(o,d)\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}. We can thus define F(x,Λ)F(x,\Lambda), the total travel time experienced by requests Λ\Lambda when being routed by xx, as follows:

F(x,Λ):=eE((o,d)V×VΛ(o,d)xe(o,d))fe((o,d)V×VΛ(o,d)xe(o,d)).\displaystyle F(x,\Lambda):=\sum_{e\in E}\left(\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}\right)f_{e}\left(\sum_{(o,d)\in V\times V}\Lambda(o,d)x_{e}^{(o,d)}\right). (7)

With these definitions in place, the unit network flow that minimizes the total travel time when serving the demand Λ\Lambda can be found by solving the following optimization problem

minimize𝑥\displaystyle\underset{x}{\text{minimize }} F(x,Λ)\displaystyle F(x,\Lambda) (8)
subject to x={x(o,d)}(o,d)V×V,\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},
x(o,d) is a unit (o,d) flow for every(o,d)V×V,\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,

3.2 Network Flow with Stochastic Demand

In practice, demand may vary from day to day, and such variations can be modeled by Λ\Lambda being a random variable with distribution 𝒬\mathcal{Q}. If 𝒬\mathcal{Q} is known by the network operator, then rather than solving (8), the operator is interested in minimizing expected total travel time through the following stochastic optimization problem:

minimize𝑥\displaystyle\underset{x}{\text{minimize }} 𝔼Λ𝒬[F(x,Λ)]\displaystyle\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x,\Lambda)\right] (9)
subject to x={x(o,d)}(o,d)V×V,\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},
x(o,d) is a unit (o,d) flow for every(o,d)V×V,\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,

We note that (9) is a generalization of (1) to the case when Λ\Lambda is random.

In the more realistic case that 𝒬\mathcal{Q} is not known, the optimization problem (9) can be approximated from historical data. We study a situation where the network operator has demand data Λ1,Λ2,,ΛNi.i.d𝒬\Lambda_{1},\Lambda_{2},...,\Lambda_{N}\overset{\text{i.i.d}}{\sim}\mathcal{Q} collected from operations of previous days. Using this data it can solve the following empirical approximation to (9):

minimize𝑥\displaystyle\underset{x}{\text{minimize }} 1Nk=1NF(x,Λk)\displaystyle\frac{1}{N}\sum_{k=1}^{N}F(x,\Lambda_{k}) (10)
subject to x={x(o,d)}(o,d)V×V,\displaystyle x=\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V},
x(o,d) is a unit (o,d) flow for every(o,d)V×V,\displaystyle x^{(o,d)}\text{ is a unit $(o,d)$ flow for every}(o,d)\in V\times V,

The optimization problem in (10) uses historical data to estimate (9). In line with Assumption 1 we will assume that Λt(o,d)λmax\Lambda_{t}(o,d)\leq\lambda_{\max} for all values of t,ot,o and dd.

In Section 4 we show how (10) can be solved in a request-level differentially private way.

3.3 Assumptions on Travel Time functions

In this section we will introduce some assumptions that will help us establish our technical results. We will make the following assumptions on the network demand:

Assumption 1 (Bounded Demand).

We assume there exists a non-negative constant λmax\lambda_{\max} so that Λ(o,d)λmax\Lambda(o,d)\leq\lambda_{\max} for every (o,d)V×V(o,d)\in V\times V. In practice, this constant can be estimated from historical data.

The following are assumptions we make on the objective function FF (see (7)). These assumptions are related to properties of the link latency functions {fe}eE\left\{f_{e}\right\}_{e\in E}. We present a typical model for link latency functions in Section 3.4 that satisfies all of the following assumptions.

Assumption 2 (Bounded Variance Gradients).

We assume there exists a non-negative constant KK so that for every xx, the variance of F(x,Λ)\nabla F(x,\Lambda) is upper bounded by K2K^{2}, i.e., 𝔼Λ𝒬[F(x,Λ)22]K2\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left|\left|\nabla F(x,\Lambda)\right|\right|_{2}^{2}\right]\leq K^{2}.

Assumption 3 (Twice Differentiability).

We assume that F(,Λ)F(\cdot,\Lambda) is twice-differentiable for every Λ\Lambda so that the hessian H(x,Λ):=2x2F(x,Λ)H(x,\Lambda):=\frac{\partial^{2}}{\partial x^{2}}F(x,\Lambda) is defined on the entire domain of xx.

Assumption 4 (Strong Convexity).

We assume that there exists α>0\alpha>0 for which F(,Λ)F(\cdot,\Lambda) is α\alpha-strongly convex for every Λ\Lambda. This means that for every Λ\Lambda, and any unit network flows x,xx,x^{\prime} we have

F(x,Λ)F(x,Λ)+xF(x,Λ)(xx)+α2xx22\displaystyle F(x^{\prime},\Lambda)\geq F(x,\Lambda)+\nabla_{x}F(x,\Lambda)^{\top}(x^{\prime}-x)+\frac{\alpha}{2}\left|\left|x^{\prime}-x\right|\right|_{2}^{2}
Assumption 5 (Smoothness).

We assume that there exists β>α\beta>\alpha for which F(,Λ)F(\cdot,\Lambda) is β\beta-smooth for every Λ\Lambda. This means that for every Λ\Lambda, and any unit network flows x,xx,x^{\prime} we have

F(x,Λ)F(x,Λ)+xF(x,Λ)(xx)+β2xx22.\displaystyle F(x^{\prime},\Lambda)\leq F(x,\Lambda)+\nabla_{x}F(x,\Lambda)^{\top}(x^{\prime}-x)+\frac{\beta}{2}\left|\left|x^{\prime}-x\right|\right|_{2}^{2}.
Assumption 6 (Bounded second order partial derivative).

We assume that there exists C>0C>0 so that 𝒟Λ(o,d)[xF](x,Λ)opC\left|\left|\mathcal{D}_{\Lambda(o,d)}\left[\nabla_{x}F\right](x,\Lambda)\right|\right|_{op}\leq C for all x,Λx,\Lambda and (o,d)V×V(o,d)\in V\times V.

Remark 6 (Satisfying Assumption 6).

We can satisfy Assumption 6 for any positive CC by re-scaling. In particular, letting θ:=maxx,Λ𝒟Λ(o,d)[xF](x,Λ)2\theta:=\max_{x^{\prime},\Lambda^{\prime}}\left|\left|\mathcal{D}_{\Lambda(o,d)}\left[\nabla_{x}F\right](x^{\prime},\Lambda^{\prime})\right|\right|_{2}, then for any CC we can define a re-scaled objective function

F~(x,Λ):=CθF(x,Λ).\displaystyle\widetilde{F}(x,\Lambda):=\frac{C}{\theta}F(x,\Lambda).

By construction, F~\widetilde{F} satisfies Assumption 6 with constant value CC. Note, however, that the smoothness and strong convexity parameters for F~\widetilde{F} will be rescaled accordingly to βθ\frac{\beta}{\theta} and αθ\frac{\alpha}{\theta} respectively.

3.4 Transportation Model satisfying assumptions from Section 3.3

In this section, we present a transportation network model that satisfies Assumptions 3,4,5 and 6. We study a network where the link latency functions are all affine where for each eEe\in E, there are non-negative constants qe,ceq_{e},c_{e} so that fe(y)=qey+cef_{e}(y)=q_{e}y+c_{e}. Let Qm×mQ\in\mathbb{R}^{m\times m} be defined so that Qee=qeQ_{ee}=q_{e}, and let cmc\in\mathbb{R}^{m} be the concatenation of all of the zero order coefficients in the link latency functions.

As mentioned in Remark 4, we will represent xx as a concatenation of {x(o,d)}(o,d)V×V\left\{x^{(o,d)}\right\}_{(o,d)\in V\times V}. As such, xn2mx\in\mathbb{R}^{n^{2}m}. Let {(oi,di)}i=1n2\left\{(o_{i},d_{i})\right\}_{i=1}^{n^{2}} be the order in which the unit (o,d)(o,d) flows are concatenated to produce xx so that

x=[x(o1,d1)x(o2,d2)x(on2,dn2)].\displaystyle x=\left[\begin{tabular}[]{c}$x^{(o_{1},d_{1})}$\\ $x^{(o_{2},d_{2})}$\\ $\vdots$\\ $x^{(o_{n^{2}},d_{n^{2}})}$\end{tabular}\right].

The total flow on the links in the network when serving demand Λ\Lambda according to xx can then be written as:

y:=(o,d)V×VΛ(o,d)x(o,d)=BΛx\displaystyle y:=\sum_{(o,d)\in V\times V}\Lambda(o,d)x^{(o,d)}=B_{\Lambda}x

where BΛ=vec(Λ)ImB_{\Lambda}=\text{vec}(\Lambda)\otimes I_{m}. Here vec(Λ)\text{vec}(\Lambda) is a n2n^{2} dimensional row vector whose iith entry is Λ(oi,di)\Lambda(o_{i},d_{i}), and \otimes represents the Kronecker product. Then when Λ\Lambda is being routed according to xx, the travel times on the links can be computed as

Qy+c=QBΛx+c,\displaystyle Qy+c=QB_{\Lambda}x+c,

which means that the total travel time can be written as

F^(x,Λ):=eEyefe(ye)\displaystyle\widehat{F}(x,\Lambda):=\sum_{e\in E}y_{e}f_{e}(y_{e}) =y(Qy+c)\displaystyle=y^{\top}\left(Qy+c\right)
=xBΛ(QBΛx+c)\displaystyle=x^{\top}B_{\Lambda}^{\top}\left(QB_{\Lambda}x+c\right)
=xBΛQBΛx+cBΛx.\displaystyle=x^{\top}B_{\Lambda}^{\top}QB_{\Lambda}x+c^{\top}B_{\Lambda}x. (11)

If we add a bit of 2\ell_{2} regularization, we obtain

F(x,Λ)\displaystyle F(x,\Lambda) =F^(x,Λ)+α2x22\displaystyle=\widehat{F}(x,\Lambda)+\frac{\alpha}{2}\left|\left|x\right|\right|_{2}^{2}
=xBΛQBΛx+cBΛx+α2x22\displaystyle=x^{\top}B_{\Lambda}^{\top}QB_{\Lambda}x+c^{\top}B_{\Lambda}x+\frac{\alpha}{2}\left|\left|x\right|\right|_{2}^{2}
=x(BΛQBΛ+α2I)x+cBΛx.\displaystyle=x^{\top}\left(B_{\Lambda}^{\top}QB_{\Lambda}+\frac{\alpha}{2}I\right)x+c^{\top}B_{\Lambda}x.

Recall that we use H(,Λ)H(\cdot,\Lambda) to denote the hessian of F(,Λ)F(\cdot,\Lambda) with respect to its first argument. We now make the following observations:

  • The Hessian of F(x,Λ)F(x,\Lambda) with respect to xx is defined for all xx and is equal to 2BΛQBΛ+αI2B_{\Lambda}^{\top}QB_{\Lambda}+\alpha I. Hence Assumption 3 is satisfied.

  • Since QQ is a diagonal matrix with non-negative entries, Q0Q\succeq 0. This implies that BΛQBλ0B_{\Lambda}^{\top}QB_{\lambda}\succeq 0. Hence H(x,Λ)αIH(x,\Lambda)\succeq\alpha I. This implies that FF is α\alpha-strongly convex, and hence Assumption 4 is satisfied.

  • Note that BΛQBΛ+(α/2)IopQopBΛop2+α/2Qopvec(Λ)22Imop2+α/2\left|\left|B_{\Lambda}^{\top}QB_{\Lambda}+(\alpha/2)I\right|\right|_{op}\leq\left|\left|Q\right|\right|_{op}\left|\left|B_{\Lambda}\right|\right|_{op}^{2}+\alpha/2\leq\left|\left|Q\right|\right|_{op}\left|\left|\text{vec}(\Lambda)\right|\right|_{2}^{2}\left|\left|I_{m}\right|\right|_{op}^{2}+\alpha/2. Defining β:=n2(maxeqe)λmax2+α/2\beta:=n^{2}\left(\max_{e}q_{e}\right)\lambda_{\max}^{2}+\alpha/2, we see that H(x,Λ)opβ\left|\left|H(x,\Lambda)\right|\right|_{op}\leq\beta for all x𝒳x\in\mathcal{X}, meaning that FF is β\beta-smooth, and thus Assumption 5 is satisfied.

  • By product rule,

    𝒟Λ(oi,di)[F](x,Λ)\displaystyle\mathcal{D}_{\Lambda(o_{i},d_{i})}\left[\nabla F\right](x,\Lambda) =Λ(oi,di){2(BΛQBΛ+αI)x+BΛc}\displaystyle=\frac{\partial}{\partial\Lambda(o_{i},d_{i})}\left\{2\left(B_{\Lambda}^{\top}QB_{\Lambda}+\alpha I\right)x+B_{\Lambda}^{\top}c\right\}
    =Λ(oi,di){2BΛQy+2αx+BΛc}\displaystyle=\frac{\partial}{\partial\Lambda(o_{i},d_{i})}\left\{2B_{\Lambda}^{\top}Qy+2\alpha x+B_{\Lambda}^{\top}c\right\}
    =2((eiI)Qy+BΛQx(o,d))+eic\displaystyle=2\left((e_{i}\otimes I)Qy+B_{\Lambda}^{\top}Qx^{(o,d)}\right)+e_{i}\otimes c

    where eie_{i} is the iith standard basis vector for n2\mathbb{R}^{n^{2}}. By triangle inequality we can then conclude that

    𝒟Λ(oi,di)[F](x,Λ)2\displaystyle\left|\left|\mathcal{D}_{\Lambda(o_{i},d_{i})}\left[\nabla F\right](x,\Lambda)\right|\right|_{2} 2Qy2+2BΛQx(o,d)2+c2\displaystyle\leq 2\left|\left|Qy\right|\right|_{2}+2\left|\left|B_{\Lambda}^{\top}Qx^{(o,d)}\right|\right|_{2}+\left|\left|c\right|\right|_{2}
    2Qopy2+2BΛopQopx(o,d)2+c2\displaystyle\leq 2\left|\left|Q\right|\right|_{op}\left|\left|y\right|\right|_{2}+2\left|\left|B_{\Lambda}\right|\right|_{op}\left|\left|Q\right|\right|_{op}\left|\left|x^{(o,d)}\right|\right|_{2}+\left|\left|c\right|\right|_{2}
    (a)2Qopy2+2nλmaxQopm+c2\displaystyle\overset{(a)}{\leq}2\left|\left|Q\right|\right|_{op}\left|\left|y\right|\right|_{2}+2n\lambda_{\max}\left|\left|Q\right|\right|_{op}\sqrt{m}+\left|\left|c\right|\right|_{2}
    (b)2λmaxQopn2m+2λmaxQopnm+c2.\displaystyle\overset{(b)}{\leq}2\lambda_{\max}\left|\left|Q\right|\right|_{op}n^{2}\sqrt{m}+2\lambda_{\max}\left|\left|Q\right|\right|_{op}n\sqrt{m}+\left|\left|c\right|\right|_{2}.

    Here (a)(a) is due to the fact that x(o,d)x^{(o,d)} is a unit (o,d)(o,d) flow, meaning x(o,d)1\left|\left|x^{(o,d)}\right|\right|_{\infty}\leq 1 and thus x(o,d)2m\left|\left|x^{(o,d)}\right|\right|_{2}\leq\sqrt{m}. Also, BΛ=vec(Λ)ImB_{\Lambda}=\text{vec}(\Lambda)\otimes I_{m} implies that BΛop𝟙m2Imop=nλmax\left|\left|B_{\Lambda}\right|\right|_{op}\leq\left|\left|\mathds{1}_{m}^{\top}\right|\right|_{2}\left|\left|I_{m}\right|\right|_{op}=n\lambda_{\max}. (b)(b) is due to y2=BΛx(o,d)2BΛopx(o,d)2n2mλmax\left|\left|y\right|\right|_{2}=\left|\left|B_{\Lambda}x^{(o,d)}\right|\right|_{2}\leq\left|\left|B_{\Lambda}\right|\right|_{op}\left|\left|x^{(o,d)}\right|\right|_{2}\leq n^{2}\sqrt{m}\lambda_{\max}.

    Hence Assumption 6 is satisfied with C=2λmaxQopmn(n+1)+c2C=2\lambda_{\max}\left|\left|Q\right|\right|_{op}\sqrt{m}n(n+1)+\left|\left|c\right|\right|_{2}.

4 Differentially Private Network Routing Optimization

Given the setup from Section 3, our objective is to design a request-level differentially private algorithm that returns a near optimal solution to (9). Since the true distribution 𝒬\mathcal{Q} of demand is unknown, we will design an algorithm for (10) and show that under the assumptions described in Section 3.2, the algorithm’s solution is also near optimal for (9).

Computing a near-optimal solution to (10) while maintaining differential privacy may seem like a daunting task, but it turns out that a single modification to a well-known optimization algorithm gives an accurate and differentially private solution.

We present a Private Projected Stochastic Gradient Descent algorithm, which is described in Algorithm 1. As the name suggests, Algorithm 1 is a modified version of stochastic gradient descent. The algorithm conducts a single pass over the historical data, using each data point to perform a noisy gradient step (see line 6). The key difference between Algorithm 1 and standard stochastic gradient descent is in line 11, where instead of returning the final gradient descent iterate, Algorithm 1 returns a noisy version of the last iterate. Algorithm 1 has the following privacy and performance guarantees.

Theorem 2 (Privacy Guarantee for Algorithm 1).

Algorithm 1 is (ϵ,δ)(\epsilon,\delta)-differentially private under request level adjacency defined in Definition 9.

Theorem 3 (Performance Guarantee for Algorithm 1).

If Λ1,,ΛNi.i.d𝒬\Lambda_{1},...,\Lambda_{N}\overset{\text{i.i.d}}{\sim}\mathcal{Q}, and xx^{*} is a solution to (9), then the output xalgx_{\text{alg}} of Algorithm 1 satisfies:

𝔼[xalgx2]\displaystyle\mathbb{E}\left[\left|\left|x_{\text{alg}}-x^{*}\right|\right|_{2}\right] 1NKCexp(β2π212α2)α+nmN(βexp(β2π212α2)αmin(1,2α)+CϵαT2ln(1.25δ))\displaystyle\leq\frac{1}{\sqrt{N}}\frac{KC\exp\left(\frac{\beta^{2}\pi^{2}}{12\alpha^{2}}\right)}{\alpha}+\frac{n\sqrt{m}}{N}\left(\frac{\beta\exp\left(\frac{\beta^{2}\pi^{2}}{12\alpha^{2}}\right)}{\alpha\min(1,2\alpha)}+\frac{C}{\epsilon\alpha T}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}\right)
O(1N)+O(1ϵNln1δ)\displaystyle\leq O\left(\frac{1}{\sqrt{N}}\right)+O\left(\frac{1}{\epsilon N}\sqrt{\ln\frac{1}{\delta}}\right)

In particular, xalgx_{\text{alg}} is (ϵ,δ)(\epsilon,\delta)-differentially private and converges to xx^{*} as NN\rightarrow\infty, meaning that privacy and asymptotic optimality are simultaneously achieved.

See Appendix B and Appendix C for proofs of Theorem 2 and Theorem 3 respectively.

1 Input: Historical demand data Λ1,,ΛN\Lambda_{1},...,\Lambda_{N}, Desired privacy level (ϵ,δ)(\epsilon,\delta);
2 Output: Unit network flow x𝒳x\in\mathcal{X};
3 Initialize x0𝒳x_{0}\in\mathcal{X} arbitrarily ;
4 for 1kN1\leq k\leq N do
5       ηk1min(1αk,min(1,2α)β)\eta_{k-1}\leftarrow\min\left(\frac{1}{\alpha k},\frac{\min(1,2\alpha)}{\beta}\right);
6       xkΠ𝒳(xk1+ηk1xF(xk1,Λk))x_{k}\leftarrow\Pi_{\mathcal{X}}\left(x_{k-1}+\eta_{k-1}\nabla_{x}F(x_{k-1},\Lambda_{k})\right);
7      
8sCTmin(min(1,2α)β,1αN)s\leftarrow\frac{C}{T}\min\left(\frac{\min(1,2\alpha)}{\beta},\frac{1}{\alpha N}\right);
9 σ22s2ϵ2ln(1.25δ)\sigma^{2}\leftarrow 2\frac{s^{2}}{\epsilon^{2}}\ln\left(\frac{1.25}{\delta}\right);
10 Z𝒩(0,σ2I)Z\sim\mathcal{N}\left(0,\sigma^{2}I\right) ;
11 xalgΠ𝒳(xN+Z)x_{\text{alg}}\leftarrow\Pi_{\mathcal{X}}\left(x_{N}+Z\right);
12 Return xalgx_{\text{alg}};
Algorithm 1 Private Projected Stochastic Gradient Descent

4.1 Discussion

Carefully adding noise to specific parts of existing algorithms is a principled approach for developing differentially private algorithms [DMNS06, FMTT18, FKT20]. The main challenge in such an approach is determining a) where and b) how much noise to add. Suppose the goal is to (approximately) compute a query function f(L)f(L) on a data set LL in a differentially private way. The latter question can be addressed by measuring the sensitivity of the desired query function.

Definition 12 (2\ell_{2} sensitivity).

Consider a function f:df:\mathcal{L}\rightarrow\mathbb{R}^{d} which maps data sets to real vectors. For a given adjacency relation Adj, the 2\ell_{2} sensitivity of ff, denoted sAdj(f)s_{\text{Adj}}(f), is the largest achievable difference in function value between adjacent data sets. Namely,

sAdj(f):=maxL1,L2Adj(L1,L2)=1f(L1)f(L2)2.\displaystyle s_{\text{Adj}}(f):=\max_{\begin{subarray}{c}L_{1},L_{2}\in\mathcal{L}\\ \text{Adj}(L_{1},L_{2})=1\end{subarray}}\left|\left|f(L_{1})-f(L_{2})\right|\right|_{2}.

Once the sensitivity of the query function is known, the required noise distribution can be determined using the Gaussian mechanism as described in Theorem 4.

Theorem 4 (From [DR14]).

Suppose f:𝒟pf:\mathcal{D}\rightarrow\mathbb{R}^{p} maps datasets to real vectors. If sAdjs_{\text{Adj}} is the 2\ell_{2} sensitivity of ff, then f^(D):=f(D)+Z\widehat{f}(D):=f(D)+Z where Z𝒩(0,2sAdj2ϵ2ln(1.25δ)Ip)Z\sim\mathcal{N}\left(0,2\frac{s_{\text{Adj}}^{2}}{\epsilon^{2}}\ln\left(\frac{1.25}{\delta}\right)I_{p}\right) is (ϵ,δ)(\epsilon,\delta)-differentially private with respect to the adjacency relation Adj.

Calculating the sensitivity of the simple query functions (e.g., counting, voting, selecting the maximum value) is relatively easy, making the Gaussian mechanism straightforward to apply. However, for more complicated functions, noise calibration becomes more involved.

Algorithm 1 is an application of the Gaussian mechanism. Moreover, Theorem 3 shows that the suboptimality of Algorithm 1 is O(1N)O\left(\frac{1}{\sqrt{N}}\right). The asymptotic optimality of Algorithm 1 comes from the fact that the 2\ell_{2} sensitivity of the final gradient descent iterate is actually converging to zero as NN approaches infinity. This fact enables us to add less noise as NN\rightarrow\infty. Indeed, the Gaussian noise added to the final gradient descent iterate in Algorithm 1 has standard deviation which is O(1N)O\left(\frac{1}{N}\right).

In the remainder of this section, we sketch some of the mathematical ideas behind the perhaps non-intuitive result that the sensitivity of gradient descent converges to zero as the number of data points increases. For simplicity of exposition we will a) use scalar notation in place of vector notation for the sake of readability, and b) consider the simpler case of unconstrained gradient descent, which removes the need to perform projections. As a reminder, the full proof can be found in Appendix B.

Given a mobility data set, we use xt(L)x_{t}(L) to denote the tt-th iterate of Algorithm 1 when using data set LL. It is sufficient to show that the maxt|dxNdΛt|\max_{t}\left|\frac{dx_{N}}{d\Lambda_{t}}\right| converges to zero for every tt. This is because the sensitivity is obtained by integrating the derivative:

xN(L1)xN(L2)2=L1L2dxNdL𝑑L2.\displaystyle\left|\left|x_{N}(L_{1})-x_{N}(L_{2})\right|\right|_{2}=\left|\left|\int_{L_{1}}^{L_{2}}\frac{dx_{N}}{dL}dL\right|\right|_{2}.

Because L1,L2L_{1},L_{2} are adjacent, the distance between them is finite, and thus the above integral will converge to zero if its integrand converges to zero.

With this in mind, by chain rule, we can write

dxNdΛt\displaystyle\frac{dx_{N}}{d\Lambda_{t}} =dxtdΛtdxt+1dxtdxt+2dxt+1dxNdxN1.\displaystyle=\frac{dx_{t}}{d\Lambda_{t}}\frac{dx_{t+1}}{dx_{t}}\frac{dx_{t+2}}{dx_{t+1}}...\frac{dx_{N}}{dx_{N-1}}.

Next, by using properties of smooth and strongly convex functions, we show that |dxt+1dxt|1θ\left|\frac{dx_{t+1}}{dx_{t}}\right|\leq 1-\theta for some positive constant θ\theta. This result implies

|dxNdΛt|\displaystyle\left|\frac{dx_{N}}{d\Lambda_{t}}\right| =|dxtdΛt|(1θ)Nt.\displaystyle=\left|\frac{dx_{t}}{d\Lambda_{t}}\right|(1-\theta)^{N-t}.

Next, recalling that xt=xt1ηtF(xt1,Λt)x_{t}=x_{t-1}-\eta_{t}\nabla F(x_{t-1},\Lambda_{t}), note that xt1x_{t-1} is computed from the first t1t-1 gradient steps, which only depend on the first t1t-1 data points Λ1,,Λt1\Lambda_{1},...,\Lambda_{t-1}. Hence the dxt1dΛt=0\frac{dx_{t-1}}{d\Lambda_{t}}=0. From this we see that

dxtdΛt\displaystyle\frac{dx_{t}}{d\Lambda_{t}} =ddΛt(xt1ηtF(xt1,Λt))\displaystyle=\frac{d}{d\Lambda_{t}}\left(x_{t-1}-\eta_{t}\nabla F(x_{t-1},\Lambda_{t})\right)
=ηtddΛtF(xt1,Λt).\displaystyle=-\eta_{t}\frac{d}{d\Lambda_{t}}\nabla F(x_{t-1},\Lambda_{t}).

By using Assumption 6 we have the following inequality:

|dxtdΛt|=|ηtddΛtF(xt1,Λt)|ηtC.\displaystyle\left|\frac{dx_{t}}{d\Lambda_{t}}\right|=\left|\eta_{t}\frac{d}{d\Lambda_{t}}\nabla F(x_{t-1},\Lambda_{t})\right|\leq\eta_{t}C.

Putting everything together, we have

|dxNdΛt|\displaystyle\left|\frac{dx_{N}}{d\Lambda_{t}}\right| =Cηt(1θ)Nt.\displaystyle=C\eta_{t}(1-\theta)^{N-t}.

The choice of stepsize in Algorithm 1 ensures three things: a) ηt2\eta_{t}\leq 2 for every tt, b) ηt\eta_{t} is non-increasing, and c) ηt0\eta_{t}\rightarrow 0 as tt\rightarrow\infty. Given these facts, there are two cases to consider for |dxNdΛt|\left|\frac{dx_{N}}{d\Lambda_{t}}\right|.

  1. Case 1:

    tN/2t\leq N/2. In this case, we can upper bound ηt2\eta_{t}\leq 2 and (1θ)N/2(1-\theta)^{N/2}. Hence |dxNdΛt|2C(1θ)N/2\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq 2C(1-\theta)^{N/2}. Since 1θ<11-\theta<1, this converges to zero as NN\rightarrow\infty.

  2. Case 2:

    tN/2t\geq N/2. In this case, we simply upper bound (1θ)Nt1(1-\theta)^{N-t}\leq 1 and ηtηN/2\eta_{t}\leq\eta_{N/2} which gives |dxNdΛt|CηN/2\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq C\eta_{N/2} which converges to zero as NN\rightarrow\infty.

Finally, this shows that

maxt|dxNdΛt|max(CηN/2,2C(1θ)N),\displaystyle\max_{t}\left|\frac{dx_{N}}{d\Lambda_{t}}\right|\leq\max\left(C\eta_{N/2},2C(1-\theta)^{N}\right),

and since both terms in the maximum are converging to zero, we have limN|dxNdΛt|=0\lim_{N\rightarrow\infty}\left|\frac{dx_{N}}{d\Lambda_{t}}\right|=0.

5 Experiments

Differentially private mechanisms add noise to provide a principled privacy guarantee for individual level user data. One immediate question is the degree to which the added noise degrades the quality of the obtained solution. In the previous section, we addressed this question theoretically in Theorem 3 by showing that Algorithm  1 is both differentially private and asymptotic optimal. In this section, we present empirical studies on privacy-performance trade-offs by comparing our algorithm’s performance to that of a non-private network routing approach. To this end, we simulate a transportation network to evaluate the performance of our algorithm and the non-private optimal solution to the network flow problem (10).

We describe the dataset used for the experiments in Section 5.1. Next, the algorithms used in the experiments are described in Section 5.2. In Section 5.3, we evaluate the practical performance of our algorithm for different values of NN. In particular, we study the effect of the number of samples, and quantify the loss in system performance we may experience due to the introduction of our privacy-preserving algorithm. Finally, in Section 5.4, we study the sensitivity of our algorithms to the simulation parameters. In particular, we study the convergence of the routing policy with increasing data for different demands, edge latency function, and the magnitude of the regularization loss introduced to convexify the edge latency function.

5.1 Data Set

We use data for the Sioux Fall network, which is available in the Transportation Network Test Problem (TNTP) dataset [Tra16]. This network has 24 nodes, 76 edges, and 528 OD pairs (see Figure 1 for an overview of the network topology). The distribution of mean hourly demand across different OD pairs is shown in Figure 1. The mean OD demand is 682 vehicles/ hour. Furthermore, for more context on the scale of the network, the travel time on edges ranges from 2 to 10 minutes. Trip data is generated each hour, i.e., T=60T=60, from a Poisson distribution with a mean value that is given by the data. Our objective is to learn a routing policy for these trips that minimizes the total travel costs for all users. To model congestion, we use the link latency model described in Section 3.4. In particular, for each eEe\in E, we estimate the free flow latency as cec_{e} directly from the data set. The sensitivity of the latency function to the traffic volume on the link, denoted by qeq_{e} is chosen such that the travel time on the link is doubled when the link flow equals the link capacity. In later experiments, we will change this factor to study the sensitivity of our algorithm.

5.2 Algorithms

Throughout Sections 5.3 and 5.4 we compare the performance of two different algorithms: a Baseline algorithm and Algorithm 1, which we describe below.

Baseline: This algorithm computes the minimum travel time solution to (10) for the Sioux Fall model described in Section 3.4. Recall that in the Section 3.4 model, (11) describes the travel time incurred when serving demand Λ\Lambda with a routing policy xx. Given a data set Λ1,,ΛN\Lambda_{1},...,\Lambda_{N}, it computes the solution to the following minimization problem: minx𝒳1Ni=1NxBΛiQBΛix+cBΛix\min_{x\in\mathcal{X}}\frac{1}{N}\sum_{i=1}^{N}x^{\top}B_{\Lambda_{i}}^{\top}QB_{\Lambda_{i}}x+c^{\top}B_{\Lambda_{i}}x.

Algorithm 1: The travel time function described in (11) is not strongly convex because BΛQBΛB_{\Lambda}^{\top}QB_{\Lambda} is rank deficient. In order to satisfy Assumption 4, we introduce an 2\ell_{2} regularization. Namely, given a data set Λ1,,ΛN\Lambda_{1},...,\Lambda_{N}, we apply Algorithm 1 to the following minimization problem: minx𝒳αx22+1Ni=1NxBΛiQBΛix+cBΛix\min_{x\in\mathcal{X}}\alpha\left|\left|x\right|\right|_{2}^{2}+\frac{1}{N}\sum_{i=1}^{N}x^{\top}B_{\Lambda_{i}}^{\top}QB_{\Lambda_{i}}x+c^{\top}B_{\Lambda_{i}}x.

For the Sioux Falls network, the parameters for implementing Algorithm 1 are set as follows. The smoothness parameter β\beta is set to be the largest eigenvalue of BΛQBΛB_{\Lambda}^{\top}QB_{\Lambda}, which is equal to 2.08×1072.08\times 10^{7}. We set C=βC=\beta, and the regularizer parameter α=104\alpha=10^{4}. It is easy to check that these values satisfy the assumptions described in Section 3.4. Note that several different values of α\alpha could have been used to convexify the latency function. However, our choice is governed by two factors. A small value of α\alpha will ensure that the regularized objective is a good estimate to the true objective, which is desirable. However, smaller values of α\alpha will lead to a larger condition number (which is β/α\beta/\alpha), resulting in slower convergence and necessitating more data for achieving a similar performance. Thus, the particular value of α=104\alpha=10^{4} balances both these factors for our problem instance. In section 5.4, we present a sensitivity analysis with different values of α\alpha.

Refer to caption
(a) Network topology
Refer to caption
(b) Travel time on edges
Refer to caption
(c) Mean demand between OD pairs
Figure 1: Description of the Sioux Falls dataset

5.3 Performance of Algorithm 1

In this subsection, we study the convergence of the routing policy with each step of the gradient descent performed by our algorithm. Since the impact of a routing policy is directly reflected in the total travel time, we plot the travel cost induced by a the learned routing policy as a function of the iterates. Recall that the number of iterations for a data set with NN data points is NN according to Algorithm 1. In our experiments we evaluate the cost of a routing policy xx as F(x,1Ni=1NΛi)F(x,\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i}) instead of using the sample average 1Ni=1NF(x,Λi)\frac{1}{N}\sum_{i=1}^{N}F(x,\Lambda_{i}). This approximation is done solely for improving the run-time of our experiment (by up to 50X) and introduces less than 10410^{-4}% error in the evaluation of the system costs. Similarly, the optimal solution is also computed by minimizing the objective function F(x,1Ni=1NΛi)F(x,\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i}) instead of the term described in Equation 10 to achieve a computational speed up of 30X. Evaluating the cost of this solution using the average-demand approximation results in errors less than 10710^{-7}% error. Thus, in these experiments, we define the optimal costs as the approximate costs obtained through this procedure. Further details justifying these approximation are presented in Appendix D

For our first set of experiments, we compare the objective values obtained by Algorithm 1 and the Baseline as a function of sample size NN. In Figure 2, we plot the ratio of Algorithm 1’s cost to Baseline’s cost over the course of iterations for different values of NN. For this set of experiments, we set the privacy parameters to ϵ=0.1\epsilon=0.1 and δ=0.1\delta=0.1. Note that Baseline’s cost is fixed for a given NN and is computed offline to serve as a benchmark. For a given NN, we only have NN iterations since each data point is only used once in Algorithm 1 to maintain privacy. For all three experiments (N=10N=10, N=25N=25, N=50N=50), the cost decreases monotonically with additional iterations. It is therefore not surprising that the final costs for the N=50N=50 case is the lowest, as we expect the routing policy learned with 50 data points to be better than the routing policy learned from 10 data points. It is however very interesting that even with a random routing policy initialization, our algorithm finds solutions that are just around 2% away from the optimal policy. We suspect that this 2% gap is due to the fact that the two algorithms have slightly different objective functions. Indeed, Algorithm 1 has an 2\ell_{2} regularizer in the objective, but Baseline does not. Our results therefore show that although the convergence is guaranteed only in the limit NN\rightarrow\infty, we can obtain practically useful solutions with a relatively small number of data points.

Refer to caption
Figure 2: Convergence dynamics for different values of NN

In the next set of experiments, we study the effect of different privacy parameters on total travel time. To this end, we compare the costs of the pre-noise and post-noise solutions xNx_{N} xalgx_{\text{alg}} from Algorithm 1. We conduct this comparison for ϵ{0.01,0.1,0.5}\epsilon\in\{0.01,0.1,0.5\} and δ{0.1,0.5}\delta\in\{0.1,0.5\}. Table 1 presents the percentage increase in total travel time due to the addition of privacy noise. The results in indicate that the price of privacy, i.e., the increase in total travel time due to the introduction of differential privacy noise is less than 7.8×1027.8\times 10^{-2}% in the worst case. In fact, for more commonly used privacy parameters of ϵ=0.1\epsilon=0.1 and δ=0.1\delta=0.1, the cost of privacy is even smaller. One reason for this low cost of privacy is the high demand in the traffic network. From Figure 1(c), we observe that every OD pair typically has a few hundred trips. With over 500 OD pairs, it is thus clear that there are tens of thousands of vehicles in the network contributing to trip information with every data point. Thus, with such a large number of vehicles, the noise required to protect the identity of one vehicle is not too high.

ϵ\epsilon δ\delta Cost (% increase)
0.01 0.1 7.83×1027.83\times 10^{-2}
0.01 0.5 3.97×1033.97\times 10^{-3}
0.1 0.1 9.06×1039.06\times 10^{-3}
0.1 0.5 5.96×1035.96\times 10^{-3}
0.5 0.1 2.44×1032.44\times 10^{-3}
0.5 0.5 2.05×1032.05\times 10^{-3}
Table 1: Change in routing costs due to incorporating privacy.

5.4 Sensitivity

Refer to caption
(a) α\alpha
Refer to caption
(b) Slope of latency function
Refer to caption
(c) Traffic demand
Figure 3: Sensitivity of the performance of Algorithm 1

We now study the sensitivity of our algorithm to input parameters. For these experiments, we fix the number of data points to N=50N=50, since prior results suggest that most of the cost benefits are obtained with 50 data points. First, we study the effect of the regularizer term by setting α{102,103,104}\alpha\in\{10^{2},10^{3},10^{4}\} and plot the convergence of the normalized costs in Figure 3(a). We see that for larger values of α\alpha, the costs decrease faster. This makes sense because larger α\alpha results in a lower condition number βα\frac{\beta}{\alpha}, which leads to faster convergence. We also note that the cost ratio does not go to 1 because Algorithm 1 is minimizing a regularized objective, while Baseline has no regularization.

In Figure 3(b), we compare three scenarios with varying slope for the latency function qeq_{e}. Recall that our previous experiments set the slope qeq_{e} on each edge such that the travel time on the link doubles when the traffic is equal to the link capacity. This setting corresponds to a sensitivity factor of 2. We consider two more cases where where the travel time at capacity flow is 1.5 times and 5 times the free flow latency. Note that changing the latency sensitivity factor changes the matrix QQ. Thus, for each of these experiments, we recompute the value of β\beta and CC and set it to be equal to the largest eigenvalue of the appropriate BΛQBΛB_{\Lambda}^{\top}QB_{\Lambda}. The optimal cost also varies for all the three cases and is recomputed. The value of α\alpha is fixed at 10410^{4}. We observe that when the latency function is steeper, i.e., the sensitivity factor is higher, the algorithm takes more iterations to reduce the costs, but eventually ends up with the lowest costs. This is because a larger β\beta leads to larger condition number βα\frac{\beta}{\alpha}, which makes convergence slow. However, a larger β\beta means that the 2\ell_{2} regularizer is a smaller proportion of the total cost, meaning that the objectives of Algorithm 1 and Baseline become more similar, which is what we believe causes the cost ratio to improve as the latency function becomes steeper.

Finally, we present the sensitivity of our algorithm to varying traffic demand in Figure 3(c). Elaborating further, in these experiments, we compare the nominal setting, where the demand is the mean demand with a scale factor of 1 to two cases. In the first case, we use a lower demand, where the mean traffic is 0.5 times the nominal traffic, and in the second case, the mean traffic is 1.5 times the nominal traffic. Again, as the demand changes, the matrix BB changes, and we recompute β\beta and CC as before. We observe that for the same value of α\alpha, higher demand leads to better convergence and lower costs. This is because higher demand increases the travel time, making the 2\ell_{2} regularizer a smaller proportion of Algorithm 1’s objective. The objective functions becoming more similar leads to the cost ratio being closer to 11.

6 Conclusions

In this paper, we study the problem of learning network routing policies from sensitive user data. In particular, we consider the setting of a transportation network, where we want to learn and share a routing policy such that it does not reveal too much information about individual trips that may have contributed to learning this policy. Our paper presented a new approach to learn privacy-preserving routing policies by solving a reformulated network flow problem using a differentially private variant of the stochastic gradient descent algorithm. We prove that our algorithm is asymptotically optimal, meaning that the cost of the routing policy produced by our algorithm converges to the optimal non-private cost as the number of data points goes to infinity. Finally, our simulations on a Sioux Falls road network suggests that for realistic travel demands, we can learn differentially private routing policies that result in only a 2% suboptimality in terms of total travel time.

There are several interesting directions for future work. First, because differentially private algorithms are not allowed to be sensitive to single data points, they are naturally robust, and can be useful for tracking non-stationary demand distributions, as opposed to the stationary demand models we studied in this paper. In this paper, we studied request-level differential privacy where the goal is to occlude the influence of a single trip on the algorithm’s output. Another practical and important notion is user-level differential privacy where the goal is to occlude the influence of all trips belonging to the same person on the algorithm’s output. User-level privacy is harder to achieve, but is important in practice. Finally, generalizing the results to non-smooth objective functions would expand the domain of models that this technique can be applied to.

References

  • [AHFI+18] Khalil Al-Hussaeni, Benjamin CM Fung, Farkhund Iqbal, Gaby G Dagher, and Eun G Park. Safepath: Differentially-private publishing of passenger trajectories in transportation systems. Computer Networks, 143:126–139, 2018.
  • [BS03] Alastair R Beresford and Frank Stajano. Location privacy in pervasive computing. IEEE Pervasive computing, 2(1):46–55, 2003.
  • [CML11] Chi-Yin Chow, Mohamed F Mokbel, and Xuan Liu. Spatial cloaking for anonymous location-based services in mobile peer-to-peer environments. GeoInformatica, 15(2):351–380, 2011.
  • [CTW+21] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 2633–2650. USENIX Association, 2021.
  • [DKBS15] Roy Dong, Walid Krichene, Alexandre M. Bayen, and S. Shankar Sastry. Differential privacy of populations in routing games. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2798–2803, 2015.
  • [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, volume 3876, pages 265–284. Springer, 2006.
  • [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  • [Dwo08] Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008.
  • [FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 439–449. ACM, 2020.
  • [FMTT18] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 521–532. IEEE Computer Society, 2018.
  • [GFCA14] Moein Ghasemzadeh, Benjamin CM Fung, Rui Chen, and Anjali Awasthi. Anonymizing trajectory data for passenger flow analysis. Transportation research part C: emerging technologies, 39:63–79, 2014.
  • [GLTY18] Mehmet Emre Gursoy, Ling Liu, Stacey Truex, and Lei Yu. Differentially private and utility preserving publication of trajectory data. IEEE Transactions on Mobile Computing, 18(10):2315–2329, 2018.
  • [GMW87] Oded Goldreich, Silvio Micali, and Avi Wigderson. How to play any mental game or A completeness theorem for protocols with honest majority. In Alfred V. Aho, editor, STOC 1987, New York, New York, USA, pages 218–229. ACM, 1987.
  • [GZFS15] Yanmin Gong, Chi Zhang, Yuguang Fang, and Jinyuan Sun. Protecting location privacy for task allocation in ad hoc mobile cloud computing. IEEE Transactions on Emerging Topics in Computing, 6(1):110–121, 2015.
  • [HC20] Brian Yueshuai He and Joseph YJ Chow. Optimal privacy control for transport network data sharing. Transportation Research Part C: Emerging Technologies, 113, 2020.
  • [HTP17] Shuo Han, Ufuk Topcu, and George J. Pappas. Differentially private distributed constrained optimization. IEEE Transactions on Automatic Control, 62(1):50–64, 2017.
  • [LYH20] Yang Li, Dasen Yang, and Xianbiao Hu. A differential privacy-based privacy-preserving data publishing algorithm for transit smart card data. Transportation Research Part C: Emerging Technologies, 115:102634, 2020.
  • [Pan14] Vijay Pandurangan. Riding with the stars: Passenger privacy in the nyc taxicab dataset. Available Online, 2014.
  • [Swe02] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
  • [Tra16] Transportation networks for research. github.com/bstabler/TransportationNetworks, 2016. Accessed January 20, 2021.
  • [Tre11] Luca Trevisan. Lecture notes on optimization and algorithmic paradigms (lecture 11). Available online, 2011.
  • [WHL+18] Zhibo Wang, Jiahui Hu, Ruizhao Lv, Jian Wei, Qian Wang, Dejun Yang, and Hairong Qi. Personalized privacy-preserving task allocation for mobile crowdsensing. IEEE Transactions on Mobile Computing, 18(6):1330–1341, 2018.
  • [WLK+17] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey F. Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1307–1322. ACM, 2017.
  • [YLL+19] Ke Yan, Guoming Lu, Guangchun Luo, Xu Zheng, Ling Tian, and Akshita Maradapu Vera Venkata Sai. Location privacy-aware task bidding and assignment for mobile crowd-sensing. IEEE Access, 7:131929–131943, 2019.

Appendix A Notation

The terms and concepts defined and used throughout the paper are all described here for the sake of convenience.

Notation Definition GG Graph representation of the network. (Definition 1) VV The vertex set of the network GG. (Definition 1) EE The edge set of the network GG. (Definition 1) nn The number of vertices in GG, i.e., n=|V|n=\left|V\right|. (Definition 1) mm The number of edges in GG, i.e., m=|E|m=\left|E\right|. (Definition 1) 𝒯\mathcal{T} Operation period 𝒯=[tstart,tend]\mathcal{T}=\left[t_{\text{start}},t_{\text{end}}\right] which represents the time of day that the network operator is trying to optimize its decisions in. (Definition 2) TT Number of minutes in the operation period. (Definition 2) Λ\Lambda Demand matrix. Λ(o,d)\Lambda(o,d) specifies the rate at which requests from oVo\in V to dVd\in V appear in the network. (Definition  3) 𝒬\mathcal{Q} The distribution of Λ\Lambda. (Defined in Section 3.2) LL Dataset L:=(Λ1,,ΛN)L:=(\Lambda_{1},...,\Lambda_{N}) containing NN days of historical request data for the operation period 𝒯\mathcal{T}. (Defined in Section 3.2) x(o,d)x^{(o,d)} Unit (o,d)(o,d) flow. A flow that routes exactly 1 unit of flow from oo to dd through the network. (Definition 10) xx Unit network flow. Specifies a unit (o,d)(o,d) flow for all pairs of vertices (o,d)(o,d) in the network. (Definition 11) 𝒳\mathcal{X} The set of all unit network flows. It is also the feasible set for optimization problems (8),(9) and (10). (Definition 11) xkx_{k} A unit network flow and the kkth gradient descent iterate from Algorithm 1. xk(L)x_{k}(L) is used to show explicit dependence of xkx_{k} on the historical training data LL. (Algorithm 1) fef_{e} Link latency function for the edge eEe\in E. If xx is the total flow on ee, then f(x)f(x) will be the average travel time through the edge. (Definition 4) FF Total travel time function. F(x,Λ)F(x,\Lambda) is the total travel time of requests in Λ\Lambda experience when they are routed according to xx. (Defined in (7)) (ϵ,δ)(\epsilon,\delta) Differential privacy parameters. As ϵ\epsilon and δ\delta get smaller, the privacy guarantees offered by differential privacy get stronger. (Definition 8) KK Upper bound on the standard deviation of the gradient of FF. (Assumption 2) HH Hessian of FF with respect to xx. Concretely, H(x,Λ)H(x,\Lambda) is the hessian of FF with respect to xx evaluated at (x,Λ)(x,\Lambda). α\alpha Strong convexity parameter. (Assumption 4) β\beta Smoothness parameter. (Assumption 5 CC Upper bound on 𝒟Λ[F](x,Λ)2\left|\left|\mathcal{D}_{\Lambda}\left[\nabla F\right](x,\Lambda)\right|\right|_{2}. (Assumption 6)

Appendix B Privacy Analysis (Proof of Theorem  2)

Recall the privacy guarantee for the Gaussian Mechanism from Theorem 4. To show that Algorithm 1 is (ϵ,δ)(\epsilon,\delta)-differentially private, it suffices to show the following Lemma:

Lemma 1.

For every 1tN1\leq t\leq N, any value of L:=(Λ1,,ΛN)L:=(\Lambda_{1},...,\Lambda_{N}), and any (o,d)V×V(o,d)\in V\times V we have 𝒟Λt(o,d)[xN](L)opmin(Cmin(1,2α)β,CαN)\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right).

Lemma 1 is sufficient to prove privacy, because it implies that the 2\ell_{2} sensitivity of xNx_{N} is at most LαN\frac{L}{\alpha N}. To see why this is true, let L1=(Λ1,,Λt,,ΛN)L_{1}=(\Lambda_{1},...,\Lambda^{\prime}_{t},...,\Lambda_{N}) and L2=(Λ1,,Λt′′,,ΛN)L_{2}=(\Lambda_{1},...,\Lambda^{\prime\prime}_{t},...,\Lambda_{N}) be request-level-adjacent data sets that differ only on the ttth day. Furthermore, let (o,d)(o^{\prime},d^{\prime}) be the request for which Λt\Lambda_{t}^{\prime} and Λt′′\Lambda_{t}^{\prime\prime} differ. Let xN(L1)x_{N}(L_{1}) and xN(L2)x_{N}(L_{2}) be the final iterates of gradient descent obtained by using the data sets L1L_{1} and L2L_{2} respectively. By the fundamental theorem of calculus,

xN(L2)xN(L1)\displaystyle x_{N}(L_{2})-x_{N}(L_{1}) =L1L2𝒟Λt[xN](L)𝑑L\displaystyle=\int_{L_{1}}^{L_{2}}\mathcal{D}_{\Lambda_{t}}[x_{N}](L)\;dL
=ΛtΛt′′𝒟Λt[xN](Λ1,,Λt1,Λ,Λt+1,,ΛN)𝑑Λ\displaystyle=\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda
xN(L)xN(L)2\displaystyle\implies\left|\left|x_{N}^{(L^{\prime})}-x_{N}^{(L)}\right|\right|_{2} =ΛtΛt′′𝒟Λt[xN](Λ1,,Λt1,Λ,Λt+1,,ΛN)𝑑Λ2\displaystyle=\left|\left|\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right|\right|_{2}
ΛtΛt′′𝒟Λt[xN](Λ1,,Λt1,Λ,Λt+1,,ΛN)dΛ2\displaystyle\leq\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\left|\left|\mathcal{D}_{\Lambda_{t}}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda\right|\right|_{2}
=Λt(o,d)Λt′′(o,d)𝒟Λt(o,d)[xN](Λ1,,Λt1,Λ,Λt+1,,ΛN)dΛ(o,d)2\displaystyle=\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left|\left|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\;d\Lambda(o^{\prime},d^{\prime})\right|\right|_{2}
Λt(o,d)Λt′′(o,d)𝒟Λt(o,d)[xN](Λ1,,Λt1,Λ,Λt+1,,ΛN)2𝑑Λ(o,d)\displaystyle\leq\int_{\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})}^{\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})}\left|\left|\mathcal{D}_{\Lambda_{t}(o^{\prime},d^{\prime})}[x_{N}](\Lambda_{1},...,\Lambda_{t-1},\Lambda,\Lambda_{t+1},...,\Lambda_{N})\right|\right|_{2}d\Lambda(o^{\prime},d^{\prime})
(a)ΛtΛt′′min(Cmin(1,2α)β,CαN)𝑑Λ(o,d)\displaystyle\overset{(a)}{\leq}\int_{\Lambda^{\prime}_{t}}^{\Lambda^{\prime\prime}_{t}}\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)d\Lambda(o^{\prime},d^{\prime})
min(Cmin(1,2α)β,CαN)|Λt′′(o,d)Λt(o,d)|\displaystyle\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right)\left|\Lambda^{\prime\prime}_{t}(o^{\prime},d^{\prime})-\Lambda^{\prime}_{t}(o^{\prime},d^{\prime})\right|
(b)min(Cmin(1,2α)βT,CαTN).\displaystyle\overset{(b)}{\leq}\min\left(\frac{C\min(1,2\alpha)}{\beta T},\frac{C}{\alpha TN}\right).

Here (a)(a) due to Lemma 1. (b)(b) is because |Λt′′(o,d)Λt(o,d)|T1\left|\Lambda^{\prime\prime}_{t}(o,d)-\Lambda^{\prime}_{t}(o,d)\right|\leq T^{-1} is a consequence of L1,L2L_{1},L_{2} being request-level-adjacent.

Thus to establish (ϵ,δ)(\epsilon,\delta)-differential privacy of Algorithm 1, all that remains is to prove Lemma 1.

Proof of Lemma 1.

Define L:=(Λ1,,ΛN)L:=(\Lambda_{1},...,\Lambda_{N}). By Assumptions 3 , 4 and 5, the functions F(,Λ1),,F(,ΛN)F(\cdot,\Lambda_{1}),...,F(\cdot,\Lambda_{N}) are all twice differentiable, α\alpha-strongly convex and β\beta-smooth. For each tNt\leq N, and any (o,d)V×V(o,d)\in V\times V by chain rule we have

𝒟Λt(o,d)[xN](L)=(k=t+1N1𝒟xk[xk+1](L))𝒟Λt(o,d)[xt+1](L).\displaystyle\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)=\left(\prod_{k=t+1}^{N-1}\mathcal{D}_{x_{k}}[x_{k+1}](L)\right)\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L).

Since xk+1=xkηkxF(xk,Λk)x_{k+1}=x_{k}-\eta_{k}\nabla_{x}F(x_{k},\Lambda_{k}), differentiating both sides with respect to xkx_{k} gives

𝒟xk[xk+1](L)\displaystyle\mathcal{D}_{x_{k}}[x_{k+1}](L) =IηkH(xk,Λk)\displaystyle=I-\eta_{k}H(x_{k},\Lambda_{k})

where H(,Λk)H(\cdot,\Lambda_{k}) is the Hessian of F(,Λk)F(\cdot,\Lambda_{k}). Since F(,Λk)F(\cdot,\Lambda_{k}) is α\alpha-strongly convex, we know H(xk,Λk)αIH(x_{k},\Lambda_{k})\succeq\alpha I. By β\beta-smoothness of F(,Λk)F(\cdot,\Lambda_{k}), we also know that H(xk,Λk)βIH(x_{k},\Lambda_{k})\preceq\beta I. Therefore if ηk1β\eta_{k}\leq\frac{1}{\beta}, we see that 𝒟xk[xk+1](L)op1ηkα\left|\left|\mathcal{D}_{x_{k}}[x_{k+1}](L)\right|\right|_{op}\leq 1-\eta_{k}\alpha. From this we can conclude that

𝒟Λt(o,d)[xN](L)op\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op} 𝒟Λt(o,d)[xt+1](L)opk=t+1N1𝒟xk[xk+1](L)op\displaystyle\leq\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right|\right|_{op}\prod_{k=t+1}^{N-1}\left|\left|\mathcal{D}_{x_{k}}[x_{k+1}](L)\right|\right|_{op}
𝒟Λt(o,d)[xt+1](L)opk=t+1N1(1ηkα)\displaystyle\leq\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right|\right|_{op}\prod_{k=t+1}^{N-1}(1-\eta_{k}\alpha)
𝒟Λt(o,d)[xt+1](L)opk=t+1N1exp(ηkα)\displaystyle\leq\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right|\right|_{op}\prod_{k=t+1}^{N-1}\exp\left(-\eta_{k}\alpha\right)
=𝒟Λt(o,d)[xt+1](L)opexp(k=t+1N1ηkα)\displaystyle=\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right|\right|_{op}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right) (12)

Finally, note that xkx_{k} only depends on x0x_{0} and Λ1,,Λk1\Lambda_{1},...,\Lambda_{k-1}, and in particular it does not depend on Λk\Lambda_{k}. Therefore differentiating both sides of xk+1=xkηkxF(xk,Λk)x_{k+1}=x_{k}-\eta_{k}\nabla_{x}F(x_{k},\Lambda_{k}) with respect to Λk(o,d)\Lambda_{k}(o,d) gives

𝒟Λt(o,d)[xt+1](L)\displaystyle\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L) =ηt𝒟Λt(o,d)[xF](xt,Λt)\displaystyle=-\eta_{t}\mathcal{D}_{\Lambda_{t}(o,d)}\left[\nabla_{x}F\right](x_{t},\Lambda_{t})
𝒟Λt(o,d)[xt+1](L)op\displaystyle\implies\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{t+1}](L)\right|\right|_{op} =ηt𝒟Λt(o,d)[xF](xt,Λt)op(a)Cηt,\displaystyle=\eta_{t}\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}\left[\nabla_{x}F\right](x_{t},\Lambda_{t})\right|\right|_{op}\overset{(a)}{\leq}C\eta_{t}, (13)

where (a)(a) is due to Assumption 6. Combining inequalities (12) and (13) gives

𝒟Λt(o,d)[xN](L)op\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op} Cηtexp(k=t+1N1ηkα).\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right). (14)

Letting t0:=max(βαmin(1,2α)1,0)t_{0}:=\max\left(\frac{\beta}{\alpha\min(1,2\alpha)}-1,0\right) note that

ηt={1α(t+1)if tt0min(1,2α)βotherwise.\displaystyle\eta_{t}=\left\{\begin{tabular}[]{cc}$\frac{1}{\alpha(t+1)}$&if $t\geq t_{0}$\\ $\frac{\min(1,2\alpha)}{\beta}$&otherwise.\end{tabular}\right.

If tt0t\geq t_{0}, then ηk=1α(k+1)\eta_{k}=\frac{1}{\alpha(k+1)} for all ktk\geq t, and we see that

𝒟Λt(o,d)[xN](L)op\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op} Cηtexp(k=t+1N1ηkα)\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)
=Cα(t+1)exp(αk=t+1N11α(k+1))\displaystyle=\frac{C}{\alpha(t+1)}\exp\left(-\alpha\sum_{k=t+1}^{N-1}\frac{1}{\alpha(k+1)}\right)
Cα(t+1)exp(αt+1N1αy𝑑y)\displaystyle\leq\frac{C}{\alpha(t+1)}\exp\left(-\alpha\int_{t+1}^{N}\frac{1}{\alpha y}dy\right)
=Cα(t+1)exp(lnN+ln(t+1))\displaystyle=\frac{C}{\alpha(t+1)}\exp\left(-\ln N+\ln(t+1)\right)
=Cα(t+1)t+1N=CαN\displaystyle=\frac{C}{\alpha(t+1)}\frac{t+1}{N}=\frac{C}{\alpha N}

On the other hand, if tt0t\leq t_{0}, then ηt=min(1,2α)β\eta_{t}=\frac{\min(1,2\alpha)}{\beta} and we see that

𝒟Λt(o,d)[xN](L)op\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op} Cηtexp(k=t+1N1ηkα)\displaystyle\leq C\eta_{t}\exp\left(-\sum_{k=t+1}^{N-1}\eta_{k}\alpha\right)
=Cmin(1,2α)βexp(αk=t+1N11α(k+1))\displaystyle=\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\sum_{k=t+1}^{N-1}\frac{1}{\alpha(k+1)}\right)
Cmin(1,2α)βexp(αk=t0+1N11α(k+1))\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\sum_{k=\left\lfloor t_{0}\right\rfloor+1}^{N-1}\frac{1}{\alpha(k+1)}\right)
Cmin(1,2α)βexp(αt0+1N1αy𝑑y)\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\exp\left(-\alpha\int_{t_{0}+1}^{N}\frac{1}{\alpha y}dy\right)
Cmin(1,2α)βt0+1N\displaystyle\leq\frac{C\min(1,2\alpha)}{\beta}\frac{t_{0}+1}{N}
=Cmin(1,2α)ββαmin(1,2α)N=CαN.\displaystyle=\frac{C\min(1,2\alpha)}{\beta}\frac{\beta}{\alpha\min(1,2\alpha)N}=\frac{C}{\alpha N}.

Thus in either case we have 𝒟Λt(o,d)[xN](L)opCαN\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq\frac{C}{\alpha N}.

Finally, note that (14) implies 𝒟Λt(o,d)[xN](L)opCηtCmin(1,2α)β\left|\left|\mathcal{D}_{\Lambda_{t}(o,d)}[x_{N}](L)\right|\right|_{op}\leq C\eta_{t}\leq C\frac{\min(1,2\alpha)}{\beta}. Combining these two bounds gives the desired result:

𝒟Λt[xN](L)opmin(Cmin(1,2α)β,CαN).\displaystyle\left|\left|\mathcal{D}_{\Lambda_{t}}[x_{N}](L)\right|\right|_{op}\leq\min\left(\frac{C\min(1,2\alpha)}{\beta},\frac{C}{\alpha N}\right).

Appendix C Performance Analysis (Proof of Theorem 3)

Let xx^{*} be a solution to (9). Note that

xk+1x22\displaystyle\left|\left|x_{k+1}-x^{*}\right|\right|_{2}^{2}
=𝒳(xkηkF(xk,Λk))x22\displaystyle=\left|\left|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-x^{*}\right|\right|_{2}^{2}
=(a)𝒳(xkηkF(xk,Λk))𝒳(xηk𝔼Λ𝒬[F(x,Λ)])22\displaystyle\overset{(a)}{=}\left|\left|\prod_{\mathcal{X}}\left(x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})\right)-\prod_{\mathcal{X}}\left(x^{*}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)\right|\right|_{2}^{2}
(b)xkηkF(xk,Λk)(xηk𝔼Λ𝒬[F(x,Λ)])22\displaystyle\overset{(b)}{\leq}\left|\left|x_{k}-\eta_{k}\nabla F(x_{k},\Lambda_{k})-\left(x^{*}-\eta_{k}\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)\right|\right|_{2}^{2}
=xkxηk(Fk(xk,Λk)𝔼Λ𝒬[F(x,Λ)])22\displaystyle=\left|\left|x_{k}-x^{*}-\eta_{k}\left(\nabla F_{k}(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)\right|\right|_{2}^{2}
=xkx22+ηk2F(xk,Λk)𝔼Λ𝒬F(x,Λ)222ηk(F(xk,Λk)𝔼Λ𝒬F(x,Λ))(xkx)\displaystyle=\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}+\eta_{k}^{2}\left|\left|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{*},\Lambda)\right|\right|_{2}^{2}-2\eta_{k}\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{*},\Lambda)\right)^{\top}(x_{k}-x^{*})

Taking expectation of both sides conditioned on xkx_{k}, we see that

𝔼[xk+1x22|xk]\displaystyle\mathbb{E}\left[\left.\left|\left|x_{k+1}-x^{*}\right|\right|_{2}^{2}\right|x_{k}\right]
𝔼[xkx22+ηk2F(xk,Λk)𝔼Λ𝒬F(x,Λ)22Term 22ηk(F(xk,Λk)𝔼Λ𝒬F(x,Λ))(xkx)Term 2|xk]\displaystyle\leq\mathbb{E}\left[\left.\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}+\underbrace{\eta_{k}^{2}\left|\left|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{*},\Lambda)\right|\right|_{2}^{2}}_{\text{Term 2}}\underbrace{-2\eta_{k}\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}F(x^{*},\Lambda)\right)^{\top}(x_{k}-x^{*})}_{\text{Term 2}}\right|x_{k}\right]

To show that xk+1x_{k+1} is closer to xx^{*} than xkx_{k} is, we will provide bounds for both Term 2 and Term 3.

C.1 Bounding Term 2

To upper bound Term 2, noting that 𝔼Λk[fk(x)]=f(x)\mathbb{E}_{\Lambda_{k}}[\nabla f_{k}(x)]=\nabla f(x) for all x𝒳x\in\mathcal{X}, we have

ηk2𝔼[F(xk,Λk)𝔼Λ𝒬[F(x,Λ)]22|xk]\displaystyle\eta_{k}^{2}\mathbb{E}\left[\left.\left|\left|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right|\right|_{2}^{2}\right|x_{k}\right]
=ηk2𝔼[F(xk,Λk)𝔼Λ𝒬[F(xk,Λ)]A+𝔼Λ𝒬[F(xk,Λ)]𝔼Λ𝒬[F(x,Λ)]B22|xk]\displaystyle=\eta_{k}^{2}\mathbb{E}\left[\left.\left|\left|\underbrace{\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]}_{A}+\underbrace{\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]}_{B}\right|\right|_{2}^{2}\right|x_{k}\right]
=ηk2𝔼[A22+B22+2AB|xk]\displaystyle=\eta_{k}^{2}\mathbb{E}\left[\left.\left|\left|A\right|\right|_{2}^{2}+\left|\left|B\right|\right|_{2}^{2}+2A^{\top}B\right|x_{k}\right] (15)

By Assumption 6 and the dominated convergence theorem, we have 𝔼Λ𝒬[F(xk,Λ)]=𝔼Λ𝒬[F(xk,Λ)]\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x_{k},\Lambda)\right]=\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)\right]. Hence we see that

𝔼[A|xk]=𝔼Λk𝒬[F(xk,Λk)|xk]𝔼Λ𝒬[F(xk,Λ)|xk]=0.\displaystyle\mathbb{E}\left[\left.A\right|x_{k}\right]=\mathbb{E}_{\Lambda_{k}\sim\mathcal{Q}}\left[\left.F(x_{k},\Lambda_{k})\right|x_{k}\right]-\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left.F(x_{k},\Lambda)\right|x_{k}\right]=0.

From this observation we have the three following remarks:

  1. 1.

    Since AA is a zero mean random vector, 𝔼[A22|xk]\mathbb{E}\left[\left.\left|\left|A\right|\right|_{2}^{2}\right|x_{k}\right] is the variance of F(xk,Λ)\nabla F(x_{k},\Lambda) given xkx_{k}. By Assumption 2, 𝔼Λ𝒬[F(x,Λ)22]K2\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left|\left|\nabla F(x,\Lambda)\right|\right|_{2}^{2}\right]\leq K^{2} for any xx, which implies that 𝔼[A22|xk]K2\mathbb{E}\left[\left.\left|\left|A\right|\right|_{2}^{2}\right|x_{k}\right]\leq K^{2}.

  2. 2.

    By Assumption 5, F(,Λ)F(\cdot,\Lambda) is β\beta-smooth for every Λ\Lambda. β\beta-smoothness of F(,Λ)F(\cdot,\Lambda) implies that F(,Λ)\nabla F(\cdot,\Lambda) is β\beta-lipschitz. Thus we have

    𝔼[B22|xk]\displaystyle\mathbb{E}\left[\left.\left|\left|B\right|\right|_{2}^{2}\right|x_{k}\right] =𝔼[𝔼Λ𝒬[F(xk,Λ)F(x,Λ)]22|xk]\displaystyle=\mathbb{E}\left[\left.\left|\left|\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right]\right|\right|_{2}^{2}\right|x_{k}\right]
    (a)𝔼[𝔼Λ𝒬[F(xk,Λ)F(x,Λ)]22|xk]\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\left|\left|\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right|\right|\right]_{2}^{2}\right|x_{k}\right]
    (b)𝔼[𝔼Λ𝒬[β2xkx]22|xk]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\left.\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\beta^{2}\left|\left|x_{k}-x^{*}\right|\right|\right]_{2}^{2}\right|x_{k}\right]
    =β2xkx22.\displaystyle=\beta^{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}.

    where (a)(a) is due to Jensen’s inequality and (b)(b) is due to F(,Λ)\nabla F(\cdot,\Lambda) being β\beta-Lipschitz.

  3. 3.

    Conditioned on xkx_{k}, AA is zero mean and BB is constant, meaning that ABA^{\top}B is a zero mean random vector. Therefore 𝔼[AB|xk]=0\mathbb{E}[\left.A^{\top}B\right|x_{k}]=0.

Applying these three remarks to the inequality (15) we see that

ηk2𝔼[F(xk,Λk)𝔼Λ𝒬[F(x,Λ)]22|xk]\displaystyle\eta_{k}^{2}\mathbb{E}\left[\left.\left|\left|\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right|\right|_{2}^{2}\right|x_{k}\right] ηk2(K2+β2xkx22).\displaystyle\leq\eta_{k}^{2}\left(K^{2}+\beta^{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}\right). (16)

C.2 Bounding Term 3

By linearity of expectation, Term 3 is equal to

2ηk𝔼[(F(xk,Λk)𝔼Λ𝒬[F(x,Λ)])(xkx)|xk]\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*})\right|x_{k}\right]
=2ηk(𝔼[f(xk,Λk)|xk]𝔼Λ𝒬[F(x,Λ]))(xkx)\displaystyle=-2\eta_{k}\left(\mathbb{E}\left[\left.\nabla f(x_{k},\Lambda_{k})\right|x_{k}\right]-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda\right])\right)^{\top}(x_{k}-x^{*})
=2ηk(𝔼Λ𝒬[F(xk,Λ)F(x,Λ)])(xkx).\displaystyle=-2\eta_{k}\left(\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[\nabla F(x_{k},\Lambda)-\nabla F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*}).

Define f(x):=𝔼Λ𝒬[F(x,Λ)]f(x):=\mathbb{E}_{\Lambda\sim\mathcal{Q}}[F(x,\Lambda)]. We can re-write the above equation as

2ηk𝔼[(F(xk,Λk)𝔼Λ𝒬[F(x,Λ)])(xkx)|xk]\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*})\right|x_{k}\right] 2ηk(f(xk)f(x))(xkx)\displaystyle\leq-2\eta_{k}\left(\nabla f(x_{k})-\nabla f(x^{*})\right)^{\top}(x_{k}-x^{*})

Since F(,Λ)F(\cdot,\Lambda) is α\alpha-strongly convex for every Λ\Lambda, we can conclude that ff is also α\alpha-strongly convex. To upper bound Term 3, we use α\alpha-strong convexity of ff to conclude that

f(xk)\displaystyle f(x_{k}) f(x)+f(x)(xkx)+α2xkx22\displaystyle\geq f(x^{*})+\nabla f(x^{*})^{\top}(x_{k}-x^{*})+\frac{\alpha}{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}
and
f(x)\displaystyle f(x^{*}) f(xk)+f(xk)(xxk)+α2xkx22.\displaystyle\geq f(x_{k})+\nabla f(x_{k})^{\top}(x^{*}-x_{k})+\frac{\alpha}{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}.

Adding these inequalities together gives

f(xk)+f(x)\displaystyle f(x_{k})+f(x^{*}) f(xk)+f(x)(f(xk)f(x))(xkx)+αxkx22\displaystyle\geq f(x_{k})+f(x^{*})-\left(\nabla f(x_{k})-\nabla f(x^{*})\right)^{\top}(x_{k}-x^{*})+\alpha\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}
αxkx22\displaystyle\implies-\alpha\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2} (f(xk)f(x))(xkx).\displaystyle\geq-\left(\nabla f(x_{k})-\nabla f(x^{*})\right)^{\top}(x_{k}-x^{*}).

This inequality implies the following bound on Term 3:

2ηk𝔼[(F(xk,Λk)𝔼Λ𝒬[F(x,Λ)])(xkx)|xk]\displaystyle-2\eta_{k}\mathbb{E}\left[\left.\left(\nabla F(x_{k},\Lambda_{k})-\nabla\mathbb{E}_{\Lambda\sim\mathcal{Q}}\left[F(x^{*},\Lambda)\right]\right)^{\top}(x_{k}-x^{*})\right|x_{k}\right] 2αηkxkx22.\displaystyle\leq-2\alpha\eta_{k}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}. (17)

C.3 Putting everything together

Applying the bounds (16) and (17) on Term 2 and Term 3 respectively, we see that

𝔼[xk+1x22|xk]\displaystyle\mathbb{E}\left[\left.\left|\left|x_{k+1}-x^{*}\right|\right|_{2}^{2}\right|x_{k}\right] xkx22+ηk2K2+β2ηk2xkx222αηkxkx22\displaystyle\leq\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}+\eta_{k}^{2}K^{2}+\beta^{2}\eta_{k}^{2}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}-2\alpha\eta_{k}\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}
=(12αηk+β2ηk2)xkx22+ηk2K2.\displaystyle=\left(1-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right)\left|\left|x_{k}-x^{*}\right|\right|_{2}^{2}+\eta_{k}^{2}K^{2}. (18)

By the tower property of expectation, we can write

𝔼[xNx22]\displaystyle\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right] =𝔼[𝔼[𝔼[xNx22|xN1]|xN2]|x0].\displaystyle=\mathbb{E}\left[...\left.\mathbb{E}\left[\left.\mathbb{E}\left[\left.\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right|x_{N-1}\right]\right|x_{N-2}\right]...\right|x_{0}\right]. (19)

Combining the recursive relation from (18) with (19) we see that

𝔼[xNx22]\displaystyle\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right] x0x22(t=0N112αηt+β2ηt2)+K2t=0N1ηt2(k=t+1N112αηk+β2ηk2)\displaystyle\leq\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\left(\prod_{t=0}^{N-1}1-2\alpha\eta_{t}+\beta^{2}\eta_{t}^{2}\right)+K^{2}\sum_{t=0}^{N-1}\eta_{t}^{2}\left(\prod_{k=t+1}^{N-1}1-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right)
x0x22exp(t=0N12αηt+β2ηt2)+K2t=0N1ηt2exp(k=t+1N12αηk+β2ηk2).\displaystyle\leq\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}+\beta^{2}\eta_{t}^{2}\right)+K^{2}\sum_{t=0}^{N-1}\eta_{t}^{2}\exp\left(\sum_{k=t+1}^{N-1}-2\alpha\eta_{k}+\beta^{2}\eta_{k}^{2}\right).

Since we chose ηk:=min(1αk,min(1,2α)β)\eta_{k}:=\min\left(\frac{1}{\alpha k},\frac{\min(1,2\alpha)}{\beta}\right), we have ηk1αk\eta_{k}\leq\frac{1}{\alpha k}. This means that k=1ηk2k=11α2k2=π26α2\sum_{k=1}^{\infty}\eta_{k}^{2}\leq\sum_{k=1}^{\infty}\frac{1}{\alpha^{2}k^{2}}=\frac{\pi^{2}}{6\alpha^{2}}. Thus defining Cα,β:=exp(β2π26α2)C_{\alpha,\beta}:=\exp\left(\frac{\beta^{2}\pi^{2}}{6\alpha^{2}}\right), we have

𝔼[xNx22]\displaystyle\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right] Cα,βx0x22exp(t=0N12αηt)+K2Cα,βt=0N1ηt2exp(k=t+1N12αηk)\displaystyle\leq C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+K^{2}C_{\alpha,\beta}\sum_{t=0}^{N-1}\eta_{t}^{2}\exp\left(\sum_{k=t+1}^{N-1}-2\alpha\eta_{k}\right)
(a)Cα,βx0x22exp(t=0N12αηt)+K2Cα,βt=0N1(CαN)2\displaystyle\overset{(a)}{\leq}C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+K^{2}C_{\alpha,\beta}\sum_{t=0}^{N-1}\left(\frac{C}{\alpha N}\right)^{2}
=Cα,βx0x22exp(t=0N12αηt)+K2C2Cα,βα2N\displaystyle=C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}

where (a)(a) is because in Appendix B we showed that ηtexp(k=t+1N1αηk)CαN\eta_{t}\exp\left(\sum_{k=t+1}^{N-1}-\alpha\eta_{k}\right)\leq\frac{C}{\alpha N} for all 0tN0\leq t\leq N. Next, let t0:=max(βαmin(1,2α)1,0)t_{0}:=\max\left(\frac{\beta}{\alpha\min(1,2\alpha)}-1,0\right) so that ηt=1αt\eta_{t}=\frac{1}{\alpha t} if tt0t\geq t_{0} and ηt=min(1,2α)β\eta_{t}=\frac{\min(1,2\alpha)}{\beta} otherwise. We then have

𝔼[xNx22]\displaystyle\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right] Cα,βx0x22exp(t=0N12αηt)+K2C2Cα,βα2N\displaystyle\leq C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}
=Cα,βx0x22exp(t=0t02αηt)exp(t=t0+1N12αηt)+K2C2Cα,βα2N\displaystyle=C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(\sum_{t=0}^{t_{0}}-2\alpha\eta_{t}\right)\exp\left(\sum_{t=t_{0}+1}^{N-1}-2\alpha\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}
Cα,βx0x22exp(2αt=t0+1N1ηt)+K2C2Cα,βα2N\displaystyle\leq C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(2\alpha\sum_{t=t_{0}+1}^{N-1}-\eta_{t}\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}
Cα,βx0x22exp(2αt0+1N1αy𝑑y)+K2C2Cα,βα2N\displaystyle\leq C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\exp\left(-2\alpha\int_{t_{0}+1}^{N}\frac{1}{\alpha y}dy\right)+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}
=Cα,βx0x22(t0+1N)2+K2C2Cα,βα2N\displaystyle=C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\left(\frac{t_{0}+1}{N}\right)^{2}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}
=Cα,βx0x22β2(αmin(1,2α))2N2+K2C2Cα,βα2N.\displaystyle=\frac{C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\beta^{2}}{\left(\alpha\min(1,2\alpha)\right)^{2}N^{2}}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}.

Finally, we have

𝔼[xalgx2]\displaystyle\mathbb{E}\left[\left|\left|x_{\text{alg}}-x^{*}\right|\right|_{2}\right] =𝔼[Π𝒳(xN+Z)x2]\displaystyle=\mathbb{E}\left[\left|\left|\Pi_{\mathcal{X}}\left(x_{N}+Z\right)-x^{*}\right|\right|_{2}\right]
=𝔼[Π𝒳(xN+Z)Π𝒳(x)2]\displaystyle=\mathbb{E}\left[\left|\left|\Pi_{\mathcal{X}}\left(x_{N}+Z\right)-\Pi_{\mathcal{X}}(x^{*})\right|\right|_{2}\right]
𝔼[xN+Zx2]\displaystyle\leq\mathbb{E}\left[\left|\left|x_{N}+Z-x^{*}\right|\right|_{2}\right]
𝔼[xNx2]+𝔼[Z2]\displaystyle\leq\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}\right]+\mathbb{E}\left[\left|\left|Z\right|\right|_{2}\right]
(a)𝔼[xNx22]+𝔼[Z22]\displaystyle\overset{(a)}{\leq}\sqrt{\mathbb{E}\left[\left|\left|x_{N}-x^{*}\right|\right|_{2}^{2}\right]}+\sqrt{\mathbb{E}\left[\left|\left|Z\right|\right|_{2}^{2}\right]}
Cα,βx0x22β2(αmin(1,2α))2N2+K2C2Cα,βα2N+n2m2C2ϵ2α2T2N2ln(1.25δ)\displaystyle\leq\sqrt{\frac{C_{\alpha,\beta}\left|\left|x_{0}-x^{*}\right|\right|_{2}^{2}\beta^{2}}{\left(\alpha\min(1,2\alpha)\right)^{2}N^{2}}+\frac{K^{2}C^{2}C_{\alpha,\beta}}{\alpha^{2}N}}+\sqrt{n^{2}m\frac{2C^{2}}{\epsilon^{2}\alpha^{2}T^{2}N^{2}}\ln\left(\frac{1.25}{\delta}\right)}
Cα,βx0x2β(αmin(1,2α))N+KCCα,βαN+CnmϵαTN2ln(1.25δ)\displaystyle\leq\frac{\sqrt{C_{\alpha,\beta}}\left|\left|x_{0}-x^{*}\right|\right|_{2}\beta}{\left(\alpha\min(1,2\alpha)\right)N}+\frac{KC\sqrt{C_{\alpha,\beta}}}{\alpha\sqrt{N}}+\frac{Cn\sqrt{m}}{\epsilon\alpha TN}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}
(b)Cα,βn2mβ(αmin(1,2α))N+KCCα,βαN+CnmϵαTN2ln(1.25δ)\displaystyle\overset{(b)}{\leq}\frac{\sqrt{C_{\alpha,\beta}}n^{2}m\beta}{\left(\alpha\min(1,2\alpha)\right)N}+\frac{KC\sqrt{C_{\alpha,\beta}}}{\alpha\sqrt{N}}+\frac{Cn\sqrt{m}}{\epsilon\alpha TN}\sqrt{2\ln\left(\frac{1.25}{\delta}\right)}

where (a)(a) is due to Jensen’s inequality and (b)(b) is due to the fact that xx′′2nm\left|\left|x^{\prime}-x^{\prime\prime}\right|\right|_{2}\leq n\sqrt{m} for any pair of unit network flows x,x′′x^{\prime},x^{\prime\prime}.

Appendix D Computational approximations

In our experiments, we make several approximations for computational tractability. In this section, we provide empirical evidence that these approximations are reasonable and do not introduce significant errors. For ease of discussion, we define the following notations.

  • xαoptx^{opt}_{\alpha}: Solution obtained by solving the optimization problem (7) with regularizer α\alpha

  • xαx_{\alpha}: Solution obtained by solving the optimization problem with the average demand and a regularizer α\alpha

  • Fα(x,Λi)F_{\alpha}(x,\Lambda_{i}): Evaluating the routing policy xx on demand Λi\Lambda_{i} with a regularizer α\alpha

  • Fα(x,Λi)\langle F_{\alpha}(x,\Lambda_{i})\rangle: Evaluating the average cost of the routing policy xx on the set of demands {Λ1,ΛN}\{\Lambda_{1}\ldots,\Lambda_{N}\} with a regularizer α\alpha. More precisely, Fα(x,Λi)=1Ni=1NFα(x,Λi)\langle F_{\alpha}(x,\Lambda_{i})\rangle=\frac{1}{N}\sum_{i=1}^{N}F_{\alpha}(x,\Lambda_{i})

  • Fα(x,Λi)F_{\alpha}(x,\langle\Lambda_{i}\rangle): Evaluating the cost of the routing policy xx on the average demands Λi=1Ni=1NΛi\langle\Lambda_{i}\rangle=\frac{1}{N}\sum_{i=1}^{N}\Lambda_{i} with a regularizer α\alpha

Error #1 Error #2 Error #3
NN Fα(xαopt,Λi)Fα(xαopt,Λi)Fα(xαopt,Λi)\frac{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle-F_{\alpha}(x^{opt}_{\alpha},\langle\Lambda_{i}\rangle)}{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle} Fα(xα,Λi)Fα(xαopt,Λi)Fα(xαopt,Λi)\frac{\langle F_{\alpha}(x_{\alpha},\Lambda_{i})\rangle-\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle}{\langle F_{\alpha}(x^{opt}_{\alpha},\Lambda_{i})\rangle} Fα=0(xα,Λi)Fα=0(xα=0,Λi)Fα=0(xα=0,Λi)\frac{\langle F_{\alpha=0}(x_{\alpha},\Lambda_{i})\rangle-\langle F_{\alpha=0}(x_{\alpha=0},\Lambda_{i})\rangle}{\langle F_{\alpha=0}(x_{\alpha=0},\Lambda_{i})\rangle}
10 8.97×1078.97\times 10^{-7} <1010<10^{-10} 8.17×1038.17\times 10^{-3}
25 1.02×1061.02\times 10^{-6} <1010<10^{-10} 8.16×1038.16\times 10^{-3}
50 9.84×1079.84\times 10^{-7} <1010<10^{-10} 8.17×1038.17\times 10^{-3}
Table 2: Approximation errors.

Table 2 presents errors from three different approximations. Our first approximation is to compute the system costs for a given policy by using the average demand instead of averaging the costs over every observed demand. The first column (denoted as Error #) lists the fractional error introduced by this approximation for different values of NN when using the optimal routing policy. We note that the error is less than 10610^{-6}, and we we observe a 30-50X improvement in computational time when evaluating the costs using he average flow. This justifies the use of the average demand for estimating costs. Our second approximation is in solving an easier optimization problem to compute the optimal routing policy. In this case, the exact approach would be to solve the optimization problem described in Equation 10. However, the size of this problem grows rapidly with the number of data points NN. Our approximation involves solving the easier optimization problem of maximizing Fα(x,Λi)F_{\alpha}(x,\langle\Lambda_{i}\rangle) to obtain the routing policy xαx_{\alpha} instead of solving the original optimization problem to obtain xαoptx_{\alpha}^{opt}. The second column of the table (titled Error #2) presents the error introduced due to this approximation on the travel costs. The small errors indicate that this assumption is reasonable, and helps us obtain upto a 30X speedup in solving the optimization problem. Finally, we show through numerical evaluations that the addition of the α=103\alpha=10^{3} regularizer does not change the travel costs significantly (third column, denoted as Error #3), and results in less than a 0.01% error in the total travel cost.