This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

JI]UM Joint Institute, Shanghai Jiao Tong University, Shanghai, China SEIEE]School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China PKU]School of Electronics, Peking University, Beijing, China UMJI]UM Joint Institute and School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

Learning-Based Adaptive Dynamic Routing with Stability Guarantee for a Single-Origin-Single-Destination Network

Yidan Wu\arefJI    Feixiang Shu\arefSEIEE    Jianan Zhang\arefPKU    Li Jin\arefUMJI [ [email protected] [ [email protected] [ [email protected] [ [email protected]
Abstract

We consider learning-based adaptive dynamic routing for a single-origin-single-destination queuing network with stability guarantees. Specifically, we study a class of generalized shortest path policies that can be parameterized by only two constants via a piecewise-linear function. Using the Foster-Lyapunov stability theory, we develop a criterion on the parameters to ensure mean boundedness of the traffic state. Then, we develop a policy iteration algorithm that learns the parameters from realized sample paths. Importantly, the piecewise-linear function is both integrated into the Lyapunov function for stability analysis and used as a proxy of the value function for policy iteration; hence, stability is inherently ensured for the learned policy. Finally, we demonstrate via a numerical example that the proposed algorithm learns a near-optimal routing policy with an acceptable optimality gap but significantly higher computational efficiency compared with a standard neural network-based algorithm.

keywords:
Dynamic routing, queuing networks, reinforcement learning, stability
00footnotetext: This work was in part supported by National Natural Science Foundation of China Project 62103260, SJTU UM Joint Institute, J. Wu & J. Sun Foundation.

1 Introduction

Dynamic routing is a control scheme with extensive application in network systems including communications, manufacturing, transportation, etc [1, 2, 3, 4]. A primary objective of dynamic routing is to ensure stability, i.e. the boundedness of the long-time average of traffic state. The typical tool to approach this class of problems is the Lyapunov function method [5]. On top of this, a secondary objective is to minimize the system time experienced by each job. Queuing theorists have made efforts on characterization of the limiting distribution of the traffic state, but such results are usually complicated and non-scalable [6, 7, 8]. Recently, learning-based approaches have been proposed for dynamic routing. However, such approaches are usually purely numerical without explicit theoretical guarantees on traffic stability [9].

Motivated by the above challenge, we consider a dynamic routing scheme for single-origin-single-destination queuing networks that integrates Lyapunov function methods [10] and approximate dynamic programming [11]. We use a class of piecewise-linear (PL) functions that are increasing in traffic states to measure the travel cost of a path. An incoming job is routed to the path with the smallest cost, so the routing scheme can be viewed as a generalized shortest-path (GSP) strategy. We use the PL functions to construct a Lyapunov function for stability analysis, and we show that the resultant routing scheme is throughput-optimal. We also use the PL functions as proxies for action-value functions to obtain a policy iteration algorithm. This algorithm computes a suboptimal routing scheme that is guaranteed to be stabilizing. We also show that our scheme gives comparable performance (i.e., average system time) with respect to neural network-based methods but with a significant reduction in computational workload. This paper focuses on the bridge network in Fig. 1,

Refer to caption
Figure 1: A single-origin-single-destination network.

but our modeling, analysis, and computation approaches also provide hints for general networks.

Extensive results have been developed for traffic stabilization and throughput maximization [5, 12, 13, 14], which provide a basis for this paper. In terms of travel cost minimization, previous results are typically either restricted to simple scenarios (e.g., parallel queues) or based on numerical simulation [15, 16] or optimization methods without traffic stability analysis [17, 18]. In particular, reinforcement learning methods are proposed to compute optimal/suboptimal solutions to dynamic routing problems [19, 20]. Although learning methods usually give excellent performance, they focus on algorithm design and do not emphasize theoretical guarantees for stability/optimality. In addition, generic learning methods may be computationally inefficient for dynamic routing. In particular, Liu et al. proposed a mixed scheme that uses learning over a bounded set of states and uses a known stabilizing policy for the other states [21]; this notion of integration of learning and stability guarantee is aligned with the idea of this paper, but this paper makes such integration in a different manner.

We formulate our problem on the single-origin-single-destination queuing network in Fig. 1. Jobs arrive at the source node as a Poisson process. Each server has exponentially distributed service times. When a job arrives, the system operator assigns a path to the job, which will not be changed afterwards. Hence, every job belongs to a class that corresponds to the assigned path. The key of our routing scheme is a set of PL functions that approximate the expected travel times on each path. Such PL functions were introduced by Down and Meyn [10] to show positive Harris recurrence of the traffic state. Our previous work proposed an explicit construction of the parameters of the PL functions according to network topology, without solving model data-dependent inequalities [22]. In this paper, we further utilize the PL functions to bound the mean of the traffic state and to obtain our GSP routing policy.

We show that our GSP policy maximizes network throughput in the sense that it is stabilizing as long as the demand at the source is less than the min-cut capacity of the network (Theorem 1). The proof utilizes a critical insight about the PL functions: each linear regime corresponds to a particular configuration of bottlenecks. We construct a piecewise-quadratic Lyapunov function using the PL functions and show that the mean drift of the Lyapunov function is negative in heavy-traffic regimes. The drift condition also leads to a set of constraints on the PL parameters.

We then develop a policy iteration algorithm that learns the PL parameters (and thus the GSP policy). The policy iteration algorithm uses the PL functions as a proxy (as opposed to an immediate approximation) of the action-value functions; this connects the learning-based routing decisions with the above-mentioned theoretical properties. In each iteration, the current policy is evaluated using the PL function after a Monte-Carlo episode and improved by greedifying over the PL functions; the evaluation is done by an algorithm specifically designed for the PL functions. We compare our policy iteration algorithm with a standard neural-network dynamic programming algorithm. The implementation results indicate that our approach can attain a 1.9%-6.7% optimality gap with respect to the neural-network method with a 97.8%-99.8% reduction in training time.

In summary, our contributions are:

  1. 1.

    A simple, insightful, and easy-to-implement dynamic routing scheme, i.e., the GSP policy.

  2. 2.

    Theoretical guarantee on traffic stabilization and throughput maximization of the GSP policy over the bridge network.

  3. 3.

    An efficient policy iteration algorithm for the GSP policy that attains comparable control performance but significantly reduces training time compared with neural network-based methods.

The rest of this paper is organized as follows. Section 2 introduces the network model and formulates the dynamic routing problem. Section 3 develops the stability guarantee for the GSP policy. Section 4 demonstrates the training algorithm for the GSP policy. Section 5 gives the concluding remarks.

2 Modeling and formulation

Consider the single-origin-single-destination (SOSD) network of queuing servers with infinite buffer sizes in Fig. 1. Let 𝒩={1,2,3,4,5}\mathcal{N}=\{1,2,3,4,5\} be the set of servers. Each server nn has an exponential service rate μ¯n\bar{\mu}_{n} and job number xn(t)x_{n}(t) at time tt, t0t\in\mathbb{R}_{\geq 0}. Jobs arrive at origin SS according to a Poisson process of rate λ¯>0\bar{\lambda}>0. There are three pathspaths from the origin to the destination, 𝒫={(1,2),(1,3,5),(4,5)}\mathcal{P}=\{(1,2),~{}(1,3,5),~{}(4,5)\}. We denote npn\in p if server nn is on path p.p.

When a job arrives, it will go to one of the three paths according to a routing policy. The path of the job cannot be changed after it is decided at the origin. Thus, we divide the jobs into three classes according to their assigned paths. We use xnp(t)x_{n}^{p}(t) to denote the jobs queuing in server nn and going on path p𝒫p\in\mathcal{P} at time tt. For each server we have xn(t)=npxnp(t)x_{n}(t)=\sum_{n\in p}x_{n}^{p}(t). Thus we can define the state of the network as the vector x=[x112,x1135,x212,x3135,x445,x5135,x545]Tx=[x_{1}^{12},x_{1}^{135},x_{2}^{12},x_{3}^{135},x_{4}^{45},x_{5}^{135},x_{5}^{45}]^{T}, and the state space is 𝒳=07\mathcal{X}=\mathbb{Z}_{\geq 0}^{7}.

We consider two sets of actions, viz. 1)routing1)~{}routing and 2)scheduling2)~{}scheduling. Routing refers to determining the path of each job upon its arrival, which only occurs at SS. The routing action can be formulated as a vector

λ=[λp]p𝒫=[λ12λ135λ45]{0,λ¯}3.\lambda=[\lambda_{p}]_{p\in\mathcal{P}}=\left[\begin{matrix}\lambda_{12}\\ \lambda_{135}\\ \lambda_{45}\end{matrix}\right]\in\{0,\bar{\lambda}\}^{3}.

Scheduling refers to selecting a job from the queue to serve, which only occurs at Servers 1 and 5; preemptive priority is assumed at both servers. The scheduling action can also be formulated as a vector

μ=[μnp]pn,n𝒩=[μ112μ1135μ212μ3135μ445μ5135μ545]n𝒩{0,μ¯n}mn,\mu=[\mu_{n}^{p}]_{p\ni n,n\in\mathcal{N}}=\left[\begin{matrix}\mu^{12}_{1}\\ \mu^{135}_{1}\\ \mu^{12}_{2}\\ \mu_{3}^{135}\\ \mu_{4}^{45}\\ \mu_{5}^{135}\\ \mu^{45}_{5}\end{matrix}\right]\in\prod_{n\in\mathcal{N}}\{0,\bar{\mu}_{n}\}^{m_{n}},

where mnm_{n} is the number of paths through server nn.

This paper focuses on a particular class of control policies which we call the generalizedshortestpathgeneralized~{}shortest~{}path (GSP) policy, which is based on the following piecewise-linear (PL) functions of traffic states:

Q12(x)=max{β2x112,β(x112+x212)},\displaystyle Q_{12}(x)=\text{max}\{\beta^{2}x_{1}^{12},\beta(x_{1}^{12}+x_{2}^{12})\},
Q135(x)=max{β2x1135,β(x1135+x3135),\displaystyle Q_{135}(x)=\text{max}\{\beta^{2}x_{1}^{135},\beta(x_{1}^{135}+x_{3}^{135}),
(x1135+x3135+x5135)},\displaystyle\hskip 76.82234pt(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})\},
Q45(x)=max{β2x445,β(x445+x545)},\displaystyle Q_{45}(x)=\text{max}\{\beta^{2}x_{4}^{45},\beta(x_{4}^{45}+x_{5}^{45})\},

where β>1\beta>1 is a design parameter. We then define the notion of bottlenecks as follows:

Definition 1 (Bottleneck)

Given x𝒳x\in\mathcal{X}, a server npn\in p is a bottleneck if

  1. 1.

    +xnQp(x)>0,\frac{\partial_{+}}{\partial x_{n}}Q_{p}(x)>0, where +\partial_{+} means the right derivative, and

  2. 2.

    either nn has no downstream servers or the immediately downstream server nn^{\prime} is such that +xnQp(x)=0.\frac{\partial_{+}}{\partial x_{n^{\prime}}}Q_{p}(x)=0.

Intuitively, a bottleneck is a server that immediately contributes to QpQ_{p}, while the immediately downstream server, if it exists, does not contribute to QpQ_{p}. Let B=(b12,b135,b45)B=(b_{12},b_{135},b_{45}) be a combination of bottlenecks, where b12{1,2},b135{1,3,5},b45{4,5}b_{12}\in\{1,2\},b_{135}\in\{1,3,5\},b_{45}\in\{4,5\}, and denote nBn\in B if nn is a bottleneck in combination BB. Let \mathcal{B} be the set of possible combinations, which has 12 elements, each corresponding to a subset of 𝒳\mathcal{X} over which {Qp}\{Q_{p}\} are all linear in xx.

The GSP policy essentially sends an incoming job to the path with the smallest QQ function. However, the QQ functions only capture the traffic states but do not consider the influence of the service rates, especially the service rate of the current bottleneck. To account for this, we introduce a second parameter γ>1\gamma>1 that allows the GSP policy to prioritize the path with a higher bottleneck service rate. To be specific, for a combination BB\in\mathcal{B}, we first sort the paths according to their bottleneck service rates:

P1,B:={p𝒫:pargmaxnBpμ¯n},\displaystyle P_{1,B}:=\{p\in\mathcal{P}:p\in\mathop{\arg\max}\limits_{n\in B\cap p^{\prime}}\bar{\mu}_{n}\},
Pγ,B:={p𝒫:pargmaxpBpP1,Bcμ¯bp},\displaystyle P_{\gamma,B}:=\{p\in\mathcal{P}:p\in\mathop{\arg\max}\limits_{p\in B\cap p^{\prime}\cap P_{1,B}^{c}}\bar{\mu}_{b_{p^{\prime}}}\},
Pγ2,B:=𝒫\(P1,BPγ,B).\displaystyle P_{\gamma^{2},B}:=\mathcal{P}\backslash(P_{1,B}\cup P_{\gamma,B}).

Then for BB\in\mathcal{B}, define the weight of each path as

γp(B):={1pP1,B,γpPγ,B,γ2pPγ2,B,p𝒫.\gamma_{p}(B):=\begin{cases}1&p\in P_{1,B},\\ \gamma&p\in P_{\gamma,B},\\ \gamma^{2}&p\in P_{\gamma^{2},B},\end{cases}\quad p\in\mathcal{P}. (1)

Now we are ready to formulate the GSP policy, with a slight abuse of notation, as follows: When there are no ties,

λp(x)=λ¯𝕀{p=argminp𝒫γpQp},p𝒫,\displaystyle\lambda_{p}(x)=\bar{\lambda}\mathbb{I}\Big{\{}p=\mathop{\arg\min}\limits_{p^{\prime}\in\mathcal{P}}\gamma_{p^{\prime}}Q_{p^{\prime}}\Big{\}},\ p\in\mathcal{P},
μnp(x)=μ¯n𝕀{p=argmaxp:nBpQp},n{1,5},\displaystyle\mu^{p}_{n}(x)=\bar{\mu}_{n}\mathbb{I}\Big{\{}p=\mathop{\arg\max}\limits_{p^{\prime}:n\in B\cap p^{\prime}}Q_{p^{\prime}}\Big{\}},\ n\in\{1,5\},
μnp(x)=μ¯n𝕀{xn>0},n{2,3,4};\displaystyle\mu^{p}_{n}(x)=\bar{\mu}_{n}\mathbb{I}\{x_{n}>0\},\ n\in\{2,3,4\};

ties are broken randomly.

The above joint-routing-scheduling policy is called GSP, because the PL QQ function is a generalized measurement of the “temporal length” of each path. By this policy, server nn will prioritize the jobs from class pp if server nn is the bottleneck of path pp. If server nn is the bottleneck of multiple paths, the GSP policy would further prioritize the jobs from the path with a larger QQ function. We also define

P=argminp𝒫γpQp(x);P^{*}=\mathop{\arg\min}\nolimits_{p\in\mathcal{P}}\gamma_{p}Q_{p}(x);

note that 1|P|31\leq|P^{*}|\leq 3.

We say that the traffic in the network is stablestable if there exists C<C<\infty such that for any initial condition,

lim supt1ts=0t𝖤[x(s)]𝑑s<C.\displaystyle\limsup\limits_{t\rightarrow\infty}\frac{1}{t}\int_{s=0}^{t}\mathsf{E}[\|x(s)\|]ds<C.

We say the network is stabilizablestabilizable if λ¯<μ¯\bar{\lambda}<\bar{\mu}, where μ¯\bar{\mu} is the min-cut capacity of the network. Note that λ¯<μ¯\bar{\lambda}<\bar{\mu} ensures the existence of a stabilizing Bernoulli routing policy.

3 Stability guarantee for GSP policy

In this section, we consider the stability condition under GSP policy for the bridge network.

To state the main result, let \mathcal{M} be the set of cuts of the network, and define

m=\displaystyle m=\quad minn(MM~)μ¯nλ¯nM~μ¯n,\displaystyle\min\quad\frac{\sum_{n\in(M-\tilde{M})}\bar{\mu}_{n}}{\bar{\lambda}-\sum_{n\in\tilde{M}}\bar{\mu}_{n}},
s.t.λ¯nM~μ¯n>0,M~M,M;\displaystyle\begin{array}[]{r@{\quad}l@{}l@{\quad}l}s.t.&\bar{\lambda}-\sum_{n\in\tilde{M}}\bar{\mu}_{n}>0,\ \tilde{M}\subsetneq M,\\ &M\in\mathcal{M};\end{array}

one can show that m>1m>1 if the network is stabilizable. Let dnd_{n} be the number of links on the longest path from the origin to server nn. For each MM\in\mathcal{M}, let μmaxM=maxnMμ¯n\mu_{\max}^{M}=\max\limits_{n\in M}\bar{\mu}_{n}, μminM=minnMμ¯n\mu_{\min}^{M}=\min\limits_{n\in M}\bar{\mu}_{n}, and define

G1M:=nargmaxνMμ¯νdn,G2M:=minnargminνMμ¯νdn,\displaystyle G_{1}^{M}:=\sum_{n\in\mathop{\arg\max}\limits_{\nu\in M}\bar{\mu}_{\nu}}d_{n},\quad G_{2}^{M}:=\min_{n\in\mathop{\arg\min}\limits_{\nu\in M}\bar{\mu}_{\nu}}d_{n},
GM:=sgn(μmaxMμminM)(G2MG1M)+,\displaystyle G^{M}:=\mathrm{sgn}(\mu_{\max}^{M}-\mu_{\min}^{M})(G_{2}^{M}-G_{1}^{M})_{+},

where sgn()\mathrm{sgn}(\cdot) (resp. ()+(\cdot)_{+}) is the sign (resp. positive part) function. Also define

ΔG:=sgn(maxMGM).\Delta_{G}:=\mathrm{sgn}\Big{(}\max_{M\in\mathcal{M}}G^{M}\Big{)}.

Then, we can state the main result as follows:

Theorem 1

The bridge network is stable under the GSP policy if the network is stabilizable and

1<γ(2+ΔG)<β(2+ΔG)<m.\displaystyle 1<\gamma^{(2+\Delta_{G})}<\beta^{(2+\Delta_{G})}<m. (2)

The above result includes two key points. First, the GSP policy is stabilizing if the queuing network is stabilizable; i.e., the GSP policy maximizes the throughput of the network. Second, constraints on the parameters for the GSP policy are given, which determine the feasible set for the subsequent learning-based optimization (Section 4).

The parameters mm and ΔG\Delta_{G} indicate the relationships between the service rates and arrival rate, while the design parameters β\beta and γ\gamma characterize the GSP policy’s “preference”. Note that limλ¯μ¯m=1\lim_{\bar{\lambda}\uparrow{\bar{\mu}}}m=1, which means each server and each path will be “equally emphasized” by the GSP policy as the network approaches saturation. The case when ΔG=1\Delta_{G}=1 indicates that the path with a stronger service ability has a bottleneck closer to the origin. Then the degree of “preference” should decrease and the constraints of β\beta and γ\gamma should be stricter because even though with a weaker service ability, the other paths have the bottleneck closer to the destination, and jobs on these paths can get to the destination more quickly once they pass the bottleneck.

The proof of Theorem 1 uses a Lyapunov function constructed from the PL QQ functions:

V(x)=p𝒫(Qp(x))2.V(x)=\sum_{p\in\mathcal{P}}\Big{(}Q_{p}(x)\Big{)}^{2}. (3)

The key technique is to utilize the connection between the locations of bottlenecks and the GSP policy to argue for the negative mean drift of the Lyapunov function. In particular, we show that there exist ϵ>0,C<\epsilon>0,~{}C<\infty such that for all x𝒳x\in\mathcal{X}

V(x)=ϵx1+C,\mathcal{L}V({x})=-\epsilon\|x\|_{1}+C, (4)

where \mathcal{L} is the infinitesimal generator of the network model under the GSP policy. (4) implies the boundedness of the 1-norm of the traffic state and thus the stability of the network, according to the Foster-Lyapunov criterion [10].

Proof of Theorem 1. Firstly, we consider the case when Qp=0Q_{p}=0 for some p𝒫p\in\mathcal{P}. Suppose 𝒫={(1,2)}\mathcal{P}^{*}=\{(1,2)\} and Q12=0Q_{12}=0, which implies the servers on path (1,2)(1,2) are all empty. According to the GSP policy, λ12(x)=λ¯,λ135(x)=λ45(x)=0\lambda_{12}(x)=\bar{\lambda},~{}\lambda_{135}(x)=\lambda_{45}(x)=0. The jobs will be allocated to path (1,2)(1,2), thus it only makes finite positive contribution to the drift. While path (1,3,5)(1,3,5) and (4,5)(4,5) are not empty and keep serving jobs, they make negative contribution to the drift:

V(x)\displaystyle\mathcal{L}V(x) =ϵ135n(1,3,5)xn135ϵ45n(4,5)xn45+C\displaystyle=-\epsilon_{135}\sum_{n\in(1,3,5)}x_{n}^{135}-\epsilon_{45}\sum_{n\in(4,5)}x_{n}^{45}+C (5)
<ϵn𝒩xn+C.\displaystyle<-\epsilon\sum_{n\in\mathcal{N}}x_{n}+C.

Thus we can infer that the Lyapunov drift of the network can only be negative or finite positive when there are empty paths in the network, which also covers the cases of Q135=0Q_{135}=0 and Q45=0Q_{45}=0.

The rest of this proof is devoted to the case where Qp>0,p𝒫Q_{p}>0,p\in\mathcal{P}. Specifically, we consider five qualitatively different cases:
Case 1: B=(2,5,5)B=(2,5,5).

In this case, the QQ function of each path have the forms of:

Q12=β(x112+x212),\displaystyle Q_{12}=\beta(x_{1}^{12}+x_{2}^{12}),
Q135=(x1135+x3135+x5135),\displaystyle Q_{135}=(x_{1}^{135}+x_{3}^{135}+x_{5}^{135}),
Q45=β(x445+x545).\displaystyle Q_{45}=\beta(x_{4}^{45}+x_{5}^{45}).

Then we prove according to the different routing decision:

  1. i)

    P={(1,2)}P^{*}=\{(1,2)\}

    Since b135=b45=5,b12=2b_{135}=b_{45}=5,~{}b_{12}=2, we have:

    γ12(B)=1,γ135(B)=γ45(B)=γ\displaystyle\gamma_{12}(B)=1,\gamma_{135}(B)=\gamma_{45}(B)=\gamma ,μ¯b12>μ¯b135=μ¯b45\displaystyle,\bar{\mu}_{b_{12}}>\bar{\mu}_{b_{135}}=\bar{\mu}_{b_{45}}
    γ12(B)=γ,γ135(B)=γ45(B)=1\displaystyle\gamma_{12}(B)=\gamma,\gamma_{135}(B)=\gamma_{45}(B)=1 ,μ¯b135=μ¯b45>μ¯b12.\displaystyle,\bar{\mu}_{b_{135}}=\bar{\mu}_{b_{45}}>\bar{\mu}_{b_{12}}.

    Then the new arriving job will go to path (1,2)(1,2) when:

    1)γQ12<Q135,Q45;\displaystyle 1)~{}\gamma Q_{12}<Q_{135},Q_{45}; (6a)
    2)Q12<γQ135,γQ45.\displaystyle 2)~{}Q_{12}<\gamma Q_{135},\gamma Q_{45}. (6b)

    The scheduling decision of server 1 and server 5 will be μ212(x)=μ¯2\mu_{2}^{12}(x)=\bar{\mu}_{2} and μ545(x)+μ5135(x)=μ¯5,μ545(x)μ5135(x)=0\mu_{5}^{45}(x)+\mu_{5}^{135}(x)=\bar{\mu}_{5},~{}\mu_{5}^{45}(x)\mu_{5}^{135}(x)=0. Clearly (6a) can be transformed into Q12<Q135,Q45Q_{12}<Q_{135},Q_{45}, then we have :

    V(x)<\displaystyle\mathcal{L}V(x)< 2β2(λμ212(x))(x112+x212)\displaystyle 2\beta^{2}(\lambda-\mu_{2}^{12}(x))(x_{1}^{12}+x_{2}^{12})
    2β2μ545(x)(x445+x545)\displaystyle-2\beta^{2}\mu_{5}^{45}(x)(x_{4}^{45}+x_{5}^{45})
    2μ5135(x)(x1135+x3135+x5135)+C\displaystyle-2\mu_{5}^{135}(x)(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})+C
    <\displaystyle< 2β(x112+x212)(β(λ¯μ¯2)μ¯5/γ)+C.\displaystyle 2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})-\bar{\mu}_{5}/\gamma\Big{)}+C.

    The drift is naturally negative when λ¯μ¯20\bar{\lambda}-\bar{\mu}_{2}\leq 0, and it remains negative since 1<βγ<β2<mμ¯5λ¯μ¯21<\beta\gamma<\beta^{2}<m\leq\frac{\bar{\mu}_{5}}{\bar{\lambda}-\bar{\mu}_{2}} when λ¯μ¯2>0\bar{\lambda}-\bar{\mu}_{2}>0 according to (2). Thus (4) holds.

  2. ii)

    P={(1,3,5)}P^{*}=\{(1,3,5)\}

    Similarly, the jobs will go to path (1,3,5) when γQ135<Q12,γQ45\gamma Q_{135}<Q_{12},\gamma Q_{45} or Q135<γQ12,Q45Q_{135}<\gamma Q_{12},Q_{45}. We have the drift:

    V(x)<\displaystyle\mathcal{L}V(x)< 2(x1135+x3135+x5135)(λ¯βμ¯2/γβμ¯5)\displaystyle 2(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})(\bar{\lambda}-\beta\bar{\mu}_{2}/\gamma-\beta\bar{\mu}_{5})
    +C.\displaystyle+C.
  3. iii)

    P={(4,5)}P^{*}=\{(4,5)\}

    V(x)<\displaystyle\mathcal{L}V(x)< 2β(x445+x545)(βλβμ¯2/γμ¯5)+C.\displaystyle 2\beta(x_{4}^{45}+x_{5}^{45})(\beta\lambda-\beta\bar{\mu}_{2}/\gamma-\bar{\mu}_{5})+C.

    Since γ<β<m\gamma<\beta<\sqrt{m}, the drift satisfies (4).

For other cases when there are ties and |P|>1|P^{*}|>1, since a job can only take a single path, the cases can be proved analogously.
Case 2: B=(1,5,5)B=(1,5,5).

When P={(1,2)}P^{*}=\{(1,2)\}, the maximal coefficient of x112x_{1}^{12} exists when μ¯1>μ¯5\bar{\mu}_{1}>\bar{\mu}_{5} and Q12<γQ135,γQ45Q_{12}<\gamma Q_{135},\gamma Q_{45}, the drift turns to be:

V(x)<\displaystyle\mathcal{L}V(x)< 2β2x112(β2(λ¯μ¯1)μ¯5/γ)+C.\displaystyle 2\beta^{2}x_{1}^{12}\Big{(}\beta^{2}(\bar{\lambda}-\bar{\mu}_{1})-\bar{\mu}_{5}/\gamma\Big{)}+C.

Since d1=1,d5=3d_{1}=1,d_{5}=3 and the two nodes belong to the same cut, we have ΔG=1\Delta_{G}=1 and 1<γ<β<m131<\gamma<\beta<m^{\frac{1}{3}}, which make (4) stands. The proof of other forms of PP^{*} is analogous with case 1.
Case 3: B=(2,5,4)B=(2,5,4).

We only show the case where P={(1,2)}P^{*}=\{(1,2)\}, the other cases can be proved analogously. For the diverse possible values of γp(B)\gamma_{p}(B), the drift has the form of:

V(x)<\displaystyle\mathcal{L}V(x)< 2β(x112+x212)(β(λ¯μ¯2)μ¯5γ12(B)γ135(B)\displaystyle 2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})-\bar{\mu}_{5}\frac{\gamma_{12}(B)}{\gamma_{135}(B)}
β2μ¯4γ12(B)γ45(B))+C.\displaystyle-\beta^{2}\bar{\mu}_{4}\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}+C.

According to the GSP policy, we have the scaled ratio:

(γ12(B)γ135(B),γ12(B)γ45(B))\displaystyle\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}\in
{(1,1),(1γ2,1γ),(1γ,1γ2),(1γ,1),(1,1γ),(1γ,1γ)}.\displaystyle\Big{\{}\Big{(}1,1\Big{)},\Big{(}\frac{1}{\gamma^{2}},\frac{1}{\gamma}\Big{)},\Big{(}\frac{1}{\gamma},\frac{1}{\gamma^{2}}\Big{)},\Big{(}\frac{1}{\gamma},1\Big{)},\Big{(}1,\frac{1}{\gamma}\Big{)},\Big{(}\frac{1}{\gamma},\frac{1}{\gamma}\Big{)}\Big{\}}.

Thus the possible maximal coefficient of x112+x212x_{1}^{12}+x_{2}^{12} exists when:

1)\displaystyle 1) μ¯2>μ¯4>μ¯5,(γ12(B)γ135(B),γ12(B)γ45(B))=(1γ2,1γ),\displaystyle\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5},~{}\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}=\Big{(}\frac{1}{\gamma^{2}},\frac{1}{\gamma}\Big{)},
Q12<γ2Q135,γQ45;\displaystyle Q_{12}<\gamma^{2}Q_{135},\gamma Q_{45};
2)\displaystyle 2) μ¯2>μ¯5>μ¯4,(γ12(B)γ135(B),γ12(B)γ45(B))=(1γ,1γ2),\displaystyle\bar{\mu}_{2}>\bar{\mu}_{5}>\bar{\mu}_{4},~{}\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}=\Big{(}\frac{1}{\gamma},\frac{1}{\gamma^{2}}\Big{)},
Q12<γQ135,γ2Q45.\displaystyle Q_{12}<\gamma Q_{135},\gamma^{2}Q_{45}.

We have:

V(x)<\displaystyle\mathcal{L}V(x)< C+2β(x112+x212)(β(λ¯μ¯2)\displaystyle C+2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})
+𝕀{μ¯2>μ¯4>μ¯5}(μ¯5/γ2β2μ¯4/γ)\displaystyle+\mathbb{I}_{\{\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5}\}}(-\bar{\mu}_{5}/\gamma^{2}-\beta^{2}\bar{\mu}_{4}/\gamma)
+𝕀{μ¯2>μ¯5>μ¯4}(μ¯5/γβ2μ¯4/γ2)).\displaystyle+\mathbb{I}_{\{\bar{\mu}_{2}>\bar{\mu}_{5}>\bar{\mu}_{4}\}}(-\bar{\mu}_{5}/\gamma-\beta^{2}\bar{\mu}_{4}/\gamma^{2})\Big{)}.

Since there are d2=2,d4=1,d5=3d_{2}=2,d_{4}=1,d_{5}=3 and ΔG=1\Delta_{G}=1 when μ¯2>μ¯4>μ¯5\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5}, we have 1<βγ2<m1<\beta\gamma^{2}<m. Thus the (4) can be proved. The case when B=(1,5,4)B=(1,5,4) can be proved analogously.
Case 4: B=(2,3,4).B=(2,3,4).

We discuss when P={(4,5)}P^{*}=\{(4,5)\}. When the possible maximal coefficient of x445x^{45}_{4} exists, the drift is:

V(x)<\displaystyle\mathcal{L}V(x)< C+βx445(β(λ¯μ¯4)\displaystyle C+\beta x^{45}_{4}\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{4})
+𝕀{μ¯4>μ¯3>μ¯2}(μ¯2/γ2μ¯3/γ)\displaystyle+\mathbb{I}_{\{\bar{\mu}_{4}>\bar{\mu}_{3}>\bar{\mu}_{2}\}}(-\bar{\mu}_{2}/\gamma^{2}-\bar{\mu}_{3}/\gamma)
+𝕀{μ¯4>μ¯2>μ¯3}(μ¯2/γμ¯3/γ2)).\displaystyle+\mathbb{I}_{\{\bar{\mu}_{4}>\bar{\mu}_{2}>\bar{\mu}_{3}\}}(-\bar{\mu}_{2}/\gamma-\bar{\mu}_{3}/\gamma^{2})\Big{)}.

Since d2=2,d4=1,d3=2d_{2}=2,d_{4}=1,d_{3}=2 and ΔG=1\Delta_{G}=1 when μ¯4>μ¯2>μ¯3\bar{\mu}_{4}>\bar{\mu}_{2}>\bar{\mu}_{3} or μ¯4>μ¯3>μ¯2\bar{\mu}_{4}>\bar{\mu}_{3}>\bar{\mu}_{2}, thus we have 1<βγ2<β3<m1<\beta\gamma^{2}<\beta^{3}<m and (4) holds. The cases of other forms of PP^{*} when B=(2,3,4)B=(2,3,4) and B=(2,3,5)B=(2,3,5) can be proved analogously.
Case 5: B=(1,3,5).B=(1,3,5).

Considering P={(1,2)}P^{*}=\{(1,2)\}, when the possible maximal coefficient of x112x^{12}_{1} exists, the drift satisfies:

V(x)<\displaystyle\mathcal{L}V(x)< C+β3x112(β(λ¯μ¯1)\displaystyle C+\beta^{3}x^{12}_{1}\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{1})
+𝕀{μ¯1>μ¯5>μ¯3}(μ¯3/γ2μ¯5/γ)\displaystyle+\mathbb{I}_{\{\bar{\mu}_{1}>\bar{\mu}_{5}>\bar{\mu}_{3}\}}(-\bar{\mu}_{3}/\gamma^{2}-\bar{\mu}_{5}/\gamma)
+𝕀{μ¯1>μ¯3>μ¯5}(μ¯3/γμ¯5/γ2)).\displaystyle+\mathbb{I}_{\{\bar{\mu}_{1}>\bar{\mu}_{3}>\bar{\mu}_{5}\}}(-\bar{\mu}_{3}/\gamma-\bar{\mu}_{5}/\gamma^{2})\Big{)}.

We have d1=1,d3=2,d5=3d_{1}=1,d_{3}=2,d_{5}=3, and there is ΔG=1\Delta_{G}=1 when μ¯1>μ¯3>μ¯5\bar{\mu}_{1}>\bar{\mu}_{3}>\bar{\mu}_{5}. Thus there is 1<βγ2<β3<m1<\beta\gamma^{2}<\beta^{3}<m, and (4) stands. The other forms of PP^{*} of case B=(1,3,5)B=(1,3,5) and B=(1,3,4)B=(1,3,4) can be proved analogously. The proof is the same for other cases of B=(1,1,4)B=(1,1,4), B=(1,1,5)B=(1,1,5), B=(2,1,4)B=(2,1,4), B=(2,1,5)B=(2,1,5).

In conclusion, we show that the mean drift always satisfies (4), which ensures the stability of the network. \square

4 Iterative computation of GSP parameters

In this section, we develop a policy iteration (PI) algorithm that computes the parameters for the GSP policy. In particular, theorem 1 gives a set of values for β\beta and γ\gamma that ensure stability; the goal of the PI algorithm is to search the aforementioned set for a specific combination of β\beta and γ\gamma that attains near-optimal traffic efficiency. In addition, the PI algorithm does not require a priori knowledge of the arrival and service rate. We also compare the performance of the GSP PI algorithm with three benchmark policies, viz. neural network (NN)-based PI, simple shortest path, and optimized Bernoulli.

We formulate the joint routing-scheduling problem as a Markov decision process (MDP) with state space 𝒳=07\mathcal{X}=\mathbb{Z}_{\geq 0}^{7}. Since the routing and scheduling decision will only be made when an arrival or a departure occurs (i.e., at transition epochs [23, p.72]), the problem can be formulated as a discrete-time (DT) MDP. We denote the state and action of the DT MDP as x[k]x[k] and a[k]a[k], respectively. With a slight abuse of notation, x[k]=x(tk)x[k]=x(t_{k}), where tkt_{k} is the kk-th transition epoch of the continuous-time process. Similarly, the action is given by a[k]=(λ[k],μ[k])a[k]=(\lambda[k],\mu[k]), where λ[k]{0,λ¯}3\lambda[k]\in\{0,\bar{\lambda}\}^{3} and μ[k]n𝒩{0,μ¯n}mn\mu[k]\in\prod_{n\in\mathcal{N}}\{0,\bar{\mu}_{n}\}^{m_{n}} are the route-specific arrival rates and service rates, respectively.

As indicated in Section 2, both the routing and the scheduling decisions can be parameterized via design variables β\beta and γ\gamma; by Theorem 1, β\beta and γ\gamma should satisfy (2). Thus, the action is given by the policy

a[k]=π(x[k];β,γ).a[k]=\pi(x[k];\beta,\gamma).

The transition probability p(x|x,a)p(x^{\prime}|x,a) of the DT process can be derived from the network model defined in Section 2 in a straightforward manner. The one-step cost of the MDP is given by

u[k]=x[k]1(tktk1).u[k]=\|x[k]\|_{1}(t_{k}-t_{k-1}).

The total cost over one episode of the MDP is thus given by

UπK(x)=Eπ[=kKx[]1(tktk1)|x[0]=x],U_{\pi}^{K}(x)=\mathrm{E}_{\pi}\Big{[}\sum_{\ell=k}^{K}\|x[\ell]\|_{1}(t_{k}-t_{k-1})\Big{|}x[0]=x\Big{]},

where K>0K\in\mathbb{Z}_{>0} is the length (i.e., number of transition epochs) of one episode.

4.1 GSP PI algorithm

Closed-form solution to UπKU_{\pi}^{K} is not easy. Hence, we use the PL QQ function defined by (3) as a proxy for UπKU_{\pi}^{K}. A key characteristic of our formulation is that (β,γ)(\beta,\gamma) parametrizes both the policy π(x;β,γ)\pi(x;\beta,\gamma) and the PL functions γpQp(x;β)\gamma_{p}Q_{p}(x;\beta). Although this may lead to a gap towards the ground-truth optimal solution, it significantly simplifies the computation and, importantly, guarantees traffic stability.

The GSP PI algorithm has two tasks in parallel: calibration of model data λ¯\bar{\lambda}, μ¯n\bar{\mu}_{n} and optimization of control parameters β,γ\beta,\gamma. The latter depends on the former, since ΔG\Delta_{G} in (2) depends on model data. Let λ^(i),μ^n(i)\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)} be the estimate of λ¯,μ¯n\bar{\lambda},\bar{\mu}_{n} after ii iterations (i.e., episodes), ΔG(i)\Delta_{G}^{(i)} be the corresponding ΔG\Delta_{G}, and β(i),γ(i)\beta^{(i)},\gamma^{(i)} be the control parameters after ii iterations. Then, the specific steps of the GSP PI algorithm can be summarized as follows:

  1. 1.

    Initialize with λ^(0)>0\hat{\lambda}^{(0)}\in\mathbb{R}_{>0}, μ^n(0)>0\hat{\mu}_{n}^{(0)}\in\mathbb{R}_{>0} for each server nn, β(0)(1,m1(2+ΔG))\beta^{(0)}\in(1,m^{\frac{1}{(2+\Delta_{G})}}), γ(0)(1,β(0))\gamma^{(0)}\in(1,\beta^{(0)}).

  2. 2.

    Policy evaluation: Simulate a Monte-Carlo episode of the network model with the policy π(x;β(i1),γ(i1))\pi(x;\beta^{(i-1)},\gamma^{(i-1)}). Update the estimates to obtain λ^(i),μ^n(i).\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)}. Update the parameters β(i),γ(i).\beta^{(i)},\gamma^{(i)}. (See below for how the updates are done.)

  3. 3.

    Policy improvement: Use β(i),γ(i)\beta^{(i)},\gamma^{(i)} to obtain a new policy π(x;β(i),γ(i))\pi(x;\beta^{(i)},\gamma^{(i)}).

  4. 4.

    Terminate when |β(i)β(i1)||\beta^{(i)}-\beta^{(i-1)}| and |γ(i)γ(i1)||\gamma^{(i)}-\gamma^{(i-1)}| are smaller than some thresholds.

The rest of this subsection is devoted to the update of (β,γ).(\beta,\gamma). Given (β,γ)(\beta,\gamma), suppose that a job arrives at epoch kk and takes path pp according to policy π(x;β,γ)\pi(x;\beta,\gamma). Then, we track the system time W[k]W[k]111This notation should cause no confusion, since at most one Poisson arrival can occur at an epoch. of this particular job, and use the square error

SE(β,γ)[k]=𝕀{W[k]>0}(γp[k]Qp[k](x[k];β)W[k])2\mathrm{SE}_{(\beta,\gamma)}[k]=\mathbb{I}_{\{W[k]>0\}}\Big{(}\gamma_{p[k]}Q_{p[k]}(x[k];\beta)-W[k]\Big{)}^{2}

to update (β,γ)(\beta,\gamma), where p[k]p[k] is the path that π(x[k];β,γ)\pi(x[k];\beta,\gamma) selects. Note that W[k]W[k] records the time that a job spends in the continuous-time setting; we define W[k]=0W[k]=0 if no arrival occurs at epoch kk, so 𝕀{W[k]>0}\mathbb{I}_{\{W[k]>0\}} indicates whether there is an arrival at epoch kk. (β,γ)(\beta,\gamma) are obtained by minimizing the sum-of-squares error:

(β,γ)=arg min(β,γ) satisfy (2)k=1KSE(β,γ)[k].\displaystyle(\beta^{*},\gamma^{*})=\underset{{\footnotesize(\beta,\gamma)\mbox{ satisfy }\eqref{equa:theorem}}}{\mbox{arg min}}\sum_{k=1}^{K}\mathrm{SE}_{(\beta,\gamma)}[k]. (7)

Note that the PL function actually approximates the job-specific system time WW rather than the total system time UπKU_{\pi}^{K}, which are in fact related via

limK{UπK(x)tKk=1KW[k]k=1K𝕀{W[k]>0}}=0a.s..\lim_{K\to\infty}\Bigg{\{}\frac{U_{\pi}^{K}(x)}{t_{K}}-\frac{\sum_{k=1}^{K}W[k]}{\sum_{k=1}^{K}\mathbb{I}_{\{W[k]>0\}}}\Bigg{\}}=0\quad a.s..

4.1.1 Estimate of λ¯,μ¯n\bar{\lambda},\bar{\mu}_{n}

In the iith iteration, we run a Monte-Carlo simulation of the network model. The simulation has a discrete time step size that is sufficiently small to approximate the continuous time process; i.e., we use Bernoulli processes with sufficiently small time increment to approximate Poisson processes. We track every inter-arrival time at the origin and compute the average to obtain 1/λ^(i).1/\hat{\lambda}^{(i)}. We also track every service time at each server nn and compute the average to obtain 1/μ^n(i).1/\hat{\mu}_{n}^{(i)}.

4.1.2 Update of β\beta

In the iith iteration, we record {x[k],p[k],W[k]}\{x[k],p[k],W[k]\} if W[k]>0W[k]>0. Let D(i)D^{(i)} be the table of {x[k],p[k],W[k]}\{x[k],p[k],W[k]\}. We determine β(i)\beta^{(i)} by a method similar to the least-squares partition algorithm in [24]. We first partition D(i)D^{(i)} according to the boundaries of the PL functions {Qp(x;β)}\{Q_{p}(x;\beta^{^{\prime}})\} with β=β(i1)\beta^{^{\prime}}=\beta^{(i-1)}. Then we use the L-BFGS-B algorithm [25] to calculate a better β\beta^{\prime} that minimizes the sum-of-squares error defined in (7), with γ\gamma fixed to be γ(i1)\gamma^{(i-1)}. This leads to a new partition according to β\beta^{\prime}. We recursively do so until the partition stabilizes or the iteration reaches a certain maximum number, and the last β\beta^{\prime} is taken as β\beta^{*} of this episode.

4.1.3 Update of γ\gamma

After β(i)\beta^{(i)} is obtained, we may determine the relation between γp\gamma_{p} and γ\gamma for each set of data in D(i)D^{(i)}. To be specific, we determine the combination of bottlenecks BB at state x[k]x[k] based on β(i)\beta^{(i)}, and γp(B)\gamma_{p}(B) is given by (1). Then, we again use the L-BFGS-B algorithm to calculate γ\gamma^{*} which minimize the sum-of-squares error defined in (7) with β\beta fixed to be β(i)\beta^{(i)}. When update of γ\gamma is completed, γ\gamma^{*} of this iteration is then set to be γ(i)\gamma^{(i)}. The process of the iith iteration of γ\gamma and β\beta is given by Algorithm 1.

Algorithm 1 Computation of β(i)\beta^{(i)} and γ(i)\gamma^{(i)}
1:D(i),λ^(i),μ^n(i),β(i1),γ(i1)D^{(i)},\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)},\beta^{(i-1)},\gamma^{(i-1)}
2:partition data with ββ(i1)\beta^{^{\prime}}\longleftarrow\beta^{(i-1)}
3:repeat
4:     update β\beta^{^{\prime}} based on current partition
5:until partition stabilizes
6:β(i)β\beta^{(i)}\longleftarrow\beta^{*}
7:determine the forms of γp\gamma_{p} for data in D(i)D^{(i)}
8:γ(i)γ\gamma^{(i)}\longleftarrow\gamma^{*}
9:β(i)\beta^{(i)} , γ(i)\gamma^{(i)}

4.2 Benchmark policies and algorithms

To evaluate the performance of the GSP PI algorithm, we consider three benchmarks:

4.2.1 Neural network (NN) PI

We consider action-value-function Θ(x,a)\Theta(x,a) and use NN to approximate it. Initially, we get the first sequence of state-action-reward {x[k],a[k],r[k]}\{x[k],a[k],r[k]\} and

θ0(x[k],a[k])=k=kKl(kk)rk,Θ0=θ0\displaystyle\theta_{0}(x[k],a[k])=\sum_{k^{\prime}=k}^{K}l^{(k^{\prime}-k)}r_{k^{\prime}},~{}\Theta_{0}=\theta_{0}

for the initial iteration after a Monte-Carlo episode, where ll is a discount of future action-state-reward. The value is renewed as

Θi(x[k],a[k])=\displaystyle\Theta_{i}(x[k],a[k])= Θi1(x[k],a[k])+(θi(x[k],a[k])\displaystyle\Theta_{i-1}(x[k],a[k])+(\theta_{i}(x[k],a[k])
Θi1(x[k],a[k]))/i\displaystyle-\Theta_{i-1}(x[k],a[k]))/i

in iteration i,iNi,i\in N^{*}. Then the current policy is evaluated by a designed NN, which has two convolutional layers, two maxpooling layers and two fully connected layers. We adopt the rectified linear unit (ReLU) activation function at each layer to avoid gradient explosion and extinction. The weights of NN are updated by the adaptive moment estimation (Adam) algorithm [26] to narrow the gap between the predicted and calculated state-action-value. Finally, we use the NN approximation for Θ\Theta for policy improvement by the ϵ\epsilon-greedy algorithm. Since an exact optimal policy of the original MDP is not readily available, the policy computed by NN PI is used as an approximate optimal policy.

4.2.2 Simple shortest-path (SSP) policy

For routing and scheduling decisions under SSP policy, we simply calculate the cost of paths by Φ(x,p)=npxnp\Phi(x,p)=\sum_{n\in p}x^{p}_{n} and select the path with the smallest Φ(x,p)\Phi(x,p) value.

4.2.3 Optimal Bernoulli (OB) policy

A Bernoulli policy route any incoming job to a route p𝒫p\in\mathcal{P} with a time-invariant probability ηp\eta_{p} such that p𝒫ηp=1.\sum_{p\in\mathcal{P}}\eta_{p}=1. Consequently, the routing probabilities ηp\eta_{p} are optimized using a simple numerical search scheme.

4.3 Evaluation and comparison

We compare our GSP policy with the three benchmarks according to the following performance metrics:

  1. 1)

    Computational efficiency: training time and number of iterations (if applicable.)

  2. 2)

    Implementation efficiency: the average system time experienced by all jobs that go through the network.

Consider the network in Fig 1 and suppose μ¯1=0.15\bar{\mu}_{1}=0.15, μ¯2=0.1\bar{\mu}_{2}=0.1, μ¯3=0.25\bar{\mu}_{3}=0.25, μ¯4=0.15\bar{\mu}_{4}=0.15, μ¯5=0.2\bar{\mu}_{5}=0.2, λ¯=0.2\bar{\lambda}=0.2, all in unit sec-1. Then we have m=1.50,ΔG=0m=1.50,~{}\Delta_{G}=0 for this set of parameters. For GSP PI, we initialize the estimates of μ¯n\bar{\mu}_{n} with 0.5 and estimate of λ¯\bar{\lambda} with 0.1. The OB policy turns out to be η=[0.28,0.20,0.52]T\eta^{*}=[0.28,0.20,0.52]^{T}. For simulation, a discrete time step of 0.1 sec is used. All experiments were implemented in Google Colab [27], using Intel(R) Xeon(R) CPU with 12.7GB memory. NN PI was trained on NVIDIA GeForce RTX 2080 Ti with 40GB memory.

The computational efficiency of our GSP policy is significantly better than the general NN algorithm with a 97.8%-99.8% reduction of training time as demonstrated in Fig. 2(a).

Refer to caption
(a) The time of learning
Refer to caption
(b) The number of iterations
Figure 2: Average system time comparison of GSP PI and NN algorithm. The coordinate scale of the axes is uneven due to the large value. (a) Average system time as the learning time growing. (b) Average system time as the number of iterations increases.

Our GSP PI algorithm converges rapidly within a few seconds while NN PI takes more than twelve hours to converge. This is expected, since the PL functions have simple structures, while the NN model require sophisticated training.

Table 1 lists the Normalized Average System Times (NAST) under various methods. The results are generated from test simulations of 5×1065\times 10^{6} sec. Although NN PI performs better in long terms of learning as expected, GSP PI performs better with just a few number of iterations as demonstrated in Fig. 2(b), and can attain a 1.9%-6.7% optimality gap towards NN PI.

Table 1: Normalized Average System Times (NAST) of various schemes
\hhlineAlgorithm NAST
\hhlineNeural network (NN) 1.00
\hhlineGeneralized shortest-path (GSP) 1.03
\hhlineSimple shortest-path (SSP) 1.12
\hhlineOptimal Bernoulli (OB) 1.25
\hhline

The job will spend more time going through the queuing network under SSP or OB policy at the average of 35.0435.04 sec and 39.2739.27 sec. Though the implementation efficiency of GSP PI is slightly worse than NN PI, GSP PI gives the best trade-off between computational efficiency and implementation efficiency: the average system time of NN and GSP method are 31.3031.30 sec and 32.2432.24 sec, which implies an adequate optimality gap of 3%.

5 Concluding remarks

In this article, we focus on a class of control policies for a single-origin-single-destination network, which we call the generalized shortest-path policy. We design a set of piecewise-linear functions as a proxy of the travel cost on the path over the network. We use the piecewise-linear functions to construct a Lyapunov function to derive theoretical guarantee on the stability of the traffic state. We also develop a policy iteration algorithm to learn the parameters for the proposed policy. Then we implement and compare the algorithm with three other benchmarks including a standard neural network-based algorithm. The experimental results showed that our policy and PI algorithm can perform efficiently and steadily when significantly reducing the computation cost as well, which is cost-effective. Future work may include (i) extension of the theoretical results as well as PI algorithm to general single-origin-single-destination acyclic networks and (ii) refinement of the PI algorithm to develop more efficient temporal-difference algorithms.

References

  • [1] Debasis Mitra and Judith B Seery. Comparative evaluations of randomized and dynamic routing strategies for circuit-switched networks. IEEE Transactions on Communications, 39(1):102–116, 1991.
  • [2] D Down and Sean P Meyn. Piecewise linear test functions for stability and instability of queueing networks. Queueing Systems, 27(3-4):205–226, 1997.
  • [3] Murat Alanyali and Bruce Hajek. Analysis of simple algorithms for dynamic load balancing. Mathematics of Operations Research, 22(4):840–871, 1997.
  • [4] Li Jin and Saurabh Amin. Stability of fluid queueing systems with parallel servers and stochastic capacities. IEEE Transactions on Automatic Control, 63(11):3948–3955, 2018.
  • [5] PR Kumar and Sean P Meyn. Stability of queueing networks and scheduling policies. IEEE Transactions on Automatic Control, 40(2):251–260, 1995.
  • [6] G Foschini and JACK Salz. A basic dynamic routing problem and diffusion. IEEE Transactions on Communications, 26(3):320–327, 1978.
  • [7] Anthony Ephremides, P Varaiya, and Jean Walrand. A simple dynamic routing problem. IEEE Transactions on Automatic Control, 25(4):690–693, 1980.
  • [8] L Tassiulas and A Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Transactions on Automatic Control, 37(12):1936–1948, 1992.
  • [9] Zubair Md Fadlullah, Fengxiao Tang, Bomin Mao, Nei Kato, Osamu Akashi, Takeru Inoue, and Kimihiro Mizutani. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s intelligent network traffic control systems. IEEE Communications Surveys & Tutorials, 19(4):2432–2455, 2017.
  • [10] Sean P. Meyn and R. L. Tweedie. Stability of markovian processes iii: Foster–lyapunov criteria for continuous-time processes. Advances in Applied Probability, 25(3):518–548, 1993.
  • [11] Dimitri Bertsekas. Dynamic Programming and Optimal Control: Volume I, volume 1. Athena Scientific, 2012.
  • [12] Jim G Dai and Sean P Meyn. Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Transactions on Automatic Control, 40(11):1889–1904, 1995.
  • [13] Christopher V Hollot, Vishal Misra, Donald Towsley, and Weibo Gong. Analysis and design of controllers for aqm routers supporting tcp flows. IEEE Transactions on Automatic Control, 47(6):945–959, 2002.
  • [14] Leonidas Georgiadis, Michael J Neely, Leandros Tassiulas, et al. Resource allocation and cross-layer control in wireless networks. Foundations and Trends® in Networking, 1(1):1–144, 2006.
  • [15] Abhishek Sinha and Eytan Modiano. Optimal control for generalized network-flow problems. IEEE/ACM Transactions on Networking, 26(1):506–519, 2017.
  • [16] Jianan Zhang, Abhishek Sinha, Jaime Llorca, Antonia M Tulino, and Eytan Modiano. Optimal control of distributed computing networks with mixed-cast traffic flows. IEEE/ACM Transactions on Networking, 29(4):1760–1773, 2021.
  • [17] Keith W Ross. Optimal dynamic routing in markov queueing networks. Automatica, 22(3):367–370, 1986.
  • [18] Quan-Lin Li, Jing-Yu Ma, Rui-Na Fan, and Li Xia. An overview for markov decision processes in queues and networks. Stochastic Models in Reliability, Network Security and System Safety: Essays Dedicated to Professor Jinhua Cao on the Occasion of His 80th Birthday, pages 44–71, 2019.
  • [19] Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1871–1879. IEEE, 2018.
  • [20] Guillermo Bernárdez, José Suárez-Varela, Albert López, Bo Wu, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. Is machine learning ready for traffic engineering optimization? In 2021 IEEE 29th International Conference on Network Protocols (ICNP), pages 1–11. IEEE, 2021.
  • [21] Bai Liu, Qiaomin Xie, and Eytan Modiano. Rl-qn: A reinforcement learning framework for optimal control of queueing systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 7(1):1–35, 2022.
  • [22] Qian Xie and Li Jin. Stabilizing queuing networks with model data-independent control. IEEE Transactions on Control of Network Systems, 9(3):1317–1326, 2022.
  • [23] Robert G Gallager. Stochastic Processes: Theory for Applications. Cambridge University Press, 2013.
  • [24] Alessandro Magnani and Stephen P. Boyd. Convex piecewise-linear fitting. Optimization and Engineering, 10:1–17, 2009.
  • [25] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.
  • [26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [27] Google Colaboratory. https://colab.research.google.com/. Accessed: 2023-03-28.