JI]UM Joint Institute, Shanghai Jiao Tong University, Shanghai, China SEIEE]School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China PKU]School of Electronics, Peking University, Beijing, China UMJI]UM Joint Institute and School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

Learning-Based Adaptive Dynamic Routing with Stability Guarantee for a Single-Origin-Single-Destination Network

Yidan Wu\arefJI Feixiang Shu\arefSEIEE Jianan Zhang\arefPKU Li Jin\arefUMJI [ [email protected] [ [email protected] [ [email protected] [ [email protected]

Abstract

We consider learning-based adaptive dynamic routing for a single-origin-single-destination queuing network with stability guarantees. Specifically, we study a class of generalized shortest path policies that can be parameterized by only two constants via a piecewise-linear function. Using the Foster-Lyapunov stability theory, we develop a criterion on the parameters to ensure mean boundedness of the traffic state. Then, we develop a policy iteration algorithm that learns the parameters from realized sample paths. Importantly, the piecewise-linear function is both integrated into the Lyapunov function for stability analysis and used as a proxy of the value function for policy iteration; hence, stability is inherently ensured for the learned policy. Finally, we demonstrate via a numerical example that the proposed algorithm learns a near-optimal routing policy with an acceptable optimality gap but significantly higher computational efficiency compared with a standard neural network-based algorithm.

keywords:

Dynamic routing, queuing networks, reinforcement learning, stability

⁰⁰footnotetext: This work was in part supported by National Natural Science Foundation of China Project 62103260, SJTU UM Joint Institute, J. Wu & J. Sun Foundation.

1 Introduction

Dynamic routing is a control scheme with extensive application in network systems including communications, manufacturing, transportation, etc [1, 2, 3, 4]. A primary objective of dynamic routing is to ensure stability, i.e. the boundedness of the long-time average of traffic state. The typical tool to approach this class of problems is the Lyapunov function method [5]. On top of this, a secondary objective is to minimize the system time experienced by each job. Queuing theorists have made efforts on characterization of the limiting distribution of the traffic state, but such results are usually complicated and non-scalable [6, 7, 8]. Recently, learning-based approaches have been proposed for dynamic routing. However, such approaches are usually purely numerical without explicit theoretical guarantees on traffic stability [9].

Motivated by the above challenge, we consider a dynamic routing scheme for single-origin-single-destination queuing networks that integrates Lyapunov function methods [10] and approximate dynamic programming [11]. We use a class of piecewise-linear (PL) functions that are increasing in traffic states to measure the travel cost of a path. An incoming job is routed to the path with the smallest cost, so the routing scheme can be viewed as a generalized shortest-path (GSP) strategy. We use the PL functions to construct a Lyapunov function for stability analysis, and we show that the resultant routing scheme is throughput-optimal. We also use the PL functions as proxies for action-value functions to obtain a policy iteration algorithm. This algorithm computes a suboptimal routing scheme that is guaranteed to be stabilizing. We also show that our scheme gives comparable performance (i.e., average system time) with respect to neural network-based methods but with a significant reduction in computational workload. This paper focuses on the bridge network in Fig. 1,

Refer to caption — Figure 1: A single-origin-single-destination network.

but our modeling, analysis, and computation approaches also provide hints for general networks.

Extensive results have been developed for traffic stabilization and throughput maximization [5, 12, 13, 14], which provide a basis for this paper. In terms of travel cost minimization, previous results are typically either restricted to simple scenarios (e.g., parallel queues) or based on numerical simulation [15, 16] or optimization methods without traffic stability analysis [17, 18]. In particular, reinforcement learning methods are proposed to compute optimal/suboptimal solutions to dynamic routing problems [19, 20]. Although learning methods usually give excellent performance, they focus on algorithm design and do not emphasize theoretical guarantees for stability/optimality. In addition, generic learning methods may be computationally inefficient for dynamic routing. In particular, Liu et al. proposed a mixed scheme that uses learning over a bounded set of states and uses a known stabilizing policy for the other states [21]; this notion of integration of learning and stability guarantee is aligned with the idea of this paper, but this paper makes such integration in a different manner.

We formulate our problem on the single-origin-single-destination queuing network in Fig. 1. Jobs arrive at the source node as a Poisson process. Each server has exponentially distributed service times. When a job arrives, the system operator assigns a path to the job, which will not be changed afterwards. Hence, every job belongs to a class that corresponds to the assigned path. The key of our routing scheme is a set of PL functions that approximate the expected travel times on each path. Such PL functions were introduced by Down and Meyn [10] to show positive Harris recurrence of the traffic state. Our previous work proposed an explicit construction of the parameters of the PL functions according to network topology, without solving model data-dependent inequalities [22]. In this paper, we further utilize the PL functions to bound the mean of the traffic state and to obtain our GSP routing policy.

We show that our GSP policy maximizes network throughput in the sense that it is stabilizing as long as the demand at the source is less than the min-cut capacity of the network (Theorem 1). The proof utilizes a critical insight about the PL functions: each linear regime corresponds to a particular configuration of bottlenecks. We construct a piecewise-quadratic Lyapunov function using the PL functions and show that the mean drift of the Lyapunov function is negative in heavy-traffic regimes. The drift condition also leads to a set of constraints on the PL parameters.

We then develop a policy iteration algorithm that learns the PL parameters (and thus the GSP policy). The policy iteration algorithm uses the PL functions as a proxy (as opposed to an immediate approximation) of the action-value functions; this connects the learning-based routing decisions with the above-mentioned theoretical properties. In each iteration, the current policy is evaluated using the PL function after a Monte-Carlo episode and improved by greedifying over the PL functions; the evaluation is done by an algorithm specifically designed for the PL functions. We compare our policy iteration algorithm with a standard neural-network dynamic programming algorithm. The implementation results indicate that our approach can attain a 1.9%-6.7% optimality gap with respect to the neural-network method with a 97.8%-99.8% reduction in training time.

In summary, our contributions are:

1.

A simple, insightful, and easy-to-implement dynamic routing scheme, i.e., the GSP policy.
2.

Theoretical guarantee on traffic stabilization and throughput maximization of the GSP policy over the bridge network.
3.

An efficient policy iteration algorithm for the GSP policy that attains comparable control performance but significantly reduces training time compared with neural network-based methods.

The rest of this paper is organized as follows. Section 2 introduces the network model and formulates the dynamic routing problem. Section 3 develops the stability guarantee for the GSP policy. Section 4 demonstrates the training algorithm for the GSP policy. Section 5 gives the concluding remarks.

2 Modeling and formulation

Consider the single-origin-single-destination (SOSD) network of queuing servers with infinite buffer sizes in Fig. 1. Let $\mathcal{N}=\{1,2,3,4,5\}$ be the set of servers. Each server $n$ has an exponential service rate $\bar{\mu}_{n}$ and job number $x_{n}(t)$ at time $t$ , $t\in\mathbb{R}_{\geq 0}$ . Jobs arrive at origin $S$ according to a Poisson process of rate $\bar{\lambda}>0$ . There are three $paths$ from the origin to the destination, $\mathcal{P}=\{(1,2),~{}(1,3,5),~{}(4,5)\}$ . We denote $n\in p$ if server $n$ is on path $p.$

When a job arrives, it will go to one of the three paths according to a routing policy. The path of the job cannot be changed after it is decided at the origin. Thus, we divide the jobs into three classes according to their assigned paths. We use $x_{n}^{p}(t)$ to denote the jobs queuing in server $n$ and going on path $p\in\mathcal{P}$ at time $t$ . For each server we have $x_{n}(t)=\sum_{n\in p}x_{n}^{p}(t)$ . Thus we can define the state of the network as the vector $x=[x_{1}^{12},x_{1}^{135},x_{2}^{12},x_{3}^{135},x_{4}^{45},x_{5}^{135},x_{5}^{45}]^{T}$ , and the state space is $\mathcal{X}=\mathbb{Z}_{\geq 0}^{7}$ .

We consider two sets of actions, viz. $1)~{}routing$ and $2)~{}scheduling$ . Routing refers to determining the path of each job upon its arrival, which only occurs at $S$ . The routing action can be formulated as a vector

\lambda=[\lambda_{p}]_{p\in\mathcal{P}}=\left[\begin{matrix}\lambda_{12}\\ \lambda_{135}\\ \lambda_{45}\end{matrix}\right]\in\{0,\bar{\lambda}\}^{3}.

Scheduling refers to selecting a job from the queue to serve, which only occurs at Servers 1 and 5; preemptive priority is assumed at both servers. The scheduling action can also be formulated as a vector

\mu=[\mu_{n}^{p}]_{p\ni n,n\in\mathcal{N}}=\left[\begin{matrix}\mu^{12}_{1}\\ \mu^{135}_{1}\\ \mu^{12}_{2}\\ \mu_{3}^{135}\\ \mu_{4}^{45}\\ \mu_{5}^{135}\\ \mu^{45}_{5}\end{matrix}\right]\in\prod_{n\in\mathcal{N}}\{0,\bar{\mu}_{n}\}^{m_{n}},

where $m_{n}$ is the number of paths through server $n$ .

This paper focuses on a particular class of control policies which we call the $generalized~{}shortest~{}path$ (GSP) policy, which is based on the following piecewise-linear (PL) functions of traffic states:

	$\displaystyle Q_{12}(x)=\text{max}\{\beta^{2}x_{1}^{12},\beta(x_{1}^{12}+x_{2}^{12})\},$
	$\displaystyle Q_{135}(x)=\text{max}\{\beta^{2}x_{1}^{135},\beta(x_{1}^{135}+x_{3}^{135}),$
	$\displaystyle\hskip 76.82234pt(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})\},$
	$\displaystyle Q_{45}(x)=\text{max}\{\beta^{2}x_{4}^{45},\beta(x_{4}^{45}+x_{5}^{45})\},$

where $\beta>1$ is a design parameter. We then define the notion of bottlenecks as follows:

Definition 1 (Bottleneck)

Given $x\in\mathcal{X}$ , a server $n\in p$ is a bottleneck if

1.

$\frac{\partial_{+}}{\partial x_{n}}Q_{p}(x)>0,$ where $\partial_{+}$ means the right derivative, and
2.

either $n$ has no downstream servers or the immediately downstream server $n^{\prime}$ is such that $\frac{\partial_{+}}{\partial x_{n^{\prime}}}Q_{p}(x)=0.$

Intuitively, a bottleneck is a server that immediately contributes to $Q_{p}$ , while the immediately downstream server, if it exists, does not contribute to $Q_{p}$ . Let $B=(b_{12},b_{135},b_{45})$ be a combination of bottlenecks, where $b_{12}\in\{1,2\},b_{135}\in\{1,3,5\},b_{45}\in\{4,5\}$ , and denote $n\in B$ if $n$ is a bottleneck in combination $B$ . Let $\mathcal{B}$ be the set of possible combinations, which has 12 elements, each corresponding to a subset of $\mathcal{X}$ over which $\{Q_{p}\}$ are all linear in $x$ .

The GSP policy essentially sends an incoming job to the path with the smallest $Q$ function. However, the $Q$ functions only capture the traffic states but do not consider the influence of the service rates, especially the service rate of the current bottleneck. To account for this, we introduce a second parameter $\gamma>1$ that allows the GSP policy to prioritize the path with a higher bottleneck service rate. To be specific, for a combination $B\in\mathcal{B}$ , we first sort the paths according to their bottleneck service rates:

	$\displaystyle P_{1,B}:=\{p\in\mathcal{P}:p\in\mathop{\arg\max}\limits_{n\in B\cap p^{\prime}}\bar{\mu}_{n}\},$
	$\displaystyle P_{\gamma,B}:=\{p\in\mathcal{P}:p\in\mathop{\arg\max}\limits_{p\in B\cap p^{\prime}\cap P_{1,B}^{c}}\bar{\mu}_{b_{p^{\prime}}}\},$
	$\displaystyle P_{\gamma^{2},B}:=\mathcal{P}\backslash(P_{1,B}\cup P_{\gamma,B}).$

Then for $B\in\mathcal{B}$ , define the weight of each path as

\gamma_{p}(B):=\begin{cases}1&p\in P_{1,B},\\ \gamma&p\in P_{\gamma,B},\\ \gamma^{2}&p\in P_{\gamma^{2},B},\end{cases}\quad p\in\mathcal{P}.

(1)

Now we are ready to formulate the GSP policy, with a slight abuse of notation, as follows: When there are no ties,

	$\displaystyle\lambda_{p}(x)=\bar{\lambda}\mathbb{I}\Big{\{}p=\mathop{\arg\min}\limits_{p^{\prime}\in\mathcal{P}}\gamma_{p^{\prime}}Q_{p^{\prime}}\Big{\}},\ p\in\mathcal{P},$
	$\displaystyle\mu^{p}_{n}(x)=\bar{\mu}_{n}\mathbb{I}\Big{\{}p=\mathop{\arg\max}\limits_{p^{\prime}:n\in B\cap p^{\prime}}Q_{p^{\prime}}\Big{\}},\ n\in\{1,5\},$
	$\displaystyle\mu^{p}_{n}(x)=\bar{\mu}_{n}\mathbb{I}\{x_{n}>0\},\ n\in\{2,3,4\};$

ties are broken randomly.

The above joint-routing-scheduling policy is called GSP, because the PL $Q$ function is a generalized measurement of the “temporal length” of each path. By this policy, server $n$ will prioritize the jobs from class $p$ if server $n$ is the bottleneck of path $p$ . If server $n$ is the bottleneck of multiple paths, the GSP policy would further prioritize the jobs from the path with a larger $Q$ function. We also define

P^{*}=\mathop{\arg\min}\nolimits_{p\in\mathcal{P}}\gamma_{p}Q_{p}(x);

note that $1\leq|P^{*}|\leq 3$ .

We say that the traffic in the network is $stable$ if there exists $C<\infty$ such that for any initial condition,

\displaystyle\limsup\limits_{t\rightarrow\infty}\frac{1}{t}\int_{s=0}^{t}\mathsf{E}[\|x(s)\|]ds<C.

We say the network is $stabilizable$ if $\bar{\lambda}<\bar{\mu}$ , where $\bar{\mu}$ is the min-cut capacity of the network. Note that $\bar{\lambda}<\bar{\mu}$ ensures the existence of a stabilizing Bernoulli routing policy.

3 Stability guarantee for GSP policy

In this section, we consider the stability condition under GSP policy for the bridge network.

To state the main result, let $\mathcal{M}$ be the set of cuts of the network, and define

	$\displaystyle m=\quad$	$\displaystyle\min\quad\frac{\sum_{n\in(M-\tilde{M})}\bar{\mu}_{n}}{\bar{\lambda}-\sum_{n\in\tilde{M}}\bar{\mu}_{n}},$
		$\displaystyle\begin{array}[]{r@{\quad}l@{}l@{\quad}l}s.t.&\bar{\lambda}-\sum_{n\in\tilde{M}}\bar{\mu}_{n}>0,\ \tilde{M}\subsetneq M,\\ &M\in\mathcal{M};\end{array}$

one can show that $m>1$ if the network is stabilizable. Let $d_{n}$ be the number of links on the longest path from the origin to server $n$ . For each $M\in\mathcal{M}$ , let $\mu_{\max}^{M}=\max\limits_{n\in M}\bar{\mu}_{n}$ , $\mu_{\min}^{M}=\min\limits_{n\in M}\bar{\mu}_{n}$ , and define

	$\displaystyle G_{1}^{M}:=\sum_{n\in\mathop{\arg\max}\limits_{\nu\in M}\bar{\mu}_{\nu}}d_{n},\quad G_{2}^{M}:=\min_{n\in\mathop{\arg\min}\limits_{\nu\in M}\bar{\mu}_{\nu}}d_{n},$
	$\displaystyle G^{M}:=\mathrm{sgn}(\mu_{\max}^{M}-\mu_{\min}^{M})(G_{2}^{M}-G_{1}^{M})_{+},$

where $\mathrm{sgn}(\cdot)$ (resp. $(\cdot)_{+}$ ) is the sign (resp. positive part) function. Also define

\Delta_{G}:=\mathrm{sgn}\Big{(}\max_{M\in\mathcal{M}}G^{M}\Big{)}.

Then, we can state the main result as follows:

Theorem 1

The bridge network is stable under the GSP policy if the network is stabilizable and

\displaystyle 1<\gamma^{(2+\Delta_{G})}<\beta^{(2+\Delta_{G})}<m.

(2)

The above result includes two key points. First, the GSP policy is stabilizing if the queuing network is stabilizable; i.e., the GSP policy maximizes the throughput of the network. Second, constraints on the parameters for the GSP policy are given, which determine the feasible set for the subsequent learning-based optimization (Section 4).

The parameters $m$ and $\Delta_{G}$ indicate the relationships between the service rates and arrival rate, while the design parameters $\beta$ and $\gamma$ characterize the GSP policy’s “preference”. Note that $\lim_{\bar{\lambda}\uparrow{\bar{\mu}}}m=1$ , which means each server and each path will be “equally emphasized” by the GSP policy as the network approaches saturation. The case when $\Delta_{G}=1$ indicates that the path with a stronger service ability has a bottleneck closer to the origin. Then the degree of “preference” should decrease and the constraints of $\beta$ and $\gamma$ should be stricter because even though with a weaker service ability, the other paths have the bottleneck closer to the destination, and jobs on these paths can get to the destination more quickly once they pass the bottleneck.

The proof of Theorem 1 uses a Lyapunov function constructed from the PL $Q$ functions:

V(x)=\sum_{p\in\mathcal{P}}\Big{(}Q_{p}(x)\Big{)}^{2}.

(3)

The key technique is to utilize the connection between the locations of bottlenecks and the GSP policy to argue for the negative mean drift of the Lyapunov function. In particular, we show that there exist $\epsilon>0,~{}C<\infty$ such that for all $x\in\mathcal{X}$

\mathcal{L}V({x})=-\epsilon\|x\|_{1}+C,

(4)

where $\mathcal{L}$ is the infinitesimal generator of the network model under the GSP policy. (4) implies the boundedness of the 1-norm of the traffic state and thus the stability of the network, according to the Foster-Lyapunov criterion [10].

Proof of Theorem 1. Firstly, we consider the case when $Q_{p}=0$ for some $p\in\mathcal{P}$ . Suppose $\mathcal{P}^{*}=\{(1,2)\}$ and $Q_{12}=0$ , which implies the servers on path $(1,2)$ are all empty. According to the GSP policy, $\lambda_{12}(x)=\bar{\lambda},~{}\lambda_{135}(x)=\lambda_{45}(x)=0$ . The jobs will be allocated to path $(1,2)$ , thus it only makes finite positive contribution to the drift. While path $(1,3,5)$ and $(4,5)$ are not empty and keep serving jobs, they make negative contribution to the drift:

	$\displaystyle\mathcal{L}V(x)$	$\displaystyle=-\epsilon_{135}\sum_{n\in(1,3,5)}x_{n}^{135}-\epsilon_{45}\sum_{n\in(4,5)}x_{n}^{45}+C$		(5)
		$\displaystyle<-\epsilon\sum_{n\in\mathcal{N}}x_{n}+C.$		(5)

Thus we can infer that the Lyapunov drift of the network can only be negative or finite positive when there are empty paths in the network, which also covers the cases of $Q_{135}=0$ and $Q_{45}=0$ .

The rest of this proof is devoted to the case where $Q_{p}>0,p\in\mathcal{P}$ . Specifically, we consider five qualitatively different cases:
Case 1: $B=(2,5,5)$ .

In this case, the $Q$ function of each path have the forms of:

	$\displaystyle Q_{12}=\beta(x_{1}^{12}+x_{2}^{12}),$
	$\displaystyle Q_{135}=(x_{1}^{135}+x_{3}^{135}+x_{5}^{135}),$
	$\displaystyle Q_{45}=\beta(x_{4}^{45}+x_{5}^{45}).$

Then we prove according to the different routing decision:

$P^{*}=\{(1,2)\}$

Since $b_{135}=b_{45}=5,~{}b_{12}=2$ , we have:

	$\displaystyle\gamma_{12}(B)=1,\gamma_{135}(B)=\gamma_{45}(B)=\gamma$	$\displaystyle,\bar{\mu}_{b_{12}}>\bar{\mu}_{b_{135}}=\bar{\mu}_{b_{45}}$
	$\displaystyle\gamma_{12}(B)=\gamma,\gamma_{135}(B)=\gamma_{45}(B)=1$	$\displaystyle,\bar{\mu}_{b_{135}}=\bar{\mu}_{b_{45}}>\bar{\mu}_{b_{12}}.$

Then the new arriving job will go to path $(1,2)$ when:


	$\displaystyle 1)~{}\gamma Q_{12}<Q_{135},Q_{45};$		(6a)
	$\displaystyle 2)~{}Q_{12}<\gamma Q_{135},\gamma Q_{45}.$		(6b)

The scheduling decision of server 1 and server 5 will be $\mu_{2}^{12}(x)=\bar{\mu}_{2}$ and $\mu_{5}^{45}(x)+\mu_{5}^{135}(x)=\bar{\mu}_{5},~{}\mu_{5}^{45}(x)\mu_{5}^{135}(x)=0$ . Clearly (6a) can be transformed into $Q_{12}<Q_{135},Q_{45}$ , then we have :

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle 2\beta^{2}(\lambda-\mu_{2}^{12}(x))(x_{1}^{12}+x_{2}^{12})$
		$\displaystyle-2\beta^{2}\mu_{5}^{45}(x)(x_{4}^{45}+x_{5}^{45})$
		$\displaystyle-2\mu_{5}^{135}(x)(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})+C$
	$\displaystyle<$	$\displaystyle 2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})-\bar{\mu}_{5}/\gamma\Big{)}+C.$

The drift is naturally negative when $\bar{\lambda}-\bar{\mu}_{2}\leq 0$ , and it remains negative since $1<\beta\gamma<\beta^{2}<m\leq\frac{\bar{\mu}_{5}}{\bar{\lambda}-\bar{\mu}_{2}}$ when $\bar{\lambda}-\bar{\mu}_{2}>0$ according to (2). Thus (4) holds.

ii)

$P^{*}=\{(1,3,5)\}$

Similarly, the jobs will go to path (1,3,5) when $\gamma Q_{135}<Q_{12},\gamma Q_{45}$ or $Q_{135}<\gamma Q_{12},Q_{45}$ . We have the drift:

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle 2(x_{1}^{135}+x_{3}^{135}+x_{5}^{135})(\bar{\lambda}-\beta\bar{\mu}_{2}/\gamma-\beta\bar{\mu}_{5})$
		$\displaystyle+C.$

iii)

$P^{*}=\{(4,5)\}$

\displaystyle\mathcal{L}V(x)<

\displaystyle 2\beta(x_{4}^{45}+x_{5}^{45})(\beta\lambda-\beta\bar{\mu}_{2}/\gamma-\bar{\mu}_{5})+C.

Since $\gamma<\beta<\sqrt{m}$ , the drift satisfies (4).

For other cases when there are ties and $|P^{*}|>1$ , since a job can only take a single path, the cases can be proved analogously.
Case 2: $B=(1,5,5)$ .

When $P^{*}=\{(1,2)\}$ , the maximal coefficient of $x_{1}^{12}$ exists when $\bar{\mu}_{1}>\bar{\mu}_{5}$ and $Q_{12}<\gamma Q_{135},\gamma Q_{45}$ , the drift turns to be:

\displaystyle\mathcal{L}V(x)<

\displaystyle 2\beta^{2}x_{1}^{12}\Big{(}\beta^{2}(\bar{\lambda}-\bar{\mu}_{1})-\bar{\mu}_{5}/\gamma\Big{)}+C.

Since $d_{1}=1,d_{5}=3$ and the two nodes belong to the same cut, we have $\Delta_{G}=1$ and $1<\gamma<\beta<m^{\frac{1}{3}}$ , which make (4) stands. The proof of other forms of $P^{*}$ is analogous with case 1.
Case 3: $B=(2,5,4)$ .

We only show the case where $P^{*}=\{(1,2)\}$ , the other cases can be proved analogously. For the diverse possible values of $\gamma_{p}(B)$ , the drift has the form of:

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle 2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})-\bar{\mu}_{5}\frac{\gamma_{12}(B)}{\gamma_{135}(B)}$
		$\displaystyle-\beta^{2}\bar{\mu}_{4}\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}+C.$

According to the GSP policy, we have the scaled ratio:

	$\displaystyle\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}\in$
	$\displaystyle\Big{\{}\Big{(}1,1\Big{)},\Big{(}\frac{1}{\gamma^{2}},\frac{1}{\gamma}\Big{)},\Big{(}\frac{1}{\gamma},\frac{1}{\gamma^{2}}\Big{)},\Big{(}\frac{1}{\gamma},1\Big{)},\Big{(}1,\frac{1}{\gamma}\Big{)},\Big{(}\frac{1}{\gamma},\frac{1}{\gamma}\Big{)}\Big{\}}.$

Thus the possible maximal coefficient of $x_{1}^{12}+x_{2}^{12}$ exists when:

	$\displaystyle 1)$	$\displaystyle\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5},~{}\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}=\Big{(}\frac{1}{\gamma^{2}},\frac{1}{\gamma}\Big{)},$
		$\displaystyle Q_{12}<\gamma^{2}Q_{135},\gamma Q_{45};$
	$\displaystyle 2)$	$\displaystyle\bar{\mu}_{2}>\bar{\mu}_{5}>\bar{\mu}_{4},~{}\Big{(}\frac{\gamma_{12}(B)}{\gamma_{135}(B)},\frac{\gamma_{12}(B)}{\gamma_{45}(B)}\Big{)}=\Big{(}\frac{1}{\gamma},\frac{1}{\gamma^{2}}\Big{)},$
		$\displaystyle Q_{12}<\gamma Q_{135},\gamma^{2}Q_{45}.$

We have:

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle C+2\beta(x_{1}^{12}+x_{2}^{12})\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{2})$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5}\}}(-\bar{\mu}_{5}/\gamma^{2}-\beta^{2}\bar{\mu}_{4}/\gamma)$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{2}>\bar{\mu}_{5}>\bar{\mu}_{4}\}}(-\bar{\mu}_{5}/\gamma-\beta^{2}\bar{\mu}_{4}/\gamma^{2})\Big{)}.$

Since there are $d_{2}=2,d_{4}=1,d_{5}=3$ and $\Delta_{G}=1$ when $\bar{\mu}_{2}>\bar{\mu}_{4}>\bar{\mu}_{5}$ , we have $1<\beta\gamma^{2}<m$ . Thus the (4) can be proved. The case when $B=(1,5,4)$ can be proved analogously.
Case 4: $B=(2,3,4).$

We discuss when $P^{*}=\{(4,5)\}$ . When the possible maximal coefficient of $x^{45}_{4}$ exists, the drift is:

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle C+\beta x^{45}_{4}\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{4})$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{4}>\bar{\mu}_{3}>\bar{\mu}_{2}\}}(-\bar{\mu}_{2}/\gamma^{2}-\bar{\mu}_{3}/\gamma)$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{4}>\bar{\mu}_{2}>\bar{\mu}_{3}\}}(-\bar{\mu}_{2}/\gamma-\bar{\mu}_{3}/\gamma^{2})\Big{)}.$

Since $d_{2}=2,d_{4}=1,d_{3}=2$ and $\Delta_{G}=1$ when $\bar{\mu}_{4}>\bar{\mu}_{2}>\bar{\mu}_{3}$ or $\bar{\mu}_{4}>\bar{\mu}_{3}>\bar{\mu}_{2}$ , thus we have $1<\beta\gamma^{2}<\beta^{3}<m$ and (4) holds. The cases of other forms of $P^{*}$ when $B=(2,3,4)$ and $B=(2,3,5)$ can be proved analogously.
Case 5: $B=(1,3,5).$

Considering $P^{*}=\{(1,2)\}$ , when the possible maximal coefficient of $x^{12}_{1}$ exists, the drift satisfies:

	$\displaystyle\mathcal{L}V(x)<$	$\displaystyle C+\beta^{3}x^{12}_{1}\Big{(}\beta(\bar{\lambda}-\bar{\mu}_{1})$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{1}>\bar{\mu}_{5}>\bar{\mu}_{3}\}}(-\bar{\mu}_{3}/\gamma^{2}-\bar{\mu}_{5}/\gamma)$
		$\displaystyle+\mathbb{I}_{\{\bar{\mu}_{1}>\bar{\mu}_{3}>\bar{\mu}_{5}\}}(-\bar{\mu}_{3}/\gamma-\bar{\mu}_{5}/\gamma^{2})\Big{)}.$

We have $d_{1}=1,d_{3}=2,d_{5}=3$ , and there is $\Delta_{G}=1$ when $\bar{\mu}_{1}>\bar{\mu}_{3}>\bar{\mu}_{5}$ . Thus there is $1<\beta\gamma^{2}<\beta^{3}<m$ , and (4) stands. The other forms of $P^{*}$ of case $B=(1,3,5)$ and $B=(1,3,4)$ can be proved analogously. The proof is the same for other cases of $B=(1,1,4)$ , $B=(1,1,5)$ , $B=(2,1,4)$ , $B=(2,1,5)$ .

In conclusion, we show that the mean drift always satisfies (4), which ensures the stability of the network. $\square$

4 Iterative computation of GSP parameters

In this section, we develop a policy iteration (PI) algorithm that computes the parameters for the GSP policy. In particular, theorem 1 gives a set of values for $\beta$ and $\gamma$ that ensure stability; the goal of the PI algorithm is to search the aforementioned set for a specific combination of $\beta$ and $\gamma$ that attains near-optimal traffic efficiency. In addition, the PI algorithm does not require a priori knowledge of the arrival and service rate. We also compare the performance of the GSP PI algorithm with three benchmark policies, viz. neural network (NN)-based PI, simple shortest path, and optimized Bernoulli.

We formulate the joint routing-scheduling problem as a Markov decision process (MDP) with state space $\mathcal{X}=\mathbb{Z}_{\geq 0}^{7}$ . Since the routing and scheduling decision will only be made when an arrival or a departure occurs (i.e., at transition epochs [23, p.72]), the problem can be formulated as a discrete-time (DT) MDP. We denote the state and action of the DT MDP as $x[k]$ and $a[k]$ , respectively. With a slight abuse of notation, $x[k]=x(t_{k})$ , where $t_{k}$ is the $k$ -th transition epoch of the continuous-time process. Similarly, the action is given by $a[k]=(\lambda[k],\mu[k])$ , where $\lambda[k]\in\{0,\bar{\lambda}\}^{3}$ and $\mu[k]\in\prod_{n\in\mathcal{N}}\{0,\bar{\mu}_{n}\}^{m_{n}}$ are the route-specific arrival rates and service rates, respectively.

As indicated in Section 2, both the routing and the scheduling decisions can be parameterized via design variables $\beta$ and $\gamma$ ; by Theorem 1, $\beta$ and $\gamma$ should satisfy (2). Thus, the action is given by the policy

a[k]=\pi(x[k];\beta,\gamma).

The transition probability $p(x^{\prime}|x,a)$ of the DT process can be derived from the network model defined in Section 2 in a straightforward manner. The one-step cost of the MDP is given by

u[k]=\|x[k]\|_{1}(t_{k}-t_{k-1}).

The total cost over one episode of the MDP is thus given by

U_{\pi}^{K}(x)=\mathrm{E}_{\pi}\Big{[}\sum_{\ell=k}^{K}\|x[\ell]\|_{1}(t_{k}-t_{k-1})\Big{|}x[0]=x\Big{]},

where $K\in\mathbb{Z}_{>0}$ is the length (i.e., number of transition epochs) of one episode.

4.1 GSP PI algorithm

Closed-form solution to $U_{\pi}^{K}$ is not easy. Hence, we use the PL $Q$ function defined by (3) as a proxy for $U_{\pi}^{K}$ . A key characteristic of our formulation is that $(\beta,\gamma)$ parametrizes both the policy $\pi(x;\beta,\gamma)$ and the PL functions $\gamma_{p}Q_{p}(x;\beta)$ . Although this may lead to a gap towards the ground-truth optimal solution, it significantly simplifies the computation and, importantly, guarantees traffic stability.

The GSP PI algorithm has two tasks in parallel: calibration of model data $\bar{\lambda}$ , $\bar{\mu}_{n}$ and optimization of control parameters $\beta,\gamma$ . The latter depends on the former, since $\Delta_{G}$ in (2) depends on model data. Let $\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)}$ be the estimate of $\bar{\lambda},\bar{\mu}_{n}$ after $i$ iterations (i.e., episodes), $\Delta_{G}^{(i)}$ be the corresponding $\Delta_{G}$ , and $\beta^{(i)},\gamma^{(i)}$ be the control parameters after $i$ iterations. Then, the specific steps of the GSP PI algorithm can be summarized as follows:

1.

Initialize with $\hat{\lambda}^{(0)}\in\mathbb{R}_{>0}$ , $\hat{\mu}_{n}^{(0)}\in\mathbb{R}_{>0}$ for each server $n$ , $\beta^{(0)}\in(1,m^{\frac{1}{(2+\Delta_{G})}})$ , $\gamma^{(0)}\in(1,\beta^{(0)})$ .
2.

Policy evaluation: Simulate a Monte-Carlo episode of the network model with the policy $\pi(x;\beta^{(i-1)},\gamma^{(i-1)})$ . Update the estimates to obtain $\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)}.$ Update the parameters $\beta^{(i)},\gamma^{(i)}.$ (See below for how the updates are done.)
3.

Policy improvement: Use $\beta^{(i)},\gamma^{(i)}$ to obtain a new policy $\pi(x;\beta^{(i)},\gamma^{(i)})$ .
4.

Terminate when $|\beta^{(i)}-\beta^{(i-1)}|$ and $|\gamma^{(i)}-\gamma^{(i-1)}|$ are smaller than some thresholds.

The rest of this subsection is devoted to the update of $(\beta,\gamma).$ Given $(\beta,\gamma)$ , suppose that a job arrives at epoch $k$ and takes path $p$ according to policy $\pi(x;\beta,\gamma)$ . Then, we track the system time $W[k]$ ¹¹1This notation should cause no confusion, since at most one Poisson arrival can occur at an epoch. of this particular job, and use the square error

\mathrm{SE}_{(\beta,\gamma)}[k]=\mathbb{I}_{\{W[k]>0\}}\Big{(}\gamma_{p[k]}Q_{p[k]}(x[k];\beta)-W[k]\Big{)}^{2}

to update $(\beta,\gamma)$ , where $p[k]$ is the path that $\pi(x[k];\beta,\gamma)$ selects. Note that $W[k]$ records the time that a job spends in the continuous-time setting; we define $W[k]=0$ if no arrival occurs at epoch $k$ , so $\mathbb{I}_{\{W[k]>0\}}$ indicates whether there is an arrival at epoch $k$ . $(\beta,\gamma)$ are obtained by minimizing the sum-of-squares error:

\displaystyle(\beta^{*},\gamma^{*})=\underset{{\footnotesize(\beta,\gamma)\mbox{ satisfy }\eqref{equa:theorem}}}{\mbox{arg min}}\sum_{k=1}^{K}\mathrm{SE}_{(\beta,\gamma)}[k].

(7)

Note that the PL function actually approximates the job-specific system time $W$ rather than the total system time $U_{\pi}^{K}$ , which are in fact related via

\lim_{K\to\infty}\Bigg{\{}\frac{U_{\pi}^{K}(x)}{t_{K}}-\frac{\sum_{k=1}^{K}W[k]}{\sum_{k=1}^{K}\mathbb{I}_{\{W[k]>0\}}}\Bigg{\}}=0\quad a.s..

4.1.1 Estimate of $\bar{\lambda},\bar{\mu}_{n}$

In the $i$ th iteration, we run a Monte-Carlo simulation of the network model. The simulation has a discrete time step size that is sufficiently small to approximate the continuous time process; i.e., we use Bernoulli processes with sufficiently small time increment to approximate Poisson processes. We track every inter-arrival time at the origin and compute the average to obtain $1/\hat{\lambda}^{(i)}.$ We also track every service time at each server $n$ and compute the average to obtain $1/\hat{\mu}_{n}^{(i)}.$

4.1.2 Update of $\beta$

In the $i$ th iteration, we record $\{x[k],p[k],W[k]\}$ if $W[k]>0$ . Let $D^{(i)}$ be the table of $\{x[k],p[k],W[k]\}$ . We determine $\beta^{(i)}$ by a method similar to the least-squares partition algorithm in [24]. We first partition $D^{(i)}$ according to the boundaries of the PL functions $\{Q_{p}(x;\beta^{^{\prime}})\}$ with $\beta^{^{\prime}}=\beta^{(i-1)}$ . Then we use the L-BFGS-B algorithm [25] to calculate a better $\beta^{\prime}$ that minimizes the sum-of-squares error defined in (7), with $\gamma$ fixed to be $\gamma^{(i-1)}$ . This leads to a new partition according to $\beta^{\prime}$ . We recursively do so until the partition stabilizes or the iteration reaches a certain maximum number, and the last $\beta^{\prime}$ is taken as $\beta^{*}$ of this episode.

4.1.3 Update of $\gamma$

After $\beta^{(i)}$ is obtained, we may determine the relation between $\gamma_{p}$ and $\gamma$ for each set of data in $D^{(i)}$ . To be specific, we determine the combination of bottlenecks $B$ at state $x[k]$ based on $\beta^{(i)}$ , and $\gamma_{p}(B)$ is given by (1). Then, we again use the L-BFGS-B algorithm to calculate $\gamma^{*}$ which minimize the sum-of-squares error defined in (7) with $\beta$ fixed to be $\beta^{(i)}$ . When update of $\gamma$ is completed, $\gamma^{*}$ of this iteration is then set to be $\gamma^{(i)}$ . The process of the $i$ th iteration of $\gamma$ and $\beta$ is given by Algorithm 1.

Algorithm 1 Computation of

\beta^{(i)}

and

\gamma^{(i)}

D^{(i)},\hat{\lambda}^{(i)},\hat{\mu}_{n}^{(i)},\beta^{(i-1)},\gamma^{(i-1)}

2:partition data with

\beta^{^{\prime}}\longleftarrow\beta^{(i-1)}

3:repeat

4: update

\beta^{^{\prime}}

based on current partition

5:until partition stabilizes

\beta^{(i)}\longleftarrow\beta^{*}

7:determine the forms of

\gamma_{p}

for data in

D^{(i)}

\gamma^{(i)}\longleftarrow\gamma^{*}

\beta^{(i)}

\gamma^{(i)}

4.2 Benchmark policies and algorithms

To evaluate the performance of the GSP PI algorithm, we consider three benchmarks:

4.2.1 Neural network (NN) PI

We consider action-value-function $\Theta(x,a)$ and use NN to approximate it. Initially, we get the first sequence of state-action-reward $\{x[k],a[k],r[k]\}$ and

\displaystyle\theta_{0}(x[k],a[k])=\sum_{k^{\prime}=k}^{K}l^{(k^{\prime}-k)}r_{k^{\prime}},~{}\Theta_{0}=\theta_{0}

for the initial iteration after a Monte-Carlo episode, where $l$ is a discount of future action-state-reward. The value is renewed as

	$\displaystyle\Theta_{i}(x[k],a[k])=$	$\displaystyle\Theta_{i-1}(x[k],a[k])+(\theta_{i}(x[k],a[k])$
		$\displaystyle-\Theta_{i-1}(x[k],a[k]))/i$

in iteration $i,i\in N^{*}$ . Then the current policy is evaluated by a designed NN, which has two convolutional layers, two maxpooling layers and two fully connected layers. We adopt the rectified linear unit (ReLU) activation function at each layer to avoid gradient explosion and extinction. The weights of NN are updated by the adaptive moment estimation (Adam) algorithm [26] to narrow the gap between the predicted and calculated state-action-value. Finally, we use the NN approximation for $\Theta$ for policy improvement by the $\epsilon$ -greedy algorithm. Since an exact optimal policy of the original MDP is not readily available, the policy computed by NN PI is used as an approximate optimal policy.

4.2.2 Simple shortest-path (SSP) policy

For routing and scheduling decisions under SSP policy, we simply calculate the cost of paths by $\Phi(x,p)=\sum_{n\in p}x^{p}_{n}$ and select the path with the smallest $\Phi(x,p)$ value.

4.2.3 Optimal Bernoulli (OB) policy

A Bernoulli policy route any incoming job to a route $p\in\mathcal{P}$ with a time-invariant probability $\eta_{p}$ such that $\sum_{p\in\mathcal{P}}\eta_{p}=1.$ Consequently, the routing probabilities $\eta_{p}$ are optimized using a simple numerical search scheme.

4.3 Evaluation and comparison

We compare our GSP policy with the three benchmarks according to the following performance metrics:

1)

Computational efficiency: training time and number of iterations (if applicable.)
2)

Implementation efficiency: the average system time experienced by all jobs that go through the network.

Consider the network in Fig 1 and suppose $\bar{\mu}_{1}=0.15$ , $\bar{\mu}_{2}=0.1$ , $\bar{\mu}_{3}=0.25$ , $\bar{\mu}_{4}=0.15$ , $\bar{\mu}_{5}=0.2$ , $\bar{\lambda}=0.2$ , all in unit sec^-1. Then we have $m=1.50,~{}\Delta_{G}=0$ for this set of parameters. For GSP PI, we initialize the estimates of $\bar{\mu}_{n}$ with 0.5 and estimate of $\bar{\lambda}$ with 0.1. The OB policy turns out to be $\eta^{*}=[0.28,0.20,0.52]^{T}$ . For simulation, a discrete time step of 0.1 sec is used. All experiments were implemented in Google Colab [27], using Intel(R) Xeon(R) CPU with 12.7GB memory. NN PI was trained on NVIDIA GeForce RTX 2080 Ti with 40GB memory.

The computational efficiency of our GSP policy is significantly better than the general NN algorithm with a 97.8%-99.8% reduction of training time as demonstrated in Fig. 2(a).

Our GSP PI algorithm converges rapidly within a few seconds while NN PI takes more than twelve hours to converge. This is expected, since the PL functions have simple structures, while the NN model require sophisticated training.

Table 1 lists the Normalized Average System Times (NAST) under various methods. The results are generated from test simulations of $5\times 10^{6}$ sec. Although NN PI performs better in long terms of learning as expected, GSP PI performs better with just a few number of iterations as demonstrated in Fig. 2(b), and can attain a 1.9%-6.7% optimality gap towards NN PI.

Table 1: Normalized Average System Times (NAST) of various schemes

\hhlineAlgorithm	NAST
\hhlineNeural network (NN)	1.00
\hhlineGeneralized shortest-path (GSP)	1.03
\hhlineSimple shortest-path (SSP)	1.12
\hhlineOptimal Bernoulli (OB)	1.25
\hhline

The job will spend more time going through the queuing network under SSP or OB policy at the average of $35.04$ sec and $39.27$ sec. Though the implementation efficiency of GSP PI is slightly worse than NN PI, GSP PI gives the best trade-off between computational efficiency and implementation efficiency: the average system time of NN and GSP method are $31.30$ sec and $32.24$ sec, which implies an adequate optimality gap of 3%.

5 Concluding remarks

In this article, we focus on a class of control policies for a single-origin-single-destination network, which we call the generalized shortest-path policy. We design a set of piecewise-linear functions as a proxy of the travel cost on the path over the network. We use the piecewise-linear functions to construct a Lyapunov function to derive theoretical guarantee on the stability of the traffic state. We also develop a policy iteration algorithm to learn the parameters for the proposed policy. Then we implement and compare the algorithm with three other benchmarks including a standard neural network-based algorithm. The experimental results showed that our policy and PI algorithm can perform efficiently and steadily when significantly reducing the computation cost as well, which is cost-effective. Future work may include (i) extension of the theoretical results as well as PI algorithm to general single-origin-single-destination acyclic networks and (ii) refinement of the PI algorithm to develop more efficient temporal-difference algorithms.

References

[1] Debasis Mitra and Judith B Seery. Comparative evaluations of randomized and dynamic routing strategies for circuit-switched networks. IEEE Transactions on Communications, 39(1):102–116, 1991.
[2] D Down and Sean P Meyn. Piecewise linear test functions for stability and instability of queueing networks. Queueing Systems, 27(3-4):205–226, 1997.
[3] Murat Alanyali and Bruce Hajek. Analysis of simple algorithms for dynamic load balancing. Mathematics of Operations Research, 22(4):840–871, 1997.
[4] Li Jin and Saurabh Amin. Stability of fluid queueing systems with parallel servers and stochastic capacities. IEEE Transactions on Automatic Control, 63(11):3948–3955, 2018.
[5] PR Kumar and Sean P Meyn. Stability of queueing networks and scheduling policies. IEEE Transactions on Automatic Control, 40(2):251–260, 1995.
[6] G Foschini and JACK Salz. A basic dynamic routing problem and diffusion. IEEE Transactions on Communications, 26(3):320–327, 1978.
[7] Anthony Ephremides, P Varaiya, and Jean Walrand. A simple dynamic routing problem. IEEE Transactions on Automatic Control, 25(4):690–693, 1980.
[8] L Tassiulas and A Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Transactions on Automatic Control, 37(12):1936–1948, 1992.
[9] Zubair Md Fadlullah, Fengxiao Tang, Bomin Mao, Nei Kato, Osamu Akashi, Takeru Inoue, and Kimihiro Mizutani. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s intelligent network traffic control systems. IEEE Communications Surveys & Tutorials, 19(4):2432–2455, 2017.
[10] Sean P. Meyn and R. L. Tweedie. Stability of markovian processes iii: Foster–lyapunov criteria for continuous-time processes. Advances in Applied Probability, 25(3):518–548, 1993.
[11] Dimitri Bertsekas. Dynamic Programming and Optimal Control: Volume I, volume 1. Athena Scientific, 2012.
[12] Jim G Dai and Sean P Meyn. Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Transactions on Automatic Control, 40(11):1889–1904, 1995.
[13] Christopher V Hollot, Vishal Misra, Donald Towsley, and Weibo Gong. Analysis and design of controllers for aqm routers supporting tcp flows. IEEE Transactions on Automatic Control, 47(6):945–959, 2002.
[14] Leonidas Georgiadis, Michael J Neely, Leandros Tassiulas, et al. Resource allocation and cross-layer control in wireless networks. Foundations and Trends® in Networking, 1(1):1–144, 2006.
[15] Abhishek Sinha and Eytan Modiano. Optimal control for generalized network-flow problems. IEEE/ACM Transactions on Networking, 26(1):506–519, 2017.
[16] Jianan Zhang, Abhishek Sinha, Jaime Llorca, Antonia M Tulino, and Eytan Modiano. Optimal control of distributed computing networks with mixed-cast traffic flows. IEEE/ACM Transactions on Networking, 29(4):1760–1773, 2021.
[17] Keith W Ross. Optimal dynamic routing in markov queueing networks. Automatica, 22(3):367–370, 1986.
[18] Quan-Lin Li, Jing-Yu Ma, Rui-Na Fan, and Li Xia. An overview for markov decision processes in queues and networks. Stochastic Models in Reliability, Network Security and System Safety: Essays Dedicated to Professor Jinhua Cao on the Occasion of His 80th Birthday, pages 44–71, 2019.
[19] Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1871–1879. IEEE, 2018.
[20] Guillermo Bernárdez, José Suárez-Varela, Albert López, Bo Wu, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. Is machine learning ready for traffic engineering optimization? In 2021 IEEE 29th International Conference on Network Protocols (ICNP), pages 1–11. IEEE, 2021.
[21] Bai Liu, Qiaomin Xie, and Eytan Modiano. Rl-qn: A reinforcement learning framework for optimal control of queueing systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 7(1):1–35, 2022.
[22] Qian Xie and Li Jin. Stabilizing queuing networks with model data-independent control. IEEE Transactions on Control of Network Systems, 9(3):1317–1326, 2022.
[23] Robert G Gallager. Stochastic Processes: Theory for Applications. Cambridge University Press, 2013.
[24] Alessandro Magnani and Stephen P. Boyd. Convex piecewise-linear fitting. Optimization and Engineering, 10:1–17, 2009.
[25] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[27] Google Colaboratory. https://colab.research.google.com/. Accessed: 2023-03-28.