This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Joint Convergence of Traffic State and Weight Vector in Learning-Based Dynamic Routing with Value Function Approximation

Yidan Wu, Jianan Zhang and Li Jin This work was in part supported by NSFC Project 62103260, SJTU UM Joint Institute, J. Wu & J. Sun Foundation, and US NSF CMMI-1949710. Y. Wu and L. Jin are with the UM Joint Institute, Shanghai Jiao Tong University, China. J. Zhang is with the School of Electronics, Peking University, China. (Emails: [email protected], [email protected], [email protected])
Abstract

Learning-based approaches are increasingly popular for traffic control problems. However, these approaches are applied typically as black boxes with limited theoretical guarantees and interpretability. In this paper, we consider the theory of dynamic routing over parallel servers, a representative traffic control task, using semi-gradient on-policy control algorithm, a representative reinforcement learning method. We consider a linear value function approximation on an infinite state space; a Lyapunov function is also derived from the approximator. In particular, the structure of the approximator naturally makes possible idling policies, which is an interesting and useful advantage over existing dynamic routing schemes. We show that the convergence of the approximation weights is coupled with the convergence of the traffic state. We show that if the system is stabilizable, then (i) the weight vector converges to a bounded region, and (ii) the traffic state is bounded in the mean. We also empirically show that the proposed algorithm is computationally efficient with an insignificant optimality gap.

Index terms: Dynamic routing, reinforcement learning, Lyapunov method, value function approximation.

I Introduction

Dynamic routing is a classical control problem in transportation, manufacturing, and networking. This problem was conventionally challenging, because analytical characterization of the steady-state distributions of the traffic state and thus of the long-time performance metrics (e.g., queuing delay) are very difficult [1, 2]. Recently, there is a rapidly growing interest in applying reinforcement learning (RL) methods to dynamic routing and network control problems in general. RL methods are attractive because of their computational efficiency and adaptivity to unknown/non-stationary environments [3]. However, there is still a non-trivial gap between the demand for theoretical guarantees on key performance metrics and the black-box nature of RL methods. In particular, most existing theoretical results on RL are developed in the context of finite Markov decision processes (MDPs), while dynamic routing may be considered in infinite state spaces, especially for stability and throughput analysis.

In this paper, we make an effort to respond to the above challenge by studying the behavior of a parallel service system (Fig. 1) controlled by a class of semi-gradient SARSA (SGS) algorithms with linear function approximation; these methods are attractive because of (i) adaptivity to unknown model parameters and (ii) potential to obtain near-optimal policies.

Refer to caption

Figure 1: A parallel service system.

Importantly, we jointly consider the convergence of the algorithm training process and of the traffic state process. The specific research questions are:

  1. 1.

    How is the convergence of the weight vector coupled with the convergence of the traffic state?

  2. 2.

    Under what conditions does the proposed SGS algorithm ensure the joint convergence of the weights and the state?

The above questions involve two bodies of literature, viz. dynamic routing and reinforcement learning. Classical dynamic routing schemes rely on Lyapunov methods to study traffic stability and provide a solid foundation for our work [4, 5]. However, these methods are less powerful to search for routing policies that optimizing average system time. In particular, existing results usually assume non-idling policies, which may be quite restrictive. Recently, RL is used for finding optimal routing policies and provides important tools and insights for practice [6]. In particular, Liu et al. proposed a mixed scheme that uses learning over a bounded set of states and uses a known stabilizing policy for the other states [7]; Xu et al. proposed a deep RL-based framework for traffic engineering problems in communication networks [8]; Lin et al. proposed an adaptive routing algorithm utilizing RL in a hierarchical network [9]. However, existing theory on RL mostly, to the best of our knowledge, considers MDPs with finite or bounded state spaces [10, 11, 12]; the theory on infinite/unbounded state spaces is limited to very special problems (e.g., linear-quadratic regulation [13]). Hence, existing learning-based routing algorithms either rely on empirical evidence for convergence or build on finite MDP theories. Thus, there is a lack of solid theory on the convergence of value function approximation over unbounded state spaces; such a theory is essential for developing interpretable and reliable routing schemes.

In response to the above research gaps, we jointly consider the convergence of traffic state and of the weight vector in dynamic routing. The traffic state characterizes the queuing process in the parallel service system illustrated in Fig. 1. The routing objective is to minimize the expected total system time, which includes waiting times and service times. The weights parameterize the approximate action value function, and thus determine the routing policy. In particular, the algorithm naturally makes possible idling policies, which is an advantage over existing methods. The weights are updated by a semi-gradient on-policy algorithm.

Our main result (Theorem 1) states that the proposed algorithm ensures joint convergence of the traffic state and the weight vector if and only if the system is stabilizable. Importantly, we study the coupling between the long-time behavior of the traffic state and that of the weight vector, which extends the existing theory on finite-state RL [11] to unbounded state spaces. The convergence of traffic state results from a Lyapunov function associated with the approximate value function which verifies the drift criterion [14]. The convergence of weights results from stochastic approximation theory [15]. We compare the proposed algorithm with a much more sophisticated neural network-based algorithm and show that our algorithm converges much faster than the benchmark with only an 8% optimality gap.

In summary, the contributions of this paper are as follows.

  1. 1.

    We propose a structurally intuitive and technically sound algorithm to learn near-optimal routing policies over parallel servers.

  2. 2.

    We study joint convergence of traffic state and weight vector under the proposed algorithm; this is theoretically interesting in itself.

  3. 3.

    We show empirical evidence for the computational efficiency and near-optimality of the proposed algorithm.

The rest of this paper is organized as follows. Section II introduces the parallel service system model, the MDP formulation, and the SGS algorithm. Section III presents and develops the main result on joint convergence. Section IV compares the SGS algorithm with two benchmarks. Section V gives the concluding remarks.

II Modeling and Formulation

Consider the system of parallel servers with infinite buffer sizes in Fig. 1. In this section, we model the dynamics of the system, formulate the dynamic routing problem as a Markov decision process (MDP), and introduce our semi-gradient SARSA (SGS) algorithm.

II-A System modeling

Let 𝒩={1,2,3,,N}\mathcal{N}=\{1,2,3,\ldots,N\} be the set of parallel servers. Each server nn has an exponentially distributed service rate μn\mu_{n} and job number xn(t)x_{n}(t) at time t0t\in\mathbb{R}_{\geq 0}. The state of the system is x=[x1,x2,,xN]Tx=[x_{1},x_{2},\ldots,x_{N}]^{T}, and the state space is 0N\mathbb{Z}^{N}_{\geq 0}. Jobs arrive at origin SS according to a Poisson process of rate λ>0\lambda>0. When a job arrives, it will go to one of the NN servers according to a routing policy

π:𝒩×0N[0,1].\pi:\mathcal{N}\times\mathbb{Z}^{N}_{\geq 0}\rightarrow[0,1].

That is, π(a|x)\pi(a|x) is the probability of routing the new job to server aa conditional on state xx. This paper focuses on a particular class of routing policies which we call the weightedshortestqueueweighted~{}shortest~{}queue (WSQ) policy. WSQ is based on the approximation for the action value function Q^:𝒩×0N×>0N0\hat{Q}:\mathcal{N}\times\mathbb{Z}^{N}_{\geq 0}\times\mathbb{R}^{N}_{>0}\to\mathbb{R}_{\geq 0} defined as:

Q^(x,a;w):=n=1Nwn(xn+𝕀{n=a})2,\displaystyle\hat{Q}(x,a;w):=\sum_{n=1}^{N}w_{n}(x_{n}+\mathbb{I}_{\{n=a\}})^{2}, (1)

where w=[w1,w2,,wN]T>0Nw=[w_{1},w_{2},\ldots,w_{N}]^{T}\in\mathbb{R}^{N}_{>0} is the weight vector. For technical convenience, we consider the softmax version of WSQ

πw(a|x)=exp(Q^(x,a;w)/ι)b=1Nexp(Q^(x,b,w)/ι),\displaystyle\pi_{w}(a|x)=\frac{\exp(-\hat{Q}(x,a;w)/\iota)}{\sum_{b=1}^{N}\exp(-\hat{Q}(x,b,w)/\iota)}, (2)

where ι(0,)\iota\in(0,\infty) is the temperature of the softmax function. Note that πw(a|x)\pi_{w}(a|x) converges to a deterministic policy greedy w.r.t. Q^\hat{Q} as ι\iota approaches 0 [11].

We say that the traffic in the system is stable if there exists a constant M<M<\infty such that for any initial condition,

limt1ts=0t𝖤[x(s)1]𝑑s<M.\displaystyle\lim_{t\rightarrow\infty}\frac{1}{t}\int_{s=0}^{t}\mathsf{E}[\|x(s)\|_{1}]ds<M. (3)

We say that the system is stabilizable if

λ<n=1Nμn.\displaystyle\lambda<\sum_{n=1}^{N}\mu_{n}. (4)

Note that the above ensures the existence of at least a stabilizing Bernoulli routing policy.

II-B MDP formulation

Since routing actions are made only at transition epochs [16, p.72], the routing problem of the parallel queuing system can be formulated as a discrete-time (DT) MDP with countably infinite state space 0N\mathbb{Z}^{N}_{\geq 0} and finite action space 𝒩\mathcal{N}. With a slight abuse of notation, we denote the state and action of the DT MDP as x[k]0Nx[k]\in\mathbb{Z}^{N}_{\geq 0} and a[k]𝒩a[k]\in\mathcal{N}, respectively. Specifically, x[k]=x(tk)x[k]=x(t_{k}), where tkt_{k} is the kk-th transition epoch of the continuous-time process. As indicated in Section II-A, the routing policy can be parameterized via weight vector w>0Nw\in\mathbb{R}^{N}_{>0}.

The transition probability 𝗉(x|x,a)\mathsf{p}(x^{\prime}|x,a) of the DT MDP can be derived from the system dynamics in a straightforward manner. Let ei{0,1}Ne_{i}\in\{0,1\}^{N} denote the unit vector such that ei,i=1e_{i,i}=1 and ei,j=0,jie_{i,j}=0,j\not=i. Then we have

𝗉(x|x,a)={λλ+n=1Nμn𝕀{xn>0}x{x+ea}a=1N,μn𝕀{xn>0}λ+n=1Nμn𝕀{xn>0}x=xen.\displaystyle\mathsf{p}(x^{\prime}|x,a)=\begin{cases}\frac{\lambda}{\lambda+\sum_{n=1}^{N}\mu_{n}\mathbb{I}_{\{x_{n}>0\}}}&x^{\prime}\in\{x+e_{a}\}^{N}_{a=1},\\ \frac{\mu_{n}\mathbb{I}_{\{x_{n}>0\}}}{\lambda+\sum_{n=1}^{N}\mu_{n}\mathbb{I}_{\{x_{n}>0\}}}&x^{\prime}=x-e_{n}.\end{cases}

The one-step random cost of the MDP is given by

c[k+1]=\displaystyle c[k+1]= x[k+1]1(tk+1tk),\displaystyle\|x[k+1]\|_{1}(t_{k+1}-t_{k}),

where 1\|\cdot\|_{1} is the 1-norm for N\mathbb{R}^{N}. Let c¯(x,a)=𝖤[c[k+1]|x[k]=x,a[k]=a]\bar{c}(x,a)=\mathsf{E}[c[k+1]|x[k]=x,a[k]=a] denote the expected value of cost. The total discounted cost over infinite-horizon process is thus given by

Qπ(x,a)=\displaystyle Q_{\pi}(x,a)= 𝖤π[=0γx[+1]1(t+1t)|x,a],\displaystyle\mathsf{E}_{\pi}\Big{[}\sum^{\infty}_{\ell=0}\gamma^{\ell}\|x[\ell+1]\|_{1}(t_{\ell+1}-t_{\ell})\Big{|}x,a\Big{]},

where γ(0,1)\gamma\in(0,1) is the discount factor of infinite-horizon MDP. Let Q(x,a)Q^{*}(x,a) denote the solution of Bellman optimal equation, that

Q(x,a)=c¯(x,a)+γminax𝗉(x|x,a)Q(x,a),\displaystyle Q^{*}(x,a)=\bar{c}(x,a)+\gamma\min_{a^{\prime}}\sum_{x^{\prime}}\mathsf{p}(x^{\prime}|x,a)Q^{*}(x^{\prime},a^{\prime}),

and let π\pi^{*} denote the greedy policy with respect to QQ^{*}.

Closed-form solution to QπQ_{\pi} is not easy. Hence, we use the Q^\hat{Q} function defined by (1) as a proxy for QπQ_{\pi}. Motivated by [11, 15], we consider the function approximator as

Q^(x,a;w)=n=1Nwnϕn(x,a),\displaystyle\hat{Q}(x,a;w)=\sum_{n=1}^{N}w_{n}\phi_{n}(x,a), (5)
ϕn(x,a)=(xn+𝕀{n=a})2,n=1,2,,N,\displaystyle\phi_{n}(x,a)=(x_{n}+\mathbb{I}_{\{n=a\}})^{2},\quad n=1,2,\ldots,N,

where ϕn:𝒩×0N0\phi_{n}:\mathcal{N}\times\mathbb{Z}^{N}_{\geq 0}\to\mathbb{R}_{\geq 0} and {{ϕn}x,a}n=1N\{\{\phi_{n}\}_{x,a}\}_{n=1}^{N} are linearly independent basis functions. Let ww^{*} denote the optimal solution to

minwx0N,aNNd(x)π(a|x)(Q(x,a)Q^(x,a;w))2,\displaystyle\min_{w}\sum^{N}_{\begin{subarray}{c}x\in\mathbb{Z}^{N}_{\geq 0},\\ a\leq N\end{subarray}}d^{*}(x)\pi^{*}(a|x)\Big{(}Q^{*}(x,a)-\hat{Q}(x,a;w)\Big{)}^{2},

where d(x)d^{*}(x) is the invariant state distribution under policy π\pi^{*}. We select quadratic basis functions because QπQ_{\pi} is non-linear and the quadratic function is one of the simplest non-linear functions. Besides, the analysis is generalizable to polynomials with higher order.

II-C Semi-gradient SARSA algorithm

Inspired by [11], let w[k]w[k] denote the weight vector at the kk-th transition epoch, which is updated by an SARSA(0) algorithm

w[k+1]=Γ(w[k]+αkΔ[k]wQ^(x[k],a[k];w[k]));\displaystyle w[k+1]=\Gamma\Big{(}w[k]+\alpha_{k}\Delta[k]\nabla_{w}\hat{Q}(x[k],a[k];w[k])\Big{)};

in the above, Γ:>0N>0N\Gamma:\mathbb{R}_{>0}^{N}\to\mathbb{R}_{>0}^{N} is a projection operator, αk\alpha_{k} is the stochastic step size, Δ[k]\Delta[k] is the temporal-difference (TD) error, and wQ^(x[k],a[k];w[k])\nabla_{w}\hat{Q}(x[k],a[k];w[k]) is the gradient, which are specified as follows.

The projection Γ()\Gamma(\cdot) is defined with a positive constant CΓC_{\Gamma}:

Γ(w)={wwCΓ,CΓwww>CΓ,\displaystyle\Gamma(w)=\begin{cases}w&\|w\|\leq C_{\Gamma},\\ C_{\Gamma}\frac{w}{\|w\|}&\|w\|>C_{\Gamma},\end{cases}

where \|\cdot\| is the standard 22-norm. Besides, we use x,y:=xTy\langle x,y\rangle:=x^{T}y denote the standard inner product in Euclidean spaces.

The temporal difference (TD) error Δ[k]\Delta[k] and the gradient wQ^(x[k],a[k];w[k])\nabla_{w}\hat{Q}(x[k],a[k];w[k]) are as follows. Let ϕ=[ϕ1,ϕ2,,ϕN]T\phi=[\phi_{1},\phi_{2},\ldots,\phi_{N}]^{T}, then we can compactly write

Q^(x,a;w)=ϕT(x,a)w,\displaystyle\hat{Q}(x,a;w)=\phi^{T}(x,a)w,
wQ^(x,a;w)=ϕ(x,a).\displaystyle\triangledown_{w}\hat{Q}(x,a;w)=\phi(x,a).

Then, for any k0k\in\mathbb{Z}_{\geq 0}, the TD error and the gradient are collectively given by

δw[k](x[k],w[k])=Δ[k]wQ^(x[k],a[k];w[k])\displaystyle\delta_{w[k]}(x[k],w[k])=\Delta[k]\nabla_{w}\hat{Q}(x[k],a[k];w[k])
=(ϕT(x[k],a[k])w[k]+c[k+1]\displaystyle=\Bigg{(}-\phi^{T}\Big{(}x[k],a[k]\Big{)}w[k]+c[k+1]
+γϕT(x[k+1],a[k+1])w[k])ϕ(x[k],a[k]).\displaystyle\qquad+\gamma\phi^{T}\Big{(}x[k+1],a[k+1]\Big{)}w[k]\Bigg{)}\phi\Big{(}x[k],a[k]\Big{)}.

The step sizes {αk}{\{\alpha_{k}\}} are generated by the following mechanism. We define an auxiliary sequence {α~k~;k~0N}\{\tilde{\alpha}_{\tilde{k}};\tilde{k}\in\mathbb{Z}_{\geq 0}^{N}\} satisfying the standard step size condition [10]

k=0α~k~=,k=0α~k~2<.\displaystyle\sum_{k=0}^{\infty}\tilde{\alpha}_{\tilde{k}}=\infty,\quad\sum_{k=0}^{\infty}\tilde{\alpha}^{2}_{\tilde{k}}<\infty. (6)

The step sizes can not be too small to stop the iteration process, while also can not be too large to impede convergence. Let BαB_{\alpha} denote a finite positive constant and define

k~=maxδw[k](x[k],w[k])Bαk<kk.\tilde{k}=\max\nolimits_{\begin{subarray}{c}\delta_{w[k]}(x[k],w[k])\leq B_{\alpha}\\ k^{\prime}<k\end{subarray}}k^{\prime}.

Then the step size sequence {αk}\{\alpha_{k}\} can be constructed as

αk={α~k~δw[k](x[k],w[k])Bα,0o.w.,\displaystyle\alpha_{k}=\begin{cases}\tilde{\alpha}_{\tilde{k}}&\delta_{w[k]}(x[k],w[k])\leq B_{\alpha},\\ 0&o.w.,\end{cases}

where BαB_{\alpha} is a finite positive constant. That is, {αk}\{\alpha_{k}\} consists of zeros and elements from the deterministic sequence {α~k~}\{\tilde{\alpha}_{\tilde{k}}\} as demonstrated in Table I. Thus the weight vector w[k+1]w[k+1] is updated only when the constraint BαB_{\alpha} is satisfied.

TABLE I: A sample of stochastic step sizes {αk}\{\alpha_{k}\}.
kk δw[k](x[k],w[k])Bα?\delta_{w[k]}(x[k],w[k])\leq B_{\alpha}? αk=\alpha_{k}=
0 Yes α~0\tilde{\alpha}_{0}
1 Yes α~1\tilde{\alpha}_{1}
2 No 0
3 Yes α~2\tilde{\alpha}_{2}
\vdots \vdots \vdots

The update equation thus becomes

w[k+1]=Γ(w[k]+αkδw[k](x[k],w[k])).\displaystyle w[k+1]=\Gamma\Big{(}w[k]+\alpha_{k}\delta_{w[k]}(x[k],w[k])\Big{)}. (7)

It is known that SARSA chatters when combined with linear function approximation [10]. We say that Algorithm 1 is convergent to a bounded region if there exists a positive finite constant BB such that

limk𝖤[w[k]w]B,\displaystyle\lim_{k\rightarrow\infty}\mathsf{E}[\|w[k]-w^{*}\|]\leq B, (8)

for every initial traffic state x[0]0Nx[0]\in\mathbb{Z}_{\geq 0}^{N} and every initial weight w[0]>0Nw[0]\in\mathbb{R}^{N}_{>0}.

Algorithm 1 (SGS) Computation of Q^\hat{Q} for QQ
1:Initial weights w[0],w[0]<CΓw[0],\|w[0]\|<C_{\Gamma}, WSQ policy πw[0]\pi_{w[0]}, step sizes sequence αk\alpha_{k}, γ\gamma
2:Initialize weights w[0]w[0]w[0]\leftarrow w[0]
3:for k=0,1,k=0,1,\ldots do
4:     Execute action a[k]a[k]
5:     Obtain new state x[k+1]x[k+1] and immediate reward c[k+1]c[k+1]
6:     Select a[k+1]a[k+1] according to policy πw[k]\pi_{w[k]}
7:     Calculate δw[k](x[k],w[k])\delta_{w[k]}(x[k],w[k])
8:     w[k+1]Γ(w[k]+αkδw[k](x[k],w[k]))w[k+1]\leftarrow\Gamma(w[k]+\alpha_{k}\cdot\delta_{w[k]}(x[k],w[k]))
9:end for

III Joint convergence guarantee

In this section, we develop the main result of this paper, which states that the proposed semi-gradient SARSA (SGS) algorithm ensures joint convergence of traffic state and weight vector if and only if the parallel service system is stabilizable.

Theorem 1.

(Joint convergence) Consider a stabilizable parallel service system with arrival rate λ>0\lambda>0 and service rates μ1,μ2,,μN>0.\mu_{1},\mu_{2},\ldots,\mu_{N}>0. Suppose that the step size condition (6) holds. Then, the traffic state x[k]x[k] converges in the sense of (3) and the weight w[k]w[k] converges in the sense of (8).

The above result essentially states that the joint convergence of w[k]w[k] and x[k]x[k] relies on step size constraint (6). They are standard for reinforcement learning methods, which ensures that (i) sufficient updates will be made, and (ii) randomness will not perturb the weights at steady state [15].

The rest of this section is devoted to the proof of the above theorem. Section III-A proves the stability of traffic state, section III-B presents unboundedness of k=0αk\sum_{k=0}^{\infty}\alpha_{k}, and section III-C proves the convergence of the approximation weights.

III-A Convergence of x[k]x[k]

In this section we construct a Lyapunov function and argue the drift to show the convergence of traffic state x[k]x[k], with policy π\pi and proper temperature parameter ι\iota. In particular, we considering the Lyapunov function

V^w(x)=n=1Nwnxn2\hat{V}_{w}(x)=\sum_{n=1}^{N}w_{n}x_{n}^{2}

with the same weight vector as (5). We show that there exist ι>0,ϵv>0,Bv<\iota>0,~{}\epsilon_{v}>0,~{}B_{v}<\infty such that for all x0Nx\in\mathbb{Z}^{N}_{\geq 0} and for all w>0Nw\in\mathbb{R}^{N}_{>0}

V^w(x)=ϵvn=1Nwnxn+Bv,\displaystyle\mathcal{L}\hat{V}_{w}({x})=-\epsilon_{v}\sum_{n=1}^{N}w_{n}x_{n}+B_{v}, (9)

where \mathcal{L} is the infinitesimal generator of the system under the (softmax) WSQ policy.

Let ln=𝕀{xn1}l_{n}=\mathbb{I}_{\{x_{n}\geq 1\}}, m=argminnNwn(2xn+1)m=\arg\min_{n\leq N}w_{n}(2x_{n}+1). We have

V^w(x)\displaystyle\mathcal{L}\hat{V}_{w}(x) =n=1Nlnμnwn(2xn+1)\displaystyle=\sum\nolimits_{n=1}^{N}l_{n}\mu_{n}w_{n}(-2x_{n}+1)
+a=1Nπw(a|x)λawa(2xa+1).\displaystyle+\sum\nolimits_{a=1}^{N}\pi_{w}(a|x)\lambda_{a}w_{a}(2x_{a}+1).

Suppose that ι\iota is sufficiently small, since (ln1)xn=0(l_{n}-1)x_{n}=0, we have

V^w(x)n=1N(λμnk=1Nμkμn)wn(2xn+1)+Bv.\displaystyle\mathcal{L}\hat{V}_{w}(x)\leq\sum_{n=1}^{N}\Big{(}\lambda\frac{\mu_{n}}{\sum_{k=1}^{N}\mu_{k}}-\mu_{n}\Big{)}w_{n}(2x_{n}+1)+B_{v}.

Then (9) holds when (4) holds. We can use Foster-Lyapunov criterion [14, Theorem 4.3.] conclude (3).

III-B Unboundedness of k=0αk\sum_{k=0}^{\infty}\alpha_{k}

In this section, we show that k=1αk=\sum_{k=1}^{\infty}\alpha_{k}=\infty a.s..

Lemma 1.

Under the constraints in Theorem 1, let Wp(x)=n=1Neνwn(2xn+1)/wn,ν>0W_{p}(x)=\sum_{n=1}^{N}e^{\nu w_{n}(2x_{n}+1)}/w_{n},~{}\nu>0, then there exists function gp1g_{p}\geq 1 and finite non-negative constant BwB_{w} satisfying

ΔWp(x)\displaystyle\Delta W_{p}(x) =𝖤[Wp(x[k+1])Wp(x[k])|x[k]=x]\displaystyle=\mathsf{E}[W_{p}(x[k+1])-W_{p}(x[k])|x[k]=x]
gp(x)+Bw.\displaystyle\leq-g_{p}(x)+B_{w}. (10)

Furthermore, we have

limK1Kk=0K1𝖤[gp(x)]Bw.\displaystyle\lim_{K\rightarrow\infty}\frac{1}{K}\sum\nolimits_{k=0}^{K-1}\mathsf{E}[g_{p}(x)]\leq B_{w}.

Proof: Considering similar definition of m,lnm,l_{n} in section III-A. Similarly, under the (softmax) WSQ policy, we have

ΔWp(x)=a=1Nπw(a|x)[naNlnwn(eνwn(2xn2mun+1)\displaystyle\Delta W_{p}(x)=\sum_{a=1}^{N}\pi_{w}(a|x)\Big{[}\sum\limits_{n\not=a}^{N}\frac{l_{n}}{w_{n}}\cdot\Big{(}e^{\nu w_{n}(2x_{n}-2_{m}u_{n}+1)}
eνwn(2xn+1))1wa((1la)eνwa(2xa+2λ+1)\displaystyle\quad-e^{\nu w_{n}(2x_{n}+1)}\Big{)}-\frac{1}{w_{a}}\Big{(}(1-l_{a})\cdot e^{\nu w_{a}(2x_{a}+2\lambda+1)}
+laeνwa(2xa+2(λμa)+1)+eνwa(2xa+1))].\displaystyle\quad+l_{a}\cdot e^{\nu w_{a}(2x_{a}+2(\lambda-\mu_{a})+1)}+e^{\nu w_{a}(2x_{a}+1)}\Big{)}\Big{]}.

Note that there is lneνwn2xn=(ln1)+eνwn2xnl_{n}\cdot e^{\nu w_{n}2x_{n}}=(l_{n}-1)+e^{\nu w_{n}2x_{n}} and suppose that ι\iota is sufficiently small, we have

ΔWp(x)=nmNeνwn(2xn+1)(e2νwnμn1)/wn\displaystyle\Delta W_{p}(x)=\sum_{n\not=m}^{N}e^{\nu w_{n}(2x_{n}+1)}\cdot(e^{-2\nu w_{n}\mu_{n}}-1)/w_{n}
+eνwm(2xm+1)(e2νwm(λμm)1)/wm+B0,\displaystyle+e^{\nu w_{m}(2x_{m}+1)}\cdot(e^{2\nu w_{m}(\lambda-\mu_{m})}-1)/w_{m}+B_{0}, (11)

where B0=1wm(eνwm(2xm+1)eνwm(2λm2μm+1))+nmNeνwn(1e2νwnμn))1wnB_{0}=\frac{1}{w_{m}}(e^{\nu w_{m}(2x_{m}+1)}-e^{\nu w_{m}(2\lambda_{m}-2\mu_{m}+1)})+\sum_{n\not=m}^{N}e^{\nu w_{n}}(1-e^{-2\nu w_{n}\mu_{n}}))\frac{1}{w_{n}} is a finite positive constant. In the case of λμm\lambda\leq\mu_{m}, the drift equation (III-B) naturally satisfies (1). When λ>μm\lambda>\mu_{m}, we have

ΔWp(x)nmNeνwn(2xn+1)Λ(ν),\displaystyle\Delta W_{p}(x)\leq\sum_{n\not=m}^{N}e^{\nu w_{n}(2x_{n}+1)}\cdot\Lambda(\nu),
Λ(ν)=μn(e2νwm(λμm)1)wmkmNμk+(e2νwnμn1)wn.\displaystyle\Lambda(\nu)=\frac{\mu_{n}(e^{2\nu w_{m}(\lambda-\mu_{m})}-1)}{w_{m}\sum_{k\not=m}^{N}\mu_{k}}+\frac{(e^{-2\nu w_{n}\mu_{n}}-1)}{w_{n}}.

Note that Λ(0)=0,Λ()\Lambda(0)=0,~{}\Lambda(\infty)\rightarrow{\infty}. The derivate of Λ(ν)\Lambda(\nu) at ν=0\nu=0 is calculated as

dΛdν|ν=0=2μn(λμmkmNμk1).\displaystyle\frac{d\Lambda}{d\nu}\Bigr{|}_{\nu=0}=2\mu_{n}(\frac{\lambda-\mu_{m}}{\sum_{k\not=m}^{N}\mu_{k}}-1).

Then the derivative of Λ\Lambda is negative when λ<n=1Nμn\lambda<\sum_{n=1}^{N}\mu_{n}, which implies that there exist ν0>0\nu_{0}>0 as the second zero of Λ(ν)\Lambda(\nu) and Λ(ν)<0,ν(0,ν0)\Lambda(\nu)<0,~{}\nu\in(0,\nu_{0}). Now we can conclude that with a proper selection of ν\nu, (1) is guaranteed. There exists a finite positive constant BpB_{p} satisfies

gp(x)=Bpn=1Neνwn(2xn+1)/wn+1.g_{p}(x)=B_{p}\sum\nolimits_{n=1}^{N}e^{\nu w_{n}(2x_{n}+1)}/w_{n}+1.

Following the proof in [17], summing the inequality over epochs k{0,,K1}k\in\{0,\ldots,K-1\} yields a telescoping series on the left hand side of (1), result in

𝖤[Wp(x[K])]𝖤[Wp(x[0])]\displaystyle\mathsf{E}[W_{p}(x[K])]-\mathsf{E}[W_{p}(x[0])]
K(B0+1)k=0K1𝖤[Bpn=1Neνwn(2xn+1)/wn+1].\displaystyle\leq K(B_{0}+1)-\sum_{k=0}^{K-1}\mathsf{E}[B_{p}\sum\nolimits_{n=1}^{N}e^{\nu w_{n}(2x_{n}+1)}/w_{n}+1].

Since E[Wp(x[0])]0\mathrm{E}[W_{p}(x[0])]\geq 0, we have

limK1Kk=0K1𝖤[Bpn=1Neνwn(2xn+1)/wn+1]Bw,\displaystyle\lim_{K\rightarrow\infty}\frac{1}{K}\sum_{k=0}^{K-1}\mathsf{E}[B_{p}\sum\nolimits_{n=1}^{N}e^{\nu w_{n}(2x_{n}+1)}/w_{n}+1]\leq B_{w},

where Bw=B0+1B_{w}=B_{0}+1. The above inequality implies the boundedness of 𝖤[n=1Neνwn(2xn+1)]\mathsf{E}[\sum_{n=1}^{N}e^{\nu w_{n}(2x_{n}+1)}], thus the higher order stability of system states [18]. ∎

Proposition 1.

w>0N\forall w\in\mathbb{R}^{N}_{>0}, the chain induced by πw\pi_{w} is ergodic and positive Harris recurrent.

Proof: To argue for the irreducibility of the chain, note that the state x=𝟎x=\mathbf{0} can be accessible from any initial condition with positive probability. According to [19, Theorem 11.3.4], the proof of ergodic and positive Harris recurrent is straightforward with Lemma 1. ∎

Proposition 2.

With Proposition 1, the step size sequence {ak}\{a_{k}\} in SGS satisfies (6).

Proof: Let 𝒳~:={x:xB𝒳~,x0}\tilde{\mathcal{X}}:=\{x:\|x\|\leq B_{\tilde{\mathcal{X}}},x\in\mathbb{Z}_{\geq 0}\}, where B𝒳~B_{\tilde{\mathcal{X}}} is a finite positive constant and there is 𝒳~0n\tilde{\mathcal{X}}\in\mathbb{Z}^{n}_{\geq 0}. Then the boundedness Δ[k]wQ^(a[k],x[k],w[k])Bα\Delta[k]\nabla_{w}\hat{Q}(a[k],x[k],w[k])\leq B_{\alpha} can be satisfied by constraining x[k]𝒳~x[k]\in\tilde{\mathcal{X}} and tk+1tkBTt_{k+1}-t_{k}\leq B_{T}, where BT<B_{T}<\infty is a positive constant. That is, the weight vector w[k+1]w[k+1] is updated only when the state and time interval (i.e., the immediate cost) are not too large.

Let use τ𝒳~:=k=1𝕀x[k]𝒳~\tau_{\tilde{\mathcal{X}}}:=\sum_{k=1}^{\infty}\mathbb{I}_{x[k]\in\tilde{\mathcal{X}}} denote the occupation time of states x𝒳~x\in\tilde{\mathcal{X}}, since the chain is positive Harris recurrent, we have P(τ𝒳~=)=1P(\tau_{\tilde{\mathcal{X}}}=\infty)=1. Since the arrival rate and service rates are well-defined, we have P(tk+1tkBT)PT>0P(t_{k+1}-t_{k}\leq B_{T})\geq P_{T}>0. Then we have

k=0αkPTxk𝒳~τ𝒳~α~k~+0=a.s.,\displaystyle\sum_{k=0}^{\infty}\alpha_{k}\geq P_{T}\sum_{x_{k}\in\tilde{\mathcal{X}}}^{\tau_{\tilde{\mathcal{X}}}}\tilde{\alpha}_{\tilde{k}}+0=\infty\quad\text{a.s.},
k=0αk2k=0α~k~2<a.s.\displaystyle\sum_{k=0}^{\infty}\alpha_{k}^{2}\leq\sum_{k=0}^{\infty}\tilde{\alpha}^{2}_{\tilde{k}}<\infty\quad\text{a.s.}

as desired. ∎

III-C Convergence of w[k]

In the following, we establish the convergence of w[k]w[k] by showing (i) the convergence under fixed-policy evaluation and (ii) the difference among optimal weight vectors due to policy improvement is bounded.

Proposition 3.

(Lipschitz continuity of πw(a|x)\pi_{w}(a|x)) For our WSQ policy, there exists Lπ>0L_{\pi}>0 such that w,w,a,x\forall w,w^{\prime},a,x,

πw(a|x)πw(a|x)Lπww.\|\pi_{w}(a|x)-\pi_{w^{\prime}}(a|x)\|\leq L_{\pi}\|w-w^{\prime}\|.

Proof: Note that the boundedness of derivative towards ww implies the Lipschitz continuity. With the well constructed sequence {αk}\{\alpha_{k}\}, we have

|πw(a|x)πw(a|x)|=0=Lπww,x𝒳~.|\pi_{w}(a|x)-\pi_{w^{\prime}}(a|x)|=0=L_{\pi}\|w-w^{\prime}\|,\quad x\not\in\tilde{\mathcal{X}}.

For x𝒳~x\in\tilde{\mathcal{X}}, according to (2), we have πw(a|x)(0,1)\pi_{w}(a|x)\in(0,1) and

|dπw(a|x)dw||Nιmaxa,bN2|xaxb||,\displaystyle\Big{|}\frac{d\pi_{w}(a|x)}{dw}\Big{|}\leq\Big{|}\frac{N}{\iota}\max_{a,b\leq N}2\cdot|x_{a}-x_{b}|\Big{|},

which is bounded. ∎

With the above Proposition 1-3, we can analysis the fixed policy performance and bound the difference among distinct policies. For a better elucidation, we use subscript w[k]w^{*}[k] denote the invariant steady-state dynamic of the chain under fixed policy πw[k]\pi_{w[k]}, and use subscript kk denote the real dynamic at the kk-th transition. That is dw[k]()d_{w^{*}[k]}(\cdot) denotes the invariant distribution of policy πw[k]\pi_{w[k]}, and dk()d_{k}(\cdot) denotes the state distribution at the kk-th transition. Suppose that under the fixed policy πw[k]\pi_{w[k]} and according to [20], we have

δ¯w[k](x[k],w[k])=Aw[k]w[k]+bw[k],\displaystyle\bar{\delta}_{w[k]}(x[k],w[k])=A_{w^{*}[k]}w[k]+b_{w^{*}[k]},

where Aw[k]A_{w^{*}[k]} is defined as

Aw[k]\displaystyle A_{w^{*}[k]}
=aN,x0N(aN,x0Ndw[k](x)πw[k](a|x)𝗉w[k](x|a,x)\displaystyle=\sum_{a^{\prime}\leq N,\atop x^{\prime}\in\mathbb{Z}^{N}_{\geq 0}}\Big{(}\sum_{a\leq N,\atop x\in\mathbb{Z}^{N}_{\geq 0}}d_{w^{*}[k]}(x)\pi_{w[k]}(a|x)\mathsf{p}_{w^{*}[k]}(x^{\prime}|a,x)
×γϕ(x,a)dw[k](x)ϕ(x,a))πw[k](a|x)ϕT(x,a)\displaystyle\times\gamma\phi(x,a)-d_{w^{*}[k]}(x^{\prime})\phi(x^{\prime},a^{\prime})\Big{)}\pi_{w[k]}(a^{\prime}|x^{\prime})\phi^{T}(x^{\prime},a^{\prime})
(γ1)𝖤w[k][ϕ(x,a)ϕT(x,a)],\displaystyle\leq(\gamma-1)\mathsf{E}_{w^{*}[k]}\Big{[}\phi(x^{\prime},a^{\prime})\phi^{T}(x^{\prime},a^{\prime})\Big{]},

Which is negative definite since γ(0,1)\gamma\in(0,1). Analogous, we have bw[k]=𝖤w[k][ϕT(x,a)c],bk=𝖤k[ϕT(x,a)c],Ak=𝖤k[ϕ(x,a)(γϕT(x,a)ϕT(x,a))]b_{w^{*}[k]}=\mathsf{E}_{w^{*}[k]}[\phi^{T}(x,a)c],b_{k}=\mathsf{E}_{k}[\phi^{T}(x,a)c],A_{k}=\mathsf{E}_{k}[\phi(x,a)(\gamma\phi^{T}(x^{\prime},a^{\prime})-\phi^{T}(x,a))] and 𝖤[δw[k](x,w)]=Akw+bk\mathsf{E}[\delta_{w[k]}(x,w)]=A_{k}w+b_{k}.

According to [19, Theorem 14.0.1], we have

supfp:|fp|gp\displaystyle\sup_{f_{p}:|f_{p}|\leq g_{p}} t=0|x𝗉kt(x|x)fp(x)xdw[k](x)fp(x)|\displaystyle\sum_{t=0}^{\infty}\Big{|}\sum_{x^{\prime}}\mathsf{p}_{k}^{t}(x^{\prime}|x)f_{p}(x^{\prime})-\sum_{x^{\prime}}d_{w^{*}[k]}(x^{\prime})f_{p}(x^{\prime})\Big{|}
<Bf(Wp(x)+1),x[k]=x,x𝒳~,\displaystyle<B_{f}(W_{p}(x)+1),\quad x[k]=x,x\in\tilde{\mathcal{X}}, (12)

where BfB_{f} is a finite positive constant. 𝗉kt(x|x)\mathsf{p}^{t}_{k}(x^{\prime}|x) indicates the transition probability from state xx to xx^{\prime} after tt steps under policy πw[k]\pi_{w[k]}, and there is 𝗉(x|x)=a=1Nπ(a|x)𝗉(x|a,x)\mathsf{p}(x^{\prime}|x)=\sum_{a=1}^{N}\pi(a|x)\mathsf{p}(x^{\prime}|a,x). According to [15], with (III-C) holds, the iterative algorithm (7) has a unique solution ww^{*} satisfies that δ¯(x,w)=0\bar{\delta}(x,w^{*})=0 under fixed policy. With a little abuse of notation, let wkw^{*}_{k} denote the solution of

δ¯w[k](x[k],wk)=0\bar{\delta}_{w[k]}(x[k],w^{*}_{k})=0

in SGS under fixed policy πw[k]\pi_{w[k]}.

According to the inequality ekx>q=0kqxqq!e^{kx}>\sum_{q=0}^{\infty}\frac{k^{q}x^{q}}{q!} and note that there is ϕ(x,a)=𝒪(x2)\phi(x,a)=\mathcal{O}(x^{2}), the boundedness of gp(x)g_{p}(x) defined in Lemma 1 implies the boundedness of Aw[k],bw[k]A_{w^{*}[k]},b_{w^{*}[k]}. With the constructed step size {αk}\{\alpha_{k}\}, we have Aw[k]<Bϕ\|A_{w^{*}[k]}\|<B_{\phi}, bw[k]<Br\|b_{w^{*}[k]}\|<B_{r}, and 𝖤[αk2δw[k](x,w)δ¯w[k](x,w)2]αk2Bδ2\mathsf{E}[\alpha^{2}_{k}\|\delta_{w[k]}(x,w)-\bar{\delta}_{w[k]}(x,w)\|^{2}]\leq\alpha^{2}_{k}B^{2}_{\delta}, where Bϕ,Br,BδB_{\phi},B_{r},B_{\delta} are finite positive constants.

Following [11], considering the weights update equation, the auxiliary sequence {u[k]}\{u[k]\} is defined as

u[0]:=w[0],\displaystyle u[0]:=w[0],
u[k+1]:=Γ(u[k])+αkδw[k](x[k],Γ(u[k])).\displaystyle u[k+1]:=\Gamma(u[k])+\alpha_{k}\delta_{w[k]}(x[k],\Gamma(u[k])).

Since the Q^\hat{Q} and policy π\pi are both linearly related to ww, we have w[k]=Γ(u[k]),k0w[k]=\Gamma(u[k]),\forall k\in\mathbb{Z}_{\geq 0}. Let y=u[k+1]wk+1,y=w[k]wky^{\prime}=u[k+1]-w^{*}_{k+1},~{}y=w[k]-w^{*}_{k}, then we have

𝖤[12y2]=𝖤[12y2]\displaystyle\mathsf{E}[\frac{1}{2}\|y^{\prime}\|^{2}]=\mathsf{E}\Big{[}\frac{1}{2}\|y\|^{2}\Big{]}
+𝖤[y,αkδw[k](x[k],w[k])+wkwk+1]T2\displaystyle\quad+\underbrace{\mathsf{E}\Big{[}\Big{\langle}y,\quad\alpha_{k}\delta_{w[k]}(x[k],w[k])+w^{*}_{k}-w^{*}_{k+1}\Big{\rangle}\Big{]}}_{T_{2}}
+𝖤[12αkδw[k](x[k],w[k])+wkwk+12]T3,\displaystyle\quad+\underbrace{\mathsf{E}\Big{[}\frac{1}{2}\|\alpha_{k}\delta_{w[k]}(x[k],w[k])+w^{*}_{k}-w^{*}_{k+1}\|^{2}\Big{]}}_{T_{3}},

where ,\langle\cdot,\cdot\rangle is defined as section II-C. The second item can be rewritten as

T2\displaystyle T_{2} =y,𝖤[αkδw[k](x[k],w[k])T21]+y,wkwk+1.\displaystyle=\Big{\langle}y,\mathsf{E}\Big{[}\underbrace{\alpha_{k}\delta_{w[k]}(x[k],w[k])}_{T_{21}}\Big{]}\Big{\rangle}+\langle y,w^{*}_{k}-w^{*}_{k+1}\rangle.

For T3T_{3}, we have

T3\displaystyle T_{3} =𝖤[12wkwk+12]+𝖤[wkwk+1,T21]\displaystyle=\mathsf{E}\Big{[}\frac{1}{2}\|w^{*}_{k}-w^{*}_{k+1}\|^{2}\Big{]}+\mathsf{E}\Big{[}\Big{\langle}w^{*}_{k}-w^{*}_{k+1},T_{21}\Big{\rangle}\Big{]}
+𝖤[12T212].\displaystyle\quad+\mathsf{E}\Big{[}\frac{1}{2}\|T_{21}\|^{2}\Big{]}.

Now we are ready to analyze the boundedness of each item. Note that αk0\alpha_{k}\not=0 only when x[k]𝒳~x[k]\in\tilde{\mathcal{X}}, with [11, Theorem 4.4., Lemma C.5., Lemma D.10.] and the analysis of δw[k]\delta_{w[k]} and wkw^{*}_{k}, we have

αkδw[k](x[k],w)αk(LFw+UF),\displaystyle\alpha_{k}\|\delta_{w[k]}(x[k],w)\|\leq\alpha_{k}(L_{F}\|w\|+U_{F}),

and

wwww\displaystyle\|w^{*}_{w}-w^{*}_{w^{\prime}}\| (BA2BϕLDP+BALD)LπBrLwww\displaystyle\leq\underbrace{(B^{2}_{A}B_{\phi}L_{DP}+B_{A}L_{D})L_{\pi}B_{r}}_{L_{w}}\|w-w^{\prime}\|
αkLw(UF+LFCΓ+CΓ)=αkLBW,\displaystyle\leq\alpha_{k}L_{w}(U_{F}+L_{F}C_{\Gamma}+C_{\Gamma})=\alpha_{k}L_{BW},

where BA,UF,LFB_{A},U_{F},L_{F} are finite positive constants that BA>supwAw1,αkUFαkbk,LF((1+γ)(B𝒳~+1)2+1)B_{A}>\sup_{w}\|A^{-1}_{w^{*}}\|,~{}\alpha_{k}U_{F}\geq\|\alpha_{k}b_{k}\|,~{}L_{F}\geq((1+\gamma)(B_{\tilde{\mathcal{X}}}+1)^{2}+1), and LD,LDPL_{D},L_{DP} are Lipschitz constants of state distribution and transition probability. We have

T21\displaystyle T_{21} =αkδ¯w[k]()+αkδw[k]()αkδ¯w[k](),\displaystyle=\alpha_{k}\bar{\delta}_{w[k]}(\cdot)+\alpha_{k}\delta_{w[k]}(\cdot)-\alpha_{k}\bar{\delta}_{w[k]}(\cdot),

where ()(\cdot) is short for (x[k],w[k])(x[k],w[k]). By leveraging [11, Lemma D.2.], suppose that kk is sufficient large, we have αkδ¯w[k]()(kα1)y,kα=1Bϕαk(0,1)\|\alpha_{k}\bar{\delta}_{w[k]}(\cdot)\|\leq(k_{\alpha}-1)\|y\|,~{}k_{\alpha}=\sqrt{1-B_{\phi}\alpha_{k}}\in(0,1). Then we have

T2𝖤[((kα1)y+αkBδ+αkLBW)y],\displaystyle T_{2}\leq\mathsf{E}[\Big{(}(k_{\alpha}-1)\|y\|+\alpha_{k}B_{\delta}+\alpha_{k}L_{BW}\Big{)}\|y\|],

where kα=1Bϕαk(0,1)k_{\alpha}=\sqrt{1-B_{\phi}\alpha_{k}}\in(0,1). Analogously, we have

T3\displaystyle T_{3} 𝖤[12(kα1)2y2+αk(kα1)(Bδ+LBW)y\displaystyle\leq\mathsf{E}[\frac{1}{2}(k_{\alpha}-1)^{2}\|y\|^{2}+\alpha_{k}(k_{\alpha}-1)(B_{\delta}+L_{BW})\|y\|
+αk2(12Bδ2+BδLBW+12LBW2)].\displaystyle+\alpha_{k}^{2}(\frac{1}{2}B^{2}_{\delta}+B_{\delta}L_{BW}+\frac{1}{2}L^{2}_{BW})].

According to the above analysis, we have

𝖤[12u[k+1]wk+12]\displaystyle\mathsf{E}[\frac{1}{2}\|u[k+1]-w^{*}_{k+1}\|^{2}]
=12𝖤[(kαw[k]wk+αk(LBW+Bδ))2].\displaystyle=\frac{1}{2}\mathsf{E}\Big{[}\Big{(}k_{\alpha}\|w[k]-w^{*}_{k}\|+\alpha_{k}(L_{BW}+B_{\delta})\Big{)}^{2}\Big{]}.

Then we have

zk+1kαzk+αk(LBW+Bδ),z_{k+1}\leq k_{\alpha}z_{k}+\alpha_{k}(L_{BW}+B_{\delta}),

where zk+1=𝖤[u[k+1]wk+12]z_{k+1}=\sqrt{\mathsf{E}[\|u[k+1]-w^{*}_{k+1}\|^{2}]}. Since kα(0,1)k_{\alpha}\in(0,1) and w[k]wku[k]wk,\|w[k]-w^{*}_{k}\|\leq\|u[k]-w^{*}_{k}\|, by iteration, we have

zk+1\displaystyle z_{k+1} ==k0k(1αBϕ)zk0\displaystyle=\sqrt{\prod^{k}\nolimits_{\ell=k_{0}}(1-\alpha_{\ell}B_{\phi})}\cdot z_{k_{0}}
+(LBW+Bδ)=k0kαj=k(1αjBϕ).\displaystyle+(L_{BW}+B_{\delta})\sum^{k}\nolimits_{\ell=k_{0}}\alpha_{\ell}\sqrt{\prod^{k}\nolimits_{j=\ell}(1-\alpha_{j}B_{\phi})}.

Note that {αk}\{\alpha_{k}\} is constrained by (6) and the inequality 1xex1-x\leq e^{-x} holds, we have

zk+1Bδ+LBWBϕ.\displaystyle z_{k+1}\leq\frac{B_{\delta}+L_{BW}}{B_{\phi}}.

Considering the relationship between the policy and weight vector, we have

𝖤[w[k]w]\displaystyle\mathsf{E}[\|w[k]-w^{*}\|] 𝖤[w[k]wk]+𝖤[wkww]\displaystyle\leq\mathsf{E}[\|w[k]-w^{*}_{k}\|]+\mathsf{E}[\|w^{*}_{k}-w^{*}_{w^{*}}\|]
𝖤[w[k]wk]+Lw𝖤[w[k]w].\displaystyle\leq\mathsf{E}[\|w[k]-w^{*}_{k}\|]+L_{w}\mathsf{E}[\|w[k]-w^{*}\|].

When the immediate cost cc is constrained such that Lw<1L_{w}<1, we have

𝖤[w[k]w]\displaystyle\mathsf{E}[\|w[k]-w^{*}\|] 11Lw𝖤[w[k]wk]11Lwzt.\displaystyle\leq\frac{1}{1-L_{w}}\mathsf{E}[\|w[k]-w^{*}_{k}\|]\leq\frac{1}{1-L_{w}}z_{t}.

Then finally we can conclude that

𝖤[w[k]w]Bδ+LBW(1Lw)Bϕ,\displaystyle\mathsf{E}[\|w[k]-w^{*}\|]\leq\frac{B_{\delta}+L_{BW}}{(1-L_{w})B_{\phi}},

which yields (8).

IV experiments

To evaluate the performance of the semi-gradient SARSA (SGS) algorithm with weighted shortest queue (WSQ) policy, we consider two benchmarks:

  • Neural network (NN)-WSQ: We constructed a NN for approximation of Q(x,a)Q(x,a). The algorithm is similar to Algorithm 1, except that the weights update is replaced by NN update with adaptive moment estimation (Adam) algorithm [21]. Specifically, the NN has two fully connected layer with a rectified linear unit (ReLU) activation function. The loss function is the mean square error between the one-step predicted and calculated state-action-value. Since an exact optimal policy of the original MDP is not readily available, the policy computed by NN is used as an approximate optimal policy.

  • Join the shortest queue (JSQ) policy: For routing decisions under JSQ policy, we simply select the path with the shortest queue length, that is aJSQ=argminnNxna_{JSQ}=\arg\min_{n\leq N}x_{n}.

Consider the network in Fig. 1 with three parallel servers. Suppose that the service rate μ1=0.5,μ2=2.5,μ3=5\mu_{1}=0.5,~{}\mu_{2}=2.5,~{}\mu_{3}=5 and the arrival rate λ=2\lambda=2, all in unit sec-1. The WSQ policy temperature parameter is set as ι=0.01\iota=0.01. For SGS algorithm, we initialize the weight as w1=w2=w3=0.5w_{1}=w_{2}=w_{3}=0.5. For simulation, a discrete time step of 0.1 sec is used. All experiments were implemented in Google Colab [22], using Intel(R) Xeon(R) CPU with 12.7GB memory. We trained SGS for 10610^{6} episodes, NN for 4×1064\times 10^{6} episodes, and evaluate them for 10610^{6} episodes each, and the result are as follows.

For the performance, the weight of SGS converges to w=[0.60 0.49 0.15]w=[0.60\ 0.49\ 0.15], which means the weight of a slower server is higher. Our WSQ policy is not restricted to non-idling conditions, since w1(2×0+1)>w3(2×1+1)w_{1}(2\times 0+1)>w_{3}(2\times 1+1), the server with higher service rate (i.e., server 3) is more likely to be selected, even the slower server (i.e., server 1) is empty.

Refer to caption

Figure 2: The performance compare between SGS and NN.
TABLE II: Average system times of various schemes.
Algorithm Normalized Average System Time
Neural network (NN) 1.00
Semi-gradient SARSA (SGS) 1.08
Join the shortest queue (JSQ) 2.78

Table II lists the normalized average system time results under various methods. The results are generated from test simulations of 10510^{5} sec. Although NN performs better in long terms of learning as expected, SGS performs better with just a few number of iterations as demonstrated in Fig. 2.

The job will spend more time going through the queuing network under JSQ policy at the average of 0.85810.8581 sec. For WSQ, though the implementation efficiency of SGS is slightly worse than NN, SGS gives the best trade-off between computational efficiency and implementation efficiency: the average system time of NN and SGS are 0.30780.3078 sec and 0.33180.3318 sec. More importantly, SGS algorithm theoretically ensures the convergence of the optimal routing decision, while NN might be diverge and let alone the existence of the optimal decision.

V conclusion

In this paper, we propose a semi-gradient SARSA(0) (SGS) algorithm with linear value function approximation for dynamic routing over parallel servers. We extend the analysis of SGS to infinite state space and show that the convergence of the weight vector in SGS is coupled with the convergence of the traffic state, and the joint convergence is guaranteed if and only if the paralle service system is stabilizable. Specifically, the approximator is used as Lyapunov function for traffic state stability analysis; and the constraint and convergence analysis of weight vector is based on stochastic approximation theory. Besides, our analysis can be extended to polynomial approximator with higher order. We compare the proposed SGS algorithm with a neural network-based algorithm and show that our algorithm converges faster with a higher computationally efficiency and an insignificant optimality gap. Possible future work includes (i) extension of the joint convergence result as well as SGS algorithm to a general service network and (ii) the analysis of fixed-point convergence condition.

References

  • [1] J. G. Dai and M. Gluzman, “Queueing network controls via deep reinforcement learning,” Stochastic Systems, vol. 12, no. 1, pp. 30–67, 2022.
  • [2] Q. Xie and L. Jin, “Stabilizing queuing networks with model data-independent control,” IEEE Transactions on Control of Network Systems, vol. 9, no. 3, pp. 1317–1326, 2022.
  • [3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [4] P. Kumar and S. P. Meyn, “Stability of queueing networks and scheduling policies,” IEEE Transactions on Automatic Control, vol. 40, no. 2, pp. 251–260, 1995.
  • [5] J. G. Dai and S. P. Meyn, “Stability and convergence of moments for multiclass queueing networks via fluid limit models,” IEEE Transactions on Automatic Control, vol. 40, no. 11, pp. 1889–1904, 1995.
  • [6] S. Bradtke and M. Duff, “Reinforcement learning methods for continuous-time markov decision problems,” Advances in Neural Information Processing Systems, vol. 7, 1994.
  • [7] B. Liu, Q. Xie, and E. Modiano, “Rl-qn: A reinforcement learning framework for optimal control of queueing systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems, vol. 7, no. 1, pp. 1–35, 2022.
  • [8] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang, “Experience-driven networking: A deep reinforcement learning based approach,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications.   IEEE, 2018, pp. 1871–1879.
  • [9] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive routing in multi-layer hierarchical software defined networks: A reinforcement learning approach,” in 2016 IEEE International Conference on Services Computing (SCC).   IEEE, 2016, pp. 25–33.
  • [10] G. J. Gordon, “Reinforcement learning with function approximation converges to a region,” Advances in Neural Information Processing Systems, vol. 13, 2000.
  • [11] S. Zhang, R. T. Des Combes, and R. Laroche, “On the convergence of sarsa with linear function approximation,” in International Conference on Machine Learning.   PMLR, 2023, pp. 41 613–41 646.
  • [12] D. P. De Farias and B. Van Roy, “On the existence of fixed points for approximate value iteration and temporal-difference learning,” Journal of Optimization Theory and Applications, vol. 105, pp. 589–608, 2000.
  • [13] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control.   John Wiley & Sons, 2013.
  • [14] S. P. Meyn and R. L. Tweedie, “Stability of markovian processes iii: Foster–lyapunov criteria for continuous-time processes,” Advances in Applied Probability, vol. 25, no. 3, p. 518–548, 1993.
  • [15] J. Tsitsiklis and B. Van Roy, “Analysis of temporal-diffference learning with function approximation,” Advances in Neural Information Processing Systems, vol. 9, 1996.
  • [16] R. G. Gallager, Stochastic Processes: Theory for Applications.   Cambridge University Press, 2013.
  • [17] L. Georgiadis, M. J. Neely, L. Tassiulas et al., “Resource allocation and cross-layer control in wireless networks,” Foundations and Trends® in Networking, vol. 1, no. 1, pp. 1–144, 2006.
  • [18] S. Meyn, Control techniques for complex networks.   Cambridge University Press, 2008.
  • [19] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability.   Springer Science & Business Media, 2012.
  • [20] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 664–671.
  • [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [22] “Google Colaboratory,” https://colab.research.google.com/, accessed: 2023-03-28.