On Joint Convergence of Traffic State and Weight Vector in Learning-Based Dynamic Routing with Value Function Approximation
Abstract
Learning-based approaches are increasingly popular for traffic control problems. However, these approaches are applied typically as black boxes with limited theoretical guarantees and interpretability. In this paper, we consider the theory of dynamic routing over parallel servers, a representative traffic control task, using semi-gradient on-policy control algorithm, a representative reinforcement learning method. We consider a linear value function approximation on an infinite state space; a Lyapunov function is also derived from the approximator. In particular, the structure of the approximator naturally makes possible idling policies, which is an interesting and useful advantage over existing dynamic routing schemes. We show that the convergence of the approximation weights is coupled with the convergence of the traffic state. We show that if the system is stabilizable, then (i) the weight vector converges to a bounded region, and (ii) the traffic state is bounded in the mean. We also empirically show that the proposed algorithm is computationally efficient with an insignificant optimality gap.
Index terms: Dynamic routing, reinforcement learning, Lyapunov method, value function approximation.
I Introduction
Dynamic routing is a classical control problem in transportation, manufacturing, and networking. This problem was conventionally challenging, because analytical characterization of the steady-state distributions of the traffic state and thus of the long-time performance metrics (e.g., queuing delay) are very difficult [1, 2]. Recently, there is a rapidly growing interest in applying reinforcement learning (RL) methods to dynamic routing and network control problems in general. RL methods are attractive because of their computational efficiency and adaptivity to unknown/non-stationary environments [3]. However, there is still a non-trivial gap between the demand for theoretical guarantees on key performance metrics and the black-box nature of RL methods. In particular, most existing theoretical results on RL are developed in the context of finite Markov decision processes (MDPs), while dynamic routing may be considered in infinite state spaces, especially for stability and throughput analysis.
In this paper, we make an effort to respond to the above challenge by studying the behavior of a parallel service system (Fig. 1) controlled by a class of semi-gradient SARSA (SGS) algorithms with linear function approximation; these methods are attractive because of (i) adaptivity to unknown model parameters and (ii) potential to obtain near-optimal policies.
Importantly, we jointly consider the convergence of the algorithm training process and of the traffic state process. The specific research questions are:
-
1.
How is the convergence of the weight vector coupled with the convergence of the traffic state?
-
2.
Under what conditions does the proposed SGS algorithm ensure the joint convergence of the weights and the state?
The above questions involve two bodies of literature, viz. dynamic routing and reinforcement learning. Classical dynamic routing schemes rely on Lyapunov methods to study traffic stability and provide a solid foundation for our work [4, 5]. However, these methods are less powerful to search for routing policies that optimizing average system time. In particular, existing results usually assume non-idling policies, which may be quite restrictive. Recently, RL is used for finding optimal routing policies and provides important tools and insights for practice [6]. In particular, Liu et al. proposed a mixed scheme that uses learning over a bounded set of states and uses a known stabilizing policy for the other states [7]; Xu et al. proposed a deep RL-based framework for traffic engineering problems in communication networks [8]; Lin et al. proposed an adaptive routing algorithm utilizing RL in a hierarchical network [9]. However, existing theory on RL mostly, to the best of our knowledge, considers MDPs with finite or bounded state spaces [10, 11, 12]; the theory on infinite/unbounded state spaces is limited to very special problems (e.g., linear-quadratic regulation [13]). Hence, existing learning-based routing algorithms either rely on empirical evidence for convergence or build on finite MDP theories. Thus, there is a lack of solid theory on the convergence of value function approximation over unbounded state spaces; such a theory is essential for developing interpretable and reliable routing schemes.
In response to the above research gaps, we jointly consider the convergence of traffic state and of the weight vector in dynamic routing. The traffic state characterizes the queuing process in the parallel service system illustrated in Fig. 1. The routing objective is to minimize the expected total system time, which includes waiting times and service times. The weights parameterize the approximate action value function, and thus determine the routing policy. In particular, the algorithm naturally makes possible idling policies, which is an advantage over existing methods. The weights are updated by a semi-gradient on-policy algorithm.
Our main result (Theorem 1) states that the proposed algorithm ensures joint convergence of the traffic state and the weight vector if and only if the system is stabilizable. Importantly, we study the coupling between the long-time behavior of the traffic state and that of the weight vector, which extends the existing theory on finite-state RL [11] to unbounded state spaces. The convergence of traffic state results from a Lyapunov function associated with the approximate value function which verifies the drift criterion [14]. The convergence of weights results from stochastic approximation theory [15]. We compare the proposed algorithm with a much more sophisticated neural network-based algorithm and show that our algorithm converges much faster than the benchmark with only an 8% optimality gap.
In summary, the contributions of this paper are as follows.
-
1.
We propose a structurally intuitive and technically sound algorithm to learn near-optimal routing policies over parallel servers.
-
2.
We study joint convergence of traffic state and weight vector under the proposed algorithm; this is theoretically interesting in itself.
-
3.
We show empirical evidence for the computational efficiency and near-optimality of the proposed algorithm.
The rest of this paper is organized as follows. Section II introduces the parallel service system model, the MDP formulation, and the SGS algorithm. Section III presents and develops the main result on joint convergence. Section IV compares the SGS algorithm with two benchmarks. Section V gives the concluding remarks.
II Modeling and Formulation
Consider the system of parallel servers with infinite buffer sizes in Fig. 1. In this section, we model the dynamics of the system, formulate the dynamic routing problem as a Markov decision process (MDP), and introduce our semi-gradient SARSA (SGS) algorithm.
II-A System modeling
Let be the set of parallel servers. Each server has an exponentially distributed service rate and job number at time . The state of the system is , and the state space is . Jobs arrive at origin according to a Poisson process of rate . When a job arrives, it will go to one of the servers according to a routing policy
That is, is the probability of routing the new job to server conditional on state . This paper focuses on a particular class of routing policies which we call the (WSQ) policy. WSQ is based on the approximation for the action value function defined as:
(1) |
where is the weight vector. For technical convenience, we consider the softmax version of WSQ
(2) |
where is the temperature of the softmax function. Note that converges to a deterministic policy greedy w.r.t. as approaches 0 [11].
We say that the traffic in the system is stable if there exists a constant such that for any initial condition,
(3) |
We say that the system is stabilizable if
(4) |
Note that the above ensures the existence of at least a stabilizing Bernoulli routing policy.
II-B MDP formulation
Since routing actions are made only at transition epochs [16, p.72], the routing problem of the parallel queuing system can be formulated as a discrete-time (DT) MDP with countably infinite state space and finite action space . With a slight abuse of notation, we denote the state and action of the DT MDP as and , respectively. Specifically, , where is the -th transition epoch of the continuous-time process. As indicated in Section II-A, the routing policy can be parameterized via weight vector .
The transition probability of the DT MDP can be derived from the system dynamics in a straightforward manner. Let denote the unit vector such that and . Then we have
The one-step random cost of the MDP is given by
where is the 1-norm for . Let denote the expected value of cost. The total discounted cost over infinite-horizon process is thus given by
where is the discount factor of infinite-horizon MDP. Let denote the solution of Bellman optimal equation, that
and let denote the greedy policy with respect to .
Closed-form solution to is not easy. Hence, we use the function defined by (1) as a proxy for . Motivated by [11, 15], we consider the function approximator as
(5) | ||||
where and are linearly independent basis functions. Let denote the optimal solution to
where is the invariant state distribution under policy . We select quadratic basis functions because is non-linear and the quadratic function is one of the simplest non-linear functions. Besides, the analysis is generalizable to polynomials with higher order.
II-C Semi-gradient SARSA algorithm
Inspired by [11], let denote the weight vector at the -th transition epoch, which is updated by an SARSA(0) algorithm
in the above, is a projection operator, is the stochastic step size, is the temporal-difference (TD) error, and is the gradient, which are specified as follows.
The projection is defined with a positive constant :
where is the standard -norm. Besides, we use denote the standard inner product in Euclidean spaces.
The temporal difference (TD) error and the gradient are as follows. Let , then we can compactly write
Then, for any , the TD error and the gradient are collectively given by
The step sizes are generated by the following mechanism. We define an auxiliary sequence satisfying the standard step size condition [10]
(6) |
The step sizes can not be too small to stop the iteration process, while also can not be too large to impede convergence. Let denote a finite positive constant and define
Then the step size sequence can be constructed as
where is a finite positive constant. That is, consists of zeros and elements from the deterministic sequence as demonstrated in Table I. Thus the weight vector is updated only when the constraint is satisfied.
0 | Yes | |
1 | Yes | |
2 | No | |
3 | Yes | |
The update equation thus becomes
(7) |
It is known that SARSA chatters when combined with linear function approximation [10]. We say that Algorithm 1 is convergent to a bounded region if there exists a positive finite constant such that
(8) |
for every initial traffic state and every initial weight .
III Joint convergence guarantee
In this section, we develop the main result of this paper, which states that the proposed semi-gradient SARSA (SGS) algorithm ensures joint convergence of traffic state and weight vector if and only if the parallel service system is stabilizable.
Theorem 1.
The above result essentially states that the joint convergence of and relies on step size constraint (6). They are standard for reinforcement learning methods, which ensures that (i) sufficient updates will be made, and (ii) randomness will not perturb the weights at steady state [15].
The rest of this section is devoted to the proof of the above theorem. Section III-A proves the stability of traffic state, section III-B presents unboundedness of , and section III-C proves the convergence of the approximation weights.
III-A Convergence of
In this section we construct a Lyapunov function and argue the drift to show the convergence of traffic state , with policy and proper temperature parameter . In particular, we considering the Lyapunov function
with the same weight vector as (5). We show that there exist such that for all and for all
(9) |
where is the infinitesimal generator of the system under the (softmax) WSQ policy.
III-B Unboundedness of
In this section, we show that a.s..
Lemma 1.
Under the constraints in Theorem 1, let , then there exists function and finite non-negative constant satisfying
(10) |
Furthermore, we have
Proof: Considering similar definition of in section III-A. Similarly, under the (softmax) WSQ policy, we have
Note that there is and suppose that is sufficiently small, we have
(11) |
where is a finite positive constant. In the case of , the drift equation (III-B) naturally satisfies (1). When , we have
Note that . The derivate of at is calculated as
Then the derivative of is negative when , which implies that there exist as the second zero of and . Now we can conclude that with a proper selection of , (1) is guaranteed. There exists a finite positive constant satisfies
Following the proof in [17], summing the inequality over epochs yields a telescoping series on the left hand side of (1), result in
Since , we have
where . The above inequality implies the boundedness of , thus the higher order stability of system states [18]. ∎
Proposition 1.
, the chain induced by is ergodic and positive Harris recurrent.
Proof: To argue for the irreducibility of the chain, note that the state can be accessible from any initial condition with positive probability. According to [19, Theorem 11.3.4], the proof of ergodic and positive Harris recurrent is straightforward with Lemma 1. ∎
Proof: Let , where is a finite positive constant and there is . Then the boundedness can be satisfied by constraining and , where is a positive constant. That is, the weight vector is updated only when the state and time interval (i.e., the immediate cost) are not too large.
Let use denote the occupation time of states , since the chain is positive Harris recurrent, we have . Since the arrival rate and service rates are well-defined, we have . Then we have
as desired. ∎
III-C Convergence of w[k]
In the following, we establish the convergence of by showing (i) the convergence under fixed-policy evaluation and (ii) the difference among optimal weight vectors due to policy improvement is bounded.
Proposition 3.
(Lipschitz continuity of ) For our WSQ policy, there exists such that ,
Proof: Note that the boundedness of derivative towards implies the Lipschitz continuity. With the well constructed sequence , we have
For , according to (2), we have and
which is bounded. ∎
With the above Proposition 1-3, we can analysis the fixed policy performance and bound the difference among distinct policies. For a better elucidation, we use subscript denote the invariant steady-state dynamic of the chain under fixed policy , and use subscript denote the real dynamic at the -th transition. That is denotes the invariant distribution of policy , and denotes the state distribution at the -th transition. Suppose that under the fixed policy and according to [20], we have
where is defined as
Which is negative definite since . Analogous, we have and .
According to [19, Theorem 14.0.1], we have
(12) |
where is a finite positive constant. indicates the transition probability from state to after steps under policy , and there is . According to [15], with (III-C) holds, the iterative algorithm (7) has a unique solution satisfies that under fixed policy. With a little abuse of notation, let denote the solution of
in SGS under fixed policy .
According to the inequality and note that there is , the boundedness of defined in Lemma 1 implies the boundedness of . With the constructed step size , we have , , and , where are finite positive constants.
Following [11], considering the weights update equation, the auxiliary sequence is defined as
Since the and policy are both linearly related to , we have . Let , then we have
where is defined as section II-C. The second item can be rewritten as
For , we have
Now we are ready to analyze the boundedness of each item. Note that only when , with [11, Theorem 4.4., Lemma C.5., Lemma D.10.] and the analysis of and , we have
and
where are finite positive constants that , and are Lipschitz constants of state distribution and transition probability. We have
where is short for . By leveraging [11, Lemma D.2.], suppose that is sufficient large, we have . Then we have
where . Analogously, we have
According to the above analysis, we have
Then we have
where . Since and by iteration, we have
Note that is constrained by (6) and the inequality holds, we have
Considering the relationship between the policy and weight vector, we have
When the immediate cost is constrained such that , we have
Then finally we can conclude that
which yields (8).
IV experiments
To evaluate the performance of the semi-gradient SARSA (SGS) algorithm with weighted shortest queue (WSQ) policy, we consider two benchmarks:
-
•
Neural network (NN)-WSQ: We constructed a NN for approximation of . The algorithm is similar to Algorithm 1, except that the weights update is replaced by NN update with adaptive moment estimation (Adam) algorithm [21]. Specifically, the NN has two fully connected layer with a rectified linear unit (ReLU) activation function. The loss function is the mean square error between the one-step predicted and calculated state-action-value. Since an exact optimal policy of the original MDP is not readily available, the policy computed by NN is used as an approximate optimal policy.
-
•
Join the shortest queue (JSQ) policy: For routing decisions under JSQ policy, we simply select the path with the shortest queue length, that is .
Consider the network in Fig. 1 with three parallel servers. Suppose that the service rate and the arrival rate , all in unit sec-1. The WSQ policy temperature parameter is set as . For SGS algorithm, we initialize the weight as . For simulation, a discrete time step of 0.1 sec is used. All experiments were implemented in Google Colab [22], using Intel(R) Xeon(R) CPU with 12.7GB memory. We trained SGS for episodes, NN for episodes, and evaluate them for episodes each, and the result are as follows.
For the performance, the weight of SGS converges to , which means the weight of a slower server is higher. Our WSQ policy is not restricted to non-idling conditions, since , the server with higher service rate (i.e., server 3) is more likely to be selected, even the slower server (i.e., server 1) is empty.
Algorithm | Normalized Average System Time |
---|---|
Neural network (NN) | 1.00 |
Semi-gradient SARSA (SGS) | 1.08 |
Join the shortest queue (JSQ) | 2.78 |
Table II lists the normalized average system time results under various methods. The results are generated from test simulations of sec. Although NN performs better in long terms of learning as expected, SGS performs better with just a few number of iterations as demonstrated in Fig. 2.
The job will spend more time going through the queuing network under JSQ policy at the average of sec. For WSQ, though the implementation efficiency of SGS is slightly worse than NN, SGS gives the best trade-off between computational efficiency and implementation efficiency: the average system time of NN and SGS are sec and sec. More importantly, SGS algorithm theoretically ensures the convergence of the optimal routing decision, while NN might be diverge and let alone the existence of the optimal decision.
V conclusion
In this paper, we propose a semi-gradient SARSA(0) (SGS) algorithm with linear value function approximation for dynamic routing over parallel servers. We extend the analysis of SGS to infinite state space and show that the convergence of the weight vector in SGS is coupled with the convergence of the traffic state, and the joint convergence is guaranteed if and only if the paralle service system is stabilizable. Specifically, the approximator is used as Lyapunov function for traffic state stability analysis; and the constraint and convergence analysis of weight vector is based on stochastic approximation theory. Besides, our analysis can be extended to polynomial approximator with higher order. We compare the proposed SGS algorithm with a neural network-based algorithm and show that our algorithm converges faster with a higher computationally efficiency and an insignificant optimality gap. Possible future work includes (i) extension of the joint convergence result as well as SGS algorithm to a general service network and (ii) the analysis of fixed-point convergence condition.
References
- [1] J. G. Dai and M. Gluzman, “Queueing network controls via deep reinforcement learning,” Stochastic Systems, vol. 12, no. 1, pp. 30–67, 2022.
- [2] Q. Xie and L. Jin, “Stabilizing queuing networks with model data-independent control,” IEEE Transactions on Control of Network Systems, vol. 9, no. 3, pp. 1317–1326, 2022.
- [3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- [4] P. Kumar and S. P. Meyn, “Stability of queueing networks and scheduling policies,” IEEE Transactions on Automatic Control, vol. 40, no. 2, pp. 251–260, 1995.
- [5] J. G. Dai and S. P. Meyn, “Stability and convergence of moments for multiclass queueing networks via fluid limit models,” IEEE Transactions on Automatic Control, vol. 40, no. 11, pp. 1889–1904, 1995.
- [6] S. Bradtke and M. Duff, “Reinforcement learning methods for continuous-time markov decision problems,” Advances in Neural Information Processing Systems, vol. 7, 1994.
- [7] B. Liu, Q. Xie, and E. Modiano, “Rl-qn: A reinforcement learning framework for optimal control of queueing systems,” ACM Transactions on Modeling and Performance Evaluation of Computing Systems, vol. 7, no. 1, pp. 1–35, 2022.
- [8] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang, “Experience-driven networking: A deep reinforcement learning based approach,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2018, pp. 1871–1879.
- [9] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive routing in multi-layer hierarchical software defined networks: A reinforcement learning approach,” in 2016 IEEE International Conference on Services Computing (SCC). IEEE, 2016, pp. 25–33.
- [10] G. J. Gordon, “Reinforcement learning with function approximation converges to a region,” Advances in Neural Information Processing Systems, vol. 13, 2000.
- [11] S. Zhang, R. T. Des Combes, and R. Laroche, “On the convergence of sarsa with linear function approximation,” in International Conference on Machine Learning. PMLR, 2023, pp. 41 613–41 646.
- [12] D. P. De Farias and B. Van Roy, “On the existence of fixed points for approximate value iteration and temporal-difference learning,” Journal of Optimization Theory and Applications, vol. 105, pp. 589–608, 2000.
- [13] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control. John Wiley & Sons, 2013.
- [14] S. P. Meyn and R. L. Tweedie, “Stability of markovian processes iii: Foster–lyapunov criteria for continuous-time processes,” Advances in Applied Probability, vol. 25, no. 3, p. 518–548, 1993.
- [15] J. Tsitsiklis and B. Van Roy, “Analysis of temporal-diffference learning with function approximation,” Advances in Neural Information Processing Systems, vol. 9, 1996.
- [16] R. G. Gallager, Stochastic Processes: Theory for Applications. Cambridge University Press, 2013.
- [17] L. Georgiadis, M. J. Neely, L. Tassiulas et al., “Resource allocation and cross-layer control in wireless networks,” Foundations and Trends® in Networking, vol. 1, no. 1, pp. 1–144, 2006.
- [18] S. Meyn, Control techniques for complex networks. Cambridge University Press, 2008.
- [19] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability. Springer Science & Business Media, 2012.
- [20] F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 664–671.
- [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [22] “Google Colaboratory,” https://colab.research.google.com/, accessed: 2023-03-28.