Learning Scalable Multi-Agent Coordination by Spatial Differentiation for Traffic Signal Control
^†^†thanks:

Junjia Liu^†, Huimin Zhang^‡, Zhuang Fu^†*, Yao Wang^† The first two authors Junjia Liu and Huimin Zhang contributed equally to this paper. * Correspondence: [email protected] (Z. F.), Tel.: +86-138 1649 6926 (Z. F.) ^†State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University
^‡National Engineering Laboratory for Automotive Electronic Control Technology, Shanghai Jiao Tong University
Shanghai, 200240, China
[email protected], [email protected], [email protected], [email protected]

Abstract

The intelligent control of the traffic signal is critical to the optimization of transportation systems. To achieve global optimal traffic efficiency in large-scale road networks, recent works have focused on coordination among intersections, which have shown promising results. However, existing studies paid more attention to observations sharing among intersections (both explicit and implicit) and did not care about the consequences after decisions. In this paper, we design a multi-agent coordination framework based on Deep Reinforcement Learning method for traffic signal control, defined as $\gamma$ -Reward that includes both original $\gamma$ -Reward and $\gamma$ -Attention-Reward. Specifically, we propose the Spatial Differentiation method for coordination which uses the temporal-spatial information in the replay buffer to amend the reward of each action. A concise theoretical analysis that proves the proposed model can converge to Nash equilibrium is given. By extending the idea of Markov Chain to the dimension of space-time, this truly decentralized coordination mechanism replaces the graph attention method and realizes the decoupling of the road network, which is more scalable and more in line with practice. The simulation results show that the proposed model remains a state-of-the-art performance even not use a centralized setting. Code is available in https://github.com/Skylark0924/Gamma_Reward.

Index Terms:

Multi-agent, coordination mechanism,

\gamma

-Reward, Deep Reinforcement Learning, spatial differentiation

I Introduction

Traffic congestion has been an increasingly critical matter for decades. It not only leads to an increase in commuting time but also exacerbates noise and environmental pollution issues due to frequent acceleration and deceleration. According to relevant researches, almost all collisions and delays in urban traffic are concentrated on intersections[1]. Unreasonable signal control significantly leads to a waste of traffic resources. Therefore, the key to solve urban congestion is to keep the intersection clear.

Deep Reinforcement Learning (DRL) methods have been well applied in the traffic signal regulation of single-intersection and shown a better performance than traditional methods[2][3][4], such as Max-pressure[5]. Recent works began to try to apply DRL algorithms, especially multi-agent Reinforcement Learning (MARL)[6], to multi-intersection and even large-scale road networks. Different from the single-intersection problem, the intelligent regulation of large-scale road networks needs to achieve synergistic control between various intersections which can be regarded as a Multi-objective Optimization Problem (MOP) and Markov/Stochastic Games (MGs) with Cooperative Setting[7]. In other words, multiple agents need to coordinate with each other. They need to keep their intersection open, and at the same time, pay attention to the traffic flow status of surrounding or even remote intersections, so that they can ultimately improve the efficiency of the overall road network. The latest research introduced the graph attention network (GAT) to share the observations of real-time traffic volume implicitly with each other, and get an inspired result[8].

I-A Related work and Motivation

The existing traffic signal control (TSC) methods can be divided into two categories: rule-based methods and learning-based methods. The former transforms the problem into a rule-based optimization problem; the later one seeks control strategy from the traffic flow data.

For the rule-based methods, such as Webster[9], GreenWave[10] and Max-pressure[5], a traffic signal optimization problem is usually solved under some assumptions like a preset period or fixed cycle-based phase sequence[4]. Webster is used for an isolated intersection and is a widely-used method in TSC. It assumes that the traffic flow is uniform during a certain period and constructs a closed-form equation to generate the optimal cycle length and phase split for a single intersection that minimizes the travel time of all vehicles passing the intersection. GreenWave is a classical method in the transportation field to implement coordination, which aims to optimize the offsets to reduce the number of stops for vehicles travelling along one specific direction. Max-pressure aims to reduce the risk of over-saturation by balancing queue length between adjacent intersections and minimizing the “pressure” of the phases for an intersection. However, the unpractical assumptions in these methods might not lead to excellent performance.

Recently, the DRL technique, as a popular learning-based method, has been proposed to control traffic signals due to its capability of online optimization without prior knowledge about the given environment. At present, DRL has been successfully applied to the single-intersection regulation of traffic signal [3] by regarding the intersection as an agent. The results of various state-of-the-art DRL algorithms are compared in [11], showing that Deep Q-Networks algorithm is more suitable for the solution of TSC tasks. However, the problem with multiple intersections is still a frontier.

Existing MARL algorithms focus on collaboration among agents and can be divided into centralized and decentralized setting according to various information structures[7]. Independent reinforcement learning (IRL), as a fully decentralized setting, directly perform a DRL algorithm for each agent with neither explicit nor implicit information exchange with each other. This method has been applied in multi-intersection TSC problem[12][13]. However, the environment is shared in MARL, and it changes with the policy and state of each agent[14]. For one of the agents, the environment is dynamic and non-stationary, leads to convergence problems. Tan et.al.[15] compares IRL with Value Decomposition Networks (VDN) and illustrates the disadvantages of IRL. As a centralized method, there exists a central controller in VDN which integrates the value function of each agent to obtain a joint action-value function. The integration strategy is to add them directly. Moreover, QMIX[16], as a extend of VDN, uses state information and integrates them in a nonlinear way and gets a stronger approximation ability than VDN. Both QMIX and VDN are typical centralized MARL algorithms with communication, and this joint-action idea has been already used in TSC[17].

Based on these centralized methods, recent TSC studies condense the global scope into a smaller neighborhood[8][18][19] and use graph convolution network (GCN) to achieve coordination. Colight[8] introduces the concept of attention mechanism and realized cooperation by integrating observations in a neighborhood implicitly into a hidden state. However, as mentioned in Colight, the neighborhood scope is a constant, so the traffic information among intersections cannot be utilized to determine the range of the neighborhood. Due to the usage of GCN and Multi-head Attention techniques, these methods still need to gather information for centralized computing and betray the intention of distribution.

In fact, such a central controller does not exist in many applications, apart from those that can easily have a global perspective, like video games[7]. Considering the demand for the scalability in TSC, setting a central controller is impractical. Therefore, we need to seek a compromise solution which is both convergent and scalable. This setting is referred to as a decentralized one with networked agents[7].

I-B Main contributions

In this paper, we first regard each intersection as a DRL agent and transform the TSC problem into a Markov Decision Process (MDP). Unlike existing work, this paper is aiming at improving method scalability while ensuring a SOTA performance. To achieve this goal, we introduce a structural prior about road networks as an inductive bias and extend the Markov Chain theory to the temporal-spatial domain for the coordination.

The change of future states and rewards from distant intersections is multiplied by the spatial discount rate $\gamma$ (Note that it represents the temporal-spatial discount on multi-agent information, not the ordinary meaning used in temporal difference, and it will be distinguished in detail later) and taken into account when learning the current intersection policy. This information is used as a penalty to correct the calculation of current rewards so that the agents have the ability to collaborate. Due to the various traffic volume of each road, the influence of surrounding intersections may be different. Therefore, the attention mechanism is introduced in this paper to correct the influence weight of surroundings on the current intersection.

To summarize, our main contributions are as follows:

•

We propose a coordination framework, defined as $\gamma$ -Reward, which can communicate with adjacent intersections and even further in a scalable way by sharing future states and rewards and achieve global optimal control of the TSC problem.
•

Instead of Multi-head Attention, the spatial differentiation method is proposed to collect the temporal-spatial information in a decentralized way and amend the current reward by recursion.
•

We just use attention in the neighborhood for distinguishing various significance, and update the attention score parameters in spatial differentiation formula by imitating the idea of the target network.
•

It is found in the test results of various road networks that the $\gamma$ -Reward series, including original $\gamma$ -Reward and $\gamma$ -Attention-Reward, maintain a SOTA performance while achieving a better scalability.

II Problem Formulation $\&$ Proposed MARL method

In this section, we introduce the basic knowledge of the TSC problem and propose our MARL method.

II-A Preliminary $\&$ Formulation

•

Lane: Lane is part of a roadway that is designated to be used by a single line of vehicles [20]. There are two kinds of lanes: entering lane and exiting lane[21]. Each intersection consists of multiple lanes.
•

Phase: A phase is a combination of movement signals[4]. Figure 1(a) shows eight main directions of vehicles at the intersection. Note that the direction of turning right is usually ignored in these problems since it can execute every time without caring for the traffic signal. The directions in the same phase need not be a conflict which is shown in Figure 1(b). Phase is the unit of TSC, and only one phase can turn green at a time.
•

Neighbor intersection: The intersections which directly connect to the current intersection. In an informal road network, each intersection usually has at most four neighbors.
•

Waiting vehicle: If a vehicle on an entering lane has a speed lower than a threshold, then we define it as a waiting vehicle, which means it is slowing down to wait for the red light.

Refer to caption — (a) Eight main directions in a single intersection

By using DRL, we regard the TSC problem as MDP. An individual DRL agent controls one of the intersections. They need to observe part of the environmental state $\mathcal{O}$ and get actions $\mathcal{A}$ according to these observations to determine which phase in the intersection needs to turn green. The effect of control is fed back from the environment in the form of reward $\mathcal{R}$ . The goal of the DRL agent is to maximize the reward function by continuously exploring and exploiting based on constant interaction with the environment. In this paper, the problem requests to reduce the length of the queue $q_{l}$ or the travel time $T_{w}$ in the road network. To make this problem more suitable for DRL, we can first abstract it into these parts $\mathcal{<O,A,P,R,\pi,\gamma>}$ :

•

Observation $o_{t}^{i}$ : $\boldsymbol{o}_{t}^{i}=\left(o_{1}^{i},\ldots,o_{t}^{i}\right)$ , where $\boldsymbol{o}_{t}^{i}\in\boldsymbol{\mathcal{O}_{t}}^{i}$ . Every agent observes the length of the vehicle queue on entering lanes of their intersection. Moreover, to cater to the design of the proposed $\gamma$ -Reward algorithm, we also need to observe the number of vehicle on the exiting lanes which connect to neighbor intersections.
•

Action $a_{t}^{i}$ : $\boldsymbol{a}_{t}^{i}=\left(a_{1}^{i},\ldots,a_{t}^{i}\right)$ , where $\boldsymbol{a}_{t}^{i}\in\boldsymbol{\mathcal{A}_{t}}^{i}$ . Action can be easily set as the serial number of phase which is chosen to be green.
•

Transition probability $\mathcal{P}$ : $\mathcal{P}{(o_{t+1}}^{i}|{o_{t}}^{i},{a_{t}}^{i})$ describes the probability from state ${o}_{t}^{i}$ to the next potential state ${o_{t+1}}^{i}$ .
•

Reward $r_{t}^{i}$ : After executing each action $a_{t}^{i}$ , we can get a return information to judge whether $a_{t}^{i}$ is good enough for $o_{t}^{i}$ . We use the number of waiting vehicle on the entering lanes as a raw reward. For amendatory reward, we use $R_{t}^{i}$ as a representation.
•

Policy $\pi$ : Policy is what agents need to learn in DRL. It represents the goal of reducing travel time and increasing average speed. For a single agent, $\pi^{i}:{\mathcal{O}_{t}}^{i}\mapsto\mathcal{A}_{t}^{i}$ .
•

Discount rate $\gamma^{\prime}$ : This factor is the common meaning used in Temporal-Difference (see Appendix A.1). To avoid confusion with $\gamma$ -Reward , we use $\gamma^{\prime}$ here to replace the original symbol $\gamma$ .

By using the Bellman equation, the relationship among these parameters can be formulated. We can gain the optimal phases after iterating these equations:

	$\displaystyle Q(o_{t}^{i},a_{t}^{i})$	$\displaystyle=Q(o_{t}^{i},a_{t}^{i})+\alpha(r^{i}_{t}+\gamma^{\prime}\max_{a_{t+1}^{i}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i}))$		(1)
	$\displaystyle a_{t}^{i}$	$\displaystyle=\arg\max Q(o_{t}^{i},a_{t}^{i})$		(1)

II-B Proposed Spatial Differentiation Function

Coordination among agents plays a critical role in MARL algorithms, either centralized or decentralized. In this paper, we propose a coordination mechanism among distributed agents. Each agent is based on Dueling-Double-Deep Q Network (D3QN)[22][23][24], which is one of the best Q value-based model until now[25]. A detailed description of it can be found in Appendix A. We found that some studies already use D3QN directly on the TSC problem [26], but it can only be used as an independent Q-learning (IQL) method in the multi-intersection problem. For the TSC problem, the idea of distributed agents is wise and what we need is an appropriate decentralized coordination mechanism. Therefore, we propose $\gamma$ -Reward framework for coordination (Figure 3).

The basic theory of DRL inspires the main idea of this article. For the TSC problem, not only the temporal decision should be regarded as MDP, the road network itself is more like a Markov Chain since the decision of the current intersection will affect other intersections in the future. So it should consider not only the TD-error (see Appendix A.1) in time, but also a “TD-error” in space which is defined as the spatial differentiation.

Figure 4 shows a diagram of multi-intersection road network with $3\times 3$ intersections. Since the decision of intersection $I_{22}$ at time $t$ will affect the next intersection $I_{23}$ at time $t+n$ , the result of intersection $I_{22}$ at time $t$ should be affected not only by the current intersection reward $r^{I_{22}}_{t}$ , but also by the reward $r^{j}_{t+n}$ of the surrounding intersections at time $t+n$ , where $j\in{I_{12},I_{21},I_{23},I_{32}}$ . The formula of spatial differentiation is as follows:

\displaystyle R^{I_{22}}_{t}=r^{I_{22}}_{t}\cdot\left\{1+\gamma\cdot\text{tanh}\left[\sum_{j\in\mathcal{N}_{i}}\left(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c\right)\right]\right\}

(2)

Where $\mathcal{N}_{I_{22}}=\{I_{12},I_{21},I_{23},I_{32}\}$ represents the four intersections around intersection $I_{22}$ . The parameter $n$ is called delay span which represents the time span of most vehicles reach the next intersection after current action $a_{t}$ . It is related to the length of the road, the average velocity on the exiting lane and sample interval. The whole formula can also be treated as value function of reward, thus it is a spatial differentiation of reward-value function.

In Equation 2, ${R^{j}_{t+n}}/{r^{j}_{t}}$ shows the change of traffic capacity at intersection $j$ between time $t$ and $t+n$ . If ${R^{j}_{t+n}}/{r^{j}_{t}}$ is greater than threshold $c$ , it indicates that the traffic capacity of the intersection $j$ is deteriorated, in other words, the decision of the intersection $I_{22}$ at the time $t$ will cause the adjacent intersection $j$ to be more congested; conversely, ${R^{j}_{t+n}}/{r^{j}_{t}}$ less than $c$ indicates that the capacity of intersection $j$ is improved, and the decision of the intersection $I_{22}$ at time $t$ is good for $j$ . Through some region transformations, the differential item is served as a penalty. It finally multiplies by a spatial discount factor $\gamma$ as an amendment to reward. Since the training goal of DRL is to maximize the reward function, the existence of the penalty item forces the agent to pay attention to the situation around the intersection while improving its own policy.

This equation is also in line with the idea of MARL with networked agents which dates back to Varshavskaya et al. [27]. $\mathcal{QD}$ -learning[28] further gives a convergence proof of this kind of MARL algorithm. It incorporates the idea of consensus + innovation to the standard Q-learning algorithm and has the following Q-value update equation:

$\displaystyle Q(o_{t}^{i},a_{t}^{i})$	$\displaystyle\leftarrow Q(o_{t}^{i},a_{t}^{i})$	(3)
	$\displaystyle+\alpha_{t,o,a}\left[r^{i}_{t}+\gamma^{\prime}\max_{a_{t+1}^{i}\in\mathcal{A}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i})\right]$
	$\displaystyle-\beta_{t,o,a}\sum_{j\in\mathcal{N}_{t}^{i}}\left[Q(o_{t}^{i},a_{t}^{i})-Q(o_{t}^{j},a_{t}^{j})\right]$

where the term after $\alpha_{t,o,a}$ denotes a local innovation potential that incorporates newly obtained observations, and the term after $\beta_{t,o,a}$ is a consensus potential (agent collaboration) which captures the difference of Q-value estimates from its neighbors.

Analogously, our Q-value update equation can be changed as follows:

	$\displaystyle Q(o_{t}^{i},a_{t}^{i})=$	$\displaystyle Q(o_{t}^{i},a_{t}^{i})+\alpha\{r^{i}_{t}+\gamma\cdot r^{i}_{t}\cdot\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c)]$		(4)
		$\displaystyle+\gamma^{\prime}\max_{a_{t+1}^{i}\in\mathcal{A}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i})\}$		(4)

The spatial differentiation term denotes the consensus like $\mathcal{QD}$ -learning and we further construct the communication in the temporal domain and the spatial domain at the same time. Although Equation 2 uses future information which sounds contrary to the causality, it is achievable in the programming. Since we use Q-learning as a basic model which is off-policy, it saves the trajectory of state-action pair and the obtained reward into a replay buffer. In this way, the raw reward $r^{i}_{t}$ is saved first and then corrected after $n$ steps. Therefore the calculation process can be realized (see Appendix B for pseudocode).

II-C Attention Mechanism for $\gamma$ -Reward

The spatial differentiation formula proposed in the previous section is based on the fact that each intersection in the road network has the same situation. In other words, the levels of them are equal. But in reality, the levels of the intersections in the road network are different, some have only one or two lanes, and others may have four to six. Imagine if the intersection $I_{22}$ in Figure 4 is a two-way road which has eight lanes, the intersection $I_{23}$ is same as $I_{22}$ , while on the other side, the intersection $I_{32}$ is a two-way which only has two lanes. The decision of intersection $I_{22}$ in $I_{22}\Rightarrow I_{23}$ and $I_{22}\Rightarrow I_{32}$ is inevitably different. The discharge of intersection $I_{22}$ can easily lead to excessive congestion at intersection $I_{32}$ , but it may not be very serious for intersection $I_{23}$ . In addition to the number of lanes at the intersection, the length between each intersection is different, which means that the maximum congestion length each intersection can accept is also different. In summary, when using a spatial differentiation formula correction at an intersection, there are different influence weights for different intersections.

The problems mentioned above can indeed be solved by setting different thresholds to get weights, like [29]. However, the actual road network situation is very complicated. There is no way to include all parameters into consideration. Therefore, we can learn the rules from the traffic data. In this respect, the attention mechanism gives us a good solution.

Attention mechanism [30][31][32] is an algorithm first proposed to solve “seq2seq” translation problem in NLP. Attention can be interpreted broadly as a vector of importance weights: To predict an element, such as a word in a sentence, attention vector can be used to estimate how strongly it is related to other elements, and the sum of its values can be used as an approximation of the target. It breaks the limits of Euclidean distance between data, captures long-range dependencies in the sentences, and provides smoother translations.

In addition to sequence data, Attention can also be used for other types of problems. In the graphics world, the GCN [33] tells us that combining local graph structures with node features can achieve good performances in node classification tasks. However, the way GCN combines the characteristics of neighboring nodes is closely related to the structure of the graph, which limits the generalization ability of the trained model on other graph structures. The GAT [34] proposes a weighted summation of neighboring node features using the attention mechanism. The weights of the neighboring node feature depend entirely on the node characteristics and are independent of the graph structure.

Recent research has begun to introduce the idea of Attention to MARL algorithms. A Multi-Agent Actor-Critic (MAAC) algorithm has been proposed that combines attention mechanism[35]. MAAC encodes the state of the surrounding agents and obtains the contribution value of the surrounding agents to the current agent through the attention network, together with $(o,a)$ of the current agent as an input, the Q value is obtained through an MLP. While the Q network is updated in reverse, the attention network is updated, and the attention scores of the surrounding agents for the current agent are also corrected. Colight applied attention mechanism to the TSC problem of the large-scale road network, it encodes the state and directly obtains the Q value through the Multi-head Attention network.

II-C1 Attention

Every agent can interact with their environment and get the observation on time. We first need to embed the observation from the environment by applying a layer of Multi-layer Perception (MLP):

z_{i}=W_{l}o_{i}+b

(5)

To get the weight of the intersection $i$ to the adjacent intersection $j$ , we need to combine their hidden variables $z_{i},z_{j}$ by following dot product:

e_{ij}=z_{j}^{T}W^{T}_{k}W_{q}z_{i}

(6)

$e_{ij}$ , represents the influence of the adjacent intersection $j$ on the current intersection $i$ . It should be noted that the influence between them may not be necessarily equal. Then we normalize them by softmax function:

\alpha_{ij}=\frac{\exp\left(e_{ij}\right)}{\sum_{k\in\mathcal{N}_{i}}\exp\left(e_{ik}\right)}

(7)

Impact value $v_{ij}$ can be calculate as $v_{ij}=\alpha_{ij}z_{j}$ , which means the value $i$ needs to consider from $j$ . Finally, adding them together and passing the RELU activation function, the final characterization of the intersection $i$ is obtained:

h_{i}=\sigma\left(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}z_{j}\right)

(8)

II-C2 $\gamma$ -Attention-Reward

We use the attention mechanism to make spatial differentiation function more perfect and interpretable by adding an attention score before the sum operation.

\displaystyle R^{i}_{t}=r^{i}_{t}\cdot\left\{1+\gamma\cdot\text{tanh}\left[\sum_{j\in\mathcal{N}_{i}}\hat{\alpha}_{ji}\left(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c\right)\right]\right\}

(9)

Attention score is updated together with the policies:

\mathcal{J}(\pi_{\theta}^{i})=(R^{i}+\gamma^{\prime}\max{\hat{Q}(o^{i},{a}^{i};\hat{\theta}^{i},\boldsymbol{\hat{\alpha}}^{i})}-Q(o^{i},a^{i};\theta^{i},\boldsymbol{\alpha}^{i}))^{2}

(10)

It is worth emphasizing that $\alpha_{ij}$ represents the importance from $j$ to $i$ , but for $\gamma$ -Reward we seek the influence from $i$ to $j$ . So in Equations 9 and 10, we need to use $\alpha_{ji}$ for amending rewards of agent $i$ .

Since the attention score $\alpha_{ji}$ is a real-time updated value, this paper uses it as an evaluation metric to dynamically assess the impact of surrounding intersections based on dynamic traffic data. While the reward is also an evaluation indicator, which is timely feedback obtained after performing an action at a particular state, used to evaluate the quality of the state-action pair. If we introduce $\alpha_{ij}$ into the calculation of reward-value and update it in real-time, there is bound to be a problem. This causes the Attention layer to update the direction intentionally, which reduces the impact of essential neighbor intersections and increases those with less traffic flow to increase the reward-value. In this way, the introduction of the attention score will be even worse than the original $\gamma$ -Reward. To solve this problem, we can follow the Q-learning algorithm and use the off-policy idea to get target attention scores $\hat{\alpha}_{ij}$ in the reward calculation using target Attention layers. Its network parameters are updated together with target Q parameters $\hat{\theta}$ . The whole framework of the proposed $\gamma$ -Attention-Reward model is shown in Figure 5.

In Colight, there is a section devoted to the selection of the hyper-parameter neighbor scope. It is found through experiments that the larger the $\left|\mathcal{N}_{i}\right|$ is, the better the performance is. But when it is greater than 5, it takes more time to learn. This is because intersections within the scope of $\left|\mathcal{N}_{i}\right|$ needs to aggregate all the observations into one agent, and the total number of agents is still equal to the intersection, which will inevitably lead to an increase in the amount of calculation. However, unlike Colight, $\gamma$ -reward does not need to consider the size setting of the neighbor number. We can merely consider the neighbor number as a constant 5, which contains the current intersection and the four intersections directly connected to it. Note that, we do not need to employ Multi-head Attention which leads to centralization. For the information of further intersections, recursiveness in the $\gamma$ -Reward formula can work.

From the original $\gamma$ -Reward part in Figure 5, we can figure out its principle. The row in the figure represents the state information collected in time series and the column represents each intersection. To prevent too much confusion, only the $\gamma$ -Reward process of intersection $I_{13}$ is shown as a demonstration, and the interval of the scale variable $n$ is also not shown here. The value of $R_{13}^{t}$ is related to $R_{12}^{t+n}$ and $R_{14}^{t+n}$ . By analogy, these two rewards are related to $R_{11}^{t+2n}$ and $R_{15}^{t+2n}$ respectively. It is worth noting that only a two-dimensional replay buffer is drawn here, in fact, the real intersection has four or more surrounding intersections, so it should be a three-dimensional gradient.

III Theoretical Analysis in Game Theory

In this section, we establish theoretical results for the proposed algorithms. The key challenge of MARL algorithms is that the decision process may be non-stationary[36][37]. Since an agent is not aware of actions from other agents and lack of communication, the transition probabilities $\mathcal{P}(o_{t+1}^{i}|o_{t}^{i}\in\mathcal{O}_{t}^{i},a_{t}^{i}\in\mathcal{A}_{t}^{i})$ are not stationary and change as the other agents change. So we first need to demonstrate the decision process is stationary by using the proposed algorithms. Specifically, MARL can be regarded as a game model[38], and we prove that they can converge to Nash equilibrium. The argument is started with the following definitions.

Definition III.1

A decentralized MARL decision process is stationary (or homogeneous), iff, for each agent $i$ and all $p,q\in\mathbb{N}$ , $o^{i}\in\mathcal{O}^{i}$ , $\boldsymbol{a}^{i}\in\mathcal{A}^{i}$ [36]

	$\displaystyle\sum_{\boldsymbol{a}_{p}^{-i}\in\boldsymbol{A}_{p}^{-i}}\mathcal{P}_{i}\left(o_{p+1}^{i}\|o_{p}^{i},\left\langle a_{p}^{i},\boldsymbol{a}_{p}^{-i}\right\rangle\right)=$		(11)
	$\displaystyle\sum_{\boldsymbol{a}_{q}^{-i}\in\boldsymbol{A}_{q}^{-i}}\mathcal{P}_{i}\left(o_{q+1}^{i}\|o_{q}^{i},\left\langle a_{q}^{i},\boldsymbol{a}_{q}^{-i}\right\rangle\right)$		(11)

and $\mathcal{P}$ for global state $s\in\mathcal{S}$ must be stationary either.

	$\displaystyle\sum_{\boldsymbol{a}_{p}^{-i}\in\boldsymbol{A}_{p}^{-i}}\mathcal{P}\left(s_{p+1}^{i}\|s_{p}^{i},\left\langle a_{p}^{i},\boldsymbol{a}_{p}^{-i}\right\rangle\right)=$		(12)
	$\displaystyle\sum_{\boldsymbol{a}_{q}^{-i}\in\boldsymbol{A}_{q}^{-i}}\mathcal{P}\left(s_{q+1}^{i}\|s_{q}^{i},\left\langle a_{q}^{i},\boldsymbol{a}_{q}^{-i}\right\rangle\right)$		(12)

where $\boldsymbol{a}^{-i}=\boldsymbol{a}\backslash\left\{a^{i}\right\}$ .

Based on the definition of stationary MDP, we can define optimal global reward for proving the astringency of proposed methods by extending the definition by [39].

Definition III.2

For a stationary MDP, the global optimal reward can be decomposed into a sum of local optimal reward for each reward function $f_{i}\in\mathcal{F}$

\rho^{*}=\sum_{i=1}^{m}\rho_{i}^{*}

(13)

Proof:

For a given stationary MDP, there must exists a stationary $\mathcal{P}_{i}^{\pi_{*}^{i}}$ , where $\pi_{*}^{i}$ is the optimal policy of agent $i$

\rho_{i}^{\pi_{*}^{i}}=\sum_{o^{i}\in\mathcal{O}^{i}}\mathcal{P}_{i}^{\pi_{*}^{i}}(o^{i})f_{i}(o^{i},a^{i}|a^{i}=\pi_{*}^{i}(o^{i}))

(14)

∎

Definition III.3

In stochastic game, a Nash equilibrium point is a tuple of $m$ strategies $\left(\pi_{*}^{1},\ldots,\pi_{*}^{m}\right)$ such that for all global state $s\in\mathcal{S}$ and $i=1,\dots,m$ [40]

\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m})\leq\nu^{i}(s|\pi_{*}^{1},\ldots,\pi^{i},\pi_{*}^{i+1},\ldots,\pi_{*}^{m})

(15)

for all $\pi_{*}^{i}\in\Pi^{i}$ , where $\Pi^{i}$ is the set of policies of total $m$ agents.

III-A Stationarity of $\gamma$ -Reward series

First, we give the proof of stationarity. Unless the MDP is stationary, the model cannot guarantee convergence to the optimal.

Assumption III.1

The original reward function can be represented as

r_{t}^{i}=f(\boldsymbol{o}_{t}^{i},\boldsymbol{a}_{t}^{i})

(16)

where $\boldsymbol{o}^{i},\boldsymbol{a}^{i}$ include state-action pair from time step $1$ to $t$ . Assume that $\gamma$ -Reward series are special reward function $f(o,a)$ .

R_{t}^{i}=f\left(\left\langle\boldsymbol{o}_{t}^{i},\boldsymbol{o}_{k}^{-i}\right\rangle,\left\langle\boldsymbol{a}_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\right)

(17)

$k\in[1,\mathbb{N}]$ .

Assumption III.2

As a continuing task without definite ending, the excepted reward $\mathcal{G}_{t}^{i}$ of TSC problem is defined as following with a discount rate $\gamma^{\prime}$

\mathcal{G}_{t}\doteq r_{t+1}+\gamma^{\prime}r_{t+2}+(\gamma^{\prime})^{2}r_{t+3}+\cdots=\sum_{k=0}^{\mathbb{N}-t}(\gamma^{\prime})^{k}r_{t+k+1}

(18)

Assumption III.3

The Q function is based on trajectory of expected return.

Q(o,a|\pi)=\sum\mathcal{P}(path_{o,a}|\pi)*\mathcal{G}(path_{o,a})

(19)

a_{t}=\text{{argmax} }Q_{\max}(o,a|\pi)

(20)

Theorem III.1

With $\gamma$ -Reward as a coordination mechanism, the decision process of distributed DQN algorithm $\mathcal{(RP)}$ is stationary.

Proof:

According to Assumption III.2 and Assumption III.3, $Q$ value function with $\gamma$ -Reward $\mathcal{(RQ)}$ can be written as follow

	$\displaystyle\mathcal{(RQ)}_{t}^{i}(o,a\|\pi)$	$\displaystyle=\sum_{t=0}^{\mathbb{N}}\mathcal{(RP)}_{t}^{i}(path_{o^{i},a^{i}}\|\pi)*(\mathcal{RG})_{i}(path_{o^{i},a^{i}})$		(21)
		$\displaystyle=Q\left(\left\langle\boldsymbol{o}_{t}^{i},\boldsymbol{o}_{k}^{-i}\right\rangle,\left\langle\boldsymbol{a}_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\lvert\pi\right)$		(21)

The calculation of $\mathcal{(RQ)}$ is related to the except reward path. $\mathcal{(RG)}_{t}^{i}$ is a discounted sum of amendatory reward $R_{t}^{i}$ , which is bound up with not only $(o^{i},a^{i})$ , but also $(o^{-i},a^{-i})$ from the other agents. Since $\mathcal{(RG)}_{t}^{i}$ records the future trajectory of amendatory reward from time step $t$ to the end of episode, it must contain the previous and posterior state-action pair as a vector, like Equation 17 shown in Assumption III.1. According to the above two properties, we can decompose $\mathcal{(RP)}$ by using Equation 20.

	$\displaystyle\mathcal{(RP)}_{t}^{i}(o_{t+1}^{i}\lvert{o}_{t}^{i},{a}_{t}^{i})$	$\displaystyle=\mathcal{P}_{t}^{i}(o_{t+1}^{i}\lvert o_{t}^{i},\text{{argmax} }\mathcal{(RQ)}^{i}_{t\max}(\boldsymbol{o},\boldsymbol{a}))$		(22)
		$\displaystyle=\mathcal{P}_{t}^{i}\left(o_{t+1}^{i}\lvert o_{t}^{i},\left\langle a_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\right)$		(22)

where $k\in[1,\mathbb{N}]$ . Obviously, $\mathcal{(RP)}$ satisfy the property in Definition III.1. Thus, the process with $\gamma$ -Reward series is a stationary process.

∎

III-B Convergence of $\gamma$ -Reward series

Theorem III.2

Spatial differentiation formula in original $\gamma$ -Reward can lead the reward value function to converge to local optimal reward.

Proof:

As mentioned in Section II.B, our Q-value function shares a similar structure with $\mathcal{QD}$ -learning. So the proof is inspired by the convergence analysis of $\mathcal{QD}$ -learning. Since this article is not mainly about multi-agent theory, we would like to simply transform $\gamma$ -Reward into the form of $\mathcal{QD}$ -learning convergence proof. By following the rest of the proof process in $\mathcal{QD}$ -learning, we can get the conclusion of convergence. Thus, we highly recommend readers to read the whole analysis in $\mathcal{QD}$ -learning.

According to the Spectral Graph Theory, the multi-agent communication network is simplified to an undirected graph $G=(V,E)$ , where $V$ represents intersections and $E$ denotes the communication links. From the adjacency matrix $A$ ( $A_{ij}$ =1, if $(i,j)\in E$ , $A_{ij}=0$ , otherwise) and the degree matrix $D=\text{diag}\{d_{1},\dots,d_{N}\}$ ( $d_{i}=|\mathcal{N}_{i}|$ ), we can get a positive definite graph Laplacian matrix $L=D-A$ , where eigenvalues ordered as $0=\lambda_{1}(L)\leq\lambda_{2}(L)\leq\cdots\leq\lambda_{N}(L)$ . Let $\beta=-\alpha\gamma$ and $\beta\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c)]=-\beta\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(c-\frac{R^{j}_{t+n}}{r^{j}_{t}})]$ , we note that

	$\displaystyle\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}(t+1)$	$\displaystyle=(I_{N}-\beta L_{t}+\alpha I_{N})\boldsymbol{r}_{t}$		(23)
		$\displaystyle+\alpha(\mathcal{C}_{\boldsymbol{o},\boldsymbol{a}}(\boldsymbol{Q_{t}})-\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}(t)+\boldsymbol{\nu}_{t})$		(23)

where $\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}=[Q_{o,a}^{1}(t),\dots,Q_{o,a}^{N}(t)]^{T}$ and $\mathcal{C}_{\boldsymbol{o},\boldsymbol{a}}(\boldsymbol{Q_{t}})=[\mathcal{C}^{1}_{o,a}(Q_{t}^{1}),\dots,\mathcal{C}^{N}_{o,a}(Q_{t}^{N})]^{T}$ . The local $\gamma$ -Reward operator $\mathcal{C}^{i}$ of agent $i$ is

\displaystyle\mathcal{C}_{o,a}^{i}(Q)=\mathbb{E}(r^{i}_{t})+\gamma^{\prime}\sum_{j\in\mathcal{N}}p_{i,j}^{a}\max_{v\in\mathcal{A}}Q_{j,v}

(24)

The residual $\nu^{i}_{t}$ for each agent $i$ in Equation 23 is

\displaystyle\nu_{t}^{i}(o,a)=(1-\dfrac{1}{\alpha})r^{i}_{t}+\gamma^{\prime}\max_{v\in\mathcal{A}}Q_{{o}_{t+1},v}^{i}(t)-\mathcal{C}^{i}_{o,a}\left({Q}\right)

(25)

By following the proof in Section V of $\mathcal{QD}$ -learning, we can get the same main result that for each agent $i$ , we have

\displaystyle\mathbb{P}(\lim_{t\rightarrow\infty}\boldsymbol{Q}^{i}_{t}=\boldsymbol{Q}^{*})=1

(26)

We can conclude that spatial differentiation formula can lead the $\gamma$ -Reward to converge to local optimal. Obviously, $\gamma$ -Attention-Reward has the same property.

∎

III-C Optimality of $\gamma$ -Reward model

For a stochastic game with multi-agent, the optimal point of the whole system is a Nash equilibrium point which is declared in Definition III.3. $\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m})$ can be interpreted as the discounted except reward $\mathcal{G}$ . According to previous two theorems, we can draw the following conclusions.

Theorem III.3

With $\gamma$ -Reward model, the multi-agent system can converge to a Nash equilibrium point.

Proof:

First we give the definition of optimal $\nu^{*}$ with Definition III.2

\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m})=\rho^{*}=\sum_{i=1}^{m}\rho_{i}^{*}

(27)

From the process of convergence proof, we can figure that the local optimal reward of $R^{i}$ depends on the local optimal of the other agents $\rho_{j}^{*}$ . In other words, if there is an agent which does not converge to local optimal reward, the other will also not be optimal. From this, we can conclude that $\gamma$ -Reward forces agents to care about the others and let the whole system finally converge to a Nash equilibrium point. ∎

IV Experiment

We conduct experiments on Cityflow, an open-source traffic simulator that supports large-scale traffic signal control[41], rather than the common used SUMO simulator[42], since it is more than twenty times faster than SUMO. Moreover, we use Ray[43] framework, which is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications for DRL algorithms.

IV-A Datasets

In the experiment, we use both synthetic data and real-world data. We share the same real-world dataset with Colight for convenience. The datasets mainly include two parts, roadnet and flow. Roadnet describes the number of intersections in the road network, the coordinates, and the number of lanes owned by each intersection. Flow is based on vehicles and lists thousands of vehicles, each vehicle has its own property, such as length, width, max of accuracy, max of speed and, most importantly, trajectory. The experiment used the real-world data of Hangzhou, Jinan in China, and also New York in the USA. Meanwhile, we used two kinds of synthetic data, arterial and grid type. We counted their characteristics and presented them in Table I. $Grid_{3\times 3}uni$ is one-way traffic, and $Grid_{3\times 3}bi$ is two-way with the same road network.

TABLE I: Situation of Datasets

DataSet	Intersections	Arrival rate (vehicles/300s)
DataSet	Intersections	Mean	Std	Max	Min
$Arterial_{1\times 6}{\textsuperscript{$\dagger$}}$	6	300	-	-	-
$Grid_{3\times 3}uni{\textsuperscript{$\dagger$}}$	9	300	-	-	-
$Grid_{3\times 3}bi{\textsuperscript{$\dagger$}}$	9	300	-	-	-
$NewYork_{16\times 3}$	48	240.79	10.08	274	216
$Jinan_{3\times 4}$	12	526.63	86.70	676	256
$Hangzhou_{4\times 4}$	16	250.79	38.21	335	208

$\dagger$

Traffic flow from synthetic data are uniform, so there is no need to count another three values.

IV-B Baseline Methods

In Chapter II, we have already introduced methods for TSC, including traditional rule-based methods and learning-based methods. The most primitive rule-based methods are still the most common methods nowadays. As a mature rule-based method, Max-Pressure can be used as a representative.

Learning-based methods have been prosperous under the development of deep learning and data science in recent years. They are characterized by the use of large-scale data to approximate optimal strategies through iterative learning. We have chosen several methods as the baseline:

•

IQL: Since $\gamma$ -Reward is based on the decoupling idea of IQL, and introduces a coordination mechanism so that it can demonstrate the impact of coordination mechanism compared to IQL. Here we use original D3QN, like [26], for comparison.
•

QMIX: This is a sophisticated MARL algorithm, which integrates all agents into the same model and concentrates on the learning of joint action reward functions. As a typical one-model method, comparing it with $\gamma$ -Reward can effectively observe the advantages and disadvantages of joint learning and independent learning for coordination.
•

Colight: Unlike $\gamma$ -Reward, Colight is more like QMIX, but not learns a joint action. It uses Attention layers to train the surrounding observation code to replace the observation of the current intersection. Due to its full collection of observation, it can apply Multi-head Attention. By this way, the coordination between agents is realized. $\gamma$ -Attention-Reward has made some improvements on this basis. The method of Replay Buffer Amendment is employed to introduce the effect of the current intersection on the surrounding intersection, replacing the hyperparameter $\left|\mathcal{N}_{i}\right|$ in Colight and adding consideration of the impact of actions on both time and space.

IV-C Evaluation Performance

Figure 6 and Table II shows the performance comparison between $\gamma$ -Reward and the more comprehensive $\gamma$ -Attention-Reward and baselines. Each model has trained 100 iterations, and each iteration run 3600 time steps in the simulator. Each action in them lasts at least 10 seconds for avoiding rapid switching phase impracticably. Delay span $n=10s$ and threshold $c=0.8$ .

We use the average transit time of the vehicle to evaluate the performance of the model, which is the standard evaluation method in the field of TSC.

The performance of learning-based methods is significantly better than rule-based Max-Pressure (Table II), which is widely proved in many researches.

Among the independent learning DRL model, the performance of $\gamma$ -Reward series far exceeds the IQL model. This demonstrates the importance of coordination between agents for global performance improvement.

It is worth noting that, in all road network, the independent learning DRL model shows a stronger astringency than one-model QMIX. That is probably because for a single model, excessive dimensions can make the policy more difficult to learn. However, Colight does not show divergence while achieved excellent results. This may benefit from that it does not make joint decisions through the joint action function, but by sharing network parameters, so that all agents generate independent actions. Since the agents share the model parameters, they will undoubtedly ignore individual differences and sacrifice some performance. Compared to the proposed model, intense oscillations sometimes occur in Colight during training. This is also a result of sharing parameters. Once the model iterating in the wrong direction, it will mislead all agents and lead to horrible congestion in the whole road network. The performance of proposed methods is even better than Colight, which means sharing sensation is not the only way to realize coordination. Sharing results can also help to focus on the whole road network for a single intersection.

In real-world road networks, $\gamma$ -Reward and $\gamma$ -Attention-Reward do not show gigantic difference. The reason is that all real-world road networks we used are two-way road. We will introduce the study about the attention score later and can be shown in Figure 9 and Figure 10. It has shown its effect in specifically synthetic road networks. Compared $Grid_{6\times 6}bi$ and $Grid_{6\times 6}uni$ in Figure 6, Attention layers distinguish one-way and two-way, and assist agents to achieve a better performance. This will also describe later by revealing the detail of Attention layers.

TABLE II: Performance Comparison (Average Travel Time)

Model	$Grid_{3\times 3}bi$	$Grid_{3\times 3}uni$	$NewYork_{16\times 3}$	$Jinan_{3\times 4}$	$Hangzhou_{4\times 4}$
Max-Pressure	204.72	186.06	405.69	359.81	431.53
IQL	191.05	157.51	248.46	371.74	406.27
QMIX	565.70	619.32	216.56	571.78	587.46
Colight	104.89	100.96	169.66	301.78	311.15
$\gamma$ -Reward	96.44	175.38	162.18	303.97	304.90
$\gamma$ -Attention-Reward	96.14	93.93	141.16	286.27	284.24

IV-D Study of hyperparameter $\gamma$

We use $Arterial_{1\times 6}$ road network to study the impact of different $\gamma$ value. We have chosen [0.1, 0.3, 0.5, 0.7, 0.9] five $\gamma$ values and compared the results. As shown in Figure 7, 0.5 may be a balance point of the penalty item. So for the hyperparameter $\gamma$ , we all set it to 0.5.

IV-E Visualization of Proposed method

The core idea of the $\gamma$ -Reward algorithm is to correct the reward in the replay buffer. In this section, we use the $Arterial_{1\times 6}$ road network as an example to show how the reward values between different intersections affect each other. Compare the $Grid_{3\times 3}uni$ and $Grid_{3\times 3}bi$ road networks to demonstrate the role of the attention mechanism in the $\gamma$ -Reward algorithm improvement.

IV-E1 Visualization of Spatial Differentiation Formula

In Figure 8, dashed lines represent the original reward, and solid lines represent the corrected reward. Since the linear road network is selected, there are no more than two adjacent intersections, so it is easier to observe the influence between the intersections. We just select the first two intersections for clearness.

Observe the red line in the light yellow area, which represents the reward for the second intersection. It can be found that the solid line in this area is almost above the dashed line, meaning that after amendment, rewards become better. The reason is shown in the light blue area, dark blue and pink lines are getting a raise, which means the traffic situation is getting better both in intersection $I_{11}$ and $I_{13}$ . We believe that in the process of getting better at these two intersections, some of the efforts are contributed by intersection $I_{12}$ . The lag between yellow and blue area is up to the delay span $n$ .

From Figure 8, we can see that in the actual training, the $\gamma$ -Reward framework, as what we expected, introduces future changes in nearby intersections.

IV-E2 Effect of Attention Mechanism

Figure 9 shows attention scores from Hangzhou road network. We can find that except current intersection $I_{22}$ , others are declined and tending to the same value. Therefore, Attention does not play an important role in these real-world datasets. The reason could be that all of them are two-way road and have the same number of lanes. To highlight its effect, we need to compare the situation between one-way and two-way traffic in the same road network. Thus, we use $Grid_{3\times 3}uni$ and $Grid_{3\times 3}bi$ for visualization.

Figure 10 shows the comparison between one-way and two-way $3\times 3$ network. Comparing Figure 10(a) with Figure 10(b), the score change of intersection $I_{22}$ in Figure 10(a) is the equalization of surrounding intersection $I_{12}$ , $I_{21}$ , $I_{23}$ and $I_{32}$ . While in Figure 10(b), the intersection from the direction of exiting is obviously holding a commanding edge. This means that the attention mechanism does have a significant effect in understanding the structure of the road network. With the attention mechanism, the reward amendment is more concerned with the results of its actions, which is crucial in the revision of reward.

V Discussion about Real World Application

TSC is an actual task running without the simulator, and the proposed control plan should aim at solving practical problems. Therefore, it is necessary to take the limitations that may exist in practice into account.

V-A Scalability

Scalability is an important standard to evaluate whether a TSC method is meaningful. The proposed methods in this paper are both decentralized, each D3QN agent can be deployed on the embedding device in each intersection respectively. Either the training time or the inference time of our methods is not related to the number of intersections. Besides, the message transmission among neighboring intersections just takes tolerable time and all of the intersections communicate in parallel. Due to the limitation of computing capacity, experiments of larger scale are not able to conduct. Instead, we analyze the scalability by collecting the training time on datasets with different scales which is listed in Figure 11. As the number of intersections increased, the time cost of Colight is significantly larger. In contrast, our methods remain a shorter training time due to the parallel computation. Another advantage is when the road network structure is changed, we only need to train the newly added intersections separately and there is a tiny impact on existing intersections. Therefore, the proposed methods are believed to be scalable.

V-B Real-time

Real-time traffic communication may have problems like communication delay, information security, and risk of packet loss[44]. Therefore, it is necessary to decouple the calculation of the whole road network. Decentralization is becoming the trend of solutions in TSC problem[29][45]. Colight, while using Multi-head Attention technology and sharing parameter, summarizes the relationship of the global road network and reduces the time complexity and space complexity of training. However, if we apply it to the practice, the global road network information in real-time is first needed to transmit to the central server for calculation, and then the server needs to dispense the resulting actions to each intersection. Note that it is not just transmitting actions of each traffic signal, but also their sensor observations! In contrast, communication in our methods occurs in a small neighborhood, and the messages they need are only several bits. These attributes help the intersections keep in touch in real-time.

V-C Further applications

The coordination mechanism in this paper can be expended to other multi-agent tasks which face the problem of scalability and real-time. For instance, the power dispatching of power networks can be described as the same as the TSC problem. Power transmission is similar to the traffic flow between two neighboring intersections. Furthermore, the control of a multi-joint tandem robot may be regarded as a multi-agent task. Actions from the joint which nears the base have a consequence on other joints and the end-effecter. We can use the coordination mechanism to consider this effect on neighboring agents or joints and achieve a global optimal performance. There are already broad applications using this kind of coordination mechanism in engineering systems like sensor networks[46], smart grid[47] and robotics[48].

VI Conclusion

In this paper, we propose the $\gamma$ -Reward method and its variant $\gamma$ -Attention-Reward that introduces the attention mechanism to solve the problem of intelligent control of the traffic signal. By extending the Markov Chain to the space-time domain, the methods turn to be a scalable solution for TSC. Specially, we give a concise proof of the spatial differentiation formula which shows that the two frameworks can converge to Nash equilibrium. We conduct extensive experiments using both real-world and synthetic data. They confirm that our proposed methods have a superior performance over state-of-the-art methods. In the asymmetry road network, $\gamma$ -Attention-Reward shows inspired results than $\gamma$ -Reward by adding the attention mechanism. Moreover, we investigate the effect of the reward amendment and attention mechanism in achieving coordination thoroughly. Compared to the recently proposed Colight, $\gamma$ -Reward series replaces the graph attention with recursion, decouples the calculation of the whole road networks, and is more suitable for practical applications.

Acknowledgment

This project is derived from the 46th project of Deecamp 2019, which won the Best Technology Award, and we thanks for the effort of our teammates in “Multi-Commander”¹¹1Our project of Deecamp 2019 is open-source in Github: https://github.com/multi-commander/Multi-Commander. Thanks to the practice platform provided by Dr. Kai-Fu Lee and the support of APEX Lab of Shanghai Jiaotong University. We also gratefully acknowledge the support from the National Natural Science Foundation of China (Grant No. 61973210), the projects of SJTU (Grant Nos. YG2019ZDA17; ZH2018QNB23).

References

[1] H. R. TO and M. M. Barker, “White paper european transport policy for 2010: time to decide,” 2001.
[2] X. Liang, X. Du, G. Wang, and Z. Han, “Deep reinforcement learning for traffic light control in vehicular networks,” arXiv preprint arXiv:1803.11115, 2018.
[3] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement learning approach for intelligent traffic light control,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2496–2505, ACM, 2018.
[4] H. Wei, G. Zheng, V. V. Gayah, and Z. Li, “A survey on traffic signal control methods,” CoRR, vol. abs/1904.08117, 2019.
[5] P. Varaiya, “The max-pressure controller for arbitrary networks of signalized intersections,” in Advances in Dynamic Network Modeling in Complex Transportation Systems, pp. 27–66, Springer, 2013.
[6] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” in Innovations in multi-agent systems and applications-1, pp. 183–221, Springer, 2010.
[7] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” arXiv preprint arXiv:1911.10635, 2019.
[8] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li, “Colight: Learning network-level cooperation for traffic signal control,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1913–1922, 2019.
[9] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.,” tech. rep., United States. Federal Highway Administration, 2008.
[10] J. Török and J. Kertész, “The green wave model of two-dimensional traffic: Transitions in the flow properties and in the geometry of the traffic jam,” Physica A: Statistical Mechanics and its Applications, vol. 231, no. 4, pp. 515–533, 1996.
[11] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using deep policy-gradient and value-function-based reinforcement learning,” IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017.
[12] Y. Xiong, G. Zheng, K. Xu, and Z. Li, “Learning traffic signal control from demonstrations,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2289–2292, 2019.
[13] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu, and Z. Li, “Learning phase competition for traffic signal control,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1963–1972, 2019.
[14] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
[15] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, pp. 330–337, 1993.
[16] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, pp. 4295–4304, 2018.
[17] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement learners for traffic light control,” Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016), 2016.
[18] T. Nishi, K. Otaki, K. Hayakawa, and T. Yoshimura, “Traffic signal control based on reinforcement learning with graph convolutional neural nets,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 877–883, IEEE, 2018.
[19] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li, “Presslight: Learning max pressure control to coordinate traffic signals in arterial network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1290–1298, 2019.
[20] Wikipedia contributors, “Lane — Wikipedia, the free encyclopedia,” 2019. [Online; accessed 1-November-2019].
[21] A. Stevanovic, Adaptive traffic control systems: domestic and foreign state of practice. No. Project 20-5 (Topic 40-03), 2010.
[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
[23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Thirtieth AAAI conference on artificial intelligence, 2016.
[24] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning, pp. 1995–2003, 2016.
[25] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in AAAI, 2018.
[26] X. Liang, X. Du, G. Wang, and Z. Han, “A deep reinforcement learning network for traffic light cycle control,” IEEE Transactions on Vehicular Technology, vol. 68, no. 2, pp. 1243–1253, 2019.
[27] P. Varshavskaya, L. P. Kaelbling, and D. Rus, “Efficient distributed reinforcement learning through agreement,” in Distributed Autonomous Robotic Systems 8, pp. 367–378, Springer, 2009.
[28] S. Kar, J. M. Moura, and H. V. Poor, “ $\mathcal{QD}$ -learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1848–1862, 2013.
[29] W. Liu, G. Qin, Y. He, and F. Jiang, “Distributed cooperative reinforcement learning-based traffic signal control that integrates v2x networks’ dynamic clustering,” IEEE Transactions on Vehicular Technology, vol. 66, no. 10, pp. 8667–8681, 2017.
[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[31] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
[33] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[34] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
[35] S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” arXiv preprint arXiv:1810.02912, 2018.
[36] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2681–2690, JMLR. org, 2017.
[37] G. J. Laurent, L. Matignon, L. Fort-Piat, et al., “The world of independent learners is not markovian,” International Journal of Knowledge-based and Intelligent Engineering Systems, vol. 15, no. 1, pp. 55–64, 2011.
[38] R. Aragon-Gómez and J. B. Clempner, “Traffic-signal control reinforcement learning approach for continuous-time markov games,” Engineering Applications of Artificial Intelligence, vol. 89, p. 103415, 2020.
[39] D. T. Nguyen, W. Yeoh, H. C. Lau, S. Zilberstein, and C. Zhang, “Decentralized multi-agent reinforcement learning in average-reward dynamic dcops,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
[40] J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
[41] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li, “Cityflow: A multi-agent reinforcement learning environment for large scale city traffic scenario,” in The World Wide Web Conference, pp. 3620–3624, ACM, 2019.
[42] D. Krajzewicz, “Traffic simulation with sumo–simulation of urban mobility,” in Fundamentals of traffic simulation, pp. 269–293, Springer, 2010.
[43] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al., “Ray: A distributed framework for emerging ai applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577, 2018.
[44] B. C. Ezell, R. M. Robinson, P. Foytik, C. Jordan, and D. Flanagan, “Cyber risk to transportation, industrial control systems, and traffic signal controllers,” Environment Systems and Decisions, vol. 33, no. 4, pp. 508–516, 2013.
[45] T. Chu, S. Qu, and J. Wang, “Large-scale traffic grid signal control with regional reinforcement learning,” in 2016 American Control Conference (ACC), pp. 815–820, IEEE, 2016.
[46] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd international symposium on Information processing in sensor networks, pp. 20–27, 2004.
[47] K. Zhang, W. Shi, H. Zhu, E. Dall’Anese, and T. Başar, “Dynamic power distribution system management with a locally connected communication network,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 4, pp. 673–687, 2018.
[48] P. Corke, R. Peterson, and D. Rus, “Networked robots: Flying robot navigation using a sensor net,” in Robotics research. The eleventh international symposium, pp. 234–243, Springer, 2005.
[49] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
[50] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[51] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.

Appendix

VI-A Fundamental DRL algorithms

VI-A1 Temporal-Difference Learning

Temporal-Difference Learning (TD learning), proposed by Sutton[49], combining with Dynamic Programming (DP) and Monte Carlo (MC) methods, becomes the core idea of DRL. Like the Monte Carlo algorithm, it does not need to know the specific environment model and can learn directly from experience. On the other hand, it also inherits bootstrapping from DP algorithm, which is the unique feature of TD learning: predictions are used as targets during learning[50]. Monte Carlo simulates (or experiences) an episode until it ends, then estimates the state value based on the value of each state. In contrast, TD learning simulates an episode with one step (or several steps) per action which based on the value of the new state, and then estimate the state value before execution.

The Q-learning algorithm [51] is a groundbreaking algorithm. TD learning is used here for off-policy learning.

\delta_{t}=r_{t+1}+\gamma^{\prime}\max_{a}Q\left(o_{t+1},a_{t+1}\right)-Q\left(o_{t},a_{t}\right)

where $\delta_{t}$ represents TD-error.

VI-A2 Deep Q-network

Deep Q-network (DQN) is a powerful off-policy algorithm which has achieved excellent results in many fields since 2015[22]. It uses a neural network to approximate the Q-value function instead of tabling.

Q\left(o_{t},a_{t}\right)\leftarrow Q\left(o_{t},a_{t}\right)+\alpha\left[y_{j}-Q\left(o_{t},a_{t}\right)\right]

y_{t+1}=r_{t+1}+\gamma^{\prime}\max_{a}Q\left(o_{t+1},a_{t+1}\right)

VI-A3 Double Deep Q-network

Q-learning uses max to select the best action, which causes a Maximization Bias problem. So [23] solved this by designing a Double Q-learning, it only differs in the calculation of the target Q value:

y_{t+1}=r_{t+1}+\gamma^{\prime}Q^{\prime}\left(\phi\left(o^{\prime}\right),\arg\max_{a^{\prime}}Q\left(\phi\left(o^{\prime}\right),a,w\right),w^{\prime}\right)

VI-A4 Dueling Deep Q-network

Another improvement for DQN is Dueling DQN [24], which decomposes the Q network into two separate control streams, a value function $V(s)$ , and a state-based action advantage function $A(s,a)$ . These two control flows obtain an estimate value of the Q function through a special aggregating layer.

	$\displaystyle Q(o,a;\theta,\alpha,\beta)$	$\displaystyle=V(o;\theta,\beta)+$
		$\displaystyle\left(A(o,a;\theta,\alpha)-\frac{1}{\|\mathcal{A}\|}\sum_{a^{\prime}}A\left(o,a^{\prime};\theta,\alpha\right)\right)$

VI-B Pseudocode for $\gamma$ -Reward series

Algorithm 1

\gamma

-Attention-Reward Algorithm for MARL Traffic Lights Control

1:Initialize

E

parallel environments with

N

agents

2:Initialize replay memory

D

to capacity

N_{D}

3:Initialize raw replay memory

D_{r}

to capacity

N_{D}+n

4:Initialize action-value function

Q

with random weights

\theta

5:Initialize target action-value function

\hat{Q}

with weights

\theta^{-}=\theta

6:Initialize attention scores

\alpha_{i,j}

T_{update}\leftarrow 0

8:for

episode=1,M

9: Reset environments, and get initial

o_{i}

for each agent

i

10: for

t=1,T

11: Select actions

a_{i}\sim\pi_{i}(\cdot|o_{i})

for each agent

i

in each environment

e

12: Send actions

a_{i}

to all parallel environments and get

o_{i}^{\prime}

r_{i}

for all agents

13: Store (

a_{i},o_{i},r_{i},o_{i}^{\prime}

) in

D_{r}

14:

T_{update}=T_{update}+N

15: if

T_{update}\leq

min steps per update +

n

then

16: Replay Buffer Amendment(

D_{r},\alpha_{i,j}

)

17: Update Policies: Calculate

a_{1\ldots N}^{B}\sim\pi_{i}^{\bar{\theta}}\left(o_{i}^{\prime B}\right),i\in 1\ldots N

Calculate

Q_{i}^{\psi}\left(o_{1\ldots N}^{B},a_{1\ldots N}^{B}\right)

for all

i

in parallel Update policies using

\nabla_{\theta,i}\mathcal{J}(\pi_{\theta})

and Adam

18: Update target

Q

parameters:

\hat{\theta}\leftarrow\theta

19: Update target attention parameters:

\hat{\alpha}\leftarrow\alpha

20:

T_{update}\leftarrow 0

21: end if

22: end for

23:end for

Algorithm 2 Replay Buffer Amendment

1:function ReplayBufferAmendment(

D_{r},\alpha_{i,j}

)

2: for

i=1,N

index=len(D_{r})-n

4: while

index>ex\_index

(o_{i},a_{i},r_{i},o_{i}^{\prime})=D_{r,i}(j)

R_{i}=\gamma

-Attention-Reward function(

r_{i},\alpha_{i,j}

)

7: Store (

o_{i},a_{i},R_{i},o_{i}^{\prime}

) in

D

j=j-1

9: end while

10:

ex\_index=len(D_{r})-n

11: end for

12:end function

Learning Scalable Multi-Agent Coordination by Spatial Differentiation for Traffic Signal Control ††thanks:

Abstract

Index Terms:

I Introduction

I-A Related work and Motivation

I-B Main contributions

II Problem Formulation &\& Proposed MARL method

II-A Preliminary &\& Formulation

II-B Proposed Spatial Differentiation Function

II-C Attention Mechanism for γ\gamma-Reward

II-C1 Attention

II-C2 γ\gamma-Attention-Reward

III Theoretical Analysis in Game Theory

Definition III.1

Definition III.2

Proof:

Definition III.3

III-A Stationarity of γ\gamma-Reward series

Assumption III.1

Assumption III.2

Assumption III.3

Theorem III.1

Proof:

III-B Convergence of γ\gamma-Reward series

Theorem III.2

Proof:

III-C Optimality of γ\gamma-Reward model

Theorem III.3

Proof:

IV Experiment

IV-A Datasets

IV-B Baseline Methods

IV-C Evaluation Performance

IV-D Study of hyperparameter γ\gamma

IV-E Visualization of Proposed method

IV-E1 Visualization of Spatial Differentiation Formula

IV-E2 Effect of Attention Mechanism

V Discussion about Real World Application

V-A Scalability

V-B Real-time

V-C Further applications

VI Conclusion

Acknowledgment

References

Appendix

VI-A Fundamental DRL algorithms

VI-A1 Temporal-Difference Learning

VI-A2 Deep Q-network

VI-A3 Double Deep Q-network

VI-A4 Dueling Deep Q-network

VI-B Pseudocode for γ\gamma-Reward series

Learning Scalable Multi-Agent Coordination by Spatial Differentiation for Traffic Signal Control
^†^†thanks:

II Problem Formulation $\&$ Proposed MARL method

II-A Preliminary $\&$ Formulation

II-C Attention Mechanism for $\gamma$ -Reward

II-C2 $\gamma$ -Attention-Reward

III-A Stationarity of $\gamma$ -Reward series

III-B Convergence of $\gamma$ -Reward series

III-C Optimality of $\gamma$ -Reward model

IV-D Study of hyperparameter $\gamma$

VI-B Pseudocode for $\gamma$ -Reward series