This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Scalable Multi-Agent Coordination by Spatial Differentiation for Traffic Signal Control
thanks:

Junjia Liu, Huimin Zhang, Zhuang Fu*, Yao Wang The first two authors Junjia Liu and Huimin Zhang contributed equally to this paper. * Correspondence: [email protected] (Z. F.), Tel.: +86-138 1649 6926 (Z. F.) State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University
National Engineering Laboratory for Automotive Electronic Control Technology
, Shanghai Jiao Tong University
Shanghai, 200240, China
[email protected], [email protected], [email protected], [email protected]
Abstract

The intelligent control of the traffic signal is critical to the optimization of transportation systems. To achieve global optimal traffic efficiency in large-scale road networks, recent works have focused on coordination among intersections, which have shown promising results. However, existing studies paid more attention to observations sharing among intersections (both explicit and implicit) and did not care about the consequences after decisions. In this paper, we design a multi-agent coordination framework based on Deep Reinforcement Learning method for traffic signal control, defined as γ\gamma-Reward that includes both original γ\gamma-Reward and γ\gamma-Attention-Reward. Specifically, we propose the Spatial Differentiation method for coordination which uses the temporal-spatial information in the replay buffer to amend the reward of each action. A concise theoretical analysis that proves the proposed model can converge to Nash equilibrium is given. By extending the idea of Markov Chain to the dimension of space-time, this truly decentralized coordination mechanism replaces the graph attention method and realizes the decoupling of the road network, which is more scalable and more in line with practice. The simulation results show that the proposed model remains a state-of-the-art performance even not use a centralized setting. Code is available in https://github.com/Skylark0924/Gamma_Reward.

Index Terms:
Multi-agent, coordination mechanism, γ\gamma-Reward, Deep Reinforcement Learning, spatial differentiation

I Introduction

Traffic congestion has been an increasingly critical matter for decades. It not only leads to an increase in commuting time but also exacerbates noise and environmental pollution issues due to frequent acceleration and deceleration. According to relevant researches, almost all collisions and delays in urban traffic are concentrated on intersections[1]. Unreasonable signal control significantly leads to a waste of traffic resources. Therefore, the key to solve urban congestion is to keep the intersection clear.

Deep Reinforcement Learning (DRL) methods have been well applied in the traffic signal regulation of single-intersection and shown a better performance than traditional methods[2][3][4], such as Max-pressure[5]. Recent works began to try to apply DRL algorithms, especially multi-agent Reinforcement Learning (MARL)[6], to multi-intersection and even large-scale road networks. Different from the single-intersection problem, the intelligent regulation of large-scale road networks needs to achieve synergistic control between various intersections which can be regarded as a Multi-objective Optimization Problem (MOP) and Markov/Stochastic Games (MGs) with Cooperative Setting[7]. In other words, multiple agents need to coordinate with each other. They need to keep their intersection open, and at the same time, pay attention to the traffic flow status of surrounding or even remote intersections, so that they can ultimately improve the efficiency of the overall road network. The latest research introduced the graph attention network (GAT) to share the observations of real-time traffic volume implicitly with each other, and get an inspired result[8].

I-A Related work and Motivation

The existing traffic signal control (TSC) methods can be divided into two categories: rule-based methods and learning-based methods. The former transforms the problem into a rule-based optimization problem; the later one seeks control strategy from the traffic flow data.

For the rule-based methods, such as Webster[9], GreenWave[10] and Max-pressure[5], a traffic signal optimization problem is usually solved under some assumptions like a preset period or fixed cycle-based phase sequence[4]. Webster is used for an isolated intersection and is a widely-used method in TSC. It assumes that the traffic flow is uniform during a certain period and constructs a closed-form equation to generate the optimal cycle length and phase split for a single intersection that minimizes the travel time of all vehicles passing the intersection. GreenWave is a classical method in the transportation field to implement coordination, which aims to optimize the offsets to reduce the number of stops for vehicles travelling along one specific direction. Max-pressure aims to reduce the risk of over-saturation by balancing queue length between adjacent intersections and minimizing the “pressure” of the phases for an intersection. However, the unpractical assumptions in these methods might not lead to excellent performance.

Recently, the DRL technique, as a popular learning-based method, has been proposed to control traffic signals due to its capability of online optimization without prior knowledge about the given environment. At present, DRL has been successfully applied to the single-intersection regulation of traffic signal [3] by regarding the intersection as an agent. The results of various state-of-the-art DRL algorithms are compared in [11], showing that Deep Q-Networks algorithm is more suitable for the solution of TSC tasks. However, the problem with multiple intersections is still a frontier.

Existing MARL algorithms focus on collaboration among agents and can be divided into centralized and decentralized setting according to various information structures[7]. Independent reinforcement learning (IRL), as a fully decentralized setting, directly perform a DRL algorithm for each agent with neither explicit nor implicit information exchange with each other. This method has been applied in multi-intersection TSC problem[12][13]. However, the environment is shared in MARL, and it changes with the policy and state of each agent[14]. For one of the agents, the environment is dynamic and non-stationary, leads to convergence problems. Tan et.al.[15] compares IRL with Value Decomposition Networks (VDN) and illustrates the disadvantages of IRL. As a centralized method, there exists a central controller in VDN which integrates the value function of each agent to obtain a joint action-value function. The integration strategy is to add them directly. Moreover, QMIX[16], as a extend of VDN, uses state information and integrates them in a nonlinear way and gets a stronger approximation ability than VDN. Both QMIX and VDN are typical centralized MARL algorithms with communication, and this joint-action idea has been already used in TSC[17].

Based on these centralized methods, recent TSC studies condense the global scope into a smaller neighborhood[8][18][19] and use graph convolution network (GCN) to achieve coordination. Colight[8] introduces the concept of attention mechanism and realized cooperation by integrating observations in a neighborhood implicitly into a hidden state. However, as mentioned in Colight, the neighborhood scope is a constant, so the traffic information among intersections cannot be utilized to determine the range of the neighborhood. Due to the usage of GCN and Multi-head Attention techniques, these methods still need to gather information for centralized computing and betray the intention of distribution.

In fact, such a central controller does not exist in many applications, apart from those that can easily have a global perspective, like video games[7]. Considering the demand for the scalability in TSC, setting a central controller is impractical. Therefore, we need to seek a compromise solution which is both convergent and scalable. This setting is referred to as a decentralized one with networked agents[7].

I-B Main contributions

In this paper, we first regard each intersection as a DRL agent and transform the TSC problem into a Markov Decision Process (MDP). Unlike existing work, this paper is aiming at improving method scalability while ensuring a SOTA performance. To achieve this goal, we introduce a structural prior about road networks as an inductive bias and extend the Markov Chain theory to the temporal-spatial domain for the coordination.

The change of future states and rewards from distant intersections is multiplied by the spatial discount rate γ\gamma (Note that it represents the temporal-spatial discount on multi-agent information, not the ordinary meaning used in temporal difference, and it will be distinguished in detail later) and taken into account when learning the current intersection policy. This information is used as a penalty to correct the calculation of current rewards so that the agents have the ability to collaborate. Due to the various traffic volume of each road, the influence of surrounding intersections may be different. Therefore, the attention mechanism is introduced in this paper to correct the influence weight of surroundings on the current intersection.

To summarize, our main contributions are as follows:

  • We propose a coordination framework, defined as γ\gamma-Reward, which can communicate with adjacent intersections and even further in a scalable way by sharing future states and rewards and achieve global optimal control of the TSC problem.

  • Instead of Multi-head Attention, the spatial differentiation method is proposed to collect the temporal-spatial information in a decentralized way and amend the current reward by recursion.

  • We just use attention in the neighborhood for distinguishing various significance, and update the attention score parameters in spatial differentiation formula by imitating the idea of the target network.

  • It is found in the test results of various road networks that the γ\gamma-Reward series, including original γ\gamma-Reward and γ\gamma-Attention-Reward, maintain a SOTA performance while achieving a better scalability.

II Problem Formulation &\& Proposed MARL method

In this section, we introduce the basic knowledge of the TSC problem and propose our MARL method.

II-A Preliminary &\& Formulation

  • Lane: Lane is part of a roadway that is designated to be used by a single line of vehicles [20]. There are two kinds of lanes: entering lane and exiting lane[21]. Each intersection consists of multiple lanes.

  • Phase: A phase is a combination of movement signals[4]. Figure 1(a) shows eight main directions of vehicles at the intersection. Note that the direction of turning right is usually ignored in these problems since it can execute every time without caring for the traffic signal. The directions in the same phase need not be a conflict which is shown in Figure 1(b). Phase is the unit of TSC, and only one phase can turn green at a time.

  • Neighbor intersection: The intersections which directly connect to the current intersection. In an informal road network, each intersection usually has at most four neighbors.

  • Waiting vehicle: If a vehicle on an entering lane has a speed lower than a threshold, then we define it as a waiting vehicle, which means it is slowing down to wait for the red light.

Refer to caption

(a) Eight main directions in a single intersection

Refer to caption

(b) Eight kinds of primary phases
Figure 1: The foundation of the traffic problem; Phase is the fundamental unit of the TSC problem. The two directions in each phase never conflict.

By using DRL, we regard the TSC problem as MDP. An individual DRL agent controls one of the intersections. They need to observe part of the environmental state 𝒪\mathcal{O} and get actions 𝒜\mathcal{A} according to these observations to determine which phase in the intersection needs to turn green. The effect of control is fed back from the environment in the form of reward \mathcal{R}. The goal of the DRL agent is to maximize the reward function by continuously exploring and exploiting based on constant interaction with the environment. In this paper, the problem requests to reduce the length of the queue qlq_{l} or the travel time TwT_{w} in the road network. To make this problem more suitable for DRL, we can first abstract it into these parts <𝒪,𝒜,𝒫,,π,γ>\mathcal{<O,A,P,R,\pi,\gamma>}:

Refer to caption

Figure 2: The TSC problem is regarded as MDP. Each intersection is controlled by a unique agent that can implement a DRL algorithm and gain an optimal strategy of action decision.
  • Observation otio_{t}^{i} : 𝒐ti=(o1i,,oti)\boldsymbol{o}_{t}^{i}=\left(o_{1}^{i},\ldots,o_{t}^{i}\right), where 𝒐ti𝓞𝒕i\boldsymbol{o}_{t}^{i}\in\boldsymbol{\mathcal{O}_{t}}^{i}. Every agent observes the length of the vehicle queue on entering lanes of their intersection. Moreover, to cater to the design of the proposed γ\gamma-Reward algorithm, we also need to observe the number of vehicle on the exiting lanes which connect to neighbor intersections.

  • Action atia_{t}^{i}: 𝒂ti=(a1i,,ati)\boldsymbol{a}_{t}^{i}=\left(a_{1}^{i},\ldots,a_{t}^{i}\right), where 𝒂ti𝓐𝒕i\boldsymbol{a}_{t}^{i}\in\boldsymbol{\mathcal{A}_{t}}^{i}. Action can be easily set as the serial number of phase which is chosen to be green.

  • Transition probability 𝒫\mathcal{P}: 𝒫(ot+1i|oti,ati)\mathcal{P}{(o_{t+1}}^{i}|{o_{t}}^{i},{a_{t}}^{i}) describes the probability from state oti{o}_{t}^{i} to the next potential state ot+1i{o_{t+1}}^{i}.

  • Reward rtir_{t}^{i}: After executing each action atia_{t}^{i}, we can get a return information to judge whether atia_{t}^{i} is good enough for otio_{t}^{i}. We use the number of waiting vehicle on the entering lanes as a raw reward. For amendatory reward, we use RtiR_{t}^{i} as a representation.

  • Policy π\pi: Policy is what agents need to learn in DRL. It represents the goal of reducing travel time and increasing average speed. For a single agent, πi:𝒪ti𝒜ti\pi^{i}:{\mathcal{O}_{t}}^{i}\mapsto\mathcal{A}_{t}^{i}.

  • Discount rate γ\gamma^{\prime}: This factor is the common meaning used in Temporal-Difference (see Appendix A.1). To avoid confusion with γ\gamma-Reward , we use γ\gamma^{\prime} here to replace the original symbol γ\gamma.

By using the Bellman equation, the relationship among these parameters can be formulated. We can gain the optimal phases after iterating these equations:

Q(oti,ati)\displaystyle Q(o_{t}^{i},a_{t}^{i}) =Q(oti,ati)+α(rti+γmaxat+1iQ(ot+1i,at+1i)Q(oti,ati))\displaystyle=Q(o_{t}^{i},a_{t}^{i})+\alpha(r^{i}_{t}+\gamma^{\prime}\max_{a_{t+1}^{i}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i})) (1)
ati\displaystyle a_{t}^{i} =argmaxQ(oti,ati)\displaystyle=\arg\max Q(o_{t}^{i},a_{t}^{i})

II-B Proposed Spatial Differentiation Function

Coordination among agents plays a critical role in MARL algorithms, either centralized or decentralized. In this paper, we propose a coordination mechanism among distributed agents. Each agent is based on Dueling-Double-Deep Q Network (D3QN)[22][23][24], which is one of the best Q value-based model until now[25]. A detailed description of it can be found in Appendix A. We found that some studies already use D3QN directly on the TSC problem [26], but it can only be used as an independent Q-learning (IQL) method in the multi-intersection problem. For the TSC problem, the idea of distributed agents is wise and what we need is an appropriate decentralized coordination mechanism. Therefore, we propose γ\gamma-Reward framework for coordination (Figure 3).

Refer to caption

Figure 3: Left: MARL algorithms with centralized setting. Right: γ\gamma-Reward framework which is a decentralized one with network agents. γ\gamma-Reward contains a coordination mechanism proposed based on D3QN.

The basic theory of DRL inspires the main idea of this article. For the TSC problem, not only the temporal decision should be regarded as MDP, the road network itself is more like a Markov Chain since the decision of the current intersection will affect other intersections in the future. So it should consider not only the TD-error (see Appendix A.1) in time, but also a “TD-error” in space which is defined as the spatial differentiation.

Refer to caption

Figure 4: Diagram for spatial differentiation

Figure 4 shows a diagram of multi-intersection road network with 3×33\times 3 intersections. Since the decision of intersection I22I_{22} at time tt will affect the next intersection I23I_{23} at time t+nt+n, the result of intersection I22I_{22} at time tt should be affected not only by the current intersection reward rtI22r^{I_{22}}_{t}, but also by the reward rt+njr^{j}_{t+n} of the surrounding intersections at time t+nt+n, where jI12,I21,I23,I32j\in{I_{12},I_{21},I_{23},I_{32}}. The formula of spatial differentiation is as follows:

RtI22=rtI22{1+γtanh[j𝒩i(Rt+njrtjc)]}\displaystyle R^{I_{22}}_{t}=r^{I_{22}}_{t}\cdot\left\{1+\gamma\cdot\text{tanh}\left[\sum_{j\in\mathcal{N}_{i}}\left(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c\right)\right]\right\} (2)

Where 𝒩I22={I12,I21,I23,I32}\mathcal{N}_{I_{22}}=\{I_{12},I_{21},I_{23},I_{32}\} represents the four intersections around intersection I22I_{22}. The parameter nn is called delay span which represents the time span of most vehicles reach the next intersection after current action ata_{t}. It is related to the length of the road, the average velocity on the exiting lane and sample interval. The whole formula can also be treated as value function of reward, thus it is a spatial differentiation of reward-value function.

In Equation 2, Rt+nj/rtj{R^{j}_{t+n}}/{r^{j}_{t}} shows the change of traffic capacity at intersection jj between time tt and t+nt+n. If Rt+nj/rtj{R^{j}_{t+n}}/{r^{j}_{t}} is greater than threshold cc, it indicates that the traffic capacity of the intersection jj is deteriorated, in other words, the decision of the intersection I22I_{22} at the time tt will cause the adjacent intersection jj to be more congested; conversely, Rt+nj/rtj{R^{j}_{t+n}}/{r^{j}_{t}} less than cc indicates that the capacity of intersection jj is improved, and the decision of the intersection I22I_{22} at time tt is good for jj. Through some region transformations, the differential item is served as a penalty. It finally multiplies by a spatial discount factor γ\gamma as an amendment to reward. Since the training goal of DRL is to maximize the reward function, the existence of the penalty item forces the agent to pay attention to the situation around the intersection while improving its own policy.

This equation is also in line with the idea of MARL with networked agents which dates back to Varshavskaya et al. [27]. 𝒬𝒟\mathcal{QD}-learning[28] further gives a convergence proof of this kind of MARL algorithm. It incorporates the idea of consensus + innovation to the standard Q-learning algorithm and has the following Q-value update equation:

Q(oti,ati)\displaystyle Q(o_{t}^{i},a_{t}^{i}) Q(oti,ati)\displaystyle\leftarrow Q(o_{t}^{i},a_{t}^{i}) (3)
+αt,o,a[rti+γmaxat+1i𝒜Q(ot+1i,at+1i)Q(oti,ati)]\displaystyle+\alpha_{t,o,a}\left[r^{i}_{t}+\gamma^{\prime}\max_{a_{t+1}^{i}\in\mathcal{A}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i})\right]
βt,o,aj𝒩ti[Q(oti,ati)Q(otj,atj)]\displaystyle-\beta_{t,o,a}\sum_{j\in\mathcal{N}_{t}^{i}}\left[Q(o_{t}^{i},a_{t}^{i})-Q(o_{t}^{j},a_{t}^{j})\right]

where the term after αt,o,a\alpha_{t,o,a} denotes a local innovation potential that incorporates newly obtained observations, and the term after βt,o,a\beta_{t,o,a} is a consensus potential (agent collaboration) which captures the difference of Q-value estimates from its neighbors.

Analogously, our Q-value update equation can be changed as follows:

Q(oti,ati)=\displaystyle Q(o_{t}^{i},a_{t}^{i})= Q(oti,ati)+α{rti+γrtitanh[j𝒩i(Rt+njrtjc)]\displaystyle Q(o_{t}^{i},a_{t}^{i})+\alpha\{r^{i}_{t}+\gamma\cdot r^{i}_{t}\cdot\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c)] (4)
+γmaxat+1i𝒜Q(ot+1i,at+1i)Q(oti,ati)}\displaystyle+\gamma^{\prime}\max_{a_{t+1}^{i}\in\mathcal{A}}Q(o_{t+1}^{i},a_{t+1}^{i})-Q(o_{t}^{i},a_{t}^{i})\}

The spatial differentiation term denotes the consensus like 𝒬𝒟\mathcal{QD}-learning and we further construct the communication in the temporal domain and the spatial domain at the same time. Although Equation 2 uses future information which sounds contrary to the causality, it is achievable in the programming. Since we use Q-learning as a basic model which is off-policy, it saves the trajectory of state-action pair and the obtained reward into a replay buffer. In this way, the raw reward rtir^{i}_{t} is saved first and then corrected after nn steps. Therefore the calculation process can be realized (see Appendix B for pseudocode).

II-C Attention Mechanism for γ\gamma-Reward

Refer to caption

Figure 5: Framework of the proposed γ\gamma-Attention-Reward model; In the inner cycle, eval Attention Layers and eval Q Network in blue are used to evaluate real-time Q value for the control. In the external cycle, target system in orange is used to predict long-term impact for improving the performance of Q network and updated periodically. At the right of this figure, we illustrate the message mechanism in original γ\gamma-Reward by the example of a Arterial1×5Arterial_{1\times 5} road network.

The spatial differentiation formula proposed in the previous section is based on the fact that each intersection in the road network has the same situation. In other words, the levels of them are equal. But in reality, the levels of the intersections in the road network are different, some have only one or two lanes, and others may have four to six. Imagine if the intersection I22I_{22} in Figure 4 is a two-way road which has eight lanes, the intersection I23I_{23} is same as I22I_{22}, while on the other side, the intersection I32I_{32} is a two-way which only has two lanes. The decision of intersection I22I_{22} in I22I23I_{22}\Rightarrow I_{23} and I22I32I_{22}\Rightarrow I_{32} is inevitably different. The discharge of intersection I22I_{22} can easily lead to excessive congestion at intersection I32I_{32}, but it may not be very serious for intersection I23I_{23}. In addition to the number of lanes at the intersection, the length between each intersection is different, which means that the maximum congestion length each intersection can accept is also different. In summary, when using a spatial differentiation formula correction at an intersection, there are different influence weights for different intersections.

The problems mentioned above can indeed be solved by setting different thresholds to get weights, like [29]. However, the actual road network situation is very complicated. There is no way to include all parameters into consideration. Therefore, we can learn the rules from the traffic data. In this respect, the attention mechanism gives us a good solution.

Attention mechanism [30][31][32] is an algorithm first proposed to solve “seq2seq” translation problem in NLP. Attention can be interpreted broadly as a vector of importance weights: To predict an element, such as a word in a sentence, attention vector can be used to estimate how strongly it is related to other elements, and the sum of its values can be used as an approximation of the target. It breaks the limits of Euclidean distance between data, captures long-range dependencies in the sentences, and provides smoother translations.

In addition to sequence data, Attention can also be used for other types of problems. In the graphics world, the GCN [33] tells us that combining local graph structures with node features can achieve good performances in node classification tasks. However, the way GCN combines the characteristics of neighboring nodes is closely related to the structure of the graph, which limits the generalization ability of the trained model on other graph structures. The GAT [34] proposes a weighted summation of neighboring node features using the attention mechanism. The weights of the neighboring node feature depend entirely on the node characteristics and are independent of the graph structure.

Recent research has begun to introduce the idea of Attention to MARL algorithms. A Multi-Agent Actor-Critic (MAAC) algorithm has been proposed that combines attention mechanism[35]. MAAC encodes the state of the surrounding agents and obtains the contribution value of the surrounding agents to the current agent through the attention network, together with (o,a)(o,a) of the current agent as an input, the Q value is obtained through an MLP. While the Q network is updated in reverse, the attention network is updated, and the attention scores of the surrounding agents for the current agent are also corrected. Colight applied attention mechanism to the TSC problem of the large-scale road network, it encodes the state and directly obtains the Q value through the Multi-head Attention network.

II-C1 Attention

Every agent can interact with their environment and get the observation on time. We first need to embed the observation from the environment by applying a layer of Multi-layer Perception (MLP):

zi=Wloi+bz_{i}=W_{l}o_{i}+b (5)

To get the weight of the intersection ii to the adjacent intersection jj, we need to combine their hidden variables zi,zjz_{i},z_{j} by following dot product:

eij=zjTWkTWqzie_{ij}=z_{j}^{T}W^{T}_{k}W_{q}z_{i} (6)

eije_{ij}, represents the influence of the adjacent intersection jj on the current intersection ii. It should be noted that the influence between them may not be necessarily equal. Then we normalize them by softmax function:

αij=exp(eij)k𝒩iexp(eik)\alpha_{ij}=\frac{\exp\left(e_{ij}\right)}{\sum_{k\in\mathcal{N}_{i}}\exp\left(e_{ik}\right)} (7)

Impact value vijv_{ij} can be calculate as vij=αijzjv_{ij}=\alpha_{ij}z_{j}, which means the value ii needs to consider from jj. Finally, adding them together and passing the RELU activation function, the final characterization of the intersection ii is obtained:

hi=σ(j𝒩iαijzj)h_{i}=\sigma\left(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}z_{j}\right) (8)

II-C2 γ\gamma-Attention-Reward

We use the attention mechanism to make spatial differentiation function more perfect and interpretable by adding an attention score before the sum operation.

Rti=rti{1+γtanh[j𝒩iα^ji(Rt+njrtjc)]}\displaystyle R^{i}_{t}=r^{i}_{t}\cdot\left\{1+\gamma\cdot\text{tanh}\left[\sum_{j\in\mathcal{N}_{i}}\hat{\alpha}_{ji}\left(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c\right)\right]\right\} (9)

Attention score is updated together with the policies:

𝒥(πθi)=(Ri+γmaxQ^(oi,ai;θ^i,𝜶^i)Q(oi,ai;θi,𝜶i))2\mathcal{J}(\pi_{\theta}^{i})=(R^{i}+\gamma^{\prime}\max{\hat{Q}(o^{i},{a}^{i};\hat{\theta}^{i},\boldsymbol{\hat{\alpha}}^{i})}-Q(o^{i},a^{i};\theta^{i},\boldsymbol{\alpha}^{i}))^{2} (10)

It is worth emphasizing that αij\alpha_{ij} represents the importance from jj to ii, but for γ\gamma-Reward we seek the influence from ii to jj. So in Equations 9 and 10, we need to use αji\alpha_{ji} for amending rewards of agent ii.

Since the attention score αji\alpha_{ji} is a real-time updated value, this paper uses it as an evaluation metric to dynamically assess the impact of surrounding intersections based on dynamic traffic data. While the reward is also an evaluation indicator, which is timely feedback obtained after performing an action at a particular state, used to evaluate the quality of the state-action pair. If we introduce αij\alpha_{ij} into the calculation of reward-value and update it in real-time, there is bound to be a problem. This causes the Attention layer to update the direction intentionally, which reduces the impact of essential neighbor intersections and increases those with less traffic flow to increase the reward-value. In this way, the introduction of the attention score will be even worse than the original γ\gamma-Reward. To solve this problem, we can follow the Q-learning algorithm and use the off-policy idea to get target attention scores α^ij\hat{\alpha}_{ij} in the reward calculation using target Attention layers. Its network parameters are updated together with target Q parameters θ^\hat{\theta}. The whole framework of the proposed γ\gamma-Attention-Reward model is shown in Figure 5.

In Colight, there is a section devoted to the selection of the hyper-parameter neighbor scope. It is found through experiments that the larger the |𝒩i|\left|\mathcal{N}_{i}\right| is, the better the performance is. But when it is greater than 5, it takes more time to learn. This is because intersections within the scope of |𝒩i|\left|\mathcal{N}_{i}\right| needs to aggregate all the observations into one agent, and the total number of agents is still equal to the intersection, which will inevitably lead to an increase in the amount of calculation. However, unlike Colight, γ\gamma-reward does not need to consider the size setting of the neighbor number. We can merely consider the neighbor number as a constant 5, which contains the current intersection and the four intersections directly connected to it. Note that, we do not need to employ Multi-head Attention which leads to centralization. For the information of further intersections, recursiveness in the γ\gamma-Reward formula can work.

From the original γ\gamma-Reward part in Figure 5, we can figure out its principle. The row in the figure represents the state information collected in time series and the column represents each intersection. To prevent too much confusion, only the γ\gamma-Reward process of intersection I13I_{13} is shown as a demonstration, and the interval of the scale variable nn is also not shown here. The value of R13tR_{13}^{t} is related to R12t+nR_{12}^{t+n} and R14t+nR_{14}^{t+n}. By analogy, these two rewards are related to R11t+2nR_{11}^{t+2n} and R15t+2nR_{15}^{t+2n} respectively. It is worth noting that only a two-dimensional replay buffer is drawn here, in fact, the real intersection has four or more surrounding intersections, so it should be a three-dimensional gradient.

III Theoretical Analysis in Game Theory

In this section, we establish theoretical results for the proposed algorithms. The key challenge of MARL algorithms is that the decision process may be non-stationary[36][37]. Since an agent is not aware of actions from other agents and lack of communication, the transition probabilities 𝒫(ot+1i|oti𝒪ti,ati𝒜ti)\mathcal{P}(o_{t+1}^{i}|o_{t}^{i}\in\mathcal{O}_{t}^{i},a_{t}^{i}\in\mathcal{A}_{t}^{i}) are not stationary and change as the other agents change. So we first need to demonstrate the decision process is stationary by using the proposed algorithms. Specifically, MARL can be regarded as a game model[38], and we prove that they can converge to Nash equilibrium. The argument is started with the following definitions.

Definition III.1

A decentralized MARL decision process is stationary (or homogeneous), iff, for each agent ii and all p,qp,q\in\mathbb{N}, oi𝒪io^{i}\in\mathcal{O}^{i}, 𝐚i𝒜i\boldsymbol{a}^{i}\in\mathcal{A}^{i}[36]

𝒂pi𝑨pi𝒫i(op+1i|opi,api,𝒂pi)=\displaystyle\sum_{\boldsymbol{a}_{p}^{-i}\in\boldsymbol{A}_{p}^{-i}}\mathcal{P}_{i}\left(o_{p+1}^{i}|o_{p}^{i},\left\langle a_{p}^{i},\boldsymbol{a}_{p}^{-i}\right\rangle\right)= (11)
𝒂qi𝑨qi𝒫i(oq+1i|oqi,aqi,𝒂qi)\displaystyle\sum_{\boldsymbol{a}_{q}^{-i}\in\boldsymbol{A}_{q}^{-i}}\mathcal{P}_{i}\left(o_{q+1}^{i}|o_{q}^{i},\left\langle a_{q}^{i},\boldsymbol{a}_{q}^{-i}\right\rangle\right)

and 𝒫\mathcal{P} for global state s𝒮s\in\mathcal{S} must be stationary either.

𝒂pi𝑨pi𝒫(sp+1i|spi,api,𝒂pi)=\displaystyle\sum_{\boldsymbol{a}_{p}^{-i}\in\boldsymbol{A}_{p}^{-i}}\mathcal{P}\left(s_{p+1}^{i}|s_{p}^{i},\left\langle a_{p}^{i},\boldsymbol{a}_{p}^{-i}\right\rangle\right)= (12)
𝒂qi𝑨qi𝒫(sq+1i|sqi,aqi,𝒂qi)\displaystyle\sum_{\boldsymbol{a}_{q}^{-i}\in\boldsymbol{A}_{q}^{-i}}\mathcal{P}\left(s_{q+1}^{i}|s_{q}^{i},\left\langle a_{q}^{i},\boldsymbol{a}_{q}^{-i}\right\rangle\right)

where 𝒂i=𝒂\{ai}\boldsymbol{a}^{-i}=\boldsymbol{a}\backslash\left\{a^{i}\right\}.

Based on the definition of stationary MDP, we can define optimal global reward for proving the astringency of proposed methods by extending the definition by [39].

Definition III.2

For a stationary MDP, the global optimal reward can be decomposed into a sum of local optimal reward for each reward function fif_{i}\in\mathcal{F}

ρ=i=1mρi\rho^{*}=\sum_{i=1}^{m}\rho_{i}^{*} (13)
Proof:

For a given stationary MDP, there must exists a stationary 𝒫iπi\mathcal{P}_{i}^{\pi_{*}^{i}}, where πi\pi_{*}^{i} is the optimal policy of agent ii

ρiπi=oi𝒪i𝒫iπi(oi)fi(oi,ai|ai=πi(oi))\rho_{i}^{\pi_{*}^{i}}=\sum_{o^{i}\in\mathcal{O}^{i}}\mathcal{P}_{i}^{\pi_{*}^{i}}(o^{i})f_{i}(o^{i},a^{i}|a^{i}=\pi_{*}^{i}(o^{i})) (14)

Definition III.3

In stochastic game, a Nash equilibrium point is a tuple of mm strategies (π1,,πm)\left(\pi_{*}^{1},\ldots,\pi_{*}^{m}\right) such that for all global state s𝒮s\in\mathcal{S} and i=1,,mi=1,\dots,m [40]

νi(s|π1,,πm)νi(s|π1,,πi,πi+1,,πm)\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m})\leq\nu^{i}(s|\pi_{*}^{1},\ldots,\pi^{i},\pi_{*}^{i+1},\ldots,\pi_{*}^{m}) (15)

for all πiΠi\pi_{*}^{i}\in\Pi^{i}, where Πi\Pi^{i} is the set of policies of total mm agents.

III-A Stationarity of γ\gamma-Reward series

First, we give the proof of stationarity. Unless the MDP is stationary, the model cannot guarantee convergence to the optimal.

Assumption III.1

The original reward function can be represented as

rti=f(𝒐ti,𝒂ti)r_{t}^{i}=f(\boldsymbol{o}_{t}^{i},\boldsymbol{a}_{t}^{i}) (16)

where 𝐨i,𝐚i\boldsymbol{o}^{i},\boldsymbol{a}^{i} include state-action pair from time step 11 to tt. Assume that γ\gamma-Reward series are special reward function f(o,a)f(o,a).

Rti=f(𝒐ti,𝒐ki,𝒂ti,𝒂ki)R_{t}^{i}=f\left(\left\langle\boldsymbol{o}_{t}^{i},\boldsymbol{o}_{k}^{-i}\right\rangle,\left\langle\boldsymbol{a}_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\right) (17)

k[1,]k\in[1,\mathbb{N}].

Assumption III.2

As a continuing task without definite ending, the excepted reward 𝒢ti\mathcal{G}_{t}^{i} of TSC problem is defined as following with a discount rate γ\gamma^{\prime}

𝒢trt+1+γrt+2+(γ)2rt+3+=k=0t(γ)krt+k+1\mathcal{G}_{t}\doteq r_{t+1}+\gamma^{\prime}r_{t+2}+(\gamma^{\prime})^{2}r_{t+3}+\cdots=\sum_{k=0}^{\mathbb{N}-t}(\gamma^{\prime})^{k}r_{t+k+1} (18)
Assumption III.3

The Q function is based on trajectory of expected return.

Q(o,a|π)=𝒫(patho,a|π)𝒢(patho,a)Q(o,a|\pi)=\sum\mathcal{P}(path_{o,a}|\pi)*\mathcal{G}(path_{o,a}) (19)
at=argmax Qmax(o,a|π)a_{t}=\text{{argmax} }Q_{\max}(o,a|\pi) (20)
Theorem III.1

With γ\gamma-Reward as a coordination mechanism, the decision process of distributed DQN algorithm (𝒫)\mathcal{(RP)} is stationary.

Proof:

According to Assumption III.2 and Assumption III.3, QQ value function with γ\gamma-Reward (𝒬)\mathcal{(RQ)} can be written as follow

(𝒬)ti(o,a|π)\displaystyle\mathcal{(RQ)}_{t}^{i}(o,a|\pi) =t=0(𝒫)ti(pathoi,ai|π)(𝒢)i(pathoi,ai)\displaystyle=\sum_{t=0}^{\mathbb{N}}\mathcal{(RP)}_{t}^{i}(path_{o^{i},a^{i}}|\pi)*(\mathcal{RG})_{i}(path_{o^{i},a^{i}}) (21)
=Q(𝒐ti,𝒐ki,𝒂ti,𝒂ki|π)\displaystyle=Q\left(\left\langle\boldsymbol{o}_{t}^{i},\boldsymbol{o}_{k}^{-i}\right\rangle,\left\langle\boldsymbol{a}_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\lvert\pi\right)

The calculation of (𝒬)\mathcal{(RQ)} is related to the except reward path. (𝒢)ti\mathcal{(RG)}_{t}^{i} is a discounted sum of amendatory reward RtiR_{t}^{i}, which is bound up with not only (oi,ai)(o^{i},a^{i}), but also (oi,ai)(o^{-i},a^{-i}) from the other agents. Since (𝒢)ti\mathcal{(RG)}_{t}^{i} records the future trajectory of amendatory reward from time step tt to the end of episode, it must contain the previous and posterior state-action pair as a vector, like Equation 17 shown in Assumption III.1. According to the above two properties, we can decompose (𝒫)\mathcal{(RP)} by using Equation 20.

(𝒫)ti(ot+1i|oti,ati)\displaystyle\mathcal{(RP)}_{t}^{i}(o_{t+1}^{i}\lvert{o}_{t}^{i},{a}_{t}^{i}) =𝒫ti(ot+1i|oti,argmax (𝒬)tmaxi(𝒐,𝒂))\displaystyle=\mathcal{P}_{t}^{i}(o_{t+1}^{i}\lvert o_{t}^{i},\text{{argmax} }\mathcal{(RQ)}^{i}_{t\max}(\boldsymbol{o},\boldsymbol{a})) (22)
=𝒫ti(ot+1i|oti,ati,𝒂ki)\displaystyle=\mathcal{P}_{t}^{i}\left(o_{t+1}^{i}\lvert o_{t}^{i},\left\langle a_{t}^{i},\boldsymbol{a}_{k}^{-i}\right\rangle\right)

where k[1,]k\in[1,\mathbb{N}]. Obviously, (𝒫)\mathcal{(RP)} satisfy the property in Definition III.1. Thus, the process with γ\gamma-Reward series is a stationary process.

III-B Convergence of γ\gamma-Reward series

Theorem III.2

Spatial differentiation formula in original γ\gamma-Reward can lead the reward value function to converge to local optimal reward.

Proof:

As mentioned in Section II.B, our Q-value function shares a similar structure with 𝒬𝒟\mathcal{QD}-learning. So the proof is inspired by the convergence analysis of 𝒬𝒟\mathcal{QD}-learning. Since this article is not mainly about multi-agent theory, we would like to simply transform γ\gamma-Reward into the form of 𝒬𝒟\mathcal{QD}-learning convergence proof. By following the rest of the proof process in 𝒬𝒟\mathcal{QD}-learning, we can get the conclusion of convergence. Thus, we highly recommend readers to read the whole analysis in 𝒬𝒟\mathcal{QD}-learning.

According to the Spectral Graph Theory, the multi-agent communication network is simplified to an undirected graph G=(V,E)G=(V,E), where VV represents intersections and EE denotes the communication links. From the adjacency matrix AA (AijA_{ij}=1, if (i,j)E(i,j)\in E, Aij=0A_{ij}=0, otherwise) and the degree matrix D=diag{d1,,dN}D=\text{diag}\{d_{1},\dots,d_{N}\} (di=|𝒩i|d_{i}=|\mathcal{N}_{i}|), we can get a positive definite graph Laplacian matrix L=DAL=D-A, where eigenvalues ordered as 0=λ1(L)λ2(L)λN(L)0=\lambda_{1}(L)\leq\lambda_{2}(L)\leq\cdots\leq\lambda_{N}(L). Let β=αγ\beta=-\alpha\gamma and βtanh[j𝒩i(Rt+njrtjc)]=βtanh[j𝒩i(cRt+njrtj)]\beta\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(\frac{R^{j}_{t+n}}{r^{j}_{t}}-c)]=-\beta\text{tanh}[\sum_{j\in\mathcal{N}_{i}}(c-\frac{R^{j}_{t+n}}{r^{j}_{t}})], we note that

𝑸𝒐,𝒂(t+1)\displaystyle\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}(t+1) =(INβLt+αIN)𝒓t\displaystyle=(I_{N}-\beta L_{t}+\alpha I_{N})\boldsymbol{r}_{t} (23)
+α(𝒞𝒐,𝒂(𝑸𝒕)𝑸𝒐,𝒂(t)+𝝂t)\displaystyle+\alpha(\mathcal{C}_{\boldsymbol{o},\boldsymbol{a}}(\boldsymbol{Q_{t}})-\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}(t)+\boldsymbol{\nu}_{t})

where 𝑸𝒐,𝒂=[Qo,a1(t),,Qo,aN(t)]T\boldsymbol{Q}_{\boldsymbol{o},\boldsymbol{a}}=[Q_{o,a}^{1}(t),\dots,Q_{o,a}^{N}(t)]^{T} and 𝒞𝒐,𝒂(𝑸𝒕)=[𝒞o,a1(Qt1),,𝒞o,aN(QtN)]T\mathcal{C}_{\boldsymbol{o},\boldsymbol{a}}(\boldsymbol{Q_{t}})=[\mathcal{C}^{1}_{o,a}(Q_{t}^{1}),\dots,\mathcal{C}^{N}_{o,a}(Q_{t}^{N})]^{T}. The local γ\gamma-Reward operator 𝒞i\mathcal{C}^{i} of agent ii is

𝒞o,ai(Q)=𝔼(rti)+γj𝒩pi,jamaxv𝒜Qj,v\displaystyle\mathcal{C}_{o,a}^{i}(Q)=\mathbb{E}(r^{i}_{t})+\gamma^{\prime}\sum_{j\in\mathcal{N}}p_{i,j}^{a}\max_{v\in\mathcal{A}}Q_{j,v} (24)

The residual νti\nu^{i}_{t} for each agent ii in Equation 23 is

νti(o,a)=(11α)rti+γmaxv𝒜Qot+1,vi(t)𝒞o,ai(Q)\displaystyle\nu_{t}^{i}(o,a)=(1-\dfrac{1}{\alpha})r^{i}_{t}+\gamma^{\prime}\max_{v\in\mathcal{A}}Q_{{o}_{t+1},v}^{i}(t)-\mathcal{C}^{i}_{o,a}\left({Q}\right) (25)

By following the proof in Section V of 𝒬𝒟\mathcal{QD}-learning, we can get the same main result that for each agent ii, we have

(limt𝑸ti=𝑸)=1\displaystyle\mathbb{P}(\lim_{t\rightarrow\infty}\boldsymbol{Q}^{i}_{t}=\boldsymbol{Q}^{*})=1 (26)

We can conclude that spatial differentiation formula can lead the γ\gamma-Reward to converge to local optimal. Obviously, γ\gamma-Attention-Reward has the same property.

III-C Optimality of γ\gamma-Reward model

For a stochastic game with multi-agent, the optimal point of the whole system is a Nash equilibrium point which is declared in Definition III.3. νi(s|π1,,πm)\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m}) can be interpreted as the discounted except reward 𝒢\mathcal{G}. According to previous two theorems, we can draw the following conclusions.

Theorem III.3

With γ\gamma-Reward model, the multi-agent system can converge to a Nash equilibrium point.

Proof:

First we give the definition of optimal ν\nu^{*} with Definition III.2

νi(s|π1,,πm)=ρ=i=1mρi\nu^{i}(s|\pi_{*}^{1},\ldots,\pi_{*}^{m})=\rho^{*}=\sum_{i=1}^{m}\rho_{i}^{*} (27)

From the process of convergence proof, we can figure that the local optimal reward of RiR^{i} depends on the local optimal of the other agents ρj\rho_{j}^{*}. In other words, if there is an agent which does not converge to local optimal reward, the other will also not be optimal. From this, we can conclude that γ\gamma-Reward forces agents to care about the others and let the whole system finally converge to a Nash equilibrium point. ∎

IV Experiment

We conduct experiments on Cityflow, an open-source traffic simulator that supports large-scale traffic signal control[41], rather than the common used SUMO simulator[42], since it is more than twenty times faster than SUMO. Moreover, we use Ray[43] framework, which is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications for DRL algorithms.

IV-A Datasets

In the experiment, we use both synthetic data and real-world data. We share the same real-world dataset with Colight for convenience. The datasets mainly include two parts, roadnet and flow. Roadnet describes the number of intersections in the road network, the coordinates, and the number of lanes owned by each intersection. Flow is based on vehicles and lists thousands of vehicles, each vehicle has its own property, such as length, width, max of accuracy, max of speed and, most importantly, trajectory. The experiment used the real-world data of Hangzhou, Jinan in China, and also New York in the USA. Meanwhile, we used two kinds of synthetic data, arterial and grid type. We counted their characteristics and presented them in Table I. Grid3×3uniGrid_{3\times 3}uni is one-way traffic, and Grid3×3biGrid_{3\times 3}bi is two-way with the same road network.

TABLE I: Situation of Datasets
DataSet Intersections Arrival rate (vehicles/300s)
Mean Std Max Min
Arterial1×6Arterial_{1\times 6}{\textsuperscript{$\dagger$}} 6 300 - - -
Grid3×3uniGrid_{3\times 3}uni{\textsuperscript{$\dagger$}} 9 300 - - -
Grid3×3biGrid_{3\times 3}bi{\textsuperscript{$\dagger$}} 9 300 - - -
NewYork16×3NewYork_{16\times 3} 48 240.79 10.08 274 216
Jinan3×4Jinan_{3\times 4} 12 526.63 86.70 676 256
Hangzhou4×4Hangzhou_{4\times 4} 16 250.79 38.21 335 208
  • \dagger

    Traffic flow from synthetic data are uniform, so there is no need to count another three values.

Refer to caption

Figure 6: Evaluation Performance Compared among Baselines, proposed γ\gamma-Reward and γ\gamma-Attention-Reward

IV-B Baseline Methods

In Chapter II, we have already introduced methods for TSC, including traditional rule-based methods and learning-based methods. The most primitive rule-based methods are still the most common methods nowadays. As a mature rule-based method, Max-Pressure can be used as a representative.

Learning-based methods have been prosperous under the development of deep learning and data science in recent years. They are characterized by the use of large-scale data to approximate optimal strategies through iterative learning. We have chosen several methods as the baseline:

  • IQL: Since γ\gamma-Reward is based on the decoupling idea of IQL, and introduces a coordination mechanism so that it can demonstrate the impact of coordination mechanism compared to IQL. Here we use original D3QN, like [26], for comparison.

  • QMIX: This is a sophisticated MARL algorithm, which integrates all agents into the same model and concentrates on the learning of joint action reward functions. As a typical one-model method, comparing it with γ\gamma-Reward can effectively observe the advantages and disadvantages of joint learning and independent learning for coordination.

  • Colight: Unlike γ\gamma-Reward, Colight is more like QMIX, but not learns a joint action. It uses Attention layers to train the surrounding observation code to replace the observation of the current intersection. Due to its full collection of observation, it can apply Multi-head Attention. By this way, the coordination between agents is realized. γ\gamma-Attention-Reward has made some improvements on this basis. The method of Replay Buffer Amendment is employed to introduce the effect of the current intersection on the surrounding intersection, replacing the hyperparameter |𝒩i|\left|\mathcal{N}_{i}\right| in Colight and adding consideration of the impact of actions on both time and space.

IV-C Evaluation Performance

Figure 6 and Table II shows the performance comparison between γ\gamma-Reward and the more comprehensive γ\gamma-Attention-Reward and baselines. Each model has trained 100 iterations, and each iteration run 3600 time steps in the simulator. Each action in them lasts at least 10 seconds for avoiding rapid switching phase impracticably. Delay span n=10sn=10s and threshold c=0.8c=0.8.

We use the average transit time of the vehicle to evaluate the performance of the model, which is the standard evaluation method in the field of TSC.

The performance of learning-based methods is significantly better than rule-based Max-Pressure (Table II), which is widely proved in many researches.

Among the independent learning DRL model, the performance of γ\gamma-Reward series far exceeds the IQL model. This demonstrates the importance of coordination between agents for global performance improvement.

It is worth noting that, in all road network, the independent learning DRL model shows a stronger astringency than one-model QMIX. That is probably because for a single model, excessive dimensions can make the policy more difficult to learn. However, Colight does not show divergence while achieved excellent results. This may benefit from that it does not make joint decisions through the joint action function, but by sharing network parameters, so that all agents generate independent actions. Since the agents share the model parameters, they will undoubtedly ignore individual differences and sacrifice some performance. Compared to the proposed model, intense oscillations sometimes occur in Colight during training. This is also a result of sharing parameters. Once the model iterating in the wrong direction, it will mislead all agents and lead to horrible congestion in the whole road network. The performance of proposed methods is even better than Colight, which means sharing sensation is not the only way to realize coordination. Sharing results can also help to focus on the whole road network for a single intersection.

In real-world road networks, γ\gamma-Reward and γ\gamma-Attention-Reward do not show gigantic difference. The reason is that all real-world road networks we used are two-way road. We will introduce the study about the attention score later and can be shown in Figure 9 and Figure 10. It has shown its effect in specifically synthetic road networks. Compared Grid6×6biGrid_{6\times 6}bi and Grid6×6uniGrid_{6\times 6}uni in Figure 6, Attention layers distinguish one-way and two-way, and assist agents to achieve a better performance. This will also describe later by revealing the detail of Attention layers.

TABLE II: Performance Comparison (Average Travel Time)
Model Grid3×3biGrid_{3\times 3}bi Grid3×3uniGrid_{3\times 3}uni NewYork16×3NewYork_{16\times 3} Jinan3×4Jinan_{3\times 4} Hangzhou4×4Hangzhou_{4\times 4}
Max-Pressure 204.72 186.06 405.69 359.81 431.53
IQL 191.05 157.51 248.46 371.74 406.27
QMIX 565.70 619.32 216.56 571.78 587.46
Colight 104.89 100.96 169.66 301.78 311.15
γ\gamma-Reward 96.44 175.38 162.18 303.97 304.90
γ\gamma-Attention-Reward 96.14 93.93 141.16 286.27 284.24

IV-D Study of hyperparameter γ\gamma

Refer to caption

Figure 7: Study of hyperparameter γ\gamma; dark lines represent results after smoothing, light lines represent the deviation of raw results.

We use Arterial1×6Arterial_{1\times 6} road network to study the impact of different γ\gamma value. We have chosen [0.1, 0.3, 0.5, 0.7, 0.9] five γ\gamma values and compared the results. As shown in Figure 7, 0.5 may be a balance point of the penalty item. So for the hyperparameter γ\gamma, we all set it to 0.5.

IV-E Visualization of Proposed method

The core idea of the γ\gamma-Reward algorithm is to correct the reward in the replay buffer. In this section, we use the Arterial1×6Arterial_{1\times 6} road network as an example to show how the reward values between different intersections affect each other. Compare the Grid3×3uniGrid_{3\times 3}uni and Grid3×3biGrid_{3\times 3}bi road networks to demonstrate the role of the attention mechanism in the γ\gamma-Reward algorithm improvement.

Refer to caption

Figure 8: Visualization of the reward changing by spatial differentiation; rir^{i} represents raw reward from D3QN, RiR^{i} represents the reward after amendment.

IV-E1 Visualization of Spatial Differentiation Formula

In Figure 8, dashed lines represent the original reward, and solid lines represent the corrected reward. Since the linear road network is selected, there are no more than two adjacent intersections, so it is easier to observe the influence between the intersections. We just select the first two intersections for clearness.

Observe the red line in the light yellow area, which represents the reward for the second intersection. It can be found that the solid line in this area is almost above the dashed line, meaning that after amendment, rewards become better. The reason is shown in the light blue area, dark blue and pink lines are getting a raise, which means the traffic situation is getting better both in intersection I11I_{11} and I13I_{13}. We believe that in the process of getting better at these two intersections, some of the efforts are contributed by intersection I12I_{12}. The lag between yellow and blue area is up to the delay span nn.

From Figure 8, we can see that in the actual training, the γ\gamma-Reward framework, as what we expected, introduces future changes in nearby intersections.

IV-E2 Effect of Attention Mechanism

Refer to caption

Figure 9: Attention score of intersection I22I_{22} selected from Hangzhou4×4Hangzhou_{4\times 4} road network

Figure 9 shows attention scores from Hangzhou road network. We can find that except current intersection I22I_{22}, others are declined and tending to the same value. Therefore, Attention does not play an important role in these real-world datasets. The reason could be that all of them are two-way road and have the same number of lanes. To highlight its effect, we need to compare the situation between one-way and two-way traffic in the same road network. Thus, we use Grid3×3uniGrid_{3\times 3}uni and Grid3×3biGrid_{3\times 3}bi for visualization.

Refer to caption

(a) Attention scores of intersection I22I_{22} selected from Grid3×3biGrid_{3\times 3}bi

Refer to caption

(b) Attention scores of intersection I22I_{22} selected from Grid3×3uniGrid_{3\times 3}uni
Figure 10: Attention layers learn the different of surrounding from traffic flow; Grid3×3biGrid_{3\times 3}bi and Grid3×3uniGrid_{3\times 3}uni road network are used for comparison to show the effect of attention layers.

Figure 10 shows the comparison between one-way and two-way 3×33\times 3 network. Comparing Figure 10(a) with Figure 10(b), the score change of intersection I22I_{22} in Figure 10(a) is the equalization of surrounding intersection I12I_{12}, I21I_{21}, I23I_{23} and I32I_{32}. While in Figure 10(b), the intersection from the direction of exiting is obviously holding a commanding edge. This means that the attention mechanism does have a significant effect in understanding the structure of the road network. With the attention mechanism, the reward amendment is more concerned with the results of its actions, which is crucial in the revision of reward.

V Discussion about Real World Application

Refer to caption

Figure 11: Training time of γ\gamma-Reward series and Colight; Note that in order to show the difference between centralized and decentralized algorithms, we illustrate training time of one agent in γ\gamma-Reward series.

TSC is an actual task running without the simulator, and the proposed control plan should aim at solving practical problems. Therefore, it is necessary to take the limitations that may exist in practice into account.

V-A Scalability

Scalability is an important standard to evaluate whether a TSC method is meaningful. The proposed methods in this paper are both decentralized, each D3QN agent can be deployed on the embedding device in each intersection respectively. Either the training time or the inference time of our methods is not related to the number of intersections. Besides, the message transmission among neighboring intersections just takes tolerable time and all of the intersections communicate in parallel. Due to the limitation of computing capacity, experiments of larger scale are not able to conduct. Instead, we analyze the scalability by collecting the training time on datasets with different scales which is listed in Figure 11. As the number of intersections increased, the time cost of Colight is significantly larger. In contrast, our methods remain a shorter training time due to the parallel computation. Another advantage is when the road network structure is changed, we only need to train the newly added intersections separately and there is a tiny impact on existing intersections. Therefore, the proposed methods are believed to be scalable.

V-B Real-time

Real-time traffic communication may have problems like communication delay, information security, and risk of packet loss[44]. Therefore, it is necessary to decouple the calculation of the whole road network. Decentralization is becoming the trend of solutions in TSC problem[29][45]. Colight, while using Multi-head Attention technology and sharing parameter, summarizes the relationship of the global road network and reduces the time complexity and space complexity of training. However, if we apply it to the practice, the global road network information in real-time is first needed to transmit to the central server for calculation, and then the server needs to dispense the resulting actions to each intersection. Note that it is not just transmitting actions of each traffic signal, but also their sensor observations! In contrast, communication in our methods occurs in a small neighborhood, and the messages they need are only several bits. These attributes help the intersections keep in touch in real-time.

V-C Further applications

The coordination mechanism in this paper can be expended to other multi-agent tasks which face the problem of scalability and real-time. For instance, the power dispatching of power networks can be described as the same as the TSC problem. Power transmission is similar to the traffic flow between two neighboring intersections. Furthermore, the control of a multi-joint tandem robot may be regarded as a multi-agent task. Actions from the joint which nears the base have a consequence on other joints and the end-effecter. We can use the coordination mechanism to consider this effect on neighboring agents or joints and achieve a global optimal performance. There are already broad applications using this kind of coordination mechanism in engineering systems like sensor networks[46], smart grid[47] and robotics[48].

VI Conclusion

In this paper, we propose the γ\gamma-Reward method and its variant γ\gamma-Attention-Reward that introduces the attention mechanism to solve the problem of intelligent control of the traffic signal. By extending the Markov Chain to the space-time domain, the methods turn to be a scalable solution for TSC. Specially, we give a concise proof of the spatial differentiation formula which shows that the two frameworks can converge to Nash equilibrium. We conduct extensive experiments using both real-world and synthetic data. They confirm that our proposed methods have a superior performance over state-of-the-art methods. In the asymmetry road network, γ\gamma-Attention-Reward shows inspired results than γ\gamma-Reward by adding the attention mechanism. Moreover, we investigate the effect of the reward amendment and attention mechanism in achieving coordination thoroughly. Compared to the recently proposed Colight, γ\gamma-Reward series replaces the graph attention with recursion, decouples the calculation of the whole road networks, and is more suitable for practical applications.

Acknowledgment

This project is derived from the 46th project of Deecamp 2019, which won the Best Technology Award, and we thanks for the effort of our teammates in “Multi-Commander”111Our project of Deecamp 2019 is open-source in Github: https://github.com/multi-commander/Multi-Commander. Thanks to the practice platform provided by Dr. Kai-Fu Lee and the support of APEX Lab of Shanghai Jiaotong University. We also gratefully acknowledge the support from the National Natural Science Foundation of China (Grant No. 61973210), the projects of SJTU (Grant Nos. YG2019ZDA17; ZH2018QNB23).

References

  • [1] H. R. TO and M. M. Barker, “White paper european transport policy for 2010: time to decide,” 2001.
  • [2] X. Liang, X. Du, G. Wang, and Z. Han, “Deep reinforcement learning for traffic light control in vehicular networks,” arXiv preprint arXiv:1803.11115, 2018.
  • [3] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement learning approach for intelligent traffic light control,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2496–2505, ACM, 2018.
  • [4] H. Wei, G. Zheng, V. V. Gayah, and Z. Li, “A survey on traffic signal control methods,” CoRR, vol. abs/1904.08117, 2019.
  • [5] P. Varaiya, “The max-pressure controller for arbitrary networks of signalized intersections,” in Advances in Dynamic Network Modeling in Complex Transportation Systems, pp. 27–66, Springer, 2013.
  • [6] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” in Innovations in multi-agent systems and applications-1, pp. 183–221, Springer, 2010.
  • [7] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” arXiv preprint arXiv:1911.10635, 2019.
  • [8] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li, “Colight: Learning network-level cooperation for traffic signal control,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1913–1922, 2019.
  • [9] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.,” tech. rep., United States. Federal Highway Administration, 2008.
  • [10] J. Török and J. Kertész, “The green wave model of two-dimensional traffic: Transitions in the flow properties and in the geometry of the traffic jam,” Physica A: Statistical Mechanics and its Applications, vol. 231, no. 4, pp. 515–533, 1996.
  • [11] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using deep policy-gradient and value-function-based reinforcement learning,” IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017.
  • [12] Y. Xiong, G. Zheng, K. Xu, and Z. Li, “Learning traffic signal control from demonstrations,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2289–2292, 2019.
  • [13] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu, and Z. Li, “Learning phase competition for traffic signal control,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1963–1972, 2019.
  • [14] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
  • [15] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proceedings of the tenth international conference on machine learning, pp. 330–337, 1993.
  • [16] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in International Conference on Machine Learning, pp. 4295–4304, 2018.
  • [17] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement learners for traffic light control,” Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016), 2016.
  • [18] T. Nishi, K. Otaki, K. Hayakawa, and T. Yoshimura, “Traffic signal control based on reinforcement learning with graph convolutional neural nets,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 877–883, IEEE, 2018.
  • [19] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li, “Presslight: Learning max pressure control to coordinate traffic signals in arterial network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1290–1298, 2019.
  • [20] Wikipedia contributors, “Lane — Wikipedia, the free encyclopedia,” 2019. [Online; accessed 1-November-2019].
  • [21] A. Stevanovic, Adaptive traffic control systems: domestic and foreign state of practice. No. Project 20-5 (Topic 40-03), 2010.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Thirtieth AAAI conference on artificial intelligence, 2016.
  • [24] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning, pp. 1995–2003, 2016.
  • [25] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in AAAI, 2018.
  • [26] X. Liang, X. Du, G. Wang, and Z. Han, “A deep reinforcement learning network for traffic light cycle control,” IEEE Transactions on Vehicular Technology, vol. 68, no. 2, pp. 1243–1253, 2019.
  • [27] P. Varshavskaya, L. P. Kaelbling, and D. Rus, “Efficient distributed reinforcement learning through agreement,” in Distributed Autonomous Robotic Systems 8, pp. 367–378, Springer, 2009.
  • [28] S. Kar, J. M. Moura, and H. V. Poor, “𝒬𝒟\mathcal{QD}-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1848–1862, 2013.
  • [29] W. Liu, G. Qin, Y. He, and F. Jiang, “Distributed cooperative reinforcement learning-based traffic signal control that integrates v2x networks’ dynamic clustering,” IEEE Transactions on Vehicular Technology, vol. 66, no. 10, pp. 8667–8681, 2017.
  • [30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [31] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
  • [33] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [34] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
  • [35] S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” arXiv preprint arXiv:1810.02912, 2018.
  • [36] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2681–2690, JMLR. org, 2017.
  • [37] G. J. Laurent, L. Matignon, L. Fort-Piat, et al., “The world of independent learners is not markovian,” International Journal of Knowledge-based and Intelligent Engineering Systems, vol. 15, no. 1, pp. 55–64, 2011.
  • [38] R. Aragon-Gómez and J. B. Clempner, “Traffic-signal control reinforcement learning approach for continuous-time markov games,” Engineering Applications of Artificial Intelligence, vol. 89, p. 103415, 2020.
  • [39] D. T. Nguyen, W. Yeoh, H. C. Lau, S. Zilberstein, and C. Zhang, “Decentralized multi-agent reinforcement learning in average-reward dynamic dcops,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
  • [40] J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
  • [41] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, and Z. Li, “Cityflow: A multi-agent reinforcement learning environment for large scale city traffic scenario,” in The World Wide Web Conference, pp. 3620–3624, ACM, 2019.
  • [42] D. Krajzewicz, “Traffic simulation with sumo–simulation of urban mobility,” in Fundamentals of traffic simulation, pp. 269–293, Springer, 2010.
  • [43] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al., “Ray: A distributed framework for emerging ai applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577, 2018.
  • [44] B. C. Ezell, R. M. Robinson, P. Foytik, C. Jordan, and D. Flanagan, “Cyber risk to transportation, industrial control systems, and traffic signal controllers,” Environment Systems and Decisions, vol. 33, no. 4, pp. 508–516, 2013.
  • [45] T. Chu, S. Qu, and J. Wang, “Large-scale traffic grid signal control with regional reinforcement learning,” in 2016 American Control Conference (ACC), pp. 815–820, IEEE, 2016.
  • [46] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd international symposium on Information processing in sensor networks, pp. 20–27, 2004.
  • [47] K. Zhang, W. Shi, H. Zhu, E. Dall’Anese, and T. Başar, “Dynamic power distribution system management with a locally connected communication network,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 4, pp. 673–687, 2018.
  • [48] P. Corke, R. Peterson, and D. Rus, “Networked robots: Flying robot navigation using a sensor net,” in Robotics research. The eleventh international symposium, pp. 234–243, Springer, 2005.
  • [49] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
  • [50] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
  • [51] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.

Appendix

VI-A Fundamental DRL algorithms

VI-A1 Temporal-Difference Learning

Temporal-Difference Learning (TD learning), proposed by Sutton[49], combining with Dynamic Programming (DP) and Monte Carlo (MC) methods, becomes the core idea of DRL. Like the Monte Carlo algorithm, it does not need to know the specific environment model and can learn directly from experience. On the other hand, it also inherits bootstrapping from DP algorithm, which is the unique feature of TD learning: predictions are used as targets during learning[50]. Monte Carlo simulates (or experiences) an episode until it ends, then estimates the state value based on the value of each state. In contrast, TD learning simulates an episode with one step (or several steps) per action which based on the value of the new state, and then estimate the state value before execution.

The Q-learning algorithm [51] is a groundbreaking algorithm. TD learning is used here for off-policy learning.

δt=rt+1+γmaxaQ(ot+1,at+1)Q(ot,at)\delta_{t}=r_{t+1}+\gamma^{\prime}\max_{a}Q\left(o_{t+1},a_{t+1}\right)-Q\left(o_{t},a_{t}\right)

where δt\delta_{t} represents TD-error.

VI-A2 Deep Q-network

Deep Q-network (DQN) is a powerful off-policy algorithm which has achieved excellent results in many fields since 2015[22]. It uses a neural network to approximate the Q-value function instead of tabling.

Q(ot,at)Q(ot,at)+α[yjQ(ot,at)]Q\left(o_{t},a_{t}\right)\leftarrow Q\left(o_{t},a_{t}\right)+\alpha\left[y_{j}-Q\left(o_{t},a_{t}\right)\right]
yt+1=rt+1+γmaxaQ(ot+1,at+1)y_{t+1}=r_{t+1}+\gamma^{\prime}\max_{a}Q\left(o_{t+1},a_{t+1}\right)

VI-A3 Double Deep Q-network

Q-learning uses max to select the best action, which causes a Maximization Bias problem. So [23] solved this by designing a Double Q-learning, it only differs in the calculation of the target Q value:

yt+1=rt+1+γQ(ϕ(o),argmaxaQ(ϕ(o),a,w),w)y_{t+1}=r_{t+1}+\gamma^{\prime}Q^{\prime}\left(\phi\left(o^{\prime}\right),\arg\max_{a^{\prime}}Q\left(\phi\left(o^{\prime}\right),a,w\right),w^{\prime}\right)

VI-A4 Dueling Deep Q-network

Another improvement for DQN is Dueling DQN [24], which decomposes the Q network into two separate control streams, a value function V(s)V(s), and a state-based action advantage function A(s,a)A(s,a). These two control flows obtain an estimate value of the Q function through a special aggregating layer.

Q(o,a;θ,α,β)\displaystyle Q(o,a;\theta,\alpha,\beta) =V(o;θ,β)+\displaystyle=V(o;\theta,\beta)+
(A(o,a;θ,α)1|𝒜|aA(o,a;θ,α))\displaystyle\left(A(o,a;\theta,\alpha)-\frac{1}{|\mathcal{A}|}\sum_{a^{\prime}}A\left(o,a^{\prime};\theta,\alpha\right)\right)

VI-B Pseudocode for γ\gamma-Reward series

Algorithm 1 γ\gamma-Attention-Reward Algorithm for MARL Traffic Lights Control
1:Initialize EE parallel environments with NN agents
2:Initialize replay memory DD to capacity NDN_{D}
3:Initialize raw replay memory DrD_{r} to capacity ND+nN_{D}+n
4:Initialize action-value function QQ with random weights θ\theta
5:Initialize target action-value function Q^\hat{Q} with weights θ=θ\theta^{-}=\theta
6:Initialize attention scores αi,j\alpha_{i,j}
7:Tupdate0T_{update}\leftarrow 0
8:for episode=1,Mepisode=1,M do
9:     Reset environments, and get initial oio_{i} for each agent ii
10:     for t=1,Tt=1,T do
11:          Select actions aiπi(|oi)a_{i}\sim\pi_{i}(\cdot|o_{i}) for each agent ii in each environment ee
12:          Send actions aia_{i} to all parallel environments and get oio_{i}^{\prime} , rir_{i} for all agents
13:         Store (ai,oi,ri,oia_{i},o_{i},r_{i},o_{i}^{\prime}) in DrD_{r}
14:         Tupdate=Tupdate+NT_{update}=T_{update}+N
15:         if TupdateT_{update}\leq min steps per update + nn then
16:              Replay Buffer Amendment(Dr,αi,jD_{r},\alpha_{i,j})
17:               Update Policies: Calculate a1NBπiθ¯(oiB),i1Na_{1\ldots N}^{B}\sim\pi_{i}^{\bar{\theta}}\left(o_{i}^{\prime B}\right),i\in 1\ldots N Calculate Qiψ(o1NB,a1NB)Q_{i}^{\psi}\left(o_{1\ldots N}^{B},a_{1\ldots N}^{B}\right) for all ii in parallel Update policies using θ,i𝒥(πθ)\nabla_{\theta,i}\mathcal{J}(\pi_{\theta}) and Adam
18:              Update target QQ parameters: θ^θ\hat{\theta}\leftarrow\theta
19:              Update target attention parameters: α^α\hat{\alpha}\leftarrow\alpha
20:              Tupdate0T_{update}\leftarrow 0
21:         end if
22:     end for
23:end for
Algorithm 2 Replay Buffer Amendment
1:function ReplayBufferAmendment(Dr,αi,jD_{r},\alpha_{i,j})
2:     for i=1,Ni=1,N do
3:         index=len(Dr)nindex=len(D_{r})-n
4:         while index>ex_indexindex>ex\_index do
5:              (oi,ai,ri,oi)=Dr,i(j)(o_{i},a_{i},r_{i},o_{i}^{\prime})=D_{r,i}(j)
6:              Ri=γR_{i}=\gamma-Attention-Reward function(ri,αi,jr_{i},\alpha_{i,j})
7:              Store (oi,ai,Ri,oio_{i},a_{i},R_{i},o_{i}^{\prime}) in DD
8:              j=j1j=j-1
9:         end while
10:         ex_index=len(Dr)nex\_index=len(D_{r})-n
11:     end for
12:end function